AI Skill Report Card
Handling Corrupted Data
Quick Start13 / 15
Pythonimport re from difflib import SequenceMatcher def analyze_corruption(text): # Check entropy and patterns entropy = len(set(text)) / len(text) if text else 0 repeating_pattern = find_repeating_sequence(text[:50]) if entropy < 0.3 and repeating_pattern: return {"type": "keyboard_mash", "confidence": 0.9} elif entropy < 0.1: return {"type": "encoding_error", "confidence": 0.8} return {"type": "unknown", "confidence": 0.1} result = analyze_corruption("asdasdasdsdasasdasdasdsdasas") # Output: {"type": "keyboard_mash", "confidence": 0.9}
Recommendation▾
Include complete implementations of key functions like find_repeating_sequence() and the entropy calculation logic instead of just showing the interface
Workflow12 / 15
Progress:
- Calculate text entropy and character distribution
- Detect repeating patterns or sequences
- Check for encoding artifacts (null bytes, high ASCII)
- Attempt reconstruction using context clues
- Generate recovery report with confidence scores
- Entropy Analysis: Calculate character diversity ratio
- Pattern Detection: Use regex to find repeating sequences
- Encoding Check: Test for UTF-8, Latin-1, ASCII corruption
- Context Recovery: Look for partial words or known formats
- Reconstruction: Apply appropriate recovery technique
Recommendation▾
Provide a comprehensive template or framework for the entire corruption detection and recovery process with all the statistical methods mentioned
Examples15 / 20
Example 1:
Input: asdasdasdsdasasdasdasdsdasasdasdasdsdasas
Output:
JSON{ "corruption_type": "keyboard_mash", "entropy": 0.15, "pattern": "asd", "recovery_method": "discard", "confidence": 0.95 }
Example 2:
Input: Hello\x00\x00World\xff\xfeData
Output:
JSON{ "corruption_type": "encoding_error", "detected_encoding": "utf-16", "recovered_text": "HelloWorldData", "confidence": 0.85 }
Recommendation▾
Add more diverse examples showing different corruption types (transmission errors, partial file recovery, mixed encoding scenarios) with actual before/after data pairs
Best Practices
- Use chardet library for encoding detection
- Apply Levenshtein distance for fuzzy matching
- Check file headers for format clues
- Test multiple encoding hypotheses
- Preserve original data before attempting recovery
- Use statistical analysis (chi-square test) for randomness
Common Pitfalls
- Don't assume all repeated characters are corruption
- Avoid over-processing legitimate data patterns
- Don't ignore byte order marks (BOM) in Unicode files
- Never modify original files without backups
- Don't rely solely on entropy - check semantic context