AI Skill Report Card

Handling Corrupted Data

B-72·Apr 10, 2026·Source: Web
13 / 15
Python
import re from difflib import SequenceMatcher def analyze_corruption(text): # Check entropy and patterns entropy = len(set(text)) / len(text) if text else 0 repeating_pattern = find_repeating_sequence(text[:50]) if entropy < 0.3 and repeating_pattern: return {"type": "keyboard_mash", "confidence": 0.9} elif entropy < 0.1: return {"type": "encoding_error", "confidence": 0.8} return {"type": "unknown", "confidence": 0.1} result = analyze_corruption("asdasdasdsdasasdasdasdsdasas") # Output: {"type": "keyboard_mash", "confidence": 0.9}
Recommendation
Include complete implementations of key functions like find_repeating_sequence() and the entropy calculation logic instead of just showing the interface
12 / 15

Progress:

  • Calculate text entropy and character distribution
  • Detect repeating patterns or sequences
  • Check for encoding artifacts (null bytes, high ASCII)
  • Attempt reconstruction using context clues
  • Generate recovery report with confidence scores
  1. Entropy Analysis: Calculate character diversity ratio
  2. Pattern Detection: Use regex to find repeating sequences
  3. Encoding Check: Test for UTF-8, Latin-1, ASCII corruption
  4. Context Recovery: Look for partial words or known formats
  5. Reconstruction: Apply appropriate recovery technique
Recommendation
Provide a comprehensive template or framework for the entire corruption detection and recovery process with all the statistical methods mentioned
15 / 20

Example 1: Input: asdasdasdsdasasdasdasdsdasasdasdasdsdasas Output:

JSON
{ "corruption_type": "keyboard_mash", "entropy": 0.15, "pattern": "asd", "recovery_method": "discard", "confidence": 0.95 }

Example 2: Input: Hello\x00\x00World\xff\xfeData Output:

JSON
{ "corruption_type": "encoding_error", "detected_encoding": "utf-16", "recovered_text": "HelloWorldData", "confidence": 0.85 }
Recommendation
Add more diverse examples showing different corruption types (transmission errors, partial file recovery, mixed encoding scenarios) with actual before/after data pairs
  • Use chardet library for encoding detection
  • Apply Levenshtein distance for fuzzy matching
  • Check file headers for format clues
  • Test multiple encoding hypotheses
  • Preserve original data before attempting recovery
  • Use statistical analysis (chi-square test) for randomness
  • Don't assume all repeated characters are corruption
  • Avoid over-processing legitimate data patterns
  • Don't ignore byte order marks (BOM) in Unicode files
  • Never modify original files without backups
  • Don't rely solely on entropy - check semantic context
0
Grade B-AI Skill Framework
Scorecard
Criteria Breakdown
Quick Start
13/15
Workflow
12/15
Examples
15/20
Completeness
5/20
Format
15/15
Conciseness
12/15