AI Skill Report Card
Detecting File Types
File Type Detection with Magika
Quick Start15 / 15
Pythonfrom magika import Magika # Initialize once (loads model) m = Magika() # Detect from bytes result = m.identify_bytes(b'function log(msg) {console.log(msg);}') print(result.output.label) # javascript # Detect from file path result = m.identify_path('./document.pdf') print(result.output.label) # pdf # Detect from stream with open('./file.bin', 'rb') as f: result = m.identify_stream(f) print(result.output.label)
Recommendation▾
Add installation instructions (pip install magika) in Quick Start for complete immediacy
Workflow14 / 15
Single File Detection
- Initialize Magika - One-time setup loads the ML model
- Choose input method - bytes, file path, or stream
- Call identify method - Returns result with confidence score
- Extract information - label, MIME type, description, extensions
Batch Processing
Python# Process multiple files m = Magika() files = ['file1.bin', 'file2.txt', 'file3.unknown'] for file_path in files: result = m.identify_path(file_path) print(f"{file_path}: {result.output.label} ({result.output.description})")
Progress checklist for large batches:
- Initialize Magika once (not per file)
- Collect file paths
- Process in loop with error handling
- Log results for verification
Recommendation▾
Include a simple template/framework section showing a complete file processing script pattern
Examples20 / 20
Example 1: JavaScript Detection
Input: b'function log(msg) {console.log(msg);}'
Output: javascript (label), text/javascript (MIME), score: 0.997
Example 2: Unknown Binary File
Input: Random binary data
Output: unknown (label), application/octet-stream (MIME)
Example 3: Command Line Bulk Processing
Bash# Recursive directory scan magika -r /path/to/files --json > results.json # Process with confidence scores magika file1.bin file2.txt --output-score
Example 4: Stream Processing
Pythonimport io data = b'<!DOCTYPE html><html><body>Hello</body></html>' stream = io.BytesIO(data) result = m.identify_stream(stream) # Output: html
Recommendation▾
Add more edge case handling examples like permission errors, network files, or corrupted files
Best Practices
Performance:
- Initialize Magika once per application/script
- Model loading is one-time ~5ms overhead
- Inference is ~5ms per file regardless of size
- Use batch processing for multiple files
Confidence Handling:
Python# Check confidence before trusting result if result.output.score > 0.8: file_type = result.output.label else: file_type = "unknown_low_confidence"
Error Handling:
Pythontry: result = m.identify_path(file_path) if result.status == "ok": return result.output.label except Exception as e: return "detection_failed"
Memory Efficiency:
- Use
identify_stream()for large files - Magika only reads a subset of file content
- Memory usage is constant regardless of file size
Common Pitfalls
Don't reinstantiate Magika repeatedly:
Python# BAD - loads model each time for file in files: m = Magika() # Expensive! result = m.identify_path(file) # GOOD - reuse instance m = Magika() # Load once for file in files: result = m.identify_path(file)
Don't ignore confidence scores:
- Scores below 0.5 may indicate generic detection
- Very low scores suggest truly unknown formats
- Use appropriate thresholds for your use case
Don't assume file extensions match content:
- Magika detects actual content, not extension
- Use for validation:
detected != expected_from_extension
Don't process empty files without checking:
Python# Handle edge cases if os.path.getsize(file_path) == 0: return "empty_file"