AI Skill Report Card
Anonymizing Medical Documents
Quick Start
Pythonimport fitz # PyMuPDF import re from datetime import datetime, timedelta import random def anonymize_pdf(input_path, output_path): doc = fitz.open(input_path) # Common PHI patterns patterns = { 'names': r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', 'mrn': r'MRN:?\s*\d{6,10}', 'ssn': r'\b\d{3}-\d{2}-\d{4}\b', 'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'dob': r'\b\d{1,2}/\d{1,2}/\d{4}\b' } for page in doc: for pattern_name, pattern in patterns.items(): matches = page.search_for(pattern) for match in matches: page.add_redact_annot(match, fill=(0, 0, 0)) page.apply_redactions() doc.save(output_path) doc.close()
Workflow
Progress:
- Inventory Documents - Catalog all files by type (PDF, DOCX, images)
- Backup Originals - Create secure copies before processing
- Identify PHI Categories - Patient names, MRNs, DOBs, addresses, phone numbers
- Set Anonymization Rules - Replace vs redact, fake data generation
- Process by File Type - Use appropriate tools for each format
- Quality Check - Manual review of sample files
- Validate Medical Content - Ensure clinical information remains intact
Step-by-Step Process
-
Create folder structure:
portfolio/ ├── originals/ (secure backup) ├── anonymized/ (portfolio ready) └── scripts/ (automation tools) -
Install required tools:
Bashpip install PyMuPDF python-docx Pillow pytesseract -
Run anonymization pipeline:
- PDFs: Text replacement + OCR for scanned docs
- DOCX: Find/replace with python-docx
- Images: OCR + overlay/blur sensitive areas
-
Generate consistent fake data:
Pythonfake_names = ["Patient A", "Patient B", "John D.", "Jane S."] fake_dates = generate_shifted_dates(original_date, -30, +30) # ±30 days fake_mrns = ["MRN: 12345", "MRN: 67890"]
Examples
Example 1: PDF Report Input: "Patient: Sarah Johnson, DOB: 03/15/1985, MRN: 987654321" Output: "Patient: Patient A, DOB: 04/12/1985, MRN: 12345"
Example 2: DOCX Letter Input: "Dear Dr. Smith, Regarding your patient Robert Wilson..." Output: "Dear Dr. [Physician], Regarding your patient Patient B..."
Example 3: Lab Image Input: Lab result with patient name in header Output: Same lab values, header shows "Patient C" with blurred original text
Best Practices
- Consistent replacement - Same patient should get same fake name across all documents
- Preserve medical timeline - Shift dates by same offset to maintain relative timing
- Keep clinical relevance - Age ranges, approximate dates matter for medical context
- Use blackout for handwriting - OCR often misses handwritten PHI
- Maintain document formatting - Headers, layouts should look professional
- Test with colleagues - Have others review for missed identifiers
Common Pitfalls
- Inconsistent anonymization - Same patient having different fake names across documents
- Missing image-embedded text - Scanned documents need OCR processing
- Forgetting metadata - Document properties often contain original author/patient info
- Over-anonymization - Removing medically relevant information (age, general location)
- Signature blocks - Your own signature/credentials should remain intact
- Date logic errors - Creating impossible timelines (discharge before admission)
- Manual review skip - Automated tools miss context-dependent identifiers