AI Skill Report Card

Anonymizing Medical Documents

B+78·Jan 16, 2026

Quick Start

Python
import fitz # PyMuPDF import re from datetime import datetime, timedelta import random def anonymize_pdf(input_path, output_path): doc = fitz.open(input_path) # Common PHI patterns patterns = { 'names': r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', 'mrn': r'MRN:?\s*\d{6,10}', 'ssn': r'\b\d{3}-\d{2}-\d{4}\b', 'phone': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', 'dob': r'\b\d{1,2}/\d{1,2}/\d{4}\b' } for page in doc: for pattern_name, pattern in patterns.items(): matches = page.search_for(pattern) for match in matches: page.add_redact_annot(match, fill=(0, 0, 0)) page.apply_redactions() doc.save(output_path) doc.close()

Workflow

Progress:

  • Inventory Documents - Catalog all files by type (PDF, DOCX, images)
  • Backup Originals - Create secure copies before processing
  • Identify PHI Categories - Patient names, MRNs, DOBs, addresses, phone numbers
  • Set Anonymization Rules - Replace vs redact, fake data generation
  • Process by File Type - Use appropriate tools for each format
  • Quality Check - Manual review of sample files
  • Validate Medical Content - Ensure clinical information remains intact
  1. Create folder structure:

    portfolio/
    ├── originals/     (secure backup)
    ├── anonymized/    (portfolio ready)
    └── scripts/       (automation tools)
    
  2. Install required tools:

    Bash
    pip install PyMuPDF python-docx Pillow pytesseract
  3. Run anonymization pipeline:

    • PDFs: Text replacement + OCR for scanned docs
    • DOCX: Find/replace with python-docx
    • Images: OCR + overlay/blur sensitive areas
  4. Generate consistent fake data:

    Python
    fake_names = ["Patient A", "Patient B", "John D.", "Jane S."] fake_dates = generate_shifted_dates(original_date, -30, +30) # ±30 days fake_mrns = ["MRN: 12345", "MRN: 67890"]

Examples

Example 1: PDF Report Input: "Patient: Sarah Johnson, DOB: 03/15/1985, MRN: 987654321" Output: "Patient: Patient A, DOB: 04/12/1985, MRN: 12345"

Example 2: DOCX Letter Input: "Dear Dr. Smith, Regarding your patient Robert Wilson..." Output: "Dear Dr. [Physician], Regarding your patient Patient B..."

Example 3: Lab Image Input: Lab result with patient name in header Output: Same lab values, header shows "Patient C" with blurred original text

Best Practices

  • Consistent replacement - Same patient should get same fake name across all documents
  • Preserve medical timeline - Shift dates by same offset to maintain relative timing
  • Keep clinical relevance - Age ranges, approximate dates matter for medical context
  • Use blackout for handwriting - OCR often misses handwritten PHI
  • Maintain document formatting - Headers, layouts should look professional
  • Test with colleagues - Have others review for missed identifiers

Common Pitfalls

  • Inconsistent anonymization - Same patient having different fake names across documents
  • Missing image-embedded text - Scanned documents need OCR processing
  • Forgetting metadata - Document properties often contain original author/patient info
  • Over-anonymization - Removing medically relevant information (age, general location)
  • Signature blocks - Your own signature/credentials should remain intact
  • Date logic errors - Creating impossible timelines (discharge before admission)
  • Manual review skip - Automated tools miss context-dependent identifiers
0
Grade B+AI Skill Framework
Scorecard
Criteria Breakdown
Quick Start
11/15
Workflow
11/15
Examples
15/20
Completeness
15/20
Format
11/15
Conciseness
11/15