AI Skill Report Card

Manipulating Pdfs

A-85·Jan 24, 2026

Quick Start

Python
import PyPDF2 import pandas as pd from fpdf import FPDF # Extract text from PDF with open('document.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) text = "" for page in reader.pages: text += page.extract_text() print(text)

Workflow

Progress:

  • Identify PDF type (text-based, scanned, form)
  • Choose appropriate extraction method
  • Process content (text, tables, forms)
  • Output in desired format
  • Validate results
  1. Install Dependencies

    Bash
    pip install PyPDF2 pandas tabula-py fpdf2 pytesseract pdf2image
  2. Analyze PDF Structure

    Python
    reader = PyPDF2.PdfReader('file.pdf') print(f"Pages: {len(reader.pages)}") print(f"Encrypted: {reader.is_encrypted}")
  3. Choose Processing Method

    • Text-based: PyPDF2/pdfplumber
    • Scanned: OCR with pytesseract
    • Tables: tabula-py
    • Forms: PyPDF2 form handling

Examples

Example 1: Merge PDFs Input: Multiple PDF files in directory

Python
import PyPDF2 import glob merger = PyPDF2.PdfMerger() for pdf in sorted(glob.glob("*.pdf")): merger.append(pdf) merger.write("merged.pdf") merger.close()

Output: Single merged PDF file

Example 2: Extract Tables Input: PDF with tabular data

Python
import tabula df = tabula.read_pdf("report.pdf", pages="all") df[0].to_csv("extracted_table.csv", index=False)

Output: CSV file with table data

Example 3: Split PDF Input: Multi-page PDF

Python
reader = PyPDF2.PdfReader("document.pdf") for i, page in enumerate(reader.pages): writer = PyPDF2.PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", "wb") as output: writer.write(output)

Output: Individual PDF files per page

Example 4: OCR Scanned PDF Input: Image-based PDF

Python
from pdf2image import convert_from_path import pytesseract images = convert_from_path("scanned.pdf") text = "" for img in images: text += pytesseract.image_to_string(img)

Output: Extracted text from scanned document

Example 5: Fill PDF Forms Input: PDF form with fillable fields

Python
reader = PyPDF2.PdfReader("form.pdf") writer = PyPDF2.PdfWriter() writer.clone_reader_document_root(reader) writer.update_page_form_field_values( writer.pages[0], {"field_name": "New Value"} ) with open("filled_form.pdf", "wb") as output: writer.write(output)

Output: PDF with populated form fields

Best Practices

  • Use pdfplumber for complex text extraction - Better layout preservation than PyPDF2
  • Batch process with pathlib - Handle multiple files efficiently
  • Set OCR language explicitly - pytesseract.image_to_string(img, lang='eng')
  • Handle encrypted PDFs - Check is_encrypted before processing
  • Preserve metadata - Copy document info when merging/splitting
  • Use temporary files for large operations - Avoid memory issues

Common Pitfalls

  • Don't assume text extraction always works - Some PDFs are images
  • Don't ignore PDF permissions - Check if extraction is allowed
  • Don't process without validation - Verify page count and structure first
  • Don't hardcode page numbers - Use dynamic page detection
  • Don't forget to close file handles - Use context managers (with statements)
  • Don't mix coordinate systems - PDF coordinates start bottom-left, not top-left
0
Grade A-AI Skill Framework
Scorecard
Criteria Breakdown
Quick Start
11/15
Workflow
11/15
Examples
15/20
Completeness
15/20
Format
11/15
Conciseness
11/15