AI Skill Report Card
Manipulating Pdfs
Quick Start
Pythonimport PyPDF2 import pandas as pd from fpdf import FPDF # Extract text from PDF with open('document.pdf', 'rb') as file: reader = PyPDF2.PdfReader(file) text = "" for page in reader.pages: text += page.extract_text() print(text)
Workflow
PDF Analysis & Processing Checklist
Progress:
- Identify PDF type (text-based, scanned, form)
- Choose appropriate extraction method
- Process content (text, tables, forms)
- Output in desired format
- Validate results
Step-by-Step Process
-
Install Dependencies
Bashpip install PyPDF2 pandas tabula-py fpdf2 pytesseract pdf2image -
Analyze PDF Structure
Pythonreader = PyPDF2.PdfReader('file.pdf') print(f"Pages: {len(reader.pages)}") print(f"Encrypted: {reader.is_encrypted}") -
Choose Processing Method
- Text-based: PyPDF2/pdfplumber
- Scanned: OCR with pytesseract
- Tables: tabula-py
- Forms: PyPDF2 form handling
Examples
Example 1: Merge PDFs Input: Multiple PDF files in directory
Pythonimport PyPDF2 import glob merger = PyPDF2.PdfMerger() for pdf in sorted(glob.glob("*.pdf")): merger.append(pdf) merger.write("merged.pdf") merger.close()
Output: Single merged PDF file
Example 2: Extract Tables Input: PDF with tabular data
Pythonimport tabula df = tabula.read_pdf("report.pdf", pages="all") df[0].to_csv("extracted_table.csv", index=False)
Output: CSV file with table data
Example 3: Split PDF Input: Multi-page PDF
Pythonreader = PyPDF2.PdfReader("document.pdf") for i, page in enumerate(reader.pages): writer = PyPDF2.PdfWriter() writer.add_page(page) with open(f"page_{i+1}.pdf", "wb") as output: writer.write(output)
Output: Individual PDF files per page
Example 4: OCR Scanned PDF Input: Image-based PDF
Pythonfrom pdf2image import convert_from_path import pytesseract images = convert_from_path("scanned.pdf") text = "" for img in images: text += pytesseract.image_to_string(img)
Output: Extracted text from scanned document
Example 5: Fill PDF Forms Input: PDF form with fillable fields
Pythonreader = PyPDF2.PdfReader("form.pdf") writer = PyPDF2.PdfWriter() writer.clone_reader_document_root(reader) writer.update_page_form_field_values( writer.pages[0], {"field_name": "New Value"} ) with open("filled_form.pdf", "wb") as output: writer.write(output)
Output: PDF with populated form fields
Best Practices
- Use pdfplumber for complex text extraction - Better layout preservation than PyPDF2
- Batch process with pathlib - Handle multiple files efficiently
- Set OCR language explicitly -
pytesseract.image_to_string(img, lang='eng') - Handle encrypted PDFs - Check
is_encryptedbefore processing - Preserve metadata - Copy document info when merging/splitting
- Use temporary files for large operations - Avoid memory issues
Common Pitfalls
- Don't assume text extraction always works - Some PDFs are images
- Don't ignore PDF permissions - Check if extraction is allowed
- Don't process without validation - Verify page count and structure first
- Don't hardcode page numbers - Use dynamic page detection
- Don't forget to close file handles - Use context managers (
withstatements) - Don't mix coordinate systems - PDF coordinates start bottom-left, not top-left