Finds differences between two PDF documents
Project description
pdf-diff
Finds differences between two PDF documents:
- Compares the text layers of two PDF documents and outputs the bounding boxes of changed text in JSON.
- Rasterizes the changed pages in the PDFs to a PNG and draws red outlines around changed text.
The script is written in Python 3, and it relies on the pdftotext program.
Requirements
libxml2 >= 2.7.0, libxslt >= 1.1.23, poppler
Requirements installation for Ubuntu:
sudo apt-get install python3-lxml poppler-utils
Requirements installation for OS X:
brew install libxml2 libxslt poppler
Installation
From PyPI:
pip install pdf-diff
From source:
sudo python3 setup.py install
Running
Turn two PDFs into one large PNG image showing the differences:
pdf-diff before.pdf after.pdf > comparison_output.png
Maintainer Notes
To deploy:
python3 -m pip install --user --upgrade setuptools wheel twine
python3 setup.py sdist bdist_wheel
python3 -m twine upload dist/*
Function flow diagram
compute_changes
│
├── serialize_pdf (called twice)
│ ├── pdf_to_bboxes
│ ├── mark_eol_hyphens
│ │ └── mark_eol_hyphen
│ └── Processes bounding boxes and text
│
├── perform_diff
│ └── Calls external `fast_diff_match_patch`
│
└── process_hunks
├── Iterates through diff hunks
└── mark_difference (called multiple times)
render_changes
│
├── simplify_changes
├── make_pages_images
│ └── pdftopng (converts PDF pages to images)
├── realign_pages
│ ├── Splits pages into sub-pages
│ └── Adjusts box coordinates
├── draw_red_boxes
│ └── Annotates images with rectangles or lines
└── zealous_crop
└── Crops the image to reduce unnecessary margins
stack_pages
│
└── Combines processed images into a final output
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
pdf_diff-0.9.3.tar.gz
(11.8 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
pdf_diff-0.9.3-py3-none-any.whl
(12.1 kB
view details)
File details
Details for the file pdf_diff-0.9.3.tar.gz.
File metadata
- Download URL: pdf_diff-0.9.3.tar.gz
- Upload date:
- Size: 11.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b45973a57919f83e1583ff7c9fcfd28711b58a21e585a44e8a45d1084c16ab02
|
|
| MD5 |
f0a75ad8d4e76c1fde30d18c85515dca
|
|
| BLAKE2b-256 |
082211cb4e1190380cfeb24c65ea3175b88aa1ca8074b3c94619cfc0e0bee49a
|
File details
Details for the file pdf_diff-0.9.3-py3-none-any.whl.
File metadata
- Download URL: pdf_diff-0.9.3-py3-none-any.whl
- Upload date:
- Size: 12.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a97e6ad396f757a9c2b62fab738b798fa9634a862ef238812dfa01bf713dc5a6
|
|
| MD5 |
d945e76f389962e7cfeb0ed30a77058b
|
|
| BLAKE2b-256 |
c79a9dc309211ce8481301b13a66aa2468cb9c71d97964b711931b871f7f1210
|