Skip to main content

A wrapper around the pdftoppm and pdftocairo command line tools to convert PDF to a PIL Image list.

Project description

pdf2image TravisCI PyPI version codecov Downloads

A python 2.7 and 3.4+ module that wraps pdftoppm and pdftocairo to convert PDF to a PIL Image object

How to install

First you need poppler-utils

pdftoppm and pdftocairo are the piece of software that do the actual magic. It is distributed as part of a greater package called poppler.

Using pip

Windows users will have to install poppler for Windows, then add the bin/ folder to PATH.

Mac users will have to install poppler for Mac.

Linux users will have both tools pre-installed with Ubuntu 16.04+ and Archlinux. If it's not, run sudo apt install poppler-utils

Using conda

conda install -c conda-forge poppler

Then you can install the pip package!

pip install pdf2image

Install Pillow if you don't have it already with pip install pillow

How does it work?

from pdf2image import convert_from_path, convert_from_bytes

from pdf2image.exceptions import (
    PDFInfoNotInstalledError,
    PDFPageCountError,
    PDFSyntaxError
)

Then simply do:

images = convert_from_path('/home/belval/example.pdf')

OR

images = convert_from_bytes(open('/home/belval/example.pdf', 'rb').read())

OR better yet

import tempfile

with tempfile.TemporaryDirectory() as path:
    images_from_path = convert_from_path('/home/belval/example.pdf', output_folder=path)
    # Do something here

images will be a list of PIL Image representing each page of the PDF document.

Here are the definitions:

convert_from_path(pdf_path, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False)

convert_from_bytes(pdf_file, dpi=200, output_folder=None, first_page=None, last_page=None, fmt='ppm', thread_count=1, userpw=None, use_cropbox=False, strict=False, transparent=False, single_file=False, output_file=str(uuid.uuid4()), poppler_path=None, grayscale=False)

What's new?

  • grayscale parameter allows you to convert images to grayscale (-gray in pdftoppm CLI)
  • single_file parameter allows you to convert the first PDF page only, without adding digits at the end of the output_file
  • Allow the user to specify poppler's installation path with poppler_path
  • Fixed a bug where PNGs buffer with a non-terminating I-E-N-D sequence would throw an exception
  • Fixed a bug that left open file descriptors when using convert_from_bytes() (Thank you @FabianUken)
  • fmt='tiff' parameter allows you to create .tiff files (You need pdftocairo for this)
  • transparent parameter allows you to generate images with no background instead of the usual white one (You need pdftocairo for this)
  • strict parameter allows you to catch pdftoppm syntax error with a custom type PDFSyntaxError

Performance tips

  • Using an output folder is significantly faster if you are using an SSD. Otherwise i/o usually becomes the bottleneck.
  • Using multiple threads can give you some gains but avoid more than 4 as this will cause i/o bottleneck (even on my NVMe SSD!).
  • If i/o is your bottleneck, using the JPEG format can lead to significant gains.
  • PNG format is pretty slow, this is because of the compression.
  • If you want to know the best settings (most settings will be fine anyway) you can clone the project and run python tests.py to get timings.

Limitations / known issues

  • A relatively big PDF will use up all your memory and cause the process to be killed (unless you use an output folder)

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf2image-1.7.0.tar.gz (6.6 kB view details)

Uploaded Source

File details

Details for the file pdf2image-1.7.0.tar.gz.

File metadata

  • Download URL: pdf2image-1.7.0.tar.gz
  • Upload date:
  • Size: 6.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.13.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.0 requests-toolbelt/0.9.1 tqdm/4.23.4 CPython/3.6.8

File hashes

Hashes for pdf2image-1.7.0.tar.gz
Algorithm Hash digest
SHA256 0250f6e0e4d58248e27635c2e2a2d239c7f34b839a616f06c9c4978f79a9b06f
MD5 9c1e9b7e42d0a888d04dfc63df0fd707
BLAKE2b-256 f90f52e7f0859e2d746d6a63dc45dbb40c11acab42062cd51b4ae3c0a4dd5d5c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page