No project description provided

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

form-tools

The raw data for many case management and data systems exist as paper forms. form-tools is a package to help with preprocessing scanned images of these paper forms for further analysis and / or processing. It does this by making use of a template for the form to match and align scanned versions of the document to it, before taking thumbnails of the fields in the scanned document.

Before you begin

form-tools makes use of the pdf2image package for converting document images stored as pdf to image files. As such, you'll need to install poppler. See the pdf2image readme for guidance on how to do so.
The current default OCR engine for matching pages in a form template to its scanned image is tesseract. Please follow the instructions at the link for how to install it.
Computer vision is performed by using the opencv library. This project makes use of the pre-compiled python library for opencv which will be installed by default but you may wish to install opencv from source instead.

On Ubuntu, you can install all the necessary packages by running

sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config poppler-utils

Installation

To install the library run:

pip install form-tools

Basic use

Extracting form metadata

Say you have a form with a pdf template my_form.pdf. To pre-process scanned copies of the form you'll first need to create an image directory for your template as well as a FormMetadata compliant json file.

To do this from the command line and output the metadata to my_form_meta.json and your images to a directory template_images you would run:

form-tools extract-meta my_form.pdf my_form_meta.json --form-image-directory template_images

To interact with the API directly in python you should use the built in PdfFormMetaExtractor class.

from form_tools.form_meta.extractors.pdf_form_extractor import PdfFormMetaExtractor

# Instantiate extractor
pfme = PdfFormMetaExtractor()

# Create FormMetadata object and populate
# image directory template_images
form_metadata = pfme.extract_meta(
    form_template_path="my_form.pdf",
    form_image_dir="template_images"
)

# Write FormMetadata to json file
form_metadata.to_json(
    "my_form_meta.json",
)

The output metadata should contain bounding box coordinates for each field in the form that correspond to regions in the images outputted to template_images.

Note: The output metadata will not be able to be used immediately to align a scanned image to the template as the form_identifier key and identifier key for each form_page in the metadata will need to be populated with a valid regular expression so that the correct page in the scanned image can be compared with the correct page in the template images.

Aligning scanned images to a template

Once you have a complete form metadata file for your template and a populated image directory you can attempt to align a scanned form, say my_scanned_form.pdf to the template and extract field thumbnails.

You will first need to prepare a config file to specify the opencv algorithms to use for the alignment process. An example config.yaml would be as follows:

detector:
  name: SIFT
matcher:
  id: FLANN
  args:
    - algorithm: 1
      trees: 5
    - check: 50
knn: 2
proportion: 0.7
ocr_options:
  rotation_engine: tesseract
  text_extraction_engine: tesseract
pass_directory: s3://my-bucket/pass_directory
fail_directory: s3://my-bucket/fail_directory
form_metadata_directory: metadata

This config specifies that the SIFT algorithm should be used for keypoint detection and the FLANN algorithm should be used for keypoint matching, with 70% of the best keypoints kept (using KNN to decide on which of these are best). Also, note that we've put the output metadata in a metadata subdirectory in our working directory.

To align the scanned image from the command line you would then run:

form-tools process-form my_scanned_form.pdf config.yaml

To interact with the API directly in python you would use the FormOperator class.

from form_tools.form_operators import FormOperator

form_operator = FormOperator.create_from_config("config.yaml")

_ = form_operator.run_full_pipeline(
    form_path="my_scanned_form.pdf",
    pass_dir="s3://my-bucket/pass_directory",
    fail_dir="s3://my-bucket/fail_directory",
    form_meta_directory="metadata",
)

Note: The scanned image could be stored in an AWS S3 bucket. In that case you would pass the S3 path (e.g. s3://my-bucket/my_scanned_form.pdf). Only the config and metadata directory need to be located in your local working directory.

Running documentation locally

mkdocs is used to document form-tools. To run the documentation locally, run mkdocs serve on the command line and follow the link to the local host.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.6

Mar 20, 2024

0.1.5

Nov 21, 2023

0.1.4

Nov 9, 2023

0.1.3

Jul 6, 2023

0.1.2

Feb 28, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

form_tools-0.1.6.tar.gz (32.5 kB view hashes)

Uploaded Mar 20, 2024 Source

Built Distribution

form_tools-0.1.6-py3-none-any.whl (38.8 kB view hashes)

Uploaded Mar 20, 2024 Python 3

Hashes for form_tools-0.1.6.tar.gz

Hashes for form_tools-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`ff17d1412ad4d66454fa447139cdabb2021a9f39b25de213328493da3e60b612`
MD5	`e952ce601584f9f60f4591b8d86fdcff`
BLAKE2b-256	`d558ae2bf80da984e44976011927e2df6e22d8fee070b0bfe8a1f1505cbc9bff`

Hashes for form_tools-0.1.6-py3-none-any.whl

Hashes for form_tools-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`11d57f92ac4f893a32e9bee78282d5e5bb7752cbad1ec39567902a6e7393ee6c`
MD5	`b38ed7107b11bf39328509ee22c5b546`
BLAKE2b-256	`38dcccd0acf5775d37956799e0b8773a110656effe252ad6fb8c7cbeceed2777`