Skip to main content

Tesserocr bindings

Project description

ocrd_tesserocr

Crop, deskew, segment into regions / lines / words, or recognize with tesserocr

image image image Docker Automated build

Introduction

This offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr . (Each processor is a step in the OCR-D functional model, and can be replaced with an alternative implementation. Data is represented within METS/PAGE.)

This includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, line, word segmentation) and OCR proper. Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. Image results are referenced (read and written) via AlternativeImage, text results via TextEquiv, deskewing via @orientation, cropping via Border and segmentation via Region / TextLine / Word elements with Coords/@points.

Installation

Required ubuntu packages:

  • Tesseract headers (libtesseract-dev)
  • Some tesseract language models (tesseract-ocr-{eng,deu,frk,...} or script models (tesseract-ocr-script-{latn,frak,...})
  • Leptonica headers (libleptonica-dev)

From PyPI

This is the best option if you want to use the stable, released version.


NOTE

ocrd_tesserocr requires Tesseract >= 4.1.0. The Tesseract packages bundled with Ubuntu < 19.10 are too old. If you are on Ubuntu 18.04 LTS, please enable Alexander Pozdnyakov PPA which has up-to-date builds of Tesseract and its dependencies:

sudo add-apt-repository ppa:alex-p/tesseract-ocr
sudo apt-get update

sudo apt-get install git python3 python3-pip libtesseract-dev libleptonica-dev tesseract-ocr-eng tesseract-ocr wget
pip install ocrd_tesserocr

With docker

This is the best option if you want to run the software in a container.

You need to have Docker

docker pull ocrd/tesserocr

From git

This is the best option if you want to change the source code or install the latest, unpublished changes.

We strongly recommend to use venv.

git clone https://github.com/OCR-D/ocrd_tesserocr
cd ocrd_tesserocr
make deps-ubuntu # or manually with apt-get
make deps        # or pip install -r requirements
make install     # or pip install .

Usage

See docstrings and in the individual processors and ocrd-tool.json descriptions.

Available processors are:

Testing

To run with docker:

docker run ocrd/tesserocr ocrd-tesserocrd-crop ...

Testing

make test

This downloads some test data from https://github.com/OCR-D/assets under repo/assets, and runs some basic test of the Python API as well as the CLIs.

Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).

Development

Latest changes that require pre-release of ocrd >= 2.0.0 are kept in branch edge.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd_tesserocr-0.6.0.tar.gz (20.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrd_tesserocr-0.6.0-py3-none-any.whl (37.7 kB view details)

Uploaded Python 3

File details

Details for the file ocrd_tesserocr-0.6.0.tar.gz.

File metadata

  • Download URL: ocrd_tesserocr-0.6.0.tar.gz
  • Upload date:
  • Size: 20.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9

File hashes

Hashes for ocrd_tesserocr-0.6.0.tar.gz
Algorithm Hash digest
SHA256 3a1aeff23dbf42cc8c003039cc8695cd4e01807245f935c9323e6df2832855a7
MD5 9c454a4d508b6d43a1551b517c125d5b
BLAKE2b-256 48306c8253739ee61d4a42b6512be3fcfe0ce7190ff2835ee1210b1c483da025

See more details on using hashes here.

File details

Details for the file ocrd_tesserocr-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: ocrd_tesserocr-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 37.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9

File hashes

Hashes for ocrd_tesserocr-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 41d5309efc4f886569d47dede504cea5e14ffd8e27a33acb69e15c775d34f754
MD5 0f1c539e4ffd53d67a3b891586c7be48
BLAKE2b-256 89a9431c3ad62ac4612b6be3f5cad58b49910a9c00b5f28dd62f8d535ed0c0cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page