Skip to main content

Tesserocr bindings

Project description

Crop, deskew, segment into regions / lines / words, or recognize with tesserocr

https://circleci.com/gh/OCR-D/ocrd_tesserocr.svg?style=svg https://img.shields.io/pypi/v/ocrd_tesserocr.svg https://codecov.io/gh/OCR-D/ocrd_tesserocr/branch/master/graph/badge.svg Docker Automated build

Introduction

This offers OCR-D compliant workspace processors for (much of) the functionality of Tesseract via its Python API wrapper tesserocr . (Each processor is a step in the OCR-D functional model, and can be replaced with an alternative implementation. Data is represented within METS/PAGE.)

This includes image preprocessing (cropping, binarization, deskewing), layout analysis (region, line, word segmentation) and OCR proper. Most processors can operate on different levels of the PAGE hierarchy, depending on the workflow configuration. Image results are referenced (read and written) via AlternativeImage, text results via TextEquiv, deskewing via @orientation, cropping via Border and segmentation via Region / TextLine / Word elements with Coords/@points.

Installation

Required ubuntu packages:

  • Tesseract headers (libtesseract-dev)

  • Some tesseract language models (tesseract-ocr-{eng,deu,frk,...} or script models (tesseract-ocr-script-{latn,frak,...})

  • Leptonica headers (libleptonica-dev)

make deps-ubuntu # or manually
make deps # or pip install -r requirements
make install # or pip install .

If tesserocr fails to compile with an error::

$PREFIX/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type; did you mean ‘stdin’?
       static string CleanupString(const char* utf8_str) {
              ^~~~~~
              stdin

This is due to some inconsistencies in the installed tesseract C headers (fix expected for next Ubuntu upgrade, already fixed for Debian). Replace string with std::string in $PREFIX/include/tesseract/unicharset.h:265:5: and $PREFIX/include/tesseract/unichar.h:164:10: ff.

If tesserocr fails with an error about LSTM/CUBE, you have a mismatch between tesseract header/data/pkg-config versions. apt policy libtesseract-dev lists the apt-installable versions, keep it consistent. Make sure there are no spurious pkg-config artifacts, e.g. in /usr/local/lib/pkgconfig/tesseract.pc. The same goes for language models.

Usage

See docstrings and in the individual processors and ocrd-tool.json descriptions.

Available processors are:

Testing

make test

This downloads some test data from <https://github.com/OCR-D/assets> under repo/assets, and runs some basic test of the Python API as well as the CLIs.

Set PYTEST_ARGS="-s --verbose" to see log output (-s) and individual test results (--verbose).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocrd_tesserocr-0.4.0.tar.gz (19.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocrd_tesserocr-0.4.0-py3-none-any.whl (37.2 kB view details)

Uploaded Python 3

File details

Details for the file ocrd_tesserocr-0.4.0.tar.gz.

File metadata

  • Download URL: ocrd_tesserocr-0.4.0.tar.gz
  • Upload date:
  • Size: 19.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9

File hashes

Hashes for ocrd_tesserocr-0.4.0.tar.gz
Algorithm Hash digest
SHA256 616bf420794ef71bcc372fa4c29775c48d6909d01b6849e2d0be83766cd0ed90
MD5 91e09cbc5208905353c22f07029db316
BLAKE2b-256 8709b994a5d7310f73b04b7dd840a5fbdd726da42b7980ac0a07595b6c56ef00

See more details on using hashes here.

File details

Details for the file ocrd_tesserocr-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: ocrd_tesserocr-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 37.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.11.0 pkginfo/1.4.2 requests/2.18.4 setuptools/40.6.2 requests-toolbelt/0.8.0 tqdm/4.23.3 CPython/2.7.9

File hashes

Hashes for ocrd_tesserocr-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4822713547e696dbb327a80f9dd5bad705be4b7dc1f44fdef1d44f9e03c21c1d
MD5 9d5ea4deb4c75bae31b7d44a4a8fdd0a
BLAKE2b-256 ee2b483b44bf3180e81aa8a5bf7307ae47da4d1656e69dec1a704f9a8d558b88

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page