Skip to main content

OpenNMT tokenization library

Project description

Build Status PyPI version

Tokenizer

Tokenizer is a fast, generic, and customizable text tokenization library for C++ and Python with minimal dependencies.

Overview

By default, the Tokenizer applies a simple tokenization based on Unicode types. It can be customized in several ways:

  • Reversible tokenization
    Marking joints or spaces by annotating tokens or injecting modifier characters.
  • Subword tokenization
    Support for training and using BPE and SentencePiece models.
  • Advanced text segmentation
    Split digits, segment on case or alphabet change, segment each character of selected alphabets, etc.
  • Case management
    Lowercase text and return case information as a separate feature or inject case modifier tokens.
  • Protected sequences
    Sequences can be protected against tokenization with the special characters "⦅" and "⦆".

See the available options for an overview of supported features.

Using

The Tokenizer can be used in Python, C++, or command line. Each mode exposes the same set of options.

Python API

pip install pyonmttok
>>> import pyonmttok
>>> tokenizer = pyonmttok.Tokenizer("conservative", joiner_annotate=True)
>>> tokens, _ = tokenizer.tokenize("Hello World!")
>>> tokens
['Hello', 'World', '■!']
>>> tokenizer.detokenize(tokens)
'Hello World!'

See the Python API description for more details.

C++ API

#include <onmt/Tokenizer.h>

using namespace onmt;

int main() {
  Tokenizer tokenizer(Tokenizer::Mode::Conservative, Tokenizer::Flags::JoinerAnnotate);
  std::vector<std::string> tokens;
  tokenizer.tokenize("Hello World!", tokens);
}

See the Tokenizer class for more details.

Command line clients

$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate
Hello World ■!
$ echo "Hello World!" | cli/tokenize --mode conservative --joiner_annotate | cli/detokenize
Hello World!

See the -h flag to list the available options.

Development

Dependencies

Compiling

CMake and a compiler that supports the C++11 standard are required to compile the project.

mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=<Release or Debug> ..
make

It will produce the dynamic library libOpenNMTTokenizer and tokenization clients in cli/.

  • To compile only the library, use the -DLIB_ONLY=ON flag.
  • To compile with the ICU unicode backend, use the -DWITH_ICU=ON flag.

Testing

The tests are using Google Test which is included as a Git submodule. Run the tests with:

test/onmt_tokenizer_test ../test/data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

pyonmttok-1.19.0-cp39-cp39-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.9

pyonmttok-1.19.0-cp38-cp38-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8

pyonmttok-1.19.0-cp37-cp37m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.7m

pyonmttok-1.19.0-cp36-cp36m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.6m

pyonmttok-1.19.0-cp35-cp35m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.5m

pyonmttok-1.19.0-cp27-cp27mu-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7mu

pyonmttok-1.19.0-cp27-cp27m-manylinux1_x86_64.whl (2.5 MB view details)

Uploaded CPython 2.7m

File details

Details for the file pyonmttok-1.19.0-cp39-cp39-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.19.0-cp39-cp39-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.9
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.19.0-cp39-cp39-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 450d8cf7ddf1f03664457572c47d51065987f7d04f619db6176c70a1f7f27dd5
MD5 a1c7ce39c4d07a07176a346c810ef7a7
BLAKE2b-256 00346854b704528eb08d7e7c3f7fd56fe480be00dbc350bb9b3a21b1bfcb60e8

See more details on using hashes here.

File details

Details for the file pyonmttok-1.19.0-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.19.0-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.19.0-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 b8b57e1625c8f64132548079cbc2fcc6ad29e4665723c961c7a2444f154d04a7
MD5 c79b90452de96c1f017fa911b7dc1c58
BLAKE2b-256 922ffdcd389add6d90f564ba9319faeea5b271be201ed6619fc0dac26d7529a7

See more details on using hashes here.

File details

Details for the file pyonmttok-1.19.0-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.19.0-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.19.0-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 9f9d4e4421c333cf2933123e67e64c1ed5274dced18365b747b894ecf4ac58fe
MD5 654dadab8b8cbc8ee544323801727706
BLAKE2b-256 ec7c2e597f3c03b9b210b60c274584e62763ed8378782d11109d169b31fdfb0a

See more details on using hashes here.

File details

Details for the file pyonmttok-1.19.0-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.19.0-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.19.0-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ccc8a28fbcbf17f14492d3692c30e22dad57f6653debd863421225a668c0bd19
MD5 d97e024a285bcf67214c34d109053637
BLAKE2b-256 19f1637ae094c6cb0095d5faec4cfd4473734ffe590ea61e1f3ab8d4569699fd

See more details on using hashes here.

File details

Details for the file pyonmttok-1.19.0-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.19.0-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.19.0-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 005a47a0fe5d75aa453f37df763869ec21e1c26143e12bab14b2da737bda669d
MD5 f168ad088eb3058b3e756c465c2f17eb
BLAKE2b-256 16103eb8c7c3af7e7545b41c5e0cb382509a930b2e6949bd4fc2d87bd2c21c21

See more details on using hashes here.

File details

Details for the file pyonmttok-1.19.0-cp27-cp27mu-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.19.0-cp27-cp27mu-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7mu
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.19.0-cp27-cp27mu-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 be25fcdecfd72945ce43f65549b5817306cc285657868f723b1bc4c175ebaf12
MD5 9a94fce201b45165b6394becc217ab46
BLAKE2b-256 283eb56d3a1ddd6b84dcdddd4c6764ed5e916eeac057be99fdc8f14e6cd26687

See more details on using hashes here.

File details

Details for the file pyonmttok-1.19.0-cp27-cp27m-manylinux1_x86_64.whl.

File metadata

  • Download URL: pyonmttok-1.19.0-cp27-cp27m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 2.5 MB
  • Tags: CPython 2.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/1.15.0 pkginfo/1.5.0.1 requests/2.9.1 setuptools/44.1.1 requests-toolbelt/0.9.1 tqdm/4.48.2 CPython/2.7.12

File hashes

Hashes for pyonmttok-1.19.0-cp27-cp27m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 af0b9fea83fdce0ef97d319eda3d8ec7f6bba2a891424588c3ecf2731a6457a9
MD5 c27b7628eb0042dc5b5cbefc7c04d155
BLAKE2b-256 11519009988b40f50e445d16602031d2d18d055ee0f39f861cddd25905b6ba38

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page