Skip to main content

Fast and Customizable Tokenizers

Project description



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install setuptools_rust
python setup.py install

Using the provided Tokenizers

Using a pre-trained tokenizer is really simple:

from tokenizers import BPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = BPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train yours just as simply:

from tokenizers import BPETokenizer

# Initialize a tokenizer
tokenizer = BPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# And you can use it
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory", "my-bpe")

Provided Tokenizers

  • BPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

You can also easily build your own tokenizers, by putting all the different parts you need together:

Use a pre-trained tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders

# Load a BPE Model
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
bpe = models.BPE.from_files(vocab, merges)

# Initialize a tokenizer
tokenizer = Tokenizer(bpe)

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

# Or tokenize multiple sentences at once:
encoded = tokenizer.encode_batch([
	"I can feel the magic, can you?",
	"The quick brown fox jumps over the lazy dog"
])
print(encoded)

Train a new tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE.empty())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()

# And then train
trainer = trainers.BpeTrainer(vocab_size=20000, min_frequency=2)
tokenizer.train(trainer, [
	"./path/to/dataset/1.txt",
	"./path/to/dataset/2.txt",
	"./path/to/dataset/3.txt"
])

# Now we can encode
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded)

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tokenizers-0.4.1.tar.gz (62.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tokenizers-0.4.1-cp38-cp38-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.8Windows x86-64

tokenizers-0.4.1-cp38-cp38-manylinux1_x86_64.whl (7.5 MB view details)

Uploaded CPython 3.8

tokenizers-0.4.1-cp38-cp38-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.8macOS 10.15+ x86-64

tokenizers-0.4.1-cp37-cp37m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.7mWindows x86-64

tokenizers-0.4.1-cp37-cp37m-manylinux1_x86_64.whl (5.6 MB view details)

Uploaded CPython 3.7m

tokenizers-0.4.1-cp37-cp37m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.7mmacOS 10.15+ x86-64

tokenizers-0.4.1-cp36-cp36m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.6mWindows x86-64

tokenizers-0.4.1-cp36-cp36m-manylinux1_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.6m

tokenizers-0.4.1-cp36-cp36m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.6mmacOS 10.15+ x86-64

tokenizers-0.4.1-cp35-cp35m-win_amd64.whl (1.1 MB view details)

Uploaded CPython 3.5mWindows x86-64

tokenizers-0.4.1-cp35-cp35m-manylinux1_x86_64.whl (1.9 MB view details)

Uploaded CPython 3.5m

tokenizers-0.4.1-cp35-cp35m-macosx_10_15_x86_64.whl (1.1 MB view details)

Uploaded CPython 3.5mmacOS 10.15+ x86-64

File details

Details for the file tokenizers-0.4.1.tar.gz.

File metadata

  • Download URL: tokenizers-0.4.1.tar.gz
  • Upload date:
  • Size: 62.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1.tar.gz
Algorithm Hash digest
SHA256 7eda561f7a415198dacee25d3add18fafd61a9eeb4d76c807714b2e82a110ed1
MD5 7930fd812fbaa02ce1d9bf6c742d17d6
BLAKE2b-256 987f1492a8c8c045293756e5a2599e5f063f4719b7bbf1c6d7e8ab6688ae5379

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp38-cp38-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp38-cp38-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp38-cp38-win_amd64.whl
Algorithm Hash digest
SHA256 6b372790b78c7f7f24fec5b33d1e70c8829d4a6267dc31e8bfbc710035a3d355
MD5 e4c05a213d9446acfcba60f4c8cbaecf
BLAKE2b-256 8f3f18156ec89eea44f946a695d87b3f7653c84f56bcdb08870ce94a206dbd9f

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp38-cp38-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp38-cp38-manylinux1_x86_64.whl
  • Upload date:
  • Size: 7.5 MB
  • Tags: CPython 3.8
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp38-cp38-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7192fd9ba4fcb489da68909cbb1fb5d60d4cb6bc28bb62752577e468f29e800a
MD5 48e7831c9b9f272fbe91c1b5497ceb41
BLAKE2b-256 51acca54f809341fca7053c9549963f3da78e9634e229e2a6a5fc58f45ddc2d5

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp38-cp38-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp38-cp38-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.8, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp38-cp38-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 53b13eb03ac7faacd454d946cd9d4cabee62c5c80bbd0adbbe58c85fd5117460
MD5 02be9cd35d9f0b1a1dd0150c2fd0dcf8
BLAKE2b-256 890aafe5e3d9bc45b06b02fbe8b6949812ff16d8be3e84bfee641e549767f651

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp37-cp37m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp37-cp37m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp37-cp37m-win_amd64.whl
Algorithm Hash digest
SHA256 fb4908e954a893e78c67d29c78feec44d22d29137d14db3752efa77edb22a299
MD5 59cab464be25e7c0232465d620bfc781
BLAKE2b-256 83083d8930d6d77d7defc4298c792e56d5319a353feebb7131e7101dafd0137a

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp37-cp37m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp37-cp37m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 5.6 MB
  • Tags: CPython 3.7m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp37-cp37m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 7a684f2a91755d4f215983a789480a6b8c8b667d8501c3900705497af3cd6023
MD5 112ce732ed7f4deea8c66a0721e783ec
BLAKE2b-256 e2ae36e0617a349f1842851feb8ec642ae2968f5495a25fc3de2028f4ee1dd0b

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp37-cp37m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp37-cp37m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.7m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp37-cp37m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 2f28627cf5c0e7498023007b98b61ba343e5a29cf313081326a0d38e0ff15b63
MD5 53a981658d330583ad686b7f9a8a28dc
BLAKE2b-256 cd47f2eb7ea00921a23bbd784c9f7deebe9e41a404031ffe2c2184d3fc6d3a24

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp36-cp36m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp36-cp36m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp36-cp36m-win_amd64.whl
Algorithm Hash digest
SHA256 4885b631c883775eb6947135f2c59864c59caf403a57bd38f72699c4cddeba2e
MD5 ec6f21ff38083c3814527f47986d1836
BLAKE2b-256 0c7848e029461fe90015173351f2fe76acb10bfcd7f04f99a2bbf2f3cc4debfc

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp36-cp36m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp36-cp36m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 3.7 MB
  • Tags: CPython 3.6m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp36-cp36m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 a2aa5c6851dd7a10198aa6b5cba5f7634f43c02d151b0bd83fa253d9603b9d98
MD5 84442e3e87a1c2a0c2563290c3e56541
BLAKE2b-256 935498a42750038639f3f0ecb2dce65b35a84f1e6abb342720e9e01def86f0db

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp36-cp36m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp36-cp36m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.6m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp36-cp36m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 8fd8b743a10bb4b2a0c8992ae0662ded11e59c017e64a45ad9ec343f1adef4e7
MD5 7fb1b8e64f8cd181b42d321a61e0f091
BLAKE2b-256 99a7071791512a71f561e85118ec05f4071ba200e56ebfa5b5b4643d71134e7c

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp35-cp35m-win_amd64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp35-cp35m-win_amd64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp35-cp35m-win_amd64.whl
Algorithm Hash digest
SHA256 bec28714a3963d0814a57b284d591cefd9851f7fdbb8281515607bb8b5245940
MD5 d30da47ded3ea2d7b5a09b0d292ae241
BLAKE2b-256 608135803dbd642ddbc88ef8f7f402ae9b26386edb31bd2ade07c69fdfe61717

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp35-cp35m-manylinux1_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp35-cp35m-manylinux1_x86_64.whl
  • Upload date:
  • Size: 1.9 MB
  • Tags: CPython 3.5m
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp35-cp35m-manylinux1_x86_64.whl
Algorithm Hash digest
SHA256 ecea049dd3b75fdd6acdd19059772fc2b61492d60e5180966199c89028cd4601
MD5 7bfad11e59988f75b08c151d4d929d7d
BLAKE2b-256 0d17ddd3dfdd750c59de239aa783aa2ca2f9375075750ebe60aa040c9e4d16a1

See more details on using hashes here.

File details

Details for the file tokenizers-0.4.1-cp35-cp35m-macosx_10_15_x86_64.whl.

File metadata

  • Download URL: tokenizers-0.4.1-cp35-cp35m-macosx_10_15_x86_64.whl
  • Upload date:
  • Size: 1.1 MB
  • Tags: CPython 3.5m, macOS 10.15+ x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.42.1 CPython/3.8.1

File hashes

Hashes for tokenizers-0.4.1-cp35-cp35m-macosx_10_15_x86_64.whl
Algorithm Hash digest
SHA256 ea4121bed4c379b5995a6282d866b41c890dd8c25189fedcc9785cee1ca7c810
MD5 06f90d355c2a729d9f02dff282211c7a
BLAKE2b-256 157ad7b539271971641e21973a0636e6c12317c3e75e115e51a43d245481548b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page