Skip to main content

No project description provided

Project description

AI21 Labs Tokenizer

A SentencePiece based tokenizer for production uses

Test Package version Supported Python versions Poetry Supported Python versions License


Installation

pip

pip install ai21-tokenizer

poetry

poetry add ai21-tokenizer

Usage

Tokenizer Creation

from ai21_tokenizer import Tokenizer

tokenizer = Tokenizer.get_tokenizer()
# Your code here

Another way would be to use our Jurassic model directly:

from ai21_tokenizer import JurassicTokenizer

model_path = "<Path to your vocabs file. This is usually a binary file that end with .model>"
config = {} # "dictionary object of your config.json file"
tokenizer = JurassicTokenizer(model_path=model_path, config=config)

Functions

Encode and Decode

These functions allow you to encode your text to a list of token ids and back to plaintext

text_to_encode = "apple orange banana"
encoded_text = tokenizer.encode(text_to_encode)
print(f"Encoded text: {encoded_text}")

decoded_text = tokenizer.decode(encoded_text)
print(f"Decoded text: {decoded_text}")

What if you had wanted to convert your tokens to ids or vice versa?

tokens = tokenizer.convert_ids_to_tokens(encoded_text)
print(f"IDs corresponds to Tokens: {tokens}")

ids = tokenizer.convert_tokens_to_ids(tokens)

For more examples, please see our examples folder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ai21_tokenizer-0.3.10.tar.gz (2.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ai21_tokenizer-0.3.10-py3-none-any.whl (2.7 MB view details)

Uploaded Python 3

File details

Details for the file ai21_tokenizer-0.3.10.tar.gz.

File metadata

  • Download URL: ai21_tokenizer-0.3.10.tar.gz
  • Upload date:
  • Size: 2.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for ai21_tokenizer-0.3.10.tar.gz
Algorithm Hash digest
SHA256 860ec23f9b3b70cd516d010720a6628ea5af64c717f562ceca670d596651e0b7
MD5 74ea123e5c998a4c112193f7246616d7
BLAKE2b-256 43d4e3fbb23b09748313db8ffb810a34e7856183785871a3a3c296a0084fa4ba

See more details on using hashes here.

File details

Details for the file ai21_tokenizer-0.3.10-py3-none-any.whl.

File metadata

File hashes

Hashes for ai21_tokenizer-0.3.10-py3-none-any.whl
Algorithm Hash digest
SHA256 6b1c404b7fc75f5526d625cc3462c9ae763dde7a8e7bd17b1e366c1deb5f84ba
MD5 dbc4dd02288582e5b33a844d8693c9f1
BLAKE2b-256 714ff92808788dd19edcc140a70e1b7cf9bb6bfaf4ec04b99d9a6db64acce758

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page