Skip to main content

Unified Tokenizer

Project description

UniTok v4

Unified preprocessing for heterogeneous ML tables: text, categorical, and numerical columns in one pipeline.

  • Python package: unitok
  • Current package version: 4.4.2 (from setup.py)
  • Legacy v3 docs: README_v3.md

Why UniTok

UniTok turns raw tabular data into model-ready numeric tables while preserving:

  • Consistent vocabularies across multiple datasets
  • Clear feature definitions (column -> tokenizer -> output feature)
  • Reproducible metadata and saved artifacts
  • Simple unions across datasets via shared keys

Core Ideas

  • UniTok: Orchestrates preprocessing lifecycle and holds processed data.
  • Feature: Binds a column to a tokenizer and output name.
  • Tokenizer: Encodes objects to ids (entity, split, digit, transformers).
  • Vocab: Global index for tokens; shared across datasets.
  • Meta: Stores schema, tokenizers, vocabularies, and feature definitions.
  • State: initialized -> tokenized -> organized.

Install

pip install unitok

Requirements: Python 3.7+, pandas, transformers, tqdm, rich.

Quickstart

import pandas as pd
from unitok import UniTok, Vocab
from unitok.tokenizer import BertTokenizer, TransformersTokenizer, EntityTokenizer, SplitTokenizer, DigitTokenizer

item = pd.read_csv(
    'news-sample.tsv', sep='\t',
    names=['nid', 'category', 'subcategory', 'title', 'abstract'],
    usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],
)
item['abstract'] = item['abstract'].fillna('')

user = pd.read_csv(
    'user-sample.tsv', sep='\t',
    names=['uid', 'history'],
)

interaction = pd.read_csv(
    'interaction-sample.tsv', sep='\t',
    names=['uid', 'nid', 'click'],
)

item_vocab = Vocab(name='nid')
user_vocab = Vocab(name='uid')

with UniTok() as item_ut:
    bert = BertTokenizer(vocab='bert')
    llama = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')

    item_ut.add_feature(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid', key=True)
    item_ut.add_feature(tokenizer=bert, column='title', name='title@bert', truncate=20)
    item_ut.add_feature(tokenizer=llama, column='title', name='title@llama', truncate=20)
    item_ut.add_feature(tokenizer=bert, column='abstract', name='abstract@bert', truncate=50)
    item_ut.add_feature(tokenizer=llama, column='abstract', name='abstract@llama', truncate=50)
    item_ut.add_feature(tokenizer=EntityTokenizer(vocab='category'), column='category')
    item_ut.add_feature(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')

with UniTok() as user_ut:
    user_ut.add_feature(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid', key=True)
    user_ut.add_feature(tokenizer=SplitTokenizer(vocab=item_vocab, sep=','), column='history', truncate=30)

with UniTok() as inter_ut:
    inter_ut.add_index_feature(name='index')
    inter_ut.add_feature(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid')
    inter_ut.add_feature(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid')
    inter_ut.add_feature(tokenizer=DigitTokenizer(vocab='click', vocab_size=2), column='click')

item_ut.tokenize(item).save('sample-ut/item')
item_vocab.deny_edit()
user_ut.tokenize(user).save('sample-ut/user')
inter_ut.tokenize(interaction).save('sample-ut/interaction')

Loading Saved Data

from unitok import UniTok

ut = UniTok.load('sample-ut/item')
print(len(ut))
print(ut[0])

Combining Datasets (Union)

with inter_ut:
    inter_ut.union(user_ut)
    print(inter_ut[0])
  • Soft union (default): links tables and resolves on access
  • Hard union: materializes merged columns

CLI

Summarize a saved table:

unitok path/to/data

Add a feature into an existing table (integrate):

unitok integrate path/to/data --file data.tsv --column title --name title@bert \
  --vocab bert --tokenizer transformers --t.key bert-base-uncased

Remove a feature from a saved table:

unitok remove path/to/data --name title@bert

Data Artifacts

Saved directories include:

  • meta.json with schema, tokenizers, vocabularies
  • data.pkl with tokenized columns
  • *.vocab pickled vocabularies

Migration From v3

If you have v3 artifacts:

unidep-upgrade-v4 <path>

Notes and Constraints

  • Key feature must be atomic (tokenizer returns a single id, not a list).
  • Shared vocabularies must match for unions.
  • truncate=None means an atomic feature; list features must use a truncate.
  • Feature supersedes the deprecated Job class.

Repository Layout (High-Level)

  • unitok/ core library
  • UniTokv3/ legacy v3 code
  • dist/ built distributions
  • setup.py, requirements.txt

License

MIT License. See LICENSE.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

unitok-4.4.4.tar.gz (38.3 kB view details)

Uploaded Source

File details

Details for the file unitok-4.4.4.tar.gz.

File metadata

  • Download URL: unitok-4.4.4.tar.gz
  • Upload date:
  • Size: 38.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.1 CPython/3.10.6

File hashes

Hashes for unitok-4.4.4.tar.gz
Algorithm Hash digest
SHA256 83dcbcac99959f7aec3d391c1a0b0d18f83adb213c5599cbc995ba1053cf1403
MD5 a5f7a273021e9085f15172e13ade6a47
BLAKE2b-256 6aef3d962ae5395f6405e70eed4bb6c4a9c9b982b0e5b4ebca981879f14ef399

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page