Unified Tokenizer
Project description
UniTok v4
Unified preprocessing for heterogeneous ML tables: text, categorical, and numerical columns in one pipeline.
- Python package:
unitok - Current package version: 4.4.2 (from
setup.py) - Legacy v3 docs:
README_v3.md
Why UniTok
UniTok turns raw tabular data into model-ready numeric tables while preserving:
- Consistent vocabularies across multiple datasets
- Clear feature definitions (column -> tokenizer -> output feature)
- Reproducible metadata and saved artifacts
- Simple unions across datasets via shared keys
Core Ideas
- UniTok: Orchestrates preprocessing lifecycle and holds processed data.
- Feature: Binds a column to a tokenizer and output name.
- Tokenizer: Encodes objects to ids (entity, split, digit, transformers).
- Vocab: Global index for tokens; shared across datasets.
- Meta: Stores schema, tokenizers, vocabularies, and feature definitions.
- State:
initialized->tokenized->organized.
Install
pip install unitok
Requirements: Python 3.7+, pandas, transformers, tqdm, rich.
Quickstart
import pandas as pd
from unitok import UniTok, Vocab
from unitok.tokenizer import BertTokenizer, TransformersTokenizer, EntityTokenizer, SplitTokenizer, DigitTokenizer
item = pd.read_csv(
'news-sample.tsv', sep='\t',
names=['nid', 'category', 'subcategory', 'title', 'abstract'],
usecols=['nid', 'category', 'subcategory', 'title', 'abstract'],
)
item['abstract'] = item['abstract'].fillna('')
user = pd.read_csv(
'user-sample.tsv', sep='\t',
names=['uid', 'history'],
)
interaction = pd.read_csv(
'interaction-sample.tsv', sep='\t',
names=['uid', 'nid', 'click'],
)
item_vocab = Vocab(name='nid')
user_vocab = Vocab(name='uid')
with UniTok() as item_ut:
bert = BertTokenizer(vocab='bert')
llama = TransformersTokenizer(vocab='llama', key='huggyllama/llama-7b')
item_ut.add_feature(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid', key=True)
item_ut.add_feature(tokenizer=bert, column='title', name='title@bert', truncate=20)
item_ut.add_feature(tokenizer=llama, column='title', name='title@llama', truncate=20)
item_ut.add_feature(tokenizer=bert, column='abstract', name='abstract@bert', truncate=50)
item_ut.add_feature(tokenizer=llama, column='abstract', name='abstract@llama', truncate=50)
item_ut.add_feature(tokenizer=EntityTokenizer(vocab='category'), column='category')
item_ut.add_feature(tokenizer=EntityTokenizer(vocab='subcategory'), column='subcategory')
with UniTok() as user_ut:
user_ut.add_feature(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid', key=True)
user_ut.add_feature(tokenizer=SplitTokenizer(vocab=item_vocab, sep=','), column='history', truncate=30)
with UniTok() as inter_ut:
inter_ut.add_index_feature(name='index')
inter_ut.add_feature(tokenizer=EntityTokenizer(vocab=user_vocab), column='uid')
inter_ut.add_feature(tokenizer=EntityTokenizer(vocab=item_vocab), column='nid')
inter_ut.add_feature(tokenizer=DigitTokenizer(vocab='click', vocab_size=2), column='click')
item_ut.tokenize(item).save('sample-ut/item')
item_vocab.deny_edit()
user_ut.tokenize(user).save('sample-ut/user')
inter_ut.tokenize(interaction).save('sample-ut/interaction')
Loading Saved Data
from unitok import UniTok
ut = UniTok.load('sample-ut/item')
print(len(ut))
print(ut[0])
Combining Datasets (Union)
with inter_ut:
inter_ut.union(user_ut)
print(inter_ut[0])
- Soft union (default): links tables and resolves on access
- Hard union: materializes merged columns
CLI
Summarize a saved table:
unitok path/to/data
Add a feature into an existing table (integrate):
unitok integrate path/to/data --file data.tsv --column title --name title@bert \
--vocab bert --tokenizer transformers --t.key bert-base-uncased
Remove a feature from a saved table:
unitok remove path/to/data --name title@bert
Data Artifacts
Saved directories include:
meta.jsonwith schema, tokenizers, vocabulariesdata.pklwith tokenized columns*.vocabpickled vocabularies
Migration From v3
If you have v3 artifacts:
unidep-upgrade-v4 <path>
Notes and Constraints
- Key feature must be atomic (tokenizer returns a single id, not a list).
- Shared vocabularies must match for unions.
truncate=Nonemeans an atomic feature; list features must use a truncate.Featuresupersedes the deprecatedJobclass.
Repository Layout (High-Level)
unitok/core libraryUniTokv3/legacy v3 codedist/built distributionssetup.py,requirements.txt
License
MIT License. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
unitok-4.4.4.tar.gz
(38.3 kB
view details)
File details
Details for the file unitok-4.4.4.tar.gz.
File metadata
- Download URL: unitok-4.4.4.tar.gz
- Upload date:
- Size: 38.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.1 CPython/3.10.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
83dcbcac99959f7aec3d391c1a0b0d18f83adb213c5599cbc995ba1053cf1403
|
|
| MD5 |
a5f7a273021e9085f15172e13ade6a47
|
|
| BLAKE2b-256 |
6aef3d962ae5395f6405e70eed4bb6c4a9c9b982b0e5b4ebca981879f14ef399
|