Unicode (and other integer) table packer
Project description
packTab
Pack static integer tables into compact multi-level lookup tables to save space. Generates C or Rust code.
Installation
pip install packtab
Usage
Command line
# Generate C lookup code
python -m packTab 1 2 3 4
# Generate Rust lookup code
python -m packTab --rust 1 2 3 4
# Generate Rust with unsafe array access
python -m packTab --rust --unsafe 1 2 3 4
# Analyze compression without generating code
python -m packTab --analyze 1 2 3 4
# Read data from stdin
seq 0 255 | python -m packTab --rust
# Tune compression (higher = smaller, slower)
echo "1 2 3 4" | python -m packTab --compression 5
As a library
from packTab import pack_table, Code, languages
data = [0, 1, 2, 3, 0, 1, 2, 3]
solution = pack_table(data, default=0, compression=1)
code = Code("mytable")
solution.genCode(code, "lookup", language="c", private=False)
code.print_code(language="c")
The pack_table function accepts:
- A list of integers, or a dict mapping integer keys to values
default: value for missing keys (default0)compression: tunes the size-vs-speed tradeoff (default1)mapping: optional mapping between string values and integers
Rust with unsafe access
from packTab import pack_table, Code, languageClasses
data = list(range(256)) * 4
solution = pack_table(data, default=0)
lang = languageClasses["rust"](unsafe_array_access=True)
code = Code("mytable")
solution.genCode(code, "lookup", language=lang, private=False)
code.print_code(language=lang)
Examples
Simple linear data
For data that's already sequential, the identity optimization kicks in:
$ python -m packTab --analyze $(seq 0 255)
Original data: 256 values, range [0..255]
Original storage: 8 bits/value, 256 bytes total
Found 1 Pareto-optimal solutions:
0 lookups, 5 extra ops, 0 bytes
Compression ratio: ∞ (computed inline, no storage)
Generated code just returns the input: return u < 256 ? u : 0
Sparse data
For sparse lookup tables with many repeated values:
from packTab import pack_table, Code
# Sparse Unicode-like table: mostly 0, some special values
data = [0] * 100
data[10] = 5
data[20] = 10
data[50] = 15
data[80] = 20
solution = pack_table(data, default=0)
code = Code("sparse")
solution.genCode(code, "lookup", language="c")
code.print_code(language="c")
The packer will use multi-level tables and sub-byte packing to minimize storage.
Generated code structure
For small datasets, values are inlined as bit-packed constants:
// Input: [1, 2, 3, 4]
extern inline uint8_t data_get (unsigned u)
{
return u<4 ? (uint8_t)(u)+(uint8_t)(((15u>>(u))&1)) : 0;
}
// Uses identity optimization: data[i] = i + 1, stored as 0b1111
For larger datasets, generates lookup tables:
// Input: 256 values with pattern
static data_u8: [u8; 256] = [ ... ];
#[inline]
pub(crate) fn data_get (u: usize) -> u8
{
if u<256 { data_u8[u] as u8 } else { 0 }
}
How it works
The algorithm builds multi-level lookup tables using dynamic programming to find optimal split points. Values that fit in fewer bits get packed into sub-byte storage (1, 2, or 4 bits per item). An outer layer applies arithmetic reductions (GCD factoring, bias subtraction) before splitting.
The solver produces a set of Pareto-optimal solutions trading off table
size against lookup speed, and pick_solution selects the best one based
on the compression parameter.
Testing
pytest
History
I first wrote something like this back in 2001 when I needed it in FriBidi:
https://github.com/fribidi/fribidi/blob/master/gen.tab/packtab.c
In 2019 I wanted to use that to produce more compact Unicode data tables for HarfBuzz, but for convenience I wanted to use it from Python. While I considered wrapping the C code in a module, it occurred to me that I can rewrite it in pure Python in a much cleaner way. That code remains a stain on my resume in terms of readability (or lack thereof!). :D
This Python version builds on the same ideas, but is different from the C version in two major ways:
-
Whereas the C version uses backtracking to find best split opportunities, I found that the same can be achieved using dynamic-programming. So the Python version implements the DP approach, which is much faster.
-
The C version does not try packing multiple items into a single byte. The Python version does. Ie. if items fit, they might get packed into 1, 2, or 4 bits per item.
There's also a bunch of other optimizations, which make (eventually, when complete) the Python version more generic and usable for a wider variety of data tables.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file packtab-1.2.0.tar.gz.
File metadata
- Download URL: packtab-1.2.0.tar.gz
- Upload date:
- Size: 37.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2a63a34eee79a9297007e40a3b6b7fa45e3b21979e4933855c3fcca90ba2c852
|
|
| MD5 |
3f1c181b5b4fc4ed7c9bc7e8f5db5893
|
|
| BLAKE2b-256 |
d8edef7a330438d3377e3cf7552690d7ead4156637786c48beff36063e823f09
|
File details
Details for the file packtab-1.2.0-py3-none-any.whl.
File metadata
- Download URL: packtab-1.2.0-py3-none-any.whl
- Upload date:
- Size: 34.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ceda815b92604efcfd48d0b81c8fa9991994fddf44787af9b2e5a3d8b2f055c9
|
|
| MD5 |
1b9d40f06a937b66da9b86adf775172f
|
|
| BLAKE2b-256 |
f62a859940e4eba793b9dfc1e29194b6f2a674641483c27b5fe8b313efb7ea42
|