Python package implementing ML feature engineering and pre-processing for polars or pandas dataframes.
Project description
Feature engineering on polars and pandas dataframes for machine learning!
tubular implements pre-processing steps for tabular data commonly used in machine learning pipelines.
The transformers are compatible with scikit-learn Pipelines. Each has a transform method to apply the pre-processing step to data and a fit method to learn the relevant information from the data, if applicable.
The transformers in tubular are written in narwhals narwhals, so are agnostic between pandas and polars dataframes, and will utilise the chosen (pandas/polars) API under the hood.
There are a variety of transformers to assist with;
- capping
- dates
- imputation
- mapping
- categorical encoding
- numeric operations
Here is a simple example of applying capping to two columns;
import polars as pl
transformer = CappingTransformer(
capping_values={"a": [10, 20], "b": [1, 3]},
)
test_df = pl.DataFrame({"a": [1, 15, 18, 25], "b": [6, 2, 7, 1], "c": [1, 2, 3, 4]})
transformer.transform(test_df)
# ->
# shape: (4, 3)
# ┌─────┬─────┬─────┐
# │ a ┆ b ┆ c │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╡
# │ 10 ┆ 3 ┆ 1 │
# │ 15 ┆ 2 ┆ 2 │
# │ 18 ┆ 3 ┆ 3 │
# │ 20 ┆ 1 ┆ 4 │
# └─────┴─────┴─────┘
Tubular also supports saving/reading transformers and pipelines to/from json format (goodbye .pkls!), which we demo below:
import polars as pl
from tubular.imputers import MeanImputer, MedianImputer
from sklearn.pipeline import Pipeline
from tubular.pipeline import dump_pipeline_to_json, load_pipeline_from_json
# Create a simple dataframe
df = pl.DataFrame({"a": [1, 5], "b": [10, None]})
# Add imputers
median_imputer = MedianImputer(columns=["b"])
mean_imputer = MeanImputer(columns=["b"])
# Create and fit the pipeline
original_pipeline = Pipeline(
[("MedianImputer", median_imputer), ("MeanImputer", mean_imputer)]
)
original_pipeline = original_pipeline.fit(df)
# Dumping the pipeline to JSON
pipeline_json = dump_pipeline_to_json(original_pipeline)
pipeline_json
# Printed value:
# ->
# {
# 'MedianImputer': {
# 'tubular_version': '2.6.1',
# 'classname': 'MedianImputer',
# 'init': {
# 'columns': ['b'],
# 'copy': False,
# 'verbose': False,
# 'return_native': True,
# 'weights_column': None
# },
# 'fit': {
# 'impute_values_': {'b': 10.0}
# }
# },
# 'MeanImputer': {
# 'tubular_version': '2.6.1',
# 'classname': 'MeanImputer',
# 'init': {
# 'columns': ['b'],
# 'copy': False,
# 'verbose': False,
# 'return_native': True,
# 'weights_column': None
# },
# 'fit': {
# 'impute_values_': {
# 'b': 10.0
# }
# }
# }
# Load the pipeline from JSON
pipeline = load_pipeline_from_json(pipeline_json)
# Verify the reconstructed pipeline
print(pipeline)
# Printed value:
# Pipeline(steps=[('MedianImputer', MedianImputer(columns=['b'])),
# ('MeanImputer', MeanImputer(columns=['b']))])
We are currently in the process of rolling out support for polars lazyframes!
track our progress below:
| polars_compatible | pandas_compatible | jsonable | lazyframe_compatible | |
|---|---|---|---|---|
| AggregateColumnsOverRowTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| AggregateRowsOverColumnTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| ArbitraryImputer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| BetweenDatesTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| CappingTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| ColumnDtypeSetter | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| CompareTwoColumnsTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| DateDifferenceTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| DatetimeComponentExtractor | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| DatetimeInfoExtractor | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| DatetimeSinusoidCalculator | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| DifferenceTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| GroupRareLevelsTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| MappingTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| MeanImputer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| MeanResponseTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| MedianImputer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| ModeImputer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| NullIndicator | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| OneDKmeansTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| OneHotEncodingTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| OutOfRangeNullTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :x: |
| RatioTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| RenameColumnsTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| SetValueTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| ToDatetimeTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
| WhenThenOtherwiseTransformer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: |
Installation
The easiest way to get tubular is directly from pypi with;
pip install tubular
Documentation
The documentation for tubular can be found on readthedocs.
Instructions for building the docs locally can be found in docs/README.
Examples
We utilise doctest to keep valid usage examples in the docstrings of transformers in the package, so please see these for getting started!
Issues
For bugs and feature requests please open an issue.
Build and test
The test framework we are using for this project is pytest. To build the package locally and run the tests follow the steps below.
First clone the repo and move to the root directory;
git clone https://github.com/azukds/tubular.git
cd tubular
Next install tubular and development dependencies;
pip install . -r requirements-dev.txt
Finally run the test suite with pytest;
pytest
Contribute
tubular is under active development, we're super excited if you're interested in contributing!
See the CONTRIBUTING file for the full details of our working practices.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tubular-2.8.0.tar.gz.
File metadata
- Download URL: tubular-2.8.0.tar.gz
- Upload date:
- Size: 264.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9beef1b028dcb5a4e408988093baef36834579812b842b184012a8c77cf24fc6
|
|
| MD5 |
2efaf39866b9c414a552ef68d91a6532
|
|
| BLAKE2b-256 |
0355e3dad6d5306261cf10394379db773a15a5801f0958eb942adc435b9b0080
|
Provenance
The following attestation bundles were made for tubular-2.8.0.tar.gz:
Publisher:
release.yml on azukds/tubular
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tubular-2.8.0.tar.gz -
Subject digest:
9beef1b028dcb5a4e408988093baef36834579812b842b184012a8c77cf24fc6 - Sigstore transparency entry: 983338865
- Sigstore integration time:
-
Permalink:
azukds/tubular@5a449c2f5189a9924a81cddbf6846ea477d0e7ee -
Branch / Tag:
refs/tags/v2.8.0 - Owner: https://github.com/azukds
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5a449c2f5189a9924a81cddbf6846ea477d0e7ee -
Trigger Event:
release
-
Statement type:
File details
Details for the file tubular-2.8.0-py3-none-any.whl.
File metadata
- Download URL: tubular-2.8.0-py3-none-any.whl
- Upload date:
- Size: 95.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7d867cc3a2f8034b225521c4721a811f11db6c497b293271a32f60d9587e065
|
|
| MD5 |
42fbcd846a9dac65e35271b6205e8486
|
|
| BLAKE2b-256 |
f793e504132bfdda511f1ce5f5f2d2b2cc6966d16bb7989d7814236cf21ac290
|
Provenance
The following attestation bundles were made for tubular-2.8.0-py3-none-any.whl:
Publisher:
release.yml on azukds/tubular
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
tubular-2.8.0-py3-none-any.whl -
Subject digest:
e7d867cc3a2f8034b225521c4721a811f11db6c497b293271a32f60d9587e065 - Sigstore transparency entry: 983338874
- Sigstore integration time:
-
Permalink:
azukds/tubular@5a449c2f5189a9924a81cddbf6846ea477d0e7ee -
Branch / Tag:
refs/tags/v2.8.0 - Owner: https://github.com/azukds
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@5a449c2f5189a9924a81cddbf6846ea477d0e7ee -
Trigger Event:
release
-
Statement type: