CLI for fast, flexbile concatenation of tabular data using Polars.
Project description
joinem provides a CLI for fast, flexbile concatenation of tabular data using polars
- Free software: MIT license
- Repository: https://github.com/mmore500/joinem
- Documentation: https://github.com/mmore500/joinem/blob/master/README.md
Install
python3 -m pip install joinem
Features
- Lazily streams I/O to expeditiously handle numerous large files.
- Supports CSV and parquet input files.
- Due to current polars limitations, JSON and feather files are not supported.
- Input formats may be mixed.
- Supports output to CSV, JSON, parquet, and feather file types.
- Allows mismatched columns and/or empty data files with
--how diagonaland--how diagonal_relaxed. - Provides a progress bar with
--progress. - Add programatically-generated columns to output.
Example Usage
Pass input filenames via stdin, one filename per line.
find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet
Output file type is inferred from the extension of the output file name. Supported output types are feather, JSON, parquet, and csv.
find -name '*.parquet' | python3 -m joinem out.json
If file columns may mismatch, use --how diagonal.
find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal
If some files may be empty, use --how diagonal_relaxed.
To run via Singularity/Apptainer,
ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem:v0.11.1 out.feather
Advanced Usage
Add literal value column to output.
ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'
Cast a column to categorical in the output, shrink dtypes, and tune compression.
ls -1 *.csv | python3 -m joinem out.pqt \
--with-column 'pl.col("uuid").cast(pl.Categorical)' --string-cache \
--shrink-dtypes \
--write-kwarg 'compression_level=10' --write-kwarg 'compression="zstd"'
Alias an existing column in the output.
ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'
Apply regex on source datafile paths to create new column in output.
ls -1 path/to/*.csv | python3 -m joinem out.csv \
--with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'
Read data from stdin and write data to stdout.
cat foo.csv | python3 -m joinem "/dev/stdout" --stdin \
--output-filetype csv --input-filetype csv
Write to parquet via stdout using pv to display progress, cast "myValue" column to categorical, and use lz4 for parquet compression.
ls -1 input/*.pqt | python3 -m joinem "/dev/stdout" --output-filetype pqt \
--with-column 'pl.col("myValue").cast(pl.Categorical)' \
--write-kwarg 'compression="lz4"' \
| pv > concat.pqt
API
usage: __main__.py [-h] [--version] [--progress] [--stdin] [--drop DROP]
[--select SELECT] [--eager-read] [--eager-write]
[--filter FILTERS] [--head HEAD] [--tail TAIL]
[--gather-every GATHER_EVERY] [--sample SAMPLE]
[--shuffle] [--seed SEED] [--with-column WITH_COLUMNS]
[--shrink-dtypes] [--string-cache]
[--how {vertical,vertical_relaxed,diagonal,diagonal_relaxed,horizontal,align,align_full,align_inner,align_left,align_right}]
[--input-filetype INPUT_FILETYPE]
[--output-filetype OUTPUT_FILETYPE]
[--read-kwarg READ_KWARGS] [--write-kwarg WRITE_KWARGS]
output_file
CLI for fast, flexbile concatenation of tabular data using Polars.
positional arguments:
output_file Output file name
options:
-h, --help show this help message and exit
--version show program's version number and exit
--progress Show progress bar.
--stdin Read data from stdin.
--drop DROP Column names to drop. Flag may be repeated to
provide multiple column names.
--select SELECT Column names to select; otherwise, all columns are
selected. Flag may be repeated to provide multiple
column names.
--eager-read Use read_* instead of scan_*. Can improve
performance in some cases.
--eager-write Use write_* instead of sink_*. Can improve
performance in some cases.
--filter FILTERS Expression to be evaluated and passed to polars
DataFrame.filter. Example: 'pl.col("thing") == 0'
--head HEAD Number of rows to include in output, counting from
front.
--tail TAIL Number of rows to include in output, counting from
back.
--gather-every GATHER_EVERY
Take every nth row.
--sample SAMPLE Number of rows to include in output, sampled
uniformly. Pass --seed for deterministic behavior.
--shuffle Should output be shuffled? Pass --seed for
deterministic behavior.
--seed SEED Integer seed for deterministic behavior.
--with-column WITH_COLUMNS
Expression to be evaluated to add a column, has
access to each datafile's filepath as `filepath` and
polars as `pl`. Flag may be repeated to provide
multiple expressions. Example:
'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv",
r"${1}").alias("filename stem")'
--shrink-dtypes Shrink numeric columns to the minimal required
datatype.
--string-cache Enable Polars global string cache.
--how {vertical,vertical_relaxed,diagonal,diagonal_relaxed,horizontal,align,align_full,align_inner,align_left,align_right}
How to concatenate frames. See
<https://docs.pola.rs/py-
polars/html/reference/api/polars.concat.html> for
more information.
--input-filetype INPUT_FILETYPE
Filetype of input. Otherwise, inferred. Example:
csv, parquet, json, feather
--output-filetype OUTPUT_FILETYPE
Filetype of output. Otherwise, inferred. Example:
csv, parquet
--read-kwarg READ_KWARGS
Additional keyword arguments to pass to pl.read_* or
pl.scan_* call(s). Provide as 'key=value'. Specify
multiple kwargs by using this flag multiple times.
Arguments will be evaluated as Python expressions.
Example: 'infer_schema_length=None'
--write-kwarg WRITE_KWARGS
Additional keyword arguments to pass to pl.write_*
or pl.sink_* call. Provide as 'key=value'. Specify
multiple kwargs by using this flag multiple times.
Arguments will be evaluated as Python expressions.
Example: 'compression="lz4"'
Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' |
python3 -m joinem out.csv
Citing
If joinem contributes to a scholarly work, please cite it as
Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182
@software{moreno2024joinem,
author = {Matthew Andres Moreno},
title = {mmore500/joinem},
month = feb,
year = 2024,
publisher = {Zenodo},
doi = {10.5281/zenodo.10701182},
url = {https://doi.org/10.5281/zenodo.10701182}
}
And don't forget to leave a star on GitHub!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file joinem-0.11.1.tar.gz.
File metadata
- Download URL: joinem-0.11.1.tar.gz
- Upload date:
- Size: 10.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56b92fb241d1069207dfbbf604d3c87789173db8414f74f1bdc4fabe7052841c
|
|
| MD5 |
faa30a6a74fdfdca054d987859c066ae
|
|
| BLAKE2b-256 |
27b064f0c1b638024a640b186f91dfb53035d6ef6cfa303c2bb829b559cd9a6a
|
File details
Details for the file joinem-0.11.1-py2.py3-none-any.whl.
File metadata
- Download URL: joinem-0.11.1-py2.py3-none-any.whl
- Upload date:
- Size: 9.3 kB
- Tags: Python 2, Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e24c23b5cac051d7bc57e4b04e9588610ca556f28c5d024d0daa2b4a5d4715b8
|
|
| MD5 |
38979a0da6371d897a3fc6168fcc60e6
|
|
| BLAKE2b-256 |
98dea2e63dd759444328a445e1f97b934c06922821f4001034c206636818b849
|