Skip to main content

CLI for fast, flexbile concatenation of tabular data using Polars.

Project description

PyPi CI GitHub stars DOI

joinem provides a CLI for fast, flexbile concatenation of tabular data using polars

Install

python3 -m pip install joinem

Features

  • Lazily streams I/O to expeditiously handle numerous large files.
  • Supports CSV and parquet input files.
    • Due to current polars limitations, JSON and feather files are not supported.
    • Input formats may be mixed.
  • Supports output to CSV, JSON, parquet, and feather file types.
  • Allows mismatched columns and/or empty data files with --how diagonal and --how diagonal_relaxed.
  • Provides a progress bar with --progress.
  • Add programatically-generated columns to output.

Example Usage

Pass input filenames via stdin, one filename per line.

find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet

Output file type is inferred from the extension of the output file name. Supported output types are feather, JSON, parquet, and csv.

find -name '*.parquet' | python3 -m joinem out.json

If file columns may mismatch, use --how diagonal.

find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal

If some files may be empty, use --how diagonal_relaxed.

To run via Singularity/Apptainer,

ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem:v0.11.1 out.feather

Advanced Usage

Add literal value column to output.

ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'

Cast a column to categorical in the output, shrink dtypes, and tune compression.

ls -1 *.csv | python3 -m joinem out.pqt \
  --with-column 'pl.col("uuid").cast(pl.Categorical)' --string-cache \
  --shrink-dtypes \
  --write-kwarg 'compression_level=10' --write-kwarg 'compression="zstd"'

Alias an existing column in the output.

ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'

Apply regex on source datafile paths to create new column in output.

ls -1 path/to/*.csv | python3 -m joinem out.csv \
  --with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'

Read data from stdin and write data to stdout.

cat foo.csv | python3 -m joinem "/dev/stdout" --stdin \
  --output-filetype csv --input-filetype csv

Write to parquet via stdout using pv to display progress, cast "myValue" column to categorical, and use lz4 for parquet compression.

ls -1 input/*.pqt | python3 -m joinem "/dev/stdout" --output-filetype pqt \
  --with-column 'pl.col("myValue").cast(pl.Categorical)' \
  --write-kwarg 'compression="lz4"' \
  | pv > concat.pqt

API

usage: __main__.py [-h] [--version] [--progress] [--stdin] [--drop DROP]
                   [--select SELECT] [--eager-read] [--eager-write]
                   [--filter FILTERS] [--head HEAD] [--tail TAIL]
                   [--gather-every GATHER_EVERY] [--sample SAMPLE]
                   [--shuffle] [--seed SEED] [--with-column WITH_COLUMNS]
                   [--shrink-dtypes] [--string-cache]
                   [--how {vertical,vertical_relaxed,diagonal,diagonal_relaxed,horizontal,align,align_full,align_inner,align_left,align_right}]
                   [--input-filetype INPUT_FILETYPE]
                   [--output-filetype OUTPUT_FILETYPE]
                   [--read-kwarg READ_KWARGS] [--write-kwarg WRITE_KWARGS]
                   output_file

CLI for fast, flexbile concatenation of tabular data using Polars.

positional arguments:
  output_file           Output file name

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --progress            Show progress bar.
  --stdin               Read data from stdin.
  --drop DROP           Column names to drop. Flag may be repeated to
                        provide multiple column names.
  --select SELECT       Column names to select; otherwise, all columns are
                        selected. Flag may be repeated to provide multiple
                        column names.
  --eager-read          Use read_* instead of scan_*. Can improve
                        performance in some cases.
  --eager-write         Use write_* instead of sink_*. Can improve
                        performance in some cases.
  --filter FILTERS      Expression to be evaluated and passed to polars
                        DataFrame.filter. Example: 'pl.col("thing") == 0'
  --head HEAD           Number of rows to include in output, counting from
                        front.
  --tail TAIL           Number of rows to include in output, counting from
                        back.
  --gather-every GATHER_EVERY
                        Take every nth row.
  --sample SAMPLE       Number of rows to include in output, sampled
                        uniformly. Pass --seed for deterministic behavior.
  --shuffle             Should output be shuffled? Pass --seed for
                        deterministic behavior.
  --seed SEED           Integer seed for deterministic behavior.
  --with-column WITH_COLUMNS
                        Expression to be evaluated to add a column, has
                        access to each datafile's filepath as `filepath` and
                        polars as `pl`. Flag may be repeated to provide
                        multiple expressions. Example:
                        'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv",
                        r"${1}").alias("filename stem")'
  --shrink-dtypes       Shrink numeric columns to the minimal required
                        datatype.
  --string-cache        Enable Polars global string cache.
  --how {vertical,vertical_relaxed,diagonal,diagonal_relaxed,horizontal,align,align_full,align_inner,align_left,align_right}
                        How to concatenate frames. See
                        <https://docs.pola.rs/py-
                        polars/html/reference/api/polars.concat.html> for
                        more information.
  --input-filetype INPUT_FILETYPE
                        Filetype of input. Otherwise, inferred. Example:
                        csv, parquet, json, feather
  --output-filetype OUTPUT_FILETYPE
                        Filetype of output. Otherwise, inferred. Example:
                        csv, parquet
  --read-kwarg READ_KWARGS
                        Additional keyword arguments to pass to pl.read_* or
                        pl.scan_* call(s). Provide as 'key=value'. Specify
                        multiple kwargs by using this flag multiple times.
                        Arguments will be evaluated as Python expressions.
                        Example: 'infer_schema_length=None'
  --write-kwarg WRITE_KWARGS
                        Additional keyword arguments to pass to pl.write_*
                        or pl.sink_* call. Provide as 'key=value'. Specify
                        multiple kwargs by using this flag multiple times.
                        Arguments will be evaluated as Python expressions.
                        Example: 'compression="lz4"'

Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' |
python3 -m joinem out.csv

Citing

If joinem contributes to a scholarly work, please cite it as

Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182

@software{moreno2024joinem,
  author = {Matthew Andres Moreno},
  title = {mmore500/joinem},
  month = feb,
  year = 2024,
  publisher = {Zenodo},
  doi = {10.5281/zenodo.10701182},
  url = {https://doi.org/10.5281/zenodo.10701182}
}

And don't forget to leave a star on GitHub!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

joinem-0.11.1.tar.gz (10.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

joinem-0.11.1-py2.py3-none-any.whl (9.3 kB view details)

Uploaded Python 2Python 3

File details

Details for the file joinem-0.11.1.tar.gz.

File metadata

  • Download URL: joinem-0.11.1.tar.gz
  • Upload date:
  • Size: 10.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for joinem-0.11.1.tar.gz
Algorithm Hash digest
SHA256 56b92fb241d1069207dfbbf604d3c87789173db8414f74f1bdc4fabe7052841c
MD5 faa30a6a74fdfdca054d987859c066ae
BLAKE2b-256 27b064f0c1b638024a640b186f91dfb53035d6ef6cfa303c2bb829b559cd9a6a

See more details on using hashes here.

File details

Details for the file joinem-0.11.1-py2.py3-none-any.whl.

File metadata

  • Download URL: joinem-0.11.1-py2.py3-none-any.whl
  • Upload date:
  • Size: 9.3 kB
  • Tags: Python 2, Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for joinem-0.11.1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 e24c23b5cac051d7bc57e4b04e9588610ca556f28c5d024d0daa2b4a5d4715b8
MD5 38979a0da6371d897a3fc6168fcc60e6
BLAKE2b-256 98dea2e63dd759444328a445e1f97b934c06922821f4001034c206636818b849

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page