Skip to main content

Chroma DB Data Pipes is a collection of tools for working with data in Chroma DB and building RAG systems

Project description

ChromaDB Data Pipes 🖇️| Rediscover AI/ML the Unix Way

ChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of " do one thing and do it well".

Installation

pip install chromadb-data-pipes

Usage

Get help:

cdp --help

Importing

Import data from HuggingFace Datasets to .jsonl file:

cdp imp hf --uri "hf:tazarov/chroma-qna?split=train" > chroma-qna.jsonl

Import data from HuggingFace Datasets to Chroma DB:

The below command will import the train split of the given dataset to Chroma chroma-qna chroma-qna collection. The collection will be created if it does not exist and documents will be upserted.

cdp imp hf --uri "hf://tazarov/chroma-qna?split=train" | cdp imp chroma --uri "http://localhost:8000/default_database/chroma-qna" --upsert --create

Exporting

Export data from Chroma DB to .jsonl file:

The below command will export the first 10 documents from the chroma-qna collection to chroma-qna.jsonl file.

cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" --limit 10 > chroma-qna.jsonl

Export data from Chroma DB to HuggingFace Datasets:

The below command will export the first 10 documents with offset 10 from the chroma-qna collection to HuggingFace Datasets tazarov/chroma-qna dataset. The dataset will be uploaded to HF.

!!! note HF Auth and Privacy

Make sure you have `HF_TOKEN=hf_....` environment variable set.
If you want your dataset to be private, add `--private` flag to the `cdp exp hf` command.
cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" --limit 10 --offset 10 | cdp exp hf --uri "hf://tazarov/chroma-qna-modified"

To export a dataset to a file, use --uri with file:// prefix:

cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" --limit 10 --offset 10 | cdp exp hf --uri "file://chroma-qna"

!!! note File Location

The file is  relative to the current working directory.

Transforming

Copy collection from one Chroma collection to another and re-embed the documents:

cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" | cdp tx embed --ef default | cdp imp chroma --uri "http://localhost:8000/default_database/chroma-qna-def-emb" --upsert --create

Import dataset from HF to Chroma and embed the documents:

cdp imp hf --uri "hf://tazarov/ds2?split=train" | cdp tx embed --ef default | cdp imp chroma --uri "http://localhost:8000/default_database/chroma-qna-def-emb-hf" --upsert --create

Misc

Count the number of documents in a collection:

cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" | wc -l

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chromadb_data_pipes-0.0.1.tar.gz (11.4 kB view hashes)

Uploaded Source

Built Distribution

chromadb_data_pipes-0.0.1-py3-none-any.whl (15.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page