ChromaDB Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ChromaDB Data Pipes 🖇️ - The easiest way to get data into and out of ChromaDB

ChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of "do one thing and do it well".

Roadmap:

✅ Integration with LangChain 🦜🔗
🚫 Integration with LlamaIndex 🦙
✅ Support more than all-MiniLM-L6-v2 as embedding functions (head over to Embedding Processors for more info)
🚫 Multimodal support
♾️ Much more!

Installation

pip install chromadb-data-pipes

Usage

Get help:

cdp --help

Example Use Cases

This is a short list of use cases to evaluate whether this is the right tool for your needs:

Importing large datasets from local documents (PDF, TXT, etc.), from HuggingFace, from local persisted Chroma DB or even another remote Chroma DB.
Exporting large dataset to HuggingFace or any other dataformat supported by the library (if your format is not supported, either implement it in a small function or open an issue)
Create a dataset from your data that you can share with others (including the embeddings)
Clone Collection with different embedding function, distance function, and other HNSW fine-tuning parameters
Re-embed documents in a collection with a different embedding function
Backup your data to a jsonl file
Use other existing unix or other tools to transform your data after exporting from or before importing into Chroma DB

Importing

Import data from HuggingFace Datasets to .jsonl file:

cdp ds-get "hf://tazarov/chroma-qna?split=train" > chroma-qna.jsonl

Import data from HuggingFace Datasets to Chroma DB:

The below command will import the train split of the given dataset to Chroma chroma-qna chroma-qna collection. The collection will be created if it does not exist and documents will be upserted.

cdp ds-get "hf://tazarov/chroma-qna?split=train" | cdp import "http://localhost:8000/chroma-qna" --upsert --create

Importing from a directory with PDF files into Local Persisted Chroma DB:

cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500 | cdp embed --ef default | cdp import "file://chroma-data/my-pdfs" --upsert --create

Note: The above command will import the first PDF file from the sample-data/papers/ directory, chunk it into 500 word chunks, embed each chunk and import the chunks to the my-pdfs collection in Chroma DB.

Exporting

Export data from Local Persisted Chroma DB to .jsonl file:

The below command will export the first 10 documents from the chroma-qna collection to chroma-qna.jsonl file.

cdp export "file://chroma-data/chroma-qna" --limit 10 > chroma-qna.jsonl

Export data from Local Persisted Chroma DB to .jsonl file with filter:

The below command will export data from local persisted Chroma DB to a .jsonl file using a where filter to select the documents to export.

cdp export "file://chroma-data/chroma-qna" --where '{"document_id": "123"}' > chroma-qna.jsonl

Export data from Chroma DB to HuggingFace Datasets:

The below command will export the first 10 documents with offset 10 from the chroma-qna collection to HuggingFace Datasets tazarov/chroma-qna dataset. The dataset will be uploaded to HF.

HF Auth and Privacy: Make sure you have HF_TOKEN=hf_.... environment variable set. If you want your dataset to be private, add --private flag to the cdp ds-put command.

cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "hf://tazarov/chroma-qna-modified"

To export a dataset to a file, use --uri with file:// prefix:

cdp export "http://localhost:8000/chroma-qna" --limit 10 --offset 10 | cdp ds-put "file://chroma-qna"

File Location The file is relative to the current working directory.

Processing

Copy collection from one Chroma collection to another and re-embed the documents:

cdp export "http://localhost:8000/chroma-qna" | cdp embed --ef default | cdp import "http://localhost:8000/chroma-qna-def-emb" --upsert --create

Note: See Embedding Processors for more info about supported embedding functions.

Import dataset from HF to Local Persisted Chroma and embed the documents:

cdp ds-get "hf://tazarov/ds2?split=train" | cdp embed --ef default | cdp import "file://chroma-data/chroma-qna-def-emb-hf" --upsert --create

Chunk Large Documents:

cdp imp pdf sample-data/papers/ | grep "2401.02412.pdf" | head -1 | cdp chunk -s 500

Misc

Count the number of documents in a collection:

cdp export "http://localhost:8000/chroma-qna" | wc -l

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.9

May 8, 2024

0.0.8

Apr 6, 2024

0.0.7

Apr 4, 2024

0.0.6

Jan 30, 2024

0.0.5

Jan 21, 2024

0.0.4

Jan 19, 2024

0.0.3

Jan 19, 2024

0.0.2

Jan 17, 2024

0.0.1

Jan 12, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chromadb_data_pipes-0.0.9.tar.gz (21.4 kB view hashes)

Uploaded May 8, 2024 Source

Built Distribution

chromadb_data_pipes-0.0.9-py3-none-any.whl (30.3 kB view hashes)

Uploaded May 8, 2024 Python 3

Hashes for chromadb_data_pipes-0.0.9.tar.gz

Hashes for chromadb_data_pipes-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`deffc2994de92bc8bea8753a26818c6c685d289c79bf7965e23a02a95ed51603`
MD5	`a3239b476ac0b45db41b06e1e2514593`
BLAKE2b-256	`89808dd871f4126d00c6abc03e42b0e97476309e3fe1b3b910c3cd5265c1c9b8`

Hashes for chromadb_data_pipes-0.0.9-py3-none-any.whl

Hashes for chromadb_data_pipes-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cb4e30e9e08ad44ba6219f5452b76112db18ece8ad8354c7fde7d892c17d0f4f`
MD5	`41b391bb58734dcd1802ec1f7b6e9e6e`
BLAKE2b-256	`31ea33578825e9a1a8f22a51ad8596e15546ea5b96df9132d2092213e4344e91`