Chroma DB Data Pipes is a collection of tools for working with data in Chroma DB and building RAG systems
Project description
ChromaDB Data Pipes 🖇️| Rediscover AI/ML the Unix Way
ChromaDB Data Pipes is a collection of tools to build data pipelines for Chroma DB, inspired by the Unix philosophy of " do one thing and do it well".
Installation
pip install chromadb-data-pipes
Usage
Get help:
cdp --help
Importing
Import data from HuggingFace Datasets to .jsonl
file:
cdp imp hf --uri "hf:tazarov/chroma-qna?split=train" > chroma-qna.jsonl
Import data from HuggingFace Datasets to Chroma DB:
The below command will import the train
split of the given dataset to Chroma chroma-qna chroma-qna
collection. The
collection will be created if it does not exist and documents will be upserted.
cdp imp hf --uri "hf://tazarov/chroma-qna?split=train" | cdp imp chroma --uri "http://localhost:8000/default_database/chroma-qna" --upsert --create
Exporting
Export data from Chroma DB to .jsonl
file:
The below command will export the first 10 documents from the chroma-qna
collection to chroma-qna.jsonl
file.
cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" --limit 10 > chroma-qna.jsonl
Export data from Chroma DB to HuggingFace Datasets:
The below command will export the first 10 documents with offset 10 from the chroma-qna
collection to HuggingFace
Datasets tazarov/chroma-qna
dataset. The dataset will be uploaded to HF.
!!! note HF Auth and Privacy
Make sure you have `HF_TOKEN=hf_....` environment variable set.
If you want your dataset to be private, add `--private` flag to the `cdp exp hf` command.
cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" --limit 10 --offset 10 | cdp exp hf --uri "hf://tazarov/chroma-qna-modified"
To export a dataset to a file, use --uri
with file://
prefix:
cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" --limit 10 --offset 10 | cdp exp hf --uri "file://chroma-qna"
!!! note File Location
The file is relative to the current working directory.
Transforming
Copy collection from one Chroma collection to another and re-embed the documents:
cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" | cdp tx embed --ef default | cdp imp chroma --uri "http://localhost:8000/default_database/chroma-qna-def-emb" --upsert --create
Import dataset from HF to Chroma and embed the documents:
cdp imp hf --uri "hf://tazarov/ds2?split=train" | cdp tx embed --ef default | cdp imp chroma --uri "http://localhost:8000/default_database/chroma-qna-def-emb-hf" --upsert --create
Misc
Count the number of documents in a collection:
cdp exp chroma --uri "http://localhost:8000/default_database/chroma-qna" | wc -l
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for chromadb_data_pipes-0.0.1.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1a223e73afb4bcda183a548526d6cc1552ef622c91fb74c6f95b3ddf7369dcf4 |
|
MD5 | c269c0964e24763b36dd646906297d77 |
|
BLAKE2b-256 | 17feb245bb3432ea644e0885b221520fd1a209dfab6b2b8f33f47049b7b432d0 |
Hashes for chromadb_data_pipes-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 56f24e02523d47e403a3d4fb31042588746453e0be11b1e563a7f0e246fec2dc |
|
MD5 | cda1a1c5f04e9321faf26b4e10c06564 |
|
BLAKE2b-256 | 3c8cba31378d7318c4245051dc53c734fa33d78eb20bf1e1472b05e3924cbb23 |