Skip to main content

No project description provided

Project description

drawingVexpresso

Vexpresso is a simple and scalable multi-modal vector database built with Daft

Querying Pokemon with images and text

Features

🍵 Simple: Vexpresso is lightweight and is very easy to get started!

🔌 Flexible: Unlike many other vector databases, Vexpresso supports arbitrary datatypes. This means that you can query muti-modal objects (images, audio, video, etc...)

🌐 Scalable: Because Vexpresso uses Daft, it can be scaled using Ray to multi-gpu / cpu clusters.

📚 Persistent: Easy Saving and Loading functionality: Vexpresso has easily accessible functions for saving / loading to huggingface datasets.

Installation

To install from PyPi:

pip install vexpresso

To install from source:

git clone git@github.com:shyamsn97/vexpresso.git
cd vexpresso
pip install -e .

Usage

🔥 Check out our Showcase notebook for a more detailed walkthrough!

In this simple example, we create a simple collection and embed using huggingface sentence transformers.

from typing import List, Any
import vexpresso
# import embedding functions from vexpresso
import vexpresso.embedding_functions as ef

# creating a collection object!
collection = vexpresso.create(
    data = {
        "documents":[
            "This is document1",
            "This is document2",
            "This is document3",
            "This is document4",
            "This is document5",
            "This is document6"
        ],
        "source":["notion", "google-docs", "google-docs", "notion", "google-docs", "google-docs"],
        "num_lines":[10, 20, 30, 40, 50, 60]
    }
    # backend="ray" # turn this flag on to start / connect to a ray cluster!
)

# create a simple embedding function from sentence_transformers
def hf_embed_fn(content: List[Any]):
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
    return model.encode(content, convert_to_tensor=True).detach().cpu().numpy()

# or use a langchain embedding function
def langchain_embed_fn(content: List[Any]):
    from langchain.embeddings import OpenAIEmbeddings
    embeddings_model = OpenAIEmbeddings()
    return embeddings_model.embed_documents(content)

# embed function creates a column in the collection with embeddings. There can be more than one embedding column!
# lazy execution until .execute is called
collection = collection.embed(
    "documents",
    embedding_fn=hf_embed_fn,
    to="document_embeddings",
    # lazy=False # if this is false, execute doesn't need to be called
).execute()

# creating a queried collection with a subset of content closest to query
queried_collection = collection.query(
    "document_embeddings",
    query="query document6",
    k = 4, # return 2 closest
    lazy=False
    # query_embedding=[query1, query2, ...]
    # filter_conditions={"metadata_field":{"operator, ex: 'eq'":"value"}} # optional metadata filter
)

# batch query -- return a list of collections
# batch_queried_collection = collection.batch_query(
#     "document_embeddings",
#     queries=["doc1", "doc2"],
#     k = 2
# )

# filter collection for documents with num_lines less than or equal to 30
filtered_collection = queried_collection.filter(
    {
        "num_lines": {"lte":30}
    }
).execute()

# show dataframe
filtered_collection.show()

# convert to dictionary
filtered_dict = filtered_collection.to_dict()
documents = filtered_dict["documents"]

# add an entry!
collection = collection.add(
    [
        {"documents":"new documents 1", "source":"notion", "num_lines":2},
        {"documents":"new documents 2", "source":"google-docs", "num_lines":40}
    ]
)
collection = collection.execute()

Resources

Contributing

Feel free to make a pull request or open an issue for a feature!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vexpresso-0.0.2.tar.gz (21.3 kB view hashes)

Uploaded Source

Built Distribution

vexpresso-0.0.2-py3-none-any.whl (23.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page