No project description provided
Project description
Introduction
Implement the sentence embedding retriever with local cache from the embedding store.
Features
-
Embedding store abstraction class
-
Support Jina client implementation embedding store
-
Save the cache to parquet file
-
Load the cache from existed parquet file
Installation
</code></pre>
<h2>Quick Start</h2>
<h3><strong>Option 1.</strong> Using Jina flow serve the embedding model</h3>
<ul>
<li>To start up the Jina flow service with sentence embedding model
<code>sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2</code>, you can just clone
this github repo directly and serve by the docker container.</li>
</ul>
<pre lang="bash"><code>git clone https://github.com/ycc789741ycc/sentence-embedding-dataframe-cache.git
cd sentence-embedding-dataframe-cache
make serve-jina-embedding
- Retrieve the embedding
from embestore.jina import JinaEmbeddingStore
JINA_embestore_GRPC = "grpc://0.0.0.0:54321"
query_sentences = ["I want to listen the music.", "Music don't want to listen me."]
jina_embestore = JinaEmbeddingStore(embedding_grpc=JINA_embestore_GRPC)
results = jina_embestore.retrieve_embeddings(sentences=query_sentences)
- Stop the docker container
stop-jina-embedding
Option 2. Using local sentence embedding model sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
from embestore.torch import TorchEmbeddingStore
query_sentences = ["I want to listen the music.", "Music don't want to listen me."]
torch_embestore = TorchEmbeddingStore()
results = torch_embestore.retrieve_embeddings(sentences=query_sentences)
Option 3. Inherit from the abstraction class
from typing import List, Text
import numpy as np
from sentence_transformers import SentenceTransformer
from embestore.base import EmbeddingStore
model = SentenceTransformer("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2").eval()
class TorchEmbeddingStore(EmbeddingStore):
def _retrieve_embeddings_from_model(self, sentences: List[Text]) -> np.ndarray:
return model.encode(sentences)
Save the cache
torch_embestore.save("cache.parquet")
Load from the cache
torch_embestore = TorchEmbeddingStore("cache.parquet")
Road Map
[Done] prototype abstraction
[Done] Unit-test, integration test
[Done] Embedding retriever implementation: Pytorch, Jina
-
[Done] Jina
-
[Done] Sentence Embedding
[Done] Docker service
[Todo] Example, Documentation
[Todo] Embedding monitor
[Todo] pip install support
[Improve] Accelerate the Pandas retriever efficiency
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
embestore-0.1.4.tar.gz
(4.4 kB
view hashes)
Built Distribution
Close
Hashes for embestore-0.1.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | e8b42f0b92ddcec156371ba853b8d81e7073fafe6876fc47813c6144833e2ffa |
|
MD5 | 23c3eb8449facf9b9bae1f0dfd159a69 |
|
BLAKE2b-256 | 7c38f03dcf16598f669bc1fcf90526b58fb57383d22b49bc63cdaac6bcb5eeda |