LLM plugin for clustering embeddings
Project description
llm-cluster
LLM plugin for clustering embeddings.
Installation
Install this plugin in the same environment as LLM.
llm install llm-cluster
Usage
The plugin adds a new command, llm cluster
. This command takes the name of an embedding collection and the number of clusters to return.
First, use paginate-json and jq to populate a collection. I this case we are embedding the title and body of every issue in the llm repository, and storing the result in a issues.db
database:
paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
| jq '[.[] | {id: .id, title: .title}]' \
| llm embed-multi llm-issues - \
--database issues.db --store
The --store
flag causes the content to be stored in the database along with the embedding vectors.
Now we can cluster those embeddings into 10 groups:
llm cluster llm-issues 10 \
-d issues.db
If you omit the -d
option the default embeddings database will be used.
The output should look something like this (truncated):
[
{
"id": "2",
"items": [
{
"id": "1650662628",
"content": "Initial design"
},
{
"id": "1650682379",
"content": "Log prompts and responses to SQLite"
}
]
},
{
"id": "4",
"items": [
{
"id": "1650760699",
"content": "llm web command - launches a web server"
},
{
"id": "1759659476",
"content": "`llm models` command"
},
{
"id": "1784156919",
"content": "`llm.get_model(alias)` helper"
}
]
},
{
"id": "7",
"items": [
{
"id": "1650765575",
"content": "--code mode for outputting code"
},
{
"id": "1659086298",
"content": "Accept PROMPT from --stdin"
},
{
"id": "1714651657",
"content": "Accept input from standard in"
}
]
}
]
The content displayed is truncated to 100 characters. Pass --truncate 0
to disable truncation, or --truncate X
to truncate to X characters.
Generating summaries for each cluster
The --summary
flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the --truncate
option) through a prompt to a Large Language Model.
This feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.
Since this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.
This feature only works for embeddings that have had their associated content stored in the database using the --store
flag.
You can use it like this:
llm cluster llm-issues 10 \
-d issues.db \
--summary
This uses the default prompt and the default model.
To use a different model, e.g. GPT-4, pass the --model
option:
llm cluster llm-issues 10 \
-d issues.db \
--summary \
--model gpt-4
The default prompt used is:
Short, concise title for this cluster of related documents.
To use a custom prompt, pass --prompt
:
llm cluster llm-issues 10 \
-d issues.db \
--summary \
--model gpt-4 \
--prompt 'Summarize this in a short line in the style of a bored, angry panda'
A "summary"
key will be added to each cluster, containing the generated summary.
Development
To set up this plugin locally, first checkout the code. Then create a new virtual environment:
cd llm-cluster
python3 -m venv venv
source venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'
To run the tests:
pytest
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for llm_cluster-0.2-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffabb64eb97a264a414b9312d82d8a649e05c3d73b93e943fafc46975ada42cb |
|
MD5 | afd4d4a232514b6f3ae2b0ee0f162078 |
|
BLAKE2b-256 | 56ff3d156c6ed478fdd095b398fce8f50e95b23deecafbfd4372666f4b98fa08 |