Skip to main content

LLM plugin for clustering embeddings

Project description

llm-cluster

PyPI Changelog Tests License

LLM plugin for clustering embeddings.

Installation

Install this plugin in the same environment as LLM.

llm install llm-cluster

Usage

The plugin adds a new command, llm cluster. This command takes the name of an embedding collection and the number of clusters to return.

First, use paginate-json and jq to populate a collection. I this case we are embedding the title and body of every issue in the llm repository, and storing the result in a issues.db database:

paginate-json 'https://api.github.com/repos/simonw/llm/issues?state=all&filter=all' \
  | jq '[.[] | {id: .id, title: .title}]' \
  | llm embed-multi llm-issues - \
    --database issues.db --store

The --store flag causes the content to be stored in the database along with the embedding vectors.

Now we can cluster those embeddings into 10 groups:

llm cluster llm-issues 10 \
  -d issues.db

If you omit the -d option the default embeddings database will be used.

The output should look something like this (truncated):

[
  {
    "id": "2",
    "items": [
      {
        "id": "1650662628",
        "content": "Initial design"
      },
      {
        "id": "1650682379",
        "content": "Log prompts and responses to SQLite"
      }
    ]
  },
  {
    "id": "4",
    "items": [
      {
        "id": "1650760699",
        "content": "llm web command - launches a web server"
      },
      {
        "id": "1759659476",
        "content": "`llm models` command"
      },
      {
        "id": "1784156919",
        "content": "`llm.get_model(alias)` helper"
      }
    ]
  },
  {
    "id": "7",
    "items": [
      {
        "id": "1650765575",
        "content": "--code mode for outputting code"
      },
      {
        "id": "1659086298",
        "content": "Accept PROMPT from --stdin"
      },
      {
        "id": "1714651657",
        "content": "Accept input from standard in"
      }
    ]
  }
]

The content displayed is truncated to 100 characters. Pass --truncate 0 to disable truncation, or --truncate X to truncate to X characters.

Generating summaries for each cluster

The --summary flag will cause the plugin to generate a summary for each cluster, by passing the content of the items (truncated according to the --truncate option) through a prompt to a Large Language Model.

This feature is still experimental. You should experiment with custom prompts to improve the quality of your summaries.

Since this can run a large amount of text through a LLM this can be expensive, depending on which model you are using.

This feature only works for embeddings that have had their associated content stored in the database using the --store flag.

You can use it like this:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary

This uses the default prompt and the default model.

To use a different model, e.g. GPT-4, pass the --model option:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4

The default prompt used is:

Short, concise title for this cluster of related documents.

To use a custom prompt, pass --prompt:

llm cluster llm-issues 10 \
  -d issues.db \
  --summary \
  --model gpt-4 \
  --prompt 'Summarize this in a short line in the style of a bored, angry panda'

A "summary" key will be added to each cluster, containing the generated summary.

Development

To set up this plugin locally, first checkout the code. Then create a new virtual environment:

cd llm-cluster
python3 -m venv venv
source venv/bin/activate

Now install the dependencies and test dependencies:

pip install -e '.[test]'

To run the tests:

pytest

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm-cluster-0.2.tar.gz (9.2 kB view hashes)

Uploaded Source

Built Distribution

llm_cluster-0.2-py3-none-any.whl (9.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page