autoevals

Universal library for evaluating AI models

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

AutoEvals

AutoEvals is a tool for quickly and easily evaluating AI model outputs. It comes with a variety of evaluation methods, including heuristic (e.g. Levenshtein distance), statistical (e.g. BLEU), and model-based (using LLMs).

Many of the model-based evaluations are adapted from OpenAI's excellent evals, project but are implemented so you can flexibly run them on individual examples, tweak the prompts, and debug their outputs.

You can also add your own custom prompts, and use AutoEvals to deal with adding Chain-of-Thought, parsing outputs, and managing exceptions.

Installation

To install AutoEvals, run the following command:

pip install autoevals

Example

from autoevals.llm import *

evaluator = Fact()
result = evaluator(
    output="People's Republic of China", expected="China",
    input="Which country has the highest population?"
)
print(result.score)
print(result.metadata)

Supported Evaluation Methods

Heuristic

Levenshtein distance
Jaccard distance
BLEU

Model-Based Classification

Battle
ClosedQA
Humor
Factuality
Security
Summarization
SQL
Translation

Other Model-Based

Embedding distance
Fine-tuned classifiers

Custom Evaluation Prompts

AutoEvals supports custom evaluation prompts. To use them, simply pass in a prompt and scoring mechanism:

from autoevals import LLMClassifier

evaluator = LLMClassifier(
    """
You are a technical project manager who helps software engineers generate better titles for their GitHub issues.
You will look at the issue description, and pick which of two titles better describes it.

I'm going to provide you with the issue description, and two possible titles.

Issue Description: {{page_content}}

1: {{output}}
2: {{expected}}

Please discuss each title briefly (one line for pros, one for cons).
""",
    {"1": 1, "2": 0},
    use_cot=False,
)

page_content = """
As suggested by Nicolo, we should standardize the error responses coming from GoTrue, postgres, and realtime (and any other/future APIs) so that it's better DX when writing a client,

We can make this change on the servers themselves, but since postgrest and gotrue are fully/partially external may be harder to change, it might be an option to transform the errors within the client libraries/supabase-js, could be messy?

Nicolo also dropped this as a reference: http://spec.openapis.org/oas/v3.0.3#openapi-specification"""

gen_title = "Standardize error responses from GoTrue, Postgres, and Realtime APIs for better DX"
original_title = "Standardize Error Responses across APIs"


response = evaluator(gen_title, original_title, page_content=page_content)

print(f"Score: {response.score}")

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.68

May 28, 2024

0.0.67

May 24, 2024

0.0.66

May 24, 2024

0.0.65

May 18, 2024

0.0.64

Apr 26, 2024

0.0.63

Apr 25, 2024

0.0.62

Apr 24, 2024

0.0.61

Apr 17, 2024

0.0.60

Apr 16, 2024

0.0.59

Apr 16, 2024

0.0.58

Apr 16, 2024

0.0.57

Apr 16, 2024

0.0.56

Apr 9, 2024

0.0.55

Apr 2, 2024

0.0.54

Mar 28, 2024

0.0.53

Mar 17, 2024

0.0.52

Mar 17, 2024

0.0.51

Mar 14, 2024

0.0.50

Mar 7, 2024

0.0.49

Mar 5, 2024

0.0.48

Feb 25, 2024

0.0.47

Feb 23, 2024

0.0.46

Feb 4, 2024

0.0.45

Jan 23, 2024

0.0.44

Jan 11, 2024

0.0.43

Jan 11, 2024

0.0.42

Jan 10, 2024

0.0.41

Dec 31, 2023

0.0.40

Dec 18, 2023

0.0.39

Dec 18, 2023

0.0.38

Dec 16, 2023

0.0.37

Dec 16, 2023

0.0.36

Dec 15, 2023

0.0.35

Dec 15, 2023

0.0.34

Dec 6, 2023

0.0.33

Dec 5, 2023

0.0.32

Nov 28, 2023

0.0.31

Nov 10, 2023

0.0.30

Nov 9, 2023

0.0.29

Nov 8, 2023

0.0.28

Nov 3, 2023

0.0.27

Nov 3, 2023

0.0.26

Oct 17, 2023

0.0.25

Oct 13, 2023

0.0.24

Oct 12, 2023

0.0.23

Oct 11, 2023

0.0.22

Sep 20, 2023

0.0.21

Sep 13, 2023

0.0.20

Sep 13, 2023

0.0.19

Sep 13, 2023

0.0.18

Sep 6, 2023

0.0.17

Sep 6, 2023

0.0.16

Sep 5, 2023

0.0.15

Sep 5, 2023

0.0.14

Aug 24, 2023

0.0.13

Aug 18, 2023

0.0.12

Aug 18, 2023

0.0.11

Aug 17, 2023

0.0.10

Aug 17, 2023

0.0.9

Aug 16, 2023

0.0.8

Aug 5, 2023

0.0.7

Jul 26, 2023

0.0.6

Jul 26, 2023

0.0.5

Jul 24, 2023

0.0.4

Jul 12, 2023

0.0.3

Jul 11, 2023

This version

0.0.2

Jul 11, 2023

0.0.1

Jul 10, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autoevals-0.0.2.tar.gz (7.4 kB view hashes)

Uploaded Jul 11, 2023 Source

Built Distribution

autoevals-0.0.2-py3-none-any.whl (6.6 kB view hashes)

Uploaded Jul 11, 2023 Python 3

Hashes for autoevals-0.0.2.tar.gz

Hashes for autoevals-0.0.2.tar.gz
Algorithm	Hash digest
SHA256	`f56e2132798fa2b4c2848f8c1ea255ac9faec27d7a61a8e9f167bbdf195b7c0c`
MD5	`d24849426976d583a50b0734588e101e`
BLAKE2b-256	`b59e90b2882c75853e60a7d695dde8d1466be8a8faf4892eaa54753bf652b06e`

Hashes for autoevals-0.0.2-py3-none-any.whl

Hashes for autoevals-0.0.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`84f8e89971a6f999c6da958cf459c3f9884e1a9f47c11d3431a6e93aed78f782`
MD5	`db92c39f2e40c7bb19ef891177169384`
BLAKE2b-256	`3073a2552132639c21d42a2bac8c438bd4651f305765f3c3152374a5ad544079`