Booktest is a snapshot testing library for review driven testing.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

antti-lumoa

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Project description

Booktest - Review-Driven Testing for Data Science

Booktest is a regression testing tool for systems where outputs aren't strictly right or wrong — ML models, LLM applications, NLP pipelines, and other data science systems that need expert review rather than binary assertions.

Booktest Demo

In these systems, the hard problem isn't checking correctness — it's seeing what changed. When you update a prompt, retrain a model, or tweak parameters, behavior shifts across the system. Most testing tools reduce this to a single verdict: pass or fail. In practice, that's a "computer says no" experience — a failure signal without diagnostics, raising more questions than it answers.

Booktest treats regressions as information, not verdicts. It captures test outputs as readable markdown, tracks them in Git, and makes behavioral changes reviewable — the same way you review code. The richer your diagnostics, the faster you find root causes and the fewer iterations you need.

Tolerance metrics separate real regressions from noise. AI evaluation scales review beyond what humans can do manually. A build-system-style dependency graph makes each iteration faster by letting you re-run one step of a pipeline without re-running everything before it.

Built by Netigate (formerly Lumoa) after years of production use testing NLP, ML and LLM models processing millions of customer feedback messages. Similar tools were built and used over a 2-decade career by the author to support DS/ML, information retrieval and predictive database R&D.

import booktest as bt

def test_gpt_response(t: bt.TestCaseRun):
    response = generate_response("What is the capital of France?")

    # Capture output as reviewable markdown
    t.h1("GPT Response")
    t.iln(response)

    # AI evaluates AI outputs
    r = t.start_review()
    r.iln(response)
    r.reviewln("Is response accurate?", "Yes", "No")
    r.reviewln("Is it concise?", "Yes", "No")

    # Tolerance metrics - catch regressions, ignore noise
    accuracy = evaluate_accuracy(response)
    t.tmetric(accuracy, tolerance=0.05)  # 85% ± 5% = OK

# Review changes interactively
booktest -v -i

# See exactly what changed:
?  * Prediction: 54% Positive (should be Negative)     |  * Prediction: 51% Negative (ok)
?  * Accuracy: 93.3% (was 98.4%, delta -5.1%)

    test_model DIFF 3027 ms
    (a)ccept, (c)ontinue, (q)uit, (v)iew, (l)ogs, (d)iff or fast (D)iff

The Three Problems Booktest Solves

1. No Correct Answer

Traditional software testing has clear pass/fail:

assert result == "Paris"  # Clear right/wrong

Data science doesn't:

# Which is "correct"?
result1 = "Paris"
result2 = "The capital of France is Paris, which is located..."
result3 = "Paris, France"

assert result == ???  # No single correct answer

You need expert review, statistical thresholds, human judgment — but manual review doesn't scale to 1,000 test cases.

Booktest solution:

import booktest as bt

def test_gpt_response(t: bt.TestCaseRun):
    response = generate_response("What is the capital of France?")

    # 1. Human review via markdown output & Git diffs
    t.h1("GPT Response")
    t.iln(response)

    # 2. AI reviews AI outputs automatically
    r = t.start_review()
    r.iln(response)
    r.reviewln("Is response accurate?", "Yes", "No")
    r.reviewln("Is it concise?", "Yes", "No")

    # 3. Tolerance metrics - catch regressions, not noise
    accuracy = evaluate_accuracy(response)
    t.tmetric(accuracy, tolerance=0.05)  # 85% +/- 5% = OK

Three-tier quality control: human review via markdown, AI evaluation at scale, tolerance metrics for trends.

2. Regressions Without Visibility

This is the scenario every data scientist recognizes — spooky action at a distance:

Change one prompt -> tests fail in unrelated areas
Update training data -> model behaves differently everywhere
Tweak hyperparameters -> metrics shift across the board
Refine system context -> LLM loses focus on some other detail

Traditional testing gives you binary pass/fail, no visibility into what actually changed, and no way to accept "close enough" changes.

Booktest treats test outputs like code:

def test_model_predictions(t: bt.TestCaseRun):
    model = load_model()
    predictions = model.predict(test_data)

    # Snapshot everything as markdown
    t.h1("Model Predictions")
    t.tdf(predictions)  # DataFrame -> readable markdown table

    # Track metrics
    t.key("Accuracy:").tmetric(accuracy, tolerance=0.05)
    t.key("F1 Score:").tmetric(f1, tolerance=0.05)

Review changes like code:

booktest -v -i

# See exactly what changed:
   ...
?  ?  * Prediction: 54% Positive (should be Negative)                          |  * Prediction: 51% Negative (ok)
   ...
?  ?  * Accuracy: 93.3% (was 98.4%, delta -5.1%)
   ...

    test/datascience/test_model.py::test_model_predictions DIFF 3027 ms (snapshots updated)
    (a)ccept, (c)ontinue, (q)uit, (v)iew, (l)ogs, (d)iff or fast (D)iff

Regressions become reviewable, not catastrophic. Git history tracks how your model evolved.

3. Expensive Operations, Slow Iteration

A typical ML pipeline: load data, clean, featurize, train, validate, test, generate reports.

Traditional testing forces you to run all steps every time, even when you're only changing the last one.

Example pipeline:

Load data: 5 min
Train model: 20 min
Evaluate: 4 min
Generate report: 1 min

Total: 30 minutes to test a report formatting change

Booktest is a build system for tests:

Tests return objects (like Make targets). Other tests depend on them. Change step 4 -> only step 4 re-runs.

# Step 1: Load data (slow, runs once)
def test_load_data(t: bt.TestCaseRun):
    data = expensive_data_load()  # 5 minutes
    t.tln(f"Loaded {len(data)} rows")
    return data  # Cache result

# Step 2: Train model (slow, depends on step 1)
@bt.depends_on(test_load_data)
def test_train_model(t: bt.TestCaseRun, data):
    model = train_large_model(data)  # 20 minutes
    t.key("Accuracy:").tmetric(model.accuracy, tolerance=0.05)
    return model  # Cache result

# Step 3: Evaluate (depends on step 2)
@bt.depends_on(test_train_model)
def test_evaluate(t: bt.TestCaseRun, model):
    results = evaluate(model, test_data)  # 4 minutes
    t.tdf(results)
    return results

# Step 4: Generate report (depends on step 3)
@bt.depends_on(test_evaluate)
def test_report(t: bt.TestCaseRun, results):
    report = generate_report(results)  # 1 minute
    t.h1("Final Report")
    t.tln(report)

Iteration speed:

Change formatting in step 4? Only step 4 re-runs (1 min, not 30 min)
Change model params in step 2? Steps 2-4 re-run (25 min, step 1 cached)
All steps in parallel? booktest test -p8 -> smart scheduling

Plus HTTP mocking:

@bt.snapshot_httpx()  # Record once, replay forever
def test_openai_prompts(t):
    response = openai.chat(...)  # 5s first run, instant after

Test each pipeline step in isolation, reuse expensive results.

Real example: 3-step agent testing - Break agent into plan, answer, and validate steps. Iterate on validation logic without re-running plan generation.

How Booktest Compares

Problem	Jupyter	pytest + syrupy	promptfoo	Booktest
Expert review at scale	Manual	No support	LLM only	AI-assisted
Tolerance metrics	None	None	None	Built-in
Pipeline decomposition	No	No	No	Built-in
Git-trackable outputs	No	Basic	No	Markdown
HTTP/LLM mocking	Manual	Complex	No	Automatic
Parallel execution	No	Limited	Limited	Native
Data science ergonomics	Exploration only	No	No	Yes

Jupyter: Great for exploration, not for regression testing. No automated review, no Git tracking, no CI/CD integration.

pytest + syrupy: Built for deterministic outputs. No concept of "good enough" — either exact match or fail.

promptfoo/langsmith: LLM-focused evaluation platforms. Missing: dataframe support, metric tracking with tolerance, resource sharing, parallel dependency resolution.

Booktest: Combines review-driven workflow + tolerance metrics + snapshot testing + parallel execution for data science.

Key Capabilities

Tolerance-Based Metrics

Track metrics with acceptable ranges instead of exact matches:

t.tmetric(accuracy, tolerance=0.05)  # 87% -> 86% = OK
                                      # 87% -> 80% = DIFF

t.assertln("Accuracy >= 80%", accuracy >= 0.80)  # Hard minimum threshold

Catch real regressions, ignore noise.

AI-Powered Review

Two capabilities for scaling test review:

AI evaluation of outputs — use LLMs to evaluate LLM outputs:

r = t.start_review()
r.iln(response)
r.reviewln("Is code syntactically correct?", "Yes", "No")
r.reviewln("Does it solve the problem?", "Yes", "No")

First run: AI evaluates and records decisions. Subsequent runs: reuses evaluations (instant, deterministic, free). Only re-evaluates when outputs change.

AI-assisted diff review — when many tests change output, AI triages which changes need human attention:

booktest -R        # AI reviews all diffs automatically
booktest -R -i     # Interactive: press 'R' for AI review on individual tests

AI classifies changes from ACCEPT (auto-approve) to FAIL (auto-reject), with intermediate categories flagged for human review. See the Feature Guide for details.

DVC Integration

Large HTTP/LLM snapshots stored in DVC instead of Git. Markdown outputs stay in Git for easy review:

@bt.snapshot_httpx()
def test_gpt(t: bt.TestCaseRun):
    response = openai.chat(...)  # Cassette -> DVC, Git tracks only manifest hash

Auto-Report on Failures

Tests that fail now show a detailed report automatically — no need to remember verbose flags:

booktest -p8    # Run in parallel; failures show detailed report automatically

Reviewable Changes

Review and selectively accept or reject changes:

booktest -w       # Interactive review of failures
booktest -u -c    # Accept all changes

Quick Start

# Install
pip install booktest

# Initialize
booktest --setup

# Create your first test
cat > test/test_hello.py << EOF
import booktest as bt

def test_hello(t: bt.TestCaseRun):
    t.h1("My First Test")
    t.tln("Hello, World!")
EOF

# Run
booktest

# Or run with verbose output during execution
booktest -v

# Or run interactively to review each test
booktest -v -i

Output: Test results saved to books/test/test_hello.md

# My First Test

Hello, World!

When tests fail: Detailed failure report appears automatically.

Next steps: See Getting Started Guide for LLM evaluation, metric tracking, and more.

Real-World Examples

At Netigate: Testing sentiment classification across 50 languages x 20 topic models x 100 customer segments = 100,000 test combinations. Booktest reduced CI time from 12 hours to 45 minutes while catching 3x more regressions through systematic review.

LLM Application Testing

@bt.snapshot_httpx()  # Mock OpenAI automatically
def test_code_generation(t: bt.TestCaseRun):
    code = generate_code("fizzbuzz in python")

    r = t.start_review()
    r.h1("Generated Code")
    r.icode(code, "python")

    # Use LLM to evaluate LLM output
    r.reviewln("Is code syntactically correct?", "Yes", "No")
    r.reviewln("Does it solve fizzbuzz?", "Yes", "No")
    r.reviewln("Code quality?", "Excellent", "Good", "Poor")

ML Model Evaluation

def test_sentiment_model(t: bt.TestCaseRun):
    model = load_model()
    predictions = model.predict(test_data)

    t.h1("Predictions")
    t.tdf(predictions)  # Snapshot as table

    # Two-tier evaluation
    t.h2("Metrics (with tolerance)")
    t.key("Accuracy:").tmetric(accuracy, tolerance=0.05)
    t.key("F1 Score:").tmetric(f1, tolerance=0.05)

    t.h2("Minimum Requirements")
    t.assertln("Accuracy >= 80%", accuracy >= 0.80)
    t.assertln("F1 >= 0.75", f1 >= 0.75)

Agent Testing with Build System

# Step 1: Agent plans approach (slow: loads docs, calls GPT)
@snapshot_gpt()
def test_agent_step1_plan(t: bt.TestCaseRun):
    context = load_documentation()  # Expensive
    plan = llm.create_plan(context)
    return {"context": context, "plan": plan}  # Cache for next steps

# Step 2: Agent generates answer (depends on step 1)
@bt.depends_on(test_agent_step1_plan)
@snapshot_gpt()
def test_agent_step2_answer(t, state):
    answer = llm.generate_answer(state["plan"])  # Uses cached state
    return {**state, "answer": answer}

# Step 3: Agent validates (depends on step 2)
@bt.depends_on(test_agent_step2_answer)
@snapshot_gpt()
def test_agent_step3_validate(t, state):
    validation = llm.validate(state["answer"])
    t.key("Quality:").tmetric(validation.score, tolerance=10)

Iteration speed:

Iterating on step 3? Steps 1-2 cached (instant)
First run: ~30 seconds (3 GPT calls)
Subsequent runs: ~100ms (all snapshotted)

Full example: test/datascience/test_agent.py

More examples: test/examples/ and test/datascience/

Core Features

Review and evaluation:

Human review via markdown - Git-tracked outputs, review changes like code diffs
AI-assisted review - LLM evaluates outputs automatically; -R flag for AI diff review
Tolerance metrics and assertions - Track trends with tmetric(), set thresholds with assertln()

Regression management:

Snapshot testing - Git-track all outputs as markdown
Git diff visibility - See exactly what changed
Selective acceptance - Accept good changes, reject bad ones
DVC integration - Large snapshots outside Git

Performance and pipeline:

Build system for tests - Tests return objects, other tests depend on them (like Make/Bazel)
Pipeline decomposition - Turn 10-step pipeline into 10 tests, iterate on any step independently
Automatic HTTP/LLM mocking - HTTP/HTTPX requests recorded and replayed with @snapshot_httpx()
Parallel execution - Native multi-core support with intelligent dependency scheduling
Resource sharing - Share expensive resources (models, data) across tests with @depends_on()

Data science ergonomics:

Markdown output - Human-readable, reviewable test reports
DataFrame support - Snapshot pandas DataFrames as tables
Image support - Snapshot plots and visualizations
Environment mocking - Control and snapshot env vars

Documentation

Getting Started Guide - Your first test
Use Case Gallery - Quick recipes for common scenarios
Complete Feature Guide - Comprehensive documentation of all features
CI/CD Integration - GitHub Actions, GitLab CI, CircleCI
API Reference - Full API documentation
Examples - Copy-pasteable examples
Development Guide - Contributing to booktest

Use Cases

Works well for:

Testing LLM applications (ChatGPT, Claude, etc.)
ML model evaluation and monitoring
Data pipeline regression testing
Prompt engineering and optimization
Non-deterministic system testing
Exploratory data analysis that needs regression testing

Not the right fit for:

Traditional unit testing (use pytest)
Testing with strict equality requirements
Systems without a review component

FAQ

Q: Why not just use pytest-regtest or syrupy? A: Those work well for deterministic outputs. They don't handle tolerance-based metrics, subjective quality review, or large test matrices where you need to scale evaluation.

Q: Why not promptfoo or langsmith? A: They're good for LLM-specific evaluation. Booktest is complementary — it handles the broader data science workflow (dataframes, metrics, resource management, parallel execution) and integrates review-driven testing into your Git workflow.

Q: Won't AI reviews give inconsistent results? A: Reviews are snapshotted. First run records the AI's evaluation, subsequent runs reuse it (instant, deterministic, free). Re-evaluation only happens when output changes.

Q: Why Git-track test outputs? Won't that bloat my repo? A: Markdown outputs are small (human-readable summaries). Large snapshots (HTTP cassettes, binary data) go to DVC. You get reviewable diffs in Git without bloat.

Q: Does this replace pytest? A: No, it complements it. Use pytest for unit tests with clear pass/fail. Use booktest for integration tests, LLM outputs, model evaluation — anything requiring expert review or tolerance.

Q: How is this different from Make or Bazel? A: Similar concept (dependency graph, incremental builds) but purpose-built for testing. Tests return Python objects (models, dataframes), not files. Built-in review workflow, tolerance metrics, parallel scheduling with resource management.

Why "Booktest"?

Test outputs are organized like a book — chapters (test files), sections (test cases), with all results in readable markdown. Review your tests like reading a book, track changes in Git like code.

Community

GitHub: lumoa-oss/booktest
Issues: Report bugs or request features
Discussions: Ask questions, share use cases

Built by Netigate - Enterprise feedback and experience management platform.

License

MIT - See LICENSE for details.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

antti-lumoa

These details have not been verified by PyPI

Development Status
- 5 - Production/Stable
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

1.1.1

Feb 10, 2026

1.1.0

Jan 30, 2026

1.0.14

Nov 20, 2025

1.0.13

Nov 11, 2025

1.0.12

Nov 10, 2025

1.0.11

Nov 5, 2025

1.0.9

Nov 5, 2025

1.0.8

Nov 5, 2025

1.0.7

Nov 4, 2025

1.0.6

Oct 30, 2025

1.0.5

Oct 29, 2025

1.0.4

Oct 27, 2025

1.0.3

Oct 24, 2025

1.0.2

Oct 23, 2025

1.0.1

Oct 22, 2025

1.0.0

Oct 20, 2025

0.3.60

Jun 3, 2025

0.3.59

May 5, 2025

0.3.58

May 2, 2025

0.3.57

Jan 28, 2025

0.3.56

Jan 23, 2025

0.3.55

Jan 22, 2025

0.3.54

Jan 22, 2025

0.3.53

Jan 22, 2025

0.3.52

Jan 21, 2025

0.3.51

Jan 21, 2025

0.3.50

Jan 20, 2025

0.3.49

Jan 10, 2025

0.3.48

Dec 20, 2024

0.3.47

Dec 18, 2024

0.3.46

Dec 18, 2024

0.3.45

Dec 13, 2024

0.3.44

Dec 13, 2024

0.3.43

Dec 11, 2024

0.3.42

Dec 11, 2024

0.3.41

Nov 12, 2024

0.3.40

Oct 30, 2024

0.3.39

Oct 23, 2024

0.3.38

Oct 14, 2024

0.3.37

Sep 27, 2024

0.3.36

Sep 27, 2024

0.3.35

Sep 26, 2024

0.3.34

Sep 12, 2024

0.3.33

Sep 4, 2024

0.3.32

Sep 3, 2024

0.3.31

May 30, 2024

0.3.30

May 23, 2024

0.3.29

May 15, 2024

0.3.28

May 14, 2024

0.3.27

May 14, 2024

0.3.25

May 10, 2024

0.3.24

Apr 19, 2024

0.3.23

Apr 9, 2024

0.3.22

Mar 26, 2024

0.3.21

Mar 26, 2024

0.3.20

Mar 25, 2024

0.3.19

Mar 25, 2024

0.3.18

Mar 22, 2024

0.3.17

Mar 22, 2024

0.3.16

Jan 16, 2024

0.3.15

Jan 16, 2024

0.3.14

Jan 15, 2024

0.3.13

Dec 11, 2023

0.3.12

Dec 8, 2023

0.3.11

Nov 30, 2023

0.3.10

Nov 22, 2023

0.3.8

Nov 17, 2023

0.3.7

Oct 25, 2023

0.3.6

Oct 24, 2023

0.3.5

Sep 19, 2023

0.3.4

Sep 15, 2023

0.3.3

Sep 15, 2023

0.3.2

Aug 30, 2023

0.3.1

Aug 23, 2023

0.3

Jul 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

booktest-1.1.1.tar.gz (4.5 MB view details)

Uploaded Feb 10, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

booktest-1.1.1-py3-none-any.whl (112.0 kB view details)

Uploaded Feb 10, 2026 Python 3

File details

Details for the file booktest-1.1.1.tar.gz.

File metadata

Download URL: booktest-1.1.1.tar.gz
Upload date: Feb 10, 2026
Size: 4.5 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for booktest-1.1.1.tar.gz
Algorithm	Hash digest
SHA256	`134394aa9f8eae1658b261fd89aab649c567157db425f94bad046d9391a319a3`
MD5	`c1ee1c1ff722a451f23768e5a005071c`
BLAKE2b-256	`997a6c49efd44f7bfc118e3205dfb86706554b9fbcf4b1902b36b7430dc954b9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for booktest-1.1.1.tar.gz:

Publisher: actions.yml on lumoa-oss/booktest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: booktest-1.1.1.tar.gz
- Subject digest: 134394aa9f8eae1658b261fd89aab649c567157db425f94bad046d9391a319a3
- Sigstore transparency entry: 938233708
- Sigstore integration time: Feb 10, 2026
Source repository:
- Permalink: lumoa-oss/booktest@febf086ff2ff38b06210645e6f3e23022e50fe88
- Branch / Tag: refs/tags/v1.1.1
- Owner: https://github.com/lumoa-oss
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: actions.yml@febf086ff2ff38b06210645e6f3e23022e50fe88
- Trigger Event: push

File details

Details for the file booktest-1.1.1-py3-none-any.whl.

File metadata

Download URL: booktest-1.1.1-py3-none-any.whl
Upload date: Feb 10, 2026
Size: 112.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for booktest-1.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`2ecf6ce8f8a4e6ac5d0ae4a78e00cc2b24f75ddc261620231a2eab72da238257`
MD5	`489a8ba72aa24c6377d10f14268868dd`
BLAKE2b-256	`bb19bc3d57e0db35ad0f69d6725a57da0bdd7cd2a7f3a72d0c7314d6ad350c79`

See more details on using hashes here.

Provenance

The following attestation bundles were made for booktest-1.1.1-py3-none-any.whl:

Publisher: actions.yml on lumoa-oss/booktest

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: booktest-1.1.1-py3-none-any.whl
- Subject digest: 2ecf6ce8f8a4e6ac5d0ae4a78e00cc2b24f75ddc261620231a2eab72da238257
- Sigstore transparency entry: 938233714
- Sigstore integration time: Feb 10, 2026
Source repository:
- Permalink: lumoa-oss/booktest@febf086ff2ff38b06210645e6f3e23022e50fe88
- Branch / Tag: refs/tags/v1.1.1
- Owner: https://github.com/lumoa-oss
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: actions.yml@febf086ff2ff38b06210645e6f3e23022e50fe88
- Trigger Event: push

booktest 1.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Booktest - Review-Driven Testing for Data Science

The Three Problems Booktest Solves

1. No Correct Answer

2. Regressions Without Visibility

3. Expensive Operations, Slow Iteration

How Booktest Compares

Key Capabilities

Tolerance-Based Metrics

AI-Powered Review

DVC Integration

Auto-Report on Failures

Reviewable Changes

Quick Start

Real-World Examples

LLM Application Testing

ML Model Evaluation

Agent Testing with Build System

Core Features

Documentation

Use Cases

FAQ

Why "Booktest"?

Community

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance