Booktest is a snapshot testing library for review driven testing.
Project description
Booktest - Review-Driven Testing for Data Science
Booktest is a regression testing tool for systems where outputs aren't strictly right or wrong — ML models, LLM applications, NLP pipelines, and other data science systems that need expert review rather than binary assertions.
In these systems, the hard problem isn't checking correctness — it's seeing what changed. When you update a prompt, retrain a model, or tweak parameters, behavior shifts across the system. Most testing tools reduce this to a single verdict: pass or fail. In practice, that's a "computer says no" experience — a failure signal without diagnostics, raising more questions than it answers.
Booktest treats regressions as information, not verdicts. It captures test outputs as readable markdown, tracks them in Git, and makes behavioral changes reviewable — the same way you review code. The richer your diagnostics, the faster you find root causes and the fewer iterations you need.
Tolerance metrics separate real regressions from noise. AI evaluation scales review beyond what humans can do manually. A build-system-style dependency graph makes each iteration faster by letting you re-run one step of a pipeline without re-running everything before it.
Built by Netigate (formerly Lumoa) after years of production use testing NLP, ML and LLM models processing millions of customer feedback messages. Similar tools were built and used over a 2-decade career by the author to support DS/ML, information retrieval and predictive database R&D.
import booktest as bt
def test_gpt_response(t: bt.TestCaseRun):
response = generate_response("What is the capital of France?")
# Capture output as reviewable markdown
t.h1("GPT Response")
t.iln(response)
# AI evaluates AI outputs
r = t.start_review()
r.iln(response)
r.reviewln("Is response accurate?", "Yes", "No")
r.reviewln("Is it concise?", "Yes", "No")
# Tolerance metrics - catch regressions, ignore noise
accuracy = evaluate_accuracy(response)
t.tmetric(accuracy, tolerance=0.05) # 85% ± 5% = OK
# Review changes interactively
booktest -v -i
# See exactly what changed:
? * Prediction: 54% Positive (should be Negative) | * Prediction: 51% Negative (ok)
? * Accuracy: 93.3% (was 98.4%, delta -5.1%)
test_model DIFF 3027 ms
(a)ccept, (c)ontinue, (q)uit, (v)iew, (l)ogs, (d)iff or fast (D)iff
The Three Problems Booktest Solves
1. No Correct Answer
Traditional software testing has clear pass/fail:
assert result == "Paris" # Clear right/wrong
Data science doesn't:
# Which is "correct"?
result1 = "Paris"
result2 = "The capital of France is Paris, which is located..."
result3 = "Paris, France"
assert result == ??? # No single correct answer
You need expert review, statistical thresholds, human judgment — but manual review doesn't scale to 1,000 test cases.
Booktest solution:
import booktest as bt
def test_gpt_response(t: bt.TestCaseRun):
response = generate_response("What is the capital of France?")
# 1. Human review via markdown output & Git diffs
t.h1("GPT Response")
t.iln(response)
# 2. AI reviews AI outputs automatically
r = t.start_review()
r.iln(response)
r.reviewln("Is response accurate?", "Yes", "No")
r.reviewln("Is it concise?", "Yes", "No")
# 3. Tolerance metrics - catch regressions, not noise
accuracy = evaluate_accuracy(response)
t.tmetric(accuracy, tolerance=0.05) # 85% +/- 5% = OK
Three-tier quality control: human review via markdown, AI evaluation at scale, tolerance metrics for trends.
2. Regressions Without Visibility
This is the scenario every data scientist recognizes — spooky action at a distance:
- Change one prompt -> tests fail in unrelated areas
- Update training data -> model behaves differently everywhere
- Tweak hyperparameters -> metrics shift across the board
- Refine system context -> LLM loses focus on some other detail
Traditional testing gives you binary pass/fail, no visibility into what actually changed, and no way to accept "close enough" changes.
Booktest treats test outputs like code:
def test_model_predictions(t: bt.TestCaseRun):
model = load_model()
predictions = model.predict(test_data)
# Snapshot everything as markdown
t.h1("Model Predictions")
t.tdf(predictions) # DataFrame -> readable markdown table
# Track metrics
t.key("Accuracy:").tmetric(accuracy, tolerance=0.05)
t.key("F1 Score:").tmetric(f1, tolerance=0.05)
Review changes like code:
booktest -v -i
# See exactly what changed:
...
? ? * Prediction: 54% Positive (should be Negative) | * Prediction: 51% Negative (ok)
...
? ? * Accuracy: 93.3% (was 98.4%, delta -5.1%)
...
test/datascience/test_model.py::test_model_predictions DIFF 3027 ms (snapshots updated)
(a)ccept, (c)ontinue, (q)uit, (v)iew, (l)ogs, (d)iff or fast (D)iff
Regressions become reviewable, not catastrophic. Git history tracks how your model evolved.
3. Expensive Operations, Slow Iteration
A typical ML pipeline: load data, clean, featurize, train, validate, test, generate reports.
Traditional testing forces you to run all steps every time, even when you're only changing the last one.
Example pipeline:
- Load data: 5 min
- Train model: 20 min
- Evaluate: 4 min
- Generate report: 1 min
Total: 30 minutes to test a report formatting change
Booktest is a build system for tests:
Tests return objects (like Make targets). Other tests depend on them. Change step 4 -> only step 4 re-runs.
# Step 1: Load data (slow, runs once)
def test_load_data(t: bt.TestCaseRun):
data = expensive_data_load() # 5 minutes
t.tln(f"Loaded {len(data)} rows")
return data # Cache result
# Step 2: Train model (slow, depends on step 1)
@bt.depends_on(test_load_data)
def test_train_model(t: bt.TestCaseRun, data):
model = train_large_model(data) # 20 minutes
t.key("Accuracy:").tmetric(model.accuracy, tolerance=0.05)
return model # Cache result
# Step 3: Evaluate (depends on step 2)
@bt.depends_on(test_train_model)
def test_evaluate(t: bt.TestCaseRun, model):
results = evaluate(model, test_data) # 4 minutes
t.tdf(results)
return results
# Step 4: Generate report (depends on step 3)
@bt.depends_on(test_evaluate)
def test_report(t: bt.TestCaseRun, results):
report = generate_report(results) # 1 minute
t.h1("Final Report")
t.tln(report)
Iteration speed:
- Change formatting in step 4? Only step 4 re-runs (1 min, not 30 min)
- Change model params in step 2? Steps 2-4 re-run (25 min, step 1 cached)
- All steps in parallel?
booktest test -p8-> smart scheduling
Plus HTTP mocking:
@bt.snapshot_httpx() # Record once, replay forever
def test_openai_prompts(t):
response = openai.chat(...) # 5s first run, instant after
Test each pipeline step in isolation, reuse expensive results.
Real example: 3-step agent testing - Break agent into plan, answer, and validate steps. Iterate on validation logic without re-running plan generation.
How Booktest Compares
| Problem | Jupyter | pytest + syrupy | promptfoo | Booktest |
|---|---|---|---|---|
| Expert review at scale | Manual | No support | LLM only | AI-assisted |
| Tolerance metrics | None | None | None | Built-in |
| Pipeline decomposition | No | No | No | Built-in |
| Git-trackable outputs | No | Basic | No | Markdown |
| HTTP/LLM mocking | Manual | Complex | No | Automatic |
| Parallel execution | No | Limited | Limited | Native |
| Data science ergonomics | Exploration only | No | No | Yes |
Jupyter: Great for exploration, not for regression testing. No automated review, no Git tracking, no CI/CD integration.
pytest + syrupy: Built for deterministic outputs. No concept of "good enough" — either exact match or fail.
promptfoo/langsmith: LLM-focused evaluation platforms. Missing: dataframe support, metric tracking with tolerance, resource sharing, parallel dependency resolution.
Booktest: Combines review-driven workflow + tolerance metrics + snapshot testing + parallel execution for data science.
Key Capabilities
Tolerance-Based Metrics
Track metrics with acceptable ranges instead of exact matches:
t.tmetric(accuracy, tolerance=0.05) # 87% -> 86% = OK
# 87% -> 80% = DIFF
t.assertln("Accuracy >= 80%", accuracy >= 0.80) # Hard minimum threshold
Catch real regressions, ignore noise.
AI-Powered Review
Two capabilities for scaling test review:
AI evaluation of outputs — use LLMs to evaluate LLM outputs:
r = t.start_review()
r.iln(response)
r.reviewln("Is code syntactically correct?", "Yes", "No")
r.reviewln("Does it solve the problem?", "Yes", "No")
First run: AI evaluates and records decisions. Subsequent runs: reuses evaluations (instant, deterministic, free). Only re-evaluates when outputs change.
AI-assisted diff review — when many tests change output, AI triages which changes need human attention:
booktest -R # AI reviews all diffs automatically
booktest -R -i # Interactive: press 'R' for AI review on individual tests
AI classifies changes from ACCEPT (auto-approve) to FAIL (auto-reject), with intermediate categories flagged for human review. See the Feature Guide for details.
DVC Integration
Large HTTP/LLM snapshots stored in DVC instead of Git. Markdown outputs stay in Git for easy review:
@bt.snapshot_httpx()
def test_gpt(t: bt.TestCaseRun):
response = openai.chat(...) # Cassette -> DVC, Git tracks only manifest hash
Auto-Report on Failures
Tests that fail now show a detailed report automatically — no need to remember verbose flags:
booktest -p8 # Run in parallel; failures show detailed report automatically
Reviewable Changes
Review and selectively accept or reject changes:
booktest -w # Interactive review of failures
booktest -u -c # Accept all changes
Quick Start
# Install
pip install booktest
# Initialize
booktest --setup
# Create your first test
cat > test/test_hello.py << EOF
import booktest as bt
def test_hello(t: bt.TestCaseRun):
t.h1("My First Test")
t.tln("Hello, World!")
EOF
# Run
booktest
# Or run with verbose output during execution
booktest -v
# Or run interactively to review each test
booktest -v -i
Output: Test results saved to books/test/test_hello.md
# My First Test
Hello, World!
When tests fail: Detailed failure report appears automatically.
Next steps: See Getting Started Guide for LLM evaluation, metric tracking, and more.
Real-World Examples
At Netigate: Testing sentiment classification across 50 languages x 20 topic models x 100 customer segments = 100,000 test combinations. Booktest reduced CI time from 12 hours to 45 minutes while catching 3x more regressions through systematic review.
LLM Application Testing
@bt.snapshot_httpx() # Mock OpenAI automatically
def test_code_generation(t: bt.TestCaseRun):
code = generate_code("fizzbuzz in python")
r = t.start_review()
r.h1("Generated Code")
r.icode(code, "python")
# Use LLM to evaluate LLM output
r.reviewln("Is code syntactically correct?", "Yes", "No")
r.reviewln("Does it solve fizzbuzz?", "Yes", "No")
r.reviewln("Code quality?", "Excellent", "Good", "Poor")
ML Model Evaluation
def test_sentiment_model(t: bt.TestCaseRun):
model = load_model()
predictions = model.predict(test_data)
t.h1("Predictions")
t.tdf(predictions) # Snapshot as table
# Two-tier evaluation
t.h2("Metrics (with tolerance)")
t.key("Accuracy:").tmetric(accuracy, tolerance=0.05)
t.key("F1 Score:").tmetric(f1, tolerance=0.05)
t.h2("Minimum Requirements")
t.assertln("Accuracy >= 80%", accuracy >= 0.80)
t.assertln("F1 >= 0.75", f1 >= 0.75)
Agent Testing with Build System
# Step 1: Agent plans approach (slow: loads docs, calls GPT)
@snapshot_gpt()
def test_agent_step1_plan(t: bt.TestCaseRun):
context = load_documentation() # Expensive
plan = llm.create_plan(context)
return {"context": context, "plan": plan} # Cache for next steps
# Step 2: Agent generates answer (depends on step 1)
@bt.depends_on(test_agent_step1_plan)
@snapshot_gpt()
def test_agent_step2_answer(t, state):
answer = llm.generate_answer(state["plan"]) # Uses cached state
return {**state, "answer": answer}
# Step 3: Agent validates (depends on step 2)
@bt.depends_on(test_agent_step2_answer)
@snapshot_gpt()
def test_agent_step3_validate(t, state):
validation = llm.validate(state["answer"])
t.key("Quality:").tmetric(validation.score, tolerance=10)
Iteration speed:
- Iterating on step 3? Steps 1-2 cached (instant)
- First run: ~30 seconds (3 GPT calls)
- Subsequent runs: ~100ms (all snapshotted)
Full example: test/datascience/test_agent.py
More examples: test/examples/ and test/datascience/
Core Features
Review and evaluation:
- Human review via markdown - Git-tracked outputs, review changes like code diffs
- AI-assisted review - LLM evaluates outputs automatically;
-Rflag for AI diff review - Tolerance metrics and assertions - Track trends with
tmetric(), set thresholds withassertln()
Regression management:
- Snapshot testing - Git-track all outputs as markdown
- Git diff visibility - See exactly what changed
- Selective acceptance - Accept good changes, reject bad ones
- DVC integration - Large snapshots outside Git
Performance and pipeline:
- Build system for tests - Tests return objects, other tests depend on them (like Make/Bazel)
- Pipeline decomposition - Turn 10-step pipeline into 10 tests, iterate on any step independently
- Automatic HTTP/LLM mocking - HTTP/HTTPX requests recorded and replayed with
@snapshot_httpx() - Parallel execution - Native multi-core support with intelligent dependency scheduling
- Resource sharing - Share expensive resources (models, data) across tests with
@depends_on()
Data science ergonomics:
- Markdown output - Human-readable, reviewable test reports
- DataFrame support - Snapshot pandas DataFrames as tables
- Image support - Snapshot plots and visualizations
- Environment mocking - Control and snapshot env vars
Documentation
- Getting Started Guide - Your first test
- Use Case Gallery - Quick recipes for common scenarios
- Complete Feature Guide - Comprehensive documentation of all features
- CI/CD Integration - GitHub Actions, GitLab CI, CircleCI
- API Reference - Full API documentation
- Examples - Copy-pasteable examples
- Development Guide - Contributing to booktest
Use Cases
Works well for:
- Testing LLM applications (ChatGPT, Claude, etc.)
- ML model evaluation and monitoring
- Data pipeline regression testing
- Prompt engineering and optimization
- Non-deterministic system testing
- Exploratory data analysis that needs regression testing
Not the right fit for:
- Traditional unit testing (use pytest)
- Testing with strict equality requirements
- Systems without a review component
FAQ
Q: Why not just use pytest-regtest or syrupy? A: Those work well for deterministic outputs. They don't handle tolerance-based metrics, subjective quality review, or large test matrices where you need to scale evaluation.
Q: Why not promptfoo or langsmith? A: They're good for LLM-specific evaluation. Booktest is complementary — it handles the broader data science workflow (dataframes, metrics, resource management, parallel execution) and integrates review-driven testing into your Git workflow.
Q: Won't AI reviews give inconsistent results? A: Reviews are snapshotted. First run records the AI's evaluation, subsequent runs reuse it (instant, deterministic, free). Re-evaluation only happens when output changes.
Q: Why Git-track test outputs? Won't that bloat my repo? A: Markdown outputs are small (human-readable summaries). Large snapshots (HTTP cassettes, binary data) go to DVC. You get reviewable diffs in Git without bloat.
Q: Does this replace pytest? A: No, it complements it. Use pytest for unit tests with clear pass/fail. Use booktest for integration tests, LLM outputs, model evaluation — anything requiring expert review or tolerance.
Q: How is this different from Make or Bazel? A: Similar concept (dependency graph, incremental builds) but purpose-built for testing. Tests return Python objects (models, dataframes), not files. Built-in review workflow, tolerance metrics, parallel scheduling with resource management.
Why "Booktest"?
Test outputs are organized like a book — chapters (test files), sections (test cases), with all results in readable markdown. Review your tests like reading a book, track changes in Git like code.
Community
- GitHub: lumoa-oss/booktest
- Issues: Report bugs or request features
- Discussions: Ask questions, share use cases
Built by Netigate - Enterprise feedback and experience management platform.
License
MIT - See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file booktest-1.1.1.tar.gz.
File metadata
- Download URL: booktest-1.1.1.tar.gz
- Upload date:
- Size: 4.5 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
134394aa9f8eae1658b261fd89aab649c567157db425f94bad046d9391a319a3
|
|
| MD5 |
c1ee1c1ff722a451f23768e5a005071c
|
|
| BLAKE2b-256 |
997a6c49efd44f7bfc118e3205dfb86706554b9fbcf4b1902b36b7430dc954b9
|
Provenance
The following attestation bundles were made for booktest-1.1.1.tar.gz:
Publisher:
actions.yml on lumoa-oss/booktest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
booktest-1.1.1.tar.gz -
Subject digest:
134394aa9f8eae1658b261fd89aab649c567157db425f94bad046d9391a319a3 - Sigstore transparency entry: 938233708
- Sigstore integration time:
-
Permalink:
lumoa-oss/booktest@febf086ff2ff38b06210645e6f3e23022e50fe88 -
Branch / Tag:
refs/tags/v1.1.1 - Owner: https://github.com/lumoa-oss
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
actions.yml@febf086ff2ff38b06210645e6f3e23022e50fe88 -
Trigger Event:
push
-
Statement type:
File details
Details for the file booktest-1.1.1-py3-none-any.whl.
File metadata
- Download URL: booktest-1.1.1-py3-none-any.whl
- Upload date:
- Size: 112.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ecf6ce8f8a4e6ac5d0ae4a78e00cc2b24f75ddc261620231a2eab72da238257
|
|
| MD5 |
489a8ba72aa24c6377d10f14268868dd
|
|
| BLAKE2b-256 |
bb19bc3d57e0db35ad0f69d6725a57da0bdd7cd2a7f3a72d0c7314d6ad350c79
|
Provenance
The following attestation bundles were made for booktest-1.1.1-py3-none-any.whl:
Publisher:
actions.yml on lumoa-oss/booktest
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
booktest-1.1.1-py3-none-any.whl -
Subject digest:
2ecf6ce8f8a4e6ac5d0ae4a78e00cc2b24f75ddc261620231a2eab72da238257 - Sigstore transparency entry: 938233714
- Sigstore integration time:
-
Permalink:
lumoa-oss/booktest@febf086ff2ff38b06210645e6f3e23022e50fe88 -
Branch / Tag:
refs/tags/v1.1.1 - Owner: https://github.com/lumoa-oss
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
actions.yml@febf086ff2ff38b06210645e6f3e23022e50fe88 -
Trigger Event:
push
-
Statement type: