bestmate eval

Overview

Twin Fidelity evals let you measure how well your twin represents your knowledge. Define expected Q&A pairs, run them against your twin, and track the score over time.

Commands

Add an Eval Question

bestmate eval add \
  --question "What's our AI adoption approach?" \
  --expected "We use the Augment vs Automate framework" \
  --tags "methodology,core" \
  --weight high

Bulk Import

bestmate eval import --file evals.json

JSON format:

[
  {
    "question": "What's our pricing model?",
    "expected": "Advisory + Sprint, starting at $15K/month",
    "tags": ["pricing", "core"],
    "weight": "high"
  },
  {
    "question": "How do we handle skeptical CEOs?",
    "expected": "Three forms of option value framework",
    "tags": ["objections"],
    "weight": "medium"
  }
]

List Eval Questions

bestmate eval list

Run Evals

bestmate eval run

Output:

🧪 Running 12 evals...

  ✅ What's our AI adoption approach?          92%
  ✅ How do we handle skeptical CEOs?           87%
  ✅ What's the Kickstart structure?            85%
  ⚠️  What tools do we recommend?               41%
  ⚠️  Budget framework for SMBs?                38%
  ❌ Healthcare AI playbook?                    12%

Score: 68/100  (8 pass, 2 warn, 2 fail)

View History

bestmate eval history

Output:

Twin Fidelity History

  2026-03-15  ████████░░░░░░░░░░░░  52/100
  2026-03-22  ████████████░░░░░░░░  61/100  (+9)
  2026-03-28  █████████████░░░░░░░  68/100  (+7)
  2026-04-01  ██████████████░░░░░░  74/100  (+6)

Delete

bestmate eval delete <eval-id>

Scoring

Threshold	Status	Meaning
70-100	✅ Pass	Twin covers the key information
40-69	⚠️ Warn	Partially covered, needs more content
0-39	❌ Fail	Missing or wrong — add content to your KB

Scoring uses semantic similarity (not exact match). The LLM judges whether the twin’s answer covers the same key information as your expected answer.

Weight

Weight	Multiplier	Use for
`high`	2x	Core methodology, pricing, key differentiators
`medium`	1x	Standard knowledge (default)
`low`	0.5x	Nice-to-have, edge cases

Workflow

Baseline — Add 10-20 eval questions covering your core knowledge
Run — Get your baseline score
Improve — Ingest content targeting failed evals
Re-run — Track improvement
Repeat — Add new evals as your twin’s scope grows

Getting Started

CLI Reference

Platform

Integrations

macOS App

Overview

Commands

Add an Eval Question

Bulk Import

List Eval Questions

Run Evals

View History

Delete

Scoring

Weight

Workflow

Getting Started

CLI Reference

Platform

Integrations

macOS App

​Overview

​Commands

​Add an Eval Question

​Bulk Import

​List Eval Questions

​Run Evals

​View History

​Delete

​Scoring

​Weight

​Workflow

Overview

Commands

Add an Eval Question

Bulk Import

List Eval Questions

Run Evals

View History

Delete

Scoring

Weight

Workflow