Overview

Twin Fidelity evals let you measure how well your twin represents your knowledge. Define expected Q&A pairs, run them against your twin, and track the score over time.

Commands

Add an Eval Question

bestmate eval add \
  --question "What's our AI adoption approach?" \
  --expected "We use the Augment vs Automate framework" \
  --tags "methodology,core" \
  --weight high

Bulk Import

bestmate eval import --file evals.json
JSON format:
[
  {
    "question": "What's our pricing model?",
    "expected": "Advisory + Sprint, starting at $15K/month",
    "tags": ["pricing", "core"],
    "weight": "high"
  },
  {
    "question": "How do we handle skeptical CEOs?",
    "expected": "Three forms of option value framework",
    "tags": ["objections"],
    "weight": "medium"
  }
]

List Eval Questions

bestmate eval list

Run Evals

bestmate eval run
Output:
🧪 Running 12 evals...

  ✅ What's our AI adoption approach?          92%
  ✅ How do we handle skeptical CEOs?           87%
  ✅ What's the Kickstart structure?            85%
  ⚠️  What tools do we recommend?               41%
  ⚠️  Budget framework for SMBs?                38%
  ❌ Healthcare AI playbook?                    12%

Score: 68/100  (8 pass, 2 warn, 2 fail)

View History

bestmate eval history
Output:
Twin Fidelity History

  2026-03-15  ████████░░░░░░░░░░░░  52/100
  2026-03-22  ████████████░░░░░░░░  61/100  (+9)
  2026-03-28  █████████████░░░░░░░  68/100  (+7)
  2026-04-01  ██████████████░░░░░░  74/100  (+6)

Delete

bestmate eval delete <eval-id>

Scoring

ThresholdStatusMeaning
70-100✅ PassTwin covers the key information
40-69⚠️ WarnPartially covered, needs more content
0-39❌ FailMissing or wrong — add content to your KB
Scoring uses semantic similarity (not exact match). The LLM judges whether the twin’s answer covers the same key information as your expected answer.

Weight

WeightMultiplierUse for
high2xCore methodology, pricing, key differentiators
medium1xStandard knowledge (default)
low0.5xNice-to-have, edge cases

Workflow

  1. Baseline — Add 10-20 eval questions covering your core knowledge
  2. Run — Get your baseline score
  3. Improve — Ingest content targeting failed evals
  4. Re-run — Track improvement
  5. Repeat — Add new evals as your twin’s scope grows