Evaluation
The QA and extraction evaluation harnesses — four-gate QA metrics, per-channel extraction accuracy, and a baseline regression gate that ratchets up and never silently down.
The eval domain (src/ragspine/eval/) is RAGSpine's evaluation closed loop: two harnesses
(QA and extraction) that score the engine against version-controlled golden sets, plus a
baseline gate that fails CI on regression. Its guiding invariant:
The baseline gate ratchets up, never down — a regression must fail the gate, never silently lower it. Never weaken a golden or baseline to make a case pass.
Layout
Golden and baseline data live under data/golden/ (force version-controlled), and the
extraction ground-truth fixtures under data/fixtures/.
QA evaluation — the four gates
qa_eval.py is the Q&A closed-loop harness. The entry point:
def run_qa_eval(
golden_path: str | Path,
mode: str = "tool",
kb_dir: str | Path | None = None,
) -> QAEvalReport: ...It loads and validates the golden set, builds a deterministic synthetic knowledge base
(build_eval_kb, from the module's _EVAL_FACT_ROWS + _EVAL_NARRATIVE_DOCS), runs every
case in one of two modes (EVAL_MODES = ("tool", "agent")), and returns a four-gate report.
tool mode
Zero-LLM, tool-direct: parse_intent → clarify_scope → execute_query_metric / narrative retrieve. Deterministic.
agent mode
Full orchestrator: answer_question with a deterministic MockProvider(reference_date=...).
Each case (GoldenCase: id, question, case_type, expected, tags,
reference_date) is run into a normalized CaseOutcome (route, clarification mode, found
value/unit/source, refused flag, sources, tool statuses), then scored against the four
gates — kept as four separate pass rates, never merged into one number:
Prop
Type
GATE_METRICS is the tuple of all four. A fifth, separate metric tracks fabrication on
refusal cases:
FABRICATION = "fabrication" is computed only on refusal cases via
detect_fabricated_numbers(answer): it strips period tokens (whitelisted from the active
profile's temporal dimension) and treats any remaining digit as fabrication. The target is 0.
It is deliberately not merged into the four gates.
The report types:
GateMetric—name,total,passed,pass_rate(1.0iftotal == 0, elsepassed / total),failures,by_tag.QAEvalReport—mode,n_cases,metrics: dict[str, GateMetric], and a separatefabrication: GateMetric(with afabrication_countproperty).evaluate(cases, outcomes, mode)applies all four gate rules per case.
QA baseline regression gating
qa_eval.py also owns the baseline comparison:
make_baseline_entry(report) -> dict→{"metrics": {name: pass_rate}, "fabrication_count": int, "n_cases": int}.compare_to_baseline(report, baseline) -> BaselineComparison— fails the gate when any gate'spass_rateis below its baseline threshold, orfabrication_countrises above the baseline. Only metrics listed in the baseline are checked; exactly equal passes. Each regression recordsmetric,baseline,current,delta.
There are no hardcoded thresholds — the thresholds are the previously stored per-mode
pass rates in data/golden/qa_baseline.json. The baseline ratchets up through observed
runs.
QA harness — the CLI
scripts/run_qa_eval.py drives the harness from the project root:
.venv/bin/python scripts/run_qa_eval.py --mode tool.venv/bin/python scripts/run_qa_eval.py --mode agent --report out/qa_report.json.venv/bin/python scripts/run_qa_eval.py --mode tool --update-baselineFlags: --mode {tool,agent} (default tool), --golden (default
data/golden/qa_golden_set.jsonl), --report, --baseline (default
data/golden/qa_baseline.json), --update-baseline. The gate is keyed per mode: a mode
with no baseline auto-generates one on first run and passes (exit 0); with a baseline, any
gate regression or fabrication increase exits 1.
Golden data format
data/golden/qa_golden_set.jsonl is one JSON object per line. Each case carries id,
question, case_type (one of numeric / clarification / refusal / narrative /
composite), expected (always clarification ∈ {none, ask_first, answer_with_assumptions}
and refuse: bool; numeric/composite add value + unit +
source; narrative/composite add narrative_doc), tags, and a reference_date.
{
"id": "num-001",
"question": "香港FY2025的REVENUE是多少",
"case_type": "numeric",
"expected": {
"clarification": "none",
"refuse": false,
"value": 1702.0,
"unit": "USD_M",
"source": { "doc": "ACME_FY2025_Results.pptx", "locator": "slide=5,table=1,row=2,col=3" }
},
"tags": { "topic": "FIN", "scope": "ACME_HK", "qtype": "must_not_clarify" },
"reference_date": "2026-06-12"
}The golden set uses the fictional ACME company and synthetic figures. data/golden/ is
force-tracked so the baseline can't drift out of version control.
Extraction evaluation
extraction_eval.py scores office-document extraction per channel:
def run_eval(
facts: list[dict[str, object]],
ground_truth: dict[str, object] | list[dict[str, object]],
) -> EvalReport: ...It aligns extracted facts against ground truth by sheet!cell_ref, evaluating only cells
present on both sides, and computes accuracy on three channels:
| Channel | Constant | Compared fact field |
|---|---|---|
| Cell value | CELL_VALUE = "cell_value" | value |
| Color mapping | COLOR_MAPPING = "color_mapping" | tags |
| Header attribution | HEADER_ATTRIBUTION = "header_attribution" | merge_span |
Types: ChannelMetric (name, total, correct, accuracy, mismatches), EvalReport
(channels: dict[str, ChannelMetric], n_facts). The regression gate is
compare_to_baseline(metrics, baseline) -> BaselineComparison: any channel whose accuracy
falls below its threshold fails (passed=False); only channels listed in the baseline are
checked; exactly equal passes. Ground-truth fixtures live under data/fixtures/ (e.g.
fixtures_ground_truth.json, pdf/pdf_ground_truth.json, pptx/pptx_ground_truth.json).
See also
Agent
The orchestration layer — four-slot intent parsing, the clarification gateway, a deterministic security gate, three-path routing, the tool-use loop, the LLM provider seam, and the per-path anti-fabrication guard.
Service
The HTTP + async-queue layer — a FastAPI app factory with dependency injection, the RQ task queue behind a TaskQueue Protocol, worker-owned ingestion jobs, the FAQ short-circuit cache, and ServiceConfig (env RAGSPINE_*).