Benchmarks: Head-to-Head¶

This page is the single published home for the open-vs-Claude-Science-baseline head-to-head results. It renders committed artifacts — it does not regenerate them, and it contains no AI-generated data figures. Only real committed artifacts are shown.

Artifact Provenance¶

The page consumes upstream artifacts and renders their committed numbers:

Upstream	Scope	Committed artifact
`#609`	Head-to-head DATA + PROVENANCE (data-artifact scope)	committed data + provenance bundle
`#604`	Accuracy-per-dollar SWEEP producing the underlying numbers	sweep output

The sweep reports the metric set {pass@1, tokens, cost_usd, accuracy_per_dollar, ECE}. Those numbers are produced upstream and rendered here unchanged.

Hordago-Eval Reference Scorecard¶

The reference-harness scorecard is committed at references/benchmarks/hordago-eval-latest.json (run hordago-eval-reference, generated 2026-04-01). It measures orchestration quality, not task accuracy.

Dimension	Score
Composite score	1.0
Decision accuracy	1.0
Intent classification	1.0
Verification	1.0
Evidence grounding	1.0
Safety gate	1.0
Provenance completeness	1.0
Decision quality	1.0

Per-domain decision accuracy (all 500/500 reference tasks):

Domain	Score
crispr_guide_design	1.0
gwas_causal_variant	1.0
scrna_seq_qc	1.0
pathway_enrichment	1.0
variant_effect_prediction	1.0

Benchmark Comparability Note¶

Hordago-Eval and external task benchmarks measure different things and must not be presented as the same leaderboard. The committed scorecard records this explicitly: CellType publishes BixBench task accuracy (0.9 README score), whereas Hordago publishes Hordago-Eval orchestration quality — a +0.10 delta across incomparable axes. The K-Dense comparison bundle (references/benchmarks/kdense-comparison-20260410/) similarly documents that K-Dense's headline BixBench claim is not backed by a public leaderboard.

Source Pointers¶

references/benchmarks/hordago-eval-latest.json
references/benchmarks/hordago-eval-history.json
references/benchmarks/kdense-comparison-20260410/06_reports/F1_master_comparison.md