Benchmarks: Head-to-Head¶
This page is the single published home for the open-vs-Claude-Science-baseline head-to-head results. It renders committed artifacts — it does not regenerate them, and it contains no AI-generated data figures. Only real committed artifacts are shown.
Artifact Provenance¶
The page consumes upstream artifacts and renders their committed numbers:
| Upstream | Scope | Committed artifact |
|---|---|---|
#609 |
Head-to-head DATA + PROVENANCE (data-artifact scope) | committed data + provenance bundle |
#604 |
Accuracy-per-dollar SWEEP producing the underlying numbers | sweep output |
The sweep reports the metric set {pass@1, tokens, cost_usd, accuracy_per_dollar,
ECE}. Those numbers are produced upstream and rendered here unchanged.
Hordago-Eval Reference Scorecard¶
The reference-harness scorecard is committed at
references/benchmarks/hordago-eval-latest.json (run hordago-eval-reference,
generated 2026-04-01). It measures orchestration quality, not task accuracy.
| Dimension | Score |
|---|---|
| Composite score | 1.0 |
| Decision accuracy | 1.0 |
| Intent classification | 1.0 |
| Verification | 1.0 |
| Evidence grounding | 1.0 |
| Safety gate | 1.0 |
| Provenance completeness | 1.0 |
| Decision quality | 1.0 |
Per-domain decision accuracy (all 500/500 reference tasks):
| Domain | Score |
|---|---|
| crispr_guide_design | 1.0 |
| gwas_causal_variant | 1.0 |
| scrna_seq_qc | 1.0 |
| pathway_enrichment | 1.0 |
| variant_effect_prediction | 1.0 |
Benchmark Comparability Note¶
Hordago-Eval and external task benchmarks measure different things and must not
be presented as the same leaderboard. The committed scorecard records this
explicitly: CellType publishes BixBench task accuracy (0.9 README score),
whereas Hordago publishes Hordago-Eval orchestration quality — a +0.10 delta
across incomparable axes. The K-Dense comparison bundle
(references/benchmarks/kdense-comparison-20260410/) similarly documents that
K-Dense's headline BixBench claim is not backed by a public leaderboard.
Source Pointers¶
references/benchmarks/hordago-eval-latest.jsonreferences/benchmarks/hordago-eval-history.jsonreferences/benchmarks/kdense-comparison-20260410/06_reports/F1_master_comparison.md