Skip to content

Benchmarks: Head-to-Head

This page is the single published home for the open-vs-Claude-Science-baseline head-to-head results. It renders committed artifacts — it does not regenerate them, and it contains no AI-generated data figures. Only real committed artifacts are shown.

Artifact Provenance

The page consumes upstream artifacts and renders their committed numbers:

Upstream Scope Committed artifact
#609 Head-to-head DATA + PROVENANCE (data-artifact scope) committed data + provenance bundle
#604 Accuracy-per-dollar SWEEP producing the underlying numbers sweep output

The sweep reports the metric set {pass@1, tokens, cost_usd, accuracy_per_dollar, ECE}. Those numbers are produced upstream and rendered here unchanged.

Hordago-Eval Reference Scorecard

The reference-harness scorecard is committed at references/benchmarks/hordago-eval-latest.json (run hordago-eval-reference, generated 2026-04-01). It measures orchestration quality, not task accuracy.

Dimension Score
Composite score 1.0
Decision accuracy 1.0
Intent classification 1.0
Verification 1.0
Evidence grounding 1.0
Safety gate 1.0
Provenance completeness 1.0
Decision quality 1.0

Per-domain decision accuracy (all 500/500 reference tasks):

Domain Score
crispr_guide_design 1.0
gwas_causal_variant 1.0
scrna_seq_qc 1.0
pathway_enrichment 1.0
variant_effect_prediction 1.0

Benchmark Comparability Note

Hordago-Eval and external task benchmarks measure different things and must not be presented as the same leaderboard. The committed scorecard records this explicitly: CellType publishes BixBench task accuracy (0.9 README score), whereas Hordago publishes Hordago-Eval orchestration quality — a +0.10 delta across incomparable axes. The K-Dense comparison bundle (references/benchmarks/kdense-comparison-20260410/) similarly documents that K-Dense's headline BixBench claim is not backed by a public leaderboard.

Source Pointers

  • references/benchmarks/hordago-eval-latest.json
  • references/benchmarks/hordago-eval-history.json
  • references/benchmarks/kdense-comparison-20260410/06_reports/F1_master_comparison.md