CompBioBench++ Weekly Scoring¶
The CompBioBench++ weekly scoring automation runs the Hordago-Eval benchmark scoring engine on a fixed schedule, commits the refreshed results into the repo, and publishes an observable job status. It keeps a durable, reproducible record of benchmark scores without any manual step.
Schedule¶
| Field | Value |
|---|---|
| Workflow | .github/workflows/compbiobench-scoring.yml |
| Trigger | schedule — every Monday at 06:00 UTC (cron: "0 6 * * 1") |
| Manual trigger | workflow_dispatch (run on demand from the Actions tab) |
| Runner | [self-hosted, Linux, X64, general-ci, pr-fast] |
The Monday 06:00 UTC slot is offset from the 05:00 UTC weekly cost rollup so the two automations never contend for the same runner window.
What it produces¶
The scoring backend is scripts/run_compbiobench_scoring.py, which scores every
benchmark domain in eval/domains/ through eval/scoring.py and writes a
deterministic bundle into eval/results/:
| Artifact | Contents |
|---|---|
eval/results/compbiobench-scoring.json |
Per-domain scores, benchmark-wide totals, and a SHA-256 input manifest of the scored domain files |
eval/results/compbiobench-scoring.md |
Human-readable results table |
The automation commits any change to these files back to the repo, so the committed bundle always reflects the latest scored domain data.
Reproducibility¶
The bundle carries no wall-clock timestamp. Re-running the scorer against unchanged domain data yields byte-identical output, so the committed artifacts are reproducible. Regenerate locally with:
Verify the committed bundle matches the current domain data (this is what the automation and the test suite assert):
A non-zero exit means the committed bundle has drifted from the domain data and must be regenerated.
Observability and failure visibility¶
- Job summary — every run appends the results table (or a failure notice) to
$GITHUB_STEP_SUMMARY, visible on the run page. - Artifact — the scoring bundle is uploaded as a workflow artifact
(
compbiobench-scoring-<run_id>, 90-day retention) even when a step fails. - Failures stay visible — if the scoring run fails, no bundle is produced, the summary says so, and the workflow run is marked failed rather than passing silently.