Skip to content

CompBioBench++ Weekly Scoring

The CompBioBench++ weekly scoring automation runs the Hordago-Eval benchmark scoring engine on a fixed schedule, commits the refreshed results into the repo, and publishes an observable job status. It keeps a durable, reproducible record of benchmark scores without any manual step.

Schedule

Field Value
Workflow .github/workflows/compbiobench-scoring.yml
Trigger schedule — every Monday at 06:00 UTC (cron: "0 6 * * 1")
Manual trigger workflow_dispatch (run on demand from the Actions tab)
Runner [self-hosted, Linux, X64, general-ci, pr-fast]

The Monday 06:00 UTC slot is offset from the 05:00 UTC weekly cost rollup so the two automations never contend for the same runner window.

What it produces

The scoring backend is scripts/run_compbiobench_scoring.py, which scores every benchmark domain in eval/domains/ through eval/scoring.py and writes a deterministic bundle into eval/results/:

Artifact Contents
eval/results/compbiobench-scoring.json Per-domain scores, benchmark-wide totals, and a SHA-256 input manifest of the scored domain files
eval/results/compbiobench-scoring.md Human-readable results table

The automation commits any change to these files back to the repo, so the committed bundle always reflects the latest scored domain data.

Reproducibility

The bundle carries no wall-clock timestamp. Re-running the scorer against unchanged domain data yields byte-identical output, so the committed artifacts are reproducible. Regenerate locally with:

python3 scripts/run_compbiobench_scoring.py

Verify the committed bundle matches the current domain data (this is what the automation and the test suite assert):

python3 scripts/run_compbiobench_scoring.py --check

A non-zero exit means the committed bundle has drifted from the domain data and must be regenerated.

Observability and failure visibility

  • Job summary — every run appends the results table (or a failure notice) to $GITHUB_STEP_SUMMARY, visible on the run page.
  • Artifact — the scoring bundle is uploaded as a workflow artifact (compbiobench-scoring-<run_id>, 90-day retention) even when a step fails.
  • Failures stay visible — if the scoring run fails, no bundle is produced, the summary says so, and the workflow run is marked failed rather than passing silently.