How We Evaluate Matbench (Offline)¶

This page outlines our offline-first, reproducible evaluation path for Matbench-style tasks, and the manual GitHub workflows that package artifacts for review.

Principles

Offline-first: no network calls or external datasets required.
Conservative reporting: uncertainty via bootstrap CIs; avoid comparative claims by default.
Provenance: attestation with file hashes, environment info, and git commit when available.

CLI pipeline (local)

Generate demo CSV:
- python examples/generate_matbench_demo.py --out data/matbench_demo.csv
Calibrate λ (kappa − λ·leak) on validation; report test AURC with CIs:
- python tools/calibrate_matbench_srmf.py --path data/matbench_demo.csv --objective-col y_true --mode max --topk-grid 1,5,10 --lambda-grid 0.0,0.5,1.0 --bootstrap 1000 --seed 0 --out-json reports/matbench_calibration.json --scores-out reports/matbench_scores_test.csv
Evaluate regret with tuned λ (including per-group curves if group exists):
- python tools/eval_matbench_regret.py --path data/matbench_demo.csv --objective-col y_true --mode max --use-srmf --lambda-weight $(jq -r .best_lambda reports/matbench_calibration.json) --topk-grid 1,5,10 --group-col group --out-csv reports/matbench_regret.csv --out-json reports/matbench_regret.json --out-group-csv reports/matbench_regret_groups.csv --bootstrap 1000 --seed 0
Attestation:
- python tools/generate_matbench_attestation.py --input-csv data/matbench_demo.csv --calibration-json reports/matbench_calibration.json --regret-json reports/matbench_regret.json --out reports/matbench_attestation.json

Manual GitHub workflows (no releases)

matbench_offline (workflow_dispatch): calibrates and evaluates on a committed CSV path and uploads artifacts:
- Calibration JSON (tuned λ, AURC, CI, splits)
- Test scores CSV (kappa, leak, score)
- Regret@k CSV and summary JSON (AURC, optional CI)
- Attestation JSON (hashes, env, commit)
materials_audit (workflow_dispatch): runs live audit with secrets.MP_API_KEY and uploads CSV.

Artifacts and review

Prefer linking the attestation JSON alongside CSV/JSON outputs.
Provide the exact CLI commands used (with seeds) and the CSV schema.
Keep claims conservative; regret is reported as opportunity loss vs. oracle.

Baselines and Plots¶

Baseline regret: tools/eval_baseline_regret.py –model ridge –folds 5 –topk-grid 1,5,10 –plot
If matplotlib present, PNGs are written to reports/.

Emergent Layers¶

Explore SRMF layers via quantiles or k-means: tools/explore_matbench_layers.py
Outputs CSV and JSON with per-layer AURC; use to guide ? tuning per-layer.

Exporting a Task CSV¶

Export via Materials Project (requires MP_API_KEY): ools/export_matbench_task_csv.py –from-mp –elements La Ni O –nelements 3 –objective band_gap –limit 500 –out data/mp_matbench_task.csv
Or generate a synthetic task: ools/export_matbench_task_csv.py –offline-mock –out data/matbench_task.csv
Then run the offline workflow or local pipeline on the exported CSV.