How We Evaluate Matbench (Offline)¶
This page outlines our offline-first, reproducible evaluation path for Matbench-style tasks, and the manual GitHub workflows that package artifacts for review.
Principles
Offline-first: no network calls or external datasets required.
Conservative reporting: uncertainty via bootstrap CIs; avoid comparative claims by default.
Provenance: attestation with file hashes, environment info, and git commit when available.
CLI pipeline (local)
Generate demo CSV:
python examples/generate_matbench_demo.py --out data/matbench_demo.csv
Calibrate λ (kappa − λ·leak) on validation; report test AURC with CIs:
python tools/calibrate_matbench_srmf.py --path data/matbench_demo.csv --objective-col y_true --mode max --topk-grid 1,5,10 --lambda-grid 0.0,0.5,1.0 --bootstrap 1000 --seed 0 --out-json reports/matbench_calibration.json --scores-out reports/matbench_scores_test.csv
Evaluate regret with tuned λ (including per-group curves if
groupexists):python tools/eval_matbench_regret.py --path data/matbench_demo.csv --objective-col y_true --mode max --use-srmf --lambda-weight $(jq -r .best_lambda reports/matbench_calibration.json) --topk-grid 1,5,10 --group-col group --out-csv reports/matbench_regret.csv --out-json reports/matbench_regret.json --out-group-csv reports/matbench_regret_groups.csv --bootstrap 1000 --seed 0
Attestation:
python tools/generate_matbench_attestation.py --input-csv data/matbench_demo.csv --calibration-json reports/matbench_calibration.json --regret-json reports/matbench_regret.json --out reports/matbench_attestation.json
Manual GitHub workflows (no releases)
matbench_offline(workflow_dispatch): calibrates and evaluates on a committed CSV path and uploads artifacts:Calibration JSON (tuned λ, AURC, CI, splits)
Test scores CSV (kappa, leak, score)
Regret@k CSV and summary JSON (AURC, optional CI)
Attestation JSON (hashes, env, commit)
materials_audit(workflow_dispatch): runs live audit withsecrets.MP_API_KEYand uploads CSV.
Artifacts and review
Prefer linking the attestation JSON alongside CSV/JSON outputs.
Provide the exact CLI commands used (with seeds) and the CSV schema.
Keep claims conservative; regret is reported as opportunity loss vs. oracle.
Baselines and Plots¶
Baseline regret: tools/eval_baseline_regret.py –model ridge –folds 5 –topk-grid 1,5,10 –plot
If matplotlib present, PNGs are written to reports/.
Emergent Layers¶
Explore SRMF layers via quantiles or k-means: tools/explore_matbench_layers.py
Outputs CSV and JSON with per-layer AURC; use to guide ? tuning per-layer.
Exporting a Task CSV¶
Export via Materials Project (requires MP_API_KEY): ools/export_matbench_task_csv.py –from-mp –elements La Ni O –nelements 3 –objective band_gap –limit 500 –out data/mp_matbench_task.csv
Or generate a synthetic task: ools/export_matbench_task_csv.py –offline-mock –out data/matbench_task.csv
Then run the offline workflow or local pipeline on the exported CSV.