# How We Evaluate Matbench (Offline) This page outlines our offline-first, reproducible evaluation path for Matbench-style tasks, and the manual GitHub workflows that package artifacts for review. Principles - Offline-first: no network calls or external datasets required. - Conservative reporting: uncertainty via bootstrap CIs; avoid comparative claims by default. - Provenance: attestation with file hashes, environment info, and git commit when available. CLI pipeline (local) - Generate demo CSV: - `python examples/generate_matbench_demo.py --out data/matbench_demo.csv` - Calibrate λ (kappa − λ·leak) on validation; report test AURC with CIs: - `python tools/calibrate_matbench_srmf.py --path data/matbench_demo.csv --objective-col y_true --mode max --topk-grid 1,5,10 --lambda-grid 0.0,0.5,1.0 --bootstrap 1000 --seed 0 --out-json reports/matbench_calibration.json --scores-out reports/matbench_scores_test.csv` - Evaluate regret with tuned λ (including per-group curves if `group` exists): - `python tools/eval_matbench_regret.py --path data/matbench_demo.csv --objective-col y_true --mode max --use-srmf --lambda-weight $(jq -r .best_lambda reports/matbench_calibration.json) --topk-grid 1,5,10 --group-col group --out-csv reports/matbench_regret.csv --out-json reports/matbench_regret.json --out-group-csv reports/matbench_regret_groups.csv --bootstrap 1000 --seed 0` - Attestation: - `python tools/generate_matbench_attestation.py --input-csv data/matbench_demo.csv --calibration-json reports/matbench_calibration.json --regret-json reports/matbench_regret.json --out reports/matbench_attestation.json` Manual GitHub workflows (no releases) - `matbench_offline` (workflow_dispatch): calibrates and evaluates on a committed CSV path and uploads artifacts: - Calibration JSON (tuned λ, AURC, CI, splits) - Test scores CSV (kappa, leak, score) - Regret@k CSV and summary JSON (AURC, optional CI) - Attestation JSON (hashes, env, commit) - `materials_audit` (workflow_dispatch): runs live audit with `secrets.MP_API_KEY` and uploads CSV. Artifacts and review - Prefer linking the attestation JSON alongside CSV/JSON outputs. - Provide the exact CLI commands used (with seeds) and the CSV schema. - Keep claims conservative; regret is reported as opportunity loss vs. oracle. # Baselines and Plots - Baseline regret: tools/eval_baseline_regret.py --model ridge --folds 5 --topk-grid 1,5,10 --plot - If matplotlib present, PNGs are written to reports/. # Emergent Layers - Explore SRMF layers via quantiles or k-means: tools/explore_matbench_layers.py - Outputs CSV and JSON with per-layer AURC; use to guide ? tuning per-layer. # Exporting a Task CSV - Export via Materials Project (requires MP_API_KEY): ools/export_matbench_task_csv.py --from-mp --elements La Ni O --nelements 3 --objective band_gap --limit 500 --out data/mp_matbench_task.csv - Or generate a synthetic task: ools/export_matbench_task_csv.py --offline-mock --out data/matbench_task.csv - Then run the offline workflow or local pipeline on the exported CSV.