# How We Evaluate Matbench (Offline)

This page outlines our offline-first, reproducible evaluation path for Matbench-style tasks, and
the manual GitHub workflows that package artifacts for review.

Principles
- Offline-first: no network calls or external datasets required.
- Conservative reporting: uncertainty via bootstrap CIs; avoid comparative claims by default.
- Provenance: attestation with file hashes, environment info, and git commit when available.

CLI pipeline (local)
- Generate demo CSV:
  - `python examples/generate_matbench_demo.py --out data/matbench_demo.csv`
- Calibrate λ (kappa − λ·leak) on validation; report test AURC with CIs:
  - `python tools/calibrate_matbench_srmf.py --path data/matbench_demo.csv --objective-col y_true --mode max --topk-grid 1,5,10 --lambda-grid 0.0,0.5,1.0 --bootstrap 1000 --seed 0 --out-json reports/matbench_calibration.json --scores-out reports/matbench_scores_test.csv`
- Evaluate regret with tuned λ (including per-group curves if `group` exists):
  - `python tools/eval_matbench_regret.py --path data/matbench_demo.csv --objective-col y_true --mode max --use-srmf --lambda-weight $(jq -r .best_lambda reports/matbench_calibration.json) --topk-grid 1,5,10 --group-col group --out-csv reports/matbench_regret.csv --out-json reports/matbench_regret.json --out-group-csv reports/matbench_regret_groups.csv --bootstrap 1000 --seed 0`
- Attestation:
  - `python tools/generate_matbench_attestation.py --input-csv data/matbench_demo.csv --calibration-json reports/matbench_calibration.json --regret-json reports/matbench_regret.json --out reports/matbench_attestation.json`

Manual GitHub workflows (no releases)
- `matbench_offline` (workflow_dispatch): calibrates and evaluates on a committed CSV path and uploads artifacts:
  - Calibration JSON (tuned λ, AURC, CI, splits)
  - Test scores CSV (kappa, leak, score)
  - Regret@k CSV and summary JSON (AURC, optional CI)
  - Attestation JSON (hashes, env, commit)
- `materials_audit` (workflow_dispatch): runs live audit with `secrets.MP_API_KEY` and uploads CSV.

Artifacts and review
- Prefer linking the attestation JSON alongside CSV/JSON outputs.
- Provide the exact CLI commands used (with seeds) and the CSV schema.
- Keep claims conservative; regret is reported as opportunity loss vs. oracle.


# Baselines and Plots
- Baseline regret: tools/eval_baseline_regret.py --model ridge --folds 5 --topk-grid 1,5,10 --plot
- If matplotlib present, PNGs are written to reports/.

# Emergent Layers
- Explore SRMF layers via quantiles or k-means: tools/explore_matbench_layers.py
- Outputs CSV and JSON with per-layer AURC; use to guide ? tuning per-layer.


# Exporting a Task CSV
- Export via Materials Project (requires MP_API_KEY): 	ools/export_matbench_task_csv.py --from-mp --elements La Ni O --nelements 3 --objective band_gap --limit 500 --out data/mp_matbench_task.csv 
- Or generate a synthetic task: 	ools/export_matbench_task_csv.py --offline-mock --out data/matbench_task.csv 
- Then run the offline workflow or local pipeline on the exported CSV.