Data Policy & Security¶

Scope

Strong separation of concerns: core library and CI run offline by default; data acquisition is manual and opt-in.
Reproducible, conservative results: artifacts include uncertainty and attestation with hashes and environment info.

Acquisition

Prefer CSV/JSON inputs you control. Avoid serialized Python objects (pickle) due to code execution risk.
Materials Project access is manual: use MP_API_KEY in the materials_audit workflow or local sessions only.
RouterBench dataset remains gated; helper script fetches from the canonical source and warns about .pkl risk.

CI/CD boundaries

Default CI does not download external data or call external APIs.
Manual workflows (workflow_dispatch) accept user-provided CSV paths and secrets, and upload artifacts (no Releases).
Attestation JSON (hashes, env, commit) accompanies outputs for review reproducibility.

Security

Bandit scans src/compitum, tools, examples, scripts in CI.
We do not unpickle arbitrary files in CI; .pkl files are advisory only and never committed.
Secrets are scoped to manual jobs and not echoed in logs; outputs exclude secrets.

Recommended layout

data/ for local inputs and products (ignored by Git). Suggested subfolders:
- data/external/ for third-party CSVs
- data/samples/ for small examples (checked in)
Avoid committing third-party data; instead, reference provenance in attestation or a small README next to the file.

Reproducibility

Use the offline Matbench workflow to calibrate and evaluate from a repo CSV path, producing:
- Calibration JSON, Regret CSV/JSON (+ groups/budget optional), Baseline CSV/JSON, Layers CSV/JSON, and Attestation JSON.