Data Policy & Security

Scope

  • Strong separation of concerns: core library and CI run offline by default; data acquisition is manual and opt-in.

  • Reproducible, conservative results: artifacts include uncertainty and attestation with hashes and environment info.

Acquisition

  • Prefer CSV/JSON inputs you control. Avoid serialized Python objects (pickle) due to code execution risk.

  • Materials Project access is manual: use MP_API_KEY in the materials_audit workflow or local sessions only.

  • RouterBench dataset remains gated; helper script fetches from the canonical source and warns about .pkl risk.

CI/CD boundaries

  • Default CI does not download external data or call external APIs.

  • Manual workflows (workflow_dispatch) accept user-provided CSV paths and secrets, and upload artifacts (no Releases).

  • Attestation JSON (hashes, env, commit) accompanies outputs for review reproducibility.

Security

  • Bandit scans src/compitum, tools, examples, scripts in CI.

  • We do not unpickle arbitrary files in CI; .pkl files are advisory only and never committed.

  • Secrets are scoped to manual jobs and not echoed in logs; outputs exclude secrets.

Recommended layout

  • data/ for local inputs and products (ignored by Git). Suggested subfolders:

    • data/external/ for third-party CSVs

    • data/samples/ for small examples (checked in)

  • Avoid committing third-party data; instead, reference provenance in attestation or a small README next to the file.

Reproducibility

  • Use the offline Matbench workflow to calibrate and evaluate from a repo CSV path, producing:

    • Calibration JSON, Regret CSV/JSON (+ groups/budget optional), Baseline CSV/JSON, Layers CSV/JSON, and Attestation JSON.