RouterBench Fairness Notes¶

Purpose

Document how we use RouterBench fairly: equal prompts, budgets, cost accounting, and seeds across baselines and Compitum; transparent panel composition and configs.

Panel and Configs

Panel Summary: see Panel Summary for tasks, models, WTP slices, and eval unit counts detected from the latest runs.
Config files (examples):
- data/rb_clean/evaluate_routers.yaml
- data/rb_clean/evaluate_routers_multitask.yaml
WTP grid: fixed slices used for headline tables are {0.1, 1.0}. Sensitivity grid for frontier plots includes {1e-4, 1e-3, 1e-2, 0.1, 1.0, 10.0}.

Fairness Principles

Equal prompts: baselines and Compitum operate over the same prompt sets for each evaluation unit.
Identical budgets and accounting: WTP (lambda) and token/cost accounting are identical across all systems for the unit being compared.
Paired comparisons: “best baseline” is computed per evaluation unit at fixed WTP; regret and deltas are paired before aggregation.
Seeds and grids: seeds are fixed; any limited hyperparameter sweep is preregistered, small, and frozen before final evaluation on held-out splits (see PEER_REVIEW).
Oracle exclusion: oracle rows are excluded from baseline comparisons.

Reproducibility and Transparency

Offline runs: evaluation uses locally cached, licensed inputs; no network fetching is performed by default scripts.
Artifact integrity: we generate a manifest with SHA-256 checksums for key outputs.
Evidence breakdown: in addition to panel averages, we provide per-baseline win rates, frontier gaps (with 95% bootstrap CIs), and per-task summaries.

Agreement and Diagnostics

Per-baseline agreement tooling: see tools/per_baseline_agreement.py for measuring agreement rates across models/routers.
Column inspection: see tools/inspect_eval_cols.py to verify expected columns and accounting fields in the evaluation CSVs.

Versioning and Compatibility

RouterBench integration is kept in a separate folder and not modified by default CI or linters.
We welcome PRs to update or extend fairness checks, panel definitions, or cost accounting notes. Please include YAML changes and a short justification of the impact on comparability.