RouterBench Fairness Notes¶
Purpose
Document how we use RouterBench fairly: equal prompts, budgets, cost accounting, and seeds across baselines and Compitum; transparent panel composition and configs.
Panel and Configs
Panel Summary: see Panel Summary for tasks, models, WTP slices, and eval unit counts detected from the latest runs.
Config files (examples):
data/rb_clean/evaluate_routers.yamldata/rb_clean/evaluate_routers_multitask.yaml
WTP grid: fixed slices used for headline tables are {0.1, 1.0}. Sensitivity grid for frontier plots includes {1e-4, 1e-3, 1e-2, 0.1, 1.0, 10.0}.
Fairness Principles
Equal prompts: baselines and Compitum operate over the same prompt sets for each evaluation unit.
Identical budgets and accounting: WTP (lambda) and token/cost accounting are identical across all systems for the unit being compared.
Paired comparisons: “best baseline” is computed per evaluation unit at fixed WTP; regret and deltas are paired before aggregation.
Seeds and grids: seeds are fixed; any limited hyperparameter sweep is preregistered, small, and frozen before final evaluation on held-out splits (see PEER_REVIEW).
Oracle exclusion: oracle rows are excluded from baseline comparisons.
Reproducibility and Transparency
Offline runs: evaluation uses locally cached, licensed inputs; no network fetching is performed by default scripts.
Artifact integrity: we generate a manifest with SHA-256 checksums for key outputs.
Evidence breakdown: in addition to panel averages, we provide per-baseline win rates, frontier gaps (with 95% bootstrap CIs), and per-task summaries.
Agreement and Diagnostics
Per-baseline agreement tooling: see
tools/per_baseline_agreement.pyfor measuring agreement rates across models/routers.Column inspection: see
tools/inspect_eval_cols.pyto verify expected columns and accounting fields in the evaluation CSVs.
Versioning and Compatibility
RouterBench integration is kept in a separate folder and not modified by default CI or linters.
We welcome PRs to update or extend fairness checks, panel definitions, or cost accounting notes. Please include YAML changes and a short justification of the impact on comparability.