Results Summary¶
Canonical table with fixed budgets (WTP), bootstrap 95% confidence intervals, and seeds.
Task |
Budget (WTP) |
Best Baseline Utility |
Compitum Utility |
Delta |
95% CI (Delta) |
Seeds |
|---|---|---|---|---|---|---|
grade-school-math |
0.1 |
- |
- |
- |
[-, -] |
n= |
hellaSWAG |
0.1 |
- |
- |
- |
[-, -] |
n= |
MBPP |
0.1 |
- |
- |
- |
[-, -] |
n= |
Panel Avg |
0.1 |
- |
- |
- |
[-, -] |
n= |
grade-school-math |
1.0 |
- |
- |
- |
[-, -] |
n= |
hellaSWAG |
1.0 |
- |
- |
- |
[-, -] |
n= |
MBPP |
1.0 |
- |
- |
- |
[-, -] |
n= |
Panel Avg |
1.0 |
- |
- |
- |
[-, -] |
n= |
Notes
Baselines: KNN/MLP/cascade and RouterBench common routers. “Best Baseline” is per-evaluation unit at fixed WTP.
CIs: nonparametric bootstrap (1,000 resamples) over evaluation units. Seeds fixed; Hypothesis derandomized.
Reproduce via scripts/run_peer_review.bat (Windows) or Make target
peer-review(below).
Reproduce¶
Windows (one-shot):
scripts\run_peer_review.bat
Make (Windows):
make peer-review