---
title: Results Summary
---

# Results Summary

Canonical table with fixed budgets (WTP), bootstrap 95% confidence intervals, and seeds.

| Task              | Budget (WTP) | Best Baseline Utility | Compitum Utility | Delta | 95% CI (Delta) | Seeds |
|---                |---:          |---:                   |---:              |---:   |---             |---:   |
| grade-school-math | 0.1          | -                     | -                | -     | [-, -]         | n=    |
| hellaSWAG         | 0.1          | -                     | -                | -     | [-, -]         | n=    |
| MBPP              | 0.1          | -                     | -                | -     | [-, -]         | n=    |
| Panel Avg         | 0.1          | -                     | -                | -     | [-, -]         | n=    |
| grade-school-math | 1.0          | -                     | -                | -     | [-, -]         | n=    |
| hellaSWAG         | 1.0          | -                     | -                | -     | [-, -]         | n=    |
| MBPP              | 1.0          | -                     | -                | -     | [-, -]         | n=    |
| Panel Avg         | 1.0          | -                     | -                | -     | [-, -]         | n=    |

Notes
- Baselines: KNN/MLP/cascade and RouterBench common routers. "Best Baseline" is per-evaluation unit at fixed WTP.
- CIs: nonparametric bootstrap (1,000 resamples) over evaluation units. Seeds fixed; Hypothesis derandomized.
- Reproduce via scripts/run_peer_review.bat (Windows) or Make target `peer-review` (below).

## Reproduce

Windows (one-shot):

```bat
scripts\run_peer_review.bat
```

Make (Windows):

```bash
make peer-review
```