Results Summary

Canonical table with fixed budgets (WTP), bootstrap 95% confidence intervals, and seeds.

Task

Budget (WTP)

Best Baseline Utility

Compitum Utility

Delta

95% CI (Delta)

Seeds

grade-school-math

0.1

-

-

-

[-, -]

n=

hellaSWAG

0.1

-

-

-

[-, -]

n=

MBPP

0.1

-

-

-

[-, -]

n=

Panel Avg

0.1

-

-

-

[-, -]

n=

grade-school-math

1.0

-

-

-

[-, -]

n=

hellaSWAG

1.0

-

-

-

[-, -]

n=

MBPP

1.0

-

-

-

[-, -]

n=

Panel Avg

1.0

-

-

-

[-, -]

n=

Notes

  • Baselines: KNN/MLP/cascade and RouterBench common routers. “Best Baseline” is per-evaluation unit at fixed WTP.

  • CIs: nonparametric bootstrap (1,000 resamples) over evaluation units. Seeds fixed; Hypothesis derandomized.

  • Reproduce via scripts/run_peer_review.bat (Windows) or Make target peer-review (below).

Reproduce

Windows (one-shot):

scripts\run_peer_review.bat

Make (Windows):

make peer-review