Generated 2025-10-30 04:05 UTC
This report summarizes Compitum's test and benchmark results alongside common baselines.
Charts: bars include Compitum (blue/green) and baselines (gray). The scatter shows average cost vs performance for all models.
WTP policy: best-of-grid [0.0001, 0.001, 0.01, 0.1, 1.0] (regret at best WTP)
Coverage: win-rate denominator = 86 evals; Compitum evals at this WTP = 86.
........................................................................ [ 50%] ...................................................................... [100%] ------------------------------------------------------------------------------------------------------------ benchmark: 6 tests ------------------------------------------------------------------------------------------------------------ Name (time in ns) Min Max Mean StdDev Median IQR Outliers OPS (Kops/s) Rounds Iterations -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- test_iso_utility_savings_vs_fixed_best 40.5170 (1.0) 4,212.9310 (1.0) 43.5317 (1.0) 19.9865 (1.0) 42.2415 (1.0) 0.8621 (1.0) 933;17557 22,971.7826 (1.0) 192308 116 test_energy_drift 1,899.9854 (46.89) 190,899.9984 (45.31) 2,121.8334 (48.74) 1,046.0108 (52.34) 2,099.9869 (49.71) 100.0008 (116.00) 532;4360 471.2905 (0.02) 114943 1 test_spd_det_and_trust_radius_bounds 2,799.9922 (69.11) 1,770,700.0079 (420.30) 3,124.5824 (71.78) 6,779.2280 (339.19) 2,999.9937 (71.02) 100.0008 (116.00) 148;2668 320.0428 (0.01) 78125 1 test_constraint_violation_rate 2,899.9930 (71.57) 995,899.9872 (236.39) 3,223.0994 (74.04) 5,168.7014 (258.61) 3,099.9945 (73.39) 100.0008 (116.00) 132;2203 310.2604 (0.01) 46512 1 test_router_throughput_and_latency 19,900.0060 (491.15) 409,600.0048 (97.22) 21,331.2078 (490.02) 6,074.6730 (303.94) 20,700.0121 (490.04) 699.9762 (811.97) 385;1883 46.8797 (0.00) 30675 1 test_mean_regret_and_pareto 74,999.9890 (>1000.0) 1,412,000.0124 (335.16) 82,819.6565 (>1000.0) 28,107.3446 (>1000.0) 78,700.0172 (>1000.0) 2,600.0198 (>1000.0) 112;439 12.0744 (0.00) 3612 1 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Legend: Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile. OPS: Operations Per Second, computed as 1 / Mean 142 passed in 67.54s (0:01:07)
Exit code: 0
| Step | Return Code | Timed Out | Duration (s) | Timeout Cap (s) |
|---|---|---|---|---|
| Unit Tests | 0 | NO | 69.879 | 900 |
| RouterBench | 0 | NO | 902.056 | 1200 |
| Compitum | 0 | NO | 159.36 | 900 |
Full machine-readable log stored alongside this report as JSON.
C:\Users\paulc\projects\compitum\data\rb_clean\eval_results\eval_results__10-26-17__rb_clean.csvC:\Users\paulc\projects\compitum\data\rb_clean\eval_results\eval_results-eval-all-10-26-17-07-val_split.csv| Model | Avg Performance | Avg Total Cost |
|---|---|---|
| compitum | 0.7091 | 3.371482 |
| claude-instant-v1 | 0.3923 | 0.257096 |
| claude-v1 | 0.4612 | 2.491849 |
| claude-v2 | 0.5111 | 2.611772 |
| gpt-3.5-turbo-1106 | 0.5842 | 0.300727 |
| gpt-4-1106-preview | 0.7091 | 3.371482 |
| Mean Regret | P95 Regret | Win Rate | Avg Cost Delta on Wins |
|---|---|---|---|
| 0.004249 | 0.025114 | 86.0% | 0.000000 |
Regret computed at best WTP=0.0001 WTP policy: best-of-grid [0.0001, 0.001, 0.01, 0.1, 1.0] (regret at best WTP)