Compitum Report

Overview

This report summarizes Compitum's test and benchmark results alongside common baselines.

Performance: average task accuracy (higher is better).
Total cost: average token-compute cost in USD (lower is better).
Utility: performance - WTP * cost, where WTP (willingness to pay) scales the cost penalty.
Mean regret: average gap to the best baseline utility at a selected WTP (lower is better).
Win rate: fraction of evaluations where Compitum's utility >= best baseline utility at the selected WTP.

Charts: bars include Compitum (blue/green) and baselines (gray). The scatter shows average cost vs performance for all models.

Topline Takeaways

Win rate (share of evaluations where Compitum's utility >= best baseline): 86.0%.
Mean regret (utility gap to best baseline at selected/best WTP): 0.004249. (on this evaluation suite). Lower is better.
Avg cost delta on wins (Compitum - best baseline cost on wins): 0.00 USD (parity on wins).
Average performance: Compitum 0.7091 vs gpt-4-1106-preview 0.7091.
Average cost: Compitum 3.371482 vs claude-instant-v1 0.257096 USD.

Unit Tests

........................................................................ [ 50%]
......................................................................   [100%]

------------------------------------------------------------------------------------------------------------ benchmark: 6 tests ------------------------------------------------------------------------------------------------------------
Name (time in ns)                                  Min                       Max                   Mean                 StdDev                 Median                   IQR             Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_iso_utility_savings_vs_fixed_best         40.5170 (1.0)          4,212.9310 (1.0)          43.5317 (1.0)          19.9865 (1.0)          42.2415 (1.0)          0.8621 (1.0)      933;17557   22,971.7826 (1.0)      192308         116
test_energy_drift                           1,899.9854 (46.89)      190,899.9984 (45.31)     2,121.8334 (48.74)     1,046.0108 (52.34)     2,099.9869 (49.71)      100.0008 (116.00)    532;4360      471.2905 (0.02)     114943           1
test_spd_det_and_trust_radius_bounds        2,799.9922 (69.11)    1,770,700.0079 (420.30)    3,124.5824 (71.78)     6,779.2280 (339.19)    2,999.9937 (71.02)      100.0008 (116.00)    148;2668      320.0428 (0.01)      78125           1
test_constraint_violation_rate              2,899.9930 (71.57)      995,899.9872 (236.39)    3,223.0994 (74.04)     5,168.7014 (258.61)    3,099.9945 (73.39)      100.0008 (116.00)    132;2203      310.2604 (0.01)      46512           1
test_router_throughput_and_latency         19,900.0060 (491.15)     409,600.0048 (97.22)    21,331.2078 (490.02)    6,074.6730 (303.94)   20,700.0121 (490.04)     699.9762 (811.97)    385;1883       46.8797 (0.00)      30675           1
test_mean_regret_and_pareto                74,999.9890 (>1000.0)  1,412,000.0124 (335.16)   82,819.6565 (>1000.0)  28,107.3446 (>1000.0)  78,700.0172 (>1000.0)  2,600.0198 (>1000.0)    112;439       12.0744 (0.00)       3612           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
142 passed in 67.54s (0:01:07)

Exit code: 0

Step Status

Step	Timed Out	Duration (s)	Timeout Cap (s)
Unit Tests	NO	69.879	900
RouterBench	NO	902.056	1200
Compitum	NO	159.36	900

Full machine-readable log stored alongside this report as JSON.

RouterBench Artifacts

C:\Users\paulc\projects\compitum\data\rb_clean\eval_results\eval_results__10-26-17__rb_clean.csv

Compitum Artifacts

C:\Users\paulc\projects\compitum\data\rb_clean\eval_results\eval_results-eval-all-10-26-17-07-val_split.csv

Numerical Summary

Model	Avg Performance	Avg Total Cost
compitum	0.7091	3.371482
claude-instant-v1	0.3923	0.257096
claude-v1	0.4612	2.491849
claude-v2	0.5111	2.611772
gpt-3.5-turbo-1106	0.5842	0.300727
gpt-4-1106-preview	0.7091	3.371482

Regret & Wins (WTP-selected)

Mean Regret	P95 Regret	Win Rate	Avg Cost Delta on Wins
0.004249	0.025114	86.0%	0.000000

Regret computed at best WTP=0.0001 WTP policy: best-of-grid [0.0001, 0.001, 0.01, 0.1, 1.0] (regret at best WTP)

Glossary

Performance: Average task accuracy across the evaluation set (higher is better).
Total cost: Average token and compute cost in USD (lower is better).
Willingness to Pay (WTP): How much performance is worth relative to cost; utility = performance - WTP * cost.
Utility: Single-number trade-off of performance and cost at a chosen WTP.
Mean regret: Average utility gap to the best baseline at the chosen WTP (lower is better).
Win rate: Share of evaluations where Compitum's utility >= best baseline utility.

Compitum Test & Benchmark Report