Reproducibility¶

Baseline

Benchmark tool: RouterBench (vendored at src/routerbench) pinned to upstream commit cc67d10 (origin/main).
We do not modify RouterBench source for final runs. All behavior changes are applied via an external wrapper.

How We Run RouterBench (Upstream)

Wrapper: tools/run_routerbench_clean.py (invoked by scripts/run_routerbench_clean.bat).
- Suppresses tokencost stdout warnings.
- Standardizes token counts via tiktoken with cl100k_base fallback.
- On Windows, fixes unsafe filename characters in per-eval CSV save.
- Tokenizer backend can be selected with --tokenizer-backend=tiktoken|tokencost|hf (default: tiktoken).

Fast Config

Config: data/rb_clean/evaluate_routers.yaml
Run: scripts\run_routerbench_clean.bat --config=data/rb_clean/evaluate_routers.yaml --local

Compitum Evaluation

Our router remains in src/compitum. Adapter lives outside RouterBench at tools/routerbench/routers/compitum_router.py.
Driver: tools/evaluate_compitum.py (invoked by scripts/run_compitum_eval.bat).
Run: scripts\run_compitum_eval.bat --config=data/rb_clean/evaluate_routers.yaml

Outputs

RouterBench upstream run: CSV/PKL under data/rb_clean/eval_results.
Compitum run: CSV under data/rb_clean/eval_results (Compitum eval script writes per-eval CSVs).

Tokenization Policy

Default: tiktoken with encoding_for_model and cl100k_base fallback.
Alternatives: --tokenizer-backend=tokencost (use library defaults) or --tokenizer-backend=hf (experimental; falls back to tiktoken on errors).
Rationale: consistency across routers is prioritized over absolute counts; OpenAI models receive the most accurate counts under tiktoken.

Diff Against Upstream

Full diff of prior forked changes exists at src/routerbench/DIFF_WITH_UPSTREAM.patch (archival reference). Final runs do not rely on these changes.

One-Command Report

End-to-end (tests + both benchmarks + HTML report):
scripts\run_full_report.bat
Output report path is printed at the end, under reports/.

Individual Steps

Unit tests only: python tools/ci_orchestrator.py --tests
RouterBench only: python tools/ci_orchestrator.py --bench-routerbench --config=data/rb_clean/evaluate_routers.yaml
Compitum only: python tools/ci_orchestrator.py --bench-compitum --config=data/rb_clean/evaluate_routers.yaml
Build report from latest artifacts: python tools/ci_orchestrator.py --report-out reports/report.html

Quality Suite (Lint/Types/Sec/Tests/Mutation)

One command: scripts\run_quality.bat
Runs: ruff (style), mypy (types), bandit (security), pytest with coverage (line + branch), cosmic-ray (mutation). Results in reports/quality_*.json.

Licensing & Data Use

RouterBench inputs follow their upstream licenses; we do not redistribute proprietary datasets in this repository or distributions.
Evaluation runs operate offline on locally cached inputs; scripts do not auto-fetch datasets or call judge models.
Generated artifacts (JSON/CSV/HTML) are local; a SHA-256 manifest is available in reports/artifact_manifest.json.