# Reproducibility Baseline - Benchmark tool: RouterBench (vendored at `src/routerbench`) pinned to upstream commit `cc67d10` (origin/main). - We do not modify RouterBench source for final runs. All behavior changes are applied via an external wrapper. How We Run RouterBench (Upstream) - Wrapper: `tools/run_routerbench_clean.py` (invoked by `scripts/run_routerbench_clean.bat`). - Suppresses tokencost stdout warnings. - Standardizes token counts via tiktoken with `cl100k_base` fallback. - On Windows, fixes unsafe filename characters in per-eval CSV save. - Tokenizer backend can be selected with `--tokenizer-backend=tiktoken|tokencost|hf` (default: `tiktoken`). Fast Config - Config: `data/rb_clean/evaluate_routers.yaml` - Run: `scripts\run_routerbench_clean.bat --config=data/rb_clean/evaluate_routers.yaml --local` Compitum Evaluation - Our router remains in `src/compitum`. Adapter lives outside RouterBench at `tools/routerbench/routers/compitum_router.py`. - Driver: `tools/evaluate_compitum.py` (invoked by `scripts/run_compitum_eval.bat`). - Run: `scripts\run_compitum_eval.bat --config=data/rb_clean/evaluate_routers.yaml` Outputs - RouterBench upstream run: CSV/PKL under `data/rb_clean/eval_results`. - Compitum run: CSV under `data/rb_clean/eval_results` (Compitum eval script writes per-eval CSVs). Tokenization Policy - Default: `tiktoken` with `encoding_for_model` and `cl100k_base` fallback. - Alternatives: `--tokenizer-backend=tokencost` (use library defaults) or `--tokenizer-backend=hf` (experimental; falls back to tiktoken on errors). - Rationale: consistency across routers is prioritized over absolute counts; OpenAI models receive the most accurate counts under `tiktoken`. Diff Against Upstream - Full diff of prior forked changes exists at `src/routerbench/DIFF_WITH_UPSTREAM.patch` (archival reference). Final runs do not rely on these changes. One-Command Report - End-to-end (tests + both benchmarks + HTML report): - `scripts\run_full_report.bat` - Output report path is printed at the end, under `reports/`. Individual Steps - Unit tests only: `python tools/ci_orchestrator.py --tests` - RouterBench only: `python tools/ci_orchestrator.py --bench-routerbench --config=data/rb_clean/evaluate_routers.yaml` - Compitum only: `python tools/ci_orchestrator.py --bench-compitum --config=data/rb_clean/evaluate_routers.yaml` - Build report from latest artifacts: `python tools/ci_orchestrator.py --report-out reports/report.html` Quality Suite (Lint/Types/Sec/Tests/Mutation) - One command: `scripts\run_quality.bat` - Runs: ruff (style), mypy (types), bandit (security), pytest with coverage (line + branch), cosmic-ray (mutation). Results in `reports/quality_*.json`. Licensing & Data Use - RouterBench inputs follow their upstream licenses; we do not redistribute proprietary datasets in this repository or distributions. - Evaluation runs operate offline on locally cached inputs; scripts do not auto-fetch datasets or call judge models. - Generated artifacts (JSON/CSV/HTML) are local; a SHA-256 manifest is available in `reports/artifact_manifest.json`.