Reproducibility

Baseline

  • Benchmark tool: RouterBench (vendored at src/routerbench) pinned to upstream commit cc67d10 (origin/main).

  • We do not modify RouterBench source for final runs. All behavior changes are applied via an external wrapper.

How We Run RouterBench (Upstream)

  • Wrapper: tools/run_routerbench_clean.py (invoked by scripts/run_routerbench_clean.bat).

    • Suppresses tokencost stdout warnings.

    • Standardizes token counts via tiktoken with cl100k_base fallback.

    • On Windows, fixes unsafe filename characters in per-eval CSV save.

    • Tokenizer backend can be selected with --tokenizer-backend=tiktoken|tokencost|hf (default: tiktoken).

Fast Config

  • Config: data/rb_clean/evaluate_routers.yaml

  • Run: scripts\run_routerbench_clean.bat --config=data/rb_clean/evaluate_routers.yaml --local

Compitum Evaluation

  • Our router remains in src/compitum. Adapter lives outside RouterBench at tools/routerbench/routers/compitum_router.py.

  • Driver: tools/evaluate_compitum.py (invoked by scripts/run_compitum_eval.bat).

  • Run: scripts\run_compitum_eval.bat --config=data/rb_clean/evaluate_routers.yaml

Outputs

  • RouterBench upstream run: CSV/PKL under data/rb_clean/eval_results.

  • Compitum run: CSV under data/rb_clean/eval_results (Compitum eval script writes per-eval CSVs).

Tokenization Policy

  • Default: tiktoken with encoding_for_model and cl100k_base fallback.

  • Alternatives: --tokenizer-backend=tokencost (use library defaults) or --tokenizer-backend=hf (experimental; falls back to tiktoken on errors).

  • Rationale: consistency across routers is prioritized over absolute counts; OpenAI models receive the most accurate counts under tiktoken.

Diff Against Upstream

  • Full diff of prior forked changes exists at src/routerbench/DIFF_WITH_UPSTREAM.patch (archival reference). Final runs do not rely on these changes.

One-Command Report

  • End-to-end (tests + both benchmarks + HTML report):

  • scripts\run_full_report.bat

  • Output report path is printed at the end, under reports/.

Individual Steps

  • Unit tests only: python tools/ci_orchestrator.py --tests

  • RouterBench only: python tools/ci_orchestrator.py --bench-routerbench --config=data/rb_clean/evaluate_routers.yaml

  • Compitum only: python tools/ci_orchestrator.py --bench-compitum --config=data/rb_clean/evaluate_routers.yaml

  • Build report from latest artifacts: python tools/ci_orchestrator.py --report-out reports/report.html

Quality Suite (Lint/Types/Sec/Tests/Mutation)

  • One command: scripts\run_quality.bat

  • Runs: ruff (style), mypy (types), bandit (security), pytest with coverage (line + branch), cosmic-ray (mutation). Results in reports/quality_*.json.

Licensing & Data Use

  • RouterBench inputs follow their upstream licenses; we do not redistribute proprietary datasets in this repository or distributions.

  • Evaluation runs operate offline on locally cached inputs; scripts do not auto-fetch datasets or call judge models.

  • Generated artifacts (JSON/CSV/HTML) are local; a SHA-256 manifest is available in reports/artifact_manifest.json.