Language Perspective (cs.CL)

This page frames Compitum for cs.CL reviewers: an NLP router for LLM tasks that trades cost and quality at fixed willingness-to-pay (lambda), enforces policy/region constraints, detects ambiguity, and emits an auditable certificate per decision — all without a judge model.

Problem in NLP Terms

  • Inputs: prompts and light-weight prompt-derived features (PGD), plus optional embeddings.

  • Models: a small set of LLM backends (fast, thinking, auto) with different quality/cost/latency profiles.

  • Objective: maximize utility U = quality − lambda * total_cost at fixed lambda, subject to constraints (e.g., region/policy/rate).

  • Decision: route each prompt to the best feasible model; optionally defer when ambiguity is high (boundary region).

Code anchors: src/compitum/router.py:80 (route), src/compitum/constraints.py:36 (feasibility), src/compitum/energy.py:33 (utility), src/compitum/boundary.py:19 (ambiguity), src/compitum/router.py:25 (certificate schema).

What We Emit (Certificate)

Per decision, a certificate exposes:

  • utility components (quality, latency, cost), overall utility U

  • constraints (feasible, approximate local shadow prices)

  • boundary diagnostics (gap to runner-up, entropy, uncertainty)

  • drift/trust-region state (for online adaptation)

This makes routing auditable and suitable for post-hoc error analysis and deferral policies.

Relation to cs.CL Literature

  • Mixture-of-Experts and routing: Compitum is a deterministic router across a small panel of LLMs, optimizing a scalarized utility with constraints, not a learned soft gate within a single model.

  • Selective prediction/abstention: boundary flags align with deferral; we measure deferral quality against high-regret items.

  • Cost-aware evaluation: fixed lambda slices capture cost–quality tradeoffs; we report regret, win rate, and cost deltas (when available).

  • Calibration: we use calibrated component predictors and report reliability of uncertainty vs. absolute regret.

Constraints and Safety

  • Policy/rate/region constraints are enforced before selection (feasibility-first); constraint compliance should be ~100%.

  • Approximate shadow prices (finite-difference viability) help identify binding constraints; they are diagnostic only.

Coherence and Ambiguity

  • A metric-aware KDE prior (whitened) nudges decisions toward familiar contexts with predictable behavior.

  • Boundary diagnostic combines gap, entropy, and uncertainty to flag ambiguous prompts where deferral or conservative routing is prudent.

Evaluation for cs.CL

  • Panel and per-task summaries at fixed lambda (e.g., 0.1 and 1.0):

    • regret and win rate vs best baseline

    • boundary/deferral rate and quality

    • optional cost delta on wins if cost columns are present

  • Calibration diagnostics: reliability curve (uncertainty bins) and Spearman rho(uncertainty, |regret|).

  • Constraint compliance rate (~100%).

Helper commands:

  • cs.CL summary

python tools\analysis\cl_summary.py ^
  --input data\rb_clean\eval_results\<latest-compitum-csv>.csv ^
  --out-json reports\cl_summary.json ^
  --out-md reports\cl_summary.md
  • Reliability curve and CEI (see docs/Statistical-Notes.md and docs/Control-of-Error.md)

  • Decision curves (ambiguity-based deferral upper bound + boundary AP/AUROC)

python tools\analysis\cl_decision_curves.py ^
  --input data\rb_clean\eval_results\<latest-compitum-csv>.csv ^
  --quantiles 0,0.05,0.1,0.2,0.3,0.4,0.5 ^
  --out-json reports\cl_decision_curves.json ^
  --out-md reports\cl_decision_curves.md ^
  --out-png reports\cl_decision_curve.png

Reproducibility

  • Deterministic evaluation with fixed seeds and offline artifacts.

  • Attach reports/cl_summary.md, reliability_curve.md/png, cei_report.md, fixed_wtp_summary.md.

Determinism & Explainability (0.1.1)

  • Determinism

    • Repeated route and batch determinism under fixed seeds/embeddings.

    • Tests: tests/invariants/test_invariants_router_determinism.py, tests/router/test_router_batch_determinism.py

  • Paraphrase robustness

    • Flip budget under small lexical/format edits; flips must be explainable via certificate deltas (distance or feasibility changes).

    • Tests: tests/invariants/test_paraphrase_invariance.py, tests/invariants/test_paraphrase_explainability.py

Limits (cs.CL)

  • No judge model; utility proxies depend on upstream task scoring and pricing assumptions.

  • Shadow prices are approximate diagnostics; coherence prior is bounded and small-weight.

  • Router panel size is intentionally small; extending to larger panels is future work.