---
title: Operations Runbook
description: Practical guidance for logging, monitoring, alerting, and run/rollback procedures using Compitum’s mechanistic signals.
---

# Operations Runbook

Overview

- Compitum emits structured certificates per decision (CLI `--trace`, API `cert.to_json()`). These fields are designed to map directly to logs and metrics for SRE/ops.

Logging (Structured)

- Log per decision the following fields (JSON):
  - `model`, `utility`, `utility_components.quality`, `utility_components.cost`
  - `constraints.feasible`, `constraints.shadow_prices`
  - `boundary_analysis.gap`, `boundary_analysis.entropy`, `boundary_analysis.sigma`
  - `drift_status.trust_radius`, `drift_status.ema`
- Example: see `examples/cert_to_logging.py`.

Metrics (Suggested)

- Gauges/histograms:
  - Utility (U), quality, cost
  - Gap, entropy, sigma
  - Trust radius, EMA
- Counters:
  - Feasible/infeasible decisions
  - Deferrals (if policy triggers on ambiguity)
- Derived:
  - “At frontier” rate (gap ~ 0)
  - Active-constraint count (nonzero shadow prices)

Alerts (Initial Thresholds)

- Constraint compliance < 99.9% over 5–15 min window
- Prolonged high ambiguity:
  - Gap < 0.02 and Entropy > 0.8 for > 1% of decisions in 15 min
- Drift tightness:
  - Trust radius persistently low (e.g., < 0.2) beyond N decisions

Dashboards (Minimal)

- Efficiency: U, quality, cost (p50/p90)
- Ambiguity: gap, entropy (p50/p90), at-frontier rate
- Constraints: feasible rate, active constraint count, top shadow prices
- Drift: trust radius and EMA trend

Run Procedures

- Standard run:
  - Use fixed configs (defaults, constraints)
  - Log certificate JSON for each decision
  - Export metrics from logs via your pipeline (e.g., ELK/OTel)
- Rollback:
  - Revert to previous frozen config (tagged release)
  - Reduce update stride or tighten trust radius if instability appears

Knobs (Tuning)

- `lambda` (WTP): cost sensitivity
- Metric params: `D`, `rank`, `delta` (stability)
- Boundary thresholds: `gap_threshold`, `entropy_threshold`, `sigma_threshold`
- Update cadence: `update_stride`

SRE Tests (Smoke)

- Route a fixed prompt set and assert:
  - No infeasible certificates
  - U, gap, entropy within expected bands
  - Logs parse as valid JSON; metrics exporter sees fields

References

- {doc}`Certificate-Schema`
- {doc}`PEER_REVIEW` (Routing Certificate)
- {doc}`Panel-Summary`
- `examples/cert_to_logging.py`