Instantaneous Feedback (Deep Dive)¶

We articulate why mechanistic, judge‑free signals at decision time (feasibility, boundary ambiguity, calibrated uncertainty, trust‑region state) can improve the learning dynamics relative to delayed reward.

Setup¶

At step t, Compitum observes S_t = {feasible_t, gap_t, entropy_t, uncertainty_t, r_t} and updates geometry parameters θ_t (e.g., L_t) with a capped step: θ_{t+1} = θ_t − η_eff ∇_θ U(x_t, m_t; θ_t), with η_eff ≤ κ/(‖∇‖+ε).
No external judge is consulted; the signals are endogenous and instantaneous.
In bandit/RL settings, feedback often arrives as a random reward R_{t+τ}, τ ≥ 1, requiring temporal credit assignment and introducing additional variance/noise.

Intuition: Variance and Latency¶

Immediate surrogates S_t reduce update delay and eliminate temporal credit assignment. Under bounded steps, this empirically accelerates regret reduction.
Conceptually, S_t serve as full‑information control variates for ∇_θ U, whereas delayed rewards require estimating long‑horizon contributions.
We do not claim new regret bounds; our claim is empirical and supported by CEI and control KPIs.

What to Measure¶

CEI components: deferral quality (AP/AUROC), calibration (Spearman ρ and reliability curve), stability under trust‑radius shrink, compliance.
Control KPIs: shrink/expand counts, r statistics, Spearman ρ(shrink, future improvement).
Fixed‑λ regret/win‑rate with paired bootstrap CIs across panels.

Engineering Consequences¶

No judge model is needed; evaluation remains deterministic and offline.
Updates remain PD‑safe via M = L L^T + δI and Cholesky checks; step sizes are capped by SRMF.
Bounded coherence prior ensures robustness to KDE bandwidth and β_s sweeps.

Limits and Scope¶

This is not a theoretical reduction of RL; it is a practical observation about estimator variance and latency under bounded online updates.
Shadow prices are approximate diagnostics; selection is feasibility‑first argmax U.
The approach complements, not replaces, delayed‑reward formulations where only bandit feedback is available.