Replay, Simulation, and Calibration
Phase 3 turns runtime governance from a blocking layer into a tunable control plane. This page is the focused guide for testing policy changes before they affect production traffic.
What Runs Where
sf_policy separates live enforcement from candidate-policy analysis:
evaluate()records the production decision that actually governed the requestsimulate()runs a candidate bundle without changing the production resultreplay()runs historical events through a candidate bundlecompare_policies()summarizes how baseline and candidate actions differrecord_review()captures false-positive and analyst feedbacksuggest_threshold()proposes threshold adjustments from reviewed outcomes
Typical Workflow
- Start from a versioned baseline bundle for
dev,staging, orprod. - Create a candidate bundle with one concrete change.
- Run
simulate()for representative live-like requests. - Run
replay()against historical trace-linked events. - Use
compare_policies()to quantify action changes. - Record review outcomes for false positives and justified escalations.
- Promote the candidate only after the calibration data is acceptable.
Tuning Targets
The main GA calibration knobs are:
- grounding confidence thresholds
- scope violation handling
- RBAC violation handling
- explainability coverage rules
Examples:
- move low-grounding responses from
allow+logtohuman_review - tighten a scope rule from
human_reviewtoblock - relax an RBAC rule that is producing known false positives in staging
Production Separation
Replay and simulation records are intentionally separate from production evidence:
- production decisions describe what actually governed the request
- simulation decisions describe what a candidate policy would have done
- replay outputs are for tuning and approval, not for claiming live enforcement
That separation matters for audits and incident review. Operators should be able to tell whether a result came from live control enforcement or from policy testing.
False-Positive Review Loop
The practical loop is:
- review a blocked or escalated decision
- classify it as justified or false positive
- persist that review with
record_review() - use
suggest_threshold()and comparison output to adjust the candidate bundle - rerun replay or simulation before activation
Related Docs
- Runtime Governance GA Guide
- Runtime Governance Contracts
- API Reference:
spanforge.sdk.policy - Runtime Governance Demo
Ready to instrument your AI pipeline?