Replay, Simulation, and Calibration

Phase 3 turns runtime governance from a blocking layer into a tunable control plane. This page is the focused guide for testing policy changes before they affect production traffic.

What Runs Where

sf_policy separates live enforcement from candidate-policy analysis:

evaluate() records the production decision that actually governed the request
simulate() runs a candidate bundle without changing the production result
replay() runs historical events through a candidate bundle
compare_policies() summarizes how baseline and candidate actions differ
record_review() captures false-positive and analyst feedback
suggest_threshold() proposes threshold adjustments from reviewed outcomes

Typical Workflow

Start from a versioned baseline bundle for dev, staging, or prod.
Create a candidate bundle with one concrete change.
Run simulate() for representative live-like requests.
Run replay() against historical trace-linked events.
Use compare_policies() to quantify action changes.
Record review outcomes for false positives and justified escalations.
Promote the candidate only after the calibration data is acceptable.

Tuning Targets

The main GA calibration knobs are:

grounding confidence thresholds
scope violation handling
RBAC violation handling
explainability coverage rules

Examples:

move low-grounding responses from allow+log to human_review
tighten a scope rule from human_review to block
relax an RBAC rule that is producing known false positives in staging

Production Separation

Replay and simulation records are intentionally separate from production evidence:

production decisions describe what actually governed the request
simulation decisions describe what a candidate policy would have done
replay outputs are for tuning and approval, not for claiming live enforcement

That separation matters for audits and incident review. Operators should be able to tell whether a result came from live control enforcement or from policy testing.

False-Positive Review Loop

The practical loop is:

review a blocked or escalated decision
classify it as justified or false positive
persist that review with record_review()
use suggest_threshold() and comparison output to adjust the candidate bundle
rerun replay or simulation before activation

Related Docs

Ready to instrument your AI pipeline?

Try the 30-second quickstart See the compliance checklist View on GitHub