Skip to content

spanforge.baseline & spanforge.drift

Overview

spanforge.baseline builds a statistical summary of an agent's historical behaviour. spanforge.drift uses that summary at runtime to detect statistically significant deviations and emit drift.* events.


DistributionStats

from spanforge.baseline import DistributionStats

Frozen dataclass that holds a numeric distribution snapshot.

AttributeTypeDescription
meanfloatArithmetic mean
stddevfloatSample standard deviation (0.0 when sample_count < 2)
p50float50th percentile (median)
p95float95th percentile
p99float99th percentile
sample_countintNumber of observations

Factory

stats = DistributionStats.from_samples([10.2, 15.3, 11.8, ...])

Serialisation

d = stats.to_dict()          # → dict
stats2 = DistributionStats.from_dict(d)

BehaviouralBaseline

from spanforge.baseline import BehaviouralBaseline

Dataclass that captures the typical behaviour of a single agent across a traffic window.

Note — LLM spans only BehaviouralBaseline tracks tokens and confidence_by_type only for LLM spans (event types matching llm.trace.*). Tool-call and decision spans contribute to latency_by_operation and tool_rate_per_hour but will never populate the tokens distribution. Passing non-LLM spans to from_events() is intentional and safe — the baseline simply accumulates the metrics each span type supports.

AttributeTypeDescription
tokensDistributionStatsToken count distribution across all LLM spans
confidence_by_typedict[str, DistributionStats]Per-decision-type confidence score distributions
latency_by_operationdict[str, DistributionStats]Per-operation latency (ms) distributions
tool_rate_per_hourdict[str, float]Observed tool invocation rate per tool name (calls/h)
decision_rate_per_hourdict[str, float]Observed decision rate per decision type (decisions/h)
event_countintNumber of events consumed to build this baseline
window_secondsfloatDuration of the baseline traffic window (seconds)
recorded_atstrISO 8601 UTC timestamp when the baseline was created

Building from events

from spanforge.baseline import BehaviouralBaseline

baseline = BehaviouralBaseline.from_events(
    events,               # Iterable[Event]
    max_events=1000,      # optional upper bound on events consumed
    window_seconds=86400.0,  # 24 h — denominator for rate calculations
)

Persistence

baseline.save("baseline.json")
baseline2 = BehaviouralBaseline.load("baseline.json")

# JSON round-trip
json_str = baseline.to_json()
baseline3 = BehaviouralBaseline.from_json(json_str)

DriftDetector

from spanforge.drift import DriftDetector

Maintains a sliding window of recent metric observations and compares them against a BehaviouralBaseline using Z-score and KL-divergence statistics.

Constructor

detector = DriftDetector(
    baseline,                        # BehaviouralBaseline
    agent_id="my-agent",
    window_size=500,                 # observations in the sliding window (default 500)
    z_threshold=3.0,                 # σ deviation before alert
    kl_threshold=0.5,                # KL-divergence threshold
    window_seconds=3600,             # rate-calculation window in seconds
    auto_emit=True,                  # emit drift.* events automatically
    metric_ttl_seconds=86400,        # seconds before an idle metric window expires
)

Constructor parameters:

ParameterTypeDefaultDescription
baselineBehaviouralBaseline(required)Baseline to compare observations against.
agent_idstr(required)Non-empty string identifying the agent being monitored.
window_sizeint500Maximum number of observations kept in the rolling window.
z_thresholdfloat3.0Z-score above which a metric is considered drifted. Must be finite and positive.
kl_thresholdfloat0.5KL-divergence above which a distribution shift is flagged.
window_secondsint3600Time window (seconds) used for rate calculations. Must be > 0.
auto_emitboolTrueWhen True, DriftDetector automatically emits drift.* events on breach.
metric_ttl_secondsint86400Seconds of inactivity before an idle metric window is evicted.

Recording events

results: list[DriftResult] = detector.record(event)

Returns zero or more DriftResult objects — one per metric that breached a threshold in this observation.

DriftResult

AttributeTypeDescription
metric_namestrMetric that drifted
statusstr"alert" or "warn"
observed_valuefloatThe triggering observation
baseline_meanfloatBaseline DistributionStats.mean for this metric
baseline_stddevfloatBaseline DistributionStats.stddev
zscorefloat | NoneZ-score of the observation
kl_divergencefloat | NoneKL divergence of window vs. baseline
window_sizeintCurrent window size at detection time
event_idstrID of the triggering event
agent_idstrAgent the detector is tracking

Inspection helpers

# Returns (mean, stddev, n) for the current window
stats = detector.window_stats("latency_ms")

# True if the given metric is currently in breach
if detector.in_breach("latency_ms"):
    ...

# Reset one or all windows
detector.reset_window("latency_ms")
detector.reset_window()          # reset all metrics

Emitting drift events

from spanforge._stream import emit_rfc_event
from spanforge.types import EventType

for result in detector.record(event):
    emit_rfc_event(
        EventType("drift." + result.status),
        result.__dict__,
    )

When emit_events=True (the default), DriftDetector does this automatically.