Skip to content

spanforge.metrics

Batch aggregation API for computing structured metrics from collections of spanforge events.


aggregate

def aggregate(events: Iterable[Event]) -> MetricsSummary

Compute a fully-populated MetricsSummary from any iterable of Event objects (file stream, in-memory list, or TraceStore result).

import spanforge
from spanforge.stream import EventStream

events  = list(EventStream.from_file("events.jsonl"))
summary = spanforge.metrics.aggregate(events)

MetricsSummary

@dataclass
class MetricsSummary:
    trace_count: int
    span_count: int
    agent_success_rate: float
    avg_trace_duration_ms: float
    p50_trace_duration_ms: float
    p95_trace_duration_ms: float
    total_input_tokens: int
    total_output_tokens: int
    total_cost_usd: float
    llm_latency_ms: LatencyStats
    tool_failure_rate: float
    token_usage_by_model: dict[str, TokenUsage]
    cost_by_model: dict[str, float]
    drift_incidents: int
    confidence_trend: list[float]
    baseline_deviation_pct: float
FieldDescription
trace_countNumber of distinct trace IDs in the input
span_countTotal number of spans
agent_success_rateFraction of traces that contain no error spans (0–1)
avg_trace_duration_msMean end-to-end duration across all traces
p50_trace_duration_msMedian trace duration
p95_trace_duration_ms95th-percentile trace duration
total_input_tokensSum of all input_tokens values across LLM spans
total_output_tokensSum of all output_tokens values
total_cost_usdSum of all cost_usd values
llm_latency_msLatencyStats(min, max, p50, p95, p99) for LLM spans
tool_failure_rateFraction of tool spans with status="error" (0–1)
token_usage_by_modelPer-model TokenUsage aggregate
cost_by_modelPer-model USD total
drift_incidentsCount of drift.threshold_breach events in the stream
confidence_trendRolling mean confidence per 50-event window; empty if no confidence.sample events
baseline_deviation_pctCoefficient of variation of confidence scores (stddev/mean×100); 0.0 when unavailable

LatencyStats

@dataclass
class LatencyStats:
    min: float
    max: float
    p50: float
    p95: float
    p99: float

Single-metric helpers

All helpers accept the same Iterable[Event] as aggregate() and return a scalar or simple dataclass computed from a single pass.

agent_success_rate(events) -> float

Fraction of traces (by trace_id) that contain no event with status="error" or status="timeout".

llm_latency(events) -> LatencyStats

Latency percentiles computed from all llm_call span durations.

tool_failure_rate(events) -> float

Fraction of tool_call spans with status="error".

token_usage(events) -> dict[str, TokenUsage]

Per-model TokenUsage aggregate keyed by ModelInfo.name.


Usage with TraceStore

import spanforge
from spanforge._store import get_store

spanforge.configure(exporter="console", enable_trace_store=True)

# ... run your agent ...

events  = spanforge.get_last_agent_run() or []
summary = spanforge.metrics.aggregate(events)