Alert Routing Service (sf-alert)

Added in: 2.0.6 (Phase 7)
Module: spanforge.sdk.alert
Singleton: spanforge.sdk.sf_alert

The sf-alert service provides topic-based alert routing with deduplication, CRITICAL escalation, per-project rate limiting, maintenance-window suppression, and six production-ready sink integrations. Every dispatch is audit-logged to sf_audit schema spanforge.alert.v1 on a best-effort basis.

Installation

sf-alert ships with the core package — no extra dependencies required. Sink credentials are loaded from environment variables at startup.

pip install spanforge

Getting started

from spanforge.sdk import sf_alert

# Publish a CRITICAL drift alert
result = sf_alert.publish(
    "halluccheck.drift.red",
    {"model": "gpt-4o", "drift_score": 0.91, "threshold": 0.80},
    severity="critical",
    project_id="proj-abc123",
)
print(result.alert_id)    # UUID4, e.g. "3f9a7e12-..."
print(result.suppressed)  # False — alert was dispatched

# Acknowledge to cancel the 15-min escalation timer
sf_alert.acknowledge(result.alert_id)

Configuration via environment variables

Set the following variables before starting your application:

# Sink credentials
export SPANFORGE_ALERT_SLACK_WEBHOOK="https://hooks.slack.com/services/..."
export SPANFORGE_ALERT_TEAMS_WEBHOOK="https://xxx.webhook.office.com/webhookb2/..."
export SPANFORGE_ALERT_PAGERDUTY_KEY="pd-routing-key"
export SPANFORGE_ALERT_OPSGENIE_KEY="og-api-key"
export SPANFORGE_ALERT_OPSGENIE_REGION="us"   # or "eu"
export SPANFORGE_ALERT_VICTOROPS_URL="https://alert.victorops.com/integrations/..."
export SPANFORGE_ALERT_WEBHOOK_URL="https://hooks.example.com/alert"
export SPANFORGE_ALERT_WEBHOOK_SECRET="my-hmac-secret"

# Tuning
export SPANFORGE_ALERT_DEDUP_SECONDS=300     # default: 300
export SPANFORGE_ALERT_RATE_LIMIT=60         # default: 60 alerts/min/project
export SPANFORGE_ALERT_ESCALATION_WAIT=900   # default: 900 s (15 min)

Topic registry

Eight built-in topics from the HallucCheck event taxonomy are pre-registered:

Topic	Default severity
`halluccheck.drift.red`	`critical`
`halluccheck.drift.amber`	`warning`
`halluccheck.pii.detected`	`high`
`halluccheck.cost.exceeded`	`warning`
`halluccheck.latency.breach`	`warning`
`halluccheck.audit.gap`	`high`
`halluccheck.security.violation`	`critical`
`halluccheck.compliance.breach`	`critical`

Registering custom topics

sf_alert.register_topic(
    "myapp.pipeline.failed",
    description="ML pipeline execution failure",
    default_severity="high",
    runbook_url="https://runbooks.example.com/pipeline",
    dedup_window_seconds=600.0,  # 10-minute per-topic dedup window
)

Publishing to an unregistered topic logs a WARNING and routes to all sinks.

Deduplication

The same (topic, project_id) pair is suppressed for dedup_window_seconds (default: 300 s) after the first dispatch. Per-topic overrides take precedence over the client-wide setting.

# First publish → dispatched
r1 = sf_alert.publish("halluccheck.drift.red", {}, project_id="proj-1")
assert not r1.suppressed

# Second publish within 5 minutes → suppressed
r2 = sf_alert.publish("halluccheck.drift.red", {}, project_id="proj-1")
assert r2.suppressed

Alert grouping

Multiple alerts sharing the same (topic_prefix, project_id) — where prefix is everything before the last . — are coalesced within a 2-minute window. The first alert is dispatched immediately; subsequent alerts are buffered and sent as a single notification when the timer fires.

# Both share prefix "halluccheck.drift" → r2 buffered
r1 = sf_alert.publish("halluccheck.drift.red", {"i": 1})   # dispatched
r2 = sf_alert.publish("halluccheck.drift.amber", {"i": 2}) # buffered
assert r2.routed_to == []  # deferred until flush

CRITICAL escalation

CRITICAL alerts schedule an escalation timer (default: 900 s = 15 min). If the alert is not acknowledged before the timer fires, it is re-dispatched with an [ESCALATED] title prefix.

result = sf_alert.publish(
    "halluccheck.security.violation",
    {"attacker_ip": "203.0.113.1"},
    severity="critical",
)

# Cancel escalation after investigating
sf_alert.acknowledge(result.alert_id)

Maintenance windows

Suppress all alerts for a project during a planned maintenance period:

from datetime import datetime, timezone, timedelta

# Suppress for 2 hours
sf_alert.set_maintenance_window(
    project_id="proj-abc",
    start=datetime.now(timezone.utc),
    end=datetime.now(timezone.utc) + timedelta(hours=2),
)

# Alert during window → suppressed and audit-logged
result = sf_alert.publish("halluccheck.drift.red", {}, project_id="proj-abc")
assert result.suppressed

# Remove when maintenance is over
removed = sf_alert.remove_maintenance_windows("proj-abc")
print(f"Removed {removed} windows")

Alert history

from datetime import datetime, timezone, timedelta

records = sf_alert.get_alert_history(
    project_id="proj-abc",
    topic="halluccheck.drift.red",
    from_dt=datetime.now(timezone.utc) - timedelta(hours=1),
    status="open",
    limit=50,
)

for r in records:
    print(r.alert_id, r.severity, r.timestamp, r.sinks_notified)

Adding sinks at runtime

from spanforge.sdk.alert import OpsGenieAlerter, WebhookAlerter

# Add OpsGenie
sf_alert.add_sink(OpsGenieAlerter(api_key="og-key", region="eu"), "opsgenie-eu")

# Add a generic HMAC-signed webhook
sf_alert.add_sink(
    WebhookAlerter(url="https://hooks.example.com/alert", secret="secret"),
    "custom-webhook",
)

Rate limiting

Per-project sliding-window rate limiting (default: 60 alerts/minute). When exceeded in normal mode, the alert is suppressed and logged. In strict mode (local_fallback_enabled=False), SFAlertRateLimitedError is raised.

from spanforge.sdk._base import SFClientConfig
from spanforge.sdk.alert import SFAlertClient

# Strict mode: raise instead of suppress
client = SFAlertClient(
    SFClientConfig(local_fallback_enabled=False),
    rate_limit_per_minute=10,
)

Sink security

Sink	Secret handling
`WebhookAlerter`	`secret` field has `repr=False`; HMAC computed with `hmac.new()`; constant-time compare
`OpsGenieAlerter`	`api_key` has `repr=False`
`IncidentIOAlerter`	`api_key` has `repr=False`
`SMSAlerter`	`auth_token` has `repr=False`
All URL-based sinks	URLs validated by SSRF guard (rejects private/loopback IPs, non-HTTPS)

Circuit breakers

Each sink has an independent _CircuitBreaker (5-failure threshold, 30 s auto-reset). A failing sink is bypassed without blocking other sinks.

status = sf_alert.get_status()
print(status.sink_count)  # number of registered sinks
print(status.healthy)     # True when worker thread is alive

Graceful shutdown

# Drains queue, cancels escalation timers, stops worker thread
sf_alert.shutdown(timeout=10.0)