1.1 — What is AI observability and why it matters | The SpanForge Book

Modern software systems are observable. Modern AI systems, despite appearing similar on the surface, are not—at least not with the same assumptions, tools, or guarantees. This distinction is where SpanForge begins.

In a conventional service, a request produces a predictable outcome. Failures are explicit: an exception is thrown, a timeout occurs, a dependency returns an error. Observability systems are built around these properties. They capture logs, aggregate metrics, and trace execution paths under the assumption that behavior is largely deterministic and reproducible.

AI systems violate these assumptions at a fundamental level. The same input may produce different outputs. A system may return a response within latency targets and still be factually incorrect. Failures are often semantic rather than operational: a hallucinated clause, a misleading summary, a subtly incorrect inference. Nothing crashes. Nothing alerts. And yet, something has gone wrong.

SpanForge is designed to make such failures visible—not by inferring them after the fact, but by capturing system behavior in a form that can be reconstructed, inspected, and reasoned about.

To understand how this is done, we begin with the primitives.

From Request to Reconstruction

Consider a contract analysis assistant. A user submits a request:

"Summarize this contract and highlight potential risks."

The system retrieves relevant clauses, constructs a prompt, invokes a language model, evaluates the response against guardrails, and returns a summary.

Now consider a failure. The system returns a well-formed response, within acceptable latency, but includes a clause that does not exist in the source document. A hallucination has occurred.

Operationally, nothing is wrong. The request succeeded. Metrics remain healthy. Yet the output is incorrect in a way that matters.

The problem is no longer availability. It is explainability.

SpanForge addresses this by transforming execution into structured events, connecting them through a trace, and preserving them in a form that can be queried later. The result is not just a record of activity, but a system that can answer a precise question:

What happened—and why?