Skip to content

llm.eval — Scoring & Evaluation

Auto-documented module: spanforge.namespaces.eval_

The llm.eval.* namespace records evaluation scores, regression detections, and evaluation scenario lifecycle events (RFC-0001 §5).

Payload classes

ClassEvent typeDescription
EvalScoreRecordedPayloadllm.eval.score.recordedA numeric score was recorded for a metric
EvalRegressionDetectedPayloadllm.eval.regression.detectedA metric score crossed a regression threshold
EvalScenarioStartedPayloadllm.eval.scenario.startedAn evaluation scenario started
EvalScenarioCompletedPayloadllm.eval.scenario.completedAn evaluation scenario completed

EvalScoreRecordedPayload — key fields

FieldTypeRequiredDescription
evaluatorstrEvaluator identifier (e.g. "human", "gpt-4o", "rubric-v2")
metric_namestrName of the metric being scored (e.g. "faithfulness")
scorefloatNumeric score value
score_minfloat | NoneMinimum of the scoring scale
score_maxfloat | NoneMaximum of the scoring scale
thresholdfloat | NonePass/fail threshold
passedbool | NoneWhether the score met the threshold
subject_event_idstr | NoneULID of the event being evaluated
subject_typestr | NoneType of the evaluated subject (e.g. "span", "agent_run")
eval_run_idstr | NoneEvaluation run identifier

Example

from spanforge import Event, EventType
from spanforge.namespaces.eval_ import EvalScoreRecordedPayload

payload = EvalScoreRecordedPayload(
    evaluator="gpt-4o",
    metric_name="faithfulness",
    score=0.85,
    score_min=0.0,
    score_max=1.0,
    threshold=0.7,
    passed=True,
)

event = Event(
    event_type=EventType.EVAL_SCORE_RECORDED,
    source="eval-worker@1.0.0",
    org_id="org_01HX",
    payload=payload.to_dict(),
)