Skip to content

spanforge.regression — Pass/fail regression detection

Module: spanforge.regression
Added in: 2.0.3

spanforge.regression provides generic, pass/fail–based regression detection over evaluation runs. It is distinct from the mean-based spanforge.eval.RegressionDetector — this module focuses on two concrete failure signals:

  1. New failures — test cases that passed in the baseline but fail now.
  2. Score drops — test cases whose numeric score fell by more than a configurable threshold.

Quick example

from spanforge.regression import RegressionDetector, compare

baseline = [
    {"id": "tc-001", "passed": True,  "score": 0.95},
    {"id": "tc-002", "passed": True,  "score": 0.88},
]
current = [
    {"id": "tc-001", "passed": True,  "score": 0.93},  # small drop — OK
    {"id": "tc-002", "passed": False, "score": 0.45},  # NEW FAILURE
]

report = compare(
    baseline, current,
    key_fn=lambda x: x["id"],
    passed_fn=lambda x: x["passed"],
    score_fn=lambda x: x["score"],
    score_drop_threshold=0.10,
)

if report.has_regression:
    print(report.summary())
    # "1 new failure(s), 0 score drop(s)"

API

compare() — convenience function

def compare(
    baseline: list[T],
    current: list[T],
    *,
    key_fn: Callable[[T], Any],
    passed_fn: Callable[[T], bool],
    score_fn: Callable[[T], float],
    score_drop_threshold: float = 0.10,
) -> RegressionReport[T]: ...

One-shot helper equivalent to RegressionDetector(score_drop_threshold).compare(...).


RegressionDetector

class RegressionDetector(Generic[T]):
    def __init__(self, score_drop_threshold: float = 0.10) -> None: ...

    def compare(
        self,
        baseline: list[T],
        current: list[T],
        *,
        key_fn: Callable[[T], Any],
        passed_fn: Callable[[T], bool],
        score_fn: Callable[[T], float],
    ) -> RegressionReport[T]: ...
ParameterDescription
score_drop_thresholdMinimum score decrease (0–1) to count as a regression. Default 0.10.
key_fnFunction that extracts a unique key from each result item.
passed_fnFunction that returns True if the item passed.
score_fnFunction that returns the numeric score for the item.

Only items present in both baseline and current are compared; items added to current or removed from baseline are ignored.


RegressionReport

@dataclass
class RegressionReport(Generic[T]):
    new_failures:  list[T]
    score_drops:   list[tuple[T, T]]   # (baseline_item, current_item)

    @property
    def has_regression(self) -> bool: ...

    def summary(self) -> str: ...
Attribute / methodDescription
new_failuresItems that were passing in baseline but failing now.
score_dropsPairs (baseline, current) where the score drop exceeded the threshold.
has_regressionTrue when either list is non-empty.
summary()Short human-readable string, e.g. "1 new failure(s), 1 score drop(s)" or "no regression detected".

Difference from spanforge.eval.RegressionDetector

spanforge.eval.RegressionDetectorspanforge.regression.RegressionDetector
SignalMean score across all metricsPer-case pass/fail and score delta
InputEvalReport objectsAny list of generic items
Use caseOverall eval pipeline healthCI gating, per-case diff