spanforge.regression — Pass/fail regression detection

Module: spanforge.regression
Added in: 2.0.3

spanforge.regression provides generic, pass/fail–based regression detection over evaluation runs. It is distinct from the mean-based spanforge.eval.RegressionDetector — this module focuses on two concrete failure signals:

New failures — test cases that passed in the baseline but fail now.
Score drops — test cases whose numeric score fell by more than a configurable threshold.

Quick example

from spanforge.regression import RegressionDetector, compare

baseline = [
    {"id": "tc-001", "passed": True,  "score": 0.95},
    {"id": "tc-002", "passed": True,  "score": 0.88},
]
current = [
    {"id": "tc-001", "passed": True,  "score": 0.93},  # small drop — OK
    {"id": "tc-002", "passed": False, "score": 0.45},  # NEW FAILURE
]

report = compare(
    baseline, current,
    key_fn=lambda x: x["id"],
    passed_fn=lambda x: x["passed"],
    score_fn=lambda x: x["score"],
    score_drop_threshold=0.10,
)

if report.has_regression:
    print(report.summary())
    # "1 new failure(s), 0 score drop(s)"

API

`compare()` — convenience function

def compare(
    baseline: list[T],
    current: list[T],
    *,
    key_fn: Callable[[T], Any],
    passed_fn: Callable[[T], bool],
    score_fn: Callable[[T], float],
    score_drop_threshold: float = 0.10,
) -> RegressionReport[T]: ...

One-shot helper equivalent to RegressionDetector(score_drop_threshold).compare(...).

`RegressionDetector`

class RegressionDetector(Generic[T]):
    def __init__(self, score_drop_threshold: float = 0.10) -> None: ...

    def compare(
        self,
        baseline: list[T],
        current: list[T],
        *,
        key_fn: Callable[[T], Any],
        passed_fn: Callable[[T], bool],
        score_fn: Callable[[T], float],
    ) -> RegressionReport[T]: ...

Parameter	Description
`score_drop_threshold`	Minimum score decrease (0–1) to count as a regression. Default `0.10`.
`key_fn`	Function that extracts a unique key from each result item.
`passed_fn`	Function that returns `True` if the item passed.
`score_fn`	Function that returns the numeric score for the item.

Only items present in both baseline and current are compared; items added to current or removed from baseline are ignored.

`RegressionReport`

@dataclass
class RegressionReport(Generic[T]):
    new_failures:  list[T]
    score_drops:   list[tuple[T, T]]   # (baseline_item, current_item)

    @property
    def has_regression(self) -> bool: ...

    def summary(self) -> str: ...

Attribute / method	Description
`new_failures`	Items that were passing in baseline but failing now.
`score_drops`	Pairs `(baseline, current)` where the score drop exceeded the threshold.
`has_regression`	`True` when either list is non-empty.
`summary()`	Short human-readable string, e.g. `"1 new failure(s), 1 score drop(s)"` or `"no regression detected"`.

Difference from `spanforge.eval.RegressionDetector`

	`spanforge.eval.RegressionDetector`	`spanforge.regression.RegressionDetector`
Signal	Mean score across all metrics	Per-case pass/fail and score delta
Input	`EvalReport` objects	Any list of generic items
Use case	Overall eval pipeline health	CI gating, per-case diff

Ready to instrument your AI pipeline?

Try the 30-second quickstart See the compliance checklist View on GitHub

spanforge.regression — Pass/fail regression detection

Quick example

API

compare() — convenience function

RegressionDetector

RegressionReport

Difference from spanforge.eval.RegressionDetector

`compare()` — convenience function

`RegressionDetector`

`RegressionReport`

Difference from `spanforge.eval.RegressionDetector`