Skip to content

spanforge.sdk.pii — PII Service Client

Module: spanforge.sdk.pii
Added in: 2.0.3 (Phase 3: PII Service Hardening)

spanforge.sdk.pii provides the Phase 3 PII service client with full-text scanning, payload anonymisation, pipeline action enforcement, GDPR/HIPAA/CCPA/ DPDP/PIPL compliance helpers, and training-data auditing.

The pre-built sf_pii singleton is available at the top level:

from spanforge.sdk import sf_pii

Presidio NLP backend

When the presidio optional extra is installed, spanforge automatically switches from the built-in regex engine to a full NLP-powered Presidio analysis pipeline:

pip install "spanforge[presidio]"
python -m spacy download en_core_web_lg   # ~400 MB — required for NER

No code changes are needed. The backend is detected and activated at import time.

Entity types covered

Entity typeLabel returnedSeverity
Credit cardcredit_cardhigh
Cryptocurrency addresscrypto_addressmedium
Email addressemailmedium
IBAN bank codeibanhigh
IPv4 / IPv6 addressip_addresslow
Person nameperson_namemedium
Phone numberphonemedium
US Social Security Numberssnhigh
UK NHS numberuk_nhshigh
US driver's licenseus_driver_licensehigh
US passportus_passporthigh
India Aadhaar numberaadhaarhigh
India PAN cardpanhigh
Medical license numbermedical_licensemedium
UK National Insurance numberuk_national_insurancehigh

Custom recognizers

The following entity types have custom PatternRecognizer rules registered on top of the default Presidio English model to improve recall:

  • PHONE_NUMBER: three format patterns — +1-NXX-NXX-XXXX (0.75), (NXX) NXX-XXXX (0.75), NXX-NXX-XXXX (0.60).
  • IN_AADHAAR: DDDD DDDD DDDD (space- or hyphen-delimited 12-digit groups), confidence 0.85.
  • IN_PAN: 5 uppercase letters + 4 digits + 1 uppercase letter, confidence 0.85.
  • UK_NATIONAL_INSURANCE: standard NI number format, confidence 0.85.

Post-filters

After Presidio analysis, spanforge applies additional filters to reduce false positives:

  • Lowercase PERSON suppression — all-lowercase matches for PERSON are discarded; they typically indicate programming identifiers or function names, not human names.
  • IPv4 boundary check — IPv4 matches are validated (4 octets, each 0–255) and must not be adjacent to other digit or dot characters. This prevents OID fragments like 1.101.3.4 inside longer strings from being flagged as IP addresses.

GA accuracy gates (verified)

MetricThresholdResult
False-positive rate< 0.5 %✅ Passed (191-item clean corpus)
True-positive rate≥ 95 %✅ Passed (100 % = 25/25 samples)

Custom regex patterns alongside Presidio

The extra_patterns argument to scan_text() and scan_payload() is always honoured, even when the Presidio backend is active. Results from Presidio and from custom regex patterns are merged before returning.


Quick example

from spanforge.sdk import sf_pii

# Scan raw text
result = sf_pii.scan_text("Contact alice@example.com or call +1 555-867-5309")
print(result.detected)           # True
for entity in result.entities:
    print(entity.type, entity.start, entity.end, entity.score)

# Anonymise a payload dict
anon = sf_pii.anonymise({
    "user": "alice@example.com",
    "note": "SSN 078-05-1120",
})
print(anon.clean_payload)        # {"user": "<EMAIL_ADDRESS>", "note": "<US_SSN>"}

SFPIIClient

class SFPIIClient(SFServiceClient)

All methods are thread-safe. The class can be used standalone or via the sf_pii singleton exported from spanforge.sdk.

scan_text()

def scan_text(
    self,
    text: str,
    *,
    language: str = "en",
) -> PIITextScanResult

Scan raw text for PII entities. Uses Presidio when installed; falls back to the regex-based redact.scan_payload() engine automatically.

ParameterDefaultDescription
text(required)The text string to scan.
language"en"ISO 639-1 language code (e.g. "zh" for Chinese).

Returns: PIITextScanResult

Example:

result = sf_pii.scan_text("SSN: 078-05-1120", language="en")
assert result.detected
assert result.entities[0].type == "US_SSN"

anonymise()

def anonymise(
    self,
    payload: dict,
    *,
    language: str = "en",
    max_depth: int = 10,
) -> PIIAnonymisedResult

Recursively walk payload, scan every string value, and replace PII hits with <TYPE> placeholders. Returns the cleaned payload alongside a full redaction_manifest recording each replacement.

Returns: PIIAnonymisedResult

Example:

anon = sf_pii.anonymise({"email": "alice@example.com", "meta": {"ip": "203.0.113.4"}})
# anon.clean_payload == {"email": "<EMAIL_ADDRESS>", "meta": {"ip": "<IP_ADDRESS>"}}
# anon.redaction_manifest[0].field_path == "email"
# anon.redaction_manifest[0].original_hash  (SHA-256 of original — never raw PII)

scan_async()

Added in: 1.0.0

async def scan_async(
    self,
    text: str,
    *,
    language: str = "en",
    score_threshold: float = 0.5,
) -> PIITextScanResult

Non-blocking async variant of scan_text(). Delegates to the synchronous method via asyncio.get_event_loop().run_in_executor() so it never blocks the event loop.

ParameterDefaultDescription
text(required)The text string to scan.
language"en"ISO 639-1 language code.
score_threshold0.5Filter out entities below this confidence score.

Returns: PIITextScanResult

Example:

import asyncio
from spanforge.sdk import sf_pii

result = asyncio.run(sf_pii.scan_async("alice@example.com"))
assert result.detected
assert result.entities[0].type == "EMAIL_ADDRESS"

scan_batch()

async def scan_batch(
    self,
    texts: list[str],
    *,
    language: str = "en",
) -> list[PIITextScanResult]

Scan multiple texts in parallel via asyncio.gather. Falls back to sequential execution when no running event loop is present.

Example:

import asyncio

results = asyncio.run(sf_pii.scan_batch(["alice@example.com", "hello world"]))
assert results[0].detected
assert not results[1].detected

apply_pipeline_action()

def apply_pipeline_action(
    self,
    text: str,
    *,
    action: str = "flag",
    threshold: float = 0.85,
    language: str = "en",
) -> PIIPipelineResult

Scan text and enforce a pipeline action on any PII found above threshold.

ActionBehaviour
"flag"Return result with detected set; no text modification.
"redact"Replace PII spans with <TYPE> placeholders in text; detected=True.
"block"Raise SFPIIBlockedError; never returns.

The threshold parameter filters entities: only those with score >= threshold trigger the action (default 0.85). Sub-threshold hits are recorded in low_confidence_hits for audit purposes.

Returns: PIIPipelineResult

Raises: SFPIIBlockedError (action "block" only), SFPIIScanError if text is not a str or action is invalid.

Example:

from spanforge.sdk._exceptions import SFPIIBlockedError

try:
    pipeline = sf_pii.apply_pipeline_action(
        "My SSN is 078-05-1120", action="block", threshold=0.85
    )
except SFPIIBlockedError as exc:
    print("Blocked:", exc.entity_types)   # ["US_SSN"]

get_status()

def get_status(self) -> PIIServiceStatus

Return the current sf-pii service status.

Returns: PIIServiceStatus

Example:

status = sf_pii.get_status()
print(status.presidio_available)      # True / False
print(status.entity_types_loaded)     # ["EMAIL_ADDRESS", "PHONE_NUMBER", ...]
print(status.last_scan_at)            # "2026-04-17T12:00:00Z" or None

erase_subject()

def erase_subject(
    self,
    subject_id: str,
    project_id: str,
) -> ErasureReceipt

GDPR Article 17 — Right to Erasure. Locate all audit events associated with subject_id in project_id and issue erasure instructions to the downstream store.

The subject_id is SHA-256 hashed in the returned ErasureReceipt — the raw identifier is never persisted.

Returns: ErasureReceipt

Example:

receipt = sf_pii.erase_subject("user-12345", "proj-abc")
print(receipt.records_erased)         # 42
print(receipt.erased_at)              # "2026-04-17T12:00:00Z"
print(receipt.subject_id_hash)        # SHA-256 of "user-12345"

export_subject_data()

def export_subject_data(
    self,
    subject_id: str,
    project_id: str,
) -> DSARExport

CCPA / GDPR — Data Subject Access Request (DSAR). Aggregate all events referencing subject_id from the audit store for project_id.

Returns: DSARExport

Example:

export = sf_pii.export_subject_data("user-12345", "proj-abc")
print(export.total_records)
for record in export.records:
    print(record["event_type"], record["created_at"])

safe_harbor_deidentify()

def safe_harbor_deidentify(self, text: str) -> SafeHarborResult

HIPAA Safe Harbor De-identification per 45 CFR §164.514(b)(2). Removes or generalises all 18 PHI identifier categories from text:

TransformationRule
Dates (except year)Replaced with year only — April 17 20262026
Ages > 89Replaced with "90+"
ZIP codesTruncated to first 3 digits — 90210902XX
Phone/faxRemoved
EmailRemoved
SSN, MRN, account/certificate numbersRemoved
URLs, IP addresses, device identifiersRemoved
Names, geographic subdivisions, biometric dataRemoved

Returns: SafeHarborResult

Example:

result = sf_pii.safe_harbor_deidentify(
    "Patient John Doe (DOB: 04/17/1932, MRN 0000-4321) lives at 902 Oak Lane, 90210."
)
print(result.deidentified_text)
# "Patient [NAME] (DOB: 1932, MRN [REMOVED]) lives at [ADDRESS], 902XX."
print(result.identifiers_removed)    # 5

audit_training_data()

def audit_training_data(
    self,
    dataset_path: str,
    *,
    max_records: int = 10_000,
    language: str = "en",
) -> TrainingDataPIIReport

EU AI Act Article 10 — Training Data Governance. Scan each line of a JSONL dataset file for PII and produce a prevalence report. Lines that are not valid JSON are counted as malformed_lines and skipped.

ParameterDefaultDescription
dataset_path(required)Path to a JSONL dataset file.
max_records10_000Stop after scanning this many records.
language"en"Language code forwarded to scan_text.

Returns: TrainingDataPIIReport

Example:

report = sf_pii.audit_training_data("dataset/train.jsonl", max_records=5000)
print(f"{report.pii_record_count} / {report.total_records} records contain PII")
for entry in report.entity_breakdown:
    print(f"  {entry.entity_type}: {entry.count}")

get_pii_stats()

def get_pii_stats(self, project_id: str) -> list[PIIHeatMapEntry]

Aggregate PII detection statistics per entity type for project_id. Powers the SpanForge dashboard PII heat-map.

Returns: list[PIIHeatMapEntry]

Example:

for entry in sf_pii.get_pii_stats("proj-abc"):
    print(entry.entity_type, entry.count, entry.last_seen_at)

Return types

PIIEntityResult

@dataclass(frozen=True)
class PIIEntityResult:
    type: str
    start: int
    end: int
    score: float

A single detected PII entity.

FieldDescription
typeEntity type label, e.g. "EMAIL_ADDRESS", "US_SSN", "PIPL_NATIONAL_ID".
startByte offset of the match start in the input text.
endByte offset of the match end (exclusive).
scoreConfidence score in [0.0, 1.0].

PIITextScanResult

@dataclass
class PIITextScanResult:
    detected: bool
    entities: list[PIIEntityResult]
    redacted_text: str
FieldDescription
detectedTrue if at least one entity was found.
entitiesAll entities found (regardless of score threshold).
redacted_textInput text with all detected PII replaced by <TYPE> placeholders.

PIIRedactionManifestEntry

@dataclass(frozen=True)
class PIIRedactionManifestEntry:
    field_path: str
    entity_type: str
    original_hash: str
    replacement: str

One entry in an anonymise() redaction manifest.

FieldDescription
field_pathDot-delimited path to the field in the original payload (e.g. "user.email").
entity_typeEntity type label.
original_hashSHA-256 of the original field value. The raw value is never stored.
replacementThe <TYPE> placeholder that replaced the original.

PIIAnonymisedResult

@dataclass
class PIIAnonymisedResult:
    clean_payload: dict
    redaction_manifest: list[PIIRedactionManifestEntry]
FieldDescription
clean_payloadDeep copy of the input payload with all PII replaced.
redaction_manifestFull list of replacements made.

PIIPipelineResult

@dataclass
class PIIPipelineResult:
    action: str
    detected: bool
    entity_types: list[str]
    redacted_text: str | None
FieldDescription
actionThe action that was applied: "flag", "redact", or "block".
detectedWhether PII above the threshold was present.
entity_typesList of distinct entity type labels that triggered the action.
redacted_textRedacted text (only set when action="redact").

PIIServiceStatus

@dataclass
class PIIServiceStatus:
    status: str
    presidio_available: bool
    entity_types_loaded: list[str]
    last_scan_at: str | None
FieldDescription
status"ok" or "degraded".
presidio_availableTrue when the Presidio engine is importable and healthy.
entity_types_loadedEntity type labels currently registered in the active engine.
last_scan_atISO-8601 timestamp of the most recent scan, or None.

ErasureReceipt

@dataclass
class ErasureReceipt:
    subject_id_hash: str
    project_id: str
    records_erased: int
    erased_at: str
FieldDescription
subject_id_hashSHA-256 hex digest of the subject ID (never the raw ID).
project_idProject the erasure was scoped to.
records_erasedNumber of audit records removed.
erased_atISO-8601 UTC timestamp of the erasure.

DSARExport

@dataclass
class DSARExport:
    subject_id_hash: str
    project_id: str
    total_records: int
    records: list[dict]
    exported_at: str
FieldDescription
subject_id_hashSHA-256 hex digest of the subject ID.
project_idProject the export was scoped to.
total_recordsNumber of records in the export.
recordsList of serialisable event dicts.
exported_atISO-8601 UTC timestamp of the export.

SafeHarborResult

@dataclass
class SafeHarborResult:
    deidentified_text: str
    identifiers_removed: int
FieldDescription
deidentified_textThe input text after Safe Harbor de-identification.
identifiers_removedCount of identifier instances removed or generalised.

TrainingDataPIIReport

@dataclass
class TrainingDataPIIReport:
    dataset_path: str
    total_records: int
    pii_record_count: int
    malformed_lines: int
    entity_breakdown: list[PIIHeatMapEntry]
FieldDescription
dataset_pathPath that was scanned.
total_recordsLines successfully parsed.
pii_record_countLines that contained at least one PII entity.
malformed_linesLines that could not be parsed as JSON (skipped).
entity_breakdownPer-entity-type occurrence counts.

PIIHeatMapEntry

@dataclass
class PIIHeatMapEntry:
    entity_type: str
    count: int
    last_seen_at: str | None
FieldDescription
entity_typeEntity type label (e.g. "EMAIL_ADDRESS").
countNumber of times this entity type was detected in the project.
last_seen_atISO-8601 timestamp of the most recent detection, or None.

Exceptions

ExceptionDescription
SFPIIErrorBase class for all sf-pii SDK errors.
SFPIIBlockedError(entity_types, count)Raised by apply_pipeline_action(action="block"). entity_types lists the types that triggered the block.
SFPIIDPDPConsentMissingError(subject_id_hash, entity_types)Raised when a DPDP-regulated entity is detected but no valid consent record exists for the current purpose. subject_id_hash is a SHA-256 digest.
SFPIIScanErrorWraps unexpected engine failures.

See exceptions reference for full details.


PIPL entity types

China PIPL-specific entity types registered in presidio_backend.py:

TypePattern
PIPL_NATIONAL_IDChinese national ID — 17 digits followed by a digit or X
PIPL_MOBILEChinese mobile — 1[3-9] followed by 9 digits
PIPL_BANK_CARDChinese bank card — 16–19 digit card numbers

These types are flagged as pipl_sensitive for cross-border transfer controls.


Environment variables

VariableDefaultDescription
SPANFORGE_SF_PII_ENDPOINT(none)Remote sf-pii service URL. When set, SFPIIClient forwards scans to the service; falls back to local engine on network error.
SPANFORGE_PII_ACTION"flag"Default pipeline action ("flag" / "redact" / "block").
SPANFORGE_PII_THRESHOLD0.85Default confidence threshold for pipeline action enforcement.
SPANFORGE_PII_LANGUAGE"en"Default language code forwarded to scan calls.
SPANFORGE_PII_MAX_DEPTH10Maximum recursion depth for anonymise().

See also