Skip to content

spanforge.sdk.pii — PII Service Client

Module: spanforge.sdk.pii
Added in: 2.0.3 (Phase 3: PII Service Hardening)

spanforge.sdk.pii provides the Phase 3 PII service client with full-text scanning, payload anonymisation, pipeline action enforcement, GDPR/HIPAA/CCPA/ DPDP/PIPL compliance helpers, and training-data auditing.

The pre-built sf_pii singleton is available at the top level:

from spanforge.sdk import sf_pii

Quick example

from spanforge.sdk import sf_pii

# Scan raw text
result = sf_pii.scan_text("Contact alice@example.com or call +1 555-867-5309")
print(result.detected)           # True
for entity in result.entities:
    print(entity.type, entity.start, entity.end, entity.score)

# Anonymise a payload dict
anon = sf_pii.anonymise({
    "user": "alice@example.com",
    "note": "SSN 078-05-1120",
})
print(anon.clean_payload)        # {"user": "<EMAIL_ADDRESS>", "note": "<US_SSN>"}

SFPIIClient

class SFPIIClient(SFServiceClient)

All methods are thread-safe. The class can be used standalone or via the sf_pii singleton exported from spanforge.sdk.

scan_text()

def scan_text(
    self,
    text: str,
    *,
    language: str = "en",
) -> PIITextScanResult

Scan raw text for PII entities. Uses Presidio when installed; falls back to the regex-based redact.scan_payload() engine automatically.

ParameterDefaultDescription
text(required)The text string to scan.
language"en"ISO 639-1 language code (e.g. "zh" for Chinese).

Returns: PIITextScanResult

Example:

result = sf_pii.scan_text("SSN: 078-05-1120", language="en")
assert result.detected
assert result.entities[0].type == "US_SSN"

anonymise()

def anonymise(
    self,
    payload: dict,
    *,
    language: str = "en",
    max_depth: int = 10,
) -> PIIAnonymisedResult

Recursively walk payload, scan every string value, and replace PII hits with <TYPE> placeholders. Returns the cleaned payload alongside a full redaction_manifest recording each replacement.

Returns: PIIAnonymisedResult

Example:

anon = sf_pii.anonymise({"email": "alice@example.com", "meta": {"ip": "203.0.113.4"}})
# anon.clean_payload == {"email": "<EMAIL_ADDRESS>", "meta": {"ip": "<IP_ADDRESS>"}}
# anon.redaction_manifest[0].field_path == "email"
# anon.redaction_manifest[0].original_hash  (SHA-256 of original — never raw PII)

scan_async()

Added in: 2.0.14

async def scan_async(
    self,
    text: str,
    *,
    language: str = "en",
    score_threshold: float = 0.5,
) -> PIITextScanResult

Non-blocking async variant of scan_text(). Delegates to the synchronous method via asyncio.get_event_loop().run_in_executor() so it never blocks the event loop.

ParameterDefaultDescription
text(required)The text string to scan.
language"en"ISO 639-1 language code.
score_threshold0.5Filter out entities below this confidence score.

Returns: PIITextScanResult

Example:

import asyncio
from spanforge.sdk import sf_pii

result = asyncio.run(sf_pii.scan_async("alice@example.com"))
assert result.detected
assert result.entities[0].type == "EMAIL_ADDRESS"

scan_batch()

async def scan_batch(
    self,
    texts: list[str],
    *,
    language: str = "en",
) -> list[PIITextScanResult]

Scan multiple texts in parallel via asyncio.gather. Falls back to sequential execution when no running event loop is present.

Example:

import asyncio

results = asyncio.run(sf_pii.scan_batch(["alice@example.com", "hello world"]))
assert results[0].detected
assert not results[1].detected

apply_pipeline_action()

def apply_pipeline_action(
    self,
    scan_result: PIITextScanResult,
    action: str = "flag",
    *,
    threshold: float = 0.85,
) -> PIIPipelineResult

Enforce a pipeline action based on a previous scan_text() result.

ActionBehaviour
"flag"Return result with detected set; no text modification.
"redact"Replace PII spans in redacted_text; detected=True.
"block"Raise SFPIIBlockedError; never returns.

The threshold parameter filters entities: only those with score >= threshold are considered when deciding whether to fire the action (default 0.85). Sub-threshold entities are still included in scan_result for audit purposes.

Returns: PIIPipelineResult

Raises: SFPIIBlockedError (action "block" only)

Example:

from spanforge.sdk._exceptions import SFPIIBlockedError

scan = sf_pii.scan_text("My SSN is 078-05-1120")
try:
    pipeline = sf_pii.apply_pipeline_action(scan, action="block", threshold=0.85)
except SFPIIBlockedError as exc:
    print("Blocked:", exc.entity_types)   # ["US_SSN"]

get_status()

def get_status(self) -> PIIServiceStatus

Return the current sf-pii service status.

Returns: PIIServiceStatus

Example:

status = sf_pii.get_status()
print(status.presidio_available)      # True / False
print(status.entity_types_loaded)     # ["EMAIL_ADDRESS", "PHONE_NUMBER", ...]
print(status.last_scan_at)            # "2026-04-17T12:00:00Z" or None

erase_subject()

def erase_subject(
    self,
    subject_id: str,
    project_id: str,
) -> ErasureReceipt

GDPR Article 17 — Right to Erasure. Locate all audit events associated with subject_id in project_id and issue erasure instructions to the downstream store.

The subject_id is SHA-256 hashed in the returned ErasureReceipt — the raw identifier is never persisted.

Returns: ErasureReceipt

Example:

receipt = sf_pii.erase_subject("user-12345", "proj-abc")
print(receipt.records_erased)         # 42
print(receipt.erased_at)              # "2026-04-17T12:00:00Z"
print(receipt.subject_id_hash)        # SHA-256 of "user-12345"

export_subject_data()

def export_subject_data(
    self,
    subject_id: str,
    project_id: str,
) -> DSARExport

CCPA / GDPR — Data Subject Access Request (DSAR). Aggregate all events referencing subject_id from the audit store for project_id.

Returns: DSARExport

Example:

export = sf_pii.export_subject_data("user-12345", "proj-abc")
print(export.total_records)
for record in export.records:
    print(record["event_type"], record["created_at"])

safe_harbor_deidentify()

def safe_harbor_deidentify(self, text: str) -> SafeHarborResult

HIPAA Safe Harbor De-identification per 45 CFR §164.514(b)(2). Removes or generalises all 18 PHI identifier categories from text:

TransformationRule
Dates (except year)Replaced with year only — April 17 20262026
Ages > 89Replaced with "90+"
ZIP codesTruncated to first 3 digits — 90210902XX
Phone/faxRemoved
EmailRemoved
SSN, MRN, account/certificate numbersRemoved
URLs, IP addresses, device identifiersRemoved
Names, geographic subdivisions, biometric dataRemoved

Returns: SafeHarborResult

Example:

result = sf_pii.safe_harbor_deidentify(
    "Patient John Doe (DOB: 04/17/1932, MRN 0000-4321) lives at 902 Oak Lane, 90210."
)
print(result.deidentified_text)
# "Patient [NAME] (DOB: 1932, MRN [REMOVED]) lives at [ADDRESS], 902XX."
print(result.identifiers_removed)    # 5

audit_training_data()

def audit_training_data(
    self,
    dataset_path: str,
    *,
    max_records: int = 10_000,
    language: str = "en",
) -> TrainingDataPIIReport

EU AI Act Article 10 — Training Data Governance. Scan each line of a JSONL dataset file for PII and produce a prevalence report. Lines that are not valid JSON are counted as malformed_lines and skipped.

ParameterDefaultDescription
dataset_path(required)Path to a JSONL dataset file.
max_records10_000Stop after scanning this many records.
language"en"Language code forwarded to scan_text.

Returns: TrainingDataPIIReport

Example:

report = sf_pii.audit_training_data("dataset/train.jsonl", max_records=5000)
print(f"{report.pii_record_count} / {report.total_records} records contain PII")
for entry in report.entity_breakdown:
    print(f"  {entry.entity_type}: {entry.count}")

get_pii_stats()

def get_pii_stats(self, project_id: str) -> list[PIIHeatMapEntry]

Aggregate PII detection statistics per entity type for project_id. Powers the SpanForge dashboard PII heat-map.

Returns: list[PIIHeatMapEntry]

Example:

for entry in sf_pii.get_pii_stats("proj-abc"):
    print(entry.entity_type, entry.count, entry.last_seen_at)

Return types

PIIEntityResult

@dataclass(frozen=True)
class PIIEntityResult:
    type: str
    start: int
    end: int
    score: float

A single detected PII entity.

FieldDescription
typeEntity type label, e.g. "EMAIL_ADDRESS", "US_SSN", "PIPL_NATIONAL_ID".
startByte offset of the match start in the input text.
endByte offset of the match end (exclusive).
scoreConfidence score in [0.0, 1.0].

PIITextScanResult

@dataclass
class PIITextScanResult:
    detected: bool
    entities: list[PIIEntityResult]
    redacted_text: str
FieldDescription
detectedTrue if at least one entity was found.
entitiesAll entities found (regardless of score threshold).
redacted_textInput text with all detected PII replaced by <TYPE> placeholders.

PIIRedactionManifestEntry

@dataclass(frozen=True)
class PIIRedactionManifestEntry:
    field_path: str
    entity_type: str
    original_hash: str
    replacement: str

One entry in an anonymise() redaction manifest.

FieldDescription
field_pathDot-delimited path to the field in the original payload (e.g. "user.email").
entity_typeEntity type label.
original_hashSHA-256 of the original field value. The raw value is never stored.
replacementThe <TYPE> placeholder that replaced the original.

PIIAnonymisedResult

@dataclass
class PIIAnonymisedResult:
    clean_payload: dict
    redaction_manifest: list[PIIRedactionManifestEntry]
FieldDescription
clean_payloadDeep copy of the input payload with all PII replaced.
redaction_manifestFull list of replacements made.

PIIPipelineResult

@dataclass
class PIIPipelineResult:
    action: str
    detected: bool
    entity_types: list[str]
    redacted_text: str | None
FieldDescription
actionThe action that was applied: "flag", "redact", or "block".
detectedWhether PII above the threshold was present.
entity_typesList of distinct entity type labels that triggered the action.
redacted_textRedacted text (only set when action="redact").

PIIServiceStatus

@dataclass
class PIIServiceStatus:
    status: str
    presidio_available: bool
    entity_types_loaded: list[str]
    last_scan_at: str | None
FieldDescription
status"ok" or "degraded".
presidio_availableTrue when the Presidio engine is importable and healthy.
entity_types_loadedEntity type labels currently registered in the active engine.
last_scan_atISO-8601 timestamp of the most recent scan, or None.

ErasureReceipt

@dataclass
class ErasureReceipt:
    subject_id_hash: str
    project_id: str
    records_erased: int
    erased_at: str
FieldDescription
subject_id_hashSHA-256 hex digest of the subject ID (never the raw ID).
project_idProject the erasure was scoped to.
records_erasedNumber of audit records removed.
erased_atISO-8601 UTC timestamp of the erasure.

DSARExport

@dataclass
class DSARExport:
    subject_id_hash: str
    project_id: str
    total_records: int
    records: list[dict]
    exported_at: str
FieldDescription
subject_id_hashSHA-256 hex digest of the subject ID.
project_idProject the export was scoped to.
total_recordsNumber of records in the export.
recordsList of serialisable event dicts.
exported_atISO-8601 UTC timestamp of the export.

SafeHarborResult

@dataclass
class SafeHarborResult:
    deidentified_text: str
    identifiers_removed: int
FieldDescription
deidentified_textThe input text after Safe Harbor de-identification.
identifiers_removedCount of identifier instances removed or generalised.

TrainingDataPIIReport

@dataclass
class TrainingDataPIIReport:
    dataset_path: str
    total_records: int
    pii_record_count: int
    malformed_lines: int
    entity_breakdown: list[PIIHeatMapEntry]
FieldDescription
dataset_pathPath that was scanned.
total_recordsLines successfully parsed.
pii_record_countLines that contained at least one PII entity.
malformed_linesLines that could not be parsed as JSON (skipped).
entity_breakdownPer-entity-type occurrence counts.

PIIHeatMapEntry

@dataclass
class PIIHeatMapEntry:
    entity_type: str
    count: int
    last_seen_at: str | None
FieldDescription
entity_typeEntity type label (e.g. "EMAIL_ADDRESS").
countNumber of times this entity type was detected in the project.
last_seen_atISO-8601 timestamp of the most recent detection, or None.

Exceptions

ExceptionDescription
SFPIIErrorBase class for all sf-pii SDK errors.
SFPIIBlockedError(entity_types, count)Raised by apply_pipeline_action(action="block"). entity_types lists the types that triggered the block.
SFPIIDPDPConsentMissingError(subject_id_hash, entity_types)Raised when a DPDP-regulated entity is detected but no valid consent record exists for the current purpose. subject_id_hash is a SHA-256 digest.
SFPIIScanErrorWraps unexpected engine failures.

See exceptions reference for full details.


PIPL entity types

China PIPL-specific entity types registered in presidio_backend.py:

TypePattern
PIPL_NATIONAL_IDChinese national ID — 17 digits followed by a digit or X
PIPL_MOBILEChinese mobile — 1[3-9] followed by 9 digits
PIPL_BANK_CARDChinese bank card — 16–19 digit card numbers

These types are flagged as pipl_sensitive for cross-border transfer controls.


Environment variables

VariableDefaultDescription
SPANFORGE_SF_PII_ENDPOINT(none)Remote sf-pii service URL. When set, SFPIIClient forwards scans to the service; falls back to local engine on network error.
SPANFORGE_PII_ACTION"flag"Default pipeline action ("flag" / "redact" / "block").
SPANFORGE_PII_THRESHOLD0.85Default confidence threshold for pipeline action enforcement.
SPANFORGE_PII_LANGUAGE"en"Default language code forwarded to scan calls.
SPANFORGE_PII_MAX_DEPTH10Maximum recursion depth for anonymise().

See also