spanforge.sdk.pii — PII Service Client

Module: spanforge.sdk.pii
Added in: 2.0.3 (Phase 3: PII Service Hardening)

spanforge.sdk.pii provides the Phase 3 PII service client with full-text scanning, payload anonymisation, pipeline action enforcement, GDPR/HIPAA/CCPA/ DPDP/PIPL compliance helpers, and training-data auditing.

The pre-built sf_pii singleton is available at the top level:

from spanforge.sdk import sf_pii

Quick example

from spanforge.sdk import sf_pii

# Scan raw text
result = sf_pii.scan_text("Contact alice@example.com or call +1 555-867-5309")
print(result.detected)           # True
for entity in result.entities:
    print(entity.type, entity.start, entity.end, entity.score)

# Anonymise a payload dict
anon = sf_pii.anonymise({
    "user": "alice@example.com",
    "note": "SSN 078-05-1120",
})
print(anon.clean_payload)        # {"user": "<EMAIL_ADDRESS>", "note": "<US_SSN>"}

`SFPIIClient`

class SFPIIClient(SFServiceClient)

All methods are thread-safe. The class can be used standalone or via the sf_pii singleton exported from spanforge.sdk.

`scan_text()`

def scan_text(
    self,
    text: str,
    *,
    language: str = "en",
) -> PIITextScanResult

Scan raw text for PII entities. Uses Presidio when installed; falls back to the regex-based redact.scan_payload() engine automatically.

Parameter	Default	Description
`text`	(required)	The text string to scan.
`language`	`"en"`	ISO 639-1 language code (e.g. `"zh"` for Chinese).

Returns: PIITextScanResult

Example:

result = sf_pii.scan_text("SSN: 078-05-1120", language="en")
assert result.detected
assert result.entities[0].type == "US_SSN"

`anonymise()`

def anonymise(
    self,
    payload: dict,
    *,
    language: str = "en",
    max_depth: int = 10,
) -> PIIAnonymisedResult

Recursively walk payload, scan every string value, and replace PII hits with <TYPE> placeholders. Returns the cleaned payload alongside a full redaction_manifest recording each replacement.

Returns: PIIAnonymisedResult

Example:

anon = sf_pii.anonymise({"email": "alice@example.com", "meta": {"ip": "203.0.113.4"}})
# anon.clean_payload == {"email": "<EMAIL_ADDRESS>", "meta": {"ip": "<IP_ADDRESS>"}}
# anon.redaction_manifest[0].field_path == "email"
# anon.redaction_manifest[0].original_hash  (SHA-256 of original — never raw PII)

`scan_async()`

Added in: 2.0.14

async def scan_async(
    self,
    text: str,
    *,
    language: str = "en",
    score_threshold: float = 0.5,
) -> PIITextScanResult

Non-blocking async variant of scan_text(). Delegates to the synchronous method via asyncio.get_event_loop().run_in_executor() so it never blocks the event loop.

Parameter	Default	Description
`text`	(required)	The text string to scan.
`language`	`"en"`	ISO 639-1 language code.
`score_threshold`	`0.5`	Filter out entities below this confidence score.

Returns: PIITextScanResult

Example:

import asyncio
from spanforge.sdk import sf_pii

result = asyncio.run(sf_pii.scan_async("alice@example.com"))
assert result.detected
assert result.entities[0].type == "EMAIL_ADDRESS"

`scan_batch()`

async def scan_batch(
    self,
    texts: list[str],
    *,
    language: str = "en",
) -> list[PIITextScanResult]

Scan multiple texts in parallel via asyncio.gather. Falls back to sequential execution when no running event loop is present.

Example:

import asyncio

results = asyncio.run(sf_pii.scan_batch(["alice@example.com", "hello world"]))
assert results[0].detected
assert not results[1].detected

`apply_pipeline_action()`

def apply_pipeline_action(
    self,
    scan_result: PIITextScanResult,
    action: str = "flag",
    *,
    threshold: float = 0.85,
) -> PIIPipelineResult

Enforce a pipeline action based on a previous scan_text() result.

Action	Behaviour
`"flag"`	Return result with `detected` set; no text modification.
`"redact"`	Replace PII spans in `redacted_text`; `detected=True`.
`"block"`	Raise `SFPIIBlockedError`; never returns.

The threshold parameter filters entities: only those with score >= threshold are considered when deciding whether to fire the action (default 0.85). Sub-threshold entities are still included in scan_result for audit purposes.

Returns: PIIPipelineResult

Raises: SFPIIBlockedError (action "block" only)

Example:

from spanforge.sdk._exceptions import SFPIIBlockedError

scan = sf_pii.scan_text("My SSN is 078-05-1120")
try:
    pipeline = sf_pii.apply_pipeline_action(scan, action="block", threshold=0.85)
except SFPIIBlockedError as exc:
    print("Blocked:", exc.entity_types)   # ["US_SSN"]

`get_status()`

def get_status(self) -> PIIServiceStatus

Return the current sf-pii service status.

Returns: PIIServiceStatus

Example:

status = sf_pii.get_status()
print(status.presidio_available)      # True / False
print(status.entity_types_loaded)     # ["EMAIL_ADDRESS", "PHONE_NUMBER", ...]
print(status.last_scan_at)            # "2026-04-17T12:00:00Z" or None

`erase_subject()`

def erase_subject(
    self,
    subject_id: str,
    project_id: str,
) -> ErasureReceipt

GDPR Article 17 — Right to Erasure. Locate all audit events associated with subject_id in project_id and issue erasure instructions to the downstream store.

The subject_id is SHA-256 hashed in the returned ErasureReceipt — the raw identifier is never persisted.

Returns: ErasureReceipt

Example:

receipt = sf_pii.erase_subject("user-12345", "proj-abc")
print(receipt.records_erased)         # 42
print(receipt.erased_at)              # "2026-04-17T12:00:00Z"
print(receipt.subject_id_hash)        # SHA-256 of "user-12345"

`export_subject_data()`

def export_subject_data(
    self,
    subject_id: str,
    project_id: str,
) -> DSARExport

CCPA / GDPR — Data Subject Access Request (DSAR). Aggregate all events referencing subject_id from the audit store for project_id.

Returns: DSARExport

Example:

export = sf_pii.export_subject_data("user-12345", "proj-abc")
print(export.total_records)
for record in export.records:
    print(record["event_type"], record["created_at"])

`safe_harbor_deidentify()`

def safe_harbor_deidentify(self, text: str) -> SafeHarborResult

HIPAA Safe Harbor De-identification per 45 CFR §164.514(b)(2). Removes or generalises all 18 PHI identifier categories from text:

Transformation	Rule
Dates (except year)	Replaced with year only — `April 17 2026` → `2026`
Ages > 89	Replaced with `"90+"`
ZIP codes	Truncated to first 3 digits — `90210` → `902XX`
Phone/fax	Removed
Email	Removed
SSN, MRN, account/certificate numbers	Removed
URLs, IP addresses, device identifiers	Removed
Names, geographic subdivisions, biometric data	Removed

Returns: SafeHarborResult

Example:

result = sf_pii.safe_harbor_deidentify(
    "Patient John Doe (DOB: 04/17/1932, MRN 0000-4321) lives at 902 Oak Lane, 90210."
)
print(result.deidentified_text)
# "Patient [NAME] (DOB: 1932, MRN [REMOVED]) lives at [ADDRESS], 902XX."
print(result.identifiers_removed)    # 5

`audit_training_data()`

def audit_training_data(
    self,
    dataset_path: str,
    *,
    max_records: int = 10_000,
    language: str = "en",
) -> TrainingDataPIIReport

EU AI Act Article 10 — Training Data Governance. Scan each line of a JSONL dataset file for PII and produce a prevalence report. Lines that are not valid JSON are counted as malformed_lines and skipped.

Parameter	Default	Description
`dataset_path`	(required)	Path to a JSONL dataset file.
`max_records`	`10_000`	Stop after scanning this many records.
`language`	`"en"`	Language code forwarded to `scan_text`.

Returns: TrainingDataPIIReport

Example:

report = sf_pii.audit_training_data("dataset/train.jsonl", max_records=5000)
print(f"{report.pii_record_count} / {report.total_records} records contain PII")
for entry in report.entity_breakdown:
    print(f"  {entry.entity_type}: {entry.count}")

`get_pii_stats()`

def get_pii_stats(self, project_id: str) -> list[PIIHeatMapEntry]

Aggregate PII detection statistics per entity type for project_id. Powers the SpanForge dashboard PII heat-map.

Returns: list[PIIHeatMapEntry]

Example:

for entry in sf_pii.get_pii_stats("proj-abc"):
    print(entry.entity_type, entry.count, entry.last_seen_at)

Return types

`PIIEntityResult`

@dataclass(frozen=True)
class PIIEntityResult:
    type: str
    start: int
    end: int
    score: float

A single detected PII entity.

Field	Description
`type`	Entity type label, e.g. `"EMAIL_ADDRESS"`, `"US_SSN"`, `"PIPL_NATIONAL_ID"`.
`start`	Byte offset of the match start in the input text.
`end`	Byte offset of the match end (exclusive).
`score`	Confidence score in `[0.0, 1.0]`.

`PIITextScanResult`

@dataclass
class PIITextScanResult:
    detected: bool
    entities: list[PIIEntityResult]
    redacted_text: str

Field	Description
`detected`	`True` if at least one entity was found.
`entities`	All entities found (regardless of score threshold).
`redacted_text`	Input text with all detected PII replaced by `<TYPE>` placeholders.

`PIIRedactionManifestEntry`

@dataclass(frozen=True)
class PIIRedactionManifestEntry:
    field_path: str
    entity_type: str
    original_hash: str
    replacement: str

One entry in an anonymise() redaction manifest.

Field	Description
`field_path`	Dot-delimited path to the field in the original payload (e.g. `"user.email"`).
`entity_type`	Entity type label.
`original_hash`	SHA-256 of the original field value. The raw value is never stored.
`replacement`	The `<TYPE>` placeholder that replaced the original.

`PIIAnonymisedResult`

@dataclass
class PIIAnonymisedResult:
    clean_payload: dict
    redaction_manifest: list[PIIRedactionManifestEntry]

Field	Description
`clean_payload`	Deep copy of the input payload with all PII replaced.
`redaction_manifest`	Full list of replacements made.

`PIIPipelineResult`

@dataclass
class PIIPipelineResult:
    action: str
    detected: bool
    entity_types: list[str]
    redacted_text: str | None

Field	Description
`action`	The action that was applied: `"flag"`, `"redact"`, or `"block"`.
`detected`	Whether PII above the threshold was present.
`entity_types`	List of distinct entity type labels that triggered the action.
`redacted_text`	Redacted text (only set when `action="redact"`).

`PIIServiceStatus`

@dataclass
class PIIServiceStatus:
    status: str
    presidio_available: bool
    entity_types_loaded: list[str]
    last_scan_at: str | None

Field	Description
`status`	`"ok"` or `"degraded"`.
`presidio_available`	`True` when the Presidio engine is importable and healthy.
`entity_types_loaded`	Entity type labels currently registered in the active engine.
`last_scan_at`	ISO-8601 timestamp of the most recent scan, or `None`.

`ErasureReceipt`

@dataclass
class ErasureReceipt:
    subject_id_hash: str
    project_id: str
    records_erased: int
    erased_at: str

Field	Description
`subject_id_hash`	SHA-256 hex digest of the subject ID (never the raw ID).
`project_id`	Project the erasure was scoped to.
`records_erased`	Number of audit records removed.
`erased_at`	ISO-8601 UTC timestamp of the erasure.

`DSARExport`

@dataclass
class DSARExport:
    subject_id_hash: str
    project_id: str
    total_records: int
    records: list[dict]
    exported_at: str

Field	Description
`subject_id_hash`	SHA-256 hex digest of the subject ID.
`project_id`	Project the export was scoped to.
`total_records`	Number of records in the export.
`records`	List of serialisable event dicts.
`exported_at`	ISO-8601 UTC timestamp of the export.

`SafeHarborResult`

@dataclass
class SafeHarborResult:
    deidentified_text: str
    identifiers_removed: int

Field	Description
`deidentified_text`	The input text after Safe Harbor de-identification.
`identifiers_removed`	Count of identifier instances removed or generalised.

`TrainingDataPIIReport`

@dataclass
class TrainingDataPIIReport:
    dataset_path: str
    total_records: int
    pii_record_count: int
    malformed_lines: int
    entity_breakdown: list[PIIHeatMapEntry]

Field	Description
`dataset_path`	Path that was scanned.
`total_records`	Lines successfully parsed.
`pii_record_count`	Lines that contained at least one PII entity.
`malformed_lines`	Lines that could not be parsed as JSON (skipped).
`entity_breakdown`	Per-entity-type occurrence counts.

`PIIHeatMapEntry`

@dataclass
class PIIHeatMapEntry:
    entity_type: str
    count: int
    last_seen_at: str | None

Field	Description
`entity_type`	Entity type label (e.g. `"EMAIL_ADDRESS"`).
`count`	Number of times this entity type was detected in the project.
`last_seen_at`	ISO-8601 timestamp of the most recent detection, or `None`.

Exceptions

Exception	Description
`SFPIIError`	Base class for all sf-pii SDK errors.
`SFPIIBlockedError(entity_types, count)`	Raised by `apply_pipeline_action(action="block")`. `entity_types` lists the types that triggered the block.
`SFPIIDPDPConsentMissingError(subject_id_hash, entity_types)`	Raised when a DPDP-regulated entity is detected but no valid consent record exists for the current purpose. `subject_id_hash` is a SHA-256 digest.
`SFPIIScanError`	Wraps unexpected engine failures.

See exceptions reference for full details.

PIPL entity types

China PIPL-specific entity types registered in presidio_backend.py:

Type	Pattern
`PIPL_NATIONAL_ID`	Chinese national ID — 17 digits followed by a digit or `X`
`PIPL_MOBILE`	Chinese mobile — `1[3-9]` followed by 9 digits
`PIPL_BANK_CARD`	Chinese bank card — 16–19 digit card numbers

These types are flagged as pipl_sensitive for cross-border transfer controls.

Environment variables

Variable	Default	Description
`SPANFORGE_SF_PII_ENDPOINT`	(none)	Remote sf-pii service URL. When set, `SFPIIClient` forwards scans to the service; falls back to local engine on network error.
`SPANFORGE_PII_ACTION`	`"flag"`	Default pipeline action (`"flag"` / `"redact"` / `"block"`).
`SPANFORGE_PII_THRESHOLD`	`0.85`	Default confidence threshold for pipeline action enforcement.
`SPANFORGE_PII_LANGUAGE`	`"en"`	Default language code forwarded to scan calls.
`SPANFORGE_PII_MAX_DEPTH`	`10`	Maximum recursion depth for `anonymise()`.

spanforge.sdk.pii — PII Service Client

Quick example

SFPIIClient

scan_text()

anonymise()

scan_async()

scan_batch()

apply_pipeline_action()

get_status()

erase_subject()

export_subject_data()

safe_harbor_deidentify()

audit_training_data()

get_pii_stats()

Return types

PIIEntityResult

PIITextScanResult

PIIRedactionManifestEntry

PIIAnonymisedResult

PIIPipelineResult

PIIServiceStatus

ErasureReceipt

DSARExport

SafeHarborResult

TrainingDataPIIReport

PIIHeatMapEntry

Exceptions

PIPL entity types

Environment variables

See also

`SFPIIClient`

`scan_text()`

`anonymise()`

`scan_async()`

`scan_batch()`

`apply_pipeline_action()`

`get_status()`

`erase_subject()`

`export_subject_data()`

`safe_harbor_deidentify()`

`audit_training_data()`

`get_pii_stats()`

`PIIEntityResult`

`PIITextScanResult`

`PIIRedactionManifestEntry`

`PIIAnonymisedResult`

`PIIPipelineResult`

`PIIServiceStatus`

`ErasureReceipt`

`DSARExport`

`SafeHarborResult`

`TrainingDataPIIReport`

`PIIHeatMapEntry`