Skip to content

How to Detect PII in Training Data (With Examples and Commands)

The problem

You're preparing a dataset for model training. It looks clean. But buried in 50,000 rows are email addresses, phone numbers, names, and government IDs—collected during normal product usage.

You won't find them manually. And if you don't find them before training, you've baked GDPR violations into your model weights.


Why this matters

Training on PII without consent creates compounding risk:

  • GDPR Article 5 — data must be adequate, relevant, and limited to what's necessary ("data minimization")
  • EU AI Act Article 10 — training data must be "relevant, representative, and free of errors" with documented practices for PII handling
  • CCPA / US State Laws — model outputs that reproduce personal data create re-identification exposure

Most teams discover this during audits—not before. By then, the model is in production.

SpanForge detects it before production.


What SpanForge scans for

PII TypeExamplesDetection Method
Email addressesuser@company.comRegex + contextual
Phone numbers+1-555-867-5309E.164 + local formats
NamesJohn SmithNER + dictionary
Government IDsSSN, NIN, passportPattern + checksum
IP addresses192.168.1.1CIDR-aware
Financial dataCard numbers, IBANsLuhn + BIC
Health identifiersMRN, NPIHealthcare-specific
Free-text PII"my name is Sarah"Contextual NLP

Example

Input dataset row:

{
  "prompt": "Help me draft an email to john.doe@acme.com about the Q3 invoice",
  "completion": "Hi John, I'm writing on behalf of..."
}

SpanForge scan output:

[PII DETECTED]
Field: prompt
Match: john.doe@acme.com
Type: EMAIL_ADDRESS
Confidence: 0.99
Regulation: GDPR Article 5, EU AI Act Article 10
Action: REDACT | EXCLUDE | FLAG

After redaction:

{
  "prompt": "Help me draft an email to [REDACTED:EMAIL] about the Q3 invoice",
  "completion": "Hi [REDACTED:FIRSTNAME], I'm writing on behalf of..."
}

Try this in 30 seconds

pip install spanforge

# Scan a dataset for PII
spanforge validate --dataset data.jsonl --pii-check

# Redact in place
spanforge redact --dataset data.jsonl --output data.clean.jsonl

# Generate a compliance report
spanforge audit --dataset data.jsonl --format pdf

What you get

  • PII scan report — every match, field, confidence score, and regulation reference
  • Redacted dataset — ready-to-train file with PII replaced by typed tokens
  • Compliance evidence — signed, timestamped artifact you can hand to a regulator

Compliance mapping

RequirementStandardSpanForge Feature
Data minimizationGDPR Art. 5(1)(c)sf_pii scanner
Training data qualityEU AI Act Art. 10sf_validate
Purpose limitationGDPR Art. 5(1)(b)sf_audit trail
Documented controlsISO 27001 A.8Evidence export


Run this with SpanForge

pip install spanforge

# Scan your dataset for PII
spanforge validate --dataset data.jsonl --pii-check

# Redact in place
spanforge redact --dataset data.jsonl --output data.clean.jsonl

# Generate a signed compliance report
spanforge audit export --standard gdpr --format pdf

What you get: A PII-free dataset + a signed PDF showing every detection, confidence score, and regulation reference. Hand it directly to any auditor or enterprise procurement team.

PII redaction SDK reference →
CLI reference →
30-second quickstart →


Continue in Learn

AI audit trail: track & prove decisions →
EU AI Act Article 10 compliance guide →
What is a Compliance Evidence Chain? →