Training Data Compliance Standard
The SpanForge framework for compliant AI training data. Defines what constitutes compliant training data under EU AI Act, GDPR, DPDP, and CCPA — with a 30+ PII detection framework, redaction standards, data lineage requirements, and audit-ready sign-off checklists.
What this standard provides
AI models trained on personal data can perpetuate biases, violate privacy regulations, and expose organisations to regulatory liability. High-risk decisions require transparent, auditable training data — yet no broadly adopted standard existed for what "compliant training data" means in practice.
This standard fills that gap. It is an open specification for data engineers, ML practitioners, compliance officers, and auditors — designed to be incrementally adoptable and regulator-facing.
PII data types in the detection framework
Redaction and anonymization techniques
Regulatory frameworks covered
Compliance checklist phases with sign-off
Regulations covered
This standard applies to training datasets used to train or fine-tune AI models that process personal data, are classified high-risk under EU AI Act, or are deployed in regulated jurisdictions.
| Regulation | Article / Scope | Compliance area |
|---|---|---|
| EU AI Act | Article 10 | High-Risk Systems — training data quality, governance, bias checks |
| GDPR | Art. 5, 6, 13, 30 | Lawfulness, consent, records of processing activities |
| DPDP Act | India, 2023 | Digital Personal Data Protection — purpose limitation, consent |
| CCPA | California | Consumer rights, opt-out, data minimization for ML training |
Eight sections
The standard is structured in eight sections covering the full compliance lifecycle — from data sourcing through audit sign-off.
30+ PII categories across 10 groups
The framework covers every major PII category in regulated jurisdictions. Detection combines regex patterns, ML model-based contextual analysis, and mandatory manual spot-checks — regex alone is insufficient.
| Group | Category | Examples |
|---|---|---|
| A | Identifiers | Full name, email, phone, SSN, passport, driver license, national ID, tax ID |
| B | Biometric Data | Facial recognition, fingerprints, iris scan, voice print, DNA, palm print |
| C | Location Data | GPS coordinates, IP address, MAC address, WiFi SSID, geofence data |
| D | Financial Data | Credit card, bank account, routing number, IBAN, cryptocurrency, PayPal |
| E | Health Data | Medical record number, medication, diagnosis, lab results, fitness biometrics |
| F | Communication Data | Email messages, chat, phone calls, SMS, Slack/Teams content |
| G | Work / Education | Employee ID, job title, employer, school name, student ID, grades |
| H | Behavioral Data | Shopping history, browsing history, social media activity, location history |
| I | Government / Legal | Visa number, refugee status, criminal record, court records |
| J | Demographic Data | Exact age, ZIP code, race/ethnicity, gender, religion, sexual orientation, union membership, political affiliation |
⚠️ Context matters: Behavioral data that reveals protected characteristics (religion, sexual orientation, health, union membership, politics) is high-risk even after name removal. Quasi-identifier combinations (age + ZIP + job title) can re-identify individuals — k-anonymity k≥5 is required.
Six redaction techniques
Each technique carries different utility/privacy trade-offs. The standard requires redaction completeness verification in four mandatory steps — automated re-scan, manual validation, quasi-identifier check, and domain expert review.
| # | Technique | Best use case | Key risk |
|---|---|---|---|
| 1 | Complete Removal | Data not needed for training | May lose context |
| 2 | Hashing (One-Way) | Preserve uniqueness without identity | Rainbow table attacks — salt required |
| 3 | Generalization | Approximate values (age 30–40, ZIP 941**) | Reduced model utility |
| 4 | Differential Privacy | Mathematical privacy proof with noise | Computational overhead |
| 5 | Synthetic Replacement | Statistical clones of real data | May miss rare edge cases |
| 6 | Tokenization | Reversible token with encrypted lookup | Lookup table must be protected |
SpanForge CLI integration
The SpanForge CLI provides a reference implementation of the scanning, reporting, and validation pipeline defined in this standard. Tool output is advisory — all critical findings require human review and stakeholder sign-off.
# Step 1: Scan dataset for PII spanforge validate --dataset training_data.jsonl \ --dataset-scan \ --check-pii-field-names \ --check-pii-values \ --use-ml-model contextual_pii \ --manual-review-sample 100 \ --fail-on-violations \ --format json > pii_report.json # Step 2: Check required fields spanforge validate --dataset training_data.jsonl \ --required-fields employee_id,name,email \ --format text # Step 3: Generate compliance report spanforge compliance report training_data.jsonl \ --framework EU_AI_ACT \ --format pdf > compliance_report.pdf
- Obvious PII patterns (email, SSN, phone)
- Field-name heuristics
- 30+ detection categories
- High-risk records flagged for review
- Contextual / implicit PII judgement
- Legal basis validity assessment
- Domain expert manual review
- Regulatory compliance determination
Five-phase compliance lifecycle
Pre-Training
- Data source documentation
- Legal basis (GDPR Art. 6)
- EU AI Act high-risk classification
- Prohibited source verification
Data Preparation
- Automated PII scan (30+ patterns)
- Manual spot-check (50–100 records)
- PII remediation & redaction
- Redaction completeness verification
Data Lineage
- Provenance metadata schema
- Transformation log (every step)
- Consent & legal basis records
- Chain of custody documentation
Bias Analysis
- Demographic representation check
- Class imbalance assessment
- Fairness metrics (parity, odds, calibration)
- Disparate impact (80% rule)
Review & Sign-Off
- Data owner sign-off
- Compliance officer sign-off
- Security review
- Optional: third-party audit
Sign-off does not eliminate liability
Completing this standard's checklists documents that due diligence was performed. It does NOT guarantee regulatory compliance, absence of PII or bias, or protect signers from liability if issues emerge later. Signers remain responsible for accuracy of assessment, actual compliance with regulations, ongoing monitoring, and response to issues. Consult legal counsel for high-risk model classification under EU AI Act.