Skip to content
StandardsTraining Data Compliance Standard
TDCS-1.0 · Open Standard · RFC-pending

Training Data Compliance Standard

The SpanForge framework for compliant AI training data. Defines what constitutes compliant training data under EU AI Act, GDPR, DPDP, and CCPA — with a 30+ PII detection framework, redaction standards, data lineage requirements, and audit-ready sign-off checklists.

Version
1.0
Status
Open Standard (RFC-pending)
Effective Date
May 2, 2026
Last Updated
May 2, 2026
Target Audience
Data engineers, ML practitioners, compliance officers, auditors
Executive Summary

What this standard provides

AI models trained on personal data can perpetuate biases, violate privacy regulations, and expose organisations to regulatory liability. High-risk decisions require transparent, auditable training data — yet no broadly adopted standard existed for what "compliant training data" means in practice.

This standard fills that gap. It is an open specification for data engineers, ML practitioners, compliance officers, and auditors — designed to be incrementally adoptable and regulator-facing.

30+

PII data types in the detection framework

6

Redaction and anonymization techniques

4

Regulatory frameworks covered

5

Compliance checklist phases with sign-off

Regulatory scope

Regulations covered

This standard applies to training datasets used to train or fine-tune AI models that process personal data, are classified high-risk under EU AI Act, or are deployed in regulated jurisdictions.

RegulationArticle / ScopeCompliance area
EU AI ActArticle 10High-Risk Systems — training data quality, governance, bias checks
GDPRArt. 5, 6, 13, 30Lawfulness, consent, records of processing activities
DPDP ActIndia, 2023Digital Personal Data Protection — purpose limitation, consent
CCPACaliforniaConsumer rights, opt-out, data minimization for ML training
Document structure

Eight sections

The standard is structured in eight sections covering the full compliance lifecycle — from data sourcing through audit sign-off.

1
§1Framework Overview

Scope of the standard — what training data is covered, what is excluded, and why compliance matters for regulators and practitioners.

2
§2Training Data Compliance Checklist

Five-phase checklist: pre-training data sourcing, data preparation, lineage documentation, bias analysis, and compliance review with sign-off.

3
§3PII Detection Framework

30+ PII categories across identifiers, biometrics, location, financial, health, communication, work, behavioral, government, and demographic data — with detection patterns.

4
§4Data Redaction & Anonymization

Six redaction techniques (removal, hashing, generalization, differential privacy, synthetic replacement, tokenization) with verification requirements.

5
§5Sensitive Data Categories

GDPR Article 9 special categories, quasi-identifier risks, and data minimization principles for compliant model training.

6
§6Data Lineage & Provenance

Data origin metadata schema, transformation log format, and reproducibility requirements for audit-ready datasets.

7
§7Compliance Checklist & Sign-Off

Self-assessment form, section-by-section sign-off templates, and accountability guidance for compliance officers, legal, security, and model teams.

8
§8Tools & Implementation

SpanForge CLI commands for automated PII scanning, compliance reporting, and integration into CI/CD pipelines.

§3 — PII Detection Framework

30+ PII categories across 10 groups

The framework covers every major PII category in regulated jurisdictions. Detection combines regex patterns, ML model-based contextual analysis, and mandatory manual spot-checks — regex alone is insufficient.

GroupCategoryExamples
AIdentifiersFull name, email, phone, SSN, passport, driver license, national ID, tax ID
BBiometric DataFacial recognition, fingerprints, iris scan, voice print, DNA, palm print
CLocation DataGPS coordinates, IP address, MAC address, WiFi SSID, geofence data
DFinancial DataCredit card, bank account, routing number, IBAN, cryptocurrency, PayPal
EHealth DataMedical record number, medication, diagnosis, lab results, fitness biometrics
FCommunication DataEmail messages, chat, phone calls, SMS, Slack/Teams content
GWork / EducationEmployee ID, job title, employer, school name, student ID, grades
HBehavioral DataShopping history, browsing history, social media activity, location history
IGovernment / LegalVisa number, refugee status, criminal record, court records
JDemographic DataExact age, ZIP code, race/ethnicity, gender, religion, sexual orientation, union membership, political affiliation

⚠️ Context matters: Behavioral data that reveals protected characteristics (religion, sexual orientation, health, union membership, politics) is high-risk even after name removal. Quasi-identifier combinations (age + ZIP + job title) can re-identify individuals — k-anonymity k≥5 is required.

§4 — Data Redaction & Anonymization

Six redaction techniques

Each technique carries different utility/privacy trade-offs. The standard requires redaction completeness verification in four mandatory steps — automated re-scan, manual validation, quasi-identifier check, and domain expert review.

#TechniqueBest use caseKey risk
1Complete RemovalData not needed for trainingMay lose context
2Hashing (One-Way)Preserve uniqueness without identityRainbow table attacks — salt required
3GeneralizationApproximate values (age 30–40, ZIP 941**)Reduced model utility
4Differential PrivacyMathematical privacy proof with noiseComputational overhead
5Synthetic ReplacementStatistical clones of real dataMay miss rare edge cases
6TokenizationReversible token with encrypted lookupLookup table must be protected
§8 — Tools & Implementation

SpanForge CLI integration

The SpanForge CLI provides a reference implementation of the scanning, reporting, and validation pipeline defined in this standard. Tool output is advisory — all critical findings require human review and stakeholder sign-off.

bash
# Step 1: Scan dataset for PII
spanforge validate --dataset training_data.jsonl \
  --dataset-scan \
  --check-pii-field-names \
  --check-pii-values \
  --use-ml-model contextual_pii \
  --manual-review-sample 100 \
  --fail-on-violations \
  --format json > pii_report.json

# Step 2: Check required fields
spanforge validate --dataset training_data.jsonl \
  --required-fields employee_id,name,email \
  --format text

# Step 3: Generate compliance report
spanforge compliance report training_data.jsonl \
  --framework EU_AI_ACT \
  --format pdf > compliance_report.pdf
✅ Detects
  • Obvious PII patterns (email, SSN, phone)
  • Field-name heuristics
  • 30+ detection categories
  • High-risk records flagged for review
❌ Cannot replace
  • Contextual / implicit PII judgement
  • Legal basis validity assessment
  • Domain expert manual review
  • Regulatory compliance determination
§2 — Compliance checklist

Five-phase compliance lifecycle

Phase 2.1

Pre-Training

  • Data source documentation
  • Legal basis (GDPR Art. 6)
  • EU AI Act high-risk classification
  • Prohibited source verification
Phase 2.2

Data Preparation

  • Automated PII scan (30+ patterns)
  • Manual spot-check (50–100 records)
  • PII remediation & redaction
  • Redaction completeness verification
Phase 2.3

Data Lineage

  • Provenance metadata schema
  • Transformation log (every step)
  • Consent & legal basis records
  • Chain of custody documentation
Phase 2.4

Bias Analysis

  • Demographic representation check
  • Class imbalance assessment
  • Fairness metrics (parity, odds, calibration)
  • Disparate impact (80% rule)
Phase 2.5

Review & Sign-Off

  • Data owner sign-off
  • Compliance officer sign-off
  • Security review
  • Optional: third-party audit
Accountability

Sign-off does not eliminate liability

Completing this standard's checklists documents that due diligence was performed. It does NOT guarantee regulatory compliance, absence of PII or bias, or protect signers from liability if issues emerge later. Signers remain responsible for accuracy of assessment, actual compliance with regulations, ongoing monitoring, and response to issues. Consult legal counsel for high-risk model classification under EU AI Act.

All standardsSpanForge CLI