Data Factory: The Simulation Engine
Data Factory is a hyper-realistic synthetic data engine. It doesn't just generate random strings; it simulates business behaviors to stress-test modern data platforms.
The Problem
Building data platforms is hard. Testing them is harder. - Production data is sensitive (PII). - Staging data is stale. - Edge cases (e.g. thermal limit failure) rarely happen naturally.
Data Factory solves this by allowing developers to script reality.
Capabilities
1. Behavior Packs
Data Factory uses a plugin system called "Packs" to define domain logic: - Retail Pack: Simulates POS transactions, inventory drift, and customer churn. - IoT Pack: Simulates sensor telemetry, heartbeat loss, and threshold alerts. - Healthcare Pack: Generates HL7/FHIR-like claim streams with tunable error rates.
2. The "Live Ledger" Orchestrator
At its core is a serverless state machine (AWS Step Functions) that coordinates millions of events. - Batch Mode: Dumps terabytes of history to S3 (Parquet/JSON). - Stream Mode: Pushes thousands of events/sec to Kinesis/Kafka. - Chaos Mode: Intentionally corrupts schema or drops data to test downstream DLQs.
Architecture
The system is fully Infrastructure-as-Code (Terraform) and deployed per-tenant.
flowchart TD
UI[Console UI] --> API[API Gateway]
API --> Lambda[Orchestrator Lambda]
subgraph Engine [Step Function State Machine]
Gen[Generator Kernels]
Chaos[Chaos Monkey]
end
Lambda --> Engine
Engine -->|Batch| S3[(S3 Buckets)]
Engine -->|Stream| Kinesis[Kinesis Data Streams]
S3 --> Telemetry[Telemetry API]
Kinesis --> Telemetry
Telemetry --> UI
Key Components
- Generator Kernels: Optimized Python modules that inflate "Seed" templates into millions of variant records.
- Lifecycle Manager: Handles the spin-up and tear-down of ephemeral test resources.
- Telemetry API: Provides real-time "Events Per Second" metrics back to the console.
Why it Matters
Data Factory allows FMAI (and other platforms) to prove their worth. By generating "Bad Data" on demand, it validates that the governance layers actually work.