dbt Dimensional Modeling for Streaming Pipelines
Dimensional modeling principles don't change just because your data comes from Kafka. How I structured the dbt layer in AthleteOS to stay clean as the schema evolved.
- dbt
- Data Modeling
- Kafka
- Snowflake
When people hear "streaming pipeline," they often assume that means real-time dashboards and no data warehouse. The reality is that most streaming use cases still need a warehouse layer for historical analysis, model training, and reporting. The data just gets there faster.
In AthleteOS, Apache Flink consumes from two Kafka topics, joins streams in real time, and writes to Snowflake every 30 seconds. dbt then runs on a schedule to transform raw Snowflake tables into clean dimensional models.
The raw layer
Flink writes to two raw Snowflake tables: raw.pose_events (one row per MediaPipe BlazePose frame) and raw.biometric_readings (one row per WHOOP API poll). These tables are append-only — Flink never updates or deletes.
Staging models
The staging layer handles type casting, null handling, and renaming. One staging model per raw source. Nothing clever happens in staging — if you're doing business logic in staging, it's in the wrong layer.
Intermediate models
The intermediate layer is where the streaming-specific complexity lives. Because Flink writes every 30 seconds, a single training session produces ~180 pose event records per hour. The intermediate layer aggregates these into session-level metrics.
The key insight: the intermediate layer converts streaming cardinality (one row per 30s) into analytical cardinality (one row per session). Everything downstream works with sessions, not events.
Mart models
The mart layer joins intermediates into the final fact and dimension tables consumed by the application and analytics tools. The left join between pose and biometric data matters — some sessions have pose data but no biometric data (WHOOP API was offline). The model should still produce a row for those sessions.
Incremental models for high-cardinality sources
The raw tables grow continuously. Full refreshes become expensive fast. I use incremental models for everything in the staging and intermediate layers, filtering on _ingested_at. The on_schema_change='append_new_columns' setting prevents dbt from failing when Flink adds new fields to the Kafka topic.
The principle that held
Kimball's dimensional modeling principles are 30 years old and they hold for streaming data just as well as batch. The only adjustment is the intermediate aggregation step that converts streaming cardinality into analytical cardinality. Everything else transfers directly.