When Streaming Is Overkill (And When It Isn't)
Kafka and Flink are powerful, but most data problems don't need them. A practical framework for deciding when streaming infrastructure is actually justified.
- Data Engineering
- Kafka
- Architecture
- Batch
I built AthleteOS with Kafka and Apache Flink. I built the financial transaction pipeline with n8n polling Gmail every few minutes. Both process data as it arrives — but they're architecturally very different. Knowing which one to reach for is most of the decision.
The actual question
The question isn't "do I need real-time data?" It's "what is the cost of data being N minutes stale, and does that cost justify streaming infrastructure?"
For AthleteOS: the cost of pose data being 30 seconds stale during a workout is high — you're giving coaching feedback based on form from 30 seconds ago, which is useless. Kafka + Flink is justified.
For the financial pipeline: the cost of transaction data being 5 minutes stale is zero. The finance team reviews reports once a day. A cron-triggered Airflow DAG produces identical business value at a fraction of the operational complexity.
The complexity cost of streaming
Kafka + Flink introduces at-least-once vs exactly-once delivery semantics, consumer group offset management, schema evolution across producers and consumers, watermarking for out-of-order events, backpressure handling, and separate monitoring. This is real operational overhead — for a single engineer, each of those is a potential 3am incident.
The micro-batch middle ground
Most "real-time" requirements are actually "low-latency batch." A dashboard that updates every 5 minutes isn't streaming — it's micro-batch. Airflow with a 5-minute schedule covers this without Kafka.
The threshold I use: if your business requires data actionable within seconds and has a continuous high-volume event stream (thousands of events per minute), consider Kafka. If it requires data within minutes and has a moderate event rate, micro-batch is almost always the right call.
When I'd use streaming
- Real-time recommendations where stale data degrades user experience directly
- Fraud detection where the detection window closes in seconds
- IoT sensor fusion joining streams from multiple devices in real time
- Event sourcing architectures where Kafka is the system of record
The honest answer
Most data engineering work is batch. Most "real-time" requirements are low-latency batch. Build the simplest thing that meets the latency requirement. Optimize only when you can measure the cost of not doing so.