Flink Intro: Row-Based Streaming Model | Generated by AI

Home 2026.06

Question: Give an intro to Apache Flink and clarify whether its data model is row-based or column-based.

Answer:

Apache Flink is a distributed stream-processing engine. The core abstraction is an unbounded stream of events processed record-by-record with low latency. Batch is treated as a special case of streaming (a bounded stream), unifying both under one runtime — this is the “everything is a stream” thesis that distinguishes it from Spark’s “everything is a batch (micro-batch)” origin.

The pillars:

Row-based or Column-based?

Flink’s runtime is fundamentally row-based. The streaming model processes one record (one row) at a time as it flows through the operator DAG. This is the natural fit for streaming: you can’t wait to accumulate a column’s worth of values when events arrive continuously and you need millisecond latency.

Concretely:

The nuance you flagged earlier is correct: Flink’s connectors read from and write to columnar formats. When Flink reads a Parquet or ORC file, it does vectorized columnar reads for efficiency at the source, then converts to row-oriented RowData for processing. So columnar appears at the I/O boundary, not in the streaming core.

There’s an exception worth knowing: Flink has added batch operators with columnar/vectorized execution in some paths (and integrates with Apache Arrow for PyFlink UDFs, where Arrow’s columnar in-memory format avoids per-row serialization overhead crossing the JVM↔Python boundary). But the default streaming engine is row-at-a-time.

Why This Matters for Your Mental Model

Engine Native model Optimized for
Flink Row, streaming Low-latency, stateful event processing (OLTP-ish streams)
Spark Micro-batch, increasingly columnar (Tungsten + Arrow) Throughput, batch analytics
DuckDB / ClickHouse Columnar, vectorized OLAP scans/aggregations

Row-based is the right default when you process whole events as they arrive and need full-record state. Columnar wins when you scan few columns over huge static datasets. Flink lives in the former world; it borrows columnar tricks only at the edges (file I/O, cross-language UDFs) where vectorization pays off without compromising the streaming model.

If you’re touching Flink at the bank (common in financial event/transaction pipelines), the practical takeaway: think in RowData for your processing logic, and let connectors handle the columnar↔row conversion at sources/sinks.


Back Donate