Parquet vs Arrow: Different Layers

Parquet vs Arrow: Different Layers | Generated by AI

Home 2026.06

Question: Between Parquet and Arrow, which has better performance?

Answer:

They’re not really competitors — they solve different problems at different layers of the data stack. But let’s compare them precisely.

What They Are

	Parquet	Arrow
Type	Columnar storage format (on-disk)	Columnar memory format (in-memory)
Designed for	Long-term storage, compression, query pushdown	Zero-copy IPC, in-process analytics, SIMD
Row group / batch	Row groups (MB-scale chunks)	RecordBatches (configurable)
Encoding	RLE, dict, delta, bit-packing + Snappy/Zstd/LZ4	Raw typed buffers (no compression by default)
Spec owner	Apache (originated at Twitter/Cloudera)	Apache (originated at Wes McKinney / Cloudera)

Performance Breakdown

Disk I/O → Parquet wins

Parquet’s compression + encoding means 5–10x smaller files than raw Arrow IPC files. For cold reads from S3/NFS, Parquet wins massively due to predicate pushdown and column pruning — you skip reading data at the file format level.

# Parquet reads only the columns you ask for — at the C++ reader level
import pyarrow.parquet as pq
table = pq.read_table("data.parquet", columns=["user_id", "event"])

In-memory compute → Arrow wins (not even close)

Arrow buffers are SIMD-friendly, cache-aligned, zero-copy between processes/languages. No deserialization needed.

import pyarrow as pa
import pyarrow.compute as pc

# SIMD vectorized, no Python loop overhead
result = pc.sum(table.column("revenue"))

Parquet data must be decoded into Arrow before compute anyway — that’s exactly what DuckDB, Polars, and pandas 2.0 do internally.

Serialization / IPC → Arrow wins

Arrow IPC (Feather v2) is essentially zero-copy. Writing and reading back a 1GB dataset:

Parquet write: ~2–5s (compression cost)
Arrow IPC write: ~0.1–0.3s (memcpy speed)

Parquet read: ~0.5–1s (decompress + decode)
Arrow IPC read: ~0.05s (mmap, near zero)

Query engines use both together

The real pattern is Parquet on disk → Arrow in memory:

S3/disk         ──[Parquet]──►  DuckDB/Polars  ──[Arrow]──►  result
(compressed,                    (decodes to
 column pruned,                  Arrow buffers,
 predicate pushed)               runs SIMD ops)

DuckDB, Polars, DataFusion, Spark, and BigQuery all do this internally.

Concrete numbers (rough, hardware-dependent)

Operation	Parquet	Arrow IPC
Read 1GB from disk	~0.5–2s	~0.05–0.1s (mmap)
Write 1GB	~2–5s	~0.1–0.3s
File size (1GB raw)	~100–300MB	~900MB–1GB
Column scan (in-memory)	N/A (already decoded)	~1–5 GB/s SIMD

When to use which

Use Parquet when:

Storing datasets long-term (S3, HDFS, local)
Dataset is large and you query subsets of columns
You need interoperability across systems (Spark, BigQuery, Athena, DuckDB)
Cold storage / archival

Use Arrow (IPC/Feather) when:

Passing large datasets between processes (e.g., Python → Rust CLI agent)
Caching intermediate results in a pipeline (avoid re-decoding)
Building a data service where latency matters
Shared memory between processes (Arrow’s plasma store or mmap)

Practical example — your CLI agent workflow

If your ww/zz agents process large datasets:

import pyarrow as pa
import pyarrow.ipc as ipc
import pyarrow.parquet as pq

# Load from cold storage (Parquet, compressed)
table = pq.read_table("events.parquet", columns=["ts", "event", "user_id"])

# Cache hot working set as Arrow IPC for fast re-reads
with ipc.new_file("events_cache.arrow", table.schema) as writer:
    writer.write_table(table)

# Next run — sub-100ms reload
with ipc.open_file("events_cache.arrow") as reader:
    table = reader.read_all()

TL;DR

Parquet = best for storage efficiency and disk I/O (compression, column pruning, ecosystem compatibility)
Arrow = best for compute throughput and IPC (SIMD, zero-copy, no deserialization overhead)
In practice: Parquet stores it, Arrow processes it — they’re designed to work together, not compete.

Back Donate