Efficient Large Excel Handling in Python | Generated by AI

Home 2026.05

Question: How can Python handle large Excel files using libraries like polars, pyexcel, openpyxl, etc.?

Answer:

The core problem with large Excel files is memory and speed. openpyxl loads everything into RAM by default — a 500k-row xlsx will OOM a 16GB machine. Here’s the landscape ranked by performance:


Library Comparison

Library Best For Memory Speed
polars Large data analysis Low (lazy) Fastest
pandas chunked Medium files, familiar API Medium Medium
openpyxl read_only Read-only streaming Low Slow (pure Python)
calamine / fastexcel Fast read into arrow/polars Very Low Fastest reads
xlrd Old .xls Low Fast
pyexcel Multi-format abstraction High Slow

pyexcel is a high-level wrapper — avoid it for large files, it has no streaming.


fastexcel is a Rust-backed reader that outputs Apache Arrow natively — feeding directly into polars with near-zero copy:

pip install fastexcel polars
import fastexcel
import polars as pl

# Read entire sheet as polars LazyFrame
reader = fastexcel.read_excel("large.xlsx")
print(reader.sheet_names)

df = reader.load_sheet(0).to_polars()  # Arrow-native, very fast
print(df.shape, df.dtypes)

# Filter before materializing (push-down)
result = (
    df.lazy()
    .filter(pl.col("amount") > 1000)
    .select(["id", "amount", "date"])
    .collect()
)

For a 1M-row file this is typically 5-10x faster than pandas and uses ~3x less RAM.


2. openpyxl Streaming (read_only mode)

When you must use openpyxl (e.g. need formula-level access), stream rows:

from openpyxl import load_workbook

wb = load_workbook("large.xlsx", read_only=True, data_only=True)
ws = wb.active

for i, row in enumerate(ws.iter_rows(values_only=True)):
    # Process row tuple — never loads full sheet
    if i % 100_000 == 0:
        print(f"Row {i}: {row[:3]}")

wb.close()  # MUST close explicitly in read_only mode

Gotcha: read_only=True + data_only=True returns cached formula values, not recalculated ones. If the file was never opened in Excel, formula cells return None.


3. pandas chunked reading

pandas doesn’t natively chunk xlsx (unlike CSV), but you can simulate it:

import pandas as pd

# Read in chunks via skiprows/nrows
CHUNK = 50_000
df_iter = pd.read_excel(
    "large.xlsx",
    engine="calamine",   # faster than openpyxl engine
    chunksize=CHUNK      # only works for CSV; for xlsx use skiprows
)

# Manual chunking for xlsx:
def read_excel_chunks(path, chunksize=50_000):
    nrows_total = pd.read_excel(path, usecols=[0], engine="calamine").shape[0]
    for start in range(0, nrows_total, chunksize):
        yield pd.read_excel(
            path,
            skiprows=range(1, start + 1),  # skip header=0, then data rows
            nrows=chunksize,
            header=0 if start == 0 else None,
            engine="calamine"
        )

for chunk in read_excel_chunks("large.xlsx"):
    print(chunk.shape)

For CSV this is trivial — just use pd.read_csv(..., chunksize=N).


4. python-calamine engine for pandas

calamine is another Rust-backed reader, pluggable as a pandas engine:

pip install python-calamine
import pandas as pd

df = pd.read_excel("large.xlsx", engine="calamine")
# ~3-5x faster than default openpyxl engine

5. Writing large files: xlsxwriter > openpyxl

For writing, xlsxwriter has a constant-memory mode:

import xlsxwriter

wb = xlsxwriter.Workbook("output.xlsx", {"constant_memory": True})
ws = wb.add_worksheet()

for i in range(1_000_000):
    ws.write_row(i, 0, [i, i * 2, f"row_{i}"])

wb.close()
# Streams to disk row by row — O(1) memory

openpyxl write_only mode does the same:

from openpyxl import Workbook

wb = Workbook(write_only=True)
ws = wb.create_sheet()

for i in range(1_000_000):
    ws.append([i, i * 2, f"row_{i}"])  # streams to disk

wb.save("output.xlsx")

Decision Tree

Large xlsx (>100k rows)?
├── Just reading data → fastexcel + polars  ← best default
├── Need pandas API → pd.read_excel(engine="calamine")
├── Need formula values/styles → openpyxl read_only=True, iter_rows
├── Writing large file → xlsxwriter constant_memory or openpyxl write_only
└── Old .xls format → xlrd (openpyxl doesn't support .xls)

Quick benchmark mental model

For 1M rows × 10 cols:

Your M2 Air (16GB) will handle up to ~500k rows fine with calamine/fastexcel. For anything bigger, stream with read_only or dump to parquet first.


References


Back Donate