SSH Banner Stall Under Memory Pressure | Generated by AI
Question: Why does sshd accept TCP connections but never send its banner when the machine is under severe memory/CPU pressure?
Answer:
The SSH Banner Problem — Root Cause
SSH handshake has two phases:
Client Server
| |
|--- TCP SYN ------->| ← port 22 open, kernel accepts (ring buffer)
|<-- TCP SYN-ACK ----| ← this works even on a frozen machine
|--- TCP ACK ------->|
| |
|<-- SSH-2.0-OpenSSH-| ← THIS NEVER ARRIVES
(banner)
The TCP handshake (SYN/SYN-ACK/ACK) is handled by the kernel’s network stack — it lives in kernel space and runs even when userspace is completely frozen. The kernel has a socket accept queue; it will complete TCP connections autonomously.
But sending the SSH banner requires sshd to actually run — it’s a userspace process. It needs to:
- Be scheduled by the CPU scheduler
accept()the connection from the kernel queue- Allocate memory for the session
- Write
SSH-2.0-OpenSSH-x.x\r\nto the socket
If the machine is swapping or CPU-starved, sshd never gets scheduled, or tries to allocate memory and blocks waiting for swap pages to come in — so the banner never arrives. Your SSH client just… waits.
Why the Tokenize Job Causes This
Memory path
16 workers × 270MB parquet file = 4.3 GB just for raw reads
+ tokenized output lists in each worker
+ main process memory
+ OS page cache
─────────────────────────────────────────────────────
easily 10–15 GB total on a 16GB or 32GB machine
When RAM fills up, the kernel starts swapping. Swap on a spinning disk does ~100 MB/s. The machine needs to swap gigabytes — it becomes essentially frozen. Every process, including sshd, is waiting on I/O to page in its code/stack before it can run.
CPU path
mp.Pool with 16 workers on a machine with (say) 8 cores means the scheduler is juggling 16 CPU-bound Python processes + the main process. sshd is low-priority, gets starved.
The silent hang in pool.imap
for result in pool.imap(worker_fn, files):
... # main process blocks here forever
If a worker gets OOM-killed by the kernel (SIGKILL from the OOM killer), pool.imap doesn’t raise an exception — it just never yields that result. The main process sits in the for loop waiting. No timeout, no error. This is a well-known Python multiprocessing footgun.
Why ping works but SSH doesn’t
| Operation | Who handles it | Survives memory starvation? |
|---|---|---|
ping reply (ICMP echo) |
kernel network stack | ✅ yes |
| TCP SYN-ACK (port 22) | kernel TCP stack | ✅ yes |
| SSH banner | sshd userspace process | ❌ no |
| Any SSH command | bash + userspace | ❌ no |
The kernel stays partially alive even in extreme conditions — it handles interrupts, network packets, and the OOM killer itself. Userspace dies first.
The Fix (streaming approach)
Instead of loading whole parquet files:
import pyarrow.parquet as pq
def worker_streaming(path):
pf = pq.ParquetFile(path)
tokens = []
for batch in pf.iter_batches(batch_size=1000, columns=["content"]):
for text in batch.column("content").to_pylist():
tokens.extend(tokenizer.encode(text))
return tokens
This keeps each worker’s memory footprint to ~batch_size rows instead of the entire file. Also drop workers to 2–4 so you don’t multiply the pressure.