Optimizing Data Transfers and Preprocessing Workflows

Optimizing Data Transfers and Preprocessing Workflows | Generated by AI

Home 2026.04

Here’s a breakdown of what’s happening in this sequence:

1. The Problem

You had a VPS and a GPU droplet (likely a cloud server with an H200 GPU). Initially, transferring 61 GB of data between them over the external (public) network would have taken ~5 hours (probably due to limited bandwidth or high latency).

2. Setting Up Internal Network Communication

Both servers are on the same internal network (private IPs:

VPS at 10.100.0.2
GPU droplet at 10.100.3

You verified they can talk internally with very low latency (1.56 ms), much faster than going over the internet.

SSH Key Setup

You set up SSH key authentication from VPS → GPU droplet so transfers wouldn’t require a password. One minor hiccup: the key was concatenated without a newline, breaking authentication; you fixed it.

3. Internal Network Transfer Speed

Once the internal network path was ready, you transferred the 61 GB dataset directly:

Speed peaked at ~450 MB/s
Total transfer time: ~5 minutes (vs. 5 hours over external network)

This huge speedup comes from:

Private network — no internet routing bottlenecks
High internal bandwidth (likely 10 Gbps or more)
Low latency between nodes

4. Data Preparation on GPU Droplet

After the transfer, you started data preprocessing on the GPU droplet (the H200 machine).

You ran 12 parallel workers, each using ~93% CPU — meaning the task was CPU-bound during preprocessing, not yet GPU-bound.

Processing Speed

~40 million tokens per second
Each shard (likely a chunk of the dataset) processed in ~2 seconds
Total shards: ~610
After ~1 minute: 42 shards done (7.9 GB output) → ~7% done
ETA: ~14–20 minutes to finish preprocessing

5. Why This Matters

This workflow shows a common ML data pipeline optimization:

Transfer data internally — not over public internet — to avoid slow speeds.
Use high-performance storage and compute (H200 + many CPU cores) to preprocess quickly.
Parallelize tokenization/shuffling/writing to maximize throughput.

The result: From raw data on a VPS → preprocessed dataset ready on a GPU server in under 30 minutes total, where the naïve approach would have taken hours just for the transfer.

Let me know if you’d like a deeper explanation of the internal networking setup, the parallel preprocessing logic, or how to monitor such pipelines.

Back Donate