arxiv:2602.06499

FCDP: Fully Cached Data Parallel for Communication-Avoiding Large-Scale Training

Published on Feb 6

Authors:

Abstract

FCDP reduces inter-node communication in distributed deep learning training by caching forward-pass parameters in host memory, achieving significantly higher throughput than existing methods while maintaining memory efficiency.

AI-generated summary

Training billion-parameter models requires distributing model states across GPUs using fully sharded data parallel (i.e., ZeRO-3). While ZeRO-3 succeeds on clusters with high-bandwidth NVLink and InfiniBand interconnects, researchers with commodity hardware face severe inter-node all-gather bottlenecks. Existing optimizations take two approaches: GPU memory caching (MiCS, ZeRO++) trades memory capacity for reduced communication, triggering out-of-memory failures on large models; host memory offloading (ZeRO-Offload, ZeRO-Infinity) extends capacity but degrades throughput due to PCIe overhead. We observe that on bandwidth-limited clusters, host memory can serve not as an overflow tier but as a fast caching layer that outperforms inter-node communication. Based on this insight, we propose FCDP, which eliminates redundant inter-node communication while preserving ZeRO-3's minimal GPU memory footprint. FCDP caches forward-pass parameters in host memory and reuses them during the backward pass via fast intra-node all-gather, reducing inter-node all-gather by 50%. For parameter-efficient fine-tuning (PEFT), FCDP selectively communicates only trainable parameters to maximize caching, reducing inter-node traffic by over 99%. In our commodity cluster setup, FCDP achieves up to 100x higher throughput than ZeRO-3 and 51x higher than ZeRO++, while maintaining ZeRO-3's maximum batch size.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2602.06499

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.06499 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.06499 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06499 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.