Papers
arxiv:2602.06499

FCDP: Fully Cached Data Parallel for Communication-Avoiding Large-Scale Training

Published on Feb 6
Authors:
,
,
,
,
,
,
,

Abstract

FCDP reduces inter-node communication in distributed deep learning training by caching forward-pass parameters in host memory, achieving significantly higher throughput than existing methods while maintaining memory efficiency.

AI-generated summary

Training billion-parameter models requires distributing model states across GPUs using fully sharded data parallel (i.e., ZeRO-3). While ZeRO-3 succeeds on clusters with high-bandwidth NVLink and InfiniBand interconnects, researchers with commodity hardware face severe inter-node all-gather bottlenecks. Existing optimizations take two approaches: GPU memory caching (MiCS, ZeRO++) trades memory capacity for reduced communication, triggering out-of-memory failures on large models; host memory offloading (ZeRO-Offload, ZeRO-Infinity) extends capacity but degrades throughput due to PCIe overhead. We observe that on bandwidth-limited clusters, host memory can serve not as an overflow tier but as a fast caching layer that outperforms inter-node communication. Based on this insight, we propose FCDP, which eliminates redundant inter-node communication while preserving ZeRO-3's minimal GPU memory footprint. FCDP caches forward-pass parameters in host memory and reuses them during the backward pass via fast intra-node all-gather, reducing inter-node all-gather by 50%. For parameter-efficient fine-tuning (PEFT), FCDP selectively communicates only trainable parameters to maximize caching, reducing inter-node traffic by over 99%. In our commodity cluster setup, FCDP achieves up to 100x higher throughput than ZeRO-3 and 51x higher than ZeRO++, while maintaining ZeRO-3's maximum batch size.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2602.06499
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.06499 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.06499 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.06499 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.