SPARTA is a novel algorithm for conducting distributed training in low communication environments. During my time performing research at EXO labs, most of my effort was dedicated towards designing and performing experiments to test the limits of this algorithm and under which constraints it succeeds. There is more work that needs to be done to make SPARTA perform similarly to true data parallelism, but it still outperforms single node training signficantly with minimal increases in communication time. I even found that communication and training could be performed in parallel to compress communication time to zero while still maintaining the same convergence speed gains.
Training large language models (LLMs) at scale requires efficient training across multiple computational nodes. Traditional distributed training approaches synchronize data between nodes at each iteration on the scale of the full model size, resulting in high communication overhead. While recent advances in federated learning mitigate this by allowing infrequent synchronization, they still require transmitting the entire set of model gradients during updates. In this work, we propose sparse parameter averaging, where a small portion of the model parameters are averaged between workers at each step. SPARTA offers several advantages: it supports asynchronous updates of up to 10 steps without performance degradation, is resilient to network faults and is simple to implement. Our experiments demonstrate that SPARTA outperforms state-of-the-art federated learning methods such as DiLoCo in both bandwidth efficiency and practical robustness.
Matt Beton, Mohamed Baioumy, Tycho van der Ouderaa, Matthew Reed, Seth Howes, Alex Cheema, Gelu Vrabie