Abstract
Distributed model training is vulnerable to byzantine system failures and adversarial compute nodes, i.e., nodes that use malicious updates to corrupt the global model stored at a parameter server (PS). To guarantee some form of robustness, recent work suggests using variants of the ge ometric median as an aggregation rule, in place of gradient averaging. Unfortunately, median-based rules can incur a prohibitive computational overhead in large-scale settings, and their convergence guarantees often require strong assumptions. In this work, we present D RACO, a scalable framework for robust distributed training that uses idea from coding theory. In D RACO, each compute node evaluates redundant gradients that are used by the parameter server to eliminate the effects of adversarial updates. D RACO comes with problemindependent robustness guarantees, and the model that it trains is identical to the one trained in t adversary-free setup. We provide extensive experiments on real datasets and distributed setups across a variety of large-scale models, where we show that D RACO is several times, to orders of magnitude faster than median-based approaches.