Ring All-Reduce vs Parameter Server

video1.0<iframe src="https://www.loom.com/embed/d5b4429eb9e249dc95307492ce6adb7e" frameborder="0" width="1662" height="1246" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>12461662Loomhttps://www.loom.com12461662https://cdn.loom.com/sessions/thumbnails/d5b4429eb9e249dc95307492ce6adb7e-2fb7444c4d26b4e0.gif289.72Ring All-Reduce vs Parameter ServerThis Loom presents a final project implementing two collective communication algorithms for distributed machine learning: a naive parameter server and a ring all-reduce based on NCCL and PyTorch. It explains a token bucket rate limiter that fills at a fixed bytes per second and blocks before writes until enough tokens are available, then details ring all-reduce’s scatter-reduce and all-gather phases, noting send and receive must run concurrently to avoid deadlock. It compares the parameter server’s linear communication bottleneck, where one machine receives and broadcasts n times the array size. The author demonstrates correctness with 3 workers and a 9 element array and then tests extreme cases with 1 million 64 bit floats per worker (8 MB each) using a simulated 100 MB/s per link bandwidth.