{"type":"video","version":"1.0","html":"<iframe src=\"https://www.loom.com/embed/d5b4429eb9e249dc95307492ce6adb7e\" frameborder=\"0\" width=\"1662\" height=\"1246\" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>","height":1246,"width":1662,"provider_name":"Loom","provider_url":"https://www.loom.com","thumbnail_height":1246,"thumbnail_width":1662,"thumbnail_url":"https://cdn.loom.com/sessions/thumbnails/d5b4429eb9e249dc95307492ce6adb7e-2fb7444c4d26b4e0.gif","duration":289.72,"title":"Ring All-Reduce vs Parameter Server","description":"This Loom presents a final project implementing two collective communication algorithms for distributed machine learning: a naive parameter server and a ring all-reduce based on NCCL and PyTorch. It explains a token bucket rate limiter that fills at a fixed bytes per second and blocks before writes until enough tokens are available, then details ring all-reduce’s scatter-reduce and all-gather phases, noting send and receive must run concurrently to avoid deadlock. It compares the parameter server’s linear communication bottleneck, where one machine receives and broadcasts n times the array size. The author demonstrates correctness with 3 workers and a 9 element array and then tests extreme cases with 1 million 64 bit floats per worker (8 MB each) using a simulated 100 MB/s per link bandwidth."}