<?xml version="1.0" encoding="UTF-8"?><oembed><type>video</type><version>1.0</version><html>&lt;iframe src=&quot;https://www.loom.com/embed/d5b4429eb9e249dc95307492ce6adb7e&quot; frameborder=&quot;0&quot; width=&quot;1662&quot; height=&quot;1246&quot; webkitallowfullscreen mozallowfullscreen allowfullscreen&gt;&lt;/iframe&gt;</html><height>1246</height><width>1662</width><provider_name>Loom</provider_name><provider_url>https://www.loom.com</provider_url><thumbnail_height>1246</thumbnail_height><thumbnail_width>1662</thumbnail_width><thumbnail_url>https://cdn.loom.com/sessions/thumbnails/d5b4429eb9e249dc95307492ce6adb7e-2fb7444c4d26b4e0.gif</thumbnail_url><duration>289.72</duration><title>Ring All-Reduce vs Parameter Server</title><description>This Loom presents a final project implementing two collective communication algorithms for distributed machine learning: a naive parameter server and a ring all-reduce based on NCCL and PyTorch. It explains a token bucket rate limiter that fills at a fixed bytes per second and blocks before writes until enough tokens are available, then details ring all-reduce’s scatter-reduce and all-gather phases, noting send and receive must run concurrently to avoid deadlock. It compares the parameter server’s linear communication bottleneck, where one machine receives and broadcasts n times the array size. The author demonstrates correctness with 3 workers and a 9 element array and then tests extreme cases with 1 million 64 bit floats per worker (8 MB each) using a simulated 100 MB/s per link bandwidth.</description></oembed>