<?xml version="1.0" encoding="UTF-8"?><oembed><type>video</type><version>1.0</version><html>&lt;iframe src=&quot;https://www.loom.com/embed/08680fe052ba4686bb48185123fef9ba&quot; frameborder=&quot;0&quot; width=&quot;1658&quot; height=&quot;1243&quot; webkitallowfullscreen mozallowfullscreen allowfullscreen&gt;&lt;/iframe&gt;</html><height>1243</height><width>1658</width><provider_name>Loom</provider_name><provider_url>https://www.loom.com</provider_url><thumbnail_height>1243</thumbnail_height><thumbnail_width>1658</thumbnail_width><thumbnail_url>https://cdn.loom.com/sessions/thumbnails/08680fe052ba4686bb48185123fef9ba-64ed0e934b427ad6.gif</thumbnail_url><duration>91.563</duration><title>Vision Transformers and Foundation Model Advances</title><description>This Loom explains how vision transformers expanded from NLP to become a dominant architecture in computer vision. It starts with attention mechanisms, covering queries, keys, values, and scale.productAttention and multiheadAttention. The video then shows how transformers process images by splitting them into patches treated like NLP tokens, adding patch embeddings and positional encodings to capture spatial structure. It outlines the vision transformer architecture with transformer blocks and multi-head attention, and also briefly covers modern foundation models including CLIP, DinoV2, and SAM for segmentation.</description></oembed>