<?xml version="1.0" encoding="UTF-8"?><oembed><type>video</type><version>1.0</version><html>&lt;iframe src=&quot;https://www.loom.com/embed/78cd9255812749118487fd38169f0f17&quot; frameborder=&quot;0&quot; width=&quot;1670&quot; height=&quot;1252&quot; webkitallowfullscreen mozallowfullscreen allowfullscreen&gt;&lt;/iframe&gt;</html><height>1252</height><width>1670</width><provider_name>Loom</provider_name><provider_url>https://www.loom.com</provider_url><thumbnail_height>1252</thumbnail_height><thumbnail_width>1670</thumbnail_width><thumbnail_url>https://cdn.loom.com/sessions/thumbnails/78cd9255812749118487fd38169f0f17-76353b6be8e840e7.gif</thumbnail_url><duration>172.906667</duration><title>Predict GPU Failures with Human Approval</title><description>This Loom explains an AI driven GPU failure prediction system for data centers. It describes monitoring that predicts GPU failures before they happen, identifying recurring failure patterns by searching known data signatures and analyzing memory logs. Using a real GPU trace from the GWBG research cluster, it detects and estimates failure risk about 18 hours before a crash, then guides operators with clear actions while keeping humans in control of critical decisions. It notes that the model cannot initiate destructive actions, and that Slack integration is available for notifications and coordination.</description></oembed>