<?xml version="1.0" encoding="UTF-8"?><oembed><type>video</type><version>1.0</version><html>&lt;iframe src=&quot;https://www.loom.com/embed/88cac91c41e741cdb72310ccf7c22904&quot; frameborder=&quot;0&quot; width=&quot;2260&quot; height=&quot;1695&quot; webkitallowfullscreen mozallowfullscreen allowfullscreen&gt;&lt;/iframe&gt;</html><height>1695</height><width>2260</width><provider_name>Loom</provider_name><provider_url>https://www.loom.com</provider_url><thumbnail_height>1695</thumbnail_height><thumbnail_width>2260</thumbnail_width><thumbnail_url>https://cdn.loom.com/sessions/thumbnails/88cac91c41e741cdb72310ccf7c22904-6639824ebe38deb4.gif</thumbnail_url><duration>80.86</duration><title>Watch Demo</title><description>I set up a consistent prompt and evaluated multiple customer support models against a dataset of real questions with reference answers. I used an LLM as a judge, pointing it to the reference answers, and Ziggy can help flag issues and fix the evaluation prompt if needed. I chose my judge model and then compared a few options like Frontier and Open Source to see differences in quality, speed, and cost. We are ready to run inference, starting with an overview ranking the models.</description></oembed>