{"type":"video","version":"1.0","html":"<iframe src=\"https://www.loom.com/embed/88cac91c41e741cdb72310ccf7c22904\" frameborder=\"0\" width=\"2260\" height=\"1695\" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>","height":1695,"width":2260,"provider_name":"Loom","provider_url":"https://www.loom.com","thumbnail_height":1695,"thumbnail_width":2260,"thumbnail_url":"https://cdn.loom.com/sessions/thumbnails/88cac91c41e741cdb72310ccf7c22904-6639824ebe38deb4.gif","duration":80.86,"title":"Watch Demo","description":"I set up a consistent prompt and evaluated multiple customer support models against a dataset of real questions with reference answers. I used an LLM as a judge, pointing it to the reference answers, and Ziggy can help flag issues and fix the evaluation prompt if needed. I chose my judge model and then compared a few options like Frontier and Open Source to see differences in quality, speed, and cost. We are ready to run inference, starting with an overview ranking the models."}