Watch Demo

video1.0<iframe src="https://www.loom.com/embed/88cac91c41e741cdb72310ccf7c22904" frameborder="0" width="2260" height="1695" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>16952260Loomhttps://www.loom.com16952260https://cdn.loom.com/sessions/thumbnails/88cac91c41e741cdb72310ccf7c22904-6639824ebe38deb4.gif80.86Watch DemoI set up a consistent prompt and evaluated multiple customer support models against a dataset of real questions with reference answers. I used an LLM as a judge, pointing it to the reference answers, and Ziggy can help flag issues and fix the evaluation prompt if needed. I chose my judge model and then compared a few options like Frontier and Open Source to see differences in quality, speed, and cost. We are ready to run inference, starting with an overview ranking the models.