Agent Evals in Crafting

video1.0<iframe src="https://www.loom.com/embed/c548c766916f4ecebdfb997d55ab5fa0" frameborder="0" width="1108" height="831" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>8311108Loomhttps://www.loom.com8311108https://cdn.loom.com/sessions/thumbnails/c548c766916f4ecebdfb997d55ab5fa0-a6a9b2300d59a6c0.gif637.378Agent Evals in CraftingWe demonstrate how to use Crafting for evaluating and iterating on a customer-facing agent. Since agent runs are non-deterministic, we set up multiple parallel agent runs in sandboxes to measure performance changes based on prompt improvements. After establishing a baseline, we open a PR with improvements to automatically trigger another round of evals inside of Crafting. https://github.com/crafting-demo/agent-validation