AI Engineer Assignment Evaluation Results

video1.0<iframe src="https://www.loom.com/embed/50452061e9394ef28d5001b6c9227631" frameborder="0" width="1662" height="1246" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>12461662Loomhttps://www.loom.com12461662https://cdn.loom.com/sessions/thumbnails/50452061e9394ef28d5001b6c9227631-94b36a8604a0c89c.gif230.891AI Engineer Assignment Evaluation ResultsI start by showing that the agent itself was not touched, and that all Sourcetree tests and hard metrics passed. I then open evaluation reports, where I see an 82 percent pass rate and three failing cases, all failing for the same reason related to Marksteps. Next, I run evaluation with filter efficiency and no exteriors using API calls with concurrency 2, max used $5, and it finishes in about 8 seconds, where everything passes. Finally, I change judge validation prompts for Judge V1 and V2 and rerun evaluation, which fails with an answer rejects pattern not matching. No action was requested from viewers.