Detecting Sandbagging in LLMs: A Methodology Overview

video1.0<iframe src="https://www.loom.com/embed/86aaf5726d69496789d73763fadd56b8" frameborder="0" width="790" height="592" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>592790Loomhttps://www.loom.com592790https://cdn.loom.com/sessions/thumbnails/86aaf5726d69496789d73763fadd56b8-b83dd37c290d8e6b.gif253.9204Detecting Sandbagging in LLMs: A Methodology OverviewIn this video, I present my hackathon project called Playing Dumb, which focuses on detecting sandbagging in language models through consistency checks. I utilized the weapons of mass destruction proxy dataset, specifically the biosecurity subset, to compare how models respond to formal versus casual question framings. My trials showed that when models are prompted casually, they perform significantly better, indicating they are sandbagging when evaluated formally. I encourage you to review the full write-up for more details on the methodology and findings.