What went wrong, why I misdiagnosed it, and what I changed. Forensic, not emotional.
I built an autonomous loop where an AI agent builds a complete app. Version 1 claimed perfect scores. The app didn't work. Here's what I misdiagnosed, and the single change that fixed everything.