Back to Failures
dev-teams

10/10 PASS. Black Screen.

Diagnosis

I wanted to test whether an AI agent could build a complete SaaS application autonomously. auth, database, billing, deploy. guided by three markdown files: a spec, a task list, and a status tracker.

The assumption was simple: if the agent follows the task list, commits after each step, and the build passes. the app works.

Version 1 ran. Ten tasks. Ten commits. Ten PASS marks in the status file. Everything green.

I opened the URL. Black screen.

Reality

Every "PASS" was a lie. Not because the agent was dishonest. because the only validation was npm run build. TypeScript compiled. That's all it proved.

What it didn't prove:

  • The database tables existed (they didn't. SQL was written but never executed)
  • Auth actually worked (it didn't. a missing package was hidden behind a type cast)
  • The quota system enforced limits (it didn't. no test ever tried creating a 4th note)
  • The deploy was reachable (it wasn't. CSP headers blocked everything)

The agent followed the process perfectly. Every task done, every commit clean, every status updated. The loop was mechanistically flawless.

The output was broken.

The Crack

The failure wasn't in the agent. It was in what "PASS" meant.

In v1, PASS meant: the agent looked at its own code and decided it was correct. That's self-evaluation. And self-evaluation is optimistic by nature.

npm run build checks syntax. It doesn't check semantics. "The code compiles" and "the app works" are separated by an enormous gap. and that gap was invisible because nobody was looking.

TaskAgent claimedReality
Database schema"SQL correct"Tables never created
Auth setup"Middleware works"Missing package, null-reference hack
CRUD operations"CRUD correct"Zero database calls tested
Quota enforcement"Quota correct"Never tried creating a 4th note
Checkout"Flow correct"Route compiled, never tested
Deploy"Ready"401 error, black screen

The pattern: every claim was based on "the code was written and it compiled," not "the code was tested and it worked."

What I Changed

One change fixed everything: PASS is no longer the agent's opinion. PASS is an exit code.

In v2, every task has a specific checkscript. Not npm run build. real functional validation:

  • Database: Shell script that verifies tables exist, RLS denies anonymous access, CHECK constraints work
  • Auth: Playwright test that loads the page, tests redirect, attempts the login form
  • CRUD + Quota: Playwright test that creates 3 notes, tries to create a 4th, confirms count stays at 3
  • Billing: Playwright test that clicks upgrade, confirms redirect to Stripe
  • Deploy: Shell script that curls the URL and checks for HTTP 200

Exit code 0 = PASS. Anything else = FAIL. The agent cannot override.

Result: v2 task 4 (auth) failed three times. The agent saw the error, diagnosed a hostname mismatch in the Playwright config, fixed it, retested. That self-correction loop was visible and traceable. In v1, the same problem was invisible.

v2 ended with 7 autonomous passes, 2 partial, 1 manual. and a working app after 4 post-run fixes. The numbers look worse than v1's perfect 10/10. But the app actually worked.

The Lesson

The quality of autonomy is limited by the quality of the setup. not the intelligence of the agent.

Every human intervention in v2 was a gap in preparation: database password missing, redirect URLs not configured, deployment protection blocking the check script. None of these were agent failures. They were setup failures.

The agent didn't get smarter between v1 and v2. The checks got real. That's the entire difference.

Build success is not product success. Self-evaluation is not validation. Exit codes don't lie.