I put an AI agent in a box and told it to build

Here's the experiment: take an AI agent, lock it in a Docker container, give it three markdown files, and see if it can build a working app from scratch. No hand-holding. No pair programming. Just a spec, a task list, and a status tracker.

The app itself doesn't matter. it's a simple notes app with login and a 3-note limit. A test vehicle. What matters is the loop.

The setup

The container is the agent's entire world. Node.js, Playwright, git, a database connection. Nothing else. No access to my Mac, no browsing, no escape hatch.

Three files control everything:

SPEC.md. what to build. A notes app: email login, create/delete notes, 3-note quota for free users, Stripe checkout for paid. Simple enough to build in an hour, complex enough to break in interesting ways.

TASKS.md. ten discrete steps. Each one has a checkscript attached. Not "does it compile". real tests. Playwright opens the browser, clicks buttons, creates notes, tries to break the quota.

STATUS.md. the agent's logbook. After each task: PASS or FAIL, iteration count, error notes. This is where you watch it think.

What happens when you press play

The agent reads SPEC. Understands the app. Reads TASKS. Picks up task 1. Builds. Runs the check. If exit code 0: PASS, commit, next task. If not: read the error, fix, retry.

graph LR
    A[SPEC.md] --> B[Agent reads task]
    B --> C[Build]
    C --> D{Checkscript}
    D -->|Exit 0| E[PASS → Commit]
    E --> B
    D -->|Exit 1| F[Read error]
    F --> G[Fix]
    G --> D

That's it. That's the entire architecture. Three markdown files and exit codes.

The beautiful part is watching it get stuck and unstick itself. Task 4. the login page. failed three times. The agent's test tried to load the page on 127.0.0.1, but Next.js internally redirects to localhost. Three failures. Each time the agent read the error, made a hypothesis, tried a fix. Third attempt: updated the Playwright config. PASS.

No human told it what was wrong. It read the error output, reasoned about it, and fixed it. In a container. Alone.

Why a container

Because trust is a spectrum, and "run my agent on my laptop with full system access" is at the wrong end of it.

The container enforces boundaries. The agent can read its project files, run builds, talk to the database, push to GitHub. That's it. It can't read my SSH keys, open random ports, or install whatever it wants. The spec lists the exact stack. nothing else gets in.

Every tool the agent needs is pre-loaded. Every credential is scoped. If something isn't in the container, the agent can't access it. That's not a limitation. that's the design.

What actually worked

Seven out of ten tasks passed autonomously. The agent built the full app: auth flow, note CRUD, quota enforcement at the database level (try creating a 4th note. the database literally won't let you), and a Stripe checkout redirect.

The quota test is my favorite. The Playwright script creates three notes, then tries to create a fourth. Checks the count. Still three. The database said no. Not the frontend. the database. That's real enforcement, and the agent built it because the check demanded it.

What didn't

Three tasks needed me. Not because the agent was dumb. because the setup had holes.

The database password wasn't in the environment. The agent can't guess a password. That's a setup problem, not an intelligence problem.

Supabase redirect URLs weren't configured. The agent can't click through a web dashboard. Another setup gap.

Vercel deployment protection was on. The agent's deploy check got a 401. It couldn't verify its own work was live.

Every human intervention was a hole in my preparation. Not a flaw in the loop.

The real insight

The first version of this experiment had the agent claim 10/10 PASS. Perfect score. I opened the app. Black screen. Every "PASS" was based on npm run build. which only proves TypeScript compiles. Compiles ≠ works. That's a whole separate story.

Version two added real checks. Playwright tests, database queries, curl commands. Suddenly the agent started failing. and then fixing its own failures. The scores looked worse (7 autonomous, 2 partial, 1 manual). But the app worked.

Worse numbers. Better reality. I'll take that trade every time.

What's next

Version three. Goal: zero human interventions. Every gap from v2 becomes a preflight check. Database password in env before the container starts. Redirect URLs pre-configured. Deploy protection off or bypass token ready.

The agent doesn't need to get smarter. The setup needs to get tighter. That's the whole game.

Three markdown files. A locked container. An agent that reads, builds, tests, and fixes. Not science fiction. running on my Mac right now.

Well, not right now. Right now it's waiting for me to close those last three holes. Almost there.