Most people use AI agents like a slot machine. You type a prompt, pull the lever, and hope the output is good. I got tired of hoping, so I built what belongs in between: a layer that sits between intent and output and refuses to let bad output through. I treat my agents like a compiler.
The mapping is exact, not poetic
A compiler takes source code and produces something that runs, and it refuses to produce anything if the source does not typecheck. My setup maps onto that almost one to one. The source is intent: a brief, written in structured text. The passes are agents, each doing one narrow job. The intermediate representation is contracts, structured descriptions of what must be true rather than loose prose. The typechecker is made of eval gates: deterministic checks first, like does it build, do the tests pass and does it match the schema, and where judgment is needed, a critic agent that scores the previous agent's work and sends it back the moment it falls short. The linker is an aggregation pass that assembles the loose parts into one whole. And the release gate is me.
graph LR
A["Intent (a brief)"] --> B["Agents (passes)"]
B --> C{"Eval gates"}
C -->|rejected| B
C -->|passes| D["Aggregate (linker)"]
D --> E["GO, that is me"]
Where it runs
It does not run on my laptop with everything held in memory. It runs on GitHub, where every part has a fixed role: issues are the queue, labels hold the state, branches give isolation, pull requests are the audit trail and GitHub Actions provides the compute. Because of that, nothing important lives inside a process that can crash. Every run is durable and reviewable after the fact, and if a step fails along the way the system picks it up and retries on its own.
graph TD
R["A run on GitHub"] --> I["Issues = the queue"]
R --> L["Labels = the state"]
R --> Br["Branches = isolation"]
R --> P["Pull requests = audit trail"]
R --> Ac["Actions = the compute"]
What it has actually produced
Two working instances on the same architecture. The first is a research engine: five agents research five angles of a problem in parallel, each one shadowed by a critic that rejects weak sourcing. A coherence pass then catches the contradictions between them, and a synthesis pass turns it into one coherent document.
graph LR
Q["A problem"] --> F["5 pillars, in parallel"]
F --> CR["Agent + critic per pillar (retry until PASS)"]
CR --> CO["Coherence pass"]
CO --> SY["Synthesis pass"]
SY --> DOC["One document"]
The second is a product engine that turns a spec into built software phase by phase, schema before API before interface, with tests gating each phase before the next one is allowed to start.
The honest part
This is early work. The research engine runs; the product engine has a first version and is not yet autonomous from end to end. I will not pretend otherwise.
It is also non-deterministic by design: the same brief produces slightly different output on a different run. I am not chasing identical output, because crushing that variation would kill the value. What I am chasing is output that passes every gate, every time. For this kind of work that is a more honest definition of correct than bit-for-bit reproducibility.
What it taught me
The shift that really mattered was not making the agents smarter. It was admitting they are unreliable and building the verification around that fact, instead of waving it away. A compiler does not trust the programmer. It checks, and then it checks again. That is how I started treating the models: not trusting them, but checking their work.