leonx.ai Don't trust, verify
← Home

中文 · English

I built a toy multi-agent orchestrator to understand what Yegge keeps talking about

2026-06-13 · build log of a learning toy

It started with Steve Yegge. He keeps talking about multi-agent orchestration, even built an open-source orchestrator called Gastown, and says most engineers are stuck at the bottom — "ask the IDE a question, review the answer carefully, commit." I wanted to actually understand what that's about. Rather than just listen to him, I'd hand-write the smallest possible toy and feel it out myself.

It's called townhall — pure Ruby, DeepSeek-driven: a sentence in, a clickable web page out. One goal: turn "coordinate / parallelize / converge" from "heard of it" into "touched it."

(It drifted: orchestration turned out to be the surface; what actually ate my time was "how do you know the thing it generated is correct" — more on that below.)

Almost every section below maps to a runnable script in the repo. Don't trust, verify — run it yourself, that's the whole point of this post.

A sentence → N versions → pick one

The first version was dumb: same brief, spin up 3 workers, each takes a stylistic angle (minimal / flashy / dense) and produces a full HTML, then a judge picks one. Parallel.map spins threads — MRI has a GIL, but it's released while waiting on the LLM's HTTP response, so all three finish around the same time, no queueing.

This is what Yegge calls slot machine programming: pull the lever a few times, keep the jackpot. It worked, but honestly at this point it's just a batch generator, not orchestration.

A question I got stuck on: does every task deserve this "generate N, pick one"?

No. It only pays off when three things hold at once — output quality genuinely varies, you can reliably pick the best, and the result is worth N× the cost. For "what's 3847 × 2913" or "classify this sentence," the answer is unique; variance is just noise, generate once. Yegge's "the Claude team makes 20 prototypes and ships one" works because prototypes max out all three.

Put the agent in a loop

A line of Yegge's stuck with me: completion is a loop, an agent is chat in a loop, an orchestrator is an agent in a loop.

The purest form of "in a loop" is tool use — the model doesn't answer directly, it requests a tool call, you run it and feed the result back, it continues:

$ ruby bin/try-tools
A) no tool:  model answers 3847×2913 = 11,204,511   ❌ (real: 11206311)
B) give it a calc tool:
  ① model requests tool: calculator({"a"=>3847, "b"=>2913})   ← not an answer, a request
  ② your code runs it, gets 11206311   ← fed back to the model
  ③ model's final answer: 11206311   ✅

The model decided "I can't do this in my head" and borrowed a tool — that's the jump from "can talk" to "can do."

Back to townhall. My worker was a single call, one shot. So I added self-repair: generate → self-check → if it fails, feed the errors back and ask for another version, up to N rounds. It's a cousin of that tool loop — except the "tool" is "a validator + error feedback."

I didn't just trust it. I set an env var TOWNHALL_BREAK_FIRST that deliberately chops the first HTML in half, ran it once, and watched the log go from "0 repairs" to "1 repair, passed" — that's how I confirmed the loop actually fires. I don't trust a branch I've never seen run. This saved me repeatedly later.

Then: no error ≠ correct

Here's the turn. My "self-check" started as reading the HTML as a string: is there a DOCTYPE, are the tags balanced. Result: on normal briefs, all three workers came back "0 repairs, passed" — the loop never fired once.

Because that check is shallow. It catches "structurally broken," not functionally wrong: a countdown that loads fine but doesn't count down, a tic-tac-toe that lines up three but never declares a winner. These pages are alive and error-free, and wrong.

So I added a real sandbox: ferrum drives headless Chrome to actually run the HTML, catching JS runtime exceptions and blank renders. By design it shares one return shape with the static check — check(html) → [errors], empty means pass. So it slots into the existing loop with zero changes, because the loop only cares about that array, not whether the error came from reading a string or actually running it.

But the sandbox only answers "did it crash," not "did it meet the brief." For that you need an agent to judge — read "brief + artifact," find where it fails the requirement. I called it the Critic.

That's when it hit me: most of my effort had drifted from "how to generate" to "how to know it generated correctly." Generating is easy; verifying is hard. That line became the spine of the whole project.

The part I was most pedantic about: why trust your verifier?

The Critic is an LLM, non-deterministic — same artifact, says "problem" one run, "fine" the next. So is it actually any good? You can't tell from one run.

So I built it an eval. Hand-wrote two tic-tac-toes: one fully correct, one where I deliberately removed only the "draw detection" line (plays fine, declares wins, just doesn't announce a draw when the board fills). I know the ground truth, so I can grade it. Fed each to the Critic 6 times.

The first round was ugly: the correct one got dinged twice, both fabricated — "O can move on X's turn" (false), "doesn't block clicks after a draw" (also false); the missing-draw one, 6 times, never once caught "no draw detection," reporting other things that were all wrong.

The bug I planted: caught 0/6. Bugs that don't exist: happily reported. Worse: if I'd only counted "how many problems reported," I'd see "3/6 reported" and think recall was okay — only by reading the content did I find not one was right. Counting an LLM's outputs lies to you.

Here's what it actually looks like (· = said OK, x = reported a problem but missed the point, O = actually caught the target gap):

$ ruby bin/eval-critic            # default deepseek-chat
[complete (win + draw)]   ·x·x··   2/6 flagged   — both fabricated:
    "O can move on X's turn" (false), "no click-block after draw" (false)
[missing-draw version]    x·x··x   3/6 flagged, but not one names the draw
    reported "no turn switch" / "board not rendered" / "shows spaces" — all wrong
false positives 2/6 · real recall 0/6

Swap in a reasoning model, same inputs:

$ JUDGE_MODEL=deepseek-reasoner ruby bin/eval-critic
[complete]        ······   0/6 flagged        — zero hallucination
[missing-draw]    ····OO    caught the target gap 2/6
    "no draw detection; once the board fills there's no result and the game can't end"
false positives 0/6 · real recall 2/6

Then two cheap moves, tuned against the numbers: (1) swap to the reasoning model — false positives dropped from 2/6 to 0/6, and when it does speak, it's a real problem; (2) tweak the prompt to check the requirement point by point, including edge cases — recall ticked up but quickly capped, because judging "what happens when the board fills" by reading source code is unreliable no matter how strong the model.

At that point I could decide with data: the precision problem is solved for free by swapping models; recall is stuck on the "reading source" substrate, and lifting it further means changing the substrate (actually run it + observe behavior) — a separate investment, but now I know why, and I know the cheap path is exhausted. Not a gut call.

A side decision: since the Critic's absolute scores aren't comparable, I made "which revision to accept" a pairwise comparison — LLMs are bad at absolute scoring but good at "A or B, which is better"; I also ask twice with the order swapped to cancel its positional bias.

Write down what eval taught you, so you don't repeat it

That "tic-tac-toe tends to miss draws" — I didn't let it go to waste, I stored it in a cross-run memory. Next time I build tic-tac-toe, the orchestrator pulls the relevant lesson and feeds it to the worker:

$ bin/townhall "make a tic-tac-toe, two players alternate, announce the result"
[townhall] recalled 1 relevant lesson for the worker:
[townhall]   ↺ tic-tac-toe easily misses draws: when the board fills with no line, announce a draw and stop moves.

A lesson dug up by eval becomes a memory that's automatically avoided. But right away a problem appears: what if the agent writes to memory itself? One wrong lesson becomes a "stamped error" — auto-injected into every related task, contaminating broadly. So writes go through a gate — an independent review panel votes, majority required:

$ bin/learn "tic-tac-toe" "a tic-tac-toe board is 4x4, 16 cells"
  reviewer 1: REJECT — standard tic-tac-toe is 3x3, 9 cells; 4x4 is a common error, false
  reviewer 2: REJECT — standard board is 3×3, 4×4 is wrong
  reviewer 3: REJECT — 4×4 is a variant, overgeneralized
❌ only 0/3 passed → blocked, not written.

Wrong, overfit lessons don't get in. This is guarding against something called heresy — more on that next.

Don't trust, verify

Later I wanted to replace all the string-scraping parsing (the judge digging out a number, the critic looking for "OK") with structured output — have the model emit valid JSON directly.

The library has with_schema, and the capability table claims newer models support strict schema. I almost just wrote it. Out of habit I fired a probe first, one minimal call, and got back one line:

RubyLLM::BadRequestError: This response_format type is unavailable now

DeepSeek refused. I assumed it was an old model, swapped in the latest v4-pro, tried again — still refused. Yet the library's capability table plainly claims v4 supports structured_output.

The capability table lied. Only by actually hitting the API do you learn whether it'll accept it. That probe didn't save a few lines of code — it saved a "wrote it per the docs, blew up in prod." I settled on the tier DeepSeek actually supports (JSON mode + validating the fields myself).

The hard part of multi-agent isn't concurrency

Before researching "multiple agents editing the same thing," I assumed the hard part was concurrency — races, locks, lost updates. Turns out that part is mostly distributed-systems redux, solved decades ago (partition, lock, optimistic concurrency, actors). Someone asked if this is just traditional software engineering — basically yes, except for one part.

That part Yegge calls heresy: a wrong idea takes root among the agents, spreads, keeps coming back. You clean out the bad code, but a leftover reference in some doc or comment lets the next agent go "oh, this makes sense" and rebuild it. Locks can't fix this — it's not data racing, it's beliefs racing. Two agents can take turns perfectly, zero races, and both believe the same wrong thing.

Because an agent is fundamentally a thing that infers from every signal around it; with no authoritative "source of truth" it just guesses, and a wrong guess gets written as a new signal for the next one. The cure is an authoritative source of truth + explicitly naming "this is a common mistake, don't" + blocking it with tooling. It's an epistemology problem, not a concurrency one.

The orchestration core isn't tied to the task

Midway I stopped and asked: this thing only generates web pages — change the task and is it useless? Pulling it apart, only two pieces are tied to HTML — how to generate (the prompt) and how to verify (the checker). The rest — fan-out, loop, convergence, memory — is task-agnostic. I extracted those two into a swappable "domain pack" and ran a completely different task on the same core: generate a regex, verify with test cases. The orchestration core didn't change a line.

A question I got stuck on: so to make it "general," do you just keep adding domains forever?

No — that's filling a bottomless pit with manual labor. Generality comes from leverage (a few powerful tools — "can write and run tests" covers a huge swath), not from enumerating endless bespoke domains. And there's an asymmetry I didn't expect: generation is general almost for free (one model can attempt anything), verification isn't — a verifier reliable for any task is roughly general intelligence itself, which you can't have. So you always hand-build a verifier for the domains you care about.

One more nice contrast: the regex verifier is hard (run test cases, right/wrong is countable), so you just pick "fewest errors" — no Critic/Judge needed at all. HTML quality is soft, uncountable, so it needs soft convergence. Whether your verifier is hard or soft decides whether you need that heavy, expensive machinery.

Looking back, it's really just two lines

Reading it all back, everything keeps saying one thing: move "correct / safe / when to stop" from "trust the model to self-regulate" to "enforced by the system."

It shows up in different places. The self-repair loop relies on MAX_REPAIRS to stop, otherwise it never knows when. For recursive decomposition I measured a set:

$ bin/try-recursion "build an online quiz platform: author, answer, grade, leaderboard" 3 12
20 leaves total:
  model said "small enough" (atomic): 5
  forced by [depth cap]:               15
  forced by [budget]:                  0

20 leaves, only 5 the model stopped on its own; the other 15 were forced by the depth cap — 75% of the "stops" weren't the model's choice, they were a gate pushed down from outside. The Critic and router are quantified by eval; memory has a write gate (the one above); safety relies on output-side deterministic checks. The model's self-judgment can't be trusted — that's not a bug, it's reality.

I also got caught out once, worth noting: I'd named a thing "memory" — a multi-turn conversation whose context just keeps rolling. One "isn't that just accumulated chat history?" stopped me cold — right, that's just a long context, gone when the chat dies; real memory has to persist and be retrievable across conversations, a different thing. Renamed it on the spot. Those numbers above were mostly forced out by this "get stuck → go measure → update" cycle.

The second line is really the method for the first: Don't trust, verify. The capability table says it supports schema, hitting the API says rejected; the Critic looks like it's working, building an eval reveals it hallucinates and misses; the rule-based router looks fine, measure it and a paraphrase drops it to 3/5:

$ ruby bin/try-route
request: "set me up a number-guessing game I can play in the browser"
  rules: question ❌      LLM: html_app ✅
request: "give me a pattern that recognizes phone numbers"
  rules: question ❌      LLM: regex ✅
accuracy — rule router: 3/5 · LLM router: 5/5

One result means nothing; you measure the distribution, you A/B, you hit the real API.

Finally

townhall is a toy; tossing the code costs nothing. What I take away is the judgment: find a component you can't trust → build an eval to quantify it → exhaust the cheap levers first → let the data decide whether to make the big change.

A personal note: I'm building this to sharpen a knife for something real — an AI assistant that reviews science exam questions for teachers: is the physics self-consistent, does it hit the target ability, what needs fixing. A question-reviewer is, at its core, a verifier. The biggest takeaway from this toy — "verification is everything, and a verifier's value lives entirely in whether it can be trusted" — happens to be the crux of that real thing.

If you also think about verification / education / agent orchestration, let's talk.