How MONO decides what to do: anatomy of the router
Why the router exists
The naive architecture — common in 2023 — was: dump every tool description into GPT-4's or Claude's system prompt and hope the model picks well. This works with 5 tools. It breaks at 30.
The problem is twofold: tokens and distraction. Every skill in MONO has a manifest with description, examples and parameters — ~800 tokens minimum. Times 47 skills: ~38,000 tokens before you include the user's message. Claude Sonnet has a long window but charges per token and gets confused when the right option is buried among 46 irrelevant ones.
So MONO inserts a prior step: a router based on Claude Haiku (Anthropic's smallest and cheapest model) that reads the message and returns a compact JSON with the 1-3 relevant skills and the execution "mode".
What the router sees
The router gets three things:
- The user message (the plain text as it arrived on WhatsApp).
- A compressed catalog of all skills: name, 1-line description, and 2-3 real input examples that should activate it.
- Working memory: data and entities mentioned in the last few turns (if you said "Maria" 2 messages ago, the router knows "send Maria a message" doesn't need to ask for a surname).
And returns, in ~200ms:
{
"skills": ["calendar", "reminders"],
"mode": "agent",
"confidence": 0.92
} The three modes: tool, think, agent
One of the most useful design decisions was distinguishing how a skill executes:
- Tool mode: a Go function that runs without an LLM. "Book me a meeting with Juan at 4pm" executes
calendar.create_eventwith extracted parameters. Fast, deterministic, cheap. - Think mode: the big LLM reasons about the input but doesn't call more tools. "Summarize my week" runs
calendar.list_events+expenses.summarythen Claude composes the narrative. - Agent mode: the LLM makes multiple calls, possibly with intermediate reasoning. "Analyze my pipeline and tell me which deal to close first" iterates: list deals → score probability → search relevant memories → write recommendation.
The router picks the mode. Most messages (~65%) land in tool mode — dirt cheap. ~25% are think. Only ~10% are agent, where the compute actually gets spent.
What happens when the router gets it wrong
It happens. When the user writes something ambiguous, or two skills are too similar, the router can pick wrong. Three mitigations:
- Complete logging. Every router decision is stored in SurrealDB with message, picked skills, timing, and final outcome.
- Automated analyzer. Every 6h, a job scans the log for cases where the picked skill didn't produce a useful result (e.g. the user re-asked). Those cases get flagged and used to improve the examples in the manifests.
- Thumbs-up/down feedback. If you mark a result as bad, the original message joins the improvement dataset.
This creates a self-improvement loop: the router trains on the router's errors. Not magic — just logging discipline and analytics.
The architectural lesson
The tempting intuition when building agents is to throw everything at a big model and hope for the best. It works for demos. It doesn't work in production.
The systems that scale have specialization layers: small + fast model for control decisions, big + slow model for reasoning, deterministic code for anything doable without an LLM. Each layer optimizes its trade-off: the router optimizes latency, the executor optimizes quality, the tools optimize cost.
It's the same pattern we saw in processors (branch predictor → pipeline → execution units) and in web (CDN → app server → database). AI agents are learning that systems architecture principles didn't stop applying just because the "processor" is now an LLM.
By the numbers
Router p50 latency: 180ms. Cost per decision: $0.00008. Tokens saved from the big model: ~36,000 per message. Correct-decision rate: 94% on the last 8 weeks of production traffic. The router pays for itself at ~500 messages/month.