← Blog·Engineering

Skill vs Tool: the distinction that makes MONO different

·MONO Team·5 min read

Why the separation matters

Most agent frameworks (LangChain, AutoGPT, etc.) treat "tool" and "function" as the base atom. The agent sees a flat list of 50 functions and the LLM decides which to call.

This works with 5-10 tools. It breaks at 50. The prompt bloats with repetitive descriptions, and similar tools (log_expense vs create_transaction vs record_spending) confuse the model.

Skills introduce a layer of semantic grouping. Instead of "here are 83 functions, pick one", we say "here are 21 skills; which one(s) are relevant?". The Haiku router picks 1-3 skills. Then, inside that skill, only its subset of tools is exposed.

Anatomy of a skill

A skill in MONO contains:

  • Tools: the functions it executes (create_event, list_events, delete_event for the calendar skill).
  • Manifest: YAML file with description, activation examples, UI conditions, boundaries.
  • Renderer: Go function that turns results into Dynamic UI (rendered HTML).
  • Modes: declaration of which modes it supports (tool/think/agent — see the router post).
  • Proactive monitors (optional): cron jobs that run without user input (e.g. calendar's morning brief at 7am).

Example: the Expenses skill

Tools (6): log_expense, list_expenses, expense_summary, delete_expense, categorize_expense, analyze_expenses (agent mode).

Manifest examples: "spent 300 on uber" → log_expense. "how much did I spend this month?" → expense_summary. "explain my patterns" → analyze_expenses (agent mode).

Renderer: HTML table with categories, monthly projection, delta vs last month, bar chart for top-5 categories.

Monitor: every Friday 6pm, compute the weekly summary and proactively send it if the user exceeded their average.

The LLM never sees all 6 tools individually until the router picked "expenses" as the relevant skill. Then and only then they're exposed in context.

Architectural consequences

Composability: adding a new skill = creating a Go package + a YAML + a renderer. No need to touch the router, executor, or UI system.

Isolated testing: tools are tested independently. The manifest is validated against a schema. The renderer is tested with output fixtures.

Billing per skill: we can charge for individual skills (email $2/mo, calls $7/mo) because they're discrete units with clear boundaries.

User-facing vocabulary: users think in skills ("I want to activate the Fitness skill"), not tools. Internally we have ~83 tools, but the UI has 21 skills. That's the right abstraction for the product.

TL;DR

Tool = executable function. Skill = domain capability that groups tools + manifest + renderer + monitors. The separation lets us scale to 83 tools without confusing the LLM, bill per capability, and test each domain in isolation.