In Code Mode the model sees only two tools by default: list_tools(pattern) and execute_code(code). list_tools takes a regex and returns TypeScript signatures for matching tools. execute_code runs JavaScript that calls them.
So when the model actually needs the GitHub API for example, it calls list_tools("github.*pull") - it gets back just the typed signatures for those endpoints, and then writes code against them. Your second hypothesis is the mechanism: a meta-tool that queries on demand. The typed signatures (first hypothesis) are what the model reasons over once it has them.
That is what really brings the cost down. A large API as MCP tool definitions is easily 40-50k tokens upfront. The same API via list_tools + execute_code is ~1k for the two tool descriptions, plus only the signatures the model pulls per query.
A bit more on DADL, since this is what people typically ask first - why ANOTHER standard?
DADL is on purpose narrower than e.g. OpenAPI. It describes only the tool surface that an agent is allowed to call - not the full API contract that humans, SDK generators, gateways, docs and mocks need. In practice this means fewer parts to think about: method, path, parameters, access class, descriptions, and policy metadata. The point is to make the allowed actions explicit and small enough to actually review.
Every MCP project I have seen wraps APIs imperatively - custom code per backend. DADL is the only declarative format I know of that makes the allowed surface reviewable in a PR diff. That is a deliberate trade-off: less flexibility, more auditability.
Why YAML? Because humans really read and edit it. I wanted something small enough to review in a PR, diff cleanly, easy for LLMs to generate AND to write by hand when needed. In practice this is more important than maximum expressiveness.
What DADL can do: describe HTTP tools with typed parameters, declare auth requirements, attach policy metadata and caller constraints, provide a compact tool surface to the model - and attach a simple 'access' badge to each tool that flags it as read-only or dangerous. And errors come back in a form the LLM can reason about, not as crashes that break the flow.
What DADL is not trying to do: replace OpenAPI, capture every edge case of complex APIs, or be a full SDK generation format.
A few questions I get often:
Does this work for every API?
No. APIs with very stateful flows, weird auth handshakes, streaming edge cases or messy responses still need custom handling. Some APIs map cleanly to DADL, some do not - but for those that do not, you can still plug in an existing MCP server through ToolMesh, and Code Mode applies to it too.
Why not generate from OpenAPI?
OpenAPI is a great source material. You point an LLM to the DADL specification, to the OpenAPI definition - and you get a valid DADL that usually only needs optimisation.
So far there are 20 DADLs in the public registry covering 1,833 tools (GitHub, Cloudflare, GitLab, DeepL, Hetzner Cloud and more). If there is a specific API you would want to see as DADL, just ask - I am happy to add it.
If you want to try before cloning: https://demo.toolmesh.io is a public instance with the HN APIs loaded (login dadl/toolmesh). Works with Claude.ai, Claude Desktop, Claude Code, and ChatGPT - setup takes 30 seconds: https://toolmesh.io/demo
Well the LLM might well have read all operating procedures etc. - just like the teenager - but it has not *experienced* issues himself/herself/itself. If you look at the published statements about "AI deleted my..." those AI models clearly knew what went wrong AFTER they caused it...
I get your point, but mine is that if there are ways to nudge it to do better when creating code, there must be ways to nudge it to be more careful with operations and avoid the same issues.
So instead of saying it is "broken", I wonder if we are "holding it wrong" to get those outcomes?
Web is similarly full of "LLMs created this crappy codebase and/or code change", but others use it very successfully.
You are also right and I half agree - but the same issue arises with new employees. They already have some of this experience you are talking about - but usually there is something about the very environment they are in which causes them to learn lessons the hard way. And that experience does not transfer easily to the next colleague - and even less to an LLM. The only way I see to solve this is by having an "organisational memory".
I only use a decay function to see how "hot" a chunk is - not for forgetting old ones. What concerns me more are memory chunks with errors in them - they need to be corrected/removed by some other mechanism, not by decay (since they might get retrieved often).
Think of AI just like of a genius 16-year old. Accidents will happen - only let AI and the 16-year old access systems where you are sure you have a recovery plan.
the better the tokenizer maps text to its internal representation, the better the understanding of the model what you are saying - or coding! But 4.7 is much more verbose in my experience, and this probably drives cost/limits a lot.
Great site! I once created muelltonne.de (german for "trashcan") where users could send (spam) mails which they did not like - and got poems or jokes in return that were made from exactly the same letters (plus some remains that could not be used). Reading tweets nowadays cries out for a new enhanced version...
Great idea, I do like the concept to give the LLM more information and context so it can decide what approach is better. But why do you implement it as a mcp server and not as a proxy to have the full context?
Great question! We initially started with a proxy, and plan to support it for users that already have things like LiteLLM setup. However we chose the MCP route, and soon plugin route, because it was much lower lift for installation. Setting up a proxy and configuring it for Cursor especially can be a bit tricky. There is actually still some code in the MCP server for a LiteLLM integration, which we hope to support officially soon.
Another large part of it is without MCP, and only a proxy in a client like cursor, the agent doesn't reason about the budget. When it has it has to use it as a tool, it actively thinks about the cost of its actions.
In Code Mode the model sees only two tools by default: list_tools(pattern) and execute_code(code). list_tools takes a regex and returns TypeScript signatures for matching tools. execute_code runs JavaScript that calls them.
So when the model actually needs the GitHub API for example, it calls list_tools("github.*pull") - it gets back just the typed signatures for those endpoints, and then writes code against them. Your second hypothesis is the mechanism: a meta-tool that queries on demand. The typed signatures (first hypothesis) are what the model reasons over once it has them.
That is what really brings the cost down. A large API as MCP tool definitions is easily 40-50k tokens upfront. The same API via list_tools + execute_code is ~1k for the two tool descriptions, plus only the signatures the model pulls per query.
reply