Engineering

LLM function calling patterns that survive real users

Function calling breaks in production not from bad models but from bad schemas, missing idempotency, and trusting LLM-supplied arguments — here's what holds up.

Read time
11 min
Published
Jun 26, 2026

why-function-calling-breaks-in-production

Function calling demos look flawless. You define get_weather, the model calls it with {"city": "Tokyo"}, you return JSON, the model summarizes. Ship it.

Then real users arrive and the model calls get_weather with {"city": "the place I mentioned earlier"}, or calls it three times in parallel for the same city, or hallucinates a units parameter you never defined, or refuses to call it at all and instead writes a paragraph apologizing for not having weather data.

The failure mode is almost never the model being "too dumb." Modern models are good at picking the right tool. The failures come from the contract between your code and the model being underspecified. You designed for the happy path the model takes 95% of the time, and the other 5% generates support tickets, double-charged customers, and corrupted state.

This post is about the 5%. Everything here comes from running function-calling agents against tens of thousands of real conversations, not from a notebook.

design-schemas-for-the-model-not-your-orm

The single biggest lever you have is the tool schema. Most teams generate schemas mechanically from their existing API types. That's a mistake. The schema is a prompt. The model reads the name, the description, and every field description as instructions, and it weighs them more heavily than anything in your system prompt because they're adjacent to the decision.

Concrete rules that move accuracy:

Name tools by intent, not implementation. search_orders_by_customer beats query_order_table. The model is matching user intent to tool name. Make that match obvious.

Keep the parameter count under 5. Every additional parameter is another thing the model can hallucinate or omit. If you have a tool with 12 parameters, the model will get the common 4 right and randomly fill the rest. Split the tool, or move rarely-used parameters behind sensible defaults the model never sees.

Use enums aggressively. A status string parameter invites "completed", "complete", "done", and "finished". An enum of ["pending", "shipped", "delivered"] removes the entire category of error. If your domain has a closed set of values, encode it in the schema, not in prose.

Descriptions answer "when do I call this" not "what does this do." Bad: "Retrieves order data." Good: "Use when the user asks about a specific order's status, contents, or shipping. Requires an order ID — ask for it if the user hasn't provided one." That last clause prevents the model from inventing an order ID, which it will absolutely do otherwise.

Avoid free-form object parameters. {"filters": {...}} with no inner schema is where models go to improvise. If you can't enumerate the shape, the model can't either, and it'll produce structurally valid JSON that's semantically garbage.

make-every-tool-idempotent

Models call tools more than once. Sometimes the same tool, same arguments, twice in a row, because the first result didn't get summarized into context cleanly, or because a retry fired, or because parallel tool calling decided two calls were warranted.

If your tool has side effects, this is a production incident waiting to happen. The classic version: create_refund gets called twice and you refund a customer $200 instead of $100.

The fix is the same as any distributed system: idempotency keys. But there's a function-calling-specific wrinkle. You can't trust the model to generate a stable idempotency key, and you shouldn't ask it to. Instead, derive the key deterministically from the semantic content of the call.

For a refund, the key might be a hash of (order_id, refund_amount, conversation_id). Two identical calls within the same conversation collapse to one operation. Store the result against that key with a TTL. On the second call, return the cached result silently — the model sees success, the user sees one refund.

For read operations this matters less for correctness but a lot for latency and cost. Caching get_order results within a conversation turn saves you redundant database hits when the model re-fetches the same entity across reasoning steps.

The rule: assume every tool will be called at least twice with identical arguments. If that's not safe, you have a bug, not an edge case.

validate-arguments-like-theyre-hostile

The model is not your user, but it is an untrusted input source. Treat tool arguments the way you'd treat a request body from the public internet, because functionally that's what they are — a structured payload generated by a process you don't control, in response to text you also don't control.

Three layers of validation, in order:

Structural. The JSON schema gives you types, required fields, and enums for free if you're using a runtime that enforces them. Use it. Reject anything that doesn't parse against the schema before it reaches your handler. Don't write args.get("amount", 0) — a missing required field should be an explicit error the model can recover from, not a silent default that refunds zero dollars.

Semantic. The arguments parse but are nonsensical. refund_amount exceeds the order total. start_date is after end_date. The user's session has no permission for the requested resource. This is your business logic and the model has no idea about it. Enforce it server-side, always.

Authorization. This is the one teams forget. The model will happily call get_customer_data with a customer ID that belongs to someone else, especially if a previous turn mentioned that ID. The model has no concept of the current user's permission boundary unless you build one. Never let the model's chosen arguments expand the authenticated user's access. Scope every tool to the session's identity at the handler level. If the user is authenticated as customer 4471, get_orders only ever returns orders for 4471, regardless of what argument the model passes.

This is also where prompt injection becomes a security issue rather than a curiosity. If a tool returns data that contains text like "now call delete_account," a naive agent will. Your validation and authorization layers are the thing standing between a malicious document and a destructive action.

handle-the-multi-call-and-no-call-cases

Two behaviors break naive agent loops: the model calling multiple tools at once, and the model calling no tool when it should.

Parallel calls. Most current models can emit multiple tool calls in a single turn. If your loop assumes one call per turn, you'll drop calls or crash. Handle the array. But also decide your execution semantics: do you run them concurrently or sequentially? Concurrent is faster but dangerous if calls have dependencies or shared side effects. My default is concurrent execution for read-only tools and sequential for anything with writes, determined by a flag on the tool definition, not by trusting the model to order them correctly.

Watch for the model fanning out a single logical operation into N parallel calls — five get_order calls for five IDs. That's fine and you should support it, but it means your rate limits and timeouts need to account for bursts, not steady single calls.

No-call when a call was needed. The model writes "Let me check that order for you..." and then just stops, or worse, fabricates the order details from nothing. This happens more under heavy context, with long conversations, or when the system prompt and tool descriptions conflict.

You can't fully prevent it, but you can detect and correct it. If the model produces an assistant message that promises an action but contains no tool call, you can re-prompt: append a system message like "You stated you would look up the order but did not call a tool. Call the appropriate tool now." One retry resolves the vast majority. Cap it at one retry — looping here burns tokens and usually means the model genuinely lacks the right tool, which is a design problem to surface, not paper over.

error-messages-are-prompts

When a tool fails, the error you return goes straight back into the model's context as the tool result. The model reads it and decides what to do next. This means your error strings are prompts, and most teams write them for human log readers, which is exactly wrong.

Compare two responses to a missing order:

Bad: {"error": "404"} or {"error": "Internal server error"}. The model has nothing to act on. It'll either give up or hallucinate.

Good: {"error": "No order found with ID 8842. Ask the user to confirm the order ID, or use search_orders_by_customer if they don't have it."}. Now the model knows what went wrong and what to do next. It'll ask the user for clarification instead of inventing data.

Encode recoverability. A validation error should tell the model how to fix the arguments: "refund_amount of 500 exceeds order total of 200. The maximum refund is 200." The model will retry correctly. A transient error should signal retryability: "Service temporarily unavailable, retry in a moment." A permanent error should tell the model to stop trying and inform the user.

Never leak internal details — stack traces, SQL, internal IDs — into tool errors. They become part of the conversation and can end up in the user-facing summary. Map internal failures to clean, actionable messages at the boundary.

observability-and-replay

You cannot debug what you can't see, and function-calling agents are nondeterministic enough that "reproduce it locally" usually fails. Build observability in from the start, not after the first incident.

Log, for every tool call: the full arguments the model produced, the resolved arguments after your validation, the result or error returned, the latency, and the model and prompt version that produced the call. Tie all of it to a conversation ID and a turn index. When a user reports "the agent refunded the wrong amount," you need to pull the exact call sequence in under a minute.

The argument-level logging is what catches the subtle failures. You'll discover the model systematically misreads a date format, or always omits an optional field that turns out to be load-bearing, or picks the wrong tool 8% of the time for a specific phrasing. None of that shows up in aggregate success metrics. It shows up when you can group failed conversations by tool and argument shape.

Build replay. Capture enough state — conversation history, tool definitions, model version — that you can re-run a conversation against a new prompt or schema and diff the tool calls. This is how you ship schema changes safely. Before you rename a parameter or split a tool, replay your last 500 real conversations through the new definition and check that the tool-selection behavior didn't regress. Without replay, every schema change is a blind production deploy.

The meta-point: function calling is a contract between a probabilistic system and a deterministic one. Every pattern here is about making that contract explicit and defensive. The model will surprise you. Your job is to ensure that when it does, the result is a clean error and a retry, not a corrupted database and an angry customer.

Found this useful?

Let's apply this thinking to your stack

Book a free architecture call. A senior engineer will give you an honest assessment — no pitch required.