What Tool Use Actually Means
Tool use — also called function calling — is the mechanism by which an LLM decides to invoke an external function, API, or system based on the conversation context. Instead of generating a text response, the model outputs a structured call to a specific tool with specific arguments. Your application executes the tool and returns the result to the model, which then continues reasoning.
This is the fundamental capability that turns a language model into an agent. Without tool use, an LLM can only generate text. With tool use, it can search the web, query databases, send emails, create tickets, execute code, and interact with any system that exposes an API.
How It Works Across Providers
OpenAI Function Calling
OpenAI's implementation uses a tools parameter in the chat completions API. You define tools as JSON Schema objects describing the function name, description, and parameters.
The key fields in a tool definition:
- name: A concise function name (e.g.,
search_database). The model uses this to decide which tool to call. - description: Critical for model decision-making. Write it as if you are explaining the tool to a junior developer — be explicit about what it does, when to use it, and when NOT to use it.
- parameters: A JSON Schema object defining the expected arguments. Use
requiredfields andenumconstraints wherever possible to reduce parameter hallucination.
OpenAI also supports tool_choice to force the model to call a specific tool (tool_choice: {"type": "function", "function": {"name": "search_database"}}) or to let the model decide (tool_choice: "auto"). In production, use forced tool calling at the start of workflows where you know the first step, then switch to auto for the reasoning loop.
Anthropic Tool Use
Anthropic's implementation passes tools in the tools array of the messages API. The structure is similar to OpenAI but with notable differences:
- Tool descriptions are processed through Claude's instruction-following training, meaning longer, more detailed descriptions often produce better results than terse ones. Anthropic explicitly recommends including example inputs and outputs in tool descriptions.
- When Claude decides to call a tool, the response contains a
tool_usecontent block with the tool name and a JSON input. You execute the tool and send the result back as atool_resultcontent block in the next message. - Claude supports calling multiple tools in parallel within a single response — particularly useful for agents that need to gather data from multiple sources simultaneously.
Google Gemini Function Calling
Gemini's tool use is defined through function_declarations in the generation config. The JSON Schema for parameters is the same standard, but Gemini adds a behavior configuration that controls how aggressively the model uses tools:
AUTO: Model decides whether to call a tool or respond with text (default).ANY: Model must call at least one tool for every response.NONE: Tool calling is disabled; model responds with text only.
Gemini also supports code_execution as a built-in tool, allowing the model to write and run Python code in a sandboxed environment. This is uniquely powerful for mathematical, data analysis, and data transformation tasks.
Tool Description Best Practices
The quality of your tool descriptions is the single biggest factor in whether your agent uses tools correctly. Here are patterns that work consistently across all providers:
- Be explicit about when to use the tool: "Use this tool when the user asks about current prices, inventory levels, or product availability. Do NOT use this tool for historical data — use the analytics_query tool instead."
- Include parameter constraints in the description: "The date parameter must be in YYYY-MM-DD format. The category must be one of: electronics, clothing, food."
- Provide an example: "Example: search_database({query: 'red shoes', max_results: 5, category: 'clothing'})"
- Describe the return format: "Returns a JSON array of product objects with fields: id, name, price, in_stock."
- State error conditions: "Returns an error message if the query is empty or the category is not recognized."
Error Handling in Production
Tool execution failures are inevitable. Your agent framework needs to handle them gracefully:
Retry Logic
Implement retries with exponential backoff for transient failures (network timeouts, rate limits). But do not retry tool calls where the parameters were wrong — send the error back to the model so it can correct its approach.
Fallback Tools
For critical workflows, define fallback tools that provide degraded but functional alternatives. If the primary database search fails, a cached or simplified search can keep the agent moving forward.
Error Messages to the Model
When a tool call fails, the error message you send back to the model matters enormously. Instead of a generic "Error occurred," send structured feedback: "The search_database tool returned an error: invalid date format '2026/03/30'. Expected format is YYYY-MM-DD. Please retry with the correct format." This gives the model the information it needs to self-correct.
Validation and Safety
Never execute tool calls without validation. The model can hallucinate tool names, generate malformed parameters, or request actions outside its authorized scope.
- Schema validation: Validate every tool call against its JSON Schema before execution. Libraries like Pydantic (Python) or Zod (TypeScript) make this straightforward.
- Authorization checks: Verify that the tool call is permitted given the current user's permissions. An agent serving customer A should not be able to query customer B's data.
- Rate limiting: Prevent agents from making excessive tool calls. A bug or adversarial input could cause an infinite tool-calling loop that racks up costs or overwhelms downstream services.
- Output sanitization: Tool results should be sanitized before being sent back to the model, especially if they contain user-generated content that could constitute a prompt injection.
Tool use and function calling are foundational skills for any AI agent engineer. Companies hiring for these skills are listed daily on AgenticCareers.co.
Advanced Patterns
Parallel Tool Calling
Both OpenAI and Anthropic support parallel tool calling — the model can request multiple tool calls in a single response. This is essential for performance: if an agent needs to check inventory and look up pricing simultaneously, parallel calls cut latency in half.
To enable effective parallel tool calling, design your tools to be independent — each tool should accept all the parameters it needs without depending on the output of another tool. When tools have dependencies (tool B needs the output of tool A), the model will naturally sequence them across multiple turns.
Streaming with Tool Calls
In streaming mode, tool calls arrive as partial JSON that must be accumulated until complete. This is a common source of bugs. Use the provider's streaming helpers (OpenAI's stream event handler, Anthropic's stream manager) rather than parsing the raw stream yourself. The edge cases around partial JSON, multiple tool calls in a single chunk, and interleaved text and tool content are tricky to handle correctly.
Tool Call Chaining
Sophisticated agents chain tool calls — using the output of one tool as the input to another in a multi-step workflow. The agent might: (1) search for a customer by email, (2) retrieve their order history, (3) find the specific order they are asking about, (4) check the return eligibility, and (5) initiate the return. Each step depends on the previous one.
For reliable chaining, implement state management between tool calls. Store intermediate results in a session object that persists across the agent loop. This prevents the agent from losing context between steps and allows you to resume chains that are interrupted by errors or timeouts.
Dynamic Tool Registration
In production systems, the set of available tools may change based on the user's permissions, the conversation context, or the current step in a workflow. Rather than loading all tools at the start, dynamically adjust the tool list at each step. This reduces the model's decision space (fewer tools to choose from means more accurate selection) and enforces authorization at the tool-availability level.
Testing Tool Use
Testing tool use requires a specific strategy:
- Unit test each tool independently: Verify that tools handle all expected inputs, edge cases, and error conditions correctly. This is traditional testing and should be comprehensive.
- Test tool selection: Create test cases that present the model with scenarios where different tools are appropriate. Verify that the model selects the correct tool with the correct parameters. This is where LLM non-determinism makes testing harder — run each test case multiple times and check for consistency.
- Test error recovery: Inject tool failures (timeouts, invalid responses, rate limits) and verify that the agent handles them gracefully. The error recovery behavior is often more important than the happy path.
- Test authorization boundaries: Verify that the agent cannot access tools outside its authorized scope, even when prompted to do so. This is a security concern as much as a quality concern.
Real-World Tool Use at Scale
At production scale, tool use introduces challenges that do not appear in prototypes:
Tool latency budgets: Each tool call adds latency to the agent's response time. If an agent makes 3 tool calls averaging 500ms each, that is 1.5 seconds of just tool execution time. Set latency budgets for each tool and optimize aggressively — caching, connection pooling, and pre-fetching can all reduce tool latency significantly.
Tool versioning: When you update a tool's behavior or interface, you need to consider that the LLM was trained (or prompted) with the old tool description. Update tool descriptions alongside tool implementations, and run evaluation suites to verify the model still uses updated tools correctly.
Graceful degradation: Design your agent to function (with reduced capability) when individual tools are unavailable. If the search tool is down, can the agent still answer questions using its training data? If the database tool is slow, can it provide a partial answer while the query completes? Graceful degradation is the difference between a production-ready agent and a fragile demo.