- Your dev environment should let you iterate on agent behavior in seconds, not minutes.
- Observability is not optional — you need tracing from day one, not after your first production incident.
- API key management is a real security concern; set it up properly before you start building.
- Local LLMs are essential for fast iteration and cost control during development.
AI agent engineering has a unique development workflow. Unlike traditional software where you write code, run it, and check the output, agent development involves non-deterministic behavior, expensive API calls, multi-step reasoning chains, and failure modes that are difficult to reproduce.
Your toolkit needs to account for all of this. Here is the complete setup that working AI agent engineers use in 2026, organized by category.
IDE and Editor Setup
Primary Editor: VS Code or Cursor
Most AI agent engineers use VS Code with AI-assisted coding extensions, or Cursor as a purpose-built AI coding editor. The key extensions and settings:
- Python extension pack — Pylance, Black formatter, Ruff linter
- Jupyter extension — essential for interactive prompt prototyping and response analysis
- REST Client or Thunder Client — for quick API testing against LLM endpoints
- YAML/JSON Schema — for validating agent configuration files and tool definitions
- GitLens — for tracking prompt template changes over time (prompt versioning matters)
Editor Configuration That Matters
Set your editor to auto-save on focus change. When you are iterating on prompts, you want changes to propagate immediately to your running agent. Also configure your terminal to show timestamps — when debugging multi-step agent runs, timing information is critical.
Core Libraries and Frameworks
Agent Frameworks
| Framework | Best For | Notes |
| OpenAI Agents SDK | Production agents with OpenAI models | Excellent tool-calling, built-in guardrails, handoff patterns |
| LangGraph | Complex multi-agent workflows | Graph-based orchestration, good for stateful agents |
| CrewAI | Role-based multi-agent systems | Good abstractions for team-of-agents pattern |
| Anthropic Claude Agent SDK | Agents built on Claude models | Computer use, MCP tool protocol, strong reasoning |
| Pydantic AI | Type-safe agent development | Pydantic-native, great for structured outputs |
Essential Python Libraries
- httpx — async HTTP client for API calls (prefer over requests for async agent code)
- pydantic — data validation for tool inputs/outputs, agent state, and configuration
- tenacity — retry logic with exponential backoff for flaky LLM API calls
- tiktoken / anthropic tokenizer — token counting for cost estimation and context window management
- jinja2 — prompt templating with variables, conditionals, and loops
- structlog — structured logging that makes agent traces searchable
Debugging and Observability
This is where most agent developers underinvest, and it costs them dearly in production.
Tracing Tools
- Langfuse — open-source LLM observability. Traces every LLM call, tool invocation, and agent step. Self-hostable. This is the most popular choice among agent engineers in 2026.
- LangSmith — LangChain's tracing platform. Best if you are already in the LangChain ecosystem.
- Braintrust — combined eval and observability platform. Strong on evaluation-driven development.
- OpenTelemetry + Jaeger — if you prefer standard distributed tracing. Requires more setup but integrates with existing infrastructure.
Debugging Techniques
Agent debugging is fundamentally different from traditional software debugging:
- Trace replay: Record full agent runs (inputs, LLM calls, tool calls, outputs) and replay them deterministically. This is the single most valuable debugging technique.
- Step-through execution: Build a mode where the agent pauses after each step and shows its reasoning, planned next action, and available tools.
- Prompt diffing: When agent behavior changes, diff the rendered prompts between the working and broken versions. Often the bug is in prompt construction, not agent logic.
- Token budget monitoring: Track context window usage throughout the agent run. Many agent failures are caused by exceeding the context window silently.
Testing Frameworks
Unit Testing
- pytest with pytest-asyncio — the standard for async agent code
- respx or vcrpy — HTTP request mocking for deterministic LLM response testing
- Inline snapshots — capture expected LLM outputs and detect regressions
Evaluation Frameworks
- Braintrust Eval — programmatic eval with scoring functions and dataset management
- promptfoo — CLI-based prompt and agent testing with YAML configuration
- DeepEval — Python-native eval framework with built-in metrics (faithfulness, relevancy, hallucination)
- Custom eval harnesses — many teams build their own because agent evaluation is highly domain-specific
API Key and Secret Management
Agent engineers often manage 4-8 different API keys (OpenAI, Anthropic, Google, Perplexity, various tool APIs). This is a real security surface area.
Local Development
- Use direnv with a .envrc file that sources a .env file (add .env to .gitignore)
- Never hardcode API keys in source code — treat this as a firing offense
- Use 1Password CLI or Doppler for team key sharing instead of Slack messages
- Rotate keys quarterly and after any team member departure
Production
- Use your platform's secret management (Railway secrets, Vercel env vars, AWS Secrets Manager)
- Implement key-scoped rate limiting so a leaked key has limited blast radius
- Set up billing alerts on all LLM API accounts — a runaway agent can generate thousands in API costs in hours
Local LLM Setup
Running models locally is essential for fast iteration, cost control, and offline development.
Recommended Local Setup
- Ollama — the easiest way to run local models. One command to download and serve any supported model. Compatible with the OpenAI API format.
- Recommended models for local development: Llama 3.1 8B (fast, good for tool-calling tests), Qwen 2.5 Coder 7B (code-related agents), Mistral 7B (general purpose)
- vLLM — for higher-throughput local inference when you need to run batch evaluations
When to Use Local vs. Cloud
- Local: Prompt iteration, tool-calling logic debugging, evaluation runs, CI/CD pipelines
- Cloud: Final quality evaluation, production deployments, benchmarking against real model behavior
CI/CD for Agent Projects
Agent CI/CD has unique requirements beyond standard software pipelines:
- Eval gates: Run your evaluation suite in CI. Block merges if agent quality metrics drop below thresholds.
- Cost estimation: Estimate the API cost of changes before deploying. A prompt change that doubles token usage should be flagged.
- Canary deployments: Route a small percentage of traffic to the new agent version and compare quality metrics before full rollout.
- Prompt versioning: Tag prompt templates with versions so you can correlate quality changes to specific prompt updates.
GitHub Actions Workflow Example
A typical CI pipeline for an agent project runs: lint and type check, unit tests with mocked LLM responses, evaluation suite against local model, cost estimation diff, and (on main branch) deploy with canary.
Project Structure
Here is the directory structure that most production agent projects converge on:
- agents/ — agent definitions and configurations
- tools/ — tool implementations that agents can call
- prompts/ — prompt templates with version tracking
- evals/ — evaluation datasets, scoring functions, and test harnesses
- traces/ — recorded agent runs for replay testing
- config/ — model configs, rate limits, feature flags
Setting Up From Scratch: A 30-Minute Checklist
If you are starting a new agent project today, here is the setup order:
- Create the project with uv (Python) or your preferred package manager
- Install core deps: your chosen agent framework, pydantic, httpx, tenacity, structlog
- Set up direnv + .env file with your API keys
- Install Ollama and pull a small local model for fast iteration
- Set up Langfuse (or your preferred tracing tool) and add the tracing decorator to your first agent
- Create a basic eval harness with 5-10 test cases
- Write a Makefile with common commands: run, test, eval, lint
- Add pre-commit hooks for linting and secret detection
That is the foundation. Everything else you add should be in response to a specific problem you encounter while building.
For more on how these tools map to real job requirements, explore the AI agent engineering roles on AgenticCareers.co, or browse our glossary to understand the terminology you will encounter in job descriptions.