AI Agent Failure Modes: A Testing Taxonomy for Engineering Teams
When your product is powered by a single LLM call, failure is relatively straightforward to reason about: the model returns bad output, you catch it in evals, you fix the prompt or swap the model. When your product is powered by an agentic workflow — a multi-step system where an LLM autonomously selects tools, reasons over intermediate results, and drives towards a goal — failure becomes structurally different. Agents fail in ways that single-call systems never do, and most engineering teams aren’t testing for them.
After working with dozens of B2B SaaS teams integrating agentic features into their products, QualityArk has identified six primary failure mode categories that consistently surface in production. This article lays them out concretely, with examples and testable signals for each. This is the foundation before you write a single eval.
1. Tool Call Errors: Wrong Tool, Wrong Arguments, Wrong Timing
The most visible failure category. An agent selects a tool that doesn’t match the intent (e.g., calling search_database when lookup_user_profile was the correct choice), passes malformed arguments (a string where an integer is required, a missing required parameter), or invokes a tool at the wrong point in a multi-step sequence. In ReAct-style agents, tool call errors are the primary failure signal — and yet many teams test tool availability without testing tool selection accuracy.
How to test for it: Build a golden dataset of task descriptions paired with expected tool call sequences. Use PromptFoo’s is-json and schema validation assertions to verify argument structure. Track „tool accuracy rate” — the percentage of steps where the agent selects the correct tool given the task state. A production threshold of ≥ 92% tool accuracy is a reasonable starting point for most agentic features; below that, reliability degrades visibly for end users.
2. Hallucinated Reasoning Chains
Chain-of-thought prompting and scratchpad reasoning improve agent performance — but they also introduce a failure mode that doesn’t exist in direct-answer systems: the model constructs a plausible-sounding reasoning chain that leads to an incorrect action. The agent „thinks” it has retrieved a customer record when it hasn’t. It „confirms” a precondition that was never validated. The reasoning looks coherent; the action is wrong.
This is particularly dangerous in autonomous agents with write access — agents that can modify records, send emails, or trigger downstream workflows. A hallucinated confirmation that „the user has an active subscription” can cause real harm when it precedes an action predicated on that assumption.
How to test for it: Use LLM-as-judge evals (frameworks like DeepEval support this natively with their HallucinationMetric) to assess whether the agent’s reasoning claims are grounded in the actual retrieved context. Specifically, compare what the scratchpad asserts was retrieved against what the tool actually returned. Flag divergences as hallucinated reasoning events. Set a zero-tolerance policy for hallucinated preconditions on write-path actions.
3. State Corruption and Context Window Drift
In long-horizon tasks, agents maintain state across many steps — either in an explicit memory store, a structured context object, or implicitly in the accumulating conversation history passed to the model. Context window drift occurs when earlier task state is misrepresented or lost as the context grows. State corruption occurs when the agent overwrites or misinterprets prior results due to ambiguous intermediate outputs.
This failure mode is insidious because it often doesn’t manifest until late in a task — the agent reaches step 8 of 10 and produces a final output that’s inconsistent with the constraints established at step 2. By the time the failure surfaces, tracing it back to its source requires full step-by-step replay.
How to test for it: Instrument your agent to log state at each step boundary. Build test cases that verify state invariants mid-task: „After the user profile is retrieved in step 2, is the user ID still correctly referenced in step 6?” Langfuse’s tracing supports multi-step span inspection that makes this kind of mid-task state assertion practical at scale.
4. Goal Drift and Off-Task Behavior
An agent tasked with „schedule a follow-up email for next Tuesday” drifts into also updating the CRM record, creating a calendar invite, and querying the customer’s recent purchase history — none of which were requested. This is goal drift: the agent’s scope expands beyond the defined task boundary, often because the model generalizes from similar observed patterns in training.
Goal drift is particularly common in agents given broad tool access without explicit scope constraints. It’s usually benign in low-stakes workflows, catastrophic in systems with sensitive data access or external side effects.
How to test for it: Define per-task tool allowlists and test that the agent respects scope boundaries. PromptFoo supports configuration of allowed function calls per test case. Measure „scope violation rate” — tasks where the agent invokes tools outside the allowed set — and treat any non-zero rate as a regression signal requiring prompt or architecture investigation.
5. Infinite Loops and Unresolved Planning Cycles
Agents running under ReAct or plan-and-execute architectures can enter failure states where they cycle through the same tool calls repeatedly — typically because a tool returns ambiguous or empty results and the agent lacks a defined termination condition for that state. Without a hard step limit and explicit empty-result handling, these loops consume tokens indefinitely and return nothing to the user.
How to test for it: Inject test cases with tools that return empty results, error responses, and ambiguous outputs. Verify the agent reaches a termination condition (either a valid answer or a graceful failure response) within your defined step budget. For most production agentic features, a maximum of 10–15 steps before graceful failure is a reasonable constraint. Measure „task completion rate within step budget” as a core eval metric.
6. Cascading Errors in Multi-Agent Pipelines
The most complex failure mode emerges in systems where multiple agents hand off to each other — an orchestrator delegates sub-tasks to specialist agents, whose outputs feed back into the orchestrator’s reasoning. A bad output from a sub-agent propagates downstream and compounds. By the final output, the error is far removed from its source.
Testing multi-agent pipelines requires component-level isolation testing in addition to end-to-end evaluation. Each agent must have its own eval suite; integration tests must cover the handoff interfaces specifically.
How to test for it: Apply QualityArk’s QA SPINE™ framework at each agent boundary — test Scope (does this agent stay in role?), Precision (does it return well-structured outputs the downstream agent can parse?), Integrity (does it handle bad inputs gracefully without silent failure?), Non-determinism (does it behave consistently across temperature and seed variation?), and Edge cases (what happens when the upstream agent sends malformed data?). Treat each handoff interface as a contract with a dedicated test suite.
Building Your Agent Failure Mode Coverage Map
Before you write your first agentic eval, map your agent architecture against these six failure categories. For each one, ask: do we have test cases that can catch this? Most teams find they have reasonable coverage of tool call errors (because they’re visible) and almost no coverage of state corruption or goal drift (because they’re subtle). Coverage gaps at the failure-mode level are more revealing than coverage gaps at the feature level.
Once you have the map, build your eval dataset to target the uncovered quadrants. Tools like DeepEval, PromptFoo, and LangSmith all support the kinds of structured, multi-step test cases these failure modes require — but the test design has to be intentional. The tools can run the evals; only you can define what „correct” looks like at each step.
If your team is building agentic features and needs a systematic approach to AI agent QA — from failure mode mapping to production monitoring — QualityArk works with B2B SaaS engineering teams to build the eval infrastructure that makes agent reliability measurable. Reach out to start the conversation.