Reasoning, Tool Use, and Failure Modes of Agents

VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard

We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.

Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.

VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.

As can be seen below, models perform poorly on VAKRA – in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks.

Task Description

As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities.

Fig 1: Representative examples of each capability in the VAKRA benchmark

Capability 1: API Chaining using Business Intelligence APIs

This capability includes 2,077 test instances across 54 domains, requiring the use of tools from the SLOT-BIRD and SEL-BIRD collections (Elder et al., 2026). Compared to the setup in Elder et al., the tool universe in SLOT-BIRD and SEL-BIRD is expanded through the inclusion of a larger number of domains. Each domain is restricted to one tool collection, and tasks involve chaining 1–12 tool calls to arrive at the final answer.

{
  "query": "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?",
  "tool_calls":[
      {
        "name": "get_data",
        "arguments":{"tool_universe_id="486ea46224d1-aeb8037c5e78"},
        "label": "retrieved_data_1"
      },
      {
        "name": "select_data_equal_to",
        "arguments":{"data_label":"retrieved_data_1","key_name":"play_speed","value":31},
        "label": "FILTERED_DF_0"
      },
      {
        "name": "select_data_equal_to",
        "arguments":{"data_label":"FILTERED_DF_0","key_name":"play_dribble","value":53},
        "label": "FILTERED_DF_1"
      },
      {
        "name": "select_data_equal_to",
        "arguments":{"data_label":"FILTERED_DF_1","key_name":"play_passing","value":32},
        "label": "FILTERED_DF_2"
      },
      {"name":{get_team_name},"arguments":{"data_label":"FILTERED_DF_2","n":1}}}],
  "answer": "FC Barcelona"
}

Fig 2: Data sample from SEL-BIRD collection

As shown above, each instance has an associated JSON data source from which the answer must be derived. The MCP servers supporting this task include a special tool, called get_data(tool_universe_id=id), which must be called at the beginning of each instance.
This tool initializes the data source, returns a lightweight preview of the data (see below Figure 3), and stores the full dataset server-side to avoid large data transfers. This prevents the inefficient transfer of large data over the MCP protocol. The call also configures the MCP server to expose the appropriate tool set based on the tool_universe_id and aligns the data source with the domain-specific database for the instance.

The SLOT-BIRD collection provides a global set of 7 tools for generic data manipulation (e.g., filtering, sorting), inspired by systems like Tableau and
Google Analytics. The SEL-BIRD collection extends this by introducing more specialized tools: some are shared with SLOT-BIRD, while others are derived by flattening categorical arguments into separate functions (e.g., sort_data with argument ascending: bool = False becomes sort_data_ascending and sort_data_descending). Additionally, the generic (retrieve_data) function from SLOT-BIRD is replaced with query-specific getters. Every key in the data for a given instance has an associated get function (get_KEY_NAME) for an average of 4 get functions per instance.

{
  "handle": "retrieved_data_1",
  "num_records": 2,
  "key_details": [
    {"name": "team_name", "dtype": "str", "first_3_values": ["FC Barcelona", "Manchester City"]},
    {"name": "play_speed", "dtype": "int32", "first_3_values": [31, 40]},
    {"name": "play_dribble", "dtype": "int32", "first_3_values": [53, 30]},
    {"name": "play_passing", "dtype": "int32", "first_3_values": [32, 16]}
  ]}

Fig 3: Data preview obtained from `get_data` function

https://huggingface.co/blog/ibm-research/vakra-benchmark-analysis/

Capability 2: Tool Selection using Dashboard APIs

This capability includes 1,597 instances across 17 domains, requiring tools from an expanded REST-BIRD collection (Elder et al.).
These use endpoint-style interfaces that provide highly specific, query-aligned endpoints that encapsulate most computation. They are served as REST APIs running in a FastAPI server, which is wrapped by the MCP server. This task requires selecting the correct APIs from the domain-specific tool set (as shown in the example in Figure 1). Each domain contains a minimum of 6 to a maximum of 328 tools (with an average of 116 tools). Similar to the previous task, the get_data tool configures the MCP server to expose only the relevant domain-specific APIs.

The OpenAI API Specification restricts the tool list input to a maximum length of 128 tools. This restriction requires an agent builder using this API to manage the length of the tool list directly via a shortlisting mechanism. In the baseline agents in our repository, a simple shortlisting capability handles this challenge.

Capability 3: Multi-Hop Reasoning using Dashboard APIs

The Capability 3 segment of the benchmark has 869 test instances drawn from 38 subject domains. These instances rely again on the REST-BIRD API collection, but add multi-hop reasoning to the challenge (refer to example in Figure 1). Multi-hop questions require multiple pieces of supporting evidence to be extracted and combined to reach an answer. The instances in this section require between one and five logical hops to answer a query. The question types distribution for queries within the test dataset is shown below in Figure 4.

Fig 4: API Hop-Type distribution for Capability 3 (MultiHop) and Hybrid Hop-Type distribution for Capability 4 (MultiHop MultiSource Reasoning)

Capability 4: Multi-Hop, Multi-Source Reasoning and Policy Adherence

Capability 4 includes 644 instances across 41 domains and is also built on the REST-BIRD API collection. Figure 4 above shows a distribution of hybrid hops for test queries without policies. It contains the most complex queries with the following characteristics:

Multi-Source: This segment adds document indices per domain. Queries in this capability could require information from these document indexes as well as API calls. Similar to Capability 3, this task also has Multi-Hop queries. The required information source applies at the per-hop level, so, for example, a question may entail three logical hops with sources: API – RAG (Document Retrieval) – API. To enforce correct reasoning, sources are decontaminated during data generation, i.e. information required for a given hop is available in only one source. For example, if a hop is to be answered using APIs, the document index is built by removing documents that likely contain the information needed to answer the question.

Multi-Turn: This segment of the dataset also adds multi-turn conversations to the setting. Each instance is a dialog with multiple turns. The data is released as context-response pairs, where the context encodes the current dialog history and the agent is only responsible for answering the current turn.

Tool-usage Policies: A subset of these instances includes tool-use policies that the agent is required to follow. These policies take the form of plain-text instructions about the knowledge sources that the agent is allowed to access and under which circumstances. For example:

If a user's query pertains to Technology & Software, which is/are about Topics focusing on codebases,
software platforms, applications, and user interactions in tech, make sure you try answering them by
only using document retrievers. Do not use other types of tools.

The baseline agent in the project repo imposes adherence to these policies through a simple addition to the prompt: "You are a helpful assistant with access to tools.n Tool Usage Constraint: {additional_instructions}.". Of course, agent builders are free to choose any constraint enforcement mechanism.

Evaluation Framework

VAKRA evaluates agents in tool environments where success depends on both the ability to execute coherent, multi-step workflows and answer correctness. We introduce an execution-centric evaluation framework that assesses not only final outputs but also the full tool-execution trajectory that includes tool calls, inputs, and intermediate results.

Evaluation Metric

The VAKRA Evaluator operates over two key inputs for each sample: a predicted final response and the corresponding tool-call trajectory. The tool calls from the predicted trajectory are executed in the same environment as the ground truth to verify intermediate tool outputs.

The evaluation follows a waterfall-style pipeline (Figure 6), where later stages are conditioned on earlier success:

For Capability 4 tasks, policy adherence is first verified programmatically (this step is not applied to other capabilities).
The predicted tool call sequence is then compared against the ground truth sequence.
Only samples with valid trajectories proceed to final response evaluation.

Fig 6: Waterfall-style Evaluation Pipeline

Tool-Sequence Comparison
Due to the presence of an executable environment, agents can explore the environment and sometimes return the answer by invoking a different set of APIs than the ones identified by us. In order to support alternative but valid tool invocations and reasoning paths, correctness is assessed by executing each predicted tool and comparing the set of tool responses against those from the ground truth (rather than enforcing strict step-level matching).

Specifically, we first perform a programmatic check, verifying whether all information present in the ground-truth tool responses is recovered by the predicted tool responses. This check may be inconclusive in cases involving partial matches, semantic equivalence, or differences in representation (e.g., ordering, aggregation, or formatting). In such cases, we apply a secondary LLM-based evaluation, adapted from the CRAG framework Yang et al., 2024, to determine whether the predicted trajectory retrieves all required information despite structural differences. This step uses an adapted prompt to determine whether the predicted trajectory captures all required information, even if obtained through a different sequence of tool calls.

Final Response Evaluation
For trajectories that pass the previous check, the final response is evaluated using an LLM-based judge. This step ensures that the response is (i) grounded in the predicted tool outputs, and (ii) factually consistent with the ground truth answer, accounting for potential variations in phrasing or structure.

This design ensures that agents are rewarded not only for producing correct answers, but for obtaining them through valid and complete reasoning processes.

Scoring

Every capability is equally weighted to obtain a final leaderboard score

$text{Leaderboard_Score} = frac{1}{4} sum_{n=1}^{4} text{Capability}_n$

To obtain a capability score, every sample within a capability is equally weighted for capabilities 1 through 3.

$text{Capability}_{1-3} = frac{#,text{correct queries}}{#,text{total queries}}$

For capability 4, we weight heterogeneous queries higher:

$text{Capability}_4 = frac{ #,text{correct multi-source} times 2 + #,text{correct API-only or RAG-only} }{ #,text{total multi-source} times 2 + #,text{total API-only or RAG-only} }$

Error Analysis

We now present detailed error analysis across the four VAKRA capabilities. To facilitate our analysis, we adopt stage-wise error categorization to assign each failure to the first point of breakdown. Specifically, we evaluate, in order: (i) whether the correct tool(s) were selected, (ii) whether the required arguments were provided without omissions or hallucinations, (iii) whether argument values were correct, and (iv) whether the final response is both accurate and grounded in the tool outputs.

Failure Stage Isolation

Since a single sample may exhibit multiple errors across different steps, we sequentially classify each instance to the earliest failing stage (e.g., tool selection errors take precedence over argument errors). This avoids double-counting and allows error categories to be interpreted as disjoint fractions of the dataset. While more granular metrics (e.g., precision/recall over tool usage) are possible (Elder et al., 2026), we find this formulation provides a simple and interpretable breakdown of agent failures.

The instances in this part of the benchmark required selecting and sequencing multiple tools to solve a single task. We have 2077 samples in this capability. This was challenging for all models, but GPT-OSS-120B performed best on this segment of the benchmark.

GPT-OSS-120B outperformed the other models by a large margin, mostly from a better understanding of the tool schemas.
The tools in this part of the benchmark involve a large number of parameters, many of which are optional, and GPT-OSS-120B was especially robust, as compared to the others, at choosing the right parameters to fill.
Overall, synthesizing a correct answer after making all tool calls correctly was less challenging in this section of the benchmark, most likely because the tool call sequencing made the tool choice problem less amenable to guessing compared with the Dashboard API capability

Fig 7: SEL-BIRD vs SLOT-BIRD Error Types Analysis

The Business Intelligence (BI) API capability contains two sets of APIs, from the SLOT-BIRD and SEL-BIRD tool collections. The SEL part of this benchmark had 600 samples, while the SLOT part of the benchmark had 1477 samples. These two collections are grouped under the BI API capability, but have slightly different characteristics. The SLOT-BIRD collection has a smaller number of generic tools with a large number of parameter values to fill, while the SEL-BIRD collection has a larger set of tools and fewer parameters per tool. This focus is reflected in the relative errors made by models using these two tool collections.

Using SLOT-BIRD, all models except for GPT-OSS-120b made a substantial number of errors producing correct names for the tool arguments. This is largely the reason that GPT-OSS-120b performed so well overall in this segment of the benchmark.
With fewer parameters to fill, the same models made very few such errors when using the SEL-BIRD tool collection, but they made many more errors selecting the correct tools, reflecting the increased difficulty of choosing from a larger (and dynamic) tool set.

As shown above, for the 1597 samples in the tool selection capability, Gemini-3-flash-preview outperforms the other models tested on all error categories.
As expected, since the dashboard API instances require the models to choose from a large number of tool options, but each tool requires only a small number of parameters, there are a large number of errors in tool selection and parameter value selection.
There seems to be little problem with hallucinating or skipping required parameters. However, even when all tool calls are made correctly, models (especially Gemini-3-flash-preview and Claude-Sonnet-4-5 still struggle to synthesize a correct answer from the tool responses, as evidenced by the large drop-offs at the far right side of the plot.

Multi-Hop Reasoning: Effect of Hop Depth on Model Performance

Fig 8: Comparison of Accuracy Across Models by Hop Depth

Multi-hop reasoning increases the difficulty of the original task by requiring models to successfully answer multiple implicitly coupled questions, each of which requires selecting and calling the correct API. As expected, all models performed best on the questions with only a single logical hop, and saw performance degradations on 2-hop and again on 3+ hop questions.

Multi-Hop Multi-Source Reasoning: Effect of Hybrid Hops on Model Performance

Fig 9: Model Accuracy Rates by Interaction Type (API, Document-Retriever, Hybrid)

The final segment of the dataset includes document sources in addition to the tool/API sources in the other segments. This leads to instances that require single or multiple API calls, single or multiple document searches, or some combination of API calls and document searches.

As before, there is a marked difference in performance on instances that require single API calls (1-hop API) as compared to those that require multiple API invocations (2-hop API), and including document retrievers makes the task more challenging (RAG Hops and Hybrid).
Interestingly, we find that on questions that require a single document retriever call (1-hop RAG), GPT-OSS-120B tries to directly return the answer from parameter knowledge, though when the question appears to require multiple hops, it answers the question. We hypothesize that since the questions for 1-hop RAG are very Wikipedia-entity focussed the model skips the tool call (we don’t see this problem on 1-hop API, where back-end database-specific entities/facts might be present more frequently in the question).
It is also interesting that the performance of Gemini-3-flash-preview shoots up on 2-hop API-RAG as compared to other hybrid hop-patterns. This is likely explained by the relatively strong performance of Gemini-3-flash-preview on the dashboard APIs (Tool Selection Capability), and thus, once the correct intermediate answer is identified using the tool-call, the retrieval query is likely to be more successful.

Effect of Policies on Model Performance

Fig 10: Model Accuracy Rates by Policy Type

Policies introduce an additional layer of difficulty on top of multi-hop, multi-source reasoning. When policies align with the required source for answering i.e. they do not affect the tool list required for models to answer the question, we refer to it as “No Updates to Answer” — as shown in Figure 10, all models except for Granite-4.0-h-Small-32B experience a clear drop in performance under policy constraints that restrict access to the most relevant information source (i.e. “Policy updates the answer”).

In general, we find that models either violate constraints or fail to retrieve sufficient information, where they sometimes understood the policy but could not answer the question correctly, or they exhibit one of the previously analyzed failure modes.

Overall, tool-use policy-constrained settings suggest that while models can reason over tools and sources, they struggle to incorporate external constraints into that reasoning – often a key requirement for reliable real-world deployment.

Conclusion

VAKRA exposes a critical gap between surface-level tool competence and robust, end‑to‑end agent reliability. Although modern models can increasingly select APIs and execute isolated tool calls, VAKRA shows that these abilities alone are insufficient for real‑world deployment. In practice, models often break down when required to perform compositional reasoning under execution constraints—spanning APIs, documents, dialog context, and policy requirements.

Try VAKRA — Where Does Your Agent Break?

Think your agent is solid? Put it to the test.

Run it on VAKRA and see where it falls apart—tool selection, multi-hop reasoning, or policy constraints.

👉 Try it and tell us what your agent learned

What's Hot

These are the first Nvidia RTX Spark laptops

Escaping the Valley of Choice in BI

Strava declares war on scrapers ahead of IPO

Norse Atlantic Airways Offers Dirt-Cheap Tickets. There’s a Catch

AI’s next dataset is your apartment

The First Open Omni-model for Physical AI Reasoning and Action

These are the first Nvidia RTX Spark laptops

Escaping the Valley of Choice in BI

Strava declares war on scrapers ahead of IPO

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Most Popular

These are the first Nvidia RTX Spark laptops

Escaping the Valley of Choice in BI

Strava declares war on scrapers ahead of IPO

Our Picks

Quantization from the ground up

David Sacks is done as AI czar — here’s what he’s doing instead

Judge sides with Anthropic to temporarily block the Pentagon’s ban

Subscribe to Updates

What's Hot

Reasoning, Tool Use, and Failure Modes of Agents

Task Description

Capability 1: API Chaining using Business Intelligence APIs

Capability 2: Tool Selection using Dashboard APIs

Capability 3: Multi-Hop Reasoning using Dashboard APIs

Capability 4: Multi-Hop, Multi-Source Reasoning and Policy Adherence

Evaluation Framework

Evaluation Metric

Error Analysis

Failure Stage Isolation

Multi-Hop Reasoning: Effect of Hop Depth on Model Performance

Multi-Hop Multi-Source Reasoning: Effect of Hybrid Hops on Model Performance

Effect of Policies on Model Performance

Conclusion

Try VAKRA — Where Does Your Agent Break?

Related Posts

Subscribe to Updates