Pinpointing Agent Failures in LLM Multi-Agent Systems: A New Benchmark and Automated Attribution Methods

Introduction

Large language model (LLM) multi-agent systems have gained significant traction for their ability to collaboratively solve complex tasks. However, these systems are prone to failures—a single agent’s mistake, miscommunication, or information loss can derail the entire process. Developers often struggle to identify which agent caused the failure and at what point it occurred, a challenge that resembles finding a needle in a haystack. To address this, researchers from Penn State University and Duke University, in collaboration with institutions including Google DeepMind, University of Washington, Meta, Nanyang Technological University, and Oregon State University, have introduced the novel problem of automated failure attribution. Their work, accepted as a spotlight presentation at ICML 2025, presents the first benchmark dataset, Who&When, and evaluates several automated attribution methods. This research paves the way for more reliable and debuggable multi-agent systems.

Pinpointing Agent Failures in LLM Multi-Agent Systems: A New Benchmark and Automated Attribution Methods — Source: syncedreview.com

The Challenge of Debugging Multi-Agent Failures

LLM-driven multi-agent systems operate autonomously, with agents exchanging long information chains. When a task fails, developers typically rely on manual log review, which is time-consuming and expertise-intensive. This process, often called manual log archaeology, involves sifting through countless interactions to pinpoint the root cause. Without efficient debugging, system iteration and optimization become impractical. The need for automated failure attribution is clear: it can save developer hours and accelerate the refinement of these systems.

Why Automation Matters

Automated attribution can dramatically reduce debugging time. Instead of manually tracing every agent’s actions, a system could flag the failing agent and the critical moment. This allows developers to focus on fixing the actual problem rather than hunting for it. Moreover, as multi-agent systems grow in complexity, manual methods become unsustainable. Automated tools are essential for scaling and reliability.

The Benchmark: Who&When

The researchers constructed Who&When, the first benchmark dataset specifically designed for automated failure attribution in LLM multi-agent systems. The dataset includes multiple scenarios where agents collaborate on tasks, with annotated failure points—identifying which agent caused the failure and the timestamp of the error. This benchmark provides a standardized way to evaluate attribution methods. The dataset and code are open-source, available on Hugging Face and GitHub, enabling the community to build upon this work.

Dataset Construction

To build Who&When, the team simulated multi-agent interactions in various domains, such as software development, data analysis, and question answering. They introduced controlled errors—like misinterpretations, missing information, or incorrect outputs—and recorded the resulting failures. Each failure was labeled by human experts with the responsible agent and the failure time. This rigorous annotation process ensures high-quality ground truth for evaluation.

Automated Attribution Methods

The researchers developed and tested several methods for automated failure attribution. These range from simple heuristics to more sophisticated LLM-based approaches. They categorized the methods into three groups:

Log-based methods: Analyze interaction logs for anomalies, such as repeated requests, missing responses, or sudden changes in agent behavior.
LLM-driven methods: Use a separate LLM to review logs and predict which agent likely caused the failure and when. This leverages the reasoning capabilities of LLMs.
Hybrid methods: Combine log analysis with LLM reasoning for improved accuracy.

Each method outputs a predicted agent and failure time, which is then compared to the ground truth labels in Who&When to measure performance.

Evaluation and Results

The evaluation showed that automated attribution is a challenging task. While LLM-driven methods performed relatively well, they still had significant room for improvement. Key findings include:

LLM-based methods outperformed simple heuristics, especially in complex scenarios with multiple agents.
Hybrid approaches achieved the highest accuracy, leveraging both structural log patterns and semantic understanding.
Failure timing was often easier to pinpoint than the responsible agent, suggesting that attribution of causality is the more difficult subproblem.

The results underscore the importance of continued research in this area. The benchmark provides a basis for future work, and the methods serve as baselines for comparison.

Implications for Future Systems

This research has practical implications for developers of LLM multi-agent systems. By integrating automated attribution tools, they can:

Reduce debugging time by quickly identifying failure sources.
Improve system reliability through targeted fixes based on attribution insights.
Enable iterative development with faster feedback loops.

Moreover, the open-source release of Who&When and the code allows the community to contribute improved attribution methods. This work aligns with the broader goal of making AI systems more transparent and trustworthy.

Conclusion

Automated failure attribution is a critical step toward robust LLM multi-agent systems. The Who&When benchmark and the evaluated methods provide a foundation for future research. As multi-agent systems become more prevalent, tools to diagnose and fix failures will be essential. The researchers’ work opens a new path for enhancing system reliability, and their collaboration across top institutions highlights the importance of addressing this challenge.

For more details, see the full paper on arXiv (PDF) and the dataset on Hugging Face (Who&When).