The Illusion of Multi-Agent Advantage

A Systematic Evaluation of Automatic-MAS Against Single-Agent Baselines

We perform a rigorous, systematic evaluation of automatic-MAS against Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and complex agentic tasks such as BrowseComp-Plus, automatic-MAS consistently underperform CoT-SC while incurring up to 10× higher inference cost. To investigate whether task suitability explains this gap, we introduce a diagnostic synthetic dataset, Synthetic Multi-Hop Financial Reasoning (SMFR), designed with explicit sub-task decomposition, context separation, and parallelization potential. Our mechanistic deconstruction of six state-of-the-art frameworks reveals pervasive architectural bloat and functional collapse: complex automatic-MAS workflows degenerate into basic ensembling loops indistinguishable from CoT-SC, exposing a fundamental misalignment with core multi-agent principles.

Systematic Benchmarking Cost-Efficiency Analysis SMFR Diagnostic Benchmark Architectural Deconstruction
HL
Hehai Lin2*
CL
Chuyuan Li3
FJ
Fangkai Jiao4
SW
Sudong Wang2
CQ
Chengwei Qin2
GC
Giuseppe Carenini3

1 Salesforce AI Research   2 HKUST-Guangzhou   3 UBC   4 NTU

* Equal contribution

Abstract

Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatic-MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic-MAS consistently underperform CoT-SC despite being up to 10× more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperform automatic-MAS architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of automatic-MAS architectures reveals that current automatic-MAS design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.

Our Contributions

A Critical Re-evaluation of the MAS Advantage

We demonstrate through systematic benchmarking that automatic-MAS rarely outperform SAS baselines when accounting for cost-efficiency and baseline strength. CoT-SC frequently achieves higher accuracy at less than 10% of the computational cost.

The SMFR Diagnostic Benchmark

We introduce Synthetic Multi-Hop Financial Reasoning (SMFR), a diagnostic task featuring explicit sub-structures and a gold-standard Expert-MAS to establish an empirical performance upper bound for multi-agent coordination.

Architectural Deconstruction

We provide a rigorous analysis of synthesized MAS workflows across six frameworks, exposing functional collapse where complex automatic-MAS designs revert to basic single-agent execution in practice.

Overview

Figure 1
Figure 1: The Illusion of Multi-Agent Advantage. Theory promises specialization through coordinated multi-agent workflows (left); in practice, automatic-MAS frameworks suffer from redundancy and functional collapse (right), often incurring ≈ 10× the inference cost of CoT-SC for negligible gains (Section 4).

Critical Re-Evaluation of the MAS Advantage

To investigate whether automatic-MAS deliver consistent, cost-effective advantages over strong SAS baselines, we conduct a large-scale audit spanning four benchmarks (GPQA-Diamond, HLE-Maths, SWE-Bench Lite, and BrowseComp-Plus) across four backbone LLMs: GPT-4o, GPT-5, GPT-OSS-120B, and Gemini-2.5-Pro. We evaluate six representative automatic-MAS frameworks covering both inference-time and optimized paradigms: DyLAN, MAS-Zero, ADAS, AFlow, MaAS, and MAS-Orchestra.

Figure 2
Figure 2: The MAS Efficiency Frontier. CoT-SC provides the optimal balance of performance and cost-efficiency across all benchmarks. Automatic-MAS (e.g., ADAS, MAS-Orchestra) frequently incur 10× inference costs versus SAS baselines for negligible gains, except on HLE-Math. GPT-OSS-120B is excluded from SWE-Bench Lite due to consistent formatting failures in code patches.

Key Findings

A Dominant Efficiency Gap

CoT-SC consistently outperforms automatic-MAS frameworks across all benchmarks, frequently achieving higher accuracy at less than 10% of the computational cost. This suggests that architectural complexity is an inefficient substitute for simple stochastic sampling.

Competency Floor for MAS

Significant MAS uplift only occurs on HLE-Math using GPT-5 and Gemini-2.5-Pro, indicating that architectural complexity only yields benefits when the underlying backbone already possesses the high inherent reasoning capabilities necessary to navigate complex coordination.

Challenging Capability Bridging

Our results directly challenge the prevailing assumption that sophisticated orchestration can elevate weaker models to frontier-level performance. Automatic-MAS provide no consistent improvements for mid-tier models (GPT-4o, GPT-OSS), while a single-agent GPT-5 instance using CoT-SC reliably outperforms the most sophisticated GPT-4o-based automatic-MAS frameworks, including ADAS and AFlow, while consuming less than half the total tokens.

Architectural Bloat Across the MAS Ecosystem

The significant performance-cost gap provides empirical evidence for architectural bloat: the sophisticated multi-agent graphs generated by these frameworks do not translate into functional reasoning gains. Instead, they represent a failure of automated search to discover configurations that outperform unstructured scaling, confirming that current MAS designs have yet to move beyond redundant high-cost iterations.

The SMFR Diagnostic Benchmark

CoT-SC outperforms automatic-MAS across all standard benchmarking datasets in both accuracy and cost-effectiveness. However, existing works have critiqued the use of benchmarks predicated on simple input-output flows for evaluating multi-agent coordination. To isolate task suitability as a contributing factor, we introduce the Synthetic Multi-Hop Financial Reasoning (SMFR) dataset: a diagnostic task with explicit sub-structures, context-heavy inputs, and clear opportunities for parallelization and specialization.

Task Construction

Each SMFR problem is built programmatically from historical US equity prices. We first sample stock tickers, transaction types (buy/sell), price types (open/close), the number of investors (2–6), and a target profit/loss percentage (0.1–2.0%). The resulting instance presents an agent with a stock price haystack (historical open/close prices for B companies over a 30-day window) and a set of investor transactions. For each investor, the agent must extract relevant transactions, compute profit and loss to derive a target price, retrieve the matching date(s), and finally aggregate across investors to select the winner according to an earliest- or latest-date criterion.

The pipeline is deliberately parallelizable per investor and resists shortcut strategies: correct solutions require multi-step context extraction and numerical reasoning rather than pattern matching. Problems are generated from real price distributions but remain immune to data contamination, since each instance is a synthetic composition. The dataset contains 588 test samples and 16 validation samples, balanced across transaction types, aggregation logic, and target percentages.

Figure 3
Figure 3: SMFR task construction pipeline. Each problem samples stock data and investor parameters, then follows a multi-hop workflow of context extraction, calculation, and aggregation. The per-investor stages (highlighted) are explicitly parallelizable.

Expert-MAS Baseline

To establish a competitive reference upper bound, we design an Expert-MAS that enforces a strict separation between context processing and logical control. A Meta-Agent first parses the problem topology (question type, investor names, target percentage) into a structured schema. A deterministic Python executor then orchestrates specialized sub-agents in parallel: an ExtractorAgent retrieves transaction and price data from the haystack, a CalculatorAgent computes portfolio value and target prices, and a second ExtractorAgent locates matching dates. An Aggregator agent sorts results and selects the winning investor. By offloading coordination to deterministic code, Expert-MAS eliminates the orchestration noise that plagues automatic-MAS designs.

Figure 4
Figure 4: Expert-MAS architecture for SMFR. A Meta-Agent parses the problem schema, then per-investor Extractor–Calculator–Extractor pipelines run in parallel before an Aggregator produces the final answer.

Evaluation Results

We evaluate CoT-SC, six automatic-MAS frameworks, and Expert-MAS on SMFR across GPT-4o, GPT-OSS-120B, and GPT-5. If task suitability were the primary bottleneck for automatic-MAS, we would expect them to close the gap on this explicitly agentic benchmark. Instead, automatic-MAS consistently fail to surpass CoT-SC efficiency, while Expert-MAS achieves substantially better accuracy-cost trade-offs on stronger backbones.

Figure 5
Figure 5: Automatic-MAS consistently fail to surpass CoT-SC efficiency on SMFR. Expert-MAS achieves superior accuracy-cost trade-offs except on GPT-4o, where base-model reasoning limitations bottleneck performance regardless of orchestration. Gemini-2.5-Pro is omitted due to non-viable MAS cost multipliers (> 10×).

Task Suitability Is Not the Limiting Factor

Despite explicit agentic requirements (multi-step planning, state tracking, and long-context retrieval), automatic-MAS consistently fail to surpass CoT-SC efficiency on SMFR, proving that task suitability is not the root cause of their underperformance. Instead, these systems incur massive computational overhead without proportional gains. Expert-MAS, by contrast, achieves substantial improvements with cost comparable to CoT-SC: GPT-OSS improves from 22.4% to 36.0%, while GPT-5 jumps from 49.1% to a near-perfect 96.7%.

A Threshold Competency for MAS

The sole exception is GPT-4o, where persistent calculation and retrieval failures bottleneck the system regardless of orchestration. This reinforces our finding that MAS require a threshold baseline competency to be effective. Thus, while the multi-agent paradigm is fundamentally viable, current automatic-MAS frameworks fail to exploit task-specific opportunities effectively or economically.

Architectural Deconstruction

While our results establish a clear efficiency gap between single- and multi-agent systems, they do not reveal whether the internal mechanisms of MAS (role specialization, consensus, routing) provide latent benefits that justify their complexity. To address this, we deconstruct the generated architectures and investigate whether their features contribute meaningfully to the reasoning process. We find that in most automatic-MAS workflows, these mechanisms are either sub-optimal or purely decorative rather than manifestations of emergent intelligence.

(a) DyLAN: Functional Collapse and Structural Redundancy

Figure 9
Figure 9: Highest ranked agents by importance score for different role settings in DyLAN. Results use GPT-4o and GPT-5 as backbone models on GPQA-Diamond. Plots reveal positional bias towards the first agent regardless of assigned role or backbone model.

DyLAN posits that performance is driven by agent diversity, yet this fails to manifest in practice. Agents reach immediate, unanimous consensus in ~70% of GPT-4o cases and >90% of GPT-5 cases, effectively functioning as a unanimous CoT-SC baseline rather than a dynamic negotiation. When interaction does occur, task-specific roles provide no marginal utility: an all-assistant configuration achieved better accuracy than task-specific experts (54.4% vs. 53.4%).

(b) MAS-Zero: Verifier Bias and Consensus Collapse

Figure 6
Figure 6: Judge model selection frequency of MAS-Zero across four datasets, using GPT-4o (blue) and GPT-5 (orange) as both worker and verifier. Indices 0–3 correspond to four fundamental reasoning paradigms: vanilla CoT, CoT-SC, Reflexion, and Debate. Indices 4–8 represent the subsequent 5 rounds of multi-agent organization search.

The verifier disproportionately favors earlier entries in the context window: GPT-4o selects the initial block in over 45% of instances. Outputs from later search rounds are rarely selected, accounting for less than 15% of final decisions, effectively turning subsequent worker agents into “expensive witnesses” that incur full inference costs while exerting near-zero causal influence on the output.

(c) ADAS: Convergence on Heuristic Search Artifacts

Figure 7
Figure 7: ADAS (GPT-5) validation accuracies on different seeded runs on GPQA-Diamond. Search dynamics are non-monotonic; accuracy frequently peaks early and subsequently regresses.

Our motif analysis reveals that the primary positive signal originates from Self-consistency motifs; specialized coordination motifs yield negligible gains. On GPQA-Diamond, architectures incorporating Self-consistency achieved a mean accuracy of 82.19% (+1.34% over the global average). ADAS functions as a heuristic explorer rather than a principled optimizer, where performance gains reflect stochastic “lucky” iterations rather than structural evolution.

(d) AFlow: Degeneration to Trivial Ensembles

Figure 10
Figure 10: The final MAS workflows generated by AFlow on GPQA Diamond and SMFR degenerate to trivial ensembling rather than sophisticated multi-agent orchestration.

Across 14 final workflows generated by GPT-4o, GPT-5, and GPT-OSS-120B on five datasets, 50% (7/14) adopted a structure that simply iterates a single custom prompt three times before aggregation, which is functionally identical to standard CoT-SC; four of these underperformed the CoT-SC baseline. This confirms that automated search often converges on rediscovering CoT-SC-style sampling under more complex labels rather than inventing novel multi-agent strategies.

(e) MaAS: Incentive Misalignment in Dynamic Routing

Table 1
Table 1: Operator Activation Distribution (%) of MaAS (GPT-5). On context-heavy BrowseComp-Plus, the controller collapses to I/O calls due to cost-dominated optimization; on GPQA-Diamond, it spreads calls more evenly but fails to outperform CoT-SC.

With highly capable base models, the accuracy gradient flattens to ~1/K, causing the controller to ignore task-specific logic and collapse into two distinct failure modes: (i) Cost-Minimizing Collapse on BrowseComp-Plus (74.2% trivial I/O calls) and (ii) Stochastic Stalling on GPQA-Diamond, where negligible cost differentials trap the controller in its initialized near-uniform distribution.

(f) MAS-Orchestra: Policy Collapse in Dynamic Orchestration

Table 2
Table 2: Agent Selection Distribution (%) by MAS-Orchestra Across Datasets.

The system largely ignores its diverse agent pool, converging on a rigid binary preference for high-overhead Debate and Reflexion agents. The orchestrator fails to scale agent complexity to task difficulty: despite GPQA-Diamond posing a lower reasoning ceiling than HLE-Math, the system exhibits a higher reliance on Debate agents for the former (84.9%) than the latter (79.2%). These behaviors confirm that automated orchestrators do not learn task-adaptive strategies, but instead settle into static, greedy local minima.

Discussion

Our evaluation reveals a systematic divergence between the theoretical complexity of MAS frameworks and their empirical execution. While intended to foster emergent collaboration, current automatic-MAS paradigms frequently result in mechanistic trivialization: sophisticated architectural designs collapse into elementary single-agent workflows.

The Ensembling Trap

A primary driver of this collapse is the reliance on CoT and CoT-SC as the fundamental building blocks of MAS. While these primitives ensure generalization and leverage ensembling effects, the resulting architectures fail to implement them efficiently. Rather than synergistic coordination, frameworks like AFlow and ADAS settle into structural degeneration, rediscovering basic ensembling motifs under the guise of optimized multi-agent graphs. The ~10× increase in inference cost thus buys little more than a redundant, poorly routed approximation of standard CoT-SC.

Towards Mechanistic Interpretability of MAS

As model capability scales, the MAS advantage further erodes due to two reinforcing factors. Signal Saturation: in models like GPT-5, accuracy gradients flatten, causing controllers (MaAS) to lose the signal necessary for nuanced routing, collapsing into cheap shortcuts or static policies. Positional and Primacy Biases: verifiers and controllers (MAS-Zero, DyLAN) disproportionately favor early reasoning steps, effectively terminating the multi-agent benefit before interaction can occur. The success of Expert-MAS on SMFR reinforces that multi-agent coordination excels only when architectures are specifically engineered to exploit parallelizable sub-problems or context protection. Future research should pivot away from black-box automated graph generation, which defaults to redundant ensembling, and toward the mechanistic interpretability of agent interactions. We argue that to move beyond creating “expensive witnesses,” MAS must be evaluated on their structural fidelity: the degree to which assigned agentic roles exert measurable causal influence on the final decision. Without such grounding, increased architectural complexity serves only to mask computational inefficiency.

Conclusion

Our systematic evaluation identifies a critical efficiency gap in modern MAS design, where architectural complexity often masks a fundamental functional collapse into simpler, stochastic baselines. By introducing the SMFR benchmark and isolating the mechanistic failures of six major frameworks, we provide a roadmap for more principled, cost-effective agentic design. Our architectural deconstruction reveals that current automatic-MAS workflows frequently degenerate into redundant ensembling loops functionally identical to CoT-SC. Ultimately, moving beyond “expensive witnesses” requires a pivot from black-box graph searching towards architectures grounded in verifiable task decomposition and causal role-alignment.

BibTeX

@misc{xx
}