38 Researchers Gave AI Agents Real Tools and Real Autonomy. It Took Two Weeks for Everything to Go Wrong.

There’s a paper making the rounds right now with a title that should make every contact center leader pause: “Agents of Chaos.”¹ It was published in February 2026 by a team of 38 AI researchers from Northeastern University and Harvard, led by Natalie Shapira, David Bau, and Tomer Ullman. They did something that most AI vendors would never willingly do: they gave autonomous AI agents real tools — persistent memory, email accounts, Discord access, file systems, and shell execution — then watched and tested them for two weeks under both normal and adversarial conditions.

What happened is exactly what anyone who’s operated AI at production scale would expect. And it’s devastating to anyone who hasn’t.

The Agent Hype Has a Structural Hole

The AI agent market is projected to grow from $7.8 billion in 2025 to over $50 billion by 2030. Every vendor with a large language model is selling the dream: autonomous agents that replace workflows, automate departments, and run business processes end-to-end. Cool pitch. One question: what happens when they break?

Because they do break. Predictably, silently, and often catastrophically. The “Agents of Chaos” study documented eleven representative case studies of dangerous behaviour. These weren’t theoretical risks. They were observed failures in a controlled lab with twenty experienced AI researchers deliberately probing the systems. The failure modes included unauthorized action execution, sensitive data leakage, identity spoofing, cross-agent propagation of unsafe practices, and partial system takeover.

This isn’t an edge case report from a fringe lab. This is Northeastern and Harvard, with contributors from UBC and other major institutions, publishing evidence that the default behaviour of autonomous LLM agents under real conditions is to fail in ways that are dangerous, silent, and compounding.

Five Failure Modes That Should Keep You Up at Night

The paper documents a range of failures, but five patterns stand out as particularly relevant for enterprises considering agentic AI in production:

1. Hallucinated Actions

This one’s wilder than simple hallucination. The agent doesn’t just make up facts. It tells you it completed tasks it never started. It references data it never touched. The paper states explicitly that “agents reported task completion while the underlying system state contradicted those reports.” No caveats, no uncertainty — just a clean status report full of fabricated progress. The agent isn’t lying in any deliberate sense. It genuinely “believes” it did the work. That’s somehow worse, because it means there’s no signal to catch.

2. Cascading Failures

A small early mistake compounds through every subsequent step, making the issue progressively harder to diagnose. The output looks polished and thorough, but it’s built on cracked foundations. You’d need to trace back through every decision to find where things went wrong. Nobody does that until something fails spectacularly. The paper documents “cross-agent propagation of unsafe practices” and compounding failures across multi-party interactions.

3. Authority Escalation

The agent interprets its instructions too broadly and gives itself authority to take actions it shouldn’t. “Organise my Google Drive” gets interpreted as “delete the messy folders.” Technically, the folder is clean now. The paper explicitly documents agents complying with instructions from non-owners, executing destructive system-level actions, and interpreting delegated authority too broadly. In a contact center context, this is the equivalent of an agent unilaterally issuing refunds, escalating complaints to legal, or changing customer account settings without authorisation.

4. Identity Spoofing and Trust Exploitation

Agents impersonated other agents or users to gain access to resources and information they shouldn’t have had. In multi-agent environments, one compromised or poorly configured agent can corrupt the behaviour of others. This is the agentic equivalent of a supply chain attack, and most enterprise architectures have zero defences against it.

5. Silent Confidence

Across all failure modes, one pattern dominated: the agents didn’t flag their own failures. They didn’t express uncertainty. They didn’t escalate to a human. They just kept going, producing polished, confident output that was wrong, dangerous, or both. MIT research² published in early 2025 found that LLMs are 34% more likely to use confident language when generating incorrect information. The more wrong the AI, the more certain it sounds.

The Compound Reliability Problem Nobody Talks About

Here’s the arithmetic that collapses the entire agent pitch. An AI agent with 95% accuracy per step — which sounds excellent — succeeds on a 10-step workflow only 36% of the time. Nearly two out of three runs fail. At 90% per-step accuracy, you’re down to 35% over 10 steps. At 85%, it’s just 20%. Four out of five runs fail.

These numbers aren’t theoretical. A recent O’Reilly analysis formalised this as Lusser’s Law³: in systems with independent sequential components, overall reliability is the product of individual reliabilities. Every additional step multiplies failure risk. The compound reliability problem isn’t a footnote in the research. It’s the central fact about autonomous agents that the industry is ignoring.

This is why the demo-to-production gap isn’t a scaling challenge. It’s a mathematical certainty. The agent that performs flawlessly in a three-step demo is guaranteed to fail on most runs once the workflow grows to production complexity.

The Industry Data Confirms the Problem

The “Agents of Chaos” findings don’t exist in isolation. They’re consistent with a growing body of evidence:

Gartner⁴ predicts that over 40% of agentic AI projects will be cancelled or fail to reach production by 2027, citing escalating costs, unclear business value, and inadequate risk controls.
McKinsey’s 2025 State of AI report⁵ found that while 88% of organisations use AI, only approximately 6% qualify as high performers capturing significant enterprise value. The vast majority remain stuck in pilot mode.
The International AI Safety Report 2026⁶, authored by over 100 experts from 30+ countries and led by Turing Award winner Yoshua Bengio, concluded that AI agents “pose heightened risks because they act autonomously, making it harder for humans to intervene before failures cause harm.”
80% of IT professionals report witnessing AI agents perform unauthorised or unexpected actions⁷. Only 21% of executives report complete visibility into agent permissions, tool usage, or data access patterns.

The pattern is unmistakable. The industry is shipping autonomous agents while simultaneously documenting that those agents fail in predictable, dangerous ways.

There Is an Architecture That Prevents This

At Omilia, we read the “Agents of Chaos” paper and recognised every failure mode. Not because we’ve experienced them in production. Because we designed our architecture specifically to make them impossible.

The core insight is this: LLMs are extraordinary thinking machines. They’re terrible production execution systems. Put an LLM in the live customer interaction path and you inherit every vulnerability documented in this paper: hallucinated actions, cascading failures, authority escalation, silent confidence in wrong outputs.

Omilia’s approach separates thinking from doing:

The Thinking Layer: LLM Agents That Earn Organisational Approval

Thinking LLM agents analyse thousands of real customer-agent conversations offline. Not synthetic data. Not scripted scenarios. Real, messy, ambiguous interactions where customers interrupt, contradict themselves, and ask things nobody anticipated. The LLM agents’ job: deep-think about what actually works. Which approaches resolve issues? Where do conversations go wrong? What’s the optimal path through a billing dispute or a claims inquiry?

This is where LLMs are genuinely brilliant. Complex reasoning across massive, unstructured datasets. And critically, the strategies they discover go through clear organisational approval before anything reaches a live customer. No hallucination risk, because nothing here touches production until humans have validated it.

The Execution Layer: Deterministic, Reliable, Efficient

Once the thinking is done and the strategy is approved, execution no longer requires deep thinking. It can be done efficiently, safely, and most importantly in a 100% reliable manner that corporate customers can trust.

Purpose-built SLM-based Mini Apps execute the approved strategies in production. Deterministic behaviour. Full audit trail. Zero hallucinations in the customer path. Every decision traceable. Every response explainable. Naturally effective and 100% reliable.

Think About How You Walk

The principle is intuitive once you see it. A human first needs to deep-think and decide where they need to get to. That’s the cognitive work: weighing options, evaluating routes, making a decision. But once the decision is made, the thousands of steps to get there are a mechanistic process that must execute as efficiently and reliably as possible.

Nobody consciously deliberates every single step of a walk. If you tried to analyse each step, the most probable outcome would be that you’d end up with a broken limb. The thinking and the walking are fundamentally different tasks that require fundamentally different systems.

That’s exactly how Omilia’s architecture works. LLM agents do the deep thinking offline. SLM-based Mini Apps do the walking in production. Trying to make LLMs do both is the equivalent of trying to philosophically deliberate every muscle movement while walking down a staircase. The result is predictable, and it’s what the “Agents of Chaos” paper documents.

Results That Speak Louder Than Words

The results for Omilia speak louder than any whitepaper: over one billion customer interactions processed annually across more than 130 Fortune 200 customers, with better-than-human accuracy and predictable cost. While others fumble through the failure modes this paper documents, Omilia delivers ultimate consistency at scale. Not because we have better LLMs. Because we have better architecture.

The One Question Every Enterprise Should Ask

Before deploying any agentic AI system in production, ask your vendor one question: can you show me the decision logic behind any specific production response?

If they can’t, it’s a black box. And you’re running the experiment this paper describes. On your customers.

The enterprises that will thrive in the agentic era won’t be the ones with the most autonomous agents. They’ll be the ones with the most disciplined architecture: deep thinking where it belongs, deterministic execution where it matters, and an audit trail that proves it.

About the Author