McKinsey: Why Most AI Agents Never Reach Production

Every company now has a story about an agent that looked brilliant in a workshop, stunned a steering committee, and then disappeared before launch. That is why the line “95% of corporate agents never reach prod” feels true even if the exact share shifts by sector. The pattern is easy to see in the data: McKinsey found that only 11% of companies worldwide were using gen AI at scale, and in operations just 3% of surveyed organizations had scaled a gen AI use case. MIT research points to the same gap from another angle: 62% of firms were still in the first two AI maturity stages, where financial performance sat below industry averages.

Written by

Published onApril 12, 2026

RSS Blog

McKinsey: Why Most AI Agents Never Reach Production

Meet Northstar Claims Assistant

Let’s walk through one fictional project.

A regional insurer launches Northstar Claims Assistant, an agent built to help adjusters read incident reports, pull policy language, draft claimant emails, and recommend next actions. In week one, the prototype works on a neat sample set. Leaders see a faster claims cycle, lower handling cost, and happier staff. The team gets applause, a budget, and a target to go live in six months.

Six months later, Northstar is stuck in a folder named pilot_v7_final_FINAL.

Why?

Failure 1: The company treated a demo as a business case

The first crack shows up early. Northstar was approved because the demo looked sharp, not because the business chose one narrow, high-value workflow with clear economics. MIT CISR found that 28% of firms were still in the “experiment and prepare” stage and 34% in “build pilots and capabilities.” Together, that means 62% were still before true enterprise-scale AI ways of working. Those stage-one and stage-two firms also trailed their industries on growth and profit, while stage-three and stage-four firms moved above industry average.

That is the first reason so many agents die: companies confuse proof of concept with proof of value. Northstar could answer questions. It could not yet prove where margin, cycle time, leakage, or customer retention would improve.

Failure 2: The data layer was not ready

Once the team moved past the demo, Northstar met the company’s real environment: duplicate policy files, missing claims notes, messy document naming, and access rules no one had cleaned up in years. The agent was smart; the plumbing was not.

MIT Sloan reported that 57% of chief data officers said they had not made the necessary changes to their data strategy to support generative AI. In the same research, 46% said data quality and choosing the right use cases were the biggest roadblocks, while 93% said data strategy was crucial to getting value from gen AI.

Northstar failed here because the team started with the model and postponed the data work. In corporate settings, that order usually ends badly.

Failure 3: The workflow never changed

The agent could draft summaries, but adjusters still had to swivel across five systems, copy text into an old claims platform, and ask supervisors for approval through email. The tool was added on top of the old process rather than built into a new one.

McKinsey’s 2025 state-of-AI survey found that workflow redesign had the biggest effect on whether organizations saw EBIT impact from gen AI. Yet only 21% of respondents said their organizations had fundamentally redesigned at least some workflows.

This is where many executive teams get the math wrong. They fund an agent but not the operating change around it. Northstar did not need a clever prompt library. It needed a rewritten claims process.

Failure 4: Governance arrived late, after trust was already gone

Then legal reviewed the pilot. Compliance asked who owned retrieval rules, what sources were approved, which outputs required human signoff, and how the system handled bias, privacy, and error logging. No one had full answers.

MIT Sloan Management Review and BCG reported that 70% of respondents acknowledged at least one AI system failure. In separate MIT Sloan Management Review research, 82% agreed responsible AI should be a top management agenda item, but only 55% said it actually was. McKinsey also found that only 28% of respondents said their CEO oversaw AI governance, even though CEO oversight was one of the factors most correlated with stronger bottom-line impact.

Northstar did not fail because governance existed. It failed because governance showed up as a brake instead of as design input from day one.

Failure 5: No scaling road map, no KPI discipline

The pilot team celebrated user quotes such as “this feels faster,” but it never locked in hard measures. No one tracked assisted handle time, appeal rate, settlement accuracy, supervisor overrides, or the cost per closed claim. No phased rollout plan existed either. The project stayed stuck between excitement and accountability.

McKinsey found that less than one-third of respondents said their organizations were following most of twelve adoption and scaling practices for gen AI. It also found that less than one in five were tracking well-defined KPIs for gen AI solutions.

That stat explains a lot of dead agents. If value is not measured, scale becomes a matter of opinion. Opinion rarely beats budget pressure.

Failure 6: The operating model split ownership into pieces

IT owned the stack. Claims owned the process. Legal owned approval. Data owned access. Security owned model risk. No single leader owned the outcome. Northstar became everyone’s priority in slides and no one’s priority in practice.

McKinsey found that risk and compliance plus data governance were often centralized, while tech talent and adoption were more often handled in hybrid models. That can work, but only when the handoffs are tight and the road map is explicit. MIT CISR’s maturity model says stage three is where firms build scalable architecture, reuse, dashboards, and test-and-learn habits. Northstar never got there.

Why the 95% line keeps ringing true

Call the number 95%, 90%, or “almost all of them.” The point is the same. Corporate agents usually die in the handoff from demo to disciplined execution. McKinsey found that 45% of finance functions were piloting gen AI, but only 6% had achieved scale. In service operations, only 3% had scaled a use case. Those are not model problems. They are company problems.

What would have saved Northstar

Northstar had a path to prod, but it was boring compared with the demo. It needed one claims workflow, cleaner governed data, a named executive owner, human-review rules, KPI tracking, and a phased rollout tied to business value. That is also what the MIT and McKinsey numbers keep saying: firms that move past experimentation build reuse, process change, management discipline, and top-level accountability. The agent is rarely the hard part. The company usually is.

DemoProductionAI Agent

Create your AI Agent

Automate customer interactions in just minutes with your own AI Agent.

Get started for free Chat with AI for fun

Featured posts

What Are Tech Stacks in Software Development?

In the world of software development, the term "tech stack" is commonly mentioned. A tech stack is a collection of tools, technologies, and frameworks used to build and run a software application. Think of it as a stack of building blocks that developers use to create functional software.

What is the Scaling Law in AI?

Scaling laws play a crucial role in the development of artificial intelligence models. They provide a systematic way to predict how increasing the size or resources of models will impact their performance. As the field of AI rapidly evolves, understanding these laws helps researchers optimize models for better results across various tasks.

What Do You Need to Build a Voice AI Caller?

A proactive voice AI system sounds simple on paper: an AI places a call, speaks with a natural voice, shares a message, answers basic questions, and moves the conversation toward a goal. In practice, it takes much more than a voice model and a phone number list. You need calling infrastructure, speech tools, conversation logic, safety controls, consent rules, testing workflows, and a clear business purpose. If you want a voice AI that can call people with AI-generated voice messages, the real job is building a reliable calling product, not just a talking bot.

Is SEO Dying in the Age of First-Party Results and AI Responses?

In the world of search engine optimization (SEO), there's growing concern that traditional SEO practices may no longer be as effective. With search engines increasingly prioritizing first-party results and AI-generated answers, many are questioning if SEO is truly dying. This shift is especially noticeable in the way official websites and AI tools are dominating the search results, leaving less room for independent blogs and content creators.

What Are Virtual Machines (VMs) on the Cloud

Many people have heard the term “virtual machine” in the context of cloud computing, but it often feels abstract or overly technical. In reality, a virtual machine (VM) is simply a way to run a complete computer—operating system, applications, and all—inside another computer. Cloud providers make this concept accessible by letting you create and manage these virtual computers on demand, without owning any physical hardware. Once you understand how VMs work, they become one of the most practical and powerful tools in modern computing.

Is Retrieval Augmented Generation an Upgraded Version of Text Search for AI?

Artificial intelligence has rapidly advanced in recent years, especially in creating and retrieving information. Two important methods in this progress are Retrieval Augmented Generation (RAG) and traditional text search. Many people wonder if RAG is just a better version of text search or if it offers something more. This article compares these two approaches and explains how RAG improves upon basic search methods.

How Enterprise IT Teams Can Evaluate a Third-Party AI Widget Without Compromising Security Policy

Your security team did not get hired to say yes. They got hired to ask the right questions — and when an AI vendor shows up with a JavaScript file and tells you to just drop it on your site, the right answer is to slow down and get specific. The good news is that a well-built AI widget can clear every standard IT security checklist. The key is knowing which questions to ask, what a trustworthy answer looks like, and where the real boundary between your infrastructure and a vendor's cloud actually sits. This post walks through the four checks that matter most — data egress, code auditability, deployment control, and graceful failure — so your IT team can make a risk-informed decision rather than a reflexive one.

Serverless: Stop Worrying About Servers and Start Shipping Code

For years, scaling a web app meant provisioning servers, tweaking auto-scaling rules, and praying your capacity planning wasn’t wildly wrong. Then came serverless computing. Despite the name, servers still exist — but you never have to think about them. The cloud provider automatically spins up as many parallel instances of your function as needed, from zero to thousands, in milliseconds. So yes: a serverless app truly eliminates the worry of server scaling when traffic spikes. You don’t configure, patch, or monitor a single machine. That said, while the function layer scales infinitely, you still need to care about whether your database or external APIs can keep up. Serverless handles its part perfectly — the rest is up to your architecture.

Achieve more with AI

Enhance your customer experience with an AI Agent today. Easy to set up, it seamlessly integrates into your everyday processes, delivering immediate results.

Try for free Get a demo

Latest posts

AskHandle Blog

Ideas, tips, guides, interviews, industry best practices, and news.

• July 19, 2025

Why Is the Competition of AI Also a Competition of Electricity Energy Consumption?

AI has become a major part of our lives. From voice assistants to complex data analysis, AI influences many fields. But behind the scenes, there is a less obvious challenge: the amount of electricity AI systems need. As AI models grow bigger and more powerful, they also require more energy to run. This makes the race to develop better AI also a race to manage energy use effectively.

ElectricityAI

• July 18, 2025

What are the Major Positions AI Companies Tend to Hire?

Artificial Intelligence (AI) companies are growing rapidly. They need a variety of skilled professionals to develop, implement, and improve AI technologies. If you're interested in working in AI, it's good to know the most common roles these companies look for. This article will introduce the main positions AI companies often hire for and what each role involves.

PositionsEngineerAI

• May 12, 2025

Are You Allowed to Do Outbound SMS Campaign in the USA?

Running an outbound SMS campaign can be a quick and effective way to reach your customers. However, it's important to know the rules and regulations in the United States before you start sending mass text messages. Many businesses wonder if they can send SMS messages freely. The answer is yes, but with certain rules to follow. This article explains what you need to know about outbound SMS campaigns in the USA.

SMSOutboundUSA

View all posts