Scale customer reach and grow sales with AskHandle chatbot

McKinsey: Why Most AI Agents Never Reach Production

Every company now has a story about an agent that looked brilliant in a workshop, stunned a steering committee, and then disappeared before launch. That is why the line “95% of corporate agents never reach prod” feels true even if the exact share shifts by sector. The pattern is easy to see in the data: McKinsey found that only 11% of companies worldwide were using gen AI at scale, and in operations just 3% of surveyed organizations had scaled a gen AI use case. MIT research points to the same gap from another angle: 62% of firms were still in the first two AI maturity stages, where financial performance sat below industry averages.

image-1
Written by
Published onApril 12, 2026
RSS Feed for BlogRSS Blog

McKinsey: Why Most AI Agents Never Reach Production

Every company now has a story about an agent that looked brilliant in a workshop, stunned a steering committee, and then disappeared before launch. That is why the line “95% of corporate agents never reach prod” feels true even if the exact share shifts by sector. The pattern is easy to see in the data: McKinsey found that only 11% of companies worldwide were using gen AI at scale, and in operations just 3% of surveyed organizations had scaled a gen AI use case. MIT research points to the same gap from another angle: 62% of firms were still in the first two AI maturity stages, where financial performance sat below industry averages.

Meet Northstar Claims Assistant

Let’s walk through one fictional project.

A regional insurer launches Northstar Claims Assistant, an agent built to help adjusters read incident reports, pull policy language, draft claimant emails, and recommend next actions. In week one, the prototype works on a neat sample set. Leaders see a faster claims cycle, lower handling cost, and happier staff. The team gets applause, a budget, and a target to go live in six months.

Six months later, Northstar is stuck in a folder named pilot_v7_final_FINAL.

Why?

Failure 1: The company treated a demo as a business case

The first crack shows up early. Northstar was approved because the demo looked sharp, not because the business chose one narrow, high-value workflow with clear economics. MIT CISR found that 28% of firms were still in the “experiment and prepare” stage and 34% in “build pilots and capabilities.” Together, that means 62% were still before true enterprise-scale AI ways of working. Those stage-one and stage-two firms also trailed their industries on growth and profit, while stage-three and stage-four firms moved above industry average.

That is the first reason so many agents die: companies confuse proof of concept with proof of value. Northstar could answer questions. It could not yet prove where margin, cycle time, leakage, or customer retention would improve.

Failure 2: The data layer was not ready

Once the team moved past the demo, Northstar met the company’s real environment: duplicate policy files, missing claims notes, messy document naming, and access rules no one had cleaned up in years. The agent was smart; the plumbing was not.

MIT Sloan reported that 57% of chief data officers said they had not made the necessary changes to their data strategy to support generative AI. In the same research, 46% said data quality and choosing the right use cases were the biggest roadblocks, while 93% said data strategy was crucial to getting value from gen AI.

Northstar failed here because the team started with the model and postponed the data work. In corporate settings, that order usually ends badly.

Failure 3: The workflow never changed

The agent could draft summaries, but adjusters still had to swivel across five systems, copy text into an old claims platform, and ask supervisors for approval through email. The tool was added on top of the old process rather than built into a new one.

McKinsey’s 2025 state-of-AI survey found that workflow redesign had the biggest effect on whether organizations saw EBIT impact from gen AI. Yet only 21% of respondents said their organizations had fundamentally redesigned at least some workflows.

This is where many executive teams get the math wrong. They fund an agent but not the operating change around it. Northstar did not need a clever prompt library. It needed a rewritten claims process.

Failure 4: Governance arrived late, after trust was already gone

Then legal reviewed the pilot. Compliance asked who owned retrieval rules, what sources were approved, which outputs required human signoff, and how the system handled bias, privacy, and error logging. No one had full answers.

MIT Sloan Management Review and BCG reported that 70% of respondents acknowledged at least one AI system failure. In separate MIT Sloan Management Review research, 82% agreed responsible AI should be a top management agenda item, but only 55% said it actually was. McKinsey also found that only 28% of respondents said their CEO oversaw AI governance, even though CEO oversight was one of the factors most correlated with stronger bottom-line impact.

Northstar did not fail because governance existed. It failed because governance showed up as a brake instead of as design input from day one.

Failure 5: No scaling road map, no KPI discipline

The pilot team celebrated user quotes such as “this feels faster,” but it never locked in hard measures. No one tracked assisted handle time, appeal rate, settlement accuracy, supervisor overrides, or the cost per closed claim. No phased rollout plan existed either. The project stayed stuck between excitement and accountability.

McKinsey found that less than one-third of respondents said their organizations were following most of twelve adoption and scaling practices for gen AI. It also found that less than one in five were tracking well-defined KPIs for gen AI solutions.

That stat explains a lot of dead agents. If value is not measured, scale becomes a matter of opinion. Opinion rarely beats budget pressure.

Failure 6: The operating model split ownership into pieces

IT owned the stack. Claims owned the process. Legal owned approval. Data owned access. Security owned model risk. No single leader owned the outcome. Northstar became everyone’s priority in slides and no one’s priority in practice.

McKinsey found that risk and compliance plus data governance were often centralized, while tech talent and adoption were more often handled in hybrid models. That can work, but only when the handoffs are tight and the road map is explicit. MIT CISR’s maturity model says stage three is where firms build scalable architecture, reuse, dashboards, and test-and-learn habits. Northstar never got there.

Why the 95% line keeps ringing true

Call the number 95%, 90%, or “almost all of them.” The point is the same. Corporate agents usually die in the handoff from demo to disciplined execution. McKinsey found that 45% of finance functions were piloting gen AI, but only 6% had achieved scale. In service operations, only 3% had scaled a use case. Those are not model problems. They are company problems.

What would have saved Northstar

Northstar had a path to prod, but it was boring compared with the demo. It needed one claims workflow, cleaner governed data, a named executive owner, human-review rules, KPI tracking, and a phased rollout tied to business value. That is also what the MIT and McKinsey numbers keep saying: firms that move past experimentation build reuse, process change, management discipline, and top-level accountability. The agent is rarely the hard part. The company usually is.

DemoProductionAI Agent
Create your AI Agent

Automate customer interactions in just minutes with your own AI Agent.

Featured posts

How Enterprise IT Teams Can Evaluate a Third-Party AI Widget Without Compromising Security Policy
How Enterprise IT Teams Can Evaluate a Third-Party AI Widget Without Compromising Security Policy

Your security team did not get hired to say yes. They got hired to ask the right questions — and when an AI vendor shows up with a JavaScript file and tells you to just drop it on your site, the right answer is to slow down and get specific. The good news is that a well-built AI widget can clear every standard IT security checklist. The key is knowing which questions to ask, what a trustworthy answer looks like, and where the real boundary between your infrastructure and a vendor's cloud actually sits. This post walks through the four checks that matter most — data egress, code auditability, deployment control, and graceful failure — so your IT team can make a risk-informed decision rather than a reflexive one.

Serverless: Stop Worrying About Servers and Start Shipping Code
Serverless: Stop Worrying About Servers and Start Shipping Code

For years, scaling a web app meant provisioning servers, tweaking auto-scaling rules, and praying your capacity planning wasn’t wildly wrong. Then came serverless computing. Despite the name, servers still exist — but you never have to think about them. The cloud provider automatically spins up as many parallel instances of your function as needed, from zero to thousands, in milliseconds. So yes: a serverless app truly eliminates the worry of server scaling when traffic spikes. You don’t configure, patch, or monitor a single machine. That said, while the function layer scales infinitely, you still need to care about whether your database or external APIs can keep up. Serverless handles its part perfectly — the rest is up to your architecture.

Subscribe to our newsletter

Achieve more with AI

Enhance your customer experience with an AI Agent today. Easy to set up, it seamlessly integrates into your everyday processes, delivering immediate results.

Latest posts

AskHandle Blog

Ideas, tips, guides, interviews, industry best practices, and news.

View all posts