McKinsey: Why Most AI Agents Never Reach Production
Every company now has a story about an agent that looked brilliant in a workshop, stunned a steering committee, and then disappeared before launch. That is why the line “95% of corporate agents never reach prod” feels true even if the exact share shifts by sector. The pattern is easy to see in the data: McKinsey found that only 11% of companies worldwide were using gen AI at scale, and in operations just 3% of surveyed organizations had scaled a gen AI use case. MIT research points to the same gap from another angle: 62% of firms were still in the first two AI maturity stages, where financial performance sat below industry averages.
Meet Northstar Claims Assistant
Let’s walk through one fictional project.
A regional insurer launches Northstar Claims Assistant, an agent built to help adjusters read incident reports, pull policy language, draft claimant emails, and recommend next actions. In week one, the prototype works on a neat sample set. Leaders see a faster claims cycle, lower handling cost, and happier staff. The team gets applause, a budget, and a target to go live in six months.
Six months later, Northstar is stuck in a folder named pilot_v7_final_FINAL.
Why?
Failure 1: The company treated a demo as a business case
The first crack shows up early. Northstar was approved because the demo looked sharp, not because the business chose one narrow, high-value workflow with clear economics. MIT CISR found that 28% of firms were still in the “experiment and prepare” stage and 34% in “build pilots and capabilities.” Together, that means 62% were still before true enterprise-scale AI ways of working. Those stage-one and stage-two firms also trailed their industries on growth and profit, while stage-three and stage-four firms moved above industry average.
That is the first reason so many agents die: companies confuse proof of concept with proof of value. Northstar could answer questions. It could not yet prove where margin, cycle time, leakage, or customer retention would improve.
Failure 2: The data layer was not ready
Once the team moved past the demo, Northstar met the company’s real environment: duplicate policy files, missing claims notes, messy document naming, and access rules no one had cleaned up in years. The agent was smart; the plumbing was not.
MIT Sloan reported that 57% of chief data officers said they had not made the necessary changes to their data strategy to support generative AI. In the same research, 46% said data quality and choosing the right use cases were the biggest roadblocks, while 93% said data strategy was crucial to getting value from gen AI.
Northstar failed here because the team started with the model and postponed the data work. In corporate settings, that order usually ends badly.
Failure 3: The workflow never changed
The agent could draft summaries, but adjusters still had to swivel across five systems, copy text into an old claims platform, and ask supervisors for approval through email. The tool was added on top of the old process rather than built into a new one.
McKinsey’s 2025 state-of-AI survey found that workflow redesign had the biggest effect on whether organizations saw EBIT impact from gen AI. Yet only 21% of respondents said their organizations had fundamentally redesigned at least some workflows.
This is where many executive teams get the math wrong. They fund an agent but not the operating change around it. Northstar did not need a clever prompt library. It needed a rewritten claims process.
Failure 4: Governance arrived late, after trust was already gone
Then legal reviewed the pilot. Compliance asked who owned retrieval rules, what sources were approved, which outputs required human signoff, and how the system handled bias, privacy, and error logging. No one had full answers.
MIT Sloan Management Review and BCG reported that 70% of respondents acknowledged at least one AI system failure. In separate MIT Sloan Management Review research, 82% agreed responsible AI should be a top management agenda item, but only 55% said it actually was. McKinsey also found that only 28% of respondents said their CEO oversaw AI governance, even though CEO oversight was one of the factors most correlated with stronger bottom-line impact.
Northstar did not fail because governance existed. It failed because governance showed up as a brake instead of as design input from day one.
Failure 5: No scaling road map, no KPI discipline
The pilot team celebrated user quotes such as “this feels faster,” but it never locked in hard measures. No one tracked assisted handle time, appeal rate, settlement accuracy, supervisor overrides, or the cost per closed claim. No phased rollout plan existed either. The project stayed stuck between excitement and accountability.
McKinsey found that less than one-third of respondents said their organizations were following most of twelve adoption and scaling practices for gen AI. It also found that less than one in five were tracking well-defined KPIs for gen AI solutions.
That stat explains a lot of dead agents. If value is not measured, scale becomes a matter of opinion. Opinion rarely beats budget pressure.
Failure 6: The operating model split ownership into pieces
IT owned the stack. Claims owned the process. Legal owned approval. Data owned access. Security owned model risk. No single leader owned the outcome. Northstar became everyone’s priority in slides and no one’s priority in practice.
McKinsey found that risk and compliance plus data governance were often centralized, while tech talent and adoption were more often handled in hybrid models. That can work, but only when the handoffs are tight and the road map is explicit. MIT CISR’s maturity model says stage three is where firms build scalable architecture, reuse, dashboards, and test-and-learn habits. Northstar never got there.
Why the 95% line keeps ringing true
Call the number 95%, 90%, or “almost all of them.” The point is the same. Corporate agents usually die in the handoff from demo to disciplined execution. McKinsey found that 45% of finance functions were piloting gen AI, but only 6% had achieved scale. In service operations, only 3% had scaled a use case. Those are not model problems. They are company problems.
What would have saved Northstar
Northstar had a path to prod, but it was boring compared with the demo. It needed one claims workflow, cleaner governed data, a named executive owner, human-review rules, KPI tracking, and a phased rollout tied to business value. That is also what the MIT and McKinsey numbers keep saying: firms that move past experimentation build reuse, process change, management discipline, and top-level accountability. The agent is rarely the hard part. The company usually is.












