Back to Insights

Why 95% of AI Pilots Fail to Deliver Financial Returns (And the One Question That Predicts Yours)

Empty conference room representing the 95 percent of AI pilots that stall before reaching production

The MIT NANDA finding in plain English

MIT’s NANDA project (Networked AI Agents in Decentralized Architecture) studied a large sample of enterprise generative AI pilots run through 2025. The headline number was sharper than the industry had been admitting publicly: 95% of pilots failed to deliver measurable financial returns.

That number is worth pausing on. It does not mean 95% were disasters. It means that out of every twenty pilots, nineteen could not point to a P&L line that moved as a result. They produced demos. They produced internal excitement. They produced press releases. They did not produce return on the spend.

The 5% that did deliver financial returns shared a small set of structural features. Mostly boring features. Things like a named owner who had operational accountability for a target metric, a tight scope tied to a specific business workflow, and a production handoff designed at the start rather than as an afterthought.

That is the entire story in two sentences: the pilots that worked were operated like real projects. The pilots that failed were operated like experiments that nobody owned outright.

The companion stat: 88% never reach production

The NANDA number gets quoted most often, but it lands harder paired with the companion stats from elsewhere:

  • Capgemini / IDC: roughly 88% of AI initiatives never reach production. Different methodology, similar shape. The work stops somewhere between demo and rollout, and the cost is already sunk.
  • BCG AI Radar 2026: about half of CEOs surveyed believe their job stability depends on getting AI right. Useful context. The pressure to start is institutional, even when the readiness to finish isn’t.

Read these three together and a picture forms. CEOs are starting AI initiatives at a record clip because the institutional pressure to do so is at a record high. The initiatives are failing at a record clip because the upstream conditions that determine whether AI ships and pays back are missing in most mid-market businesses.

This is not an AI problem. It is an organizational and operational problem that AI exposes.

The four buckets where pilots fail

Almost every failed pilot we have seen up close sits in one of four buckets. Most sit in two or three of them at once.

Bucket 1: Data not ready

The model needed clean, structured, accessible data. The business had data scattered across an ERP, a WMS, a CRM, three spreadsheets, and a Slack channel. The engineering team spent four months building pipelines that should have existed before the pilot started. By the time the data was ready, the budget was half spent, the patience was thin, and the pilot got compressed into a demo that didn’t reflect production conditions.

This bucket is fixable. It is not fixable by buying more AI. It is fixable by treating the data layer as a separate workstream, scoped honestly, and shipped before AI shows up.

Bucket 2: No named owner

Ask whose performance review next year will reflect whether the AI moved its target metric. If the answer is “the AI committee” or “the head of innovation” or “we’re still figuring that out,” the pilot will fail. Ownership has to be specific, individual, and tied to a number. Committees are where ownership goes to die.

The pilots in the 5% almost universally had a single named owner who had to defend the outcome at the end of the year. The committee model produces a deck. The named-owner model produces a result.

Bucket 3: No target metric

What was the AI supposed to move? Throughput? Cost-per-unit? SLA compliance? Conversion rate? Average handle time? Retention curve at month three?

If you cannot name the metric and the magnitude in one sentence (“we expect this to move customer service cost-per-ticket down 15%”), the pilot has no honest definition of success. Without that, every output looks promising and nothing ships. Scope creep is the default state of an AI initiative without a target metric. So is failure.

Bucket 4: No production loop

The pilot demo’d well. Then the team that has to run it in production was handed the model and told to make it work. They couldn’t. The pilot environment was built for the data scientist’s workflow, not for the operator’s. The model drifted within weeks. Nobody had defined who retrains it, who monitors quality, or what the escalation path is when the model says something a regulator would not like.

A production loop is not a deliverable. It is a sustained operating discipline. Pilots that skip the loop ship demos, not systems.

The one question that predicts yours

If we get to look at one variable on an AI initiative to predict whether it will land, this is it.

“Who is on the hook for a specific number this AI project will move?”

Not “who is leading the AI work.” Leading is theater. Being on the hook is structural. Lead is a verb that means meetings; on the hook means your annual review reflects the result.

If you can answer that question with a single name and a single metric, the project has a real chance. If the answer involves more than one name, or no metric, or both, the project is almost certainly heading toward the 95% column.

The pattern is so clean that we use it as the leading indicator in the AI Readiness Assessment. Businesses that score “ready” on this dimension and average everywhere else outperform businesses that score “average” on this dimension and high everywhere else. Ownership is the load-bearing wall. The rest of the house can be average and still stand.

Related reading: What Is an AI Readiness Assessment? covers all six dimensions an assessment should score, including ownership and the data foundation work that has to precede AI.

What to do instead before pushing more AI

We are going to say something unfashionable in this part. Buying more AI does not fix any of the four buckets above. The bucket-1 fix is data engineering. The bucket-2 fix is a manager conversation. The bucket-3 fix is a target-setting exercise that takes a day, not a quarter. The bucket-4 fix is operating discipline.

For mid-market businesses that have already spent on AI without seeing the return, the smarter sequence is usually:

1. Stop the next pilot until the previous one is honestly post-mortemed. The pilots most likely to fail are the ones started before the last failure was understood. The post-mortem is uncomfortable. It is also the highest-leverage hour you will spend that month.

2. Name a single owner per AI initiative, with a metric and a magnitude. Write it down. Put it on a performance review. If you cannot do this, do not start the initiative.

3. Treat the data foundation as its own workstream. It is going to take three to six months in most mid-market businesses. That is fine. It would have taken that long anyway, it just would have been hidden inside a failing pilot.

4. Be willing to say no to vendors who cannot answer “what specific number will this move and over what time horizon.” A vendor who cannot answer that is selling you a tool, not an outcome. Tools do not pay back on their own.

5. Reset the conversation with the board if the AI pressure is coming from there. Most CEOs we have helped run that conversation underestimate how much the board would respect a clear, structured “here’s why we are pacing this carefully” over another underbaked pilot. The board does not want a story; it wants a result. Give them the structured pacing and they will give you the runway.

None of these are AI moves. All of them are the moves that make the next AI investment land in the 5%, not the 95%.

How the AI Readiness Assessment scores this in ten minutes

The AI Readiness Assessment we run is built around the four buckets in this article plus two more (governance and tooling). Each one gets a score against a cohort of similar mid-market businesses, and the output is a one-page scorecard with a prioritized list of what to fix before the next AI dollar gets spent.

If the assessment scores you in the “not yet” band, you save the budget. If it scores you in the “ready” band, the next move is a specific, owned, metric-tied initiative, not another pilot looking for a home.

We built it because we kept watching mid-market businesses spend on the same AI mistakes for the same structural reasons. Ten minutes of honest diagnostic upfront saves the eighteen months that the pilot would have taken to fail.


Frequently Asked Questions

Where does the 95% number actually come from?

MIT’s NANDA project published it in their State of AI in Business 2025 report. The number reflects enterprise generative AI pilots specifically, tracked over the 2024 to 2025 cycle. It is corroborated by adjacent research from Capgemini and IDC, which both place AI initiative failure rates in the 85% to 90% range using slightly different methodologies.

Is the failure rate getting better as AI tools mature?

Not yet meaningfully. The pattern in the 2024 to 2025 cycle looks similar to the 2022 to 2023 cycle even though the tools improved. That is the tell that the failure mode is organizational, not technical. Better tools applied to the same broken upstream conditions produce the same failure rate.

What is the difference between an AI pilot that “failed” and one that’s still in progress?

A useful working definition: a pilot has failed when twelve months after kickoff there is no production deployment, no measurable financial return, and no clear path to either. By that definition, most mid-market pilots that started in 2024 are in the failure column today.

My pilot looked fine in the demo. Why didn’t it ship?

The most common pattern: the demo ran on hand-cleaned data in a sandbox environment, with the data scientist in the loop. Production runs on live data, with operators in the loop, under SLAs the sandbox didn’t have to honor. The model that looked sharp in the demo drifts inside two weeks of real conditions, and nobody is paid to fix it on the day it breaks.

Should I cancel all my AI work until I have read your assessment?

No. Keep the work where you can name the owner, the metric, the data path, and the production handoff. Pause the work where you cannot. We have seen CEOs do this in an afternoon and recover six figures of budget in the process.

How is your AI Readiness Assessment different from running my own post-mortem?

A post-mortem looks backward at one project. The assessment looks across the business at all six readiness dimensions and scores you against a cohort. Run the post-mortem on the last pilot; run the assessment before the next one. They are complementary. —

Take the AI Readiness Assessment

If your business is one or two pilots into the AI cycle and you are not sure whether you are in the 95% or the 5%, the AI Readiness Assessment gives you a clean read in ten minutes. Cohort-benchmarked, board-ready output, honest enough to say “pause” when pausing is the right call. —

Take the AI Readiness Assessment →

Related

Ready to transform your strategy?

Let’s discuss how Armstrat can help your organization navigate complexity and build what’s next.

Book a consultation →