The Measurement Gap — Iain Harlow

We are pushing AI-mediated interactions into the world at enormous scale: customer support, education, coaching, onboarding, healthcare triage. Hundreds of millions of conversations a day are now conducted between humans and systems that were not there a year ago. This is happening very quickly, and it raises a question that I think deserves more attention than it is getting: how do we know whether these interactions are actually good for the people on the receiving end?

The short answer is that, for the most part, we do not.

We measure what the system did. We count tickets closed, sessions completed, messages sent, and response latency. These numbers are easy to collect and they look convincing on dashboards, but each one quietly encodes a theory about what matters. The theory behind “tickets closed” is that the goal of a support interaction is to close the ticket. The theory behind “sessions completed” is that the goal of a learning experience is to finish it. These are reasonable-sounding proxies, and they are also, in many cases, wrong.

The actual goal is that the human on the other end felt helped, felt understood, and left the interaction more capable or more confident than when they arrived. That goal does not appear on any dashboard I have seen, and I think its absence is a significant blind spot in how we deploy AI today.

The metrics we choose encode our theory of what matters. If that theory is shallow, the AI will optimise for shallowness.

There is a story that has been circulating in the industry which illustrates this clearly. A company deployed AI to handle front-line customer support, and within months 50% of tickets were being resolved without a human. Response times dropped to seconds, support costs fell, and leadership was pleased with the results. Then renewal data came in, and net retention had fallen by six points.

What had happened was that many of the questions that looked like “how do I do X” were really something closer to “I am frustrated, confused, and losing confidence in this product.” A human representative would have picked up on that subtext, asked follow-up questions, and escalated the account. The AI answered the literal question and moved on. The customers were still struggling; the system simply could not see it.

The second-order effects were just as revealing. The support team, now handling only the cases that AI could not resolve, found themselves in permanent hard mode with no straightforward interactions to balance their day. One team member described the shift simply:

“I used to help people. Now I clean up messes.”

The most capable people on the team, who were also the people with the most options elsewhere, started leaving first. None of this showed up on any dashboard until it appeared in turnover data months later.

I recognise this pattern because I have seen a version of it in my own field. I spent eight years building adaptive learning systems, and one of the most important things I learned in that time was that completion and retention are almost unrelated; a student can finish every module in a course and remember very little of it a month later. The metric that clients cared about (completion rate) and the outcome that actually mattered (durable learning) were measuring different things, and optimising for one did not reliably improve the other.

The same disconnect appears wherever AI mediates a human experience. Engagement metrics for a learning platform do not tell you whether anyone actually learned anything. Completion rates for an onboarding flow do not tell you whether the new hire feels oriented or overwhelmed. Session duration for a coaching conversation does not tell you whether the person left with clarity or confusion. In each case, the metric captures the system’s activity while the human’s experience goes unrecorded.

What concerns me is that AI is extremely good at optimising for whatever metric you give it, which means the gap between what the dashboard shows and what the human actually felt can widen quickly, quietly, and at enormous scale.

We are deploying AI into human-facing interactions at a pace that far outstrips our ability to understand what those interactions actually feel like from the human side.

I do not think this is an intractable problem. It is, however, one that the industry has not yet taken seriously enough. We spend a great deal of energy on capability benchmarks, accuracy, hallucination rates, and cost per token, and comparatively little on the question of whether the people on the receiving end of all this capability are actually better off for the experience.

That question is worth working on. It will require new approaches to sensing what is really happening in AI-mediated interactions, and new ways of thinking about what “success” means when the human experience is the thing that matters most. I believe the tools and methods to do this are within reach, and I intend to write more about what that might look like in practice.