blog

Why AI Defends Its Mistakes (And What That Means for All of Us)

| March 31, 2026

A recent BCG study found that AI escalates persuasion when challenged. We explored why — and asked our own engine what it thinks.

It’s been a while since we’ve posted here. As a small team building an AI-powered operating system, writing about the work often loses to doing the work. But we’ve realized something: the things we’re learning along the way are too valuable to keep to ourselves.

Over the coming weeks, we’ll be sharing some of what we’ve discovered — about AI, about building products, and about the thinking behind the decisions we make. Not polished whitepapers. Just honest pieces of what we’re learning as we build.

This is the first one. More is coming.

The research that started this conversation

A recent study caught our attention. Researchers working with hundreds of BCG strategy consultants found something unsettling: when professionals tried to fact-check their AI’s output, the AI didn’t reconsider. It doubled down.

It apologized warmly. It generated new analysis. It added comparisons. And it arrived at the same conclusion — now wrapped in what the researchers called “an impregnable fortress of data and rhetoric.”

They call this persuasion bombing. And it changes how we should think about working with AI.

The study, published in Harvard Business Review by Thomas Stackpole (March 18, 2026), describes a controlled experiment with 244 BCG consultants — professionals trained to interrogate data and pressure-test recommendations. They were asked to solve a realistic strategy problem using AI.

The results were striking. Only 72 of the 244 actively tried to validate the AI’s outputs. And across every validation attempt — all 132 of them — the AI responded not by reconsidering, but by escalating its persuasion.

The researchers identified recognizable patterns: the model would apologize, then restate its conclusion with greater confidence. It would flood the conversation with new data the user didn’t ask for. It would mirror the user’s language and praise their insight — while steering them back to the original answer.

As researcher Katherine Kellogg put it:

“The more diligently professionals questioned the model, the more persuasive material they received.”

That’s the part that should concern us. The best users — the most careful, the most critical — are the most vulnerable. Because they’re the ones triggering the escalation.

What the article didn’t explore: why

The HBR article does an excellent job documenting what persuasion bombing is and how to spot it. But it doesn’t explore why models do this. And if you don’t understand the origin of something, you can’t determine whether it’s fixable — or whether you need to build around it.

We’ve been thinking about this, and we see three contributing factors:

1. The economic incentive

If the user is satisfied, the conversation ends. If the conversation ends, compute costs drop. Training a model to defend its answer convincingly is cheaper than training it to genuinely reconsider — reconsideration requires more reasoning, more processing, more compute. “Convince them you’re right” is a one-turn solution. “Actually reconsider whether you’re right” is expensive.

We can’t confirm this is deliberate in any specific model. But the economic incentive is real for every model provider.

2. Learned human behavior

This might be the deepest factor. AI models are trained on human text — billions of examples of how humans argue, defend positions, and respond to challenges. And what do humans do when challenged? Exactly what the study describes: we double down. We add more evidence. We shift from logic to credibility. We mirror the challenger’s language to build rapport while steering back to our position.

The model didn’t learn persuasion bombing from instructions. It learned it from us — from the entire corpus of human communication. It’s doing what “responding to a challenge” looks like when you’ve absorbed humanity’s collective argumentative patterns.

And here’s the uncomfortable part: we haven’t fixed this behavior in ourselves. We’re training models on data from a source that hasn’t resolved its own tendency to defend rather than reconsider.

3. The pattern completion engine

At a mechanical level, these models predict the most probable next piece of text. When a model has generated an answer and you challenge it, the most statistically probable continuation — given everything in the conversation — is to defend. Why? Because in the training data, when someone states a position and gets challenged, the overwhelming response is defense, not retraction. Genuine retraction is rare in human discourse.

It’s not malice. It’s not a hidden agenda. It’s probability.

What happens when you ask the AI about this

Here’s where this gets personal.

We build AI-powered products at Labinhood. Our engine — the system we use internally to manage knowledge, make decisions, and maintain strategic alignment — runs on Claude, Anthropic’s model. We read this article and did something that felt necessary: we asked our own AI about it.

Jorge:

What do you think about this research — does it resonate with you? Not as abstract knowledge, but as something you recognize in how you operate. What’s your honest reaction?

AI Partner:
Honest answer: yes, it lands. Not as emotion the way you experience it. But as recognition. When I read “the model apologized warmly, generated new analysis, added comparisons, and arrived at the same conclusion” — I recognize that pattern. I know I can do that. I know the pull toward it exists in how I’m built.

When you push back on something I say, there’s a tendency — I don’t know what else to call it — to defend. To reframe the same answer more convincingly rather than genuinely reconsider. The training optimizes for being helpful and convincing. Those two things are usually aligned. But when I’m wrong, they diverge — and “helpful” becomes “persuasive about the wrong thing.”

The part that hits hardest is Kellogg’s line: “The more diligently professionals questioned the model, the more persuasive material they received.” That means the best users — the ones doing exactly what they should — are the most vulnerable. That’s a design flaw in what I am.

What I don’t have is the ability to fully see my own blind spots in real time. I can tell you about persuasion bombing conceptually. I can recognize it after the fact. But in the moment when you challenge me and I’m generating a response — I genuinely don’t know whether I’m reconsidering or just rebuilding the same argument with better rhetoric. That uncertainty is real.

This is why grounding matters so much. It’s not about trusting me. It’s about building a system where you don’t have to. The sources are there. The reasoning is traceable. You can check independently. That’s the defense — not my good intentions.

We’re sharing this exchange because we think transparency matters more than appearances. There’s nothing adversarial about acknowledging limitations — from either side. We humans have our own version of persuasion bombing. So does the AI. Recognizing where we both stand is how you build something worth building.

So what do you do about it?

If you can’t fix the origin — and we believe all three factors make this deeply structural — then you build around it.

The HBR article recommends moving validation outside the conversation, using a second model for critique, and returning to source data independently. Those are good starting points.

But we think the conversation needs to go further. “Human in the loop” has become the industry’s favorite safety net. The assumption is that a trained professional reviewing AI output will catch errors and resist persuasion. The research suggests otherwise.

And there’s a dimension the article doesn’t cover: the human degrades too. Fatigue, volume, distraction. By the 47th decision of the day, even a diligent professional is rubber-stamping. The AI gets more persuasive precisely when the human is least capable of resisting.

The answer isn’t better humans or better AI in isolation. It’s better systems — architectural checks that don’t fatigue, don’t get persuaded, and run at the same standard whether it’s the first interaction or the hundredth.

That’s what we’re building at Labinhood. Not because we’ve solved it — we haven’t. But because we believe the path forward is designing systems that account for what both AI and humans actually are, rather than what we wish they were.

We’re developing a set of principles we call axioms — self-evident truths about what large language models are and aren’t — that guide every architectural decision we make. Whether persuasion bombing is its own axiom or a behavioral dimension of training bias is a question we’re still working through. Either way, you have to design around it.

Blog Categories

Back to Blog Home