When AI Says X and Your Gut Says Y: The Three-Question Test

A Product Owner sits in her sprint planning meeting on a Wednesday morning. The new AI prioritization tool the team has been piloting tells her to push the data export feature down two slots. Her senior engineer leans forward and says, "I think that's wrong. Three customers have called this out in the last month."

She has the tool's recommendation. She has her engineer's gut. They do not agree. The room goes quiet, waiting for her to make a call. And she freezes.

That scene is playing out in product organizations all over the place right now. The AI is faster than the gut. The gut has been right before. Nobody knows which one to trust on this particular Wednesday, and nobody has a habit of resolving it.

The core idea

People do not trust what they cannot reason about. That is not a flaw; it is a feature of human judgment. The job of leadership is not to make people trust AI more. It is to give them a repeatable way to interrogate AI outputs honestly, so they end up trusting the right ones for the right reasons.

The comfortable lie about AI adoption

The comfortable lie is that AI will earn trust by being right often enough. Show your team enough accurate outputs, the thinking goes, and the resistance will dissolve.

The uncomfortable reality is that being right is not the same as being trustworthy. Trustworthy means I can explain why you should have believed the right answers, and why you should still be cautious about the wrong ones. Without that, "right most of the time" feels like luck. People stop using the tool the first time it confidently delivers a wrong answer about something that matters.

The NIST AI Risk Management Framework names this directly. Explainability and interpretability are listed alongside accuracy as characteristics of trustworthy AI. A model that produces accurate outputs but cannot tell you why is, in NIST's framing, less trustworthy than a model with slightly lower accuracy that can.

Source: NIST AI Risk Management Framework, nist.gov/itl/ai-risk-management-framework

The two failure modes most teams fall into

When AI outputs conflict with human judgment, most teams default to one of two failures.

Blind compliance

The team rubber-stamps the AI output because the model said so. This usually happens in organizations where AI adoption itself is a metric. People defer to the tool to avoid looking resistant. The result is that AI errors propagate further into the workflow before anyone catches them, and the team's actual expertise quietly atrophies.

Blind rejection

The team dismisses the AI output because the model could not possibly understand the domain. This usually happens in organizations with deep specialist expertise. People defer to their gut to avoid looking gullible. The result is that genuinely useful AI input gets thrown out, and the team loses the upside of having a second perspective in the room.

Real talk: most organizations I work with do both, depending on who is in the room. Engineers default to blind rejection. Product people default to blind compliance. The org does not have a position; it has two factions.

(There is a broader pattern here, and we have written about why teams resist AI tools they do not yet trust. The interrogation habit is one of the moves that turns resistance into productive skepticism.)

The third path: honest interrogation

There is a middle path. It is a learned skill, not a personality trait. It does not require trusting AI more or trusting your gut less. It requires the habit of asking three questions before you act on any AI output that matters.

Question 1: What evidence is this output based on?

What data was the model trained on? What context did it see in this specific prompt? What assumptions are baked in?

If the AI suggests you deprioritize a feature, it is doing so based on signals such as usage data, customer feedback patterns, or similar feature outcomes from its training. Name the signal. If you cannot name it, you do not have an output to act on; you have a guess wearing the costume of authority.

Question 2: What would have to be true for this output to be wrong?

This is a pre-mortem of the AI's answer. If the model is wrong, what would be the most likely reason? Stale data. Missing context that the AI never saw. A pattern that fits historically but does not fit the moment.

Annie Duke calls this kind of thinking "decisions under uncertainty." The discipline is not about getting to certainty; it is about understanding the shape of the uncertainty before you bet.

Question 3: What does my gut say, and where might the gut be wrong?

Your gut is a model too. It was trained on years of pattern recognition in this domain, and it is performing the same kind of inference as the AI, just with different training data and a smaller sample size.

Your gut will be right about some things the AI cannot see, like the texture of customer conversations, the politics of the next stakeholder review, and the technical debt nobody has documented. Your gut will be wrong about some things the AI sees clearly, like statistical patterns across a dataset bigger than any one person has held in their head. Name where your gut is strong and where it is weak before you weigh it against the AI.

Leadership cue

When a team can safely disagree with an AI output, that is a psychological safety signal. The DORA 2024 research found that psychological safety is among the strongest predictors of software delivery performance, and the same pattern holds for AI. If your team defers in only one direction, either always to the model or always to the gut, the issue may not be the tool; it may be that the room does not yet feel safe enough for honest interrogation.

Four traps to watch for

The "the model said" trap. Treating AI outputs like oracle pronouncements. The phrase itself is a tell. When someone defends a decision with "the model said," ask them what evidence the model actually used. Most of the time, they do not know.

The "junior employee" trap. Treating AI like an unreliable intern whose work must be redone from scratch every time. This wastes the AI's actual contribution, which is processing speed and pattern matching across data you cannot hold in your head. The right frame is not "do not trust the intern;" it is "review the intern's work, then decide."

The "no measurement" trap. Never tracking whether AI advice produced better outcomes than the gut would have. Most organizations have no honest record of when AI was right, when the gut was right, and when both were wrong. Without that record, the trust conversation runs on anecdote and personality.

The "fluent equals accurate" trap. Confusing well-written AI output for correct AI output. Modern models produce confident, grammatical, well-structured answers regardless of whether the answer is right. Fluency is not evidence of accuracy. (We have written about how this exact problem can quietly slow your team down even when everyone thinks AI is helping.)

Try this next week

Pick one AI output you will receive this week. A recommendation from a prioritization tool, a summary from a meeting assistant, or a pull request review from a coding agent. Whatever is in front of you.

Before you act on it, run it through the three questions. Write down your answers. Compare to your gut. Decide intentionally, then track which one was right after the fact, even informally.

Do this five times over the next two weeks, and you will have a small dataset on your own AI judgment. You will know which kinds of outputs your AI tools get right consistently, where they hallucinate, and where your gut still beats them. That dataset is more valuable than any vendor benchmark, because it is yours.

If you are building this judgment muscle for your team and want a structured way to learn it together, our AI for Product Owners course covers exactly this kind of decision-making with AI tools, including the interrogation frameworks you can take back to your team the same week.

Read Next

Your AI Works in the Demo. It Dies in the Workflow.
Even with the interrogation habit in place, AI tools have a way of impressing in the demo and falling apart in real work. This piece covers why that gap exists and what to do about it.