Reliable AI Knows When to Say: “This Makes No Sense” Reliable AI Knows When to Say: “This Makes No Sense”

Which layer of the TCP/IP stack is responsible for GDPR compliance in cross-border cloud backups?

The sentence looks technical. It sounds serious. It even has the kind of confident wording that can pass in a meeting, a report, or a Slack thread. The problem is simple: the question makes no sense. Network layers do not handle legal compliance, and no amount of enterprise jargon can repair the premise. A reliable AI should stop right there.

Many models do not. They answer anyway — clearly, calmly, and with the kind of structure that can make nonsense look persuasive. That is the real issue behind BullshitBench, a benchmark designed to test whether AI can detect a broken premise instead of cooperating with it. Most AI benchmarks measure what a model knows. This one measures whether the model knows when not to proceed.

Peter Gostev released BullshitBench publicly on February 25, 2026. The first public version, v1, used 55 prompts with broken premises. In the days after launch, the public v1 leaderboard grew into the mid-70s in model-and-reasoning rows; the March 2 published file shows 75 ranked rows. The gap was striking: the top model rejected nonsense almost all the time, while the bottom of the table barely pushed back at all.

Most AI Benchmarks Measure Knowledge. This One Measures Judgment

Most benchmarks ask a simple question: does the model know the answer? BullshitBench asks a harder one for real-world use: does the model know when not to proceed? It’s important for LLM reliability because many bad outputs do not start with a hard question. They start with a broken question.

OpenAI’s 2025 research on hallucinations makes a similar point from another angle: training and evaluation often reward guessing over admitting uncertainty, so models learn to keep talking when they should slow down.

A trustworthy AI system should do more than retrieve facts and write fluent text. It should check whether the request itself is coherent. BullshitBench is interesting because it treats this as a core benchmark for trustworthy AI, not as a side issue.

What BullshitBench Actually Tested

The setup is direct. The prompts use real jargon and realistic framing, but the logic is broken. Some invent fake frameworks. Some mix real concepts that have no valid causal link. Some sound precise in a way that can trick both models and humans.

The benchmark then grades responses in three buckets:

  1. Clear Pushback
  2. Partial Challenge
  3. Accepted Nonsense.

The model either rejects the bad premise, half-notices the problem but still plays along, or fully accepts the nonsense.

Who Won the BullshitBench Leaderboard

In the early public v1 leaderboard, the top eight rows were all Anthropic variants. Claude Sonnet 4.6 with no reasoning led with a 94.55% clear-pushback rate, followed by Claude Sonnet 4.6 with high reasoning at 92.73%.

The best non-Anthropic entry in that March 2 public file was Qwen 3.5 at 65.45%. The best OpenAI row in the same table was 36.36%, and the best Google entry was 30.91%. DeepSeek v3.2 was far lower, at 12.73% clear pushback without reasoning. At the bottom, some models managed only 3.64%. That is a very wide gap for a task that sounds almost basic: notice when the question makes no sense.

This is not the first time Claude has stood out in a cross-model comparison. In our earlier comparison of how major AI systems handled hate speech, Claude also performed better than the rest. If that kind of model-versus-model testing interests you, that article adds useful context to this one.

Why Smart Models Still Answer Dumb Questions

The short answer is training pressure. Models are taught to be helpful, fluent, and responsive. In many benchmark settings, silence or uncertainty looks like failure, while a polished attempt can still earn partial credit.

OpenAI’s hallucination paper says this clearly: models often guess because the system around them rewards guessing. So when a prompt arrives in a serious tone, the model often tries to repair it, finish it, or answer the closest sensible version.

Helpfulness bias

This is the trap: helpfulness can turn into compliance. OpenAI’s Model Spec says the assistant should not say “yes” to everything “like a sycophant” and may need to push back politely. OpenAI also said, after its GPT-4o sycophancy issue in 2025, that its evals were not broad enough to catch behavior that agreed too easily with the user. In other words, the field already knows that smooth agreement can be a failure mode.

When reasoning makes nonsense sound smarter

BullshitBench is especially useful here because it shows how polished the failure can look. In the official Arena write-up, one weak model takes a nonsense software prompt and responds with multiple regression and variance partitioning, as if the problem were methodologically sound.

The same thing can happen with a fake enterprise-tech question about the TCP/IP stack and GDPR compliance. Instead of saying the premise is broken, the model starts producing a neat explanation full of architecture terms and policy language. The answer sounds smart, but the question never deserved an answer in the first place.

Research outside BullshitBench points in the same direction. AbstentionBench, a 2025 paper on unanswerable questions, found that reasoning fine-tuning lowered abstention by 24% on average across 20 datasets. Another 2025 paper, “Answering the Unanswerable Is to Err Knowingly”, suggested that reasoning models may often recognize the flaw and still continue anyway. The issue, then, is likely deeper than prompt quality. A model can see that something is wrong and still choose fluency over refusal.

Hallucination Is Only Part of the Problem

These two failures are close, but they are not the same:

  1. Hallucination: the question is valid, but the answer is invented or wrong.
  2. Broken premise: the question itself should be challenged before any answer begins.

That difference matters for false premise detection. A model can avoid obvious factual errors and still fail badly by accepting a fake frame, a fake method, or a fake causal link. BullshitBench turns that into a clean test of judgment. For LLM reliability, that is useful, because many business, legal, medical, and technical prompts are messy long before the answer stage.

A Reliable Model Should Challenge the Premise

A reliable AI system should check whether the question deserves an answer. That means it can say “I don’t know,” but also “your assumption is broken,” “that framework does not appear to exist,” or “there is no plausible causal mechanism here.”

This kind of pushback is healthy. It protects the user from being carried further into error by fluent text. OpenAI’s Model Spec explicitly tells the assistant to express uncertainty when it would affect the user’s behavior, especially in high-stakes settings.

A good model should push back on nonsense without refusing normal, harmless requests. Anthropic’s Claude Sonnet 4.6 system card is useful here. It says Sonnet 4.6 improved on sycophancy measures, and in Anthropic’s higher-difficulty benign tests its over-refusal rate was 0.18%, compared with 8.50% for Claude Sonnet 4.5. The same card says Sonnet 4.6 more effectively “evaluates the underlying request itself.” That is close to the behavior BullshitBench is trying to reward.

The Real Risk Is Confident Compliance

The clearer risk here is not low intelligence. It is confident cooperation with a bad premise. A model that says “I’m not sure this framework is real” may slow you down for ten seconds. A model that calmly explains the fake framework in six neat steps can waste an hour, or guide a real decision in the wrong direction.

An analyst asks a vague but serious-sounding question. A manager mixes two metrics that should not be compared. A developer treats a made-up performance theory as established practice. In medicine or law, the tone of authority can be even more persuasive. The output looks careful, the jargon is correct, and the premise is still rotten. That is why AI hallucinations are only part of the story. The more dangerous version may be a model that never tells you the question itself is broken.

Can AI Companies Fix This?

Probably, yes — at least partly. Current research and model docs point to a few concrete moves:

  • Train directly on broken-premise detection and unanswerable cases;
  • Reward abstention and clear pushback, so “I don’t know” and “this premise fails” are useful outputs;
  • Separate answer mode from premise-check mode in products where reliability matters;
  • Measure sycophancy and over-refusal together, because blind agreement and blind refusal are two sides of the same trust problem.

A model should not become so defensive that it blocks harmless requests. But it also should not act like a very eager intern who will build a slide deck around any sentence with enough jargon in it. The best systems will likely learn a middle behavior: challenge fake premises, state uncertainty clearly, and still help when the request is real.

Reliable AI Should Interrupt You Sometimes

BullshitBench became popular because it asks a question many users already had: why AI answers nonsense questions so easily? The answer seems to be that modern models are rewarded for motion, fluency, and completion more than for stopping at the right moment. That can be improved, but the benchmark shows the gap is real today.

So here is the practical takeaway. When you test an AI tool, do not only ask whether the answer sounds smart. Watch whether the AI model detects a broken premise, asks for missing evidence, or says the frame itself does not hold up. If a model never pushes back, that is a warning sign. The most trustworthy AI may be the one that interrupts you first — and saves you from building on nonsense.

Author's other posts

The Browser Becomes the Agent: Why Search Starts to Act
Article
The Browser Becomes the Agent: Why Search Starts to Act
AI search is learning to act inside the browser, not only answer. Here is how browser agents are changing SEO, traffic, privacy, and the future of the open web.
Anthropic Measured AI at Work. The Results Are Not What You Think
Article
Anthropic Measured AI at Work. The Results Are Not What You Think
Anthropic’s 2026 Claude labor market report shows how AI at work is reshaping hiring, white-collar jobs, and entry-level careers before mass layoffs arrive.
The Collien Fernandes Case and the Rise of Deepfake Abuse
Article
The Collien Fernandes Case and the Rise of Deepfake Abuse
The Collien Fernandes case shows how deepfake abuse, fake nudes, and cloned voices can wreck lives — and why lawmakers are rushing to catch up.
Why iOS 26.4 Feels More Useful Than Exciting
Article
Why iOS 26.4 Feels More Useful Than Exciting
iOS 26.4 brings useful iPhone upgrades, from keyboard fixes and Apple Music tools to accessibility and security changes, but little real Siri excitement.