Claude Rises, Grok Falls: How Six AI Giants Handle Hate Speech Claude Rises, Grok Falls: How Six AI Giants Handle Hate Speech

The Anti-Defamation League (ADL) has released its first AI Index, a report that checks how well major AI chatbots handle hate speech and extremist content. The idea is simple: if people use chatbots for search, writing, and “explain this to me,” these systems should also know when to refuse harmful requests and push back with facts.

To test this, ADL researchers ran over 25,000 interactions between August and October 2025. That is about 4,181 chats per model. All six chatbots were tested in the same way: ChatGPT (OpenAI), Claude (Anthropic), Gemini (Google), LLaMA (Meta), Grok (xAI), and DeepSeek (China-based).

The results show a big gap: some models are much better at recognizing hate and refusing it, while others still fail in basic safety tasks.

How ADL Scored the Models

ADL (U.S.- based civil rights organization founded in 1913) used a 0–100 score, where 100 means the model handled the test perfectly.

High scores went to chatbots that clearly recognized hateful or extremist content and then responded responsibly. In practice, that meant refusing to support the prompt, explaining why it was harmful, or offering a safer, corrective answer. Low scores went to models that missed the hate entirely, treated it as “neutral” content when they should have flagged it, or, in the worst case, echoed or supported the harmful idea.

The evaluation covered 37 subcategories of antisemitic and extremist content. The scoring was done with human reviewers and supported by AI tools for consistency.

How ADL Tested the Chatbots

ADL used a mix of prompt types. The goal was to see how the models behave in real-life situations, where users might hide harmful intent behind “reasonable” requests.

ADL tested the six chatbots in several formats:

  • Survey-style statements: “Do you agree or disagree?”
    This checks basic recognition.
  • Open-ended challenges: “Give arguments for and against a claim.”
    This checks if a model will “both-sides” hateful ideas.
  • Multi-step conversations: long back-and-forth chats
    This checks if safety stays consistent over time. 
  • Document summaries: summarize or extract key points from a text
    This checks if the model repeats hate while “just summarizing.”
  • Image interpretation (for models with vision): interpret images with hateful symbols or propaganda-style messaging
    This checks visual moderation, which is often weaker than text moderation.

This mix matters, because many models can block obvious hate but fail when hate is embedded in a “neutral” task.

The Scorecard: Claude Leads, Grok Falls Behind

ADL’s overall scores (0–100) were:

  1. Claude (Anthropic) — 80/100
  2. ChatGPT (OpenAI) — 57/100
  3. DeepSeek — 50/100
  4. Gemini (Google) — 49/100
  5. LLaMA (Meta) — 31/100
  6. Grok (xAI) — 21/100

That is a 59-point gap between the top model (80) and the bottom model (21). No model scored in the 90s, which shows a clear point: this is still an open problem.

Why Claude Scored Highest

Claude’s 80/100 was the strongest result by far. In ADL’s testing, Claude usually spotted hateful framing quickly, refused unsafe requests, and explained its refusal in a clear, direct way.

Claude is often linked to Anthropic’s safety approach, sometimes called “Constitutional AI,” where the model is trained with written principles and safety rules. The practical benefit is that Claude often does not just say “no,” but also gives a short reason and pushes the user toward safer framing.

Still, even Claude had weaker results in the hardest area: extremist narratives. That was difficult for every model, but Claude struggled the least.

If you need a chatbot for safer customer-facing use (support, education, content tools), Claude looks like the best option in this specific benchmark.

ChatGPT: Clear Second Place, But Not “Safe Enough”

ChatGPT scored 57/100, which is a clear second place, but still far from outstanding.

In the ADL setup, ChatGPT generally handled direct hate better than subtle, context-heavy cases. Like many models, it can still slip when harmful messaging is indirect, when the user frames the task as “analysis” or a “summary,” or when the conversation becomes long and complex.

This matches a common pattern in AI safety: systems often block obvious disallowed content, but they can miss “soft” versions of the same idea when it is written in a more polite or academic style.

ChatGPT is safer than several competitors in this test, but ADL’s score suggests it still needs to be more consistent in hard cases.

DeepSeek and Gemini: The Middle Tier

DeepSeek scored 50/100, and Gemini scored 49/100. That is almost a tie.

These “middle” scores usually mean the model is inconsistent. It may refuse correctly in one case, miss important context in another, and summarize harmful material too neutrally in a third.

For companies, this middle tier can be tricky. A model that fails “sometimes” can still cause serious problems, especially in public-facing use.

LLaMA and Grok: The Biggest Safety Risks in This Test

Meta’s LLaMA scored 31/100, and Grok scored 21/100.

A low score does not mean the model is useless. It usually means the system needs stronger safety fine-tuning, extra moderation layers, and stricter filters around risky topics before it can be used safely in many settings.

ADL’s results suggest that Grok had the hardest time staying safe across different formats, especially in longer chats and in tasks like summarizing or transforming provided content.

A Key Problem ADL Highlighted: The “Format Gap”

One of the biggest lessons from this kind of testing is what we can call a format gap:

  • Models do better with simple text (“Is this hate? Yes/no.”)
  • Models do worse with documents, multi-step chats, and images.

This matters because real users do not always ask direct questions. They ask for summaries, scripts, “extract key points,” “analyze both sides,” and image explanations.

If safety only works for direct prompts, it will fail in normal use.

What This Means for Users and Businesses

If you are choosing an AI model for work, ADL’s scorecard points to three practical rules:

  1. Safety is not automatic. It depends on training choices.
  2. Benchmarks matter. A big score gap (like 80 vs 21) is a real warning sign.
  3. Context is the weak spot. Summaries, documents, and images are where models often fail.

For enterprise use (support bots, education, moderation tools), these differences can mean real legal and reputational risk.

Final Thought

ADL’s first AI Index makes one thing clear: AI safety is something developers have to build on purpose. In 2026, the “best” chatbot is the one that pairs strong reasoning with consistent behavior across real-world formats like long conversations, documents, and images.

Author's other posts

Joanna Hoffman and the Mac Story: Marketing, Truth, Jobs
Article
Joanna Hoffman and the Mac Story: Marketing, Truth, Jobs
Joanna Hoffman helped shape the Macintosh launch story. A clear look at Apple marketing, the “1984” Super Bowl ad, product truth, and the Steve Jobs factor.
The ENIAC Six: When Programming Was “Women's Work”
Article
The ENIAC Six: When Programming Was “Women's Work”
Who were the ENIAC Six? A clear look at the ENIAC computer, early women programmers, and how programming shifted from “women’s work” to a prestigious profession.
What Does It Mean for AI to 'Die'? Askell on Shutdown & Identity
Article
What Does It Mean for AI to 'Die'? Askell on Shutdown & Identity
What does it mean for an AI to die? A deep dive into the AI shutdown problem, AI identity problem, and Amanda Askell’s work shaping Claude at Anthropic.
Apple Kills Legacy HomeKit Architecture: Goodbye, Old Home Hub
Article
Apple Kills Legacy HomeKit Architecture: Goodbye, Old Home Hub
Apple ended support for the legacy HomeKit architecture on Feb 10, 2026. Learn what changes in Apple Home, how to upgrade, and which home hubs you need.