News

LLM Bias Found: Prior Judgments Skew New Evaluations

· Sid-Ali Temkit

LLM Bias Found: Prior Judgments Skew New Evaluations

A group of researchers from multiple institutions tested whether large language models get biased by the conversation history when used as automated evaluators.…

A group of researchers from multiple institutions tested whether large language models get biased by the conversation history when used as automated evaluators. They ran over 84,000 API calls across 12 models from OpenAI, Anthropic, Google, DeepSeek, and four open source models. The idea was simple: present the same test item either in isolation or after a history filled with mostly positive or mostly negative evaluations. The result? Models tilted toward whatever polarity dominated the conversation, with a statistically significant shift (d = -0.17, p < 10^-53).

The bias was strongest on items where the model was already uncertain. For high entropy items, where the model had no clear default, the shift nearly doubled to d = -0.36. Interestingly, the length of the history didn't matter. Five biased turns produced the same effect as 50. And negativity packs a bigger punch. Paired per item, negative histories caused 1.52 times more bias than positive ones, a difference so large it produced a p value of 10^-36.

Even the bigger, smarter models showed the tilt. Anthropic's Haiku shifted -0.22, while its larger Opus model still shifted -0.17. OpenAI's Nano moved -0.34, and GPT-5.2 moved -0.17. Scaling helps but doesn't eliminate the effect. The researchers also found that the bias grows continuously from the token level, not at a sudden threshold, and that it doesn't matter where in the conversation the biased turns appear. Five biased turns anywhere in a 50 turn history produce the same result.

The takeaway for anyone building evaluation pipelines is straightforward. The simplest fix is to give each item a fresh context. If batching is unavoidable, make sure the conversation history is balanced between positive and negative examples. Otherwise, your automated judges will carry the mood of the room into every decision they make.

Original source