News

How SPRT Slashes LLM Debate Costs 3.7x with Minimal Accuracy Loss

· Andrea Morandi

How SPRT Slashes LLM Debate Costs 3.7x with Minimal Accuracy Loss

A new paper from researchers proposes a simple fix for one of the big headaches in multi-agent LLM systems: how many debate rounds to run.…

A new paper from researchers proposes a simple fix for one of the big headaches in multi-agent LLM systems: how many debate rounds to run. Current approaches just pick a fixed number, which wastes time on easy questions and cuts off hard ones too soon. The team adapted an old statistical method, the Sequential Probability Ratio Test (SPRT), to act as a plug in compute governor for LLM debates.

Here is how it works. After each round of debate, a separate LLM judge gives a consensus score between 0 and 1. The SPRT monitor then accumulates evidence about whether the debate is usefully converging or just spinning its wheels. It stops when it gets enough signal, or caps out at a maximum round limit. Under some statistical assumptions, this inherits error guarantees from the original SPRT method, but the authors note that in practice the calibration matters more than the math.

The team ran two evaluations. First, a Monte Carlo study to characterize the method's behavior. Then a real test with 200 questions from MMLU and 200 from GSM8K, using three different LLMs as debaters (gpt-5, claude-opus-4-6, gemini-2.5-pro) and a fourth as judge. On GSM8K, the adaptive debate stopped after an average of 1.01 rounds (4.06 LLM calls) with 97% accuracy, compared to 99% for a fixed 5 round debate using 15 calls. That is a 3.7x reduction in cost for just 2 percentage points of accuracy loss. On MMLU, the calibration essentially collapsed and the rule capped out on 99.5% of items at 2.1x the cost.

The key takeaway is not that SPRT makes debates more accurate. It does not. Rather, it offers a cheap way to control compute spend and detect when a multi-agent system is failing. Think of it as a classical statistical governor that adds a layer of cost control and failure detection to what is otherwise a brute force approach.

Original source