Large language models (LLMs) often behave unpredictably during inference, producing different outputs even when given the same prompt.
Thinking Machines, an AI company founded by former OpenAI CTO Mira Murati, says it has identified the root cause of this nondeterminism and developed a solution that could make inference reproducible and reliable.
In a blog post titled “Defeating Nondeterminism in LLM Inference”, the company explained that the problem goes beyond the well-known issue of floating-point arithmetic and GPU concurrency.
While rounding errors from parallel computations do play a role, Thinking Machines argues that the real culprit is the lack of batch invariance in widely used inference kernels.
Batch invariance means that a model’s output for a given prompt should remain identical regardless of the batch size or how requests are grouped together. In current systems, many operations—such as matrix multiplications, attention mechanisms, and normalisation—change their internal computation strategies depending on batch size.
This variation in reduction order introduces tiny numerical differences, which can cascade into divergent outputs over long generations.
To address this, the team built batch-invariant kernels for key operations, including RMSNorm, matmul, and attention. Testing on the Qwen-3-8B model, they found that under default settings, 1,000 runs of the same prompt at temperature 0 produced 80 unique completions. With the modified kernels, all 1,000 completions were identical, demonstrating full reproducibility.
The trade-off, however, is speed. The batch-invariant setup ran slower than default inference, though optimisations to the attention kernel helped reduce the gap. Still, Thinking Machines argues that the performance cost is a fair price for the gains in determinism, especially for use cases in research, safety, and debugging.
“Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models.” the blog noted, adding that eliminating nondeterminism could also reduce discrepancies between training and inference phases of LLM deployment.
By reframing nondeterminism as a batch invariance problem, the company hopes to influence the design of future inference engines, where determinism may become as critical as raw speed.