Alibaba’s Qwen team has introduced Qwen3-Next, a new large language model architecture designed to improve efficiency in both training and inference for ultra-long context and large-parameter settings.
At its core, Qwen3-Next combines a hybrid attention mechanism with a highly sparse mixture-of-experts (MoE) design, activating just three billion of its 80 billion parameters during inference.
The announcement blog explains that the new mechanism allows the base model to match, and in some cases outperform, the dense Qwen3-32B, while using less than 10% of its training compute. In inference, throughput surpasses 10x at context lengths beyond 32,000 tokens.
Two post-trained versions are being released: Qwen3-Next-80B-A3B-Instruct and Qwen3-Next-80B-A3B-Thinking. The Instruct model performs close to the 235B flagship and shows clear advantages in ultra-long context tasks of up to 2,56,000 tokens. The thinking model, aimed at complex reasoning, outperforms mid-tier Qwen3 variants and even the closed-source Gemini-2.5-Flash-Thinking on several benchmarks.
Among the key technical innovations are Gated DeltaNet mixed with standard attention, stabilised training via Zero-Centred RMSNorm, and Multi-Token Prediction for faster speculative decoding. These designs also address stability issues typically seen in reinforcement learning training with sparse MoE structures.
Pretrained on a 15-trillion-token dataset, Qwen3-Next demonstrates not just higher accuracy but also efficiency, requiring only 9.3% of the compute cost of Qwen3-32B. Its architecture enables near-linear scaling of throughput, delivering up to 7x speedup in prefill and 4x in decode stages at shorter contexts.
The models are available via Hugging Face, ModelScope, Alibaba Cloud Model Studio and NVIDIA API Catalog, with support from inference frameworks like SGLang and vLLM. According to the company, this marks a step towards Qwen3.5, targeting even greater efficiency and reasoning capabilities.