According to monitoring by Dongcha Beating, the domestic large model company MiniMax has officially open-sourced the weights of its native multimodal mixture of experts (MoE) model MiniMax M3 on Hugging Face. MiniMax M3 has a total parameter count of 428 billion, with 23 billion parameters activated per token, and natively supports one million ultra-long contexts. To reduce deployment memory costs, the development team has simultaneously released an MXFP8 quantized version, compatible with mainstream inference frameworks such as SGLang, vLLM, and Transformers. In terms of multimodal design, MiniMax M3 conducts joint training of text, images, and videos during the pre-training phase to achieve native semantic fusion, rather than aligning multimodal data in the post-training phase. The model operates in two inference modes: the Thinking mode for complex logic and tool orchestration, and the Non-thinking mode for low-latency dialogue and code generation. The underlying kernel supporting one million ultra-long contexts is the lightweight attention kernel library MiniMax Sparse Attention (MSA), which is also open-sourced. Official data shows that MSA employs a grouped query attention (GQA) chunk retrieval mechanism, achieving over 9 times pre-fill acceleration and 15 times decoding speedup in tests with one million tokens on the NVIDIA Blackwell (SM100) architecture, while significantly reducing inference costs.
All Comments