According to monitoring by Beating, AI startup Boson AI has released the weights for its autoregressive text-to-speech (TTS) model Higgs Audio v3 TTS. The model is built on the Qwen3-4B foundation with approximately 4 billion parameters, specifically optimized for streaming interactions of real-time voice agents. It supports streaming synthesis even before the text is fully generated, reducing latency in real-time voice conversations. Higgs Audio v3 TTS supports over 100 languages and dialects, achieving a single-digit word error rate on test sets such as Seed-TTS, CV3, and MiniMax-Multilingual. The model supports zero-shot voice cloning and allows for the direct embedding of over 20 emotions and various inline control tags (including tone, speech rate, pitch, pauses, as well as effects like coughing, sighing, and laughter) in the input text for highly controllable vocal expressiveness. Boson AI has collaborated with the LMSYS team to optimize the end-to-end serving performance of Higgs Audio v3 TTS on the SGLang-Omni inference framework. Testing on an H100 GPU achieved a single concurrent real-time factor (RTF) of 0.147. The weights have been made publicly available on Hugging Face under a non-commercial research license.
All Comments