According to a source familiar with the discussions, there is previously undisclosed news: earlier this month, OpenAI engineers informed some colleagues that, relying on several newly developed optimization technologies, they have found a solution that can reduce model inference costs by more than half. After applying this new technology to scenarios where free/paid account visitors use ChatGPT, the number of required Nvidia graphics processing units (GPUs) was reduced to just a few hundred — a remarkably low figure. It is currently unclear what specific technical means OpenAI used to achieve this significant improvement in computational efficiency. Common optimization methods in the industry generally include: quantization compression, key-value caching, batch processing of user queries instead of computing them individually, and redirecting some requests to lower-power lightweight models or model shards for responses.
All Comments