According to monitoring by Dongcha Beating, Logan Kilpatrick, Senior Product Manager at Google DeepMind and head of Google AI Studio, stated on X that every company building products based on AI should establish its own benchmark tests to measure the performance of AI models. He described this as a method to make model advancements 'disproportionately beneficial to your company' and suggested that founders and business owners 'start tomorrow.' Currently, most companies rely on public leaderboards to select AI models, but these rankings measure general capabilities and are often disconnected from specific business scenarios. For instance, a company focused on contract review is primarily concerned with the accuracy of clause extraction, yet this test is not included in public benchmarks, making it difficult to assess the model's performance in this area. The benefits of building one's own benchmarks are twofold: first, it allows companies to evaluate models using their own business tasks with each update, selecting the best-performing model for their specific context rather than the one with the highest public ranking; second, it enables them to provide feedback on these test sets to model providers, encouraging continuous optimization in areas of concern. Kilpatrick noted that companies like Zapier and Sierra are already doing this, stating, 'There is a lot of alpha (excess return) to be created here.'
All Comments