When OpenAI first introduced its o3 “reasoning” model back in December, it made waves in the AI world. Touted as a major leap forward, o3 was showcased alongside the ARC-AGI benchmark. A rigorous test designed to measure advanced AI reasoning skills. But just a few months later, new data has emerged that paints a slightly different picture of its true cost and performance.
The Arc Prize Foundation, which oversees the ARC-AGI benchmark. Recently revised its earlier estimate of how much it costs to run OpenAI’s most powerful o3 configuration, dubbed “o3 high.” Initially, they pegged the cost at around $3,000 per problem. But according to their latest assessment, that figure could be closer to $30,000 per task — a tenfold increase.
This update underscores a key challenge facing today’s cutting-edge AI systems: staggering operational costs. While OpenAI hasn’t made the o3 model publicly available yet, or even revealed its official pricing. The Arc Prize Foundation says OpenAI’s most premium available model o1-pro might offer a good comparison point for estimating what o3 could cost.
Mike Knoop, co-founder of the Arc Prize Foundation, explained that o1-pro is a better stand-in for o3 high due to the enormous amount of computing power both models require at test time. However, he also clarified that this remains speculative. Until OpenAI officially announces pricing, o3 remains marked as “preview” on the ARC-AGI leaderboard to reflect that uncertainty.
The jump in cost makes more sense when you consider how much compute o3 high actually needs. According to the foundation, this version of the model used 172 times more computing resources than o3 low. The most efficient configuration tested.
There’s also growing speculation about how OpenAI might price the model for enterprise clients. Earlier this year, The Information reported that OpenAI was exploring AI-powered “agents” for specific tasks — like writing code. And could charge up to $20,000 per month for access. That level of pricing could place o3 firmly in the enterprise tier.
Of course, some argue that even at those prices, o3 might still be cheaper than hiring a full-time software engineer. But others, including AI researcher Toby Ord, have raised concerns about efficiency. Ord noted that to achieve its best scores on ARC-AGI, o3 high had to attempt each task 1,024 times. A potentially wasteful approach when compared to human reasoning.
As OpenAI continues to push the envelope on AI capabilities, the real-world costs of running these powerful models could shape how — and where — they’re deployed. While o3’s reasoning skills may be impressive, its financial demands might limit how broadly it can be adopted in the near term.