OpenAI wins gold at the International Math Olympiad. The new LLM surpasses the historic challenge of mathematical reasoning

Graziano Stefanelli
Jul 19
2 min read

An announcement from researcher Alexander Wei marks a milestone for artificial intelligence: OpenAI’s new experimental model achieves gold medal-level performance in the world’s most prestigious mathematics competition, sparking debate over the limits, merits, and implications of this leap in algorithmic reasoning.

An announcement that shakes the AI community: “Gold medal-level” for OpenAI’s LLM

On the morning of July 19, 2025, Alexander Wei, a leading researcher at OpenAI, posted a thread on X that quickly spread across the web. The reason is clear: the team’s new experimental reasoning model achieved a result that, until recently, seemed out of reach for general AI. When evaluated on real problems from the 2025 International Math Olympiad (IMO), the system achieved a “gold medal” score, replicating the competition conditions: six high-level problems, solutions with natural language explanations, no access to the internet or external tools.

The announcement was accompanied by a symbolic image: a stylized strawberry (the IMO symbol) on a podium, wearing a medal with the OpenAI logo.

A historic challenge: why the IMO is the “Mount Everest” of AI reasoning

The International Math Olympiad has long been the pinnacle of mathematical competition for top students worldwide. Solving five or six out of six problems, under tight time constraints (two sessions of 4.5 hours), has always been considered the gold standard of mathematical creativity and abstract reasoning.

In AI, surpassing this benchmark has always been a “grand challenge”: most previous models, even those with calculation plug-ins or chain-of-thought capabilities, have only reached average or “silver” levels, far from the perfection required for gold. The breakthrough of OpenAI’s new LLM thus marks a true discontinuity compared to the past.

How the model was evaluated: methodology and results

According to statements from Alexander Wei and other team members, the model was tested on the official IMO 2025 problems, solving five out of six questions for a total of 35/42 points—the same range as the top human medalists this year. All answers were produced in natural language, with detailed explanations and without the use of the internet or external tools.

Noam Brown, another researcher involved, clarified that the model was not specifically trained on IMO problems but was evaluated in a generalist mode, further highlighting the robustness of its reasoning and problem-solving capabilities.

The reactions: between enthusiasm, open questions, and calls for independent verification

Within hours, the news spread quickly on X, LinkedIn, and Reddit, with comments ranging from excitement over the result to the caution typical of major AI milestones. Researchers and insiders emphasized the historic significance of the achievement but called for independent review of the solutions, given the past incidents of “hallucination” even in advanced models. Some threads compared the feat to DeepMind’s AlphaProof, suggesting a new era of intense competition among major ML research players.

__________

DATA STUDIOS

datastudios.org