New secret math benchmark stumps AI models and PhDs alike

November 12, 2024

315

Epoch AI allowed Fields Medal winners Terence Tao and Timothy Gowers to review portions of the benchmark. “These are extremely challenging,” Tao said in feedback provided to Epoch. “I think that in the near term basically the only way to solve them, short of having a real domain expert in the area, is by a combination of a semi-expert like a graduate student in a related field, maybe paired with some combination of a modern AI and lots of other algebra packages.”

A chart showing AI models’ limited success on the FrontierMath problems, taken from Epoch AI’s research paper.

Credit:

Epoch AI

To aid in the verification of correct answers during testing, the FrontierMath problems must have answers that can be automatically checked through computation, either as exact integers or mathematical objects. The designers made problems “guessproof” by requiring large numerical answers or complex mathematical solutions, with less than a 1 percent chance of correct random guesses.

Mathematician Evan Chen, writing on his blog, explained how he thinks that FrontierMath differs from traditional math competitions like the International Mathematical Olympiad (IMO). Problems in that competition typically require creative insight while avoiding complex implementation and specialized knowledge, he says. But for FrontierMath, “they keep the first requirement, but outright invert the second and third requirement,” Chen wrote.

While IMO problems avoid specialized knowledge and complex calculations, FrontierMath embraces them. “Because an AI system has vastly greater computational power, it’s actually possible to design problems with easily verifiable solutions using the same idea that IOI or Project Euler does—basically, ‘write a proof’ is replaced by ‘implement an algorithm in code,'” Chen explained.

The organization plans regular evaluations of AI models against the benchmark while expanding its problem set. They say they will release additional sample problems in the coming months to help the research community test their systems.

Source link

New secret math benchmark stumps AI models and PhDs alike

LEAVE A REPLY Cancel reply

MUST READ

Wallingford-Swarthmore School District ‘Bans’ Christmas Decorations on Buses and Restricts Employees’...

Nasdaq Private Market sues Hiive, claiming Hiive stole trade secrets and...

This Bosch electric drill deal is a bargain for DIY enthusiasts

WATCH: Jessica Tarlov Completely Blows Up While Getting Destroyed by Fellow...

EVEN MORE NEWS

Singapore-based dConstruct Robotics, which develops spatial tech to let autonomous robots...

Montreal-based Stathera, a maker of MEMS-based silicon timing components for chips,...

Midjourney wants Disney, Universal, and Warner Bros. to reveal in court...

POPULAR CATEGORY