😮 Researchers impressed by ChatGPT's latest model

OpenAI's new model, o1, performs better than previous models on advanced scientific and mathematical tasks. o1 performed a calculation on black holes in about an hour, which previously took the researcher several months. It solved 83% of the tasks on a qualifying test for the Mathematics Olympiad.

Mathias Sundin 04.Nov.2024 2 min read

Share this story!

OpenAI's new model, o1, performs better than previous models on advanced scientific and mathematical tasks.
o1 performed a calculation on black holes in about an hour, which previously took the researcher several months.
It solved 83 percent of the tasks on a qualifying test for the Mathematics Olympiad.

Improved performance in science and mathematics

OpenAI's new language model o1 shows significant progress in scientific and mathematical areas compared to previous models. On the challenging Graduate-Level Google-Proof Q&A Benchmark (GPQA), o1 achieved an overall score of 78 percent, with particularly good results in physics where it got 93 percent correct, Nature reports.

This surpasses the performance of previous models and even results from researchers with doctoral degrees.

In mathematics, o1 also showed remarkable progress. On a qualifying test for the International Mathematics Olympiad, o1 solved 83 percent of the tasks correctly. This can be compared to OpenAI's previous top model GPT-4o which only managed 13 percent of the tasks.

Benefits for scientific work

Researchers who have tested o1 report that the model can be of great help in scientific work. Mario Krenn, leader of the Artificial Scientist Lab at the Max Planck Institute, used o1 in a tool to scan scientific literature and generate new research ideas. He states that o1 creates "much more interesting ideas" compared to previous models.

Kyle Kabasares, a data scientist at the Bay Area Environmental Research Institute, used o1 to replicate coding from his doctoral project on calculating black hole mass. He describes the experience as impressive and notes that o1 performed work in about an hour that previously took him several months.

Improved reasoning and longer processing time A distinguishing feature of o1 is its improved ability to reason. The model uses a method called "chain-of-thought" where it step-by-step reasons its way to a solution and corrects itself during the process. This results in slower but more capable answers, especially in areas where right and wrong answers can be clearly defined.

Testing and limitations

OpenAI has allowed a group of researchers to test o1 before launch. While the model proved useful for developing scientific experimental protocols, testers also noted some shortcomings. For example, important safety information related to dangerous steps in the experiments was sometimes missing.

Andrew White, a chemist at FutureHouse, points out that o1 is still not perfect or reliable enough to be used without careful scrutiny. He recommends that the model be primarily used as guidance for experts rather than beginners, as expert knowledge is required to assess the quality of o1's output.