Unveiling the Truth Behind OpenAI's GPT-4o Performance

Unveiling the Truth Behind OpenAI's GPT-4o Performance

Delving into the comparison between OpenAI's GPT-4o and GPT-4 Turbo, it is evident that GPT-4o falls short in reading comprehension and only slightly surpasses in other areas. Was the hype surrounding GPT-4o justified or just a Google troll?

OpenAI stole the spotlight from Google before Google's major event, Google I/O. However, when the event came, Google only had a language model to present, which was just a slight improvement from the previous one. The "magic" part of the model wasn't even in the Alpha testing stage yet.

This might have left users feeling like a mom receiving a vacuum cleaner for Mother's Day, but it did manage to divert press attention away from Google's significant event.

The Letter O

The first hint that there’s at least a little trolling going on is the name of the new GPT model, 4 “o” with the letter “o” as in the name of Google’s event,  I/O.

OpenAI says that the letter O stands for Omni, which means everything, but it sure seems like there’s a subtext to that choice.

GPT-4o Oversold As Magic

Sam Altman teased an upcoming announcement on Twitter, hinting at exciting new developments that he described as feeling like "magic." He mentioned that it was not related to GPT-5 or a search engine, but rather something different that they had been working hard on and believed people would enjoy.

OpenAI co-founder Greg Brockman tweeted:

“Introducing GPT-4o, our new model which can reason across text, audio, and video in real time.

It’s incredibly versatile, enjoyable to interact with, and represents a significant advancement in how humans interact with computers (and even how computers interact with each other).

The previous versions of ChatGPT utilized three models to handle audio input. One model converted the audio input into text, another model completed the task and produced the text version, and a third model converted the text output back into audio. The breakthrough with GPT-4o is its ability to process audio input and output within a single model, delivering results in the same time it takes a human to listen and respond to a question.

Unfortunately, the audio part is not available online at the moment. The team is currently focused on fixing the guardrails, and it will be a few weeks before they release an Alpha version for testing. Alpha versions may have some bugs, while Beta versions are usually closer to the final product.

This delay was explained by OpenAI as follows:

We understand that there are new risks associated with GPT-4o's audio features. We are now unveiling text and image inputs and text outputs. In the following weeks and months, we will focus on enhancing the technical setup, usability after training, and safety measures needed for releasing the remaining features.

Although the audio input and output, which are key components of GPT-4o, have been completed, we are still working on ensuring the safety measures before making them available to the public.

Some Users Disappointed

It’s inevitable that an incomplete and oversold product would generate some negative sentiment on social media.

AI engineer Maziyar Panahi (LinkedIn profile) expressed his disappointment in a recent tweet. He mentioned that he had been testing the new GPT-4o (Omni) in ChatGPT and was not impressed at all. He stated that the faster, cheaper, and multimodal features of the new technology were not suitable for his needs.

Code interpreter, that’s all I care and it’s as lazy as it was before!”

He followed up with:

I know that for startups and businesses, things like affordability, speed, and audio features are appealing. However, personally, I only use the Chat function, and I find it to be quite similar. Especially when it comes to the Data Analytics assistant.

In addition, I don't think I'm getting any additional value for my $20. Not at the moment!

Did OpenAI Oversell GPT-4o?

There were some people on Facebook and X who shared similar feelings, while others were pleased with the perceived speed and cost benefits of using the API.

It seems like the release of GPT-4o was rushed to overshadow Google I/O. Releasing an unfinished product right before Google's big event may have made it seem like GPT-4o is just a small upgrade.

Currently, GPT-4o is not a significant advancement. However, once the audio part of the model finishes testing and moves to the next stage, we can expect some major changes in large language models. But by then, Google and Anthropic may have already made their mark in this field.

OpenAI’s announcement describes the new model as not very impressive, saying it performs similarly to GPT-4 Turbo. The positive aspects are the notable enhancements in languages other than English and for API users.

It performs as well as GPT-4 Turbo on English text and code, but excels in non-English languages. Additionally, it is faster and more cost-effective at 50% less in the API.

Here are the ratings from six benchmarks comparing GPT-4o and GPT-4T. GPT-4o slightly outperforms GPT-4T in most tests, but lags behind in a key benchmark for reading comprehension.

Here are the scores:

MMLU (Massive Multitask Language Understanding)

This benchmark measures multitasking accuracy and problem-solving skills in various subjects such as math, science, history, and law. GPT-4o leads slightly with a score of 88.7, while GPT4 Turbo follows closely behind at 86.9.

GPQA (Graduate-Level Google-Proof Q&A Benchmark)

This is 448 multiple-choice questions written by human experts in various fields like biology, chemistry, and physics. GPT-4o scored 53.6, slightly outscoring GPT-4T (48.0).

Math

GPT 4o (76.6) outscores GPT-4T by four points (72.6).

HumanEval

This is the coding benchmark. GPT-4o (90.2) slightly outperforms GPT-4T (87.1) by about three points.

MGSM (Multilingual Grade School Math Benchmark)

This tests LLM grade-school level math skills across ten different languages. GPT-4o scores 90.5 versus 88.5 for GPT-4T.

DROP (Discrete Reasoning Over Paragraphs)

GPT-4o, a language model, was recently benchmarked with 96k questions to test its comprehension of paragraph content. Surprisingly, it scored nearly three points lower than GPT-4T, another model with a score of 86.0.

Did OpenAI Play a Prank on Google with GPT-4o?

It's difficult to ignore the attention-grabbing model named with the letter o from OpenAI. Some may think they are trying to overshadow Google's important I/O conference. Despite the intention, OpenAI successfully diverted the focus away from Google's upcoming search conference.

The question arises: is a language model that only slightly outperforms its predecessor deserving of the excessive hype and media attention it has received? The overwhelming coverage of OpenAI's upcoming announcement overshadowed Google's major event, indicating that for OpenAI, the answer is a resounding yes - it was worth all the hype.

Featured Image by Shutterstock/BeataGFX

Editor's P/S:

The hyped