Where GPT Falls Short: Study Reveals Where Humans Outperform AI

Where GPT Falls Short: Study Reveals Where Humans Outperform AI

Advocates of AI superiority often highlight the achievements of GPT-4 and predict it will soon replace everyone. However, studies that examine the capabilities of large language models (LLMs) from a more grounded and alternative perspective receive far less attention. Let’s fill that gap.

The Essence of the Study

There’s a benchmark called GAIA (General AI Assistants). It’s a set of 466 questions designed to test the abilities of AI. The test was created in response to the trend of comparing AI to humans in fields that require specialized skills and knowledge, such as law or chemistry. Instead, the authors developed a benchmark based on the following principles:

  • Real and complex questions. The questions typically involve searching for and transforming information gathered from various sources, such as provided documents or the open and ever-changing internet. This helps assess how well the subject adapts through reasoning, uses different tools, and applies multimodal thinking. In contrast, other LLM benchmarks often feature highly specific tasks or are limited to closed, synthetic environments—like a purely text-based setting that’s convenient for AI but not for humans.
  • Easy interpretability. Conceptual simplicity makes it easy to understand the model’s reasoning, and unambiguous answers allow for precise evaluation of test results.
  • Resistance to memorization. To complete a task, the system must plan and successfully execute several steps, since the final answer isn’t present in the training data (see zero-shot learning). Due to the diversity and breadth of the action space, these tasks can’t be solved by brute force or memorizing answers. The ability to check reasoning, the required accuracy of answers, and the absence of those answers in open internet text prevent data contamination. In contrast, multiple-choice answers (like in the MMLU benchmark) make it harder to assess contamination, as incorrect reasoning can more easily lead to the correct choice.

Question Difficulty Levels

  • Level 1 questions generally don’t require tools or need no more than one tool and up to five steps.
  • Level 2 questions usually involve more steps (about 5 to 10) and require combining various tools.
  • Level 3 questions demand arbitrarily long sequences of actions and the use of any number of tools.

On the left: the number of tasks for each skill, requiring at least that skill to solve. On the right: each dot represents a GAIA question. The size of the dots is proportional to the number of questions; for clarity, only the level with the most questions is shown. The higher the dot, the more tools are needed. The further right, the more steps required.

Test Results

To evaluate LLMs using GAIA, the authors only used API access with a prefix prompt (technical info on how the model should respond, in what format, etc.). The following were tested:

  • GPT-4 (with and without plugins)
  • AutoGPT (using GPT-4 as the backend)
  • Humans
  • Search engines (to check if the answer could be found on the first page of results)

Even the most capable LLMs struggle with GAIA. Even when equipped with tools, GPT-4 achieves only a 30% success rate on the simplest tasks and 0% on the most difficult ones. Meanwhile, the average success rate for human respondents is 92%.

Some caveats: all human participants had higher education (bachelor’s: 61%, master’s: 26%, PhDs: 17%). However, whether having a degree significantly affects a person’s ability to complete GAIA tasks remains an open question. Another limitation is that the benchmark is in English, and the authors themselves acknowledge the constraints this introduces.

GPT vs. Search Engines

The research team also sees potential in using LLMs as a replacement for search engines, and here’s why. When searching online, a person can get text results that may contain the correct answer for level 1 questions, but this works less effectively and more slowly than LLMs for more complex queries, as users have to sift through the top search results. Overall, collaboration between a human and GPT-4 with plugins seems to offer the best balance of results and time spent. However, this assessment doesn’t account for the reliability and accuracy of search results, which is a real issue for LLMs as search engine replacements.

How to Challenge AI with Questions

The study provides a systematic method for creating questions that are difficult for AI, which you can use for your own experiments and tests:

  1. Make sure your question is based on a “source of truth” (Wikipedia, arXiv, GitHub, etc.). For levels 2 and 3, a good way to create questions is to combine multiple “truth” sources. Here, “truth” means “verifiable,” not necessarily “true.”
  2. Ensure the answer to your question does not exist on the internet as plain text.
  3. Make sure the answer is a number or at most a few words, to keep evaluation reliable.
  4. Ensure the answer won’t change over time. This avoids issues if the source is removed.
  5. Make sure the answer is unambiguous.
  6. Ensure your question is “interesting”—that is, if you read it, you’d think an AI assistant that could answer it would be very helpful.
  7. Make sure a human can solve your question in a reasonable amount of time.

Editorial Perspective

We’re sharing this study not to prove that GPT-4 is dumb or useless. It’s simply a response to the tendency to exaggerate the capabilities of LLMs and to view the topic only through the lenses of AI evangelism or alarmism.

Approach sensational news with caution and avoid the unnecessary emotions they try to provoke. The truth is always somewhere in the middle. LLMs are excellent tools (or entities), but they can’t fully replace humans in any serious role. On the contrary, the smartest and most adaptable people will leverage this technology and become even more in-demand professionals.

At least, for now.

Leave a Reply