Despite the usefulness of AI, Large Language Models (LLMs) can sometimes generate incorrect answers. OpenAI aims to address this issue and enhance the reliability of its models, such as ChatGPT. To achieve this, the company has introduced SimpleQA, an open-source benchmark designed to assess the accuracy of responses from LLMs. The creation of this tool has brought attention to the current limitations of AI, especially when handling specific types of queries.
SimpleQA was created to test how well OpenAI's models can answer short, clear, and factual questions. The tool uses a set of 4,326 straightforward questions with verifiable answers to make evaluation easier. By focusing on specific, well-defined questions, OpenAI believes SimpleQA is an accurate way to measure LLMs factuality.
When designing SimpleQA, the researchers chose questions that were known to be difficult, where LLMs had previously given wrong answers. These questions had clear, factual answers that remain constant over time. By doing so, the researchers aimed to see how well the AI models could handle these specific, difficult questions, rather than just testing their general ability to answer basic factual questions correctly.
The results show that GPT-4o (the current version of ChatGPT-4) answers around 40% of the questions correctly, while the GPT-4-o1 preview model performs slightly better. Smaller models, however, have even lower accuracy.
According to OpenAI researchers, SimpleQA could encourage further research into making LLMs more accurate and reliable. This work is essential, as OpenAI recently launched its own search engine in ChatGPT, with other AI models soon to follow suit.