AI summary on Google Search: Millions of false information per hour? | News

0 0 2 minutes read

AI summaries in search results are controversial: site operators not only fear that users will hardly click on the search results anymore, but they clearly measure this. In the beginning, incorrect or hallucinated answers were also common. With the switch to version 3 of the Large Language Model (LLM) Gemini, reliability increased significantly. An investigation by the New York Times wanted to find out how often errors continue to creep into AI summaries. The result: 91 percent of the answers were free of errors.

That leaves nine percent where Google’s AI summaries contain false, incorrect or fabricated information. That sounds impressive by LLM standards, but becomes dramatic when you integrate Google’s popularity into the equation: Since Google processes over five trillion search queries annually, an error rate of nine percent would work out to around 50 million incorrect summaries per hour.

Hulk Hogan is alive?
The study used a standardized questionnaire called SimpleQA for quantification. In total, the AI summaries and attached sources from over 4,000 Google searches were evaluated. Examples of incorrect information found are as blatant as they are entertaining: The “Classical Music Hall of Fame” does not exist, rivers and dates are mixed up, and professional wrestler Hulk Hogan is said not to have died. When asked when Bob Marley’s home was converted into a museum, Google’s AI summary gives the year 1987. In fact, it was 1986 – the error can be traced to conflicting information on the Wikipedia page. The other two source links in the answer do not mention the date of the museum’s founding (or only mention it inaccurately).

False sources
But even the correct answers were often difficult to check: in more than half of the cases, the sources linked to each answer contained links to websites that did not contain the information sought. This could also have something to do with the origin of the sources: Facebook was the second most frequently linked domain, Reddit came in fourth place. Both are portals with extensive data sets, but their veracity varies greatly.

Google questions study
A spokesman for the Google Group rejects the criticism: The SimpleQA question sentence itself contains errors; Internally, we work with a smaller test derived from it called SimpleQA verified – but without naming our own success rates. In addition, these questions are not a realistic representation of the questions users ask. The actual survey was carried out by Oumi, a company that itself makes extensive use of AI for evaluation. This may result in further errors creeping into the evaluation. Another problem is the non-deterministic nature of AI answers: each AI answer can produce different answers with different sources. On top of that, Google adjusts internally which LLM is used depending on the question. Overall, it remains the same: a search engine’s AI summary should not be understood as a fact, but rather as a starting point for research.