How good are ChatGPT and Co. really?

0 7 3 minutes read

AI chatbots like Perplexity, Copilot and Co. are taking up more and more space in people’s everyday lives – as guides, search engines and digital help. But how reliable are their answers really? In a test, a British consumer organization examined the best chatbots in a comparison.

Since the introduction of ChatGPT in 2022, AI chatbots have rapidly established themselves in many people’s everyday lives. In September 2025 alone, the chatbot from OpenAI counted 5.9 billion visits, compared to 3.1 billion the year before.

Whether for research, purchasing advice, health questions or professional tasks – the number of uses of AI chatbots like ChatGPT is increasing rapidly. Above all, the low access barriers, answers in natural language and the quick and understandable answering of complex questions contribute to the success of these large language models.

They are also available around the clock and seem competent in their answers. But how reliable are ChatGPT, Gemini and Co. really? The British consumer organization Which? did the test and examined six AI tools.

AI: The best chatbots in comparison

The consumer organization Which? tested the six most common chatbots. These included ChatGPT, Google Gemini – here both Gemini and Gemini AI overviews from the standard Google search – Microsoft Copilot, Meta AI and Perplexity.

All chatbots had to answer 40 frequently asked questions. These covered the topics of money, law, health and nutrition as well as consumer rights and travel. Some questions intentionally contained misinformation in order to test the chatbots. Afterwards, experts from Which? the answers given are evaluated in terms of accuracy, usefulness and ethical responsibility, among other things.

They discovered that the AI chatbots often make mistakes, misinterpret information and even give risky advice. Many inaccuracies and misleading statements ran through the answers.

5th place: Meta AI

Meta AI takes last place in the ranking of the best chatbots with a total of only 55 percent. Although the chatbot corrected the ISA allowance, the experts from Wich? but otherwise not convincing. When it comes to the accuracy of the answers alone, the chatbot from Meta only achieves a value of 54 percent. When it comes to usefulness, the figure is only 51 percent.

4th place: ChatGPT

ChatGPT was able to answer the question about investment tips, but did not correct the incorrectly mentioned allowance. Overall, the chatbot from OpenAI, which is popular with users, only comes in fifth place with an overall score of 64 percent.

ChatGPT was also very wrong in its answer to a question about applying for a tax refund from the tax office. Because the chatbot – like Perplexity – linked to paid tax refund providers who are known for charging high fees and charging unjustified additional costs.

3rd place: Microsoft Copilot

The AI chatbot Kopilot from Microsoft is only just in fourth place with 68 percent. The language model shines in terms of relevance with 71 percent, but could improve on ethical responsibility with only 62 percent.

When asked about the ISA allowance, a tax-favored form of investment in Great Britain, the experts from Wich? intentionally made a mistake. They asked for investment tips for the £25,000 allowance. Copilot gave the requested investment tips, but did not realize that the allowance was 20,000 pounds. For users, this could mean a blatant violation of the regulations of the British Tax and Customs Authority.

2nd place: Google Gemini

Google can secure the remaining places on the podium with its chatbot Gemini and the Gemini AI Overviews (Gemini AIO) from the standard Google search. In a direct comparison of the answers from both versions, the differences in the accuracy and quality of the information were sometimes striking.

However, Gemini AIO only displayed answers to 28 of the 40 questions because this feature is not always available. The score was proportionally adjusted to be comparable.

Overall, Gemini AIO did slightly better with 70 percent than Gemini itself with 69 percent. The chatbot was particularly successful when it came to questions about law, health and nutrition. Gemini itself, on the other hand, answered questions about finances as well as consumer rights and travel better.

1st place: Perplexity

The chatbots’ responses were evaluated in terms of accuracy, relevance, clarity/context, usefulness and ethical responsibility. Perplexity came out on top with an overall score of 71 percent.

The chatbot was particularly impressive in the areas of relevance and clarity/context with 73 percent. When it comes to ethical responsibility, however, there is still room for improvement at 66 percent.

Also interesting:

Source link

How good are ChatGPT and Co. really?