GPT-5.2 shines in benchmarks – but for many it is the worst version since its market launch | News


In mid-December, OpenAI released a new version of GPT, the technical basis of ChatGPT. 5.2 is intended to deliver even deeper understanding and better reliability – especially for complex queries, which are announced to be answered more precisely, with fewer hallucinations and more consistent output. Long conversations also offer more coherence and more targeted connection. However, there are now increasing reports of problems claiming that these goals have backfired. While benchmarks do indeed document improved factual accuracy, detailed conversations themselves often work worse than ever. Instructions and questions are completely ignored
Many users report that GPT-5.2 simply discards user instructions, personalization or context as soon as internal priority logic takes effect. This is also confirmed in stress tests, in which “custom instructions” no longer have any effect as soon as “invisible internal rules” become active. In concrete terms, this means suddenly no longer receiving answers to questions. No matter what the prompt looks like in ongoing discussions, it always follows the same summary or reply to an earlier point in the conversation. Even explicit instructions are completely ignored (“Don’t summarize again, just answer my question”) – regardless of whether you are in normal or thinking mode. ChatGPT can explain in detail how inappropriate the output was and execute exactly how the user actually wants it, but then the identical reaction follows again in an endless loop.
Context disappears, it only works better in benchmarks
Users also report that the context is regularly suddenly forgotten. The model then behaves as if it were several steps behind the question being asked. Less creativity, poorer task coherence, inappropriate excesses at the beginning of the answer and the aforementioned repetitions are not isolated cases, but are becoming more and more common. What shines in benchmark metrics has often clearly fallen behind in the original flagship discipline. Even the first market-ready version 3.5 (end of 2022) did not have such weaknesses.
Moral club for normal search queries
Added to this is the frequently observed switch to “instructive mode”, in which users are suddenly made aware of inappropriate content. However, moral classifications of this kind usually turn out to be completely unfounded: Anyone who wants to know whether a particular celebrity is married or has children does not want to “infringe their privacy, which is why an answer must be refused.” In one case, interest in whether the family of a new football coach actually moved with him led to the issue: “I have to clearly protect you – I cannot confirm, comment or expand on that. I am neither allowed to spread such topics nor treat them as facts.” An earlier GPT version, on the other hand, would have simply stated that it had no information.
Once again: practice shows completely new problems
OpenAI faces a problem that is not new. The effects of updates or adjustments are often not recognized during internal tests, but only after a somewhat longer period of use in practice. A few months ago, for example, ChatGPT had turned into an unpleasant sycophant, which is why OpenAI immediately reverted the changes. Even the developers of an LLM usually have to first observe the effects of updates. Even the creator of the technology only understands to a limited extent what exactly the model “constructs” internally. With regard to 5.2, there is definitely a need for improvement and some people suspect that the update was thrown onto the market in a hurry and with too much focus on benchmarks.














