When chatbots say “I don’t know”: A statistical trick improves reliability

0 2 3 minutes read

Researchers at MIT in Cambridge, USA, have developed a method that makes the self-awareness of language models mathematically measurable and therefore correctable. The process is called Reinforcement Learning with Calibration Rewards, or RLCR for short, and directly targets the cause of hallucinations in inferential models.

The systematic weakness of guessing

Previous training approaches for so-called reasoning models, such as those used in the development of OpenAI, have a systematic weakness. These models are traditionally trained to find correct answers without assessing the confidence of their judgment.

A team led by graduate student Mehul Damani and graduate student Isha Puri from MIT CSAIL finds that simple reward systems encourage guessing. Models receive the same reward whether they find an answer through logical deduction or just get lucky while guessing.

The Brier score as a corrective

To correct this behavior, the team uses the so-called Brier score as an additional component in the reward function. This statistic penalizes the deviation between the certainty given by the model and the actual correctness of the answer.

During training, the model not only learns the solution to a problem, but at the same time must provide a numerical assessment of its own uncertainty. An answer that is given with high confidence but is incorrect will result in a significant point deduction in training.

Significant reduction in error rates

The study results published on the preprint server Arxiv show that the calibration error could be reduced by up to 90 percent using this method. What is particularly noteworthy is that the general accuracy of the models in the tasks does not suffer from the new honesty.

The researchers were able to demonstrate that conventional training methods actually actively worsen the models’ self-assessment while they become more powerful. “What is striking is that ordinary reinforcement learning not only does not improve calibration, but actively damages it,” explains Isha Puri in an MIT press release.

Added value through reflection

By integrating uncertainty analysis directly into the AI’s thought processes, information is created that goes far beyond decorative additions. According to the study, smaller models benefit particularly greatly when they have to explicitly reflect on their own ignorance.

However, a critical look at its practical suitability remains necessary, as the method slightly increases the computational effort during training. Even if the results are impressive, better calibration does not automatically mean that the model will no longer make errors.

A signal for human choice

It simply provides a more reliable signal for the moment when users should seek a second opinion. Especially in sensitive areas such as medicine or finance, this knowledge about what you don’t know could make the crucial difference.

Editorial recommendations

Previous attempts to increase the trustworthiness of AI through subsequent filters often proved inadequate compared to internal corrections. Instead, MIT’s method starts at the foundation of the learning process to ensure the integrity of the output from the start.

Whether this process will be implemented across the board also depends on the willingness of the development departments to integrate the additional complexity into their processes. In any case, the scientific basis for a more honest artificial intelligence has been laid with this work.