Why AI often loses track

Ahmed Riaz

5 months ago

Many users find it annoying that AI tools often lose track of things. But this is not a coincidence, but an LLM memory problem. The background is an architectural limit.

If you’ve been working with a large language model (LLM) like ChatGPT or Claude for a while, you’re probably familiar with this phenomenon: You’re in the middle of a complex task and suddenly the AI seems to have forgotten central parts of the previous discussion. Experts rightly call this phenomenon “The Memory Problem”. This is a fundamental architectural limitation that affects all current LLMs.

This forgetting is not intentional, but is based on a technical limit. Because LLMs don’t have memory in the traditional sense. When you send a new message, the model does not remember the previous messages from a saved database.

Instead, it rereads the entire conversation from the beginning to generate the next response. You can think of it like reading a book, where every time a new sentence needs to be written, you have to read the entire text from page one.

LLM memory problem: The context window as a bottleneck

This constant “re-reading” takes place within the so-called context window. You can think of this window as a fixed-size notepad: the entire conversation has to fit there. Capacity is measured in tokens, the basic units of text that an LLM processes.

A token is roughly equivalent to about three quarters of a word. When the notebook fills up, the system must delete older content so the conversation can continue. Anything that falls out of this window is no longer directly accessible to the AI.

The real problem is not the data transmission. A 30,000-word conversation only corresponds to around 200 to 300 kilobytes of data. The real bottleneck is computing power. This is due to the so-called attention mechanism of the LLMs. This requires the AI to calculate the relationship of each word to every other word in the conversation.

This leads to a quadratic growth problem. If the input doubles, the amount of computation required quadruples. That’s why longer chats take progressively longer and require immense GPU memory to store all those relationships.

RAG as a possible solution

A promising way to circumvent this problem is Retrieval-Augmented Generation (RAG). Instead of cramming the entire context into the LLM notebook, a RAG system acts like a smart library system. It searches vast external databases and knowledge sources for the information specifically relevant to the question at hand.

Only these relevant snippets are then inserted into the LLM context window along with the question. This can make a context window that is actually limited feel almost limitless because the external databases can store millions of documents.

RAG is particularly useful for tasks such as searching technical documentation or answering questions from large knowledge bases. With classic chats, the memory problem will haunt us for some time.

Also interesting:

Source link

LLM memory problem: The context window as a bottleneck

RAG as a possible solution

🤑 Get a €200 bonus