Understanding the Limitations of Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) enhances language models with external knowledge, but what happens when it goes wrong?

ARTIFICIAL INTELLIGENCE

7/28/20243 min read

As part of our work surrounding our vector database SemaDB, we encounter retrieval augment generation (RAG) as a common use case. In simplest terms, with RAG we let the large language model search for relevant up to date content and use it as part of the prompt. This retrieved context could be anything: internet search, internal documents, images and so on.

While the main idea behind RAG is to imbue language models with extra relevant knowledge, the whole process can be brittle insofar as the quality of the content retrieved and whether the language model even cares to use it.

In this post, we talk about some of the experiences we had along the way where RAG unravels into an entertaining mess.

Finding the Right Needle in a Haystack

Can we search through millions of documents? Yes. Can we search through multi-modal items like images and unstructured text? Yes. Do we get the relevant information for every query? Maybe.

The success of RAG often boils down to whether we were able to retrieve the relevant information to answer the question. While there is an increase in adoption for vector based search, they struggle to search exact numbers or dates. For example, a query like “How many summer items did we sell in March?” will likely require both regular structured search to filter sales to March and vector search to find items that could be “summer” related.

We predict a mixture of vector and traditional search will continue to complement each other rather than one taking over. Most databases and search engines like our SemaDB support this hybrid search strategy. If you have existing search pipelines, consider vector search as a new pipeline that leads to the same model rather than supplanting it.

Hallucinations: When RAG Makes Things Up

Sometimes the thing we ask is not in the corpus and there is no information which would allow a language model to answer. For example, suppose we ask for the name of the chief research officer for a company that doesn’t have one. In this case, the retrieved items from an internal company corpus may contain relevant information such as the name of the CEO etc.

At this point we hope that the language model follows the prompt which is usually something like “Using the information below, {retrieved documents}, answer the query: {query}. If you can’t, say there isn’t enough information.” But sometimes the urge to be a very helpful assistant clashes with not knowing the answer. So it starts to hallucinate a chief research officer for us!

While the phenomenon of hallucination is alleviated with RAG and the training regimes of language models are trying to cater for it, it must be appreciated that these models are probabilistic in how they generate the next token in a sequence and it would be naive to think that probability of hallucination will be zero for every possible input.

We recommend trying unorthodox questions to test the system. Prefer to ask questions that are difficult to answer but easy to verify. These needn’t be complex, multi-level questions, sometimes it could be as simple as asking whether Alexa (who might not exist) can approve expense claims and seeing what was retrieved as well as the generated response.

Bias and Fairness: RAG's Inherited Flaws

Same context, same question but different models may lead to very different answers. Despite having the relevant information, a language model might refuse to respond because it deems itself a language model who cannot answer that query. A slight change in the prompt may then unlock a whole essay or dictate that your company is not diverse enough.

Retrieval augment generation is a process consisting of multiple stages and the language model sitting at the final generation step is crucial to bringing everything together. If you are seeking factual responses, or opinions on doing new things, you have to be aware of the lens that the model is trained to look through.

A language model may ignore crucial items from its context. Why? The technical answer comes down to the maximum likelihood of the training data as well as human feedback to tune the model. The non-technical answer is the inherent biases of the models. When summarising a piece of text, or the retrieved context in this case, the models have the freedom to statistically determine what is the most likely thing we would like to see.

As models evolve and new ones get released, don’t jump in because it has better benchmark results. A slightly higher or lower result can still hide huge biases and unseen baggage. Ideally, create an internal benchmark on your data and past examples to see if the model plays along nicely. For example, you can ask how old the CEO is without giving the age and check whether it hallucinates a middle-aged person or answers it doesn’t know.