How to Build a Private ChatGPT Using Open-Source Technology? Download our free white paper.

Demystifying the generation of datasets to evaluate and fine-tune embedders for RAGs

Discover how Lettria tackles the challenge of generating datasets to evaluate embedder model performance in RAG (Retrieval Augmented Generation) systems, where the retriever plays a crucial role in querying databases or finding relevant nodes across knowledge graphs.

Talk to a GraphRAG expert

Introduction

In a RAG, the retrieval part is one of the most crucial step. Indeed, a retriever consists of an embedder model that queries a vector database in order to fetch the closest embedded chunks based on a user query.

The better the embedder, the more relevant the retrieved chunks used to populate the context are. In the context of GraphRAGs, a sophisticated embedder model can be used to search nodes across the knowledge graph.

The more efficient the embedder, the more relevant the retrieved chunks used to feed the context become. In the context of GraphRAGs, a sophisticated embedder model can be employed to search for relevant nodes in the knowledge graph based on the user query, further enhancing its appeal for RAGs in general.

However, evaluating the performance of an embedder model remains a very tricky task to this day. This article unveils the approach carried out at Lettria.

Key Considerations

A retriever dataset must contain two key elements:

  • questions or more generally the user queries
  • relevant passages used to answer the question (one or several)
Styled Div

📔 Irrelevant passages can also be provided in order to fine-tune the retriever models.

I have divided my work into two phases:

  • creation of a dataset including extractive questions. Such questions focus on specific facts that can be easily extracted from a single passage of text.Example: When was Github Copilot released ?
  • creation of a dataset containing high-level questions. The latter require more reasoning and cognitive skills, sometimes involving cross document analysis.Example: How might advancements in artificial intelligence revolutionize the way we approach healthcare in the future?

This approach is solely based on my intuition that embedders may operate differently in these two scenarios.

In order to assess the performance of the embedder, and thus the retriever, several metrics can be used:

  • Precision : number of relevant retrieved documents / number of retrieved documents
  • Recall : number of relevant retrieved documents / number of relevant documents
  • MRR (Mean Reciprocal Rank): mean of the reciprocal ranks (rank of the most relevant chunk) over all queries. It is a metric that helps us understand the average position of the first relevant chunk across all queries.The formula can be written as:

The MRR, which ranges between 0 and 1, is quite simple and intuitive to get an idea of how well the retrieval system performs but this metric only considers the most relevant chunk which is not adapted to complex questions usually found in client data.

During our experience, we observe that reworded questions or questions with spelling mistakes on common or proper nouns make the task much more difficult given how good the retriever is. Therefore, it seems interesting to expand the dataset by generating rephrased questions from a single question.

Finally, providing irrelevant passages allows to finetune retriever models as it allows the model to learn the features important for retrieving correct information and thus distinguish between relevant and irrelevant information.

Styled Div

📔 Retriever models were finetuned following this Github tutorial.

Let's first explore how we can construct a retriever dataset that includes extractive questions.

Afterwards, we will tackle the approach followed to build a retriever dataset containing high-level questions.

Creation of a dataset containing extractive questions

Issues encountered

For this first dataset, questions must be extractive, i.e. require the respondent to extract information directly from a single short section of text.

Unfortunately, as said earlier, given a question, there may be more than one relevant passage and, worse, these passages may be found across several documents.

To address this issue and narrow down the number of possible relevant passages given a query, the idea is to ask for very specific information extracted from a given document. Besides, a simultaneous generation of the question and the relevant passages by a LLM was considered effective.

Regarding the generation of irrelevant passages for a given question, they will be computed at the end as they will correspond to relevant passages associated to other questions.

In this simple approach, despite using a clever LLM such as GPT4-o to generate questions and passages, some issues arose:

  • questions are sometimes simply paraphrases of the text in interrogative form and passages may be too easy to retrieve.For instance, corporation’s legal name should rather be replaced by their more common names: a user will most likely use “Apple” instead of “Apple Inc”. More generally, some technical terminologies specific to the domain of the corpus may need to be adapted.
  • Defining a Persona, i.e. a role the LLM has to play when they generate the question, helps obtain more natural questions. The persona used should be stated in the System Prompt as explained in the previous blog post about prompt engineering techniques How To Write A “Good” Prompt ?
  • questions may contain undefined references and require additional context to be understoodFor instance, “who murdered the President in November 1963 ?” and “Who murdered John F. Kennedy, according to the report?” are invalid ones as they do not provide all the details to answer unambiguously even though one may probably guess the president mentioned in the first question. Besides, defining a clear context in the question helps reduce the number of relevant passages.
  • The passages generated by LLMs are often slightly altered from the original text, making it difficult to efficiently verify the accuracy of the retrieved information. Even advanced LLMs may require a post-processor to locate the original passage within the document.

As described in the blog post, parameters such as temperature is a key one for such implementations where LLMs are used as a validator or a generator.

On the one hand, a too low temperature is not ideal for generating creative questions to build up a sufficiently varied data set. On the other hand, a too high temperature compromises the efficiency of the entire building process by introducing cast and format errors that can simply jeopardize code execution, which is very troublesome if the code is expected to run during a few hours.

One way to limit format errors during LLM generation is to set the argument response_format={"type": "json_object"} while using openAI generative models given the complexity of the task and the resources you have) to force it to output json object. Some additional validator checkers may be used to make sure the object contains all the expected keys, for instance.

Post-Processor steps

Instead of striving to find a magic prompt that will build the perfect questions (questions similar to client data and that can be answered unambiguously), a profitable strategy is to use LLM judges in order to validate generated outputs. A sophisticated generative AI model can have enough critical ability to validate or invalidate a question for instance. In practice, instead of expecting a boolean output, the LLM is asked to give a mark on a scale of 3: questions are filtered out if the score is lower or equal to 1.

Another post processor involves finding the original text in the document based on the text provided by the LLM and modifying it to match the original text.

The implemented method makes use of the package fuzzywuzzy, that relies on the Levenshtein Distance and computes the distance between the retrieved passage and passages within the document that start and end with same words.

As said previously, the dataset was also expanded by rephrasing generated questions to try to replicate real-world data. Four ways of rephrasing questions were thus proposed:

  • reworded questions: common words replaced with synonyms when possible and structure of the sentence updated with use of passive form, for instance.
  • slight misspellings that mainly affect common names and seldom proper nouns
  • important misspellings that may affect common names as well as proper nouns but not numbers such as prices, dates, … as they can totally change the meaning of the question
  • query-like form questions: questions written as a query a user would perform on Google for instance : “American president killed in 1963?”, “date death Kennedy”, … It is probably the most relevant rephrasing as it gives a good simulation of queries encountered in real world data.
Styled Div

Obviously, the meaning of the questions should remain the same and LLM validators can be used again to confirm the meaning is preserved.

Pipeline for the generation of dataset of extractive questions

Evaluation on extractive questions

The retriever dataset was built based on the initial dataset from ashraq/financial-news-articles.

Our created dataset contains extractive questions and is composed of a total of 1000 samples with the five forms of questions uniformly distributed:

Question Forms Number of sample
CLEAN_AS_IN_TEXT 200
REPHRASED_WORDING 200
REPHRASED_QUERY 200
REPHRASED_SLIGHT_MISSPELLING 200
REPHRASED_BIG_MISSPELLING 200
Total 1000

Two open-source embedders were assessed :

  • thenlper/gte-base
  • mixedbread-ai/mxbai-embed-large-v1

As explained, it is quite complicated to collect all the relevant passages for a given question. For this reason, only one relevant passage is considered as groundtruth passage and the metrics monitored are:

  • MRR (adapted when there is only one relevant passage)
  • Recall (”simple recall” based on one relevant passage)

In the following two evaluations, we tested the impact of the reranker model mixedbread-ai/mxbai-rerank-large-v1 that reranks the top 18 chunks. Then, the top 6 chunks will be considered. If no reranker models are used, only the top 6 chunks retrieved are considered.

For all forms of questions, the embedding model mixedbread-ai/mxbai-embed-large-v1 performs better than thenlper/gte-large.

Furthermore, the reranker mixedbread-ai/mxbai-rerank-large-v1 greatly improved the quality of the retrieved documents (even though the execution takes about 40 times longer).

Eventually, we observe that important misspellings may impact the performance of a retriever model, even when applying the reranker model.

Want to learn how to build a private ChatGPT using open-source technology?

Creation of a dataset containing high-level questions

Motivations

Extractive questions were a first satisfying approach to easily assess the overall performance of retrievers, but they do not fully capture all the complexity of the questions that one may find in client data.

Indeed, in a recent customer project, only 20% of the questions were considered purely extractive, whereas almost 60% required advanced reasoning processes. These high-level questions require complex cognitive skills as they involve critical thinking, problem-solving, analysis, evaluation, and synthesis. They are not typically answered by "yes" or "no," and rather often start with "why," "how," or "what do you think about...". They are used to stimulate thought, encourage conversations, and promote a deeper understanding of a topic. Contrary to extractive questions for which the answer is always contained in a single passage, high-level questions may generally require several ideas expressed in various chunks or documents, which is a perfect challenge to further assess the retrievers.

Approach followed

If extractive questions were built from a single document, chances of omitting relevant passages from the whole corpus were important. To remedy this crucial problem and also promote the generation of questions linking several documents, high-level questions will not be generated based on a single document but based on a document cluster.

Styled Div

Due to the significant time required for embedding all documents, it is recommended that this process be carried out during development rather than at runtime.

Then, documents within the selected cluster are semantically split into chunks: cosine distances between consecutive sentence embeddings are calculated and a percentile threshold is used to determine the breakpoints, effectively splitting the text into semantic chunks that will represent candidates for the relevant passages.

As explained in this tutorial, the smaller the parameter breakpoint_percentile_threshold, the smaller the size of the passages. Since a chunk retrieved by the retrieval model will be considered correct if a groundtruth passage is contained within it, then the groundtruth passages must be kept as small as possible while still maintaining a unity of understanding. The parameter breakpoint_percentile_threshold allows for calibrating the size of the passage to the correct size.

As with the generation of extractive questions, the generation of high-level questions and relevant passages are carried out simultaneously by the LLM, given how similar the performances are and how simpler a simultaneous generation is. A summary of all the considered documents in the cluster will be fed to the LLM. The summary, which results from a summary of the summary of each document, allows to grasp the main ideas in the cluster and facilitate the generation of high-level questions by the LLM. In addition to this, the LLM context must also contain the passages that have been split with breakpoint_percentile_threshold: some of them, considered as relevant to the generated questions will be returned by the generative model. However similarly to before, a step consisting in finding the original passage must be implemented.

Finally, this pipeline of generation also contains crucial post processor steps to validate the output and expand the dataset such as Question validator, Question Rephrasor.

Pipeline for the generation of dataset of high-level questions

Evaluation on high-level questions

Given the prompt used to generate the dataset, there may be up to three groundtruth passages and, for this reason, the MRR is no longer relevant.

In this evaluation, let us stick to the recall and consider as correct a sample for which one of the retrieved passage is among the groundtruth ones.

Our created dataset, based on the same original dataset (ashraq/financial-news-articles), contains high-level questions and is composed of a total of 1000 samples with the five forms of questions equally distributed:

Question Forms Number of sample
CLEAN_AS_IN_TEXT 200
REPHRASED_WORDING 200
REPHRASED_QUERY 200
REPHRASED_SLIGHT_MISSPELLING 200
REPHRASED_BIG_MISSPELLING 200
Total 1000

Like in the previous experiment, the reranker model used is mixedbread-ai/mxbai-rerank-large-v1 and it will rerank the top 18 chunks to eventually consider the top 6 chunks.

The embedder model mixedbread-ai/mxbai-embed-large-v1 remains better than thenlper/gte-large but the performance discrepancy between with and without the reranker model is more noticeable with high-level questions.

Have in mind that, here, a sample is considered correctly retrieved if at least one groundtruth passage is retrieved and not that if all the groundtruth passages are correctly retrieved.

Conclusion

These two generation processes combine steps involving generative AI models but also purely algorithmic tasks relying on machine learning algorithms or natural language processing (NLP) techniques. The conclusions drawn from the two datasets have shed light on the relative effectiveness of powerful open source embedding models.

Because of the random nature of LLMs, validation functions are required to prevent errors that could jeopardize the whole execution of the code. In our use case, they were used to

  • check the question does not require any additional context
  • check the passage(s) answer(s) the question

where as post-processors mainly aim at expanding the dataset with questions rephrased to simulate real-world data. What’s beneficial is that the majority of these post processors functions are agnostic to the domain of the corpus.

Eventually, the approaches mentioned, based on trial and error, are often the result of initiatives founded on intuition, such as the choice to separately address extractive questions on one hand and high-level questions on the other.

Please, don't hesitate to share your feedbacks with me on this open and fascinating topic, I would be delighted!

References

Callout

Build your NLP pipeline for free
Get started ->