How Doc Parsing is a key step in RAG applications

Discover how to create a Retrieval Augmented Generation (RAG) system for enterprise-level chatbots using the powerful Lettria API.

Doug Sillars

Mar 13, 2024

Talk to a GraphRAG expert

Get a quick demo ->

Introduction

In the last 2 years, Large language models (LLMs) and Generative chat apps like ChatGPT have taken the world by storm. The conversational manner of Chatbots makes it easy to learn and extract data from the model. But, LLMs are only as good as the information they have been trained on. LLMs are prone to hallucinations - making up answers while sounding like it knows what it is saying. Sometimes, ChatGPT even admits that it doesn’t have the latest data, and can’t answer the question accurately. Inaccurate answers and dated information are huge limitations for enterprise-wide adoption of chatbots.

A second problem is that LLMs need to be trained on publicly available information. You can’t ask ChatGPT about internal documents in your company, or get accurate results on recent events. However, this is possible to do. Using Retrieval Augmented Generation (RAG) allows you to piggyback your company’s information onto an LLM, and have the model return results from your documents and files. RAGs sit on top of existing LLMs, and thus do not require expensive modeling and training time,

By utilizing the power of a LLM, but not incurring the expense of creating an LLM from scratch, Retrieval Augmented Generation RAGs are becoming a popular choice in the enterprise. In this post, we’ll walk through the general process of creating an RAG - with a close focus on the document parsing portion of the process.

Want to see how easy it is to implement GraphRAG?

Request a demo →

I need a RAG! How do I make one?

Building an LLM-like experience that has specific knowledge for your industry, your company, or perhaps your very niche and specific hobby would be an amazing tool. Having all of your documentation available with a chat interface will allow your team to better understand and learn about the information your company has, and can find ways to better leverage that knowledge.

Data ingestion

When creating an RAG - you need to be able to parse all of the required files and insert them into an RAG database. When parsing the files, there are two very important criteria for the tooling you use: accuracy and speed.

Accuracy: If your files are parsed incorrectly - with typos and incorrectly formatted sentences, the responses from your RAG will not be accurate. The refrain “garbage in - garbage out” holds here. The data inserted into the RAG must be complete and accurate.
Speed: Why speed? Imagine building an ingestion engine that can update your RAG whenever a new document is created, or even whenever an existing document is updated! If your ingestion engine is slow, a backlog can form and the data being presented in your RAG could be out of date. RAGs can support real-time stock prices (https://www.linkedin.com/pulse/embracing-rag-financial-services-new-frontier-ai-technologies-nzfme/), so any delay in updating the RAG could potentially have huge consequences.

There are a number of tools that can be used to parse your documents. The Lettria API is a great choice, We have compared the Lettria API to other competition, and the Lettria API was far superior in accuracy, and competitive in processing speed.

Using a low-quality or slow parsing engine can doom your RAG goals from the very start,

The Lettria API has a free tier that you can use to test your RAG generation models. The API can accept all popular text and image formats, converting the text into JSON that can be easily cleaned and processed into your RAG.

To learn more about Lettria’s Parse API, you can read our tutorial, or check out our documentation.

Example

Imagine that your company has built a RAG of movie data, and you’d like to have it up to date in real time. The Academy Awards were recently awarded, and here is an example of how using Lettria’s parse engine could be used to immediately ingest data on winners of the awards.

A data source. In this example, we’ll use a free API https://newsapi.org/, to search for awards data. This Python call searches for articles published the evening of the Oscars with “Academy Award for Best Picture” and “Oscars”

import requests

url = ('https://newsapi.org/v2/everything?'
       'q=Academy+Award+for+Best+Picture&'
       'q=Oscars&'
       'from=2024-03-10&'
       'to=2024-03-12&'
       'sortBy=popularity&'
       'apiKey=<api_key>)

response = requests.get(url)

Next, we filter the results for entries with ‘best picture’ and ‘won’ in the description. This returns one URL:

#get urls or articles with infor on who won
bp_articles = []
for article in json_data['articles']:
    #print(article['description'] )
    if "won" in article['description'].lower() and "best picture" in article['description'].lower():
        #probably a good choice of data
        bp_articles.append(article['url'])
print(bp_articles)

We then save this webpage locally:

import requests

# The URL of the webpage you want to extract
url = bp_articles[0]

# Send a GET request to the webpage
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Save the webpage content to a file
    with open('webpage.html', 'w', encoding='utf-8') as file:
        file.write(response.text)
    print("Webpage saved as 'webpage.html'.")
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")

Now we use the Lettria Parse engine to extract the text in a clean format for our RAG:

import requests, json
jwt_token = "<api key>"

api_host = 'https://api.lettria.com'
api_route = api_host + '/parse'

document_file_name = 'webpage.html'

print(document_file_name)
chunks = []
#open the file, and send it to Lettria
with open( document_file_name, 'rb') as file:
    response = requests.post(
        api_route,
        files=[('file', ( document_file_name, file))],
        headers={"Authorization":"LettriaProKey " + jwt_token}
    ).json()
    chunks = response['chunks']
text_chunks = [ chunk for chunk in chunks if chunk['type'] == 'text' ]

A quick examination of a few chunks shows that the correct data is being extracted:

Christopher Nolan's epic biographical drama Oppenheimer won best picture at the 96th Academy Awards on Sunday after dominating for most of the evening, taking home 7 awards across categories.

The resulting text chunks are cleanly parsed data, ready to be inserted into a RAG database. Building automations with news feeds in this way enables you to build RAG databases that have near real-time data in them

After Processing

Once you have parsed and cleaned your data (link to the first post), your data is inserted into a vector database that is connected with the LLM. Vector databases take your text and convert it into a mathematical model that syncs with the LLM. Search results are pulled from the Database first, and if no good results are obtained, the LLM can attempt to answer from its broader corpus of data. Now your team can use a chat interface to search examine and cross-reference your documentation!

Building a RAG

When building an RAG to customize your chat interface with unique documentation, your first step is to build a pipeline for that data to be rapidly parsed and inserted into the tool. APIs like the Lettria parse API allow for fast and accurate processing of your data and files into your RAG database - allowing for near real-time ingestion of content. If you’re interested in testing out the Lettria API for your RAG ingestion - sign up today!

‍

Ready to revolutionize your RAG?

Request a demo →

Doug Sillars

Doug is a lifelong learner and educator, having focused his career on improving developer knowledge and experiences. A Google Developer Expert for the web, O’Reilly author, international keynote speaker, and a prolific blogger, he relishes in simplifying the complex.