How to Build a Private ChatGPT Using Open-Source Technology? Download our free white paper.

Data cleaning with Lettria’s Parse API

Learn how to easily parse and clean data for NLP models using Lettria's Parse API and Python.

Data cleaning with Lettria’s Parse API

Over the last year, the world has been introduced to ChatGPT and discovered how the use of AI can simplify everyday tasks. Many organizations realize that they have drives and drives of useful data, but no easy way to surface the data to their employees (or their customers.)  One solution is to build a Natural Language Processing (NLP) model based on all of the institutional knowledge that is stored in files across the company.  By modeling this data, it becomes instantly searchable and useful to the team!

One of the most important parts of building an NLP model is ensuring that the documents are all parsed similarly;  If the data is not imported well, or in a uniform way, the results of the model will be unsatisfactory.  When building an NLP model, how can you ensure that all files are parsed correctly, and sent to the modeling function correctly?

Introducing the Lettria Parse API.  Lettria’s Parse API accepts images, pdfs, and all text format documents, and parses the data into well-formed JSON. Once in JSON format, the data can be easily cleaned with existing python libraries to meet your NLP’s needs (all lower case, remove special characters, etc.)  In this post, we’ll demonstrate how to use the Lettria parsing API on a number of file types, and show how one might perform text cleaning to the data.

Once we have quickly and accurately parsed our files with the Lettria Parse API, we can do our text cleaning with a few lines of python.  Interested in learning how? Read on!

Want to learn how to build a private ChatGPT using open-source technology?

Using the Lettria API

To use the API, we need to have an API key and a document to parse.

To generate an API key, sign up at https://app.lettria.com/.  In the right-side navigation, navigate to API keys and click “+New Key” to create an API key. The Lettria API has a generous free tier, so we will be able to run our tests at no charge.

The Lettria API can accept text files (.txt, docx) table formatted docs (.xls, .csv), images ((.png, .jpg, .webp), and pdfs. To make an API call, we will use a screenshot of the first chapter of Moby Dick:

Making an API call

Here is a sample API call using cURL:

curl -X POST \

-H "Authorization: LettriaProKey <apikey>" \

-F "file=@md.jpg;filename=md.jpg" \

https://api.lettria.com/parse

The first line adds the Authorization header to Lettria. Replace <apikey> with your API key.  The second line adds the file we will parse.  The @md.jpg indicates the location of the file in the local filesystem, and the filename is the name of the file. In the example above, we are using the md.jpg, a screenshot of the first few paragraphs of Moby Dick.  The last line is the API endpoint.

When we run this command, cURL grabs the local file and uploads it to Lettria.  The response from the API has our cleaned data in JSON format.

Python example

Getting a giant JSON in your terminal shows that the API is working, but integrating your code into a Python workflow would be more useful. Here is the same example, but using the python requests library:

import requests, json

jwt_token = "<api token>"

api_host = 'https://api.lettria.com'

api_route = api_host + '/parse'

pdf_file_name = 'DougSillars_2024.pdf'

website_file_name = 'sodr.html'

jpg_file_name = 'md-screenshot.jpg'

jpg2 = 'resumess.jpg'

document_file_name = jpg_file_name

print(document_file_name)

chunks = []

#open the file, and send it to Lettria

with open( document_file_name, 'rb') as file:

	response = requests.post(
	
		api_route,
		
		files=[('file', ( document_file_name, file))],
		
		headers={"Authorization":"LettriaProKey " + jwt_token}
	
	).json()

	chunks = response['chunks']

text_chunks = [ chunk for chunk in chunks if chunk['type'] == 'text' ]

The API Response

The response from the file has a lot of information and scrolls forever.  It isn’t complicated, but there is a lot of data presented.  Let’s walk through the parameters in JSON response to understand better what we are receiving.

At the highest level of the JSON, all of the text data is broken into chunks, and the response is an array of all of the text chunks:

{“chunks”: [{<chunk1 data>}, {< chunk2 data>}...]
}

Each of the chunk objects has a wealth of information.  Looking at the first chunk from our screenshot above:

{
        "content": "CHAPTER 1",
        "html_tags": [],
        "id": 0,
        "infos": {
            "lines": [
                0
            ],
            "page": 0,
            "words": [
                {
                    "bottom": 0.0556640625,
                    "content": "CHAPTER",
                    "font": "",
                    "height": 0.0185546875,
                    "indexes": [
                        0,
                        7
                    ],
                    "left": 0.01171875,
                    "right": 0.0947265625,
                    "top": 0.037109375,
                    "width": 0.0830078125
                },
                {
                    "bottom": 0.056640625,
                    "content": "1",
                    "font": "",
                    "height": 0.0224609375,
                    "indexes": [
                        8,
                        9
                    ],
                    "left": 0.1015625,
                    "right": 0.119140625,
                    "top": 0.0341796875,
                    "width": 0.017578125
                }
            ]
        },
        "line_split": [],
        "start_index": 0,
        "type": "text"
    },

The text in this chunk is simply the chapter header: “Chapter 1.” In the infos object, there are three parameters:

  • lines: An array of the line numbers the text appears on. In the example above, we only have one line - line 0 of the page.
  • page: The page number where this text chunk was found. In this example, we have just one page - page 0.
  • words:  An array containing each word in the chunk. Each word has further details:
    • content: the word described in this array.
    • font: (if discovered) - the font used on the page.
    • Index: Where the word starts and finishes in the dataset. This is a character count, starting at zero.
    • height, bottom, left, right, top width: dimensional information about where the word is on the “page” (image, file, etc.)

Skipping ahead to the first paragraph:

{
        "content": "Call me Ishmael. Some years ago--never mind how long precisely-having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people's hats off--then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophica flourish Cato throws himself upon his Sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me.",
        "html_tags": [],
        "id": 2,
        "infos": {
            "lines": [
                2,
                3,
                4,
                5,
                6,
                7,
                8,
                9,
                10,
                11,
                12,
                13,
                14,
                15,
                16,
                17,
                18
            ],
            "page": 0,
            "words": [
                {
                    "bottom": 0.171875,
                    "content": "Call",
                    "font": "",
                    "height": 0.021484375,
                    "indexes": [
                        20,
                        24
                    ],
                    "left": 0.009765625,
                    "right": 0.0615234375,
                    "top": 0.150390625,
                    "width": 0.0517578125
                },
                {
                    "bottom": 0.1728515625,
                    "content": "me",
                    "font": "",
                    "height": 0.01953125,
                    "indexes": [
                        25,
                        27
                    ],
                    "left": 0.06640625,
                    "right": 0.0966796875,
                    "top": 0.1533203125,
                    "width": 0.0302734375
                },
                {
                    "bottom": 0.169921875,
                    "content": "Ishmael.",
                    "font": "",
                    "height": 0.0185546875,
                    "indexes": [
                        28,
                        36
                    ],
                    "left": 0.103515625,
                    "right": 0.1943359375,
                    "top": 0.1513671875,
                    "width": 0.0908203125
                },
               <BIG SNIP>                {
                  
            ]
        },
        "line_split": [],
        "start_index": 20,
        "type": "text"
    },

For brevity, we snip the output after the famous first line “Call me Ishmael.”

For most NLP processing tasks, the “content” fields for each chunk will be satisfactory.

Using Python to print just the content reveals Melviille’s words.

Text cleaning

Once your text has been imported, and before it can be added to the model, you may wish to complete cleaning tasks to the data  Some NLP models work best will all text set to lowercase.  Since we are using python in our example, adding .lower() to our extraction path converts the text to lowercase:

We could also UTF encode special characters or perform any other cleaning in our Python workflow.  The logic for text cleaning can be as simple or as complex as needed for your use case.

Conclusion

In this post, we have introduced the Lettria Parse API. This API takes files of many different formats and quickly and accurately extracts the data.  The results are easy to manipulate in their JSON format, allowing for easy cleaning and injecting into your NLP pipeline.  The speed and accuracy of the data extraction will reduce errors and improve the accuracy of your NLP models.  Try the Lettria API today for free!

Callout

Build your NLP pipeline for free
Get started ->