Benchmark

In-Depth Text Cleaning Benchmark:
Lettria vs. unstructured.io

The task of text cleaning and processing is pivotal in data analysis and machine learning applications. Accurately extracting and cleaning text data from various document formats can significantly influence the outcome of data-driven projects.

In this context, a thorough benchmark analysis was conducted to compare the text cleaning capabilities of two prominent APIs in the field: Lettria and unstructured.io. This analysis aimed to shed light on how these tools perform across a diverse range of document types, emphasizing their strengths and weaknesses in text cleaning.

10

documents tested

32

hours of work

28

tests run

7 min read

Introduction to the Text Cleaning Benchmark

The test was meticulously designed to encompass a broad spectrum of document formats, such as PDF, PNG, JPG, TXT, DOCX, along with elements like tables, lists, columns, bold words, and large files. A special focus was placed on challenging aspects like CID errors in documents.

In total, 10 varied documents were analyzed.

The core methodology involved using a script for calling the APIs and systematically recording the outcomes, with a keen focus on the quality and efficiency of text cleaning.

What are the key results?

Model

Benchmark Tests Passed

unstructured.io

25%

Lettria

71%

28 total benchmark tests were performed to check the outputs of both tools, and see how they performed on basic tasks, depending on the file and input type.

Lettria passed 20 out of 28 tests, unstructured.io passed 7 out of 28 tests.

Processing Time in Text Cleaning

A key factor in text cleaning is the processing time, which can significantly impact the efficiency of data projects.

The analysis revealed that Lettria occasionally outperformed unstructured.io in terms of local processing speed. However, this wasn't consistently observed across all tests.

unstructured.io, benefiting from local processing, eliminated internet latency, but its processing time included file open/read durations.

Dependencies and Preparation

Both Lettria and unstructured.io necessitated a range of Python package dependencies for their operation.

These dependencies are crucial for tasks like OCR (Optical Character Recognition), handling different file formats, and ensuring the APIs' functionality in text extraction and cleaning.

Comparative Analysis on Text Cleaning

The benchmark test covered various documents to comprehensively evaluate each API's text cleaning prowess. Here are some examples.

You can also see the documents used to run the benchmark below, and download the results in JSON format at the end of the article.

PDF Documents (Text and Columns)

Lettria demonstrated excellence in removing irrelevant text like page numbers and adeptly separating titles from contents, showing superior text cleaning.

unstructured.io, while efficient, struggled with language detection and separating text elements cleanly.

OCR Accuracy on Images (JPG, PNG):

Lettria's robust OCR capabilities were evident, as it successfully minimized common character recognition errors, a critical aspect of text cleaning in image-based documents.

Conversely, unstructured.io showed weaknesses in OCR, impacting its text cleaning accuracy.

Handling CID Errors in PDFs:

Handling CID errors, a complex text cleaning challenge, saw unstructured.io returning these errors in its output. Lettria, on the other hand, chose to return empty outputs in such cases.

Large Document Processing:

In processing large documents, unstructured.io demonstrated a significant edge in speed, suggesting better efficiency for text cleaning in voluminous documents.

Handling HTML Content in TXT Files:

Lettria outshone unstructured.io by effectively removing HTML tags and logically organizing text, a crucial aspect of text cleaning.

unstructured.io did not remove HTML tags and struggled with logical text segmentation.

DOCX File Handling:

Both APIs showed comparable performance in handling DOCX files, with minor differences in dealing with specific elements.

Thanks! You can download the results from the link below.

Download Results

Oops! An error occurred while submitting the form.

Strengths and Final Thoughts

Lettria

A standout for its precision in OCR, efficient text segmentation, and handling complex elements like tables and HTML.

It's ideal for projects requiring detailed and accurate text cleaning.

unstructured.io

Shines in its processing speed and lighter JSON output, making it suitable for projects where quick text cleaning is essential.

Conclusion

This detailed benchmark analysis highlights that the choice between Lettria and unstructured.io for text cleaning tasks should align with the specific requirements of a project.

Lettria is preferable for intricate text cleaning and accuracy, while unstructured.io offers advantages in speed and efficiency.

Each API has distinct capabilities, making them suitable for different text cleaning scenarios in data analysis and machine learning projects.

Start to accelerate your AI adoption today.

Boost RAG accuracy by 30 percent and watch your documents explain themselves.

Book a call