Introduction to the Text Cleaning Benchmark
The test was meticulously designed to encompass a broad spectrum of document formats, such as PDF, PNG, JPG, TXT, DOCX, along with elements like tables, lists, columns, bold words, and large files. A special focus was placed on challenging aspects like CID errors in documents.
In total, 10 varied documents were analyzed.
The core methodology involved using a script for calling the APIs and systematically recording the outcomes, with a keen focus on the quality and efficiency of text cleaning.
Processing Time in Text Cleaning
A key factor in text cleaning is the processing time, which can significantly impact the efficiency of data projects.
The analysis revealed that Lettria occasionally outperformed unstructured.io in terms of local processing speed. However, this wasn't consistently observed across all tests.
unstructured.io, benefiting from local processing, eliminated internet latency, but its processing time included file open/read durations.
Dependencies and Preparation
Both Lettria and unstructured.io necessitated a range of Python package dependencies for their operation.
These dependencies are crucial for tasks like OCR (Optical Character Recognition), handling different file formats, and ensuring the APIs' functionality in text extraction and cleaning.
Read Lettria's Text Parsing documentation here.
Comparative Analysis on Text Cleaning
The benchmark test covered various documents to comprehensively evaluate each API's text cleaning prowess. Here are some examples.
You can also see the documents used to run the benchmark below, and download the results in JSON format at the end of the article.
PDF Documents (Text and Columns)
Lettria demonstrated excellence in removing irrelevant text like page numbers and adeptly separating titles from contents, showing superior text cleaning.
unstructured.io, while efficient, struggled with language detection and separating text elements cleanly.
OCR Accuracy on Images (JPG, PNG):
Lettria's robust OCR capabilities were evident, as it successfully minimized common character recognition errors, a critical aspect of text cleaning in image-based documents.
Conversely, unstructured.io showed weaknesses in OCR, impacting its text cleaning accuracy.
Handling CID Errors in PDFs:
Handling CID errors, a complex text cleaning challenge, saw unstructured.io returning these errors in its output. Lettria, on the other hand, chose to return empty outputs in such cases.
Large Document Processing:
In processing large documents, unstructured.io demonstrated a significant edge in speed, suggesting better efficiency for text cleaning in voluminous documents.
Handling HTML Content in TXT Files:
Lettria outshone unstructured.io by effectively removing HTML tags and logically organizing text, a crucial aspect of text cleaning.
unstructured.io did not remove HTML tags and struggled with logical text segmentation.
DOCX File Handling:
Both APIs showed comparable performance in handling DOCX files, with minor differences in dealing with specific elements.
Strengths and Final Thoughts
A standout for its precision in OCR, efficient text segmentation, and handling complex elements like tables and HTML.
It's ideal for projects requiring detailed and accurate text cleaning.
Shines in its processing speed and lighter JSON output, making it suitable for projects where quick text cleaning is essential.
This detailed benchmark analysis highlights that the choice between Lettria and unstructured.io for text cleaning tasks should align with the specific requirements of a project.
Lettria is preferable for intricate text cleaning and accuracy, while unstructured.io offers advantages in speed and efficiency.
Each API has distinct capabilities, making them suitable for different text cleaning scenarios in data analysis and machine learning projects.