Apache Tika is widely used as a critical enabling technology for search in Apache Solr and other search systems. This open source library performs text and metadata extraction from numerous file formats, including PDF via an integration with Apache PDFBox. As we all know, when something goes wrong with text extraction, the reliability of search and other natural language processing (NLP) applications is greatly hindered.
Over the last 5 years, the Tika project has gathered and published a large corpus of files (https://corpora.tika.apache.org/base/docs), and we have developed an evaluation module (tika-eval) and methodology to identify regressions in text extraction and areas for improvement in our parsers.
This talk offers an overview of Tika’s publicly available regression corpus as well as the tika-eval module. We’ll discuss ways of scaling its NLP/language-modeling based metrics to identify potential mojibake, corrupt text and/or bad OCR at scale. These techniques have applicability for PDF parser developers, search system integrators and for those interested in archiving and accessibility.