Making more sense of PDF structures in the wild at scale

This is a follow-on talk from our 2021 PDF Days presentation on the File Observatory. Our team built the File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program by enabling parser developers to understand features of PDFs in the wild at scale. In the first part of our presentation, we’ll offer an overview … Read more

Making sense of PDF structures in the wild at scale

PDFs in the wild offer a bewildering amount of variation in syntax, features and structure.  For those building parsers or evaluating parsers, it is critical to have a broad coverage corpus available to assess and discover distributions of issues “in the wild” or on specific client document sets.  In this talk, our team will present … Read more

Evaluating Text Extraction at Scale

Apache Tika is widely used as a critical enabling technology for search in Apache Solr and other search systems. This open source library performs text and metadata extraction from numerous file formats, including PDF via an integration with Apache PDFBox. As we all know, when something goes wrong with text extraction, the reliability of search … Read more