Making sense of PDF structures in the wild at scale

PDFs in the wild offer a bewildering amount of variation in syntax, features and structure. For those building parsers or evaluating parsers, it is critical to have a broad coverage corpus available to assess and discover distributions of issues “in the wild” or on specific client document sets. In this talk, our team will present our work building a File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program, which has an initial primary focus on PDF and related formats (e.g. jpeg, ICC, fonts, XMP). The talk will focus on a) gathering interesting PDFs and b) making features searchable and patterns easily discoverable with open source search technologies. In the first part, we’ll discuss gathering millions of PDFs from Common Crawl and thousands of files from open source PDF parser bug tracker sites. In the second we’ll outline the capabilities of the File Observatory to run multiple parsers against the files, extract features (runtime exceptions and error messages as well as structural features, including PDF DOM keys and values and other semantic components within the PDFs’ structures) and make those features searchable with Elasticsearch. We will also briefly demonstrate how the observatory enables the discovery of spelling variations (e.g. /Subtype vs. /SubType), and structural features which are statistically correlated with specific creator tools.