Making more sense of PDF structures in the wild at scale

This is a follow-on talk from our 2021 PDF Days presentation on the File Observatory. Our team built the File Observatory to support Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program by enabling parser developers to understand features of PDFs in the wild at scale.

In the first part of our presentation, we’ll offer an overview of the capabilities of the observatory, from gathering files, to running numerous parsers on the files, to searching and analyzing the features extracted by the parsers. In the second part, we’ll detail progress on building and packaging the “observatory in a box” for transition. In the third part, we’ll present some of the findings on an analysis of roughly 8 million PDFs from Common Crawl. This section will include an analysis of parser warnings, exceptions and errors on the set of files as well as a presentation of statistical summaries of PDF features, including versions, languages, creator tools/producers and more interesting syntactic features.