Interview with Tim Allison, NASA’s Jet Propulsion Laboratory, about PDF Days Online 2021
Excerpt: Tim Allison of NASA’s Jet Propulsion Laboratory talks about his PDF Days Online 2021 presentation “Making sense of PDF structures in the wild at scale”.
About the author: Tim has been working in content/metadata extraction (and evaluation), advanced search and relevance tuning for nearly 20 years. Tim is the founder of Rhapsode Consulting LLC, and he currently works … Read more
PDF Association: At the PDF Days Online 2021, you will be hosting a presentation titled “Making sense of PDF structures in the wild at scale” – what’s that about?
Tim Allison: Our team has been supporting secure parser developers working on the Defense Advanced Research Projects Agency (DARPA)’s SafeDocs program. As part of that project our team has built a search and discovery system using open source tools to allow parser developers and specification writers to analyze and find patterns in features extracted from millions of PDFs. These patterns include, for example, correlations between structural elements and creator tools and many other critical features of PDFs at scale.
PDF Association: Who is your presentation aimed at?
Tim Allison: Anyone interested in making sense of PDFs as they are generated in the wild. As mentioned above, this includes anyone developing PDF processing software or writing specifications to improve such software. Our primary corpus derives from Common Crawl (https://commoncrawl.org/), and it offers a reasonable view into PDFs on the web.
PDF Association: What will the people who attend your presentation be able to take away from it?
Tim Allison: Lessons learned on methods for scaling the gathering and feature extraction from millions of PDFs; the utility of analyzing PDFs at scale for anyone involved in PDF processing.
PDF Association: The PDF Days Online 2021 has become the leading PDF event. What makes the PDF Days so unique in your mind?
Tim Allison: The opportunity to meet so many key industry stakeholders and to learn from the people developing the future of PDF.
PDF Association: Thank you! We look forward to seeing you at the PDF Days Online 2021.