SafeDocs’ latest research into securing PDF

Excerpt: The LangSec IEEE S&P Workshop this year includes several PDF-related presentations resulting from SafeDocs research including a paper by CTO Peter Wyatt.

About the author: Peter Wyatt is the PDF Association’s CTO and an independent technology consultant with deep file format and parsing expertise, who is a developer and researcher actively working on PDF technologies … Read more

Article

May 16, 2021
by Peter Wyatt

As previously described here and here, the PDF Association is actively involved in the DARPA-funded “SafeDocs” fundamental research program. This program aims to expand the principles of language-theoretic security (or “LangSec”) to research and develop novel parser methodologies for provably ensuring safety in digital content. Each year the IEEE Symposium on Security & Privacy hosts a LangSec workshop as a dedicated forum for reporting and discussing advances from across the LangSec community.

This year’s Seventh LangSec IEEE S&P Workshop (May 27-28 2021) includes a number of presentations resulting from SafeDocs research that relate directly to PDF technology:

“Demystifying PDF through a machine-readable definition” by the PDF Association’s CTO, Peter Wyatt. This work describes the “Arlington PDF Model” as the first open-access, comprehensive specification-derived machine-readable definition of all formally defined PDF objects and data integrity relationships. Arlington represents the bulk of the latest 1,000-page ISO PDF 2.0 specification and is a definition for the entire PDF document object model, establishing a state of the art “ground truth” for all future PDF research efforts and implementers. Expressed as a set of text-based TSV files with 12 data fields, the Arlington PDF Model currently defines 514 different PDF objects with 3,551 keys and array elements, and uses 40 custom predicates to encode over 5,000 rules. The Arlington PDF Model has been successfully validated against alternate models, as well as a sizable corpus of extant data files and has been widely shared within the SafeDocs research community as well as the PDF Association’s PDF Technical Working Group. The Arlington PDF Model has already highlighted various extant data malformations and triggered multiple changes to the PDF 2.0 specification to reflect the de-facto specification, remove ambiguities, and correct errors;
“Building a File Observatory for Secure Parser Development” by Allison et al, from NASA Jet Propulsion Laboratory (a liaison member of the PDF Association), as follow-on work from their previous efforts in establishing both the stressful “Issue Tracker” corpus and prototype “PDF Observatory” discussed in this article and briefly demonstrated in this OctoberPDFest 2020 presentation;
“Accessible Formal Methods for Verified Parser Development” by Li et al from BAE Systems;
“Looking for Non-Compliant Documents Using Error Messages from Multiple Parsers” by Michael Robinson from American University;
“RL-GRIT: Reinforcement Learning for Grammar Inference” by Walt Woods from Galois Inc.

Abstracts for all LangSec 2021 papers are available from the LangSec 2021 Workshop Papers & Slides web page, with many papers also available for free download. Slide decks and videos will be added soon.

This year’s LangSec Workshop will again be a virtual two-day event and is now open for anyone to register and attend. If you want to understand some of these leading-edge advances towards provably securing PDF, then please consider attending.

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0079. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). Approved for public release.