Digitizing permanent records: the case for PDF/A-4
Excerpt: PDF/A-4 is essential to losslessly archiving PDF files that use current-generation PDF 2.0 technology… even including scanned documents. From modern Unicode support to interoperability with other specifications PDF/A-4 is the only way to archive PDF files conforming to PDF 2.0.
About the author: Duff Johnson is a veteran of the electronic document technology marketplace. He founded or led several software and services businesses in the electronic document industry since 1996.
- 1 Digitizing permanent records: the case for PDF/A-4
- 1.1 The rulemaking proposal
- 1.2 Concerns
- 1.2.1 Support for PDF 2.0 features
- 1.2.2 Support for modern Unicode
- 1.2.3 Superior technical disclosure
- 1.2.4 Interoperability with other current-generation specifications
- 1.2.5 Evolving PDF creation and processing technology are not reflected in old standards
- 1.3 Recommendation
The following article is derived from the PDF Association’s comment on a proposed rulemaking from NARA entitled “Federal Records Management: Digitizing Permanent Records and Reviewing Records Schedules” published on December 1, 2020 (85 FR 77095).
Our comment focuses on the proposed rulemaking’s exclusion of ISO 19005-4 (PDF/A-4) as an acceptable digital document format for digitizing permanent records.
The rulemaking proposal
In terms of the content of interest to the PDF technology community the draft rulemaking reads as follows:
…we propose to amend 36 CFR part 1236, Electronic Records Management, to add a new subpart establishing standards for digitizing permanent paper and photographic records, including paper and photographs contained in mixed-media records.
The proposed rulemaking identifies acceptable PDF technology for permanent records in § 1236.48 File format requirements as follows:
(a) You must digitize, encode, retain, and transfer most paper-based documents in one of the following file formats, either uncompressed or using one of the specified lossless compression codecs:
- PDF/A-1 (ZIP compression only)
- PDF/A-2 (ZIP or JPEG 2000 part 1 “lossless”)
From the PDF industry and users’ perspective the draft rulemaking is problematic in that it only permits up to 2008-era PDF technology. As such this rulemaking will inhibit the uptake of current-generation technology and deny users the benefits of a more completely disclosed, modern and secure PDF ecosystem.
We identify specific concerns below.
Support for PDF 2.0 features
PDF 2.0 introduced new features to PDF, and refined many existing features. In most cases the new PDF 2.0 features (e.g., page level output intents) are not supported by either PDF/A-1 or PDF/A-2.
Software developers and end users will be inhibited from using PDF 2.0 features as they will not be able to archive these files without PDF/A-4. Alternatively, users who do take advantage of PDF 2.0 features will find that they are then forced to remove these features and downgrade their documents on conversion to PDF/A-1 or PDF/A-2, with possible data loss resulting.
Support for modern Unicode
Essential to textual semantics, Unicode has expanded dramatically since 2008. Since 2008 the Unicode Consortium has done 13 official releases that have more than doubled the number of supported scripts by defining many new Unicode points for various glyphs, including the name of the new Japanese era, emoji, and more.
Although the visual appearance of glyphs can be retained through use of the Unicode Private Use Area (PUA) or PDF Type3 fonts, without PDF/A-4 it is impossible to create indexable, searchable, reusable or accessible PDF/A documents that contain modern Unicode.
Superior technical disclosure
ISO 32000-2:2020 (PDF 2.0) is more fully disclosed than its predecessors, allowing superior and more reliable 3rd party (including open source) tooling for creation, processing and validation.
Limiting archival PDF documents to PDF/A-1 and PDF/A-2 will force archival PDFs to use older, less complete specifications. PDF/A-1, in particular, is based on Adobe’s proprietary PDF 1.4 specification from 2001, the use of which present sustainability concerns.
Interoperability with other current-generation specifications
Interoperability is key to PDF’s value proposition, and PDF 2.0 is the foundation of all modern PDF technology. PDF/A-4 is explicitly designed to co-exist with PDF/X-6, PDF/VT-3, the forthcoming PDF/UA-2 and other PDF subsets and ISO specifications based on PDF 2.0.
PDF is a container format; indeed, being self-contained is a major feature of the format’s value proposition. As it is based on PDF 2.0 PDF/A-4 takes advantage of PDF 2.0’s systematically updated Normative References that have themselves been updated since the early 2000s, including ICC, TrueType, OpenType, JPEG 2000 and others.
The proposed regulation will discourage industry support for advanced (post 2008 PDF 1.7) PDF technology, and tend to limit PDF applications to those based on PDF 1.7 in order to meet interoperability requirements. PDF 2.0 documents will require a lossy conversion back to PDF 1.4 or PDF 1.7 in order to accommodate this proposal’s restrictions.
Evolving PDF creation and processing technology are not reflected in old standards
Implementation limits (internal limitations to the range of integers, string length, etc.) in PDF 1.4 and PDF 1.7 are based on a single proprietary implementation that was dominant when these specifications were developed. PDF/A-4 does not include any such implementation limits, reflecting advances in technology and modern classes of documents.
PDF documents that exceed the implementation limits specified in PDF 1.4 and PDF 1.7 are impossible to reliably “convert back” to PDF/A-1 or PDF/A-2.
The PDF Association has recommended to NARA that the proposed regulation be modified to include ISO 19005-4:2020 (PDF/A-4) in the set of acceptable archival formats.