Understanding UTF-8 in PDF 2.0
Excerpt: PDF Association’s CTO, Peter Wyatt discusses text strings in PDF.
About the author: Peter Wyatt is the PDF Association’s CTO and an independent technology consultant with deep file format and parsing expertise, who is a developer and researcher actively working on PDF technologies … Read more
In PDF “text strings” are a formal subtype of strings as illustrated in Figure 7 from ISO 32000-2:
Text strings in PDF are intended for character strings that could be presented to a human, such as in a graphical user interface or in the output from command-line utilities. Because modern PDF text strings support Unicode they can reliably represent any character, symbol or pictograph from any language or symbol set supported by Unicode.
Unicode describes itself as “… a character coding system designed to support the worldwide interchange, processing, and display of the written texts of the diverse languages and technical disciplines of the modern world. In addition, it supports classical and historical texts of many written languages.”
A Brief History
PDF 1.0 originally defined only PDFDocEncoding as the default text encoding used throughout PDF for “outline entries, text annotations, and strings in the Info dictionary” (Adobe PDF 1.0, p.185). With the introduction of PDF 1.2 in November 1996, Adobe added support for Unicode using the UTF-16BE encoding, although the PDF 1.2 reference does not refer to the specific encoding as “UTF” (the UTF-16 terminology was only introduced a few months before PDF 1.2 shipped).
PDFDocEncoding is a predefined text encoding unique to PDF. It supports a superset of the ISO Latin 1 character set which happens, as Adobe’s PDF Reference 1.2 puts it, to be “compatible with Unicode in that all Unicode codes less than 256 match PDFDocEncoding” (Adobe PDF 1.2, p.47).
In 2017, PDF 2.0 introduced UTF-8 encoded strings as an additional format for PDF text strings, while maintaining full backward-compatible support for the existing UTF-16BE and PDFDocEncoded text string definitions. Since PDF 1.7 was originally published back in 2006, UTF-8 had become the lingua franca of the web, operating systems, and many programming languages. Accordingly, adding UTF-8 support to PDF 2.0 aligned the ubiquitous file format with this trend, and reduced the burden on conversion technologies to translate between Unicode encodings.
By requiring the use of Unicode Byte Order Markers (BOMs) at the start of all Unicode PDF text strings, a PDF implementation can easily and unambiguously identify the encoding of each data point. The 3-byte BOM for PDF UTF-8 text strings is 239, 187 and 191 in decimal (357, 273, 277 in octal and EF, BB, BF hexadecimal). Note also that there is no requirement in PDF that all text strings in a file must use the same encoding so a single PDF file may contain text strings in all three encodings. As described in the PDF specification and elsewhere, the selected byte values of the BOMs are “… unlikely to be a meaningful beginning of a word or phrase”.
Furthermore, both kinds of PDF Unicode text strings (UTF-8 and UTF-16BE) also support internal language escape sequences using a 2-byte BCP 47 language tag with an optional 2-byte ISO 3166 country code allowing the languages of Unicode text strings to be unambiguously specified. For example, the single literal PDF text string ( 33enUSHello 33esMXHola) identifies “Hello” as US English and “Hola” as Mexican Spanish with