Emails for eternity
Excerpt: callas software’s Dietrich von Seggern explains how PDF/A can be used to archive emails and their attachments.
About the author: Dietrich von Seggern received his degree as a printing engineer, and in 1991 started his professional career as head of desktop prepress production in a reproduction house. He became involved in … Read more
According to a recent survey conducted on behalf of the digital association Bitkom, an average of 26 e-mails are received per professional mailbox in Germany every day. Processing them takes up a large part of working time. In addition, e-mails are an integral part of processes. Some of them must be retained under tax law, including purchase orders or invoices, but also any documents that may be relevant in connection with a business transaction. In addition, electronic messages often contain valuable knowledge that must be retained. But how can e-mails be elegantly archived? To date, there is no supreme solution. However, for a number of reasons, the PDF route currently seems to be the most practical.
The good news is that e-mails are digital per se and already contain metadata. This makes it fundamentally easier to archive them than paper-based communications. However, in many cases, there are no company guidelines in this regard, so users decide individually how to handle their e-mails. As a result, there is a high risk that business-relevant messages are lost. E-mails are handled by various specialized systems that enable the creation, transport, viewing and storage of these electronic messages (lifecycle: client, server, relay, archiving system). For more on archiving emails, we will have to deep dive in what an email consists of.
The header is basically the equivalent of the letterhead and contains the sender and recipient information, the creation date and some optional information such as the subject in the form of metadata. Often, an ID is also included here to help the email client associate it with other emails when an email sequence consists of replies and forwards. In order to properly assess e-mails and the reliability of header information, it is important to understand that the actual routing is independent of the header data and takes place via the Simple Mail Transfer Protocol (SMTP). The SMTP acts as an envelope, so to speak, and controls the routing of the electronic message. The e-mail client therefore sends an SMTP call to the e-mail server together with the user data of the e-mail (including the header), which contains the address of the recipient and is decisive for the routing.
The body, i.e. the actual mail content, is displayed differently depending on the user-defined settings in the e-mail software. Possible are plain text (ASCII) without umlauts, simply formatted text (like bold or italics) with support of country-specific encodings (umlauts) as well as extensive HTML formatting with embedded images, etc. An email file can contain multiple variants at the same time and there is no guarantee of corresponding content: It is readily possible to place different text. Often, for example, the ASCII text part only contains a note that an HTML-capable e-mail client is required for display. This is a crucial aspect for possible format conversions during archiving.
The third, optional part consists of attachments. This is where the infinite field of file formats, feared by every archivist opens up: These are often documents or images, possibly combined in a ZIP file, but exotic file formats or executable programs or scripts can also be included.
As already described, e-mail is transported via the SMTP protocol, namely from the client to the server at the sender, then via the Mail-Relays to the server at the recipient and from there to the recipient’s client. Since e-mails are often sent in “conversations” as replies and the complete history is not always included, it would be ideal to archive the entire mail system in order to be able to fully trace the e-mail communication with all steps later on. In practice, this is obviously rather unfeasible. Alternatively, it would be good if at least the receiving or sending mailbox could be archived completely with all references of the e-mails to each other. To date, however, there is no standardized, interoperable approach to this, although there are some interesting initiatives and approaches (e.g. a report recently produced by the University of Illinois with the support of the PDF Association).
Furthermore, such an approach is problematic because the technology most commonly used in business processes by Microsoft uses its own proprietary format (MSG). Although it is documented, it is subject to frequent changes. Content is sometimes not even inserted into the body of the e-mail by the programs, but sent as “Winmail.dat” attachments, which can then only be interpreted and displayed by appropriately prepared clients on the recipient side. For these reasons alone, it seems essential to convert the e-mails into a standard format suitable for archiving. This becomes even more overwhelming when attachments are taken into consideration. Here, there are no limits to the imagination as to which file format is used in the attachments. It is therefore impossible to guarantee that an application will be available for years, or even decades, with which the attachments can be displayed – one of the reasons why PDF/A was developed and became established so quickly.
PDF/A for secure archiving
To break free from this dependency, system-independent archiving of all e-mails and attachments in PDF/A is recommended. The format has long been established for general archiving purposes. Recently, the PDF/A-4f conformance level has become available as the successor to PDF/A-3, in which any files can be embedded. On this basis, at least the question of format for e-mail archiving can be answered satisfactorily. Most e-mail systems offer an export function to PDF. Unfortunately, however, this approach often falls short, because usually only the e-mail body is taken into account and not the header or any attachments.
If e-mails are to be archived in PDF in their entirety, the header data should be saved as XMP metadata in the PDF file. This can then be used as the basis for a targeted search for e-mails. The e-mail body is ideally converted on the basis of the body branch (plain ASCII, formatted text, HTML) that most comprehensively reflects the content. Links or referenced images in HTML must then also be integrated. The greatest flexibility in the use of archived e-mails is available if the original e-mail file in EML or MSG format and the attachments are also embedded in the PDF, which is possible with PDF/A-3 or PDF/A-4f. But experience has shown that this is not the only reason why e-mails archived as PDF/A are almost always larger than the original files. Another factor is that the PDF/A standard requires the embedding of fonts or ICC profiles for colors in order to ensure the reproducibility of e-mails over the years. On the other hand, file size can be minimized via compression methods built into the PDF, an option that does not exist in “e-mail formats”.
If you have made it through till here in this rather long blog post, I want to give away some more helpful information. Since callas is a PDF Association member, I will be presenting about the same topic named ‘Archiving email – as PDF?’ at the upcoming PDF Days Europe 2021 for which you can find the full agenda here. I will summarise, with practical demonstrations, exactly what needs to be done in order to include as much information as possible in e-mail archiving and to be able to retrieve and use it in the future. In case you are interested to be a part of this event and want to receive a discounted ticket until the end of July 2021, please write to us at email@example.com.