PDF’s popularity online
Excerpt: According to CommonCrawl’s detected MIME type PDF is the 3rd most popular file-format on the web (behind HTML and XHTML); more popular than JPEG, PNG or GIF files.
About the author: Duff Johnson is a veteran of the electronic document technology marketplace. He founded or led several software and services businesses in the electronic document industry since 1996.
We know that PDF is popular, and ubiquitous worldwide. Unless you are Google, or are otherwise privy to telemetry from internet-scale PDF processing software, it’s really hard to know just how popular PDF is, whether measured in absolute terms, relative to other digital document formats, or in terms of relevance to end users.
But some data is available.
According to the detected MIME type as captured in the latest (July 2021) CommonCrawl database, PDF is the 3rd most popular file-format on the web (after HTML and XHTML); more popular than JPEG, PNG or GIF files.
There are other ways to understand PDF’s popularity. Google Trends allows us to monitor the relevance of specific word or phrase searches as measured against the proportion of all Google web-searches. This model presents a picture of PDF’s “mind-share” but only in the context of comparing searches for “PDF” against searches for, well, everything else.
Google Trends data implies substantial continuing end-user interest in “PDF” but it doesn’t help us with how often PDF is chosen over other file formats as a means of representing documents on the internet.
From 2011 to 2017 I was able to leverage Google’s Advanced Search facility (e.g., “filetype:pdf”) to keep some sort of an eye on a metric that’s at least relevant to this question: the proportion of PDF files (vs. other digital document formats) Google finds online. I reported these data a few times over the years.
Around 2017, however, Google changed its advanced search facility to block searches that did not include search strings, eliminating my methodology.
Happily, the strings denoting specific file-formats are relatively specific to those formats. On this basis, then, I decided to revive the original methodology, but ignore the filetype: parameter and simply count hits based on respective search-strings instead, on the basis that the string were reasonably specific to the formats themselves.
Counting hits for these search-strings doesn’t imply specific numbers of files but it does imply – at least at some level – the degree to which these various types of files are posted (and referenced) online. There are many unavoidable distortions in this type of analysis. As we re-run these counts over time, off-topic uses of these strings should become part of the baseline, allowing us to at least imagine that we’re looking at an approximation of actual deployment volumes and trends.
The following chart shows my previous results and new data from 2021 using the new methodology.
For the strings I selected as follows:
- PPTX (modern presentation format)
- XLSX (modern spreadsheet format)
- DOCX (modern word-processor format)
- EPUB (HTML-based document format)
I did not search for alternatives such as ODT and ODS due to (a) very small numbers and (b) significant other uses for these TLAs (Three Letter Acronyms).
Regardless of noise in this data the basic reality is clear: PDF continues to predominate in digital document formats.
For accessibility and related purposes the chart’s data is also provided in tabular form.
Searches are conducted using search strings on the following file-types: PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX. All searches were conducted on Mac OS / Chrome from Winchester, Massachusetts. Of course, your mileage may vary – I’ve noticed different results in different countries, and indeed, on different days of a given week.