Linking research and industry
Excerpt: To assist both academic and industry researchers achieve high-quality and accurate PDF-oriented research outcomes, the PDF Association is now offering a new free peer-review service.
About the author: Peter Wyatt is the PDF Association’s CTO and an independent technology consultant with deep file format and parsing expertise, who is a developer and researcher actively working on PDF technologies … Read more
PDF technology per se is, of course, not an academic domain. Nonetheless, every year many universities and research organizations publish papers and present research work that focuses on the format from across a diverse range of domains that utilize PDF. Topics range from the more obvious software engineering, cyber-security, accessibility, data mining, archival studies, and document understanding to specialized areas of health informatics, medicine, and education.
A search using Google Scholar, or in the digital libraries of technical societies (such as IEEE, ACM or TAGA), or in an open publishing index (such as arXiv, Semantic Scholar, or Research Gate) will locate many publications that explicitly mention PDF. Note that advanced searching is often required to exclude alternate PDF acronyms such as probability density function, pair distribution function, post-dialysis fatigue. primary dysfunction, etc. as well as avoiding generic text for downloading publications in PDF format that are commonly included in search results.
A few examples of some recently published papers illustrate the diversity of PDF-based research, based solely on their titles (this is not an endorsement of any specific research publication or institution!):
- “Searching the PDF Haystack: Automated Knowledge Discovery in Scanned EHR Documents” (EHR = electronic health records),
- “PAWLS: PDF Annotation With Labels and Structure”
- “PDF2LaTeX: A Deep Learning System to Convert Mathematical Documents from PDF to LaTeX”,
- “Maintaining interoperability in open source software: A case study of the Apache PDFBox project”
- “Exploitation and Sanitization of Hidden Data in PDF Files: Do Security Agencies Sanitize Their PDF files?”
- “A Novel Adversarial Example Detection Method for Malicious PDFs Using Multiple Mutated Classifiers”
- “The Portable Document Format: An Analysis of PDF Accessibility”
- “PDF/X-4 Today and For Tomorrow”
In addition to individual papers across diverse domains, dedicated conferences such as ACM Symposium of Document Engineering (commonly referred to as “DocEng”) often include many research topics directly and indirectly related to PDF.
Most research publications are oriented towards making unique contributions within their primary domain, however in some cases, the research is weakened by a lack of PDF knowledge and expertise. This is understandable especially when the research is conducted by academics without deep experience with PDF and are not PDF experts themselves. This can result in papers with shortcomings such as:
- misunderstandings about PDF lexical rules, syntax and features;
- referencing out-of-date PDF specifications;
- relying on incorrect information from previously published work;
- being unaware of specialized PDF publications;
- use of old or incomplete implementations;
- limitations in the design and selection of PDF-based corpora, and
- confusion between PDF as a file format specification and behavior of specific implementations.
As a consequence, conclusions and future areas for research are often weakened. But this is precisely where PDF experts, such as PDF Association members, can “cross pollinate” and assist researchers to create better and more relevant research outcomes for the benefit of everyone.
Potential benefits to industry include:
- New sources of corpora, possibly including established “ground truth” or targeted to a particular problem. The PDF Association has already established the pdf-corpora GitHub repo that lists many such corpora and will gladly accept more;
- Identification of previously unrecognized problems or market needs. Although this may not always be explicitly identified in formal conclusions, many papers discuss areas of potential future research and problems encountered during the research process;
- New approaches to solve existing industry problems;
- New tools, features, applications or product opportunities that might be useful to industry;
- Potential future skilled employees!
Potential benefits to academia include:
- Access to PDF experience and expertise that complements and strengthens the academic domain expertise;
- Ensuring that research methodology and related tooling is current “state of the art” for the PDF industry;
- Ensuring that research refers to the latest and most relevant PDF or PDF subset specifications, including acknowledged errata;
- Review by PDF experts to improve research outcomes and ensure research work is technically accurate and relevant;
- Directing future research towards real-world problems faced by industry and establishing ongoing relationships with industrial partners, and
- Potential partnership opportunities and employment opportunities for new graduates.
A new free peer-review service
To assist both academic and industry researchers achieve high-quality and accurate PDF-oriented research outcomes, the PDF Association is now making available a new free peer-review service. By emailing email@example.com, this service will link acknowledged experts in the PDF file format with journal editors, academic publishers, conference steering committees and researchers to provide expert peer-review of pre-print articles, whitepapers and presentations in relation to statements made about PDF.
I encourage all PDF Association Members who are technically minded to consider becoming more aware of the large body of academic knowledge that utilizes PDF. You never know what new ideas or new approaches you might find!
With the transition to online events, attending academic conferences virtually is not as costly, or potentially intimidating, as previously. With detailed conference programs and abstracts published in advance, you can also selectively attend just the presentations of real interest. I also encourage industry to seek out academic papers and presentations in domains related to their business, as well as reaching out to your local research institutions.
As the PDF Association’s CTO I am always on the lookout for new and interesting PDF-based research contributions, as well as where future directions might take the PDF industry so please get in touch.