Document understanding is a constantly addressed topic and has become on top of the scene these last years with Deep Learning and NLP evolution. The PDF format is by nature unstructured, which implies sophisticated processes to extract and qualify information from such documents.
In this presentation, we will discuss four ways to address challenges brought by PDF (which are: layout & text understanding, hierarchy & relationships between the different structures):
- Layout analysis,
- Textual content key-value association,
- Natural language processing.
We will then discuss the many fields of applications of such technologies, including OCR, automatic indexing, tagging & labeling, structured layout conversion, and automatic redaction.