Intuitive data extraction from documents: easier than ever with iText pdf2Data 3.1

Excerpt: We are excited to announce the public release of iText pdf2Data 3.1, our document solution for intelligent data extraction.


About the author:
Member News

March 10, 2022
by


Introduction

We are excited to announce the release of iText pdf2Data 3.1, the latest version of our template-based data extraction solution. iText pdf2Data intelligently recognizes data inside structured and semi-structured PDF documents and extracts them in a structured format.

It enables you to define areas and rules in a template that corresponds to the content you want to extract from similar documents. You simply create a template from a sample document with the user-friendly Template Editor and verify the data is recognized and extracted correctly. From then on, all subsequent documents can be processed automatically by the pdf2Data SDK.

Intuitive data extraction

In recent years the extracting of information from business documents to enable end-to-end business process automation has become increasingly important. Intelligent Document Processing (IDP) is a set of technologies to process documents intelligently, helping businesses to extract and store data as simply and efficiently as possible.

PDF is widely used to share and exchange business data, particularly for invoices and other commercial documents. In today’s business world it is a common requirement to be able to access and extract the data contained within such documents. However, getting this data in a usable format can prove challenging. If you’ve ever tried copying a table from a PDF into a spreadsheet, then you’ll recognize how frustrating it can be.

Traditionally, accessing such data would require someone to transfer data from documents manually. Of course, this takes a lot of time and resources, with the risk of input errors or security issues to consider. What if you could automate this process in a reliable and secure way?

Enter iText pdf2Data. Similar to our document generation solution iText DITO, iText pdf2Data allows anyone to leverage iText’s powerful PDF capabilities, not just developers. By intelligently extracting data from documents in a smart and structured way, the data can easily be repurposed for analysis, reports, or whatever you want.

Once the iText pdf2Data components have been deployed and integrated into an automated document workflow, it’s simple to create or refine document templates to recognize and automatically extract data, which can then be easily reused by whoever needs it.

Advantages of template-based extraction

A number of IDP solutions use artificial intelligence (AI) technologies such as machine learning (ML) and natural language processing (NLP) to classify and extract data. For reliable results though, extensive training and large data sets are required to learn about the documents to be processed, and documents with content in different languages can be a struggle.

On the other hand, template-based solutions can offer significant benefits over AI-based alternatives. With iText pdf2Data you can begin extracting data with a template created from a single example document. It’s also easy to modify or adapt an existing template for new document types, and it offers excellent built-in language support.

In addition, while AI is particularly useful for handling less structured documents such as emails, for other type of documents it can be like using a sledgehammer to crack a nut. For example, structured (official forms, passports, ID cards etc.) and semi-structured documents (invoices, bank statements etc.) can instead be handled more efficiently using a more rules-based approach.

Both approaches have their benefits depending on the situation, and we’re holding a webinar on March 24, to explore some real-life use cases, and go into more detail about the advantages and disadvantages of each strategy. There will be two live sessions held, and we’ll make the recording available afterwards to registrants in case you can’t make it.

What’s new?

First and foremost, iText pdf2Data is now offered as a standalone solution instead of an iText 7 add-on. Everything you need to begin automated document data extraction is included; the browser-based pdf2Data Editor to create and modify templates, and the pdf2Data engine which parses documents and extracts the data with literally just a few lines of code. The SDK is available as a Java or .NET library for integration into your workflows, or alternatively can be used as a command-line application.

That’s not to say your workflows won’t benefit from also using iText 7 Core for pre- or post-processing tasks, or any of the add-ons available in the iText 7 Suite. For example, you could speed up mass-processing of documents by using pdfOptimizer to reduce file size. Alternatively, you might want pdfOCR to turn scanned documents and images into PDFs before the data extraction step. That’s entirely up to you though.

Updated Template Editor

In addition, we have made some considerable improvements to the pdf2Data Editor to make creating and updating templates even easier. There’s an updated user interface incorporating significant user experience enhancements, including inline help for the data field selectors which define how your data is extracted.

In addition, we’ve made deploying the pdf2Data Editor easier by providing it as a Docker container, in addition to the standard Apache Tomcat deployment method. This means you only need Docker installed to deploy the pdf2Data Editor.

pdf2Data SDK/CLI

In fact, there are no significant changes on the SDK side for this release. It is still available natively as a Java or .NET (C#) library, or as a command-line version if you prefer. It still gives you the same great extraction results as before.

That doesn’t mean we’re resting on our laurels though. A ton of features and improvements have been added since the first version was released, and we have big plans for iText pdf2Data’s functionality and deployment options as we continue to expand our reach into the world of IDP. Keep watching this space!

Want to know more about iText pdf2Data?

If you’re not already an iText pdf2Data customer, we recommend exploring all its features and capabilities with a free 30-day online trial!

Alternatively, check out the product page for a more detailed overview of how iText pdf2Data works. You can also visit our Knowledge Base where we have tutorials and a breakdown of all available pdf2Data selectors, including tips on how to use them effectively.

Finally, don’t miss our live webinar on March 24, to learn more about document data extraction.
Register for the live webinar now: https://itextpdf.com/en/events/live-webinar-data-extraction-documents-template-based-or-ai-based

Still have questions?

If you are interested in learning more or have additional questions.

If you are interested in learning more about iText pdf2Data.

Original post: https://itextpdf.com/en/blog/itext-news-technical-notes/intuitive-data-extraction-documents-easier-ever-itext-pdf2data-31