Based on the classification of documents, there are two common types of data extraction methodologies: rule-based data extraction and model-based data extraction1. Rule-based data extraction targets structured documents, while model-based data extraction is used to process semi-structured and unstructured documents1. However, neither of these methods alone can handle the high variety of layouts that some document types may have. Therefore, a hybrid data extraction approach is recommended, which combines the strengths of both methods and allows for more flexibility and accuracy23. A hybrid data extraction approach can use one or more extractors, such as RegEx Based Extractor, Form Extractor, Intelligent Form Extractor, Machine Learning Extractor, or FlexiCapture Extractor, depending on the document type and the fields of interest3. The Data Extraction Scope activity in UiPath enables the configuration and execution of a hybrid data extraction methodology, by allowing the user to customize which fields are requested from each extractor, what is the minimum confidence threshold for a given data point extracted by each extractor, what is the taxonomy mapping, at field level, between the project taxonomy and the extractor’s internal taxonomy (if any), and how to implement “fall-back” rules for data extraction2.
References: 2: Data Extraction Overview 3: Data Extraction 1: Document Processing with Improved Data Extraction
Contribute your Thoughts:
Chosen Answer:
This is a voting comment (?). You can switch to a simple comment. It is better to Upvote an existing comment if you don't have anything to add.
Submit