Template Creation Guide

Table of Contents:

Product home page: ByteScout Document Parser SDK

Template specification version: 3.

Templates can be written in YAML or JSON formats. A template defines one or more keywords to match the right template to the document and expressions for fields and tables to be extracted. A single template file can contain multiple templates. Templates in YAML file should be separated with --- line. Templates in JSON must be arranged as an array [].

Sample YAML template showing the main features:

--- templateVersion: 3 templatePriority: 1 sourceId: ACME Inc. Invoice culture: en-US detectionRules: keywords: - ACME Inc\. - Invoice No - ABN 01 234 567 890 fields: companyName: type: static expression: ACME Inc. invoiceNumber: type: regex expression: 'Invoice No.: {{ABC123+}}' pageIndex: 0 invoiceDate: type: regex expression: 'Invoice Date: \d{2}\/\d{2}\/\d{4}' dataType: date dateFormat: MM/dd/yyyy billTo: type: rectangle rectangle: - 32.5 - 64.5 - 200 - 100 pageIndex: 0 total: type: regex expression: TOTAL\s+(\d+\.\d+) dataType: decimal tables: - name: table1 start: expression: Item\s+Quantity\s+Price\s+Total end: expression: TOTAL row: expression: ^\s*(?<description>\w+.*)(?<quantity>\d+)\s+(?<unitPrice>\d+\.\d{2})\s+(?<itemTotal>\d+\.\d{2})\s*$ columns: - name: description type: string - name: quantity type: integer - name: unitPrice type: decimal - name: itemTotal type: decimal multipage: true

Template Parameters

TemplatePriority

Templates are sorted and tried by templatePriority, then alphabetically. 0 - the highest priority, 999999 - the lowest.

SourceId

Some name that identifies the design of the document. Passed to the result unchanged.

Culture

Template culture that affects the detection of dates and decimal numbers. For example, if en-US culture is set, the parser will expect dates in month-day-year sequence, and decimal numbers with the dot as the decimal symbol and the comma as the digit grouping symbol. For fr-FR culture, the parser will expect dates in day-month-year sequence, and decimal numbers with the comma as the decimal symbol and the space as the digit grouping symbol. You can find the list of culture names at https://msdn.microsoft.com/en-us/library/cc233982.aspx.

Example:

culture: fr-FR

DetectionRules

Few regular expressions (Regex) that uniquely identify the document design. Note, if you need to specify a static keyword phrase instead of regex, just escape symbols +*.[]()\/$ with \ because they are Regex special characters.

Example:

detectionRules: keywords: - ACME Inc\. - \[CONFIDENTIAL\] - 'Invoice No:\s+\d{6}'

DocumentStart

If your PDF file contains multiple documents to parse, documentStart regular expression should indicate the beginning of new document in PDF file.

Example:

documentStart: TAX INVOICE

Fields

Standalone fields to extract. For example, invoice number, invoice date, etc.

Field parameters:

Note 1: If you come across different number representations in the same document, you can override the template culture by appending new culture to the data type name. This single field will be parsed according to the specified culture.

Example:

type: decimal[fr-FR]

Note 2: The dateFormat and outputDateFormat can contain a format string defining the exact date format. Find the format string description here: https://docs.microsoft.com/en-us/dotnet/api/system.datetime.tryparseexact.

Example:

type: date dateFormat: MM-dd-yyyy

The dateFormat can also contain auto-format strings:

Example:

type: date dateFormat: auto-DMY

Tables

This section defines tabular data you need to extract. Tables can be defined by coordinates or by regular expressions to find the table start, the end, and rows. Tables section can contain multiple table definitions arranged as an array.

Table parameters:

Example of table parsing:

Description Interval Quantity Amount ($)
Basic Plan Jan 1 - Jan 31 1 25.00
Basic Plan Feb 1 - Feb 28 1 25.00
Total in USD: 50.00

The table above, can be parsed with regular expressions or with explicitly defined column coordinates.

Regex approach:

tables: - name: table1 start: expression: Amount \(\$\) end: expression: Total in USD row: expression: ^\s*(?<description>\w+.*)(?<interval>[a-zA-Z]{3} \d+ - [a-zA-Z]{3} \d+)\s+(?<quantity>\d+)\s+(?<amount>\d+\.\d\d) columns: - name: description type: string - name: interval type: string - name: quantity type: integer - name: amount type: decimal

If the regex approach is impossible for some complicated table, you can specify column coordinates explicitly. To visually determine the coordinates of a column, you can use the included Template Editor application: it shows the cursor coordinates in the toolbar.

Explicit column coordinates approach:

tables: - name: table1 start: expression: Description\s+Interval end: expression: Total in USD columns: - name: description x: 0 type: string - name: interval x: 100 type: string - name: quantity x: 150 type: integer - name: amount x: 200 type: decimal

Options

Template options.

If ocrMode option is not specified, the mode will be defined by DocumentParser.OCRMode property. See documentation of Document Parser SDK.

APPENDIX 1

The Expression parameter can contain macros or regular expression (https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference). Do not use both macros and regular expressions in the same expression.

Macros:

Remarks:

If the macro successfully detected an appropriate character sequence, it will be passed to the parsing result for this field. If you used several macros in one expression, only the last one will be passed to the result.


Copyright (c) 2018-2020 ByteScout, Inc.

www.bytescout.com