Template Creation Guide

Table of Contents:

Template Creation Guide

Product home page: ByteScout Document Parser SDK

Template specification version: 3.

Templates can be written in YAML or JSON formats. A template defines one or more keywords to match the right template to the document and expressions for fields and tables to be extracted. A single template file can contain multiple templates. Templates in YAML file should be separated with --- line. Templates in JSON must be arranged as an array [].

Sample YAML template showing the main features:

---
templateVersion: 3
templatePriority: 1
sourceId: ACME Inc. Invoice
culture: en-US

detectionRules:
  keywords:
    - ACME Inc\.
    - Invoice No
    - ABN 01 234 567 890

fields:
  companyName:
    type: static
    expression: ACME Inc.
  invoiceNumber:
    type: regex
    expression: 'Invoice No.: {{ABC123+}}'
    pageIndex: 0
  invoiceDate:
    type: regex
    expression: 'Invoice Date: \d{2}\/\d{2}\/\d{4}'
    dataType: date
    dateFormat: MM/dd/yyyy
  billTo:
    type: rectangle
    rectangle:
      - 32.5
      - 64.5
      - 200
      - 100
    pageIndex: 0
  total:
    type: regex
    expression: TOTAL\s+(\d+\.\d+)
    dataType: decimal

tables:
  - name: table1
    start:
      expression: Item\s+Quantity\s+Price\s+Total
    end:
      expression: TOTAL
    row:
      expression: ^\s*(?<description>\w+.*)(?<quantity>\d+)\s+(?<unitPrice>\d+\.\d{2})\s+(?<itemTotal>\d+\.\d{2})\s*$
    columns:
      - name: description
        type: string
      - name: quantity
        type: integer
      - name: unitPrice
        type: decimal
      - name: itemTotal
        type: decimal
    multipage: true

Template Parameters

TemplatePriority

Templates are sorted and tried by templatePriority, then alphabetically. 0 - the highest priority, 999999 - the lowest.

SourceId

Some name that identifies the design of the document. Passed to the result unchanged.

Culture

Template culture that affects the detection of dates and decimal numbers. For example, if en-US culture is set, the parser will expect dates in month-day-year sequence, and decimal numbers with the dot as the decimal symbol and the comma as the digit grouping symbol. For fr-FR culture, the parser will expect dates in day-month-year sequence, and decimal numbers with the comma as the decimal symbol and the space as the digit grouping symbol. You can find the list of culture names at https://msdn.microsoft.com/en-us/library/cc233982.aspx.

Example:

culture: fr-FR

DetectionRules

Few regular expressions (Regex) that uniquely identify the document design. Note, if you need to specify a static keyword phrase instead of regex, just escape symbols +*.[]()\/$ with \ because they are Regex special characters.

Example:

detectionRules:
  keywords:
    - ACME Inc\.
    - \[CONFIDENTIAL\]
    - 'Invoice No:\s+\d{6}'

DocumentStart

If your PDF file contains multiple documents to parse, documentStart regular expression should indicate the beginning of new document in PDF file.

Example:

documentStart: TAX INVOICE

Fields

Standalone fields to extract. For example, invoice number, invoice date, etc.

Field parameters:

type - [optional] Type of the field.

Valid values:
- regex - (default) A field that contains a regular expression (Regex) or macros (see Appendix 1).
- rectangle - Rectangular area of the document page to extract text from. The rectangle coordinates are defined in rectangle parameter. If used without the expression parameter, it will simply return the text extracted from the rectangle. If used with the expression parameter, the regex will only search within the text extracted from the rectangle.
- static - Static text that will be passed to the result without changes.
- structure - Virtual table structure field. The parser tries to reconstruct a tabular structure of the document page and allows you to specify coordinates of desired field in this structure. Use pageIndex and structureCoordinates parameters to specify the coordinates. Use Template Editor to select structure coordinates visually.
- direction - Directional field. Allows to find a fixed keyword phrase and take Nth phrase from it as value. Use expression parameter to specify the keyword phrase, and keywordOrdinalNumber, valueOrdinalNumber parameters to specify the criteria.
Examples of fields of different types:
```
fields:

  # regex field
  total:
    type: regex
    expression: TOTAL\s+(\d+\.\d+)
    dataType: decimal

  # rectangle field
  billTo:
    type: rectangle
    rectangle:
      - 32.5
      - 64.5
      - 200
      - 100
    expression: '(?s)Bill to:(?<value>.*)'
    pageIndex: 0

  #static field
  companyName:
    type: static
    expression: ACME Inc.

  # structure field
  structureField1:
    type: structure
    pageIndex: 0
    structureCoordinates:
      x: 2
      y: 4

  # directional field
  userSsn:
    type: direction
    expression: SSN
    keywordOrdinalNumber: 2
    valueDirection: right
    valueOrdinalNumber: 1
```
expression - Contains macros (see Appendix 1) or a regular expression (Regex) defining the data to be searched and retrieved from the document.

Remarks:
- If you used several macros in one expression, only the value of the last one will be passed to the result.
- If regex doesn't contain capturing groups, the entire match will go to result. With groups, the last group or the named <value> group will go to the result.
- Do not use both macros and regular expressions in the same expression.
Special case: the expression can also contain the name of the special function. Currently available special functions:
- $$funcFindCompany - searches the document for the company name from a predefined list of known companies.
- $$funcFindCompanyNext - searches the document for the company name from a predefined list of known companies.
- $$funcFindMaxDate - searches the document for the maximum date.
- $$funcFindMinDate - searches the document for the minimum date.
- $$funcFindMaxNumber - searches the document for the maximum number.
Examples of expression parameter:
```
# Macro expression. The last macro will go to the result.
expression: 'Invoice No.: {{ABC123}}'

# The entire match will go to the result
expression: \w{6}-\d{5}

# The last capturing group will go to the result
expression: 'Account number:\s+(\d+)'

# Only the value of <value> named group will go to the result
expression: 'Total\s+(?:USD|€|\$|£|¥)?\s*(?<value>(\d+,?)+\.\d\d)'

# Special function
expression: $$funcFindCompany
```
rectangle - [optional] coordinates of the extraction area for fields of the 'rectangle' type. The coordinates are specified as top, left, width, and height in PDF units Points (1 Point = 1/72").

Example:
```
fields:
  billTo:
    type: rectangle
    rectangle:
      - 10
      - 10
      - 200
      - 100
```
pageIndex - [optional] Zero-based page index to search the field in. Default is -1 (any page).
dataType - [optional] The expected datatype of the parsed value.

Possible values:
- string - used by default if the type is not specified; the matched Regex value will be passed to the result unchanged.
- integer - the parser will try to convert the retrieved text to an integer number according to the template culture.
- decimal - the parser will try to convert the retrieved text to a decimal number according to the template culture. See Note 1 below.
- date - the retrieved text will be parsed as a date according to specified dateFormat or the template culture. See Note 2 below.
- table - the special type used in conjunction with 'rectangle' field type. The data from the rectangle area will be extracted preserving the table structure.
dateFormat - [optional] The format string to parse the date. See Note 2 below.
outputDateFormat - [optional] Output date format. By default, successfully parsed date will be passed to the result in ISO 8601 format, e.g. 2018-01-04T00:00:00, but you can specify your own output format, e.g. yyyy-MM-dd.
rowMergingRule - [optional] defines a rule to merge multiline data in table cells. Used with 'rectangle' field type and 'table' data type. See rowMergingRule description in tables section.
coalesceWith - name of another field to coalesce with. If the specified field is not parsed, the current field will replace it. This is useful if you need to create two parsing criteria for some varying data and get them as a single field in the result. If the first field fails, the second will be used.

Example. If field1 is not successfully parsed, the field1a will be used to replace field1 in the result:
```
fields:
  field1:
    rect:
      - 10
      - 10
      - 100
      - 25
  field1a:
    rect:
      - 10
      - 50
      - 100
      - 25
    coalesceWith: field1
```

structureCoordinates - X and Y coordinates in the virtual table structure. You can use Use Template Editor to select structure coordinates visually.

Example:

# structure field
  structureField1:
    type: structure
    pageIndex: 0
    structureCoordinates:
      x: 2
      y: 4

keywordOrdinalNumber - For direction type fields. Ordinal number of the keyword phrase occurrence.
valueOrdinalNumber - For direction type fields. Ordinal number of the sentence to return as result. Sentence is a sequence of words separated by a single space.

Note 1: If you come across different number representations in the same document, you can override the template culture by appending new culture to the data type name. This single field will be parsed according to the specified culture.

Example:

type: decimal[fr-FR]

Note 2: The dateFormat and outputDateFormat can contain a format string defining the exact date format. Find the format string description here: https://docs.microsoft.com/en-us/dotnet/api/system.datetime.tryparseexact.

Example:

type: date
dateFormat: MM-dd-yyyy

The dateFormat can also contain auto-format strings:

auto-MDY - the parser will try to detect the date format automatically, assuming the date is in month-day-year sequence.
auto-DMY - the parser will try to detect the date format automatically, assuming the date is in day-month-year sequence.
auto-YMD - the parser will try to detect the date format automatically, assuming the date is in year-month-day sequence.
auto - the parser will try to detect the format automatically, taking the date parts sequence from the template culture.

Example:

type: date
dateFormat: auto-DMY

Tables

This section defines tabular data you need to extract. Tables can be defined by coordinates or by regular expressions to find the table start, the end, and rows. Tables section can contain multiple table definitions arranged as an array.

Table parameters:

name - table name to distinguish different tables in the result.
start - group of parameters that define the start of the table:
- expression - regular expression to find the start of the table, or
- y - the top coordinate of the table.
- pageIndex - index of the page containing the y coordinate.
end - group of parameters that define the end of the table:
- expression - regular expression to find the end of the table, or
- y - the bottom coordinate of the table.
subItemStart - [optional] parameters that define the start of the table sub-item. Sub-items are used for tables with complex multiline rows:
- expression - regular expression to find the start of the sub-item.
subItemEnd - [optional] parameters that define the end of the table sub-item:
- expression - regular expression to find the end of the sub-item.
introduction - Parameters to parse values from sub-headers. Values parsed from the introduction expression will be repeated in the beginning of every row.
- expression - regular expression to parse introduction items.
row - [optional] group of parameters that define table rows:
- expression - the main regular expression to find a row.
- subExpression1, subExpression2, subExpression3, subExpression4, subExpression5 - additional expressions to parse some remaining parts of row data which the main expression cannot parse in one pass. Sub-expressions are executed after the main expression for the text chunks between matches of the main expression. Can be used to parse hanging rows (wrapped multiline cells).
columns - [optional] array that defines column properties. Names of columns should correspond to the names of the capturing groups of the row expression.

Column properties:
- name - column name.
- x - [optional] X coordinate of the left column edge in PDF Points;
- type - [optional] Column data type: 'string', 'integer', 'date', or 'decimal' (see also the type descriptions in fields section).
- dateFormat - [optional] See dateFormat description in fields section.
- outputDateFormat - [optional] See outputDateFormat description in fields section.
- coalesceWith - [optional] Name of column to merge the parsed value with.
Example:
```
  columns:
    - name: exam
      x: 0
      type: string
    - name: examDate
      x: 100
      type: date
      dateFormat: auto-MDY
```
rowMergingRule - [optional] For the fields of rectangle type and table data type. Defines the rule to merge multiline data in table cells.

Valid values:
- none - default, no rule.
- byBorders - combine lines within a table cell framed by border lines.
- hangingRows - join table row that contains only a single cell up to the previous row if there is no separating line between them. Useful for tables without borders between rows.
Example:
```
rowMergingRule: byBorders
```
multipage - [optional] defines whether the table may continue on further pages.

Example:
```
multipage: true
```

Example of table parsing:

Description	Interval	Quantity	Amount ($)
Basic Plan	Jan 1 - Jan 31	1	25.00
Basic Plan	Feb 1 - Feb 28	1	25.00
		Total in USD:	50.00

The table above, can be parsed with regular expressions or with explicitly defined column coordinates.

Regex approach:

tables:
  - name: table1
    start:
      expression: Amount \(\$\)
    end:
      expression: Total in USD
    row:
      expression: ^\s*(?<description>\w+.*)(?<interval>[a-zA-Z]{3} \d+ - [a-zA-Z]{3} \d+)\s+(?<quantity>\d+)\s+(?<amount>\d+\.\d\d)
    columns:
      - name: description
        type: string
      - name: interval
        type: string
      - name: quantity
        type: integer
      - name: amount
        type: decimal

If the regex approach is impossible for some complicated table, you can specify column coordinates explicitly. To visually determine the coordinates of a column, you can use the included Template Editor application: it shows the cursor coordinates in the toolbar.

Explicit column coordinates approach:

tables:
  - name: table1
    start:
      expression: Description\s+Interval
    end:
      expression: Total in USD
    columns:
      - name: description
        x: 0
        type: string
      - name: interval
        x: 100
        type: string
      - name: quantity
        x: 150
        type: integer
      - name: amount
        x: 200
        type: decimal

Options

Template options.

ocrLanguage - The language for Optical Character Recognition (OCR). Document Parser SDK is shipped with 5 language files, but you can download more languages at https://github.com/bytescout/ocrdata.

Valid values:
- eng - English (default)
- deu - German
- fra - French
- spa - Spanish
- nld - Dutch
Example:
```
ocrLanguage: nld
```
ocrMode - The mode of the Optical Character Recognition (OCR):
- auto - OCR will be used only if there are no text on PDF document page but only raster images.
- forced - Force OCR to extract text from both images and fonts. Useful for PDF documents with mixed content (when portion of document text is drawn as image).
- repairFonts - Some PDF documents use embedded fonts with customized charset making the text extraction impossible. This mode will render entire document and extract the text using OCR.
Example:
```
ocrMode: forced
```

If ocrMode option is not specified, the mode will be defined by DocumentParser.OCRMode property. See documentation of Document Parser SDK.

APPENDIX 1

The Expression parameter can contain macros or regular expression (https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference). Do not use both macros and regular expressions in the same expression.

Macros:

{{ABC}} - Detects continuous sequences of letters and _ character.
{{ABC+}} - Detects continuous sequences of letters and _-+=/ characters.
{{ABC123}} - Detects continuous sequences of letters, digits, and _ character.
{{ABC123+}} - Detects continuous sequences of letters, digits, and _-+=/ characters.
{{123}} - Detects continuous sequences of digits.
{{123+}} - Detects continuous sequences of digits and _-+=/ characters.
{{DATE}} - Detects short date patterns like the following: 12/31/2019, 31.12.19, 2019-12-31.
{{DATE+}} - Detects long date patterns like the following: Sep 23, 2019, 22 décembre 2010.
{{DECIMAL}} - Detects decimal numbers like the following: 12.34, -123,456.78, 123.456. The decimal separator and group separator are automatically taken from the template culture.
{{MONEY}} - Detects decimal numbers with currency symbol like the following: USD 12.34, $123,456.78, 123.45 €. The decimal separator and group separator are automatically taken from the template culture.
{{ANY}} - Sequence of any characters, including spaces and new lines.

Remarks:

If the macro successfully detected an appropriate character sequence, it will be passed to the parsing result for this field. If you used several macros in one expression, only the last one will be passed to the result.