Table of Contents:
Product home page: ByteScout Document Parser SDK
Template specification version: 3.
Templates can be written in YAML or JSON formats. A template defines one or more keywords to match the right template to the document and expressions for fields and tables to be extracted.
A single template file can contain multiple templates. Templates in YAML file should be separated with --- line. Templates in JSON must be arranged as an array [].
Sample YAML template showing the main features:
---
templateVersion: 3
templatePriority: 1
sourceId: ACME Inc. Invoice
culture: en-US
detectionRules:
keywords:
- ACME Inc\.
- Invoice No
- ABN 01 234 567 890
fields:
companyName:
type: static
expression: ACME Inc.
invoiceNumber:
type: regex
expression: 'Invoice No.: {{ABC123+}}'
pageIndex: 0
invoiceDate:
type: regex
expression: 'Invoice Date: \d{2}\/\d{2}\/\d{4}'
dataType: date
dateFormat: MM/dd/yyyy
billTo:
type: rectangle
rectangle:
- 32.5
- 64.5
- 200
- 100
pageIndex: 0
total:
type: regex
expression: TOTAL\s+(\d+\.\d+)
dataType: decimal
tables:
- name: table1
start:
expression: Item\s+Quantity\s+Price\s+Total
end:
expression: TOTAL
row:
expression: ^\s*(?<description>\w+.*)(?<quantity>\d+)\s+(?<unitPrice>\d+\.\d{2})\s+(?<itemTotal>\d+\.\d{2})\s*$
columns:
- name: description
type: string
- name: quantity
type: integer
- name: unitPrice
type: decimal
- name: itemTotal
type: decimal
multipage: true
Templates are sorted and tried by templatePriority, then alphabetically. 0 - the highest priority, 999999 - the lowest.
Some name that identifies the design of the document. Passed to the result unchanged.
Template culture that affects the detection of dates and decimal numbers.
For example, if en-US culture is set, the parser will expect dates in month-day-year sequence, and decimal numbers with the dot as the decimal symbol and the comma as the digit grouping symbol.
For fr-FR culture, the parser will expect dates in day-month-year sequence, and decimal numbers with the comma as the decimal symbol and the space as the digit grouping symbol.
You can find the list of culture names at https://msdn.microsoft.com/en-us/library/cc233982.aspx.
Example:
culture: fr-FR
Few regular expressions (Regex) that uniquely identify the document design.
Note, if you need to specify a static keyword phrase instead of regex, just escape symbols +*.[]()\/$ with \ because they are Regex special characters.
Example:
detectionRules:
keywords:
- ACME Inc\.
- \[CONFIDENTIAL\]
- 'Invoice No:\s+\d{6}'
If your PDF file contains multiple documents to parse, documentStart regular expression should indicate the beginning of new document in PDF file.
Example:
documentStart: TAX INVOICE
Standalone fields to extract. For example, invoice number, invoice date, etc.
Field parameters:
type - [optional] Type of the field.
Valid values:
regex - (default) A field that contains a regular expression (Regex) or macros (see Appendix 1).rectangle - Rectangular area of the document page to extract text from. The rectangle coordinates are defined in rectangle parameter. If used without the expression parameter, it will simply return the text extracted from the rectangle. If used with the expression parameter, the regex will only search within the text extracted from the rectangle.static - Static text that will be passed to the result without changes.structure - Virtual table structure field. The parser tries to reconstruct a tabular structure of the document page and allows you to specify coordinates of desired field in this structure. Use pageIndex and structureCoordinates parameters to specify the coordinates. Use Template Editor to select structure coordinates visually.direction - Directional field. Allows to find a fixed keyword phrase and take Nth phrase from it as value. Use expression parameter to specify the keyword phrase, and keywordOrdinalNumber, valueOrdinalNumber parameters to specify the criteria.Examples of fields of different types:
fields:
# regex field
total:
type: regex
expression: TOTAL\s+(\d+\.\d+)
dataType: decimal
# rectangle field
billTo:
type: rectangle
rectangle:
- 32.5
- 64.5
- 200
- 100
expression: '(?s)Bill to:(?<value>.*)'
pageIndex: 0
#static field
companyName:
type: static
expression: ACME Inc.
# structure field
structureField1:
type: structure
pageIndex: 0
structureCoordinates:
x: 2
y: 4
# directional field
userSsn:
type: direction
expression: SSN
keywordOrdinalNumber: 2
valueDirection: right
valueOrdinalNumber: 1
expression - Contains macros (see Appendix 1) or a regular expression (Regex) defining the data to be searched and retrieved from the document.
Remarks:
<value> group will go to the result.Special case: the expression can also contain the name of the special function. Currently available special functions:
Examples of expression parameter:
# Macro expression. The last macro will go to the result.
expression: 'Invoice No.: {{ABC123}}'
# The entire match will go to the result
expression: \w{6}-\d{5}
# The last capturing group will go to the result
expression: 'Account number:\s+(\d+)'
# Only the value of <value> named group will go to the result
expression: 'Total\s+(?:USD|€|\$|£|¥)?\s*(?<value>(\d+,?)+\.\d\d)'
# Special function
expression: $$funcFindCompany
rectangle - [optional] coordinates of the extraction area for fields of the 'rectangle' type. The coordinates are specified as top, left, width, and height in PDF units Points (1 Point = 1/72").
Example:
fields:
billTo:
type: rectangle
rectangle:
- 10
- 10
- 200
- 100
pageIndex - [optional] Zero-based page index to search the field in. Default is -1 (any page).
dataType - [optional] The expected datatype of the parsed value.
Possible values:
dateFormat - [optional] The format string to parse the date. See Note 2 below.
outputDateFormat - [optional] Output date format. By default, successfully parsed date will be passed to the result in ISO 8601 format, e.g. 2018-01-04T00:00:00, but you can specify your own output format, e.g. yyyy-MM-dd.
rowMergingRule - [optional] defines a rule to merge multiline data in table cells. Used with 'rectangle' field type and 'table' data type. See rowMergingRule description in tables section.
coalesceWith - name of another field to coalesce with. If the specified field is not parsed, the current field will replace it. This is useful if you need to create two parsing criteria for some varying data and get them as a single field in the result. If the first field fails, the second will be used.
Example. If field1 is not successfully parsed, the field1a will be used to replace field1 in the result:
fields:
field1:
rect:
- 10
- 10
- 100
- 25
field1a:
rect:
- 10
- 50
- 100
- 25
coalesceWith: field1
structureCoordinates - X and Y coordinates in the virtual table structure. You can use Use Template Editor to select structure coordinates visually.
Example:
# structure field
structureField1:
type: structure
pageIndex: 0
structureCoordinates:
x: 2
y: 4
keywordOrdinalNumber - For direction type fields. Ordinal number of the keyword phrase occurrence.
valueOrdinalNumber - For direction type fields. Ordinal number of the sentence to return as result. Sentence is a sequence of words separated by a single space.
Note 1: If you come across different number representations in the same document, you can override the template culture by appending new culture to the data type name. This single field will be parsed according to the specified culture.
Example:
type: decimal[fr-FR]
Note 2: The dateFormat and outputDateFormat can contain a format string defining the exact date format. Find the format string description here: https://docs.microsoft.com/en-us/dotnet/api/system.datetime.tryparseexact.
Example:
type: date
dateFormat: MM-dd-yyyy
The dateFormat can also contain auto-format strings:
auto-MDY - the parser will try to detect the date format automatically, assuming the date is in month-day-year sequence.auto-DMY - the parser will try to detect the date format automatically, assuming the date is in day-month-year sequence.auto-YMD - the parser will try to detect the date format automatically, assuming the date is in year-month-day sequence.auto - the parser will try to detect the format automatically, taking the date parts sequence from the template culture.Example:
type: date
dateFormat: auto-DMY
This section defines tabular data you need to extract. Tables can be defined by coordinates or by regular expressions to find the table start, the end, and rows. Tables section can contain multiple table definitions arranged as an array.
Table parameters:
name - table name to distinguish different tables in the result.
start - group of parameters that define the start of the table:
expression - regular expression to find the start of the table, ory - the top coordinate of the table.pageIndex - index of the page containing the y coordinate.end - group of parameters that define the end of the table:
expression - regular expression to find the end of the table, ory - the bottom coordinate of the table.subItemStart - [optional] parameters that define the start of the table sub-item. Sub-items are used for tables with complex multiline rows:
expression - regular expression to find the start of the sub-item.subItemEnd - [optional] parameters that define the end of the table sub-item:
expression - regular expression to find the end of the sub-item.introduction - Parameters to parse values from sub-headers. Values parsed from the introduction expression will be repeated in the beginning of every row.
expression - regular expression to parse introduction items.row - [optional] group of parameters that define table rows:
expression - the main regular expression to find a row.subExpression1, subExpression2, subExpression3, subExpression4, subExpression5 - additional expressions to parse some remaining parts of row data which the main expression cannot parse in one pass. Sub-expressions are executed after the main expression for the text chunks between matches of the main expression. Can be used to parse hanging rows (wrapped multiline cells).columns - [optional] array that defines column properties. Names of columns should correspond to the names of the capturing groups of the row expression.
Column properties:
name - column name.x - [optional] X coordinate of the left column edge in PDF Points;type - [optional] Column data type: 'string', 'integer', 'date', or 'decimal' (see also the type descriptions in fields section).dateFormat - [optional] See dateFormat description in fields section.outputDateFormat - [optional] See outputDateFormat description in fields section.coalesceWith - [optional] Name of column to merge the parsed value with.Example:
columns:
- name: exam
x: 0
type: string
- name: examDate
x: 100
type: date
dateFormat: auto-MDY
rowMergingRule - [optional] For the fields of rectangle type and table data type. Defines the rule to merge multiline data in table cells.
Valid values:
none - default, no rule.byBorders - combine lines within a table cell framed by border lines.hangingRows - join table row that contains only a single cell up to the previous row if there is no separating line between them. Useful for tables without borders between rows.Example:
rowMergingRule: byBorders
multipage - [optional] defines whether the table may continue on further pages.
Example:
multipage: true
Example of table parsing:
| Description | Interval | Quantity | Amount ($) |
|---|---|---|---|
| Basic Plan | Jan 1 - Jan 31 | 1 | 25.00 |
| Basic Plan | Feb 1 - Feb 28 | 1 | 25.00 |
| Total in USD: | 50.00 |
The table above, can be parsed with regular expressions or with explicitly defined column coordinates.
Regex approach:
tables:
- name: table1
start:
expression: Amount \(\$\)
end:
expression: Total in USD
row:
expression: ^\s*(?<description>\w+.*)(?<interval>[a-zA-Z]{3} \d+ - [a-zA-Z]{3} \d+)\s+(?<quantity>\d+)\s+(?<amount>\d+\.\d\d)
columns:
- name: description
type: string
- name: interval
type: string
- name: quantity
type: integer
- name: amount
type: decimal
If the regex approach is impossible for some complicated table, you can specify column coordinates explicitly. To visually determine the coordinates of a column, you can use the included Template Editor application: it shows the cursor coordinates in the toolbar.
Explicit column coordinates approach:
tables:
- name: table1
start:
expression: Description\s+Interval
end:
expression: Total in USD
columns:
- name: description
x: 0
type: string
- name: interval
x: 100
type: string
- name: quantity
x: 150
type: integer
- name: amount
x: 200
type: decimal
Template options.
ocrLanguage - The language for Optical Character Recognition (OCR). Document Parser SDK is shipped with 5 language files, but you can download more languages at https://github.com/bytescout/ocrdata.
Valid values:
eng - English (default)deu - Germanfra - Frenchspa - Spanishnld - DutchExample:
ocrLanguage: nld
ocrMode - The mode of the Optical Character Recognition (OCR):
auto - OCR will be used only if there are no text on PDF document page but only raster images.forced - Force OCR to extract text from both images and fonts. Useful for PDF documents with mixed content (when portion of document text is drawn as image).repairFonts - Some PDF documents use embedded fonts with customized charset making the text extraction impossible. This mode will render entire document and extract the text using OCR.Example:
ocrMode: forced
If ocrMode option is not specified, the mode will be defined by DocumentParser.OCRMode property. See documentation of Document Parser SDK.
The Expression parameter can contain macros or regular expression (https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-language-quick-reference). Do not use both macros and regular expressions in the same expression.
Macros:
{{ABC}} - Detects continuous sequences of letters and _ character.{{ABC+}} - Detects continuous sequences of letters and _-+=/ characters.{{ABC123}} - Detects continuous sequences of letters, digits, and _ character.{{ABC123+}} - Detects continuous sequences of letters, digits, and _-+=/ characters.{{123}} - Detects continuous sequences of digits.{{123+}} - Detects continuous sequences of digits and _-+=/ characters.{{DATE}} - Detects short date patterns like the following: 12/31/2019, 31.12.19, 2019-12-31.{{DATE+}} - Detects long date patterns like the following: Sep 23, 2019, 22 décembre 2010.{{DECIMAL}} - Detects decimal numbers like the following: 12.34, -123,456.78, 123.456. The decimal separator and group separator are automatically taken from the template culture.{{MONEY}} - Detects decimal numbers with currency symbol like the following: USD 12.34, $123,456.78, 123.45 €. The decimal separator and group separator are automatically taken from the template culture.{{ANY}} - Sequence of any characters, including spaces and new lines.Remarks:
If the macro successfully detected an appropriate character sequence, it will be passed to the parsing result for this field. If you used several macros in one expression, only the last one will be passed to the result.
Copyright (c) 2018-2020 ByteScout, Inc.