The HTMLExtractor type exposes the following members.
Controls if HTML output adds font style information to text objects. True by default; set to False to output objects as plain text without font size and style defined
Defines whether respect permissions set by document owner. If True, extractor throws exception when the extraction is prohibited.(Inherited from BaseExtractor.)
Column detection mode.
Controls if renders the form text controls to a plain text objects. False by default, set to True to display controls as text.
Controls if URL links will be detected as set as clickable links or not True by default.
|DetectLinesInsteadOfParagraphs|| Obsolete. |
Tries to detect single lines instead of multiple lines.
Table columns detection option: defines space between columns to detect text as a new column.
Gets or sets a value indicating whether to extract text from annotation objects. Default is true.
Gets or sets a value indicating whether to extract text column by column or use the visual layout of the text while extracting. False by default. if you are processing PDF newspapers with text columns, set this property to True so you get column by column instead of line by line
Gets or sets a value indicating whether to extract invisible text from PDF document.
Extraction mode: plain HTML or formatted HTML with CSS.
Gets or sets a value indicating whether to include characters used to create "shadow" effect (when the same character appears with some offset) from PDF document. True by default (includes all encoded characters disregarding their real appearance).
Map to substitute fonts. You can add new mappings to match a font to another font in output HTML code.
Gets or sets a value indicating whether to use the high precision text positioning.
By default HTMLExtractor replaces names of embedded fonts with standard (or "descendant") fonts similar by metrics and typeface. This is because embedded fonts differ from fonts installed into your system or absent there at all. Set this property to true if you want to keep the original font names.
Sets how lines are grouped into paragraphs. Default: no lines grouping is performed.
Gets or sets optimization of images (True by default)
Defines format for output images. Default is PNG (with transparency). If you do NOT need the transparency support and want to have smaller image sizes (so the page will load faster) then set this property to OutputImageFormat.JPEG.
Set or get width (in pixels) of the output pages rendered into HTML. Default output width is 1024 (height is calculated and used according to the original pdf pages ratio)
Controls page data caching behavior.
PDF document owner password.(Inherited from BaseExtractor.)
Gets or sets a value indicating whether to preserve the text formatting on the extraction.
Registration key.(Inherited from BaseExtractor.)
Registration name.(Inherited from BaseExtractor.)
Gets or sets a value indicating whether to automatically remove hyphenations in end of lines (works when Unwrap is True).
Get or sets the image handling (skip, embed, or save to outer file).
Gets or sets a value indicating whether to remove trailing and ending spaces from table cell values.
Gets or sets a value indicating whether to unwrap lines into single lines or not (especially could be useful in the column layout mode - see ExtractColumnByColumn property). Default is False.
Gets the component version number.(Inherited from BaseExtractor.)