HTMLExtractor ClassByteScout PDF To HTML SDK
Extracts text and images from PDF document and creates formated HTML page from extracted data.
Inheritance Hierarchy

SystemObject
  Bytescout.PDF2HTMLBaseExtractor
    Bytescout.PDF2HTMLHTMLExtractor

Namespace:  Bytescout.PDF2HTML
Assembly:  Bytescout.PDF2HTML (in Bytescout.PDF2HTML.dll) Version: 9.2.0.3276-local
Syntax

public class HTMLExtractor : BaseExtractor, 
	IHTMLExtractor

The HTMLExtractor type exposes the following members.

Constructors

  NameDescription
Public methodHTMLExtractor
Initializes a new instance of the HTMLExtractor class.
Public methodHTMLExtractor(String, String)
Initializes a new instance of the HTMLExtractor class.
Top
Properties

  NameDescription
Public propertyAddFontStyleHTMLTagsToText
Controls if HTML output adds font style information to text objects. True by default; set to False to output objects as plain text without font size and style defined
Public propertyCheckPermissions
Defines whether respect permissions set by document owner. If True, extractor throws exception when the extraction is prohibited.
(Inherited from BaseExtractor.)
Public propertyColumnDetectionMode
Column detection mode.
Public propertyControlsAsText
Controls if renders the form text controls to a plain text objects. False by default, set to True to display controls as text.
Public propertyDetectHyperLinks
Controls if URL links will be detected as set as clickable links or not True by default.
Public propertyDetectLinesInsteadOfParagraphs Obsolete.
Tries to detect single lines instead of multiple lines.
Public propertyDetectNewColumnBySpacesRatio
Table columns detection option: defines space between columns to detect text as a new column.
Public propertyExtractAnnotations
Gets or sets a value indicating whether to extract text from annotation objects. Default is true.
Public propertyExtractColumnByColumn
Gets or sets a value indicating whether to extract text column by column or use the visual layout of the text while extracting. False by default. if you are processing PDF newspapers with text columns, set this property to True so you get column by column instead of line by line
Public propertyExtractInvisibleText
Gets or sets a value indicating whether to extract invisible text from PDF document.
Public propertyExtractionMode
Extraction mode: plain HTML or formatted HTML with CSS.
Public propertyExtractShadowLikeText
Gets or sets a value indicating whether to include characters used to create "shadow" effect (when the same character appears with some offset) from PDF document. True by default (includes all encoded characters disregarding their real appearance).
Public propertyFontSubstitutionMap
Map to substitute fonts. You can add new mappings to match a font to another font in output HTML code.
Public propertyHighPrecisionTextPositioning
Gets or sets a value indicating whether to use the high precision text positioning.
Public propertyKeepOriginalFontNames
By default HTMLExtractor replaces names of embedded fonts with standard (or "descendant") fonts similar by metrics and typeface. This is because embedded fonts differ from fonts installed into your system or absent there at all. Set this property to true if you want to keep the original font names.
Public propertyLineGroupingMode
Sets how lines are grouped into paragraphs. Default: no lines grouping is performed.
Public propertyOptimizeImages
Gets or sets optimization of images (True by default)
Public propertyOutputImageFormat
Defines format for output images. Default is PNG (with transparency). If you do NOT need the transparency support and want to have smaller image sizes (so the page will load faster) then set this property to OutputImageFormat.JPEG.
Public propertyOutputPageWidth
Set or get width (in pixels) of the output pages rendered into HTML. Default output width is 1024 (height is calculated and used according to the original pdf pages ratio)
Public propertyPageDataCaching
Controls page data caching behavior.
Public propertyPassword
PDF document owner password.
(Inherited from BaseExtractor.)
Public propertyPreserveFormattingOnTextExtraction
Gets or sets a value indicating whether to preserve the text formatting on the extraction.
Public propertyRegistrationKey
Registration key.
(Inherited from BaseExtractor.)
Public propertyRegistrationName
Registration name.
(Inherited from BaseExtractor.)
Public propertyRemoveHyphenation
Gets or sets a value indicating whether to automatically remove hyphenations in end of lines (works when Unwrap is True).
Public propertySaveImages
Get or sets the image handling (skip, embed, or save to outer file).
Public propertyTrimSpaces
Gets or sets a value indicating whether to remove trailing and ending spaces from table cell values.
Public propertyUnwrap
Gets or sets a value indicating whether to unwrap lines into single lines or not (especially could be useful in the column layout mode - see ExtractColumnByColumn property). Default is False.
Public propertyVersion
Gets the component version number.
(Inherited from BaseExtractor.)
Top
Methods

  NameDescription
Public methodDispose
Releases the unmanaged resources used by the instance and optionally releases the managed resources.
(Inherited from BaseExtractor.)
Public methodDisposePage
Disposes the page object. Uses this method carefully to destroy the page object that should not be used further. Useful to free allocated memory when processing huge PDF documents.
Public methodEquals
Determines whether the specified Object is equal to the current Object.
(Inherited from Object.)
Protected methodFinalize
Allows an object to try to free resources and perform other cleanup operations before it is reclaimed by garbage collection.
(Inherited from Object.)
Protected methodFireParsingError (Inherited from BaseExtractor.)
Public methodGetHashCode
Serves as a hash function for a particular type.
(Inherited from Object.)
Public methodGetHTML
Extracts HTML from whole document.
Public methodGetHTML(Int32, Int32)
Extracts HTML text from specified page range.
Public methodGetHTMLPage
Extracts HTML from specified document page.
Public methodGetOutputHTMLPageHeight
Get height of the output page rendered in HTML format
Public methodGetPageCount
Returns document page count.
(Inherited from BaseExtractor.)
Public methodGetPageHeight
Get height of the original PDF page (in pdf units)
Public methodGetPageRect_Height
Gets the specified page height.
(Inherited from BaseExtractor.)
Public methodGetPageRect_Left
Gets the specified page left coordinate.
(Inherited from BaseExtractor.)
Public methodGetPageRect_Top
Gets the specified page top coordinate.
(Inherited from BaseExtractor.)
Public methodGetPageRect_Width
Gets the specified page width.
(Inherited from BaseExtractor.)
Public methodGetPageRectangle
Gets the page rectangle.
(Inherited from BaseExtractor.)
Public methodGetPageWidth
Get height of the original PDF page (in pdf units)
Public methodGetType
Gets the Type of the current instance.
(Inherited from Object.)
Public methodLoadDocumentFromFile
Loads document from file. Supported formats: PDF, PNG, JPEG, BMP and TIFF (single-page). Call .Reset() method before loading another file into the same instance so it will release the lock for the file.
(Inherited from BaseExtractor.)
Public methodLoadDocumentFromStream
Loads document from stream. Supported formats: PDF, PNG, JPEG, BMP and TIFF (single-page).
(Inherited from BaseExtractor.)
Protected methodMemberwiseClone
Creates a shallow copy of the current Object.
(Inherited from Object.)
Public methodReset
Resets the instance and disposes internal resources. Also automatically invoked by Dispose.
(Overrides BaseExtractorReset.)
Public methodResetExtractionArea
Resets the extraction area to full page.
(Inherited from BaseExtractor.)
Public methodSaveHtmlPageToFile
Saves HTML from specified page page to stream.
Public methodSaveHtmlPageToStream
Saves HTML from specified page page to stream.
Public methodSaveHtmlToFile(String)
Saves HTML text to file.
Public methodSaveHtmlToFile(Int32, Int32, String)
Saves HTML text from specified page range to file.
Public methodSaveHtmlToStream(Stream)
Saves HTML text to stream.
Public methodSaveHtmlToStream(Int32, Int32, Stream)
Saves HTML text from specified page range to stream.
Public methodSetExtractionArea(RectangleF)
Sets the extraction area by rectangle.
(Inherited from BaseExtractor.)
Public methodSetExtractionArea(Single, Single, Single, Single)
Sets the extraction area by coordinates and dimensions.
(Inherited from BaseExtractor.)
Public methodToString
Returns a string that represents the current object.
(Inherited from Object.)
Top
Events

  NameDescription
Public eventParsingError
Raised on PDF document parsing errors. This usually indicates a damaged document.
(Inherited from BaseExtractor.)
Public eventPasswordRequired
Occurs when the password required to decrypt the document.
(Inherited from BaseExtractor.)
Top
See Also

Reference