Text Extraction

Extract text using the GetText method of the PDFDocument class. To extract text from a particular PDF page, use the GetText method of the PdfPage class. The text returned from the GetText method is a string. Examples of both are provided below.

The following points are essential when using one of the GetText methods listed above for extracting text from within a PDF.

  • The text part of an image, a form field, or a note/comment is not extracted.
  • Text is extracted from a PDF in the order the PDF operators are loaded in the existing PDF.
  • During evaluation mode, text extraction is limited to 256 characters.

Extracting from Document

The following example illustrates extracting text from an existing PDF document.

// Create the PDF document object
PdfDocument pdfA = new PdfDocument(pdfFilePath);

// Call the GetText method from PDF document object to get the text from the document
string extractedText = pdfA.GetText();
'Create PDF document object
Dim pdfA As PdfDocument = New PdfDocument(pdfFilePath)

'Call the GetText method from PDF document object to get the text from the document
Dim extractedText As String = pdfA.GetText()    

Extracting from Page

If extracting text from a specific page, the following code illustrates extracting text from a specified page within a PDF. Note that by calling the specific Page of the PdfDocument instance's Pages property, it returns the particular PDFPage, which then calls its GetText method to extract the text from that page.

MergeDocument document = new("DocumentA.pdf");
PdfDocument pdfA = new PdfDocument("doc-text.pdf");
string extractedText = pdfA.Pages[1].GetText();
Page page = new Page(PageSize.Letter);
page.Elements.Add(new TextArea(extractedText, 0, 0, 612, 792));
document.Pages.Add(page);
document.Draw(outputPage);
Dim document As New MergeDocument("DocumentA.pdf")
Dim pdfA As New PdfDocument("doc-text.pdf")
Dim extractedText As String = pdfA.Pages(1).GetText()
Dim page As New Page(PageSize.Letter)
page.Elements.Add(New TextArea(extractedText, 0, 0, 612, 792))
document.Pages.Add(page)
document.Draw(outputPath)

Extracting from Area on Page

The GetText method is also overloaded to extract text from a specific area within a page. The following code illustrates extracting text from a specific area within a page.

MergeDocument document = new("DocumentA.pdf");
PdfDocument pdfA = new PdfDocument("doc-text.pdf");
string extractedText = pdfA.Pages[1].GetText(0, 0, 100, 400);
Page page = new Page(PageSize.Letter);
page.Elements.Add(new TextArea(extractedText, 0, 0, 612, 792));
document.Pages.Add(page);
document.Draw(outputPath);
Dim document As New MergeDocument("DocumentA.pdf")
Dim pdfA As New PdfDocument("doc-text.pdf")
Dim extractedText As String = pdfA.Pages(1).GetText(0, 0, 100, 400)
Dim page As New Page(PageSize.Letter)
page.Elements.Add(New TextArea(extractedText, 0, 0, 612, 792))
document.Pages.Add(page)
document.Draw(outputPath)

text extract Figure 1. Extracting text from a page and specifying area.

In this topic