Content Extraction

To extract existing content from a PDF document, use the PdfDocument and PdfPage classes.

Refer to the Extracting Text documentation page for more information on extracting text from a PDF.

You can extract content from an existing PDF by loading it into a PdfDocument class and then examining/extracting its content. The PdfDocument class has methods and properties to get images, attachments, document metadata, text, fonts, and more.

The PdfDocument also has a Pages property that contains the document's pages as PdfPage instances.

Extracting from Document

The following example illustrates extracting an attached file and a bookmark from an existing PDF document and adding it to a MergeDocument instance to create a new PDF document.

PdfDocument pdfDoc = new PdfDocument("DocumentB.pdf");
Attachment attachment = pdfDoc.GetAttachments()[0];
EmbeddedFile embFile = new(attachment.GetData(), attachment.Filename, DateTime.Now);
PdfOutline outline = pdfDoc.Outlines[1];

MergeDocument document = new MergeDocument();
document.Pages.Add(new ImportedPage(pdfDoc.Pages[1]));
document.EmbeddedFiles.Add(embFile);
document.Outlines.Add(outline);
document.Draw(outputPath);

Dim pdfDoc As New PdfDocument("DocumentB.pdf")
Dim attachment As Attachment = pdfDoc.GetAttachments()(0)
Dim embFile As New EmbeddedFile(attachment.GetData(), attachment.Filename, DateTime.Now)
Dim outline As PdfOutline = pdfDoc.Outlines(1)
Dim document As New MergeDocument()
document.Pages.Add(New ImportedPage(pdfDoc.Pages(1)))
document.EmbeddedFiles.Add(embFile)
document.Outlines.Add(outline)
document.Draw(outputPath)

ExtractingExistingContent.cs

ExtractingExistingContent.vb

Note in the example above the MergeDocument could not add a PdfPage instance directly, but instead had to construct an ImportedPage instance from the PdfPage instance first.

Extracting from Page

If extracting text from a specific page, the following code illustrates extracting an image from a PdfPage and then adding the image to a MergeDocument instance's page.

MergeDocument document = new();
PdfDocument pdfDoc = new PdfDocument("DocumentB.pdf");
PdfDocument pdfDoc2 = new("DocumentA.pdf");
Page page = new Page(PageSize.Letter);
PdfPage pdfPage = pdfDoc.GetPage(1);
ImageInformation imageInfo = pdfPage.GetImages()[0];
Image image = new Image(imageInfo.GetImage().Data, 0, 0, .5F);
Label lbl = new Label("Extracted Image", 10, 400, 600, 0);
lbl.FontSize = 24;
lbl.TextColor = RgbColor.Navy;
page.Elements.Add(image);
page.Elements.Add(lbl);
document.Pages.Add(page);
document.Append(pdfDoc2);
document.Draw(outputPath);

Dim document As New MergeDocument()
Dim pdfDoc As New PdfDocument("DocumentB.pdf")
Dim pdfDoc2 As New PdfDocument("DocumentA.pdf")
Dim page As New Page(PageSize.Letter)
Dim pdfPage As PdfPage = pdfDoc.GetPage(1)
Dim imageInfo As ImageInformation = pdfPage.GetImages()(0)
Dim image As New Image(imageInfo.GetImage().Data, 0, 0, 0.5F)
Dim lbl As New Label("Extracted Image", 10, 400, 600, 0)
lbl.FontSize = 24
lbl.TextColor = RgbColor.Navy
page.Elements.Add(image)
page.Elements.Add(lbl)
document.Pages.Add(page)
document.Append(pdfDoc2)
document.Draw(outputPath)

ExtractingExistingContent.cs

ExtractingExistingContent.vb

Note that the PdfPage returns ImageInformation instances and not Image instances.

Content Extraction

Extracting from Document

Extracting from Page

In this topic