com.asprise.util.pdf
Class PDFReader

java.lang.Object
  extended bycom.asprise.util.pdf.PDFReader

public class PDFReader
extends java.lang.Object

Represents a PDF reader with image rendering and text extraction feature. Basic flow:

PDFReader reader = new PDFReader(new File("my.pdf"));
 reader.open(); // open the file. 
 int pages = reader.getNumberOfPages();
 
 for(int i=0; i < pages; i++) {
 	String text = reader.extractTextFromPage(i);
  System.out.println("Page " + i + ": " + text); 
 }
 
 ... // perform other operations on pages.
 
 reader.close(); // finally, close the file.
 
Main features:
  1. getPageAsImage(pageIndex) - returns the specified page as a BufferedImage
  2. savePageAsImageFile(pageIndex, format, file) - saves individual page to an image file
  3. extractTextFromPage(pageIndex) - extracts all the text content in a given page
  4. getPageSize(pageIndex) - returns the dimension of a page
  5. getNumberOfPages - returns total number of pages.


Constructor Summary
PDFReader(java.io.File pdfFile)
          Creates a new PDF reader for the given PDF file.
PDFReader(java.io.InputStream pdfStream)
          Creates a new PDF reader with the specified stream as the input.
 
Method Summary
 void close()
          Closes the PDF and releases resources used.
 java.lang.String extractTextFromPage(int pageIndex)
          Extracts text from the specified page.
 int getNumberOfPages()
          Returns the total number of pages in the PDF.
 java.awt.image.BufferedImage getPageAsImage(int pageIndex)
          Renders the specified page as a buffered image.
 java.awt.Rectangle getPageSize(int pageIndex)
          Returns the page size of the specified page.
 PDFSecurityObject getSecurityObject()
          Returns the security object.
static void main(java.lang.String[] args)
          A utility that extract text from a PDF file.
 void open()
          Opens and parses the pdf content.
 void savePageAsImageFile(int pageIndex, java.lang.String formatName, java.io.File output)
          Saves the specified page as an image file with the given format.
 void setSecurityObject(PDFSecurityObject securityObject)
          If the PDF is encrypted, you need to supply a security object to 'unlock' the PDF before open().
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PDFReader

public PDFReader(java.io.InputStream pdfStream)
Creates a new PDF reader with the specified stream as the input.

Parameters:
pdfStream -

PDFReader

public PDFReader(java.io.File pdfFile)
          throws java.io.FileNotFoundException
Creates a new PDF reader for the given PDF file.

Parameters:
pdfFile -
Throws:
java.io.FileNotFoundException
Method Detail

open

public void open()
          throws java.io.IOException
Opens and parses the pdf content.

This method may throw exception when the PDF page is too complex to rasterize (for example type 0 font). In that case, you can use this free utility.

Throws:
java.io.IOException

close

public void close()
           throws java.io.IOException
Closes the PDF and releases resources used.

Throws:
java.io.IOException

getNumberOfPages

public int getNumberOfPages()
Returns the total number of pages in the PDF.

Returns:

getPageAsImage

public java.awt.image.BufferedImage getPageAsImage(int pageIndex)
                                            throws java.io.IOException
Renders the specified page as a buffered image.

This method may throw exception when the PDF page is too complex to rasterize (for example type 0 font). In that case, you can use this free utility.

Parameters:
pageIndex - - zero based page index, i.e., the first page is page 0.
Returns:
Throws:
java.io.IOException

savePageAsImageFile

public void savePageAsImageFile(int pageIndex,
                                java.lang.String formatName,
                                java.io.File output)
                         throws java.io.IOException
Saves the specified page as an image file with the given format.

This method may throw exception when the PDF page is too complex to rasterize (for example type 0 font). In that case, you can use this free utility.

Parameters:
pageIndex - - zero based page index, i.e., the first page is page 0.
formatName - - valid values are "gif", "jpeg", "png"
output -
Throws:
java.io.IOException

extractTextFromPage

public java.lang.String extractTextFromPage(int pageIndex)
                                     throws java.io.IOException
Extracts text from the specified page. If the text content is stored in image objects in the PDF, you may fail to extract text from such pages. In that case, you should use getPageAsImage() then feed the image to Asprise OCR engine.

Parameters:
pageIndex - - zero based page index, i.e., the first page is page 0.
Returns:
the text content extracted.
Throws:
java.io.IOException

getPageSize

public java.awt.Rectangle getPageSize(int pageIndex)
Returns the page size of the specified page.

Parameters:
pageIndex - - zero based page index, i.e., the first page is page 0.
Returns:
the size of the page.

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception
A utility that extract text from a PDF file. Usage: java com.asprise.util.pdf.PDFReader [pdf file]

Parameters:
args -
Throws:
java.lang.Exception

getSecurityObject

public PDFSecurityObject getSecurityObject()
Returns the security object.

Returns:

setSecurityObject

public void setSecurityObject(PDFSecurityObject securityObject)
If the PDF is encrypted, you need to supply a security object to 'unlock' the PDF before open().

Parameters:
securityObject -