PDFReader (Asprise PDF Reader/Writer Library Library)

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.asprise.util.pdf
Class PDFReader

java.lang.Object
  com.asprise.util.pdf.PDFReader

public class PDFReader
extends java.lang.Object

Represents a PDF reader with image rendering and text extraction feature. Basic flow:

PDFReader reader = new PDFReader(new File("my.pdf"));
 reader.open(); // open the file. 
 int pages = reader.getNumberOfPages();
 
 for(int i=0; i < pages; i++) {
 	String text = reader.extractTextFromPage(i);
  System.out.println("Page " + i + ": " + text); 
 }
 
 ... // perform other operations on pages.
 
 reader.close(); // finally, close the file.

Main features:

getPageAsImage(pageIndex) - returns the specified page as a BufferedImage
savePageAsImageFile(pageIndex, format, file) - saves individual page to an image file
extractTextFromPage(pageIndex) - extracts all the text content in a given page
getPageSize(pageIndex) - returns the dimension of a page
getNumberOfPages - returns total number of pages.

Constructor Summary
`PDFReader(java.io.File pdfFile)` Creates a new PDF reader for the given PDF file.
`PDFReader(java.io.InputStream pdfStream)` Creates a new PDF reader with the specified stream as the input.

Method Summary
`void`	`close()` Closes the PDF and releases resources used.
`java.lang.String`	`extractTextFromPage(int pageIndex)` Extracts text from the specified page.
`int`	`getNumberOfPages()` Returns the total number of pages in the PDF.
`java.awt.image.BufferedImage`	`getPageAsImage(int pageIndex)` Renders the specified page as a buffered image.
`java.awt.Rectangle`	`getPageSize(int pageIndex)` Returns the page size of the specified page.
`PDFSecurityObject`	`getSecurityObject()` Returns the security object.
`static void`	`main(java.lang.String[] args)` A utility that extract text from a PDF file.
`void`	`open()` Opens and parses the pdf content.
`void`	`savePageAsImageFile(int pageIndex, java.lang.String formatName, java.io.File output)` Saves the specified page as an image file with the given format.
`void`	`setSecurityObject(PDFSecurityObject securityObject)` If the PDF is encrypted, you need to supply a security object to 'unlock' the PDF before open().

Methods inherited from class java.lang.Object

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail

PDFReader

public PDFReader(java.io.InputStream pdfStream)

Creates a new PDF reader with the specified stream as the input.
Parameters:: pdfStream -

PDFReader

public PDFReader(java.io.File pdfFile)
          throws java.io.FileNotFoundException

Creates a new PDF reader for the given PDF file.
Parameters:: pdfFile -
Throws:: java.io.FileNotFoundException

Method Detail

open

public void open()
          throws java.io.IOException

Opens and parses the pdf content.

This method may throw exception when the PDF page is too complex to rasterize (for example type 0 font). In that case, you can use this free utility.

Throws:: java.io.IOException

close

public void close()
           throws java.io.IOException

Closes the PDF and releases resources used.

Throws:: java.io.IOException

getNumberOfPages

public int getNumberOfPages()

Returns the total number of pages in the PDF.

Returns:

getPageAsImage

public java.awt.image.BufferedImage getPageAsImage(int pageIndex)
                                            throws java.io.IOException

Renders the specified page as a buffered image.

This method may throw exception when the PDF page is too complex to rasterize (for example type 0 font). In that case, you can use this free utility.

Parameters:: pageIndex - - zero based page index, i.e., the first page is page 0.
Returns:
Throws:: java.io.IOException

savePageAsImageFile

public void savePageAsImageFile(int pageIndex,
                                java.lang.String formatName,
                                java.io.File output)
                         throws java.io.IOException

Saves the specified page as an image file with the given format.

This method may throw exception when the PDF page is too complex to rasterize (for example type 0 font). In that case, you can use this free utility.

Parameters:: pageIndex - - zero based page index, i.e., the first page is page 0.; formatName - - valid values are "gif", "jpeg", "png"; output -
Throws:: java.io.IOException

extractTextFromPage

public java.lang.String extractTextFromPage(int pageIndex)
                                     throws java.io.IOException

Extracts text from the specified page. If the text content is stored in image objects in the PDF, you may fail to extract text from such pages. In that case, you should use getPageAsImage() then feed the image to Asprise OCR engine.

Parameters:: pageIndex - - zero based page index, i.e., the first page is page 0.
Returns:: the text content extracted.
Throws:: java.io.IOException

getPageSize

public java.awt.Rectangle getPageSize(int pageIndex)

Returns the page size of the specified page.

Parameters:: pageIndex - - zero based page index, i.e., the first page is page 0.
Returns:: the size of the page.

main

public static void main(java.lang.String[] args)
                 throws java.lang.Exception

A utility that extract text from a PDF file. Usage: java com.asprise.util.pdf.PDFReader [pdf file]

Parameters:: args -
Throws:: java.lang.Exception

getSecurityObject

public PDFSecurityObject getSecurityObject()

Returns the security object.

Returns:

setSecurityObject

public void setSecurityObject(PDFSecurityObject securityObject)

If the PDF is encrypted, you need to supply a security object to 'unlock' the PDF before open().

Parameters:: securityObject -

Package

Class

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

com.asprise.util.pdf Class PDFReader

PDFReader

PDFReader

open

close

getNumberOfPages

getPageAsImage

savePageAsImageFile

extractTextFromPage

getPageSize

main

getSecurityObject

setSecurityObject

com.asprise.util.pdf
Class PDFReader