PHP Classes

PHP PDFBox: Extract text from PDF documents using PDFBox tool

Recommend this page to a friend!
  Info   Documentation   View files Files   Install with Composer Install with Composer   Download Download   Reputation   Support forum   Blog    
Ratings Unique User Downloads Download Rankings
Not enough user ratingsTotal: 416 All time: 6,440 This week: 67Up
Version License PHP version Categories
pdfbox 1.0.1BSD License5.3PHP 5, Files and Folders, Text proces...
Description 

Author

This package can extract text from PDF documents using the PDFBox tool.

It can read a PDF document from a file or an opened stream and calls the PDFBox Java tool to extract text the PDF document.

The extracted text can be returned in plain text, HTML or DOM objects. The output can also be saved to a given file.

Picture of Fabian Schmengler
  Performance   Level  
Innovation award
Innovation award
Nominee: 4x

 

Documentation

PdfBox

A PHP interface for the PdfBox ExtractText utility, useful to unit-test contents of generated PDFs.

Requirements

  • Java Runtime Environment
  • PdfBox JAR file - Download: http://pdfbox.apache.org/downloads.html - Tested with 1.6.0, 1.7.0 and 1.8.6
  • PHP needs permissions for shell execution

Install

To install with composer:

composer require sgh/pdfbox

Basic Usage

use SGH\PdfBox

//$pdf = GENERATED_PDF;
$converter = new PdfBox;
$converter->setPathToPdfBox('/usr/bin/pdfbox-app-1.7.0.jar');
$text = $converter->textFromPdfStream($pdf);
$html = $converter->htmlFromPdfStream($pdf);
$dom  = $converter->domFromPdfStream($pdf);

If the source PDF is a file, use xxxFromPdfFile() instead xxxFromPdfStream() with the source path as parameter.

If you want to save the converted output to a file, specify the destination path as second parameter of the xxxFromPdfxxx() methods.

Advanced Usage

Convert a range of pages instead of the full document:

$converter->getOptions()
    ->setStartPage(2)
	->setEndPage(5);

Ignore corrupt objects in the PDF:

$converter->getOptions()
    ->setForce(true);

Sort text:

$converter->getOptions()
    ->setSort(true);

PHPUnit tests

To run the unit tests, change the environment variable PDFBOX_JAR to the full path of your PdfBox JAR file. See phpunit.xml.dist.


  Files folder image Files (11)  
File Role Description
Files folder imagesrc (1 directory)
Files folder imagetest (1 file, 1 directory)
Accessible without login Plain text file composer.json Data Auxiliary data
Accessible without login Plain text file LICENSE.txt Lic. Documentation
Accessible without login Plain text file phpunit.xml.dist Data Auxiliary data
Accessible without login Plain text file README.md Doc. Auxiliary data

The PHP Classes site has supported package installation using the Composer tool since 2013, as you may verify by reading this instructions page.
Install with Composer Install with Composer
 Version Control Unique User Downloads Download Rankings  
 100%
Total:416
This week:0
All time:6,440
This week:67Up