Caution: This documentation is for eZ Publish legacy, from version 3.x to 6.x.
For 5.x documentation covering Platform see eZ Documentation Center, for difference between legacy and Platform see 5.x Architecture overview.

TextExtractionTool

Summary

Sets the text extraction tool to use to extract text content from PDF files.

Usage

TextExtractionTool=<name text extraction tool>

Description

This setting sets the text extraction tool to use to extract text content from PDF files.
The text extraction tool is an external program which returns the content of the PDF file as plain text. The pstotext and pdftotext programs have been tested for compatibility. Other programs should also work, such as Apache Tika used together with the eZ Tika extension. This setting is case sensitive.
By default pstotext is used. The default settings for using this program are:

[HandlerSettings]
MetaDataExtractor[application/pdf]=ezpdf
 
[PDFHandlerSettings]
TextExtractionTool=pstotext

Please note that the pstotext binaries does not come with eZ Publish by default, and will need to be installed manually in your webserver.

Examples

1. pstotext

TextExtractionTool=pstotext

2. pdftotext

As an alternative, the pdftotext program can be used; Pdftotext is a PDF reader that also provides numerous PDF and PS utilities - note that depending on the linux distribution in use it might come bundled in different packages, e.g. poppler.
Since pdftotext requires the usage of command-line options on its execution, an additional workaround will have to be used in order to make it work correctly as text extraction tool for eZ Publish.

As an example, you can create a shell script to execute pdftotext with the needed options. These are the contents of the script:

#!/bin/sh
/usr/bin/pdftotext $1 -

In this case you will have to define your shell script as text extraction tool:

TextExtractionTool=/path/to/your/script.sh

3. Tika

TextExtractionTool=path/to/eztika

Ester Heylen (08/10/2009 11:39 am)

Ricardo Correia (01/04/2013 1:28 pm)

Ester Heylen, Ricardo Correia


Comments

There are no comments.