TextExtractionTool
Summary
Sets the text extraction tool to use to extract text content from PDF files.Usage
TextExtractionTool=<name text extraction tool>
Description
This setting sets the text extraction tool to use to extract text content from PDF files.
The text extraction tool is an external program which returns the content of the PDF file as plain text. The pstotext and pdftotext programs have been tested for compatibility. Other programs should also work, such as Apache Tika used together with the eZ Tika extension. This setting is case sensitive.
By default pstotext is used. The default settings for using this program are:
[HandlerSettings] MetaDataExtractor[application/pdf]=ezpdf [PDFHandlerSettings] TextExtractionTool=pstotext
Please note that the pstotext binaries does not come with eZ Publish by default, and will need to be installed manually in your webserver.
Examples
1. pstotext
TextExtractionTool=pstotext
2. pdftotext
As an alternative, the pdftotext program can be used; Pdftotext is a PDF reader that also provides numerous PDF and PS utilities - note that depending on the linux distribution in use it might come bundled in different packages, e.g. poppler.
Since pdftotext requires the usage of command-line options on its execution, an additional workaround will have to be used in order to make it work correctly as text extraction tool for eZ Publish.
As an example, you can create a shell script to execute pdftotext with the needed options. These are the contents of the script:
#!/bin/sh /usr/bin/pdftotext $1 -
In this case you will have to define your shell script as text extraction tool:
TextExtractionTool=/path/to/your/script.sh
3. Tika
TextExtractionTool=path/to/eztika
Ester Heylen (08/10/2009 11:39 am)
Ricardo Correia (01/04/2013 1:28 pm)
Comments
There are no comments.