Tags

StringifierContrib

Helper library to stringify binary document formats

Description

This extension serves as a helper to stringify documents in a way that they can be indexed by full-text search engines such as Foswiki:Extensions/KinoSearchContrib or Foswiki:Extensions/SolrPlugin.

It supports all major office document formats, i.e. MS-Office and OpenDocument formats.

StringifierContrib is organized in plugins to serialization a document format by delegating it to according backends. For some formats there are alternative backends to chose from. For example a DOC file can be serialized by any of abiword, antiword, catdoc, soffice or wvWare. Use the one that serves best your needs and is available on your platform. For instance soffice is a very good choice to serve as a document converter. However using it is rather performance demanding. The more simpler ones suffice most of the time but may have an inferior quality of text being extracted.

Backends for Word Documents

To index Word Documents (.doc) you will need to install one of the following:

  • antiword
  • abiword
  • catdoc
  • soffice
  • wvWare

Backend for PDF

To index .pdf files you need to install poppler-utils.

Backend for PPT

To index .ppt files you may select one of the following:

  • catdoc
  • pphtml
  • soffice

Backend for HTML

There are a couple of options to index .html files:

* w3m * lynx * links * html2text

Some of the other backends generate html temporarily which is then converted to plain text using the html stringifier.

Backends for DOCX, PPTX, XLSX

To index these file types, you will need to install the following tools:

Then set the command path to these tools in configure.

Backend for OpenDocument and Staroffice documents

To index these file types you need to install odt2txt.

Backed fo E-books

To index common ebook file formats you need to install calibre

Installing the Contrib

You do not need to install anything in the browser to use this extension. The following instructions are for the administrator who installs the extension on the server.

Open configure, and open the "Extensions" section. "Extensions Operation and Maintenance" Tab -> "Install, Update or Remove extensions" Tab. Click the "Search for Extensions" button. Enter part of the extension name or description and press search. Select the desired extension(s) and click install. If an extension is already installed, it will not show up in the search results.

You can also install from the shell by running the extension installer as the web server user: (Be sure to run as the webserver user, not as root!)
cd /path/to/foswiki
perl tools/extension_installer <NameOfExtension> install

If you have any problems, or if the extension isn't available in configure, then you can still install manually from the command-line. See https://foswiki.org/Support/ManuallyInstallingExtensions for more help.

Configuration

There are a number of settings that need to be set in configure before you can use the Contrib.

Test of the Installation

  • Test if the installation was successful:
    • Check that antiword, abiword or wvHtml is in place: Type antiword, abiword or wvHtml on the prompt and check that the command exists.
    • Check that pdftotext is in place: Type pdftotext on the prompt and check that the command exists.
    • Check that ppthtml is in place: Type ppthtml on the prompt and check that the command exists.
    • stringify some files (see below)

Test of Stringification with stringify

Some users report problems with the stringification: The stringifier scipts fails, takes too long on attachments. Some times this may result from installation errors, especially of the installation of the backends for the stringification.

stringify give you the opportunity to test the stringification in advance.

Usage: stringify file_name

In the result you see, which stringifier is used and the result of the stringification.

Example:

stringify /path/to/foswiki/StringifierContrib/test/unit/StringifierContrib/attachement_examples/Simple_example.doc

Simple example  

Keyword: dummy  

Umlaute: Grober, Uberschall, Anderung

Further Development

In this extension, a plug-in mechanism is implemented, so that additional stringifiers can be added without changing the existing code. All stringifier plugins are stored in the directory lib/Foswiki/Contrib/Stringifier/Plugins.

You can add new stringifier plugins by just adding new files here. The minimum things to be implemented are:

  • The plugin must inherit from Foswiki::Contrib::StringififierContrib::Base
  • The plugin must register itself by __PACKAGE__->register_handler($application, $file_extension);
  • The plugin must implement the method $text = stringForFile ($filename)

All the stringifiers have unit tests associated with them, and we would encourage you to provide unit tests for any you wish to contribute. See Foswiki:Development/UnitTests for more information on unit testing.

Dependencies

NameVersionDescription
abiword>0One of antiword, abiword, soffice or wvWare is required for .doc files
antiword>0One of antiword, abiword, soffice or wvWare is required for .doc files
Archive::Zip>=0Required
calibre>3Required for indexing ebooks such as .mobi and .epub files
catdoc>0Optional
Encode>0Required
Error>0Required
File::Which>0Required
html2text>0One of html2text or lynx for indexing html files
lynx>0One of html2text or lynx for indexing html files
Module::Pluggable>0Required
odt2txt>0Required for indexing OpenDocument and StarOffice documents
pdftotext>0Required for indexing .pdf. Part of poppler-utils
ppthtml>0Required
soffice>0One of antiword, abiword, soffice or wvWare is required for .doc and .docx0 files
Spreadsheet::ParseExcel>0Required for .xls files
Spreadsheet::XLSX>0One of Spreadsheet::ParseXLSX or xlsx2csv is required for .xlsx files
tesseract>0OCR for tiff files,Optional
wvWare>0One of antiword, abiword, soffice or wvWare is required for .doc files
xlsx2csv>0One of Spreadsheet::ParseXLSX or xlsx2csv is required for .xlsx files
XML::Twig>=3REquired for indexing PPTX files

Change History

17 Jan 2024: (7.00) reworked configuration settings
21 Oct 2019: (6.00) performance fixes; new api canStringify(); added support for common ebook file formats
16 Aug 2018: (5.20) register more mime types to the text stringifier
09 Jan 2018: (5.10) added support for tiff documents using tesseract
18 Sep 2017: (5.00) make html-to-text converter pluggable
31 Jan 2017: (4.40) improved XLSX stringifier
23 Jan 2017: (4.30) added stringifier to index XLS using soffice
18 Oct 2015: (4.20) removed dependency on File::MMagic; now using extension-based mime detection
01 Oct 2015: (4.10) don't default to pass-through for non-supported document types; fixed unit tests
29 Sep 2015: (4.00) added unicode support with Foswiki > 2.0
22 Jul 2015: (3.10) added support for stringification of ppt using catdoc as ppthtml isn't available on some distros
29 Aug 2014: (3.00) added support for stringification using open/libreoffice
07 May 2012: (2.20) added configuration parameter to specify the encoding of the output of each external helper in use
17 Oct 2011: (2.10) using wvText instead of wvHtml now; encoding stringified files to the site's charset now; fixed unit tests to use utf8 exclusively
05 Sep 2011: (2.00) added OpenDocument serializer; removed dependency left-over on Text::Iconv; added dependency on odt2txt; fixed defaults for wv serializer
01 Dec 2010: (1.20) moved core from StringifierContrib to Stringifier not to disturb configure
12 Nov 2010: (1.14) Foswiki:Main.PadraigLennon - Foswikitask:Item9311
23 Oct 2010: (1.12) made system fault-tolerant in case of missing dependencies for a given file type; doc cleanup -- Foswiki:Main.WillNorris
12 Feb 2010: robust parsing of password protected XLS files
02 Oct 2009: extracted from Foswiki:Extensions/KinoSearchContrib (MD)
Topic revision: r1 - 09 Jan 2018, ProjectContributor
This site is powered by FoswikiCopyright © by the contributing authors. All material on this site is the property of the contributing authors.
Ideas, requests, problems regarding CLASSE Wiki? Send feedback