DocumentReaderService provides API to retrieve DocumentReader by mimetype. DocumentReader lets the user fetch content of document as String or, in case of TikaDocumentReader, as Reader.
Basically, DocumentReaderService is a container for all registered DocumentReaders. So, you can register DocumentReader (method addDocumentReader(ComponentPlugin reader)) and fetch DocumentReader by mimeType (method getDocumentReader(String mimeType)).
TikaDocumentReaderServiceImpl extends DocumentReaderService with a simple goal - reading Tika configuration and lazy-registering each Tika Parser as TikaDocumentReader.
By default, all Tika Parsers are not registered in readers <mimetype, DocumentReader> map. When a user tries to fetch a DocumentReader by unknown mimetype, TikaDocumentReaderService checks the Tika configuration and registers a new <mimetype, DocumentReader> map.
The configuration of TikaDocumentReaderServiceImpl looks like:
<component>
<key>org.exoplatform.services.document.DocumentReaderService</key>
<type>org.exoplatform.services.document.impl.tika.TikaDocumentReaderServiceImpl</type>
<!-- Old-style document readers -->
<component-plugins>
<component-plugin>
<name>pdf.document.reader</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.PDFDocumentReader</type>
<description>to read the pdf inputstream</description>
</component-plugin>
<component-plugin>
<name>document.readerMSWord</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.MSWordDocumentReader</type>
<description>to read the ms word inputstream</description>
</component-plugin>
<component-plugin>
<name>document.readerMSXWord</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.MSXWordDocumentReader</type>
<description>to read the ms word inputstream</description>
</component-plugin>
<component-plugin>
<name>document.readerMSExcel</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.MSExcelDocumentReader</type>
<description>to read the ms excel inputstream</description>
</component-plugin>
<component-plugin>
<name>document.readerMSXExcel</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.MSXExcelDocumentReader</type>
<description>to read the ms excel inputstream</description>
</component-plugin>
<component-plugin>
<name>document.readerMSOutlook</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.MSOutlookDocumentReader</type>
<description>to read the ms outlook inputstream</description>
</component-plugin>
<component-plugin>
<name>PPTdocument.reader</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.PPTDocumentReader</type>
<description>to read the ms ppt inputstream</description>
</component-plugin>
<component-plugin>
<name>MSXPPTdocument.reader</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.MSXPPTDocumentReader</type>
<description>to read the ms pptx inputstream</description>
</component-plugin>
<component-plugin>
<name>document.readerHTML</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.HTMLDocumentReader</type>
<description>to read the html inputstream</description>
</component-plugin>
<component-plugin>
<name>document.readerXML</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.XMLDocumentReader</type>
<description>to read the xml inputstream</description>
</component-plugin>
<component-plugin>
<name>TPdocument.reader</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.TextPlainDocumentReader</type>
<description>to read the plain text inputstream</description>
<init-params>
<!--
values-param> <name>defaultEncoding</name> <description>description</description> <value>UTF-8</value>
</values-param
-->
</init-params>
</component-plugin>
<component-plugin>
<name>document.readerOO</name>
<set-method>addDocumentReader</set-method>
<type>org.exoplatform.services.document.impl.OpenOfficeDocumentReader</type>
<description>to read the OO inputstream</description>
</component-plugin>
</component-plugins>
<init-params>
<value-param>
<name>tika-configuration</name>
<value>jar:/conf/portal/tika-config.xml</value>
</value-param>
</init-params>
</component>
</configuration>
tika-configuration
: This parameter refers to the path of the Tika configuration
file to use. By default, it uses the default configuration of Tika
available from TikaConfig.getDefaultConfig()
.
The example of tika-config.xml
is:
<properties>
<mimeTypeRepository magic="false"/>
<parsers>
<parser name="parse-dcxml" class="org.apache.tika.parser.xml.DcXMLParser">
<mime>application/xml</mime>
<mime>image/svg+xml</mime>
<mime>text/xml</mime>
<mime>application/x-google-gadget</mime>
</parser>
<parser name="parse-office" class="org.apache.tika.parser.microsoft.OfficeParser">
<mime>application/excel</mime>
<mime>application/xls</mime>
<mime>application/msworddoc</mime>
<mime>application/msworddot</mime>
<mime>application/powerpoint</mime>
<mime>application/ppt</mime>
<mime>application/x-tika-msoffice</mime>
<mime>application/msword</mime>
<mime>application/vnd.ms-excel</mime>
<mime>application/vnd.ms-excel.sheet.binary.macroenabled.12</mime>
<mime>application/vnd.ms-powerpoint</mime>
<mime>application/vnd.visio</mime>
<mime>application/vnd.ms-outlook</mime>
</parser>
<parser name="parse-ooxml" class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
<mime>application/x-tika-ooxml</mime>
<mime>application/vnd.openxmlformats-package.core-properties+xml</mime>
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.sheet</mime>
<mime>application/vnd.openxmlformats-officedocument.spreadsheetml.template</mime>
<mime>application/vnd.ms-excel.sheet.macroenabled.12</mime>
<mime>application/vnd.ms-excel.template.macroenabled.12</mime>
<mime>application/vnd.ms-excel.addin.macroenabled.12</mime>
<mime>application/vnd.openxmlformats-officedocument.presentationml.presentation</mime>
<mime>application/vnd.openxmlformats-officedocument.presentationml.template</mime>
<mime>application/vnd.openxmlformats-officedocument.presentationml.slideshow</mime>
<mime>application/vnd.ms-powerpoint.presentation.macroenabled.12</mime>
<mime>application/vnd.ms-powerpoint.slideshow.macroenabled.12</mime>
<mime>application/vnd.ms-powerpoint.addin.macroenabled.12</mime>
<mime>application/vnd.openxmlformats-officedocument.wordprocessingml.document</mime>
<mime>application/vnd.openxmlformats-officedocument.wordprocessingml.template</mime>
<mime>application/vnd.ms-word.document.macroenabled.12</mime>
<mime>application/vnd.ms-word.template.macroenabled.12</mime>
</parser>
<parser name="parse-html" class="org.apache.tika.parser.html.HtmlParser">
<mime>text/html</mime>
</parser>
<parser mame="parse-rtf" class="org.apache.tika.parser.rtf.RTFParser">
<mime>application/rtf</mime>
</parser>
<parser name="parse-pdf" class="org.apache.tika.parser.pdf.PDFParser">
<mime>application/pdf</mime>
</parser>
<parser name="parse-txt" class="org.apache.tika.parser.txt.TXTParser">
<mime>text/plain</mime>
<mime>script/groovy</mime>
<mime>application/x-groovy</mime>
<mime>application/x-javascript</mime>
<mime>application/javascript</mime>
<mime>text/javascript</mime>
</parser>
<parser name="parse-openoffice" class="org.apache.tika.parser.opendocument.OpenOfficeParser">
<mime>application/vnd.oasis.opendocument.database</mime>
<mime>application/vnd.sun.xml.writer</mime>
<mime>application/vnd.oasis.opendocument.text</mime>
<mime>application/vnd.oasis.opendocument.graphics</mime>
<mime>application/vnd.oasis.opendocument.presentation</mime>
<mime>application/vnd.oasis.opendocument.spreadsheet</mime>
<mime>application/vnd.oasis.opendocument.chart</mime>
<mime>application/vnd.oasis.opendocument.image</mime>
<mime>application/vnd.oasis.opendocument.formula</mime>
<mime>application/vnd.oasis.opendocument.text-master</mime>
<mime>application/vnd.oasis.opendocument.text-web</mime>
<mime>application/vnd.oasis.opendocument.text-template</mime>
<mime>application/vnd.oasis.opendocument.graphics-template</mime>
<mime>application/vnd.oasis.opendocument.presentation-template</mime>
<mime>application/vnd.oasis.opendocument.spreadsheet-template</mime>
<mime>application/vnd.oasis.opendocument.chart-template</mime>
<mime>application/vnd.oasis.opendocument.image-template</mime>
<mime>application/vnd.oasis.opendocument.formula-template</mime>
<mime>application/x-vnd.oasis.opendocument.text</mime>
<mime>application/x-vnd.oasis.opendocument.graphics</mime>
<mime>application/x-vnd.oasis.opendocument.presentation</mime>
<mime>application/x-vnd.oasis.opendocument.spreadsheet</mime>
<mime>application/x-vnd.oasis.opendocument.chart</mime>
<mime>application/x-vnd.oasis.opendocument.image</mime>
<mime>application/x-vnd.oasis.opendocument.formula</mime>
<mime>application/x-vnd.oasis.opendocument.text-master</mime>
<mime>application/x-vnd.oasis.opendocument.text-web</mime>
<mime>application/x-vnd.oasis.opendocument.text-template</mime>
<mime>application/x-vnd.oasis.opendocument.graphics-template</mime>
<mime>application/x-vnd.oasis.opendocument.presentation-template</mime>
<mime>application/x-vnd.oasis.opendocument.spreadsheet-template</mime>
<mime>application/x-vnd.oasis.opendocument.chart-template</mime>
<mime>application/x-vnd.oasis.opendocument.image-template</mime>
<mime>application/x-vnd.oasis.opendocument.formula-template</mime>
</parser>
<parser name="parse-image" class="org.apache.tika.parser.image.ImageParser">
<mime>image/bmp</mime>
<mime>image/gif</mime>
<mime>image/jpeg</mime>
<mime>image/png</mime>
<mime>image/tiff</mime>
<mime>image/vnd.wap.wbmp</mime>
<mime>image/x-icon</mime>
<mime>image/x-psd</mime>
<mime>image/x-xcf</mime>
</parser>
<parser name="parse-class" class="org.apache.tika.parser.asm.ClassParser">
<mime>application/x-tika-java-class</mime>
</parser>
<parser name="parse-mp3" class="org.apache.tika.parser.mp3.Mp3Parser">
<mime>audio/mpeg</mime>
</parser>
<parser name="parse-midi" class="org.apache.tika.parser.audio.MidiParser">
<mime>application/x-midi</mime>
<mime>audio/midi</mime>
</parser>
<parser name="parse-audio" class="org.apache.tika.parser.audio.AudioParser">
<mime>audio/basic</mime>
<mime>audio/x-wav</mime>
<mime>audio/x-aiff</mime>
</parser>
</parsers>
</properties>
As you see the configuration above, there are both old-style DocumentReaders and new Tika parsers registered.
Both MSWordDocumentReader and org.apache.tika.parser.microsoft.OfficeParser refer to the same application/msword mimetype. However, only one DocumentReader will be fetched.
Old-style DocumentReader registered in configuration becomes registered into DocumentReaderService. So, mimetype that is supported by those DocumentReaders will have a registered pair, and user will always fetch this DocumentReaders with the getDocumentReader(..) method. The Tika configuration will be checked for Parsers only if there is no already registered DocumentReader.
Registering your own DocumentReader
You can make you own DocumentReader in two ways below:
By using Old-Style Document Reader
Extend BaseDocumentReader.
public class MyDocumentReader extends BaseDocumentReader
{
public String[] getMimeTypes()
{
return new String[]{"mymimetype"};
}
...
}
Register it as a component-plugin.
<component-plugin>
<name>my.DocumentReader</name>
<set-method>addDocumentReader</set-method>
<type>com.mycompany.document.MyDocumentReader</type>
<description>to read my own file format</description>
</component-plugin>
By using Tika Parser
Implement the Tika Parser.
public class MyParser implements Parser
{
...
}
Register it in tika-config.xml
.
<parser name="parse-mydocument" class="com.mycompany.document.MyParser">
<mime>mymimetype</mime>
</parser>
TikaDocumentReader can return document content as Reader object, but Old-Style DocumentReader cannot.
TikaDocumentReader does not detect document mimetype. You will get the exact parser as configured in tika-config.xml
.
All readers methods close InputStream at final.