2.2.1.1.1. Customized search indexes and analyzers

By default JCR uses the Lucene standard Analyzer to index contents. This analyzer uses some standard filters in the method that analyzes the content:

public TokenStream tokenStream(String fieldName, Reader reader) {

    StandardTokenizer tokenStream = new StandardTokenizer(reader, replaceInvalidAcronym);

    tokenStream.setMaxTokenLength(maxTokenLength);

    TokenStream result = new StandardFilter(tokenStream);

    result = new LowerCaseFilter(result);

    result = new StopFilter(result, stopSet);

    return result;

  }

The first one (StandardFilter) removes 's (as 's in "Peter's") from the end of words and removes dots from acronyms.
The second one (LowerCaseFilter) normalizes token text to lower case.
The last one (StopFilter) removes stop words from a token stream. The stop set is defined in the analyzer.

For specific cases, you may wish to use additional filters like ISOLatin1AccentFilter, which replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalents.

In order to use a different filter, you have to create a new analyzer, and a new search index to use the analyzer. You put it in a jar, which is deployed with your application.

Creating a filter

The ISOLatin1AccentFilter is not present in the current Lucene version used by eXo. You can use the attached file. You can also create your own filter with the relevant method as follows:

public final Token next(final Token reusableToken) throws java.io.IOException

This method defines how chars are read and used by the filter.

Creating an analyzer

The analyzer has to extend org.apache.lucene.analysis.standard.StandardAnalyzer, and overload the following method to put your own filters.

public TokenStream tokenStream(String fieldName, Reader reader)

You can have a glance at the example analyzer attached to this article.

Configuring Platform to use your analyzer

In repository-configuration.xml which can be found in various locations, you have to add the analyzer parameter to each query-handler config:


<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">

    <properties>

      ...

      <property name="analyzer" value="org.exoplatform.services.jcr.impl.core.MyAnalyzer"/>

      ...

    </properties>

</query-handler>

When you start eXo, your SearchIndex will start to index content with the specified filters.

Creating a search index

You have had the analyzer, so you now need to write the SearchIndex, which will use the analyzer. You have to extend org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex. You have to write the constructor to set the right analyzer and the following method to return your analyzer.

public Analyzer getAnalyzer() {

    return MyAnalyzer;

}

You can see the attached SearchIndex.

Note

You can set Analyzer directly in your configuration. So, creating a new SearchIndex only for new Analyzer is redundant.

Configuring Platform to use your SearchIndex

In repository-configuration.xml which can be found in various locations, you have to replace each:


<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">

with your own class


<query-handler class="mypackage.indexation.MySearchIndex">