By default JCR uses the Lucene standard Analyzer to index contents. This analyzer uses some standard filters in the method that analyzes the content:
public TokenStream tokenStream(String fieldName, Reader reader) {
StandardTokenizer tokenStream = new StandardTokenizer(reader, replaceInvalidAcronym);
tokenStream.setMaxTokenLength(maxTokenLength);
TokenStream result = new StandardFilter(tokenStream);
result = new LowerCaseFilter(result);
result = new StopFilter(result, stopSet);
return result;
}
The first one (StandardFilter) removes 's (as 's in "Peter's") from the end of words and removes dots from acronyms.
The second one (LowerCaseFilter) normalizes token text to lower case.
The last one (StopFilter) removes stop words from a token stream. The stop set is defined in the analyzer.
For specific cases, you may wish to use additional filters like ISOLatin1AccentFilter, which replaces accented characters in the ISO Latin 1 character set (ISO-8859-1) by their unaccented equivalents.
In order to use a different filter, you have to create a new analyzer, and a new search index to use the analyzer. You put it in a jar, which is deployed with your application.
The ISOLatin1AccentFilter
is not present in the current Lucene
version used by eXo. You can use the attached file. You can also
create your own filter with the relevant method as follows:
public final Token next(final Token reusableToken) throws java.io.IOException
This method defines how chars are read and used by the filter.
The analyzer has to extend org.apache.lucene.analysis.standard.StandardAnalyzer
, and overload the
following method to put your own filters.
public TokenStream tokenStream(String fieldName, Reader reader)
You can have a glance at the example analyzer attached to this article.
Configuring Platform to use your analyzer
In repository-configuration.xml
which can be found in various locations, you have to add the analyzer
parameter to each query-handler config:
<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
<properties>
...
<property name="analyzer" value="org.exoplatform.services.jcr.impl.core.MyAnalyzer"/>
...
</properties>
</query-handler>
When you start eXo, your SearchIndex will start to index content with the specified filters.
You have had the analyzer, so you now need to write the SearchIndex,
which will use the analyzer. You have to extend org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex
. You
have to write the constructor to set the right analyzer and the following method to return your analyzer.
public Analyzer getAnalyzer() {
return MyAnalyzer;
}
You can see the attached SearchIndex.
You can set Analyzer directly in your configuration. So, creating a new SearchIndex only for new Analyzer is redundant.
Configuring Platform to use your SearchIndex
In repository-configuration.xml
which can be found in various locations, you have to replace each:
<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">
with your own class
<query-handler class="mypackage.indexation.MySearchIndex">