2.2.3.1.3. Local index

Warning

You are looking at documentation for an older release. Not what you want? See the current release documentation.

Clustered implementation with local indexes is built upon same strategy with volatile in-memory index buffer along with delayed flushing on persistent storage.

As this implementation designed for clustered environment, it has additional mechanisms for data delivery within cluster. Actual text extraction jobs are done on the same node that does content operations (for example: write operation). Prepared "documents" (Lucene term that means block of data ready for indexing) are replicated within cluster nodes and processed by local indexes. So each cluster instance has the same index content. When new node joins the cluster, it has no initial index, so it must be created. There are some supported ways of doing this operation. The simplest is to simply copy the index manually but this is not intended for use. If no initial index is found, JCR will use the automated scenarios. They are controlled via configuration (see the index-recovery-mode parameter) offering full re-indexing from database or copying from another cluster node.

To use cluster-ready strategy based on local indexes, the following configuration must be applied when each node has its own copy of index on local file system. Indexing directory must point to any folder on local file system and "changesfilter-class" must be set to "org.exoplatform.services.jcr.impl.core.query.ispn.LocalIndexChangesFilter".




<query-handler class="org.exoplatform.services.jcr.impl.core.query.lucene.SearchIndex">

   <properties>

      <property name="index-dir" value="/mnt/nfs_drive/index/db1/ws" />

      <property name="changesfilter-class"

         value="org.exoplatform.services.jcr.impl.core.query.ispn.LocalIndexChangesFilter" />

      <property name="infinispan-configuration" value="infinispan-indexer.xml" />

      <property name="jgroups-configuration" value="udp-mux.xml" />

      <property name="infinispan-cluster-name" value="JCR-cluster" />

      <property name="max-volatile-time" value="60" />

      <property name="rdbms-reindexing" value="true" />

      <property name="reindexing-page-size" value="1000" />

      <property name="index-recovery-mode" value="from-coordinator" />

   </properties>

</query-handler>

Local index recovery filters

Common usecase for all cluster-ready applications is a hot joining and leaving of processing units. All nodes that are joining cluster for the first time or after some downtime must be in a synchronized state.

When having a deal with shared value storages, databases and indexes, cluster nodes are synchronized anytime. However it is an issue when local index strategy is used. If the new node joins cluster having no index, it will be retrieved or recreated. Node can be restarted also and thus index is not empty. Usually existing index is thought to be actual, but can be outdated.

JCR offers a mechanism called RecoveryFilters that will automatically retrieve index for the joining node on startup. This feature is a set of filters that can be defined via QueryHandler configuration:


<property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter"  />

Filter number is not limited so they can be combined:


<property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter" />

      <property name="index-recovery-filter" value="org.exoplatform.services.jcr.impl.core.query.lucene.SystemPropertyRecoveryFilter" />

If any one fires, the index is re-synchronized. Please take in account that DocNumberRecoveryFilter is used in cases no filter is configured. So, if resynchronization should be blocked or strictly required on start, then ConfigurationPropertyRecoveryFilter can be used.

This feature uses the standard index recovery mode defined by previously described parameter (can be "from-indexing" or "from-coordinator" (default value)).


<property name="index-recovery-mode" value="from-coordinator"

      />

There are couple implementations of filters:

org.exoplatform.services.jcr.impl.core.query.lucene.DummyRecoveryFilter: Always return true, for cases when index must be force resynchronized (recovered) each time;
org.exoplatform.services.jcr.impl.core.query.lucene.SystemPropertyRecoveryFilter: Return value of system property "org.exoplatform.jcr.recoveryfilter.forcereindexing". So index recovery can be controlled from the top without changing documentation using system properties;
org.exoplatform.services.jcr.impl.core.query.lucene.ConfigurationPropertyRecoveryFilter: Return value of QueryHandler configuration property "index-recovery-filter-forcereindexing" so the index recovery can be controlled from configuration separately for each workspace. For example:
```
<property name="index-recovery-filter"

          value="org.exoplatform.services.jcr.impl.core.query.lucene.ConfigurationPropertyRecoveryFilter" />

          <property name="index-recovery-filter-forcereindexing" value="true" />
```
org.exoplatform.services.jcr.impl.core.query.lucene.DocNumberRecoveryFilter: Check number of documents in index on coordinator side and self-side and return true if differs. Advantage of this filter comparing to other is it will skip reindexing for workspaces where index was not modified. For example, there are 10 repositories with 3 workspaces in each one. Only one is really heavily used in cluster: frontend/production. So using this filter will only re-index those workspaces that are really changed without affecting other indexes thus greatly reduce the startup time.

Local index recovery strategy

Recovery local index with copy from coodinator strategy requires much time for re-synchronization index on startup of a new cluster node. RSync copy strategy solves this problem along with local file system advantages in term of speed.

Note

This strategy is used only for linux based Operating Systems.

By default, index recovery from coordinator uses "copy" strategy, a new strategy to recover index as added using RSync copy strategy.

System requirement

Mandatory requirement for RSync copy strategy is an installed and properly configured RSync utility. It must be accessible by calling "rsync" without defining its full path.

In addition, each cluster node should have a running RSync Server supporting the "rsync://" protocol. For more details, refer to the used RSync Server documentations. RSync-Server configuration example will be shown below.

There are also some additional limitations such as:

Parent index folder for each workspace must be the same across the cluster, for example, "/var/data/index/<repository-name>/<workspace-name>".
RSync Server configuration.
It must share some of index's parent folders. For example, "/var/data/index". In other words, index is stored inside of RSync Server shared folder. Configuration details are given below.

Configuration

Configure JCR-Index to use "rsync" strategy requires some additional parameters comparing to RSync options.

Enable recovery index profile by adding recovery-index-rsyncrecovery-index-rsync to EXO_PROFILES:
```
					EXO_PROFILES="${EXO_PROFILES},recovery-index-rsync"
				
```

Configure RSync server parameters on exo.properties:

# The folder name to replicate using RSync (value must be the same for all cluster nodes)
exo.jcr.index.rsync-entry-name=index
# Value must equals to index folder absolute path that is configured in RSync-Server configuration (Path can be different for each cluster node)
exo.jcr.index.rsync-entry-path=/var/data/index
# Rsync-Server port (port must be the same for all cluster nodes)
exo.jcr.index.rsync-port=8085
# rsync-user and rsync-password They are optional and can be skipped
# if RSync Server configured to accept anonymous identity.
# exo.jcr.index.rsync-user=
# exo.jcr.index.rsync-password=

The RSync-Server (rsyncd) can be configured like the following example:

			uid = nobody
      gid = nobody
      use chroot = no
      port = 8085
      log file = rsyncd.log
      pid file = rsyncd.pid
      [index]
      path = /var/data/index
      comment = indexes
      read only = true
      auth users = rsyncexo
      secrets file= rsyncd.secrets

Configure rsync synchronized mode via the system property exo.jcr.index.rsync-strategy:
- rsync-with-delete : force delete slave index folder at each startup before retrieving indexes from coordinator (the master). It is the default value.
- rsync : synchronize index data from coordinator without removing old index.
```
			# Optional setting. Default value set to "rsync-with-delete". If you won't delete index folder at each slave startup, switch this to "rsync" value.
exo.jcr.index.rsync-strategy=rsync
			
```
Configure in the coordinator (the master) the index online/offline mode during slave startup. This could be done via system property exo.jcr.index.rsync-offline.Default value is set to true (i.e it sets to offline the index of coordinator node).
```
				exo.jcr.index.rsync-offline=true
			
```