Download - Crx Cluster

Transcript
Page 1: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 1 Created on 2012-10-08

CRX ClusteringOverview / CQ / CQ 5.5 / CRX 2.3 (embedded in CQ 5.5) / Administering /

CRX provides built-in clustering capability that enables multiple repository instances running on separatemachines to be joined together to act as a single repository. The repository instances within a cluster arereferred to as cluster nodes (not to be confused with the unrelated concept of the JCR node, which is the unitof storage within a repository). Each node of the cluster is a complete, fully functioning CRX instance with a distinct address on the network.Any operation (read or write) can be addressed to any cluster node. When a write operation updates the data in one cluster node, that change is immediately and automaticallyapplied to all other nodes in the cluster. This guarantees that any subsequent read from any cluster node willreturn the same updated result.

How Clustering WorksA CRX cluster consists of one master instance and some number of slave instances. A single standaloneCRX instance is simply a master instance with zero slave instances.# Typically, each instance runs on a distinct physical machine. The instances must all be able to communicatedirectly with each other over TCP/IP. Each cluster instance retains an independent identity on the network, meaning that each can be written toand read from independently. However, whenever a write operation is received by a slave instance, it isredirected to the master instance, which makes the change to its own content while also guaranteeing thatthe same change is made to all slave instances, ensuring synchronization of content. In contrast, when a read request is received by any cluster instance (master or slave) that instance servesthe request immediately and directly. Since the content of each instance is synchronized by the clusteringmechanism, the results will be consistent regardless of which particular instance is read from. The usual way to take advantage of this is to have a load-balancing webserver (for example, theDispatcher) mediate read requests and distribute them across the cluster instances, thus increasing readresponsiveness.

Architecture and ConfigurationWhen setting up a clustered installation, there are two main deployment decisions to be made: • How the cluster fits into the larger architecture of the installation• How the cluster is configured internally The cluster architecture, on the other hand, involves how the cluster is used within the installation. The twomain options are: • Active/active clustering• Active/passive clusteringThe cluster configuration defines which internal storage mechanisms are used and how these are shared orsynchronized across cluster nodes. The three most commonly used configurations are: • Shared nothing • Shared data store • Shared external database First we will look at cluster architecture. Cluster configuration will be covered later on.

CQ without Clustering Because CQ is built on top of the CRX repository, CRX clustering can be employed in CQ installationsto improve performance. To understand the various ways that CRX clustering can fit into the larger CQarchitecture we will first take a look at two common, non-clustered CQ architectures: single publish andmultiple publish.

Page 2: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 2 Created on 2012-10-08

Single Publish

A CQ installation consists of two separate environments: author and publish. The author environment is used for adding, deleting and editing pages of the website. When a page is readyto go live, it is activated, causing the system to replicate the page to the publish environment from where itis served to the viewing audience. In the simplest case, the author environment consists of a single CQ instance running on a single servermachine and the publish instance consists of another single CQ instance running on another machine. Inaddition, a Dispatcher is usually installed between the publish server and the web for caching.

Multiple Publish

A common variation on the installation described above is to install multiple publish instances (usuallyon separate machines). When a change made on the author instance is ready to go live, it is replicatedsimultaneously to all the publish instances. As long as all activations and deactivations of pages are performed identically on all publish instances, thecontent of the publish instances will stay in sync. Depending on the configuration of the front-end server,requests from web surfers are dealt with in one of two ways: • The incoming requests are distributed among the publish instances by the load-balancing feature of the

Dispatcher.• The incoming requests are all forwarded to a primary publish instance until it fails, at which point all

requests are forward to the secondary publish instance (in this arrangement there are usually only twoinstances).

NOTE

The dispatcher is an additional piece of software provided by Adobe in conjunction with CQ thatcan be installed as a module within any of the major web servers (Microsoft IIS, Apache, etc.).Load-balancing is a feature of the dispatcher itself, and its configuration is described here. The configuration of failover, on the other hand, is typically a feature of the web server withinwhich the dispatcher is installed. For information on configuring failover, therefore, please consultthe documentaion specifc to your web server.

Page 3: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 3 Created on 2012-10-08

Is Multiple Publish a Form of Clustering?NOTE

The architecture shown above describes two possible configurations for a multiple publishsystem: load balancing and failover. The first setup is sometimes described as a form of active/active clustering and the second asa form of active/passive clustering. However, these archiectures are not considered true CRXclustering because: • This solution is specific to the CQ system since the concept of activation and replication of

pages is itself specific to CQ. So this is not a generic solution for all CRX applications.• Synchronization of the publish instances is dependant on an external process (the replication

process configured in the author instance), so the publish instances are not in fact acting as asingle system.

• If content is written to a publish instance from the external web (as is the case with usergenerated content such as forum comments) CQ uses reverse-replication to copythe content to the author instance, from where it is replcated back to all the publishinstances. While this system is sufficient in many cases, it lacls the robustness of contentsynchronization under true clustering.

For these reasons the multiple publish arrangement is not a true clustering solution as the term isusually employed.

Clustering ArchitectureClustering architectures usually fall into one of two categories: active/active or active/passive.

Active/ActiveIn an active/active setup the processing load is divided equally among all nodes in the cluster using the aload balancer such as the CQ Dispatcher. In normal circumstances, all nodes in the cluster are active and

Page 4: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 4 Created on 2012-10-08

running at the same time. When a node fails, the load balancer redirects the requests from that node acrossthe remaining nodes.

Active/PassiveIn an active/passive setup, a front-end server passes all requests back to only one of the cluster nodes(called the primary) which then serves all of these requests itself. A secondary node (there are usually justtwo) runs on standby. Because it is part of the cluster the secondary node remains in sync with the primary,it just does not actually serve any requests itself. However, if the primary node fails then the front-end serverdetects this and a "failover" occurs where the server then redirects all requests to the secondary nodeinstead.

Hot BackupThe Active/Passive setup described above can also be thought of and used as a "hot backup" system. Thesecondary server (the slave) functions as the backup for the primary server. Because the clustering systemautomatically synchronizes the two servers, no manual backing-up is required. In the case of the failure ofthe primary server, the backup can be immediately deployed by activating the secondary connection.

CQ with ClusteringAbove we discussed two common CQ architectures that do not use CRX clustering and then gave twogeneral examples of how CRX clsuterng can work (active/active and active/passive). Now we will bput thesetogether and see how CRX clustering can be used within a CQ installation.

Publish ClusteringThere are a number of options for using true CRX clustering within a CQ installation. The first one we willlook at is publish clustering.

Page 5: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 5 Created on 2012-10-08

In the publish clustering arrangement, the publish instances of the CQ installation are combined into a singlecluster. The front-end server (including the dispatcher) is then configured either for active/active behavior(where load is equally distributed across cluster nodes) or active/passive behavior (where load is onlyredirected on failover). The following diagram illiustrates a publish cluster arrangement:

NOTE

Hot Backup A variation on publish clustering with failover is to have the secondary publish server completelydisconnected from the web, functioning instead as a continually updated and synchronizedbackup server. If at any time the back up server needs to be put on line, this could then be donemanually by reconfiguring the front-end server.

Author Clustering

Clustering can also be employed to improve the performance of the CQ author environment. An arrangementusing both publish and author clustering is shown below:

Page 6: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 6 Created on 2012-10-08

Other Variations

It is also possible to set up other variations on the author clustering theme by pairing the author cluster witheither multiple (non-clustered) publish instances or with a single publish instance. However, such variantsare rarely used in practice.

Clustering and PerformanceWhen faced with a performance problem in single instance CQ system (either at the publish or authorlevels), clustering may provide a solution. However, the extent of the improvement, if any, depends uponwhere the performance bottleneck is located. Under CQ/CRX clustering, read performance scales linearly with the number of nodes in the cluster.However, additional cluster nodes will not increase write performance, since all writes are serialized throughthe master node. Clearly, the increase in read performance will benefit the publish environment, since it is primarily a readsystem. Perhaps surprisingly, clustering can also benefit the author environment because even in theauthor environment the vast majority of interactions with the repository are reads. In the usual case 97% ofrepository requests in an author environment are reads, while only 3% are writes. This means that despite the fact that CQ clustering does not scale writes, clustering can still be a veryeffective way of improving performance both at the publish and author level. However, while increasing the number of cluster nodes will increase read performance, a bottleneck willstill be reached when the frequency of requests gets to the point where the 3% of requests that are writesoverwhelm the capabilities of the single master node. In addition, while the percentage of writes under normal authoring conditions is about 3%, this can rise insituations where the authoring system handles a large number of more complex processes like workflowsand multisite management. In cases where a write bottleneck is reached, additional cluster nodes will not improve performance. In suchcircumstances the correct strategy is to increase the hardware performance through improved CPU speedand increased memory.

Page 7: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 7 Created on 2012-10-08

CRX Storage OverviewClustering in CRX can be configured in a number of ways, depending on the implementation chosen for eachelement of the storage system. To understand the options available, a brief overview of CRX storage maytherefore be helpful.

Storage ElementsStorage in CRX is made up of the following major elements: • Persistence store• Data store• Journal• Version storage• Other file-based storage

Persistence StoreThe repository's primary content store holds the hierarchy of JCR nodes and properties. This storage ishandled by a persistence manager (PM) and is therefore commonly called the persistence store (despitethe fact that all the stoage elements are in fact persistent). It is also sometimes referred to as the workspacestore because it is configured per workspace (see below). While a number of different PMs are available, each using a different underlying storage mechanism, thedefault PM is the Tar PM, a high-performance database built into CRX and designed specifically for JCRcontent. It derives its name from the fact that it stores its data in the form of standard Unix-style tar files in thefile system of the server on which you are running CRX. Other PMs, such as the MySQL PM and the OraclePM, store data in conventional relational databases, which must be installed and configured separately. All non-binary property values and all binary properrty values under a certain (configurable) size, are storeddirectly by the PM in the content hierarchy, in the manner specific to that PM. However, binary propertyvalues above the threshold size are redirected to the data store, and a reference to the value in the datastore is stored by the PM (DS, see below). Depending on the cluster configuration, each cluster node may have its own separate PM storage (theshared nothing and shared data store configurations) or it may share the same PM storage as the othercluster nodes (the shared external database arrangement). In the case where each cluster node has its ownseparate PM storage, these are kept synchronized across instances through the journal (see below). Formore details on persistence managers, see Persistence Managers.

NOTE

The CRX repository actually supports multiple content hierarchies, which are called workspacesin the terminology of JCR. In theory, different workspaces can be configured with different PMs.However, since multiple workspaces are rarely used in practice and do not have a role in thearchitecture of CQ, this will not typically be an issue.

Data StoreThe data store (DS) holds binary property values over a given, configurable, size. On write, these values arestreamed directly to the DS and only an reference to the value is written by the PM to the persistence store.By providing this level of indirection, the DS ensures that large binaries are only stored once, even if theyappear in multiple locations within a workspace. In a clustered environment the data store can be either perrepository instance (per cluster node) or shared among cluster nodes in commonly accessible network filesystem directory. For more details on the data store, see Data Store.

JournalWhenever the repository writes data it first records the intended change in the journal. Maintaining thejournal helps ensure data consistency and helps the system to recover quickly from crashes. In a clusteredenvironment the journal plays the critical role of synchronizing content across cluster instances.

Page 8: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 8 Created on 2012-10-08

Version Storage The version storage is held in a single, per-repository, hidden workspace only accessible through theversioning API. Though the version storage is reflected into each normal workspace under/jcr:system/jcr:versionStorage, the actual version information resides in the hidden workspace. Since the version information is stored in the same way as content in a normal workspace, any changesto it in one cluster instance are propagated to the other instances in the same way as changes in a normalworkspace. The PM used for version storage is configured separately from those used for normal workspaces. By defaultthe Tar PM is used.

Other File-Based StorageIn addition to the above elements, CRX also stores some repository state information in plain files in thefile-system. This includes the search indices, namespace registry, node type registry and access controlsettings. In a cluster, this information is automatically synchronized across cluster nodes.

Common ConfigurationsAs mentioned above, there are three commonly used cluster configurations, which differ according tohow the various storage elements are configured. The configuration parameters are found in the file /crx-quickstart/repository/repository.xml.

Shared NothingThis is the default configuration. In this configuration all elements of CRX storage are held per cluster nodeand synchronized over the network. No shared storage is used. The TarPersistenceManager is used for thepersistence store and version storage, the TarJournal is used for the journal and the ClusterDataStore isused for the data store. In most cases you will not need to manually configure the repository.xml file for shared nothing cluster, sincethe GUI clustering setup automatically takes care of it (see GUI Setup of Shared Nothing Clustering). ... ... ... ... ... ...

Shared Data StoreIn this configuration the workspace stores and the journal are maintained per-cluster node as above, usingthe TarPersistenceManager and TarJournal, but the DataStore element is configured with the classFileDataStore (instead of ClusterDataStore as above). The parameter path points to the location on ashared file system where the data store will be held. Every node in the cluster must be configured to point tothe same shared location. ... ...

Shared Data Store and JournalIn this configuration the workspace stores is maintained per-cluster node as above, usingthe TarPersistenceManager, but the Data Store uses the class FileDataStore (instead ofClusterDataStore as above) and the Journal uses the class FileJournal (instead of TarJournal). Theresulting configuration for the data store is the same as immediately above. The configuration for the Journalis as follows:

• syncDelay: By default, a cluster instance reads the journal and updates its state (including its cache)every 5000 milliseconds (5 seconds). To use a different value, set the attribute syncDelay in theCluster element. Note: This attribute belongs directly in the Cluster element. It is not a param elementwithin the contained Journal element.

Page 9: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 9 Created on 2012-10-08

• sharedPath: Mandatory argument specifying the shared directory. • maximumSize: Maximum size of a single journal file before it will get rotated. Default: 104857600 bytes

(100 MB).• maximumAge: Age specified as duration in ISO 8601 or plain format. Journal files exceeding this age

will get removed. If this parameter is not specified, this option is disabled (age is not checked). If theparameter is set to an empty string (""), only one Journal log file is kept. The default in CRX is "P1M",which means files older than one month are deleted.

• maximumFiles: Number of rotated journal files kept around. Files exceeding this limit will be removed. Avalue of 0 disables this option. Default: 0.

• portList: The list of ports to use in a master node. By default, any free port is used. When using afirewall, open ports must be listed. A list of ports or ranges is supported, for example: 9100-9110 or9100-9110,9210-9220. Default: 0 (any port).

• bindAddress: Set this parameter if synchronizing among cluster nodes should be done over a specificnetwork interface. By default, all network interfaces are used. Default: empty (use all interfaces).

• connectTimeout: Timeout in milliseconds that a client will wait when initially connecting to a serverinstance before aborting. Default: 2000.

• socketTimeout: Socket timeout in milliseconds for both reading and writing over a connection froma client to a server. After this timeout period has elapsed with no activity, the connection is dropped.Default: 60000 (1 minute). If you have very long transactions, use 600000 (10 minutes).

• operationTimeout: Operation timeout in milliseconds. After a client or server has locked the journal andthis timeout has passed with no record being appended, the lock to the journal is automatically released.Default: 30000 (30 seconds). If you have very long transactions, use 600000 (10 minutes).

Shared External DatabaseIn this configuration all cluster nodes shared a common persistence store, journal and data store, which areall held in a single external RDBMS. ... ... ... ... ... ...

In addition, the external database storage is sometimes used to store the version storage as well (again, asingle store shared across all cluster nodes). ... ... ... ...

When configuring CRX to use an external database, typically, a single backend database system is used tostore all the elements (Workspaces, Data Store, Journal and Version Storage) for all instances in the cluster.Note that nothing prevents this backend database system from itself being a clustered system.

Configuration Options in Detail This section describes the details of cluster configuration and serves as a general guide to all possiblevariations on clustering. Not all possible combinations are recommended or commonly used. See theCommon Configurations. The configuration parameters for clustering are found primarly in the the file crx-quickstart/repository/repository.xml. We will examine those sections most relevant to cluster configuration, highlighted below.

Data StoreThe data store is used to store property values above a given configurable size threshold. The persistencemanager directly stores only the node/property hierarchy and the smaller property values while usingreferences to the data store to represent the larger property values in the hierarchy. The data store has thefollowing features: • It is space saving. Only one copy per unique object is kept. For example, if two or more identical large

binaries appears in the content hierarchy, only one copy will actually be stored in the data store andreferenced from multiple locations.

Page 10: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 10 Created on 2012-10-08

• Copying is fast. Only the identifier is copied.• Storing and reading does not block others.• Multiple repositories can use the same data store.• Objects in the data store are immutable.• Garbage collection is used to purge unused objects.• Hot backup is supported.

ClusterDataStore

The ClusterDataStore is used with the TarPersistenceManager and the TarJournal to form a SharedNothing cluster. Each cluster node has its own separate data store, persistence store and journal.

• minRecordLength: The minimum size for an object to be stored in the data store as opposed to theinline within the regular PM. The default is 4096 bytes. The maximum supported value is 32000.

FileDataStore

The FileDataStore is usually used with the TarPersistenceManager and the TarJournal to configure aShared Data Store cluster. In this configuration the persistence store and journal are still per-cluster node,but the data store is shared across the nodes. This data store places each binary in a separate file. The file name is the hash code of the content. Whenreading, the data is streamed directly from the file. No local or temporary copy of the file is created. Newcontent is first stored in a temporary file, and later renamed and moved to the right place. Because the data store is append-only, the FileDataStore is guaranteed to be consistent after a crash. It isusually faster than the DbDataStore, and the preferred choice unless you have strict operational reasons toput everything into a database (see below).

• path(optional): The name of the directory where this data store keeps the files. The default used if thisparameter is missing is crx-quickstart/repository/repository/datastore. When configuring a Shared DataStore cluster each cluster node must be configured with this parameter pointing to a common sharedfilesystem location. This will typically be the mount point on the local filesystem of each machine that isbound to the common NAS storage, for example, /shared/datastore.

• minRecordLength: The minimum object length. The default is 100 bytes; smaller objects are storedinline (not in the data store). Using a low value means more objects are kept in the data store (whichmay result in a smaller repository, if the same object is used in many places). Using a high value resultsin fewer objects being stored in the data store (which may result in better performance, because lessdata store access is required). The maximum value is approximately 32000.

DbDataStore

The DbDataStore is usually used with a database persistence manager (forexample, MySqlPersistenceManager) and the DatabaseJournal to configure a Shared ExternalDatabase cluster. The DbDataStore stores its data in a relational database using JDBC. All content is stored in one table withthe unique key of the table is the hash code of the content. When reading, the data may be first copied to a temporary file on the server, or streamed directly from thedatabase (depending on the copyWhenReading setting). New content is first stored in the table under a unique temporary identifier, and later the key is updated to thehash of the content.

• url: The database URL (required).• user: The database user name (required).• password: The database password (required).• databaseType: The database typeThe default is the sub-protocol of the JDBC database URL is

used if it is not set. It must match the resource file [databaseType].properties. Example:mysql.properties. Currently supported are: db2, derby, h2, mssql, mysql, oracle, sqlserver.

Page 11: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 11 Created on 2012-10-08

• driver: The JDBC driver class name. By default the default driver of the configured database type isused.

• maxConnections: Set the maximum number of concurrent connections in the pool. At least 3connections are required if the garbage collection process is used.

• copyWhenReading: Enabled (true) by default. If enabled, a stream is always copied to a temporaryfile when reading a stream, so that reads can be concurrent. If disabled, reads are serialized.

• tablePrefix: The table name prefix. The default is empty. Can be used to selectan non-default schema or catalog. The table name is constructed as as follows:${tablePrefix}${schemaObjectPrefix}${tableName}.

• schemaObjectPrefix: The schema object prefix. The default is empty.

NOTE

When adding a record, the stream is first copied to a temporary file. The exception "Cannot insertnew record java.io.IOException: No space left on device" indicates that the temporary directory istoo small.

Workspaces ElementA CRX repsoitory is composed of one or more worksapces, each of which holds a tree of nodes inproperties. These are configured by the elements Workspaces and Workspace (see below). TheWorkspaces element specifies where the worksapce data is to be stored and which of the workspaces will bethe default workspace:

• rootPath: The native file system directory for workspaces. A subdirectory is automatically created foreach workspace, and the path of that subdirectory can be used in the workspace configuration as the{{${wsp.path}}} variable.

• defaultWorkspace: Name of the default workspace. This workspace is automatically created when therepository is first started.

• maxIdleTime:(Optional) By default CRX only releases resources associated with an opened workspacewhen the entire repository is closed. This option, if specified, sets the maximum number of secondsthat a workspace can remain unused before the workspace is automatically closed. Changing thisparameter is not recommended.

• configRootPath: (Optional) By default the configuration of each workspace is stored in a workspace.xmlfile within the workspace directory within the rootPath directory. If this option is specified, then theworkspace configuration files are stored within the specified path in the virtual file system specified in theFileSystem element within the repository.xml file. Changing this parameter is not recommended.

Workspace ElementThe Workspace element serves as a template for the creation of individual workspace.xml files foreach workspace created in the rootPath configured above, in the Workspaces element. The workspaceconfiguration template and all workspace.xml configuration files have the following structure:

• name: The name of the workspace. This will be automatically filledin when the actual worksapce.xml fileis created from this template.

• simpleLocking: A boolean indicating whether simple locking is used on this worksapce. The default istrue.

NOTE

For the purposes of cluster configuration, only the element PersistenceManager is relevant.

Page 12: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 12 Created on 2012-10-08

PersistenceManager elementThe persistence manager is the main storage mechanism of the reposiroty. It stores the node and propertyhierarchy and values of each property (with the expcetion of large binary values, which are referenced fromwithin the persitence storage and actually stored in the Data Store as discussed above).

TarPersistenceManager

The default persistence manager (and the one used for Shared Nothing clustering) is theTarPersistenceManager:

Database Persistence Managers

The persistence store can also be housed in an external database (as is is done in the Shared ExternalDatabase configuration of clustering, for example). There are a number of database persistence managersavailable. For information about acquiring these persistence managers see the Jackrabbit project. Theclasses in question are all subclasses BundleDbPersistenceManager. They are: • DerbyPersistenceManager• H2PersistenceManager• MSSqlPersistenceManager• MySqlPersistenceManager• OraclePersistenceManager• PostgreSQLPersistenceManager

Cluster ElementThe element <Cluster> holds the parameters that govern the journal.

TarJournal

The default journal implementation is the TarJournal. This journal is used for both the Shared Nothing andShared Data Store configurations.

• syncDelay: Events that were issued by other cluster nodes are processed after at most this manymilliseconds. Optional, the default is 5000 (5 seconds).

• bindAddress: Used if the synchronization between cluster nodes should be done over a specificnetwork interface. Default: empty (meaning all network interfaces are used).

• maxFileSize: The maximum file size per journal tar file. If the current data file grows larger than thisnumber (in megabytes), a new data file is created (if the last entry in a file is very big, a data file canactually be bigger, as entries are not split among files). The maximum file size is 1024 (1 GB). Data filesare kept open at runtime. The default is 256 (256 MB).

• maximumAge: Age specified as duration in ISO 8601 or plain format. Journal files that are older thanthe configured age are automatically deleted. The default is "P1M", which means files older than onemonth are deleted.

• portList: The list of listener ports to use by this cluster node. When using a firewall, the open ports mustbe listed. A list of ports or ranges is supported, for example: 9100-9110 or 9100-9110,9210-9220. Bydefault, the following port list is used: 8088-8093.

• becomeMasterOnTimeout: Flag indicating whether a slave may become master when the networkconnection to the current master times out. Default: false. If this parameter is set to true, then the slavewill attempt to contact the master up to 5 times, where the timeout for each attempt is defined by thesystem property socket.receiveTimeout. By default this system property is set to 60000 milliseconds (1minute), so the default total timeout for a slave to become master is 5 minutes.

FileJournalNOTE

Page 13: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 13 Created on 2012-10-08

This option is maintained for backward compatibility. It is not recommended for new installations.

The FileJournal stores the journal data in a plain file, usually in a location shared by all cluster nodes(through I directory bound to a NAS, for example), in a manner similar to the FileDataStore (above). Thisclass is used in the Shared Data Store and Journal configuration. • revision: The name of this cluster node's revision file; this is a required property with no default value• directory: The directory where the journal file as well as the rotated files are kept; this is a required

property with no default value.• basename: The basename of the journal files; the default value is DEFAULT_BASENAME. It is

recommended that this setting not be changed.• maximumSize: The maximum size of an active journal file before rotating it: the default value is

DEFAULT_MAXSIZE which is equal to 1048576. It is recomended that this setting not be changed.

Database JournalsThe journal can also be stored in an external database. A number of database journal implementations areavailable. These classes are used in the Shared External Database configuration. • With an Oracle database use OracleDatabaseJournal.• With an MS SQL database use MSSqlDatabaseJournal.• For other generic databases (accessible through JNDI) use the base DatabaseJournal class. Do not use

the JNDIDatabaseJournal, it has been deprecated.

Cluster Properties and Cluster Node IDThe file crx-quickstart/repository/cluster_node.id contains the cluster node ID. Each instance within a clustermust have a unique ID. This file is automatically created by the system. By default it contains a randomly generated UUID, but itcan be any string. When copying a cluster node, this file should be copied. If two nodes (instances) within acluster contain the same cluster node id, only one of them will be able to connect. Example file: 08d434b1-5eaf-4b1c-b32f-e9abedf05f23The file crx-quickstart/repository/cluster.properties contains cluster configuration properties. The file isautomatically updated by the system if the cluster configuration is changed in the GUI. Example file: #Cluster properties #Wed May 23 16:02:06 CEST 2012cluster_id=86cab8df-3aeb-4985-8eb5-dcc1dffb8e10 addresses=10.0.2.2,10.0.2.3members=08d434b1-5eaf-4b1c-b32f-e9abedf05f23,fd11448b-a78d-4ad1-b1ae-ec967847ce94The cluster_id property contains the identifier for this cluster, each node in the cluster must have the samecluster ID. By default this is a randomly generated UUID, but it can be any string. The addresses property contains a comma separated list of the IP addresses of all nodes in this cluster. Thislist is used at the startup of each cluster node to connect to the other nodes that are already running. Thelist is not needed if all cluster nodes are running on the same computer (which may be the case in certaincircumstances, such as testing). The members properties contains a comma separated list of the cluster node IDs that participate in thecluster. This property is not required for the cluster to work, it is for informational purposes only.

System PropertiesThe following system properties affect the cluster behavior: socket.connectTimeout: The maximum number of milliseconds to wait for the master to respond (default:1000). A timeout of zero means infinite (block until connection established or an error occurs). socket.receiveTimeout: the maximum number of milliseconds to wait for a reply from the master or slave(SO_TIMEOUT; default: 60000). A timeout of zero means infinite.

Page 14: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 14 Created on 2012-10-08

com.day.crx.core.cluster.DisableReverseHostLookup: Disable the reverse lookup from the masterto the slave when connecting (default: false). If not set, the master checks if the slave is reachableusing InetAddress.isReachable(connectTimeout).

Clustering SetupIn this section we describe the process of setting up a CRX cluster.

Clustering RequirementsThe following requirements should be observed when setting up clustering: • Each cluster node (CRX instance) should be on its own dedicated machine. During development and

testing one can install multiple instances on a single machine, but for a production environment this isnot recommended.

• For shared-nothing clustering, the primary infrastructure requirement is that the netwrok connecting thecluster nodes have high reliability, high-availability and low-latency.

• For shared data store and shared external database clustering, the shared storage (be it file-based orRDBMS-based) should be hosted on a high-reliability, high-availability storage system. For file storagethe recommended technologies include enterprise-class SAN systems, NFS servers and CIFS servers.For database storage a high-availability setup running either Oracle or MySQL is recommended. Notethat since, in either case, the data is ultimately stored on a shared system, the reliability and availabilityof the cluster in general depends greatly on the reliability and availability of this shared system.

GUI Setup of Shared Nothing ClusteringBy default a freshly installed CRX instance runs as a single-master, zero-slave, shared-nothing cluster.Additional slave nodes can be added to the master easily through the cluster configuration GUI. If you wish to deploy a shared data store or shared external database cluster, or if you wish to tweak thesettings of the default shared-nothing cluster, you will have to perform a manual configuration. In this sectionwe describe the GUI deployment of a shared-nothing cluster. Manual configuration is covered in the nextsection.

1. Install two or more CRX instances. In a production environment, each would be installed on adedicated server. For development and testing, you may install multiple instances on the samemachine.

2. Ensure that all the instance machines are networked together and visible to each other over TCP/IP. CAUTION

Cluster instances communicate with each other through port 8088. Consequently thisport must be open on all servers within a cluster. If one or more of the servers is behinda firewall, that firewall must be configured to open port 8088. If you need to use another port (e.g., due to firewal setup), use Manual Cluster Setupapproach. You can configure the cluster communication port to another port number, inwhich case that port must be visible through any firewall that may be in place. To configure the cluster communications port, change the portList parmeter in the<Journal> element of repository.xml as described here.

3. Decide which instance will be the master instance. Note the host name and port of this instance.For example, if you are running the master instance on your local machine, its address might belocalhost:4502.

4. Every instance other than the master will be a slave instance. You will need to connect each slave instance to the master by going to their respective clusterconfiguration pages here: http://<slave-address>/libs/granite/cluster/content/admin.html

Page 15: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 15 Created on 2012-10-08

5. In the Cluster Configuration page enter the address of the master instance in the field marked MasterURL, as follows: http://<master-address>/ For example if both your slave and master on your local machine you might enter http://localhost:4502/ Once you filled in the Master URL, enter your Username and Password on the master instance andclick Join. You must have administrator access to set up a cluster.

6. Joining the cluster may take a few minutes. Allow some time before refreshing the master and slave UIs in the browser. Once the slave is properlyconnected to the master you should see something similar to the following on the master and slavecluster UIs:

Page 16: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 16 Created on 2012-10-08

NOTE

In some cases, a restart of the slave instance might be required to avoid stalesessions.

NOTE

When configuring file-based persistence managers such as the Tar PM (as opposed to database-based PMs) the file system location specified (for example, location where the tar PM isconfigured to store its tar files) should be a true local storage, not network storage. The use ofnetwork storage in such cases will degrade performance.

Manual Cluster SetupIn some cases a user may wish to set up a cluster without using the GUI. There are two ways to do this:manual slave join and manual slave clone. The first method, manual slave join, is the same as the standard GUI procedure except that it is donewithout the GUI. Using this method, when a slave is added, the content of the master is copied over to it anda new search index on the slave is built from scratch. In cases where an pre-existing instance with a largeamount of content is being "clusterized" this process can take a long time. In such cases it is recommended to use the second method, manual slave clone. In this method the masterinstance is copied to a new location either at the file system level (i.e., the crx-quickstart directory is copiedover) or using the online backup feature and the new instance is then adjusted to play the role of slave. Thisavoids the rebuilding of the index and for large repositories can save a lot of time.

Manual Slave JoinThe following steps are similar to joining a cluster node using the GUI. That means the data is copied overthe network, and the search index is re-created (which may take some time): If a quickstart deployment is being used (i.e., stand-alone, without an application server) then: On the master • Copy the files crx-quickstart-*.jar and and license.properties to the desired directory.• Start the instance:

java -Xmx512m -jar *.jar

Page 17: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 17 Created on 2012-10-08

• Verify that the instance is up and running, then stop the instance.• In the file crx-quickstart/repositorycluster.properties, add the IP address of the slave instance you are

adding below (if that slave is on a different machine, otherwise this step is not needed).• Stop the instance.• If a shared data store is to be be used:

• Change crx-quickstart/repository/repository.xml, as described in Shared Data Store and Data Store, above.

• Assuming you have configured <shared>/datastore/ as the shared location, copy the contents of thecrx-quickstart/repository/repository/datastore tothe directory <shared>/datastore.

• Start the instance, and verify that it still works.On the slave • Copy the files crx-quickstart-*.jar and license.properties to the desired directory (usually on a different

machine from the master, unless you are just testing).• Unpack the JAR file:

java -Xmx512m -jar crx-quickstart-*.jar -unpack• Copy the files repository.xml and cluster.properties from the master:

cp ../n1/crx-quickstart/repository/repository.xml crx-quickstart/repository/ cp ../n1/crx-quickstart/repository/cluster.properties crx-quickstart/repository/

• Copy the namespaces and node types from the master: cp -r ../n1/crx-quickstart/repository/repository/namespaces/ crx-quickstart/repository/repository/ cp -r ../n1/crx-quickstart/repository/repository/nodetypes/ crx-quickstart/repository/repository/

• If this new slave is on a different machine from the master, append the IP address of the master to thecluster.properties file of the slave: echo "addresses=x.x.x.x" >> crx-quickstart/repository/cluster.properties where x.x.x.x is replaced by the correct address. As mentioned above, the IP address of the slaveshould be added to the master's cluster.properties file as well.

• Start the slave instance: java -Xmx512m -jar crx-quickstart-*.jar

If an application server deployment is being used (i.e., the CQ or CRX war file is being deployed)then: On the master • Deploy the war file into the application server. See Installing CRX in an Application Server and Installing

CQ in an application server.• Stop the application server.• In the file crx-quickstart/repositorycluster.properties, add the IP address of the slave instance you are

adding below (if that slave is on a different machine, otherwise this step is not needed).• If a shared data store is to be be used change crx-quickstart/repository/repository.xml, as described in

Shared Data Store and Data Store, above.• Move the datastore directory to the required place• Start the application server, and verify that it still works. On the slave • Deploy the war file into the application server.• Stop the application server.• Copy the files repository.xml and cluster.properties from the master:

cp ../n1/crx-quickstart/repository/repository.xml crx-quickstart/repository/ cp ../n1/crx-quickstart/repository/cluster.properties crx-quickstart/repository/

• Copy the namespaces and node types from the master: cp -r ../n1/crx-quickstart/repository/repository/namespaces/ crx-quickstart/repository/repository/ cp -r ../n1/crx-quickstart/repository/repository/nodetypes/ crx-quickstart/repository/repository/

• If this new slave is on a different machine from the master, append the IP address of the master to thecluster.properties file of the slave: echo "addresses=x.x.x.x" >> crx-quickstart/repository/cluster.properties where x.x.x.x is replaced by the correct address. As mentioned above, the IP address of the slaveshould be added to the master's cluster.properties file as well.

• Start the application server.

Page 18: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 18 Created on 2012-10-08

Manual Slave Cloning

The following steps clone the master instance and change that clone into a slave, preserving the existingsearch index: Master Your existing repository will be the master instance. If it is feasible to stop the master instance: • Stop the master instance either through the GUI switch, the command line stop script orr, in the case of

an application server deployment, by stopping the application server.• In the case of a quickstart deployment, copy the crx-quickstart directory of the master over to the location

where you want the slave installed, using a normal filesystem copy (cp, for example).• In the case of an application server installation, copy the exploded war file from the master to same

location in the slave application server (see Installing CRX in an Application Server and Installing CQ inan application server).

• Restart the master.If it is not feasible to stop the master instance: • Do an online backup of the instance to the new slave location. The online backup tool can be made

to write the copy directly into another directory or to a zip file which you can then unpack in the newlocation. See here for details. The process can be automated using curl or wget. For example: curl -c login.txt "http://localhost:7402/crx/login.jsp?UserId=admin&Password=xyz&Workspace=crx.default" curl -b login.txt -f -o progress.txt "http://localhost:7402/crx/config/backup.jsp?action=add&&zipFileName=&targetDir=<targetDir>"

Slave In the new slave instance directory (or, in the application server case, the exploded war file directory of theslave): • Modify the file crx-quickstart/repository/cluster_node.id so that it contains a unique cluster node ID. This

ID must differ from the IDs of all other nodes in the cluster.• Add the node ID of the master instance and all other slave nodes (apart from this one) separated by

commas to the file crx-quickstart/repository/cluster.properties. For example, you could use somethinglike the following command (with the capitalized items replaced with the actual IDs used): echo "members=MASTER_NODE_ID,SLAVE_NODE_1_ID" >> crx- quickstart/repository/cluster.properties

• Add the master instance IP address and the IP address of all other slave instances (apart from this one)to the file crx-quickstart/repository/cluster.properties. For example, you could use something like thefollowing command (with the capitalized items replaced with the actual IDs used): echo "addresses=MASTER_IP_ADDRESS,SLAVE_1_IP_ADDRESS" >> crx- quickstart/repository/cluster.properties

• Remove the file sling.id.file, which stores the instance id by which nodes in a cluster are distinguishedfrom each other. This file will get re-generated with a new ID if missing, so deleting it is sufficient. The file is in different places depending on installation, so it needs to be found: rm -i $(find -type f -name sling.id.file)

• Start the slave instance. It will join the cluster without re-indexing. Note: Once the slave is started, themaster cluster.properties file will automatically be updated by appending the node ID and IP address ofthe slave.

Page 19: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 19 Created on 2012-10-08

Troubleshooting

In some cases, when the master instances is stopped while the other cluster instances are still running, Themaster instance cannot re-join the cluster after being restarted. This can occur in cases where a write operation was in progress at the moment that the master node wasstopped, or where a write operation occured a few seconds before the master instance was stopped. Inthese cases, the slave instance may not receive all changes from the master instance. When the master isthen re-started, CRX will detect that it is out of sync with the remaining cluster instances and the repositorywill not start. Instead, an error message is written to the server.log saying the repository is not available, andthe following or a similar error message in the file crx-quickstart/logs/crx/error.log and crx-quickstart/logs/stdout.log: ClusterTarSet: Could not open (ClusterTarSet.java, line 710)java.io.IOException: This cluster node and the master are out of sync. Operation stopped.Please ensure the repository is configured correctly.To continue anyway, please delete the index and data tar files on this cluster node and restart.Please note the Lucene index may still be out of sync unless it is also deleted....java.io.IOException: Init failed...RepositoryImpl: failed to start Repository: Cannot instantiate persistence manager ...RepositoryStartupServlet: RepositoryStartupServlet initializing failed

Avoiding Out-of-Sync Cluster InstancesTo avoid this problem, ensure that the slave cluster instances are always stopped before the master isstopped. If you are not sure which cluster instance is currently the master, open the page http://localhost:port/crx/config/cluster.jsp. The master ID listed there will match the the contents of the file crx-quickstart/repository/cluster_node.id of the master cluster instance.

Recovering an Out-of-Sync Cluster InstanceTo re-join a cluster instance that is out of sync, there are a number of solutions: • Create a new repository and join the cluster node as normal.• Use the Online Backup feature to create a cluster node. In many cases this is the fastest way to add a

cluster node.• Restore an existing backup of the cluster instance node and start it.• As described in the error message, delete the index and data tar files that are out-of-sync on this cluster

node and restart. Note that the Lucene search index may still be out of sync unless it is also deleted.This procedure is discouraged as it requires more knowledge of the repository, and may be slower thanusing the online backup feature (specially if the Lucene index needs to be re-built).

Page 20: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 20 Created on 2012-10-08

Time-to-SyncThe time that it takes to synchronize an out-of-sync cluster node depends on two factors: • The length of time that the node has been disconnected from the cluster.• The rate of change of information in the live node.Combined, these factors determine how much data has changed during the time that the node has beendisconnected, and therefore, the amount of data that needs to be transfered and written to that node to re-synchronize it. Since the time taken depends on these variables, it will differ from installation to installation. However, bymaking some reailistic worst-case scenario projections we can estimate an example sync-time: • Assume that a cluster node will not be out of sync for any more than 24 hours, since in most cases this is

sufficient time for manual intervention to occur.• Also assume that the website in question is the main corporate website for an enterprise of

approximately 10,000 employees. 24 hours of work on a website in such an organization is projected asfollows:• import 250 2MB images• create 2400 pages• perform 10000 page modifications• activate 2400 pages

Internal testing at Adobe has shown that given the above assumptions the results for a CQ 5 installationusing CRX clustering are: • Synchronization took 13 minutes and 38 seconds.• The author response time increased from 400ms to 700ms during the synchronization (including.

network latency).• The inital crx-quickstart folder size was 1736MB.• The crx-quickstart folder size after the simulated 24h was 2588MB.• Therefore a total transfer of 852MB of was needed to perform the synchronization.

Locking in ClustersActive clusters do not support session-scoped locks. Open-scoped locks or application-side solutions forsynchronizing write operations should be used instead.

Tar PM OptimizationThe Tar PM stores its data in standard Unix-style tar files. Occassionally, you may want to optimize thesestorage files to increase the speed of data access. Optimizing a Tar PM clustered system is essentiallyidentical to optimizing a stand-alone Tar PM instance (see Optimizing Tar Files). The only difference is that to optimize a cluster, you must run the optimization process on the masterinstance. If the optimization is started on the master instance, the shared data as well as the local cacheof the other cluster instances is optimized automatically. There is a small delay before the changes arepropagated (a few seconds). If one cluster instance is not running while optimizing, the tar files of that clusterinstance are automatically optimized the next time the instance is started.

Manual FailoverIn an active/passive environment with two nodes, under normal operating conditions, all incoming requestare served by the master, while the slave maintains content synchronization with the master but does notitself serve any requests. Ideally, if the master node fails, the slave, having identical content to the master,would automatically jump in and start serving requests, with no noticeable downtime. Of course, there are many cases where this ideal behavior is not possible, and a manual intervention by anadminstrator is required. The reason for this is that, in general, the slave cannot know with certainty whattype of failure has occured. And the type of failure (and there many types) dictates the appropriate response.Hence, intelleigent intervention is usually required. For example, imagine a two node active/passive cluster with master A and slave B. A and B keep trackof each other's state through "heartbeat" signal. That is, they periodically ping each other and wait for aresponse.

Page 21: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 21 Created on 2012-10-08

Now, if B finds that its pings are going unanswered for some relatively long period, there are, generallyspeaking, two possible reasons: 1. A is not responding because it is inoperable.2. A is still operating normally and the lack of response is due to some other reason, most likely a failure of

the network connection between the two nodes.If (1) is the case then the logical thing is for B to become the master and for requests to be redirected to andserved by B. In this situation, if B does not take on A's former role, the service will be down; an emergencysituation. But if (2) is the case then the logical thing to do is for B to simply wait and continue pinging until the networkconnection is restored, at which point it will resynchronize with the master, effecting all changes that occuredon the master during the down time. In this situation, if B instead assumes that A is down and takes over,there would be two functioning master nodes, in other words a "split-brain" situation where there is apossibility that the two nodes become desynchronized. Because these two scenarios are in conflict and the slave cannot distinguish between them, the defaultsetting in CRX is that upon repeated failures to reconnect to a non-responsive master the slave does notbecome the master. However, see below for a case where you may wish alter this default behavior. Assuming this default behavior, at this point that a manual intervention is needed. Your first priority is to ensure that at least one node is up and serving requests. Once you haveaccomplished this, you can worry about starting the other nodes, synchronizing them, and so forth. Here isprocedure to follow: 1. First, determine which of the scenarios above applies.2. If the master node A is still responding and serving requests then you have scenario (2) and the problem

lies in the connection between master slave. • In this case no emergency measures need to be taken, since the service is still functional, you must

simply reestablish the cluster.• First, stop the slave B.• Ensure that the network problem has been solved.• Restart B. It should automatically rejoin the cluster. If it is out-sync you can troubleshoot it according

to the tips given in Recovering an Out-Of-Sync Cluster Node.3. On the other hand, if A is down then you have scenario (1). In this case your first priority shgould be to

get functioning instance up and running and serving requests as soon as possible.• You have two choices: Restart A and keep it as master or redirect requests to B and make it the new

master. Which one you choose should depend on which can be achieved most quickly and with thehighest likelyhood of success.

• If the problem with A is easy to fix then restart it and ensure that it is functioning properly. Thenrestart B and ensure that it properly rejoins the cluster.

• If the problem with the master looks more involved, it may be easier to redirect incoming requests tothe slave and restart the slave making it he new master.• To do this you must first stop B and remove the file

crx-quickstart/repository/clustered.txt. This is a flag file that CRX creates to keep track ofwhether a restarted system should regard itself as master or slave. The presence f the fileinidicates to the system that before restart it as a slave, so it iwll attempt to automatically rejoinits cluster. The absence of the file indicates that the instance was, before restart, a matsre (or alone un-clusterd instance), in which case it does not attempt to rejoin any clusters.

• Now restart B.• Once you have confirmed that B is up and running and serving requests you can work on fixing

A. When as is in working order you can join it to B, except this time A will be the slave and B themaster. Alternatively you may switch back to the original arrangement. Just don't forget abit theclustered.txt file!

• It may be possible that A now reports that it is out of sync with cluster node B. See Recoveringan Out-Of-Sync Cluster Node for information on fixing this.

Enabling Automatic FailoverAs discussed above the default failover behavior in a CRX cluster requires manual intervention becausein the typical case a slave cannot distinguish between an inoperable master instance and an inoperablenetwork connection to the master, and it is only the former case that actual failover should be performed. In some cases, howver, it may make sense to enable automatic failover. This is only recommende if the theslave and master have two or more, distinct but redunadant network connections to one another.

Page 22: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 22 Created on 2012-10-08

In such a case a slave will only detect heartbeat timeout from the master if either the master is down, or allthe redundant network connections are down. Since the chance of all redunadant conenctions being down at the same time is less than that of a singlefailed connection, we can have more confidence that a heartbeat timeout really does mean that the matser isdown. Obviously this depends on the redundant connections being truly independent, to the greatest extentpossible. As well, having even more than two such redundant indepenedent connections would also increaseor confidence in the relaiability of the heartbeat monitor. The level of confidence in the heartbeat monitor is necessarily a judgement call on the part of theadministrator of the cluster. But if it is determined that the level of confidence is high enough, then thedefault failover behavior can be changed setting to true the parameter becomeMasterOnTimeout inthe <Cluster> element of the repository.xml.

Cluster Solution Scenarios

Apart from the general guidelines above, there are number of very specific actions to be taken in with respectto particular types of cluster failures. The tables below summarize these scenarios. This specific informationshould used in concert with procedure above.

Emergency

Scenario Expected behavior Restore

Master shutdown with high loadon both nodes

Shutdown within 2 min; failover;possibly out-of-sync

Restart old master. Master maybe out of sync and may need tobe restored.

Slave shutdown with load onboth nodes

Shutdown within 1 min Restart slave

Master process dies (kill -9) Failover Restart old master. Master maybe out of sync and may need tobe restored.

Complete hardware failure onmaster (e.g. power failure) withdefault config

Slave may be blocked or becomemaster

Restart slave. Master may beout of sync and may need to berestored.

Heavy load on slave node (DOSbehavior)

Should still work, slow Remove load, restart slave.

Heavy load on master node(DOS behavior)

Should still work, slow Remove load, restart master.

Master runs out of disk space Master should close/kill itself Free space, restart master.

Master runs out of memory Master should close/kill itself Restart master.

Network Problems

Scenario Expected behavior Restore

Network failure on master/slave,restore network within 5 min

Slave should re-join Not required

Page 23: Crx Cluster

CRX Clustering

© 2012 Adobe Systems Incorporated.All rights reserved. Page 23 Created on 2012-10-08

Network failure on master/slave,restore network after 7 min

Slave should re-join Not required

Set "becomeMasterOnTimeout",stop network for 6 min

Slave should become master(split brain)

Restore previous master or slavefrom the other node

Set "becomeMasterOnTimeout",set system property"socket.receiveTimeout" to"10000" (10 seconds); stopnetwork for 1 min

Slave should become master(split brain)

Persistent network failure onmaster

Master continues, slave blocked Restore network

Network failure on master/slavewith load on both nodes

Master continue, slave blocked Restore network

Slow connection between thenodes

Should still work, slow Improve connection

Maintenance Operations

Scenario Expected behavior Restore

Slave shutdown without load Shutdown within 1 min Restart old slave

Master shutdown without load Shutdown within 1 min; failover Restart old master

Project QA with Clustering

When testing a clustered CRX-based project (a website built on CQ for example) the following QA testsshould be passed at least once before content entry (with final code but with not all content) and at leastonce after content entry and before the project goes live: • Cluster slave shutdown and restart: Verify rejoin to cluster and basic smoke test (sanity check).• Cluster master shutdown and restart: Verify rejoin to cluster and basic smoke test (sanity check).• Cluster slave kill -9 or pull plug: Verify rejoin to cluster and basic smoke test (sanity check).• Cluster master kill -9 or pull plug: Verify rejoin to cluster and basic smoke test (sanity check).• Recover whole cluster from backup (disaster recovery).

Cluster Synchronization Time


Top Related