This week I've been working on getting clustering setup for a client. Initially we were using CF10 with the latest patches. Ideally we wanted non-sticky load balancing with session replication. We want really high availability with the option to reboot a server at any time and not have to wait for session draining or lose customers if a node goes down. Adam Cameron points out that there is an issue with CF10 and not having an option to turn on session replication Adam Cameron's CFML Blog: Problem with session replication with CF10 clustering. Trying various fixes I could not get the session to replicate we moved to CF11 which restores that issue. There is a bug open for CF10 with some weird responses but I never saw any sort of fix for this.
CF11 as noted solves this odd issue, so I thought we were in the clear. Following the limited cluster setup guides found online there is some manual configuration to do on the remote instance. First, I am not sure if the default cfusion instance just can't be used as a member of a cluster but I had a hard time ever getting it to work. So both the local and remote instance use new CF11 instances created from within the Instance Manager. The instructions Adobe ColdFusion 10 * Enabling clustering for load balancing and failover are mostly correct in that you have to copy the <cluster> node to the remote instance. One issue pointed out in a few places is that the cluster block has to actually go IN the <host> node and not after it. CF10, CF11 and maybe even CF9 put the block (and the documents suggest putting the block) after the </host> tag which, in my experience, does not work.
After everything was configured and I started up my test I could not get the remote node to respond at all. Looking in the cf error log I constantly saw this line:
INFO: Manager [/]: skipping state transfer. No members active in cluster group.
Digging in to the tomcat clustering discussions this basically means the cluster couldn't find the remote instance. By default CF uses the multicast cluster support in tomcat and doesn't have an option to do anything different. Researching this found that AWS does not support broadcast nor multicast in EC2. Further research showed how tomcat could be configured for static cluster member configuration and so I modified the server.xml files to match and viola, cluster with session replication. Using the ELB on AWS we have sticky sessions disabled (basically round-robin style requests) and the requests bounce evenly between the instance members. The session id's, however, stay the same on each page load even though the request is going to a different host.
So here is what the cluster node of the server.xml looks like:
<Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster" channelSendOptions="8" channelStartOptions="3"> <Manager notifyListenersOnReplication="true" expireSessionsOnShutdown="false" className="org.apache.catalina.ha.session.DeltaManager"/> <Channel className="org.apache.catalina.tribes.group.GroupChannel"> <!--<Membership port="45564" dropTime="3000" address="228.0.0.4" className="org.apache.catalina.tribes.membership.McastService" frequency="500"/>--> <Receiver port="4001" autoBind="100" address="auto" selectorTimeout="5000" maxThreads="6" className="org.apache.catalina.tribes.transport.nio.NioReceiver"/> <Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter"> <Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender"/> </Sender> <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpPingInterceptor"/> <!-- ADDED --> <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector"/> <Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor"/> <Interceptor className="org.apache.catalina.tribes.group.interceptors.StaticMembershipInterceptor"> <Member className="org.apache.catalina.tribes.membership.StaticMember" port="4002" host="172.31.33.220" domain="delta-static" uniqueId="{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}" /> </Interceptor> </Channel> <Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter=""/> <Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve"/> <ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener"/> <ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener"/> </Cluster>
You can see the <membership> node is commented out (this is the multicast function). The TcpPingInterceptor is added and the StaticMembershipInterceptor is added. The reciever port on this instance is 4001 and the remote instance is 4002 so the interceptor uses 4002 on this instance to contact the remote host and vice-versa. In other words the remote instance will use the same <cluster> node with the ports switch and the host IP address changed on the static interceptor. The uniqueID then rotates on each member going from {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15} to {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0}
Of course each additional member to the cluster will mean manual changes to each existing member (to add additional static interceptors) but that seems a small price to pay to not have to move our entire environment off AWS.