Hi,
We have tried to create a HA cluster with requests being distributed round robin to N instances of coldfusion, we are NOT using sticky sessions as we are replication session state to all cf instances. What we are seing is that all is fine with low to moderate load, however under heavy load and at random times the replication fails and leads to things in session scope not working. This manifests in users not being able to login to our application (we store a token in session scope to store logged in status).
Again key point, under low to moderate load it all works fine, users are directed to random nodes in the cluster and their session is picked up fine as the session is distributed to all nodes,so pretty confident config is right.
Linux servers using CF10 with update 12 applied. Also running is fusion reactor 5.04 on all instances. Each instance has a 64GB heap, Java 7.0.15 (latest certified).
Firstly apache setup.
workers.properties
worker.list=balancer, jkstatus
worker.jkstatus.type=status
worker.balancer.type=lb
worker.balancer.balance_workers=cfusion_master,cfusion_slave2,cfusion_slave1
worker.balancer.method=R
worker.balancer.sticky_session=False
worker.balancer.ping_mode=A
worker.cfusion_master.type=ajp13
worker.cfusion_master.host=localhost
worker.cfusion_master.port=8012
worker.cfusion_master.max_reuse_connections=250
worker.cfusion_master.lbfactor=100
worker.cfusion_slave2.reference=worker.cfusion_master
worker.cfusion_slave2.port=8014
worker.cfusion_slave1.reference=worker.cfusion_master
worker.cfusion_slave1.port=8013
Now the server.xml from 2 nodes (as an example if I run a 2 node cluster)
One of the configs from a server in the cluster
<Server port="8007" shutdown="SHUTDOWN">
<Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on">
</Listener>
<Listener className="org.apache.catalina.core.JasperListener">
</Listener>
<Listener className="org.apache.catalina.core.JreMemoryLeakPreventionListener">
</Listener>
<Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener">
</Listener>
<GlobalNamingResources>
<Resource description="User database that can be updated and saved" name="UserDatabase" pathname="conf/tomcat-users.xml" factory="org.apache.catalina.users.MemoryUserDatabaseFactory" type="org.apache.catalina.UserDatabase" auth="Container">
</Resource>
</GlobalNamingResources>
<Service name="Catalina">
<Executor name="tomcatThreadPool" minSpareThreads="4" maxThreads="150" namePrefix="catalina-exec-">
</Executor>
<Connector port="8012" protocol="AJP/1.3" connectionTimeout="600000" redirectPort="8445" tomcatAuthentication="false">
</Connector>
<Engine jvmRoute="cfusion" name="Catalina" defaultHost="localhost">
<Realm className="org.apache.catalina.realm.LockOutRealm">
<Realm className="org.apache.catalina.realm.UserDatabaseRealm" resourceName="UserDatabase">
</Realm>
</Realm>
<Host name="localhost" autoDeploy="false" unpackWARs="true" appBase="webapps">
<Valve pattern="%h %l %u %t "%r" %s %b" directory="logs" prefix="localhost_access_log." className="org.apache.catalina.valves.AccessLogValve" suffix=".txt" resolveHosts="false">
</Valve>
</Host>
<Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster" channelSendOptions="8">
<Manager notifyListenersOnReplication="true" expireSessionsOnShutdown="false" className="org.apache.catalina.ha.session.DeltaManager">
</Manager>
<Channel className="org.apache.catalina.tribes.group.GroupChannel">
<Membership port="45564" dropTime="3000" address="228.0.0.4" className="org.apache.catalina.tribes.membership.McastService" frequency="500">
</Membership>
<Receiver port="4001" autoBind="100" address="auto" selectorTimeout="5000" maxThreads="6" className="org.apache.catalina.tribes.transport.nio.NioReceiver">
</Receiver>
<Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
<Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender">
</Transport>
</Sender>
<Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector">
</Interceptor>
<Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor">
</Interceptor>
</Channel>
<Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter="">
</Valve>
<Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve">
</Valve>
<ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener">
</ClusterListener>
<ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener">
</ClusterListener>
</Cluster>
</Engine>
<Connector port="8499" protocol="org.apache.coyote.http11.Http11NioProtocol" connectionTimeout="20000" redirectPort="8443" executor="tomcatThreadPool">
</Connector>
</Service>
</Server>
Config from one of the other nodes
<Server port="8008" shutdown="SHUTDOWN">
<Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on">
</Listener>
<Listener className="org.apache.catalina.core.JasperListener">
</Listener>
<Listener className="org.apache.catalina.core.JreMemoryLeakPreventionListener">
</Listener>
<Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener">
</Listener>
<GlobalNamingResources>
<Resource description="User database that can be updated and saved" name="UserDatabase" pathname="conf/tomcat-users.xml" factory="org.apache.catalina.users.MemoryUserDatabaseFactory" type="org.apache.catalina.UserDatabase" auth="Container">
</Resource>
</GlobalNamingResources>
<Service name="Catalina">
<Executor name="tomcatThreadPool" minSpareThreads="4" maxThreads="150" namePrefix="catalina-exec-">
</Executor>
<Connector port="8013" protocol="AJP/1.3" connectionTimeout="600000" redirectPort="8446" tomcatAuthentication="false">
</Connector>
<Engine jvmRoute="cfusion" name="Catalina" defaultHost="localhost">
<Realm className="org.apache.catalina.realm.LockOutRealm">
<Realm className="org.apache.catalina.realm.UserDatabaseRealm" resourceName="UserDatabase">
</Realm>
</Realm>
<Host name="localhost" autoDeploy="false" unpackWARs="true" appBase="webapps">
<Valve pattern="%h %l %u %t "%r" %s %b" directory="logs" prefix="localhost_access_log." className="org.apache.catalina.valves.AccessLogValve" suffix=".txt" resolveHosts="false">
</Valve>
</Host>
<Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster" channelSendOptions="8">
<Manager notifyListenersOnReplication="true" expireSessionsOnShutdown="false" className="org.apache.catalina.ha.session.DeltaManager">
</Manager>
<Channel className="org.apache.catalina.tribes.group.GroupChannel">
<Membership port="45564" dropTime="3000" address="228.0.0.4" className="org.apache.catalina.tribes.membership.McastService" frequency="500">
</Membership>
<Receiver port="4002" autoBind="100" address="auto" selectorTimeout="5000" maxThreads="6" className="org.apache.catalina.tribes.transport.nio.NioReceiver">
</Receiver>
<Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter">
<Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender">
</Transport>
</Sender>
<Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector">
</Interceptor>
<Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor">
</Interceptor>
</Channel>
<Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter="">
</Valve>
<Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve">
</Valve>
<ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener">
</ClusterListener>
<ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener">
</ClusterListener>
</Cluster>
</Engine>
<Connector port="8500" protocol="org.apache.coyote.http11.Http11NioProtocol" connectionTimeout="20000" redirectPort="8443" executor="tomcatThreadPool">
</Connector>
</Service>
</Server>
So what do i see in the logs?. Well sometimes I see exceptions like this
Mar 05, 2014 9:55:19 PM org.apache.catalina.ha.session.DeltaManager messageReceived
SEVERE: Manager [localhost#/]: Unable to receive message through TCP channel
java.lang.IllegalStateException: removeAttribute: Session already invalidated
at org.apache.catalina.ha.session.DeltaSession.removeAttribute(DeltaSession.java:617)
at org.apache.catalina.ha.session.DeltaRequest.execute(DeltaRequest.java:171)
at org.apache.catalina.ha.session.DeltaManager.handleSESSION_DELTA(DeltaManager.java:1347)
at org.apache.catalina.ha.session.DeltaManager.messageReceived(DeltaManager.java:1293)
at org.apache.catalina.ha.session.DeltaManager.messageDataReceived(DeltaManager.java:1014)
at org.apache.catalina.ha.session.ClusterSessionListener.messageReceived(ClusterSessionListe ner.java:92)
at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:897)
at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:878)
at org.apache.catalina.tribes.group.GroupChannel.messageReceived(GroupChannel.java:278)
at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelIntercepto rBase.java:84)
at org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.messageReceived(TcpFailu reDetector.java:113)
at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelIntercepto rBase.java:84)
at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelIntercepto rBase.java:84)
at org.apache.catalina.tribes.group.ChannelCoordinator.messageReceived(ChannelCoordinator.ja va:253)
at org.apache.catalina.tribes.transport.ReceiverBase.messageDataReceived(ReceiverBase.java:2 87)
at org.apache.catalina.tribes.transport.nio.NioReplicationTask.drainChannel(NioReplicationTa sk.java:212)
at org.apache.catalina.tribes.transport.nio.NioReplicationTask.run(NioReplicationTask.java:1 01)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
I'm unsure why this happens as tribes uses certified mesaging so it should have resent right?, in any case I believe I can change it so messages are not sent asynchronously, should sort this out.
I see (good) messages like this
Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager startInternal
INFO: Register manager localhost#/ to cluster element Engine with name Catalina
Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager startInternal
INFO: Starting clustering manager at localhost#/
Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager getAllClusterSessions
INFO: Manager [localhost#/], requesting session state from org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=68824148, securePort=-1, UDP Port=-1, id={123 126 89 39 96 -59 69 8 -113 79 51 122 25 108 -11 -110 }, payload={}, command={}, domain={}, ]. This operation will timeout if no session state has been received within 60 seconds.
Mar 05, 2014 9:42:20 PM org.apache.catalina.ha.session.DeltaManager waitForSendAllSessions
INFO: Manager [localhost#/]; session state send at 3/5/14 9:42 PM received in 929 ms.
Mar 05, 2014 9:42:20 PM org.apache.catalina.ha.session.JvmRouteBinderValve startInternal
INFO: JvmRouteBinderValve started
So session state dies appear to be flying around the cluster, I do nightly restarts of some of the nodes due to another issue I have with an ever growing heap (separate issue), interestingly I also see nodes leave and join the cluster, again this is good (shows the multicast is working, and also that replication should be working).
Mar 05, 2014 2:30:16 AM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared
INFO: Verification complete. Member disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=18629101, securePort=-1, UDP Port=-1, id={-2 65 10 -79 53 -75 76 52 -99 63 -90 -120 34 -89 -14 100 }, payload={}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={}, ]]
Mar 05, 2014 2:30:16 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberDisappeared
INFO: Received member disappeared:org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=18629101, securePort=-1, UDP Port=-1, id={-2 65 10 -79 53 -75 76 52 -99 63 -90 -120 34 -89 -14 100 }, payload={}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={}, ]
Mar 05, 2014 2:35:16 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded
INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=1083, securePort=-1, UDP Port=-1, id={123 126 89 39 96 -59 69 8 -113 79 51 122 25 108 -11 -110 }, payload={}, command={}, domain={}, ]
So stuck now on how to proceed, to establish why at random times the replication fails, leading to cluster collapse. Could it be the size of the session?, I have a few CFCs stuffed into session scope, but perhaps when the load is high there is too many?. Things fail even with a cluster of 2 on one server, initially I had a 8 node cluster on 2 separate machines but when it failed it rolled it back to a cluster of 2 instances on the one server to see if that was stable (its not 100% which is what I need).
Any advice, points gratefully received.