Quantcast
Channel: Adobe Community : Popular Discussions - ColdFusion Server Administration
Viewing all articles
Browse latest Browse all 78799

Unstable session replication in a HA cluster (CF10)

$
0
0

Hi,

 

We have tried to create a HA cluster with requests being distributed round robin to N instances of coldfusion, we are NOT using sticky sessions as we are replication session state to all cf instances. What we are seing is that all is fine with low to moderate load, however under heavy load and at random times the replication fails and leads to things in session scope not working. This manifests in users not being able to login to our application (we store a token in session scope to store logged in status).

 

Again key point, under low to moderate load it all works fine, users are directed to random nodes in the cluster and their session is picked up fine as the session is distributed to all nodes,so pretty confident config is right.

 

Linux servers using CF10 with update 12 applied. Also running is fusion reactor 5.04 on all instances. Each instance has a 64GB heap, Java 7.0.15 (latest certified).

 

Firstly apache setup.

 

workers.properties


worker.list=balancer, jkstatus

worker.jkstatus.type=status

worker.balancer.type=lb

worker.balancer.balance_workers=cfusion_master,cfusion_slave2,cfusion_slave1

worker.balancer.method=R

worker.balancer.sticky_session=False

worker.balancer.ping_mode=A

worker.cfusion_master.type=ajp13

worker.cfusion_master.host=localhost

worker.cfusion_master.port=8012

worker.cfusion_master.max_reuse_connections=250

worker.cfusion_master.lbfactor=100

worker.cfusion_slave2.reference=worker.cfusion_master

worker.cfusion_slave2.port=8014

worker.cfusion_slave1.reference=worker.cfusion_master

worker.cfusion_slave1.port=8013

 

 

Now the server.xml from 2 nodes (as an example if I run a 2 node cluster)

 

One of the configs from a server in the cluster

 

<Server port="8007" shutdown="SHUTDOWN">

  <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on">

  </Listener>

  <Listener className="org.apache.catalina.core.JasperListener">

  </Listener>

  <Listener className="org.apache.catalina.core.JreMemoryLeakPreventionListener">

  </Listener>

  <Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener">

  </Listener>

  <GlobalNamingResources>

    <Resource description="User database that can be updated and saved" name="UserDatabase" pathname="conf/tomcat-users.xml" factory="org.apache.catalina.users.MemoryUserDatabaseFactory" type="org.apache.catalina.UserDatabase" auth="Container">

    </Resource>

  </GlobalNamingResources>

  <Service name="Catalina">

    <Executor name="tomcatThreadPool" minSpareThreads="4" maxThreads="150" namePrefix="catalina-exec-">

    </Executor>

    <Connector port="8012" protocol="AJP/1.3" connectionTimeout="600000" redirectPort="8445" tomcatAuthentication="false">

    </Connector>

    <Engine jvmRoute="cfusion" name="Catalina" defaultHost="localhost">

      <Realm className="org.apache.catalina.realm.LockOutRealm">

        <Realm className="org.apache.catalina.realm.UserDatabaseRealm" resourceName="UserDatabase">

        </Realm>

      </Realm>

      <Host name="localhost" autoDeploy="false" unpackWARs="true" appBase="webapps">

        <Valve pattern="%h %l %u %t &quot;%r&quot; %s %b" directory="logs" prefix="localhost_access_log." className="org.apache.catalina.valves.AccessLogValve" suffix=".txt" resolveHosts="false">

        </Valve>

      </Host>

      <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster" channelSendOptions="8">

        <Manager notifyListenersOnReplication="true" expireSessionsOnShutdown="false" className="org.apache.catalina.ha.session.DeltaManager">

        </Manager>

        <Channel className="org.apache.catalina.tribes.group.GroupChannel">

          <Membership port="45564" dropTime="3000" address="228.0.0.4" className="org.apache.catalina.tribes.membership.McastService" frequency="500">

          </Membership>

          <Receiver port="4001" autoBind="100" address="auto" selectorTimeout="5000" maxThreads="6" className="org.apache.catalina.tribes.transport.nio.NioReceiver">

          </Receiver>

          <Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter">

            <Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender">

            </Transport>

          </Sender>

          <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector">

          </Interceptor>

          <Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor">

          </Interceptor>

        </Channel>

        <Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter="">

        </Valve>

        <Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve">

        </Valve>

        <ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener">

        </ClusterListener>

        <ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener">

        </ClusterListener>

      </Cluster>

    </Engine>

    <Connector port="8499" protocol="org.apache.coyote.http11.Http11NioProtocol" connectionTimeout="20000" redirectPort="8443" executor="tomcatThreadPool">

    </Connector>

  </Service>

</Server>

 

Config from one of the other nodes

 

<Server port="8008" shutdown="SHUTDOWN">

  <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on">

  </Listener>

  <Listener className="org.apache.catalina.core.JasperListener">

  </Listener>

  <Listener className="org.apache.catalina.core.JreMemoryLeakPreventionListener">

  </Listener>

  <Listener className="org.apache.catalina.mbeans.GlobalResourcesLifecycleListener">

  </Listener>

  <GlobalNamingResources>

    <Resource description="User database that can be updated and saved" name="UserDatabase" pathname="conf/tomcat-users.xml" factory="org.apache.catalina.users.MemoryUserDatabaseFactory" type="org.apache.catalina.UserDatabase" auth="Container">

    </Resource>

  </GlobalNamingResources>

  <Service name="Catalina">

    <Executor name="tomcatThreadPool" minSpareThreads="4" maxThreads="150" namePrefix="catalina-exec-">

    </Executor>

    <Connector port="8013" protocol="AJP/1.3" connectionTimeout="600000" redirectPort="8446" tomcatAuthentication="false">

    </Connector>

    <Engine jvmRoute="cfusion" name="Catalina" defaultHost="localhost">

      <Realm className="org.apache.catalina.realm.LockOutRealm">

        <Realm className="org.apache.catalina.realm.UserDatabaseRealm" resourceName="UserDatabase">

        </Realm>

      </Realm>

      <Host name="localhost" autoDeploy="false" unpackWARs="true" appBase="webapps">

        <Valve pattern="%h %l %u %t &quot;%r&quot; %s %b" directory="logs" prefix="localhost_access_log." className="org.apache.catalina.valves.AccessLogValve" suffix=".txt" resolveHosts="false">

        </Valve>

      </Host>

      <Cluster className="org.apache.catalina.ha.tcp.SimpleTcpCluster" channelSendOptions="8">

        <Manager notifyListenersOnReplication="true" expireSessionsOnShutdown="false" className="org.apache.catalina.ha.session.DeltaManager">

        </Manager>

        <Channel className="org.apache.catalina.tribes.group.GroupChannel">

          <Membership port="45564" dropTime="3000" address="228.0.0.4" className="org.apache.catalina.tribes.membership.McastService" frequency="500">

          </Membership>

          <Receiver port="4002" autoBind="100" address="auto" selectorTimeout="5000" maxThreads="6" className="org.apache.catalina.tribes.transport.nio.NioReceiver">

          </Receiver>

          <Sender className="org.apache.catalina.tribes.transport.ReplicationTransmitter">

            <Transport className="org.apache.catalina.tribes.transport.nio.PooledParallelSender">

            </Transport>

          </Sender>

          <Interceptor className="org.apache.catalina.tribes.group.interceptors.TcpFailureDetector">

          </Interceptor>

          <Interceptor className="org.apache.catalina.tribes.group.interceptors.MessageDispatch15Interceptor">

          </Interceptor>

        </Channel>

        <Valve className="org.apache.catalina.ha.tcp.ReplicationValve" filter="">

        </Valve>

        <Valve className="org.apache.catalina.ha.session.JvmRouteBinderValve">

        </Valve>

        <ClusterListener className="org.apache.catalina.ha.session.JvmRouteSessionIDBinderListener">

        </ClusterListener>

        <ClusterListener className="org.apache.catalina.ha.session.ClusterSessionListener">

        </ClusterListener>

      </Cluster>

    </Engine>

    <Connector port="8500" protocol="org.apache.coyote.http11.Http11NioProtocol" connectionTimeout="20000" redirectPort="8443" executor="tomcatThreadPool">

    </Connector>

  </Service>

</Server>

 

So what do i see in the logs?. Well sometimes I see exceptions like this

 

Mar 05, 2014 9:55:19 PM org.apache.catalina.ha.session.DeltaManager messageReceived

SEVERE: Manager [localhost#/]: Unable to receive message through TCP channel

java.lang.IllegalStateException: removeAttribute: Session already invalidated

          at org.apache.catalina.ha.session.DeltaSession.removeAttribute(DeltaSession.java:617)

          at org.apache.catalina.ha.session.DeltaRequest.execute(DeltaRequest.java:171)

          at org.apache.catalina.ha.session.DeltaManager.handleSESSION_DELTA(DeltaManager.java:1347)

          at org.apache.catalina.ha.session.DeltaManager.messageReceived(DeltaManager.java:1293)

          at org.apache.catalina.ha.session.DeltaManager.messageDataReceived(DeltaManager.java:1014)

          at org.apache.catalina.ha.session.ClusterSessionListener.messageReceived(ClusterSessionListe ner.java:92)

          at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:897)

          at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:878)

          at org.apache.catalina.tribes.group.GroupChannel.messageReceived(GroupChannel.java:278)

          at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelIntercepto rBase.java:84)

          at org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.messageReceived(TcpFailu reDetector.java:113)

          at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelIntercepto rBase.java:84)

          at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelIntercepto rBase.java:84)

          at org.apache.catalina.tribes.group.ChannelCoordinator.messageReceived(ChannelCoordinator.ja va:253)

          at org.apache.catalina.tribes.transport.ReceiverBase.messageDataReceived(ReceiverBase.java:2 87)

          at org.apache.catalina.tribes.transport.nio.NioReplicationTask.drainChannel(NioReplicationTa sk.java:212)

          at org.apache.catalina.tribes.transport.nio.NioReplicationTask.run(NioReplicationTask.java:1 01)

          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

          at java.lang.Thread.run(Thread.java:722)

 

I'm unsure why this happens as tribes uses certified mesaging so it should have resent right?, in any case I believe I can change it so messages are not sent asynchronously, should sort this out.

 

I see (good) messages like this

 

Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager startInternal

INFO: Register manager localhost#/ to cluster element Engine with name Catalina

Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager startInternal

INFO: Starting clustering manager at localhost#/

Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager getAllClusterSessions

INFO: Manager [localhost#/], requesting session state from org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=68824148, securePort=-1, UDP Port=-1, id={123 126 89 39 96 -59 69 8 -113 79 51 122 25 108 -11 -110 }, payload={}, command={}, domain={}, ]. This operation will timeout if no session state has been received within 60 seconds.

Mar 05, 2014 9:42:20 PM org.apache.catalina.ha.session.DeltaManager waitForSendAllSessions

INFO: Manager [localhost#/]; session state send at 3/5/14 9:42 PM received in 929 ms.

Mar 05, 2014 9:42:20 PM org.apache.catalina.ha.session.JvmRouteBinderValve startInternal

INFO: JvmRouteBinderValve started

 

So session state dies appear to be flying around the cluster, I do nightly restarts of some of the nodes due to another issue I have with an ever growing heap (separate issue), interestingly I also see nodes leave and join the cluster, again this is good (shows the multicast is working, and also that replication should be working).

 

Mar 05, 2014 2:30:16 AM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared

INFO: Verification complete. Member disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=18629101, securePort=-1, UDP Port=-1, id={-2 65 10 -79 53 -75 76 52 -99 63 -90 -120 34 -89 -14 100 }, payload={}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={}, ]]

Mar 05, 2014 2:30:16 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberDisappeared

INFO: Received member disappeared:org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=18629101, securePort=-1, UDP Port=-1, id={-2 65 10 -79 53 -75 76 52 -99 63 -90 -120 34 -89 -14 100 }, payload={}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={}, ]

Mar 05, 2014 2:35:16 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded

INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=1083, securePort=-1, UDP Port=-1, id={123 126 89 39 96 -59 69 8 -113 79 51 122 25 108 -11 -110 }, payload={}, command={}, domain={}, ]

 

So stuck now on how to proceed, to establish why at random times the replication fails, leading to cluster collapse. Could it be the size of the session?, I have a few CFCs stuffed into session scope, but perhaps when the load is high there is too many?. Things fail even with a cluster of 2 on one server, initially I had a 8 node cluster on 2 separate machines but when it failed it rolled it back to a cluster of 2 instances on the one server to see if that was stable (its not 100% which is what I need).

 

Any advice, points gratefully received.


Viewing all articles
Browse latest Browse all 78799

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>