Unstable session replication in a HA cluster (CF10)

Hi,

We have tried to create a HA cluster with requests being distributed round robin to N instances of coldfusion, we are NOT using sticky sessions as we are replication session state to all cf instances. What we are seing is that all is fine with low to moderate load, however under heavy load and at random times the replication fails and leads to things in session scope not working. This manifests in users not being able to login to our application (we store a token in session scope to store logged in status).

Again key point, under low to moderate load it all works fine, users are directed to random nodes in the cluster and their session is picked up fine as the session is distributed to all nodes,so pretty confident config is right.

Linux servers using CF10 with update 12 applied. Also running is fusion reactor 5.04 on all instances. Each instance has a 64GB heap, Java 7.0.15 (latest certified).

Firstly apache setup.

workers.properties

worker.list=balancer, jkstatus

worker.jkstatus.type=status

worker.balancer.type=lb

worker.balancer.balance_workers=cfusion_master,cfusion_slave2,cfusion_slave1

worker.balancer.method=R

worker.balancer.sticky_session=False

worker.balancer.ping_mode=A

worker.cfusion_master.type=ajp13

worker.cfusion_master.host=localhost

worker.cfusion_master.port=8012

worker.cfusion_master.max_reuse_connections=250

worker.cfusion_master.lbfactor=100

worker.cfusion_slave2.reference=worker.cfusion_master

worker.cfusion_slave2.port=8014

worker.cfusion_slave1.reference=worker.cfusion_master

worker.cfusion_slave1.port=8013

Now the server.xml from 2 nodes (as an example if I run a 2 node cluster)

One of the configs from a server in the cluster

</Listener>

</Listener>

</Listener>

</Listener>

</Resource>

</GlobalNamingResources>

</Executor>

</Connector>

</Realm>

</Valve>

</Host>

</Manager>

</Membership>

</Receiver>

</Transport>

</Sender>

</Interceptor>

</Interceptor>

</Channel>

</Valve>

</Valve>

</ClusterListener>

</ClusterListener>

</Cluster>

</Engine>

</Connector>

</Service>

</Server>

Config from one of the other nodes

</Listener>

</Listener>

</Listener>

</Listener>

</Resource>

</GlobalNamingResources>

</Executor>

</Connector>

</Realm>

</Valve>

</Host>

</Manager>

</Membership>

</Receiver>

</Transport>

</Sender>

</Interceptor>

</Interceptor>

</Channel>

</Valve>

</Valve>

</ClusterListener>

</ClusterListener>

</Cluster>

</Engine>

</Connector>

</Service>

</Server>

So what do i see in the logs?. Well sometimes I see exceptions like this

Mar 05, 2014 9:55:19 PM org.apache.catalina.ha.session.DeltaManager messageReceived

SEVERE: Manager [localhost#/]: Unable to receive message through TCP channel

java.lang.IllegalStateException: removeAttribute: Session already invalidated

at org.apache.catalina.ha.session.DeltaSession.removeAttribute(DeltaSession.java:617)

at org.apache.catalina.ha.session.DeltaRequest.execute(DeltaRequest.java:171)

at org.apache.catalina.ha.session.DeltaManager.handleSESSION_DELTA(DeltaManager.java:1347)

at org.apache.catalina.ha.session.DeltaManager.messageReceived(DeltaManager.java:1293)

at org.apache.catalina.ha.session.DeltaManager.messageDataReceived(DeltaManager.java:1014)

at org.apache.catalina.ha.session.ClusterSessionListener.messageReceived(ClusterSessionListe ner.java:92)

at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:897)

at org.apache.catalina.ha.tcp.SimpleTcpCluster.messageReceived(SimpleTcpCluster.java:878)

at org.apache.catalina.tribes.group.GroupChannel.messageReceived(GroupChannel.java:278)

at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelIntercepto rBase.java:84)

at org.apache.catalina.tribes.group.interceptors.TcpFailureDetector.messageReceived(TcpFailu reDetector.java:113)

at org.apache.catalina.tribes.group.ChannelInterceptorBase.messageReceived(ChannelIntercepto rBase.java:84)

at org.apache.catalina.tribes.group.ChannelCoordinator.messageReceived(ChannelCoordinator.ja va:253)

at org.apache.catalina.tribes.transport.ReceiverBase.messageDataReceived(ReceiverBase.java:2 87)

at org.apache.catalina.tribes.transport.nio.NioReplicationTask.drainChannel(NioReplicationTa sk.java:212)

at org.apache.catalina.tribes.transport.nio.NioReplicationTask.run(NioReplicationTask.java:1 01)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:722)

I'm unsure why this happens as tribes uses certified mesaging so it should have resent right?, in any case I believe I can change it so messages are not sent asynchronously, should sort this out.

I see (good) messages like this

Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager startInternal

INFO: Register manager localhost#/ to cluster element Engine with name Catalina

Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager startInternal

INFO: Starting clustering manager at localhost#/

Mar 05, 2014 9:42:19 PM org.apache.catalina.ha.session.DeltaManager getAllClusterSessions

INFO: Manager [localhost#/], requesting session state from org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=68824148, securePort=-1, UDP Port=-1, id={123 126 89 39 96 -59 69 8 -113 79 51 122 25 108 -11 -110 }, payload={}, command={}, domain={}, ]. This operation will timeout if no session state has been received within 60 seconds.

Mar 05, 2014 9:42:20 PM org.apache.catalina.ha.session.DeltaManager waitForSendAllSessions

INFO: Manager [localhost#/]; session state send at 3/5/14 9:42 PM received in 929 ms.

Mar 05, 2014 9:42:20 PM org.apache.catalina.ha.session.JvmRouteBinderValve startInternal

INFO: JvmRouteBinderValve started

So session state dies appear to be flying around the cluster, I do nightly restarts of some of the nodes due to another issue I have with an ever growing heap (separate issue), interestingly I also see nodes leave and join the cluster, again this is good (shows the multicast is working, and also that replication should be working).

Mar 05, 2014 2:30:16 AM org.apache.catalina.tribes.group.interceptors.TcpFailureDetector memberDisappeared

INFO: Verification complete. Member disappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=18629101, securePort=-1, UDP Port=-1, id={-2 65 10 -79 53 -75 76 52 -99 63 -90 -120 34 -89 -14 100 }, payload={}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={}, ]]

Mar 05, 2014 2:30:16 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberDisappeared

INFO: Received member disappeared:org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=18629101, securePort=-1, UDP Port=-1, id={-2 65 10 -79 53 -75 76 52 -99 63 -90 -120 34 -89 -14 100 }, payload={}, command={66 65 66 89 45 65 76 69 88 ...(9)}, domain={}, ]

Mar 05, 2014 2:35:16 AM org.apache.catalina.ha.tcp.SimpleTcpCluster memberAdded

INFO: Replication member added:org.apache.catalina.tribes.membership.MemberImpl[tcp://{192, 168, 128, 50}:4001,{192, 168, 128, 50},4001, alive=1083, securePort=-1, UDP Port=-1, id={123 126 89 39 96 -59 69 8 -113 79 51 122 25 108 -11 -110 }, payload={}, command={}, domain={}, ]

So stuck now on how to proceed, to establish why at random times the replication fails, leading to cluster collapse. Could it be the size of the session?, I have a few CFCs stuffed into session scope, but perhaps when the load is high there is too many?. Things fail even with a cluster of 2 on one server, initially I had a 8 node cluster on 2 separate machines but when it failed it rolled it back to a cluster of 2 instances on the one server to see if that was stable (its not 100% which is what I need).

Any advice, points gratefully received.

Unstable session replication in a HA cluster (CF10)

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112