[corosync] issue about two nodes could not be merged into one ring

jason huzhijiang at gmail.com
Wed Nov 6 14:16:18 UTC 2013


Hi All,
I currently encountered a problem that two nodes could not be merged into
one ring.
Initially, there were three nodes in a ring, say A, B and C. Then, after
killing C, I found that A and B could not be merged forever (I wait at
least 4 hours), unless restart at least one of them.
By analyzing the black box log, both A and B have a dead loop in doing the
following things:
1. Form a single node ring.
2. The ring is broken by a JOIN message from peer.
3. Try to form a two-node ring but consensus timeout.
4. Go to 1.

I checked the network by using omping and it was OK. Besides, I used the
default corosync.conf.example and corosync version is 1.4.6.

To analyze more deeply, I tcpdumped the traffic to see the content of
messages exchanged between the two nodes, and found the following strange
things:
1. Every 50ms (I thinks it is the join timeout):
    Node A sends join message with proclist:A,B,C. faillist:B.
    Node B sends join message with proclist:A,B,C. faillist:A.

2. Every 1250ms(consensus timeout):
    Node A sends join message with proclist:A,B,C. faillist:B,C.
    Node B sends join message with proclist:A,B,C. faillist:A,C.

It should be because both A and B treated each other as failed so that they
could not be formed forever and the single node ring is always broken by
join messages.

I am not sure the origin why both A and B set each other as failed in join
message. I just analyzed the code and found the most possible reason make
it happen is network partition. So I made the following assumption about
what was happened:

1. Initially, ring(A,B,C).
2. A and B network partition, "in the same time", C is down.
3. Node A sends join message with proclist:A,B,C. faillist:NULL. Node B
sends join message with proclist:A,B,C. faillist:NULL.
4. Both A and B consensus timeout due to network partition.
5. A and B network remerged.
6. Node A sends join message with proclist:A,B,C. faillist:B,C. and create
ring(A). Node B sends join message with proclist:A,B,C. faillist:A,C. and
create ring(B).
7. Say join message with proclist:A,B,C. faillist:A,C which sent by node B
is received by node A because network remerged.
8. Node A shifts to gather state and send out a modified join message with
proclist:A,B,C. faillist:B. such join message will prevent both A and B
from merging.
9. Node A consensus timeout (caused by waiting node C) and sends join
message with proclist:A,B,C. faillist:B,C again.

Same thing happens on node B, so A and B will dead loop forever in step 7,8
and 9.

If my assumption and analysis is right, then I think it is step 8 that did
the wrong thing. Because according to the paper I found at
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.4028&rep=rep1&type=pdf,
it says: “if a processor receives a join message in the operational
state
and if the receiver’s identifier is in the join message’s fail list, … then
it ignores the join message."

So I create a patch to apply the above algorithm to try to solve the publem:

--- ./corosync-1.4.6-orig/exec/totemsrp.c Wed May 29 14:33:27 2013 UTC
+++ ./corosync-1.4.6/exec/totemsrp.c Wed Nov 6 13:12:30 2013 UTC
@@ -4274,6 +4274,36 @@
  srp_addr_copy_endian_convert (&out->system_from, &in->system_from);
 }

+static int ignore_join_under_operational (
+ struct totemsrp_instance *instance,
+ const struct memb_join *memb_join)
+{
+ struct srp_addr *proc_list;
+ struct srp_addr *failed_list;
+ unsigned long long ring_seq;
+
+ proc_list = (struct srp_addr *)memb_join->end_of_memb_join;
+ failed_list = proc_list + memb_join->proc_list_entries;
+ ring_seq = memb_join->ring_seq;
+
+ if (memb_set_subset (&instance->my_id, 1,
+ failed_list, memb_join->failed_list_entries)) {
+ return 1;
+ }
+
+ /* In operational state, my_proc_list is exactly the same as
+   my_memb_list. */
+
+ if ((memb_set_subset (&memb_join->system_from, 1,
+ instance->my_memb_list,
+ instance->my_memb_entries)) &&
+ (ring_seq < instance->my_ring_id.seq)) {
+ return 1;
+ }
+
+ return 0;
+}
+
 static int message_handler_memb_join (
  struct totemsrp_instance *instance,
  const void *msg,
@@ -4304,7 +4334,9 @@
  }
  switch (instance->memb_state) {
  case MEMB_STATE_OPERATIONAL:
- memb_join_process (instance, memb_join);
+ if (0 == ignore_join_under_operational(instance, memb_join)) {
+ memb_join_process (instance, memb_join);
+ }
  break;

  case MEMB_STATE_GATHER:

Currently, I haven’t reproduced the problem in a 3-node cluster, but I have
reproduced the “a processor receives a join message in the operational
state and the receiver’s identifier is in the join message’s fail list”
circumstance in a two-node evniroment, by using the following step:
1. iptables –A INPUT –i eth0 –p udp ! -sport domain –j DROP
2. usleep 2126000
3. iptables –D INPUT –i eth0 –p udp ! -sport domain –j DROP

In the two-node environment, there is no dead loop issue as in the 3-node
one, because there is no consensus timeout caused by the third dead node in
step 9. But it can still be used to proof the patch.

Please take a look at this issue, Thanks!



-- 
Yours,
Jason
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.corosync.org/pipermail/discuss/attachments/20131106/31b7c469/attachment.html>


More information about the discuss mailing list