[corosync] one of the reason that corosync running into FAILED TO RECEIVE state
sdake at redhat.com
Tue Nov 29 14:44:35 GMT 2011
I'll get a patch in the tree shortly.
On 11/29/2011 07:27 AM, Yunkai Zhang wrote:
> Hi Steven Dake:
> Today, I have observed one of the reason that corosync running into
> FAILED TO RECEIVE state.
> There was five nodes(A,B,C,D,E) in my testing, and I limited the UDP
> transmission rate of C nodes by iptables command:
> iptables -A INPUT -i eth0 -p udp -m limit --limit 10000/s
> --limit-burst 1 -j ACCEPT
> iptables -A INPUT -i eth0 -p udp -j DROP
> After one hour later, C node had been missing some MCAST messages,
> it's state described as following:
> ==state of C node==
> =>receved MCAST message with seq:806 from B nodes
> =>enter *message_handler_mcast*
> =>add this message to regular_sort_queue
> =>enter *update_aru* function
> => range = (my_high_seq_received - my_aru)
> = (0xC2C - 0x805)
> = 1063
> => if range>1024, do nothing and and return directly.
> According this logic, after (my_high_req_received-my_aru)>1024, my_aru
> will not be updated though corosync can receive MCAST messages
> retransmitted by other nodes.
> But at that timte, my_aru_count was only 7. So the corosync at C node
> would keep in this status until my_aru_count increased to
> fail_to_recv_const(the default value is 2500). This was a long time
> for corosync, but we wasted it.
> To solve this issue, maybe we can enlarge the range condition in
> update_aru function? Or we just ingnore the checking of range value,
> it seems no harmfull, because we have been using fail_to_recv_const to
> control the things.
More information about the discuss