[corosync] joined and failed list seem to be wrong after a node failure in a large ring
jfriesse at redhat.com
Wed Jun 20 12:57:59 GMT 2012
what version are you using? If it's flatiron, can you please try apply
three patches "flatiron cpg: *"? Can you please try to send log with
debug informations enabled?
Li, Qiuping (Qiuping) napsal(a):
> I have created a 32 node Corosync ring and did some node failure and recovery tests. I observed the following:
> When I fail one or multiple nodes, corosync seems to report more node failures than the actual number of failed nodes, and therefore extra and wrong configuration callback functions are called. For example, when I reset just one node, I got the following logs which indicate there were 31 nodes left the ring:
> Jun 07 15:37:26 corosync [TOTEM ] A processor failed, forming new configuration.
> Jun 07 15:37:30 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Jun 07 15:37:30 corosync [CPG ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:32 left:31) --> it says 31 nodes left together,
> This happens more often in a busier system and more often when reset more nodes at the same time.
> Note that we are using all default configurations suggested by Corosync.
> Could this be a bug or a system configuration problem?
> Thanks very much for the help.
> Qiuping Li
> discuss mailing list
> discuss at corosync.org
More information about the discuss