[corosync] Corosync 1.3.x/1.4.x: Random redundant ring instabilities

Jan Friesse jfriesse at redhat.com
Mon Jun 11 09:46:56 GMT 2012


Steven Dake napsal(a):
> On 06/07/2012 02:04 AM, Jan Friesse wrote:
>> Jerome,
>> I believe first and second behavior is same as described in
>> https://bugzilla.redhat.com/show_bug.cgi?id=820821 by Andrew. I'm not
>> yet entirely sure WHY is happening.
>>
>> Third one, flushing, is very important. Without flush, buffer may start
>> to overload and it causes really bad behavior (there was BZ with this
>> problem).
>>
>> I would like Steve to review your patch, but for me it looks like ok.
>>
>> Regards,
>>    Honza
>>
>
> I looked at the patch, and it should be fine.  Unfortunately as I was in
> the process of applying it, the email client ate the message.
>
> Honza, if you still have a copy can you merge that patch?

Pushed

>
> Consider it
>
> Reviewed-by: Steven Dake<sdake at redhat.com>
>
>> Jerome FLESCH napsal(a):
>>> Hello,
>>>
>>> When upgrading from Corosync 1.2.8 to Corosync 1.4.2/1.4.3, some nasty
>>> bugs appeared on our clusters. I observed the following bad behaviors:
>>> 1) A process connected to Corosync with CPG wasn't correctly informed
>>> that there are other processes connected on other processors. It also
>>> didn't get their messages
>>> 2) A process sending messages with CPG never received copies of its
>>> messages
>>> 3) 1 ring out of 2 went up/down quite often
>>>
>>> The behaviors 1 and 2 are very hard for us to reproduce, but we are
>>> able to get the behavior 3 quite easily.
>>>
>>> The simplest setup we found to get it is the following:
>>> - 2 VirtualBox VMs, connected by 2 network interfaces (vboxnet0,
>>> vboxnet1 ; one for each ring)
>>> - OS: Linux (Debian stable)
>>> - On one of the VMs, a test program sending some CPG messages (see the
>>> script "test_corosync.sh" joined to this mail for example)
>>>
>>> Here are the Corosync logs we get when we do this setup:
>>>
>>> Jun 06 16:23:40 corosync [TOTEM ] A processor joined or left the
>>> membership and a new membership was formed.
>>> Jun 06 16:23:40 corosync [CPG   ] chosen downlist: sender r(0)
>>> ip(192.168.56.104) r(1) ip(192.168.57.104) ; members(old:1 left:0)
>>> Jun 06 16:23:40 corosync [MAIN  ] Completed service synchronization,
>>> ready to provide service.
>>> Jun 06 16:24:37 corosync [TOTEM ] Marking ringid 1 interface
>>> 192.168.57.105 FAULTY
>>> Jun 06 16:24:38 corosync [TOTEM ] Automatically recovered ring 1
>>> Jun 06 16:25:33 corosync [TOTEM ] Marking ringid 1 interface
>>> 192.168.57.105 FAULTY
>>> Jun 06 16:25:34 corosync [TOTEM ] Automatically recovered ring 1
>>> Jun 06 16:26:35 corosync [TOTEM ] Marking ringid 1 interface
>>> 192.168.57.105 FAULTY
>>> Jun 06 16:26:36 corosync [TOTEM ] Automatically recovered ring 1
>>> (...)
>>>
>>> The second ring goes down about every 2 minutes and automatically back
>>> up right after.
>>>
>>> We spent some times looking for the commit that introduced this bug,
>>> and it appears it's due the following one:
>>> Corosync 1.3.3 ->   1.3.4: e27a58d93d0d3795beb550f87b660c9c04f11386
>>> Corosync 1.4.1 ->   1.4.2: be608c050247e5f9c8266b8a0f9803cc0a3dc881
>>> Commit message: Ignore memb_join messages during flush operations
>>>
>>> I had a look at this commit, and it seems to me it's dropping too many
>>> packets:
>>> Because of this commit, while totemrrp_recv_flush() is called,
>>> Corosync drops memb_join packets, but also ORF tokens. In the end, it
>>> seems that sometimes, we drop so many of them that Corosync marks the
>>> ring as faulty.
>>>
>>> To fix that, I've made the patch joined to this mail
>>> (corosync-fix-token-drop.patch).
>>>
>>> However I wonder why this packet dropping is done at such a low layer.
>>> Wouldn't it be more appropriate to do it in totemsrp.c ?
>>> Moreover, it seems to me that totemrrp_recv_flush() is called every
>>> times Corosync get an ORF token (in message_handler_orf_token()). It
>>> seems weird to me because the commit message says the packets should
>>> only be dropped when we are in gather state to avoid switching
>>> suddenly to recovery state.
>>>
>>> Also, could you tell me if this packet dropping could explain the 2
>>> other behaviors I observed ?
>>>
>>> Thanks in advance,
>>>
>>> Regards,
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss at corosync.org
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list
>> discuss at corosync.org
>> http://lists.corosync.org/mailman/listinfo/discuss
>



More information about the discuss mailing list