[corosync] [PATCH] Ignore the commit_token retransmitted by upstream node on the first rotation

Yunkai Zhang qiushu.zyk at taobao.com
Mon Dec 12 09:48:06 GMT 2011


I will continue to track this issue.

It seems that this patch can't prevent corosync crashing, but it can
ignore retransmited commit_token correctly.

I will update the description of this patch, and submit a patch v2 later.

On Mon, Dec 12, 2011 at 5:13 PM, Yunkai Zhang <qiushu.zyk at taobao.com> wrote:
> On Mon, Dec 12, 2011 at 2:16 AM, Steven Dake <sdake at redhat.com> wrote:
>> On 12/10/2011 10:43 AM, Yunkai Zhang wrote:
>>> On Sun, Dec 11, 2011 at 12:11 AM, Steven Dake <sdake at redhat.com> wrote:
>>>> On 12/10/2011 02:27 AM, Yunkai Zhang wrote:
>>>>> After the representative of the new ring entered into RECOVERY state,
>>>>> it was waiting for the commit_token on the second rotation. But when
>>>>> it got a ncommit_token, it didn't check the status of this token
>>>>> which could be retransmitted by upstream node on the first rotation,
>>>>> and then the representative node would send the initial ORF token
>>>>> prematurely.
>>>>>
>>>>> If other nodes which have not entered into RECOVERY state received this
>>>>> incorrect initial orf_token(which seq is 0), these nodes may crash
>>>>> caused by assert instraction in *orf_token_rtr* function:
>>>>>   ...
>>>>
>>>> Yunkai,
>>>>
>>>> First off thanks for digging into Totem.  I know the protocol is hard to
>>>> understand and its nice to have more people to bounce ideas off of.
>>>>
>>>> Do you have a test case for this?
>>>>
>>>> Patch is a clever solution but the root cause seems incorrect.  I'd like
>>>> to get to the RCA of your bug.  Just thinking about some of the problems
>>>> you have reported in the past (13 seconds to execute a recovery copy per
>>>> node), it could be possible a retransmission of the commit token from 5
>>>> to 1 (kicking off a new orf token) in the below example could occur
>>>> while in recovery.  That should be ok though, since retransmitted orf
>>>> tokens are discarded (totemsrp:3564).  I don't see any immediate way to
>>>> have a premature token transmission.
>>>>
>>>> Lets examine 5 processors:
>>>>
>>>> When in gather, a membership is built.  Once this membership is agreed
>>>> upon, a commit token is originated by the ring rep creating the first
>>>> transmission around the ring.
>>>>
>>>> totemsrp.c:4044
>>>>
>>>> Example with 5 processors:
>>>> 1,2,3,4,5 in gather
>>>> 1 originates commit token
>>>>
>>>>
>>>> Once a commit token is received in the gather state, the processor
>>>> enters commit (first token rotation)
>>>>
>>>> totemsrp.c:4348
>>>>
>>>> Example:
>>>> 1 goes to commit
>>>> 2 goes to commit
>>>> 3 goes to commit
>>>> 4 goes to commit
>>>> 5 goes to commit
>>>>
>>>> Once a commit token is received in the commit state, the processor
>>>> enters the recovery state only if this is a new commit token
>>>>
>>>> totemsrp.c:4360
>>>>
>>>> 1 goes to recovery
>>>> 2 goes to recovery
>>>> 3 goes to recovery
>>>> 4 goes to recovery
>>>> 5 goes to recovery
>>>> 1 originates orf token (this is the start of the third rotation)
>>>>
>>>> totemsrp.c:4374
>>>>
>>>>
>>>> In the above failure free scenario, we see that we will never get a
>>>> premature orf token.
>>>>
>>>> Perhaps the problem your running into is your mixing partition/merging
>>>> while forming a new ring.  This could result in my_id.addr[0] and ring
>>>> rep matching, but *old* commit tokens not being discarded by the ring
>>>> rep in the recovery state.
>>>>
>>>> I believe the code would be better served by something like:
>>>>
>>>> if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>    memb_commit_token->ring_id.seq == instance->my_ring_id_seq) {
>>>> ...
>>>> }
>>>>
>>>> Note it may be safer to compare the entire ring_id rather then only the seq.
>>>>
>>>> This will cause any commit token not from the current agreed processors
>>>> to be rejected.  The ring seq increases by 4 for each new ring and it is
>>>> not possible to have a matching seq + rep in two separate rings (this is
>>>> why we store the ring seq to stable storage).
>>>>
>>>> Could you test that or tell me your test case or both?
>>>>
>>>
>>> Let me discuss my test case based on your example mentioned above:
>>>
>>> 1,2,3,4,5 in gather
>>>
>>> 1 originates commit token
>>>   => enter commit state
>>>     => send commit_token(token_seq:0) to 2
>>>       => receive commit_token(token_seq:4) from 5
>>>         => enter RECOVERY state
>>>           => receive commit_token(token_seq:4) retransmitted by 5
>>>             => enter OPERATIONAL state *prematurely*!!
>>>               => send initial ORF toke to 2 *prematurely*!!
>>>
>>> 2 receive commit_token(token_seq:0) from 1
>>>   =>enter commit state
>>>     => send commit_token(token_seq:1) to 3
>>>       => commit token retransmit timeout occur!!
>>>         => retransmit commit_token(token_seq:1) ...
>>>         => retransmit commit_token(token_seq:1) ...
>>>         => ...
>>>           => receive JoinMSG from 3
>>>             => enter GATHER state, ...
>>>               => receive initial ORF token from 1
>>>                 => crash in *orf_token_rtr* funcation!!
>
> Today, I have checked my logging file carefully again, this
> description did not incorrectly:
>
> 2 receive commit_token(token_seq:0) from 1
>    =>enter commit state
>      => send commit_token(token_seq:1) to 3
>       => receive commit_token(token_seq:6) from 1
>         => enter RECOVERY state, **set orf_token_discard=0**
>            => commit token retransmit timeout occur!!
>              => retransmit commit_token(token_seq:1) ...
>              => retransmit commit_token(token_seq:1) ...
>              => ...
>                => receive JoinMSG from 3 in RECOVERY state
>                  => call memb_recovery_state_token_loss
>                    => old_ring_state_restore, set instance->my_aru=39
>                    => memb_state_gather_enter (instance, 5), **set
> orf_token_discard=1**
>                      => enter GATHER state
>                 => call memb_state_gather_enter (instance, 14), **set
> orf_token_discard=1**
>                    => enter GATHER state
>                   => receive initial ORF token from 1
>                      => crash in *orf_token_rtr* funcation!!
>
> I can confirmed that the value of instance->memb_state was
> MEMB_STATE_GATHER(=2) when 2 received initial ORF token. I use
> following code for logging in *message_handler_orf_token* function:
>
> =code=
> log_printf (LOG_DEBUG, "recv orf_token from:%s, token_seq:%d,
> my_token_seq:%d, aru:%x, my_aru:%x, seq:%x, state:%d",
> nodeid_to_ip(token->header.nodeid, buf),
>                                token->token_seq,
> instance->my_token_seq, token->aru, instance->my_aru, token->seq,
> instance->memb_state);
> =log output=
> Dec 08 23:02:03 corosync [TOTEM ] totemsrp.c:3544 recv orf_token
> from:10.232.134.1, token_seq:0, my_token_seq:-1, aru:ffffffff,
> my_aru:39, seq:0, state:2
>
> It was the last log record while corosync exist.
>
> But I can't believed that corosync did not discard the ORF token when
> instance->memb_state equals to MEMB_STATE_GATHER (as long as corosync
> enter into GATHER state, orf_token_discard would be set to be 1).
>
> A reasonable guess is that the value of orf_token_discard was cleared
> by unknown reason -- maybe there exist memory corruption.
>
>>>
>>> 3 receive commit_token(token_seq:1) from 2
>>>   => enter commit state
>>>     => send commit_token(token_seq:2) to 4
>>>       => commit token retransmit timeout occur!!
>>>         => retransmit commit_token(token_seq:2) ...
>>>         => retransmit commit_token(token_seq:2) ...
>>>         => ...
>>>           => commit token lost timeout occur!!
>>>             => enter GATHER state, multicast JoinMSG ...
>>>
>>> 4 receive commit_token(token_seq:2) from 3
>>>   => enter commit state
>>>     => send commit_token(token_seq:3) to 5
>>>       => commit token retransmit timeout occur!!
>>>         => retransmit commit_token(token_seq:3) to 5...
>>>         => retransmit commit_token(token_seq:3) to 5...
>>>         => ...
>>>
>>> 5 receive commit_token(token_seq:3) from 4
>>>   => enter commit state
>>>     => send commit_token(token_seq:4) to 1
>>>       => commit token retransmit timeout occur!!
>>>         => retransmit commit_token(token_seq:4) to 1...
>>>         => retransmit commit_token(token_seq:4) to 1...
>>>         => ...
>>>
>>> The key point in my test cast is that one of processors(3 in this
>>> case) runs into commit token timeout, and 1 receive commit token
>>> retransmitted by 5, and enter OPERATIONAL state *prematurely*.
>>> t
>> Yunkai,
>>
>> I took a look at the code with a fresh set of eyes this morning for 3
>> hours and don't see how it is possible for processor 2 to crash in
>> orf_token_rtr with the above scenario.  The orf token will be discarded
>> in all cases in GATHER and COMMIT before executing orf_token_rtr.
>> orf_token_rtr should *never* run in GATHER or COMMIT states of the
>> processor.
>>
>> (flatiron-1.4 line numbers):
>>
>> On startup:
>> totemsrp.c:657  - memb_state = OPERATIONAL
>> totemsrp.c:4467 - when network interface detected call gather_enter
>>        state = gather, orf_token_discard = 1
>> totemsrp.c:1870 - gather_enter sets orf_token_discard to 1
>>
>> we can proceed to enter OPERATIONAL here or go directly to COMMIT
>>
>> COMMIT path:
>> totemsrp.c:4331 - COMMIT token received in GATHER state
>> (orf_token_discard still equals 1)
>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>> (orf_token_discard still equals 1)
>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>> equals 1)
>>
>> The only code that resets orf_token_discard to 0 is
>> totemsrp.c:2008 - entered recovery state
>>
>> Analyzing the path where totem transitions from RECOVERY -> GATHER ->
>> COMMIT:
>>
>> totemsrp.c:4276 - memb_state_gather_enter called
>> totemsrp.c:1870 - gather enter sets orf_token_discard to 1
>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>> (orf_token_discard still equals 1)
>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>> equals 1)
>>
>> Given that the orf token would always be discarded by processor 2 in the
>> above test case you gave, I don't see a scenario in the code where
>> orf_token_rtr could crash in that test case.  If you can show me how
>> this happens, we could fix this up.  One possibility is some sort of
>> thread race.  Could you determine that orf_token_discard is equal to 0
>> when your failure occurs?
>>
>> As for entering operational prematurely:
>> If this is a problem, it is a separate problem from the above analysis.
>>  Lets get the above analysis fixed if it is a problem.
>>
>> Really looking forward to hearing back from you so we can fix the
>> problem you see.
>>
>> Regards
>> -steve
>>
>>>> Thanks
>>>> -steve
>>>>
>>>>
>>>>>   range = orf_token->seq - instance->my_aru;
>>>>>   assert (range < QUEUE_RTR_ITEMS_SIZE_MAX);
>>>>>   ...
>>>>>
>>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk at taobao.com>
>>>>> ---
>>>>>  exec/totemsrp.c |    3 ++-
>>>>>  1 files changed, 2 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>>>>> index 5a78962..22717ac 100644
>>>>> --- a/exec/totemsrp.c
>>>>> +++ b/exec/totemsrp.c
>>>>> @@ -4372,7 +4372,8 @@ static int message_handler_memb_commit_token (
>>>>>                       break;
>>>>>
>>>>>               case MEMB_STATE_RECOVERY:
>>>>> -                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep)) {
>>>>> +                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>> +                             memb_commit_token->token_seq == 2*memb_commit_token->addr_entries) {
>>>>>                               log_printf (instance->totemsrp_log_level_debug,
>>>>>                                       "Sending initial ORF token\n");
>>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> discuss at corosync.org
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss at corosync.org
>> http://lists.corosync.org/mailman/listinfo/discuss
>
>
>
> --
> Yunkai Zhang
> Work at Taobao



-- 
Yunkai Zhang
Work at Taobao


More information about the discuss mailing list