[corosync] Has anyone used corosync with both big & little endian systems in a single cluster?
thompa26 at gmail.com
Mon Nov 18 04:39:28 UTC 2013
On Fri, Nov 15, 2013 at 2:29 AM, Steven Dake <sdake at redhat.com> wrote:
> On 11/14/2013 02:22 AM, Christine Caulfield wrote:
>> On 14/11/13 05:01, John Thompson wrote:
>>> I am using corosync in a cluster that includes both big and little
>>> endian systems and am coming
>>> across crashes when there are retransmits in the cluster.
>>> I wondered therefore if others had tried this previously?
>>> As part of this I have identified that totempg_deliver_fn modifies the
>>> mcast msg in place to
>>> convert for endian purposes, even though it might still be on a sort
>>> queue and used for retransmission.
>>> This means that if there are different endian systems operating and a
>>> retransmission of the msg
>>> is performed, it will have been endian converted in-place and so what
>>> the node receives is a message that has some endian converted fields.
>>> I will submit a patch for this.
>> Endian conversion happens on receipt of the message and is based upon a
> field in the message indicating which endian the message was originated
> with. If a message is changed in a retransmit queue, I would expect it's
> endian field is also modified, resulting in newly transmitted messages
> being correctly decoded by the receivers.
> When totem was originally written in Corosync, we had ppc, arm, and x86_64
> as all major platforms for Corosync. But corosync hasn't been tried in
> years on these platforms. It did work grand at one point ;) Most of the
> world has moved to x86_64 so the need hasn't presented itself to focus on
> this area of the code base lately.
> I suspect it hasn't been tried for a very long time! if you have a patch
>> that fixes the bug it will be gratefully received :-)
Thanks for the responses.
I was trying out corosync in a cluster with a big endian & 4 little endian
systems. When there was a degree of packet loss, that lead to
retransmissions occurring, a crash would occur. This I worked out was in
totempg_deliver_fn where the mcast->msg_count field was VERY high. When
checking the number out it looked to be endian swapped. So I tried out
endian swapping to a local variable in this function and the
totempg_deliver_fn crashes no longer occur.
I have looked into it further and believe this is because
totemsrp.c:messages_deliver_to_app (which ends up calling
totempg_deliver_fn) is delivering whilst the msg remains on the
regular_sort_queue which can be used for
retransmission purposes. This therefore means that if the msg_count gets
endian swapped in place and the message has to be retransmitted then the
node that requested the retransmission gets a message where the
msg_count has been previously endian swapped.
I have sent in a patch that resolves this problem. The only problem I have
with it is what I have changed around the fragmentation case. I think I
have this wrong and am preparing the patch to get this right.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the discuss