- Jul 25, 2019
-
-
Tuong Lien authored
This commit along with the next one are to resolve the issues with the link changeover mechanism. See that commit for details. Basically, for the link synching, from now on, we will send only one single ("dummy") SYNCH message to peer. The SYNCH message does not contain any data, just a header conveying the synch point to the peer. A new node capability flag ("TIPC_TUNNEL_ENHANCED") is introduced for backward compatible! Acked-by:
Ying Xue <ying.xue@windriver.com> Acked-by:
Jon Maloy <jon.maloy@ericsson.com> Suggested-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 05, 2019
-
-
Tuong Lien authored
During unicast link transmission, it's observed very often that because of one or a few lost/dis-ordered packets, the sending side will fastly reach the send window limit and must wait for the packets to be arrived at the receiving side or in the worst case, a retransmission must be done first. The sending side cannot release a lot of subsequent packets in its transmq even though all of them might have already been received by the receiving side. That is, one or two packets dis-ordered/lost and dozens of packets have to wait, this obviously reduces the overall throughput! This commit introduces an algorithm to overcome this by using "Gap ACK blocks". Basically, a Gap ACK block will consist of <ack, gap> numbers that describes the link deferdq where packets have been got by the receiving side but with gaps, for example: link deferdq: [1 2 3 4 10 11 13 14 15 20] --> Gap ACK blocks: <4, 5>, <11, 1>, <15, 4>, <20, 0> The Gap ACK blocks will be sent to the sending side along with the traditional ACK or NACK message. Immediately when receiving the message the sending side will now not only release from its transmq the packets ack-ed by the ACK but also by the Gap ACK blocks! So, more packets can be enqueued and transmitted. In addition, the sending side can now do "multi-retransmissions" according to the Gaps reported in the Gap ACK blocks. The new algorithm as verified helps greatly improve the TIPC throughput especially under packet loss condition. So far, a maximum of 32 blocks is quite enough without any "Too few Gap ACK blocks" reports with a 5.0% packet loss rate, however this number can be increased in the furture if needed. Also, the patch is backward compatible. Acked-by:
Ying Xue <ying.xue@windriver.com> Acked-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 19, 2019
-
-
Hoang Le authored
As a preparation for introducing a smooth switching between replicast and broadcast method for multicast message, We have to introduce a new capability flag TIPC_MCAST_RBCTL to handle this new feature. During a cluster upgrade a node can come back with this new capabilities which also must be reflected in the cluster capabilities field. The new feature is only applicable if all node in the cluster supports this new capability. Acked-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Dec 19, 2018
-
-
Tuong Lien authored
As for the sake of debugging/tracing, the commit enables tracepoints in TIPC along with some general trace_events as shown below. It also defines some 'tipc_*_dump()' functions that allow to dump TIPC object data whenever needed, that is, for general debug purposes, ie. not just for the trace_events. The following trace_events are now available: - trace_tipc_skb_dump(): allows to trace and dump TIPC msg & skb data, e.g. message type, user, droppable, skb truesize, cloned skb, etc. - trace_tipc_list_dump(): allows to trace and dump any TIPC buffers or queues, e.g. TIPC link transmq, socket receive queue, etc. - trace_tipc_sk_dump(): allows to trace and dump TIPC socket data, e.g. sk state, sk type, connection type, rmem_alloc, socket queues, etc. - trace_tipc_link_dump(): allows to trace and dump TIPC link data, e.g. link state, silent_intv_cnt, gap, bc_gap, link queues, etc. - trace_tipc_node_dump(): allows to trace and dump TIPC node data, e.g. node state, active links, capabilities, link entries, etc. How to use: Put the trace functions at any places where we want to dump TIPC data or events. Note: a) The dump functions will generate raw data only, that is, to offload the trace event's processing, it can require a tool or script to parse the data but this should be simple. b) The trace_tipc_*_dump() should be reserved for a failure cases only (e.g. the retransmission failure case) or where we do not expect to happen too often, then we can consider enabling these events by default since they will almost not take any effects under normal conditions, but once the rare condition or failure occurs, we get the dumped data fully for post-analysis. For other trace purposes, we can reuse these trace classes as template but different events. c) A trace_event is only effective when we enable it. To enable the TIPC trace_events, echo 1 to 'enable' files in the events/tipc/ directory in the 'debugfs' file system. Normally, they are located at: /sys/kernel/debug/tracing/events/tipc/ For example: To enable the tipc_link_dump event: echo 1 > /sys/kernel/debug/tracing/events/tipc/tipc_link_dump/enable To enable all the TIPC trace_events: echo 1 > /sys/kernel/debug/tracing/events/tipc/enable To collect the trace data: cat trace or cat trace_pipe > /trace.out & To disable all the TIPC trace_events: echo 0 > /sys/kernel/debug/tracing/events/tipc/enable To clear the trace buffer: echo > trace d) Like the other trace_events, the feature like 'filter' or 'trigger' is also usable for the tipc trace_events. For more details, have a look at: Documentation/trace/ftrace.txt MAINTAINERS | add two new files 'trace.h' & 'trace.c' in tipc Acked-by:
Ying Xue <ying.xue@windriver.com> Tested-by:
Ying Xue <ying.xue@windriver.com> Acked-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Sep 29, 2018
-
-
Jon Maloy authored
Messages intended for intitating a connection are currently indistinguishable from regular datagram messages. The TIPC protocol specification defines bit 17 in word 0 as a SYN bit to allow sanity check of such messages in the listening socket, but this has so far never been implemented. We do that in this commit. Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jul 12, 2018
-
-
Jon Maloy authored
Some switch infrastructures produce huge amounts of packet duplicates. This becomes a problem if those messages are STATE/NACK protocol messages, causing unnecessary retransmissions of already accepted packets. We now introduce a unique sequence number per STATE protocol message so that duplicates can be identified and ignored. This will also be useful when tracing such cases, and to avert replay attacks when TIPC is encrypted. For compatibility reasons we have to introduce a new capability flag TIPC_LINK_PROTO_SEQNO to handle this new feature. Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 27, 2018
-
-
Jon Maloy authored
After the introduction of a 128-bit node identity it may be difficult for a user to correlate between this identity and the generated node hash address. We now try to make this easier by introducing a new ioctl() call for fetching a node identity by using the hash value as key. This will be particularly useful when we extend some of the commands in the 'tipc' tool, but we also expect regular user applications to need this feature. Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Apr 20, 2018
-
-
GhantaKrishnamurthy MohanKrishna authored
Currently, we have option to configure MTU of UDP media. The configured MTU takes effect on the links going up after that moment. I.e, a user has to reset bearer to have new value applied across its links. This is confusing and disturbing on a running cluster. We now introduce the functionality to change the default UDP bearer MTU in struct tipc_bearer. Additionally, the links are updated dynamically, without any need for a reset, when bearer value is changed. We leverage the existing per-link functionality and the design being symetrical to the confguration of link tolerance. Acked-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
GhantaKrishnamurthy MohanKrishna <mohan.krishna.ghanta.krishnamurthy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Mar 23, 2018
-
-
Jon Maloy authored
When a 32-bit node address is generated from a 128-bit identifier, there is a risk of collisions which must be discovered and handled. We do this as follows: - We don't apply the generated address immediately to the node, but do instead initiate a 1 sec trial period to allow other cluster members to discover and handle such collisions. - During the trial period the node periodically sends out a new type of message, DSC_TRIAL_MSG, using broadcast or emulated broadcast, to all the other nodes in the cluster. - When a node is receiving such a message, it must check that the presented 32-bit identifier either is unused, or was used by the very same peer in a previous session. In both cases it accepts the request by not responding to it. - If it finds that the same node has been up before using a different address, it responds with a DSC_TRIAL_FAIL_MSG containing that address. - If it finds that the address has already been taken by some other node, it generates a new, unused address and returns it to the requester. - During the trial period the requesting node must always be prepared to accept a failure message, i.e., a message where a peer suggests a different (or equal) address to the one tried. In those cases it must apply the suggested value as trial address and restart the trial period. This algorithm ensures that in the vast majority of cases a node will have the same address before and after a reboot. If a legacy user configures the address explicitly, there will be no trial period and messages, so this protocol addition is completely backwards compatible. Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Maloy authored
We add a 128-bit node identity, as an alternative to the currently used 32-bit node address. For the sake of compatibility and to minimize message header changes we retain the existing 32-bit address field. When not set explicitly by the user, this field will be filled with a hash value generated from the much longer node identity, and be used as a shorthand value for the latter. We permit either the address or the identity to be set by configuration, but not both, so when the address value is set by a legacy user the corresponding 128-bit node identity is generated based on the that value. Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Maloy authored
Nominally, TIPC organizes network nodes into a three-level network hierarchy consisting of the levels 'zone', 'cluster' and 'node'. This hierarchy is reflected in the node address format, - it is sub-divided into an 8-bit zone id, and 12 bit cluster id, and a 12-bit node id. However, the 'zone' and 'cluster' levels have in reality never been fully implemented,and never will be. The result of this has been that the first 20 bits the node identity structure have been wasted, and the usable node identity range within a cluster has been limited to 12 bits. This is starting to become a problem. In the following commits, we will need to be able to connect between nodes which are using the whole 32-bit value space of the node address. We therefore remove the restrictions on which values can be assigned to node identity, -it is from now on only a 32-bit integer with no assumed internal structure. Isolation between clusters is now achieved only by setting different values for the 'network id' field used during neighbor discovery, in practice leading to the latter becoming the new cluster identity. The rules for accepting discovery requests/responses from neighboring nodes now become: - If the user is using legacy address format on both peers, reception of discovery messages is subject to the legacy lookup domain check in addition to the cluster id check. - Otherwise, the discovery request/response is always accepted, provided both peers have the same network id. This secures backwards compatibility for users who have been using zone or cluster identities as cluster separators, instead of the intended 'network id'. Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Feb 14, 2018
-
-
Jon Maloy authored
Currently, the default link tolerance set in struct tipc_bearer only has effect on links going up after that moment. I.e., a user has to reset all the node's links across that bearer to have the new value applied. This is too limiting and disturbing on a running cluster to be useful. We now change this so that also already existing links are updated dynamically, without any need for a reset, when the bearer value is changed. We leverage the already existing per-link functionality for this to achieve the wanted effect. Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Oct 13, 2017
-
-
Jon Maloy authored
As a preparation for introducing flow control for multicast and datagram messaging we need a more strictly defined framework than we have now. A socket must be able keep track of exactly how many and which other sockets it is allowed to communicate with at any moment, and keep the necessary state for those. We therefore introduce a new concept we have named Communication Group. Sockets can join a group via a new setsockopt() call TIPC_GROUP_JOIN. The call takes four parameters: 'type' serves as group identifier, 'instance' serves as an logical member identifier, and 'scope' indicates the visibility of the group (node/cluster/zone). Finally, 'flags' makes it possible to set certain properties for the member. For now, there is only one flag, indicating if the creator of the socket wants to receive a copy of broadcast or multicast messages it is sending via the socket, and if wants to be eligible as destination for its own anycasts. A group is closed, i.e., sockets which have not joined a group will not be able to send messages to or receive messages from members of the group, and vice versa. Any member of a group can send multicast ('group broadcast') messages to all group members, optionally including itself, using the primitive send(). The messages are received via the recvmsg() primitive. A socket can only be member of one group at a time. Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Maloy authored
We see an increasing need to send multiple single-buffer messages of TIPC_SYSTEM_IMPORTANCE to different individual destination nodes. Instead of looping over the send queue and sending each buffer individually, as we do now, we add a new help function tipc_node_distr_xmit() to do this. Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Maloy authored
In the coming commits, functions at the socket level will need the ability to read the availability status of a given node. We therefore introduce a new function for this purpose, while renaming the existing static function currently having the wanted name. Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jan 20, 2017
-
-
Jon Paul Maloy authored
If the bearer carrying multicast messages supports broadcast, those messages will be sent to all cluster nodes, irrespective of whether these nodes host any actual destinations socket or not. This is clearly wasteful if the cluster is large and there are only a few real destinations for the message being sent. In this commit we extend the eligibility of the newly introduced "replicast" transmit option. We now make it possible for a user to select which method he wants to be used, either as a mandatory setting via setsockopt(), or as a relative setting where we let the broadcast layer decide which method to use based on the ratio between cluster size and the message's actual number of destination nodes. In the latter case, a sending socket must stick to a previously selected method until it enters an idle period of at least 5 seconds. This eliminates the risk of message reordering caused by method change, i.e., when changes to cluster size or number of destinations would otherwise mandate a new method to be used. Reviewed-by:
Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Sep 03, 2016
-
-
Jon Paul Maloy authored
When we send broadcasts in clusters of more 70-80 nodes, we sometimes see the broadcast link resetting because of an excessive number of retransmissions. This is caused by a combination of two factors: 1) A 'NACK crunch", where loss of broadcast packets is discovered and NACK'ed by several nodes simultaneously, leading to multiple redundant broadcast retransmissions. 2) The fact that the NACKS as such also are sent as broadcast, leading to excessive load and packet loss on the transmitting switch/bridge. This commit deals with the latter problem, by moving sending of broadcast nacks from the dedicated BCAST_PROTOCOL/NACK message type to regular unicast LINK_PROTOCOL/STATE messages. We allocate 10 unused bits in word 8 of the said message for this purpose, and introduce a new capability bit, TIPC_BCAST_STATE_NACK in order to keep the change backwards compatible. Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Aug 19, 2016
-
-
Richard Alpe authored
Add TIPC_NL_PEER_REMOVE netlink command. This command can remove an offline peer node from the internal data structures. This will be supported by the tipc user space tool in iproute2. Signed-off-by:
Richard Alpe <richard.alpe@ericsson.com> Reviewed-by:
Jon Maloy <jon.maloy@ericsson.com> Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jul 26, 2016
-
-
Parthasarathy Bhuvaragan authored
In this commit, we dump the monitor attributes when queried. The link monitor attributes are separated into two kinds: 1. general attributes per bearer 2. specific attributes per node/peer This style resembles the socket attributes and the nametable publications per socket. Reviewed-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Parthasarathy Bhuvaragan authored
In this commit, we add support to fetch the configured cluster monitoring threshold. Reviewed-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Parthasarathy Bhuvaragan authored
In this commit, we introduce support to configure the minimum threshold to activate the new link monitoring algorithm. Reviewed-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- May 03, 2016
-
-
Jon Paul Maloy authored
There are two flow control mechanisms in TIPC; one at link level that handles network congestion, burst control, and retransmission, and one at connection level which' only remaining task is to prevent overflow in the receiving socket buffer. In TIPC, the latter task has to be solved end-to-end because messages can not be thrown away once they have been accepted and delivered upwards from the link layer, i.e, we can never permit the receive buffer to overflow. Currently, this algorithm is message based. A counter in the receiving socket keeps track of number of consumed messages, and sends a dedicated acknowledge message back to the sender for each 256 consumed message. A counter at the sending end keeps track of the sent, not yet acknowledged messages, and blocks the sender if this number ever reaches 512 unacknowledged messages. When the missing acknowledge arrives, the socket is then woken up for renewed transmission. This works well for keeping the message flow running, as it almost never happens that a sender socket is blocked this way. A problem with the current mechanism is that it potentially is very memory consuming. Since we don't distinguish between small and large messages, we have to dimension the socket receive buffer according to a worst-case of both. I.e., the window size must be chosen large enough to sustain a reasonable throughput even for the smallest messages, while we must still consider a scenario where all messages are of maximum size. Hence, the current fix window size of 512 messages and a maximum message size of 66k results in a receive buffer of 66 MB when truesize(66k) = 131k is taken into account. It is possible to do much better. This commit introduces an algorithm where we instead use 1024-byte blocks as base unit. This unit, always rounded upwards from the actual message size, is used when we advertise windows as well as when we count and acknowledge transmitted data. The advertised window is based on the configured receive buffer size in such a way that even the worst-case truesize/msgsize ratio always is covered. Since the smallest possible message size (from a flow control viewpoint) now is 1024 bytes, we can safely assume this ratio to be less than four, which is the value we are now using. This way, we have been able to reduce the default receive buffer size from 66 MB to 2 MB with maintained performance. In order to keep this solution backwards compatible, we introduce a new capability bit in the discovery protocol, and use this throughout the message sending/reception path to always select the right unit. Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
During neighbor discovery, nodes advertise their capabilities as a bit map in a dedicated 16-bit field in the discovery message header. This bit map has so far only be stored in the node structure on the peer nodes, but we now see the need to keep a copy even in the socket structure. This commit adds this functionality. Acked-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Nov 20, 2015
-
-
Jon Paul Maloy authored
We move the definition of struct tipc_link from link.h to link.c in order to minimize its exposure to the rest of the code. When needed, we define new functions to make it possible for external entities to access and set data in the link. Apart from the above, there are no functional changes. Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
In our effort to have less code and include dependencies between entities such as node, link and bearer, we try to narrow down the exposed interface towards the node as much as possible. In this commit, we move the definition of struct tipc_node, along with many of its associated function declarations, from node.h to node.c. We also move some function definitions from link.c and name_distr.c to node.c, since they access fields in struct tipc_node that should not be externally visible. The moved functions are renamed according to new location, and made static whenever possible. There are no functional changes in this commit. Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
According to the node FSM a node in state SELF_UP_PEER_UP cannot change state inside a lock context, except when a TUNNEL_PROTOCOL (SYNCH or FAILOVER) packet arrives. However, the node's individual links may still change state. Since each link now is protected by its own spinlock, we finally have the conditions in place to convert the node spinlock to an rwlock_t. If the node state and arriving packet type are rigth, we can let the link directly receive the packet under protection of its own spinlock and the node lock in read mode. In all other cases we use the node lock in write mode. This enables full concurrent execution between parallel links during steady-state traffic situations, i.e., 99+ % of the time. This commit implements this change. Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
As a preparation to allow parallel links to work more independently from each other we introduce a per-link spinlock, to be stored in the struct nodes's link entry area. Since the node lock still is a regular spinlock there is no increase in parallellism at this stage. Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
The file name_distr.c currently contains three functions, named_cluster_distribute(), tipc_publ_subcscribe() and tipc_publ_unsubscribe() that all directly access fields in struct tipc_node. We want to eliminate such dependencies, so we move those functions to the file node.c and rename them to tipc_node_broadcast(), tipc_node_subscribe() and tipc_node_unsubscribe() respectively. Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Oct 24, 2015
-
-
Jon Paul Maloy authored
After the previous changes in this series, we can now remove some unused code and structures, both in the broadcast, link aggregation and link code. There are no functional changes in this commit. Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
The code path for receiving broadcast packets is currently distinct from the unicast path. This leads to unnecessary code and data duplication, something that can be avoided with some effort. We now introduce separate per-peer tipc_link instances for handling broadcast packet reception. Each receive link keeps a pointer to the common, single, broadcast link instance, and can hence handle release and retransmission of send buffers as if they belonged to the own instance. Furthermore, we let each unicast link instance keep a reference to both the pertaining broadcast receive link, and to the common send link. This makes it possible for the unicast links to easily access data for broadcast link synchronization, as well as for carrying acknowledges for received broadcast packets. Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
Until now, we have tried to support both the newer, dedicated broadcast synchronization mechanism along with the older, less safe, RESET_MSG/ ACTIVATE_MSG based one. The latter method has turned out to be a hazard in a highly dynamic cluster, so we find it safer to disable it completely when we find that the former mechanism is supported by the peer node. For this purpose, we now introduce a new capabability bit, TIPC_BCAST_SYNCH, to inform any peer nodes that dedicated broadcast syncronization is supported by the present node. The new bit is conveyed between peers in the 'capabilities' field of neighbor discovery messages. Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jul 31, 2015
-
-
Jon Paul Maloy authored
After the most recent changes, all access calls to a link which may entail addition of messages to the link's input queue are postpended by an explicit call to tipc_sk_rcv(), using a reference to the correct queue. This means that the potentially hazardous implicit delivery, using tipc_node_unlock() in combination with a binary flag and a cached queue pointer, now has become redundant. This commit removes this implicit delivery mechanism both for regular data messages and for binding table update messages. Tested-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
The node lock is currently grabbed and and released in the function tipc_disc_rcv() in the file discover.c. As a preparation for the next commits, we need to move this node lock handling, along with the code area it is covering, to node.c. This commit introduces this change. Tested-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
Link failover and synchronization have until now been handled by the links themselves, forcing them to have knowledge about and to access parallel links in order to make the two algorithms work correctly. In this commit, we move the control part of this functionality to the link aggregation level in node.c, which is the right location for this. As a result, the two algorithms become easier to follow, and the link implementation becomes simpler. Tested-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
In the next commit, we will move link synch/failover orchestration to the link aggregation level. In order to do this, we first need to extend the node FSM with two more states, NODE_SYNCHING and NODE_FAILINGOVER, plus four new events to enter and leave those states. This commit introduces this change, without yet making use of it. The node FSM now looks as follows: +-----------------------------------------+ | PEER_DOWN_EVT| | | +------------------------+----------------+ | |SELF_DOWN_EVT | | | | | | | | +-----------+ +-----------+ | | |NODE_ | |NODE_ | | | +----------|FAILINGOVER|<---------|SYNCHING |------------+ | | |SELF_ +-----------+ FAILOVER_+-----------+ PEER_ | | | |DOWN_EVT | A BEGIN_EVT A | DOWN_EVT| | | | | | | | | | | | | | | | | | | | |FAILOVER_|FAILOVER_ |SYNCH_ |SYNCH_ | | | | |END_EVT |BEGIN_EVT |BEGIN_EVT|END_EVT | | | | | | | | | | | | | | | | | | | | | +--------------+ | | | | | +------->| SELF_UP_ |<-------+ | | | | +----------------| PEER_UP |------------------+ | | | | |SELF_DOWN_EVT +--------------+ PEER_DOWN_EVT| | | | | | A A | | | | | | | | | | | | | | PEER_UP_EVT| |SELF_UP_EVT | | | | | | | | | | | V V V | | V V V +------------+ +-----------+ +-----------+ +------------+ |SELF_DOWN_ | |SELF_UP_ | |PEER_UP_ | |PEER_DOWN | |PEER_LEAVING|<------|PEER_COMING| |SELF_COMING|------>|SELF_LEAVING| +------------+ SELF_ +-----------+ +-----------+ PEER_ +------------+ | DOWN_EVT A A DOWN_EVT | | | | | | | | | | SELF_UP_EVT| |PEER_UP_EVT | | | | | | | | | |PEER_DOWN_EVT +--------------+ SELF_DOWN_EVT| +------------------->| SELF_DOWN_ |<--------------------+ | PEER_DOWN | +--------------+ Tested-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
In line with our effort to let the node level have full control over its links, we want to move all link reset calls from link.c to node.c. Some of the calls can be moved by simply moving the calling function, when this is the right thing to do. For the remaining calls we use the now established technique of returning a TIPC_LINK_DOWN_EVT flag from tipc_link_rcv(), whereafter we perform the reset call when the call returns. This change serves as a preparation for the coming commits. Tested-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
- Jul 21, 2015
-
-
Jon Paul Maloy authored
We convert packet/message reception according to the same principle we have been using for message sending and timeout handling: We move the function tipc_rcv() to node.c, hence handling the initial packet reception at the link aggregation level. The function grabs the node lock, selects the receiving link, and accesses it via a new call tipc_link_rcv(). This function appends buffers to the input queue for delivery upwards, but it may also append outgoing packets to the xmit queue, just as we do during regular message sending. The latter will happen when buffers are forwarded from the link backlog, or when retransmission is requested. Upon return of this function, and after having released the node lock, tipc_rcv() delivers/tranmsits the contents of those queues, but it may also perform actions such as link activation or reset, as indicated by the return flags from the link. This reduces the number of cpu cycles spent inside the node spinlock, and reduces contention on that lock. Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
The logics for determining when a node is permitted to establish and maintain contact with its peer node becomes non-trivial in the presence of multiple parallel links that may come and go independently. A known failure scenario is that one endpoint registers both its links to the peer lost, cleans up it binding table, and prepares for a table update once contact is re-establihed, while the other endpoint may see its links reset and re-established one by one, hence seeing no need to re-synchronize the binding table. To avoid this, a node must not allow re-establishing contact until it has confirmation that even the peer has lost both links. Currently, the mechanism for handling this consists of setting and resetting two state flags from different locations in the code. This solution is hard to understand and maintain. A closer analysis even reveals that it is not completely safe. In this commit we do instead introduce an FSM that keeps track of the conditions for when the node can establish and maintain links. It has six states and four events, and is strictly based on explicit knowledge about the own node's and the peer node's contact states. Only events leading to state change are shown as edges in the figure below. +--------------+ | SELF_UP/ | +---------------->| PEER_COMING |-----------------+ SELF_ | +--------------+ |PEER_ ESTBL_ | | |ESTBL_ CONTACT| SELF_LOST_CONTACT | |CONTACT | v | | +--------------+ | | PEER_ | SELF_DOWN/ | SELF_ | | LOST_ +--| PEER_LEAVING |<--+ LOST_ v +-------------+ CONTACT | +--------------+ | CONTACT +-----------+ | SELF_DOWN/ |<----------+ +----------| SELF_UP/ | | PEER_DOWN |<----------+ +----------| PEER_UP | +-------------+ SELF_ | +--------------+ | PEER_ +-----------+ | LOST_ +--| SELF_LEAVING/|<--+ LOST_ A | CONTACT | PEER_DOWN | CONTACT | | +--------------+ | | A | PEER_ | PEER_LOST_CONTACT | |SELF_ ESTBL_ | | |ESTBL_ CONTACT| +--------------+ |CONTACT +---------------->| PEER_UP/ |-----------------+ | SELF_COMING | +--------------+ Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
In our effort to move control of the links to the link aggregation layer, we move the perodic link supervision timer to struct tipc_node. The new timer is shared between all links belonging to the node, thus saving resources, while still kicking the FSM on both its pertaining links at each expiration. The current link timer and corresponding functions are removed. Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-
Jon Paul Maloy authored
Currently, message sending is performed through a deep call chain, where the node spinlock is grabbed and held during a significant part of the transmission time. This is clearly detrimental to overall throughput performance; it would be better if we could send the message after the spinlock has been released. In this commit, we do instead let the call revert on the stack after the buffer chain has been added to the transmission queue, whereafter clones of the buffers are transmitted to the device layer outside the spinlock scope. As a further step in our effort to separate the roles of the node and link entities we also move the function tipc_link_xmit() to node.c, and rename it to tipc_node_xmit(). Reviewed-by:
Ying Xue <ying.xue@windriver.com> Signed-off-by:
Jon Maloy <jon.maloy@ericsson.com> Signed-off-by:
David S. Miller <davem@davemloft.net>
-