Age | Commit message (Collapse) | Author |
|
In test situations with many nodes and a heavily stressed system we have
observed that the transmission broadcast link may fail due to an
excessive number of retransmissions of the same packet. In such
situations we need to reset all unicast links to all peers, in order to
reset and re-synchronize the broadcast link.
In this commit, we add a new function tipc_bearer_reset_all() to be used
in such situations. The function scans across all bearers and resets all
their pertaining links.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
After a new receiver peer has been added to the broadcast transmission
link, we allow immediate transmission of new broadcast packets, trusting
that the new peer will not accept the packets until it has received the
previously sent unicast broadcast initialiation message. In the same
way, the sender must not accept any acknowledges until it has itself
received the broadcast initialization from the peer, as well as
confirmation of the reception of its own initialization message.
Furthermore, when a receiver peer goes down, the sender has to produce
the missing acknowledges from the lost peer locally, in order ensure
correct release of the buffers that were expected to be acknowledged by
the said peer.
In a highly stressed system we have observed that contact with a peer
may come up and be lost before the above mentioned broadcast initial-
ization and confirmation have been received. This leads to the locally
produced acknowledges being rejected, and the non-acknowledged buffers
to linger in the broadcast link transmission queue until it fills up
and the link goes into permanent congestion.
In this commit, we remedy this by temporarily setting the corresponding
broadcast receive link state to ESTABLISHED and the 'bc_peer_is_up'
state to true before we issue the local acknowledges. This ensures that
those acknowledges will always be accepted. The mentioned state values
are restored immediately afterwards when the link is reset.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
At first contact between two nodes, an endpoint might sometimes have
time to send out a LINK_PROTOCOL/STATE packet before it has received
the broadcast initialization packet from the peer, i.e., before it has
received a valid broadcast packet number to add to the 'bc_ack' field
of the protocol message.
This means that the peer endpoint will receive a protocol packet with an
invalid broadcast acknowledge value of 0. Under unlucky circumstances
this may lead to the original, already received acknowledge value being
overwritten, so that the whole broadcast link goes stale after a while.
We fix this by delaying the setting of the link field 'bc_peer_is_up'
until we know that the peer really has received our own broadcast
initialization message. The latter is always sent out as the first
unicast message on a link, and always with seqeunce number 1. Because
of this, we only need to look for a non-zero unicast acknowledge value
in the arriving STATE messages, and once that is confirmed we know we
are safe and can set the mentioned field. Before this moment, we must
ignore all broadcast acknowledges from the peer.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Fix incorrect use of nla_strlcpy() where the first NLA_HDRLEN bytes
of the link name where left out.
Making the output of tipc-config -ls look something like:
Link statistics:
dcast-link
1:data0-1.1.2:data0
1:data0-1.1.3:data0
Also, for the record, the patch that introduce this regression
claims "Sending the whole object out can cause a leak". Which isn't
very likely as this is a compat layer, where the data we are parsing
is generated by us and we know the string to be NULL terminated. But
you can of course never be to secure.
Fixes: 5d2be1422e02 (tipc: fix an infoleak in tipc_nl_compat_link_dump)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When extracting an individual message from a received "bundle" buffer,
we just create a clone of the base buffer, and adjust it to point into
the right position of the linearized data area of the latter. This works
well for regular message reception, but during periods of extremely high
load it may happen that an extracted buffer, e.g, a connection probe, is
reversed and forwarded through an external interface while the preceding
extracted message is still unhandled. When this happens, the header or
data area of the preceding message will be partially overwritten by a
MAC header, leading to unpredicatable consequences, such as a link
reset.
We now fix this by ensuring that the msg_reverse() function never
returns a cloned buffer, and that the returned buffer always contains
sufficient valid head and tail room to be forwarded.
Reported-by: Erik Hugne <erik.hugne@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We sometimes observe a 'deadly embrace' type deadlock occurring
between mutually connected sockets on the same node. This happens
when the one-hour peer supervision timers happen to expire
simultaneously in both sockets.
The scenario is as follows:
CPU 1: CPU 2:
-------- --------
tipc_sk_timeout(sk1) tipc_sk_timeout(sk2)
lock(sk1.slock) lock(sk2.slock)
msg_create(probe) msg_create(probe)
unlock(sk1.slock) unlock(sk2.slock)
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk2) tipc_sk_rcv(sk1)
lock(sk2.slock) lock((sk1.slock)
filter_rcv() filter_rcv()
tipc_sk_proto_rcv() tipc_sk_proto_rcv()
msg_create(probe_rsp) msg_create(probe_rsp)
tipc_sk_respond() tipc_sk_respond()
tipc_node_xmit_skb() tipc_node_xmit_skb()
tipc_node_xmit() tipc_node_xmit()
tipc_sk_rcv(sk1) tipc_sk_rcv(sk2)
lock((sk1.slock) lock((sk2.slock)
===> DEADLOCK ===> DEADLOCK
Further analysis reveals that there are three different locations in the
socket code where tipc_sk_respond() is called within the context of the
socket lock, with ensuing risk of similar deadlocks.
We now solve this by passing a buffer queue along with all upcalls where
sk_lock.slock may potentially be held. Response or rejected message
buffers are accumulated into this queue instead of being sent out
directly, and only sent once we know we are safely outside the slock
context.
Reported-by: GUNA <gbalasun@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
net/tipc/link.c: In function ‘tipc_link_timeout’:
net/tipc/link.c:744:28: warning: ‘mtyp’ may be used uninitialized in this function [-Wuninitialized]
Fixes: 42b18f605fea ("tipc: refactor function tipc_link_timeout()")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When run tipcTS&tipcTC test suite, the following complaint appears:
[ 56.926168] ===============================
[ 56.926169] [ INFO: suspicious RCU usage. ]
[ 56.926171] 4.7.0-rc1+ #160 Not tainted
[ 56.926173] -------------------------------
[ 56.926174] net/tipc/bearer.c:408 suspicious rcu_dereference_protected() usage!
[ 56.926175]
[ 56.926175] other info that might help us debug this:
[ 56.926175]
[ 56.926177]
[ 56.926177] rcu_scheduler_active = 1, debug_locks = 1
[ 56.926179] 3 locks held by swapper/4/0:
[ 56.926180] #0: (((&req->timer))){+.-...}, at: [<ffffffff810e79b5>] call_timer_fn+0x5/0x340
[ 56.926203] #1: (&(&req->lock)->rlock){+.-...}, at: [<ffffffffa000c29b>] disc_timeout+0x1b/0xd0 [tipc]
[ 56.926212] #2: (rcu_read_lock){......}, at: [<ffffffffa00055e0>] tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc]
[ 56.926218]
[ 56.926218] stack backtrace:
[ 56.926221] CPU: 4 PID: 0 Comm: swapper/4 Not tainted 4.7.0-rc1+ #160
[ 56.926222] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2007
[ 56.926224] 0000000000000000 ffff880016803d28 ffffffff813c4423 ffff8800154252c0
[ 56.926227] 0000000000000001 ffff880016803d58 ffffffff810b7512 ffff8800124d8120
[ 56.926230] ffff880013f8a160 ffff8800132b5ccc ffff8800124d8120 ffff880016803d88
[ 56.926234] Call Trace:
[ 56.926235] <IRQ> [<ffffffff813c4423>] dump_stack+0x67/0x94
[ 56.926250] [<ffffffff810b7512>] lockdep_rcu_suspicious+0xe2/0x120
[ 56.926256] [<ffffffffa00051f1>] tipc_l2_send_msg+0x131/0x1c0 [tipc]
[ 56.926261] [<ffffffffa000567c>] tipc_bearer_xmit_skb+0x14c/0x2e0 [tipc]
[ 56.926266] [<ffffffffa00055e0>] ? tipc_bearer_xmit_skb+0xb0/0x2e0 [tipc]
[ 56.926273] [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
[ 56.926278] [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
[ 56.926283] [<ffffffffa000c2d6>] disc_timeout+0x56/0xd0 [tipc]
[ 56.926288] [<ffffffff810e7a68>] call_timer_fn+0xb8/0x340
[ 56.926291] [<ffffffff810e79b5>] ? call_timer_fn+0x5/0x340
[ 56.926296] [<ffffffffa000c280>] ? tipc_disc_init_msg+0x1f0/0x1f0 [tipc]
[ 56.926300] [<ffffffff810e8f4a>] run_timer_softirq+0x23a/0x390
[ 56.926306] [<ffffffff810f89ff>] ? clockevents_program_event+0x7f/0x130
[ 56.926316] [<ffffffff819727c3>] __do_softirq+0xc3/0x4a2
[ 56.926323] [<ffffffff8106ba5a>] irq_exit+0x8a/0xb0
[ 56.926327] [<ffffffff81972456>] smp_apic_timer_interrupt+0x46/0x60
[ 56.926331] [<ffffffff81970a49>] apic_timer_interrupt+0x89/0x90
[ 56.926333] <EOI> [<ffffffff81027fda>] ? default_idle+0x2a/0x1a0
[ 56.926340] [<ffffffff81027fd8>] ? default_idle+0x28/0x1a0
[ 56.926342] [<ffffffff810289cf>] arch_cpu_idle+0xf/0x20
[ 56.926345] [<ffffffff810adf0f>] default_idle_call+0x2f/0x50
[ 56.926347] [<ffffffff810ae145>] cpu_startup_entry+0x215/0x3e0
[ 56.926353] [<ffffffff81040ad9>] start_secondary+0xf9/0x100
The warning appears as rtnl_dereference() is wrongly used in
tipc_l2_send_msg() under RCU read lock protection. Instead the proper
usage should be that rcu_dereference_rtnl() is called here.
Fixes: 5b7066c3dd24 ("tipc: stricter filtering of packets in bearer layer")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
link_info.str is a char array of size 60. Memory after the NULL
byte is not initialized. Sending the whole object out can cause
a leak.
Signed-off-by: Kangjie Lu <kjlu@gatech.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Before calling the nla_parse_nested function, make sure the pointer to the
attribute is not null. This patch fixes several potential null pointer
dereference vulnerabilities in the tipc netlink functions.
Signed-off-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TCP stack can now run from process context.
Use read_lock_bh(&sk->sk_callback_lock) variant to restore previous
assumption.
Fixes: 5413d1babe8f ("net: do not block BH while processing socket backlog")
Fixes: d41a69f1d390 ("tcp: make tcp_sendmsg() aware of socket backlog")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
Pull trivial tree updates from Jiri Kosina.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (21 commits)
gitignore: fix wording
mfd: ab8500-debugfs: fix "between" in printk
memstick: trivial fix of spelling mistake on management
cpupowerutils: bench: fix "average"
treewide: Fix typos in printk
IB/mlx4: printk fix
pinctrl: sirf/atlas7: fix printk spelling
serial: mctrl_gpio: Grammar s/lines GPIOs/line GPIOs/, /sets/set/
w1: comment spelling s/minmum/minimum/
Blackfin: comment spelling s/divsor/divisor/
metag: Fix misspellings in comments.
ia64: Fix misspellings in comments.
hexagon: Fix misspellings in comments.
tools/perf: Fix misspellings in comments.
cris: Fix misspellings in comments.
c6x: Fix misspellings in comments.
blackfin: Fix misspelling of 'register' in comment.
avr32: Fix misspelling of 'definitions' in comment.
treewide: Fix typos in printk
Doc: treewide : Fix typos in DocBook/filesystem.xml
...
|
|
The publication field of the old netlink API should contain the
publication key and not the publication reference.
Fixes: 44a8ae94fd55 (tipc: convert legacy nl name table dump to nl compat)
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make sure the socket for which the user is listing publication exists
before parsing the socket netlink attributes.
Prior to this patch a call without any socket caused a NULL pointer
dereference in tipc_nl_publ_dump().
Tested-and-reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.cm>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When an ACTIVATE or data packet is received in a link in state
ESTABLISHING, the link does not immediately change state to
ESTABLISHED, but does instead return a LINK_UP event to the caller,
which will execute the state change in a different lock context.
This non-atomic approach incurs a low risk that we may have two
LINK_UP events pending simultaneously for the same link, resulting
in the final part of the setup procedure being executed twice. The
only potential harm caused by this it that we may see two LINK_UP
events issued to subsribers of the topology server, something that
may cause confusion.
This commit eliminates this risk by checking if the link is already
up before proceeding with the second half of the setup.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts:
net/ipv4/ip_gre.c
Minor conflicts between tunnel bug fixes in net and
ipv6 tunnel cleanups in net-next.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There are two flow control mechanisms in TIPC; one at link level that
handles network congestion, burst control, and retransmission, and one
at connection level which' only remaining task is to prevent overflow
in the receiving socket buffer. In TIPC, the latter task has to be
solved end-to-end because messages can not be thrown away once they
have been accepted and delivered upwards from the link layer, i.e, we
can never permit the receive buffer to overflow.
Currently, this algorithm is message based. A counter in the receiving
socket keeps track of number of consumed messages, and sends a dedicated
acknowledge message back to the sender for each 256 consumed message.
A counter at the sending end keeps track of the sent, not yet
acknowledged messages, and blocks the sender if this number ever reaches
512 unacknowledged messages. When the missing acknowledge arrives, the
socket is then woken up for renewed transmission. This works well for
keeping the message flow running, as it almost never happens that a
sender socket is blocked this way.
A problem with the current mechanism is that it potentially is very
memory consuming. Since we don't distinguish between small and large
messages, we have to dimension the socket receive buffer according
to a worst-case of both. I.e., the window size must be chosen large
enough to sustain a reasonable throughput even for the smallest
messages, while we must still consider a scenario where all messages
are of maximum size. Hence, the current fix window size of 512 messages
and a maximum message size of 66k results in a receive buffer of 66 MB
when truesize(66k) = 131k is taken into account. It is possible to do
much better.
This commit introduces an algorithm where we instead use 1024-byte
blocks as base unit. This unit, always rounded upwards from the
actual message size, is used when we advertise windows as well as when
we count and acknowledge transmitted data. The advertised window is
based on the configured receive buffer size in such a way that even
the worst-case truesize/msgsize ratio always is covered. Since the
smallest possible message size (from a flow control viewpoint) now is
1024 bytes, we can safely assume this ratio to be less than four, which
is the value we are now using.
This way, we have been able to reduce the default receive buffer size
from 66 MB to 2 MB with maintained performance.
In order to keep this solution backwards compatible, we introduce a
new capability bit in the discovery protocol, and use this throughout
the message sending/reception path to always select the right unit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
During neighbor discovery, nodes advertise their capabilities as a bit
map in a dedicated 16-bit field in the discovery message header. This
bit map has so far only be stored in the node structure on the peer
nodes, but we now see the need to keep a copy even in the socket
structure.
This commit adds this functionality.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In the refactoring commit d570d86497ee ("tipc: enqueue arrived buffers
in socket in separate function") we did by accident replace the test
if (sk->sk_backlog.len == 0)
atomic_set(&tsk->dupl_rcvcnt, 0);
with
if (sk->sk_backlog.len)
atomic_set(&tsk->dupl_rcvcnt, 0);
This effectively disables the compensation we have for the double
receive buffer accounting that occurs temporarily when buffers are
moved from the backlog to the socket receive queue. Until now, this
has gone unnoticed because of the large receive buffer limits we are
applying, but becomes indispensable when we reduce this buffer limit
later in this series.
We now fix this by inverting the mentioned condition.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We have observed complete lock up of broadcast-link transmission due to
unacknowledged packets never being removed from the 'transmq' queue. This
is traced to nodes having their ack field set beyond the sequence number
of packets that have actually been transmitted to them.
Consider an example where node 1 has sent 10 packets to node 2 on a
link and node 3 has sent 20 packets to node 2 on another link. We
see examples of an ack from node 2 destined for node 3 being treated as
an ack from node 2 at node 1. This leads to the ack on the node 1 to node
2 link being increased to 20 even though we have only sent 10 packets.
When node 1 does get around to sending further packets, none of the
packets with sequence numbers less than 21 are actually removed from the
transmq.
To resolve this we reinstate some code lost in commit d999297c3dbb ("tipc:
reduce locking scope during packet reception") which ensures that only
messages destined for the receiving node are processed by that node. This
prevents the sequence numbers from getting out of sync and resolves the
packet leakage, thereby resolving the broadcast-link transmission
lock-ups we observed.
While we are aware that this change only patches over a root problem that
we still haven't identified, this is a sanity test that it is always
legitimate to do. It will remain in the code even after we identify and
fix the real problem.
Reviewed-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Reviewed-by: John Thompson <john.thompson@alliedtelesis.co.nz>
Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When we are displaying statistics for the first link established between
two peers, it will always be presented as STANDBY although it in reality
is ACTIVE.
This happens because we forget to set the 'active' flag in the link
instance at the moment it is established. Although this is a bug, it only
has impact on the presentation view of the link, not on its actual
functionality.
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This is never called with a NULL "buf" and anyway, we dereference 's' on
the lines before so it would Oops before we reach the check.
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 42b18f605fea ("tipc: refactor function tipc_link_timeout()"),
introduced a bug which prevents sending of probe messages during
link synchronization phase. This leads to hanging links, if the
bearer is disabled/enabled after links are up.
In this commit, we send the probe messages correctly.
Fixes: 42b18f605fea ("tipc: refactor function tipc_link_timeout()")
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Conflicts were two cases of simple overlapping changes,
nothing serious.
In the UDP case, we need to add a hlist_add_tail_rcu()
to linux/rculist.h, because we've moved UDP socket handling
away from using nulls lists.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch fix spelling typos found in printk
within various part of the kernel sources.
Signed-off-by: Masanari Iida <standby24x7@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
|
|
According to the link FSM, a received traffic packet can take a link
from state ESTABLISHING to ESTABLISHED, but the link can still not be
fully set up in one atomic operation. This means that even if the the
very first packet on the link is a traffic packet with sequence number
1 (one), it has to be dropped and retransmitted.
This can be avoided if we let the mentioned packet be preceded by a
LINK_PROTOCOL/STATE message, which takes up the endpoint before the
arrival of the traffic.
We add this small feature in this commit.
This is a fully compatible change.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In some link establishment scenarios we see that packet #2 may be sent
out before packet #1, forcing the receiver to demand retransmission of
the missing packet. This is harmless, but may cause confusion among
people tracing the packet flow.
Since this is extremely easy to fix, we do so by adding en extra send
call to the bearer immediately after the link has come up.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The function tipc_link_timeout() is unnecessary complex, and can
easily be made more readable.
We do that with this commit. The only functional change is that we
remove a redundant test for whether the broadcast link is up or not.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a link is down, it will continuously try to re-establish contact
with the peer by sending out a RESET or an ACTIVATE message at each
timeout interval. The default value for this interval is currently
375 ms. This is wasteful, and may become a problem in very large
clusters with dozens or hundreds of nodes being down simultaneously.
We now introduce a simple backoff algorithm for these cases. The
first five messages are sent at default rate; thereafter a message
is sent only each 16th timer interval.
This will cover the vast majority of link recycling cases, since the
endpoint starting last will transmit at the higher speed, and the link
should normally be established well be before the rate needs to be
reduced.
The only case where we will see a degradation of link re-establishment
times is when the endpoints remain intact, and a glitch in the
transmission media is causing the link reset. We will then experience
a worst-case re-establishing time of 6 seconds, something we deem
acceptable.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a link endpoint is going down locally, e.g., because its interface
is being stopped, it will spontaneously send out a RESET message to
its peer, informing it about this fact. This saves the peer from
detecting the failure via probing, and hence gives both speedier and
less resource consuming failure detection on the peer side.
According to the link FSM, a receiver of a RESET message, ignoring the
reason for it, must now consider the sender ready to come back up, and
starts periodically sending out ACTIVATE messages to the peer in order
to re-establish the link. Also, according to the FSM, the receiver of
an ACTIVATE message can now go directly to state ESTABLISHED and start
sending regular traffic packets. This is a well-proven and robust FSM.
However, in the case of a reboot, there is a small possibilty that link
endpoint on the rebooted node may have been re-created with a new bearer
identity between the moment it sent its (pre-boot) RESET and the moment
it receives the ACTIVATE from the peer. The new bearer identity cannot
be known by the peer according to this scenario, since traffic headers
don't convey such information. This is a problem, because both endpoints
need to know the correct value of the peer's bearer id at any moment in
time in order to be able to produce correct link events for their users.
The only way to guarantee this is to enforce a full setup message
exchange (RESET + ACTIVATE) even after the reboot, since those messages
carry the bearer idientity in their header.
In this commit we do this by introducing and setting a "stopping" bit in
the header of the spontaneously generated RESET messages, informing the
peer that the sender will not be immediately ready to re-establish the
link. A receiver seeing this bit must act as if this were a locally
detected connectivity failure, and hence has to go through a full two-
way setup message exchange before any link can be re-established.
Although never reported, this problem seems to have always been around.
This protocol addition is fully backwards compatible.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Until now, the requests sent to topology server are queued
to a workqueue by the generic server framework.
These messages are processed by worker threads and trigger the
registered callbacks.
To reduce latency on uniprocessor systems, explicit rescheduling
is performed using cond_resched() after MAX_RECV_MSG_COUNT(25)
messages.
This implementation on SMP systems leads to an subscriber refcnt
error as described below:
When a worker thread yields by calling cond_resched() in a SMP
system, a new worker is created on another CPU to process the
pending workitem. Sometimes the sleeping thread wakes up before
the new thread finishes execution.
This breaks the assumption on ordering and being single threaded.
The fault is more frequent when MAX_RECV_MSG_COUNT is lowered.
If the first thread was processing subscription create and the
second thread processing close(), the close request will free
the subscriber and the create request oops as follows:
[31.224137] WARNING: CPU: 2 PID: 266 at include/linux/kref.h:46 tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
[31.228143] CPU: 2 PID: 266 Comm: kworker/u8:1 Not tainted 4.5.0+ #97
[31.228377] Workqueue: tipc_rcv tipc_recv_work [tipc]
[...]
[31.228377] Call Trace:
[31.228377] [<ffffffff812fbb6b>] dump_stack+0x4d/0x72
[31.228377] [<ffffffff8105a311>] __warn+0xd1/0xf0
[31.228377] [<ffffffff8105a3fd>] warn_slowpath_null+0x1d/0x20
[31.228377] [<ffffffffa0098067>] tipc_subscrb_rcv_cb+0x317/0x380 [tipc]
[31.228377] [<ffffffffa00a4984>] tipc_receive_from_sock+0xd4/0x130 [tipc]
[31.228377] [<ffffffffa00a439b>] tipc_recv_work+0x2b/0x50 [tipc]
[31.228377] [<ffffffff81071925>] process_one_work+0x145/0x3d0
[31.246554] ---[ end trace c3882c9baa05a4fd ]---
[31.248327] BUG: spinlock bad magic on CPU#2, kworker/u8:1/266
[31.249119] BUG: unable to handle kernel NULL pointer dereference at 0000000000000428
[31.249323] IP: [<ffffffff81099d0c>] spin_dump+0x5c/0xe0
[31.249323] PGD 0
[31.249323] Oops: 0000 [#1] SMP
In this commit, we
- rename tipc_conn_shutdown() to tipc_conn_release().
- move connection release callback execution from tipc_close_conn()
to a new function tipc_sock_release(), which is executed before
we free the connection.
Thus we release the subscriber during connection release procedure
rather than connection shutdown procedure.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We remove a couple of leftover fields in struct tipc_bearer. Those
were used by the old broadcast implementation, and are not needed
any longer. There is no functional changes in this commit.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If a peer node becomes unavailable, in addition to removing the
nametable entries from this node we also need to purge all deferred
updates associated with this node.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Nametable updates received from the network that cannot be applied
immediately are placed on a defer queue. This queue is global to the
TIPC module, which might cause problems when using TIPC in containers.
To prevent nametable updates from escaping into the wrong namespace,
we make the queue pernet instead.
Signed-off-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Resetting a bearer/interface, with the consequence of resetting all its
pertaining links, is not an atomic action. This becomes particularly
evident in very large clusters, where a lot of traffic may happen on the
remaining links while we are busy shutting them down. In extreme cases,
we may even see links being re-created and re-established before we are
finished with the job.
To solve this, we now introduce a solution where we temporarily detach
the bearer from the interface when the bearer is reset. This inhibits
all packet reception, while sending still is possible. For the latter,
we use the fact that the device's user pointer now is zero to filter out
which packets can be sent during this situation; i.e., outgoing RESET
messages only. This filtering serves to speed up the neighbors'
detection of the loss event, and saves us from unnecessary probing.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When enabling a bearer we create a 'neigbor discoverer' instance by
calling the function tipc_disc_create() before the bearer is actually
registered in the list of enabled bearers. Because of this, the very
first discovery broadcast message, created by the mentioned function,
is lost, since it cannot find any valid bearer to use. Furthermore,
the used send function, tipc_bearer_xmit_skb() does not free the given
buffer when it cannot find a bearer, resulting in the leak of exactly
one send buffer each time a bearer is enabled.
This commit fixes this problem by introducing two changes:
1) Instead of attemting to send the discovery message directly, we let
tipc_disc_create() return the discovery buffer to the calling
function, tipc_enable_bearer(), so that the latter can send it
when the enabling sequence is finished.
2) In tipc_bearer_xmit_skb(), as well as in the two other transmit
functions at the bearer layer, we now free the indicated buffer or
buffer chain when a valid bearer cannot be found.
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Expand headroom further in order to be able to fit the larger IPv6
header. Prior to this patch this caused a skb under panic for certain
tipc packets when using IPv6 UDP bearer(s).
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This patch extends udp_tunnel6_xmit_skb() to pass in the IPv6 flow label
from call sites. Currently, there's no such option and it's always set to
zero when writing ip6_flow_hdr(). Add a label member to ip_tunnel_key, so
that flow-based tunnels via collect metadata frontends can make use of it.
vxlan and geneve will be converted to add flow label support separately.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Several cases of overlapping changes, as well as one instance
(vxlan) of a bug fix in 'net' overlapping with code movement
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make the c files less cluttered and enable netlink attributes to be
shared between files.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Reviewed-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Until now, we have kept a pre-allocated protocol message header
aggregated into struct tipc_link. Apart from adding unnecessary
footprint to the link instances, this requires extra code both to
initialize and re-initialize it.
We now remove this sub-optimization. This change also makes it
possible to clean up the function tipc_build_proto_msg() and remove
a couple of small functions that were accessing the mentioned header.
In particular, we can replace all occurrences of the local function
call link_own_addr(link) with the generic tipc_own_addr(net).
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
commit 4d5cfcba2f6e ('tipc: fix connection abort during subscription
cancel'), removes the check for a valid subscription before calling
tipc_nametbl_subscribe().
This will lead to a nullptr exception when we process a
subscription cancel request. For a cancel request, a null
subscription is passed to tipc_nametbl_subscribe() resulting
in exception.
In this commit, we call tipc_nametbl_subscribe() only for
a valid subscription.
Fixes: 4d5cfcba2f6e ('tipc: fix connection abort during subscription cancel')
Reported-by: Anders Widell <anders.widell@ericsson.com>
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make sure the user has provided a scope for multicast and link local
addresses used locally by a UDP bearer.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
The netlink policy for TIPC_NLA_UDP_LOCAL and TIPC_NLA_UDP_REMOTE
is of type binary with a defined length. This causes the policy
framework to threat the defined length as maximum length.
There is however no protection against a user sending a smaller
amount of data. Prior to this patch this wasn't handled which could
result in a partially incomplete sockaddr_storage struct containing
uninitialized data.
In this patch we use nla_memcpy() when copying the user data. This
ensures a potential gap at the end is cleared out properly.
This was found by Julia with Coccinelle tool.
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Make sure we have a link before checking if it has been reset or not.
Prior to this patch tipc_link_is_reset() could be called with a non
existing link, resulting in a null pointer dereference.
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Prior to this patch enabling a IPv4 UDP bearer caused a null pointer
dereference in iptunnel_xmit_stats(), when it tried to dereference the
net device from the skb. To resolve this we now point the skb device
to the net device resolved from the routing table.
Fixes: 039f50629b7f (ip_tunnel: Move stats update to iptunnel_xmit())
Signed-off-by: Richard Alpe <richard.alpe@ericsson.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Reviewed-by: Erik Hugne <erik.hugne@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
reverts commit 94153e36e709e ("tipc: use existing sk_write_queue for
outgoing packet chain")
In Commit 94153e36e709e, we assume that we fill & empty the socket's
sk_write_queue within the same lock_sock() session.
This is not true if the link is congested. During congestion, the
socket lock is released while we wait for the congestion to cease.
This implementation causes a nullptr exception, if the user space
program has several threads accessing the same socket descriptor.
Consider two threads of the same program performing the following:
Thread1 Thread2
-------------------- ----------------------
Enter tipc_sendmsg() Enter tipc_sendmsg()
lock_sock() lock_sock()
Enter tipc_link_xmit(), ret=ELINKCONG spin on socket lock..
sk_wait_event() :
release_sock() grab socket lock
: Enter tipc_link_xmit(), ret=0
: release_sock()
Wakeup after congestion
lock_sock()
skb = skb_peek(pktchain);
!! TIPC_SKB_CB(skb)->wakeup_pending = tsk->link_cong;
In this case, the second thread transmits the buffers belonging to
both thread1 and thread2 successfully. When the first thread wakeup
after the congestion it assumes that the pktchain is intact and
operates on the skb's in it, which leads to the following exception:
[2102.439969] BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
[2102.440074] IP: [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
[2102.440074] PGD 3fa3f067 PUD 3fa6b067 PMD 0
[2102.440074] Oops: 0000 [#1] SMP
[2102.440074] CPU: 2 PID: 244 Comm: sender Not tainted 3.12.28 #1
[2102.440074] RIP: 0010:[<ffffffffa005f330>] [<ffffffffa005f330>] __tipc_link_xmit+0x2b0/0x4d0 [tipc]
[...]
[2102.440074] Call Trace:
[2102.440074] [<ffffffff8163f0b9>] ? schedule+0x29/0x70
[2102.440074] [<ffffffffa006a756>] ? tipc_node_unlock+0x46/0x170 [tipc]
[2102.440074] [<ffffffffa005f761>] tipc_link_xmit+0x51/0xf0 [tipc]
[2102.440074] [<ffffffffa006d8ae>] tipc_send_stream+0x11e/0x4f0 [tipc]
[2102.440074] [<ffffffff8106b150>] ? __wake_up_sync+0x20/0x20
[2102.440074] [<ffffffffa006dc9c>] tipc_send_packet+0x1c/0x20 [tipc]
[2102.440074] [<ffffffff81502478>] sock_sendmsg+0xa8/0xd0
[2102.440074] [<ffffffff81507895>] ? release_sock+0x145/0x170
[2102.440074] [<ffffffff815030d8>] ___sys_sendmsg+0x3d8/0x3e0
[2102.440074] [<ffffffff816426ae>] ? _raw_spin_unlock+0xe/0x10
[2102.440074] [<ffffffff81115c2a>] ? handle_mm_fault+0x6ca/0x9d0
[2102.440074] [<ffffffff8107dd65>] ? set_next_entity+0x85/0xa0
[2102.440074] [<ffffffff816426de>] ? _raw_spin_unlock_irq+0xe/0x20
[2102.440074] [<ffffffff8107463c>] ? finish_task_switch+0x5c/0xc0
[2102.440074] [<ffffffff8163ea8c>] ? __schedule+0x34c/0x950
[2102.440074] [<ffffffff81504e12>] __sys_sendmsg+0x42/0x80
[2102.440074] [<ffffffff81504e62>] SyS_sendmsg+0x12/0x20
[2102.440074] [<ffffffff8164aed2>] system_call_fastpath+0x16/0x1b
In this commit, we maintain the skb list always in the stack.
Signed-off-by: Parthasarathy Bhuvaragan <parthasarathy.bhuvaragan@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
msg.dst_sk needs to be set up with a valid socket because some callbacks
later derive the netns from it.
Fixes: 263ea09084d172d ("Revert "genl: Add genlmsg_new_unicast() for unicast message allocation")
Reported-by: Jon Maloy <maloy@donjonn.com>
Bisected-by: Jon Maloy <maloy@donjonn.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When the TIPC module is unloaded, we have identified a race condition
that allows a node reference counter to go to zero and the node instance
being freed before the node timer is finished with accessing it. This
leads to occasional crashes, especially in multi-namespace environments.
The scenario goes as follows:
CPU0:(node_stop) CPU1:(node_timeout) // ref == 2
1: if(!mod_timer())
2: if (del_timer())
3: tipc_node_put() // ref -> 1
4: tipc_node_put() // ref -> 0
5: kfree_rcu(node);
6: tipc_node_get(node)
7: // BOOM!
We now clean up this functionality as follows:
1) We remove the node pointer from the node lookup table before we
attempt deactivating the timer. This way, we reduce the risk that
tipc_node_find() may obtain a valid pointer to an instance marked
for deletion; a harmless but undesirable situation.
2) We use del_timer_sync() instead of del_timer() to safely deactivate
the node timer without any risk that it might be reactivated by the
timeout handler. There is no risk of deadlock here, since the two
functions never touch the same spinlocks.
3: We remove a pointless tipc_node_get() + tipc_node_put() from the
timeout handler.
Reported-by: Zhijiang Hu <huzhijiang@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Although we have never seen it happen, we have identified the
following problematic scenario when nodes are stopped and deleted:
CPU0: CPU1:
tipc_node_xxx() //ref == 1
tipc_node_put() //ref -> 0
tipc_node_find() // node still in table
tipc_node_delete()
list_del_rcu(n. list)
tipc_node_get() //ref -> 1, bad
kfree_rcu()
tipc_node_put() //ref to 0 again.
kfree_rcu() // BOOM!
We fix this by introducing use of the conditional kref_get_if_not_zero()
instead of kref_get() in the function tipc_node_find(). This eliminates
any risk of post-mortem access.
Reported-by: Zhijiang Hu <huzhijiang@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|