aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2019-03-21net: sched: flower: set unlocked flag for flower proto opsVlad Buslov
Set TCF_PROTO_OPS_DOIT_UNLOCKED for flower classifier to indicate that its ops callbacks don't require caller to hold rtnl lock. Don't take rtnl lock in fl_destroy_filter_work() that is executed on workqueue instead of being called by cls API and is not affected by setting TCF_PROTO_OPS_DOIT_UNLOCKED. Rtnl mutex is still manually taken by flower classifier before calling hardware offloads API that has not been updated for unlocked execution. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: track rtnl lock stateVlad Buslov
Use 'rtnl_held' flag to track if caller holds rtnl lock. Propagate the flag to internal functions that need to know rtnl lock state. Take rtnl lock before calling tcf APIs that require it (hw offload, bind filter, etc.). Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: protect flower classifier state with spinlockVlad Buslov
struct tcf_proto was extended with spinlock to be used by classifiers instead of global rtnl lock. Use it to protect shared flower classifier data structures (handle_idr, mask hashtable and list) and fields of individual filters that can be accessed concurrently. This patch set uses tcf_proto->lock as per instance lock that protects all filters on tcf_proto. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: handle concurrent tcf proto deletionVlad Buslov
Without rtnl lock protection tcf proto can be deleted concurrently. Check tcf proto 'deleting' flag after taking tcf spinlock to verify that no concurrent deletion is in progress. Return EAGAIN error if concurrent deletion detected, which will cause caller to retry and possibly create new instance of tcf proto. Retry mechanism is a result of fine-grained locking approach used in this and previous changes in series and is necessary to allow concurrent updates on same chain instance. Alternative approach would be to lock the whole chain while updating filters on any of child tp's, adding and removing classifier instances from the chain. However, since most CPU-intensive parts of filter update code are specifically in classifier code and its dependencies (extensions and hw offloads), such approach would negate most of the gains introduced by this change and previous changes in the series when updating same chain instance. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: handle concurrent filter insertion in fl_changeVlad Buslov
Check if user specified a handle and another filter with the same handle was inserted concurrently. Return EAGAIN to retry filter processing (in case it is an overwrite request). Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: protect masks list with spinlockVlad Buslov
Protect modifications of flower masks list with spinlock to remove dependency on rtnl lock and allow concurrent access. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: handle concurrent mask insertionVlad Buslov
Without rtnl lock protection masks with same key can be inserted concurrently. Insert temporary mask with reference count zero to masks hashtable. This will cause any concurrent modifications to retry. Wait for rcu grace period to complete after removing temporary mask from masks hashtable to accommodate concurrent readers. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Suggested-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: add reference counter to flower maskVlad Buslov
Extend fl_flow_mask structure with reference counter to allow parallel modification without relying on rtnl lock. Use rcu read lock to safely lookup mask and increment reference counter in order to accommodate concurrent deletes. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: track filter deletion with flagVlad Buslov
In order to prevent double deletion of filter by concurrent tasks when rtnl lock is not used for synchronization, add 'deleted' filter field. Check value of this field when modifying filters and return error if concurrent deletion is detected. Refactor __fl_delete() to accept pointer to 'last' boolean as argument, and return error code as function return value instead. This is necessary to signal concurrent filter delete to caller. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: introduce reference counting for filtersVlad Buslov
Extend flower filters with reference counting in order to remove dependency on rtnl lock in flower ops and allow to modify filters concurrently. Reference to flower filter can be taken/released concurrently as soon as it is marked as 'unlocked' by last patch in this series. Use atomic reference counter type to make concurrent modifications safe. Always take reference to flower filter while working with it: - Modify fl_get() to take reference to filter. - Implement tp->put() callback as fl_put() function to allow cls API to release reference taken by fl_get(). - Modify fl_change() to assume that caller holds reference to fold and take reference to fnew. - Take reference to filter while using it in fl_walk(). Implement helper functions to get/put filter reference counter. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: refactor fl_changeVlad Buslov
As a preparation for using classifier spinlock instead of relying on external rtnl lock, rearrange code in fl_change. The goal is to group the code which changes classifier state in single block in order to allow following commits in this set to protect it from parallel modification with tp->lock. Data structures that require tp->lock protection are mask hashtable and filters list, and classifier handle_idr. fl_hw_replace_filter() is a sleeping function and cannot be called while holding a spinlock. In order to execute all sequence of changes to shared classifier data structures atomically, call fl_hw_replace_filter() before modifying them. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: sched: flower: don't check for rtnl on head dereferenceVlad Buslov
Flower classifier only changes root pointer during init and destroy. Cls API implements reference counting for tcf_proto, so there is no danger of concurrent access to tp when it is being destroyed, even without protection provided by rtnl lock. Implement new function fl_head_dereference() to dereference tp->root without checking for rtnl lock. Use it in all flower function that obtain head pointer instead of rtnl_dereference(). Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net: dst: remove gc leftoversJulian Wiedmann
Get rid of some obsolete gc-related documentation and macros that were missed in commit 5b7c9a8ff828 ("net: remove dst gc related code"). CC: Wei Wang <weiwan@google.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21ipv6: Remove fallback argument from ip6_hold_safeDavid Ahern
net and null_fallback are redundant. Remove null_fallback in favor of !net check. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21ipv4: Allow amount of dirty memory from fib resizing to be controllableDavid Ahern
fib_trie implementation calls synchronize_rcu when a certain amount of pages are dirty from freed entries. The number of pages was determined experimentally in 2009 (commit c3059477fce2d). At the current setting, synchronize_rcu is called often -- 51 times in a second in one test with an average of an 8 msec delay adding a fib entry. The total impact is a lot of slow down modifying the fib. This is seen in the output of 'time' - the difference between real time and sys+user. For example, using 720,022 single path routes and 'ip -batch'[1]: $ time ./ip -batch ipv4/routes-1-hops real 0m14.214s user 0m2.513s sys 0m6.783s So roughly 35% of the actual time to install the routes is from the ip command getting scheduled out, most notably due to synchronize_rcu (this is observed using 'perf sched timehist'). This patch makes the amount of dirty memory configurable between 64k where the synchronize_rcu is called often (small, low end systems that are memory sensitive) to 64M where synchronize_rcu is called rarely during a large FIB change (for high end systems with lots of memory). The default is 512kB which corresponds to the current setting of 128 pages with a 4kB page size. As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu in a second blocking for up to 30 msec in a single instance, and a total of almost 100 msec across the 4 calls in the second. The trade off is allowing FIB entries to consume more memory in a given time window but but with much better fib insertion rates (~30% increase in prefixes/sec). With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch file runs in: $ time ./ip -batch ipv4/routes-1-hops real 0m9.692s user 0m2.491s sys 0m6.769s So the dead time is reduced to about 1/2 second or <5% of the real time. [1] 'ip' modified to not request ACK messages which improves route insertion times by about 20% Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21ipv6: Change addrconf_f6i_alloc to use ip6_route_info_createDavid Ahern
Change addrconf_f6i_alloc to generate a fib6_config and call ip6_route_info_create. addrconf_f6i_alloc is the last caller to fib6_info_alloc besides ip6_route_info_create, and there is no reason for it to do its own initialization on a fib6_info. Host routes need to be created even if the device is down, so add a new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init to not error out if device is not up. Notes on the conversion: - ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL and fc_mx_len set to 0 - dst_nocount is handled by the RTF_ADDRCONF flag - dst_host is handled by fc_dst_len = 128 nh_gw does not get set after the conversion to ip6_route_info_create but it should not be set in addrconf_f6i_alloc since this is a host route not a gateway route. Everything else is a straight forward map between fib6_info and fib6_config. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21ipv6: Move setting default metric for routesDavid Ahern
ip6_route_info_create is a low level function for ensuring fc_metric is set. Move the check and default setting to the 2 locations that do not already set fc_metric before calling ip6_route_info_create. This is required for the next patch which moves addrconf allocations to ip6_route_info_create and want the metric for host routes to be 0. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21net/tls: Replace kfree_skb() with consume_skb()Vakul Garg
To free the skb in normal course of processing, consume_skb() should be used. Only for failure paths, skb_free() is intended to be used. https://www.kernel.org/doc/htmldocs/networking/API-consume-skb.html Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21tipc: fix a null pointer derefHoang Le
In commit c55c8edafa91 ("tipc: smooth change between replicast and broadcast") we introduced new method to eliminate the risk of message reordering that happen in between different nodes. Unfortunately, we forgot checking at receiving side to ignore intra node. We fix this by checking and returning if arrived message from intra node. syzbot report: ================================================================== kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 7820 Comm: syz-executor418 Not tainted 5.0.0+ #61 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:tipc_mcast_filter_msg+0x21b/0x13d0 net/tipc/bcast.c:782 Code: 45 c0 0f 84 39 06 00 00 48 89 5d 98 e8 ce ab a5 fa 49 8d bc 24 c8 00 00 00 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80> 3c 08 00 0f 85 9a 0e 00 00 49 8b 9c 24 c8 00 00 00 48 be 00 00 RSP: 0018:ffff8880959defc8 EFLAGS: 00010202 RAX: 0000000000000019 RBX: ffff888081258a48 RCX: dffffc0000000000 RDX: 0000000000000000 RSI: ffffffff86cab862 RDI: 00000000000000c8 RBP: ffff8880959df030 R08: ffff8880813d0200 R09: ffffed1015d05bc8 R10: ffffed1015d05bc7 R11: ffff8880ae82de3b R12: 0000000000000000 R13: 000000000000002c R14: 0000000000000000 R15: ffff888081258a48 FS: 000000000106a880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020001cc0 CR3: 0000000094a20000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: tipc_sk_filter_rcv+0x182d/0x34f0 net/tipc/socket.c:2168 tipc_sk_enqueue net/tipc/socket.c:2254 [inline] tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305 tipc_sk_mcast_rcv+0x724/0x1020 net/tipc/socket.c:1209 tipc_mcast_xmit+0x7fe/0x1200 net/tipc/bcast.c:410 tipc_sendmcast+0xb36/0xfc0 net/tipc/socket.c:820 __tipc_sendmsg+0x10df/0x18d0 net/tipc/socket.c:1358 tipc_sendmsg+0x53/0x80 net/tipc/socket.c:1291 sock_sendmsg_nosec net/socket.c:651 [inline] sock_sendmsg+0xdd/0x130 net/socket.c:661 ___sys_sendmsg+0x806/0x930 net/socket.c:2260 __sys_sendmsg+0x105/0x1d0 net/socket.c:2298 __do_sys_sendmsg net/socket.c:2307 [inline] __se_sys_sendmsg net/socket.c:2305 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2305 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4401c9 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffd887fa9d8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401c9 RDX: 0000000000000000 RSI: 0000000020002140 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a50 R13: 0000000000401ae0 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: ---[ end trace ba79875754e1708f ]--- Reported-by: syzbot+be4bdf2cc3e85e952c50@syzkaller.appspotmail.com Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21tipc: fix use-after-free in tipc_sk_filter_rcvHoang Le
skb free-ed in: 1/ condition 1: tipc_sk_filter_rcv -> tipc_sk_proto_rcv 2/ condition 2: tipc_sk_filter_rcv -> tipc_group_filter_msg This leads to a "use-after-free" access in the next condition. We fix this by intializing the variable at declaration, then it is safe to check this variable to continue processing if condition matches. syzbot report: ================================================================== BUG: KASAN: use-after-free in tipc_sk_filter_rcv+0x2166/0x34f0 net/tipc/socket.c:2167 Read of size 4 at addr ffff88808ea58534 by task kworker/u4:0/7 CPU: 0 PID: 7 Comm: kworker/u4:0 Not tainted 5.0.0+ #61 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Workqueue: tipc_send tipc_conn_send_work Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:131 tipc_sk_filter_rcv+0x2166/0x34f0 net/tipc/socket.c:2167 tipc_sk_enqueue net/tipc/socket.c:2254 [inline] tipc_sk_rcv+0xc45/0x25a0 net/tipc/socket.c:2305 tipc_topsrv_kern_evt+0x3b7/0x580 net/tipc/topsrv.c:610 tipc_conn_send_to_sock+0x43e/0x5f0 net/tipc/topsrv.c:283 tipc_conn_send_work+0x65/0x80 net/tipc/topsrv.c:303 process_one_work+0x98e/0x1790 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x357/0x430 kernel/kthread.c:253 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352 Reported-by: syzbot+e863893591cc7a622e40@syzkaller.appspotmail.com Fixes: c55c8eda ("tipc: smooth change between replicast and broadcast") Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20ipv6: Add icmp_echo_ignore_anycast for ICMPv6Stephen Suryaputra
In addition to icmp_echo_ignore_multicast, there is a need to also prevent responding to pings to anycast addresses for security. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20net: remove 'fallback' argument from dev->ndo_select_queue()Paolo Abeni
After the previous patch, all the callers of ndo_select_queue() provide as a 'fallback' argument netdev_pick_tx. The only exceptions are nested calls to ndo_select_queue(), which pass down the 'fallback' available in the current scope - still netdev_pick_tx. We can drop such argument and replace fallback() invocation with netdev_pick_tx(). This avoids an indirect call per xmit packet in some scenarios (TCP syn, UDP unconnected, XDP generic, pktgen) with device drivers implementing such ndo. It also clean the code a bit. Tested with ixgbe and CONFIG_FCOE=m With pktgen using queue xmit: threads vanilla patched (kpps) (kpps) 1 2334 2428 2 4166 4278 4 7895 8100 v1 -> v2: - rebased after helper's name change Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20packet: rework packet_pick_tx_queue() to use common code selectionPaolo Abeni
Currently packet_pick_tx_queue() is the only caller of ndo_select_queue() using a fallback argument other than netdev_pick_tx. Leveraging rx queue, we can obtain a similar queue selection behavior using core helpers. After this change, ndo_select_queue() is always invoked with netdev_pick_tx() as fallback. We can change ndo_select_queue() signature in a followup patch, dropping an indirect call per transmitted packet in some scenarios (e.g. TCP syn and XDP generic xmit) This changes slightly how af packet queue selection happens when PACKET_QDISC_BYPASS is set. It's now more similar to plan dev_queue_xmit() tacking in account both XPS and TC mapping. v1 -> v2: - rebased after helper name change RFC -> v1: - initialize sender_cpu to the expected value Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20net: dev: rename queue selection helpers.Paolo Abeni
With the following patches, we are going to use __netdev_pick_tx() in many modules. Rename it to netdev_pick_tx(), to make it clear is a public API. Also rename the existing netdev_pick_tx() to netdev_core_pick_tx(), to avoid name clashes. Suggested-by: Eric Dumazet <edumazet@google.com> Suggested-by: David Miller <davem@davemloft.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20net: bridge: use eth_broadcast_addr() to assign broadcast addressMao Wenan
This patch is to use eth_broadcast_addr() to assign broadcast address insetad of memset(). Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20net/tls: Add support of AES128-CCM based ciphersVakul Garg
Added support for AES128-CCM based record encryption. AES128-CCM is similar to AES128-GCM. Both of them have same salt/iv/mac size. The notable difference between the two is that while invoking AES128-CCM operation, the salt||nonce (which is passed as IV) has to be prefixed with a hardcoded value '2'. Further, CCM implementation in kernel requires IV passed in crypto_aead_request() to be full '16' bytes. Therefore, the record structure 'struct tls_rec' has been modified to reserve '16' bytes for IV. This works for both GCM and CCM based cipher. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19ipv6: Add icmp_echo_ignore_multicast support for ICMPv6Stephen Suryaputra
IPv4 has icmp_echo_ignore_broadcast to prevent responding to broadcast pings. IPv6 needs a similar mechanism. v1->v2: - Remove NET_IPV6_ICMP_ECHO_IGNORE_MULTICAST. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19tcp: free request sock directly upon TFO or syncookies errorGuillaume Nault
Since the request socket is created locally, it'd make more sense to use reqsk_free() instead of reqsk_put() in TFO and syncookies' error path. However, tcp_get_cookie_sock() may set ->rsk_refcnt before freeing the socket; tcp_conn_request() may also have non-null ->rsk_refcnt because of tcp_try_fastopen(). In both cases 'req' hasn't been exposed to the outside world and is safe to free immediately, but that'd trigger the WARN_ON_ONCE in reqsk_free(). Define __reqsk_free() for these situations where we know nobody's referencing the socket, even though ->rsk_refcnt might be non-null. Now we can consolidate the error path of tcp_get_cookie_sock() and tcp_conn_request(). Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19datagram: Make __skb_datagram_iter staticYueHaibing
Fix sparse warning: net/core/datagram.c:411:5: warning: symbol '__skb_datagram_iter' was not declared. Should it be static? Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19tcp: add tcp_inet6_sk() helperEric Dumazet
TCP ipv6 fast path dereferences a pointer to get to the inet6 part of a tcp socket, but given the fixed memory placement, we can do better and avoid a possible cache line miss. This also reduces register pressure, since we let the compiler know about this memory placement. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19tipc: smooth change between replicast and broadcastHoang Le
Currently, a multicast stream may start out using replicast, because there are few destinations, and then it should ideally switch to L2/broadcast IGMP/multicast when the number of destinations grows beyond a certain limit. The opposite should happen when the number decreases below the limit. To eliminate the risk of message reordering caused by method change, a sending socket must stick to a previously selected method until it enters an idle period of 5 seconds. Means there is a 5 seconds pause in the traffic from the sender socket. If the sender never makes such a pause, the method will never change, and transmission may become very inefficient as the cluster grows. With this commit, we allow such a switch between replicast and broadcast without any need for a traffic pause. Solution is to send a dummy message with only the header, also with the SYN bit set, via broadcast or replicast. For the data message, the SYN bit is set and sending via replicast or broadcast (inverse method with dummy). Then, at receiving side any messages follow first SYN bit message (data or dummy message), they will be held in deferred queue until another pair (dummy or data message) arrived in other link. v2: reverse christmas tree declaration Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19tipc: introduce new capability flag for clusterHoang Le
As a preparation for introducing a smooth switching between replicast and broadcast method for multicast message, We have to introduce a new capability flag TIPC_MCAST_RBCTL to handle this new feature. During a cluster upgrade a node can come back with this new capabilities which also must be reflected in the cluster capabilities field. The new feature is only applicable if all node in the cluster supports this new capability. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19tipc: support broadcast/replicast configurable for bc-linkHoang Le
Currently, a multicast stream uses either broadcast or replicast as transmission method, based on the ratio between number of actual destinations nodes and cluster size. However, when an L2 interface (e.g., VXLAN) provides pseudo broadcast support, this becomes very inefficient, as it blindly replicates multicast packets to all cluster/subnet nodes, irrespective of whether they host actual target sockets or not. The TIPC multicast algorithm is able to distinguish real destination nodes from other nodes, and hence provides a smarter and more efficient method for transferring multicast messages than pseudo broadcast can do. Because of this, we now make it possible for users to force the broadcast link to permanently switch to using replicast, irrespective of which capabilities the bearer provides, or pretend to provide. Conversely, we also make it possible to force the broadcast link to always use true broadcast. While maybe less useful in deployed systems, this may at least be useful for testing the broadcast algorithm in small clusters. We retain the current AUTOSELECT ability, i.e., to let the broadcast link automatically select which algorithm to use, and to switch back and forth between broadcast and replicast as the ratio between destination node number and cluster size changes. This remains the default method. Furthermore, we make it possible to configure the threshold ratio for such switches. The default ratio is now set to 10%, down from 25% in the earlier implementation. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_STREAM_SCHEDULER ↵Xin Long
sockopt A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_STREAM_SCHEDULER sockopt. Fixes: 7efba10d6bd2 ("sctp: add SCTP_FUTURE_ASOC and SCTP_CURRENT_ASSOC for SCTP_STREAM_SCHEDULER sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_EVENT sockoptXin Long
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_EVENT sockopt. Fixes: d251f05e3ba2 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_EVENT sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_ENABLE_STREAM_RESET ↵Xin Long
sockopt A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_ENABLE_STREAM_RESET sockopt. Fixes: 99a62135e127 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_ENABLE_STREAM_RESET sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_PRINFO sockoptXin Long
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_DEFAULT_PRINFO sockopt. Fixes: 3a583059d187 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_DEFAULT_PRINFO sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_AUTH_DEACTIVATE_KEY ↵Xin Long
sockopt A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_AUTH_DEACTIVATE_KEY sockopt. Fixes: 2af66ff3edc7 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_AUTH_DEACTIVATE_KEY sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_AUTH_DELETE_KEY sockoptXin Long
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_AUTH_DELETE_KEY sockopt. Fixes: 3adcc300603e ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_AUTH_DELETE_KEY sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_AUTH_ACTIVE_KEY sockoptXin Long
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_AUTH_ACTIVE_KEY sockopt. Fixes: bf9fb6ad4f29 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_AUTH_ACTIVE_KEY sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_AUTH_KEY sockoptXin Long
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_AUTH_KEY sockopt. Fixes: 7fb3be13a236 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_AUTH_KEY sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_MAX_BURST sockoptXin Long
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_MAX_BURST sockopt. Fixes: e0651a0dc877 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_MAX_BURST sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_CONTEXT sockoptXin Long
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_CONTEXT sockopt. Fixes: 49b037acca8c ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_CONTEXT sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SNDINFO sockoptXin Long
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_DEFAULT_SNDINFO sockopt. Fixes: 92fc3bd928c9 ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_DEFAULT_SNDINFO sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DELAYED_SACK sockoptXin Long
A similar fix as Patch "sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM sockopt" on SCTP_DELAYED_SACK sockopt. Fixes: 9c5829e1c49e ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_DELAYED_SACK sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: fix ignoring asoc_id for tcp-style sockets on SCTP_DEFAULT_SEND_PARAM ↵Marcelo Ricardo Leitner
sockopt Currently if the user pass an invalid asoc_id to SCTP_DEFAULT_SEND_PARAM on a TCP-style socket, it will silently ignore the new parameters. That's because after not finding an asoc, it is checking asoc_id against the known values of CURRENT/FUTURE/ALL values and that fails to match. IOW, if the user supplies an invalid asoc id or not, it should either match the current asoc or the socket itself so that it will inherit these later. Fixes it by forcing asoc_id to SCTP_FUTURE_ASSOC in case it is a TCP-style socket without an asoc, so that the values get set on the socket. Fixes: 707e45b3dc5a ("sctp: use SCTP_FUTURE_ASSOC and add SCTP_CURRENT_ASSOC for SCTP_DEFAULT_SEND_PARAM sockopt") Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18sctp: not copy sctp_sock pd_lobby in sctp_copy_descendantXin Long
Now sctp_copy_descendant() copies pd_lobby from old sctp scok to new sctp sock. If sctp_sock_migrate() returns error, it will panic when releasing new sock and trying to purge pd_lobby due to the incorrect pointers in pd_lobby. [ 120.485116] kasan: CONFIG_KASAN_INLINE enabled [ 120.486270] kasan: GPF could be caused by NULL-ptr deref or user [ 120.509901] Call Trace: [ 120.510443] sctp_ulpevent_free+0x1e8/0x490 [sctp] [ 120.511438] sctp_queue_purge_ulpevents+0x97/0xe0 [sctp] [ 120.512535] sctp_close+0x13a/0x700 [sctp] [ 120.517483] inet_release+0xdc/0x1c0 [ 120.518215] __sock_release+0x1d2/0x2a0 [ 120.519025] sctp_do_peeloff+0x30f/0x3c0 [sctp] We fix it by not copying sctp_sock pd_lobby in sctp_copy_descendan(), and skb_queue_head_init() can also be removed in sctp_sock_migrate(). Reported-by: syzbot+85e0b422ff140b03672a@syzkaller.appspotmail.com Fixes: 89664c623617 ("sctp: sctp_sock_migrate() returns error if sctp_bind_addr_dup() fails") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18af_packet: fix the tx skb protocol in raw sockets with ETH_P_ALLYoshiki Komachi
I am using "protocol ip" filters in TC to manipulate TC flower classifiers, which are only available with "protocol ip". However, I faced an issue that packets sent via raw sockets with ETH_P_ALL did not match the ip filters even if they did satisfy the condition (e.g., DHCP offer from dhcpd). I have determined that the behavior was caused by an unexpected value stored in skb->protocol, namely, ETH_P_ALL instead of ETH_P_IP, when packets were sent via raw sockets with ETH_P_ALL set. IMHO, storing ETH_P_ALL in skb->protocol is not appropriate for packets sent via raw sockets because ETH_P_ALL is not a real ether type used on wire, but a virtual one. This patch fixes the tx protocol selection in cases of transmission via raw sockets created with ETH_P_ALL so that it asks the driver to extract protocol from the Ethernet header. Fixes: 75c65772c3 ("net/packet: Ask driver for protocol if not provided by user") Signed-off-by: Yoshiki Komachi <komachi.yoshiki@lab.ntt.co.jp> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18packets: Always register packet sk in the same orderMaxime Chevallier
When using fanouts with AF_PACKET, the demux functions such as fanout_demux_cpu will return an index in the fanout socket array, which corresponds to the selected socket. The ordering of this array depends on the order the sockets were added to a given fanout group, so for FANOUT_CPU this means sockets are bound to cpus in the order they are configured, which is OK. However, when stopping then restarting the interface these sockets are bound to, the sockets are reassigned to the fanout group in the reverse order, due to the fact that they were inserted at the head of the interface's AF_PACKET socket list. This means that traffic that was directed to the first socket in the fanout group is now directed to the last one after an interface restart. In the case of FANOUT_CPU, traffic from CPU0 will be directed to the socket that used to receive traffic from the last CPU after an interface restart. This commit introduces a helper to add a socket at the tail of a list, then uses it to register AF_PACKET sockets. Note that this changes the order in which sockets are listed in /proc and with sock_diag. Fixes: dc99f600698d ("packet: Add fanout support") Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18net: rose: fix a possible stack overflowEric Dumazet
rose_write_internal() uses a temp buffer of 100 bytes, but a manual inspection showed that given arbitrary input, rose_create_facilities() can fill up to 110 bytes. Lets use a tailroom of 256 bytes for peace of mind, and remove the bounce buffer : we can simply allocate a big enough skb and adjust its length as needed. syzbot report : BUG: KASAN: stack-out-of-bounds in memcpy include/linux/string.h:352 [inline] BUG: KASAN: stack-out-of-bounds in rose_create_facilities net/rose/rose_subr.c:521 [inline] BUG: KASAN: stack-out-of-bounds in rose_write_internal+0x597/0x15d0 net/rose/rose_subr.c:116 Write of size 7 at addr ffff88808b1ffbef by task syz-executor.0/24854 CPU: 0 PID: 24854 Comm: syz-executor.0 Not tainted 5.0.0+ #97 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 check_memory_region_inline mm/kasan/generic.c:185 [inline] check_memory_region+0x123/0x190 mm/kasan/generic.c:191 memcpy+0x38/0x50 mm/kasan/common.c:131 memcpy include/linux/string.h:352 [inline] rose_create_facilities net/rose/rose_subr.c:521 [inline] rose_write_internal+0x597/0x15d0 net/rose/rose_subr.c:116 rose_connect+0x7cb/0x1510 net/rose/af_rose.c:826 __sys_connect+0x266/0x330 net/socket.c:1685 __do_sys_connect net/socket.c:1696 [inline] __se_sys_connect net/socket.c:1693 [inline] __x64_sys_connect+0x73/0xb0 net/socket.c:1693 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x458079 Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f47b8d9dc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458079 RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000004 RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f47b8d9e6d4 R13: 00000000004be4a4 R14: 00000000004ceca8 R15: 00000000ffffffff The buggy address belongs to the page: page:ffffea00022c7fc0 count:0 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x1fffc0000000000() raw: 01fffc0000000000 0000000000000000 ffffffff022c0101 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88808b1ffa80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88808b1ffb00: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00 00 03 >ffff88808b1ffb80: f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 04 f3 ^ ffff88808b1ffc00: f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 00 00 ffff88808b1ffc80: 00 00 00 00 00 00 00 f1 f1 f1 f1 f1 f1 01 f2 01 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>