aboutsummaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)Author
2019-09-06Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2019-09-05 1) Several xfrm interface fixes from Nicolas Dichtel: - Avoid an interface ID corruption on changelink. - Fix wrong intterface names in the logs. - Fix a list corruption when changing network namespaces. - Fix unregistation of the underying phydev. 2) Fix a potential warning when merging xfrm_plocy nodes. From Florian Westphal. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05net: Properly update v4 routes with v6 nexthopDonald Sharp
When creating a v4 route that uses a v6 nexthop from a nexthop group. Allow the kernel to properly send the nexthop as v6 via the RTA_VIA attribute. Broken behavior: $ ip nexthop add via fe80::9 dev eth0 $ ip nexthop show id 1 via fe80::9 dev eth0 scope link $ ip route add 4.5.6.7/32 nhid 1 $ ip route show default via 10.0.2.2 dev eth0 4.5.6.7 nhid 1 via 254.128.0.0 dev eth0 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 $ Fixed behavior: $ ip nexthop add via fe80::9 dev eth0 $ ip nexthop show id 1 via fe80::9 dev eth0 scope link $ ip route add 4.5.6.7/32 nhid 1 $ ip route show default via 10.0.2.2 dev eth0 4.5.6.7 nhid 1 via inet6 fe80::9 dev eth0 10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 $ v2, v3: Addresses code review comments from David Ahern Fixes: dcb1ecb50edf (“ipv4: Prepare for fib6_nh from a nexthop object”) Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-28net: sched: act_sample: fix psample group handling on overwriteVlad Buslov
Action sample doesn't properly handle psample_group pointer in overwrite case. Following issues need to be fixed: - In tcf_sample_init() function RCU_INIT_POINTER() is used to set s->psample_group, even though we neither setting the pointer to NULL, nor preventing concurrent readers from accessing the pointer in some way. Use rcu_swap_protected() instead to safely reset the pointer. - Old value of s->psample_group is not released or deallocated in any way, which results resource leak. Use psample_group_put() on non-NULL value obtained with rcu_swap_protected(). - The function psample_group_put() that released reference to struct psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu grace period when deallocating it. Extend struct psample_group with rcu head and use kfree_rcu when freeing it. Fixes: 5c5670fae430 ("net/sched: Introduce sample tc action") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27net_sched: fix a NULL pointer deref in ipt actionCong Wang
The net pointer in struct xt_tgdtor_param is not explicitly initialized therefore is still NULL when dereferencing it. So we have to find a way to pass the correct net pointer to ipt_destroy_target(). The best way I find is just saving the net pointer inside the per netns struct tcf_idrinfo, which could make this patch smaller. Fixes: 0c66dc1ea3f0 ("netfilter: conntrack: register hooks in netns when needed by ruleset") Reported-and-tested-by: itugrok@yahoo.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-25nexthop: Fix nexthop_num_path for blackhole nexthopsDavid Ahern
Donald reported this sequence: ip next add id 1 blackhole ip next add id 2 blackhole ip ro add 1.1.1.1/32 nhid 1 ip ro add 1.1.1.2/32 nhid 2 would cause a crash. Backtrace is: [ 151.302790] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 151.304043] CPU: 1 PID: 277 Comm: ip Not tainted 5.3.0-rc5+ #37 [ 151.305078] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 [ 151.306526] RIP: 0010:fib_add_nexthop+0x8b/0x2aa [ 151.307343] Code: 35 f7 81 48 8d 14 01 c7 02 f1 f1 f1 f1 c7 42 04 01 f4 f4 f4 48 89 f2 48 c1 ea 03 65 48 8b 0c 25 28 00 00 00 48 89 4d d0 31 c9 <80> 3c 02 00 74 08 48 89 f7 e8 1a e8 53 ff be 08 00 00 00 4c 89 e7 [ 151.310549] RSP: 0018:ffff888116c27340 EFLAGS: 00010246 [ 151.311469] RAX: dffffc0000000000 RBX: ffff8881154ece00 RCX: 0000000000000000 [ 151.312713] RDX: 0000000000000004 RSI: 0000000000000020 RDI: ffff888115649b40 [ 151.313968] RBP: ffff888116c273d8 R08: ffffed10221e3757 R09: ffff888110f1bab8 [ 151.315212] R10: 0000000000000001 R11: ffff888110f1bab3 R12: ffff888115649b40 [ 151.316456] R13: 0000000000000020 R14: ffff888116c273b0 R15: ffff888115649b40 [ 151.317707] FS: 00007f60b4d8d800(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000 [ 151.319113] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 151.320119] CR2: 0000555671ffdc00 CR3: 00000001136ba005 CR4: 0000000000020ee0 [ 151.321367] Call Trace: [ 151.321820] ? fib_nexthop_info+0x635/0x635 [ 151.322572] fib_dump_info+0xaa4/0xde0 [ 151.323247] ? fib_create_info+0x2431/0x2431 [ 151.324008] ? napi_alloc_frag+0x2a/0x2a [ 151.324711] rtmsg_fib+0x2c4/0x3be [ 151.325339] fib_table_insert+0xe2f/0xeee ... fib_dump_info incorrectly has nhs = 0 for blackhole nexthops, so it believes the nexthop object is a multipath group (nhs != 1) and ends up down the nexthop_mpath_fill_node() path which is wrong for a blackhole. The blackhole check in nexthop_num_path is leftover from early days of the blackhole implementation which did not initialize the device. In the end the design was simpler (fewer special case checks) to set the device to loopback in nh_info, so the check in nexthop_num_path should have been removed. Fixes: 430a049190de ("nexthop: Add support for nexthop groups") Reported-by: Donald Sharp <sharpd@cumulusnetworks.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24net: route dump netlink NLM_F_MULTI flag missingJohn Fastabend
An excerpt from netlink(7) man page, In multipart messages (multiple nlmsghdr headers with associated payload in one byte stream) the first and all following headers have the NLM_F_MULTI flag set, except for the last header which has the type NLMSG_DONE. but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a FIB dump. The result is user space applications following above man page excerpt may get confused and may stop parsing msg believing something went wrong. In the golang netlink lib [0] the library logic stops parsing believing the message is not a multipart message. Found this running Cilium[1] against net-next while adding a feature to auto-detect routes. I noticed with multiple route tables we no longer could detect the default routes on net tree kernels because the library logic was not returning them. Fix this by handling the fib_dump_info_fnhe() case the same way the fib_dump_info() handles it by passing the flags argument through the call chain and adding a flags argument to rt_fill_info(). Tested with Cilium stack and auto-detection of routes works again. Also annotated libs to dump netlink msgs and inspected NLM_F_MULTI and NLMSG_DONE flags look correct after this. Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same as is done for fib_dump_info() so this looks correct to me. [0] https://github.com/vishvananda/netlink/ [1] https://github.com/cilium/ Fixes: ee28906fd7a14 ("ipv4: Dump route exceptions if requested") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-21trivial: netns: fix typo in 'struct net.passive' descriptionMike Rapoport
Replace 'decided' with 'decide' so that comment would be /* To decide when the network namespace should be freed. */ Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19ipv6: Fix return value of ipv6_mc_may_pull() for malformed packetsStefano Brivio
Commit ba5ea614622d ("bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls") replaces direct calls to pskb_may_pull() in br_ipv6_multicast_mld2_report() with calls to ipv6_mc_may_pull(), that returns -EINVAL on buffers too short to be valid IPv6 packets, while maintaining the previous handling of the return code. This leads to the direct opposite of the intended effect: if the packet is malformed, -EINVAL evaluates as true, and we'll happily proceed with the processing. Return 0 if the packet is too short, in the same way as this was fixed for IPv4 by commit 083b78a9ed64 ("ip: fix ip_mc_may_pull() return value"). I don't have a reproducer for this, unlike the one referred to by the IPv4 commit, but this is clearly broken. Fixes: ba5ea614622d ("bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls") Signed-off-by: Stefano Brivio <sbrivio@redhat.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18netfilter: nf_tables: map basechain priority to hardware priorityPablo Neira Ayuso
This patch adds initial support for offloading basechains using the priority range from 1 to 65535. This is restricting the netfilter priority range to 16-bit integer since this is what most drivers assume so far from tc. It should be possible to extend this range of supported priorities later on once drivers are updated to support for 32-bit integer priorities. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18net: sched: use major priority number as hardware priorityPablo Neira Ayuso
tc transparently maps the software priority number to hardware. Update it to pass the major priority which is what most drivers expect. Update drivers too so they do not need to lshift the priority field of the flow_cls_common_offload object. The stmmac driver is an exception, since this code assumes the tc software priority is fine, therefore, lshift it just to be conservative. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17Merge branch 'for-upstream' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Johan Hedberg says: ==================== pull request: bluetooth 2019-08-17 Here's a set of Bluetooth fixes for the 5.3-rc series: - Multiple fixes for Qualcomm (btqca & hci_qca) drivers - Minimum encryption key size debugfs setting (this is required for Bluetooth Qualification) - Fix hidp_send_message() to have a meaningful return value ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17Bluetooth: Add debug setting for changing minimum encryption key sizeMarcel Holtmann
For testing and qualification purposes it is useful to allow changing the minimum encryption key size value that the host stack is going to enforce. This adds a new debugfs setting min_encrypt_key_size to achieve this functionality. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2019-08-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net This patchset contains Netfilter fixes for net: 1) Extend selftest to cover flowtable with ipsec, from Florian Westphal. 2) Fix interaction of ipsec with flowtable, also from Florian. 3) User-after-free with bound set to rule that fails to load. 4) Adjust state and timeout for flows that expire. 5) Timeout update race with flows in teardown state. 6) Ensure conntrack id hash calculation use invariants as input, from Dirk Morris. 7) Do not push flows into flowtable for TCP fin/rst packets. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-13netlink: Fix nlmsg_parse as a wrapper for strict message parsingDavid Ahern
Eric reported a syzbot warning: BUG: KMSAN: uninit-value in nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510 CPU: 0 PID: 11812 Comm: syz-executor444 Not tainted 5.3.0-rc3+ #17 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x191/0x1f0 lib/dump_stack.c:113 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan_report.c:109 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:294 nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510 rtm_del_nexthop+0x1b1/0x610 net/ipv4/nexthop.c:1543 rtnetlink_rcv_msg+0x115a/0x1580 net/core/rtnetlink.c:5223 netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5241 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0xf6c/0x1050 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x110f/0x1330 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:637 [inline] sock_sendmsg net/socket.c:657 [inline] ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311 __sys_sendmmsg+0x53a/0xae0 net/socket.c:2413 __do_sys_sendmmsg net/socket.c:2442 [inline] __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2439 __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2439 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:297 entry_SYSCALL_64_after_hwframe+0x63/0xe7 The root cause is nlmsg_parse calling __nla_parse which means the header struct size is not checked. nlmsg_parse should be a wrapper around __nlmsg_parse with NL_VALIDATE_STRICT for the validate argument very much like nlmsg_parse_deprecated is for NL_VALIDATE_LIBERAL. Fixes: 3de6440354465 ("netlink: re-add parse/validate functions in strict mode") Reported-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-09sock: make cookie generation global instead of per netnsDaniel Borkmann
Generating and retrieving socket cookies are a useful feature that is exposed to BPF for various program types through bpf_get_socket_cookie() helper. The fact that the cookie counter is per netns is quite a limitation for BPF in practice in particular for programs in host namespace that use socket cookies as part of a map lookup key since they will be causing socket cookie collisions e.g. when attached to BPF cgroup hooks or cls_bpf on tc egress in host namespace handling container traffic from veth or ipvlan devices with peer in different netns. Change the counter to be global instead. Socket cookie consumers must assume the value as opqaue in any case. Not every socket must have a cookie generated and knowledge of the counter value itself does not provide much value either way hence conversion to global is fine. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Martynas Pumputis <m@lambda.lt> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09netfilter: nf_tables: use-after-free in failing rule with bound setPablo Neira Ayuso
If a rule that has already a bound anonymous set fails to be added, the preparation phase releases the rule and the bound set. However, the transaction object from the abort path still has a reference to the set object that is stale, leading to a use-after-free when checking for the set->bound field. Add a new field to the transaction that specifies if the set is bound, so the abort path can skip releasing it since the rule command owns it and it takes care of releasing it. After this update, the set->bound field is removed. [ 24.649883] Unable to handle kernel paging request at virtual address 0000000000040434 [ 24.657858] Mem abort info: [ 24.660686] ESR = 0x96000004 [ 24.663769] Exception class = DABT (current EL), IL = 32 bits [ 24.669725] SET = 0, FnV = 0 [ 24.672804] EA = 0, S1PTW = 0 [ 24.675975] Data abort info: [ 24.678880] ISV = 0, ISS = 0x00000004 [ 24.682743] CM = 0, WnR = 0 [ 24.685723] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000428952000 [ 24.692207] [0000000000040434] pgd=0000000000000000 [ 24.697119] Internal error: Oops: 96000004 [#1] SMP [...] [ 24.889414] Call trace: [ 24.891870] __nf_tables_abort+0x3f0/0x7a0 [ 24.895984] nf_tables_abort+0x20/0x40 [ 24.899750] nfnetlink_rcv_batch+0x17c/0x588 [ 24.904037] nfnetlink_rcv+0x13c/0x190 [ 24.907803] netlink_unicast+0x18c/0x208 [ 24.911742] netlink_sendmsg+0x1b0/0x350 [ 24.915682] sock_sendmsg+0x4c/0x68 [ 24.919185] ___sys_sendmsg+0x288/0x2c8 [ 24.923037] __sys_sendmsg+0x7c/0xd0 [ 24.926628] __arm64_sys_sendmsg+0x2c/0x38 [ 24.930744] el0_svc_common.constprop.0+0x94/0x158 [ 24.935556] el0_svc_handler+0x34/0x90 [ 24.939322] el0_svc+0x8/0xc [ 24.942216] Code: 37280300 f9404023 91014262 aa1703e0 (f9401863) [ 24.948336] ---[ end trace cebbb9dcbed3b56f ]--- Fixes: f6ac85858976 ("netfilter: nf_tables: unbind set in rule from commit path") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-08net/tls: prevent skb_orphan() from leaking TLS plain text with offloadJakub Kicinski
sk_validate_xmit_skb() and drivers depend on the sk member of struct sk_buff to identify segments requiring encryption. Any operation which removes or does not preserve the original TLS socket such as skb_orphan() or skb_clone() will cause clear text leaks. Make the TCP socket underlying an offloaded TLS connection mark all skbs as decrypted, if TLS TX is in offload mode. Then in sk_validate_xmit_skb() catch skbs which have no socket (or a socket with no validation) and decrypted flag set. Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and sk->sk_validate_xmit_skb are slightly interchangeable right now, they all imply TLS offload. The new checks are guarded by CONFIG_TLS_DEVICE because that's the option guarding the sk_buff->decrypted member. Second, smaller issue with orphaning is that it breaks the guarantee that packets will be delivered to device queues in-order. All TLS offload drivers depend on that scheduling property. This means skb_orphan_partial()'s trick of preserving partial socket references will cause issues in the drivers. We need a full orphan, and as a result netem delay/throttling will cause all TLS offload skbs to be dropped. Reusing the sk_buff->decrypted flag also protects from leaking clear text when incoming, decrypted skb is redirected (e.g. by TC). See commit 0608c69c9a80 ("bpf: sk_msg, sock{map|hash} redirect through ULP") for justification why the internal flag is safe. The only location which could leak the flag in is tcp_bpf_sendmsg(), which is taken care of by clearing the previously unused bit. v2: - remove superfluous decrypted mark copy (Willem); - remove the stale doc entry (Boris); - rely entirely on EOR marking to prevent coalescing (Boris); - use an internal sendpages flag instead of marking the socket (Boris). v3 (Willem): - reorganize the can_skb_orphan_partial() condition; - fix the flag leak-in through tcp_bpf_sendmsg. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08inet: frags: re-introduce skb coalescing for local deliveryGuillaume Nault
Before commit d4289fcc9b16 ("net: IP6 defrag: use rbtrees for IPv6 defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus generating many fragments) and running over an IPsec tunnel, reported more than 6Gbps throughput. After that patch, the same test gets only 9Mbps when receiving on a be2net nic (driver can make a big difference here, for example, ixgbe doesn't seem to be affected). By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing (IPv4 fragment coalescing was dropped by commit 14fe22e33462 ("Revert "ipv4: use skb coalescing in defragmentation"")). Without fragment coalescing, be2net runs out of Rx ring entries and starts to drop frames (ethtool reports rx_drops_no_frags errors). Since the netperf traffic is only composed of UDP fragments, any lost packet prevents reassembly of the full datagram. Therefore, fragments which have no possibility to ever get reassembled pile up in the reassembly queue, until the memory accounting exeeds the threshold. At that point no fragment is accepted anymore, which effectively discards all netperf traffic. When reassembly timeout expires, some stale fragments are removed from the reassembly queue, so a few packets can be received, reassembled and delivered to the netperf receiver. But the nic still drops frames and soon the reassembly queue gets filled again with stale fragments. These long time frames where no datagram can be received explain why the performance drop is so significant. Re-introducing fragment coalescing is enough to get the initial performances again (6.6Gbps with be2net): driver doesn't drop frames anymore (no more rx_drops_no_frags errors) and the reassembly engine works at full speed. This patch is quite conservative and only coalesces skbs for local IPv4 and IPv6 delivery (in order to avoid changing skb geometry when forwarding). Coalescing could be extended in the future if need be, as more scenarios would probably benefit from it. [0]: Test configuration Sender: ip xfrm policy flush ip xfrm state flush ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1 ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1 ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow netserver -D -L fc00:2::1 Receiver: ip xfrm policy flush ip xfrm state flush ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1 ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1 ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6 Signed-off-by: Guillaume Nault <gnault@redhat.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: sched: sample: allow accessing psample_group with rtnlVlad Buslov
Recently implemented support for sample action in flow_offload infra leads to following rcu usage warning: [ 1938.234856] ============================= [ 1938.234858] WARNING: suspicious RCU usage [ 1938.234863] 5.3.0-rc1+ #574 Not tainted [ 1938.234866] ----------------------------- [ 1938.234869] include/net/tc_act/tc_sample.h:47 suspicious rcu_dereference_check() usage! [ 1938.234872] other info that might help us debug this: [ 1938.234875] rcu_scheduler_active = 2, debug_locks = 1 [ 1938.234879] 1 lock held by tc/19540: [ 1938.234881] #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970 [ 1938.234900] stack backtrace: [ 1938.234905] CPU: 2 PID: 19540 Comm: tc Not tainted 5.3.0-rc1+ #574 [ 1938.234908] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 1938.234911] Call Trace: [ 1938.234922] dump_stack+0x85/0xc0 [ 1938.234930] tc_setup_flow_action+0xed5/0x2040 [ 1938.234944] fl_hw_replace_filter+0x11f/0x2e0 [cls_flower] [ 1938.234965] fl_change+0xd24/0x1b30 [cls_flower] [ 1938.234990] tc_new_tfilter+0x3e0/0x970 [ 1938.235021] ? tc_del_tfilter+0x720/0x720 [ 1938.235028] rtnetlink_rcv_msg+0x389/0x4b0 [ 1938.235038] ? netlink_deliver_tap+0x95/0x400 [ 1938.235044] ? rtnl_dellink+0x2d0/0x2d0 [ 1938.235053] netlink_rcv_skb+0x49/0x110 [ 1938.235063] netlink_unicast+0x171/0x200 [ 1938.235073] netlink_sendmsg+0x224/0x3f0 [ 1938.235091] sock_sendmsg+0x5e/0x60 [ 1938.235097] ___sys_sendmsg+0x2ae/0x330 [ 1938.235111] ? __handle_mm_fault+0x12cd/0x19e0 [ 1938.235125] ? __handle_mm_fault+0x12cd/0x19e0 [ 1938.235138] ? find_held_lock+0x2b/0x80 [ 1938.235147] ? do_user_addr_fault+0x22d/0x490 [ 1938.235160] __sys_sendmsg+0x59/0xa0 [ 1938.235178] do_syscall_64+0x5c/0xb0 [ 1938.235187] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1938.235192] RIP: 0033:0x7ff9a4d597b8 [ 1938.235197] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54 [ 1938.235200] RSP: 002b:00007ffcfe381c48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 1938.235205] RAX: ffffffffffffffda RBX: 000000005d4497f9 RCX: 00007ff9a4d597b8 [ 1938.235208] RDX: 0000000000000000 RSI: 00007ffcfe381cb0 RDI: 0000000000000003 [ 1938.235211] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006 [ 1938.235214] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001 [ 1938.235217] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001 Change tcf_sample_psample_group() helper to allow using it from both rtnl and rcu protected contexts. Fixes: a7a7be6087b0 ("net/sched: add sample action to the hardware intermediate representation") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06net: sched: police: allow accessing police->params with rtnlVlad Buslov
Recently implemented support for police action in flow_offload infra leads to following rcu usage warning: [ 1925.881092] ============================= [ 1925.881094] WARNING: suspicious RCU usage [ 1925.881098] 5.3.0-rc1+ #574 Not tainted [ 1925.881100] ----------------------------- [ 1925.881104] include/net/tc_act/tc_police.h:57 suspicious rcu_dereference_check() usage! [ 1925.881106] other info that might help us debug this: [ 1925.881109] rcu_scheduler_active = 2, debug_locks = 1 [ 1925.881112] 1 lock held by tc/18591: [ 1925.881115] #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970 [ 1925.881124] stack backtrace: [ 1925.881127] CPU: 2 PID: 18591 Comm: tc Not tainted 5.3.0-rc1+ #574 [ 1925.881130] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 1925.881132] Call Trace: [ 1925.881138] dump_stack+0x85/0xc0 [ 1925.881145] tc_setup_flow_action+0x1771/0x2040 [ 1925.881155] fl_hw_replace_filter+0x11f/0x2e0 [cls_flower] [ 1925.881175] fl_change+0xd24/0x1b30 [cls_flower] [ 1925.881200] tc_new_tfilter+0x3e0/0x970 [ 1925.881231] ? tc_del_tfilter+0x720/0x720 [ 1925.881243] rtnetlink_rcv_msg+0x389/0x4b0 [ 1925.881250] ? netlink_deliver_tap+0x95/0x400 [ 1925.881257] ? rtnl_dellink+0x2d0/0x2d0 [ 1925.881264] netlink_rcv_skb+0x49/0x110 [ 1925.881275] netlink_unicast+0x171/0x200 [ 1925.881284] netlink_sendmsg+0x224/0x3f0 [ 1925.881299] sock_sendmsg+0x5e/0x60 [ 1925.881305] ___sys_sendmsg+0x2ae/0x330 [ 1925.881309] ? task_work_add+0x43/0x50 [ 1925.881314] ? fput_many+0x45/0x80 [ 1925.881329] ? __lock_acquire+0x248/0x1930 [ 1925.881342] ? find_held_lock+0x2b/0x80 [ 1925.881347] ? task_work_run+0x7b/0xd0 [ 1925.881359] __sys_sendmsg+0x59/0xa0 [ 1925.881375] do_syscall_64+0x5c/0xb0 [ 1925.881381] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1925.881384] RIP: 0033:0x7feb245047b8 [ 1925.881388] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83 ec 28 89 54 [ 1925.881391] RSP: 002b:00007ffc2d2a5788 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 1925.881395] RAX: ffffffffffffffda RBX: 000000005d4497ed RCX: 00007feb245047b8 [ 1925.881398] RDX: 0000000000000000 RSI: 00007ffc2d2a57f0 RDI: 0000000000000003 [ 1925.881400] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006 [ 1925.881403] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001 [ 1925.881406] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001 Change tcf_police_rate_bytes_ps() and tcf_police_tcfp_burst() helpers to allow using them from both rtnl and rcu protected contexts. Fixes: 8c8cfc6ed274 ("net/sched: add police action to the hardware intermediate representation") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-05net/tls: partially revert fix transition through disconnect with closeJakub Kicinski
Looks like we were slightly overzealous with the shutdown() cleanup. Even though the sock->sk_state can reach CLOSED again, socket->state will not got back to SS_UNCONNECTED once connections is ESTABLISHED. Meaning we will see EISCONN if we try to reconnect, and EINVAL if we try to listen. Only listen sockets can be shutdown() and reused, but since ESTABLISHED sockets can never be re-connected() or used for listen() we don't need to try to clean up the ULP state early. Fixes: 32857cf57f92 ("net/tls: fix transition through disconnect with close") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-31Merge tag 'mac80211-for-davem-2019-07-31' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Just a few fixes: * revert NETIF_F_LLTX usage as it caused problems * avoid warning on WMM parameters from AP that are too short * fix possible null-ptr dereference in hwsim * fix interface combinations with 4-addr and crypto control ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-26{nl,mac}80211: fix interface combinations on crypto controlled devicesManikanta Pubbisetty
Commit 33d915d9e8ce ("{nl,mac}80211: allow 4addr AP operation on crypto controlled devices") has introduced a change which allows 4addr operation on crypto controlled devices (ex: ath10k). This change has inadvertently impacted the interface combinations logic on such devices. General rule is that software interfaces like AP/VLAN should not be listed under supported interface combinations and should not be considered during validation of these combinations; because of the aforementioned change, AP/VLAN interfaces(if present) will be checked against interfaces supported by the device and blocks valid interface combinations. Consider a case where an AP and AP/VLAN are up and running; when a second AP device is brought up on the same physical device, this AP will be checked against the AP/VLAN interface (which will not be part of supported interface combinations of the device) and blocks second AP to come up. Add a new API cfg80211_iftype_allowed() to fix the problem, this API works for all devices with/without SW crypto control. Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org> Fixes: 33d915d9e8ce ("{nl,mac}80211: allow 4addr AP operation on crypto controlled devices") Link: https://lore.kernel.org/r/1563779690-9716-1-git-send-email-mpubbise@codeaurora.org Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Alexei Starovoitov says: ==================== pull-request: bpf 2019-07-25 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) fix segfault in libbpf, from Andrii. 2) fix gso_segs access, from Eric. 3) tls/sockmap fixes, from Jakub and John. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22bpf: sockmap/tls, close can race with map freeJohn Fastabend
When a map free is called and in parallel a socket is closed we have two paths that can potentially reset the socket prot ops, the bpf close() path and the map free path. This creates a problem with which prot ops should be used from the socket closed side. If the map_free side completes first then we want to call the original lowest level ops. However, if the tls path runs first we want to call the sockmap ops. Additionally there was no locking around prot updates in TLS code paths so the prot ops could be changed multiple times once from TLS path and again from sockmap side potentially leaving ops pointed at either TLS or sockmap when psock and/or tls context have already been destroyed. To fix this race first only update ops inside callback lock so that TLS, sockmap and lowest level all agree on prot state. Second and a ULP callback update() so that lower layers can inform the upper layer when they are being removed allowing the upper layer to reset prot ops. This gets us close to allowing sockmap and tls to be stacked in arbitrary order but will save that patch for *next trees. v4: - make sure we don't free things for device; - remove the checks which swap the callbacks back only if TLS is at the top. Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com Fixes: 02c558b2d5d6 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22net/tls: fix transition through disconnect with closeJohn Fastabend
It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE state via tcp_disconnect() without actually calling tcp_close which would then call the tls close callback. Because of this a user could disconnect a socket then put it in a LISTEN state which would break our assumptions about sockets always being ESTABLISHED state. More directly because close() can call unhash() and unhash is implemented by sockmap if a sockmap socket has TLS enabled we can incorrectly destroy the psock from unhash() and then call its close handler again. But because the psock (sockmap socket representation) is already destroyed we call close handler in sk->prot. However, in some cases (TLS BASE/BASE case) this will still point at the sockmap close handler resulting in a circular call and crash reported by syzbot. To fix both above issues implement the unhash() routine for TLS. v4: - add note about tls offload still needing the fix; - move sk_proto to the cold cache line; - split TX context free into "release" and "free", otherwise the GC work itself is in already freed memory; - more TX before RX for consistency; - reuse tls_ctx_free(); - schedule the GC work after we're done with context to avoid UAF; - don't set the unhash in all modes, all modes "inherit" TLS_BASE's callbacks anyway; - disable the unhash hook for TLS_HW. Fixes: 3c4d7559159bf ("tls: kernel TLS support") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22net/tls: remove sock unlock/lock around strp_done()John Fastabend
The tls close() callback currently drops the sock lock to call strp_done(). Split up the RX cleanup into stopping the strparser and releasing most resources, syncing strparser and finally freeing the context. To avoid the need for a strp_done() call on the cleanup path of device offload make sure we don't arm the strparser until we are sure init will be successful. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22net/tls: remove close callback sock unlock/lock around TX work flushJohn Fastabend
The tls close() callback currently drops the sock lock, makes a cancel_delayed_work_sync() call, and then relocks the sock. By restructuring the code we can avoid droping lock and then reclaiming it. To simplify this we do the following, tls_sk_proto_close set_bit(CLOSING) set_bit(SCHEDULE) cancel_delay_work_sync() <- cancel workqueue lock_sock(sk) ... release_sock(sk) strp_done() Setting the CLOSING bit prevents the SCHEDULE bit from being cleared by any workqueue items e.g. if one happens to be scheduled and run between when we set SCHEDULE bit and cancel work. Then because SCHEDULE bit is set now no new work will be scheduled. Tested with net selftests and bpf selftests. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22net/tls: don't arm strparser immediately in tls_set_sw_offload()Jakub Kicinski
In tls_set_device_offload_rx() we prepare the software context for RX fallback and proceed to add the connection to the device. Unfortunately, software context prep includes arming strparser so in case of a later error we have to release the socket lock to call strp_done(). In preparation for not releasing the socket lock half way through callbacks move arming strparser into a separate function. Following patches will make use of that. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-21tcp: be more careful in tcp_fragment()Eric Dumazet
Some applications set tiny SO_SNDBUF values and expect TCP to just work. Recent patches to address CVE-2019-11478 broke them in case of losses, since retransmits might be prevented. We should allow these flows to make progress. This patch allows the first and last skb in retransmit queue to be split even if memory limits are hit. It also adds the some room due to the fact that tcp_sendmsg() and tcp_sendpage() might overshoot sk_wmem_queued by about one full TSO skb (64KB size). Note this allowance was already present in stable backports for kernels < 4.15 Note for < 4.15 backports : tcp_rtx_queue_tail() will probably look like : static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk) { struct sk_buff *skb = tcp_send_head(sk); return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk); } Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Andrew Prout <aprout@ll.mit.edu> Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com> Tested-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Christoph Paasch <cpaasch@apple.com> Cc: Jonathan Looney <jtl@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-20nl80211: fix VENDOR_CMD_RAW_DATAJohannes Berg
Since ERR_PTR() is an inline, not a macro, just open-code it here so it's usable as an initializer, fixing the build in brcmfmac. Reported-by: Arend Van Spriel <arend.vanspriel@broadcom.com> Fixes: 901bb9891855 ("nl80211: require and validate vendor command policy") Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-19net: flow_offload: add flow_block structure and use itPablo Neira Ayuso
This object stores the flow block callbacks that are attached to this block. Update flow_block_cb_lookup() to take this new object. This patch restores the block sharing feature. Fixes: da3eeb904ff4 ("net: flow_offload: add list handling functions") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19net: flow_offload: rename tc_setup_cb_t to flow_setup_cb_tPablo Neira Ayuso
Rename this type definition and adapt users. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19net: flow_offload: remove netns parameter from flow_block_cb_alloc()Pablo Neira Ayuso
No need to annotate the netns on the flow block callback object, flow_block_cb_is_busy() already checks for used blocks. Fixes: d63db30c8537 ("net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix a deadlock when module is requested via netlink_bind() in nfnetlink, from Florian Westphal. 2) Fix ipt_rpfilter and ip6t_rpfilter with VRF, from Miaohe Lin. 3) Skip master comparison in SIP helper to fix expectation clash under two valid scenarios, from xiao ruizhu. 4) Remove obsolete comments in nf_conntrack codebase, from Yonatan Goldschmidt. 5) Fix redirect extension module autoload, from Christian Hesse. 6) Fix incorrect mssg option sent to client in synproxy, from Fernando Fernandez. 7) Fix incorrect window calculations in TCP conntrack, from Florian Westphal. 8) Don't bail out when updating basechain policy due to recent offload works, also from Florian. 9) Allow symhash to use modulus 1 as other hash extensions do, from Laura.Garcia. 10) Missing NAT chain module autoload for the inet family, from Phil Sutter. 11) Fix missing adjustment of TCP RST packet in synproxy, from Fernando Fernandez. 12) Skip EAGAIN path when nft_meta_bridge is built-in or not selected. 13) Conntrack bridge does not depend on nf_tables_bridge. 14) Turn NF_TABLES_BRIDGE into tristate to fix possible link break of nft_meta_bridge, from Arnd Bergmann. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-18tcp: fix tcp_set_congestion_control() use from bpf hookEric Dumazet
Neal reported incorrect use of ns_capable() from bpf hook. bpf_setsockopt(...TCP_CONGESTION...) -> tcp_set_congestion_control() -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN) -> ns_capable_common() -> current_cred() -> rcu_dereference_protected(current->cred, 1) Accessing 'current' in bpf context makes no sense, since packets are processed from softirq context. As Neal stated : The capability check in tcp_set_congestion_control() was written assuming a system call context, and then was reused from a BPF call site. The fix is to add a new parameter to tcp_set_congestion_control(), so that the ns_capable() call is only performed under the right context. Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Lawrence Brakmo <brakmo@fb.com> Reported-by: Neal Cardwell <ncardwell@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Lawrence Brakmo <brakmo@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-17xfrm interface: fix management of phydevNicolas Dichtel
With the current implementation, phydev cannot be removed: $ ip link add dummy type dummy $ ip link add xfrm1 type xfrm dev dummy if_id 1 $ ip l d dummy kernel:[77938.465445] unregister_netdevice: waiting for dummy to become free. Usage count = 1 Manage it like in ip tunnels, ie just keep the ifindex. Not that the side effect, is that the phydev is now optional. Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Tested-by: Julien Floret <julien.floret@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-07-17xfrm interface: ifname may be wrong in logsNicolas Dichtel
The ifname is copied when the interface is created, but is never updated later. In fact, this property is used only in one error message, where the netdevice pointer is available, thus let's use it. Fixes: f203b76d7809 ("xfrm: Add virtual xfrm interfaces") Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-07-16netfilter: synproxy: fix erroneous tcp mss optionFernando Fernandez Mancera
Now synproxy sends the mss value set by the user on client syn-ack packet instead of the mss value that client announced. Fixes: 48b1de4c110a ("netfilter: add SYNPROXY core/target") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-16netfilter: nf_conntrack_sip: fix expectation clashxiao ruizhu
When conntracks change during a dialog, SDP messages may be sent from different conntracks to establish expects with identical tuples. In this case expects conflict may be detected for the 2nd SDP message and end up with a process failure. The fixing here is to reuse an existing expect who has the same tuple for a different conntrack if any. Here are two scenarios for the case. 1) SERVER CPE | INVITE SDP | 5060 |<----------------------|5060 | 100 Trying | 5060 |---------------------->|5060 | 183 SDP | 5060 |---------------------->|5060 ===> Conntrack 1 | PRACK | 50601 |<----------------------|5060 | 200 OK (PRACK) | 50601 |---------------------->|5060 | 200 OK (INVITE) | 5060 |---------------------->|5060 | ACK | 50601 |<----------------------|5060 | | |<--- RTP stream ------>| | | | INVITE SDP (t38) | 50601 |---------------------->|5060 ===> Conntrack 2 With a certain configuration in the CPE, SIP messages "183 with SDP" and "re-INVITE with SDP t38" will go through the sip helper to create expects for RTP and RTCP. It is okay to create RTP and RTCP expects for "183", whose master connection source port is 5060, and destination port is 5060. In the "183" message, port in Contact header changes to 50601 (from the original 5060). So the following requests e.g. PRACK and ACK are sent to port 50601. It is a different conntrack (let call Conntrack 2) from the original INVITE (let call Conntrack 1) due to the port difference. In this example, after the call is established, there is RTP stream but no RTCP stream for Conntrack 1, so the RTP expect created upon "183" is cleared, and RTCP expect created for Conntrack 1 retains. When "re-INVITE with SDP t38" arrives to create RTP&RTCP expects, current ALG implementation will call nf_ct_expect_related() for RTP and RTCP. The expects tuples are identical to those for Conntrack 1. RTP expect for Conntrack 2 succeeds in creation as the one for Conntrack 1 has been removed. RTCP expect for Conntrack 2 fails in creation because it has idential tuples and 'conflict' with the one retained for Conntrack 1. And then result in a failure in processing of the re-INVITE. 2) SERVER A CPE | REGISTER | 5060 |<------------------| 5060 ==> CT1 | 200 | 5060 |------------------>| 5060 | | | INVITE SDP(1) | 5060 |<------------------| 5060 | 300(multi choice) | 5060 |------------------>| 5060 SERVER B | ACK | 5060 |<------------------| 5060 | INVITE SDP(2) | 5060 |-------------------->| 5060 ==> CT2 | 100 | 5060 |<--------------------| 5060 | 200(contact changes)| 5060 |<--------------------| 5060 | ACK | 5060 |-------------------->| 50601 ==> CT3 | | |<--- RTP stream ---->| | | | BYE | 5060 |<--------------------| 50601 | 200 | 5060 |-------------------->| 50601 | INVITE SDP(3) | 5060 |<------------------| 5060 ==> CT1 CPE sends an INVITE request(1) to Server A, and creates a RTP&RTCP expect pair for this Conntrack 1 (CT1). Server A responds 300 to redirect to Server B. The RTP&RTCP expect pairs created on CT1 are removed upon 300 response. CPE sends the INVITE request(2) to Server B, and creates an expect pair for the new conntrack (due to destination address difference), let call CT2. Server B changes the port to 50601 in 200 OK response, and the following requests ACK and BYE from CPE are sent to 50601. The call is established. There is RTP stream and no RTCP stream. So RTP expect is removed and RTCP expect for CT2 retains. As BYE request is sent from port 50601, it is another conntrack, let call CT3, different from CT2 due to the port difference. So the BYE request will not remove the RTCP expect for CT2. Then another outgoing call is made, with the same RTP port being used (not definitely but possibly). CPE firstly sends the INVITE request(3) to Server A, and tries to create a RTP&RTCP expect pairs for this CT1. In current ALG implementation, the RTCP expect for CT1 fails in creation because it 'conflicts' with the residual one for CT2. As a result the INVITE request fails to send. Signed-off-by: xiao ruizhu <katrina.xiaorz@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-12net: sched: Fix NULL-pointer dereference in tc_indr_block_ing_cmd()Vlad Buslov
After recent refactoring of block offlads infrastructure, indr_dev->block pointer is dereferenced before it is verified to be non-NULL. Example stack trace where this behavior leads to NULL-pointer dereference error when creating vxlan dev on system with mlx5 NIC with offloads enabled: [ 1157.852938] ================================================================== [ 1157.866877] BUG: KASAN: null-ptr-deref in tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1157.880877] Read of size 4 at addr 0000000000000090 by task ip/3829 [ 1157.901637] CPU: 22 PID: 3829 Comm: ip Not tainted 5.2.0-rc6+ #488 [ 1157.914438] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017 [ 1157.929031] Call Trace: [ 1157.938318] dump_stack+0x9a/0xeb [ 1157.948362] ? tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1157.960262] ? tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1157.972082] __kasan_report+0x176/0x192 [ 1157.982513] ? tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1157.994348] kasan_report+0xe/0x20 [ 1158.004324] tc_indr_block_ing_cmd.isra.41+0x9c/0x160 [ 1158.015950] ? tcf_block_setup+0x430/0x430 [ 1158.026558] ? kasan_unpoison_shadow+0x30/0x40 [ 1158.037464] __tc_indr_block_cb_register+0x5f5/0xf20 [ 1158.049288] ? mlx5e_rep_indr_tc_block_unbind+0xa0/0xa0 [mlx5_core] [ 1158.062344] ? tc_indr_block_dev_put.part.47+0x5c0/0x5c0 [ 1158.074498] ? rdma_roce_rescan_device+0x20/0x20 [ib_core] [ 1158.086580] ? br_device_event+0x98/0x480 [bridge] [ 1158.097870] ? strcmp+0x30/0x50 [ 1158.107578] mlx5e_nic_rep_netdevice_event+0xdd/0x180 [mlx5_core] [ 1158.120212] notifier_call_chain+0x6d/0xa0 [ 1158.130753] register_netdevice+0x6fc/0x7e0 [ 1158.141322] ? netdev_change_features+0xa0/0xa0 [ 1158.152218] ? vxlan_config_apply+0x210/0x310 [vxlan] [ 1158.163593] __vxlan_dev_create+0x2ad/0x520 [vxlan] [ 1158.174770] ? vxlan_changelink+0x490/0x490 [vxlan] [ 1158.185870] ? rcu_read_unlock+0x60/0x60 [vxlan] [ 1158.196798] vxlan_newlink+0x99/0xf0 [vxlan] [ 1158.207303] ? __vxlan_dev_create+0x520/0x520 [vxlan] [ 1158.218601] ? rtnl_create_link+0x3d0/0x450 [ 1158.228900] __rtnl_newlink+0x8a7/0xb00 [ 1158.238701] ? stack_access_ok+0x35/0x80 [ 1158.248450] ? rtnl_link_unregister+0x1a0/0x1a0 [ 1158.258735] ? find_held_lock+0x6d/0xd0 [ 1158.268379] ? is_bpf_text_address+0x67/0xf0 [ 1158.278330] ? lock_acquire+0xc1/0x1f0 [ 1158.287686] ? is_bpf_text_address+0x5/0xf0 [ 1158.297449] ? is_bpf_text_address+0x86/0xf0 [ 1158.307310] ? kernel_text_address+0xec/0x100 [ 1158.317155] ? arch_stack_walk+0x92/0xe0 [ 1158.326497] ? __kernel_text_address+0xe/0x30 [ 1158.336213] ? unwind_get_return_address+0x2f/0x50 [ 1158.346267] ? create_prof_cpu_mask+0x20/0x20 [ 1158.355936] ? arch_stack_walk+0x92/0xe0 [ 1158.365117] ? stack_trace_save+0x8a/0xb0 [ 1158.374272] ? stack_trace_consume_entry+0x80/0x80 [ 1158.384226] ? match_held_lock+0x33/0x210 [ 1158.393216] ? kasan_unpoison_shadow+0x30/0x40 [ 1158.402593] rtnl_newlink+0x53/0x80 [ 1158.410925] rtnetlink_rcv_msg+0x3a5/0x600 [ 1158.419777] ? validate_linkmsg+0x400/0x400 [ 1158.428620] ? find_held_lock+0x6d/0xd0 [ 1158.437117] ? match_held_lock+0x1b/0x210 [ 1158.445760] ? validate_linkmsg+0x400/0x400 [ 1158.454642] netlink_rcv_skb+0xc7/0x1f0 [ 1158.463150] ? netlink_ack+0x470/0x470 [ 1158.471538] ? netlink_deliver_tap+0x1f3/0x5a0 [ 1158.480607] netlink_unicast+0x2ae/0x350 [ 1158.489099] ? netlink_attachskb+0x340/0x340 [ 1158.497935] ? _copy_from_iter_full+0xde/0x3b0 [ 1158.506945] ? __virt_addr_valid+0xb6/0xf0 [ 1158.515578] ? __check_object_size+0x159/0x240 [ 1158.524515] netlink_sendmsg+0x4d3/0x630 [ 1158.532879] ? netlink_unicast+0x350/0x350 [ 1158.541400] ? netlink_unicast+0x350/0x350 [ 1158.549805] sock_sendmsg+0x94/0xa0 [ 1158.557561] ___sys_sendmsg+0x49d/0x570 [ 1158.565625] ? copy_msghdr_from_user+0x210/0x210 [ 1158.574457] ? __fput+0x1e2/0x330 [ 1158.581948] ? __kasan_slab_free+0x130/0x180 [ 1158.590407] ? kmem_cache_free+0xb6/0x2d0 [ 1158.598574] ? mark_lock+0xc7/0x790 [ 1158.606177] ? task_work_run+0xcf/0x100 [ 1158.614165] ? exit_to_usermode_loop+0x102/0x110 [ 1158.622954] ? __lock_acquire+0x963/0x1ee0 [ 1158.631199] ? lockdep_hardirqs_on+0x260/0x260 [ 1158.639777] ? match_held_lock+0x1b/0x210 [ 1158.647918] ? lockdep_hardirqs_on+0x260/0x260 [ 1158.656501] ? match_held_lock+0x1b/0x210 [ 1158.664643] ? __fget_light+0xa6/0xe0 [ 1158.672423] ? __sys_sendmsg+0xd2/0x150 [ 1158.680334] __sys_sendmsg+0xd2/0x150 [ 1158.688063] ? __ia32_sys_shutdown+0x30/0x30 [ 1158.696435] ? lock_downgrade+0x2e0/0x2e0 [ 1158.704541] ? mark_held_locks+0x1a/0x90 [ 1158.712611] ? mark_held_locks+0x1a/0x90 [ 1158.720619] ? do_syscall_64+0x1e/0x2c0 [ 1158.728530] do_syscall_64+0x78/0x2c0 [ 1158.736254] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 1158.745414] RIP: 0033:0x7f62d505cb87 [ 1158.753070] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 f3 c3 0f 1f 80 00 00[87/1817] 48 89 f3 48 [ 1158.780924] RSP: 002b:00007fffd9832268 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 1158.793204] RAX: ffffffffffffffda RBX: 000000005d26048f RCX: 00007f62d505cb87 [ 1158.805111] RDX: 0000000000000000 RSI: 00007fffd98322d0 RDI: 0000000000000003 [ 1158.817055] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006 [ 1158.828987] R10: 00007f62d50ce260 R11: 0000000000000246 R12: 0000000000000001 [ 1158.840909] R13: 000000000067e540 R14: 0000000000000000 R15: 000000000067ed20 [ 1158.852873] ================================================================== Introduce new function tcf_block_non_null_shared() that verifies block pointer before dereferencing it to obtain index. Use the function in tc_indr_block_ing_cmd() to prevent NULL pointer dereference. Fixes: 955bcb6ea0df ("drivers: net: use flow block API") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-11net: fib_rules: do not flow dissect local packetsPetar Penkov
Rules matching on loopback iif do not need early flow dissection as the packet originates from the host. Stop counting such rules in fib_rule_requires_fldissect Signed-off-by: Petar Penkov <ppenkov@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-11Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Some highlights from this development cycle: 1) Big refactoring of ipv6 route and neigh handling to support nexthop objects configurable as units from userspace. From David Ahern. 2) Convert explored_states in BPF verifier into a hash table, significantly decreased state held for programs with bpf2bpf calls, from Alexei Starovoitov. 3) Implement bpf_send_signal() helper, from Yonghong Song. 4) Various classifier enhancements to mvpp2 driver, from Maxime Chevallier. 5) Add aRFS support to hns3 driver, from Jian Shen. 6) Fix use after free in inet frags by allocating fqdirs dynamically and reworking how rhashtable dismantle occurs, from Eric Dumazet. 7) Add act_ctinfo packet classifier action, from Kevin Darbyshire-Bryant. 8) Add TFO key backup infrastructure, from Jason Baron. 9) Remove several old and unused ISDN drivers, from Arnd Bergmann. 10) Add devlink notifications for flash update status to mlxsw driver, from Jiri Pirko. 11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski. 12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes. 13) Various enhancements to ipv6 flow label handling, from Eric Dumazet and Willem de Bruijn. 14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van der Merwe, and others. 15) Various improvements to axienet driver including converting it to phylink, from Robert Hancock. 16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean. 17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana Radulescu. 18) Add devlink health reporting to mlx5, from Moshe Shemesh. 19) Convert stmmac over to phylink, from Jose Abreu. 20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from Shalom Toledo. 21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera. 22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel. 23) Track spill/fill of constants in BPF verifier, from Alexei Starovoitov. 24) Support bounded loops in BPF, from Alexei Starovoitov. 25) Various page_pool API fixes and improvements, from Jesper Dangaard Brouer. 26) Just like ipv4, support ref-countless ipv6 route handling. From Wei Wang. 27) Support VLAN offloading in aquantia driver, from Igor Russkikh. 28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy. 29) Add flower GRE encap/decap support to nfp driver, from Pieter Jansen van Vuuren. 30) Protect against stack overflow when using act_mirred, from John Hurley. 31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen. 32) Use page_pool API in netsec driver, Ilias Apalodimas. 33) Add Google gve network driver, from Catherine Sullivan. 34) More indirect call avoidance, from Paolo Abeni. 35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan. 36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek. 37) Add MPLS manipulation actions to TC, from John Hurley. 38) Add sending a packet to connection tracking from TC actions, and then allow flower classifier matching on conntrack state. From Paul Blakey. 39) Netfilter hw offload support, from Pablo Neira Ayuso" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits) net/mlx5e: Return in default case statement in tx_post_resync_params mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync(). net: dsa: add support for BRIDGE_MROUTER attribute pkt_sched: Include const.h net: netsec: remove static declaration for netsec_set_tx_de() net: netsec: remove superfluous if statement netfilter: nf_tables: add hardware offload support net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload net: flow_offload: add flow_block_cb_is_busy() and use it net: sched: remove tcf block API drivers: net: use flow block API net: sched: use flow block API net: flow_offload: add flow_block_cb_{priv, incref, decref}() net: flow_offload: add list handling functions net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free() net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_* net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND net: flow_offload: add flow_block_cb_setup_simple() net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC ...
2019-07-09netfilter: nf_tables: add hardware offload supportPablo Neira Ayuso
This patch adds hardware offload support for nftables through the existing netdev_ops->ndo_setup_tc() interface, the TC_SETUP_CLSFLOWER classifier and the flow rule API. This hardware offload support is available for the NFPROTO_NETDEV family and the ingress hook. Each nftables expression has a new ->offload interface, that is used to populate the flow rule object that is attached to the transaction object. There is a new per-table NFT_TABLE_F_HW flag, that is set on to offload an entire table, including all of its chains. This patch supports for basic metadata (layer 3 and 4 protocol numbers), 5-tuple payload matching and the accept/drop actions; this also includes basechain hardware offload only. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09net: flow_offload: rename tc_cls_flower_offload to flow_cls_offloadPablo Neira Ayuso
And any other existing fields in this structure that refer to tc. Specifically: * tc_cls_flower_offload_flow_rule() to flow_cls_offload_flow_rule(). * TC_CLSFLOWER_* to FLOW_CLS_*. * tc_cls_common_offload to tc_cls_common_offload. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09net: flow_offload: add flow_block_cb_is_busy() and use itPablo Neira Ayuso
This patch adds a function to check if flow block callback is already in use. Call this new function from flow_block_cb_setup_simple() and from drivers. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09net: sched: remove tcf block APIPablo Neira Ayuso
Unused, now replaced by flow block API. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09drivers: net: use flow block APIPablo Neira Ayuso
This patch updates flow_block_cb_setup_simple() to use the flow block API. Several drivers are also adjusted to use it. This patch introduces the per-driver list of flow blocks to account for blocks that are already in use. Remove tc_block_offload alias. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09net: flow_offload: add flow_block_cb_{priv, incref, decref}()Pablo Neira Ayuso
This patch completes the flow block API to introduce: * flow_block_cb_priv() to access callback private data. * flow_block_cb_incref() to bump reference counter on this flow block. * flow_block_cb_decref() to decrement the reference counter. These functions are taken from the existing tcf_block_cb_priv(), tcf_block_cb_incref() and tcf_block_cb_decref(). Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09net: flow_offload: add list handling functionsPablo Neira Ayuso
This patch adds the list handling functions for the flow block API: * flow_block_cb_lookup() allows drivers to look up for existing flow blocks. * flow_block_cb_add() adds a flow block to the per driver list to be registered by the core. * flow_block_cb_remove() to remove a flow block from the list of existing flow blocks per driver and to request the core to unregister this. The flow block API also annotates the netns this flow block belongs to. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>