aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2020-03-27bpf: Allow to retrieve cgroup v1 classid from v2 hooksDaniel Borkmann
Today, Kubernetes is still operating on cgroups v1, however, it is possible to retrieve the task's classid based on 'current' out of connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators which attach to the root cgroup v2 hook in a mixed env like in case of Cilium, for example, in order to then correlate certain pod traffic and use it as part of the key for BPF map lookups. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Add netns cookie and enable it for bpf cgroup hooksDaniel Borkmann
In Cilium we're mainly using BPF cgroup hooks today in order to implement kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*), ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic between Cilium managed nodes. While this works in its current shape and avoids packet-level NAT for inter Cilium managed node traffic, there is one major limitation we're facing today, that is, lack of netns awareness. In Kubernetes, the concept of Pods (which hold one or multiple containers) has been built around network namespaces, so while we can use the global scope of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing NodePort ports on loopback addresses), we also have the need to differentiate between initial network namespaces and non-initial one. For example, ExternalIP services mandate that non-local service IPs are not to be translated from the host (initial) network namespace as one example. Right now, we have an ugly work-around in place where non-local service IPs for ExternalIP services are not xlated from connect() and friends BPF hooks but instead via less efficient packet-level NAT on the veth tc ingress hook for Pod traffic. On top of determining whether we're in initial or non-initial network namespace we also have a need for a socket-cookie like mechanism for network namespaces scope. Socket cookies have the nice property that they can be combined as part of the key structure e.g. for BPF LRU maps without having to worry that the cookie could be recycled. We are planning to use this for our sessionAffinity implementation for services. Therefore, add a new bpf_get_netns_cookie() helper which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would provide the cookie for the initial network namespace while passing the context instead of NULL would provide the cookie from the application's network namespace. We're using a hole, so no size increase; the assignment happens only once. Therefore this allows for a comparison on initial namespace as well as regular cookie usage as we have today with socket cookies. We could later on enable this helper for other program types as well as we would see need. (*) Both externalTrafficPolicy={Local|Cluster} types [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable perf event rb output for bpf cgroup progsDaniel Borkmann
Currently, connect(), sendmsg(), recvmsg() and bind-related hooks are all lacking perf event rb output in order to push notifications or monitoring events up to user space. Back in commit a5a3a828cd00 ("bpf: add perf event notificaton support for sock_ops"), I've worked with Sowmini to enable them for sock_ops where the context part is not used (as opposed to skbs for example where the packet data can be appended). Make the bpf_sockopt_event_output() helper generic and enable it for mentioned hooks. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/69c39daf87e076b31e52473c902e9bfd37559124.1585323121.git.daniel@iogearbox.net
2020-03-27bpf: Enable retrieval of socket cookie for bind/post-bind hookDaniel Borkmann
We currently make heavy use of the socket cookie in BPF's connect(), sendmsg() and recvmsg() hooks for load-balancing decisions. However, it is currently not enabled/implemented in BPF {post-}bind hooks where it can later be used in combination for correlation in the tc egress path, for example. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/e9d71f310715332f12d238cc650c1edc5be55119.1585323121.git.daniel@iogearbox.net
2020-03-23bpf: Add bpf_sk_storage support to bpf_tcp_caMartin KaFai Lau
This patch adds bpf_sk_storage_get() and bpf_sk_storage_delete() helper to the bpf_tcp_ca's struct_ops. That would allow bpf-tcp-cc to: 1) share sk private data with other bpf progs. 2) use bpf_sk_storage as a private storage for a bpf-tcp-cc if the existing icsk_ca_priv is not big enough. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320152101.2169498-1-kafai@fb.com
2020-03-20bpf, tcp: Make tcp_bpf_recvmsg staticYueHaibing
After commit f747632b608f ("bpf: sockmap: Move generic sockmap hooks from BPF TCP"), tcp_bpf_recvmsg() is not used out of tcp_bpf.c, so make it static and remove it from tcp.h. Also move it to BPF_STREAM_PARSER #ifdef to fix unused function warnings. Fixes: f747632b608f ("bpf: sockmap: Move generic sockmap hooks from BPF TCP") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320023426.60684-3-yuehaibing@huawei.com
2020-03-20bpf, tcp: Fix unused function warningsYueHaibing
If BPF_STREAM_PARSER is not set, gcc warns: net/ipv4/tcp_bpf.c:483:12: warning: 'tcp_bpf_sendpage' defined but not used [-Wunused-function] net/ipv4/tcp_bpf.c:395:12: warning: 'tcp_bpf_sendmsg' defined but not used [-Wunused-function] net/ipv4/tcp_bpf.c:13:13: warning: 'tcp_bpf_stream_read' defined but not used [-Wunused-function] Moves the unused functions into the #ifdef CONFIG_BPF_STREAM_PARSER. Fixes: f747632b608f ("bpf: sockmap: Move generic sockmap hooks from BPF TCP") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Lorenz Bauer <lmb@cloudflare.com> Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320023426.60684-2-yuehaibing@huawei.com
2020-03-16net: kcm: kcmproc.c: Fix RCU list suspicious usage warningMadhuparna Bhowmik
This path fixes the suspicious RCU usage warning reported by kernel test robot. net/kcm/kcmproc.c:#RCU-list_traversed_in_non-reader_section There is no need to use list_for_each_entry_rcu() in kcm_stats_seq_show() as the list is always traversed under knet->mutex held. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16net: sched: set the hw_stats_type in pedit loopJiri Pirko
For a single pedit action, multiple offload entries may be used. Set the hw_stats_type to all of them. Fixes: 44f865801741 ("sched: act: allow user to specify type of HW stats for a filter") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15net: dsa: warn if phylink_mac_link_state returns errorRussell King
Issue a warning to the kernel log if phylink_mac_link_state() returns an error. This should not occur, but let's make it visible. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15mptcp: drop unneeded checksPaolo Abeni
After the previous patch subflow->conn is always != NULL and is never changed. We can drop a bunch of now unneeded checks. v1 -> v2: - rebased on top of commit 2398e3991bda ("mptcp: always include dack if possible.") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15mptcp: create msk earlyPaolo Abeni
This change moves the mptcp socket allocation from mptcp_accept() to subflow_syn_recv_sock(), so that subflow->conn is now always set for the non fallback scenario. It allows cleaning up a bit mptcp_accept() reducing the additional locking and will allow fourther cleanup in the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15tipc: add NULL pointer check to prevent kernel oopsHoang Le
Calling: tipc_node_link_down()-> - tipc_node_write_unlock()->tipc_mon_peer_down() - tipc_mon_peer_down() just after disabling bearer could be caused kernel oops. Fix this by adding a sanity check to make sure valid memory access. Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15tipc: simplify trivial boolean returnHoang Le
Checking and returning 'true' boolean is useless as it will be returning at end of function Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Acked-by: Ying Xue <ying.xue@windriver.com> Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14net: sched: RED: Introduce an ECN nodrop modePetr Machata
When the RED Qdisc is currently configured to enable ECN, the RED algorithm is used to decide whether a certain SKB should be marked. If that SKB is not ECN-capable, it is early-dropped. It is also possible to keep all traffic in the queue, and just mark the ECN-capable subset of it, as appropriate under the RED algorithm. Some switches support this mode, and some installations make use of it. To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is configured with this flag, non-ECT traffic is enqueued instead of being early-dropped. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14net: sched: Allow extending set of supported RED flagsPetr Machata
The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool of global RED flags. These are passed in tc_red_qopt.flags. However none of these qdiscs validate the flag field, and just copy it over wholesale to internal structures, and later dump it back. (An exception is GRED, which does validate for VQs -- however not for the main setup.) A broken userspace can therefore configure a qdisc with arbitrary unsupported flags, and later expect to see the flags on qdisc dump. The current ABI therefore allows storage of several bits of custom data to qdisc instances of the types mentioned above. How many bits, depends on which flags are meaningful for the qdisc in question. E.g. SFQ recognizes flags ECN and HARDDROP, and the rest is not interpreted. If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it, and at the same time it needs to retain the possibility to store 6 bits of uninterpreted data. Likewise RED, which adds a new flag later in this patchset. To that end, this patch adds a new function, red_get_flags(), to split the passed flags of RED-like qdiscs to flags and user bits, and red_validate_flags() to validate the resulting configuration. It further adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags. Signed-off-by: Petr Machata <petrm@mellanox.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2020-03-13 The following pull-request contains BPF updates for your *net-next* tree. We've added 86 non-merge commits during the last 12 day(s) which contain a total of 107 files changed, 5771 insertions(+), 1700 deletions(-). The main changes are: 1) Add modify_return attach type which allows to attach to a function via BPF trampoline and is run after the fentry and before the fexit programs and can pass a return code to the original caller, from KP Singh. 2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher objects to be visible in /proc/kallsyms so they can be annotated in stack traces, from Jiri Olsa. 3) Extend BPF sockmap to allow for UDP next to existing TCP support in order in order to enable this for BPF based socket dispatch, from Lorenz Bauer. 4) Introduce a new bpftool 'prog profile' command which attaches to existing BPF programs via fentry and fexit hooks and reads out hardware counters during that period, from Song Liu. Example usage: bpftool prog profile id 337 duration 3 cycles instructions llc_misses 4228 run_cnt 3403698 cycles (84.08%) 3525294 instructions # 1.04 insn per cycle (84.05%) 13 llc_misses # 3.69 LLC misses per million isns (83.50%) 5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition of a new bpf_link abstraction to keep in particular BPF tracing programs attached even when the applicaion owning them exits, from Andrii Nakryiko. 6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering and which returns the PID as seen by the init namespace, from Carlos Neira. 7) Refactor of RISC-V JIT code to move out common pieces and addition of a new RV32G BPF JIT compiler, from Luke Nelson. 8) Add gso_size context member to __sk_buff in order to be able to know whether a given skb is GSO or not, from Willem de Bruijn. 9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output implementation but can be called from tracepoint programs, from Eelco Chaudron. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13bpf: Add bpf_trampoline_ name prefix for DECLARE_BPF_DISPATCHERBjörn Töpel
Adding bpf_trampoline_ name prefix for DECLARE_BPF_DISPATCHER, so all the dispatchers have the common name prefix. And also a small '_' cleanup for bpf_dispatcher_nopfunc function name. Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200312195610.346362-3-jolsa@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-03-13ethtool: fix spelling mistake "exceeeds" -> "exceeds"Colin Ian King
There are a couple of spelling mistakes in NL_SET_ERR_MSG_ATTR messages. Fix these. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller
Minor overlapping changes, nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12bpf: Add bpf_xdp_output() helperEelco Chaudron
Introduce new helper that reuses existing xdp perf_event output implementation, but can be called from raw_tracepoint programs that receive 'struct xdp_buff *' as a tracepoint argument. Signed-off-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial
2020-03-12inet: Use fallthrough;Joe Perches
Convert the various uses of fallthrough comments to fallthrough; Done via script Link: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/ And by hand: net/ipv6/ip6_fib.c has a fallthrough comment outside of an #ifdef block that causes gcc to emit a warning if converted in-place. So move the new fallthrough; inside the containing #ifdef/#endif too. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: add CHANNELS_NTF notificationMichal Kubecek
Send ETHTOOL_MSG_CHANNELS_NTF notification whenever channel counts of a network device are modified using ETHTOOL_MSG_CHANNELS_SET netlink message or ETHTOOL_SCHANNELS ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: set device channel counts with CHANNELS_SET requestMichal Kubecek
Implement CHANNELS_SET netlink request to set channel counts of a network device. These are traditionally set with ETHTOOL_SCHANNELS ioctl request. Like the ioctl implementation, the generic ethtool code checks if supplied values do not exceed driver defined limits; if they do, first offending attribute is reported using extack. Checks preventing removing channels used for RX indirection table or zerocopy AF_XDP socket are also implemented. Move ethtool_get_max_rxfh_channel() helper into common.c so that it can be used by both ioctl and netlink code. v2: - fix netdev reference leak in error path (found by Jakub Kicinsky) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: provide channel counts with CHANNELS_GET requestMichal Kubecek
Implement CHANNELS_GET request to get channel counts of a network device. These are traditionally available via ETHTOOL_GCHANNELS ioctl request. Omit attributes for channel types which are not supported by driver or device (zero reported for maximum). v2: (all suggested by Jakub Kicinski) - minor cleanup in channels_prepare_data() - more descriptive channels_reply_size() - omit attributes with zero max count Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: add RINGS_NTF notificationMichal Kubecek
Send ETHTOOL_MSG_RINGS_NTF notification whenever ring sizes of a network device are modified using ETHTOOL_MSG_RINGS_SET netlink message or ETHTOOL_SRINGPARAM ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: set device ring sizes with RINGS_SET requestMichal Kubecek
Implement RINGS_SET netlink request to set ring sizes of a network device. These are traditionally set with ETHTOOL_SRINGPARAM ioctl request. Like the ioctl implementation, the generic ethtool code checks if supplied values do not exceed driver defined limits; if they do, first offending attribute is reported using extack. v2: - fix netdev reference leak in error path (found by Jakub Kicinsky) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: provide ring sizes with RINGS_GET requestMichal Kubecek
Implement RINGS_GET request to get ring sizes of a network device. These are traditionally available via ETHTOOL_GRINGPARAM ioctl request. Omit attributes for ring types which are not supported by driver or device (zero reported for maximum). v2: (all suggested by Jakub Kicinski) - minor cleanup in rings_prepare_data() - more descriptive rings_reply_size() - omit attributes with zero max size Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: add PRIVFLAGS_NTF notificationMichal Kubecek
Send ETHTOOL_MSG_PRIVFLAGS_NTF notification whenever private flags of a network device are modified using ETHTOOL_MSG_PRIVFLAGS_SET netlink message or ETHTOOL_SPFLAGS ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: set device private flags with PRIVFLAGS_SET requestMichal Kubecek
Implement PRIVFLAGS_SET netlink request to set private flags of a network device. These are traditionally set with ETHTOOL_SPFLAGS ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: provide private flags with PRIVFLAGS_GET requestMichal Kubecek
Implement PRIVFLAGS_GET request to get private flags for a network device. These are traditionally available via ETHTOOL_GPFLAGS ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: add FEATURES_NTF notificationMichal Kubecek
Send ETHTOOL_MSG_FEATURES_NTF notification whenever network device features are modified using ETHTOOL_MSG_FEATURES_SET netlink message, ethtool ioctl request or any other way resulting in call to netdev_update_features() or netdev_change_features() Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: set netdev features with FEATURES_SET requestMichal Kubecek
Implement FEATURES_SET netlink request to set network device features. These are traditionally set using ETHTOOL_SFEATURES ioctl request. Actual change is subject to netdev_change_features() sanity checks so that it can differ from what was requested. Unlike with most other SET requests, in addition to error code and optional extack, kernel provides an optional reply message (ETHTOOL_MSG_FEATURES_SET_REPLY) in the same format but with different semantics: information about difference between user request and actual result and difference between old and new state of dev->features. This reply message can be suppressed by setting ETHTOOL_FLAG_OMIT_REPLY flag in request header. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: add ethnl_parse_bitset() helperMichal Kubecek
Unlike other SET type commands, modifying netdev features is required to provide a reply telling userspace what was actually changed, compared to what was requested. For that purpose, the "modified" flag provided by ethnl_update_bitset() is not sufficient, we need full information which bits were requested to change. Therefore provide ethnl_parse_bitset() returning effective value and mask bitmaps equivalent to the contents of a bitset nested attribute. v2: use non-atomic __set_bit() (suggested by David Miller) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: provide netdev features with FEATURES_GET requestMichal Kubecek
Implement FEATURES_GET request to get network device features. These are traditionally available via ETHTOOL_GFEATURES ioctl request. v2: - style cleanup suggested by Jakub Kicinski Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: update mapping of features to legacy ioctl requestsMichal Kubecek
Legacy ioctl request like ETHTOOL_GTXCSUM are still used by ethtool utility to get values of legacy flags (which rather work as feature groups). These are calculated from values of actual features and request to set them is implemented as an attempt to set all features mapping to them but there are two inconsistencies: - tx-checksum-fcoe-crc is shown under tx-checksumming but NETIF_F_FCOE_CRC is not included in ETHTOOL_GTXCSUM/ETHTOOL_STXCSUM - tx-scatter-gather-fraglist is shown under scatter-gather but NETIF_F_FRAGLIST is not included in ETHTOOL_GSG/ETHTOOL_SSG As the mapping in ethtool output is more correct from logical point of view, fix ethtool_get_feature_mask() to match it. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12ethtool: rename ethnl_parse_header() to ethnl_parse_header_dev_get()Michal Kubecek
Andrew Lunn pointed out that even if it's documented that ethnl_parse_header() takes reference to network device if it fills it into the target structure, its name doesn't make it apparent so that corresponding dev_put() looks like mismatched. Rename the function ethnl_parse_header_dev_get() to indicate that it takes a reference. Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12flow_offload: Add flow_match_ct to get rule ct matchPaul Blakey
Add relevant getter for ct info dissector. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12net/sched: act_ct: Enable hardware offload of flow table entiresPaul Blakey
Pass the zone's flow table instance on the flow action to the drivers. Thus, allowing drivers to register FT add/del/stats callbacks. Finally, enable hardware offload on the flow table instance. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12net/sched: act_ct: Support refreshing the flow table entriesPaul Blakey
If driver deleted an FT entry, a FT failed to offload, or registered to the flow table after flows were already added, we still get packets in software. For those packets, while restoring the ct state from the flow table entry, refresh it's hardware offload. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12net/sched: act_ct: Support restoring conntrack info on skbsPaul Blakey
Provide an API to restore the ct state pointer. This may be used by drivers to restore the ct state if they miss in tc chain after they already did the hardware connection tracking action (ct_metadata action). For example, consider the following rule on chain 0 that is in_hw, however chain 1 is not_in_hw: $ tc filter add dev ... chain 0 ... \ flower ... action ct pipe action goto chain 1 Packets of a flow offloaded (via nf flow table offload) by the driver hit this rule in hardware, will be marked with the ct metadata action (mark, label, zone) that does the equivalent of the software ct action, and when the packet jumps to hardware chain 1, there would be a miss. CT was already processed in hardware. Therefore, the driver's miss handling should restore the ct state on the skb, using the provided API, and continue the packet processing in chain 1. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12net/sched: act_ct: Instantiate flow table entry actionsPaul Blakey
NF flow table API associate 5-tuple rule with an action list by calling the flow table type action() CB to fill the rule's actions. In action CB of act_ct, populate the ct offload entry actions with a new ct_metadata action. Initialize the ct_metadata with the ct mark, label and zone information. If ct nat was performed, then also append the relevant packet mangle actions (e.g. ipv4/ipv6/tcp/udp header rewrites). Drivers that offload the ft entries may match on the 5-tuple and perform the action list. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12netfilter: flowtable: Add API for registering to flow table eventsPaul Blakey
Let drivers to add their cb allowing them to receive flow offload events of type TC_SETUP_CLSFLOWER (REPLACE/DEL/STATS) for flows managed by the flow table. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12Merge branch 'ct-offload' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
2020-03-12tcp: Forbid to bind more than one sockets haveing SO_REUSEADDR and ↵Kuniyuki Iwashima
SO_REUSEPORT per EUID. If there is no TCP_LISTEN socket on a ephemeral port, we can bind multiple sockets having SO_REUSEADDR to the same port. Then if all sockets bound to the port have also SO_REUSEPORT enabled and have the same EUID, all of them can be listened. This is not safe. Let's say, an application has root privilege and binds sockets to an ephemeral port with both of SO_REUSEADDR and SO_REUSEPORT. When none of sockets is not listened yet, a malicious user can use sudo, exhaust ephemeral ports, and bind sockets to the same ephemeral port, so he or she can call listen and steal the port. To prevent this issue, we must not bind more than one sockets that have the same EUID and both of SO_REUSEADDR and SO_REUSEPORT. On the other hand, if the sockets have different EUIDs, the issue above does not occur. After sockets with different EUIDs are bound to the same port and one of them is listened, no more socket can be listened. This is because the condition below is evaluated true and listen() for the second socket fails. } else if (!reuseport_ok || !reuseport || !sk2->sk_reuseport || rcu_access_pointer(sk->sk_reuseport_cb) || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2)))) { if (inet_rcv_saddr_equal(sk, sk2, true)) break; } Therefore, on the same port, we cannot do listen() for multiple sockets with different EUIDs and any other listen syscalls fail, so the problem does not happen. In this case, we can still call connect() for other sockets that cannot be listened, so we have to succeed to call bind() in order to fully utilize 4-tuples. Summarizing the above, we should be able to bind only one socket having SO_REUSEADDR and SO_REUSEPORT per EUID. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are ↵Kuniyuki Iwashima
exhausted. Commit aacd9289af8b82f5fb01bcdd53d0e3406d1333c7 ("tcp: bind() use stronger condition for bind_conflict") introduced a restriction to forbid to bind SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to assign ports dispersedly so that we can connect to the same remote host. The change results in accelerating port depletion so that we fail to bind sockets to the same local port even if we want to connect to the different remote hosts. You can reproduce this issue by following instructions below. 1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768" 2. set SO_REUSEADDR to two sockets. 3. bind two sockets to (localhost, 0) and the latter fails. Therefore, when ephemeral ports are exhausted, bind(0) should fallback to the legacy behaviour to enable the SO_REUSEADDR option and make it possible to connect to different remote (addr, port) tuples. This patch allows us to bind SO_REUSEADDR enabled sockets to the same (addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all ephemeral ports are exhausted. This also allows connect() and listen() to share ports in the following way and may break some applications. So the ip_autobind_reuse is 0 by default and disables the feature. 1. setsockopt(sk1, SO_REUSEADDR) 2. setsockopt(sk2, SO_REUSEADDR) 3. bind(sk1, saddr, 0) 4. bind(sk2, saddr, 0) 5. connect(sk1, daddr) 6. listen(sk2) If it is set 1, we can fully utilize the 4-tuples, but we should use IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible. The notable thing is that if all sockets bound to the same port have both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an ephemeral port and also do listen(). Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12tcp: Remove unnecessary conditions in inet_csk_bind_conflict().Kuniyuki Iwashima
When we get an ephemeral port, the relax is false, so the SO_REUSEADDR conditions may be evaluated twice. We do not need to check the conditions again. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12taprio: Fix sending packets without dequeueing themVinicius Costa Gomes
There was a bug that was causing packets to be sent to the driver without first calling dequeue() on the "child" qdisc. And the KASAN report below shows that sending a packet without calling dequeue() leads to bad results. The problem is that when checking the last qdisc "child" we do not set the returned skb to NULL, which can cause it to be sent to the driver, and so after the skb is sent, it may be freed, and in some situations a reference to it may still be in the child qdisc, because it was never dequeued. The crash log looks like this: [ 19.937538] ================================================================== [ 19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780 [ 19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0 [ 19.939612] [ 19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97 [ 19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4 [ 19.941523] Call Trace: [ 19.941774] <IRQ> [ 19.941985] dump_stack+0x97/0xe0 [ 19.942323] print_address_description.constprop.0+0x3b/0x60 [ 19.942884] ? taprio_dequeue_soft+0x620/0x780 [ 19.943325] ? taprio_dequeue_soft+0x620/0x780 [ 19.943767] __kasan_report.cold+0x1a/0x32 [ 19.944173] ? taprio_dequeue_soft+0x620/0x780 [ 19.944612] kasan_report+0xe/0x20 [ 19.944954] taprio_dequeue_soft+0x620/0x780 [ 19.945380] __qdisc_run+0x164/0x18d0 [ 19.945749] net_tx_action+0x2c4/0x730 [ 19.946124] __do_softirq+0x268/0x7bc [ 19.946491] irq_exit+0x17d/0x1b0 [ 19.946824] smp_apic_timer_interrupt+0xeb/0x380 [ 19.947280] apic_timer_interrupt+0xf/0x20 [ 19.947687] </IRQ> [ 19.947912] RIP: 0010:default_idle+0x2d/0x2d0 [ 19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3 [ 19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13 [ 19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e [ 19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f [ 19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1 [ 19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001 [ 19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000 [ 19.954408] ? default_idle_call+0x2e/0x70 [ 19.954816] ? default_idle+0x1f/0x2d0 [ 19.955192] default_idle_call+0x5e/0x70 [ 19.955584] do_idle+0x3d4/0x500 [ 19.955909] ? arch_cpu_idle_exit+0x40/0x40 [ 19.956325] ? _raw_spin_unlock_irqrestore+0x23/0x30 [ 19.956829] ? trace_hardirqs_on+0x30/0x160 [ 19.957242] cpu_startup_entry+0x19/0x20 [ 19.957633] start_secondary+0x2a6/0x380 [ 19.958026] ? set_cpu_sibling_map+0x18b0/0x18b0 [ 19.958486] secondary_startup_64+0xa4/0xb0 [ 19.958921] [ 19.959078] Allocated by task 33: [ 19.959412] save_stack+0x1b/0x80 [ 19.959747] __kasan_kmalloc.constprop.0+0xc2/0xd0 [ 19.960222] kmem_cache_alloc+0xe4/0x230 [ 19.960617] __alloc_skb+0x91/0x510 [ 19.960967] ndisc_alloc_skb+0x133/0x330 [ 19.961358] ndisc_send_ns+0x134/0x810 [ 19.961735] addrconf_dad_work+0xad5/0xf80 [ 19.962144] process_one_work+0x78e/0x13a0 [ 19.962551] worker_thread+0x8f/0xfa0 [ 19.962919] kthread+0x2ba/0x3b0 [ 19.963242] ret_from_fork+0x3a/0x50 [ 19.963596] [ 19.963753] Freed by task 33: [ 19.964055] save_stack+0x1b/0x80 [ 19.964386] __kasan_slab_free+0x12f/0x180 [ 19.964830] kmem_cache_free+0x80/0x290 [ 19.965231] ip6_mc_input+0x38a/0x4d0 [ 19.965617] ipv6_rcv+0x1a4/0x1d0 [ 19.965948] __netif_receive_skb_one_core+0xf2/0x180 [ 19.966437] netif_receive_skb+0x8c/0x3c0 [ 19.966846] br_handle_frame_finish+0x779/0x1310 [ 19.967302] br_handle_frame+0x42a/0x830 [ 19.967694] __netif_receive_skb_core+0xf0e/0x2a90 [ 19.968167] __netif_receive_skb_one_core+0x96/0x180 [ 19.968658] process_backlog+0x198/0x650 [ 19.969047] net_rx_action+0x2fa/0xaa0 [ 19.969420] __do_softirq+0x268/0x7bc [ 19.969785] [ 19.969940] The buggy address belongs to the object at ffff888112862840 [ 19.969940] which belongs to the cache skbuff_head_cache of size 224 [ 19.971202] The buggy address is located 140 bytes inside of [ 19.971202] 224-byte region [ffff888112862840, ffff888112862920) [ 19.972344] The buggy address belongs to the page: [ 19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0 [ 19.973930] flags: 0x8000000000010200(slab|head) [ 19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0 [ 19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000 [ 19.975915] page dumped because: kasan: bad access detected [ 19.976461] page_owner tracks the page as allocated [ 19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO) [ 19.978332] prep_new_page+0x24b/0x330 [ 19.978707] get_page_from_freelist+0x2057/0x2c90 [ 19.979170] __alloc_pages_nodemask+0x218/0x590 [ 19.979619] new_slab+0x9d/0x300 [ 19.979948] ___slab_alloc.constprop.0+0x2f9/0x6f0 [ 19.980421] __slab_alloc.constprop.0+0x30/0x60 [ 19.980870] kmem_cache_alloc+0x201/0x230 [ 19.981269] __alloc_skb+0x91/0x510 [ 19.981620] alloc_skb_with_frags+0x78/0x4a0 [ 19.982043] sock_alloc_send_pskb+0x5eb/0x750 [ 19.982476] unix_stream_sendmsg+0x399/0x7f0 [ 19.982904] sock_sendmsg+0xe2/0x110 [ 19.983262] ____sys_sendmsg+0x4de/0x6d0 [ 19.983660] ___sys_sendmsg+0xe4/0x160 [ 19.984032] __sys_sendmsg+0xab/0x130 [ 19.984396] do_syscall_64+0xe7/0xae0 [ 19.984761] page last free stack trace: [ 19.985142] __free_pages_ok+0x432/0xbc0 [ 19.985533] qlist_free_all+0x56/0xc0 [ 19.985907] quarantine_reduce+0x149/0x170 [ 19.986315] __kasan_kmalloc.constprop.0+0x9e/0xd0 [ 19.986791] kmem_cache_alloc+0xe4/0x230 [ 19.987182] prepare_creds+0x24/0x440 [ 19.987548] do_faccessat+0x80/0x590 [ 19.987906] do_syscall_64+0xe7/0xae0 [ 19.988276] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 19.988775] [ 19.988930] Memory state around the buggy address: [ 19.989402] ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 19.990111] ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb [ 19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 19.991529] ^ [ 19.992081] ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc [ 19.992796] ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler") Reported-by: Michael Schmidt <michael.schmidt@eti.uni-siegen.de> Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Acked-by: Andre Guedes <andre.guedes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12Revert "net: sched: make newly activated qdiscs visible"Julian Wiedmann
This reverts commit 4cda75275f9f89f9485b0ca4d6950c95258a9bce from net-next. Brown bag time. Michal noticed that this change doesn't work at all when netif_set_real_num_tx_queues() gets called prior to an initial dev_activate(), as for instance igb does. Doing so dies with: [ 40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400 [ 40.586922] #PF: supervisor read access in kernel mode [ 40.592668] #PF: error_code(0x0000) - not-present page [ 40.598405] PGD 0 P4D 0 [ 40.601234] Oops: 0000 [#1] PREEMPT SMP PTI [ 40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G E 5.6.0-rc3-ethnl.50-default #1 [ 40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013 [ 40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90 [ 40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a [ 40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203 [ 40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000 [ 40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0 [ 40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208 [ 40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000 [ 40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000 [ 40.699750] FS: 00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000 [ 40.708782] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0 [ 40.723156] Call Trace: [ 40.725888] dev_qdisc_set_real_num_tx_queues+0x61/0x90 [ 40.731725] netif_set_real_num_tx_queues+0x94/0x1d0 [ 40.737286] __igb_open+0x19a/0x5d0 [igb] [ 40.741767] __dev_open+0xbb/0x150 [ 40.745567] __dev_change_flags+0x157/0x1a0 [ 40.750240] dev_change_flags+0x23/0x60 [...] Fixes: 4cda75275f9f ("net: sched: make newly activated qdiscs visible") Reported-by: Michal Kubecek <mkubecek@suse.cz> CC: Michal Kubecek <mkubecek@suse.cz> CC: Eric Dumazet <edumazet@google.com> CC: Jamal Hadi Salim <jhs@mojatatu.com> CC: Cong Wang <xiyou.wangcong@gmail.com> CC: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11net: mptcp: don't hang before sending 'MP capable with data'Davide Caratti
the following packetdrill script socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3 fcntl(3, F_GETFL) = 0x2 (flags O_RDWR) fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 flags[flag_h] nokey> < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscale 8,mpcapable v1 flags[flag_h] key[skey=2]> > . 1:1(0) ack 1 win 256 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h] key[ckey,skey]> getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 fcntl(3, F_SETFL, O_RDWR) = 0 write(3, ..., 1000) = 1000 doesn't transmit 1KB data packet after a successful three-way-handshake, using mp_capable with data as required by protocol v1, and write() hangs forever: PID: 973 TASK: ffff97dd399cae80 CPU: 1 COMMAND: "packetdrill" #0 [ffffa9b94062fb78] __schedule at ffffffff9c90a000 #1 [ffffa9b94062fc08] schedule at ffffffff9c90a4a0 #2 [ffffa9b94062fc18] schedule_timeout at ffffffff9c90e00d #3 [ffffa9b94062fc90] wait_woken at ffffffff9c120184 #4 [ffffa9b94062fcb0] sk_stream_wait_connect at ffffffff9c75b064 #5 [ffffa9b94062fd20] mptcp_sendmsg at ffffffff9c8e801c #6 [ffffa9b94062fdc0] sock_sendmsg at ffffffff9c747324 #7 [ffffa9b94062fdd8] sock_write_iter at ffffffff9c7473c7 #8 [ffffa9b94062fe48] new_sync_write at ffffffff9c302976 #9 [ffffa9b94062fed0] vfs_write at ffffffff9c305685 #10 [ffffa9b94062ff00] ksys_write at ffffffff9c305985 #11 [ffffa9b94062ff38] do_syscall_64 at ffffffff9c004475 #12 [ffffa9b94062ff50] entry_SYSCALL_64_after_hwframe at ffffffff9ca0008c RIP: 00007f959407eaf7 RSP: 00007ffe9e95a910 RFLAGS: 00000293 RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007f959407eaf7 RDX: 00000000000003e8 RSI: 0000000001785fe0 RDI: 0000000000000008 RBP: 0000000001785fe0 R8: 0000000000000000 R9: 0000000000000003 R10: 0000000000000007 R11: 0000000000000293 R12: 00000000000003e8 R13: 00007ffe9e95ae30 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b Fix it ensuring that socket state is TCP_ESTABLISHED on reception of the third ack. Fixes: 1954b86016cf ("mptcp: Check connection state before attempting send") Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>