aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2020-03-30net: sched: expose HW stats types per action used by driversJiri Pirko
It may be up to the driver (in case ANY HW stats is passed) to select which type of HW stats he is going to use. Add an infrastructure to expose this information to user. $ tc filter add dev enp3s0np1 ingress proto ip handle 1 pref 1 flower dst_ip 192.168.1.1 action drop $ tc -s filter show dev enp3s0np1 ingress filter protocol ip pref 1 flower chain 0 filter protocol ip pref 1 flower chain 0 handle 0x1 eth_type ipv4 dst_ip 192.168.1.1 in_hw in_hw_count 2 action order 1: gact action drop random type none pass val 0 index 1 ref 1 bind 1 installed 10 sec used 10 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 used_hw_stats immediate <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: introduce nla_put_bitfield32() helper and use itJiri Pirko
Introduce a helper to pass value and selector to. The helper packs them into struct and puts them into netlink message. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2020-03-28 1) Use kmem_cache_zalloc() instead of kmem_cache_alloc() in xfrm_state_alloc(). From Huang Zijiang. 2) esp_output_fill_trailer() is the same in IPv4 and IPv6, so share this function to avoide code duplcation. From Raed Salem. 3) Add offload support for esp beet mode. From Xin Long. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: dsa: Simplify 'dsa_tag_protocol_to_str()'Christophe JAILLET
There is no point in preparing the module name in a buffer. The format string can be passed diectly to 'request_module()'. This axes a few lines of code and cleans a few things: - max len for a driver name is MODULE_NAME_LEN wich is ~ 60 chars, not 128. It would be down-sized in 'request_module()' - we should pass the total size of the buffer to 'snprintf()', not the size minus 1 Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: ena: Make some functions staticYueHaibing
Fix sparse warnings: drivers/net/ethernet/amazon/ena/ena_netdev.c:460:6: warning: symbol 'ena_xdp_exchange_program_rx_in_range' was not declared. Should it be static? drivers/net/ethernet/amazon/ena/ena_netdev.c:481:6: warning: symbol 'ena_xdp_exchange_program' was not declared. Should it be static? drivers/net/ethernet/amazon/ena/ena_netdev.c:1555:5: warning: symbol 'ena_xdp_handle_buff' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30dpaa_eth: Make dpaa_a050385_wa staticYueHaibing
Fix sparse warning: drivers/net/ethernet/freescale/dpaa/dpaa_eth.c:2065:5: warning: symbol 'dpaa_a050385_wa' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30crypto/chtls: Fix chtls crash in connection cleanupRohit Maheshwari
There is a possibility that cdev is removed before CPL_ABORT_REQ_RSS is fully processed, so it's better to save it in skb. Added checks in handling the flow correctly, which suggests connection reset request is sent to HW, wait for HW to respond. Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30crypto/chcr: fix incorrect ipv6 packet lengthRohit Maheshwari
IPv6 header's payload length field shouldn't include IPv6 header length. Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: stmmac: Add support for VLAN Rx filteringWong Vee Khee
Add support for VLAN ID-based filtering by the MAC controller for MAC drivers that support it. Only the 12-bit VID field is used. Signed-off-by: Chuah Kim Tatt <kim.tatt.chuah@intel.com> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Signed-off-by: Wong Vee Khee <vee.khee.wong@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30Merge branch 'crypto-chelsio-Fixes-issues-during-chcr-driver-registration'David S. Miller
Ayush Sawal says: ==================== crypto: chelsio: Fixes issues during chcr driver registration Patch 1: Avoid the accessing of wrong u_ctx pointer. Patch 2: Fixes a deadlock between rtnl_lock and uld_mutex. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30Crypto: chelsio - Fixes a deadlock between rtnl_lock and uld_mutexAyush Sawal
The locks are taken in this order during driver registration (uld_mutex), at: cxgb4_register_uld.part.14+0x49/0xd60 [cxgb4] (rtnl_mutex), at: rtnetlink_rcv_msg+0x2db/0x400 (uld_mutex), at: cxgb_up+0x3a/0x7b0 [cxgb4] (rtnl_mutex), at: chcr_add_xfrmops+0x83/0xa0 [chcr](stucked here) To avoid this now the netdev features are updated after the cxgb4_register_uld function is completed. Fixes: 6dad4e8ab3ec6 ("chcr: Add support for Inline IPSec"). Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30Crypto: chelsio - Fixes a hang issue during driver registrationAyush Sawal
This issue occurs only when multiadapters are present. Hang happens because assign_chcr_device returns u_ctx pointer of adapter which is not yet initialized as for this adapter cxgb_up is not been called yet. The last_dev pointer is used to determine u_ctx pointer and it is initialized two times in chcr_uld_add in chcr_dev_add respectively. The fix here is don't initialize the last_dev pointer during chcr_uld_add. Only assign to value to it when the adapter's initialization is completed i.e in chcr_dev_add. Fixes: fef4912b66d62 ("crypto: chelsio - Handle PCI shutdown event"). Signed-off-by: Ayush Sawal <ayush.sawal@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30selftests:mptcp: fix failure due to whitespace damageMatthieu Baerts
'pm_nl_ctl' was adding a trailing whitespace after having printed the IP. But at the end, the IP element is currently always the last one. The bash script launching 'pm_nl_ctl' had trailing whitespaces in the expected result on purpose. But these whitespaces have been removed when the patch has been applied upstream. To avoid trailing whitespaces in the bash code, 'pm_nl_ctl' and expected results have now been adapted. The MPTCP PM selftest can now pass again. Fixes: eedbc685321b (selftests: add PM netlink functional tests) Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30net: ethernet: ti: fix spelling mistake "rundom" -> "random"Colin Ian King
There is a spelling mistake in a dev_err error message. Fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30Merge tag 'mlx5-updates-2020-03-29' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2020-03-29 1) mlx5 core level updates for mkey APIs + migrate some code to mlx5_ib 2) Use a separate work queue for fib event handling 3) Move new eswitch chains files to a new directory, to prepare for upcoming E-Switch and offloads features. 4) Support indr block setup (TC_SETUP_FT) in Flow Table mode. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net/mlx5e: add mlx5e_rep_indr_setup_ft_cb supportwenxu
Add mlx5e_rep_indr_setup_ft_cb to support indr block setup in FT mode. Both tc rules and flow table rules are of the same format, It can re-use tc parsing for that, and move the flow table rules to their steering domain(the specific chain_index), the indr block offload in FT also follow this scenario. Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-29net/mlx5e: refactor indr setup blockwenxu
Refactor indr setup block for support ft indr setup in the next patch. The function mlx5e_rep_indr_offload exposes 'flags' in order set additional flag for FT in next patch. Rename mlx5e_rep_indr_setup_tc_block to mlx5e_rep_indr_setup_block and add flow_setup_cb_t callback parameters in order set the specific callback for FT in next patch. Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-29net/mlx5: E-Switch: Move eswitch chains to a new directorySaeed Mahameed
eswitch_offloads_chains.{c,h} were just introduced this kernel release cycle, eswitch is in high development demand right now and many features are planned to be added to it. eswitch deserves its own directory and here we move these new files to there, in preparation for upcoming eswitch features and new files. Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com>
2020-03-29net/mlx5: Use a separate work queue for fib event handlingMark Zhang
In VF lag mode when remove the bonding module without bring down the bond device first, we could potentially have circular dependency when we unload IB devices and also handle fib events: 1. The bond work starts first; 2. The "modprobe -rv bonding" process tries to release the bond device, with the "pernet_ops_rwsem" lock hold; 3. The bond work blocks in unregister_netdevice_notifier() and waits for the lock because fib event came right before; 4. The kernel fib module tries to free all the fib entries by broadcasting the "FIB_EVENT_NH_DEL" event; 5. Upon the fib event this lag_mp module holds the fib lock and queue a fib work. So: bond work -> modprobe task -> kernel fib module -> lag_mp -> bond work Today we either reload IB devices in roce lag in nic mode or either handle fib events in switchdev mode, but a new feature could change that we'll need to reload IB devices also in switchdev mode so this is a future proof fix as one may not notice this later. Signed-off-by: Mark Zhang <markz@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-29Merge branch 'mlx5-next' of ↵Saeed Mahameed
git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux * 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux: mlx5: Remove uninitialized use of key in mlx5_core_create_mkey {IB,net}/mlx5: Move asynchronous mkey creation to mlx5_ib {IB,net}/mlx5: Assign mkey variant in mlx5_ib only {IB,net}/mlx5: Setup mkey variant before mr create command invocation Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-29Merge branch 'ethtool-netlink-interface-part-4'David S. Miller
Michal Kubecek says: ==================== ethtool netlink interface, part 4 Implementation of more netlink request types: - coalescing (ethtool -c/-C, patches 2-4) - pause parameters (ethtool -a/-A, patches 5-7) - EEE settings (--show-eee / --set-eee, patches 8-10) - timestamping info (-T, patches 11-12) Patch 1 is a fix for netdev reference leak similar to commit 2f599ec422ad ("ethtool: fix reference leak in some *_SET handlers") but fixing a code Changes in v3 - change "one-step-*" Tx type names to "onestep-*", (patch 11, suggested by Richard Cochran - use "TSINFO" rather than "TIMESTAMP" for timestamping information constants and adjust symbol names (patch 12, suggested by Richard Cochran) Changes in v2: - fix compiler warning in net_hwtstamp_validate() (patch 11) - fix follow-up lines alignment (whitespace only, patches 3 and 8) which is only in net-next tree at the moment. ==================== Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: provide timestamping information with TSINFO_GET requestMichal Kubecek
Implement TSINFO_GET request to get timestamping information for a network device. This is traditionally available via ETHTOOL_GET_TS_INFO ioctl request. Move part of ethtool_get_ts_info() into common.c so that ioctl and netlink code use the same logic to get timestamping information from the device. v3: use "TSINFO" rather than "TIMESTAMP", suggested by Richard Cochran Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: add timestamping related string setsMichal Kubecek
Add three string sets related to timestamping information: ETH_SS_SOF_TIMESTAMPING: SOF_TIMESTAMPING_* flags ETH_SS_TS_TX_TYPES: timestamping Tx types ETH_SS_TS_RX_FILTERS: timestamping Rx filters These will be used for TIMESTAMP_GET request. v2: avoid compiler warning ("enumeration value not handled in switch") in net_hwtstamp_validate() v3: omit dash in Tx type names ("one-step-*" -> "onestep-*"), suggested by Richard Cochran Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: add EEE_NTF notificationMichal Kubecek
Send ETHTOOL_MSG_EEE_NTF notification whenever EEE settings of a network device are modified using ETHTOOL_MSG_EEE_SET netlink message or ETHTOOL_SEEE ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: set EEE settings with EEE_SET requestMichal Kubecek
Implement EEE_SET netlink request to set EEE settings of a network device. These are traditionally set with ETHTOOL_SEEE ioctl request. The netlink interface allows setting the EEE status for all link modes supported by kernel but only first 32 link modes can be set at the moment as only those are supported by the ethtool_ops callback. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: provide EEE settings with EEE_GET requestMichal Kubecek
Implement EEE_GET request to get EEE settings of a network device. These are traditionally available via ETHTOOL_GEEE ioctl request. The netlink interface allows reporting EEE status for all link modes supported by kernel but only first 32 link modes are provided at the moment as only those are reported by the ethtool_ops callback and drivers. v2: fix alignment (whitespace only) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: add PAUSE_NTF notificationMichal Kubecek
Send ETHTOOL_MSG_PAUSE_NTF notification whenever pause parameters of a network device are modified using ETHTOOL_MSG_PAUSE_SET netlink message or ETHTOOL_SPAUSEPARAM ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: set pause parameters with PAUSE_SET requestMichal Kubecek
Implement PAUSE_SET netlink request to set pause parameters of a network device. Thease are traditionally set with ETHTOOL_SPAUSEPARAM ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: provide pause parameters with PAUSE_GET requestMichal Kubecek
Implement PAUSE_GET request to get pause parameters of a network device. These are traditionally available via ETHTOOL_GPAUSEPARAM ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: add COALESCE_NTF notificationMichal Kubecek
Send ETHTOOL_MSG_COALESCE_NTF notification whenever coalescing parameters of a network device are modified using ETHTOOL_MSG_COALESCE_SET netlink message or ETHTOOL_SCOALESCE ioctl request. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: set coalescing parameters with COALESCE_SET requestMichal Kubecek
Implement COALESCE_SET netlink request to set coalescing parameters of a network device. These are traditionally set with ETHTOOL_SCOALESCE ioctl request. This commit adds only support for device coalescing parameters, not per queue coalescing parameters. Like the ioctl implementation, the generic ethtool code checks if only supported parameters are modified; if not, first offending attribute is reported using extack. v2: fix alignment (whitespace only) Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: provide coalescing parameters with COALESCE_GET requestMichal Kubecek
Implement COALESCE_GET request to get coalescing parameters of a network device. These are traditionally available via ETHTOOL_GCOALESCE ioctl request. This commit adds only support for device coalescing parameters, not per queue coalescing parameters. Omit attributes with zero values unless they are declared as supported (i.e. the corresponding bit in ethtool_ops::supported_coalesce_params is set). Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29ethtool: fix reference leak in ethnl_set_privflags()Michal Kubecek
Andrew noticed that some handlers for *_SET commands leak a netdev reference if required ethtool_ops callbacks do not exist. One of them is ethnl_set_privflags(), a simple reproducer would be e.g. ip link add veth1 type veth peer name veth2 ethtool --set-priv-flags veth1 foo on ip link del veth1 Make sure dev_put() is called when ethtool_ops check fails. Fixes: f265d799596a ("ethtool: set device private flags with PRIVFLAGS_SET request") Reported-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29Merge branch 'ipv6-add-rpl-source-routing'David S. Miller
Alexander Aring says: ==================== net: ipv6: add rpl source routing This patch series will add handling for RPL source routing handling and insertion (implement as lwtunnel)! I did an example prototype implementation in rpld for using this implementation in non-storing mode: https://github.com/linux-wpan/rpld/tree/nonstoring_mode I will also present a talk at netdev about it: https://netdevconf.info/0x14/session.html?talk-extend-segment-routing-for-RPL In receive handling I add handling for IPIP encapsulation as RFC6554 describes it as possible. For reasons I didn't implemented it yet for generating such packets because I am not really sure how/when this should happen. So far I understand there exists a draft yet which describes the cases (inclusive a Hop-by-Hop option which we also not support yet). https://tools.ietf.org/html/draft-ietf-roll-useofrplinfo-35 This is just the beginning to start implementation everything for yet, step by step. It works for my use cases yet to have it running on a 6LOWPAN _only_ network. I have some patches for iproute2 as well. A sidenote: I check on local addresses if they are part of segment routes, this is just to avoid stupid settings. A use can add addresses afterwards what I cannot control anymore but then it's users fault to make such thing. The receive handling checks for this as well which is required by RFC6554, so the next hops or when it comes back should drop it anyway. To make this possible I added functionality to pass the net structure to the build_state of lwtunnel (I hope I caught all lwtunnels). Another sidenote: I set the headroom value to 0 as I figured out it will break on interfaces with IPv6 min mtu if set to non zero for tunnels on L3. - Alex changes since v3: - use parse_nested which isn't deprecated - Thanks David Ahern - change to return -1 instead errno in exthdr handling to unify error code - change function name from ipv6_rpl_srh_decompress_size to ipv6_rpl_srh_size changes since v2: - add additional segdata length in lwtunnel build_state - fix build_state patch by not catching one inline noop function if LWTUNNEL is disabled Alexander Aring (5): include: uapi: linux: add rpl sr header definition addrconf: add functionality to check on rpl requirements net: ipv6: add support for rpl sr exthdr net: add net available in build_state net: ipv6: add rpl sr tunnel ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: ipv6: add rpl sr tunnelAlexander Aring
This patch adds functionality to configure routes for RPL source routing functionality. There is no IPIP functionality yet implemented which can be added later when the cases when to use IPv6 encapuslation comes more clear. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: add net available in build_stateAlexander Aring
The build_state callback of lwtunnel doesn't contain the net namespace structure yet. This patch will add it so we can check on specific address configuration at creation time of rpl source routes. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29net: ipv6: add support for rpl sr exthdrAlexander Aring
This patch adds rpl source routing receive handling. Everything works only if sysconf "rpl_seg_enabled" and source routing is enabled. Mostly the same behaviour as IPv6 segmentation routing. To handle compression and uncompression a rpl.c file is created which contains the necessary functionality. The receive handling will also care about IPv6 encapsulated so far it's specified as possible nexthdr in RFC 6554. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29addrconf: add functionality to check on rpl requirementsAlexander Aring
This patch adds a functionality to addrconf to check on a specific RPL address configuration. According to RFC 6554: To detect loops in the SRH, a router MUST determine if the SRH includes multiple addresses assigned to any interface on that router. If such addresses appear more than once and are separated by at least one address not assigned to that router. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29include: uapi: linux: add rpl sr header definitionAlexander Aring
This patch adds a uapi header for rpl struct definition. The segments data can be accessed over rpl_segaddr or rpl_segdata macros. In case of compri and compre is zero the segment data is not compressed and can be accessed by rpl_segaddr. In the other case the compressed data can be accessed by rpl_segdata and interpreted as byte array. Signed-off-by: Alexander Aring <alex.aring@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29Merge branch 'mptcp-multiple-subflows-path-management'David S. Miller
Mat Martineau says: ==================== Multipath TCP part 3: Multiple subflows and path management v2 -> v3: Remove 'inline' in .c files, fix uapi bit macros, and rebase. v1 -> v2: Rebase on current net-next, fix for netlink limit setting, and update .gitignore for selftest. This patch set allows more than one TCP subflow to be established and used for a multipath TCP connection. Subflows are added to an existing connection using the MP_JOIN option during the 3-way handshake. With multiple TCP subflows available, sent data is now stored in the MPTCP socket so it may be retransmitted on any TCP subflow if there is no DATA_ACK before a timeout. If an MPTCP-level timeout occurs, data is retransmitted using an available subflow. Storing this sent data requires the addition of memory accounting at the MPTCP level, which was previously delegated to the single subflow. Incoming DATA_ACKs now free data from the MPTCP-level retransmit buffer. IP addresses available for new subflow connections can now be advertised and received with the ADD_ADDR option, and the corresponding REMOVE_ADDR option likewise advertises that an address is no longer available. The MPTCP path manager netlink interface has commands to set in-kernel limits for the number of concurrent subflows and control the advertisement of IP addresses between peers. To track and debug MPTCP connections there are new MPTCP MIB counters, and subflow context can be requested using inet_diag. The MPTCP self-tests now validate multiple-subflow operation and the netlink path manager interface. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29selftests: add test-cases for MPTCP MP_JOINPaolo Abeni
Use the pm netlink to configure the creation of several subflows, and verify that via MIB counters. Update the mptcp_connect program to allow reliable MP_JOIN handshake even on small data file Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29selftests: add PM netlink functional testsPaolo Abeni
This introduces basic self-tests for the PM netlink, checking the basic APIs and possible exceptional values. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: add netlink-based PMPaolo Abeni
Expose a new netlink family to userspace to control the PM, setting: - list of local addresses to be signalled. - list of local addresses used to created subflows. - maximum number of add_addr option to react When the msk is fully established, the PM netlink attempts to announce the 'signal' list via the ADD_ADDR option. Since we currently lack the ADD_ADDR echo (and related event) only the first addr is sent. After exhausting the 'announce' list, the PM tries to create subflow for each addr in 'local' list, waiting for each connection to be completed before attempting the next one. Idea is to add an additional PM hook for ADD_ADDR echo, to allow the PM netlink announcing multiple addresses, in sequence. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: add and use MIB counter infrastructureFlorian Westphal
Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s" or "nstat" will show them automatically. The MPTCP MIB counters are allocated in a distinct pcpu area in order to avoid bloating/wasting TCP pcpu memory. Counters are allocated once the first MPTCP socket is created in a network namespace and free'd on exit. If no sockets have been allocated, all-zero mptcp counters are shown. The MIB counter list is taken from the multipath-tcp.org kernel, but only a few counters have been picked up so far. The counter list can be increased at any time later on. v2 -> v3: - remove 'inline' in foo.c files (David S. Miller) Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: allow dumping subflow context to userspaceDavide Caratti
add ulp-specific diagnostic functions, so that subflow information can be dumped to userspace programs like 'ss'. v2 -> v3: - uapi: use bit macros appropriate for userspace Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: implement and use MPTCP-level retransmissionPaolo Abeni
On timeout event, schedule a work queue to do the retransmission. Retransmission code closely resembles the sendmsg() implementation and re-uses mptcp_sendmsg_frag, providing a dummy msghdr - for flags' sake - and peeking the relevant dfrag from the rtx head. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: rework mptcp_sendmsg_frag to accept optional dfragPaolo Abeni
This will simplify mptcp-level retransmission implementation in the next patch. If dfrag is provided by the caller, skip kernel space memory allocation and use data and metadata provided by the dfrag itself. Because a peer could ack data at TCP level but refrain from sending mptcp-level ACKs, we could grow the mptcp socket backlog indefinitely. We should thus block mptcp_sendmsg until the peer has acked some of the sent data. In order to be able to do so, increment the mptcp socket wmem_queued counter on memory allocation and decrement it when releasing the memory on mptcp-level ack reception. Because TCP performns sndbuf auto-tuning up to tcp_wmem_max[2], make this the mptcp sk_sndbuf limit. In the future we could add experiment with autotuning as TCP does in tcp_sndbuf_expand(). v2 -> v3: - remove 'inline' in foo.c files (David S. Miller) Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: allow partial cleaning of rtx head dfragFlorian Westphal
After adding wmem accounting for the mptcp socket we could get into a situation where the mptcp socket can't transmit more data, and mptcp_clean_una doesn't reduce wmem even if snd_una has advanced because it currently will only remove entire dfrags. Allow advancing the dfrag head sequence and reduce wmem, even though this isn't correct (as we can't release the page). Because we will soon block on mptcp sk in case wmem is too large, call sk_stream_write_space() in case we reduced the backlog so userspace task blocked in sendmsg or poll will be woken up. This isn't an issue if the send buffer is large, but it is when SO_SNDBUF is used to reduce it to a lower value. Note we can still get a deadlock for low SO_SNDBUF values in case both sides of the connection write to the socket: both could be blocked due to wmem being too small -- and current mptcp stack will only increment mptcp ack_seq on recv. This doesn't happen with the selftest as it uses poll() and will always call recv if there is data to read. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: implement memory accounting for mptcp rtx queuePaolo Abeni
Charge the data on the rtx queue to the master MPTCP socket, too. Such memory in uncharged when the data is acked/dequeued. Also account mptcp sockets inuse via a protocol specific pcpu counter. Co-developed-by: Florian Westphal <fw@strlen.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29mptcp: introduce MPTCP retransmission timerPaolo Abeni
The timer will be used to schedule retransmission. It's frequency is based on the current subflow RTO estimation and is reset on every una_seq update The timer is clearer for good by __mptcp_clear_xmit() Also clean MPTCP rtx queue before each transmission. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>