aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-01-07mptcp: avoid atomic bit manipulation when possiblePaolo Abeni
Currently the msk->flags bitmask carries both state for the mptcp_release_cb() - mostly touched under the mptcp data lock - and others state info touched even outside such lock scope. As a consequence, msk->flags is always manipulated with atomic operations. This change splits such bitmask in two separate fields, so that we use plain bit operations when touching the cb-related info. The MPTCP_PUSH_PENDING bit needs additional care, as it is the only CB related field currently accessed either under the mptcp data lock or the mptcp socket lock. Let's add another mask just for such bit's sake. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: cleanup MPJ subflow list handlingPaolo Abeni
We can simplify the join list handling leveraging the mptcp_release_cb(): if we can acquire the msk socket lock at mptcp_finish_join time, move the new subflow directly into the conn_list, otherwise place it on join_list and let the release_cb process such list. Since pending MPJ connection are now always processed in a timely way, we can avoid flushing the join list every time we have to process all the current subflows. Additionally we can now use the mptcp data lock to protect the join_list, removing the additional spin lock. Finally, the MPJ handshake is now always finalized under the msk socket lock, we can drop the additional synchronization between mptcp_finish_join() and mptcp_close(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07selftests: mptcp: add tests for subflow creation failurePaolo Abeni
Verify that, when multiple endpoints are available, subflows creation proceed even when the first additional subflow creation fails - due to packet drop on the relevant link Co-developed-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: do not block subflows creation on errorsPaolo Abeni
If the MPTCP configuration allows for multiple subflows creation, and the first additional subflows never reach the fully established status - e.g. due to packets drop or reset - the in kernel path manager do not move to the next subflow. This patch introduces a new PM helper to cope with MPJ subflow creation failure and delay and hook it where appropriate. Such helper triggers additional subflow creation, as needed and updates the PM subflow counter, if the current one is closing. Additionally start all the needed additional subflows as soon as the MPTCP socket is fully established, so we don't have to cope with slow MPJ handshake blocking the next subflow creation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: keep track of local endpoint still available for each mskPaolo Abeni
Include into the path manager status a bitmap tracking the list of local endpoints still available - not yet used - for the relevant mptcp socket. Keep such map updated at endpoint creation/deletion time, so that we can easily skip already used endpoint at local address selection time. The endpoint used by the initial subflow is lazyly accounted at subflow creation time: the usage bitmap is be up2date before endpoint selection and we avoid such unneeded task in some relevant scenarios - e.g. busy servers accepting incoming subflows but not creating any additional ones nor annuncing additional addresses. Overall this allows for fair local endpoints usage in case of subflow failure. As a side effect, this patch also enforces that each endpoint is used at most once for each mptcp connection. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: clean-up MPJ option writingPaolo Abeni
Check for all MPJ variant at once, this reduces the number of conditionals traversed on average and will simplify the next patch. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: fix per socket endpoint accountingPaolo Abeni
Since full-mesh endpoint support, the reception of a single ADD_ADDR option can cause multiple subflows creation. When such option is accepted we increment 'add_addr_accepted' by one. When we received a paired RM_ADDR option, we deleted all the relevant subflows, decrementing 'add_addr_accepted' by one for each of them. We have a similar issue for 'local_addr_used' Fix them moving the pm endpoint accounting outside the subflow traversal. Fixes: 1a0d6136c5f0 ("mptcp: local addresses fullmesh") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07selftests: mptcp: add disconnect testsPaolo Abeni
Performs several disconnect/reconnect on the same socket, ensuring the overall transfer is succesful. The new test leverages ioctl(SIOCOUTQ) to ensure all the pending data is acked before disconnecting. Additionally order alphabetically the test program arguments list for better maintainability. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: implement support for user-space disconnectPaolo Abeni
Handle explicitly AF_UNSPEC in mptcp_stream_connnect() to allow user-space to disconnect established MPTCP connections Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: cleanup accept and pollPaolo Abeni
After the previous patch, msk->subflow will never be deleted during the whole msk lifetime. We don't need anymore to acquire references to it in mptcp_stream_accept() and we can use the listener subflow accept queue to simplify mptcp_poll() for listener socket. Overall this removes a lock pair and 4 more atomic operations per accept(). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: full disconnect implementationPaolo Abeni
The current mptcp_disconnect() implementation lacks several steps, we additionally need to reset the msk socket state and flush the subflow list. Factor out the needed helper to avoid code duplication. Additionally ensure that the initial subflow is disposed only after mptcp_close(), just reset it at disconnect time. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: implement fastclose xmit pathPaolo Abeni
Allow the MPTCP xmit path to add MP_FASTCLOSE suboption on RST egress packets. Additionally reorder related options writing to reduce the number of conditionals required in the fast path. Co-developed-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07mptcp: keep snd_una updated for fallback socketPaolo Abeni
After shutdown, for fallback MPTCP sockets, we always have write_seq == snd_una+1 The above will foul OUTQ ioctl(). Keep snd_una in sync with write_seq even after shutdown. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-07Merge tag 'mlx5-updates-2022-01-06' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2022-01-06 1) Expose FEC per lane block counters via ethtool 2) Trivial fixes/updates/cleanup to mlx5e netdev driver 3) Fix htmldoc build warning 4) Spread mlx5 SFs (sub-functions) to all available CPU cores: Commits 1..5 Shay Drory Says: ================ Before this patchset, mlx5 subfunction shared the same IRQs (MSI-X) with their peers subfunctions, causing them to use same CPU cores. In large scale, this is very undesirable, SFs use small number of cpu cores and all of them will be packed on the same CPU cores, not utilizing all CPU cores in the system. In this patchset we want to achieve two things. a) Spread IRQs used by SFs to all cpu cores b) Pack less SFs in the same IRQ, will result in multiple IRQs per core. In this patchset, we spread SFs over all online cpus available to mlx5 irqs in Round-Robin manner. e.g.: Whenever a SF is created, pick the next CPU core with least number of SF IRQs bound to it, SFs will share IRQs on the same core until a certain limit, when such limit is reached, we request a new IRQ and add it to that CPU core IRQ pool, when out of IRQs, pick any IRQ with least number of SF users. This enhancement is done in order to achieve a better distribution of the SFs over all the available CPUs, which reduces application latency, as shown bellow. Machine details: Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 cores. PCI Express 3 with BW of 126 Gb/s. ConnectX-5 Ex; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0 x16. Base line test description: Single SF on the system. One instance of netperf is running on-top the SF. Numbers: latency = 15.136 usec, CPU Util = 35% Test description: There are 250 SFs on the system. There are 3 instances of netperf running, on-top three different SFs, in parallel. Perf numbers: # netperf SFs latency(usec) latency CPU utilization affinity affinity (lower is better) increase % 1 cpu=0 cpu={0} ~23 (app 1-3) 35% 75% 2 cpu=0,2,4 cpu={0} app 1: 21.625 30% 68% (CPU 0) app 2-3: 16.5 9% 15% (CPU 2,4) 3 cpu=0 cpu={0,2,4} app 1: ~16 7% 84% (CPU 0) app 2-3: ~17.9 14% 22% (CPU 2,4) 4 cpu=0,2,4 cpu={0,2,4} 15.2 (app 1-3) 0% 33% (CPU 0,2,4) - The first two entries (#1 and #2) show current state. e.g.: SFs are using the same CPU. The last two entries (#3 and #4) shows the latency reduction improvement of this patch. e.g.: SFs are on different CPUs. - Whenever we use several CPUs, in case there is a different CPU utilization, write the utilization of each CPU separately. - Whenever the latency result of the netperf instances were different, write the latency of each netperf instances separately. Commands: - for netperf CPU=0: $ for i in {1..3}; do taskset -c 0 netperf -H 1${i}.1.1.1 -t TCP_RR -- \ -o RT_LATENCY -r8 & done - for netperf CPU=0,2,4 $ for i in {1..3}; do taskset -c $(( ($i - 1) * 2 )) netperf -H \ 1${i}.1.1.1 -t TCP_RR -- -o RT_LATENCY -r8 & done ================ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-01-06Merge branch '100GbE' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== 100GbE Intel Wired LAN Driver Updates 2022-01-06 Victor adds restoring of advanced rules after reset. Wojciech improves usage of switchdev control VSI by utilizing the device's advanced rules for forwarding. Christophe Jaillet removes some unneeded calls to zero bitmaps, changes some bitmap operations that don't need to be atomic, and converts a kfree() to a more appropriate bitmap_free(). * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: Use bitmap_free() to free bitmap ice: Optimize a few bitmap operations ice: Slightly simply ice_find_free_recp_res_idx ice: improve switchdev's slow-path ice: replay advanced rules after reset ==================== Link: https://lore.kernel.org/r/20220106183013.3777622-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06Merge branch 'mlxsw-add-spectrum-4-support'Jakub Kicinski
Ido Schimmel says: ==================== mlxsw: Add Spectrum-4 support This patchset adds Spectrum-4 support in mlxsw. It builds on top of a previous patchset merged in commit 10184da91666 ("Merge branch 'mlxsw-Spectrum-4-prep'") and makes two additional changes before adding Spectrum-4 support. Patchset overview: Patches #1-#2 add a few Spectrum-4 specific variants of existing ACL keys. The new variants are needed because the size of certain key elements (e.g., local port) was increased in Spectrum-4. Patches #3-#6 are preparations. Patch #7 implements the Spectrum-4 variant of the Bloom filter hash function. The Bloom filter is used to optimize ACL lookups by potentially skipping certain lookups if they are guaranteed not to match. See additional info in merge commit ae6750e0a5ef ("Merge branch 'mlxsw-spectrum_acl-Add-Bloom-filter-support'"). Patch #8 finally adds Spectrum-4 support. ==================== Link: https://lore.kernel.org/r/20220106160652.821176-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06mlxsw: spectrum: Extend to support Spectrum-4 ASICAmit Cohen
Extend existing driver for Spectrum, Spectrum-2 and Spectrum-3 ASICs to support Spectrum-4 ASIC as well. Currently there is no released firmware version for Spectrum-4, so the driver is not enforcing a minimum version. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06mlxsw: spectrum_acl_bloom_filter: Add support for Spectrum-4 calculationAmit Cohen
Spectrum-4 will calculate hash function for bloom filter differently from the existing ASICs. First, two hash functions will be used to calculate 16 bits result. The final result will be combination of the two results - 6 bits which are result of CRC-6 will be used as MSB and 10 bits which are result of CRC-10 will be used as LSB. Second, while in Spectrum{2,3}, there is a padding in each chunk, so the chunks use a sequence of whole bytes, in Spectrum-4 there is no padding, so each chunk use 20 bytes minus 2 bits, so it is necessary to align the chunks to be without holes. Add dedicated 'mlxsw_sp_acl_bf_ops' for Spectrum-4 and add the required tables for CRC calculations. All the details are documented as part of the code for future use. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06mlxsw: Add operations structure for bloom filter calculationAmit Cohen
Spectrum-4 will calculate hash function for bloom filter differently from the existing ASICs. There are two changes: 1. Instead of using one hash function to calculate 16 bits output (CRC-16), two functions will be used. 2. The chunks will be built differently, without padding. As preparation for support of Spectrum-4 bloom filter, add 'ops' structure to allow handling different calculation for different ASICs. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06mlxsw: spectrum_acl_bloom_filter: Rename Spectrum-2 specific objects for ↵Amit Cohen
future use Spectrum-4 will calculate hash function for bloom filter differently from the existing ASICs. There are two changes: 1. Instead of using one hash function to calculate 16 bits output (CRC-16), two functions will be used. 2. The chunks will be built differently, without padding. As preparation for support of Spectrum-4 bloom filter, rename CRC table to include "sp2" prefix and "crc16", as next patch will add two additional tables. In addition, rename all the dedicated functions and defines for Spectrum-{2,3} to include "sp2" prefix. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06mlxsw: spectrum_acl_bloom_filter: Make mlxsw_sp_acl_bf_key_encode() more ↵Amit Cohen
flexible Spectrum-4 will calculate hash function for bloom filter differently from the existing ASICs. One of the changes is related to the way that the chunks will be build - without padding. As preparation for support of Spectrum-4 bloom filter, make mlxsw_sp_acl_bf_key_encode() more flexible, so it will be able to use it for Spectrum-4 as well. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06mlxsw: spectrum_acl_bloom_filter: Reorder functions to make the code more ↵Amit Cohen
aesthetic Currently, mlxsw_sp_acl_bf_rule_count_index_get() is implemented before mlxsw_sp_acl_bf_index_get() but is used after it. Adding a new function for Spectrum-4 would make them further apart still. Fix by moving them around. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06mlxsw: Introduce flex key elements for Spectrum-4Amit Cohen
Spectrum-4 ASIC will support more virtual routers and local ports compared to the existing ASICs. Therefore, the virtual router and local port ACL key elements need to be increased. Introduce new key elements for Spectrum-4 to be aligned with the elements used already for other Spectrum ASICs. The key blocks layout is the same for Spectrum-4, so use the existing code for encode_block() and clear_block(), just create separate blocks. Note that size of `VIRT_ROUTER_MSB` is 4 bits in Spectrum-4, therefore declare it using `MLXSW_AFK_ELEMENT_INST_U32()`, in order to be able to set `.avoid_size_check` to true. Otherwise, `mlxsw_afk_blocks_check()` will fail and warn. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06mlxsw: Rename virtual router flex key elementAmit Cohen
In Spectrum-4, the size of the virtual router ACL key element increased from 11 bits to 12 bits. In order to reuse the existing virtual router ACL key element enumerators for Spectrum-4, rename 'VIRT_ROUTER_8_10' and 'VIRT_ROUTER_0_7' to 'VIRT_ROUTER_MSB' and 'VIRT_ROUTER_LSB', respectively. No functional changes intended. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06Merge branch 'dpaa2-eth-small-cleanup'Jakub Kicinski
Ioana Ciornei says: ==================== dpaa2-eth: small cleanup These 3 patches are just part of a small cleanup on the dpaa2-eth and the dpaa2-switch drivers. In case we are hitting a case in which the fwnode of the root dprc device we initiate a deferred probe. On the dpaa2-switch side, if we are on the remove path, make sure that we check for a non-NULL pointer before accessing the port private structure. ==================== Link: https://lore.kernel.org/r/20220106135905.81923-1-ioana.ciornei@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06dpaa2-switch: check if the port priv is validIoana Ciornei
Before accessing the port private structure make sure that there is still a non-NULL pointer there. A NULL pointer access can happen when we are on the remove path, some switch ports are unregistered and some are in the process of unregistering. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06dpaa2-mac: return -EPROBE_DEFER from dpaa2_mac_open in case the fwnode is ↵Ioana Ciornei
not set We could get into a situation when the fwnode of the parent device is not yet set because its probe didn't yet finish. When this happens, any caller of the dpaa2_mac_open() will not have the fwnode available, thus cause problems at the PHY connect time. Avoid this by just returning -EPROBE_DEFER from the dpaa2_mac_open when this happens. Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06dpaa2-mac: bail if the dpmacs fwnode is not foundRobert-Ionut Alexa
The parent pointer node handler must be declared with a NULL initializer. Before using it, a check must be performed to make sure that a valid address has been assigned to it. Signed-off-by: Robert-Ionut Alexa <robert-ionut.alexa@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-01-06 We've added 41 non-merge commits during the last 2 day(s) which contain a total of 36 files changed, 1214 insertions(+), 368 deletions(-). The main changes are: 1) Various fixes in the verifier, from Kris and Daniel. 2) Fixes in sockmap, from John. 3) bpf_getsockopt fix, from Kuniyuki. 4) INET_POST_BIND fix, from Menglong. 5) arm64 JIT fix for bpf pseudo funcs, from Hou. 6) BPF ISA doc improvements, from Christoph. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits) bpf: selftests: Add bind retry for post_bind{4, 6} bpf: selftests: Use C99 initializers in test_sock.c net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() bpf/selftests: Test bpf_d_path on rdonly_mem. libbpf: Add documentation for bpf_map batch operations selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3 xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames xdp: Move conversion to xdp_frame out of map functions page_pool: Store the XDP mem id page_pool: Add callback to init pages when they are allocated xdp: Allow registering memory model without rxq reference samples/bpf: xdpsock: Add timestamp for Tx-only operation samples/bpf: xdpsock: Add time-out for cleaning Tx samples/bpf: xdpsock: Add sched policy and priority support samples/bpf: xdpsock: Add cyclic TX operation capability samples/bpf: xdpsock: Add clockid selection support samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation samples/bpf: xdpsock: Add VLAN support for Tx-only operation libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API libbpf 1.0: Deprecate bpf_map__is_offload_neutral() ... ==================== Link: https://lore.kernel.org/r/20220107013626.53943-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-01-06Merge branch 'net: bpf: handle return value of post_bind{4,6} and add ↵Alexei Starovoitov
selftests for it' Menglong Dong says: ==================== From: Menglong Dong <imagedong@tencent.com> The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. The second patch use C99 initializers in test_sock.c The third patch is the selftests for this modification. Changes since v4: - use C99 initializers in test_sock.c before adding the test case Changes since v3: - add the third patch which use C99 initializers in test_sock.c Changes since v2: - NULL check for sk->sk_prot->put_port Changes since v1: - introduce 'put_port' field for 'struct proto' - add selftests for it ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2022-01-06bpf: selftests: Add bind retry for post_bind{4, 6}Menglong Dong
With previous patch, kernel is able to 'put_port' after sys_bind() fails. Add the test for that case: rebind another port after sys_bind() fails. If the bind success, it means previous bind operation is already undoed. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-4-imagedong@tencent.com
2022-01-06bpf: selftests: Use C99 initializers in test_sock.cMenglong Dong
Use C99 initializers for the initialization of 'tests' in test_sock.c. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-3-imagedong@tencent.com
2022-01-06net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()Menglong Dong
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in __inet_bind() is not handled properly. While the return value is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and exit: err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk); if (err) { inet->inet_saddr = inet->inet_rcv_saddr = 0; goto out_release_sock; } Let's take UDP for example and see what will happen. For UDP socket, it will be added to 'udp_prot.h.udp_table->hash' and 'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port() called success. If 'inet->inet_rcv_saddr' is specified here, then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong to (because inet_saddr is changed to 0), and UDP packet received will not be passed to this sock. If 'inet->inet_rcv_saddr' is not specified here, the sock will work fine, as it can receive packet properly, which is wired, as the 'bind()' is already failed. To undo the get_port() operation, introduce the 'put_port' field for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP proto, it is udp_lib_unhash(); For icmp proto, it is ping_unhash(). Therefore, after sys_bind() fail caused by BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which means that it can try to be binded to another port. Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
2022-01-06Documentation: devlink: mlx5.rst: Fix htmldoc build warningSaeed Mahameed
Fix the following build warning: Documentation/networking/devlink/mlx5.rst:13: WARNING: Error parsing content block for the "list-table" directive: +uniform two-level bullet list expected, but row 2 does not contain the same number of items as row 1 (2 vs 3). ... Add the missing item in the first row. Fixes: 0844fa5f7b89 ("net/mlx5: Let user configure io_eq_size param") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Add recovery flow in case of error CQEGal Pressman
The rep legacy RQ completion handling was missing the appropriate handling of error CQEs (dump the CQE and queue a recover work), fix it by calling trigger_report() when needed. Since all CQE handling flows do the exact same error CQE handling, extract it to a common helper function. Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: TC, Remove redundant error loggingRoi Dayan
Remove redundant and trivial error logging when trying to offload mirred device with unsupported devices. Using OVS could hit those a lot and the errors are still logged in extack. Signed-off-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Maor Dickman <maord@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Refactor set_pflag_cqe_based_moderSaeed Mahameed
Rearrange the code and use cqe_mode_to_period_mode() helper. Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Move HW-GRO and CQE compression check to fix features flowGal Pressman
Feature dependencies should be resolved in fix features rather than in set features flow. Move the check that disables HW-GRO in case CQE compression is enabled from set_feature_hw_gro() to mlx5e_fix_features(). Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Fix feature check per profileAya Levin
Remove redundant space when constructing the feature's enum. Validate against the indented enum value. Fixes: 6c72cb05d4b8 ("net/mlx5e: Use bitmap field for profile features") Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Unblock setting vid 0 for VF in case PF isn't eswitch managerMaor Dickman
When using libvirt to passthrough VF to VM it will always set the VF vlan to 0 even if user didn’t request it, this will cause libvirt to fail to boot in case the PF isn't eswitch owner. Example of such case is the DPU host PF which isn't eswitch manager, so any attempt to passthrough VF of it using libvirt will fail. Fix it by not returning error in case set VF vlan is called with vid 0. Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5e: Expose FEC counters via ethtoolLama Kayal
Add FEC counters' statistics of corrected_blocks and uncorrectable_blocks, along with their lanes via ethtool. HW supports corrected_blocks and uncorrectable_blocks counters both for RS-FEC mode and FC-FEC mode. In FC mode these counters are accumulated per lane, while in RS mode the correction method crosses lanes, thus only total corrected_blocks and uncorrectable_blocks are reported in this mode. Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Update log_max_qp value to FW max capabilityMaher Sanalla
log_max_qp in driver's default profile #2 was set to 18, but FW actually supports 17 at the most - a situation that led to the concerning print when the driver is loaded: "log_max_qp value in current profile is 18, changing to HCA capabaility limit (17)" The expected behavior from mlx5_profile #2 is to match the maximum FW capability in regards to log_max_qp. Thus, log_max_qp in profile #2 is initialized to a defined static value (0xff) - which basically means that when loading this profile, log_max_qp value will be what the currently installed FW supports at most. Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: SF, Use all available cpu for setting cpu affinityShay Drory
Currently all SFs are using the same CPUs. Spreading SF over CPUs, in round-robin manner, in order to achieve better distribution of the SFs over available CPUs. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Introduce API for bulk request and release of IRQsShay Drory
Currently IRQs are requested one by one. To balance spreading IRQs among cpus using such scheme requires remembering cpu mask for the cpus used for a given device. This complicates the IRQ allocation scheme in subsequent patch. Hence, prepare the code for bulk IRQs allocation. This enables spreading IRQs among cpus in subsequent patch. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Split irq_pool_affinity logic to new fileShay Drory
The downstream patches add more functionality to irq_pool_affinity. Move the irq_pool_affinity logic to a new file in order to ease the coding and maintenance of it. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Move affinity assignment into irq_requestShay Drory
Move affinity binding of the IRQ to irq_request function in order to bind the IRQ before inserting it to the xarray. After this change, the IRQ is ready for use when inserted to the xarray. Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: Introduce control IRQ request APIShay Drory
Currently, IRQ layer have a separate flow for ctrl and comp IRQs, and the distinction between ctrl and comp IRQs is done in the IRQ layer. In order to ease the coding and maintenance of the IRQ layer, introduce a new API for requesting control IRQs - mlx5_ctrl_irq_request(struct mlx5_core_dev *dev). Signed-off-by: Shay Drory <shayd@nvidia.com> Reviewed-by: Moshe Shemesh <moshe@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06net/mlx5: mlx5e_hv_vhca_stats_create return type to voidSaeed Mahameed
Callers of this functions ignore its return value, as reported by Wang Qing, in one of the return paths, it returns positive values. Since return value is ignored anyways, void out the return type of the function. Reported-by: Wang Qing <wangqing@vivo.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
2022-01-06bpf/selftests: Test bpf_d_path on rdonly_mem.Hao Luo
The second parameter of bpf_d_path() can only accept writable memories. Rdonly_mem obtained from bpf_per_cpu_ptr() can not be passed into bpf_d_path for modification. This patch adds a selftest to verify this behavior. Signed-off-by: Hao Luo <haoluo@google.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220106205525.2116218-1-haoluo@google.com
2022-01-06libbpf: Add documentation for bpf_map batch operationsGrant Seltzer
This adds documention for: - bpf_map_delete_batch() - bpf_map_lookup_batch() - bpf_map_lookup_and_delete_batch() - bpf_map_update_batch() This also updates the public API for the `keys` parameter of `bpf_map_delete_batch()`, and both the `keys` and `values` parameters of `bpf_map_update_batch()` to be constants. Signed-off-by: Grant Seltzer <grantseltzer@gmail.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20220106201304.112675-1-grantseltzer@gmail.com