aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2022-05-16ipv6/gro: insert temporary HBH/jumbo headerEric Dumazet
Following patch will add GRO_IPV6_MAX_SIZE, allowing gro to build BIG TCP ipv6 packets (bigger than 64K). This patch changes ipv6_gro_complete() to insert a HBH/jumbo header so that resulting packet can go through IPv6/TCP stacks. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6/gso: remove temporary HBH/jumbo headerEric Dumazet
ipv6 tcp and gro stacks will soon be able to build big TCP packets, with an added temporary Hop By Hop header. If GSO is involved for these large packets, we need to remove the temporary HBH header before segmentation happens. v2: perform HBH removal from ipv6_gso_segment() instead of skb_segment() (Alexander feedback) Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ipv6: add struct hop_jumbo_hdr definitionEric Dumazet
Following patches will need to add and remove local IPv6 jumbogram options to enable BIG TCP. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16tcp_cubic: make hystart_ack_delay() aware of BIG TCPEric Dumazet
hystart_ack_delay() had the assumption that a TSO packet would not be bigger than GSO_MAX_SIZE. This will no longer be true. We should use sk->sk_gso_max_size instead. This reduces chances of spurious Hystart ACK train detections. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: limit GSO_MAX_SIZE to 524280 bytesEric Dumazet
Make sure we will not overflow shinfo->gso_segs Minimal TCP MSS size is 8 bytes, and shinfo->gso_segs is a 16bit field. TCP_MIN_GSO_SIZE is currently defined in include/net/tcp.h, it seems cleaner to not bring tcp details into include/linux/netdevice.h Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: allow gso_max_size to exceed 65536Alexander Duyck
The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16net: add IFLA_TSO_{MAX_SIZE|SEGS} attributesEric Dumazet
New netlink attributes IFLA_TSO_MAX_SIZE and IFLA_TSO_MAX_SEGS are used to report to user-space the device TSO limits. ip -d link sh dev eth1 ... tso_max_size 65536 tso_max_segs 65535 Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16Merge branch 'Renesas-RSZ-V2M-support'David S. Miller
Phil Edworthy says: ==================== Add Renesas RZ/V2M Ethernet support The RZ/V2M Ethernet is very similar to R-Car Gen3 Ethernet-AVB, though some small parts are the same as R-Car Gen2. Other differences are: * It has separate data (DI), error (Line 1) and management (Line 2) irqs rather than one irq for all three. * Instead of using the High-speed peripheral bus clock for gPTP, it has a separate gPTP reference clock. v4: * Add clk_disable_unprepare() for gptp ref clk v3: * Really renamed irq_en_dis_regs to irq_en_dis this time * Modified ravb_ptp_extts() to use irq_en_dis * Added Reviewed-by tags v2: * Just net patches in this series * Instead of reusing ch22 and ch24 interrupt names, use the proper names * Renamed irq_en_dis_regs to irq_en_dis * Squashed use of GIC reg versus GIE/GID and got rid of separate gptp_ptm_gic feature. * Move err_mgmt_irqs code under multi_irqs * Minor editing of the commit msgs ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ravb: Add support for RZ/V2MPhil Edworthy
RZ/V2M Ethernet is very similar to R-Car Gen3 Ethernet-AVB, though some small parts are the same as R-Car Gen2. Other differences to R-Car Gen3 and Gen2 are: * It has separate data (DI), error (Line 1) and management (Line 2) irqs rather than one irq for all three. * Instead of using the High-speed peripheral bus clock for gPTP, it has a separate gPTP reference clock. Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ravb: Use separate clock for gPTPPhil Edworthy
RZ/V2M has a separate gPTP reference clock that is used when the AVB-DMAC Mode Register (CCC) gPTP Clock Select (CSEL) bits are set to "01: High-speed peripheral bus clock". Therefore, add a feature that allows this clock to be used for gPTP. Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ravb: Support separate Line0 (Desc), Line1 (Err) and Line2 (Mgmt) irqsPhil Edworthy
R-Car has a combined interrupt line, ch22 = Line0_DiA | Line1_A | Line2_A. RZ/V2M has separate interrupt lines for each of these, so add a feature that allows the driver to get these interrupts and call the common handler. Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16ravb: Separate handling of irq enable/disable regs into featurePhil Edworthy
Currently, when the HW has a single interrupt, the driver uses the GIC, TIC, RIC0 registers to enable and disable interrupts. When the HW has multiple interrupts, it uses the GIE, GID, TIE, TID, RIE0, RID0 registers. However, other devices, e.g. RZ/V2M, have multiple irqs and only have the GIC, TIC, RIC0 registers. Therefore, split this into a separate feature. Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16dt-bindings: net: renesas,etheravb: Document RZ/V2M SoCPhil Edworthy
Document the Ethernet AVB IP found on RZ/V2M SoC. It includes the Ethernet controller (E-MAC) and Dedicated Direct memory access controller (DMAC) for transferring transmitted Ethernet frames to and received Ethernet frames from respective storage areas in the RAM at high speed. The AVB-DMAC is compliant with IEEE 802.1BA, IEEE 802.1AS timing and synchronization protocol, IEEE 802.1Qav real-time transfer, and the IEEE 802.1Qat stream reservation protocol. R-Car has a pair of combined interrupt lines: ch22 = Line0_DiA | Line1_A | Line2_A ch23 = Line0_DiB | Line1_B | Line2_B Line0 for descriptor interrupts (which we call dia and dib). Line1 for error related interrupts (which we call err_a and err_b). Line2 for management and gPTP related interrupts (mgmt_a and mgmt_b). RZ/V2M hardware has separate interrupt lines for each of these. It has 3 clocks; the main AXI clock, the AMBA CHI (Coherent Hub Interface) clock and a gPTP reference clock. Signed-off-by: Phil Edworthy <phil.edworthy@renesas.com> Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextDavid S. Miller
Pablo Neira Ayuso says: ==================== Netfilter updates for net-next This is v2 including deadlock fix in conntrack ecache rework reported by Jakub Kicinski. The following patchset contains Netfilter updates for net-next, mostly updates to conntrack from Florian Westphal. 1) Add a dedicated list for conntrack event redelivery. 2) Include event redelivery list in conntrack dumps of dying type. 3) Remove per-cpu dying list for event redelivery, not used anymore. 4) Add netns .pre_exit to cttimeout to zap timeout objects before synchronize_rcu() call. 5) Remove nf_ct_unconfirmed_destroy. 6) Add generation id for conntrack extensions for conntrack timeout and helpers. 7) Detach timeout policy from conntrack on cttimeout module removal. 8) Remove __nf_ct_unconfirmed_destroy. 9) Remove unconfirmed list. 10) Remove unconditional local_bh_disable in init_conntrack(). 11) Consolidate conntrack iterator nf_ct_iterate_cleanup(). 12) Detect if ctnetlink listeners exist to short-circuit event path early. 13) Un-inline nf_ct_ecache_ext_add(). 14) Add nf_conntrack_events autodetect ctnetlink listener mode and make it default. 15) Add nf_ct_ecache_exist() to check for event cache extension. 16) Extend flowtable reverse route lookup to include source, iif, tos and mark, from Sven Auhagen. 17) Do not verify zero checksum UDP packets in nf_reject, from Kevin Mitchell. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13ice: Expose RSS indirection tables for queue groups via ethtoolSridhar Samudrala
When ADQ queue groups (TCs) are created via tc mqprio command, RSS contexts and associated RSS indirection tables are configured automatically per TC based on the queue ranges specified for each traffic class. For ex: tc qdisc add dev enp175s0f0 root mqprio num_tc 3 map 0 1 2 \ queues 2@0 8@2 4@10 hw 1 mode channel will create 3 queue groups (TC 0-2) with queue ranges 2, 8 and 4 in 3 queue groups. Each queue group is associated with its own RSS context and RSS indirection table. Add support to expose RSS indirection tables for all ADQ queue groups using ethtool RSS contexts interface. ethtool -x enp175s0f0 context <tc-num> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com> Tested-by: Bharathi Sreenivas <bharathi.sreenivas@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20220512213249.3747424-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-13ixgbe: add xdp frags support to ndo_xdp_xmitLorenzo Bianconi
Add the capability to map non-linear xdp frames in XDP_TX and ndo_xdp_xmit callback. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20220512212621.3746140-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-13eth: sfc: remove remnants of the out-of-tree napi_weight module paramJakub Kicinski
Remove napi_weight statics which are set to 64 and never modified, remnants of the out-of-tree napi_weight module param. Acked-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://lore.kernel.org/r/20220512205603.1536771-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-13Merge branch 'master' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-05-13 1) Cleanups for the code behind the XFRM offload API. This is a preparation for the extension of the API for policy offload. From Leon Romanovsky. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: drop not needed flags variable in XFRM offload struct net/mlx5e: Use XFRM state direction instead of flags netdevsim: rely on XFRM state direction instead of flags ixgbe: propagate XFRM offload state direction instead of flags xfrm: store and rely on direction to construct offload flags xfrm: rename xfrm_state_offload struct to allow reuse xfrm: delete not used number of external headers xfrm: free not used XFRM_ESP_NO_TRAILER flag ==================== Link: https://lore.kernel.org/r/20220513151218.4010119-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-13sfc: siena: Fix Kconfig dependenciesRen Zhijie
If CONFIG_PTP_1588_CLOCK=m and CONFIG_SFC_SIENA=y, the siena driver will fail to link: drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_remove_channel': ptp.c:(.text+0xa28): undefined reference to `ptp_clock_unregister' drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_probe_channel': ptp.c:(.text+0x13a0): undefined reference to `ptp_clock_register' ptp.c:(.text+0x1470): undefined reference to `ptp_clock_unregister' drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_ptp_pps_worker': ptp.c:(.text+0x1d29): undefined reference to `ptp_clock_event' drivers/net/ethernet/sfc/siena/ptp.o: In function `efx_siena_ptp_get_ts_info': ptp.c:(.text+0x301b): undefined reference to `ptp_clock_index' To fix this build error, make SFC_SIENA depends on PTP_1588_CLOCK. Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: d48523cb88e0 ("sfc: Copy shared files needed for Siena (part 2)") Signed-off-by: Ren Zhijie <renzhijie2@huawei.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Link: https://lore.kernel.org/r/20220513012721.140871-1-renzhijie2@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-05-13netfilter: conntrack: skip verification of zero UDP checksumKevin Mitchell
The checksum is optional for UDP packets. However nf_reject would previously require a valid checksum to elicit a response such as ICMP_DEST_UNREACH. Add some logic to nf_reject_verify_csum to determine if a UDP packet has a zero checksum and should therefore not be verified. Signed-off-by: Kevin Mitchell <kevmitch@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: flowtable: nft_flow_route use more data for reverse routeSven Auhagen
When creating a flow table entry, the reverse route is looked up based on the current packet. There can be scenarios where the user creates a custom ip rule to route the traffic differently. In order to support those scenarios, the lookup needs to add more information based on the current packet. The patch adds multiple new information to the route lookup. Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: prefer extension check to pointer checkFlorian Westphal
The pointer check usually results in a 'false positive': its likely that the ctnetlink module is loaded but no event monitoring is enabled. After recent change to autodetect ctnetlink usage and only allocate the ecache extension if a listener is active, check if the extension is present on a given conntrack. If its not there, there is nothing to report and calls to the notification framework can be elided. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: add nf_conntrack_events autodetect modeFlorian Westphal
This adds the new nf_conntrack_events=2 mode and makes it the default. This leverages the earlier flag in struct net to allow to avoid the event extension as long as no event listener is active in the namespace. This avoids, for most cases, allocation of ct->ext area. A followup patch will take further advantage of this by avoiding calls down into the event framework if the extension isn't present. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: un-inline nf_ct_ecache_ext_addFlorian Westphal
Only called when new ct is allocated or the extension isn't present. This function will be extended, place this in the conntrack module instead of inlining. The callers already depend on nf_conntrack module. Return value is changed to bool, noone used the returned pointer. Make sure that the core drops the newly allocated conntrack if the extension is requested but can't be added. This makes it necessary to ifdef the section, as the stub always returns false we'd drop every new conntrack if the the ecache extension is disabled in kconfig. Add from data path (xt_CT, nft_ct) is unchanged. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: nfnetlink: allow to detect if ctnetlink listeners existFlorian Westphal
At this time, every new conntrack gets the 'event cache extension' enabled for it. This is because the 'net.netfilter.nf_conntrack_events' sysctl defaults to 1. Changing the default to 0 means that commands that rely on the event notification extension, e.g. 'conntrack -E' or conntrackd, stop working. We COULD detect if there is a listener by means of 'nfnetlink_has_listeners()' and only add the extension if this is true. The downside is a dependency from conntrack module to nfnetlink module. This adds a different way: inc/dec a counter whenever a ctnetlink group is being (un)subscribed and toggle a flag in struct net. Next patches will take advantage of this and will only add the event extension if the flag is set. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: add nf_ct_iter_data object for nf_ct_iterate_cleanup*()Pablo Neira Ayuso
This patch adds a structure to collect all the context data that is passed to the cleanup iterator. struct nf_ct_iter_data { struct net *net; void *data; u32 portid; int report; }; There is a netns field that allows to clean up conntrack entries specifically owned by the specified netns. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: avoid unconditional local_bh_disableFlorian Westphal
Now that the conntrack entry isn't placed on the pcpu list anymore the bh only needs to be disabled in the 'expectation present' case. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: remove unconfirmed listFlorian Westphal
It has no function anymore and can be removed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: remove __nf_ct_unconfirmed_destroyFlorian Westphal
Its not needed anymore: A. If entry is totally new, then the rcu-protected resource must already have been removed from global visibility before call to nf_ct_iterate_destroy. B. If entry was allocated before, but is not yet in the hash table (uncofirmed case), genid gets incremented and synchronize_rcu() call makes sure access has completed. C. Next attempt to peek at extension area will fail for unconfirmed conntracks, because ext->genid != genid. D. Conntracks in the hash are iterated as before. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: cttimeout: decouple unlink and free on netns destructionFlorian Westphal
Increment the extid on module removal; this makes sure that even in extreme cases any old uncofirmed entry that happened to be kept e.g. on nfnetlink_queue list will not trip over a stale timeout reference. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: extensions: introduce extension genid countFlorian Westphal
Multiple netfilter extensions store pointers to external data in their extension area struct. Examples: 1. Timeout policies 2. Connection tracking helpers. No references are taken for these. When a helper or timeout policy is removed, the conntrack table gets traversed and affected extensions are cleared. Conntrack entries not yet in the hashtable are referenced via a special list, the unconfirmed list. On removal of a policy or connection tracking helper, the unconfirmed list gets traversed an all entries are marked as dying, this prevents them from getting committed to the table at insertion time: core checks for dying bit, if set, the conntrack entry gets destroyed at confirm time. The disadvantage is that each new conntrack has to be added to the percpu unconfirmed list, and each insertion needs to remove it from this list. The list is only ever needed when a policy or helper is removed -- a rare occurrence. Add a generation ID count: Instead of adding to the list and then traversing that list on policy/helper removal, increment a counter that is stored in the extension area. For unconfirmed conntracks, the extension has the genid valid at ct allocation time. Removal of a helper/policy etc. increments the counter. At confirmation time, validate that ext->genid == global_id. If the stored number is not the same, do not allow the conntrack insertion, just like as if a confirmed-list traversal would have flagged the entry as dying. After insertion, the genid is no longer relevant (conntrack entries are now reachable via the conntrack table iterators and is set to 0. This allows removal of the percpu unconfirmed list. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: remove nf_ct_unconfirmed_destroy helperFlorian Westphal
This helper tags connections not yet in the conntrack table as dying. These nf_conn entries will be dropped instead when the core attempts to insert them from the input or postrouting 'confirm' hook. After the previous change, the entries get unlinked from the list earlier, so that by the time the actual exit hook runs, new connections no longer have a timeout policy assigned. Its enough to walk the hashtable instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: cttimeout: decouple unlink and free on netns destructionFlorian Westphal
Make it so netns pre_exit unlinks the objects from the pernet list, so they cannot be found anymore. netns core issues a synchronize_rcu() before calling the exit hooks so any the time the exit hooks run unconfirmed nf_conn entries have been free'd or they have been committed to the hashtable. The exit hook still tags unconfirmed entries as dying, this can now be removed in a followup change. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: remove the percpu dying listFlorian Westphal
Its no longer needed. Entries that need event redelivery are placed on the new pernet dying list. The advantage is that there is no need to take additional spinlock on conntrack removal unless event redelivery failed or the conntrack entry was never added to the table in the first place (confirmed bit not set). The IPS_CONFIRMED bit now needs to be set as soon as the entry has been unlinked from the unconfirmed list, else the destroy function may attempt to unlink it a second time. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: conntrack: include ecache dying list in dumpsFlorian Westphal
The new pernet dying list includes conntrack entries that await delivery of the 'destroy' event via ctnetlink. The old percpu dying list will be removed soon. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13netfilter: ecache: use dedicated list for event redeliveryFlorian Westphal
This disentangles event redelivery and the percpu dying list. Because entries are now stored on a dedicated list, all entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries still have confirmed bit set -- the reference count is at least 1. The 'struct net' back-pointer can be removed as well. The pcpu dying list will be removed eventually, it has no functionality. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2022-05-13Merge branch 'bnxt_en-next'David S. Miller
Michael Chan says: ==================== bnxt_en: Updates for net-next This small patchset updates the firmware interface, adds timestamping support for all receive packets, and adds revised NVRAM package error messages for ethtool and devlink. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13bnxt_en: parse and report result field when NVRAM package install failsKalesh AP
Instead of always returning -ENOPKG, decode the firmware error code further when the HWRM_NVM_INSTALL_UPDATE firmware call fails. Return a more suitable error code to userspace and log an error in dmesg. This is version 2 of the earlier patch that was reverted: 02acd399533e ("bnxt_en: parse result field when NVRAM package install fails") In this new version, if the call is made through devlink instead of ethtool, we'll also set the error message in extack. Link: https://lore.kernel.org/netdev/20220307141358.4d52462e@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/ Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com> Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13bnxt_en: Enable packet timestamping for all RX packetsPavan Chebbi
Add driver support to enable timestamping on all RX packets that are received by the NIC. This capability can be requested by the applications using SIOCSHWTSTAMP ioctl with filter type HWTSTAMP_FILTER_ALL. Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13bnxt_en: Configure ptp filters during bnxt openPavan Chebbi
For correctness, we need to configure the packet filters for timestamping during bnxt_open. This way they are always configured after firmware reset or chip reset. We should not assume that the filters will always be retained across resets. This patch modifies the ioctl handler and always configures the PTP filters in the bnxt_open() path. Cc: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13bnxt_en: Update firmware interface to 1.10.2.95Michael Chan
The main changes are timestamp support for all RX packets and new PCIe statistics. Signed-off-by: Michael Chan <michael.chan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13net: axienet: Use NAPI for TX completion pathRobert Hancock
This driver was using the TX IRQ handler to perform all TX completion tasks. Under heavy TX network load, this can cause significant irqs-off latencies (found to be in the hundreds of microseconds using ftrace). This can cause other issues, such as overrunning serial UART FIFOs when using high baud rates with limited UART FIFO sizes. Switch to using a NAPI poll handler to perform the TX completion work to get this out of hard IRQ context and avoid the IRQ latency impact. A separate poll handler is used for TX and RX since they have separate IRQs on this controller, so that the completion work for each of them stays on the same CPU as the interrupt. Testing on a Xilinx MPSoC ZU9EG platform using iperf3 from a Linux PC through a switch at 1G link speed showed no significant change in TX or RX throughput, with approximately 941 Mbps before and after. Hard IRQ time in the TX throughput test was significantly reduced from 12% to below 1% on the CPU handling TX interrupts, with total hard+soft IRQ CPU usage dropping from about 56% down to 48%. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13net: axienet: Be more careful about updating tx_bd_tailRobert Hancock
The axienet_start_xmit function was updating the tx_bd_tail variable multiple times, with potential rollbacks on error or invalid intermediate positions, even though this variable is also used in the TX completion path. Use READ_ONCE where this variable is read and WRITE_ONCE where it is written to make this update more atomic, and move the write before the MMIO write to start the transfer, so it is protected by that implicit write barrier. Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()Eric Dumazet
INET_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET_MATCH() an inline function to ease our changes. v2: We remove the 32bit version of it, as modern compilers should generate the same code really, no need to try to be smarter. Also make 'struct net *net' the first argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13selftests: fib_nexthops: Make the test more robustAmit Cohen
Rarely some of the test cases fail. Make the test more robust by increasing the timeout of ping commands to 5 seconds. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13Merge branch 'lan95xx-no-polling'David S. Miller
Lukas Wunner says: ==================== Polling be gone on LAN95xx Do away with link status polling on LAN95xx USB Ethernet and rely on interrupts instead, thereby reducing bus traffic, CPU overhead and improving interface bringup latency. Link to v2: https://lore.kernel.org/netdev/cover.1651574194.git.lukas@wunner.de/ Only change since v2: * Patch [5/7]: * Drop call to __irq_enter_raw() which worked around a warning in generic_handle_domain_irq(). That warning is gone since 792ea6a074ae (queued on tip.git/irq/urgent). (Marc Zyngier, Thomas Gleixner) ====================
2022-05-13net: phy: smsc: Cope with hot-removal in interrupt handlerLukas Wunner
If reading the Interrupt Source Flag register fails with -ENODEV, then the PHY has been hot-removed and the correct response is to bail out instead of throwing a WARN splat and attempting to suspend the PHY. The PHY should be stopped in due course anyway as the kernel asynchronously tears down the device. Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500 Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514 Signed-off-by: Lukas Wunner <lukas@wunner.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13net: phy: smsc: Cache interrupt maskLukas Wunner
Cache the interrupt mask to avoid re-reading it from the PHY upon every interrupt. This will simplify a subsequent commit which detects hot-removal in the interrupt handler and bails out. Analyzing and debugging PHY transactions also becomes simpler if such redundant reads are avoided. Last not least, interrupt overhead and latency is slightly improved. Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500 Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514 Signed-off-by: Lukas Wunner <lukas@wunner.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13usbnet: smsc95xx: Forward PHY interrupts to PHY driver to avoid pollingLukas Wunner
Link status of SMSC LAN95xx chips is polled once per second, even though they're capable of signaling PHY interrupts through the MAC layer. Forward those interrupts to the PHY driver to avoid polling. Benefits are reduced bus traffic, reduced CPU overhead and quicker interface bringup. Polling was introduced in 2016 by commit d69d16949346 ("usbnet: smsc95xx: fix link detection for disabled autonegotiation"). Back then, the LAN95xx driver neglected to enable the ENERGYON interrupt, hence couldn't detect link-up events when auto-negotiation was disabled. The proper solution would have been to enable the ENERGYON interrupt instead of polling. Since then, PHY handling was moved from the LAN95xx driver to the SMSC PHY driver with commit 05b35e7eb9a1 ("smsc95xx: add phylib support"). That PHY driver is capable of link detection with auto-negotiation disabled because it enables the ENERGYON interrupt. Note that signaling interrupts through the MAC layer not only works with the integrated PHY, but also with an external PHY, provided its interrupt pin is attached to LAN95xx's nPHY_INT pin. In the unlikely event that the interrupt pin of an external PHY is attached to a GPIO of the SoC (or not connected at all), the driver can be amended to retrieve the irq from the PHY's of_node. To forward PHY interrupts to phylib, it is not sufficient to call phy_mac_interrupt(). Instead, the PHY's interrupt handler needs to run so that PHY interrupts are cleared. That's because according to page 119 of the LAN950x datasheet, "The source of this interrupt is a level. The interrupt persists until it is cleared in the PHY." https://www.microchip.com/content/dam/mchp/documents/UNG/ProductDocuments/DataSheets/LAN950x-Data-Sheet-DS00001875D.pdf Therefore, create an IRQ domain with a single IRQ for the PHY. In the future, the IRQ domain may be extended to support the 11 GPIOs on the LAN95xx. Normally the PHY interrupt should be masked until the PHY driver has cleared it. However masking requires a (sleeping) USB transaction and interrupts are received in (non-sleepable) softirq context. I decided not to mask the interrupt at all (by using the dummy_irq_chip's noop ->irq_mask() callback): The USB interrupt endpoint is polled in 1 msec intervals and normally that's sufficient to wake the PHY driver's IRQ thread and have it clear the interrupt. If it does take longer, worst thing that can happen is the IRQ thread is woken again. No big deal. Because PHY interrupts are now perpetually enabled, there's no need to selectively enable them on suspend. So remove all invocations of smsc95xx_enable_phy_wakeup_interrupts(). In smsc95xx_resume(), move the call of phy_init_hw() before usbnet_resume() (which restarts the status URB) to ensure that the PHY is fully initialized when an interrupt is handled. Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500 Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514 Signed-off-by: Lukas Wunner <lukas@wunner.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> # from a PHY perspective Cc: Andre Edich <andre.edich@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2022-05-13usbnet: smsc95xx: Avoid link settings race on interrupt receptionLukas Wunner
When a PHY interrupt is signaled, the SMSC LAN95xx driver updates the MAC full duplex mode and PHY flow control registers based on cached data in struct phy_device: smsc95xx_status() # raises EVENT_LINK_RESET usbnet_deferred_kevent() smsc95xx_link_reset() # uses cached data in phydev Simultaneously, phylib polls link status once per second and updates that cached data: phy_state_machine() phy_check_link_status() phy_read_status() lan87xx_read_status() genphy_read_status() # updates cached data in phydev If smsc95xx_link_reset() wins the race against genphy_read_status(), the registers may be updated based on stale data. E.g. if the link was previously down, phydev->duplex is set to DUPLEX_UNKNOWN and that's what smsc95xx_link_reset() will use, even though genphy_read_status() may update it to DUPLEX_FULL afterwards. PHY interrupts are currently only enabled on suspend to trigger wakeup, so the impact of the race is limited, but we're about to enable them perpetually. Avoid the race by delaying execution of smsc95xx_link_reset() until phy_state_machine() has done its job and calls back via smsc95xx_handle_link_change(). Signaling EVENT_LINK_RESET on wakeup is not necessary because phylib picks up link status changes through polling. So drop the declaration of a ->link_reset() callback. Note that the semicolon on a line by itself added in smsc95xx_status() is a placeholder for a function call which will be added in a subsequent commit. That function call will actually handle the INT_ENP_PHY_INT_ interrupt. Tested-by: Oleksij Rempel <o.rempel@pengutronix.de> # LAN9514/9512/9500 Tested-by: Ferry Toth <fntoth@gmail.com> # LAN9514 Signed-off-by: Lukas Wunner <lukas@wunner.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>