aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-08-05net/ipv4/ipv6: Replace one-element arraya with flexible-array membersGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. Use an anonymous union with a couple of anonymous structs in order to keep userspace unchanged and refactor the related code accordingly: $ pahole -C group_filter net/ipv4/ip_sockglue.o struct group_filter { union { struct { __u32 gf_interface_aux; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ struct __kernel_sockaddr_storage gf_group_aux; /* 8 128 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ __u32 gf_fmode_aux; /* 136 4 */ __u32 gf_numsrc_aux; /* 140 4 */ struct __kernel_sockaddr_storage gf_slist[1]; /* 144 128 */ }; /* 0 272 */ struct { __u32 gf_interface; /* 0 4 */ /* XXX 4 bytes hole, try to pack */ struct __kernel_sockaddr_storage gf_group; /* 8 128 */ /* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */ __u32 gf_fmode; /* 136 4 */ __u32 gf_numsrc; /* 140 4 */ struct __kernel_sockaddr_storage gf_slist_flex[0]; /* 144 0 */ }; /* 0 144 */ }; /* 0 272 */ /* size: 272, cachelines: 5, members: 1 */ /* last cacheline: 16 bytes */ }; $ pahole -C compat_group_filter net/ipv4/ip_sockglue.o struct compat_group_filter { union { struct { __u32 gf_interface_aux; /* 0 4 */ struct __kernel_sockaddr_storage gf_group_aux __attribute__((__aligned__(4))); /* 4 128 */ /* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */ __u32 gf_fmode_aux; /* 132 4 */ __u32 gf_numsrc_aux; /* 136 4 */ struct __kernel_sockaddr_storage gf_slist[1] __attribute__((__aligned__(4))); /* 140 128 */ } __attribute__((__packed__)) __attribute__((__aligned__(4))); /* 0 268 */ struct { __u32 gf_interface; /* 0 4 */ struct __kernel_sockaddr_storage gf_group __attribute__((__aligned__(4))); /* 4 128 */ /* --- cacheline 2 boundary (128 bytes) was 4 bytes ago --- */ __u32 gf_fmode; /* 132 4 */ __u32 gf_numsrc; /* 136 4 */ struct __kernel_sockaddr_storage gf_slist_flex[0] __attribute__((__aligned__(4))); /* 140 0 */ } __attribute__((__packed__)) __attribute__((__aligned__(4))); /* 0 140 */ } __attribute__((__aligned__(1))); /* 0 268 */ /* size: 268, cachelines: 5, members: 1 */ /* forced alignments: 1 */ /* last cacheline: 12 bytes */ } __attribute__((__packed__)); This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/79 Link: https://github.com/KSPP/linux/issues/109 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05Merge branch 'bridge-ioctl-fixes'David S. Miller
Nikolay Aleksandrov says: ==================== net: bridge: fix recent ioctl changes These are three fixes for the recent bridge removal of ndo_do_ioctl done by commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl"). Patch 01 fixes a deadlock of the new bridge ioctl hook lock and rtnl by taking a netdev reference and always taking the bridge ioctl lock first then rtnl from within the bridge hook. Patch 02 fixes old_deviceless() bridge calls device name argument, and patch 03 checks in dev_ifsioc()'s SIOCBRADD/DELIF cases if the netdevice is actually a bridge before interpreting its private ptr as net_bridge. Patch 01 was tested by running old bridge-utils commands with lockdep enabled. Patch 02 was tested again by using bridge-utils and using the respective ioctl calls on a "up" bridge device. Patch 03 was tested by using the addif ioctl on a non-bridge device (e.g. loopback). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: core: don't call SIOCBRADD/DELIF for non-bridge devicesNikolay Aleksandrov
Commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") changed SIOCBRADD/DELIF to use bridge's ioctl hook (br_ioctl_hook) without checking if the target netdevice is actually a bridge which can cause crashes and generally interpreting other devices' private pointers as net_bridge pointers. Crash example (lo - loopback): $ brctl addif lo ens16 BUG: kernel NULL pointer dereference, address: 000000000000059898 #PF: supervisor read access in kernel modede #PF: error_code(0x0000) - not-present pagege PGD 0 P4D 0 ^Ac Oops: 0000 [#1] SMP NOPTI CPU: 2 PID: 1376 Comm: brctl Kdump: loaded Tainted: G W 5.14.0-rc3+ #405 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014 RIP: 0010:add_del_if+0x1f/0x7c [bridge] Code: 80 bf 1b a0 41 5c e9 c0 3c 03 e1 0f 1f 44 00 00 41 55 41 54 41 89 f4 be 0c 00 00 00 55 48 89 fd 53 48 8b 87 88 00 00 00 89 d3 <4c> 8b a8 98 05 00 00 49 8b bd d0 00 00 00 e8 17 d7 f3 e0 84 c0 74 RSP: 0018:ffff888109d97cb0 EFLAGS: 00010202^Ac RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: ffff888101239bc0 RBP: ffff888101239bc0 R08: 0000000000000001 R09: 0000000000000000 R10: ffff888109d97cd8 R11: 00000000000000a3 R12: 0000000000000012 R13: 0000000000000000 R14: ffff888101239bc0 R15: ffff888109d97e10 FS: 00007fc1e365b540(0000) GS:ffff88822be80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000598 CR3: 0000000106506000 CR4: 00000000000006e0 Call Trace: br_ioctl_stub+0x7c/0x441 [bridge] br_ioctl_call+0x6d/0x8a dev_ifsioc+0x325/0x4e8 dev_ioctl+0x46b/0x4e1 sock_do_ioctl+0x7b/0xad sock_ioctl+0x2de/0x2f2 vfs_ioctl+0x1e/0x2b __do_sys_ioctl+0x63/0x86 do_syscall_64+0xcb/0xf2 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc1e3589427 Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc8d501d38 EFLAGS: 00000202 ORIG_RAX: 000000000000001010 RAX: ffffffffffffffda RBX: 0000000000000012 RCX: 00007fc1e3589427 RDX: 00007ffc8d501d60 RSI: 00000000000089a3 RDI: 0000000000000003 RBP: 00007ffc8d501d60 R08: 0000000000000000 R09: fefefeff77686d74 R10: fffffffffffff8f9 R11: 0000000000000202 R12: 00007ffc8d502e06 R13: 00007ffc8d502e06 R14: 0000000000000000 R15: 0000000000000000 Modules linked in: bridge stp llc bonding ipv6 virtio_net [last unloaded: llc]^Ac CR2: 0000000000000598 Reported-by: syzbot+79f4a8692e267bdb7227@syzkaller.appspotmail.com Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: bridge: fix ioctl old_deviceless bridge argumentNikolay Aleksandrov
Commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") changed the source of the argument copy in bridge's old_deviceless() from args[1] (user ptr to device name) to uarg (ptr to ioctl arguments) causing wrong device name to be used. Example (broken, bridge exists but is up): $ brctl delbr bridge bridge bridge doesn't exist; can't delete it Example (working): $ brctl delbr bridge bridge bridge is still up; can't delete it Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: bridge: fix ioctl lockingNikolay Aleksandrov
Before commit ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") the bridge ioctl calls were divided in two parts: one was deviceless called by sock_ioctl and didn't expect rtnl to be held, the other was with a device called by dev_ifsioc() and expected rtnl to be held. After the commit above they were united in a single ioctl stub, but it didn't take care of the locking expectations. For sock_ioctl now we acquire (1) br_ioctl_mutex, (2) rtnl and for dev_ifsioc we acquire (1) rtnl, (2) br_ioctl_mutex The fix is to get a refcnt on the netdev for dev_ifsioc calls and drop rtnl then to reacquire it in the bridge ioctl stub after br_ioctl_mutex has been acquired. That will avoid playing locking games and make the rules straight-forward: we always take br_ioctl_mutex first, and then rtnl. Reported-by: syzbot+34fe5894623c4ab1b379@syzkaller.appspotmail.com Fixes: ad2f99aedf8f ("net: bridge: move bridge ioctls out of .ndo_do_ioctl") Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net/ipv4: Revert use of struct_size() helperGustavo A. R. Silva
Revert the use of structr_size() and stay with IP_MSFILTER_SIZE() for now, as in this case, the size of struct ip_msfilter didn't change with the addition of the flexible array imsf_slist_flex[]. So, if we use struct_size() we will be allocating and calculating the size of struct ip_msfilter with one too many items for imsf_slist_flex[]. We might use struct_size() in the future, but for now let's stay with IP_MSFILTER_SIZE(). Fixes: 2d3e5caf96b9 ("net/ipv4: Replace one-element array with flexible-array member") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: fix GRO skb truesize updatePaolo Abeni
commit 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference") introduces a serious regression at the GRO layer setting the wrong truesize for stolen-head skbs. Restore the correct truesize: SKB_DATA_ALIGN(...) instead of SKB_TRUESIZE(...) Reported-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Fixes: 5e10da5385d2 ("skbuff: allow 'slow_gro' for skb carring sock reference") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Tested-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05Merge branch 'ipa-runtime-pm'David S. Miller
Alex Elder says: ==================== net: ipa: more work toward runtime PM The first two patches in this series are basically bug fixes, but in practice I don't think we've seen the problems they might cause. The third patch moves clock and interconnect related error messages around a bit, reporting better information and doing so in the functions where they are enabled or disabled (rather than those functions' callers). The last three patches move power-related code into "ipa_clock.c", as a step toward generalizing the purpose of that source file. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: ipa: move IPA flags fieldAlex Elder
The ipa->flags field is only ever used in "ipa_clock.c", related to suspend/resume activity. Move the definition of the ipa_flag enumerated type to "ipa_clock.c". And move the flags field from the ipa structure and to the ipa_clock structure. Rename the type and its values to include "power" or "POWER" in the name. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: ipa: move ipa_suspend_handler()Alex Elder
Move ipa_suspend_handler() into "ipa_clock.c" from "ipa_main.c", to group with the reset of the suspend/resume code. This IPA interrupt is triggered if an IPA RX endpoint is suspended but has a packet to be delivered. Introduce ipa_power_setup() and ipa_power_teardown() to add and remove the handler for the IPA SUSPEND interrupt at the same place as before, while allowing the handler to remain private. The "power" naming convention will be adopted elsewhere in this file as well (soon). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: ipa: move IPA power operations to ipa_clock.cAlex Elder
Move ipa_suspend() and ipa_resume(), as well as the definition of the ipa_pm_ops structure into "ipa_clock.c". Make ipa_pm_ops public and declare it as extern in "ipa_clock.h". This is part of centralizing IPA power management functionality into "ipa_clock.c" (the file will eventually get a name change). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: ipa: improve IPA clock error messagesAlex Elder
Rearrange messages reported when errors occur in the IPA clock code, so that the specific interconnect is identified when an error occurs enabling or disabling it, or the core clock is indicated when an error occurs enabling it. Have ipa_interconnect_disable() return zero or the negative error value returned by the first interconnect that produced an error when disabled. For now, the callers ignore the returned value. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: ipa: reorder netdev pointer assignmentsAlex Elder
Assign the ipa->modem_netdev and endpoint->netdev pointers *before* registering the network device. As soon as the device is registered it can be opened, and by that time we'll want those pointers valid. Similarly, don't make those pointers NULL until *after* the modem network device is unregistered in ipa_modem_stop(). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: ipa: don't suspend/resume modem if not upAlex Elder
The modem network device is set up by ipa_modem_start(). But its TX queue is not actually started and endpoints enabled until it is opened. So avoid stopping the modem network device TX queue and disabling endpoints on suspend or stop unless the netdev is marked UP. And skip attempting to resume unless it is UP. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05Merge branch 'sja1105-H'David S. Miller
Vladimir Oltean says: ==================== NXP SJA1105 driver support for "H" switch topologies Changes in v3: Preserve the behavior of dsa_tree_setup_default_cpu() which is to pick the first CPU port and not the last. Changes in v2: Send as non-RFC, drop the patches for discarding DSA-tagged packets on user ports and DSA-untagged packets on DSA and CPU ports for now. NXP builds boards like the Bluebox 3 where there are multiple SJA1110 switches connected to an LX2160A, but they are also connected to each other. I call this topology an "H" tree because of the lateral connection between switches. A piece extracted from a non-upstream device tree looks like this: &spi_bridge { /* SW1 */ ethernet-switch@0 { compatible = "nxp,sja1110a"; reg = <0>; dsa,member = <0 0>; ethernet-ports { #address-cells = <1>; #size-cells = <0>; /* SW1_P1 */ port@1 { reg = <1>; label = "con_2x20"; phy-mode = "sgmii"; fixed-link { speed = <1000>; full-duplex; }; }; port@2 { reg = <2>; ethernet = <&dpmac17>; phy-mode = "rgmii-id"; fixed-link { speed = <1000>; full-duplex; }; }; port@3 { reg = <3>; label = "1ge_p1"; phy-mode = "rgmii-id"; phy-handle = <&sw1_mii3_phy>; }; sw1p4: port@4 { reg = <4>; link = <&sw2p1>; phy-mode = "sgmii"; fixed-link { speed = <1000>; full-duplex; }; }; port@5 { reg = <5>; label = "trx1"; phy-mode = "internal"; phy-handle = <&sw1_port5_base_t1_phy>; }; port@6 { reg = <6>; label = "trx2"; phy-mode = "internal"; phy-handle = <&sw1_port6_base_t1_phy>; }; port@7 { reg = <7>; label = "trx3"; phy-mode = "internal"; phy-handle = <&sw1_port7_base_t1_phy>; }; port@8 { reg = <8>; label = "trx4"; phy-mode = "internal"; phy-handle = <&sw1_port8_base_t1_phy>; }; port@9 { reg = <9>; label = "trx5"; phy-mode = "internal"; phy-handle = <&sw1_port9_base_t1_phy>; }; port@a { reg = <10>; label = "trx6"; phy-mode = "internal"; phy-handle = <&sw1_port10_base_t1_phy>; }; }; }; /* SW2 */ ethernet-switch@2 { compatible = "nxp,sja1110a"; reg = <2>; dsa,member = <0 1>; ethernet-ports { #address-cells = <1>; #size-cells = <0>; sw2p1: port@1 { reg = <1>; link = <&sw1p4>; phy-mode = "sgmii"; fixed-link { speed = <1000>; full-duplex; }; }; port@2 { reg = <2>; ethernet = <&dpmac18>; phy-mode = "rgmii-id"; fixed-link { speed = <1000>; full-duplex; }; }; port@3 { reg = <3>; label = "1ge_p2"; phy-mode = "rgmii-id"; phy-handle = <&sw2_mii3_phy>; }; port@4 { reg = <4>; label = "to_sw3"; phy-mode = "2500base-x"; fixed-link { speed = <2500>; full-duplex; }; }; port@5 { reg = <5>; label = "trx7"; phy-mode = "internal"; phy-handle = <&sw2_port5_base_t1_phy>; }; port@6 { reg = <6>; label = "trx8"; phy-mode = "internal"; phy-handle = <&sw2_port6_base_t1_phy>; }; port@7 { reg = <7>; label = "trx9"; phy-mode = "internal"; phy-handle = <&sw2_port7_base_t1_phy>; }; port@8 { reg = <8>; label = "trx10"; phy-mode = "internal"; phy-handle = <&sw2_port8_base_t1_phy>; }; port@9 { reg = <9>; label = "trx11"; phy-mode = "internal"; phy-handle = <&sw2_port9_base_t1_phy>; }; port@a { reg = <10>; label = "trx12"; phy-mode = "internal"; phy-handle = <&sw2_port10_base_t1_phy>; }; }; }; }; Basically it is a single DSA tree with 2 "ethernet" properties, i.e. a multi-CPU-port system. There is also a DSA link between the switches, but it is not a daisy chain topology, i.e. there is no "upstream" and "downstream" switch, the DSA link is only to be used for the bridge data plane (autonomous forwarding between switches, between the RJ-45 ports and the automotive Ethernet ports), otherwise all traffic that should reach the host should do so through the dedicated CPU port of the switch. Of course, plain forwarding in this topology is bound to create packet loops. I have thought long and hard about strategies to cut forwarding in such a way as to prevent loops but also not impede normal operation of the network on such a system, and I believe I have found a solution that does work as expected. This relies heavily on DSA's recent ability to perform RX filtering towards the host by installing MAC addresses as static FDB entries. Since we have 2 distinct DSA masters, we have 2 distinct MAC addresses, and if the bridge is configured to have its own MAC address that makes it 3 distinct MAC addresses. The bridge core, plus the switchdev_handle_fdb_add_to_device() extension, handle each MAC address by replicating it to each port of the DSA switch tree. So the end result is that both switch 1 and switch 2 will have static FDB entries towards their respective CPU ports for the 3 MAC addresses corresponding to the DSA masters and to the bridge net device (and of course, towards any station learned on a foreign interface). So I think the basic design works, and it is basically just as fragile as any other multi-CPU-port system is bound to be in terms of reliance on static FDB entries towards the host (if hardware address learning on the CPU port is to be used, MAC addresses would randomly bounce between one CPU port and the other otherwise). In fact, I think it is even better to start DSA's support of multi-CPU-port systems with something small like the NXP Bluebox 3, because we allow some time for the code paths like dsa_switch_host_address_match(), which were specifically designed for it, to break in, and this board needs no user space configuration of CPU ports, like static assignments between user and CPU ports, or bonding between the CPU ports/DSA masters. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: dsa: sja1105: enable address learning on cascade portsVladimir Oltean
Right now, address learning is disabled on DSA ports, which means that a packet received over a DSA port from a cross-chip switch will be flooded to unrelated ports. It is desirable to eliminate that, but for that we need a breakdown of the possibilities for the sja1105 driver. A DSA port can be: - a downstream-facing cascade port. This is simple because it will always receive packets from a downstream switch, and there should be no other route to reach that downstream switch in the first place, which means it should be safe to learn that MAC address towards that switch. - an upstream-facing cascade port. This receives packets either: * autonomously forwarded by an upstream switch (and therefore these packets belong to the data plane of a bridge, so address learning should be ok), or * injected from the CPU. This deserves further discussion, as normally, an upstream-facing cascade port is no different than the CPU port itself. But with "H" topologies (a DSA link towards a switch that has its own CPU port), these are more "laterally-facing" cascade ports than they are "upstream-facing". Here, there is a risk that the port might learn the host addresses on the wrong port (on the DSA port instead of on its own CPU port), but this is solved by DSA's RX filtering infrastructure, which installs the host addresses as static FDB entries on the CPU port of all switches in a "H" tree. So even if there will be an attempt from the switch to migrate the FDB entry from the CPU port to the laterally-facing cascade port, it will fail to do that, because the FDB entry that already exists is static and cannot migrate. So address learning should be safe for this configuration too. Ok, so what about other MAC addresses coming from the host, not necessarily the bridge local FDB entries? What about MAC addresses dynamically learned on foreign interfaces, isn't there a risk that cascade ports will learn these entries dynamically when they are supposed to be delivered towards the CPU port? Well, that is correct, and this is why we also need to enable the assisted learning feature, to snoop for these addresses and write them to hardware as static FDB entries towards the CPU, to make the switch's learning process on the cascade ports ineffective for them. With assisted learning enabled, the hardware learning on the CPU port must be disabled. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: dsa: sja1105: suppress TX packets from looping back in "H" topologiesVladimir Oltean
H topologies like this one have a problem: eth0 eth1 | | CPU port CPU port | DSA link | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0 | | | | | | user user user user user user port port port port port port Basically any packet sent by the eth0 DSA master can be flooded on the interconnecting DSA link sw0p4 <-> sw1p4 and it will be received by the eth1 DSA master too. Basically we are talking to ourselves. In VLAN-unaware mode, these packets are encoded using a tag_8021q TX VLAN, which dsa_8021q_rcv() rightfully cannot decode and complains. Whereas in VLAN-aware mode, the packets are encoded with a bridge VLAN which _can_ be decoded by the tagger running on eth1, so it will attempt to reinject that packet into the network stack (the bridge, if there is any port under eth1 that is under a bridge). In the case where the ports under eth1 are under the same cross-chip bridge as the ports under eth0, the TX packets will even be learned as RX packets. The only thing that will prevent loops with the software bridging path, and therefore disaster, is that the source port and the destination port are in the same hardware domain, and the bridge will receive packets from the driver with skb->offload_fwd_mark = true and will not forward between the two. The proper solution to this problem is to detect H topologies and enforce that all packets are received through the local switch and we do not attempt to receive packets on our CPU port from switches that have their own. This is a viable solution which works thanks to the fact that MAC addresses which should be filtered towards the host are installed by DSA as static MAC addresses towards the CPU port of each switch. TX from a CPU port towards the DSA port continues to be allowed, this is because sja1105 supports bridge TX forwarding offload, and the skb->dev used initially for xmit does not have any direct correlation with where the station that will respond to that packet is connected. It may very well happen that when we send a ping through a br0 interface that spans all switch ports, the xmit packet will exit the system through a DSA switch interface under eth1 (say sw1p2), but the destination station is connected to a switch port under eth0, like sw0p0. So the switch under eth1 needs to communicate on TX with the switch under eth0. The response, however, will not follow the same path, but instead, this patch enforces that the response is sent by the first switch directly to its DSA master which is eth0. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: dsa: sja1105: increase MTU to account for VLAN header on DSA portsVladimir Oltean
Since all packets are transmitted as VLAN-tagged over a DSA link (this VLAN tag represents the tag_8021q header), we need to increase the MTU of these interfaces to account for the possibility that we are already transporting a user-visible VLAN header. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: dsa: sja1105: manage VLANs on cascade portsVladimir Oltean
Since commit ed040abca4c1 ("net: dsa: sja1105: use 4095 as the private VLAN for untagged traffic"), this driver uses a reserved value as pvid for the host port (DSA CPU port). Control packets which are sent as untagged get classified to this VLAN, and all ports are members of it (this is to be expected for control packets). Manage all cascade ports in the same way and allow control packets to egress everywhere. Also, all VLANs need to be sent as egress-tagged on all cascade ports. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: dsa: sja1105: manage the forwarding domain towards DSA portsVladimir Oltean
Manage DSA links towards other switches, be they host ports or cascade ports, the same as the CPU port, i.e. allow forwarding and flooding unconditionally from all user ports. We send packets as always VLAN-tagged on a DSA port, and we rely on the cross-chip notifiers from tag_8021q to install the RX VLAN of a switch port only on the proper remote ports of another switch (the ports that are in the same bridging domain). So if there is no cross-chip bridging in the system, the flooded packets will be sent on the DSA ports too, but they will be dropped by the remote switches due to either (a) a lack of the RX VLAN in the VLAN table of the ingress DSA port, or (b) a lack of valid destinations for those packets, due to a lack of the RX VLAN on the user ports of the switch Note that switches which only transport packets in a cross-chip bridge, but have no user ports of their own as part of that bridge, such as switch 1 in this case: DSA link DSA link sw0p0 sw0p1 sw0p2 -------- sw1p0 sw1p2 sw1p3 -------- sw2p0 sw2p2 sw2p3 ip link set sw0p0 master br0 ip link set sw2p3 master br0 will still work, because the tag_8021q cross-chip notifiers keep the RX VLANs installed on all DSA ports. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: dsa: sja1105: configure the cascade ports based on topologyVladimir Oltean
The sja1105 switch family has a feature called "cascade ports" which can be used in topologies where multiple SJA1105/SJA1110 switches are daisy chained. Upstream switches set this bit for the DSA link towards the downstream switches. This is used when the upstream switch receives a control packet (PTP, STP) from a downstream switch, because if the source port for a control packet is marked as a cascade port, then the source port, switch ID and RX timestamp will not be taken again on the upstream switch, it is assumed that this has already been done by the downstream switch (the leaf port in the tree) and that the CPU has everything it needs to decode the information from this packet. We need to distinguish between an upstream-facing DSA link and a downstream-facing DSA link, because the upstream-facing DSA links are "host ports" for the SJA1105/SJA1110 switches, and the downstream-facing DSA links are "cascade ports". Note that SJA1105 supports a single cascade port, so only daisy chain topologies work. With SJA1110, there can be more complex topologies such as: eth0 | host port | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 | | | | cascade cascade user user port port port port | | | | | | | host | port | | | sw1p0 sw1p1 sw1p2 sw1p3 sw1p4 | | | | | | user user user user host port port port port port | sw2p0 sw2p1 sw2p2 sw2p3 sw2p4 | | | | user user user user port port port port Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: dsa: give preference to local CPU portsVladimir Oltean
Be there an "H" switch topology, where there are 2 switches connected as follows: eth0 eth1 | | CPU port CPU port | DSA link | sw0p0 sw0p1 sw0p2 sw0p3 sw0p4 -------- sw1p4 sw1p3 sw1p2 sw1p1 sw1p0 | | | | | | user user user user user user port port port port port port basically one where each switch has its own CPU port for termination, but there is also a DSA link in case packets need to be forwarded in hardware between one switch and another. DSA insists to see this as a daisy chain topology, basically registering all network interfaces as sw0p0@eth0, ... sw1p0@eth0 and disregarding eth1 as a valid DSA master. This is only half the story, since when asked using dsa_port_is_cpu(), DSA will respond that sw1p1 is a CPU port, however one which has no dp->cpu_dp pointing to it. So sw1p1 is enabled, but not used. Furthermore, be there a driver for switches which support only one upstream port. This driver iterates through its ports and checks using dsa_is_upstream_port() whether the current port is an upstream one. For switch 1, two ports pass the "is upstream port" checks: - sw1p4 is an upstream port because it is a routing port towards the dedicated CPU port assigned using dsa_tree_setup_default_cpu() - sw1p1 is also an upstream port because it is a CPU port, albeit one that is disabled. This is because dsa_upstream_port() returns: if (!cpu_dp) return port; which means that if @dp does not have a ->cpu_dp pointer (which is a characteristic of CPU ports themselves as well as unused ports), then @dp is its own upstream port. So the driver for switch 1 rightfully says: I have two upstream ports, but I don't support multiple upstream ports! So let me error out, I don't know which one to choose and what to do with the other one. Generally I am against enforcing any default policy in the kernel in terms of user to CPU port assignment (like round robin or such) but this case is different. To solve the conundrum, one would have to: - Disable sw1p1 in the device tree or mark it as "not a CPU port" in order to comply with DSA's view of this topology as a daisy chain, where the termination traffic from switch 1 must pass through switch 0. This is counter-productive because it wastes 1Gbps of termination throughput in switch 1. - Disable the DSA link between sw0p4 and sw1p4 and do software forwarding between switch 0 and 1, and basically treat the switches as part of disjoint switch trees. This is counter-productive because it wastes 1Gbps of autonomous forwarding throughput between switch 0 and 1. - Treat sw0p4 and sw1p4 as user ports instead of DSA links. This could work, but it makes cross-chip bridging impossible. In this setup we would need to have 2 separate bridges, br0 spanning the ports of switch 0, and br1 spanning the ports of switch 1, and the "DSA links treated as user ports" sw0p4 (part of br0) and sw1p4 (part of br1) are the gateway ports between one bridge and another. This is hard to manage from a user's perspective, who wants to have a unified view of the switching fabric and the ability to transparently add ports to the same bridge. VLANs would also need to be explicitly managed by the user on these gateway ports. So it seems that the only reasonable thing to do is to make DSA prefer CPU ports that are local to the switch. Meaning that by default, the user and DSA ports of switch 0 will get assigned to the CPU port from switch 0 (sw0p1) and the user and DSA ports of switch 1 will get assigned to the CPU port from switch 1. The way this solves the problem is that sw1p4 is no longer an upstream port as far as switch 1 is concerned (it no longer views sw0p1 as its dedicated CPU port). So here we are, the first multi-CPU port that DSA supports is also perhaps the most uneventful one: the individual switches don't support multiple CPUs, however the DSA switch tree as a whole does have multiple CPU ports. No user space assignment of user ports to CPU ports is desirable, necessary, or possible. Ports that do not have a local CPU port (say there was an extra switch hanging off of sw0p0) default to the standard implementation of getting assigned to the first CPU port of the DSA switch tree. Is that good enough? Probably not (if the downstream switch was hanging off of switch 1, we would most certainly prefer its CPU port to be sw1p1), but in order to support that use case too, we would need to traverse the dst->rtable in search of an optimum dedicated CPU port, one that has the smallest number of hops between dp->ds and dp->cpu_dp->ds. At the moment, the DSA routing table structure does not keep the number of hops between dl->dp and dl->link_dp, and while it is probably deducible, there is zero justification to write that code now. Let's hope DSA will never have to support that use case. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: dsa: rename teardown_default_cpu to teardown_cpu_portsVladimir Oltean
There is nothing specific to having a default CPU port to what dsa_tree_teardown_default_cpu() does. Even with multiple CPU ports, it would do the same thing: iterate through the ports of this switch tree and reset the ->cpu_dp pointer to NULL. So rename it accordingly. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05net: ipa: fix IPA v4.9 interconnectsAlex Elder
Three interconnects are defined for IPA version 4.9, but there should only be two. They should also use names that match what's used for other platforms (and specified in the Device Tree binding). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-05mctp: remove duplicated assignment of pointer hdrColin Ian King
The pointer hdr is being initialized and also re-assigned with the same value from the call to function mctp_hdr. Static analysis reports that the initializated value is unused. The second assignment is duplicated and can be removed. Addresses-Coverity: ("Unused value"). Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: Replace deprecated CPU-hotplug functions.Sebastian Andrzej Siewior
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-04virtio_net: Replace deprecated CPU-hotplug functions.Sebastian Andrzej Siewior
The functions get_online_cpus() and put_online_cpus() have been deprecated during the CPU hotplug rework. They map directly to cpus_read_lock() and cpus_read_unlock(). Replace deprecated CPU-hotplug functions with the official version. The behavior remains unchanged. Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-04pktgen: Remove redundant clone_skb overrideNick Richardson
When the netif_receive xmit_mode is set, a line is supposed to set clone_skb to a default 0 value. This line is made redundant due to a preceding line that checks if clone_skb is more than zero and returns -ENOTSUPP. Overriding clone_skb to 0 does not make any difference to the behavior because if it was positive we return error. So it can be either 0 or negative, and in both cases the behavior is the same. Remove redundant line that sets clone_skb to zero. Signed-off-by: Nick Richardson <richardsonnick@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04ptp: ocp: Expose various resources on the timecard.Jonathan Lemon
The OpenCompute timecard driver has additional functionality besides a clock. Make the following resources available: - The external timestamp channels (ts0/ts1) - devlink support for flashing and health reporting - GPS and MAC serial ports - board serial number (obtained from i2c device) Also add watchdog functionality for when GNSS goes into holdover. The resources are collected under a timecard class directory: [jlemon@timecard ~]$ ls -g /sys/class/timecard/ocp1/ total 0 -r--r--r--. 1 root 4096 Aug 3 19:49 available_clock_sources -rw-r--r--. 1 root 4096 Aug 3 19:49 clock_source lrwxrwxrwx. 1 root 0 Aug 3 19:49 device -> ../../../0000:04:00.0/ -r--r--r--. 1 root 4096 Aug 3 19:49 gps_sync lrwxrwxrwx. 1 root 0 Aug 3 19:49 i2c -> ../../xiic-i2c.1024/i2c-2/ drwxr-xr-x. 2 root 0 Aug 3 19:49 power/ lrwxrwxrwx. 1 root 0 Aug 3 19:49 pps -> ../../../../../virtual/pps/pps1/ lrwxrwxrwx. 1 root 0 Aug 3 19:49 ptp -> ../../ptp/ptp2/ -r--r--r--. 1 root 4096 Aug 3 19:49 serialnum lrwxrwxrwx. 1 root 0 Aug 3 19:49 subsystem -> ../../../../../../class/timecard/ lrwxrwxrwx. 1 root 0 Aug 3 19:49 ttyGPS -> ../../tty/ttyS7/ lrwxrwxrwx. 1 root 0 Aug 3 19:49 ttyMAC -> ../../tty/ttyS8/ -rw-r--r--. 1 root 4096 Aug 3 19:39 uevent The labeling is needed at the minimum, in order to tell the serial devices apart. Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04sock: allow reading and changing sk_userlocks with setsockoptPavel Tikhomirov
SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags disable automatic socket buffers adjustment done by kernel (see tcp_fixup_rcvbuf() and tcp_sndbuf_expand()). If we've just created a new socket this adjustment is enabled on it, but if one changes the socket buffer size by setsockopt(SO_{SND,RCV}BUF*) it becomes disabled. CRIU needs to call setsockopt(SO_{SND,RCV}BUF*) on each socket on restore as it first needs to increase buffer sizes for packet queues restore and second it needs to restore back original buffer sizes. So after CRIU restore all sockets become non-auto-adjustable, which can decrease network performance of restored applications significantly. CRIU need to be able to restore sockets with enabled/disabled adjustment to the same state it was before dump, so let's add special setsockopt for it. Let's also export SOCK_SNDBUF_LOCK and SOCK_RCVBUF_LOCK flags to uAPI so that using these interface one can reenable automatic socket buffer adjustment on their sockets. Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04tc-testing: Add control-plane selftests for sch_mqPeilin Ye
Recently we added multi-queue support to netdevsim in commit d4861fc6be58 ("netdevsim: Add multi-queue support"); add a few control-plane selftests for sch_mq using this new feature. Use nsPlugin.py to avoid network interface name collisions. Reviewed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04Revert "net: build all switchdev drivers as modules when the bridge is a module"Vladimir Oltean
This reverts commit b0e81817629a496854ff1799f6cbd89597db65fd. Explicit driver dependency on the bridge is no longer needed since switchdev_bridge_port_{,un}offload() is no longer implemented by the bridge driver but by switchdev. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: make switchdev_bridge_port_{,unoffload} loosely coupled with the bridgeVladimir Oltean
With the introduction of explicit offloading API in switchdev in commit 2f5dc00f7a3e ("net: bridge: switchdev: let drivers inform which bridge ports are offloaded"), we started having Ethernet switch drivers calling directly into a function exported by net/bridge/br_switchdev.c, which is a function exported by the bridge driver. This means that drivers that did not have an explicit dependency on the bridge before, like cpsw and am65-cpsw, now do - otherwise it is not possible to call a symbol exported by a driver that can be built as module unless you are a module too. There was an attempt to solve the dependency issue in the form of commit b0e81817629a ("net: build all switchdev drivers as modules when the bridge is a module"). Grygorii Strashko, however, says about it: | In my opinion, the problem is a bit bigger here than just fixing the | build :( | | In case, of ^cpsw the switchdev mode is kinda optional and in many | cases (especially for testing purposes, NFS) the multi-mac mode is | still preferable mode. | | There were no such tight dependency between switchdev drivers and | bridge core before and switchdev serviced as independent, notification | based layer between them, so ^cpsw still can be "Y" and bridge can be | "M". Now for mostly every kernel build configuration the CONFIG_BRIDGE | will need to be set as "Y", or we will have to update drivers to | support build with BRIDGE=n and maintain separate builds for | networking vs non-networking testing. But is this enough? Wouldn't | it cause 'chain reaction' required to add more and more "Y" options | (like CONFIG_VLAN_8021Q)? | | PS. Just to be sure we on the same page - ARM builds will be forced | (with this patch) to have CONFIG_TI_CPSW_SWITCHDEV=m and so all our | automation testing will just fail with omap2plus_defconfig. In the light of this, it would be desirable for some configurations to avoid dependencies between switchdev drivers and the bridge, and have the switchdev mode as completely optional within the driver. Arnd Bergmann also tried to write a patch which better expressed the build time dependency for Ethernet switch drivers where the switchdev support is optional, like cpsw/am65-cpsw, and this made the drivers follow the bridge (compile as module if the bridge is a module) only if the optional switchdev support in the driver was enabled in the first place: https://patchwork.kernel.org/project/netdevbpf/patch/20210802144813.1152762-1-arnd@kernel.org/ but this still did not solve the fact that cpsw and am65-cpsw now must be built as modules when the bridge is a module - it just expressed correctly that optional dependency. But the new behavior is an apparent regression from Grygorii's perspective. So to support the use case where the Ethernet driver is built-in, NET_SWITCHDEV (a bool option) is enabled, and the bridge is a module, we need a framework that can handle the possible absence of the bridge from the running system, i.e. runtime bloatware as opposed to build-time bloatware. Luckily we already have this framework, since switchdev has been using it extensively. Events from the bridge side are transmitted to the driver side using notifier chains - this was originally done so that unrelated drivers could snoop for events emitted by the bridge towards ports that are implemented by other drivers (think of a switch driver with LAG offload that listens for switchdev events on a bonding/team interface that it offloads). There are also events which are transmitted from the driver side to the bridge side, which again are modeled using notifiers. SWITCHDEV_FDB_ADD_TO_BRIDGE is an example of this, and deals with notifying the bridge that a MAC address has been dynamically learned. So there is a precedent we can use for modeling the new framework. The difference compared to SWITCHDEV_FDB_ADD_TO_BRIDGE is that the work that the bridge needs to do when a port becomes offloaded is blocking in its nature: replay VLANs, MDBs etc. The calling context is indeed blocking (we are under rtnl_mutex), but the existing switchdev notification chain that the bridge is subscribed to is only the atomic one. So we need to subscribe the bridge to the blocking switchdev notification chain too. This patch: - keeps the driver-side perception of the switchdev_bridge_port_{,un}offload unchanged - moves the implementation of switchdev_bridge_port_{,un}offload from the bridge module into the switchdev module. - makes everybody that is subscribed to the switchdev blocking notifier chain "hear" offload & unoffload events - makes the bridge driver subscribe and handle those events - moves the bridge driver's handling of those events into 2 new functions called br_switchdev_port_{,un}offload. These functions contain in fact the core of the logic that was previously in switchdev_bridge_port_{,un}offload, just that now we go through an extra indirection layer to reach them. Unlike all the other switchdev notification structures, the structure used to carry the bridge port information, struct switchdev_notifier_brport_info, does not contain a "bool handled". This is because in the current usage pattern, we always know that a switchdev bridge port offloading event will be handled by the bridge, because the switchdev_bridge_port_offload() call was initiated by a NETDEV_CHANGEUPPER event in the first place, where info->upper_dev is a bridge. So if the bridge wasn't loaded, then the CHANGEUPPER event couldn't have happened. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04Merge tag 'linux-can-next-for-5.15-20210804' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next Marc Kleine-Budde says: ==================== pull-request: can-next 2021-08-04 this is a pull request of 5 patches for net-next/master. The first patch is by me and fixes a typo in a comment in the CAN J1939 protocol. The next 2 patches are by Oleksij Rempel and update the CAN J1939 protocol to send RX status updates via the error queue mechanism. The next patch is by me and adds a missing variable initialization to the flexcan driver (the problem was introduced in the current net-next cycle). The last patch is by Aswath Govindraju and adds power-domains to the Bosch m_can DT binding documentation. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04dt-bindings: net: can: Document power-domains propertyAswath Govindraju
Document power-domains property for adding the Power domain provider. Link: https://lore.kernel.org/r/20210802091822.16407-1-a-govindraju@ti.com Signed-off-by: Aswath Govindraju <a-govindraju@ti.com> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-04can: flexcan: flexcan_clks_enable(): add missing variable initializationMarc Kleine-Budde
This patch adds the missing initialization of the "err" variable in the flexcan_clks_enable() function. Fixes: d9cead75b1c6 ("can: flexcan: add mcf5441x support") Link: https://lore.kernel.org/r/20210728075428.1493568-1-mkl@pengutronix.de Reported-by: kernel test robot <lkp@intel.com> Cc: Angelo Dureghello <angelo@kernel-space.org> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-04can: j1939: extend UAPI to notify about RX statusOleksij Rempel
To be able to create applications with user friendly feedback, we need be able to provide receive status information. Typical ETP transfer may take seconds or even hours. To give user some clue or show a progress bar, the stack should push status updates. Same as for the TX information, the socket error queue will be used with following new signals: - J1939_EE_INFO_RX_RTS - received and accepted request to send signal. - J1939_EE_INFO_RX_DPO - received data package offset signal - J1939_EE_INFO_RX_ABORT - RX session was aborted Instead of completion signal, user will get data package. To activate this signals, application should set SOF_TIMESTAMPING_RX_SOFTWARE to the SO_TIMESTAMPING socket option. This will avoid unpredictable application behavior for the old software. Link: https://lore.kernel.org/r/20210707094854.30781-3-o.rempel@pengutronix.de Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-04can: j1939: rename J1939_ERRQUEUE_* to J1939_ERRQUEUE_TX_*Oleksij Rempel
Prepare the world for the J1939_ERRQUEUE_RX_ version Link: https://lore.kernel.org/r/20210707094854.30781-2-o.rempel@pengutronix.de Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2021-08-04ipv6: exthdrs: get rid of indirect calls in ip6_parse_tlv()Eric Dumazet
As presented last month in our "BIG TCP" talk at netdev 0x15, we plan using IPv6 jumbograms. One of the minor problem we talked about is the fact that ip6_parse_tlv() is currently using tables to list known tlvs, thus using potentially expensive indirect calls. While we could mitigate this cost using macros from indirect_call_wrapper.h, we also can get rid of the tables and let the compiler emit optimized code. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Justin Iurman <justin.iurman@uliege.be> Cc: Coco Li <lixiaoyan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04Merge branch 'm7530-sw-fallback'David S. Miller
DENG Qingfang says: ==================== mt7530 software fallback bridging fix DSA core has gained software fallback support since commit 2f5dc00f7a3e, but it does not work properly on mt7530. This patch series fixes the issues. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: dsa: mt7530: always install FDB entries with IVL and FID 1DENG Qingfang
This reverts commit 7e777021780e ("mt7530 mt7530_fdb_write only set ivl bit vid larger than 1"). Before this series, the default value of all ports' PVID is 1, which is copied into the FDB entry, even if the ports are VLAN unaware. So `bridge fdb show` will show entries like `dev swp0 vlan 1 self` even on a VLAN-unaware bridge. The blamed commit does not solve that issue completely, instead it may cause a new issue that FDB is inaccessible in a VLAN-aware bridge with PVID 1. This series sets PVID to 0 on VLAN-unaware ports, so `bridge fdb show` will no longer print `vlan 1` on VLAN-unaware bridges, and that special case in fdb_write is not required anymore. Set FDB entries' filter ID to 1 to match the VLAN table. Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: dsa: mt7530: set STP state on filter ID 1DENG Qingfang
As filter ID 1 is the only one used for bridges, set STP state on it. Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: dsa: mt7530: use independent VLAN learning on VLAN-unaware bridgesDENG Qingfang
Consider the following bridge configuration, where bond0 is not offloaded: +-- br0 --+ / / | \ / / | \ / | | bond0 / | | / \ swp0 swp1 swp2 swp3 swp4 . . . . . . A B C Ideally, when the switch receives a packet from swp3 or swp4, it should forward the packet to the CPU, according to the port matrix and unknown unicast flood settings. But packet loss will happen if the destination address is at one of the offloaded ports (swp0~2). For example, when client C sends a packet to A, the FDB lookup will indicate that it should be forwarded to swp0, but the port matrix of swp3 and swp4 is configured to only allow the CPU to be its destination, so it is dropped. However, this issue does not happen if the bridge is VLAN-aware. That is because VLAN-aware bridges use independent VLAN learning, i.e. use VID for FDB lookup, on offloaded ports. As swp3 and swp4 are not offloaded, shared VLAN learning with default filter ID of 0 is used instead. So the lookup for A with filter ID 0 never hits and the packet can be forwarded to the CPU. In the current code, only two combinations were used to toggle user ports' VLAN awareness: one is PCR.PORT_VLAN set to port matrix mode with PVC.VLAN_ATTR set to transparent port, the other is PCR.PORT_VLAN set to security mode with PVC.VLAN_ATTR set to user port. It turns out that only PVC.VLAN_ATTR contributes to VLAN awareness, and port matrix mode just skips the VLAN table lookup. The reference manual is somehow misleading when describing PORT_VLAN modes. It states that PORT_MEM (VLAN port member) is used for destination if the VLAN table lookup hits, but actually **PORT_MEM & PORT_MATRIX** (bitwise AND of VLAN port member and port matrix) is used instead, which means we can have two or more separate VLAN-aware bridges with the same PVID and traffic won't leak between them. Therefore, to solve this, enable independent VLAN learning with PVID 0 on VLAN-unaware bridges, by setting their PCR.PORT_VLAN to fallback mode, while leaving standalone ports in port matrix mode. The CPU port is always set to fallback mode to serve those bridges. During testing, it is found that FDB lookup with filter ID of 0 will also hit entries with VID 0 even with independent VLAN learning. To avoid that, install all VLANs with filter ID of 1. Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: dsa: mt7530: enable assisted learning on CPU portDENG Qingfang
Consider the following bridge configuration, where bond0 is not offloaded: +-- br0 --+ / / | \ / / | \ / | | bond0 / | | / \ swp0 swp1 swp2 swp3 swp4 . . . . . . A B C Address learning is enabled on offloaded ports (swp0~2) and the CPU port, so when client A sends a packet to C, the following will happen: 1. The switch learns that client A can be reached at swp0. 2. The switch probably already knows that client C can be reached at the CPU port, so it forwards the packet to the CPU. 3. The bridge core knows client C can be reached at bond0, so it forwards the packet back to the switch. 4. The switch learns that client A can be reached at the CPU port. 5. The switch forwards the packet to either swp3 or swp4, according to the packet's tag. That makes client A's MAC address flap between swp0 and the CPU port. If client B sends a packet to A, it is possible that the packet is forwarded to the CPU. With offload_fwd_mark = 1, the bridge core won't forward it back to the switch, resulting in packet loss. As we have the assisted_learning_on_cpu_port in DSA core now, enable that and disable hardware learning on the CPU port. Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <oltean@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04Merge branch 'ipa-pm-irqs'David S. Miller
Alex Elder says: ==================== net: ipa: prepare GSI interrupts for runtime PM The last patch in this series arranges for GSI interrupts to be disabled when the IPA hardware is suspended. This ensures the clock is always operational when a GSI interrupt fires. Leading up to that are patches that rearrange the code a bit to allow this to be done. The first two patches aren't *directly* related. They remove some flag arguments to some GSI suspend/resume related functions, using the version field now present in the GSI structure. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: ipa: disable GSI interrupts while suspendedAlex Elder
Introduce new functions gsi_suspend() and gsi_resume(), which will disable the GSI interrupt handler after all endpoints are suspended and re-enable it before endpoints are resumed. This will ensure no GSI interrupt handler will fire when the hardware is suspended. Here's a little further explanation. There are seven GSI interrupt types, and most are disabled except when needed. - These two are not used (never enabled): GSI_INTER_EE_CH_CTRL GSI_INTER_EE_EV_CTRL - These two are only used to implement channel and event ring commands, and are only enabled while a command is underway: GSI_CH_CTRL GSI_EV_CTRL - The IEOB interrupt signals I/O completion. It will not fire when a channel is stopped (or "suspended"). GSI_IEOB - This interrupt is used to allocate or halt modem channels, and is only enabled while such a command is underway. GSI_GLOB_EE However it also is used to signal certain errors, and this could occur at any time. - The general interrupt signals general errors, and could occur at any time. GSI_GENERAL The purpose for this change is to ensure no global or general interrupts fire due to errors while the hardware is suspended. We enable the clock on resume, and at that time we can "handle" (at least report) these error conditions. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: ipa: move gsi_irq_init() code into setupAlex Elder
The GSI IRQ handler could be triggered as soon as it is registered with request_irq(). The handler function, gsi_isr(), touches hardware, meaning the IPA clock must be operational. The IPA clock is not operating when the handler is registered (in gsi_irq_init()), so this is a problem. Move the call to request_irq() for the GSI interrupt handler into gsi_irq_setup(), which is called when the IPA clock is known to be operational (and furthermore, the GSI firmware will have been loaded). Request the IRQ at the end of that function, after all interrupt types have been disabled and masked. Move the matching free_irq() call into gsi_irq_teardown(), and get rid of the now empty gsi_irq_exit(), Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: ipa: have gsi_irq_setup() return an error codeAlex Elder
Change gsi_irq_setup() so it returns an error value, and introduce gsi_irq_teardown() as its inverse. Set the interrupt type (IRQ rather than MSI) in gsi_irq_setup(). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: ipa: move some GSI setup functionsAlex Elder
Move gsi_irq_setup() and gsi_ring_setup() so they're defined right above gsi_setup() where they're called. This is a trivial movement of code to prepare for upcoming patches. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-04net: ipa: move version check for channel suspend/resumeAlex Elder
Change the Boolean flags passed to __gsi_channel_start() and __gsi_channel_stop() so they represent whether the request is being made to implement suspend (versus stop) or resume (versus start). Then stop or start the channel for suspend/resume requests only if the hardware version indicates it should be done. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>