aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2021-08-03ethtool: runtime-resume netdev parent in ethnl_ops_beginHeiner Kallweit
If a network device is runtime-suspended then: - network device may be flagged as detached and all ethtool ops (even if not accessing the device) will fail because netif_device_present() returns false - ethtool ops may fail because device is not accessible (e.g. because being in D3 in case of a PCI device) It may not be desirable that userspace can't use even simple ethtool ops that not access the device if interface or link is down. To be more friendly to userspace let's ensure that device is runtime-resumed when executing the respective ethtool op in kernel. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03ethtool: move netif_device_present check from ethnl_parse_header_dev_get to ↵Heiner Kallweit
ethnl_ops_begin If device is runtime-suspended and not accessible then it may be flagged as not present. If checking whether device is present is done too early then we may bail out before we have the chance to runtime-resume the device. Therefore move this check to ethnl_ops_begin(). This is in preparation of a follow-up patch that tries to runtime-resume the device before executing ethtool ops. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03ethtool: move implementation of ethnl_ops_begin/complete to netlink.cHeiner Kallweit
In preparation of subsequent extensions to both functions move the implementations from netlink.h to netlink.c. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03ethtool: runtime-resume netdev parent before ethtool ioctl opsHeiner Kallweit
If a network device is runtime-suspended then: - network device may be flagged as detached and all ethtool ops (even if not accessing the device) will fail because netif_device_present() returns false - ethtool ops may fail because device is not accessible (e.g. because being in D3 in case of a PCI device) It may not be desirable that userspace can't use even simple ethtool ops that not access the device if interface or link is down. To be more friendly to userspace let's ensure that device is runtime-resumed when executing the respective ethtool op in kernel. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03net: Keep vertical alignmentYajun Deng
Those files under /proc/net/stat/ don't have vertical alignment, it looks very difficult. Modify the seq_printf statement, keep vertical alignment. v2: - Use seq_puts() and seq_printf() correctly. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03bpf: use skb_expand_head in bpf_out_neigh_v4/6Vasily Averin
Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate a new skb if possible. Additionally this patch replaces commonly used dereferencing with variables. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03ax25: use skb_expand_headVasily Averin
Use skb_expand_head() in ax25_transmit_buffer and ax25_rt_build_path. Unlike skb_realloc_headroom, new helper does not allocate a new skb if possible. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03ipv4: use skb_expand_head in ip_finish_output2Vasily Averin
Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate a new skb if possible. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03ipv6: use skb_expand_head in ip6_xmitVasily Averin
Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate a new skb if possible. Additionally this patch replaces commonly used dereferencing with variables. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03ipv6: use skb_expand_head in ip6_finish_output2Vasily Averin
Unlike skb_realloc_headroom, new helper skb_expand_head does not allocate a new skb if possible. Additionally this patch replaces commonly used dereferencing with variables. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-03skbuff: introduce skb_expand_head()Vasily Averin
Like skb_realloc_headroom(), new helper increases headroom of specified skb. Unlike skb_realloc_headroom(), it does not allocate a new skb if possible; copies skb->sk on new skb when as needed and frees original skb in case of failures. This helps to simplify ip[6]_finish_output2() and a few other similar cases. Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-02ipv4: Fix refcount warning for new fib_infoDavid Ahern
Ioana reported a refcount warning when booting over NFS: [ 5.042532] ------------[ cut here ]------------ [ 5.047184] refcount_t: addition on 0; use-after-free. [ 5.052324] WARNING: CPU: 7 PID: 1 at lib/refcount.c:25 refcount_warn_saturate+0xa4/0x150 ... [ 5.167201] Call trace: [ 5.169635] refcount_warn_saturate+0xa4/0x150 [ 5.174067] fib_create_info+0xc00/0xc90 [ 5.177982] fib_table_insert+0x8c/0x620 [ 5.181893] fib_magic.isra.0+0x110/0x11c [ 5.185891] fib_add_ifaddr+0xb8/0x190 [ 5.189629] fib_inetaddr_event+0x8c/0x140 fib_treeref needs to be set after kzalloc. The old code had a ++ which led to the confusion when the int was replaced by a refcount_t. Fixes: 79976892f7ea ("net: convert fib_treeref from int to refcount_t") Signed-off-by: David Ahern <dsahern@kernel.org> Reported-by: Ioana Ciornei <ciorneiioana@gmail.com> Cc: Yajun Deng <yajun.deng@linux.dev> Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20210802160221.27263-1-dsahern@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-08-02net/ipv4: Replace one-element array with flexible-array memberGustavo A. R. Silva
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. Use an anonymous union with a couple of anonymous structs in order to keep userspace unchanged: $ pahole -C ip_msfilter net/ipv4/ip_sockglue.o struct ip_msfilter { union { struct { __be32 imsf_multiaddr_aux; /* 0 4 */ __be32 imsf_interface_aux; /* 4 4 */ __u32 imsf_fmode_aux; /* 8 4 */ __u32 imsf_numsrc_aux; /* 12 4 */ __be32 imsf_slist[1]; /* 16 4 */ }; /* 0 20 */ struct { __be32 imsf_multiaddr; /* 0 4 */ __be32 imsf_interface; /* 4 4 */ __u32 imsf_fmode; /* 8 4 */ __u32 imsf_numsrc; /* 12 4 */ __be32 imsf_slist_flex[0]; /* 16 0 */ }; /* 0 16 */ }; /* 0 20 */ /* size: 20, cachelines: 1, members: 1 */ /* last cacheline: 20 bytes */ }; Also, refactor the code accordingly and make use of the struct_size() and flex_array_size() helpers. This helps with the ongoing efforts to globally enable -Warray-bounds and get us closer to being able to tighten the FORTIFY_SOURCE routines on memcpy(). [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.10/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/79 Link: https://github.com/KSPP/linux/issues/109 Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-02net: dsa: remove the struct packet_type argument from dsa_device_ops::rcv()Vladimir Oltean
No tagging driver uses this. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-02nfc: hci: pass callback data param as pointer in nci_request()Krzysztof Kozlowski
The nci_request() receives a callback function and unsigned long data argument "opt" which is passed to the callback. Almost all of the nci_request() callers pass pointer to a stack variable as data argument. Only few pass scalar value (e.g. u8). All such callbacks do not modify passed data argument and in previous commit they were made as const. However passing pointers via unsigned long removes the const annotation. The callback could simply cast unsigned long to a pointer to writeable memory. Use "const void *" as type of this "opt" argument to solve this and prevent modifying the pointed contents. This is also consistent with generic pattern of passing data arguments - via "void *". In few places which pass scalar values, use casts via "unsigned long" to suppress any warnings. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-08-02net_sched: refactor TC action init APICong Wang
TC action ->init() API has 10 parameters, it becomes harder to read. Some of them are just boolean and can be replaced by flags. Similarly for the internal API tcf_action_init() and tcf_exts_validate(). This patch converts them to flags and fold them into the upper 16 bits of "flags", whose lower 16 bits are still reserved for user-space. More specifically, the following kernel flags are introduced: TCA_ACT_FLAGS_POLICE replace 'name' in a few contexts, to distinguish whether it is compatible with policer. TCA_ACT_FLAGS_BIND replaces 'bind', to indicate whether this action is bound to a filter. TCA_ACT_FLAGS_REPLACE replaces 'ovr' in most contexts, means we are replacing an existing action. TCA_ACT_FLAGS_NO_RTNL replaces 'rtnl_held' but has the opposite meaning, because we still hold RTNL in most cases. The only user-space flag TCA_ACT_FLAGS_NO_PERCPU_STATS is untouched and still stored as before. I have tested this patch with tdc and I do not see any failure related to this patch. Tested-by: Vlad Buslov <vladbu@nvidia.com> Acked-by: Jamal Hadi Salim<jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-31Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski
Andrii Nakryiko says: ==================== bpf-next 2021-07-30 We've added 64 non-merge commits during the last 15 day(s) which contain a total of 83 files changed, 5027 insertions(+), 1808 deletions(-). The main changes are: 1) BTF-guided binary data dumping libbpf API, from Alan. 2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei. 3) Ambient BPF run context and cgroup storage cleanup, from Andrii. 4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi. 5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri. 6) bpf_{get,set}sockopt() support in BPF iterators, from Martin. 7) BPF map pinning improvements in libbpf, from Martynas. 8) Improved module BTF support in libbpf and bpftool, from Quentin. 9) Bpftool cleanups and documentation improvements, from Quentin. 10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi. 11) Increased maximum cgroup storage size, from Stanislav. 12) Small fixes and improvements to BPF tests and samples, from various folks. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits) tools: bpftool: Complete metrics list in "bpftool prog profile" doc tools: bpftool: Document and add bash completion for -L, -B options selftests/bpf: Update bpftool's consistency script for checking options tools: bpftool: Update and synchronise option list in doc and help msg tools: bpftool: Complete and synchronise attach or map types selftests/bpf: Check consistency between bpftool source, doc, completion tools: bpftool: Slightly ease bash completion updates unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg() libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf tools: bpftool: Support dumping split BTF by id libbpf: Add split BTF support for btf__load_from_kernel_by_id() tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id() tools: Free BTF objects at various locations libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id() libbpf: Rename btf__load() as btf__load_into_kernel() libbpf: Return non-null error on failures in libbpf_find_prog_btf_id() bpf: Emit better log message if bpf_iter ctx arg btf_id == 0 tools/resolve_btfids: Emit warnings and patch zero id for missing symbols bpf: Increase supported cgroup storage value size libbpf: Fix race when pinning maps in parallel ... ==================== Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-31Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Conflicting commits, all resolutions pretty trivial: drivers/bus/mhi/pci_generic.c 5c2c85315948 ("bus: mhi: pci-generic: configurable network interface MRU") 56f6f4c4eb2a ("bus: mhi: pci_generic: Apply no-op for wake using sideband wake boolean") drivers/nfc/s3fwrn5/firmware.c a0302ff5906a ("nfc: s3fwrn5: remove unnecessary label") 46573e3ab08f ("nfc: s3fwrn5: fix undefined parameter values in dev_err()") 801e541c79bb ("nfc: s3fwrn5: fix undefined parameter values in dev_err()") MAINTAINERS 7d901a1e878a ("net: phy: add Maxlinear GPY115/21x/24x driver") 8a7b46fa7902 ("MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30Merge tag 'net-5.14-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi (mac80211) and netfilter trees. Current release - regressions: - mac80211: fix starting aggregation sessions on mesh interfaces Current release - new code bugs: - sctp: send pmtu probe only if packet loss in Search Complete state - bnxt_en: add missing periodic PHC overflow check - devlink: fix phys_port_name of virtual port and merge error - hns3: change the method of obtaining default ptp cycle - can: mcba_usb_start(): add missing urb->transfer_dma initialization Previous releases - regressions: - set true network header for ECN decapsulation - mlx5e: RX, avoid possible data corruption w/ relaxed ordering and LRO - phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811 PHY - sctp: fix return value check in __sctp_rcv_asconf_lookup Previous releases - always broken: - bpf: - more spectre corner case fixes, introduce a BPF nospec instruction for mitigating Spectre v4 - fix OOB read when printing XDP link fdinfo - sockmap: fix cleanup related races - mac80211: fix enabling 4-address mode on a sta vif after assoc - can: - raw: raw_setsockopt(): fix raw_rcv panic for sock UAF - j1939: j1939_session_deactivate(): clarify lifetime of session object, avoid UAF - fix number of identical memory leaks in USB drivers - tipc: - do not blindly write skb_shinfo frags when doing decryption - fix sleeping in tipc accept routine" * tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits) gve: Update MAINTAINERS list can: esd_usb2: fix memory leak can: ems_usb: fix memory leak can: usb_8dev: fix memory leak can: mcba_usb_start(): add missing urb->transfer_dma initialization can: hi311x: fix a signedness bug in hi3110_cmd() MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver bpf: Fix leakage due to insufficient speculative store bypass mitigation bpf: Introduce BPF nospec instruction for mitigating Spectre v4 sis900: Fix missing pci_disable_device() in probe and remove net: let flow have same hash in two directions nfc: nfcsim: fix use after free during module unload tulip: windbond-840: Fix missing pci_disable_device() in probe and remove sctp: fix return value check in __sctp_rcv_asconf_lookup nfc: s3fwrn5: fix undefined parameter values in dev_err() net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32 net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev() net/mlx5: Unload device upon firmware fatal error net/mlx5e: Fix page allocation failure for ptp-RQ over SF net/mlx5e: Fix page allocation failure for trap-RQ over SF ...
2021-07-30devlink: Allocate devlink directly in requested net namespaceLeon Romanovsky
There is no need in extra call indirection and check from impossible flow where someone tries to set namespace without prior call to devlink_alloc(). Instead of this extra logic and additional EXPORT_SYMBOL, use specialized devlink allocation function that receives net namespace as an argument. Such specialized API allows clear view when devlink initialized in wrong net namespace and/or kernel users don't try to change devlink namespace under the hood. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30devlink: Break parameter notification sequence to be before/after ↵Leon Romanovsky
unload/load driver The change of namespaces during devlink reload calls to driver unload before it accesses devlink parameters. The commands below causes to use-after-free bug when trying to get flow steering mode. * ip netns add n1 * devlink dev reload pci/0000:00:09.0 netns n1 ================================================================== BUG: KASAN: use-after-free in mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] Read of size 4 at addr ffff888009d04308 by task devlink/275 CPU: 6 PID: 275 Comm: devlink Not tainted 5.12.0-rc2+ #2853 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x93/0xc2 print_address_description.constprop.0+0x18/0x140 ? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] ? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] kasan_report.cold+0x7c/0xd8 ? mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] mlx5_devlink_fs_mode_get+0x96/0xa0 [mlx5_core] devlink_nl_param_fill+0x1c8/0xe80 ? __free_pages_ok+0x37a/0x8a0 ? devlink_flash_update_timeout_notify+0xd0/0xd0 ? lock_acquire+0x1a9/0x6d0 ? fs_reclaim_acquire+0xb7/0x160 ? lock_is_held_type+0x98/0x110 ? 0xffffffff81000000 ? lock_release+0x1f9/0x6c0 ? fs_reclaim_release+0xa1/0xf0 ? lock_downgrade+0x6d0/0x6d0 ? lock_is_held_type+0x98/0x110 ? lock_is_held_type+0x98/0x110 ? memset+0x20/0x40 ? __build_skb_around+0x1f8/0x2b0 devlink_param_notify+0x6d/0x180 devlink_reload+0x1c3/0x520 ? devlink_remote_reload_actions_performed+0x30/0x30 ? mutex_trylock+0x24b/0x2d0 ? devlink_nl_cmd_reload+0x62b/0x1070 devlink_nl_cmd_reload+0x66d/0x1070 ? devlink_reload+0x520/0x520 ? devlink_get_from_attrs+0x1bc/0x260 ? devlink_nl_pre_doit+0x64/0x4d0 genl_family_rcv_msg_doit+0x1e9/0x2f0 ? mutex_lock_io_nested+0x1130/0x1130 ? genl_family_rcv_msg_attrs_parse.constprop.0+0x240/0x240 ? security_capable+0x51/0x90 genl_rcv_msg+0x27f/0x4a0 ? genl_get_cmd+0x3c0/0x3c0 ? lock_acquire+0x1a9/0x6d0 ? devlink_reload+0x520/0x520 ? lock_release+0x6c0/0x6c0 netlink_rcv_skb+0x11d/0x340 ? genl_get_cmd+0x3c0/0x3c0 ? netlink_ack+0x9f0/0x9f0 ? lock_release+0x1f9/0x6c0 genl_rcv+0x24/0x40 netlink_unicast+0x433/0x700 ? netlink_attachskb+0x730/0x730 ? _copy_from_iter_full+0x178/0x650 ? __alloc_skb+0x113/0x2b0 netlink_sendmsg+0x6f1/0xbd0 ? netlink_unicast+0x700/0x700 ? lock_is_held_type+0x98/0x110 ? netlink_unicast+0x700/0x700 sock_sendmsg+0xb0/0xe0 __sys_sendto+0x193/0x240 ? __x64_sys_getpeername+0xb0/0xb0 ? do_sys_openat2+0x10b/0x370 ? __up_read+0x1a1/0x7b0 ? do_user_addr_fault+0x219/0xdc0 ? __x64_sys_openat+0x120/0x1d0 ? __x64_sys_open+0x1a0/0x1a0 __x64_sys_sendto+0xdd/0x1b0 ? syscall_enter_from_user_mode+0x1d/0x50 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7fc69d0af14a Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b8 0f 1f 00 f3 0f 1e fa 41 89 ca 64 8b 04 25 18 00 00 00 85 c0 75 15 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 76 c3 0f 1f 44 00 00 55 48 83 ec 30 44 89 4c RSP: 002b:00007ffc1d8292f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fc69d0af14a RDX: 0000000000000038 RSI: 0000555f57c56440 RDI: 0000000000000003 RBP: 0000555f57c56410 R08: 00007fc69d17b200 R09: 000000000000000c R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 Allocated by task 146: kasan_save_stack+0x1b/0x40 __kasan_kmalloc+0x99/0xc0 mlx5_init_fs+0xf0/0x1c50 [mlx5_core] mlx5_load+0xd2/0x180 [mlx5_core] mlx5_init_one+0x2f6/0x450 [mlx5_core] probe_one+0x47d/0x6e0 [mlx5_core] pci_device_probe+0x2a0/0x4a0 really_probe+0x20a/0xc90 driver_probe_device+0xd8/0x380 device_driver_attach+0x1df/0x250 __driver_attach+0xff/0x240 bus_for_each_dev+0x11e/0x1a0 bus_add_driver+0x309/0x570 driver_register+0x1ee/0x380 0xffffffffa06b8062 do_one_initcall+0xd5/0x410 do_init_module+0x1c8/0x760 load_module+0x6d8b/0x9650 __do_sys_finit_module+0x118/0x1b0 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae Freed by task 275: kasan_save_stack+0x1b/0x40 kasan_set_track+0x1c/0x30 kasan_set_free_info+0x20/0x30 __kasan_slab_free+0x102/0x140 slab_free_freelist_hook+0x74/0x1b0 kfree+0xd7/0x2a0 mlx5_unload+0x16/0xb0 [mlx5_core] mlx5_unload_one+0xae/0x120 [mlx5_core] mlx5_devlink_reload_down+0x1bc/0x380 [mlx5_core] devlink_reload+0x141/0x520 devlink_nl_cmd_reload+0x66d/0x1070 genl_family_rcv_msg_doit+0x1e9/0x2f0 genl_rcv_msg+0x27f/0x4a0 netlink_rcv_skb+0x11d/0x340 genl_rcv+0x24/0x40 netlink_unicast+0x433/0x700 netlink_sendmsg+0x6f1/0xbd0 sock_sendmsg+0xb0/0xe0 __sys_sendto+0x193/0x240 __x64_sys_sendto+0xdd/0x1b0 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae The buggy address belongs to the object at ffff888009d04300 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 8 bytes inside of 128-byte region [ffff888009d04300, ffff888009d04380) The buggy address belongs to the page: page:0000000086a64ecc refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888009d04000 pfn:0x9d04 head:0000000086a64ecc order:1 compound_mapcount:0 flags: 0x4000000000010200(slab|head) raw: 4000000000010200 ffffea0000203980 0000000200000002 ffff8880050428c0 raw: ffff888009d04000 000000008020001d 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888009d04200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888009d04280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888009d04300: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888009d04380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888009d04400: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== The right solution to devlink reload is to notify about deletion of parameters, unload driver, change net namespaces, load driver and notify about addition of parameters. Fixes: 070c63f20f6c ("net: devlink: allow to change namespaces during reload") Reviewed-by: Parav Pandit <parav@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()Cong Wang
As Eric noticed, __unix_dgram_recvmsg() may acquire u->iolock too, so we have to release it before calling this function. Fixes: 9825d866ce0d ("af_unix: Implement unix_dgram_bpf_recvmsg()") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jakub Sitnicki <jakub@cloudflare.com> Acked-by: John Fastabend <john.fastabend@gmail.com>
2021-07-30sk_buff: avoid potentially clearing 'slow_gro' fieldPaolo Abeni
If skb_dst_set_noref() is invoked with a NULL dst, the 'slow_gro' field is cleared, too. That could lead to wrong behavior if the skb later enters the GRO stage. Fix the potential issue replacing preserving a non-zero value of the 'slow_gro' field. Additionally, fix a comment typo. Reported-by: Sabrina Dubroca <sd@queasysnail.net> Reported-by: Jakub Kicinski <kuba@kernel.org> Fixes: 8a886b142bd0 ("sk_buff: track dst status in slow_gro") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Link: https://lore.kernel.org/r/aa42529252dc8bb02bd42e8629427040d1058537.1627662501.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30net: netlink: Remove unused functionYajun Deng
lockdep_genl_is_held() and its caller arm not used now, just remove them. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Link: https://lore.kernel.org/r/20210729074854.8968-1-yajun.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30nfc: hci: cleanup unneeded spacesKrzysztof Kozlowski
No need for multiple spaces in variable declaration (the code does not use them in other places). No functional change. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30nfc: nci: constify several pointers to u8, sk_buff and other structsKrzysztof Kozlowski
Several functions receive pointers to u8, sk_buff or other structs but do not modify the contents so make them const. This allows doing the same for local variables and in total makes the code a little bit safer. This makes const also data passed as "unsigned long opt" argument to nci_request() function. Usual flow for such functions is: 1. Receive "u8 *" and store it (the pointer) in a structure allocated on stack (e.g. struct nci_set_config_param), 2. Call nci_request() or __nci_request() passing a callback function an the pointer to the structure via an "unsigned long opt", 3. nci_request() calls the callback which dereferences "unsigned long opt" in a read-only way. This converts all above paths to use proper pointer to const data, so entire flow is safer. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30nfc: constify local pointer variablesKrzysztof Kozlowski
Few pointers to struct nfc_target and struct nfc_se can be made const. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30nfc: constify several pointers to u8, char and sk_buffKrzysztof Kozlowski
Several functions receive pointers to u8, char or sk_buff but do not modify the contents so make them const. This allows doing the same for local variables and in total makes the code a little bit safer. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30nfc: hci: annotate nfc_llc_init() as __initKrzysztof Kozlowski
The nfc_llc_init() is used only in other __init annotated context. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30nfc: annotate af_nfc_exit() as __exitKrzysztof Kozlowski
The af_nfc_exit() is used only in other __exit annotated context (nfc_exit()). Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-30net: convert fib_treeref from int to refcount_tYajun Deng
refcount_t type should be used instead of int when fib_treeref is used as a reference counter,and avoid use-after-free risks. Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210729071350.28919-1-yajun.deng@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2021-07-29net: dsa: don't set skb->offload_fwd_mark when not offloading the bridgeVladimir Oltean
DSA has gained the recent ability to deal gracefully with upper interfaces it cannot offload, such as the bridge, bonding or team drivers. When such uppers exist, the ports are still in standalone mode as far as the hardware is concerned. But when we deliver packets to the software bridge in order for that to do the forwarding, there is an unpleasant surprise in that the bridge will refuse to forward them. This is because we unconditionally set skb->offload_fwd_mark = true, meaning that the bridge thinks the frames were already forwarded in hardware by us. Since dp->bridge_dev is populated only when there is hardware offload for it, but not in the software fallback case, let's introduce a new helper that can be called from the tagger data path which sets the skb->offload_fwd_mark accordingly to zero when there is no hardware offload for bridging. This lets the bridge forward packets back to other interfaces of our switch, if needed. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29net/sched: store the last executed chain also for clsact egressDavide Caratti
currently, only 'ingress' and 'clsact ingress' qdiscs store the tc 'chain id' in the skb extension. However, userspace programs (like ovs) are able to setup egress rules, and datapath gets confused in case it doesn't find the 'chain id' for a packet that's "recirculated" by tc. Change tcf_classify() to have the same semantic as tcf_classify_ingress() so that a single function can be called in ingress / egress, using the tc ingress / egress block respectively. Suggested-by: Alaa Hleilel <alaa@nvidia.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29net: dsa: tag_sja1105: fix control packets on SJA1110 being received on an ↵Vladimir Oltean
imprecise port On RX, a control packet with SJA1110 will have: - an in-band control extension (DSA tag) composed of a header and an optional trailer (if it is a timestamp frame). We can (and do) deduce the source port and switch id from this. - a VLAN header, which can either be the tag_8021q RX VLAN (pvid) or the bridge VLAN. The sja1105_vlan_rcv() function attempts to deduce the source port and switch id a second time from this. The basic idea is that even though we don't need the source port information from the tag_8021q header if it's a control packet, we do need to strip that header before we pass it on to the network stack. The problem is that we call sja1105_vlan_rcv for ports under VLAN-aware bridges, and that function tells us it couldn't identify a tag_8021q header, so we need to perform imprecise RX by VID. Well, we don't, because we already know the source port and switch ID. This patch drops the return value from sja1105_vlan_rcv and we just look at the source_port and switch_id values from sja1105_rcv and sja1110_rcv which were initialized to -1. If they are still -1 it means we need to perform imprecise RX. Fixes: 884be12f8566 ("net: dsa: sja1105: add support for imprecise RX") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Allow per-netns default networksMatt Johnston
Currently we have a compile-time default network (MCTP_INITIAL_DEFAULT_NET). This change introduces a default_net field on the net namespace, allowing future configuration for new interfaces. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Add dest neighbour lladdr to route outputMatt Johnston
Now that we have a neighbour implementation, hook it up to the output path to set the dest hardware address for outgoing packets. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Implement message fragmentation & reassemblyJeremy Kerr
This change implements MCTP fragmentation (based on route & device MTU), and corresponding reassembly. The MCTP specification only allows for fragmentation on the originating message endpoint, and reassembly on the destination endpoint - intermediate nodes do not need to reassemble/refragment. Consequently, we only fragment in the local transmit path, and reassemble locally-bound packets. Messages are required to be in-order, so we simply cancel reassembly on out-of-order or missing packets. In the fragmentation path, we just break up the message into MTU-sized fragments; the skb structure is a simple copy for now, which we can later improve with a shared data implementation. For reassembly, we keep track of incoming message fragments using the existing tag infrastructure, allocating a key on the (src,dest,tag) tuple, and reassembles matching fragments into a skb->frag_list. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Populate socket implementationJeremy Kerr
Start filling-out the socket syscalls: bind, sendmsg & recvmsg. This requires an input route implementation, so we add to mctp_route_input, allowing lookups on binds & message tags. This just handles single-packet messages at present, we will add fragmentation in a future change. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Add neighbour netlink interfaceMatt Johnston
This change adds the netlink interfaces for manipulating the MCTP neighbour table. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Add neighbour implementationMatt Johnston
Add an initial neighbour table implementation, to be used in the route output path. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Add netlink route managementMatt Johnston
This change adds RTM_GETROUTE, RTM_NEWROUTE & RTM_DELROUTE handlers, allowing management of the MCTP route table. Includes changes from Jeremy Kerr <jk@codeconstruct.com.au>. Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Add initial routing frameworkJeremy Kerr
Add a simple routing table, and a couple of route output handlers, and the mctp packet_type & handler. Includes changes from Matt Johnston <matt@codeconstruct.com.au>. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Add device handling and netlink interfaceJeremy Kerr
This change adds the infrastructure for managing MCTP netdevices; we add a pointer to the AF_MCTP-specific data to struct netdevice, and hook up the rtnetlink operations for adding and removing addresses. Includes changes from Matt Johnston <matt@codeconstruct.com.au>. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Add base socket/protocol definitionsJeremy Kerr
Add an empty socket implementation, plus initialisation/destruction handlers. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29mctp: Add MCTP baseJeremy Kerr
Add basic Kconfig, an initial (empty) af_mctp source object, and {AF,PF}_MCTP definitions, and the required definitions for a new protocol type. Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29skbuff: allow 'slow_gro' for skb carring sock referencePaolo Abeni
This change leverages the infrastructure introduced by the previous patches to allow soft devices passing to the GRO engine owned skbs without impacting the fast-path. It's up to the GRO caller ensuring the slow_gro bit validity before invoking the GRO engine. The new helper skb_prepare_for_gro() is introduced for that goal. On slow_gro, skbs are aggregated only with equal sk. Additionally, skb truesize on GRO recycle and free is correctly updated so that sk wmem is not changed by the GRO processing. rfc-> v1: - fixed bad truesize on dev_gro_receive NAPI_FREE - use the existing state bit Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29net: optimize GRO for the common case.Paolo Abeni
After the previous patches, at GRO time, skb->slow_gro is usually 0, unless the packets comes from some H/W offload slowpath or tunnel. We can optimize the GRO code assuming !skb->slow_gro is likely. This remove multiple conditionals in the most common path, at the price of an additional one when we hit the above "slow-paths". Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29sk_buff: track extension status in slow_groPaolo Abeni
Similar to the previous one, but tracking the active_extensions field status. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2021-07-29 The following pull-request contains BPF updates for your *net* tree. We've added 9 non-merge commits during the last 14 day(s) which contain a total of 20 files changed, 446 insertions(+), 138 deletions(-). The main changes are: 1) Fix UBSAN out-of-bounds splat for showing XDP link fdinfo, from Lorenz Bauer. 2) Fix insufficient Spectre v4 mitigation in BPF runtime, from Daniel Borkmann, Piotr Krysiuk and Benedict Schlueter. 3) Batch of fixes for BPF sockmap found under stress testing, from John Fastabend. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2021-07-28net: bridge: switchdev: treat local FDBs the same as entries towards the bridgeVladimir Oltean
Currently the following script: 1. ip link add br0 type bridge vlan_filtering 1 && ip link set br0 up 2. ip link set swp2 up && ip link set swp2 master br0 3. ip link set swp3 up && ip link set swp3 master br0 4. ip link set swp4 up && ip link set swp4 master br0 5. bridge vlan del dev swp2 vid 1 6. bridge vlan del dev swp3 vid 1 7. ip link set swp4 nomaster 8. ip link set swp3 nomaster produces the following output: [ 641.010738] sja1105 spi0.1: port 2 failed to delete 00:1f:7b:63:02:48 vid 1 from fdb: -2 [ swp2, swp3 and br0 all have the same MAC address, the one listed above ] In short, this happens because the number of FDB entry additions notified to switchdev is unbalanced with the number of deletions. At step 1, the bridge has a random MAC address. At step 2, the br_fdb_replay of swp2 receives this initial MAC address. Then the bridge inherits the MAC address of swp2 via br_fdb_change_mac_address(), and it notifies switchdev (only swp2 at this point) of the deletion of the random MAC address and the addition of 00:1f:7b:63:02:48 as a local FDB entry with fdb->dst == swp2, in VLANs 0 and the default_pvid (1). During step 7: del_nbp -> br_fdb_delete_by_port(br, p, vid=0, do_all=1); -> fdb_delete_local(br, p, f); br_fdb_delete_by_port() deletes all entries towards the ports, regardless of vid, because do_all is 1. fdb_delete_local() has logic to migrate local FDB entries deleted from one port to another port which shares the same MAC address and is in the same VLAN, or to the bridge device itself. This migration happens without notifying switchdev of the deletion on the old port and the addition on the new one, just fdb->dst is changed and the added_by_user flag is cleared. In the example above, the del_nbp(swp4) causes the "addr 00:1f:7b:63:02:48 vid 1" local FDB entry with fdb->dst == swp4 that existed up until then to be migrated directly towards the bridge (fdb->dst == NULL). This is because it cannot be migrated to any of the other ports (swp2 and swp3 are not in VLAN 1). After the migration to br0 takes place, swp4 requests a deletion replay of all FDB entries. Since the "addr 00:1f:7b:63:02:48 vid 1" entry now point towards the bridge, a deletion of it is replayed. There was just a prior addition of this address, so the switchdev driver deletes this entry. Then, the del_nbp(swp3) at step 8 triggers another br_fdb_replay, and switchdev is notified again to delete "addr 00:1f:7b:63:02:48 vid 1". But it can't because it no longer has it, so it returns -ENOENT. There are other possibilities to trigger this issue, but this is by far the simplest to explain. To fix this, we must avoid the situation where the addition of an FDB entry is notified to switchdev as a local entry on a port, and the deletion is notified on the bridge itself. Considering that the 2 types of FDB entries are completely equivalent and we cannot have the same MAC address as a local entry on 2 bridge ports, or on a bridge port and pointing towards the bridge at the same time, it makes sense to hide away from switchdev completely the fact that a local FDB entry is associated with a given bridge port at all. Just say that it points towards the bridge, it should make no difference whatsoever to the switchdev driver and should even lead to a simpler overall implementation, will less cases to handle. This also avoids any modification at all to the core bridge driver, just what is reported to switchdev changes. With the local/permanent entries on bridge ports being already reported to user space, it is hard to believe that the bridge behavior can change in any backwards-incompatible way such as making all local FDB entries point towards the bridge. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>