aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2018-01-10bpf: export function to write into verifier log bufferQuentin Monnet
Rename the BPF verifier `verbose()` to `bpf_verifier_log_write()` and export it, so that other components (in particular, drivers for BPF offload) can reuse the user buffer log to dump error messages at verification time. Renaming `verbose()` was necessary in order to avoid a name so generic to be exported to the global namespace. However to prevent too much pain for backports, the calls to `verbose()` in the kernel BPF verifier were not changed. Instead, use function aliasing to make `verbose` point to `bpf_verifier_log_write`. Another solution could consist in making a wrapper around `verbose()`, but since it is a variadic function, I don't see a clean way without creating two identical wrappers, one for the verifier and one to export. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: bpf: add signed jump insnsNic Viljoen
This patch adds signed jump instructions (jsgt, jsge, jslt, jsle) to the nfp jit. As well as adding the additional required raw assembler branch mask to nfp_asm.h Signed-off-by: Nic Viljoen <nick.viljoen@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: hand over to BPF offload app at coarser granularityJakub Kicinski
Instead of having an app callback per message type hand off all offload-related handling to apps with one "rest of ndo_bpf" callback. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: bpf: use a large constant in unresolved branchesJakub Kicinski
To make absolute relocated branches (branches which will be completely rewritten with br_set_offset()) distinguishable in user space dumps from normal jumps add a large offset to them. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: bpf: don't depend on high order allocations for program imageJakub Kicinski
The translator pre-allocates a buffer of maximal program size. Due to HW/FW limitations the program buffer can't currently be longer than 128Kb, so we used to kmalloc() it, and then map for DMA directly. Now that the late branch resolution is copying the program image anyway, we can just kvmalloc() the buffer. While at it, after translation reallocate the buffer to save space. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: bpf: relocate jump targets just before the loadJakub Kicinski
Don't translate the program assuming it will be loaded at a given address. This will be required for sharing programs between ports of the same NIC, tail calls and subprograms. It will also make the jump targets easier to understand when dumping the program to user space. Translate the program as if it was going to be loaded at address zero. When load happens add the load offset in and set addresses of special branches. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: bpf: add helpers for modifying branch addressesJakub Kicinski
In preparation for better handling of relocations move existing helper for setting branch offset to nfp_asm.c and add two more. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: bpf: move jump resolution to jit.cJakub Kicinski
Jump target resolution should be in jit.c not offload.c. No functional changes. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: bpf: allow disabling TC offloads when XDP activeJakub Kicinski
TC BPF offload was added first, so we used to assume that the ethtool TC HW offload flag cannot be touched whenever any BPF program is loaded on the NIC. This unncessarily limits changes to the TC flag when offloaded program is XDP. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: bpf: don't allow changing MTU above BPF offload limit when activeJakub Kicinski
When BPF offload is active we need may need to restrict the MTU changes more than just to the limitation of the kernel XDP datapath. Allow the BPF code to veto a MTU change. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: bpf: round up the size of the stackJakub Kicinski
Kernel enforces the alignment of the bottom of the stack, NFP deals with positive offsets better so we should align the top of the stack. Round the stack size to NFP word size (4B). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: fix incumbent kdoc warningsJakub Kicinski
We should use % instead of @ for documenting preprocessor defines. Add missing documentation of __NFP_REPR_TYPE_MAX. This gets rid of all remaining kdoc warnings in the driver. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10nfp: don't try to register XDP rxq structures on control queuesJakub Kicinski
Some RX rings are used for control messages, those will not have a netdev pointer in dp. Skip XDP rxq handling on those rings. Fixes: 7f1c684a8966 ("nfp: setup xdp_rxq_info") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10Merge branch 'bpf-xdp-rxq-fixes'Daniel Borkmann
Jakub Kicinski says: ==================== Two more trivial fixes to the recent XDP RXQ series. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10net: free RX queue structuresJakub Kicinski
Looks like commit e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info") replaced kvfree(dev->_rx) in free_netdev() with a call to netif_free_rx_queues() which doesn't actually free the rings? While at it remove the unnecessary temporary variable. Fixes: e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-10net: use the right variant of kfreeJakub Kicinski
kvzalloc'ed memory should be kvfree'd. Fixes: e817f85652c1 ("xdp: generic XDP handling of xdp_rxq_info") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Acked-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-08bpf: fix verifier GPF in kmalloc failure pathAlexei Starovoitov
syzbot reported the following panic in the verifier triggered by kmalloc error injection: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:copy_func_state kernel/bpf/verifier.c:403 [inline] RIP: 0010:copy_verifier_state+0x364/0x590 kernel/bpf/verifier.c:431 Call Trace: pop_stack+0x8c/0x270 kernel/bpf/verifier.c:449 push_stack kernel/bpf/verifier.c:491 [inline] check_cond_jmp_op kernel/bpf/verifier.c:3598 [inline] do_check+0x4b60/0xa050 kernel/bpf/verifier.c:4731 bpf_check+0x3296/0x58c0 kernel/bpf/verifier.c:5489 bpf_prog_load+0xa2a/0x1b00 kernel/bpf/syscall.c:1198 SYSC_bpf kernel/bpf/syscall.c:1807 [inline] SyS_bpf+0x1044/0x4420 kernel/bpf/syscall.c:1769 when copy_verifier_state() aborts in the middle due to kmalloc failure some of the frames could have been partially copied while current free_verifier_state() loop for (i = 0; i <= state->curframe; i++) assumed that all frames are non-null. Simply fix it by adding 'if (!state)' to free_func_state(). Also avoid stressing copy frame logic more if kzalloc fails in push_stack() free env->cur_state right away. Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)") Reported-by: syzbot+32ac5a3e473f2e01cfc7@syzkaller.appspotmail.com Reported-by: syzbot+fa99e24f3c29d269a7d5@syzkaller.appspotmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-07Merge branch 'ipv6-ipv4-nexthop-align'David S. Miller
Ido Schimmel says: ==================== ipv6: Align nexthop behaviour with IPv4 This set tries to eliminate some differences between IPv4's and IPv6's treatment of nexthops. These differences are most likely a side effect of IPv6's data structures (specifically 'rt6_info') that incorporate both the route and the nexthop and the late addition of ECMP support in commit 51ebd3181572 ("ipv6: add support of equal cost multipath (ECMP)"). IPv4 and IPv6 do not react the same to certain netdev events. For example, upon carrier change affected IPv4 nexthops are marked using the RTNH_F_LINKDOWN flag and the nexthop group is rebalanced accordingly. IPv6 on the other hand, does nothing which forces us to perform a carrier check during route lookup and dump. This makes it difficult to introduce features such as non-equal-cost multipath that are built on top of this set [1]. In addition, when a netdev is put administratively down IPv4 nexthops are marked using the RTNH_F_DEAD flag, whereas IPv6 simply flushes all the routes using these nexthops. To be consistent with IPv4, multipath routes should only be flushed when all nexthops in the group are considered dead. The first 12 patches introduce non-functional changes that store the RTNH_F_DEAD and RTNH_F_LINKDOWN flags in IPv6 routes based on netdev events, in a similar fashion to IPv4. This allows us to remove the carrier check performed during route lookup and dump. The next three patches make sure we only flush a multipath route when all of its nexthops are dead. Last three patches add test cases for IPv4/IPv6 FIB. These verify that both address families react similarly to netdev events. Finally, this series also serves as a good first step towards David Ahern's goal of treating nexthops as standalone objects [2], as it makes the code more in line with IPv4 where the nexthop and the nexthop group are separate objects from the route itself. 1. https://github.com/idosch/linux/tree/ipv6-nexthops 2. http://vger.kernel.org/netconf2017_files/nexthop-objects.pdf Changes since RFC (feedback from David Ahern): * Remove redundant declaration of rt6_ifdown() in patch 4 and adjust comment referencing it accordingly * Drop patch to flush multipath routes upon NETDEV_UNREGISTER. Reword cover letter accordingly * Use a temporary variable to make code more readable in patch 15 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07selftests: fib_tests: Add test cases for netdev carrier changeIdo Schimmel
Check that IPv4 and IPv6 react the same when the carrier of a netdev is toggled. Local routes should not be affected by this, whereas unicast routes should. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07selftests: fib_tests: Add test cases for netdev downIdo Schimmel
Check that IPv4 and IPv6 react the same when a netdev is being put administratively down. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07selftests: fib_tests: Add test cases for IPv4/IPv6 FIBIdo Schimmel
Add test cases to check that IPv4 and IPv6 react to a netdev being unregistered as expected. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Flush multipath routes when all siblings are deadIdo Schimmel
By default, IPv6 deletes nexthops from a multipath route when the nexthop device is put administratively down. This differs from IPv4 where the nexthops are kept, but marked with the RTNH_F_DEAD flag. A multipath route is flushed when all of its nexthops become dead. Align IPv6 with IPv4 and have it conform to the same guidelines. In case the multipath route needs to be flushed, its siblings are flushed one by one. Otherwise, the nexthops are marked with the appropriate flags and the tree walker is instructed to skip all the siblings. As explained in previous patches, care is taken to update the sernum of the affected tree nodes, so as to prevent the use of wrong dst entries. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Take table lock outside of sernum update functionIdo Schimmel
The next patch is going to allow dead routes to remain in the FIB tree in certain situations. When this happens we need to be sure to bump the sernum of the nodes where these are stored so that potential copies cached in sockets are invalidated. The function that performs this update assumes the table lock is not taken when it is invoked, but that will not be the case when it is invoked by the tree walker. Have the function assume the lock is taken and make the single caller take the lock itself. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Export sernum update functionIdo Schimmel
We are going to allow dead routes to stay in the FIB tree (e.g., when they are part of a multipath route, directly connected route with no carrier) and revive them when their nexthop device gains carrier or when it is put administratively up. This is equivalent to the addition of the route to the FIB tree and we should therefore take care of updating the sernum of all the parent nodes of the node where the route is stored. Otherwise, we risk sockets caching and using sub-optimal dst entries. Export the function that performs the above, so that it could be invoked from fib6_ifup() later on. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Teach tree walker to skip multipath routesIdo Schimmel
As explained in previous patch, fib6_ifdown() needs to consider the state of all the sibling routes when a multipath route is traversed. This is done by evaluating all the siblings when the first sibling in a multipath route is traversed. If the multipath route does not need to be flushed (e.g., not all siblings are dead), then we should just skip the multipath route as our work is done. Have the tree walker jump to the last sibling when it is determined that the multipath route needs to be skipped. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Add explicit flush indication to routesIdo Schimmel
When routes that are a part of a multipath route are evaluated by fib6_ifdown() in response to NETDEV_DOWN and NETDEV_UNREGISTER events the state of their sibling routes is not considered. This will change in subsequent patches in order to align IPv6 with IPv4's behavior. For example, when the last sibling in a multipath route becomes dead, the entire multipath route needs to be removed. To prevent the tree walker from re-evaluating all the sibling routes each time, we can simply evaluate them once - when the first sibling is traversed. If we determine the entire multipath route needs to be removed, then the 'should_flush' bit is set in all the siblings, which will cause the walker to flush them when it traverses them. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Report dead flag during route dumpIdo Schimmel
Up until now the RTNH_F_DEAD flag was only reported in route dump when the 'ignore_routes_with_linkdown' sysctl was set. This is expected as dead routes were flushed otherwise. The reliance on this sysctl is going to be removed, so we need to report the flag regardless of the sysctl's value. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Ignore dead routes during lookupIdo Schimmel
Currently, dead routes are only present in the routing tables in case the 'ignore_routes_with_linkdown' sysctl is set. Otherwise, they are flushed. Subsequent patches are going to remove the reliance on this sysctl and make IPv6 more consistent with IPv4. Before this is done, we need to make sure dead routes are skipped during route lookup, so as to not cause packet loss. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Check nexthop flags in route dump instead of carrierIdo Schimmel
Similar to previous patch, there is no need to check for the carrier of the nexthop device when dumping the route and we can instead check for the presence of the RTNH_F_LINKDOWN flag. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Check nexthop flags during route lookup instead of carrierIdo Schimmel
Now that the RTNH_F_LINKDOWN flag is set in nexthops, we can avoid the need to dereference the nexthop device and check its carrier and instead check for the presence of the flag. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Set nexthop flags during route creationIdo Schimmel
It is valid to install routes with a nexthop device that does not have a carrier, so we need to make sure they're marked accordingly. As explained in the previous patch, host and anycast routes are never marked with the 'linkdown' flag. Note that reject routes are unaffected, as these use the loopback device which always has a carrier. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Set nexthop flags upon carrier changeIdo Schimmel
Similar to IPv4, when the carrier of a netdev changes we should toggle the 'linkdown' flag on all the nexthops using it as their nexthop device. This will later allow us to test for the presence of this flag during route lookup and dump. Up until commit 4832c30d5458 ("net: ipv6: put host and anycast routes on device with address") host and anycast routes used the loopback netdev as their nexthop device and thus were not marked with the 'linkdown' flag. The patch preserves this behavior and allows one to ping the local address even when the nexthop device does not have a carrier and the 'ignore_routes_with_linkdown' sysctl is set. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Prepare to handle multiple netdev eventsIdo Schimmel
To make IPv6 more in line with IPv4 we need to be able to respond differently to different netdev events. For example, when a netdev is unregistered all the routes using it as their nexthop device should be flushed, whereas when the netdev's carrier changes only the 'linkdown' flag should be toggled. Currently, this is not possible, as the function that traverses the routing tables is not aware of the triggering event. Propagate the triggering event down, so that it could be used in later patches. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Clear nexthop flags upon netdev upIdo Schimmel
Previous patch marked nexthops with the 'dead' and 'linkdown' flags. Clear these flags when the netdev comes back up. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Mark dead nexthops with appropriate flagsIdo Schimmel
When a netdev is put administratively down or unregistered all the nexthops using it as their nexthop device should be marked with the 'dead' and 'linkdown' flags. Currently, when a route is dumped its nexthop device is tested and the flags are set accordingly. A similar check is performed during route lookup. Instead, we can simply mark the nexthops based on netdev events and avoid checking the netdev's state during route dump and lookup. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07ipv6: Remove redundant route flushing during namespace dismantleIdo Schimmel
By the time fib6_net_exit() is executed all the netdevs in the namespace have been either unregistered or pushed back to the default namespace. That is because pernet subsys operations are always ordered before pernet device operations and therefore invoked after them during namespace dismantle. Thus, all the routing tables in the namespace are empty by the time fib6_net_exit() is invoked and the call to rt6_ifdown() can be removed. This allows us to simplify the condition in fib6_ifdown() as it's only ever called with an actual netdev. Signed-off-by: Ido Schimmel <idosch@mellanox.com> Acked-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf-next 2018-01-07 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Add a start of a framework for extending struct xdp_buff without having the overhead of populating every data at runtime. Idea is to have a new per-queue struct xdp_rxq_info that holds read mostly data (currently that is, queue number and a pointer to the corresponding netdev) which is set up during rxqueue config time. When a XDP program is invoked, struct xdp_buff holds a pointer to struct xdp_rxq_info that the BPF program can then walk. The user facing BPF program that uses struct xdp_md for context can use these members directly, and the verifier rewrites context access transparently by walking the xdp_rxq_info and net_device pointers to load the data, from Jesper. 2) Redo the reporting of offload device information to user space such that it works in combination with network namespaces. The latter is reported through a device/inode tuple as similarly done in other subsystems as well (e.g. perf) in order to identify the namespace. For this to work, ns_get_path() has been generalized such that the namespace can be retrieved not only from a specific task (perf case), but also from a callback where we deduce the netns (ns_common) from a netdevice. bpftool support using the new uapi info and extensive test cases for test_offload.py in BPF selftests have been added as well, from Jakub. 3) Add two bpftool improvements: i) properly report the bpftool version such that it corresponds to the version from the kernel source tree. So pick the right linux/version.h from the source tree instead of the installed one. ii) fix bpftool and also bpf_jit_disasm build with bintutils >= 2.9. The reason for the build breakage is that binutils library changed the function signature to select the disassembler. Given this is needed in multiple tools, add a proper feature detection to the tools/build/features infrastructure, from Roman. 4) Implement the BPF syscall command BPF_MAP_GET_NEXT_KEY for the stacktrace map. It is currently unimplemented, but there are use cases where user space needs to walk all stacktrace map entries e.g. for dumping or deleting map entries w/o having to close and recreate the map. Add BPF selftests along with it, from Yonghong. 5) Few follow-up cleanups for the bpftool cgroup code: i) rename the cgroup 'list' command into 'show' as we have it for other subcommands as well, ii) then alias the 'show' command such that 'list' is accepted which is also common practice in iproute2, and iii) remove couple of newlines from error messages using p_err(), from Jakub. 6) Two follow-up cleanups to sockmap code: i) remove the unused bpf_compute_data_end_sk_skb() function and ii) only build the sockmap infrastructure when CONFIG_INET is enabled since it's only aware of TCP sockets at this time, from John. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2018-01-06Merge branch 'bpf-stacktrace-map-next-key-support'Daniel Borkmann
Yonghong Song says: ==================== The patch set implements bpf syscall command BPF_MAP_GET_NEXT_KEY for stacktrace map. Patch #1 is the core implementation and Patch #2 implements a bpf test at tools/testing/selftests/bpf directory. Please see individual patch comments for details. Changelog: v1 -> v2: - For invalid key (key pointer is non-NULL), sets next_key to be the first valid key. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-06tools/bpf: add a bpf selftest for stacktraceYonghong Song
Added a bpf selftest in test_progs at tools directory for stacktrace. The test will populate a hashtable map and a stacktrace map at the same time with the same key, stackid. The user space will compare both maps, using BPF_MAP_LOOKUP_ELEM command and BPF_MAP_GET_NEXT_KEY command, to ensure that both have the same set of keys. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-06bpf: implement syscall command BPF_MAP_GET_NEXT_KEY for stacktrace mapYonghong Song
Currently, bpf syscall command BPF_MAP_GET_NEXT_KEY is not supported for stacktrace map. However, there are use cases where user space wants to enumerate all stacktrace map entries where BPF_MAP_GET_NEXT_KEY command will be really helpful. In addition, if user space wants to delete all map entries in order to save memory and does not want to close the map file descriptor, BPF_MAP_GET_NEXT_KEY may help improve performance if map entries are sparsely populated. The implementation has similar behavior for BPF_MAP_GET_NEXT_KEY implementation in hashtab. If user provides a NULL key pointer or an invalid key, the first key is returned. Otherwise, the first valid key after the input parameter "key" is returned, or -ENOENT if no valid key can be found. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-01-05Merge branch 'xdp_rxq_info'Alexei Starovoitov
Jesper Dangaard Brouer says: ==================== V4: * Added reviewers/acks to patches * Fix patch desc in i40e that got out-of-sync with code * Add SPDX license headers for the two new files added in patch 14 V3: * Fixed bug in virtio_net driver * Removed export of xdp_rxq_info_init() V2: * Changed API exposed to drivers - Removed invocation of "init" in drivers, and only call "reg" (Suggested by Saeed) - Allow "reg" to fail and handle this in drivers (Suggested by David Ahern) * Removed the SINKQ qtype, instead allow to register as "unused" * Also fixed some drivers during testing on actual HW (noted in patches) There is a need for XDP to know more about the RX-queue a given XDP frames have arrived on. For both the XDP bpf-prog and kernel side. Instead of extending struct xdp_buff each time new info is needed, this patchset takes a different approach. Struct xdp_buff is only extended with a pointer to a struct xdp_rxq_info (allowing for easier extending this later). This xdp_rxq_info contains information related to how the driver have setup the individual RX-queue's. This is read-mostly information, and all xdp_buff frames (in drivers napi_poll) point to the same xdp_rxq_info (per RX-queue). We stress this data/cache-line is for read-mostly info. This is NOT for dynamic per packet info, use the data_meta for such use-cases. This patchset start out small, and only expose ingress_ifindex and the RX-queue index to the XDP/BPF program. Access to tangible info like the ingress ifindex and RX queue index, is fairly easy to comprehent. The other future use-cases could allow XDP frames to be recycled back to the originating device driver, by providing info on RX device and queue number. As XDP doesn't have driver feature flags, and eBPF code due to bpf-tail-calls cannot determine that XDP driver invoke it, this patchset have to update every driver that support XDP. For driver developers (review individual driver patches!): The xdp_rxq_info is tied to the drivers RX-ring(s). Whenever a RX-ring modification require (temporary) stopping RX frames, then the xdp_rxq_info should (likely) also be unregistred and re-registered, especially if reallocating the pages in the ring. Make sure ethtool set_channels does the right thing. When replacing XDP prog, if and only if RX-ring need to be changed, then also re-register the xdp_rxq_info. I'm Cc'ing the individual driver patches to the registered maintainers. Testing: I've only tested the NIC drivers I have hardware for. The general test procedure is to (DUT = Device Under Test): (1) run pktgen script pktgen_sample04_many_flows.sh (against DUT) (2) run samples/bpf program xdp_rxq_info --dev $DEV (on DUT) (3) runtime modify number of NIC queues via ethtool -L (on DUT) (4) runtime modify number of NIC ring-size via ethtool -G (on DUT) Patch based on git tree bpf-next (at commit fb982666e380c1632a): https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/ ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05samples/bpf: program demonstrating access to xdp_rxq_infoJesper Dangaard Brouer
This sample program can be used for monitoring and reporting how many packets per sec (pps) are received per NIC RX queue index and which CPU processed the packet. In itself it is a useful tool for quickly identifying RSS imbalance issues, see below. The default XDP action is XDP_PASS in-order to provide a monitor mode. For benchmarking purposes it is possible to specify other XDP actions on the cmdline --action. Output below shows an imbalance RSS case where most RXQ's deliver to CPU-0 while CPU-2 only get packets from a single RXQ. Looking at things from a CPU level the two CPUs are processing approx the same amount, BUT looking at the rx_queue_index levels it is clear that RXQ-2 receive much better service, than other RXQs which all share CPU-0. Running XDP on dev:i40e1 (ifindex:3) action:XDP_PASS XDP stats CPU pps issue-pps XDP-RX CPU 0 900,473 0 XDP-RX CPU 2 906,921 0 XDP-RX CPU total 1,807,395 RXQ stats RXQ:CPU pps issue-pps rx_queue_index 0:0 180,098 0 rx_queue_index 0:sum 180,098 rx_queue_index 1:0 180,098 0 rx_queue_index 1:sum 180,098 rx_queue_index 2:2 906,921 0 rx_queue_index 2:sum 906,921 rx_queue_index 3:0 180,098 0 rx_queue_index 3:sum 180,098 rx_queue_index 4:0 180,082 0 rx_queue_index 4:sum 180,082 rx_queue_index 5:0 180,093 0 rx_queue_index 5:sum 180,093 Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05bpf: finally expose xdp_rxq_info to XDP bpf-programsJesper Dangaard Brouer
Now all XDP driver have been updated to setup xdp_rxq_info and assign this to xdp_buff->rxq. Thus, it is now safe to enable access to some of the xdp_rxq_info struct members. This patch extend xdp_md and expose UAPI to userspace for ingress_ifindex and rx_queue_index. Access happens via bpf instruction rewrite, that load data directly from struct xdp_rxq_info. * ingress_ifindex map to xdp_rxq_info->dev->ifindex * rx_queue_index map to xdp_rxq_info->queue_index Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05xdp: generic XDP handling of xdp_rxq_infoJesper Dangaard Brouer
Hook points for xdp_rxq_info: * reg : netif_alloc_rx_queues * unreg: netif_free_rx_queues The net_device have some members (num_rx_queues + real_num_rx_queues) and data-area (dev->_rx with struct netdev_rx_queue's) that were primarily used for exporting information about RPS (CONFIG_RPS) queues to sysfs (CONFIG_SYSFS). For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info, and remove some of the CONFIG_SYSFS ifdefs. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05virtio_net: setup xdp_rxq_infoJesper Dangaard Brouer
The virtio_net driver doesn't dynamically change the RX-ring queue layout and backing pages, but instead reject XDP setup if all the conditions for XDP is not meet. Thus, the xdp_rxq_info also remains fairly static. This allow us to simply add the reg/unreg to net_device open/close functions. Driver hook points for xdp_rxq_info: * reg : virtnet_open * unreg: virtnet_close V3: - bugfix, also setup xdp.rxq in receive_mergeable() - Tested bpf-sample prog inside guest on a virtio_net device Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: virtualization@lists.linux-foundation.org Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05tun: setup xdp_rxq_infoJesper Dangaard Brouer
Driver hook points for xdp_rxq_info: * reg : tun_attach * unreg: __tun_detach I've done some manual testing of this tun driver, but I would appriciate good review and someone else running their use-case tests, as I'm not 100% sure I understand the tfile->detached semantics. V2: Removed the skb_array_cleanup() call from V1 by request from Jason Wang. Cc: Jason Wang <jasowang@redhat.com> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05thunderx: setup xdp_rxq_infoJesper Dangaard Brouer
This driver uses a bool scheme for "enable"/"disable" when setting up different resources. Thus, the hook points for xdp_rxq_info is done in the same function call nicvf_rcv_queue_config(). This is activated through enable/disable via nicvf_config_data_transfer(), which is tied into nicvf_stop()/nicvf_open(). Extending driver packet handler call-path nicvf_rcv_pkt_handler() with a pointer to the given struct rcv_queue, in-order to access the xdp_rxq_info data area (in nicvf_xdp_rx()). V2: Driver have no proper error path for failed XDP RX-queue info reg, as nicvf_rcv_queue_config is a void function. Cc: linux-arm-kernel@lists.infradead.org Cc: Sunil Goutham <sgoutham@cavium.com> Cc: Robert Richter <rric@kernel.org> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05nfp: setup xdp_rxq_infoJesper Dangaard Brouer
Driver hook points for xdp_rxq_info: * reg : nfp_net_rx_ring_alloc * unreg: nfp_net_rx_ring_free In struct nfp_net_rx_ring moved member @size into a hole on 64-bit. Thus, the size remaines the same after adding member @xdp_rxq. Cc: oss-drivers@netronome.com Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Simon Horman <simon.horman@netronome.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05bnxt_en: setup xdp_rxq_infoJesper Dangaard Brouer
Driver hook points for xdp_rxq_info: * reg : bnxt_alloc_rx_rings * unreg: bnxt_free_rx_rings This driver should be updated to re-register when changing allocation mode of RX rings. Tested on actual hardware. Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Michael Chan <michael.chan@broadcom.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-01-05mlx4: setup xdp_rxq_infoJesper Dangaard Brouer
Driver hook points for xdp_rxq_info: * reg : mlx4_en_create_rx_ring * unreg: mlx4_en_destroy_rx_ring Tested on actual hardware. Cc: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>