diff options
246 files changed, 13816 insertions, 3089 deletions
diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst index a26aa1b9b259..75a0dca5f295 100644 --- a/Documentation/bpf/bpf_devel_QA.rst +++ b/Documentation/bpf/bpf_devel_QA.rst @@ -149,7 +149,7 @@ In case the patch or patch series has to be reworked and sent out again in a second or later revision, it is also required to add a version number (``v2``, ``v3``, ...) into the subject prefix:: - git format-patch --subject-prefix='PATCH net-next v2' start..finish + git format-patch --subject-prefix='PATCH bpf-next v2' start..finish When changes have been requested to the patch series, always send the whole patch series again with the feedback incorporated (never send @@ -479,17 +479,18 @@ LLVM's static compiler lists the supported targets through $ llc --version LLVM (http://llvm.org/): - LLVM version 6.0.0svn + LLVM version 10.0.0 Optimized build. Default target: x86_64-unknown-linux-gnu Host CPU: skylake Registered Targets: - bpf - BPF (host endian) - bpfeb - BPF (big endian) - bpfel - BPF (little endian) - x86 - 32-bit X86: Pentium-Pro and above - x86-64 - 64-bit X86: EM64T and AMD64 + aarch64 - AArch64 (little endian) + bpf - BPF (host endian) + bpfeb - BPF (big endian) + bpfel - BPF (little endian) + x86 - 32-bit X86: Pentium-Pro and above + x86-64 - 64-bit X86: EM64T and AMD64 For developers in order to utilize the latest features added to LLVM's BPF back end, it is advisable to run the latest LLVM releases. Support @@ -517,6 +518,10 @@ from the git repositories:: The built binaries can then be found in the build/bin/ directory, where you can point the PATH variable to. +Set ``-DLLVM_TARGETS_TO_BUILD`` equal to the target you wish to build, you +will find a full list of targets within the llvm-project/llvm/lib/Target +directory. + Q: Reporting LLVM BPF issues ---------------------------- Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst index b5361b8621c9..44dc789de2b4 100644 --- a/Documentation/bpf/btf.rst +++ b/Documentation/bpf/btf.rst @@ -724,6 +724,31 @@ want to define unused entry in BTF_ID_LIST, like:: BTF_ID_UNUSED BTF_ID(struct, task_struct) +The ``BTF_SET_START/END`` macros pair defines sorted list of BTF ID values +and their count, with following syntax:: + + BTF_SET_START(set) + BTF_ID(type1, name1) + BTF_ID(type2, name2) + BTF_SET_END(set) + +resulting in following layout in .BTF_ids section:: + + __BTF_ID__set__set: + .zero 4 + __BTF_ID__type1__name1__3: + .zero 4 + __BTF_ID__type2__name2__4: + .zero 4 + +The ``struct btf_id_set set;`` variable is defined to access the list. + +The ``typeX`` name can be one of following:: + + struct, union, typedef, func + +and is used as a filter when resolving the BTF ID value. + All the BTF ID lists and sets are compiled in the .BTF_ids section and resolved during the linking phase of kernel build by ``resolve_btfids`` tool. diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index 7df2465fd108..4f2874b729c3 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -52,6 +52,7 @@ Program types prog_cgroup_sysctl prog_flow_dissector bpf_lsm + prog_sk_lookup Map types diff --git a/Documentation/bpf/prog_sk_lookup.rst b/Documentation/bpf/prog_sk_lookup.rst new file mode 100644 index 000000000000..85a305c19bcd --- /dev/null +++ b/Documentation/bpf/prog_sk_lookup.rst @@ -0,0 +1,98 @@ +.. SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause) + +===================== +BPF sk_lookup program +===================== + +BPF sk_lookup program type (``BPF_PROG_TYPE_SK_LOOKUP``) introduces programmability +into the socket lookup performed by the transport layer when a packet is to be +delivered locally. + +When invoked BPF sk_lookup program can select a socket that will receive the +incoming packet by calling the ``bpf_sk_assign()`` BPF helper function. + +Hooks for a common attach point (``BPF_SK_LOOKUP``) exist for both TCP and UDP. + +Motivation +========== + +BPF sk_lookup program type was introduced to address setup scenarios where +binding sockets to an address with ``bind()`` socket call is impractical, such +as: + +1. receiving connections on a range of IP addresses, e.g. 192.0.2.0/24, when + binding to a wildcard address ``INADRR_ANY`` is not possible due to a port + conflict, +2. receiving connections on all or a wide range of ports, i.e. an L7 proxy use + case. + +Such setups would require creating and ``bind()``'ing one socket to each of the +IP address/port in the range, leading to resource consumption and potential +latency spikes during socket lookup. + +Attachment +========== + +BPF sk_lookup program can be attached to a network namespace with +``bpf(BPF_LINK_CREATE, ...)`` syscall using the ``BPF_SK_LOOKUP`` attach type and a +netns FD as attachment ``target_fd``. + +Multiple programs can be attached to one network namespace. Programs will be +invoked in the same order as they were attached. + +Hooks +===== + +The attached BPF sk_lookup programs run whenever the transport layer needs to +find a listening (TCP) or an unconnected (UDP) socket for an incoming packet. + +Incoming traffic to established (TCP) and connected (UDP) sockets is delivered +as usual without triggering the BPF sk_lookup hook. + +The attached BPF programs must return with either ``SK_PASS`` or ``SK_DROP`` +verdict code. As for other BPF program types that are network filters, +``SK_PASS`` signifies that the socket lookup should continue on to regular +hashtable-based lookup, while ``SK_DROP`` causes the transport layer to drop the +packet. + +A BPF sk_lookup program can also select a socket to receive the packet by +calling ``bpf_sk_assign()`` BPF helper. Typically, the program looks up a socket +in a map holding sockets, such as ``SOCKMAP`` or ``SOCKHASH``, and passes a +``struct bpf_sock *`` to ``bpf_sk_assign()`` helper to record the +selection. Selecting a socket only takes effect if the program has terminated +with ``SK_PASS`` code. + +When multiple programs are attached, the end result is determined from return +codes of all the programs according to the following rules: + +1. If any program returned ``SK_PASS`` and selected a valid socket, the socket + is used as the result of the socket lookup. +2. If more than one program returned ``SK_PASS`` and selected a socket, the last + selection takes effect. +3. If any program returned ``SK_DROP``, and no program returned ``SK_PASS`` and + selected a socket, socket lookup fails. +4. If all programs returned ``SK_PASS`` and none of them selected a socket, + socket lookup continues on. + +API +=== + +In its context, an instance of ``struct bpf_sk_lookup``, BPF sk_lookup program +receives information about the packet that triggered the socket lookup. Namely: + +* IP version (``AF_INET`` or ``AF_INET6``), +* L4 protocol identifier (``IPPROTO_TCP`` or ``IPPROTO_UDP``), +* source and destination IP address, +* source and destination L4 port, +* the socket that has been selected with ``bpf_sk_assign()``. + +Refer to ``struct bpf_sk_lookup`` declaration in ``linux/bpf.h`` user API +header, and `bpf-helpers(7) +<https://man7.org/linux/man-pages/man7/bpf-helpers.7.html>`_ man-page section +for ``bpf_sk_assign()`` for details. + +Example +======= + +See ``tools/testing/selftests/bpf/prog_tests/sk_lookup.c`` for the reference +implementation. diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 5bc55a4e3bce..2ccc5644cc98 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -258,14 +258,21 @@ socket into zero-copy mode or fail. XDP_SHARED_UMEM bind flag ------------------------- -This flag enables you to bind multiple sockets to the same UMEM, but -only if they share the same queue id. In this mode, each socket has -their own RX and TX rings, but the UMEM (tied to the fist socket -created) only has a single FILL ring and a single COMPLETION -ring. To use this mode, create the first socket and bind it in the normal -way. Create a second socket and create an RX and a TX ring, or at -least one of them, but no FILL or COMPLETION rings as the ones from -the first socket will be used. In the bind call, set he +This flag enables you to bind multiple sockets to the same UMEM. It +works on the same queue id, between queue ids and between +netdevs/devices. In this mode, each socket has their own RX and TX +rings as usual, but you are going to have one or more FILL and +COMPLETION ring pairs. You have to create one of these pairs per +unique netdev and queue id tuple that you bind to. + +Starting with the case were we would like to share a UMEM between +sockets bound to the same netdev and queue id. The UMEM (tied to the +fist socket created) will only have a single FILL ring and a single +COMPLETION ring as there is only on unique netdev,queue_id tuple that +we have bound to. To use this mode, create the first socket and bind +it in the normal way. Create a second socket and create an RX and a TX +ring, or at least one of them, but no FILL or COMPLETION rings as the +ones from the first socket will be used. In the bind call, set he XDP_SHARED_UMEM option and provide the initial socket's fd in the sxdp_shared_umem_fd field. You can attach an arbitrary number of extra sockets this way. @@ -305,11 +312,41 @@ concurrently. There are no synchronization primitives in the libbpf code that protects multiple users at this point in time. Libbpf uses this mode if you create more than one socket tied to the -same umem. However, note that you need to supply the +same UMEM. However, note that you need to supply the XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the xsk_socket__create calls and load your own XDP program as there is no built in one in libbpf that will route the traffic for you. +The second case is when you share a UMEM between sockets that are +bound to different queue ids and/or netdevs. In this case you have to +create one FILL ring and one COMPLETION ring for each unique +netdev,queue_id pair. Let us say you want to create two sockets bound +to two different queue ids on the same netdev. Create the first socket +and bind it in the normal way. Create a second socket and create an RX +and a TX ring, or at least one of them, and then one FILL and +COMPLETION ring for this socket. Then in the bind call, set he +XDP_SHARED_UMEM option and provide the initial socket's fd in the +sxdp_shared_umem_fd field as you registered the UMEM on that +socket. These two sockets will now share one and the same UMEM. + +There is no need to supply an XDP program like the one in the previous +case where sockets were bound to the same queue id and +device. Instead, use the NIC's packet steering capabilities to steer +the packets to the right queue. In the previous example, there is only +one queue shared among sockets, so the NIC cannot do this steering. It +can only steer between queues. + +In libbpf, you need to use the xsk_socket__create_shared() API as it +takes a reference to a FILL ring and a COMPLETION ring that will be +created for you and bound to the shared UMEM. You can use this +function for all the sockets you create, or you can use it for the +second and following ones and use xsk_socket__create() for the first +one. Both methods yield the same result. + +Note that a UMEM can be shared between sockets on the same queue id +and device, as well as between queues on the same device and between +devices at the same time. + XDP_USE_NEED_WAKEUP bind flag ----------------------------- @@ -364,7 +401,7 @@ resources by only setting up one of them. Both the FILL ring and the COMPLETION ring are mandatory as you need to have a UMEM tied to your socket. But if the XDP_SHARED_UMEM flag is used, any socket after the first one does not have a UMEM and should in that case not have any -FILL or COMPLETION rings created as the ones from the shared umem will +FILL or COMPLETION rings created as the ones from the shared UMEM will be used. Note, that the rings are single-producer single-consumer, so do not try to access them from multiple processes at the same time. See the XDP_SHARED_UMEM section. @@ -567,6 +604,17 @@ A: The short answer is no, that is not supported at the moment. The switch, or other distribution mechanism, in your NIC to direct traffic to the correct queue id and socket. +Q: My packets are sometimes corrupted. What is wrong? + +A: Care has to be taken not to feed the same buffer in the UMEM into + more than one ring at the same time. If you for example feed the + same buffer into the FILL ring and the TX ring at the same time, the + NIC might receive data into the buffer at the same time it is + sending it. This will cause some packets to become corrupted. Same + thing goes for feeding the same buffer into the FILL rings + belonging to different queue ids or netdevs bound with the + XDP_SHARED_UMEM flag. + Credits ======= diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 42b6709e6dc7..7d9ea7b41c71 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -1379,10 +1379,15 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, u8 *prog = *pprog; int cnt = 0; - if (emit_call(&prog, __bpf_prog_enter, prog)) - return -EINVAL; - /* remember prog start time returned by __bpf_prog_enter */ - emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); + if (p->aux->sleepable) { + if (emit_call(&prog, __bpf_prog_enter_sleepable, prog)) + return -EINVAL; + } else { + if (emit_call(&prog, __bpf_prog_enter, prog)) + return -EINVAL; + /* remember prog start time returned by __bpf_prog_enter */ + emit_mov_reg(&prog, true, BPF_REG_6, BPF_REG_0); + } /* arg1: lea rdi, [rbp - stack_size] */ EMIT4(0x48, 0x8D, 0x7D, -stack_size); @@ -1402,13 +1407,18 @@ static int invoke_bpf_prog(const struct btf_func_model *m, u8 **pprog, if (mod_ret) emit_stx(&prog, BPF_DW, BPF_REG_FP, BPF_REG_0, -8); - /* arg1: mov rdi, progs[i] */ - emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, - (u32) (long) p); - /* arg2: mov rsi, rbx <- start time in nsec */ - emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); - if (emit_call(&prog, __bpf_prog_exit, prog)) - return -EINVAL; + if (p->aux->sleepable) { + if (emit_call(&prog, __bpf_prog_exit_sleepable, prog)) + return -EINVAL; + } else { + /* arg1: mov rdi, progs[i] */ + emit_mov_imm64(&prog, BPF_REG_1, (long) p >> 32, + (u32) (long) p); + /* arg2: mov rsi, rbx <- start time in nsec */ + emit_mov_reg(&prog, true, BPF_REG_2, BPF_REG_6); + if (emit_call(&prog, __bpf_prog_exit, prog)) + return -EINVAL; + } *pprog = prog; return 0; diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 825c104ecba1..dc1577156bb6 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -1967,7 +1967,7 @@ static int i40e_set_ringparam(struct net_device *netdev, (new_rx_count == vsi->rx_rings[0]->count)) return 0; - /* If there is a AF_XDP UMEM attached to any of Rx rings, + /* If there is a AF_XDP page pool attached to any of Rx rings, * disallow changing the number of descriptors -- regardless * if the netdev is running or not. */ diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 2e433fdbf2c3..05c6d3ea11e6 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -3122,12 +3122,12 @@ static void i40e_config_xps_tx_ring(struct i40e_ring *ring) } /** - * i40e_xsk_umem - Retrieve the AF_XDP ZC if XDP and ZC is enabled + * i40e_xsk_pool - Retrieve the AF_XDP buffer pool if XDP and ZC is enabled * @ring: The Tx or Rx ring * - * Returns the UMEM or NULL. + * Returns the AF_XDP buffer pool or NULL. **/ -static struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring) +static struct xsk_buff_pool *i40e_xsk_pool(struct i40e_ring *ring) { bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi); int qid = ring->queue_index; @@ -3138,7 +3138,7 @@ static struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring) if (!xdp_on || !test_bit(qid, ring->vsi->af_xdp_zc_qps)) return NULL; - return xdp_get_umem_from_qid(ring->vsi->netdev, qid); + return xsk_get_pool_from_qid(ring->vsi->netdev, qid); } /** @@ -3157,7 +3157,7 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring) u32 qtx_ctl = 0; if (ring_is_xdp(ring)) - ring->xsk_umem = i40e_xsk_umem(ring); + ring->xsk_pool = i40e_xsk_pool(ring); /* some ATR related tx ring init */ if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) { @@ -3280,12 +3280,13 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring) xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq); kfree(ring->rx_bi); - ring->xsk_umem = i40e_xsk_umem(ring); - if (ring->xsk_umem) { + ring->xsk_pool = i40e_xsk_pool(ring); + if (ring->xsk_pool) { ret = i40e_alloc_rx_bi_zc(ring); if (ret) return ret; - ring->rx_buf_len = xsk_umem_get_rx_frame_size(ring->xsk_umem); + ring->rx_buf_len = + xsk_pool_get_rx_frame_size(ring->xsk_pool); /* For AF_XDP ZC, we disallow packets to span on * multiple buffers, thus letting us skip that * handling in the fast-path. @@ -3368,8 +3369,8 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring) ring->tail = hw->hw_addr + I40E_QRX_TAIL(pf_q); writel(0, ring->tail); - if (ring->xsk_umem) { - xsk_buff_set_rxq_info(ring->xsk_umem, &ring->xdp_rxq); + if (ring->xsk_pool) { + xsk_pool_set_rxq_info(ring->xsk_pool, &ring->xdp_rxq); ok = i40e_alloc_rx_buffers_zc(ring, I40E_DESC_UNUSED(ring)); } else { ok = !i40e_alloc_rx_buffers(ring, I40E_DESC_UNUSED(ring)); @@ -3380,7 +3381,7 @@ static int i40e_configure_rx_ring(struct i40e_ring *ring) */ dev_info(&vsi->back->pdev->dev, "Failed to allocate some buffers on %sRx ring %d (pf_q %d)\n", - ring->xsk_umem ? "UMEM enabled " : "", + ring->xsk_pool ? "AF_XDP ZC enabled " : "", ring->queue_index, pf_q); } @@ -12644,7 +12645,7 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi, */ if (need_reset && prog) for (i = 0; i < vsi->num_queue_pairs; i++) - if (vsi->xdp_rings[i]->xsk_umem) + if (vsi->xdp_rings[i]->xsk_pool) (void)i40e_xsk_wakeup(vsi->netdev, i, XDP_WAKEUP_RX); @@ -12923,8 +12924,8 @@ static int i40e_xdp(struct net_device *dev, switch (xdp->command) { case XDP_SETUP_PROG: return i40e_xdp_setup(vsi, xdp->prog); - case XDP_SETUP_XSK_UMEM: - return i40e_xsk_umem_setup(vsi, xdp->xsk.umem, + case XDP_SETUP_XSK_POOL: + return i40e_xsk_pool_setup(vsi, xdp->xsk.pool, xdp->xsk.queue_id); default: return -EINVAL; diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index 432a984ac335..91ab824926b9 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -636,7 +636,7 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring) unsigned long bi_size; u16 i; - if (ring_is_xdp(tx_ring) && tx_ring->xsk_umem) { + if (ring_is_xdp(tx_ring) && tx_ring->xsk_pool) { i40e_xsk_clean_tx_ring(tx_ring); } else { /* ring already cleared, nothing to do */ @@ -1335,7 +1335,7 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring) rx_ring->skb = NULL; } - if (rx_ring->xsk_umem) { + if (rx_ring->xsk_pool) { i40e_xsk_clean_rx_ring(rx_ring); goto skip_free; } @@ -1369,7 +1369,7 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring) } skip_free: - if (rx_ring->xsk_umem) + if (rx_ring->xsk_pool) i40e_clear_rx_bi_zc(rx_ring); else i40e_clear_rx_bi(rx_ring); @@ -2575,7 +2575,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget) * budget and be more aggressive about cleaning up the Tx descriptors. */ i40e_for_each_ring(ring, q_vector->tx) { - bool wd = ring->xsk_umem ? + bool wd = ring->xsk_pool ? i40e_clean_xdp_tx_irq(vsi, ring) : i40e_clean_tx_irq(vsi, ring, budget); @@ -2603,7 +2603,7 @@ int i40e_napi_poll(struct napi_struct *napi, int budget) budget_per_ring = budget; i40e_for_each_ring(ring, q_vector->rx) { - int cleaned = ring->xsk_umem ? + int cleaned = ring->xsk_pool ? i40e_clean_rx_irq_zc(ring, budget_per_ring) : i40e_clean_rx_irq(ring, budget_per_ring); diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h index 4036893d6825..703b644fd71f 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h @@ -388,7 +388,7 @@ struct i40e_ring { struct i40e_channel *ch; struct xdp_rxq_info xdp_rxq; - struct xdp_umem *xsk_umem; + struct xsk_buff_pool *xsk_pool; } ____cacheline_internodealigned_in_smp; static inline bool ring_uses_build_skb(struct i40e_ring *ring) diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c index 8ce57b507a21..2a1153d8957b 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c @@ -29,14 +29,16 @@ static struct xdp_buff **i40e_rx_bi(struct i40e_ring *rx_ring, u32 idx) } /** - * i40e_xsk_umem_enable - Enable/associate a UMEM to a certain ring/qid + * i40e_xsk_pool_enable - Enable/associate an AF_XDP buffer pool to a + * certain ring/qid * @vsi: Current VSI - * @umem: UMEM - * @qid: Rx ring to associate UMEM to + * @pool: buffer pool + * @qid: Rx ring to associate buffer pool with * * Returns 0 on success, <0 on failure **/ -static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem, +static int i40e_xsk_pool_enable(struct i40e_vsi *vsi, + struct xsk_buff_pool *pool, u16 qid) { struct net_device *netdev = vsi->netdev; @@ -53,7 +55,7 @@ static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem, qid >= netdev->real_num_tx_queues) return -EINVAL; - err = xsk_buff_dma_map(umem, &vsi->back->pdev->dev, I40E_RX_DMA_ATTR); + err = xsk_pool_dma_map(pool, &vsi->back->pdev->dev, I40E_RX_DMA_ATTR); if (err) return err; @@ -80,21 +82,22 @@ static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem, } /** - * i40e_xsk_umem_disable - Disassociate a UMEM from a certain ring/qid + * i40e_xsk_pool_disable - Disassociate an AF_XDP buffer pool from a + * certain ring/qid * @vsi: Current VSI - * @qid: Rx ring to associate UMEM to + * @qid: Rx ring to associate buffer pool with * * Returns 0 on success, <0 on failure **/ -static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid) +static int i40e_xsk_pool_disable(struct i40e_vsi *vsi, u16 qid) { struct net_device *netdev = vsi->netdev; - struct xdp_umem *umem; + struct xsk_buff_pool *pool; bool if_running; int err; - umem = xdp_get_umem_from_qid(netdev, qid); - if (!umem) + pool = xsk_get_pool_from_qid(netdev, qid); + if (!pool) return -EINVAL; if_running = netif_running(vsi->netdev) && i40e_enabled_xdp_vsi(vsi); @@ -106,7 +109,7 @@ static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid) } clear_bit(qid, vsi->af_xdp_zc_qps); - xsk_buff_dma_unmap(umem, I40E_RX_DMA_ATTR); + xsk_pool_dma_unmap(pool, I40E_RX_DMA_ATTR); if (if_running) { err = i40e_queue_pair_enable(vsi, qid); @@ -118,20 +121,21 @@ static int i40e_xsk_umem_disable(struct i40e_vsi *vsi, u16 qid) } /** - * i40e_xsk_umem_setup - Enable/disassociate a UMEM to/from a ring/qid + * i40e_xsk_pool_setup - Enable/disassociate an AF_XDP buffer pool to/from + * a ring/qid * @vsi: Current VSI - * @umem: UMEM to enable/associate to a ring, or NULL to disable - * @qid: Rx ring to (dis)associate UMEM (from)to + * @pool: Buffer pool to enable/associate to a ring, or NULL to disable + * @qid: Rx ring to (dis)associate buffer pool (from)to * - * This function enables or disables a UMEM to a certain ring. + * This function enables or disables a buffer pool to a certain ring. * * Returns 0 on success, <0 on failure **/ -int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem, +int i40e_xsk_pool_setup(struct i40e_vsi *vsi, struct xsk_buff_pool *pool, u16 qid) { - return umem ? i40e_xsk_umem_enable(vsi, umem, qid) : - i40e_xsk_umem_disable(vsi, qid); + return pool ? i40e_xsk_pool_enable(vsi, pool, qid) : + i40e_xsk_pool_disable(vsi, qid); } /** @@ -191,7 +195,7 @@ bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count) rx_desc = I40E_RX_DESC(rx_ring, ntu); bi = i40e_rx_bi(rx_ring, ntu); do { - xdp = xsk_buff_alloc(rx_ring->xsk_umem); + xdp = xsk_buff_alloc(rx_ring->xsk_pool); if (!xdp) { ok = false; goto no_buffers; @@ -310,7 +314,7 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) bi = i40e_rx_bi(rx_ring, rx_ring->next_to_clean); (*bi)->data_end = (*bi)->data + size; - xsk_buff_dma_sync_for_cpu(*bi); + xsk_buff_dma_sync_for_cpu(*bi, rx_ring->xsk_pool); xdp_res = i40e_run_xdp_zc(rx_ring, *bi); if (xdp_res) { @@ -358,11 +362,11 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) i40e_finalize_xdp_rx(rx_ring, xdp_xmit); i40e_update_rx_stats(rx_ring, total_rx_bytes, total_rx_packets); - if (xsk_umem_uses_need_wakeup(rx_ring->xsk_umem)) { + if (xsk_uses_need_wakeup(rx_ring->xsk_pool)) { if (failure || rx_ring->next_to_clean == rx_ring->next_to_use) - xsk_set_rx_need_wakeup(rx_ring->xsk_umem); + xsk_set_rx_need_wakeup(rx_ring->xsk_pool); else - xsk_clear_rx_need_wakeup(rx_ring->xsk_umem); + xsk_clear_rx_need_wakeup(rx_ring->xsk_pool); return (int)total_rx_packets; } @@ -385,11 +389,11 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget) dma_addr_t dma; while (budget-- > 0) { - if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc)) + if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc)) break; - dma = xsk_buff_raw_get_dma(xdp_ring->xsk_umem, desc.addr); - xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_umem, dma, + dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr); + xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc.len); tx_bi = &xdp_ring->tx_bi[xdp_ring->next_to_use]; @@ -416,7 +420,7 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget) I40E_TXD_QW1_CMD_SHIFT); i40e_xdp_ring_update_tail(xdp_ring); - xsk_umem_consume_tx_done(xdp_ring->xsk_umem); + xsk_tx_release(xdp_ring->xsk_pool); i40e_update_tx_stats(xdp_ring, sent_frames, total_bytes); } @@ -448,7 +452,7 @@ static void i40e_clean_xdp_tx_buffer(struct i40e_ring *tx_ring, **/ bool i40e_clean_xdp_tx_irq(struct i40e_vsi *vsi, struct i40e_ring *tx_ring) { - struct xdp_umem *umem = tx_ring->xsk_umem; + struct xsk_buff_pool *bp = tx_ring->xsk_pool; u32 i, completed_frames, xsk_frames = 0; u32 head_idx = i40e_get_head(tx_ring); struct i40e_tx_buffer *tx_bi; @@ -488,13 +492,13 @@ skip: tx_ring->next_to_clean -= tx_ring->count; if (xsk_frames) - xsk_umem_complete_tx(umem, xsk_frames); + xsk_tx_completed(bp, xsk_frames); i40e_arm_wb(tx_ring, vsi, completed_frames); out_xmit: - if (xsk_umem_uses_need_wakeup(tx_ring->xsk_umem)) - xsk_set_tx_need_wakeup(tx_ring->xsk_umem); + if (xsk_uses_need_wakeup(tx_ring->xsk_pool)) + xsk_set_tx_need_wakeup(tx_ring->xsk_pool); return i40e_xmit_zc(tx_ring, I40E_DESC_UNUSED(tx_ring)); } @@ -526,7 +530,7 @@ int i40e_xsk_wakeup(struct net_device *dev, u32 queue_id, u32 flags) if (queue_id >= vsi->num_queue_pairs) return -ENXIO; - if (!vsi->xdp_rings[queue_id]->xsk_umem) + if (!vsi->xdp_rings[queue_id]->xsk_pool) return -ENXIO; ring = vsi->xdp_rings[queue_id]; @@ -565,7 +569,7 @@ void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring) void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring) { u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use; - struct xdp_umem *umem = tx_ring->xsk_umem; + struct xsk_buff_pool *bp = tx_ring->xsk_pool; struct i40e_tx_buffer *tx_bi; u32 xsk_frames = 0; @@ -585,14 +589,15 @@ void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring) } if (xsk_frames) - xsk_umem_complete_tx(umem, xsk_frames); + xsk_tx_completed(bp, xsk_frames); } /** - * i40e_xsk_any_rx_ring_enabled - Checks if Rx rings have AF_XDP UMEM attached + * i40e_xsk_any_rx_ring_enabled - Checks if Rx rings have an AF_XDP + * buffer pool attached * @vsi: vsi * - * Returns true if any of the Rx rings has an AF_XDP UMEM attached + * Returns true if any of the Rx rings has an AF_XDP buffer pool attached **/ bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi) { @@ -600,7 +605,7 @@ bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi) int i; for (i = 0; i < vsi->num_queue_pairs; i++) { - if (xdp_get_umem_from_qid(netdev, i)) + if (xsk_get_pool_from_qid(netdev, i)) return true; } diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.h b/drivers/net/ethernet/intel/i40e/i40e_xsk.h index c524c142127f..7adfd8539247 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.h +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.h @@ -5,12 +5,12 @@ #define _I40E_XSK_H_ struct i40e_vsi; -struct xdp_umem; +struct xsk_buff_pool; struct zero_copy_allocator; int i40e_queue_pair_disable(struct i40e_vsi *vsi, int queue_pair); int i40e_queue_pair_enable(struct i40e_vsi *vsi, int queue_pair); -int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem, +int i40e_xsk_pool_setup(struct i40e_vsi *vsi, struct xsk_buff_pool *pool, u16 qid); bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 cleaned_count); int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget); diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h index fe140ff38f74..65583f0a1797 100644 --- a/drivers/net/ethernet/intel/ice/ice.h +++ b/drivers/net/ethernet/intel/ice/ice.h @@ -321,9 +321,9 @@ struct ice_vsi { struct ice_ring **xdp_rings; /* XDP ring array */ u16 num_xdp_txq; /* Used XDP queues */ u8 xdp_mapping_mode; /* ICE_MAP_MODE_[CONTIG|SCATTER] */ - struct xdp_umem **xsk_umems; - u16 num_xsk_umems_used; - u16 num_xsk_umems; + struct xsk_buff_pool **xsk_pools; + u16 num_xsk_pools_used; + u16 num_xsk_pools; } ____cacheline_internodealigned_in_smp; /* struct that defines an interrupt vector */ @@ -507,25 +507,25 @@ static inline void ice_set_ring_xdp(struct ice_ring *ring) } /** - * ice_xsk_umem - get XDP UMEM bound to a ring + * ice_xsk_pool - get XSK buffer pool bound to a ring * @ring - ring to use * - * Returns a pointer to xdp_umem structure if there is an UMEM present, + * Returns a pointer to xdp_umem structure if there is a buffer pool present, * NULL otherwise. */ -static inline struct xdp_umem *ice_xsk_umem(struct ice_ring *ring) +static inline struct xsk_buff_pool *ice_xsk_pool(struct ice_ring *ring) { - struct xdp_umem **umems = ring->vsi->xsk_umems; + struct xsk_buff_pool **pools = ring->vsi->xsk_pools; u16 qid = ring->q_index; if (ice_ring_is_xdp(ring)) qid -= ring->vsi->num_xdp_txq; - if (qid >= ring->vsi->num_xsk_umems || !umems || !umems[qid] || + if (qid >= ring->vsi->num_xsk_pools || !pools || !pools[qid] || !ice_is_xdp_ena_vsi(ring->vsi)) return NULL; - return umems[qid]; + return pools[qid]; } /** diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c index 87008476d8fe..fe4320e2d1f2 100644 --- a/drivers/net/ethernet/intel/ice/ice_base.c +++ b/drivers/net/ethernet/intel/ice/ice_base.c @@ -308,12 +308,12 @@ int ice_setup_rx_ctx(struct ice_ring *ring) xdp_rxq_info_reg(&ring->xdp_rxq, ring->netdev, ring->q_index); - ring->xsk_umem = ice_xsk_umem(ring); - if (ring->xsk_umem) { + ring->xsk_pool = ice_xsk_pool(ring); + if (ring->xsk_pool) { xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq); ring->rx_buf_len = - xsk_umem_get_rx_frame_size(ring->xsk_umem); + xsk_pool_get_rx_frame_size(ring->xsk_pool); /* For AF_XDP ZC, we disallow packets to span on * multiple buffers, thus letting us skip that * handling in the fast-path. @@ -324,7 +324,7 @@ int ice_setup_rx_ctx(struct ice_ring *ring) NULL); if (err) return err; - xsk_buff_set_rxq_info(ring->xsk_umem, &ring->xdp_rxq); + xsk_pool_set_rxq_info(ring->xsk_pool, &ring->xdp_rxq); dev_info(dev, "Registered XDP mem model MEM_TYPE_XSK_BUFF_POOL on Rx ring %d\n", ring->q_index); @@ -417,9 +417,9 @@ int ice_setup_rx_ctx(struct ice_ring *ring) ring->tail = hw->hw_addr + QRX_TAIL(pf_q); writel(0, ring->tail); - if (ring->xsk_umem) { - if (!xsk_buff_can_alloc(ring->xsk_umem, num_bufs)) { - dev_warn(dev, "UMEM does not provide enough addresses to fill %d buffers on Rx ring %d\n", + if (ring->xsk_pool) { + if (!xsk_buff_can_alloc(ring->xsk_pool, num_bufs)) { + dev_warn(dev, "XSK buffer pool does not provide enough addresses to fill %d buffers on Rx ring %d\n", num_bufs, ring->q_index); dev_warn(dev, "Change Rx ring/fill queue size to avoid performance issues\n"); @@ -428,7 +428,7 @@ int ice_setup_rx_ctx(struct ice_ring *ring) err = ice_alloc_rx_bufs_zc(ring, num_bufs); if (err) - dev_info(dev, "Failed to allocate some buffers on UMEM enabled Rx ring %d (pf_q %d)\n", + dev_info(dev, "Failed to allocate some buffers on XSK buffer pool enabled Rx ring %d (pf_q %d)\n", ring->q_index, pf_q); return 0; } diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c index f2682776f8c8..feeb5cdccdc5 100644 --- a/drivers/net/ethernet/intel/ice/ice_lib.c +++ b/drivers/net/ethernet/intel/ice/ice_lib.c @@ -1743,7 +1743,7 @@ int ice_vsi_cfg_xdp_txqs(struct ice_vsi *vsi) return ret; for (i = 0; i < vsi->num_xdp_txq; i++) - vsi->xdp_rings[i]->xsk_umem = ice_xsk_umem(vsi->xdp_rings[i]); + vsi->xdp_rings[i]->xsk_pool = ice_xsk_pool(vsi->xdp_rings[i]); return ret; } diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c index 4634b48949bb..2297ee7dba26 100644 --- a/drivers/net/ethernet/intel/ice/ice_main.c +++ b/drivers/net/ethernet/intel/ice/ice_main.c @@ -2273,7 +2273,7 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi) if (ice_setup_tx_ring(xdp_ring)) goto free_xdp_rings; ice_set_ring_xdp(xdp_ring); - xdp_ring->xsk_umem = ice_xsk_umem(xdp_ring); + xdp_ring->xsk_pool = ice_xsk_pool(xdp_ring); } return 0; @@ -2517,13 +2517,13 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog, if (if_running) ret = ice_up(vsi); - if (!ret && prog && vsi->xsk_umems) { + if (!ret && prog && vsi->xsk_pools) { int i; ice_for_each_rxq(vsi, i) { struct ice_ring *rx_ring = vsi->rx_rings[i]; - if (rx_ring->xsk_umem) + if (rx_ring->xsk_pool) napi_schedule(&rx_ring->q_vector->napi); } } @@ -2549,8 +2549,8 @@ static int ice_xdp(struct net_device *dev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: return ice_xdp_setup_prog(vsi, xdp->prog, xdp->extack); - case XDP_SETUP_XSK_UMEM: - return ice_xsk_umem_setup(vsi, xdp->xsk.umem, + case XDP_SETUP_XSK_POOL: + return ice_xsk_pool_setup(vsi, xdp->xsk.pool, xdp->xsk.queue_id); default: return -EINVAL; diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c index d2fca4a52f51..eae75260fe20 100644 --- a/drivers/net/ethernet/intel/ice/ice_txrx.c +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c @@ -145,7 +145,7 @@ void ice_clean_tx_ring(struct ice_ring *tx_ring) { u16 i; - if (ice_ring_is_xdp(tx_ring) && tx_ring->xsk_umem) { + if (ice_ring_is_xdp(tx_ring) && tx_ring->xsk_pool) { ice_xsk_clean_xdp_ring(tx_ring); goto tx_skip_free; } @@ -375,7 +375,7 @@ void ice_clean_rx_ring(struct ice_ring *rx_ring) if (!rx_ring->rx_buf) return; - if (rx_ring->xsk_umem) { + if (rx_ring->xsk_pool) { ice_xsk_clean_rx_ring(rx_ring); goto rx_skip_free; } @@ -1610,7 +1610,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget) * budget and be more aggressive about cleaning up the Tx descriptors. */ ice_for_each_ring(ring, q_vector->tx) { - bool wd = ring->xsk_umem ? + bool wd = ring->xsk_pool ? ice_clean_tx_irq_zc(ring, budget) : ice_clean_tx_irq(ring, budget); @@ -1640,7 +1640,7 @@ int ice_napi_poll(struct napi_struct *napi, int budget) * comparison in the irq context instead of many inside the * ice_clean_rx_irq function and makes the codebase cleaner. */ - cleaned = ring->xsk_umem ? + cleaned = ring->xsk_pool ? ice_clean_rx_irq_zc(ring, budget_per_ring) : ice_clean_rx_irq(ring, budget_per_ring); work_done += cleaned; diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h index 51b4df7a59d2..e9f60d550fcb 100644 --- a/drivers/net/ethernet/intel/ice/ice_txrx.h +++ b/drivers/net/ethernet/intel/ice/ice_txrx.h @@ -295,7 +295,7 @@ struct ice_ring { struct rcu_head rcu; /* to avoid race on free */ struct bpf_prog *xdp_prog; - struct xdp_umem *xsk_umem; + struct xsk_buff_pool *xsk_pool; /* CL3 - 3rd cacheline starts here */ struct xdp_rxq_info xdp_rxq; /* CLX - the below items are only accessed infrequently and should be diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c index 20ac5fca68c6..797886524054 100644 --- a/drivers/net/ethernet/intel/ice/ice_xsk.c +++ b/drivers/net/ethernet/intel/ice/ice_xsk.c @@ -236,7 +236,7 @@ static int ice_qp_ena(struct ice_vsi *vsi, u16 q_idx) if (err) goto free_buf; ice_set_ring_xdp(xdp_ring); - xdp_ring->xsk_umem = ice_xsk_umem(xdp_ring); + xdp_ring->xsk_pool = ice_xsk_pool(xdp_ring); } err = ice_setup_rx_ctx(rx_ring); @@ -260,21 +260,21 @@ free_buf: } /** - * ice_xsk_alloc_umems - allocate a UMEM region for an XDP socket - * @vsi: VSI to allocate the UMEM on + * ice_xsk_alloc_pools - allocate a buffer pool for an XDP socket + * @vsi: VSI to allocate the buffer pool on * * Returns 0 on success, negative on error */ -static int ice_xsk_alloc_umems(struct ice_vsi *vsi) +static int ice_xsk_alloc_pools(struct ice_vsi *vsi) { - if (vsi->xsk_umems) + if (vsi->xsk_pools) return 0; - vsi->xsk_umems = kcalloc(vsi->num_xsk_umems, sizeof(*vsi->xsk_umems), + vsi->xsk_pools = kcalloc(vsi->num_xsk_pools, sizeof(*vsi->xsk_pools), GFP_KERNEL); - if (!vsi->xsk_umems) { - vsi->num_xsk_umems = 0; + if (!vsi->xsk_pools) { + vsi->num_xsk_pools = 0; return -ENOMEM; } @@ -282,73 +282,73 @@ static int ice_xsk_alloc_umems(struct ice_vsi *vsi) } /** - * ice_xsk_remove_umem - Remove an UMEM for a certain ring/qid + * ice_xsk_remove_pool - Remove an buffer pool for a certain ring/qid * @vsi: VSI from which the VSI will be removed - * @qid: Ring/qid associated with the UMEM + * @qid: Ring/qid associated with the buffer pool */ -static void ice_xsk_remove_umem(struct ice_vsi *vsi, u16 qid) +static void ice_xsk_remove_pool(struct ice_vsi *vsi, u16 qid) { - vsi->xsk_umems[qid] = NULL; - vsi->num_xsk_umems_used--; + vsi->xsk_pools[qid] = NULL; + vsi->num_xsk_pools_used--; - if (vsi->num_xsk_umems_used == 0) { - kfree(vsi->xsk_umems); - vsi->xsk_umems = NULL; - vsi->num_xsk_umems = 0; + if (vsi->num_xsk_pools_used == 0) { + kfree(vsi->xsk_pools); + vsi->xsk_pools = NULL; + vsi->num_xsk_pools = 0; } } /** - * ice_xsk_umem_disable - disable a UMEM region + * ice_xsk_pool_disable - disable a buffer pool region * @vsi: Current VSI * @qid: queue ID * * Returns 0 on success, negative on failure */ -static int ice_xsk_umem_disable(struct ice_vsi *vsi, u16 qid) +static int ice_xsk_pool_disable(struct ice_vsi *vsi, u16 qid) { - if (!vsi->xsk_umems || qid >= vsi->num_xsk_umems || - !vsi->xsk_umems[qid]) + if (!vsi->xsk_pools || qid >= vsi->num_xsk_pools || + !vsi->xsk_pools[qid]) return -EINVAL; - xsk_buff_dma_unmap(vsi->xsk_umems[qid], ICE_RX_DMA_ATTR); - ice_xsk_remove_umem(vsi, qid); + xsk_pool_dma_unmap(vsi->xsk_pools[qid], ICE_RX_DMA_ATTR); + ice_xsk_remove_pool(vsi, qid); return 0; } /** - * ice_xsk_umem_enable - enable a UMEM region + * ice_xsk_pool_enable - enable a buffer pool region * @vsi: Current VSI - * @umem: pointer to a requested UMEM region + * @pool: pointer to a requested buffer pool region * @qid: queue ID * * Returns 0 on success, negative on failure */ static int -ice_xsk_umem_enable(struct ice_vsi *vsi, struct xdp_umem *umem, u16 qid) +ice_xsk_pool_enable(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid) { int err; if (vsi->type != ICE_VSI_PF) return -EINVAL; - if (!vsi->num_xsk_umems) - vsi->num_xsk_umems = min_t(u16, vsi->num_rxq, vsi->num_txq); - if (qid >= vsi->num_xsk_umems) + if (!vsi->num_xsk_pools) + vsi->num_xsk_pools = min_t(u16, vsi->num_rxq, vsi->num_txq); + if (qid >= vsi->num_xsk_pools) return -EINVAL; - err = ice_xsk_alloc_umems(vsi); + err = ice_xsk_alloc_pools(vsi); if (err) return err; - if (vsi->xsk_umems && vsi->xsk_umems[qid]) + if (vsi->xsk_pools && vsi->xsk_pools[qid]) return -EBUSY; - vsi->xsk_umems[qid] = umem; - vsi->num_xsk_umems_used++; + vsi->xsk_pools[qid] = pool; + vsi->num_xsk_pools_used++; - err = xsk_buff_dma_map(vsi->xsk_umems[qid], ice_pf_to_dev(vsi->back), + err = xsk_pool_dma_map(vsi->xsk_pools[qid], ice_pf_to_dev(vsi->back), ICE_RX_DMA_ATTR); if (err) return err; @@ -357,17 +357,17 @@ ice_xsk_umem_enable(struct ice_vsi *vsi, struct xdp_umem *umem, u16 qid) } /** - * ice_xsk_umem_setup - enable/disable a UMEM region depending on its state + * ice_xsk_pool_setup - enable/disable a buffer pool region depending on its state * @vsi: Current VSI - * @umem: UMEM to enable/associate to a ring, NULL to disable + * @pool: buffer pool to enable/associate to a ring, NULL to disable * @qid: queue ID * * Returns 0 on success, negative on failure */ -int ice_xsk_umem_setup(struct ice_vsi *vsi, struct xdp_umem *umem, u16 qid) +int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, u16 qid) { - bool if_running, umem_present = !!umem; - int ret = 0, umem_failure = 0; + bool if_running, pool_present = !!pool; + int ret = 0, pool_failure = 0; if_running = netif_running(vsi->netdev) && ice_is_xdp_ena_vsi(vsi); @@ -375,26 +375,26 @@ int ice_xsk_umem_setup(struct ice_vsi *vsi, struct xdp_umem *umem, u16 qid) ret = ice_qp_dis(vsi, qid); if (ret) { netdev_err(vsi->netdev, "ice_qp_dis error = %d\n", ret); - goto xsk_umem_if_up; + goto xsk_pool_if_up; } } - umem_failure = umem_present ? ice_xsk_umem_enable(vsi, umem, qid) : - ice_xsk_umem_disable(vsi, qid); + pool_failure = pool_present ? ice_xsk_pool_enable(vsi, pool, qid) : + ice_xsk_pool_disable(vsi, qid); -xsk_umem_if_up: +xsk_pool_if_up: if (if_running) { ret = ice_qp_ena(vsi, qid); - if (!ret && umem_present) + if (!ret && pool_present) napi_schedule(&vsi->xdp_rings[qid]->q_vector->napi); else if (ret) netdev_err(vsi->netdev, "ice_qp_ena error = %d\n", ret); } - if (umem_failure) { - netdev_err(vsi->netdev, "Could not %sable UMEM, error = %d\n", - umem_present ? "en" : "dis", umem_failure); - return umem_failure; + if (pool_failure) { + netdev_err(vsi->netdev, "Could not %sable buffer pool, error = %d\n", + pool_present ? "en" : "dis", pool_failure); + return pool_failure; } return ret; @@ -425,7 +425,7 @@ bool ice_alloc_rx_bufs_zc(struct ice_ring *rx_ring, u16 count) rx_buf = &rx_ring->rx_buf[ntu]; do { - rx_buf->xdp = xsk_buff_alloc(rx_ring->xsk_umem); + rx_buf->xdp = xsk_buff_alloc(rx_ring->xsk_pool); if (!rx_buf->xdp) { ret = true; break; @@ -595,7 +595,7 @@ int ice_clean_rx_irq_zc(struct ice_ring *rx_ring, int budget) rx_buf = &rx_ring->rx_buf[rx_ring->next_to_clean]; rx_buf->xdp->data_end = rx_buf->xdp->data + size; - xsk_buff_dma_sync_for_cpu(rx_buf->xdp); + xsk_buff_dma_sync_for_cpu(rx_buf->xdp, rx_ring->xsk_pool); xdp_res = ice_run_xdp_zc(rx_ring, rx_buf->xdp); if (xdp_res) { @@ -645,11 +645,11 @@ int ice_clean_rx_irq_zc(struct ice_ring *rx_ring, int budget) ice_finalize_xdp_rx(rx_ring, xdp_xmit); ice_update_rx_ring_stats(rx_ring, total_rx_packets, total_rx_bytes); - if (xsk_umem_uses_need_wakeup(rx_ring->xsk_umem)) { + if (xsk_uses_need_wakeup(rx_ring->xsk_pool)) { if (failure || rx_ring->next_to_clean == rx_ring->next_to_use) - xsk_set_rx_need_wakeup(rx_ring->xsk_umem); + xsk_set_rx_need_wakeup(rx_ring->xsk_pool); else - xsk_clear_rx_need_wakeup(rx_ring->xsk_umem); + xsk_clear_rx_need_wakeup(rx_ring->xsk_pool); return (int)total_rx_packets; } @@ -682,11 +682,11 @@ static bool ice_xmit_zc(struct ice_ring *xdp_ring, int budget) tx_buf = &xdp_ring->tx_buf[xdp_ring->next_to_use]; - if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc)) + if (!xsk_tx_peek_desc(xdp_ring->xsk_pool, &desc)) break; - dma = xsk_buff_raw_get_dma(xdp_ring->xsk_umem, desc.addr); - xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_umem, dma, + dma = xsk_buff_raw_get_dma(xdp_ring->xsk_pool, desc.addr); + xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_pool, dma, desc.len); tx_buf->bytecount = desc.len; @@ -703,7 +703,7 @@ static bool ice_xmit_zc(struct ice_ring *xdp_ring, int budget) if (tx_desc) { ice_xdp_ring_update_tail(xdp_ring); - xsk_umem_consume_tx_done(xdp_ring->xsk_umem); + xsk_tx_release(xdp_ring->xsk_pool); } return budget > 0 && work_done; @@ -777,10 +777,10 @@ bool ice_clean_tx_irq_zc(struct ice_ring *xdp_ring, int budget) xdp_ring->next_to_clean = ntc; if (xsk_frames) - xsk_umem_complete_tx(xdp_ring->xsk_umem, xsk_frames); + xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames); - if (xsk_umem_uses_need_wakeup(xdp_ring->xsk_umem)) - xsk_set_tx_need_wakeup(xdp_ring->xsk_umem); + if (xsk_uses_need_wakeup(xdp_ring->xsk_pool)) + xsk_set_tx_need_wakeup(xdp_ring->xsk_pool); ice_update_tx_ring_stats(xdp_ring, total_packets, total_bytes); xmit_done = ice_xmit_zc(xdp_ring, ICE_DFLT_IRQ_WORK); @@ -814,7 +814,7 @@ ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, if (queue_id >= vsi->num_txq) return -ENXIO; - if (!vsi->xdp_rings[queue_id]->xsk_umem) + if (!vsi->xdp_rings[queue_id]->xsk_pool) return -ENXIO; ring = vsi->xdp_rings[queue_id]; @@ -833,20 +833,20 @@ ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, } /** - * ice_xsk_any_rx_ring_ena - Checks if Rx rings have AF_XDP UMEM attached + * ice_xsk_any_rx_ring_ena - Checks if Rx rings have AF_XDP buff pool attached * @vsi: VSI to be checked * - * Returns true if any of the Rx rings has an AF_XDP UMEM attached + * Returns true if any of the Rx rings has an AF_XDP buff pool attached */ bool ice_xsk_any_rx_ring_ena(struct ice_vsi *vsi) { int i; - if (!vsi->xsk_umems) + if (!vsi->xsk_pools) return false; - for (i = 0; i < vsi->num_xsk_umems; i++) { - if (vsi->xsk_umems[i]) + for (i = 0; i < vsi->num_xsk_pools; i++) { + if (vsi->xsk_pools[i]) return true; } @@ -854,7 +854,7 @@ bool ice_xsk_any_rx_ring_ena(struct ice_vsi *vsi) } /** - * ice_xsk_clean_rx_ring - clean UMEM queues connected to a given Rx ring + * ice_xsk_clean_rx_ring - clean buffer pool queues connected to a given Rx ring * @rx_ring: ring to be cleaned */ void ice_xsk_clean_rx_ring(struct ice_ring *rx_ring) @@ -872,7 +872,7 @@ void ice_xsk_clean_rx_ring(struct ice_ring *rx_ring) } /** - * ice_xsk_clean_xdp_ring - Clean the XDP Tx ring and its UMEM queues + * ice_xsk_clean_xdp_ring - Clean the XDP Tx ring and its buffer pool queues * @xdp_ring: XDP_Tx ring */ void ice_xsk_clean_xdp_ring(struct ice_ring *xdp_ring) @@ -896,5 +896,5 @@ void ice_xsk_clean_xdp_ring(struct ice_ring *xdp_ring) } if (xsk_frames) - xsk_umem_complete_tx(xdp_ring->xsk_umem, xsk_frames); + xsk_tx_completed(xdp_ring->xsk_pool, xsk_frames); } diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.h b/drivers/net/ethernet/intel/ice/ice_xsk.h index fc1a06b4df36..fad783690134 100644 --- a/drivers/net/ethernet/intel/ice/ice_xsk.h +++ b/drivers/net/ethernet/intel/ice/ice_xsk.h @@ -9,7 +9,8 @@ struct ice_vsi; #ifdef CONFIG_XDP_SOCKETS -int ice_xsk_umem_setup(struct ice_vsi *vsi, struct xdp_umem *umem, u16 qid); +int ice_xsk_pool_setup(struct ice_vsi *vsi, struct xsk_buff_pool *pool, + u16 qid); int ice_clean_rx_irq_zc(struct ice_ring *rx_ring, int budget); bool ice_clean_tx_irq_zc(struct ice_ring *xdp_ring, int budget); int ice_xsk_wakeup(struct net_device *netdev, u32 queue_id, u32 flags); @@ -19,8 +20,8 @@ void ice_xsk_clean_rx_ring(struct ice_ring *rx_ring); void ice_xsk_clean_xdp_ring(struct ice_ring *xdp_ring); #else static inline int -ice_xsk_umem_setup(struct ice_vsi __always_unused *vsi, - struct xdp_umem __always_unused *umem, +ice_xsk_pool_setup(struct ice_vsi __always_unused *vsi, + struct xsk_buff_pool __always_unused *pool, u16 __always_unused qid) { return -EOPNOTSUPP; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index 1e8a809233a0..de0fc6ecf491 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -350,7 +350,7 @@ struct ixgbe_ring { struct ixgbe_rx_queue_stats rx_stats; }; struct xdp_rxq_info xdp_rxq; - struct xdp_umem *xsk_umem; + struct xsk_buff_pool *xsk_pool; u16 ring_idx; /* {rx,tx,xdp}_ring back reference idx */ u16 rx_buf_len; } ____cacheline_internodealigned_in_smp; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index f4f2198f388b..0b675c34ce49 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -3151,7 +3151,7 @@ int ixgbe_poll(struct napi_struct *napi, int budget) #endif ixgbe_for_each_ring(ring, q_vector->tx) { - bool wd = ring->xsk_umem ? + bool wd = ring->xsk_pool ? ixgbe_clean_xdp_tx_irq(q_vector, ring, budget) : ixgbe_clean_tx_irq(q_vector, ring, budget); @@ -3171,7 +3171,7 @@ int ixgbe_poll(struct napi_struct *napi, int budget) per_ring_budget = budget; ixgbe_for_each_ring(ring, q_vector->rx) { - int cleaned = ring->xsk_umem ? + int cleaned = ring->xsk_pool ? ixgbe_clean_rx_irq_zc(q_vector, ring, per_ring_budget) : ixgbe_clean_rx_irq(q_vector, ring, @@ -3466,9 +3466,9 @@ void ixgbe_configure_tx_ring(struct ixgbe_adapter *adapter, u32 txdctl = IXGBE_TXDCTL_ENABLE; u8 reg_idx = ring->reg_idx; - ring->xsk_umem = NULL; + ring->xsk_pool = NULL; if (ring_is_xdp(ring)) - ring->xsk_umem = ixgbe_xsk_umem(adapter, ring); + ring->xsk_pool = ixgbe_xsk_pool(adapter, ring); /* disable queue to avoid issues while updating state */ IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(reg_idx), 0); @@ -3708,8 +3708,8 @@ static void ixgbe_configure_srrctl(struct ixgbe_adapter *adapter, srrctl = IXGBE_RX_HDR_SIZE << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT; /* configure the packet buffer length */ - if (rx_ring->xsk_umem) { - u32 xsk_buf_len = xsk_umem_get_rx_frame_size(rx_ring->xsk_umem); + if (rx_ring->xsk_pool) { + u32 xsk_buf_len = xsk_pool_get_rx_frame_size(rx_ring->xsk_pool); /* If the MAC support setting RXDCTL.RLPML, the * SRRCTL[n].BSIZEPKT is set to PAGE_SIZE and @@ -4054,12 +4054,12 @@ void ixgbe_configure_rx_ring(struct ixgbe_adapter *adapter, u8 reg_idx = ring->reg_idx; xdp_rxq_info_unreg_mem_model(&ring->xdp_rxq); - ring->xsk_umem = ixgbe_xsk_umem(adapter, ring); - if (ring->xsk_umem) { + ring->xsk_pool = ixgbe_xsk_pool(adapter, ring); + if (ring->xsk_pool) { WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq, MEM_TYPE_XSK_BUFF_POOL, NULL)); - xsk_buff_set_rxq_info(ring->xsk_umem, &ring->xdp_rxq); + xsk_pool_set_rxq_info(ring->xsk_pool, &ring->xdp_rxq); } else { WARN_ON(xdp_rxq_info_reg_mem_model(&ring->xdp_rxq, MEM_TYPE_PAGE_SHARED, NULL)); @@ -4114,8 +4114,8 @@ void ixgbe_configure_rx_ring(struct ixgbe_adapter *adapter, #endif } - if (ring->xsk_umem && hw->mac.type != ixgbe_mac_82599EB) { - u32 xsk_buf_len = xsk_umem_get_rx_frame_size(ring->xsk_umem); + if (ring->xsk_pool && hw->mac.type != ixgbe_mac_82599EB) { + u32 xsk_buf_len = xsk_pool_get_rx_frame_size(ring->xsk_pool); rxdctl &= ~(IXGBE_RXDCTL_RLPMLMASK | IXGBE_RXDCTL_RLPML_EN); @@ -4137,7 +4137,7 @@ void ixgbe_configure_rx_ring(struct ixgbe_adapter *adapter, IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(reg_idx), rxdctl); ixgbe_rx_desc_queue_enable(adapter, ring); - if (ring->xsk_umem) + if (ring->xsk_pool) ixgbe_alloc_rx_buffers_zc(ring, ixgbe_desc_unused(ring)); else ixgbe_alloc_rx_buffers(ring, ixgbe_desc_unused(ring)); @@ -5287,7 +5287,7 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring *rx_ring) u16 i = rx_ring->next_to_clean; struct ixgbe_rx_buffer *rx_buffer = &rx_ring->rx_buffer_info[i]; - if (rx_ring->xsk_umem) { + if (rx_ring->xsk_pool) { ixgbe_xsk_clean_rx_ring(rx_ring); goto skip_free; } @@ -5979,7 +5979,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring) u16 i = tx_ring->next_to_clean; struct ixgbe_tx_buffer *tx_buffer = &tx_ring->tx_buffer_info[i]; - if (tx_ring->xsk_umem) { + if (tx_ring->xsk_pool) { ixgbe_xsk_clean_tx_ring(tx_ring); goto out; } @@ -10141,7 +10141,7 @@ static int ixgbe_xdp_setup(struct net_device *dev, struct bpf_prog *prog) */ if (need_reset && prog) for (i = 0; i < adapter->num_rx_queues; i++) - if (adapter->xdp_ring[i]->xsk_umem) + if (adapter->xdp_ring[i]->xsk_pool) (void)ixgbe_xsk_wakeup(adapter->netdev, i, XDP_WAKEUP_RX); @@ -10155,8 +10155,8 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: return ixgbe_xdp_setup(dev, xdp->prog); - case XDP_SETUP_XSK_UMEM: - return ixgbe_xsk_umem_setup(adapter, xdp->xsk.umem, + case XDP_SETUP_XSK_POOL: + return ixgbe_xsk_pool_setup(adapter, xdp->xsk.pool, xdp->xsk.queue_id); default: diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h index 7887ae4aaf4f..2aeec78029bc 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h @@ -28,9 +28,10 @@ void ixgbe_irq_rearm_queues(struct ixgbe_adapter *adapter, u64 qmask); void ixgbe_txrx_ring_disable(struct ixgbe_adapter *adapter, int ring); void ixgbe_txrx_ring_enable(struct ixgbe_adapter *adapter, int ring); -struct xdp_umem *ixgbe_xsk_umem(struct ixgbe_adapter *adapter, - struct ixgbe_ring *ring); -int ixgbe_xsk_umem_setup(struct ixgbe_adapter *adapter, struct xdp_umem *umem, +struct xsk_buff_pool *ixgbe_xsk_pool(struct ixgbe_adapter *adapter, + struct ixgbe_ring *ring); +int ixgbe_xsk_pool_setup(struct ixgbe_adapter *adapter, + struct xsk_buff_pool *pool, u16 qid); void ixgbe_zca_free(struct zero_copy_allocator *alloc, unsigned long handle); diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c index ec7121f352e2..3771857cf887 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c @@ -8,8 +8,8 @@ #include "ixgbe.h" #include "ixgbe_txrx_common.h" -struct xdp_umem *ixgbe_xsk_umem(struct ixgbe_adapter *adapter, - struct ixgbe_ring *ring) +struct xsk_buff_pool *ixgbe_xsk_pool(struct ixgbe_adapter *adapter, + struct ixgbe_ring *ring) { bool xdp_on = READ_ONCE(adapter->xdp_prog); int qid = ring->ring_idx; @@ -17,11 +17,11 @@ struct xdp_umem *ixgbe_xsk_umem(struct ixgbe_adapter *adapter, if (!xdp_on || !test_bit(qid, adapter->af_xdp_zc_qps)) return NULL; - return xdp_get_umem_from_qid(adapter->netdev, qid); + return xsk_get_pool_from_qid(adapter->netdev, qid); } -static int ixgbe_xsk_umem_enable(struct ixgbe_adapter *adapter, - struct xdp_umem *umem, +static int ixgbe_xsk_pool_enable(struct ixgbe_adapter *adapter, + struct xsk_buff_pool *pool, u16 qid) { struct net_device *netdev = adapter->netdev; @@ -35,7 +35,7 @@ static int ixgbe_xsk_umem_enable(struct ixgbe_adapter *adapter, qid >= netdev->real_num_tx_queues) return -EINVAL; - err = xsk_buff_dma_map(umem, &adapter->pdev->dev, IXGBE_RX_DMA_ATTR); + err = xsk_pool_dma_map(pool, &adapter->pdev->dev, IXGBE_RX_DMA_ATTR); if (err) return err; @@ -59,13 +59,13 @@ static int ixgbe_xsk_umem_enable(struct ixgbe_adapter *adapter, return 0; } -static int ixgbe_xsk_umem_disable(struct ixgbe_adapter *adapter, u16 qid) +static int ixgbe_xsk_pool_disable(struct ixgbe_adapter *adapter, u16 qid) { - struct xdp_umem *umem; + struct xsk_buff_pool *pool; bool if_running; - umem = xdp_get_umem_from_qid(adapter->netdev, qid); - if (!umem) + pool = xsk_get_pool_from_qid(adapter->netdev, qid); + if (!pool) return -EINVAL; if_running = netif_running(adapter->netdev) && @@ -75,7 +75,7 @@ static int ixgbe_xsk_umem_disable(struct ixgbe_adapter *adapter, u16 qid) ixgbe_txrx_ring_disable(adapter, qid); clear_bit(qid, adapter->af_xdp_zc_qps); - xsk_buff_dma_unmap(umem, IXGBE_RX_DMA_ATTR); + xsk_pool_dma_unmap(pool, IXGBE_RX_DMA_ATTR); if (if_running) ixgbe_txrx_ring_enable(adapter, qid); @@ -83,11 +83,12 @@ static int ixgbe_xsk_umem_disable(struct ixgbe_adapter *adapter, u16 qid) return 0; } -int ixgbe_xsk_umem_setup(struct ixgbe_adapter *adapter, struct xdp_umem *umem, +int ixgbe_xsk_pool_setup(struct ixgbe_adapter *adapter, + struct xsk_buff_pool *pool, u16 qid) { - return umem ? ixgbe_xsk_umem_enable(adapter, umem, qid) : - ixgbe_xsk_umem_disable(adapter, qid); + return pool ? ixgbe_xsk_pool_enable(adapter, pool, qid) : + ixgbe_xsk_pool_disable(adapter, qid); } static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter, @@ -149,7 +150,7 @@ bool ixgbe_alloc_rx_buffers_zc(struct ixgbe_ring *rx_ring, u16 count) i -= rx_ring->count; do { - bi->xdp = xsk_buff_alloc(rx_ring->xsk_umem); + bi->xdp = xsk_buff_alloc(rx_ring->xsk_pool); if (!bi->xdp) { ok = false; break; @@ -286,7 +287,7 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector, } bi->xdp->data_end = bi->xdp->data + size; - xsk_buff_dma_sync_for_cpu(bi->xdp); + xsk_buff_dma_sync_for_cpu(bi->xdp, rx_ring->xsk_pool); xdp_res = ixgbe_run_xdp_zc(adapter, rx_ring, bi->xdp); if (xdp_res) { @@ -344,11 +345,11 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector, q_vector->rx.total_packets += total_rx_packets; q_vector->rx.total_bytes += total_rx_bytes; - if (xsk_umem_uses_need_wakeup(rx_ring->xsk_umem)) { + if (xsk_uses_need_wakeup(rx_ring->xsk_pool)) { if (failure || rx_ring->next_to_clean == rx_ring->next_to_use) - xsk_set_rx_need_wakeup(rx_ring->xsk_umem); + xsk_set_rx_need_wakeup(rx_ring->xsk_pool); else - xsk_clear_rx_need_wakeup(rx_ring->xsk_umem); + xsk_clear_rx_need_wakeup(rx_ring->xsk_pool); return (int)total_rx_packets; } @@ -373,6 +374,7 @@ void ixgbe_xsk_clean_rx_ring(struct ixgbe_ring *rx_ring) static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget) { + struct xsk_buff_pool *pool = xdp_ring->xsk_pool; union ixgbe_adv_tx_desc *tx_desc = NULL; struct ixgbe_tx_buffer *tx_bi; bool work_done = true; @@ -387,12 +389,11 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget) break; } - if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, &desc)) + if (!xsk_tx_peek_desc(pool, &desc)) break; - dma = xsk_buff_raw_get_dma(xdp_ring->xsk_umem, desc.addr); - xsk_buff_raw_dma_sync_for_device(xdp_ring->xsk_umem, dma, - desc.len); + dma = xsk_buff_raw_get_dma(pool, desc.addr); + xsk_buff_raw_dma_sync_for_device(pool, dma, desc.len); tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use]; tx_bi->bytecount = desc.len; @@ -418,7 +419,7 @@ static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget) if (tx_desc) { ixgbe_xdp_ring_update_tail(xdp_ring); - xsk_umem_consume_tx_done(xdp_ring->xsk_umem); + xsk_tx_release(pool); } return !!budget && work_done; @@ -439,7 +440,7 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector, { u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use; unsigned int total_packets = 0, total_bytes = 0; - struct xdp_umem *umem = tx_ring->xsk_umem; + struct xsk_buff_pool *pool = tx_ring->xsk_pool; union ixgbe_adv_tx_desc *tx_desc; struct ixgbe_tx_buffer *tx_bi; u32 xsk_frames = 0; @@ -484,10 +485,10 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector, q_vector->tx.total_packets += total_packets; if (xsk_frames) - xsk_umem_complete_tx(umem, xsk_frames); + xsk_tx_completed(pool, xsk_frames); - if (xsk_umem_uses_need_wakeup(tx_ring->xsk_umem)) - xsk_set_tx_need_wakeup(tx_ring->xsk_umem); + if (xsk_uses_need_wakeup(pool)) + xsk_set_tx_need_wakeup(pool); return ixgbe_xmit_zc(tx_ring, q_vector->tx.work_limit); } @@ -511,7 +512,7 @@ int ixgbe_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags) if (test_bit(__IXGBE_TX_DISABLED, &ring->state)) return -ENETDOWN; - if (!ring->xsk_umem) + if (!ring->xsk_pool) return -ENXIO; if (!napi_if_scheduled_mark_missed(&ring->q_vector->napi)) { @@ -526,7 +527,7 @@ int ixgbe_xsk_wakeup(struct net_device *dev, u32 qid, u32 flags) void ixgbe_xsk_clean_tx_ring(struct ixgbe_ring *tx_ring) { u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use; - struct xdp_umem *umem = tx_ring->xsk_umem; + struct xsk_buff_pool *pool = tx_ring->xsk_pool; struct ixgbe_tx_buffer *tx_bi; u32 xsk_frames = 0; @@ -546,5 +547,5 @@ void ixgbe_xsk_clean_tx_ring(struct ixgbe_ring *tx_ring) } if (xsk_frames) - xsk_umem_complete_tx(umem, xsk_frames); + xsk_tx_completed(pool, xsk_frames); } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile b/drivers/net/ethernet/mellanox/mlx5/core/Makefile index 10e6886c96ba..0b3eaa102751 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile +++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile @@ -24,7 +24,7 @@ mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \ mlx5_core-$(CONFIG_MLX5_CORE_EN) += en_main.o en_common.o en_fs.o en_ethtool.o \ en_tx.o en_rx.o en_dim.o en_txrx.o en/xdp.o en_stats.o \ en_selftest.o en/port.o en/monitor_stats.o en/health.o \ - en/reporter_tx.o en/reporter_rx.o en/params.o en/xsk/umem.o \ + en/reporter_tx.o en/reporter_rx.o en/params.o en/xsk/pool.o \ en/xsk/setup.o en/xsk/rx.o en/xsk/tx.o en/devlink.o # diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 0cc2080fd847..4f33658da25a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -442,7 +442,7 @@ struct mlx5e_xdpsq { struct mlx5e_cq cq; /* read only */ - struct xdp_umem *umem; + struct xsk_buff_pool *xsk_pool; struct mlx5_wq_cyc wq; struct mlx5e_xdpsq_stats *stats; mlx5e_fp_xmit_xdp_frame_check xmit_xdp_frame_check; @@ -606,7 +606,7 @@ struct mlx5e_rq { struct page_pool *page_pool; /* AF_XDP zero-copy */ - struct xdp_umem *umem; + struct xsk_buff_pool *xsk_pool; struct work_struct recover_work; @@ -729,12 +729,13 @@ struct mlx5e_hv_vhca_stats_agent { #endif struct mlx5e_xsk { - /* UMEMs are stored separately from channels, because we don't want to - * lose them when channels are recreated. The kernel also stores UMEMs, - * but it doesn't distinguish between zero-copy and non-zero-copy UMEMs, - * so rely on our mechanism. + /* XSK buffer pools are stored separately from channels, + * because we don't want to lose them when channels are + * recreated. The kernel also stores buffer pool, but it doesn't + * distinguish between zero-copy and non-zero-copy UMEMs, so + * rely on our mechanism. */ - struct xdp_umem **umems; + struct xsk_buff_pool **pools; u16 refcnt; bool ever_used; }; @@ -893,7 +894,7 @@ struct mlx5e_xsk_param; struct mlx5e_rq_param; int mlx5e_open_rq(struct mlx5e_channel *c, struct mlx5e_params *params, struct mlx5e_rq_param *param, struct mlx5e_xsk_param *xsk, - struct xdp_umem *umem, struct mlx5e_rq *rq); + struct xsk_buff_pool *xsk_pool, struct mlx5e_rq *rq); int mlx5e_wait_for_min_rx_wqes(struct mlx5e_rq *rq, int wait_time); void mlx5e_deactivate_rq(struct mlx5e_rq *rq); void mlx5e_close_rq(struct mlx5e_rq *rq); @@ -903,7 +904,7 @@ int mlx5e_open_icosq(struct mlx5e_channel *c, struct mlx5e_params *params, struct mlx5e_sq_param *param, struct mlx5e_icosq *sq); void mlx5e_close_icosq(struct mlx5e_icosq *sq); int mlx5e_open_xdpsq(struct mlx5e_channel *c, struct mlx5e_params *params, - struct mlx5e_sq_param *param, struct xdp_umem *umem, + struct mlx5e_sq_param *param, struct xsk_buff_pool *xsk_pool, struct mlx5e_xdpsq *sq, bool is_redirect); void mlx5e_close_xdpsq(struct mlx5e_xdpsq *sq); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c index 4bcb73a5522f..145592788de5 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c @@ -445,7 +445,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq) } while ((++i < MLX5E_TX_CQ_POLL_BUDGET) && (cqe = mlx5_cqwq_get_cqe(&cq->wq))); if (xsk_frames) - xsk_umem_complete_tx(sq->umem, xsk_frames); + xsk_tx_completed(sq->xsk_pool, xsk_frames); sq->stats->cqes += i; @@ -475,7 +475,7 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq) } if (xsk_frames) - xsk_umem_complete_tx(sq->umem, xsk_frames); + xsk_tx_completed(sq->xsk_pool, xsk_frames); } int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames, @@ -563,4 +563,3 @@ void mlx5e_set_xmit_fp(struct mlx5e_xdpsq *sq, bool is_mpw) sq->xmit_xdp_frame = is_mpw ? mlx5e_xmit_xdp_frame_mpwqe : mlx5e_xmit_xdp_frame; } - diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c index 331ca2b0f8a4..3503e7711178 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.c @@ -1,31 +1,31 @@ // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB -/* Copyright (c) 2019 Mellanox Technologies. */ +/* Copyright (c) 2019-2020, Mellanox Technologies inc. All rights reserved. */ #include <net/xdp_sock_drv.h> -#include "umem.h" +#include "pool.h" #include "setup.h" #include "en/params.h" -static int mlx5e_xsk_map_umem(struct mlx5e_priv *priv, - struct xdp_umem *umem) +static int mlx5e_xsk_map_pool(struct mlx5e_priv *priv, + struct xsk_buff_pool *pool) { struct device *dev = priv->mdev->device; - return xsk_buff_dma_map(umem, dev, 0); + return xsk_pool_dma_map(pool, dev, 0); } -static void mlx5e_xsk_unmap_umem(struct mlx5e_priv *priv, - struct xdp_umem *umem) +static void mlx5e_xsk_unmap_pool(struct mlx5e_priv *priv, + struct xsk_buff_pool *pool) { - return xsk_buff_dma_unmap(umem, 0); + return xsk_pool_dma_unmap(pool, 0); } -static int mlx5e_xsk_get_umems(struct mlx5e_xsk *xsk) +static int mlx5e_xsk_get_pools(struct mlx5e_xsk *xsk) { - if (!xsk->umems) { - xsk->umems = kcalloc(MLX5E_MAX_NUM_CHANNELS, - sizeof(*xsk->umems), GFP_KERNEL); - if (unlikely(!xsk->umems)) + if (!xsk->pools) { + xsk->pools = kcalloc(MLX5E_MAX_NUM_CHANNELS, + sizeof(*xsk->pools), GFP_KERNEL); + if (unlikely(!xsk->pools)) return -ENOMEM; } @@ -35,68 +35,68 @@ static int mlx5e_xsk_get_umems(struct mlx5e_xsk *xsk) return 0; } -static void mlx5e_xsk_put_umems(struct mlx5e_xsk *xsk) +static void mlx5e_xsk_put_pools(struct mlx5e_xsk *xsk) { if (!--xsk->refcnt) { - kfree(xsk->umems); - xsk->umems = NULL; + kfree(xsk->pools); + xsk->pools = NULL; } } -static int mlx5e_xsk_add_umem(struct mlx5e_xsk *xsk, struct xdp_umem *umem, u16 ix) +static int mlx5e_xsk_add_pool(struct mlx5e_xsk *xsk, struct xsk_buff_pool *pool, u16 ix) { int err; - err = mlx5e_xsk_get_umems(xsk); + err = mlx5e_xsk_get_pools(xsk); if (unlikely(err)) return err; - xsk->umems[ix] = umem; + xsk->pools[ix] = pool; return 0; } -static void mlx5e_xsk_remove_umem(struct mlx5e_xsk *xsk, u16 ix) +static void mlx5e_xsk_remove_pool(struct mlx5e_xsk *xsk, u16 ix) { - xsk->umems[ix] = NULL; + xsk->pools[ix] = NULL; - mlx5e_xsk_put_umems(xsk); + mlx5e_xsk_put_pools(xsk); } -static bool mlx5e_xsk_is_umem_sane(struct xdp_umem *umem) +static bool mlx5e_xsk_is_pool_sane(struct xsk_buff_pool *pool) { - return xsk_umem_get_headroom(umem) <= 0xffff && - xsk_umem_get_chunk_size(umem) <= 0xffff; + return xsk_pool_get_headroom(pool) <= 0xffff && + xsk_pool_get_chunk_size(pool) <= 0xffff; } -void mlx5e_build_xsk_param(struct xdp_umem *umem, struct mlx5e_xsk_param *xsk) +void mlx5e_build_xsk_param(struct xsk_buff_pool *pool, struct mlx5e_xsk_param *xsk) { - xsk->headroom = xsk_umem_get_headroom(umem); - xsk->chunk_size = xsk_umem_get_chunk_size(umem); + xsk->headroom = xsk_pool_get_headroom(pool); + xsk->chunk_size = xsk_pool_get_chunk_size(pool); } static int mlx5e_xsk_enable_locked(struct mlx5e_priv *priv, - struct xdp_umem *umem, u16 ix) + struct xsk_buff_pool *pool, u16 ix) { struct mlx5e_params *params = &priv->channels.params; struct mlx5e_xsk_param xsk; struct mlx5e_channel *c; int err; - if (unlikely(mlx5e_xsk_get_umem(&priv->channels.params, &priv->xsk, ix))) + if (unlikely(mlx5e_xsk_get_pool(&priv->channels.params, &priv->xsk, ix))) return -EBUSY; - if (unlikely(!mlx5e_xsk_is_umem_sane(umem))) + if (unlikely(!mlx5e_xsk_is_pool_sane(pool))) return -EINVAL; - err = mlx5e_xsk_map_umem(priv, umem); + err = mlx5e_xsk_map_pool(priv, pool); if (unlikely(err)) return err; - err = mlx5e_xsk_add_umem(&priv->xsk, umem, ix); + err = mlx5e_xsk_add_pool(&priv->xsk, pool, ix); if (unlikely(err)) - goto err_unmap_umem; + goto err_unmap_pool; - mlx5e_build_xsk_param(umem, &xsk); + mlx5e_build_xsk_param(pool, &xsk); if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) { /* XSK objects will be created on open. */ @@ -112,9 +112,9 @@ static int mlx5e_xsk_enable_locked(struct mlx5e_priv *priv, c = priv->channels.c[ix]; - err = mlx5e_open_xsk(priv, params, &xsk, umem, c); + err = mlx5e_open_xsk(priv, params, &xsk, pool, c); if (unlikely(err)) - goto err_remove_umem; + goto err_remove_pool; mlx5e_activate_xsk(c); @@ -132,11 +132,11 @@ err_deactivate: mlx5e_deactivate_xsk(c); mlx5e_close_xsk(c); -err_remove_umem: - mlx5e_xsk_remove_umem(&priv->xsk, ix); +err_remove_pool: + mlx5e_xsk_remove_pool(&priv->xsk, ix); -err_unmap_umem: - mlx5e_xsk_unmap_umem(priv, umem); +err_unmap_pool: + mlx5e_xsk_unmap_pool(priv, pool); return err; @@ -146,7 +146,7 @@ validate_closed: */ if (!mlx5e_validate_xsk_param(params, &xsk, priv->mdev)) { err = -EINVAL; - goto err_remove_umem; + goto err_remove_pool; } return 0; @@ -154,45 +154,45 @@ validate_closed: static int mlx5e_xsk_disable_locked(struct mlx5e_priv *priv, u16 ix) { - struct xdp_umem *umem = mlx5e_xsk_get_umem(&priv->channels.params, + struct xsk_buff_pool *pool = mlx5e_xsk_get_pool(&priv->channels.params, &priv->xsk, ix); struct mlx5e_channel *c; - if (unlikely(!umem)) + if (unlikely(!pool)) return -EINVAL; if (!test_bit(MLX5E_STATE_OPENED, &priv->state)) - goto remove_umem; + goto remove_pool; /* XSK RQ and SQ are only created if XDP program is set. */ if (!priv->channels.params.xdp_prog) - goto remove_umem; + goto remove_pool; c = priv->channels.c[ix]; mlx5e_xsk_redirect_rqt_to_drop(priv, ix); mlx5e_deactivate_xsk(c); mlx5e_close_xsk(c); -remove_umem: - mlx5e_xsk_remove_umem(&priv->xsk, ix); - mlx5e_xsk_unmap_umem(priv, umem); +remove_pool: + mlx5e_xsk_remove_pool(&priv->xsk, ix); + mlx5e_xsk_unmap_pool(priv, pool); return 0; } -static int mlx5e_xsk_enable_umem(struct mlx5e_priv *priv, struct xdp_umem *umem, +static int mlx5e_xsk_enable_pool(struct mlx5e_priv *priv, struct xsk_buff_pool *pool, u16 ix) { int err; mutex_lock(&priv->state_lock); - err = mlx5e_xsk_enable_locked(priv, umem, ix); + err = mlx5e_xsk_enable_locked(priv, pool, ix); mutex_unlock(&priv->state_lock); return err; } -static int mlx5e_xsk_disable_umem(struct mlx5e_priv *priv, u16 ix) +static int mlx5e_xsk_disable_pool(struct mlx5e_priv *priv, u16 ix) { int err; @@ -203,7 +203,7 @@ static int mlx5e_xsk_disable_umem(struct mlx5e_priv *priv, u16 ix) return err; } -int mlx5e_xsk_setup_umem(struct net_device *dev, struct xdp_umem *umem, u16 qid) +int mlx5e_xsk_setup_pool(struct net_device *dev, struct xsk_buff_pool *pool, u16 qid) { struct mlx5e_priv *priv = netdev_priv(dev); struct mlx5e_params *params = &priv->channels.params; @@ -212,6 +212,6 @@ int mlx5e_xsk_setup_umem(struct net_device *dev, struct xdp_umem *umem, u16 qid) if (unlikely(!mlx5e_qid_get_ch_if_in_group(params, qid, MLX5E_RQ_GROUP_XSK, &ix))) return -EINVAL; - return umem ? mlx5e_xsk_enable_umem(priv, umem, ix) : - mlx5e_xsk_disable_umem(priv, ix); + return pool ? mlx5e_xsk_enable_pool(priv, pool, ix) : + mlx5e_xsk_disable_pool(priv, ix); } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.h new file mode 100644 index 000000000000..dca0010a0866 --- /dev/null +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/pool.h @@ -0,0 +1,27 @@ +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ +/* Copyright (c) 2019-2020, Mellanox Technologies inc. All rights reserved. */ + +#ifndef __MLX5_EN_XSK_POOL_H__ +#define __MLX5_EN_XSK_POOL_H__ + +#include "en.h" + +static inline struct xsk_buff_pool *mlx5e_xsk_get_pool(struct mlx5e_params *params, + struct mlx5e_xsk *xsk, u16 ix) +{ + if (!xsk || !xsk->pools) + return NULL; + + if (unlikely(ix >= params->num_channels)) + return NULL; + + return xsk->pools[ix]; +} + +struct mlx5e_xsk_param; +void mlx5e_build_xsk_param(struct xsk_buff_pool *pool, struct mlx5e_xsk_param *xsk); + +/* .ndo_bpf callback. */ +int mlx5e_xsk_setup_pool(struct net_device *dev, struct xsk_buff_pool *pool, u16 qid); + +#endif /* __MLX5_EN_XSK_POOL_H__ */ diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c index 786fedf52436..bb6669d2a916 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c @@ -48,7 +48,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, xdp->data_end = xdp->data + cqe_bcnt32; xdp_set_data_meta_invalid(xdp); - xsk_buff_dma_sync_for_cpu(xdp); + xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool); net_prefetch(xdp->data); rcu_read_lock(); @@ -99,7 +99,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq, xdp->data_end = xdp->data + cqe_bcnt; xdp_set_data_meta_invalid(xdp); - xsk_buff_dma_sync_for_cpu(xdp); + xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool); net_prefetch(xdp->data); if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_RESP_SEND)) { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h index d147b2f13b54..7f88ccf67fdd 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h @@ -19,10 +19,10 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi, u32 cqe_bcnt); -static inline int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq, +static inline int mlx5e_xsk_page_alloc_pool(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info) { - dma_info->xsk = xsk_buff_alloc(rq->umem); + dma_info->xsk = xsk_buff_alloc(rq->xsk_pool); if (!dma_info->xsk) return -ENOMEM; @@ -38,13 +38,13 @@ static inline int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq, static inline bool mlx5e_xsk_update_rx_wakeup(struct mlx5e_rq *rq, bool alloc_err) { - if (!xsk_umem_uses_need_wakeup(rq->umem)) + if (!xsk_uses_need_wakeup(rq->xsk_pool)) return alloc_err; if (unlikely(alloc_err)) - xsk_set_rx_need_wakeup(rq->umem); + xsk_set_rx_need_wakeup(rq->xsk_pool); else - xsk_clear_rx_need_wakeup(rq->umem); + xsk_clear_rx_need_wakeup(rq->xsk_pool); return false; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c index dd9df519d383..662a1dafeaad 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c @@ -45,7 +45,7 @@ static void mlx5e_build_xsk_cparam(struct mlx5e_priv *priv, } int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params, - struct mlx5e_xsk_param *xsk, struct xdp_umem *umem, + struct mlx5e_xsk_param *xsk, struct xsk_buff_pool *pool, struct mlx5e_channel *c) { struct mlx5e_channel_param *cparam; @@ -64,7 +64,7 @@ int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params, if (unlikely(err)) goto err_free_cparam; - err = mlx5e_open_rq(c, params, &cparam->rq, xsk, umem, &c->xskrq); + err = mlx5e_open_rq(c, params, &cparam->rq, xsk, pool, &c->xskrq); if (unlikely(err)) goto err_close_rx_cq; @@ -72,13 +72,13 @@ int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params, if (unlikely(err)) goto err_close_rq; - /* Create a separate SQ, so that when the UMEM is disabled, we could + /* Create a separate SQ, so that when the buff pool is disabled, we could * close this SQ safely and stop receiving CQEs. In other case, e.g., if - * the XDPSQ was used instead, we might run into trouble when the UMEM + * the XDPSQ was used instead, we might run into trouble when the buff pool * is disabled and then reenabled, but the SQ continues receiving CQEs - * from the old UMEM. + * from the old buff pool. */ - err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, umem, &c->xsksq, true); + err = mlx5e_open_xdpsq(c, params, &cparam->xdp_sq, pool, &c->xsksq, true); if (unlikely(err)) goto err_close_tx_cq; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.h index 0dd11b81c046..ca20f1ff5e39 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.h @@ -12,7 +12,7 @@ bool mlx5e_validate_xsk_param(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk, struct mlx5_core_dev *mdev); int mlx5e_open_xsk(struct mlx5e_priv *priv, struct mlx5e_params *params, - struct mlx5e_xsk_param *xsk, struct xdp_umem *umem, + struct mlx5e_xsk_param *xsk, struct xsk_buff_pool *pool, struct mlx5e_channel *c); void mlx5e_close_xsk(struct mlx5e_channel *c); void mlx5e_activate_xsk(struct mlx5e_channel *c); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c index 4d892f6cecb3..aa91cbdfe969 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c @@ -2,7 +2,7 @@ /* Copyright (c) 2019 Mellanox Technologies. */ #include "tx.h" -#include "umem.h" +#include "pool.h" #include "en/xdp.h" #include "en/params.h" #include <net/xdp_sock_drv.h> @@ -66,7 +66,7 @@ static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq, bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget) { - struct xdp_umem *umem = sq->umem; + struct xsk_buff_pool *pool = sq->xsk_pool; struct mlx5e_xdp_info xdpi; struct mlx5e_xdp_xmit_data xdptxd; bool work_done = true; @@ -87,7 +87,7 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget) break; } - if (!xsk_umem_consume_tx(umem, &desc)) { + if (!xsk_tx_peek_desc(pool, &desc)) { /* TX will get stuck until something wakes it up by * triggering NAPI. Currently it's expected that the * application calls sendto() if there are consumed, but @@ -96,11 +96,11 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget) break; } - xdptxd.dma_addr = xsk_buff_raw_get_dma(umem, desc.addr); - xdptxd.data = xsk_buff_raw_get_data(umem, desc.addr); + xdptxd.dma_addr = xsk_buff_raw_get_dma(pool, desc.addr); + xdptxd.data = xsk_buff_raw_get_data(pool, desc.addr); xdptxd.len = desc.len; - xsk_buff_raw_dma_sync_for_device(umem, xdptxd.dma_addr, xdptxd.len); + xsk_buff_raw_dma_sync_for_device(pool, xdptxd.dma_addr, xdptxd.len); ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe, mlx5e_xmit_xdp_frame, sq, &xdptxd, &xdpi, check_result); @@ -119,7 +119,7 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget) mlx5e_xdp_mpwqe_complete(sq); mlx5e_xmit_xdp_doorbell(sq); - xsk_umem_consume_tx_done(umem); + xsk_tx_release(pool); } return !(budget && work_done); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h index 39fa0a705856..a05085035f23 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.h @@ -15,13 +15,13 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget); static inline void mlx5e_xsk_update_tx_wakeup(struct mlx5e_xdpsq *sq) { - if (!xsk_umem_uses_need_wakeup(sq->umem)) + if (!xsk_uses_need_wakeup(sq->xsk_pool)) return; if (sq->pc != sq->cc) - xsk_clear_tx_need_wakeup(sq->umem); + xsk_clear_tx_need_wakeup(sq->xsk_pool); else - xsk_set_tx_need_wakeup(sq->umem); + xsk_set_tx_need_wakeup(sq->xsk_pool); } #endif /* __MLX5_EN_XSK_TX_H__ */ diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h deleted file mode 100644 index bada94973586..000000000000 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/umem.h +++ /dev/null @@ -1,29 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */ -/* Copyright (c) 2019 Mellanox Technologies. */ - -#ifndef __MLX5_EN_XSK_UMEM_H__ -#define __MLX5_EN_XSK_UMEM_H__ - -#include "en.h" - -static inline struct xdp_umem *mlx5e_xsk_get_umem(struct mlx5e_params *params, - struct mlx5e_xsk *xsk, u16 ix) -{ - if (!xsk || !xsk->umems) - return NULL; - - if (unlikely(ix >= params->num_channels)) - return NULL; - - return xsk->umems[ix]; -} - -struct mlx5e_xsk_param; -void mlx5e_build_xsk_param(struct xdp_umem *umem, struct mlx5e_xsk_param *xsk); - -/* .ndo_bpf callback. */ -int mlx5e_xsk_setup_umem(struct net_device *dev, struct xdp_umem *umem, u16 qid); - -int mlx5e_xsk_resize_reuseq(struct xdp_umem *umem, u32 nentries); - -#endif /* __MLX5_EN_XSK_UMEM_H__ */ diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c index 08270987c506..5cb1e4839eb7 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c @@ -32,7 +32,7 @@ #include "en.h" #include "en/port.h" -#include "en/xsk/umem.h" +#include "en/xsk/pool.h" #include "lib/clock.h" void mlx5e_ethtool_get_drvinfo(struct mlx5e_priv *priv, diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c index 83c9b2bbc4af..b416a8ee2eed 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs_ethtool.c @@ -33,7 +33,7 @@ #include <linux/mlx5/fs.h> #include "en.h" #include "en/params.h" -#include "en/xsk/umem.h" +#include "en/xsk/pool.h" struct mlx5e_ethtool_rule { struct list_head list; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index aebcf73f8546..26834625556d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -57,7 +57,7 @@ #include "en/monitor_stats.h" #include "en/health.h" #include "en/params.h" -#include "en/xsk/umem.h" +#include "en/xsk/pool.h" #include "en/xsk/setup.h" #include "en/xsk/rx.h" #include "en/xsk/tx.h" @@ -363,7 +363,7 @@ static void mlx5e_rq_err_cqe_work(struct work_struct *recover_work) static int mlx5e_alloc_rq(struct mlx5e_channel *c, struct mlx5e_params *params, struct mlx5e_xsk_param *xsk, - struct xdp_umem *umem, + struct xsk_buff_pool *xsk_pool, struct mlx5e_rq_param *rqp, struct mlx5e_rq *rq) { @@ -389,9 +389,9 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, rq->mdev = mdev; rq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); rq->xdpsq = &c->rq_xdpsq; - rq->umem = umem; + rq->xsk_pool = xsk_pool; - if (rq->umem) + if (rq->xsk_pool) rq->stats = &c->priv->channel_stats[c->ix].xskrq; else rq->stats = &c->priv->channel_stats[c->ix].rq; @@ -477,7 +477,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c, if (xsk) { err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, MEM_TYPE_XSK_BUFF_POOL, NULL); - xsk_buff_set_rxq_info(rq->umem, &rq->xdp_rxq); + xsk_pool_set_rxq_info(rq->xsk_pool, &rq->xdp_rxq); } else { /* Create a page_pool and register it with rxq */ pp_params.order = 0; @@ -816,11 +816,11 @@ void mlx5e_free_rx_descs(struct mlx5e_rq *rq) int mlx5e_open_rq(struct mlx5e_channel *c, struct mlx5e_params *params, struct mlx5e_rq_param *param, struct mlx5e_xsk_param *xsk, - struct xdp_umem *umem, struct mlx5e_rq *rq) + struct xsk_buff_pool *xsk_pool, struct mlx5e_rq *rq) { int err; - err = mlx5e_alloc_rq(c, params, xsk, umem, param, rq); + err = mlx5e_alloc_rq(c, params, xsk, xsk_pool, param, rq); if (err) return err; @@ -925,7 +925,7 @@ static int mlx5e_alloc_xdpsq_db(struct mlx5e_xdpsq *sq, int numa) static int mlx5e_alloc_xdpsq(struct mlx5e_channel *c, struct mlx5e_params *params, - struct xdp_umem *umem, + struct xsk_buff_pool *xsk_pool, struct mlx5e_sq_param *param, struct mlx5e_xdpsq *sq, bool is_redirect) @@ -941,9 +941,9 @@ static int mlx5e_alloc_xdpsq(struct mlx5e_channel *c, sq->uar_map = mdev->mlx5e_res.bfreg.map; sq->min_inline_mode = params->tx_min_inline_mode; sq->hw_mtu = MLX5E_SW2HW_MTU(params, params->sw_mtu); - sq->umem = umem; + sq->xsk_pool = xsk_pool; - sq->stats = sq->umem ? + sq->stats = sq->xsk_pool ? &c->priv->channel_stats[c->ix].xsksq : is_redirect ? &c->priv->channel_stats[c->ix].xdpsq : @@ -1408,13 +1408,13 @@ void mlx5e_close_icosq(struct mlx5e_icosq *sq) } int mlx5e_open_xdpsq(struct mlx5e_channel *c, struct mlx5e_params *params, - struct mlx5e_sq_param *param, struct xdp_umem *umem, + struct mlx5e_sq_param *param, struct xsk_buff_pool *xsk_pool, struct mlx5e_xdpsq *sq, bool is_redirect) { struct mlx5e_create_sq_param csp = {}; int err; - err = mlx5e_alloc_xdpsq(c, params, umem, param, sq, is_redirect); + err = mlx5e_alloc_xdpsq(c, params, xsk_pool, param, sq, is_redirect); if (err) return err; @@ -1907,7 +1907,7 @@ static u8 mlx5e_enumerate_lag_port(struct mlx5_core_dev *mdev, int ix) static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix, struct mlx5e_params *params, struct mlx5e_channel_param *cparam, - struct xdp_umem *umem, + struct xsk_buff_pool *xsk_pool, struct mlx5e_channel **cp) { int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix)); @@ -1946,9 +1946,9 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix, if (unlikely(err)) goto err_napi_del; - if (umem) { - mlx5e_build_xsk_param(umem, &xsk); - err = mlx5e_open_xsk(priv, params, &xsk, umem, c); + if (xsk_pool) { + mlx5e_build_xsk_param(xsk_pool, &xsk); + err = mlx5e_open_xsk(priv, params, &xsk, xsk_pool, c); if (unlikely(err)) goto err_close_queues; } @@ -2309,12 +2309,12 @@ int mlx5e_open_channels(struct mlx5e_priv *priv, mlx5e_build_channel_param(priv, &chs->params, cparam); for (i = 0; i < chs->num; i++) { - struct xdp_umem *umem = NULL; + struct xsk_buff_pool *xsk_pool = NULL; if (chs->params.xdp_prog) - umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, i); + xsk_pool = mlx5e_xsk_get_pool(&chs->params, chs->params.xsk, i); - err = mlx5e_open_channel(priv, i, &chs->params, cparam, umem, &chs->c[i]); + err = mlx5e_open_channel(priv, i, &chs->params, cparam, xsk_pool, &chs->c[i]); if (err) goto err_close_channels; } @@ -3892,13 +3892,14 @@ static bool mlx5e_xsk_validate_mtu(struct net_device *netdev, u16 ix; for (ix = 0; ix < chs->params.num_channels; ix++) { - struct xdp_umem *umem = mlx5e_xsk_get_umem(&chs->params, chs->params.xsk, ix); + struct xsk_buff_pool *xsk_pool = + mlx5e_xsk_get_pool(&chs->params, chs->params.xsk, ix); struct mlx5e_xsk_param xsk; - if (!umem) + if (!xsk_pool) continue; - mlx5e_build_xsk_param(umem, &xsk); + mlx5e_build_xsk_param(xsk_pool, &xsk); if (!mlx5e_validate_xsk_param(new_params, &xsk, mdev)) { u32 hr = mlx5e_get_linear_rq_headroom(new_params, &xsk); @@ -4423,8 +4424,8 @@ static int mlx5e_xdp(struct net_device *dev, struct netdev_bpf *xdp) switch (xdp->command) { case XDP_SETUP_PROG: return mlx5e_xdp_set(dev, xdp->prog); - case XDP_SETUP_XSK_UMEM: - return mlx5e_xsk_setup_umem(dev, xdp->xsk.umem, + case XDP_SETUP_XSK_POOL: + return mlx5e_xsk_setup_pool(dev, xdp->xsk.pool, xdp->xsk.queue_id); default: return -EINVAL; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 228fd775bcd5..7aab69e991a5 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -280,8 +280,8 @@ static inline int mlx5e_page_alloc_pool(struct mlx5e_rq *rq, static inline int mlx5e_page_alloc(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info) { - if (rq->umem) - return mlx5e_xsk_page_alloc_umem(rq, dma_info); + if (rq->xsk_pool) + return mlx5e_xsk_page_alloc_pool(rq, dma_info); else return mlx5e_page_alloc_pool(rq, dma_info); } @@ -312,7 +312,7 @@ static inline void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info, bool recycle) { - if (rq->umem) + if (rq->xsk_pool) /* The `recycle` parameter is ignored, and the page is always * put into the Reuse Ring, because there is no way to return * the page to the userspace when the interface goes down. @@ -399,14 +399,14 @@ static int mlx5e_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, u8 wqe_bulk) int err; int i; - if (rq->umem) { + if (rq->xsk_pool) { int pages_desired = wqe_bulk << rq->wqe.info.log_num_frags; /* Check in advance that we have enough frames, instead of * allocating one-by-one, failing and moving frames to the * Reuse Ring. */ - if (unlikely(!xsk_buff_can_alloc(rq->umem, pages_desired))) + if (unlikely(!xsk_buff_can_alloc(rq->xsk_pool, pages_desired))) return -ENOMEM; } @@ -504,8 +504,8 @@ static int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix) /* Check in advance that we have enough frames, instead of allocating * one-by-one, failing and moving frames to the Reuse Ring. */ - if (rq->umem && - unlikely(!xsk_buff_can_alloc(rq->umem, MLX5_MPWRQ_PAGES_PER_WQE))) { + if (rq->xsk_pool && + unlikely(!xsk_buff_can_alloc(rq->xsk_pool, MLX5_MPWRQ_PAGES_PER_WQE))) { err = -ENOMEM; goto err; } @@ -753,7 +753,7 @@ INDIRECT_CALLABLE_SCOPE bool mlx5e_post_rx_mpwqes(struct mlx5e_rq *rq) * the driver when it refills the Fill Ring. * 2. Otherwise, busy poll by rescheduling the NAPI poll. */ - if (unlikely(alloc_err == -ENOMEM && rq->umem)) + if (unlikely(alloc_err == -ENOMEM && rq->xsk_pool)) return true; return false; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 3c11a77f5709..efaef83b8897 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -219,24 +219,6 @@ struct veth { __be16 h_vlan_TCI; }; -bool tun_is_xdp_frame(void *ptr) -{ - return (unsigned long)ptr & TUN_XDP_FLAG; -} -EXPORT_SYMBOL(tun_is_xdp_frame); - -void *tun_xdp_to_ptr(void *ptr) -{ - return (void *)((unsigned long)ptr | TUN_XDP_FLAG); -} -EXPORT_SYMBOL(tun_xdp_to_ptr); - -void *tun_ptr_to_xdp(void *ptr) -{ - return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG); -} -EXPORT_SYMBOL(tun_ptr_to_xdp); - static int tun_napi_receive(struct napi_struct *napi, int budget) { struct tun_file *tfile = container_of(napi, struct tun_file, napi); diff --git a/drivers/net/veth.c b/drivers/net/veth.c index e56cd562a664..b80cbffeb88e 100644 --- a/drivers/net/veth.c +++ b/drivers/net/veth.c @@ -234,14 +234,14 @@ static bool veth_is_xdp_frame(void *ptr) return (unsigned long)ptr & VETH_XDP_FLAG; } -static void *veth_ptr_to_xdp(void *ptr) +static struct xdp_frame *veth_ptr_to_xdp(void *ptr) { return (void *)((unsigned long)ptr & ~VETH_XDP_FLAG); } -static void *veth_xdp_to_ptr(void *ptr) +static void *veth_xdp_to_ptr(struct xdp_frame *xdp) { - return (void *)((unsigned long)ptr | VETH_XDP_FLAG); + return (void *)((unsigned long)xdp | VETH_XDP_FLAG); } static void veth_ptr_free(void *ptr) diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h index 64f367044e25..2f98d2fce62e 100644 --- a/include/linux/bpf-cgroup.h +++ b/include/linux/bpf-cgroup.h @@ -279,6 +279,31 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, void *key, #define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) \ BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL) +/* The SOCK_OPS"_SK" macro should be used when sock_ops->sk is not a + * fullsock and its parent fullsock cannot be traced by + * sk_to_full_sk(). + * + * e.g. sock_ops->sk is a request_sock and it is under syncookie mode. + * Its listener-sk is not attached to the rsk_listener. + * In this case, the caller holds the listener-sk (unlocked), + * set its sock_ops->sk to req_sk, and call this SOCK_OPS"_SK" with + * the listener-sk such that the cgroup-bpf-progs of the + * listener-sk will be run. + * + * Regardless of syncookie mode or not, + * calling bpf_setsockopt on listener-sk will not make sense anyway, + * so passing 'sock_ops->sk == req_sk' to the bpf prog is appropriate here. + */ +#define BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(sock_ops, sk) \ +({ \ + int __ret = 0; \ + if (cgroup_bpf_enabled) \ + __ret = __cgroup_bpf_run_filter_sock_ops(sk, \ + sock_ops, \ + BPF_CGROUP_SOCK_OPS); \ + __ret; \ +}) + #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) \ ({ \ int __ret = 0; \ diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 55f694b63164..c6d9f2c444f4 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -34,6 +34,8 @@ struct btf_type; struct exception_table_entry; struct seq_operations; struct bpf_iter_aux_info; +struct bpf_local_storage; +struct bpf_local_storage_map; extern struct idr btf_idr; extern spinlock_t btf_idr_lock; @@ -104,6 +106,25 @@ struct bpf_map_ops { __poll_t (*map_poll)(struct bpf_map *map, struct file *filp, struct poll_table_struct *pts); + /* Functions called by bpf_local_storage maps */ + int (*map_local_storage_charge)(struct bpf_local_storage_map *smap, + void *owner, u32 size); + void (*map_local_storage_uncharge)(struct bpf_local_storage_map *smap, + void *owner, u32 size); + struct bpf_local_storage __rcu ** (*map_owner_storage_ptr)(void *owner); + + /* map_meta_equal must be implemented for maps that can be + * used as an inner map. It is a runtime check to ensure + * an inner map can be inserted to an outer map. + * + * Some properties of the inner map has been used during the + * verification time. When inserting an inner map at the runtime, + * map_meta_equal has to ensure the inserting map has the same + * properties that the verifier has used earlier. + */ + bool (*map_meta_equal)(const struct bpf_map *meta0, + const struct bpf_map *meta1); + /* BTF name and id of struct allocated by map_alloc */ const char * const map_btf_name; int *map_btf_id; @@ -227,6 +248,9 @@ int map_check_no_btf(const struct bpf_map *map, const struct btf_type *key_type, const struct btf_type *value_type); +bool bpf_map_meta_equal(const struct bpf_map *meta0, + const struct bpf_map *meta1); + extern const struct bpf_map_ops bpf_map_offload_ops; /* function argument constraints */ @@ -309,6 +333,7 @@ struct bpf_func_proto { * for this argument. */ int *ret_btf_id; /* return value btf_id */ + bool (*allowed)(const struct bpf_prog *prog); }; /* bpf_context is intentionally undefined structure. Pointer to bpf_context is @@ -514,6 +539,8 @@ int arch_prepare_bpf_trampoline(void *image, void *image_end, /* these two functions are called from generated trampoline */ u64 notrace __bpf_prog_enter(void); void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start); +void notrace __bpf_prog_enter_sleepable(void); +void notrace __bpf_prog_exit_sleepable(void); struct bpf_ksym { unsigned long start; @@ -709,6 +736,7 @@ struct bpf_prog_aux { bool offload_requested; bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */ bool func_proto_unreliable; + bool sleepable; enum bpf_tramp_prog_type trampoline_prog_type; struct bpf_trampoline *trampoline; struct hlist_node tramp_hlist; @@ -1218,12 +1246,18 @@ typedef int (*bpf_iter_attach_target_t)(struct bpf_prog *prog, union bpf_iter_link_info *linfo, struct bpf_iter_aux_info *aux); typedef void (*bpf_iter_detach_target_t)(struct bpf_iter_aux_info *aux); +typedef void (*bpf_iter_show_fdinfo_t) (const struct bpf_iter_aux_info *aux, + struct seq_file *seq); +typedef int (*bpf_iter_fill_link_info_t)(const struct bpf_iter_aux_info *aux, + struct bpf_link_info *info); #define BPF_ITER_CTX_ARG_MAX 2 struct bpf_iter_reg { const char *target; bpf_iter_attach_target_t attach_target; bpf_iter_detach_target_t detach_target; + bpf_iter_show_fdinfo_t show_fdinfo; + bpf_iter_fill_link_info_t fill_link_info; u32 ctx_arg_info_size; struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX]; const struct bpf_iter_seq_info *seq_info; @@ -1250,6 +1284,10 @@ int bpf_iter_new_fd(struct bpf_link *link); bool bpf_link_is_iter(struct bpf_link *link); struct bpf_prog *bpf_iter_get_info(struct bpf_iter_meta *meta, bool in_stop); int bpf_iter_run_prog(struct bpf_prog *prog, void *ctx); +void bpf_iter_map_show_fdinfo(const struct bpf_iter_aux_info *aux, + struct seq_file *seq); +int bpf_iter_map_fill_link_info(const struct bpf_iter_aux_info *aux, + struct bpf_link_info *info); int bpf_percpu_hash_copy(struct bpf_map *map, void *key, void *value); int bpf_percpu_array_copy(struct bpf_map *map, void *key, void *value); @@ -1340,6 +1378,8 @@ int btf_struct_access(struct bpf_verifier_log *log, const struct btf_type *t, int off, int size, enum bpf_access_type atype, u32 *next_btf_id); +bool btf_struct_ids_match(struct bpf_verifier_log *log, + int off, u32 id, u32 need_type_id); int btf_resolve_helper_id(struct bpf_verifier_log *log, const struct bpf_func_proto *fn, int); @@ -1358,6 +1398,7 @@ int btf_check_type_match(struct bpf_verifier_env *env, struct bpf_prog *prog, struct btf *btf, const struct btf_type *t); struct bpf_prog *bpf_prog_by_id(u32 id); +struct bpf_link *bpf_link_by_id(u32 id); const struct bpf_func_proto *bpf_base_func_proto(enum bpf_func_id func_id); #else /* !CONFIG_BPF_SYSCALL */ @@ -1637,6 +1678,7 @@ int sock_map_prog_update(struct bpf_map *map, struct bpf_prog *prog, struct bpf_prog *old, u32 which); int sock_map_get_from_fd(const union bpf_attr *attr, struct bpf_prog *prog); int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype); +int sock_map_update_elem_sys(struct bpf_map *map, void *key, void *value, u64 flags); void sock_map_unhash(struct sock *sk); void sock_map_close(struct sock *sk, long timeout); #else @@ -1658,6 +1700,12 @@ static inline int sock_map_prog_detach(const union bpf_attr *attr, { return -EOPNOTSUPP; } + +static inline int sock_map_update_elem_sys(struct bpf_map *map, void *key, void *value, + u64 flags) +{ + return -EOPNOTSUPP; +} #endif /* CONFIG_BPF_STREAM_PARSER */ #if defined(CONFIG_INET) && defined(CONFIG_BPF_SYSCALL) @@ -1736,6 +1784,7 @@ extern const struct bpf_func_proto bpf_skc_to_tcp_sock_proto; extern const struct bpf_func_proto bpf_skc_to_tcp_timewait_sock_proto; extern const struct bpf_func_proto bpf_skc_to_tcp_request_sock_proto; extern const struct bpf_func_proto bpf_skc_to_udp6_sock_proto; +extern const struct bpf_func_proto bpf_copy_from_user_proto; const struct bpf_func_proto *bpf_tracing_func_proto( enum bpf_func_id func_id, const struct bpf_prog *prog); @@ -1850,4 +1899,7 @@ enum bpf_text_poke_type { int bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t, void *addr1, void *addr2); +struct btf_id_set; +bool btf_id_set_contains(struct btf_id_set *set, u32 id); + #endif /* _LINUX_BPF_H */ diff --git a/include/linux/bpf_local_storage.h b/include/linux/bpf_local_storage.h new file mode 100644 index 000000000000..b2c9463f36a1 --- /dev/null +++ b/include/linux/bpf_local_storage.h @@ -0,0 +1,163 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (c) 2019 Facebook + * Copyright 2020 Google LLC. + */ + +#ifndef _BPF_LOCAL_STORAGE_H +#define _BPF_LOCAL_STORAGE_H + +#include <linux/bpf.h> +#include <linux/rculist.h> +#include <linux/list.h> +#include <linux/hash.h> +#include <linux/types.h> +#include <uapi/linux/btf.h> + +#define BPF_LOCAL_STORAGE_CACHE_SIZE 16 + +struct bpf_local_storage_map_bucket { + struct hlist_head list; + raw_spinlock_t lock; +}; + +/* Thp map is not the primary owner of a bpf_local_storage_elem. + * Instead, the container object (eg. sk->sk_bpf_storage) is. + * + * The map (bpf_local_storage_map) is for two purposes + * 1. Define the size of the "local storage". It is + * the map's value_size. + * + * 2. Maintain a list to keep track of all elems such + * that they can be cleaned up during the map destruction. + * + * When a bpf local storage is being looked up for a + * particular object, the "bpf_map" pointer is actually used + * as the "key" to search in the list of elem in + * the respective bpf_local_storage owned by the object. + * + * e.g. sk->sk_bpf_storage is the mini-map with the "bpf_map" pointer + * as the searching key. + */ +struct bpf_local_storage_map { + struct bpf_map map; + /* Lookup elem does not require accessing the map. + * + * Updating/Deleting requires a bucket lock to + * link/unlink the elem from the map. Having + * multiple buckets to improve contention. + */ + struct bpf_local_storage_map_bucket *buckets; + u32 bucket_log; + u16 elem_size; + u16 cache_idx; +}; + +struct bpf_local_storage_data { + /* smap is used as the searching key when looking up + * from the object's bpf_local_storage. + * + * Put it in the same cacheline as the data to minimize + * the number of cachelines access during the cache hit case. + */ + struct bpf_local_storage_map __rcu *smap; + u8 data[] __aligned(8); +}; + +/* Linked to bpf_local_storage and bpf_local_storage_map */ +struct bpf_local_storage_elem { + struct hlist_node map_node; /* Linked to bpf_local_storage_map */ + struct hlist_node snode; /* Linked to bpf_local_storage */ + struct bpf_local_storage __rcu *local_storage; + struct rcu_head rcu; + /* 8 bytes hole */ + /* The data is stored in aother cacheline to minimize + * the number of cachelines access during a cache hit. + */ + struct bpf_local_storage_data sdata ____cacheline_aligned; +}; + +struct bpf_local_storage { + struct bpf_local_storage_data __rcu *cache[BPF_LOCAL_STORAGE_CACHE_SIZE]; + struct hlist_head list; /* List of bpf_local_storage_elem */ + void *owner; /* The object that owns the above "list" of + * bpf_local_storage_elem. + */ + struct rcu_head rcu; + raw_spinlock_t lock; /* Protect adding/removing from the "list" */ +}; + +/* U16_MAX is much more than enough for sk local storage + * considering a tcp_sock is ~2k. + */ +#define BPF_LOCAL_STORAGE_MAX_VALUE_SIZE \ + min_t(u32, \ + (KMALLOC_MAX_SIZE - MAX_BPF_STACK - \ + sizeof(struct bpf_local_storage_elem)), \ + (U16_MAX - sizeof(struct bpf_local_storage_elem))) + +#define SELEM(_SDATA) \ + container_of((_SDATA), struct bpf_local_storage_elem, sdata) +#define SDATA(_SELEM) (&(_SELEM)->sdata) + +#define BPF_LOCAL_STORAGE_CACHE_SIZE 16 + +struct bpf_local_storage_cache { + spinlock_t idx_lock; + u64 idx_usage_counts[BPF_LOCAL_STORAGE_CACHE_SIZE]; +}; + +#define DEFINE_BPF_STORAGE_CACHE(name) \ +static struct bpf_local_storage_cache name = { \ + .idx_lock = __SPIN_LOCK_UNLOCKED(name.idx_lock), \ +} + +u16 bpf_local_storage_cache_idx_get(struct bpf_local_storage_cache *cache); +void bpf_local_storage_cache_idx_free(struct bpf_local_storage_cache *cache, + u16 idx); + +/* Helper functions for bpf_local_storage */ +int bpf_local_storage_map_alloc_check(union bpf_attr *attr); + +struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr); + +struct bpf_local_storage_data * +bpf_local_storage_lookup(struct bpf_local_storage *local_storage, + struct bpf_local_storage_map *smap, + bool cacheit_lockit); + +void bpf_local_storage_map_free(struct bpf_local_storage_map *smap); + +int bpf_local_storage_map_check_btf(const struct bpf_map *map, + const struct btf *btf, + const struct btf_type *key_type, + const struct btf_type *value_type); + +void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage, + struct bpf_local_storage_elem *selem); + +bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage, + struct bpf_local_storage_elem *selem, + bool uncharge_omem); + +void bpf_selem_unlink(struct bpf_local_storage_elem *selem); + +void bpf_selem_link_map(struct bpf_local_storage_map *smap, + struct bpf_local_storage_elem *selem); + +void bpf_selem_unlink_map(struct bpf_local_storage_elem *selem); + +struct bpf_local_storage_elem * +bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, void *value, + bool charge_mem); + +int +bpf_local_storage_alloc(void *owner, + struct bpf_local_storage_map *smap, + struct bpf_local_storage_elem *first_selem); + +struct bpf_local_storage_data * +bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap, + void *value, u64 map_flags); + +#endif /* _BPF_LOCAL_STORAGE_H */ diff --git a/include/linux/bpf_lsm.h b/include/linux/bpf_lsm.h index af74712af585..aaacb6aafc87 100644 --- a/include/linux/bpf_lsm.h +++ b/include/linux/bpf_lsm.h @@ -17,9 +17,28 @@ #include <linux/lsm_hook_defs.h> #undef LSM_HOOK +struct bpf_storage_blob { + struct bpf_local_storage __rcu *storage; +}; + +extern struct lsm_blob_sizes bpf_lsm_blob_sizes; + int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, const struct bpf_prog *prog); +static inline struct bpf_storage_blob *bpf_inode( + const struct inode *inode) +{ + if (unlikely(!inode->i_security)) + return NULL; + + return inode->i_security + bpf_lsm_blob_sizes.lbs_inode; +} + +extern const struct bpf_func_proto bpf_inode_storage_get_proto; +extern const struct bpf_func_proto bpf_inode_storage_delete_proto; +void bpf_inode_storage_free(struct inode *inode); + #else /* !CONFIG_BPF_LSM */ static inline int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, @@ -28,6 +47,16 @@ static inline int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, return -EOPNOTSUPP; } +static inline struct bpf_storage_blob *bpf_inode( + const struct inode *inode) +{ + return NULL; +} + +static inline void bpf_inode_storage_free(struct inode *inode) +{ +} + #endif /* CONFIG_BPF_LSM */ #endif /* _LINUX_BPF_LSM_H */ diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index a52a5688418e..2e6f568377f1 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -107,6 +107,9 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_SK_STORAGE, sk_storage_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops) #endif +#ifdef CONFIG_BPF_LSM +BPF_MAP_TYPE(BPF_MAP_TYPE_INODE_STORAGE, inode_storage_map_ops) +#endif BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops) #if defined(CONFIG_XDP_SOCKETS) BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops) diff --git a/include/linux/btf.h b/include/linux/btf.h index 8b81fbb4497c..a9af5e7a7ece 100644 --- a/include/linux/btf.h +++ b/include/linux/btf.h @@ -64,8 +64,7 @@ const struct btf_type *btf_type_resolve_func_ptr(const struct btf *btf, u32 id, u32 *res_id); const struct btf_type * btf_resolve_size(const struct btf *btf, const struct btf_type *type, - u32 *type_size, const struct btf_type **elem_type, - u32 *total_nelems); + u32 *type_size); #define for_each_member(i, struct_type, member) \ for (i = 0, member = btf_type_member(struct_type); \ diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h index 4867d549e3c1..210b086188a3 100644 --- a/include/linux/btf_ids.h +++ b/include/linux/btf_ids.h @@ -3,6 +3,11 @@ #ifndef _LINUX_BTF_IDS_H #define _LINUX_BTF_IDS_H +struct btf_id_set { + u32 cnt; + u32 ids[]; +}; + #ifdef CONFIG_DEBUG_INFO_BTF #include <linux/compiler.h> /* for __PASTE */ @@ -62,7 +67,7 @@ asm( \ ".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ "." #scope " " #name "; \n" \ #name ":; \n" \ -".popsection; \n"); \ +".popsection; \n"); #define BTF_ID_LIST(name) \ __BTF_ID_LIST(name, local) \ @@ -88,12 +93,56 @@ asm( \ ".zero 4 \n" \ ".popsection; \n"); +/* + * The BTF_SET_START/END macros pair defines sorted list of + * BTF IDs plus its members count, with following layout: + * + * BTF_SET_START(list) + * BTF_ID(type1, name1) + * BTF_ID(type2, name2) + * BTF_SET_END(list) + * + * __BTF_ID__set__list: + * .zero 4 + * list: + * __BTF_ID__type1__name1__3: + * .zero 4 + * __BTF_ID__type2__name2__4: + * .zero 4 + * + */ +#define __BTF_SET_START(name, scope) \ +asm( \ +".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ +"." #scope " __BTF_ID__set__" #name "; \n" \ +"__BTF_ID__set__" #name ":; \n" \ +".zero 4 \n" \ +".popsection; \n"); + +#define BTF_SET_START(name) \ +__BTF_ID_LIST(name, local) \ +__BTF_SET_START(name, local) + +#define BTF_SET_START_GLOBAL(name) \ +__BTF_ID_LIST(name, globl) \ +__BTF_SET_START(name, globl) + +#define BTF_SET_END(name) \ +asm( \ +".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ +".size __BTF_ID__set__" #name ", .-" #name " \n" \ +".popsection; \n"); \ +extern struct btf_id_set name; + #else #define BTF_ID_LIST(name) static u32 name[5]; #define BTF_ID(prefix, name) #define BTF_ID_UNUSED #define BTF_ID_LIST_GLOBAL(name) u32 name[1]; +#define BTF_SET_START(name) static struct btf_id_set name = { 0 }; +#define BTF_SET_START_GLOBAL(name) static struct btf_id_set name = { 0 }; +#define BTF_SET_END(name) #endif /* CONFIG_DEBUG_INFO_BTF */ diff --git a/include/linux/filter.h b/include/linux/filter.h index 0a355b005bf4..995625950cc1 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -1236,13 +1236,17 @@ struct bpf_sock_addr_kern { struct bpf_sock_ops_kern { struct sock *sk; - u32 op; union { u32 args[4]; u32 reply; u32 replylong[4]; }; - u32 is_fullsock; + struct sk_buff *syn_skb; + struct sk_buff *skb; + void *skb_data_end; + u8 op; + u8 is_fullsock; + u8 remaining_opt_len; u64 temp; /* temp and everything after is not * initialized to 0 before calling * the BPF program. New fields that diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index 5bda8cf457b6..2a7660843444 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -27,9 +27,18 @@ struct tun_xdp_hdr { #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE) struct socket *tun_get_socket(struct file *); struct ptr_ring *tun_get_tx_ring(struct file *file); -bool tun_is_xdp_frame(void *ptr); -void *tun_xdp_to_ptr(void *ptr); -void *tun_ptr_to_xdp(void *ptr); +static inline bool tun_is_xdp_frame(void *ptr) +{ + return (unsigned long)ptr & TUN_XDP_FLAG; +} +static inline void *tun_xdp_to_ptr(struct xdp_frame *xdp) +{ + return (void *)((unsigned long)xdp | TUN_XDP_FLAG); +} +static inline struct xdp_frame *tun_ptr_to_xdp(void *ptr) +{ + return (void *)((unsigned long)ptr & ~TUN_XDP_FLAG); +} void tun_ptr_free(void *ptr); #else #include <linux/err.h> @@ -48,11 +57,11 @@ static inline bool tun_is_xdp_frame(void *ptr) { return false; } -static inline void *tun_xdp_to_ptr(void *ptr) +static inline void *tun_xdp_to_ptr(struct xdp_frame *xdp) { return NULL; } -static inline void *tun_ptr_to_xdp(void *ptr) +static inline struct xdp_frame *tun_ptr_to_xdp(void *ptr) { return NULL; } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index c0b512e6a02b..7f9fcfd15942 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -618,7 +618,7 @@ struct netdev_queue { /* Subordinate device that the queue has been assigned to */ struct net_device *sb_dev; #ifdef CONFIG_XDP_SOCKETS - struct xdp_umem *umem; + struct xsk_buff_pool *pool; #endif /* * write-mostly part @@ -755,7 +755,7 @@ struct netdev_rx_queue { struct net_device *dev; struct xdp_rxq_info xdp_rxq; #ifdef CONFIG_XDP_SOCKETS - struct xdp_umem *umem; + struct xsk_buff_pool *pool; #endif } ____cacheline_aligned_in_smp; @@ -883,7 +883,7 @@ enum bpf_netdev_command { /* BPF program for offload callbacks, invoked at program load time. */ BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE, - XDP_SETUP_XSK_UMEM, + XDP_SETUP_XSK_POOL, }; struct bpf_prog_offload_ops; @@ -917,9 +917,9 @@ struct netdev_bpf { struct { struct bpf_offloaded_map *offmap; }; - /* XDP_SETUP_XSK_UMEM */ + /* XDP_SETUP_XSK_POOL */ struct { - struct xdp_umem *umem; + struct xsk_buff_pool *pool; u16 queue_id; } xsk; }; diff --git a/include/linux/rcupdate_trace.h b/include/linux/rcupdate_trace.h index d9015aac78c6..aaaac8ac927c 100644 --- a/include/linux/rcupdate_trace.h +++ b/include/linux/rcupdate_trace.h @@ -82,7 +82,14 @@ static inline void rcu_read_unlock_trace(void) void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func); void synchronize_rcu_tasks_trace(void); void rcu_barrier_tasks_trace(void); - +#else +/* + * The BPF JIT forms these addresses even when it doesn't call these + * functions, so provide definitions that result in runtime errors. + */ +static inline void call_rcu_tasks_trace(struct rcu_head *rhp, rcu_callback_t func) { BUG(); } +static inline void rcu_read_lock_trace(void) { BUG(); } +static inline void rcu_read_unlock_trace(void) { BUG(); } #endif /* #ifdef CONFIG_TASKS_TRACE_RCU */ #endif /* __LINUX_RCUPDATE_TRACE_H */ diff --git a/include/linux/skmsg.h b/include/linux/skmsg.h index 1e9ed840b9fc..3119928fc103 100644 --- a/include/linux/skmsg.h +++ b/include/linux/skmsg.h @@ -340,23 +340,6 @@ static inline void sk_psock_update_proto(struct sock *sk, struct sk_psock *psock, struct proto *ops) { - /* Initialize saved callbacks and original proto only once, since this - * function may be called multiple times for a psock, e.g. when - * psock->progs.msg_parser is updated. - * - * Since we've not installed the new proto, psock is not yet in use and - * we can initialize it without synchronization. - */ - if (!psock->sk_proto) { - struct proto *orig = READ_ONCE(sk->sk_prot); - - psock->saved_unhash = orig->unhash; - psock->saved_close = orig->close; - psock->saved_write_space = sk->sk_write_space; - - psock->sk_proto = orig; - } - /* Pairs with lockless read in sk_clone_lock() */ WRITE_ONCE(sk->sk_prot, ops); } diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 14b62d7df942..56ff2952edaf 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -92,6 +92,8 @@ struct tcp_options_received { smc_ok : 1, /* SMC seen on SYN packet */ snd_wscale : 4, /* Window scaling received from sender */ rcv_wscale : 4; /* Window scaling to send to receiver */ + u8 saw_unknown:1, /* Received unknown option */ + unused:7; u8 num_sacks; /* Number of SACK blocks */ u16 user_mss; /* mss requested by user in ioctl */ u16 mss_clamp; /* Maximal mss, negotiated at connection setup */ @@ -237,14 +239,13 @@ struct tcp_sock { repair : 1, frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */ u8 repair_queue; - u8 syn_data:1, /* SYN includes data */ + u8 save_syn:2, /* Save headers of SYN packet */ + syn_data:1, /* SYN includes data */ syn_fastopen:1, /* SYN includes Fast Open option */ syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */ syn_fastopen_ch:1, /* Active TFO re-enabling probe */ syn_data_acked:1,/* data in SYN is acked by SYN-ACK */ - save_syn:1, /* Save headers of SYN packet */ - is_cwnd_limited:1,/* forward progress limited by snd_cwnd? */ - syn_smc:1; /* SYN includes SMC */ + is_cwnd_limited:1;/* forward progress limited by snd_cwnd? */ u32 tlp_high_seq; /* snd_nxt at the time of TLP */ u32 tcp_tx_delay; /* delay (in usec) added to TX packets */ @@ -391,6 +392,9 @@ struct tcp_sock { #if IS_ENABLED(CONFIG_MPTCP) bool is_mptcp; #endif +#if IS_ENABLED(CONFIG_SMC) + bool syn_smc; /* SYN includes SMC */ +#endif #ifdef CONFIG_TCP_MD5SIG /* TCP AF-Specific parts; only used by MD5 Signature support so far */ @@ -406,7 +410,7 @@ struct tcp_sock { * socket. Used to retransmit SYNACKs etc. */ struct request_sock __rcu *fastopen_rsk; - u32 *saved_syn; + struct saved_syn *saved_syn; }; enum tsq_enum { @@ -484,6 +488,12 @@ static inline void tcp_saved_syn_free(struct tcp_sock *tp) tp->saved_syn = NULL; } +static inline u32 tcp_saved_syn_len(const struct saved_syn *saved_syn) +{ + return saved_syn->mac_hdrlen + saved_syn->network_hdrlen + + saved_syn->tcp_hdrlen; +} + struct sk_buff *tcp_get_timestamping_opt_stats(const struct sock *sk, const struct sk_buff *orig_skb); diff --git a/include/net/bpf_sk_storage.h b/include/net/bpf_sk_storage.h index 5036c94c0503..119f4c9c3a9c 100644 --- a/include/net/bpf_sk_storage.h +++ b/include/net/bpf_sk_storage.h @@ -3,13 +3,27 @@ #ifndef _BPF_SK_STORAGE_H #define _BPF_SK_STORAGE_H +#include <linux/rculist.h> +#include <linux/list.h> +#include <linux/hash.h> +#include <linux/types.h> +#include <linux/spinlock.h> +#include <linux/bpf.h> +#include <net/sock.h> +#include <uapi/linux/sock_diag.h> +#include <uapi/linux/btf.h> +#include <linux/bpf_local_storage.h> + struct sock; void bpf_sk_storage_free(struct sock *sk); extern const struct bpf_func_proto bpf_sk_storage_get_proto; extern const struct bpf_func_proto bpf_sk_storage_delete_proto; +extern const struct bpf_func_proto sk_storage_get_btf_proto; +extern const struct bpf_func_proto sk_storage_delete_btf_proto; +struct bpf_local_storage_elem; struct bpf_sk_storage_diag; struct sk_buff; struct nlattr; diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index aa8893c68c50..c738abeb3265 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -86,6 +86,8 @@ struct inet_connection_sock { struct timer_list icsk_retransmit_timer; struct timer_list icsk_delack_timer; __u32 icsk_rto; + __u32 icsk_rto_min; + __u32 icsk_delack_max; __u32 icsk_pmtu_cookie; const struct tcp_congestion_ops *icsk_ca_ops; const struct inet_connection_sock_af_ops *icsk_af_ops; diff --git a/include/net/request_sock.h b/include/net/request_sock.h index b2eb8b4ba697..29e41ff3ec93 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -41,6 +41,13 @@ struct request_sock_ops { int inet_rtx_syn_ack(const struct sock *parent, struct request_sock *req); +struct saved_syn { + u32 mac_hdrlen; + u32 network_hdrlen; + u32 tcp_hdrlen; + u8 data[]; +}; + /* struct request_sock - mini sock to represent a connection request */ struct request_sock { @@ -60,7 +67,7 @@ struct request_sock { struct timer_list rsk_timer; const struct request_sock_ops *rsk_ops; struct sock *sk; - u32 *saved_syn; + struct saved_syn *saved_syn; u32 secid; u32 peer_secid; }; diff --git a/include/net/sock.h b/include/net/sock.h index b943731fa879..7dd3051551fb 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -246,7 +246,7 @@ struct sock_common { /* public: */ }; -struct bpf_sk_storage; +struct bpf_local_storage; /** * struct sock - network layer representation of sockets @@ -517,7 +517,7 @@ struct sock { void (*sk_destruct)(struct sock *sk); struct sock_reuseport __rcu *sk_reuseport_cb; #ifdef CONFIG_BPF_SYSCALL - struct bpf_sk_storage __rcu *sk_bpf_storage; + struct bpf_local_storage __rcu *sk_bpf_storage; #endif struct rcu_head sk_rcu; }; diff --git a/include/net/tcp.h b/include/net/tcp.h index 0d11db6436c8..e85d564446c6 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -394,7 +394,7 @@ void tcp_metrics_init(void); bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst); void tcp_close(struct sock *sk, long timeout); void tcp_init_sock(struct sock *sk); -void tcp_init_transfer(struct sock *sk, int bpf_op); +void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb); __poll_t tcp_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait); int tcp_getsockopt(struct sock *sk, int level, int optname, @@ -455,7 +455,8 @@ enum tcp_synack_type { struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type); + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb); int tcp_disconnect(struct sock *sk, int flags); void tcp_finish_connect(struct sock *sk, struct sk_buff *skb); @@ -699,7 +700,7 @@ static inline void tcp_fast_path_check(struct sock *sk) static inline u32 tcp_rto_min(struct sock *sk) { const struct dst_entry *dst = __sk_dst_get(sk); - u32 rto_min = TCP_RTO_MIN; + u32 rto_min = inet_csk(sk)->icsk_rto_min; if (dst && dst_metric_locked(dst, RTAX_RTO_MIN)) rto_min = dst_metric_rtt(dst, RTAX_RTO_MIN); @@ -2025,7 +2026,8 @@ struct tcp_request_sock_ops { int (*send_synack)(const struct sock *sk, struct dst_entry *dst, struct flowi *fl, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type); + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb); }; extern const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops; @@ -2223,6 +2225,55 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg, int len, int flags); #endif /* CONFIG_NET_SOCK_MSG */ +#ifdef CONFIG_CGROUP_BPF +/* Copy the listen sk's HDR_OPT_CB flags to its child. + * + * During 3-Way-HandShake, the synack is usually sent from + * the listen sk with the HDR_OPT_CB flags set so that + * bpf-prog will be called to write the BPF hdr option. + * + * In fastopen, the child sk is used to send synack instead + * of the listen sk. Thus, inheriting the HDR_OPT_CB flags + * from the listen sk gives the bpf-prog a chance to write + * BPF hdr option in the synack pkt during fastopen. + * + * Both fastopen and non-fastopen child will inherit the + * HDR_OPT_CB flags to keep the bpf-prog having a consistent + * behavior when deciding to clear this cb flags (or not) + * during the PASSIVE_ESTABLISHED_CB. + * + * In the future, other cb flags could be inherited here also. + */ +static inline void bpf_skops_init_child(const struct sock *sk, + struct sock *child) +{ + tcp_sk(child)->bpf_sock_ops_cb_flags = + tcp_sk(sk)->bpf_sock_ops_cb_flags & + (BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG | + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG | + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG); +} + +static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops, + struct sk_buff *skb, + unsigned int end_offset) +{ + skops->skb = skb; + skops->skb_data_end = skb->data + end_offset; +} +#else +static inline void bpf_skops_init_child(const struct sock *sk, + struct sock *child) +{ +} + +static inline void bpf_skops_init_skb(struct bpf_sock_ops_kern *skops, + struct sk_buff *skb, + unsigned int end_offset) +{ +} +#endif + /* Call BPF_SOCK_OPS program that returns an int. If the return value * is < 0, then the BPF op failed (for example if the loaded BPF * program does not support the chosen operation or there is no BPF diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index c9d87cc40c11..1a9559c0cbdd 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -18,25 +18,19 @@ struct xsk_queue; struct xdp_buff; struct xdp_umem { - struct xsk_queue *fq; - struct xsk_queue *cq; - struct xsk_buff_pool *pool; + void *addrs; u64 size; u32 headroom; u32 chunk_size; + u32 chunks; + u32 npgs; struct user_struct *user; refcount_t users; - struct work_struct work; - struct page **pgs; - u32 npgs; - u16 queue_id; - u8 need_wakeup; u8 flags; - int id; - struct net_device *dev; bool zc; - spinlock_t xsk_tx_list_lock; - struct list_head xsk_tx_list; + struct page **pgs; + int id; + struct list_head xsk_dma_list; }; struct xsk_map { @@ -48,10 +42,11 @@ struct xsk_map { struct xdp_sock { /* struct sock must be the first member of struct xdp_sock */ struct sock sk; - struct xsk_queue *rx; + struct xsk_queue *rx ____cacheline_aligned_in_smp; struct net_device *dev; struct xdp_umem *umem; struct list_head flush_node; + struct xsk_buff_pool *pool; u16 queue_id; bool zc; enum { @@ -59,10 +54,9 @@ struct xdp_sock { XSK_BOUND, XSK_UNBOUND, } state; - /* Protects multiple processes in the control path */ - struct mutex mutex; + struct xsk_queue *tx ____cacheline_aligned_in_smp; - struct list_head list; + struct list_head tx_list; /* Mutual exclusion of NAPI TX thread and sendmsg error paths * in the SKB destructor callback. */ @@ -77,6 +71,10 @@ struct xdp_sock { struct list_head map_list; /* Protects map_list */ spinlock_t map_list_lock; + /* Protects multiple processes in the control path */ + struct mutex mutex; + struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */ + struct xsk_queue *cq_tmp; /* Only as tmp storage before bind */ }; #ifdef CONFIG_XDP_SOCKETS diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h index ccf848f7efa4..5b1ee8a9976d 100644 --- a/include/net/xdp_sock_drv.h +++ b/include/net/xdp_sock_drv.h @@ -11,47 +11,50 @@ #ifdef CONFIG_XDP_SOCKETS -void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries); -bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc); -void xsk_umem_consume_tx_done(struct xdp_umem *umem); -struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id); -void xsk_set_rx_need_wakeup(struct xdp_umem *umem); -void xsk_set_tx_need_wakeup(struct xdp_umem *umem); -void xsk_clear_rx_need_wakeup(struct xdp_umem *umem); -void xsk_clear_tx_need_wakeup(struct xdp_umem *umem); -bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem); +void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries); +bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc); +void xsk_tx_release(struct xsk_buff_pool *pool); +struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev, + u16 queue_id); +void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool); +void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool); +void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool); +void xsk_clear_tx_need_wakeup(struct xsk_buff_pool *pool); +bool xsk_uses_need_wakeup(struct xsk_buff_pool *pool); -static inline u32 xsk_umem_get_headroom(struct xdp_umem *umem) +static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool) { - return XDP_PACKET_HEADROOM + umem->headroom; + return XDP_PACKET_HEADROOM + pool->headroom; } -static inline u32 xsk_umem_get_chunk_size(struct xdp_umem *umem) +static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool) { - return umem->chunk_size; + return pool->chunk_size; } -static inline u32 xsk_umem_get_rx_frame_size(struct xdp_umem *umem) +static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool) { - return xsk_umem_get_chunk_size(umem) - xsk_umem_get_headroom(umem); + return xsk_pool_get_chunk_size(pool) - xsk_pool_get_headroom(pool); } -static inline void xsk_buff_set_rxq_info(struct xdp_umem *umem, +static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq) { - xp_set_rxq_info(umem->pool, rxq); + xp_set_rxq_info(pool, rxq); } -static inline void xsk_buff_dma_unmap(struct xdp_umem *umem, +static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs) { - xp_dma_unmap(umem->pool, attrs); + xp_dma_unmap(pool, attrs); } -static inline int xsk_buff_dma_map(struct xdp_umem *umem, struct device *dev, - unsigned long attrs) +static inline int xsk_pool_dma_map(struct xsk_buff_pool *pool, + struct device *dev, unsigned long attrs) { - return xp_dma_map(umem->pool, dev, attrs, umem->pgs, umem->npgs); + struct xdp_umem *umem = pool->umem; + + return xp_dma_map(pool, dev, attrs, umem->pgs, umem->npgs); } static inline dma_addr_t xsk_buff_xdp_get_dma(struct xdp_buff *xdp) @@ -68,14 +71,14 @@ static inline dma_addr_t xsk_buff_xdp_get_frame_dma(struct xdp_buff *xdp) return xp_get_frame_dma(xskb); } -static inline struct xdp_buff *xsk_buff_alloc(struct xdp_umem *umem) +static inline struct xdp_buff *xsk_buff_alloc(struct xsk_buff_pool *pool) { - return xp_alloc(umem->pool); + return xp_alloc(pool); } -static inline bool xsk_buff_can_alloc(struct xdp_umem *umem, u32 count) +static inline bool xsk_buff_can_alloc(struct xsk_buff_pool *pool, u32 count) { - return xp_can_alloc(umem->pool, count); + return xp_can_alloc(pool, count); } static inline void xsk_buff_free(struct xdp_buff *xdp) @@ -85,100 +88,104 @@ static inline void xsk_buff_free(struct xdp_buff *xdp) xp_free(xskb); } -static inline dma_addr_t xsk_buff_raw_get_dma(struct xdp_umem *umem, u64 addr) +static inline dma_addr_t xsk_buff_raw_get_dma(struct xsk_buff_pool *pool, + u64 addr) { - return xp_raw_get_dma(umem->pool, addr); + return xp_raw_get_dma(pool, addr); } -static inline void *xsk_buff_raw_get_data(struct xdp_umem *umem, u64 addr) +static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr) { - return xp_raw_get_data(umem->pool, addr); + return xp_raw_get_data(pool, addr); } -static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp) +static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool) { struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp); + if (!pool->dma_need_sync) + return; + xp_dma_sync_for_cpu(xskb); } -static inline void xsk_buff_raw_dma_sync_for_device(struct xdp_umem *umem, +static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool, dma_addr_t dma, size_t size) { - xp_dma_sync_for_device(umem->pool, dma, size); + xp_dma_sync_for_device(pool, dma, size); } #else -static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries) +static inline void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries) { } -static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, - struct xdp_desc *desc) +static inline bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, + struct xdp_desc *desc) { return false; } -static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem) +static inline void xsk_tx_release(struct xsk_buff_pool *pool) { } -static inline struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, - u16 queue_id) +static inline struct xsk_buff_pool * +xsk_get_pool_from_qid(struct net_device *dev, u16 queue_id) { return NULL; } -static inline void xsk_set_rx_need_wakeup(struct xdp_umem *umem) +static inline void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) { } -static inline void xsk_set_tx_need_wakeup(struct xdp_umem *umem) +static inline void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool) { } -static inline void xsk_clear_rx_need_wakeup(struct xdp_umem *umem) +static inline void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool) { } -static inline void xsk_clear_tx_need_wakeup(struct xdp_umem *umem) +static inline void xsk_clear_tx_need_wakeup(struct xsk_buff_pool *pool) { } -static inline bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem) +static inline bool xsk_uses_need_wakeup(struct xsk_buff_pool *pool) { return false; } -static inline u32 xsk_umem_get_headroom(struct xdp_umem *umem) +static inline u32 xsk_pool_get_headroom(struct xsk_buff_pool *pool) { return 0; } -static inline u32 xsk_umem_get_chunk_size(struct xdp_umem *umem) +static inline u32 xsk_pool_get_chunk_size(struct xsk_buff_pool *pool) { return 0; } -static inline u32 xsk_umem_get_rx_frame_size(struct xdp_umem *umem) +static inline u32 xsk_pool_get_rx_frame_size(struct xsk_buff_pool *pool) { return 0; } -static inline void xsk_buff_set_rxq_info(struct xdp_umem *umem, +static inline void xsk_pool_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq) { } -static inline void xsk_buff_dma_unmap(struct xdp_umem *umem, +static inline void xsk_pool_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs) { } -static inline int xsk_buff_dma_map(struct xdp_umem *umem, struct device *dev, - unsigned long attrs) +static inline int xsk_pool_dma_map(struct xsk_buff_pool *pool, + struct device *dev, unsigned long attrs) { return 0; } @@ -193,12 +200,12 @@ static inline dma_addr_t xsk_buff_xdp_get_frame_dma(struct xdp_buff *xdp) return 0; } -static inline struct xdp_buff *xsk_buff_alloc(struct xdp_umem *umem) +static inline struct xdp_buff *xsk_buff_alloc(struct xsk_buff_pool *pool) { return NULL; } -static inline bool xsk_buff_can_alloc(struct xdp_umem *umem, u32 count) +static inline bool xsk_buff_can_alloc(struct xsk_buff_pool *pool, u32 count) { return false; } @@ -207,21 +214,22 @@ static inline void xsk_buff_free(struct xdp_buff *xdp) { } -static inline dma_addr_t xsk_buff_raw_get_dma(struct xdp_umem *umem, u64 addr) +static inline dma_addr_t xsk_buff_raw_get_dma(struct xsk_buff_pool *pool, + u64 addr) { return 0; } -static inline void *xsk_buff_raw_get_data(struct xdp_umem *umem, u64 addr) +static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr) { return NULL; } -static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp) +static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool) { } -static inline void xsk_buff_raw_dma_sync_for_device(struct xdp_umem *umem, +static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool, dma_addr_t dma, size_t size) { diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index 6842990e2712..0140d086dc84 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -13,6 +13,8 @@ struct xsk_buff_pool; struct xdp_rxq_info; struct xsk_queue; struct xdp_desc; +struct xdp_umem; +struct xdp_sock; struct device; struct page; @@ -26,34 +28,68 @@ struct xdp_buff_xsk { struct list_head free_list_node; }; +struct xsk_dma_map { + dma_addr_t *dma_pages; + struct device *dev; + struct net_device *netdev; + refcount_t users; + struct list_head list; /* Protected by the RTNL_LOCK */ + u32 dma_pages_cnt; + bool dma_need_sync; +}; + struct xsk_buff_pool { - struct xsk_queue *fq; + /* Members only used in the control path first. */ + struct device *dev; + struct net_device *netdev; + struct list_head xsk_tx_list; + /* Protects modifications to the xsk_tx_list */ + spinlock_t xsk_tx_list_lock; + refcount_t users; + struct xdp_umem *umem; + struct work_struct work; struct list_head free_list; + u32 heads_cnt; + u16 queue_id; + + /* Data path members as close to free_heads at the end as possible. */ + struct xsk_queue *fq ____cacheline_aligned_in_smp; + struct xsk_queue *cq; + /* For performance reasons, each buff pool has its own array of dma_pages + * even when they are identical. + */ dma_addr_t *dma_pages; struct xdp_buff_xsk *heads; u64 chunk_mask; u64 addrs_cnt; u32 free_list_cnt; u32 dma_pages_cnt; - u32 heads_cnt; u32 free_heads_cnt; u32 headroom; u32 chunk_size; u32 frame_len; + u8 cached_need_wakeup; + bool uses_need_wakeup; bool dma_need_sync; bool unaligned; void *addrs; - struct device *dev; struct xdp_buff_xsk *free_heads[]; }; /* AF_XDP core. */ -struct xsk_buff_pool *xp_create(struct page **pages, u32 nr_pages, u32 chunks, - u32 chunk_size, u32 headroom, u64 size, - bool unaligned); -void xp_set_fq(struct xsk_buff_pool *pool, struct xsk_queue *fq); +struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs, + struct xdp_umem *umem); +int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *dev, + u16 queue_id, u16 flags); +int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_umem *umem, + struct net_device *dev, u16 queue_id); void xp_destroy(struct xsk_buff_pool *pool); void xp_release(struct xdp_buff_xsk *xskb); +void xp_get_pool(struct xsk_buff_pool *pool); +void xp_put_pool(struct xsk_buff_pool *pool); +void xp_clear_dev(struct xsk_buff_pool *pool); +void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs); +void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs); /* AF_XDP, and XDP core. */ void xp_free(struct xdp_buff_xsk *xskb); @@ -80,9 +116,6 @@ static inline dma_addr_t xp_get_frame_dma(struct xdp_buff_xsk *xskb) void xp_dma_sync_for_cpu_slow(struct xdp_buff_xsk *xskb); static inline void xp_dma_sync_for_cpu(struct xdp_buff_xsk *xskb) { - if (!xskb->pool->dma_need_sync) - return; - xp_dma_sync_for_cpu_slow(xskb); } diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index b6238b2209b7..8dda13880957 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -155,6 +155,7 @@ enum bpf_map_type { BPF_MAP_TYPE_DEVMAP_HASH, BPF_MAP_TYPE_STRUCT_OPS, BPF_MAP_TYPE_RINGBUF, + BPF_MAP_TYPE_INODE_STORAGE, }; /* Note that tracing related programs such as @@ -345,6 +346,14 @@ enum bpf_link_type { /* The verifier internal test flag. Behavior is undefined */ #define BPF_F_TEST_STATE_FREQ (1U << 3) +/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will + * restrict map and helper usage for such programs. Sleepable BPF programs can + * only be attached to hooks where kernel execution context allows sleeping. + * Such programs are allowed to use helpers that may sleep like + * bpf_copy_from_user(). + */ +#define BPF_F_SLEEPABLE (1U << 4) + /* When BPF ldimm64's insn[0].src_reg != 0 then this can have * two extensions: * @@ -2807,7 +2816,7 @@ union bpf_attr { * * **-ERANGE** if resulting value was out of range. * - * void *bpf_sk_storage_get(struct bpf_map *map, struct bpf_sock *sk, void *value, u64 flags) + * void *bpf_sk_storage_get(struct bpf_map *map, void *sk, void *value, u64 flags) * Description * Get a bpf-local-storage from a *sk*. * @@ -2823,6 +2832,9 @@ union bpf_attr { * "type". The bpf-local-storage "type" (i.e. the *map*) is * searched against all bpf-local-storages residing at *sk*. * + * *sk* is a kernel **struct sock** pointer for LSM program. + * *sk* is a **struct bpf_sock** pointer for other program types. + * * An optional *flags* (**BPF_SK_STORAGE_GET_F_CREATE**) can be * used such that a new bpf-local-storage will be * created if one does not exist. *value* can be used @@ -2835,7 +2847,7 @@ union bpf_attr { * **NULL** if not found or there was an error in adding * a new bpf-local-storage. * - * long bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk) + * long bpf_sk_storage_delete(struct bpf_map *map, void *sk) * Description * Delete a bpf-local-storage from a *sk*. * Return @@ -3395,6 +3407,175 @@ union bpf_attr { * A non-negative value equal to or less than *size* on success, * or a negative error in case of failure. * + * long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags) + * Description + * Load header option. Support reading a particular TCP header + * option for bpf program (BPF_PROG_TYPE_SOCK_OPS). + * + * If *flags* is 0, it will search the option from the + * sock_ops->skb_data. The comment in "struct bpf_sock_ops" + * has details on what skb_data contains under different + * sock_ops->op. + * + * The first byte of the *searchby_res* specifies the + * kind that it wants to search. + * + * If the searching kind is an experimental kind + * (i.e. 253 or 254 according to RFC6994). It also + * needs to specify the "magic" which is either + * 2 bytes or 4 bytes. It then also needs to + * specify the size of the magic by using + * the 2nd byte which is "kind-length" of a TCP + * header option and the "kind-length" also + * includes the first 2 bytes "kind" and "kind-length" + * itself as a normal TCP header option also does. + * + * For example, to search experimental kind 254 with + * 2 byte magic 0xeB9F, the searchby_res should be + * [ 254, 4, 0xeB, 0x9F, 0, 0, .... 0 ]. + * + * To search for the standard window scale option (3), + * the searchby_res should be [ 3, 0, 0, .... 0 ]. + * Note, kind-length must be 0 for regular option. + * + * Searching for No-Op (0) and End-of-Option-List (1) are + * not supported. + * + * *len* must be at least 2 bytes which is the minimal size + * of a header option. + * + * Supported flags: + * * **BPF_LOAD_HDR_OPT_TCP_SYN** to search from the + * saved_syn packet or the just-received syn packet. + * + * Return + * >0 when found, the header option is copied to *searchby_res*. + * The return value is the total length copied. + * + * **-EINVAL** If param is invalid + * + * **-ENOMSG** The option is not found + * + * **-ENOENT** No syn packet available when + * **BPF_LOAD_HDR_OPT_TCP_SYN** is used + * + * **-ENOSPC** Not enough space. Only *len* number of + * bytes are copied. + * + * **-EFAULT** Cannot parse the header options in the packet + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, const void *from, u32 len, u64 flags) + * Description + * Store header option. The data will be copied + * from buffer *from* with length *len* to the TCP header. + * + * The buffer *from* should have the whole option that + * includes the kind, kind-length, and the actual + * option data. The *len* must be at least kind-length + * long. The kind-length does not have to be 4 byte + * aligned. The kernel will take care of the padding + * and setting the 4 bytes aligned value to th->doff. + * + * This helper will check for duplicated option + * by searching the same option in the outgoing skb. + * + * This helper can only be called during + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * Return + * 0 on success, or negative error in case of failure: + * + * **-EINVAL** If param is invalid + * + * **-ENOSPC** Not enough space in the header. + * Nothing has been written + * + * **-EEXIST** The option has already existed + * + * **-EFAULT** Cannot parse the existing header options + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags) + * Description + * Reserve *len* bytes for the bpf header option. The + * space will be used by bpf_store_hdr_opt() later in + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * If bpf_reserve_hdr_opt() is called multiple times, + * the total number of bytes will be reserved. + * + * This helper can only be called during + * BPF_SOCK_OPS_HDR_OPT_LEN_CB. + * + * Return + * 0 on success, or negative error in case of failure: + * + * **-EINVAL** if param is invalid + * + * **-ENOSPC** Not enough space in the header. + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * void *bpf_inode_storage_get(struct bpf_map *map, void *inode, void *value, u64 flags) + * Description + * Get a bpf_local_storage from an *inode*. + * + * Logically, it could be thought of as getting the value from + * a *map* with *inode* as the **key**. From this + * perspective, the usage is not much different from + * **bpf_map_lookup_elem**\ (*map*, **&**\ *inode*) except this + * helper enforces the key must be an inode and the map must also + * be a **BPF_MAP_TYPE_INODE_STORAGE**. + * + * Underneath, the value is stored locally at *inode* instead of + * the *map*. The *map* is used as the bpf-local-storage + * "type". The bpf-local-storage "type" (i.e. the *map*) is + * searched against all bpf_local_storage residing at *inode*. + * + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be + * used such that a new bpf_local_storage will be + * created if one does not exist. *value* can be used + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify + * the initial value of a bpf_local_storage. If *value* is + * **NULL**, the new bpf_local_storage will be zero initialized. + * Return + * A bpf_local_storage pointer is returned on success. + * + * **NULL** if not found or there was an error in adding + * a new bpf_local_storage. + * + * int bpf_inode_storage_delete(struct bpf_map *map, void *inode) + * Description + * Delete a bpf_local_storage from an *inode*. + * Return + * 0 on success. + * + * **-ENOENT** if the bpf_local_storage cannot be found. + * + * long bpf_d_path(struct path *path, char *buf, u32 sz) + * Description + * Return full path for given 'struct path' object, which + * needs to be the kernel BTF 'path' object. The path is + * returned in the provided buffer 'buf' of size 'sz' and + * is zero terminated. + * + * Return + * On success, the strictly positive length of the string, + * including the trailing NUL character. On error, a negative + * value. + * + * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr) + * Description + * Read *size* bytes from user space address *user_ptr* and store + * the data in *dst*. This is a wrapper of copy_from_user(). + * Return + * 0 on success, or a negative error in case of failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3539,6 +3720,13 @@ union bpf_attr { FN(skc_to_tcp_request_sock), \ FN(skc_to_udp6_sock), \ FN(get_task_stack), \ + FN(load_hdr_opt), \ + FN(store_hdr_opt), \ + FN(reserve_hdr_opt), \ + FN(inode_storage_get), \ + FN(inode_storage_delete), \ + FN(d_path), \ + FN(copy_from_user), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -3648,9 +3836,13 @@ enum { BPF_F_SYSCTL_BASE_NAME = (1ULL << 0), }; -/* BPF_FUNC_sk_storage_get flags */ +/* BPF_FUNC_<kernel_obj>_storage_get flags */ enum { - BPF_SK_STORAGE_GET_F_CREATE = (1ULL << 0), + BPF_LOCAL_STORAGE_GET_F_CREATE = (1ULL << 0), + /* BPF_SK_STORAGE_GET_F_CREATE is only kept for backward compatibility + * and BPF_LOCAL_STORAGE_GET_F_CREATE must be used instead. + */ + BPF_SK_STORAGE_GET_F_CREATE = BPF_LOCAL_STORAGE_GET_F_CREATE, }; /* BPF_FUNC_read_branch_records flags. */ @@ -4071,6 +4263,15 @@ struct bpf_link_info { __u64 cgroup_id; __u32 attach_type; } cgroup; + struct { + __aligned_u64 target_name; /* in/out: target_name buffer ptr */ + __u32 target_name_len; /* in/out: target_name buffer len */ + union { + struct { + __u32 map_id; + } map; + }; + } iter; struct { __u32 netns_ino; __u32 attach_type; @@ -4158,6 +4359,36 @@ struct bpf_sock_ops { __u64 bytes_received; __u64 bytes_acked; __bpf_md_ptr(struct bpf_sock *, sk); + /* [skb_data, skb_data_end) covers the whole TCP header. + * + * BPF_SOCK_OPS_PARSE_HDR_OPT_CB: The packet received + * BPF_SOCK_OPS_HDR_OPT_LEN_CB: Not useful because the + * header has not been written. + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB: The header and options have + * been written so far. + * BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: The SYNACK that concludes + * the 3WHS. + * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes + * the 3WHS. + * + * bpf_load_hdr_opt() can also be used to read a particular option. + */ + __bpf_md_ptr(void *, skb_data); + __bpf_md_ptr(void *, skb_data_end); + __u32 skb_len; /* The total length of a packet. + * It includes the header, options, + * and payload. + */ + __u32 skb_tcp_flags; /* tcp_flags of the header. It provides + * an easy way to check for tcp_flags + * without parsing skb_data. + * + * In particular, the skb_tcp_flags + * will still be available in + * BPF_SOCK_OPS_HDR_OPT_LEN even though + * the outgoing header has not + * been written yet. + */ }; /* Definitions for bpf_sock_ops_cb_flags */ @@ -4166,8 +4397,51 @@ enum { BPF_SOCK_OPS_RETRANS_CB_FLAG = (1<<1), BPF_SOCK_OPS_STATE_CB_FLAG = (1<<2), BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3), + /* Call bpf for all received TCP headers. The bpf prog will be + * called under sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB + * + * Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB + * for the header option related helpers that will be useful + * to the bpf programs. + * + * It could be used at the client/active side (i.e. connect() side) + * when the server told it that the server was in syncookie + * mode and required the active side to resend the bpf-written + * options. The active side can keep writing the bpf-options until + * it received a valid packet from the server side to confirm + * the earlier packet (and options) has been received. The later + * example patch is using it like this at the active side when the + * server is in syncookie mode. + * + * The bpf prog will usually turn this off in the common cases. + */ + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), + /* Call bpf when kernel has received a header option that + * the kernel cannot handle. The bpf prog will be called under + * sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB. + * + * Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB + * for the header option related helpers that will be useful + * to the bpf programs. + */ + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5), + /* Call bpf when the kernel is writing header options for the + * outgoing packet. The bpf prog will first be called + * to reserve space in a skb under + * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB. Then + * the bpf prog will be called to write the header option(s) + * under sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * Please refer to the comment in BPF_SOCK_OPS_HDR_OPT_LEN_CB + * and BPF_SOCK_OPS_WRITE_HDR_OPT_CB for the header option + * related helpers that will be useful to the bpf programs. + * + * The kernel gets its chance to reserve space and write + * options first before the BPF program does. + */ + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6), /* Mask of all currently supported cb flags */ - BPF_SOCK_OPS_ALL_CB_FLAGS = 0xF, + BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F, }; /* List of known BPF sock_ops operators. @@ -4223,6 +4497,63 @@ enum { */ BPF_SOCK_OPS_RTT_CB, /* Called on every RTT. */ + BPF_SOCK_OPS_PARSE_HDR_OPT_CB, /* Parse the header option. + * It will be called to handle + * the packets received at + * an already established + * connection. + * + * sock_ops->skb_data: + * Referring to the received skb. + * It covers the TCP header only. + * + * bpf_load_hdr_opt() can also + * be used to search for a + * particular option. + */ + BPF_SOCK_OPS_HDR_OPT_LEN_CB, /* Reserve space for writing the + * header option later in + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * Arg1: bool want_cookie. (in + * writing SYNACK only) + * + * sock_ops->skb_data: + * Not available because no header has + * been written yet. + * + * sock_ops->skb_tcp_flags: + * The tcp_flags of the + * outgoing skb. (e.g. SYN, ACK, FIN). + * + * bpf_reserve_hdr_opt() should + * be used to reserve space. + */ + BPF_SOCK_OPS_WRITE_HDR_OPT_CB, /* Write the header options + * Arg1: bool want_cookie. (in + * writing SYNACK only) + * + * sock_ops->skb_data: + * Referring to the outgoing skb. + * It covers the TCP header + * that has already been written + * by the kernel and the + * earlier bpf-progs. + * + * sock_ops->skb_tcp_flags: + * The tcp_flags of the outgoing + * skb. (e.g. SYN, ACK, FIN). + * + * bpf_store_hdr_opt() should + * be used to write the + * option. + * + * bpf_load_hdr_opt() can also + * be used to search for a + * particular option that + * has already been written + * by the kernel or the + * earlier bpf-progs. + */ }; /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect @@ -4250,6 +4581,63 @@ enum { enum { TCP_BPF_IW = 1001, /* Set TCP initial congestion window */ TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */ + TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */ + TCP_BPF_RTO_MIN = 1004, /* Min delay ack in usecs */ + /* Copy the SYN pkt to optval + * + * BPF_PROG_TYPE_SOCK_OPS only. It is similar to the + * bpf_getsockopt(TCP_SAVED_SYN) but it does not limit + * to only getting from the saved_syn. It can either get the + * syn packet from: + * + * 1. the just-received SYN packet (only available when writing the + * SYNACK). It will be useful when it is not necessary to + * save the SYN packet for latter use. It is also the only way + * to get the SYN during syncookie mode because the syn + * packet cannot be saved during syncookie. + * + * OR + * + * 2. the earlier saved syn which was done by + * bpf_setsockopt(TCP_SAVE_SYN). + * + * The bpf_getsockopt(TCP_BPF_SYN*) option will hide where the + * SYN packet is obtained. + * + * If the bpf-prog does not need the IP[46] header, the + * bpf-prog can avoid parsing the IP header by using + * TCP_BPF_SYN. Otherwise, the bpf-prog can get both + * IP[46] and TCP header by using TCP_BPF_SYN_IP. + * + * >0: Total number of bytes copied + * -ENOSPC: Not enough space in optval. Only optlen number of + * bytes is copied. + * -ENOENT: The SYN skb is not available now and the earlier SYN pkt + * is not saved by setsockopt(TCP_SAVE_SYN). + */ + TCP_BPF_SYN = 1005, /* Copy the TCP header */ + TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ + TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ +}; + +enum { + BPF_LOAD_HDR_OPT_TCP_SYN = (1ULL << 0), +}; + +/* args[0] value during BPF_SOCK_OPS_HDR_OPT_LEN_CB and + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + */ +enum { + BPF_WRITE_HDR_TCP_CURRENT_MSS = 1, /* Kernel is finding the + * total option spaces + * required for an established + * sk in order to calculate the + * MSS. No skb is actually + * sent. + */ + BPF_WRITE_HDR_TCP_SYNACK_COOKIE = 2, /* Kernel is in syncookie mode + * when sending a SYN. + */ }; struct bpf_perf_event_value { diff --git a/init/Kconfig b/init/Kconfig index d6a0b31b13dc..6ecc00e130ff 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1691,6 +1691,7 @@ config BPF_SYSCALL bool "Enable bpf() system call" select BPF select IRQ_WORK + select TASKS_TRACE_RCU default n help Enable the bpf() system call that allows to manipulate eBPF @@ -1710,6 +1711,8 @@ config BPF_JIT_DEFAULT_ON def_bool ARCH_WANT_DEFAULT_BPF_JIT || BPF_JIT_ALWAYS_ON depends on HAVE_EBPF_JIT && BPF_JIT +source "kernel/bpf/preload/Kconfig" + config USERFAULTFD bool "Enable userfaultfd() system call" depends on MMU diff --git a/kernel/Makefile b/kernel/Makefile index 9a20016d4900..22b0760660fc 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -12,7 +12,7 @@ obj-y = fork.o exec_domain.o panic.o \ notifier.o ksysfs.o cred.o reboot.o \ async.o range.o smpboot.o ucount.o regset.o -obj-$(CONFIG_BPFILTER) += usermode_driver.o +obj-$(CONFIG_USERMODE_DRIVER) += usermode_driver.o obj-$(CONFIG_MODULES) += kmod.o obj-$(CONFIG_MULTIUSER) += groups.o diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile index e6eb9c0402da..bdc8cd1b6767 100644 --- a/kernel/bpf/Makefile +++ b/kernel/bpf/Makefile @@ -5,6 +5,7 @@ CFLAGS_core.o += $(call cc-disable-warning, override-init) obj-$(CONFIG_BPF_SYSCALL) += syscall.o verifier.o inode.o helpers.o tnum.o bpf_iter.o map_iter.o task_iter.o prog_iter.o obj-$(CONFIG_BPF_SYSCALL) += hashtab.o arraymap.o percpu_freelist.o bpf_lru_list.o lpm_trie.o map_in_map.o obj-$(CONFIG_BPF_SYSCALL) += local_storage.o queue_stack_maps.o ringbuf.o +obj-${CONFIG_BPF_LSM} += bpf_inode_storage.o obj-$(CONFIG_BPF_SYSCALL) += disasm.o obj-$(CONFIG_BPF_JIT) += trampoline.o obj-$(CONFIG_BPF_SYSCALL) += btf.o @@ -12,6 +13,7 @@ obj-$(CONFIG_BPF_JIT) += dispatcher.o ifeq ($(CONFIG_NET),y) obj-$(CONFIG_BPF_SYSCALL) += devmap.o obj-$(CONFIG_BPF_SYSCALL) += cpumap.o +obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o obj-$(CONFIG_BPF_SYSCALL) += offload.o obj-$(CONFIG_BPF_SYSCALL) += net_namespace.o endif @@ -29,3 +31,4 @@ ifeq ($(CONFIG_BPF_JIT),y) obj-$(CONFIG_BPF_SYSCALL) += bpf_struct_ops.o obj-${CONFIG_BPF_LSM} += bpf_lsm.o endif +obj-$(CONFIG_BPF_PRELOAD) += preload/ diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 8ff419b632a6..e046fb7d17cd 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -10,6 +10,7 @@ #include <linux/filter.h> #include <linux/perf_event.h> #include <uapi/linux/btf.h> +#include <linux/rcupdate_trace.h> #include "map_in_map.h" @@ -487,6 +488,13 @@ static int array_map_mmap(struct bpf_map *map, struct vm_area_struct *vma) vma->vm_pgoff + pgoff); } +static bool array_map_meta_equal(const struct bpf_map *meta0, + const struct bpf_map *meta1) +{ + return meta0->max_entries == meta1->max_entries && + bpf_map_meta_equal(meta0, meta1); +} + struct bpf_iter_seq_array_map_info { struct bpf_map *map; void *percpu_value_buf; @@ -625,6 +633,7 @@ static const struct bpf_iter_seq_info iter_seq_info = { static int array_map_btf_id; const struct bpf_map_ops array_map_ops = { + .map_meta_equal = array_map_meta_equal, .map_alloc_check = array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = array_map_free, @@ -647,6 +656,7 @@ const struct bpf_map_ops array_map_ops = { static int percpu_array_map_btf_id; const struct bpf_map_ops percpu_array_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = array_map_free, @@ -1003,6 +1013,11 @@ static void prog_array_map_free(struct bpf_map *map) fd_array_map_free(map); } +/* prog_array->aux->{type,jited} is a runtime binding. + * Doing static check alone in the verifier is not enough. + * Thus, prog_array_map cannot be used as an inner_map + * and map_meta_equal is not implemented. + */ static int prog_array_map_btf_id; const struct bpf_map_ops prog_array_map_ops = { .map_alloc_check = fd_array_map_alloc_check, @@ -1101,6 +1116,7 @@ static void perf_event_fd_array_release(struct bpf_map *map, static int perf_event_array_map_btf_id; const struct bpf_map_ops perf_event_array_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = fd_array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = fd_array_map_free, @@ -1137,6 +1153,7 @@ static void cgroup_fd_array_free(struct bpf_map *map) static int cgroup_array_map_btf_id; const struct bpf_map_ops cgroup_array_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = fd_array_map_alloc_check, .map_alloc = array_map_alloc, .map_free = cgroup_fd_array_free, diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c new file mode 100644 index 000000000000..75be02799c0f --- /dev/null +++ b/kernel/bpf/bpf_inode_storage.c @@ -0,0 +1,274 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2019 Facebook + * Copyright 2020 Google LLC. + */ + +#include <linux/rculist.h> +#include <linux/list.h> +#include <linux/hash.h> +#include <linux/types.h> +#include <linux/spinlock.h> +#include <linux/bpf.h> +#include <linux/bpf_local_storage.h> +#include <net/sock.h> +#include <uapi/linux/sock_diag.h> +#include <uapi/linux/btf.h> +#include <linux/bpf_lsm.h> +#include <linux/btf_ids.h> +#include <linux/fdtable.h> + +DEFINE_BPF_STORAGE_CACHE(inode_cache); + +static struct bpf_local_storage __rcu ** +inode_storage_ptr(void *owner) +{ + struct inode *inode = owner; + struct bpf_storage_blob *bsb; + + bsb = bpf_inode(inode); + if (!bsb) + return NULL; + return &bsb->storage; +} + +static struct bpf_local_storage_data *inode_storage_lookup(struct inode *inode, + struct bpf_map *map, + bool cacheit_lockit) +{ + struct bpf_local_storage *inode_storage; + struct bpf_local_storage_map *smap; + struct bpf_storage_blob *bsb; + + bsb = bpf_inode(inode); + if (!bsb) + return NULL; + + inode_storage = rcu_dereference(bsb->storage); + if (!inode_storage) + return NULL; + + smap = (struct bpf_local_storage_map *)map; + return bpf_local_storage_lookup(inode_storage, smap, cacheit_lockit); +} + +void bpf_inode_storage_free(struct inode *inode) +{ + struct bpf_local_storage_elem *selem; + struct bpf_local_storage *local_storage; + bool free_inode_storage = false; + struct bpf_storage_blob *bsb; + struct hlist_node *n; + + bsb = bpf_inode(inode); + if (!bsb) + return; + + rcu_read_lock(); + + local_storage = rcu_dereference(bsb->storage); + if (!local_storage) { + rcu_read_unlock(); + return; + } + + /* Netiher the bpf_prog nor the bpf-map's syscall + * could be modifying the local_storage->list now. + * Thus, no elem can be added-to or deleted-from the + * local_storage->list by the bpf_prog or by the bpf-map's syscall. + * + * It is racing with bpf_local_storage_map_free() alone + * when unlinking elem from the local_storage->list and + * the map's bucket->list. + */ + raw_spin_lock_bh(&local_storage->lock); + hlist_for_each_entry_safe(selem, n, &local_storage->list, snode) { + /* Always unlink from map before unlinking from + * local_storage. + */ + bpf_selem_unlink_map(selem); + free_inode_storage = bpf_selem_unlink_storage_nolock( + local_storage, selem, false); + } + raw_spin_unlock_bh(&local_storage->lock); + rcu_read_unlock(); + + /* free_inoode_storage should always be true as long as + * local_storage->list was non-empty. + */ + if (free_inode_storage) + kfree_rcu(local_storage, rcu); +} + +static void *bpf_fd_inode_storage_lookup_elem(struct bpf_map *map, void *key) +{ + struct bpf_local_storage_data *sdata; + struct file *f; + int fd; + + fd = *(int *)key; + f = fget_raw(fd); + if (!f) + return NULL; + + sdata = inode_storage_lookup(f->f_inode, map, true); + fput(f); + return sdata ? sdata->data : NULL; +} + +static int bpf_fd_inode_storage_update_elem(struct bpf_map *map, void *key, + void *value, u64 map_flags) +{ + struct bpf_local_storage_data *sdata; + struct file *f; + int fd; + + fd = *(int *)key; + f = fget_raw(fd); + if (!f || !inode_storage_ptr(f->f_inode)) + return -EBADF; + + sdata = bpf_local_storage_update(f->f_inode, + (struct bpf_local_storage_map *)map, + value, map_flags); + fput(f); + return PTR_ERR_OR_ZERO(sdata); +} + +static int inode_storage_delete(struct inode *inode, struct bpf_map *map) +{ + struct bpf_local_storage_data *sdata; + + sdata = inode_storage_lookup(inode, map, false); + if (!sdata) + return -ENOENT; + + bpf_selem_unlink(SELEM(sdata)); + + return 0; +} + +static int bpf_fd_inode_storage_delete_elem(struct bpf_map *map, void *key) +{ + struct file *f; + int fd, err; + + fd = *(int *)key; + f = fget_raw(fd); + if (!f) + return -EBADF; + + err = inode_storage_delete(f->f_inode, map); + fput(f); + return err; +} + +BPF_CALL_4(bpf_inode_storage_get, struct bpf_map *, map, struct inode *, inode, + void *, value, u64, flags) +{ + struct bpf_local_storage_data *sdata; + + if (flags & ~(BPF_LOCAL_STORAGE_GET_F_CREATE)) + return (unsigned long)NULL; + + /* explicitly check that the inode_storage_ptr is not + * NULL as inode_storage_lookup returns NULL in this case and + * bpf_local_storage_update expects the owner to have a + * valid storage pointer. + */ + if (!inode_storage_ptr(inode)) + return (unsigned long)NULL; + + sdata = inode_storage_lookup(inode, map, true); + if (sdata) + return (unsigned long)sdata->data; + + /* This helper must only called from where the inode is gurranteed + * to have a refcount and cannot be freed. + */ + if (flags & BPF_LOCAL_STORAGE_GET_F_CREATE) { + sdata = bpf_local_storage_update( + inode, (struct bpf_local_storage_map *)map, value, + BPF_NOEXIST); + return IS_ERR(sdata) ? (unsigned long)NULL : + (unsigned long)sdata->data; + } + + return (unsigned long)NULL; +} + +BPF_CALL_2(bpf_inode_storage_delete, + struct bpf_map *, map, struct inode *, inode) +{ + /* This helper must only called from where the inode is gurranteed + * to have a refcount and cannot be freed. + */ + return inode_storage_delete(inode, map); +} + +static int notsupp_get_next_key(struct bpf_map *map, void *key, + void *next_key) +{ + return -ENOTSUPP; +} + +static struct bpf_map *inode_storage_map_alloc(union bpf_attr *attr) +{ + struct bpf_local_storage_map *smap; + + smap = bpf_local_storage_map_alloc(attr); + if (IS_ERR(smap)) + return ERR_CAST(smap); + + smap->cache_idx = bpf_local_storage_cache_idx_get(&inode_cache); + return &smap->map; +} + +static void inode_storage_map_free(struct bpf_map *map) +{ + struct bpf_local_storage_map *smap; + + smap = (struct bpf_local_storage_map *)map; + bpf_local_storage_cache_idx_free(&inode_cache, smap->cache_idx); + bpf_local_storage_map_free(smap); +} + +static int inode_storage_map_btf_id; +const struct bpf_map_ops inode_storage_map_ops = { + .map_meta_equal = bpf_map_meta_equal, + .map_alloc_check = bpf_local_storage_map_alloc_check, + .map_alloc = inode_storage_map_alloc, + .map_free = inode_storage_map_free, + .map_get_next_key = notsupp_get_next_key, + .map_lookup_elem = bpf_fd_inode_storage_lookup_elem, + .map_update_elem = bpf_fd_inode_storage_update_elem, + .map_delete_elem = bpf_fd_inode_storage_delete_elem, + .map_check_btf = bpf_local_storage_map_check_btf, + .map_btf_name = "bpf_local_storage_map", + .map_btf_id = &inode_storage_map_btf_id, + .map_owner_storage_ptr = inode_storage_ptr, +}; + +BTF_ID_LIST(bpf_inode_storage_btf_ids) +BTF_ID_UNUSED +BTF_ID(struct, inode) + +const struct bpf_func_proto bpf_inode_storage_get_proto = { + .func = bpf_inode_storage_get, + .gpl_only = false, + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_BTF_ID, + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, + .arg4_type = ARG_ANYTHING, + .btf_id = bpf_inode_storage_btf_ids, +}; + +const struct bpf_func_proto bpf_inode_storage_delete_proto = { + .func = bpf_inode_storage_delete, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_BTF_ID, + .btf_id = bpf_inode_storage_btf_ids, +}; diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c index 8faa2ce89396..30833bbf3019 100644 --- a/kernel/bpf/bpf_iter.c +++ b/kernel/bpf/bpf_iter.c @@ -390,10 +390,68 @@ out_unlock: return ret; } +static void bpf_iter_link_show_fdinfo(const struct bpf_link *link, + struct seq_file *seq) +{ + struct bpf_iter_link *iter_link = + container_of(link, struct bpf_iter_link, link); + bpf_iter_show_fdinfo_t show_fdinfo; + + seq_printf(seq, + "target_name:\t%s\n", + iter_link->tinfo->reg_info->target); + + show_fdinfo = iter_link->tinfo->reg_info->show_fdinfo; + if (show_fdinfo) + show_fdinfo(&iter_link->aux, seq); +} + +static int bpf_iter_link_fill_link_info(const struct bpf_link *link, + struct bpf_link_info *info) +{ + struct bpf_iter_link *iter_link = + container_of(link, struct bpf_iter_link, link); + char __user *ubuf = u64_to_user_ptr(info->iter.target_name); + bpf_iter_fill_link_info_t fill_link_info; + u32 ulen = info->iter.target_name_len; + const char *target_name; + u32 target_len; + + if (!ulen ^ !ubuf) + return -EINVAL; + + target_name = iter_link->tinfo->reg_info->target; + target_len = strlen(target_name); + info->iter.target_name_len = target_len + 1; + + if (ubuf) { + if (ulen >= target_len + 1) { + if (copy_to_user(ubuf, target_name, target_len + 1)) + return -EFAULT; + } else { + char zero = '\0'; + + if (copy_to_user(ubuf, target_name, ulen - 1)) + return -EFAULT; + if (put_user(zero, ubuf + ulen - 1)) + return -EFAULT; + return -ENOSPC; + } + } + + fill_link_info = iter_link->tinfo->reg_info->fill_link_info; + if (fill_link_info) + return fill_link_info(&iter_link->aux, info); + + return 0; +} + static const struct bpf_link_ops bpf_iter_link_lops = { .release = bpf_iter_link_release, .dealloc = bpf_iter_link_dealloc, .update_prog = bpf_iter_link_replace, + .show_fdinfo = bpf_iter_link_show_fdinfo, + .fill_link_info = bpf_iter_link_fill_link_info, }; bool bpf_link_is_iter(struct bpf_link *link) diff --git a/kernel/bpf/bpf_local_storage.c b/kernel/bpf/bpf_local_storage.c new file mode 100644 index 000000000000..ffa7d11fc2bd --- /dev/null +++ b/kernel/bpf/bpf_local_storage.c @@ -0,0 +1,600 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2019 Facebook */ +#include <linux/rculist.h> +#include <linux/list.h> +#include <linux/hash.h> +#include <linux/types.h> +#include <linux/spinlock.h> +#include <linux/bpf.h> +#include <linux/btf_ids.h> +#include <linux/bpf_local_storage.h> +#include <net/sock.h> +#include <uapi/linux/sock_diag.h> +#include <uapi/linux/btf.h> + +#define BPF_LOCAL_STORAGE_CREATE_FLAG_MASK (BPF_F_NO_PREALLOC | BPF_F_CLONE) + +static struct bpf_local_storage_map_bucket * +select_bucket(struct bpf_local_storage_map *smap, + struct bpf_local_storage_elem *selem) +{ + return &smap->buckets[hash_ptr(selem, smap->bucket_log)]; +} + +static int mem_charge(struct bpf_local_storage_map *smap, void *owner, u32 size) +{ + struct bpf_map *map = &smap->map; + + if (!map->ops->map_local_storage_charge) + return 0; + + return map->ops->map_local_storage_charge(smap, owner, size); +} + +static void mem_uncharge(struct bpf_local_storage_map *smap, void *owner, + u32 size) +{ + struct bpf_map *map = &smap->map; + + if (map->ops->map_local_storage_uncharge) + map->ops->map_local_storage_uncharge(smap, owner, size); +} + +static struct bpf_local_storage __rcu ** +owner_storage(struct bpf_local_storage_map *smap, void *owner) +{ + struct bpf_map *map = &smap->map; + + return map->ops->map_owner_storage_ptr(owner); +} + +static bool selem_linked_to_storage(const struct bpf_local_storage_elem *selem) +{ + return !hlist_unhashed(&selem->snode); +} + +static bool selem_linked_to_map(const struct bpf_local_storage_elem *selem) +{ + return !hlist_unhashed(&selem->map_node); +} + +struct bpf_local_storage_elem * +bpf_selem_alloc(struct bpf_local_storage_map *smap, void *owner, + void *value, bool charge_mem) +{ + struct bpf_local_storage_elem *selem; + + if (charge_mem && mem_charge(smap, owner, smap->elem_size)) + return NULL; + + selem = kzalloc(smap->elem_size, GFP_ATOMIC | __GFP_NOWARN); + if (selem) { + if (value) + memcpy(SDATA(selem)->data, value, smap->map.value_size); + return selem; + } + + if (charge_mem) + mem_uncharge(smap, owner, smap->elem_size); + + return NULL; +} + +/* local_storage->lock must be held and selem->local_storage == local_storage. + * The caller must ensure selem->smap is still valid to be + * dereferenced for its smap->elem_size and smap->cache_idx. + */ +bool bpf_selem_unlink_storage_nolock(struct bpf_local_storage *local_storage, + struct bpf_local_storage_elem *selem, + bool uncharge_mem) +{ + struct bpf_local_storage_map *smap; + bool free_local_storage; + void *owner; + + smap = rcu_dereference(SDATA(selem)->smap); + owner = local_storage->owner; + + /* All uncharging on the owner must be done first. + * The owner may be freed once the last selem is unlinked + * from local_storage. + */ + if (uncharge_mem) + mem_uncharge(smap, owner, smap->elem_size); + + free_local_storage = hlist_is_singular_node(&selem->snode, + &local_storage->list); + if (free_local_storage) { + mem_uncharge(smap, owner, sizeof(struct bpf_local_storage)); + local_storage->owner = NULL; + + /* After this RCU_INIT, owner may be freed and cannot be used */ + RCU_INIT_POINTER(*owner_storage(smap, owner), NULL); + + /* local_storage is not freed now. local_storage->lock is + * still held and raw_spin_unlock_bh(&local_storage->lock) + * will be done by the caller. + * + * Although the unlock will be done under + * rcu_read_lock(), it is more intutivie to + * read if kfree_rcu(local_storage, rcu) is done + * after the raw_spin_unlock_bh(&local_storage->lock). + * + * Hence, a "bool free_local_storage" is returned + * to the caller which then calls the kfree_rcu() + * after unlock. + */ + } + hlist_del_init_rcu(&selem->snode); + if (rcu_access_pointer(local_storage->cache[smap->cache_idx]) == + SDATA(selem)) + RCU_INIT_POINTER(local_storage->cache[smap->cache_idx], NULL); + + kfree_rcu(selem, rcu); + + return free_local_storage; +} + +static void __bpf_selem_unlink_storage(struct bpf_local_storage_elem *selem) +{ + struct bpf_local_storage *local_storage; + bool free_local_storage = false; + + if (unlikely(!selem_linked_to_storage(selem))) + /* selem has already been unlinked from sk */ + return; + + local_storage = rcu_dereference(selem->local_storage); + raw_spin_lock_bh(&local_storage->lock); + if (likely(selem_linked_to_storage(selem))) + free_local_storage = bpf_selem_unlink_storage_nolock( + local_storage, selem, true); + raw_spin_unlock_bh(&local_storage->lock); + + if (free_local_storage) + kfree_rcu(local_storage, rcu); +} + +void bpf_selem_link_storage_nolock(struct bpf_local_storage *local_storage, + struct bpf_local_storage_elem *selem) +{ + RCU_INIT_POINTER(selem->local_storage, local_storage); + hlist_add_head(&selem->snode, &local_storage->list); +} + +void bpf_selem_unlink_map(struct bpf_local_storage_elem *selem) +{ + struct bpf_local_storage_map *smap; + struct bpf_local_storage_map_bucket *b; + + if (unlikely(!selem_linked_to_map(selem))) + /* selem has already be unlinked from smap */ + return; + + smap = rcu_dereference(SDATA(selem)->smap); + b = select_bucket(smap, selem); + raw_spin_lock_bh(&b->lock); + if (likely(selem_linked_to_map(selem))) + hlist_del_init_rcu(&selem->map_node); + raw_spin_unlock_bh(&b->lock); +} + +void bpf_selem_link_map(struct bpf_local_storage_map *smap, + struct bpf_local_storage_elem *selem) +{ + struct bpf_local_storage_map_bucket *b = select_bucket(smap, selem); + + raw_spin_lock_bh(&b->lock); + RCU_INIT_POINTER(SDATA(selem)->smap, smap); + hlist_add_head_rcu(&selem->map_node, &b->list); + raw_spin_unlock_bh(&b->lock); +} + +void bpf_selem_unlink(struct bpf_local_storage_elem *selem) +{ + /* Always unlink from map before unlinking from local_storage + * because selem will be freed after successfully unlinked from + * the local_storage. + */ + bpf_selem_unlink_map(selem); + __bpf_selem_unlink_storage(selem); +} + +struct bpf_local_storage_data * +bpf_local_storage_lookup(struct bpf_local_storage *local_storage, + struct bpf_local_storage_map *smap, + bool cacheit_lockit) +{ + struct bpf_local_storage_data *sdata; + struct bpf_local_storage_elem *selem; + + /* Fast path (cache hit) */ + sdata = rcu_dereference(local_storage->cache[smap->cache_idx]); + if (sdata && rcu_access_pointer(sdata->smap) == smap) + return sdata; + + /* Slow path (cache miss) */ + hlist_for_each_entry_rcu(selem, &local_storage->list, snode) + if (rcu_access_pointer(SDATA(selem)->smap) == smap) + break; + + if (!selem) + return NULL; + + sdata = SDATA(selem); + if (cacheit_lockit) { + /* spinlock is needed to avoid racing with the + * parallel delete. Otherwise, publishing an already + * deleted sdata to the cache will become a use-after-free + * problem in the next bpf_local_storage_lookup(). + */ + raw_spin_lock_bh(&local_storage->lock); + if (selem_linked_to_storage(selem)) + rcu_assign_pointer(local_storage->cache[smap->cache_idx], + sdata); + raw_spin_unlock_bh(&local_storage->lock); + } + + return sdata; +} + +static int check_flags(const struct bpf_local_storage_data *old_sdata, + u64 map_flags) +{ + if (old_sdata && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST) + /* elem already exists */ + return -EEXIST; + + if (!old_sdata && (map_flags & ~BPF_F_LOCK) == BPF_EXIST) + /* elem doesn't exist, cannot update it */ + return -ENOENT; + + return 0; +} + +int bpf_local_storage_alloc(void *owner, + struct bpf_local_storage_map *smap, + struct bpf_local_storage_elem *first_selem) +{ + struct bpf_local_storage *prev_storage, *storage; + struct bpf_local_storage **owner_storage_ptr; + int err; + + err = mem_charge(smap, owner, sizeof(*storage)); + if (err) + return err; + + storage = kzalloc(sizeof(*storage), GFP_ATOMIC | __GFP_NOWARN); + if (!storage) { + err = -ENOMEM; + goto uncharge; + } + + INIT_HLIST_HEAD(&storage->list); + raw_spin_lock_init(&storage->lock); + storage->owner = owner; + + bpf_selem_link_storage_nolock(storage, first_selem); + bpf_selem_link_map(smap, first_selem); + + owner_storage_ptr = + (struct bpf_local_storage **)owner_storage(smap, owner); + /* Publish storage to the owner. + * Instead of using any lock of the kernel object (i.e. owner), + * cmpxchg will work with any kernel object regardless what + * the running context is, bh, irq...etc. + * + * From now on, the owner->storage pointer (e.g. sk->sk_bpf_storage) + * is protected by the storage->lock. Hence, when freeing + * the owner->storage, the storage->lock must be held before + * setting owner->storage ptr to NULL. + */ + prev_storage = cmpxchg(owner_storage_ptr, NULL, storage); + if (unlikely(prev_storage)) { + bpf_selem_unlink_map(first_selem); + err = -EAGAIN; + goto uncharge; + + /* Note that even first_selem was linked to smap's + * bucket->list, first_selem can be freed immediately + * (instead of kfree_rcu) because + * bpf_local_storage_map_free() does a + * synchronize_rcu() before walking the bucket->list. + * Hence, no one is accessing selem from the + * bucket->list under rcu_read_lock(). + */ + } + + return 0; + +uncharge: + kfree(storage); + mem_uncharge(smap, owner, sizeof(*storage)); + return err; +} + +/* sk cannot be going away because it is linking new elem + * to sk->sk_bpf_storage. (i.e. sk->sk_refcnt cannot be 0). + * Otherwise, it will become a leak (and other memory issues + * during map destruction). + */ +struct bpf_local_storage_data * +bpf_local_storage_update(void *owner, struct bpf_local_storage_map *smap, + void *value, u64 map_flags) +{ + struct bpf_local_storage_data *old_sdata = NULL; + struct bpf_local_storage_elem *selem; + struct bpf_local_storage *local_storage; + int err; + + /* BPF_EXIST and BPF_NOEXIST cannot be both set */ + if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) || + /* BPF_F_LOCK can only be used in a value with spin_lock */ + unlikely((map_flags & BPF_F_LOCK) && + !map_value_has_spin_lock(&smap->map))) + return ERR_PTR(-EINVAL); + + local_storage = rcu_dereference(*owner_storage(smap, owner)); + if (!local_storage || hlist_empty(&local_storage->list)) { + /* Very first elem for the owner */ + err = check_flags(NULL, map_flags); + if (err) + return ERR_PTR(err); + + selem = bpf_selem_alloc(smap, owner, value, true); + if (!selem) + return ERR_PTR(-ENOMEM); + + err = bpf_local_storage_alloc(owner, smap, selem); + if (err) { + kfree(selem); + mem_uncharge(smap, owner, smap->elem_size); + return ERR_PTR(err); + } + + return SDATA(selem); + } + + if ((map_flags & BPF_F_LOCK) && !(map_flags & BPF_NOEXIST)) { + /* Hoping to find an old_sdata to do inline update + * such that it can avoid taking the local_storage->lock + * and changing the lists. + */ + old_sdata = + bpf_local_storage_lookup(local_storage, smap, false); + err = check_flags(old_sdata, map_flags); + if (err) + return ERR_PTR(err); + if (old_sdata && selem_linked_to_storage(SELEM(old_sdata))) { + copy_map_value_locked(&smap->map, old_sdata->data, + value, false); + return old_sdata; + } + } + + raw_spin_lock_bh(&local_storage->lock); + + /* Recheck local_storage->list under local_storage->lock */ + if (unlikely(hlist_empty(&local_storage->list))) { + /* A parallel del is happening and local_storage is going + * away. It has just been checked before, so very + * unlikely. Return instead of retry to keep things + * simple. + */ + err = -EAGAIN; + goto unlock_err; + } + + old_sdata = bpf_local_storage_lookup(local_storage, smap, false); + err = check_flags(old_sdata, map_flags); + if (err) + goto unlock_err; + + if (old_sdata && (map_flags & BPF_F_LOCK)) { + copy_map_value_locked(&smap->map, old_sdata->data, value, + false); + selem = SELEM(old_sdata); + goto unlock; + } + + /* local_storage->lock is held. Hence, we are sure + * we can unlink and uncharge the old_sdata successfully + * later. Hence, instead of charging the new selem now + * and then uncharge the old selem later (which may cause + * a potential but unnecessary charge failure), avoid taking + * a charge at all here (the "!old_sdata" check) and the + * old_sdata will not be uncharged later during + * bpf_selem_unlink_storage_nolock(). + */ + selem = bpf_selem_alloc(smap, owner, value, !old_sdata); + if (!selem) { + err = -ENOMEM; + goto unlock_err; + } + + /* First, link the new selem to the map */ + bpf_selem_link_map(smap, selem); + + /* Second, link (and publish) the new selem to local_storage */ + bpf_selem_link_storage_nolock(local_storage, selem); + + /* Third, remove old selem, SELEM(old_sdata) */ + if (old_sdata) { + bpf_selem_unlink_map(SELEM(old_sdata)); + bpf_selem_unlink_storage_nolock(local_storage, SELEM(old_sdata), + false); + } + +unlock: + raw_spin_unlock_bh(&local_storage->lock); + return SDATA(selem); + +unlock_err: + raw_spin_unlock_bh(&local_storage->lock); + return ERR_PTR(err); +} + +u16 bpf_local_storage_cache_idx_get(struct bpf_local_storage_cache *cache) +{ + u64 min_usage = U64_MAX; + u16 i, res = 0; + + spin_lock(&cache->idx_lock); + + for (i = 0; i < BPF_LOCAL_STORAGE_CACHE_SIZE; i++) { + if (cache->idx_usage_counts[i] < min_usage) { + min_usage = cache->idx_usage_counts[i]; + res = i; + + /* Found a free cache_idx */ + if (!min_usage) + break; + } + } + cache->idx_usage_counts[res]++; + + spin_unlock(&cache->idx_lock); + + return res; +} + +void bpf_local_storage_cache_idx_free(struct bpf_local_storage_cache *cache, + u16 idx) +{ + spin_lock(&cache->idx_lock); + cache->idx_usage_counts[idx]--; + spin_unlock(&cache->idx_lock); +} + +void bpf_local_storage_map_free(struct bpf_local_storage_map *smap) +{ + struct bpf_local_storage_elem *selem; + struct bpf_local_storage_map_bucket *b; + unsigned int i; + + /* Note that this map might be concurrently cloned from + * bpf_sk_storage_clone. Wait for any existing bpf_sk_storage_clone + * RCU read section to finish before proceeding. New RCU + * read sections should be prevented via bpf_map_inc_not_zero. + */ + synchronize_rcu(); + + /* bpf prog and the userspace can no longer access this map + * now. No new selem (of this map) can be added + * to the owner->storage or to the map bucket's list. + * + * The elem of this map can be cleaned up here + * or when the storage is freed e.g. + * by bpf_sk_storage_free() during __sk_destruct(). + */ + for (i = 0; i < (1U << smap->bucket_log); i++) { + b = &smap->buckets[i]; + + rcu_read_lock(); + /* No one is adding to b->list now */ + while ((selem = hlist_entry_safe( + rcu_dereference_raw(hlist_first_rcu(&b->list)), + struct bpf_local_storage_elem, map_node))) { + bpf_selem_unlink(selem); + cond_resched_rcu(); + } + rcu_read_unlock(); + } + + /* While freeing the storage we may still need to access the map. + * + * e.g. when bpf_sk_storage_free() has unlinked selem from the map + * which then made the above while((selem = ...)) loop + * exit immediately. + * + * However, while freeing the storage one still needs to access the + * smap->elem_size to do the uncharging in + * bpf_selem_unlink_storage_nolock(). + * + * Hence, wait another rcu grace period for the storage to be freed. + */ + synchronize_rcu(); + + kvfree(smap->buckets); + kfree(smap); +} + +int bpf_local_storage_map_alloc_check(union bpf_attr *attr) +{ + if (attr->map_flags & ~BPF_LOCAL_STORAGE_CREATE_FLAG_MASK || + !(attr->map_flags & BPF_F_NO_PREALLOC) || + attr->max_entries || + attr->key_size != sizeof(int) || !attr->value_size || + /* Enforce BTF for userspace sk dumping */ + !attr->btf_key_type_id || !attr->btf_value_type_id) + return -EINVAL; + + if (!bpf_capable()) + return -EPERM; + + if (attr->value_size > BPF_LOCAL_STORAGE_MAX_VALUE_SIZE) + return -E2BIG; + + return 0; +} + +struct bpf_local_storage_map *bpf_local_storage_map_alloc(union bpf_attr *attr) +{ + struct bpf_local_storage_map *smap; + unsigned int i; + u32 nbuckets; + u64 cost; + int ret; + + smap = kzalloc(sizeof(*smap), GFP_USER | __GFP_NOWARN); + if (!smap) + return ERR_PTR(-ENOMEM); + bpf_map_init_from_attr(&smap->map, attr); + + nbuckets = roundup_pow_of_two(num_possible_cpus()); + /* Use at least 2 buckets, select_bucket() is undefined behavior with 1 bucket */ + nbuckets = max_t(u32, 2, nbuckets); + smap->bucket_log = ilog2(nbuckets); + cost = sizeof(*smap->buckets) * nbuckets + sizeof(*smap); + + ret = bpf_map_charge_init(&smap->map.memory, cost); + if (ret < 0) { + kfree(smap); + return ERR_PTR(ret); + } + + smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets, + GFP_USER | __GFP_NOWARN); + if (!smap->buckets) { + bpf_map_charge_finish(&smap->map.memory); + kfree(smap); + return ERR_PTR(-ENOMEM); + } + + for (i = 0; i < nbuckets; i++) { + INIT_HLIST_HEAD(&smap->buckets[i].list); + raw_spin_lock_init(&smap->buckets[i].lock); + } + + smap->elem_size = + sizeof(struct bpf_local_storage_elem) + attr->value_size; + + return smap; +} + +int bpf_local_storage_map_check_btf(const struct bpf_map *map, + const struct btf *btf, + const struct btf_type *key_type, + const struct btf_type *value_type) +{ + u32 int_data; + + if (BTF_INFO_KIND(key_type->info) != BTF_KIND_INT) + return -EINVAL; + + int_data = *(u32 *)(key_type + 1); + if (BTF_INT_BITS(int_data) != 32 || BTF_INT_OFFSET(int_data)) + return -EINVAL; + + return 0; +} diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c index fb278144e9fd..9cd1428c7199 100644 --- a/kernel/bpf/bpf_lsm.c +++ b/kernel/bpf/bpf_lsm.c @@ -11,6 +11,8 @@ #include <linux/bpf_lsm.h> #include <linux/kallsyms.h> #include <linux/bpf_verifier.h> +#include <net/bpf_sk_storage.h> +#include <linux/bpf_local_storage.h> /* For every LSM hook that allows attachment of BPF programs, declare a nop * function where a BPF program can be attached. @@ -45,10 +47,27 @@ int bpf_lsm_verify_prog(struct bpf_verifier_log *vlog, return 0; } +static const struct bpf_func_proto * +bpf_lsm_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) +{ + switch (func_id) { + case BPF_FUNC_inode_storage_get: + return &bpf_inode_storage_get_proto; + case BPF_FUNC_inode_storage_delete: + return &bpf_inode_storage_delete_proto; + case BPF_FUNC_sk_storage_get: + return &sk_storage_get_btf_proto; + case BPF_FUNC_sk_storage_delete: + return &sk_storage_delete_btf_proto; + default: + return tracing_prog_func_proto(func_id, prog); + } +} + const struct bpf_prog_ops lsm_prog_ops = { }; const struct bpf_verifier_ops lsm_verifier_ops = { - .get_func_proto = tracing_prog_func_proto, + .get_func_proto = bpf_lsm_func_proto, .is_valid_access = btf_ctx_access, }; diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c index 969c5d47f81f..4c3b543bb33b 100644 --- a/kernel/bpf/bpf_struct_ops.c +++ b/kernel/bpf/bpf_struct_ops.c @@ -298,8 +298,7 @@ static int check_zero_holes(const struct btf_type *t, void *data) return -EINVAL; mtype = btf_type_by_id(btf_vmlinux, member->type); - mtype = btf_resolve_size(btf_vmlinux, mtype, &msize, - NULL, NULL); + mtype = btf_resolve_size(btf_vmlinux, mtype, &msize); if (IS_ERR(mtype)) return PTR_ERR(mtype); prev_mend = moff + msize; @@ -396,8 +395,7 @@ static int bpf_struct_ops_map_update_elem(struct bpf_map *map, void *key, u32 msize; mtype = btf_type_by_id(btf_vmlinux, member->type); - mtype = btf_resolve_size(btf_vmlinux, mtype, &msize, - NULL, NULL); + mtype = btf_resolve_size(btf_vmlinux, mtype, &msize); if (IS_ERR(mtype)) { err = PTR_ERR(mtype); goto reset_unlock; diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c index 91afdd4c82e3..f9ac6935ab3c 100644 --- a/kernel/bpf/btf.c +++ b/kernel/bpf/btf.c @@ -21,6 +21,8 @@ #include <linux/btf_ids.h> #include <linux/skmsg.h> #include <linux/perf_event.h> +#include <linux/bsearch.h> +#include <linux/btf_ids.h> #include <net/sock.h> /* BTF (BPF Type Format) is the meta data format which describes @@ -1079,23 +1081,27 @@ static const struct resolve_vertex *env_stack_peak(struct btf_verifier_env *env) * *type_size: (x * y * sizeof(u32)). Hence, *type_size always * corresponds to the return type. * *elem_type: u32 + * *elem_id: id of u32 * *total_nelems: (x * y). Hence, individual elem size is * (*type_size / *total_nelems) + * *type_id: id of type if it's changed within the function, 0 if not * * type: is not an array (e.g. const struct X) * return type: type "struct X" * *type_size: sizeof(struct X) * *elem_type: same as return type ("struct X") + * *elem_id: 0 * *total_nelems: 1 + * *type_id: id of type if it's changed within the function, 0 if not */ -const struct btf_type * -btf_resolve_size(const struct btf *btf, const struct btf_type *type, - u32 *type_size, const struct btf_type **elem_type, - u32 *total_nelems) +static const struct btf_type * +__btf_resolve_size(const struct btf *btf, const struct btf_type *type, + u32 *type_size, const struct btf_type **elem_type, + u32 *elem_id, u32 *total_nelems, u32 *type_id) { const struct btf_type *array_type = NULL; - const struct btf_array *array; - u32 i, size, nelems = 1; + const struct btf_array *array = NULL; + u32 i, size, nelems = 1, id = 0; for (i = 0; i < MAX_RESOLVE_DEPTH; i++) { switch (BTF_INFO_KIND(type->info)) { @@ -1116,6 +1122,7 @@ btf_resolve_size(const struct btf *btf, const struct btf_type *type, case BTF_KIND_VOLATILE: case BTF_KIND_CONST: case BTF_KIND_RESTRICT: + id = type->type; type = btf_type_by_id(btf, type->type); break; @@ -1146,10 +1153,21 @@ resolved: *total_nelems = nelems; if (elem_type) *elem_type = type; + if (elem_id) + *elem_id = array ? array->type : 0; + if (type_id && id) + *type_id = id; return array_type ? : type; } +const struct btf_type * +btf_resolve_size(const struct btf *btf, const struct btf_type *type, + u32 *type_size) +{ + return __btf_resolve_size(btf, type, type_size, NULL, NULL, NULL, NULL); +} + /* The input param "type_id" must point to a needs_resolve type */ static const struct btf_type *btf_type_id_resolve(const struct btf *btf, u32 *type_id) @@ -3870,16 +3888,22 @@ bool btf_ctx_access(int off, int size, enum bpf_access_type type, return true; } -int btf_struct_access(struct bpf_verifier_log *log, - const struct btf_type *t, int off, int size, - enum bpf_access_type atype, - u32 *next_btf_id) +enum bpf_struct_walk_result { + /* < 0 error */ + WALK_SCALAR = 0, + WALK_PTR, + WALK_STRUCT, +}; + +static int btf_struct_walk(struct bpf_verifier_log *log, + const struct btf_type *t, int off, int size, + u32 *next_btf_id) { u32 i, moff, mtrue_end, msize = 0, total_nelems = 0; const struct btf_type *mtype, *elem_type = NULL; const struct btf_member *member; const char *tname, *mname; - u32 vlen; + u32 vlen, elem_id, mid; again: tname = __btf_name_by_offset(btf_vmlinux, t->name_off); @@ -3915,14 +3939,13 @@ again: /* Only allow structure for now, can be relaxed for * other types later. */ - elem_type = btf_type_skip_modifiers(btf_vmlinux, - array_elem->type, NULL); - if (!btf_type_is_struct(elem_type)) + t = btf_type_skip_modifiers(btf_vmlinux, array_elem->type, + NULL); + if (!btf_type_is_struct(t)) goto error; - off = (off - moff) % elem_type->size; - return btf_struct_access(log, elem_type, off, size, atype, - next_btf_id); + off = (off - moff) % t->size; + goto again; error: bpf_log(log, "access beyond struct %s at off %u size %u\n", @@ -3951,7 +3974,7 @@ error: */ if (off <= moff && BITS_ROUNDUP_BYTES(end_bit) <= off + size) - return SCALAR_VALUE; + return WALK_SCALAR; /* off may be accessing a following member * @@ -3973,11 +3996,13 @@ error: break; /* type of the field */ + mid = member->type; mtype = btf_type_by_id(btf_vmlinux, member->type); mname = __btf_name_by_offset(btf_vmlinux, member->name_off); - mtype = btf_resolve_size(btf_vmlinux, mtype, &msize, - &elem_type, &total_nelems); + mtype = __btf_resolve_size(btf_vmlinux, mtype, &msize, + &elem_type, &elem_id, &total_nelems, + &mid); if (IS_ERR(mtype)) { bpf_log(log, "field %s doesn't have size\n", mname); return -EFAULT; @@ -3991,7 +4016,7 @@ error: if (btf_type_is_array(mtype)) { u32 elem_idx; - /* btf_resolve_size() above helps to + /* __btf_resolve_size() above helps to * linearize a multi-dimensional array. * * The logic here is treating an array @@ -4039,6 +4064,7 @@ error: elem_idx = (off - moff) / msize; moff += elem_idx * msize; mtype = elem_type; + mid = elem_id; } /* the 'off' we're looking for is either equal to start @@ -4048,6 +4074,12 @@ error: /* our field must be inside that union or struct */ t = mtype; + /* return if the offset matches the member offset */ + if (off == moff) { + *next_btf_id = mid; + return WALK_STRUCT; + } + /* adjust offset we're looking for */ off -= moff; goto again; @@ -4063,11 +4095,10 @@ error: mname, moff, tname, off, size); return -EACCES; } - stype = btf_type_skip_modifiers(btf_vmlinux, mtype->type, &id); if (btf_type_is_struct(stype)) { *next_btf_id = id; - return PTR_TO_BTF_ID; + return WALK_PTR; } } @@ -4084,12 +4115,84 @@ error: return -EACCES; } - return SCALAR_VALUE; + return WALK_SCALAR; } bpf_log(log, "struct %s doesn't have field at offset %d\n", tname, off); return -EINVAL; } +int btf_struct_access(struct bpf_verifier_log *log, + const struct btf_type *t, int off, int size, + enum bpf_access_type atype __maybe_unused, + u32 *next_btf_id) +{ + int err; + u32 id; + + do { + err = btf_struct_walk(log, t, off, size, &id); + + switch (err) { + case WALK_PTR: + /* If we found the pointer or scalar on t+off, + * we're done. + */ + *next_btf_id = id; + return PTR_TO_BTF_ID; + case WALK_SCALAR: + return SCALAR_VALUE; + case WALK_STRUCT: + /* We found nested struct, so continue the search + * by diving in it. At this point the offset is + * aligned with the new type, so set it to 0. + */ + t = btf_type_by_id(btf_vmlinux, id); + off = 0; + break; + default: + /* It's either error or unknown return value.. + * scream and leave. + */ + if (WARN_ONCE(err > 0, "unknown btf_struct_walk return value")) + return -EINVAL; + return err; + } + } while (t); + + return -EINVAL; +} + +bool btf_struct_ids_match(struct bpf_verifier_log *log, + int off, u32 id, u32 need_type_id) +{ + const struct btf_type *type; + int err; + + /* Are we already done? */ + if (need_type_id == id && off == 0) + return true; + +again: + type = btf_type_by_id(btf_vmlinux, id); + if (!type) + return false; + err = btf_struct_walk(log, type, off, 1, &id); + if (err != WALK_STRUCT) + return false; + + /* We found nested struct object. If it matches + * the requested ID, we're done. Otherwise let's + * continue the search with offset 0 in the new + * type. + */ + if (need_type_id != id) { + off = 0; + goto again; + } + + return true; +} + int btf_resolve_helper_id(struct bpf_verifier_log *log, const struct bpf_func_proto *fn, int arg) { @@ -4661,3 +4764,15 @@ u32 btf_id(const struct btf *btf) { return btf->id; } + +static int btf_id_cmp_func(const void *a, const void *b) +{ + const int *pa = a, *pb = b; + + return *pa - *pb; +} + +bool btf_id_set_contains(struct btf_id_set *set, u32 id) +{ + return bsearch(&id, set->ids, set->cnt, sizeof(u32), btf_id_cmp_func) != NULL; +} diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c index f1c46529929b..cf548fc88780 100644 --- a/kernel/bpf/cpumap.c +++ b/kernel/bpf/cpumap.c @@ -79,8 +79,6 @@ struct bpf_cpu_map { static DEFINE_PER_CPU(struct list_head, cpu_map_flush_list); -static int bq_flush_to_queue(struct xdp_bulk_queue *bq); - static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) { u32 value_size = attr->value_size; @@ -658,6 +656,7 @@ static int cpu_map_get_next_key(struct bpf_map *map, void *key, void *next_key) static int cpu_map_btf_id; const struct bpf_map_ops cpu_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc = cpu_map_alloc, .map_free = cpu_map_free, .map_delete_elem = cpu_map_delete_elem, @@ -669,7 +668,7 @@ const struct bpf_map_ops cpu_map_ops = { .map_btf_id = &cpu_map_btf_id, }; -static int bq_flush_to_queue(struct xdp_bulk_queue *bq) +static void bq_flush_to_queue(struct xdp_bulk_queue *bq) { struct bpf_cpu_map_entry *rcpu = bq->obj; unsigned int processed = 0, drops = 0; @@ -678,7 +677,7 @@ static int bq_flush_to_queue(struct xdp_bulk_queue *bq) int i; if (unlikely(!bq->count)) - return 0; + return; q = rcpu->queue; spin_lock(&q->producer_lock); @@ -701,13 +700,12 @@ static int bq_flush_to_queue(struct xdp_bulk_queue *bq) /* Feedback loop via tracepoints */ trace_xdp_cpumap_enqueue(rcpu->map_id, processed, drops, to_cpu); - return 0; } /* Runs under RCU-read-side, plus in softirq under NAPI protection. * Thus, safe percpu variable access. */ -static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) +static void bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) { struct list_head *flush_list = this_cpu_ptr(&cpu_map_flush_list); struct xdp_bulk_queue *bq = this_cpu_ptr(rcpu->bulkq); @@ -728,8 +726,6 @@ static int bq_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_frame *xdpf) if (!bq->flush_node.prev) list_add(&bq->flush_node, flush_list); - - return 0; } int cpu_map_enqueue(struct bpf_cpu_map_entry *rcpu, struct xdp_buff *xdp, diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c index 10abb06065bb..2b5ca93c17de 100644 --- a/kernel/bpf/devmap.c +++ b/kernel/bpf/devmap.c @@ -341,14 +341,14 @@ bool dev_map_can_have_prog(struct bpf_map *map) return false; } -static int bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) +static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) { struct net_device *dev = bq->dev; int sent = 0, drops = 0, err = 0; int i; if (unlikely(!bq->count)) - return 0; + return; for (i = 0; i < bq->count; i++) { struct xdp_frame *xdpf = bq->q[i]; @@ -369,7 +369,7 @@ out: trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err); bq->dev_rx = NULL; __list_del_clearprev(&bq->flush_node); - return 0; + return; error: /* If ndo_xdp_xmit fails with an errno, no frames have been * xmit'ed and it's our responsibility to them free all. @@ -421,8 +421,8 @@ struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key) /* Runs under RCU-read-side, plus in softirq under NAPI protection. * Thus, safe percpu variable access. */ -static int bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf, - struct net_device *dev_rx) +static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf, + struct net_device *dev_rx) { struct list_head *flush_list = this_cpu_ptr(&dev_flush_list); struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq); @@ -441,8 +441,6 @@ static int bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf, if (!bq->flush_node.prev) list_add(&bq->flush_node, flush_list); - - return 0; } static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp, @@ -462,7 +460,8 @@ static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp, if (unlikely(!xdpf)) return -EOVERFLOW; - return bq_enqueue(dev, xdpf, dev_rx); + bq_enqueue(dev, xdpf, dev_rx); + return 0; } static struct xdp_buff *dev_map_run_prog(struct net_device *dev, @@ -751,6 +750,7 @@ static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value, static int dev_map_btf_id; const struct bpf_map_ops dev_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc = dev_map_alloc, .map_free = dev_map_free, .map_get_next_key = dev_map_get_next_key, @@ -764,6 +764,7 @@ const struct bpf_map_ops dev_map_ops = { static int dev_map_hash_map_btf_id; const struct bpf_map_ops dev_map_hash_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc = dev_map_alloc, .map_free = dev_map_free, .map_get_next_key = dev_map_hash_get_next_key, diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c index 78dfff6a501b..fe0e06284d33 100644 --- a/kernel/bpf/hashtab.c +++ b/kernel/bpf/hashtab.c @@ -9,6 +9,7 @@ #include <linux/rculist_nulls.h> #include <linux/random.h> #include <uapi/linux/btf.h> +#include <linux/rcupdate_trace.h> #include "percpu_freelist.h" #include "bpf_lru_list.h" #include "map_in_map.h" @@ -577,8 +578,7 @@ static void *__htab_map_lookup_elem(struct bpf_map *map, void *key) struct htab_elem *l; u32 hash, key_size; - /* Must be called with rcu_read_lock. */ - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); key_size = map->key_size; @@ -941,7 +941,7 @@ static int htab_map_update_elem(struct bpf_map *map, void *key, void *value, /* unknown flags */ return -EINVAL; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); key_size = map->key_size; @@ -1032,7 +1032,7 @@ static int htab_lru_map_update_elem(struct bpf_map *map, void *key, void *value, /* unknown flags */ return -EINVAL; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); key_size = map->key_size; @@ -1220,7 +1220,7 @@ static int htab_map_delete_elem(struct bpf_map *map, void *key) u32 hash, key_size; int ret = -ENOENT; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); key_size = map->key_size; @@ -1252,7 +1252,7 @@ static int htab_lru_map_delete_elem(struct bpf_map *map, void *key) u32 hash, key_size; int ret = -ENOENT; - WARN_ON_ONCE(!rcu_read_lock_held()); + WARN_ON_ONCE(!rcu_read_lock_held() && !rcu_read_lock_trace_held()); key_size = map->key_size; @@ -1810,6 +1810,7 @@ static const struct bpf_iter_seq_info iter_seq_info = { static int htab_map_btf_id; const struct bpf_map_ops htab_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = htab_map_alloc_check, .map_alloc = htab_map_alloc, .map_free = htab_map_free, @@ -1827,6 +1828,7 @@ const struct bpf_map_ops htab_map_ops = { static int htab_lru_map_btf_id; const struct bpf_map_ops htab_lru_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = htab_map_alloc_check, .map_alloc = htab_map_alloc, .map_free = htab_map_free, @@ -1947,6 +1949,7 @@ static void htab_percpu_map_seq_show_elem(struct bpf_map *map, void *key, static int htab_percpu_map_btf_id; const struct bpf_map_ops htab_percpu_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = htab_map_alloc_check, .map_alloc = htab_map_alloc, .map_free = htab_map_free, @@ -1963,6 +1966,7 @@ const struct bpf_map_ops htab_percpu_map_ops = { static int htab_lru_percpu_map_btf_id; const struct bpf_map_ops htab_lru_percpu_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = htab_map_alloc_check, .map_alloc = htab_map_alloc, .map_free = htab_map_free, diff --git a/kernel/bpf/helpers.c b/kernel/bpf/helpers.c index be43ab3e619f..5cc7425ee476 100644 --- a/kernel/bpf/helpers.c +++ b/kernel/bpf/helpers.c @@ -601,6 +601,28 @@ const struct bpf_func_proto bpf_event_output_data_proto = { .arg5_type = ARG_CONST_SIZE_OR_ZERO, }; +BPF_CALL_3(bpf_copy_from_user, void *, dst, u32, size, + const void __user *, user_ptr) +{ + int ret = copy_from_user(dst, user_ptr, size); + + if (unlikely(ret)) { + memset(dst, 0, size); + ret = -EFAULT; + } + + return ret; +} + +const struct bpf_func_proto bpf_copy_from_user_proto = { + .func = bpf_copy_from_user, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_UNINIT_MEM, + .arg2_type = ARG_CONST_SIZE_OR_ZERO, + .arg3_type = ARG_ANYTHING, +}; + const struct bpf_func_proto bpf_get_current_task_proto __weak; const struct bpf_func_proto bpf_probe_read_user_proto __weak; const struct bpf_func_proto bpf_probe_read_user_str_proto __weak; diff --git a/kernel/bpf/inode.c b/kernel/bpf/inode.c index fb878ba3f22f..b48a56f53495 100644 --- a/kernel/bpf/inode.c +++ b/kernel/bpf/inode.c @@ -20,6 +20,7 @@ #include <linux/filter.h> #include <linux/bpf.h> #include <linux/bpf_trace.h> +#include "preload/bpf_preload.h" enum bpf_type { BPF_TYPE_UNSPEC = 0, @@ -369,9 +370,10 @@ static struct dentry * bpf_lookup(struct inode *dir, struct dentry *dentry, unsigned flags) { /* Dots in names (e.g. "/sys/fs/bpf/foo.bar") are reserved for future - * extensions. + * extensions. That allows popoulate_bpffs() create special files. */ - if (strchr(dentry->d_name.name, '.')) + if ((dir->i_mode & S_IALLUGO) && + strchr(dentry->d_name.name, '.')) return ERR_PTR(-EPERM); return simple_lookup(dir, dentry, flags); @@ -409,6 +411,27 @@ static const struct inode_operations bpf_dir_iops = { .unlink = simple_unlink, }; +/* pin iterator link into bpffs */ +static int bpf_iter_link_pin_kernel(struct dentry *parent, + const char *name, struct bpf_link *link) +{ + umode_t mode = S_IFREG | S_IRUSR; + struct dentry *dentry; + int ret; + + inode_lock(parent->d_inode); + dentry = lookup_one_len(name, parent, strlen(name)); + if (IS_ERR(dentry)) { + inode_unlock(parent->d_inode); + return PTR_ERR(dentry); + } + ret = bpf_mkobj_ops(dentry, mode, link, &bpf_link_iops, + &bpf_iter_fops); + dput(dentry); + inode_unlock(parent->d_inode); + return ret; +} + static int bpf_obj_do_pin(const char __user *pathname, void *raw, enum bpf_type type) { @@ -638,6 +661,91 @@ static int bpf_parse_param(struct fs_context *fc, struct fs_parameter *param) return 0; } +struct bpf_preload_ops *bpf_preload_ops; +EXPORT_SYMBOL_GPL(bpf_preload_ops); + +static bool bpf_preload_mod_get(void) +{ + /* If bpf_preload.ko wasn't loaded earlier then load it now. + * When bpf_preload is built into vmlinux the module's __init + * function will populate it. + */ + if (!bpf_preload_ops) { + request_module("bpf_preload"); + if (!bpf_preload_ops) + return false; + } + /* And grab the reference, so the module doesn't disappear while the + * kernel is interacting with the kernel module and its UMD. + */ + if (!try_module_get(bpf_preload_ops->owner)) { + pr_err("bpf_preload module get failed.\n"); + return false; + } + return true; +} + +static void bpf_preload_mod_put(void) +{ + if (bpf_preload_ops) + /* now user can "rmmod bpf_preload" if necessary */ + module_put(bpf_preload_ops->owner); +} + +static DEFINE_MUTEX(bpf_preload_lock); + +static int populate_bpffs(struct dentry *parent) +{ + struct bpf_preload_info objs[BPF_PRELOAD_LINKS] = {}; + struct bpf_link *links[BPF_PRELOAD_LINKS] = {}; + int err = 0, i; + + /* grab the mutex to make sure the kernel interactions with bpf_preload + * UMD are serialized + */ + mutex_lock(&bpf_preload_lock); + + /* if bpf_preload.ko wasn't built into vmlinux then load it */ + if (!bpf_preload_mod_get()) + goto out; + + if (!bpf_preload_ops->info.tgid) { + /* preload() will start UMD that will load BPF iterator programs */ + err = bpf_preload_ops->preload(objs); + if (err) + goto out_put; + for (i = 0; i < BPF_PRELOAD_LINKS; i++) { + links[i] = bpf_link_by_id(objs[i].link_id); + if (IS_ERR(links[i])) { + err = PTR_ERR(links[i]); + goto out_put; + } + } + for (i = 0; i < BPF_PRELOAD_LINKS; i++) { + err = bpf_iter_link_pin_kernel(parent, + objs[i].link_name, links[i]); + if (err) + goto out_put; + /* do not unlink successfully pinned links even + * if later link fails to pin + */ + links[i] = NULL; + } + /* finish() will tell UMD process to exit */ + err = bpf_preload_ops->finish(); + if (err) + goto out_put; + } +out_put: + bpf_preload_mod_put(); +out: + mutex_unlock(&bpf_preload_lock); + for (i = 0; i < BPF_PRELOAD_LINKS && err; i++) + if (!IS_ERR_OR_NULL(links[i])) + bpf_link_put(links[i]); + return err; +} + static int bpf_fill_super(struct super_block *sb, struct fs_context *fc) { static const struct tree_descr bpf_rfiles[] = { { "" } }; @@ -654,8 +762,8 @@ static int bpf_fill_super(struct super_block *sb, struct fs_context *fc) inode = sb->s_root->d_inode; inode->i_op = &bpf_dir_iops; inode->i_mode &= ~S_IALLUGO; + populate_bpffs(sb->s_root); inode->i_mode |= S_ISVTX | opts->mode; - return 0; } @@ -705,6 +813,8 @@ static int __init bpf_init(void) { int ret; + mutex_init(&bpf_preload_lock); + ret = sysfs_create_mount_point(fs_kobj, "bpf"); if (ret) return ret; diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c index 44474bf3ab7a..00e32f2ec3e6 100644 --- a/kernel/bpf/lpm_trie.c +++ b/kernel/bpf/lpm_trie.c @@ -732,6 +732,7 @@ static int trie_check_btf(const struct bpf_map *map, static int trie_map_btf_id; const struct bpf_map_ops trie_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc = trie_alloc, .map_free = trie_free, .map_get_next_key = trie_get_next_key, diff --git a/kernel/bpf/map_in_map.c b/kernel/bpf/map_in_map.c index 17738c93bec8..39ab0b68cade 100644 --- a/kernel/bpf/map_in_map.c +++ b/kernel/bpf/map_in_map.c @@ -17,23 +17,17 @@ struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd) if (IS_ERR(inner_map)) return inner_map; - /* prog_array->aux->{type,jited} is a runtime binding. - * Doing static check alone in the verifier is not enough. - */ - if (inner_map->map_type == BPF_MAP_TYPE_PROG_ARRAY || - inner_map->map_type == BPF_MAP_TYPE_CGROUP_STORAGE || - inner_map->map_type == BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE || - inner_map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { - fdput(f); - return ERR_PTR(-ENOTSUPP); - } - /* Does not support >1 level map-in-map */ if (inner_map->inner_map_meta) { fdput(f); return ERR_PTR(-EINVAL); } + if (!inner_map->ops->map_meta_equal) { + fdput(f); + return ERR_PTR(-ENOTSUPP); + } + if (map_value_has_spin_lock(inner_map)) { fdput(f); return ERR_PTR(-ENOTSUPP); @@ -81,15 +75,14 @@ bool bpf_map_meta_equal(const struct bpf_map *meta0, return meta0->map_type == meta1->map_type && meta0->key_size == meta1->key_size && meta0->value_size == meta1->value_size && - meta0->map_flags == meta1->map_flags && - meta0->max_entries == meta1->max_entries; + meta0->map_flags == meta1->map_flags; } void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file /* not used */, int ufd) { - struct bpf_map *inner_map; + struct bpf_map *inner_map, *inner_map_meta; struct fd f; f = fdget(ufd); @@ -97,7 +90,8 @@ void *bpf_map_fd_get_ptr(struct bpf_map *map, if (IS_ERR(inner_map)) return inner_map; - if (bpf_map_meta_equal(map->inner_map_meta, inner_map)) + inner_map_meta = map->inner_map_meta; + if (inner_map_meta->ops->map_meta_equal(inner_map_meta, inner_map)) bpf_map_inc(inner_map); else inner_map = ERR_PTR(-EINVAL); diff --git a/kernel/bpf/map_in_map.h b/kernel/bpf/map_in_map.h index a507bf6ef8b9..bcb7534afb3c 100644 --- a/kernel/bpf/map_in_map.h +++ b/kernel/bpf/map_in_map.h @@ -11,8 +11,6 @@ struct bpf_map; struct bpf_map *bpf_map_meta_alloc(int inner_map_ufd); void bpf_map_meta_free(struct bpf_map *map_meta); -bool bpf_map_meta_equal(const struct bpf_map *meta0, - const struct bpf_map *meta1); void *bpf_map_fd_get_ptr(struct bpf_map *map, struct file *map_file, int ufd); void bpf_map_fd_put_ptr(void *ptr); diff --git a/kernel/bpf/map_iter.c b/kernel/bpf/map_iter.c index af86048e5afd..6a9542af4212 100644 --- a/kernel/bpf/map_iter.c +++ b/kernel/bpf/map_iter.c @@ -149,6 +149,19 @@ static void bpf_iter_detach_map(struct bpf_iter_aux_info *aux) bpf_map_put_with_uref(aux->map); } +void bpf_iter_map_show_fdinfo(const struct bpf_iter_aux_info *aux, + struct seq_file *seq) +{ + seq_printf(seq, "map_id:\t%u\n", aux->map->id); +} + +int bpf_iter_map_fill_link_info(const struct bpf_iter_aux_info *aux, + struct bpf_link_info *info) +{ + info->iter.map.map_id = aux->map->id; + return 0; +} + DEFINE_BPF_ITER_FUNC(bpf_map_elem, struct bpf_iter_meta *meta, struct bpf_map *map, void *key, void *value) @@ -156,6 +169,8 @@ static const struct bpf_iter_reg bpf_map_elem_reg_info = { .target = "bpf_map_elem", .attach_target = bpf_iter_attach_map, .detach_target = bpf_iter_detach_map, + .show_fdinfo = bpf_iter_map_show_fdinfo, + .fill_link_info = bpf_iter_map_fill_link_info, .ctx_arg_info_size = 2, .ctx_arg_info = { { offsetof(struct bpf_iter__bpf_map_elem, key), diff --git a/kernel/bpf/preload/Kconfig b/kernel/bpf/preload/Kconfig new file mode 100644 index 000000000000..ace49111d3a3 --- /dev/null +++ b/kernel/bpf/preload/Kconfig @@ -0,0 +1,26 @@ +# SPDX-License-Identifier: GPL-2.0-only +config USERMODE_DRIVER + bool + default n + +menuconfig BPF_PRELOAD + bool "Preload BPF file system with kernel specific program and map iterators" + depends on BPF + # The dependency on !COMPILE_TEST prevents it from being enabled + # in allmodconfig or allyesconfig configurations + depends on !COMPILE_TEST + select USERMODE_DRIVER + help + This builds kernel module with several embedded BPF programs that are + pinned into BPF FS mount point as human readable files that are + useful in debugging and introspection of BPF programs and maps. + +if BPF_PRELOAD +config BPF_PRELOAD_UMD + tristate "bpf_preload kernel module with user mode driver" + depends on CC_CAN_LINK + depends on m || CC_CAN_LINK_STATIC + default m + help + This builds bpf_preload kernel module with embedded user mode driver. +endif diff --git a/kernel/bpf/preload/Makefile b/kernel/bpf/preload/Makefile new file mode 100644 index 000000000000..12c7b62b9b6e --- /dev/null +++ b/kernel/bpf/preload/Makefile @@ -0,0 +1,23 @@ +# SPDX-License-Identifier: GPL-2.0 + +LIBBPF_SRCS = $(srctree)/tools/lib/bpf/ +LIBBPF_A = $(obj)/libbpf.a +LIBBPF_OUT = $(abspath $(obj)) + +$(LIBBPF_A): + $(Q)$(MAKE) -C $(LIBBPF_SRCS) OUTPUT=$(LIBBPF_OUT)/ $(LIBBPF_OUT)/libbpf.a + +userccflags += -I $(srctree)/tools/include/ -I $(srctree)/tools/include/uapi \ + -I $(srctree)/tools/lib/ -Wno-unused-result + +userprogs := bpf_preload_umd + +bpf_preload_umd-objs := iterators/iterators.o +bpf_preload_umd-userldlibs := $(LIBBPF_A) -lelf -lz + +$(obj)/bpf_preload_umd: $(LIBBPF_A) + +$(obj)/bpf_preload_umd_blob.o: $(obj)/bpf_preload_umd + +obj-$(CONFIG_BPF_PRELOAD_UMD) += bpf_preload.o +bpf_preload-objs += bpf_preload_kern.o bpf_preload_umd_blob.o diff --git a/kernel/bpf/preload/bpf_preload.h b/kernel/bpf/preload/bpf_preload.h new file mode 100644 index 000000000000..2f9932276f2e --- /dev/null +++ b/kernel/bpf/preload/bpf_preload.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _BPF_PRELOAD_H +#define _BPF_PRELOAD_H + +#include <linux/usermode_driver.h> +#include "iterators/bpf_preload_common.h" + +struct bpf_preload_ops { + struct umd_info info; + int (*preload)(struct bpf_preload_info *); + int (*finish)(void); + struct module *owner; +}; +extern struct bpf_preload_ops *bpf_preload_ops; +#define BPF_PRELOAD_LINKS 2 +#endif diff --git a/kernel/bpf/preload/bpf_preload_kern.c b/kernel/bpf/preload/bpf_preload_kern.c new file mode 100644 index 000000000000..79c5772465f1 --- /dev/null +++ b/kernel/bpf/preload/bpf_preload_kern.c @@ -0,0 +1,91 @@ +// SPDX-License-Identifier: GPL-2.0 +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include <linux/init.h> +#include <linux/module.h> +#include <linux/pid.h> +#include <linux/fs.h> +#include <linux/sched/signal.h> +#include "bpf_preload.h" + +extern char bpf_preload_umd_start; +extern char bpf_preload_umd_end; + +static int preload(struct bpf_preload_info *obj); +static int finish(void); + +static struct bpf_preload_ops umd_ops = { + .info.driver_name = "bpf_preload", + .preload = preload, + .finish = finish, + .owner = THIS_MODULE, +}; + +static int preload(struct bpf_preload_info *obj) +{ + int magic = BPF_PRELOAD_START; + loff_t pos = 0; + int i, err; + ssize_t n; + + err = fork_usermode_driver(&umd_ops.info); + if (err) + return err; + + /* send the start magic to let UMD proceed with loading BPF progs */ + n = kernel_write(umd_ops.info.pipe_to_umh, + &magic, sizeof(magic), &pos); + if (n != sizeof(magic)) + return -EPIPE; + + /* receive bpf_link IDs and names from UMD */ + pos = 0; + for (i = 0; i < BPF_PRELOAD_LINKS; i++) { + n = kernel_read(umd_ops.info.pipe_from_umh, + &obj[i], sizeof(*obj), &pos); + if (n != sizeof(*obj)) + return -EPIPE; + } + return 0; +} + +static int finish(void) +{ + int magic = BPF_PRELOAD_END; + struct pid *tgid; + loff_t pos = 0; + ssize_t n; + + /* send the last magic to UMD. It will do a normal exit. */ + n = kernel_write(umd_ops.info.pipe_to_umh, + &magic, sizeof(magic), &pos); + if (n != sizeof(magic)) + return -EPIPE; + tgid = umd_ops.info.tgid; + wait_event(tgid->wait_pidfd, thread_group_exited(tgid)); + umd_ops.info.tgid = NULL; + return 0; +} + +static int __init load_umd(void) +{ + int err; + + err = umd_load_blob(&umd_ops.info, &bpf_preload_umd_start, + &bpf_preload_umd_end - &bpf_preload_umd_start); + if (err) + return err; + bpf_preload_ops = &umd_ops; + return err; +} + +static void __exit fini_umd(void) +{ + bpf_preload_ops = NULL; + /* kill UMD in case it's still there due to earlier error */ + kill_pid(umd_ops.info.tgid, SIGKILL, 1); + umd_ops.info.tgid = NULL; + umd_unload_blob(&umd_ops.info); +} +late_initcall(load_umd); +module_exit(fini_umd); +MODULE_LICENSE("GPL"); diff --git a/kernel/bpf/preload/bpf_preload_umd_blob.S b/kernel/bpf/preload/bpf_preload_umd_blob.S new file mode 100644 index 000000000000..f1f40223b5c3 --- /dev/null +++ b/kernel/bpf/preload/bpf_preload_umd_blob.S @@ -0,0 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + .section .init.rodata, "a" + .global bpf_preload_umd_start +bpf_preload_umd_start: + .incbin "kernel/bpf/preload/bpf_preload_umd" + .global bpf_preload_umd_end +bpf_preload_umd_end: diff --git a/kernel/bpf/preload/iterators/.gitignore b/kernel/bpf/preload/iterators/.gitignore new file mode 100644 index 000000000000..ffdb70230c8b --- /dev/null +++ b/kernel/bpf/preload/iterators/.gitignore @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0-only +/.output diff --git a/kernel/bpf/preload/iterators/Makefile b/kernel/bpf/preload/iterators/Makefile new file mode 100644 index 000000000000..28fa8c1440f4 --- /dev/null +++ b/kernel/bpf/preload/iterators/Makefile @@ -0,0 +1,57 @@ +# SPDX-License-Identifier: GPL-2.0 +OUTPUT := .output +CLANG ?= clang +LLC ?= llc +LLVM_STRIP ?= llvm-strip +DEFAULT_BPFTOOL := $(OUTPUT)/sbin/bpftool +BPFTOOL ?= $(DEFAULT_BPFTOOL) +LIBBPF_SRC := $(abspath ../../../../tools/lib/bpf) +BPFOBJ := $(OUTPUT)/libbpf.a +BPF_INCLUDE := $(OUTPUT) +INCLUDES := -I$(OUTPUT) -I$(BPF_INCLUDE) -I$(abspath ../../../../tools/lib) \ + -I$(abspath ../../../../tools/include/uapi) +CFLAGS := -g -Wall + +abs_out := $(abspath $(OUTPUT)) +ifeq ($(V),1) +Q = +msg = +else +Q = @ +msg = @printf ' %-8s %s%s\n' "$(1)" "$(notdir $(2))" "$(if $(3), $(3))"; +MAKEFLAGS += --no-print-directory +submake_extras := feature_display=0 +endif + +.DELETE_ON_ERROR: + +.PHONY: all clean + +all: iterators.skel.h + +clean: + $(call msg,CLEAN) + $(Q)rm -rf $(OUTPUT) iterators + +iterators.skel.h: $(OUTPUT)/iterators.bpf.o | $(BPFTOOL) + $(call msg,GEN-SKEL,$@) + $(Q)$(BPFTOOL) gen skeleton $< > $@ + + +$(OUTPUT)/iterators.bpf.o: iterators.bpf.c $(BPFOBJ) | $(OUTPUT) + $(call msg,BPF,$@) + $(Q)$(CLANG) -g -O2 -target bpf $(INCLUDES) \ + -c $(filter %.c,$^) -o $@ && \ + $(LLVM_STRIP) -g $@ + +$(OUTPUT): + $(call msg,MKDIR,$@) + $(Q)mkdir -p $(OUTPUT) + +$(BPFOBJ): $(wildcard $(LIBBPF_SRC)/*.[ch] $(LIBBPF_SRC)/Makefile) | $(OUTPUT) + $(Q)$(MAKE) $(submake_extras) -C $(LIBBPF_SRC) \ + OUTPUT=$(abspath $(dir $@))/ $(abspath $@) + +$(DEFAULT_BPFTOOL): + $(Q)$(MAKE) $(submake_extras) -C ../../../../tools/bpf/bpftool \ + prefix= OUTPUT=$(abs_out)/ DESTDIR=$(abs_out) install diff --git a/kernel/bpf/preload/iterators/README b/kernel/bpf/preload/iterators/README new file mode 100644 index 000000000000..7fd6d39a9ad2 --- /dev/null +++ b/kernel/bpf/preload/iterators/README @@ -0,0 +1,4 @@ +WARNING: +If you change "iterators.bpf.c" do "make -j" in this directory to rebuild "iterators.skel.h". +Make sure to have clang 10 installed. +See Documentation/bpf/bpf_devel_QA.rst diff --git a/kernel/bpf/preload/iterators/bpf_preload_common.h b/kernel/bpf/preload/iterators/bpf_preload_common.h new file mode 100644 index 000000000000..8464d1a48c05 --- /dev/null +++ b/kernel/bpf/preload/iterators/bpf_preload_common.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _BPF_PRELOAD_COMMON_H +#define _BPF_PRELOAD_COMMON_H + +#define BPF_PRELOAD_START 0x5555 +#define BPF_PRELOAD_END 0xAAAA + +struct bpf_preload_info { + char link_name[16]; + int link_id; +}; + +#endif diff --git a/kernel/bpf/preload/iterators/iterators.bpf.c b/kernel/bpf/preload/iterators/iterators.bpf.c new file mode 100644 index 000000000000..5ded550b2ed6 --- /dev/null +++ b/kernel/bpf/preload/iterators/iterators.bpf.c @@ -0,0 +1,114 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> +#include <bpf/bpf_core_read.h> + +#pragma clang attribute push (__attribute__((preserve_access_index)), apply_to = record) +struct seq_file; +struct bpf_iter_meta { + struct seq_file *seq; + __u64 session_id; + __u64 seq_num; +}; + +struct bpf_map { + __u32 id; + char name[16]; + __u32 max_entries; +}; + +struct bpf_iter__bpf_map { + struct bpf_iter_meta *meta; + struct bpf_map *map; +}; + +struct btf_type { + __u32 name_off; +}; + +struct btf_header { + __u32 str_len; +}; + +struct btf { + const char *strings; + struct btf_type **types; + struct btf_header hdr; +}; + +struct bpf_prog_aux { + __u32 id; + char name[16]; + const char *attach_func_name; + struct bpf_prog *linked_prog; + struct bpf_func_info *func_info; + struct btf *btf; +}; + +struct bpf_prog { + struct bpf_prog_aux *aux; +}; + +struct bpf_iter__bpf_prog { + struct bpf_iter_meta *meta; + struct bpf_prog *prog; +}; +#pragma clang attribute pop + +static const char *get_name(struct btf *btf, long btf_id, const char *fallback) +{ + struct btf_type **types, *t; + unsigned int name_off; + const char *str; + + if (!btf) + return fallback; + str = btf->strings; + types = btf->types; + bpf_probe_read_kernel(&t, sizeof(t), types + btf_id); + name_off = BPF_CORE_READ(t, name_off); + if (name_off >= btf->hdr.str_len) + return fallback; + return str + name_off; +} + +SEC("iter/bpf_map") +int dump_bpf_map(struct bpf_iter__bpf_map *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + __u64 seq_num = ctx->meta->seq_num; + struct bpf_map *map = ctx->map; + + if (!map) + return 0; + + if (seq_num == 0) + BPF_SEQ_PRINTF(seq, " id name max_entries\n"); + + BPF_SEQ_PRINTF(seq, "%4u %-16s%6d\n", map->id, map->name, map->max_entries); + return 0; +} + +SEC("iter/bpf_prog") +int dump_bpf_prog(struct bpf_iter__bpf_prog *ctx) +{ + struct seq_file *seq = ctx->meta->seq; + __u64 seq_num = ctx->meta->seq_num; + struct bpf_prog *prog = ctx->prog; + struct bpf_prog_aux *aux; + + if (!prog) + return 0; + + aux = prog->aux; + if (seq_num == 0) + BPF_SEQ_PRINTF(seq, " id name attached\n"); + + BPF_SEQ_PRINTF(seq, "%4u %-16s %s %s\n", aux->id, + get_name(aux->btf, aux->func_info[0].type_id, aux->name), + aux->attach_func_name, aux->linked_prog->aux->name); + return 0; +} +char LICENSE[] SEC("license") = "GPL"; diff --git a/kernel/bpf/preload/iterators/iterators.c b/kernel/bpf/preload/iterators/iterators.c new file mode 100644 index 000000000000..b7ff87939172 --- /dev/null +++ b/kernel/bpf/preload/iterators/iterators.c @@ -0,0 +1,94 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ +#include <argp.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <fcntl.h> +#include <sys/resource.h> +#include <bpf/libbpf.h> +#include <bpf/bpf.h> +#include <sys/mount.h> +#include "iterators.skel.h" +#include "bpf_preload_common.h" + +int to_kernel = -1; +int from_kernel = 0; + +static int send_link_to_kernel(struct bpf_link *link, const char *link_name) +{ + struct bpf_preload_info obj = {}; + struct bpf_link_info info = {}; + __u32 info_len = sizeof(info); + int err; + + err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &info, &info_len); + if (err) + return err; + obj.link_id = info.id; + if (strlen(link_name) >= sizeof(obj.link_name)) + return -E2BIG; + strcpy(obj.link_name, link_name); + if (write(to_kernel, &obj, sizeof(obj)) != sizeof(obj)) + return -EPIPE; + return 0; +} + +int main(int argc, char **argv) +{ + struct rlimit rlim = { RLIM_INFINITY, RLIM_INFINITY }; + struct iterators_bpf *skel; + int err, magic; + int debug_fd; + + debug_fd = open("/dev/console", O_WRONLY | O_NOCTTY | O_CLOEXEC); + if (debug_fd < 0) + return 1; + to_kernel = dup(1); + close(1); + dup(debug_fd); + /* now stdin and stderr point to /dev/console */ + + read(from_kernel, &magic, sizeof(magic)); + if (magic != BPF_PRELOAD_START) { + printf("bad start magic %d\n", magic); + return 1; + } + setrlimit(RLIMIT_MEMLOCK, &rlim); + /* libbpf opens BPF object and loads it into the kernel */ + skel = iterators_bpf__open_and_load(); + if (!skel) { + /* iterators.skel.h is little endian. + * libbpf doesn't support automatic little->big conversion + * of BPF bytecode yet. + * The program load will fail in such case. + */ + printf("Failed load could be due to wrong endianness\n"); + return 1; + } + err = iterators_bpf__attach(skel); + if (err) + goto cleanup; + + /* send two bpf_link IDs with names to the kernel */ + err = send_link_to_kernel(skel->links.dump_bpf_map, "maps.debug"); + if (err) + goto cleanup; + err = send_link_to_kernel(skel->links.dump_bpf_prog, "progs.debug"); + if (err) + goto cleanup; + + /* The kernel will proceed with pinnging the links in bpffs. + * UMD will wait on read from pipe. + */ + read(from_kernel, &magic, sizeof(magic)); + if (magic != BPF_PRELOAD_END) { + printf("bad final magic %d\n", magic); + err = -EINVAL; + } +cleanup: + iterators_bpf__destroy(skel); + + return err != 0; +} diff --git a/kernel/bpf/preload/iterators/iterators.skel.h b/kernel/bpf/preload/iterators/iterators.skel.h new file mode 100644 index 000000000000..c3171357dc4f --- /dev/null +++ b/kernel/bpf/preload/iterators/iterators.skel.h @@ -0,0 +1,410 @@ +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */ + +/* THIS FILE IS AUTOGENERATED! */ +#ifndef __ITERATORS_BPF_SKEL_H__ +#define __ITERATORS_BPF_SKEL_H__ + +#include <stdlib.h> +#include <bpf/libbpf.h> + +struct iterators_bpf { + struct bpf_object_skeleton *skeleton; + struct bpf_object *obj; + struct { + struct bpf_map *rodata; + } maps; + struct { + struct bpf_program *dump_bpf_map; + struct bpf_program *dump_bpf_prog; + } progs; + struct { + struct bpf_link *dump_bpf_map; + struct bpf_link *dump_bpf_prog; + } links; + struct iterators_bpf__rodata { + char dump_bpf_map____fmt[35]; + char dump_bpf_map____fmt_1[14]; + char dump_bpf_prog____fmt[32]; + char dump_bpf_prog____fmt_2[17]; + } *rodata; +}; + +static void +iterators_bpf__destroy(struct iterators_bpf *obj) +{ + if (!obj) + return; + if (obj->skeleton) + bpf_object__destroy_skeleton(obj->skeleton); + free(obj); +} + +static inline int +iterators_bpf__create_skeleton(struct iterators_bpf *obj); + +static inline struct iterators_bpf * +iterators_bpf__open_opts(const struct bpf_object_open_opts *opts) +{ + struct iterators_bpf *obj; + + obj = (typeof(obj))calloc(1, sizeof(*obj)); + if (!obj) + return NULL; + if (iterators_bpf__create_skeleton(obj)) + goto err; + if (bpf_object__open_skeleton(obj->skeleton, opts)) + goto err; + + return obj; +err: + iterators_bpf__destroy(obj); + return NULL; +} + +static inline struct iterators_bpf * +iterators_bpf__open(void) +{ + return iterators_bpf__open_opts(NULL); +} + +static inline int +iterators_bpf__load(struct iterators_bpf *obj) +{ + return bpf_object__load_skeleton(obj->skeleton); +} + +static inline struct iterators_bpf * +iterators_bpf__open_and_load(void) +{ + struct iterators_bpf *obj; + + obj = iterators_bpf__open(); + if (!obj) + return NULL; + if (iterators_bpf__load(obj)) { + iterators_bpf__destroy(obj); + return NULL; + } + return obj; +} + +static inline int +iterators_bpf__attach(struct iterators_bpf *obj) +{ + return bpf_object__attach_skeleton(obj->skeleton); +} + +static inline void +iterators_bpf__detach(struct iterators_bpf *obj) +{ + return bpf_object__detach_skeleton(obj->skeleton); +} + +static inline int +iterators_bpf__create_skeleton(struct iterators_bpf *obj) +{ + struct bpf_object_skeleton *s; + + s = (typeof(s))calloc(1, sizeof(*s)); + if (!s) + return -1; + obj->skeleton = s; + + s->sz = sizeof(*s); + s->name = "iterators_bpf"; + s->obj = &obj->obj; + + /* maps */ + s->map_cnt = 1; + s->map_skel_sz = sizeof(*s->maps); + s->maps = (typeof(s->maps))calloc(s->map_cnt, s->map_skel_sz); + if (!s->maps) + goto err; + + s->maps[0].name = "iterator.rodata"; + s->maps[0].map = &obj->maps.rodata; + s->maps[0].mmaped = (void **)&obj->rodata; + + /* programs */ + s->prog_cnt = 2; + s->prog_skel_sz = sizeof(*s->progs); + s->progs = (typeof(s->progs))calloc(s->prog_cnt, s->prog_skel_sz); + if (!s->progs) + goto err; + + s->progs[0].name = "dump_bpf_map"; + s->progs[0].prog = &obj->progs.dump_bpf_map; + s->progs[0].link = &obj->links.dump_bpf_map; + + s->progs[1].name = "dump_bpf_prog"; + s->progs[1].prog = &obj->progs.dump_bpf_prog; + s->progs[1].link = &obj->links.dump_bpf_prog; + + s->data_sz = 7128; + s->data = (void *)"\ +\x7f\x45\x4c\x46\x02\x01\x01\0\0\0\0\0\0\0\0\0\x01\0\xf7\0\x01\0\0\0\0\0\0\0\0\ +\0\0\0\0\0\0\0\0\0\0\0\x18\x18\0\0\0\0\0\0\0\0\0\0\x40\0\0\0\0\0\x40\0\x0f\0\ +\x0e\0\x79\x12\0\0\0\0\0\0\x79\x26\0\0\0\0\0\0\x79\x17\x08\0\0\0\0\0\x15\x07\ +\x1a\0\0\0\0\0\x79\x21\x10\0\0\0\0\0\x55\x01\x08\0\0\0\0\0\xbf\xa4\0\0\0\0\0\0\ +\x07\x04\0\0\xe8\xff\xff\xff\xbf\x61\0\0\0\0\0\0\x18\x02\0\0\0\0\0\0\0\0\0\0\0\ +\0\0\0\xb7\x03\0\0\x23\0\0\0\xb7\x05\0\0\0\0\0\0\x85\0\0\0\x7e\0\0\0\x61\x71\0\ +\0\0\0\0\0\x7b\x1a\xe8\xff\0\0\0\0\xb7\x01\0\0\x04\0\0\0\xbf\x72\0\0\0\0\0\0\ +\x0f\x12\0\0\0\0\0\0\x7b\x2a\xf0\xff\0\0\0\0\x61\x71\x14\0\0\0\0\0\x7b\x1a\xf8\ +\xff\0\0\0\0\xbf\xa4\0\0\0\0\0\0\x07\x04\0\0\xe8\xff\xff\xff\xbf\x61\0\0\0\0\0\ +\0\x18\x02\0\0\x23\0\0\0\0\0\0\0\0\0\0\0\xb7\x03\0\0\x0e\0\0\0\xb7\x05\0\0\x18\ +\0\0\0\x85\0\0\0\x7e\0\0\0\xb7\0\0\0\0\0\0\0\x95\0\0\0\0\0\0\0\x79\x12\0\0\0\0\ +\0\0\x79\x26\0\0\0\0\0\0\x79\x11\x08\0\0\0\0\0\x15\x01\x3b\0\0\0\0\0\x79\x17\0\ +\0\0\0\0\0\x79\x21\x10\0\0\0\0\0\x55\x01\x08\0\0\0\0\0\xbf\xa4\0\0\0\0\0\0\x07\ +\x04\0\0\xd0\xff\xff\xff\xbf\x61\0\0\0\0\0\0\x18\x02\0\0\x31\0\0\0\0\0\0\0\0\0\ +\0\0\xb7\x03\0\0\x20\0\0\0\xb7\x05\0\0\0\0\0\0\x85\0\0\0\x7e\0\0\0\x7b\x6a\xc8\ +\xff\0\0\0\0\x61\x71\0\0\0\0\0\0\x7b\x1a\xd0\xff\0\0\0\0\xb7\x03\0\0\x04\0\0\0\ +\xbf\x79\0\0\0\0\0\0\x0f\x39\0\0\0\0\0\0\x79\x71\x28\0\0\0\0\0\x79\x78\x30\0\0\ +\0\0\0\x15\x08\x18\0\0\0\0\0\xb7\x02\0\0\0\0\0\0\x0f\x21\0\0\0\0\0\0\x61\x11\ +\x04\0\0\0\0\0\x79\x83\x08\0\0\0\0\0\x67\x01\0\0\x03\0\0\0\x0f\x13\0\0\0\0\0\0\ +\x79\x86\0\0\0\0\0\0\xbf\xa1\0\0\0\0\0\0\x07\x01\0\0\xf8\xff\xff\xff\xb7\x02\0\ +\0\x08\0\0\0\x85\0\0\0\x71\0\0\0\xb7\x01\0\0\0\0\0\0\x79\xa3\xf8\xff\0\0\0\0\ +\x0f\x13\0\0\0\0\0\0\xbf\xa1\0\0\0\0\0\0\x07\x01\0\0\xf4\xff\xff\xff\xb7\x02\0\ +\0\x04\0\0\0\x85\0\0\0\x04\0\0\0\xb7\x03\0\0\x04\0\0\0\x61\xa1\xf4\xff\0\0\0\0\ +\x61\x82\x10\0\0\0\0\0\x3d\x21\x02\0\0\0\0\0\x0f\x16\0\0\0\0\0\0\xbf\x69\0\0\0\ +\0\0\0\x7b\x9a\xd8\xff\0\0\0\0\x79\x71\x18\0\0\0\0\0\x7b\x1a\xe0\xff\0\0\0\0\ +\x79\x71\x20\0\0\0\0\0\x79\x11\0\0\0\0\0\0\x0f\x31\0\0\0\0\0\0\x7b\x1a\xe8\xff\ +\0\0\0\0\xbf\xa4\0\0\0\0\0\0\x07\x04\0\0\xd0\xff\xff\xff\x79\xa1\xc8\xff\0\0\0\ +\0\x18\x02\0\0\x51\0\0\0\0\0\0\0\0\0\0\0\xb7\x03\0\0\x11\0\0\0\xb7\x05\0\0\x20\ +\0\0\0\x85\0\0\0\x7e\0\0\0\xb7\0\0\0\0\0\0\0\x95\0\0\0\0\0\0\0\x20\x20\x69\x64\ +\x20\x6e\x61\x6d\x65\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x6d\ +\x61\x78\x5f\x65\x6e\x74\x72\x69\x65\x73\x0a\0\x25\x34\x75\x20\x25\x2d\x31\x36\ +\x73\x25\x36\x64\x0a\0\x20\x20\x69\x64\x20\x6e\x61\x6d\x65\x20\x20\x20\x20\x20\ +\x20\x20\x20\x20\x20\x20\x20\x20\x61\x74\x74\x61\x63\x68\x65\x64\x0a\0\x25\x34\ +\x75\x20\x25\x2d\x31\x36\x73\x20\x25\x73\x20\x25\x73\x0a\0\x47\x50\x4c\0\x9f\ +\xeb\x01\0\x18\0\0\0\0\0\0\0\x1c\x04\0\0\x1c\x04\0\0\0\x05\0\0\0\0\0\0\0\0\0\ +\x02\x02\0\0\0\x01\0\0\0\x02\0\0\x04\x10\0\0\0\x13\0\0\0\x03\0\0\0\0\0\0\0\x18\ +\0\0\0\x04\0\0\0\x40\0\0\0\0\0\0\0\0\0\0\x02\x08\0\0\0\0\0\0\0\0\0\0\x02\x0d\0\ +\0\0\0\0\0\0\x01\0\0\x0d\x06\0\0\0\x1c\0\0\0\x01\0\0\0\x20\0\0\0\0\0\0\x01\x04\ +\0\0\0\x20\0\0\x01\x24\0\0\0\x01\0\0\x0c\x05\0\0\0\xa3\0\0\0\x03\0\0\x04\x18\0\ +\0\0\xb1\0\0\0\x09\0\0\0\0\0\0\0\xb5\0\0\0\x0b\0\0\0\x40\0\0\0\xc0\0\0\0\x0b\0\ +\0\0\x80\0\0\0\0\0\0\0\0\0\0\x02\x0a\0\0\0\xc8\0\0\0\0\0\0\x07\0\0\0\0\xd1\0\0\ +\0\0\0\0\x08\x0c\0\0\0\xd7\0\0\0\0\0\0\x01\x08\0\0\0\x40\0\0\0\x98\x01\0\0\x03\ +\0\0\x04\x18\0\0\0\xa0\x01\0\0\x0e\0\0\0\0\0\0\0\xa3\x01\0\0\x11\0\0\0\x20\0\0\ +\0\xa8\x01\0\0\x0e\0\0\0\xa0\0\0\0\xb4\x01\0\0\0\0\0\x08\x0f\0\0\0\xba\x01\0\0\ +\0\0\0\x01\x04\0\0\0\x20\0\0\0\xc7\x01\0\0\0\0\0\x01\x01\0\0\0\x08\0\0\x01\0\0\ +\0\0\0\0\0\x03\0\0\0\0\x10\0\0\0\x12\0\0\0\x10\0\0\0\xcc\x01\0\0\0\0\0\x01\x04\ +\0\0\0\x20\0\0\0\0\0\0\0\0\0\0\x02\x14\0\0\0\x30\x02\0\0\x02\0\0\x04\x10\0\0\0\ +\x13\0\0\0\x03\0\0\0\0\0\0\0\x43\x02\0\0\x15\0\0\0\x40\0\0\0\0\0\0\0\0\0\0\x02\ +\x18\0\0\0\0\0\0\0\x01\0\0\x0d\x06\0\0\0\x1c\0\0\0\x13\0\0\0\x48\x02\0\0\x01\0\ +\0\x0c\x16\0\0\0\x94\x02\0\0\x01\0\0\x04\x08\0\0\0\x9d\x02\0\0\x19\0\0\0\0\0\0\ +\0\0\0\0\0\0\0\0\x02\x1a\0\0\0\xee\x02\0\0\x06\0\0\x04\x38\0\0\0\xa0\x01\0\0\ +\x0e\0\0\0\0\0\0\0\xa3\x01\0\0\x11\0\0\0\x20\0\0\0\xfb\x02\0\0\x1b\0\0\0\xc0\0\ +\0\0\x0c\x03\0\0\x15\0\0\0\0\x01\0\0\x18\x03\0\0\x1d\0\0\0\x40\x01\0\0\x22\x03\ +\0\0\x1e\0\0\0\x80\x01\0\0\0\0\0\0\0\0\0\x02\x1c\0\0\0\0\0\0\0\0\0\0\x0a\x10\0\ +\0\0\0\0\0\0\0\0\0\x02\x1f\0\0\0\0\0\0\0\0\0\0\x02\x20\0\0\0\x6c\x03\0\0\x02\0\ +\0\x04\x08\0\0\0\x7a\x03\0\0\x0e\0\0\0\0\0\0\0\x83\x03\0\0\x0e\0\0\0\x20\0\0\0\ +\x22\x03\0\0\x03\0\0\x04\x18\0\0\0\x8d\x03\0\0\x1b\0\0\0\0\0\0\0\x95\x03\0\0\ +\x21\0\0\0\x40\0\0\0\x9b\x03\0\0\x23\0\0\0\x80\0\0\0\0\0\0\0\0\0\0\x02\x22\0\0\ +\0\0\0\0\0\0\0\0\x02\x24\0\0\0\x9f\x03\0\0\x01\0\0\x04\x04\0\0\0\xaa\x03\0\0\ +\x0e\0\0\0\0\0\0\0\x13\x04\0\0\x01\0\0\x04\x04\0\0\0\x1c\x04\0\0\x0e\0\0\0\0\0\ +\0\0\0\0\0\0\0\0\0\x03\0\0\0\0\x1c\0\0\0\x12\0\0\0\x23\0\0\0\x92\x04\0\0\0\0\0\ +\x0e\x25\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x03\0\0\0\0\x1c\0\0\0\x12\0\0\0\x0e\0\0\0\ +\xa6\x04\0\0\0\0\0\x0e\x27\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x03\0\0\0\0\x1c\0\0\0\ +\x12\0\0\0\x20\0\0\0\xbc\x04\0\0\0\0\0\x0e\x29\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x03\ +\0\0\0\0\x1c\0\0\0\x12\0\0\0\x11\0\0\0\xd1\x04\0\0\0\0\0\x0e\x2b\0\0\0\0\0\0\0\ +\0\0\0\0\0\0\0\x03\0\0\0\0\x10\0\0\0\x12\0\0\0\x04\0\0\0\xe8\x04\0\0\0\0\0\x0e\ +\x2d\0\0\0\x01\0\0\0\xf0\x04\0\0\x04\0\0\x0f\0\0\0\0\x26\0\0\0\0\0\0\0\x23\0\0\ +\0\x28\0\0\0\x23\0\0\0\x0e\0\0\0\x2a\0\0\0\x31\0\0\0\x20\0\0\0\x2c\0\0\0\x51\0\ +\0\0\x11\0\0\0\xf8\x04\0\0\x01\0\0\x0f\0\0\0\0\x2e\0\0\0\0\0\0\0\x04\0\0\0\0\ +\x62\x70\x66\x5f\x69\x74\x65\x72\x5f\x5f\x62\x70\x66\x5f\x6d\x61\x70\0\x6d\x65\ +\x74\x61\0\x6d\x61\x70\0\x63\x74\x78\0\x69\x6e\x74\0\x64\x75\x6d\x70\x5f\x62\ +\x70\x66\x5f\x6d\x61\x70\0\x69\x74\x65\x72\x2f\x62\x70\x66\x5f\x6d\x61\x70\0\ +\x30\x3a\x30\0\x2f\x77\x2f\x6e\x65\x74\x2d\x6e\x65\x78\x74\x2f\x6b\x65\x72\x6e\ +\x65\x6c\x2f\x62\x70\x66\x2f\x70\x72\x65\x6c\x6f\x61\x64\x2f\x69\x74\x65\x72\ +\x61\x74\x6f\x72\x73\x2f\x69\x74\x65\x72\x61\x74\x6f\x72\x73\x2e\x62\x70\x66\ +\x2e\x63\0\x09\x73\x74\x72\x75\x63\x74\x20\x73\x65\x71\x5f\x66\x69\x6c\x65\x20\ +\x2a\x73\x65\x71\x20\x3d\x20\x63\x74\x78\x2d\x3e\x6d\x65\x74\x61\x2d\x3e\x73\ +\x65\x71\x3b\0\x62\x70\x66\x5f\x69\x74\x65\x72\x5f\x6d\x65\x74\x61\0\x73\x65\ +\x71\0\x73\x65\x73\x73\x69\x6f\x6e\x5f\x69\x64\0\x73\x65\x71\x5f\x6e\x75\x6d\0\ +\x73\x65\x71\x5f\x66\x69\x6c\x65\0\x5f\x5f\x75\x36\x34\0\x6c\x6f\x6e\x67\x20\ +\x6c\x6f\x6e\x67\x20\x75\x6e\x73\x69\x67\x6e\x65\x64\x20\x69\x6e\x74\0\x30\x3a\ +\x31\0\x09\x73\x74\x72\x75\x63\x74\x20\x62\x70\x66\x5f\x6d\x61\x70\x20\x2a\x6d\ +\x61\x70\x20\x3d\x20\x63\x74\x78\x2d\x3e\x6d\x61\x70\x3b\0\x09\x69\x66\x20\x28\ +\x21\x6d\x61\x70\x29\0\x30\x3a\x32\0\x09\x5f\x5f\x75\x36\x34\x20\x73\x65\x71\ +\x5f\x6e\x75\x6d\x20\x3d\x20\x63\x74\x78\x2d\x3e\x6d\x65\x74\x61\x2d\x3e\x73\ +\x65\x71\x5f\x6e\x75\x6d\x3b\0\x09\x69\x66\x20\x28\x73\x65\x71\x5f\x6e\x75\x6d\ +\x20\x3d\x3d\x20\x30\x29\0\x09\x09\x42\x50\x46\x5f\x53\x45\x51\x5f\x50\x52\x49\ +\x4e\x54\x46\x28\x73\x65\x71\x2c\x20\x22\x20\x20\x69\x64\x20\x6e\x61\x6d\x65\ +\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x6d\x61\x78\x5f\x65\x6e\ +\x74\x72\x69\x65\x73\x5c\x6e\x22\x29\x3b\0\x62\x70\x66\x5f\x6d\x61\x70\0\x69\ +\x64\0\x6e\x61\x6d\x65\0\x6d\x61\x78\x5f\x65\x6e\x74\x72\x69\x65\x73\0\x5f\x5f\ +\x75\x33\x32\0\x75\x6e\x73\x69\x67\x6e\x65\x64\x20\x69\x6e\x74\0\x63\x68\x61\ +\x72\0\x5f\x5f\x41\x52\x52\x41\x59\x5f\x53\x49\x5a\x45\x5f\x54\x59\x50\x45\x5f\ +\x5f\0\x09\x42\x50\x46\x5f\x53\x45\x51\x5f\x50\x52\x49\x4e\x54\x46\x28\x73\x65\ +\x71\x2c\x20\x22\x25\x34\x75\x20\x25\x2d\x31\x36\x73\x25\x36\x64\x5c\x6e\x22\ +\x2c\x20\x6d\x61\x70\x2d\x3e\x69\x64\x2c\x20\x6d\x61\x70\x2d\x3e\x6e\x61\x6d\ +\x65\x2c\x20\x6d\x61\x70\x2d\x3e\x6d\x61\x78\x5f\x65\x6e\x74\x72\x69\x65\x73\ +\x29\x3b\0\x7d\0\x62\x70\x66\x5f\x69\x74\x65\x72\x5f\x5f\x62\x70\x66\x5f\x70\ +\x72\x6f\x67\0\x70\x72\x6f\x67\0\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x70\x72\ +\x6f\x67\0\x69\x74\x65\x72\x2f\x62\x70\x66\x5f\x70\x72\x6f\x67\0\x09\x73\x74\ +\x72\x75\x63\x74\x20\x62\x70\x66\x5f\x70\x72\x6f\x67\x20\x2a\x70\x72\x6f\x67\ +\x20\x3d\x20\x63\x74\x78\x2d\x3e\x70\x72\x6f\x67\x3b\0\x09\x69\x66\x20\x28\x21\ +\x70\x72\x6f\x67\x29\0\x62\x70\x66\x5f\x70\x72\x6f\x67\0\x61\x75\x78\0\x09\x61\ +\x75\x78\x20\x3d\x20\x70\x72\x6f\x67\x2d\x3e\x61\x75\x78\x3b\0\x09\x09\x42\x50\ +\x46\x5f\x53\x45\x51\x5f\x50\x52\x49\x4e\x54\x46\x28\x73\x65\x71\x2c\x20\x22\ +\x20\x20\x69\x64\x20\x6e\x61\x6d\x65\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\ +\x20\x20\x20\x61\x74\x74\x61\x63\x68\x65\x64\x5c\x6e\x22\x29\x3b\0\x62\x70\x66\ +\x5f\x70\x72\x6f\x67\x5f\x61\x75\x78\0\x61\x74\x74\x61\x63\x68\x5f\x66\x75\x6e\ +\x63\x5f\x6e\x61\x6d\x65\0\x6c\x69\x6e\x6b\x65\x64\x5f\x70\x72\x6f\x67\0\x66\ +\x75\x6e\x63\x5f\x69\x6e\x66\x6f\0\x62\x74\x66\0\x09\x42\x50\x46\x5f\x53\x45\ +\x51\x5f\x50\x52\x49\x4e\x54\x46\x28\x73\x65\x71\x2c\x20\x22\x25\x34\x75\x20\ +\x25\x2d\x31\x36\x73\x20\x25\x73\x20\x25\x73\x5c\x6e\x22\x2c\x20\x61\x75\x78\ +\x2d\x3e\x69\x64\x2c\0\x30\x3a\x34\0\x30\x3a\x35\0\x09\x69\x66\x20\x28\x21\x62\ +\x74\x66\x29\0\x62\x70\x66\x5f\x66\x75\x6e\x63\x5f\x69\x6e\x66\x6f\0\x69\x6e\ +\x73\x6e\x5f\x6f\x66\x66\0\x74\x79\x70\x65\x5f\x69\x64\0\x30\0\x73\x74\x72\x69\ +\x6e\x67\x73\0\x74\x79\x70\x65\x73\0\x68\x64\x72\0\x62\x74\x66\x5f\x68\x65\x61\ +\x64\x65\x72\0\x73\x74\x72\x5f\x6c\x65\x6e\0\x09\x74\x79\x70\x65\x73\x20\x3d\ +\x20\x62\x74\x66\x2d\x3e\x74\x79\x70\x65\x73\x3b\0\x09\x62\x70\x66\x5f\x70\x72\ +\x6f\x62\x65\x5f\x72\x65\x61\x64\x5f\x6b\x65\x72\x6e\x65\x6c\x28\x26\x74\x2c\ +\x20\x73\x69\x7a\x65\x6f\x66\x28\x74\x29\x2c\x20\x74\x79\x70\x65\x73\x20\x2b\ +\x20\x62\x74\x66\x5f\x69\x64\x29\x3b\0\x09\x73\x74\x72\x20\x3d\x20\x62\x74\x66\ +\x2d\x3e\x73\x74\x72\x69\x6e\x67\x73\x3b\0\x62\x74\x66\x5f\x74\x79\x70\x65\0\ +\x6e\x61\x6d\x65\x5f\x6f\x66\x66\0\x09\x6e\x61\x6d\x65\x5f\x6f\x66\x66\x20\x3d\ +\x20\x42\x50\x46\x5f\x43\x4f\x52\x45\x5f\x52\x45\x41\x44\x28\x74\x2c\x20\x6e\ +\x61\x6d\x65\x5f\x6f\x66\x66\x29\x3b\0\x30\x3a\x32\x3a\x30\0\x09\x69\x66\x20\ +\x28\x6e\x61\x6d\x65\x5f\x6f\x66\x66\x20\x3e\x3d\x20\x62\x74\x66\x2d\x3e\x68\ +\x64\x72\x2e\x73\x74\x72\x5f\x6c\x65\x6e\x29\0\x09\x72\x65\x74\x75\x72\x6e\x20\ +\x73\x74\x72\x20\x2b\x20\x6e\x61\x6d\x65\x5f\x6f\x66\x66\x3b\0\x30\x3a\x33\0\ +\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x6d\x61\x70\x2e\x5f\x5f\x5f\x66\x6d\x74\0\ +\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x6d\x61\x70\x2e\x5f\x5f\x5f\x66\x6d\x74\ +\x2e\x31\0\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x70\x72\x6f\x67\x2e\x5f\x5f\x5f\ +\x66\x6d\x74\0\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x70\x72\x6f\x67\x2e\x5f\x5f\ +\x5f\x66\x6d\x74\x2e\x32\0\x4c\x49\x43\x45\x4e\x53\x45\0\x2e\x72\x6f\x64\x61\ +\x74\x61\0\x6c\x69\x63\x65\x6e\x73\x65\0\x9f\xeb\x01\0\x20\0\0\0\0\0\0\0\x24\0\ +\0\0\x24\0\0\0\x44\x02\0\0\x68\x02\0\0\xa4\x01\0\0\x08\0\0\0\x31\0\0\0\x01\0\0\ +\0\0\0\0\0\x07\0\0\0\x56\x02\0\0\x01\0\0\0\0\0\0\0\x17\0\0\0\x10\0\0\0\x31\0\0\ +\0\x09\0\0\0\0\0\0\0\x42\0\0\0\x7b\0\0\0\x1e\x40\x01\0\x08\0\0\0\x42\0\0\0\x7b\ +\0\0\0\x24\x40\x01\0\x10\0\0\0\x42\0\0\0\xf2\0\0\0\x1d\x48\x01\0\x18\0\0\0\x42\ +\0\0\0\x13\x01\0\0\x06\x50\x01\0\x20\0\0\0\x42\0\0\0\x22\x01\0\0\x1d\x44\x01\0\ +\x28\0\0\0\x42\0\0\0\x47\x01\0\0\x06\x5c\x01\0\x38\0\0\0\x42\0\0\0\x5a\x01\0\0\ +\x03\x60\x01\0\x70\0\0\0\x42\0\0\0\xe0\x01\0\0\x02\x68\x01\0\xf0\0\0\0\x42\0\0\ +\0\x2e\x02\0\0\x01\x70\x01\0\x56\x02\0\0\x1a\0\0\0\0\0\0\0\x42\0\0\0\x7b\0\0\0\ +\x1e\x84\x01\0\x08\0\0\0\x42\0\0\0\x7b\0\0\0\x24\x84\x01\0\x10\0\0\0\x42\0\0\0\ +\x64\x02\0\0\x1f\x8c\x01\0\x18\0\0\0\x42\0\0\0\x88\x02\0\0\x06\x98\x01\0\x20\0\ +\0\0\x42\0\0\0\xa1\x02\0\0\x0e\xa4\x01\0\x28\0\0\0\x42\0\0\0\x22\x01\0\0\x1d\ +\x88\x01\0\x30\0\0\0\x42\0\0\0\x47\x01\0\0\x06\xa8\x01\0\x40\0\0\0\x42\0\0\0\ +\xb3\x02\0\0\x03\xac\x01\0\x80\0\0\0\x42\0\0\0\x26\x03\0\0\x02\xb4\x01\0\xb8\0\ +\0\0\x42\0\0\0\x61\x03\0\0\x06\x08\x01\0\xd0\0\0\0\x42\0\0\0\0\0\0\0\0\0\0\0\ +\xd8\0\0\0\x42\0\0\0\xb2\x03\0\0\x0f\x14\x01\0\xe0\0\0\0\x42\0\0\0\xc7\x03\0\0\ +\x2d\x18\x01\0\xf0\0\0\0\x42\0\0\0\xfe\x03\0\0\x0d\x10\x01\0\0\x01\0\0\x42\0\0\ +\0\0\0\0\0\0\0\0\0\x08\x01\0\0\x42\0\0\0\xc7\x03\0\0\x02\x18\x01\0\x20\x01\0\0\ +\x42\0\0\0\x25\x04\0\0\x0d\x1c\x01\0\x38\x01\0\0\x42\0\0\0\0\0\0\0\0\0\0\0\x40\ +\x01\0\0\x42\0\0\0\x25\x04\0\0\x0d\x1c\x01\0\x58\x01\0\0\x42\0\0\0\x25\x04\0\0\ +\x0d\x1c\x01\0\x60\x01\0\0\x42\0\0\0\x53\x04\0\0\x1b\x20\x01\0\x68\x01\0\0\x42\ +\0\0\0\x53\x04\0\0\x06\x20\x01\0\x70\x01\0\0\x42\0\0\0\x76\x04\0\0\x0d\x28\x01\ +\0\x78\x01\0\0\x42\0\0\0\0\0\0\0\0\0\0\0\x80\x01\0\0\x42\0\0\0\x26\x03\0\0\x02\ +\xb4\x01\0\xf8\x01\0\0\x42\0\0\0\x2e\x02\0\0\x01\xc4\x01\0\x10\0\0\0\x31\0\0\0\ +\x07\0\0\0\0\0\0\0\x02\0\0\0\x3e\0\0\0\0\0\0\0\x08\0\0\0\x08\0\0\0\x3e\0\0\0\0\ +\0\0\0\x10\0\0\0\x02\0\0\0\xee\0\0\0\0\0\0\0\x20\0\0\0\x08\0\0\0\x1e\x01\0\0\0\ +\0\0\0\x70\0\0\0\x0d\0\0\0\x3e\0\0\0\0\0\0\0\x80\0\0\0\x0d\0\0\0\xee\0\0\0\0\0\ +\0\0\xa0\0\0\0\x0d\0\0\0\x1e\x01\0\0\0\0\0\0\x56\x02\0\0\x12\0\0\0\0\0\0\0\x14\ +\0\0\0\x3e\0\0\0\0\0\0\0\x08\0\0\0\x08\0\0\0\x3e\0\0\0\0\0\0\0\x10\0\0\0\x14\0\ +\0\0\xee\0\0\0\0\0\0\0\x20\0\0\0\x18\0\0\0\x3e\0\0\0\0\0\0\0\x28\0\0\0\x08\0\0\ +\0\x1e\x01\0\0\0\0\0\0\x80\0\0\0\x1a\0\0\0\x3e\0\0\0\0\0\0\0\x90\0\0\0\x1a\0\0\ +\0\xee\0\0\0\0\0\0\0\xa8\0\0\0\x1a\0\0\0\x59\x03\0\0\0\0\0\0\xb0\0\0\0\x1a\0\0\ +\0\x5d\x03\0\0\0\0\0\0\xc0\0\0\0\x1f\0\0\0\x8b\x03\0\0\0\0\0\0\xd8\0\0\0\x20\0\ +\0\0\xee\0\0\0\0\0\0\0\xf0\0\0\0\x20\0\0\0\x3e\0\0\0\0\0\0\0\x18\x01\0\0\x24\0\ +\0\0\x3e\0\0\0\0\0\0\0\x50\x01\0\0\x1a\0\0\0\xee\0\0\0\0\0\0\0\x60\x01\0\0\x20\ +\0\0\0\x4d\x04\0\0\0\0\0\0\x88\x01\0\0\x1a\0\0\0\x1e\x01\0\0\0\0\0\0\x98\x01\0\ +\0\x1a\0\0\0\x8e\x04\0\0\0\0\0\0\xa0\x01\0\0\x18\0\0\0\x3e\0\0\0\0\0\0\0\0\0\0\ +\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\xd6\0\0\0\0\0\x02\0\x70\0\0\0\0\ +\0\0\0\0\0\0\0\0\0\0\0\xc8\0\0\0\0\0\x02\0\xf0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\ +\xcf\0\0\0\0\0\x03\0\x78\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\xc1\0\0\0\0\0\x03\0\x80\ +\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\xba\0\0\0\0\0\x03\0\xf8\x01\0\0\0\0\0\0\0\0\0\ +\0\0\0\0\0\x14\0\0\0\x01\0\x04\0\0\0\0\0\0\0\0\0\x23\0\0\0\0\0\0\0\xf4\0\0\0\ +\x01\0\x04\0\x23\0\0\0\0\0\0\0\x0e\0\0\0\0\0\0\0\x28\0\0\0\x01\0\x04\0\x31\0\0\ +\0\0\0\0\0\x20\0\0\0\0\0\0\0\xdd\0\0\0\x01\0\x04\0\x51\0\0\0\0\0\0\0\x11\0\0\0\ +\0\0\0\0\0\0\0\0\x03\0\x02\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x03\0\x03\ +\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x03\0\x04\0\0\0\0\0\0\0\0\0\0\0\0\0\ +\0\0\0\0\xb2\0\0\0\x11\0\x05\0\0\0\0\0\0\0\0\0\x04\0\0\0\0\0\0\0\x3d\0\0\0\x12\ +\0\x02\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\x5b\0\0\0\x12\0\x03\0\0\0\0\0\0\0\0\ +\0\x08\x02\0\0\0\0\0\0\x48\0\0\0\0\0\0\0\x01\0\0\0\x0c\0\0\0\xc8\0\0\0\0\0\0\0\ +\x01\0\0\0\x0c\0\0\0\x50\0\0\0\0\0\0\0\x01\0\0\0\x0c\0\0\0\xd0\x01\0\0\0\0\0\0\ +\x01\0\0\0\x0c\0\0\0\xf0\x03\0\0\0\0\0\0\x0a\0\0\0\x0c\0\0\0\xfc\x03\0\0\0\0\0\ +\0\x0a\0\0\0\x0c\0\0\0\x08\x04\0\0\0\0\0\0\x0a\0\0\0\x0c\0\0\0\x14\x04\0\0\0\0\ +\0\0\x0a\0\0\0\x0c\0\0\0\x2c\x04\0\0\0\0\0\0\0\0\0\0\x0d\0\0\0\x2c\0\0\0\0\0\0\ +\0\0\0\0\0\x0a\0\0\0\x3c\0\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x50\0\0\0\0\0\0\0\0\0\ +\0\0\x0a\0\0\0\x60\0\0\0\0\0\0\0\0\0\0\0\x0a\0\0\0\x70\0\0\0\0\0\0\0\0\0\0\0\ +\x0a\0\0\0\x80\0\0\0\0\0\0\0\0\0\0\0\x0a\0\0\0\x90\0\0\0\0\0\0\0\0\0\0\0\x0a\0\ +\0\0\xa0\0\0\0\0\0\0\0\0\0\0\0\x0a\0\0\0\xb0\0\0\0\0\0\0\0\0\0\0\0\x0a\0\0\0\ +\xc0\0\0\0\0\0\0\0\0\0\0\0\x0a\0\0\0\xd0\0\0\0\0\0\0\0\0\0\0\0\x0a\0\0\0\xe8\0\ +\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\xf8\0\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x08\x01\0\0\ +\0\0\0\0\0\0\0\0\x0b\0\0\0\x18\x01\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x28\x01\0\0\0\ +\0\0\0\0\0\0\0\x0b\0\0\0\x38\x01\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x48\x01\0\0\0\0\ +\0\0\0\0\0\0\x0b\0\0\0\x58\x01\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x68\x01\0\0\0\0\0\ +\0\0\0\0\0\x0b\0\0\0\x78\x01\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x88\x01\0\0\0\0\0\0\ +\0\0\0\0\x0b\0\0\0\x98\x01\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\xa8\x01\0\0\0\0\0\0\0\ +\0\0\0\x0b\0\0\0\xb8\x01\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\xc8\x01\0\0\0\0\0\0\0\0\ +\0\0\x0b\0\0\0\xd8\x01\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\xe8\x01\0\0\0\0\0\0\0\0\0\ +\0\x0b\0\0\0\xf8\x01\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x08\x02\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x18\x02\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x28\x02\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x38\x02\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x48\x02\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x58\x02\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x68\x02\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x78\x02\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x94\x02\0\0\0\0\0\0\0\0\0\0\ +\x0a\0\0\0\xa4\x02\0\0\0\0\0\0\0\0\0\0\x0a\0\0\0\xb4\x02\0\0\0\0\0\0\0\0\0\0\ +\x0a\0\0\0\xc4\x02\0\0\0\0\0\0\0\0\0\0\x0a\0\0\0\xd4\x02\0\0\0\0\0\0\0\0\0\0\ +\x0a\0\0\0\xe4\x02\0\0\0\0\0\0\0\0\0\0\x0a\0\0\0\xf4\x02\0\0\0\0\0\0\0\0\0\0\ +\x0a\0\0\0\x0c\x03\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x1c\x03\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x2c\x03\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x3c\x03\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x4c\x03\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x5c\x03\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x6c\x03\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x7c\x03\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x8c\x03\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x9c\x03\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\xac\x03\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\xbc\x03\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\xcc\x03\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\xdc\x03\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\xec\x03\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\xfc\x03\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x0c\x04\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x1c\x04\0\0\0\0\0\0\0\0\0\0\ +\x0b\0\0\0\x4e\x4f\x41\x42\x43\x44\x4d\0\x2e\x74\x65\x78\x74\0\x2e\x72\x65\x6c\ +\x2e\x42\x54\x46\x2e\x65\x78\x74\0\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x6d\x61\ +\x70\x2e\x5f\x5f\x5f\x66\x6d\x74\0\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x70\x72\ +\x6f\x67\x2e\x5f\x5f\x5f\x66\x6d\x74\0\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x6d\ +\x61\x70\0\x2e\x72\x65\x6c\x69\x74\x65\x72\x2f\x62\x70\x66\x5f\x6d\x61\x70\0\ +\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x70\x72\x6f\x67\0\x2e\x72\x65\x6c\x69\x74\ +\x65\x72\x2f\x62\x70\x66\x5f\x70\x72\x6f\x67\0\x2e\x6c\x6c\x76\x6d\x5f\x61\x64\ +\x64\x72\x73\x69\x67\0\x6c\x69\x63\x65\x6e\x73\x65\0\x2e\x73\x74\x72\x74\x61\ +\x62\0\x2e\x73\x79\x6d\x74\x61\x62\0\x2e\x72\x6f\x64\x61\x74\x61\0\x2e\x72\x65\ +\x6c\x2e\x42\x54\x46\0\x4c\x49\x43\x45\x4e\x53\x45\0\x4c\x42\x42\x31\x5f\x37\0\ +\x4c\x42\x42\x31\x5f\x36\0\x4c\x42\x42\x30\x5f\x34\0\x4c\x42\x42\x31\x5f\x33\0\ +\x4c\x42\x42\x30\x5f\x33\0\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x70\x72\x6f\x67\ +\x2e\x5f\x5f\x5f\x66\x6d\x74\x2e\x32\0\x64\x75\x6d\x70\x5f\x62\x70\x66\x5f\x6d\ +\x61\x70\x2e\x5f\x5f\x5f\x66\x6d\x74\x2e\x31\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\ +\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\ +\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\x01\0\0\0\x06\0\0\0\0\0\0\0\0\0\0\0\ +\0\0\0\0\x40\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x04\0\0\0\0\0\0\0\0\ +\0\0\0\0\0\0\0\x4e\0\0\0\x01\0\0\0\x06\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x40\0\0\0\ +\0\0\0\0\0\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x08\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\ +\x6d\0\0\0\x01\0\0\0\x06\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x40\x01\0\0\0\0\0\0\x08\ +\x02\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x08\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\xa1\0\0\0\ +\x01\0\0\0\x02\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x48\x03\0\0\0\0\0\0\x62\0\0\0\0\0\ +\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x89\0\0\0\x01\0\0\0\x03\ +\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\xaa\x03\0\0\0\0\0\0\x04\0\0\0\0\0\0\0\0\0\0\0\0\ +\0\0\0\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\xad\0\0\0\x01\0\0\0\0\0\0\0\0\0\0\0\0\ +\0\0\0\0\0\0\0\xae\x03\0\0\0\0\0\0\x34\x09\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\ +\0\0\0\0\0\0\0\0\0\0\0\0\0\x0b\0\0\0\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\ +\xe2\x0c\0\0\0\0\0\0\x2c\x04\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\0\0\ +\0\0\0\0\0\0\x99\0\0\0\x02\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x10\x11\0\0\0\ +\0\0\0\x80\x01\0\0\0\0\0\0\x0e\0\0\0\x0d\0\0\0\x08\0\0\0\0\0\0\0\x18\0\0\0\0\0\ +\0\0\x4a\0\0\0\x09\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x90\x12\0\0\0\0\0\0\ +\x20\0\0\0\0\0\0\0\x08\0\0\0\x02\0\0\0\x08\0\0\0\0\0\0\0\x10\0\0\0\0\0\0\0\x69\ +\0\0\0\x09\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\xb0\x12\0\0\0\0\0\0\x20\0\0\0\ +\0\0\0\0\x08\0\0\0\x03\0\0\0\x08\0\0\0\0\0\0\0\x10\0\0\0\0\0\0\0\xa9\0\0\0\x09\ +\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\xd0\x12\0\0\0\0\0\0\x50\0\0\0\0\0\0\0\ +\x08\0\0\0\x06\0\0\0\x08\0\0\0\0\0\0\0\x10\0\0\0\0\0\0\0\x07\0\0\0\x09\0\0\0\0\ +\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x20\x13\0\0\0\0\0\0\xe0\x03\0\0\0\0\0\0\x08\0\0\ +\0\x07\0\0\0\x08\0\0\0\0\0\0\0\x10\0\0\0\0\0\0\0\x7b\0\0\0\x03\x4c\xff\x6f\0\0\ +\0\x80\0\0\0\0\0\0\0\0\0\0\0\0\0\x17\0\0\0\0\0\0\x07\0\0\0\0\0\0\0\0\0\0\0\0\0\ +\0\0\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x91\0\0\0\x03\0\0\0\0\0\0\0\0\0\0\0\0\0\ +\0\0\0\0\0\0\x07\x17\0\0\0\0\0\0\x0a\x01\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\ +\0\0\0\0\0\0\0\0\0\0\0\0"; + + return 0; +err: + bpf_object__destroy_skeleton(s); + return -1; +} + +#endif /* __ITERATORS_BPF_SKEL_H__ */ diff --git a/kernel/bpf/queue_stack_maps.c b/kernel/bpf/queue_stack_maps.c index 44184f82916a..0ee2347ba510 100644 --- a/kernel/bpf/queue_stack_maps.c +++ b/kernel/bpf/queue_stack_maps.c @@ -257,6 +257,7 @@ static int queue_stack_map_get_next_key(struct bpf_map *map, void *key, static int queue_map_btf_id; const struct bpf_map_ops queue_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = queue_stack_map_alloc_check, .map_alloc = queue_stack_map_alloc, .map_free = queue_stack_map_free, @@ -273,6 +274,7 @@ const struct bpf_map_ops queue_map_ops = { static int stack_map_btf_id; const struct bpf_map_ops stack_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = queue_stack_map_alloc_check, .map_alloc = queue_stack_map_alloc, .map_free = queue_stack_map_free, diff --git a/kernel/bpf/reuseport_array.c b/kernel/bpf/reuseport_array.c index 90b29c5b1da7..5a2ba1182493 100644 --- a/kernel/bpf/reuseport_array.c +++ b/kernel/bpf/reuseport_array.c @@ -351,6 +351,7 @@ static int reuseport_array_get_next_key(struct bpf_map *map, void *key, static int reuseport_array_map_btf_id; const struct bpf_map_ops reuseport_array_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc_check = reuseport_array_alloc_check, .map_alloc = reuseport_array_alloc, .map_free = reuseport_array_free, diff --git a/kernel/bpf/ringbuf.c b/kernel/bpf/ringbuf.c index 002f8a5c9e51..31cb04a4dd2d 100644 --- a/kernel/bpf/ringbuf.c +++ b/kernel/bpf/ringbuf.c @@ -287,6 +287,7 @@ static __poll_t ringbuf_map_poll(struct bpf_map *map, struct file *filp, static int ringbuf_map_btf_id; const struct bpf_map_ops ringbuf_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc = ringbuf_map_alloc, .map_free = ringbuf_map_free, .map_mmap = ringbuf_map_mmap, diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c index cfed0ac44d38..a2fa006f430e 100644 --- a/kernel/bpf/stackmap.c +++ b/kernel/bpf/stackmap.c @@ -839,6 +839,7 @@ static void stack_map_free(struct bpf_map *map) static int stack_trace_map_btf_id; const struct bpf_map_ops stack_trace_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc = stack_map_alloc, .map_free = stack_map_free, .map_get_next_key = stack_map_get_next_key, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 86299a292214..4108ef3b828b 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -29,6 +29,7 @@ #include <linux/bpf_lsm.h> #include <linux/poll.h> #include <linux/bpf-netns.h> +#include <linux/rcupdate_trace.h> #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \ (map)->map_type == BPF_MAP_TYPE_CGROUP_ARRAY || \ @@ -90,6 +91,7 @@ int bpf_check_uarg_tail_zero(void __user *uaddr, } const struct bpf_map_ops bpf_map_offload_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc = bpf_map_offload_map_alloc, .map_free = bpf_map_offload_map_free, .map_check_btf = map_check_no_btf, @@ -157,10 +159,11 @@ static int bpf_map_update_value(struct bpf_map *map, struct fd f, void *key, if (bpf_map_is_dev_bound(map)) { return bpf_map_offload_update_elem(map, key, value, flags); } else if (map->map_type == BPF_MAP_TYPE_CPUMAP || - map->map_type == BPF_MAP_TYPE_SOCKHASH || - map->map_type == BPF_MAP_TYPE_SOCKMAP || map->map_type == BPF_MAP_TYPE_STRUCT_OPS) { return map->ops->map_update_elem(map, key, value, flags); + } else if (map->map_type == BPF_MAP_TYPE_SOCKHASH || + map->map_type == BPF_MAP_TYPE_SOCKMAP) { + return sock_map_update_elem_sys(map, key, value, flags); } else if (IS_FD_PROG_ARRAY(map)) { return bpf_fd_array_map_update_elem(map, f.file, key, value, flags); @@ -768,7 +771,8 @@ static int map_check_btf(struct bpf_map *map, const struct btf *btf, if (map->map_type != BPF_MAP_TYPE_HASH && map->map_type != BPF_MAP_TYPE_ARRAY && map->map_type != BPF_MAP_TYPE_CGROUP_STORAGE && - map->map_type != BPF_MAP_TYPE_SK_STORAGE) + map->map_type != BPF_MAP_TYPE_SK_STORAGE && + map->map_type != BPF_MAP_TYPE_INODE_STORAGE) return -ENOTSUPP; if (map->spin_lock_off + sizeof(struct bpf_spin_lock) > map->value_size) { @@ -1728,10 +1732,14 @@ static void __bpf_prog_put_noref(struct bpf_prog *prog, bool deferred) btf_put(prog->aux->btf); bpf_prog_free_linfo(prog); - if (deferred) - call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); - else + if (deferred) { + if (prog->aux->sleepable) + call_rcu_tasks_trace(&prog->aux->rcu, __bpf_prog_put_rcu); + else + call_rcu(&prog->aux->rcu, __bpf_prog_put_rcu); + } else { __bpf_prog_put_rcu(&prog->aux->rcu); + } } static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) @@ -2101,6 +2109,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT | BPF_F_TEST_STATE_FREQ | + BPF_F_SLEEPABLE | BPF_F_TEST_RND_HI32)) return -EINVAL; @@ -2156,6 +2165,7 @@ static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr) } prog->aux->offload_requested = !!attr->prog_ifindex; + prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE; err = security_bpf_prog_alloc(prog->aux); if (err) @@ -4014,40 +4024,50 @@ static int link_detach(union bpf_attr *attr) return ret; } -static int bpf_link_inc_not_zero(struct bpf_link *link) +static struct bpf_link *bpf_link_inc_not_zero(struct bpf_link *link) { - return atomic64_fetch_add_unless(&link->refcnt, 1, 0) ? 0 : -ENOENT; + return atomic64_fetch_add_unless(&link->refcnt, 1, 0) ? link : ERR_PTR(-ENOENT); } -#define BPF_LINK_GET_FD_BY_ID_LAST_FIELD link_id - -static int bpf_link_get_fd_by_id(const union bpf_attr *attr) +struct bpf_link *bpf_link_by_id(u32 id) { struct bpf_link *link; - u32 id = attr->link_id; - int fd, err; - if (CHECK_ATTR(BPF_LINK_GET_FD_BY_ID)) - return -EINVAL; - - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; + if (!id) + return ERR_PTR(-ENOENT); spin_lock_bh(&link_idr_lock); - link = idr_find(&link_idr, id); /* before link is "settled", ID is 0, pretend it doesn't exist yet */ + link = idr_find(&link_idr, id); if (link) { if (link->id) - err = bpf_link_inc_not_zero(link); + link = bpf_link_inc_not_zero(link); else - err = -EAGAIN; + link = ERR_PTR(-EAGAIN); } else { - err = -ENOENT; + link = ERR_PTR(-ENOENT); } spin_unlock_bh(&link_idr_lock); + return link; +} - if (err) - return err; +#define BPF_LINK_GET_FD_BY_ID_LAST_FIELD link_id + +static int bpf_link_get_fd_by_id(const union bpf_attr *attr) +{ + struct bpf_link *link; + u32 id = attr->link_id; + int fd; + + if (CHECK_ATTR(BPF_LINK_GET_FD_BY_ID)) + return -EINVAL; + + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + link = bpf_link_by_id(id); + if (IS_ERR(link)) + return PTR_ERR(link); fd = bpf_link_new_fd(link); if (fd < 0) diff --git a/kernel/bpf/trampoline.c b/kernel/bpf/trampoline.c index 9be85aa4ec5f..7dd523a7e32d 100644 --- a/kernel/bpf/trampoline.c +++ b/kernel/bpf/trampoline.c @@ -7,6 +7,8 @@ #include <linux/rbtree_latch.h> #include <linux/perf_event.h> #include <linux/btf.h> +#include <linux/rcupdate_trace.h> +#include <linux/rcupdate_wait.h> /* dummy _ops. The verifier will operate on target program's ops. */ const struct bpf_verifier_ops bpf_extension_verifier_ops = { @@ -210,9 +212,12 @@ static int bpf_trampoline_update(struct bpf_trampoline *tr) * updates to trampoline would change the code from underneath the * preempted task. Hence wait for tasks to voluntarily schedule or go * to userspace. + * The same trampoline can hold both sleepable and non-sleepable progs. + * synchronize_rcu_tasks_trace() is needed to make sure all sleepable + * programs finish executing. + * Wait for these two grace periods together. */ - - synchronize_rcu_tasks(); + synchronize_rcu_mult(call_rcu_tasks, call_rcu_tasks_trace); err = arch_prepare_bpf_trampoline(new_image, new_image + PAGE_SIZE / 2, &tr->func.model, flags, tprogs, @@ -344,7 +349,14 @@ void bpf_trampoline_put(struct bpf_trampoline *tr) if (WARN_ON_ONCE(!hlist_empty(&tr->progs_hlist[BPF_TRAMP_FEXIT]))) goto out; bpf_image_ksym_del(&tr->ksym); - /* wait for tasks to get out of trampoline before freeing it */ + /* This code will be executed when all bpf progs (both sleepable and + * non-sleepable) went through + * bpf_prog_put()->call_rcu[_tasks_trace]()->bpf_prog_free_deferred(). + * Hence no need for another synchronize_rcu_tasks_trace() here, + * but synchronize_rcu_tasks() is still needed, since trampoline + * may not have had any sleepable programs and we need to wait + * for tasks to get out of trampoline code before freeing it. + */ synchronize_rcu_tasks(); bpf_jit_free_exec(tr->image); hlist_del(&tr->hlist); @@ -394,6 +406,17 @@ void notrace __bpf_prog_exit(struct bpf_prog *prog, u64 start) rcu_read_unlock(); } +void notrace __bpf_prog_enter_sleepable(void) +{ + rcu_read_lock_trace(); + might_fault(); +} + +void notrace __bpf_prog_exit_sleepable(void) +{ + rcu_read_unlock_trace(); +} + int __weak arch_prepare_bpf_trampoline(void *image, void *image_end, const struct btf_func_model *m, u32 flags, diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index ef938f17b944..b4e9c56b8b32 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -21,6 +21,7 @@ #include <linux/ctype.h> #include <linux/error-injection.h> #include <linux/bpf_lsm.h> +#include <linux/btf_ids.h> #include "disasm.h" @@ -2625,11 +2626,19 @@ static int check_map_access(struct bpf_verifier_env *env, u32 regno, #define MAX_PACKET_OFF 0xffff +static enum bpf_prog_type resolve_prog_type(struct bpf_prog *prog) +{ + return prog->aux->linked_prog ? prog->aux->linked_prog->type + : prog->type; +} + static bool may_access_direct_pkt_data(struct bpf_verifier_env *env, const struct bpf_call_arg_meta *meta, enum bpf_access_type t) { - switch (env->prog->type) { + enum bpf_prog_type prog_type = resolve_prog_type(env->prog); + + switch (prog_type) { /* Program types only with direct read access go here! */ case BPF_PROG_TYPE_LWT_IN: case BPF_PROG_TYPE_LWT_OUT: @@ -3872,6 +3881,33 @@ static int int_ptr_type_to_size(enum bpf_arg_type type) return -EINVAL; } +static int resolve_map_arg_type(struct bpf_verifier_env *env, + const struct bpf_call_arg_meta *meta, + enum bpf_arg_type *arg_type) +{ + if (!meta->map_ptr) { + /* kernel subsystem misconfigured verifier */ + verbose(env, "invalid map_ptr to access map->type\n"); + return -EACCES; + } + + switch (meta->map_ptr->map_type) { + case BPF_MAP_TYPE_SOCKMAP: + case BPF_MAP_TYPE_SOCKHASH: + if (*arg_type == ARG_PTR_TO_MAP_VALUE) { + *arg_type = ARG_PTR_TO_SOCKET; + } else { + verbose(env, "invalid arg_type for sockmap/sockhash\n"); + return -EINVAL; + } + break; + + default: + break; + } + return 0; +} + static int check_func_arg(struct bpf_verifier_env *env, u32 arg, struct bpf_call_arg_meta *meta, const struct bpf_func_proto *fn) @@ -3904,6 +3940,14 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg, return -EACCES; } + if (arg_type == ARG_PTR_TO_MAP_VALUE || + arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE || + arg_type == ARG_PTR_TO_MAP_VALUE_OR_NULL) { + err = resolve_map_arg_type(env, meta, &arg_type); + if (err) + return err; + } + if (arg_type == ARG_PTR_TO_MAP_KEY || arg_type == ARG_PTR_TO_MAP_VALUE || arg_type == ARG_PTR_TO_UNINIT_MAP_VALUE || @@ -3960,16 +4004,21 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg, goto err_type; } } else if (arg_type == ARG_PTR_TO_BTF_ID) { + bool ids_match = false; + expected_type = PTR_TO_BTF_ID; if (type != expected_type) goto err_type; if (!fn->check_btf_id) { if (reg->btf_id != meta->btf_id) { - verbose(env, "Helper has type %s got %s in R%d\n", - kernel_type_name(meta->btf_id), - kernel_type_name(reg->btf_id), regno); - - return -EACCES; + ids_match = btf_struct_ids_match(&env->log, reg->off, reg->btf_id, + meta->btf_id); + if (!ids_match) { + verbose(env, "Helper has type %s got %s in R%d\n", + kernel_type_name(meta->btf_id), + kernel_type_name(reg->btf_id), regno); + return -EACCES; + } } } else if (!fn->check_btf_id(reg->btf_id, arg)) { verbose(env, "Helper does not support %s in R%d\n", @@ -3977,7 +4026,7 @@ static int check_func_arg(struct bpf_verifier_env *env, u32 arg, return -EACCES; } - if (!tnum_is_const(reg->var_off) || reg->var_off.value || reg->off) { + if ((reg->off && !ids_match) || !tnum_is_const(reg->var_off) || reg->var_off.value) { verbose(env, "R%d is a pointer to in-kernel struct with non-zero offset\n", regno); return -EACCES; @@ -4143,6 +4192,38 @@ err_type: return -EACCES; } +static bool may_update_sockmap(struct bpf_verifier_env *env, int func_id) +{ + enum bpf_attach_type eatype = env->prog->expected_attach_type; + enum bpf_prog_type type = resolve_prog_type(env->prog); + + if (func_id != BPF_FUNC_map_update_elem) + return false; + + /* It's not possible to get access to a locked struct sock in these + * contexts, so updating is safe. + */ + switch (type) { + case BPF_PROG_TYPE_TRACING: + if (eatype == BPF_TRACE_ITER) + return true; + break; + case BPF_PROG_TYPE_SOCKET_FILTER: + case BPF_PROG_TYPE_SCHED_CLS: + case BPF_PROG_TYPE_SCHED_ACT: + case BPF_PROG_TYPE_XDP: + case BPF_PROG_TYPE_SK_REUSEPORT: + case BPF_PROG_TYPE_FLOW_DISSECTOR: + case BPF_PROG_TYPE_SK_LOOKUP: + return true; + default: + break; + } + + verbose(env, "cannot update sockmap in this context\n"); + return false; +} + static int check_map_func_compatibility(struct bpf_verifier_env *env, struct bpf_map *map, int func_id) { @@ -4214,7 +4295,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_map_delete_elem && func_id != BPF_FUNC_msg_redirect_map && func_id != BPF_FUNC_sk_select_reuseport && - func_id != BPF_FUNC_map_lookup_elem) + func_id != BPF_FUNC_map_lookup_elem && + !may_update_sockmap(env, func_id)) goto error; break; case BPF_MAP_TYPE_SOCKHASH: @@ -4223,7 +4305,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_map_delete_elem && func_id != BPF_FUNC_msg_redirect_hash && func_id != BPF_FUNC_sk_select_reuseport && - func_id != BPF_FUNC_map_lookup_elem) + func_id != BPF_FUNC_map_lookup_elem && + !may_update_sockmap(env, func_id)) goto error; break; case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY: @@ -4242,6 +4325,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, func_id != BPF_FUNC_sk_storage_delete) goto error; break; + case BPF_MAP_TYPE_INODE_STORAGE: + if (func_id != BPF_FUNC_inode_storage_get && + func_id != BPF_FUNC_inode_storage_delete) + goto error; + break; default: break; } @@ -4315,6 +4403,11 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env, if (map->map_type != BPF_MAP_TYPE_SK_STORAGE) goto error; break; + case BPF_FUNC_inode_storage_get: + case BPF_FUNC_inode_storage_delete: + if (map->map_type != BPF_MAP_TYPE_INODE_STORAGE) + goto error; + break; default: break; } @@ -4775,6 +4868,11 @@ static int check_helper_call(struct bpf_verifier_env *env, int func_id, int insn return -EINVAL; } + if (fn->allowed && !fn->allowed(env->prog)) { + verbose(env, "helper call is not allowed in probe\n"); + return -EINVAL; + } + /* With LD_ABS/IND some JITs save/restore skb from r1. */ changes_data = bpf_helper_changes_pkt_data(fn->func); if (changes_data && fn->arg1_type != ARG_PTR_TO_CTX) { @@ -5732,6 +5830,67 @@ static void scalar_min_max_or(struct bpf_reg_state *dst_reg, __update_reg_bounds(dst_reg); } +static void scalar32_min_max_xor(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + bool src_known = tnum_subreg_is_const(src_reg->var_off); + bool dst_known = tnum_subreg_is_const(dst_reg->var_off); + struct tnum var32_off = tnum_subreg(dst_reg->var_off); + s32 smin_val = src_reg->s32_min_value; + + /* Assuming scalar64_min_max_xor will be called so it is safe + * to skip updating register for known case. + */ + if (src_known && dst_known) + return; + + /* We get both minimum and maximum from the var32_off. */ + dst_reg->u32_min_value = var32_off.value; + dst_reg->u32_max_value = var32_off.value | var32_off.mask; + + if (dst_reg->s32_min_value >= 0 && smin_val >= 0) { + /* XORing two positive sign numbers gives a positive, + * so safe to cast u32 result into s32. + */ + dst_reg->s32_min_value = dst_reg->u32_min_value; + dst_reg->s32_max_value = dst_reg->u32_max_value; + } else { + dst_reg->s32_min_value = S32_MIN; + dst_reg->s32_max_value = S32_MAX; + } +} + +static void scalar_min_max_xor(struct bpf_reg_state *dst_reg, + struct bpf_reg_state *src_reg) +{ + bool src_known = tnum_is_const(src_reg->var_off); + bool dst_known = tnum_is_const(dst_reg->var_off); + s64 smin_val = src_reg->smin_value; + + if (src_known && dst_known) { + /* dst_reg->var_off.value has been updated earlier */ + __mark_reg_known(dst_reg, dst_reg->var_off.value); + return; + } + + /* We get both minimum and maximum from the var_off. */ + dst_reg->umin_value = dst_reg->var_off.value; + dst_reg->umax_value = dst_reg->var_off.value | dst_reg->var_off.mask; + + if (dst_reg->smin_value >= 0 && smin_val >= 0) { + /* XORing two positive sign numbers gives a positive, + * so safe to cast u64 result into s64. + */ + dst_reg->smin_value = dst_reg->umin_value; + dst_reg->smax_value = dst_reg->umax_value; + } else { + dst_reg->smin_value = S64_MIN; + dst_reg->smax_value = S64_MAX; + } + + __update_reg_bounds(dst_reg); +} + static void __scalar32_min_max_lsh(struct bpf_reg_state *dst_reg, u64 umin_val, u64 umax_val) { @@ -6040,6 +6199,11 @@ static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env, scalar32_min_max_or(dst_reg, &src_reg); scalar_min_max_or(dst_reg, &src_reg); break; + case BPF_XOR: + dst_reg->var_off = tnum_xor(dst_reg->var_off, src_reg.var_off); + scalar32_min_max_xor(dst_reg, &src_reg); + scalar_min_max_xor(dst_reg, &src_reg); + break; case BPF_LSH: if (umax_val >= insn_bitness) { /* Shifts greater than 31 or 63 are undefined. @@ -7287,7 +7451,7 @@ static int check_ld_abs(struct bpf_verifier_env *env, struct bpf_insn *insn) u8 mode = BPF_MODE(insn->code); int i, err; - if (!may_access_skb(env->prog->type)) { + if (!may_access_skb(resolve_prog_type(env->prog))) { verbose(env, "BPF_LD_[ABS|IND] instructions not allowed for this program type\n"); return -EINVAL; } @@ -7375,11 +7539,12 @@ static int check_return_code(struct bpf_verifier_env *env) const struct bpf_prog *prog = env->prog; struct bpf_reg_state *reg; struct tnum range = tnum_range(0, 1); + enum bpf_prog_type prog_type = resolve_prog_type(env->prog); int err; /* LSM and struct_ops func-ptr's return type could be "void" */ - if ((env->prog->type == BPF_PROG_TYPE_STRUCT_OPS || - env->prog->type == BPF_PROG_TYPE_LSM) && + if ((prog_type == BPF_PROG_TYPE_STRUCT_OPS || + prog_type == BPF_PROG_TYPE_LSM) && !prog->aux->attach_func_proto->type) return 0; @@ -7398,7 +7563,7 @@ static int check_return_code(struct bpf_verifier_env *env) return -EACCES; } - switch (env->prog->type) { + switch (prog_type) { case BPF_PROG_TYPE_CGROUP_SOCK_ADDR: if (env->prog->expected_attach_type == BPF_CGROUP_UDP4_RECVMSG || env->prog->expected_attach_type == BPF_CGROUP_UDP6_RECVMSG || @@ -9154,6 +9319,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, struct bpf_prog *prog) { + enum bpf_prog_type prog_type = resolve_prog_type(prog); /* * Validate that trace type programs use preallocated hash maps. * @@ -9171,8 +9337,8 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, * now, but warnings are emitted so developers are made aware of * the unsafety and can fix their programs before this is enforced. */ - if (is_tracing_prog_type(prog->type) && !is_preallocated_map(map)) { - if (prog->type == BPF_PROG_TYPE_PERF_EVENT) { + if (is_tracing_prog_type(prog_type) && !is_preallocated_map(map)) { + if (prog_type == BPF_PROG_TYPE_PERF_EVENT) { verbose(env, "perf_event programs can only use preallocated hash map\n"); return -EINVAL; } @@ -9184,8 +9350,8 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, verbose(env, "trace type programs with run-time allocated hash maps are unsafe. Switch to preallocated hash maps.\n"); } - if ((is_tracing_prog_type(prog->type) || - prog->type == BPF_PROG_TYPE_SOCKET_FILTER) && + if ((is_tracing_prog_type(prog_type) || + prog_type == BPF_PROG_TYPE_SOCKET_FILTER) && map_value_has_spin_lock(map)) { verbose(env, "tracing progs cannot use bpf_spin_lock yet\n"); return -EINVAL; @@ -9202,6 +9368,23 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env, return -EINVAL; } + if (prog->aux->sleepable) + switch (map->map_type) { + case BPF_MAP_TYPE_HASH: + case BPF_MAP_TYPE_LRU_HASH: + case BPF_MAP_TYPE_ARRAY: + if (!is_preallocated_map(map)) { + verbose(env, + "Sleepable programs can only use preallocated hash maps\n"); + return -EINVAL; + } + break; + default: + verbose(env, + "Sleepable programs can only use array and hash maps\n"); + return -EINVAL; + } + return 0; } @@ -9897,7 +10080,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env) insn->code = BPF_LDX | BPF_PROBE_MEM | BPF_SIZE((insn)->code); env->prog->aux->num_exentries++; - } else if (env->prog->type != BPF_PROG_TYPE_STRUCT_OPS) { + } else if (resolve_prog_type(env->prog) != BPF_PROG_TYPE_STRUCT_OPS) { verbose(env, "Writes through BTF pointers are not allowed\n"); return -EINVAL; } @@ -10820,6 +11003,37 @@ static int check_attach_modify_return(struct bpf_prog *prog, unsigned long addr) return -EINVAL; } +/* non exhaustive list of sleepable bpf_lsm_*() functions */ +BTF_SET_START(btf_sleepable_lsm_hooks) +#ifdef CONFIG_BPF_LSM +BTF_ID(func, bpf_lsm_bprm_committed_creds) +#else +BTF_ID_UNUSED +#endif +BTF_SET_END(btf_sleepable_lsm_hooks) + +static int check_sleepable_lsm_hook(u32 btf_id) +{ + return btf_id_set_contains(&btf_sleepable_lsm_hooks, btf_id); +} + +/* list of non-sleepable functions that are otherwise on + * ALLOW_ERROR_INJECTION list + */ +BTF_SET_START(btf_non_sleepable_error_inject) +/* Three functions below can be called from sleepable and non-sleepable context. + * Assume non-sleepable from bpf safety point of view. + */ +BTF_ID(func, __add_to_page_cache_locked) +BTF_ID(func, should_fail_alloc_page) +BTF_ID(func, should_failslab) +BTF_SET_END(btf_non_sleepable_error_inject) + +static int check_non_sleepable_error_inject(u32 btf_id) +{ + return btf_id_set_contains(&btf_non_sleepable_error_inject, btf_id); +} + static int check_attach_btf_id(struct bpf_verifier_env *env) { struct bpf_prog *prog = env->prog; @@ -10837,6 +11051,12 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) long addr; u64 key; + if (prog->aux->sleepable && prog->type != BPF_PROG_TYPE_TRACING && + prog->type != BPF_PROG_TYPE_LSM) { + verbose(env, "Only fentry/fexit/fmod_ret and lsm programs can be sleepable\n"); + return -EINVAL; + } + if (prog->type == BPF_PROG_TYPE_STRUCT_OPS) return check_struct_ops_btf_id(env); @@ -11045,13 +11265,36 @@ static int check_attach_btf_id(struct bpf_verifier_env *env) } } - if (prog->expected_attach_type == BPF_MODIFY_RETURN) { + if (prog->aux->sleepable) { + ret = -EINVAL; + switch (prog->type) { + case BPF_PROG_TYPE_TRACING: + /* fentry/fexit/fmod_ret progs can be sleepable only if they are + * attached to ALLOW_ERROR_INJECTION and are not in denylist. + */ + if (!check_non_sleepable_error_inject(btf_id) && + within_error_injection_list(addr)) + ret = 0; + break; + case BPF_PROG_TYPE_LSM: + /* LSM progs check that they are attached to bpf_lsm_*() funcs. + * Only some of them are sleepable. + */ + if (check_sleepable_lsm_hook(btf_id)) + ret = 0; + break; + default: + break; + } + if (ret) + verbose(env, "%s is not sleepable\n", + prog->aux->attach_func_name); + } else if (prog->expected_attach_type == BPF_MODIFY_RETURN) { ret = check_attach_modify_return(prog, addr); if (ret) verbose(env, "%s() is not modifiable\n", prog->aux->attach_func_name); } - if (ret) goto out; tr->func.addr = (void *)addr; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index a8d4f253ed77..b2a5380eb187 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1098,6 +1098,52 @@ static const struct bpf_func_proto bpf_send_signal_thread_proto = { .arg1_type = ARG_ANYTHING, }; +BPF_CALL_3(bpf_d_path, struct path *, path, char *, buf, u32, sz) +{ + long len; + char *p; + + if (!sz) + return 0; + + p = d_path(path, buf, sz); + if (IS_ERR(p)) { + len = PTR_ERR(p); + } else { + len = buf + sz - p; + memmove(buf, p, len); + } + + return len; +} + +BTF_SET_START(btf_allowlist_d_path) +BTF_ID(func, vfs_truncate) +BTF_ID(func, vfs_fallocate) +BTF_ID(func, dentry_open) +BTF_ID(func, vfs_getattr) +BTF_ID(func, filp_close) +BTF_SET_END(btf_allowlist_d_path) + +static bool bpf_d_path_allowed(const struct bpf_prog *prog) +{ + return btf_id_set_contains(&btf_allowlist_d_path, prog->aux->attach_btf_id); +} + +BTF_ID_LIST(bpf_d_path_btf_ids) +BTF_ID(struct, path) + +static const struct bpf_func_proto bpf_d_path_proto = { + .func = bpf_d_path, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_BTF_ID, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE_OR_ZERO, + .btf_id = bpf_d_path_btf_ids, + .allowed = bpf_d_path_allowed, +}; + const struct bpf_func_proto * bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { @@ -1182,6 +1228,8 @@ bpf_tracing_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return &bpf_jiffies64_proto; case BPF_FUNC_get_task_stack: return &bpf_get_task_stack_proto; + case BPF_FUNC_copy_from_user: + return prog->aux->sleepable ? &bpf_copy_from_user_proto : NULL; default: return NULL; } @@ -1579,6 +1627,8 @@ tracing_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) return prog->expected_attach_type == BPF_TRACE_ITER ? &bpf_seq_write_proto : NULL; + case BPF_FUNC_d_path: + return &bpf_d_path_proto; default: return raw_tp_prog_func_proto(func_id, prog); } diff --git a/mm/filemap.c b/mm/filemap.c index 1aaea26556cc..054d93a86f8a 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -827,10 +827,10 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) } EXPORT_SYMBOL_GPL(replace_page_cache_page); -static int __add_to_page_cache_locked(struct page *page, - struct address_space *mapping, - pgoff_t offset, gfp_t gfp_mask, - void **shadowp) +noinline int __add_to_page_cache_locked(struct page *page, + struct address_space *mapping, + pgoff_t offset, gfp_t gfp_mask, + void **shadowp) { XA_STATE(xas, &mapping->i_pages, offset); int huge = PageHuge(page); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fab5e97dc9ca..0608f7f1236d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3482,7 +3482,7 @@ static inline bool __should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) #endif /* CONFIG_FAIL_PAGE_ALLOC */ -static noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) +noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order) { return __should_fail_alloc_page(gfp_mask, order); } diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig index 73d0b12789f1..8ad0233ce497 100644 --- a/net/bpfilter/Kconfig +++ b/net/bpfilter/Kconfig @@ -2,6 +2,7 @@ menuconfig BPFILTER bool "BPF based packet filtering framework (BPFILTER)" depends on NET && BPF && INET + select USERMODE_DRIVER help This builds experimental bpfilter framework that is aiming to provide netfilter compatible functionality via BPF diff --git a/net/core/bpf_sk_storage.c b/net/core/bpf_sk_storage.c index b988f48153a4..a0d1a3265b71 100644 --- a/net/core/bpf_sk_storage.c +++ b/net/core/bpf_sk_storage.c @@ -7,97 +7,14 @@ #include <linux/spinlock.h> #include <linux/bpf.h> #include <linux/btf_ids.h> +#include <linux/bpf_local_storage.h> #include <net/bpf_sk_storage.h> #include <net/sock.h> #include <uapi/linux/sock_diag.h> #include <uapi/linux/btf.h> +#include <linux/btf_ids.h> -#define SK_STORAGE_CREATE_FLAG_MASK \ - (BPF_F_NO_PREALLOC | BPF_F_CLONE) - -struct bucket { - struct hlist_head list; - raw_spinlock_t lock; -}; - -/* Thp map is not the primary owner of a bpf_sk_storage_elem. - * Instead, the sk->sk_bpf_storage is. - * - * The map (bpf_sk_storage_map) is for two purposes - * 1. Define the size of the "sk local storage". It is - * the map's value_size. - * - * 2. Maintain a list to keep track of all elems such - * that they can be cleaned up during the map destruction. - * - * When a bpf local storage is being looked up for a - * particular sk, the "bpf_map" pointer is actually used - * as the "key" to search in the list of elem in - * sk->sk_bpf_storage. - * - * Hence, consider sk->sk_bpf_storage is the mini-map - * with the "bpf_map" pointer as the searching key. - */ -struct bpf_sk_storage_map { - struct bpf_map map; - /* Lookup elem does not require accessing the map. - * - * Updating/Deleting requires a bucket lock to - * link/unlink the elem from the map. Having - * multiple buckets to improve contention. - */ - struct bucket *buckets; - u32 bucket_log; - u16 elem_size; - u16 cache_idx; -}; - -struct bpf_sk_storage_data { - /* smap is used as the searching key when looking up - * from sk->sk_bpf_storage. - * - * Put it in the same cacheline as the data to minimize - * the number of cachelines access during the cache hit case. - */ - struct bpf_sk_storage_map __rcu *smap; - u8 data[] __aligned(8); -}; - -/* Linked to bpf_sk_storage and bpf_sk_storage_map */ -struct bpf_sk_storage_elem { - struct hlist_node map_node; /* Linked to bpf_sk_storage_map */ - struct hlist_node snode; /* Linked to bpf_sk_storage */ - struct bpf_sk_storage __rcu *sk_storage; - struct rcu_head rcu; - /* 8 bytes hole */ - /* The data is stored in aother cacheline to minimize - * the number of cachelines access during a cache hit. - */ - struct bpf_sk_storage_data sdata ____cacheline_aligned; -}; - -#define SELEM(_SDATA) container_of((_SDATA), struct bpf_sk_storage_elem, sdata) -#define SDATA(_SELEM) (&(_SELEM)->sdata) -#define BPF_SK_STORAGE_CACHE_SIZE 16 - -static DEFINE_SPINLOCK(cache_idx_lock); -static u64 cache_idx_usage_counts[BPF_SK_STORAGE_CACHE_SIZE]; - -struct bpf_sk_storage { - struct bpf_sk_storage_data __rcu *cache[BPF_SK_STORAGE_CACHE_SIZE]; - struct hlist_head list; /* List of bpf_sk_storage_elem */ - struct sock *sk; /* The sk that owns the the above "list" of - * bpf_sk_storage_elem. - */ - struct rcu_head rcu; - raw_spinlock_t lock; /* Protect adding/removing from the "list" */ -}; - -static struct bucket *select_bucket(struct bpf_sk_storage_map *smap, - struct bpf_sk_storage_elem *selem) -{ - return &smap->buckets[hash_ptr(selem, smap->bucket_log)]; -} +DEFINE_BPF_STORAGE_CACHE(sk_cache); static int omem_charge(struct sock *sk, unsigned int size) { @@ -111,445 +28,38 @@ static int omem_charge(struct sock *sk, unsigned int size) return -ENOMEM; } -static bool selem_linked_to_sk(const struct bpf_sk_storage_elem *selem) -{ - return !hlist_unhashed(&selem->snode); -} - -static bool selem_linked_to_map(const struct bpf_sk_storage_elem *selem) -{ - return !hlist_unhashed(&selem->map_node); -} - -static struct bpf_sk_storage_elem *selem_alloc(struct bpf_sk_storage_map *smap, - struct sock *sk, void *value, - bool charge_omem) -{ - struct bpf_sk_storage_elem *selem; - - if (charge_omem && omem_charge(sk, smap->elem_size)) - return NULL; - - selem = kzalloc(smap->elem_size, GFP_ATOMIC | __GFP_NOWARN); - if (selem) { - if (value) - memcpy(SDATA(selem)->data, value, smap->map.value_size); - return selem; - } - - if (charge_omem) - atomic_sub(smap->elem_size, &sk->sk_omem_alloc); - - return NULL; -} - -/* sk_storage->lock must be held and selem->sk_storage == sk_storage. - * The caller must ensure selem->smap is still valid to be - * dereferenced for its smap->elem_size and smap->cache_idx. - */ -static bool __selem_unlink_sk(struct bpf_sk_storage *sk_storage, - struct bpf_sk_storage_elem *selem, - bool uncharge_omem) -{ - struct bpf_sk_storage_map *smap; - bool free_sk_storage; - struct sock *sk; - - smap = rcu_dereference(SDATA(selem)->smap); - sk = sk_storage->sk; - - /* All uncharging on sk->sk_omem_alloc must be done first. - * sk may be freed once the last selem is unlinked from sk_storage. - */ - if (uncharge_omem) - atomic_sub(smap->elem_size, &sk->sk_omem_alloc); - - free_sk_storage = hlist_is_singular_node(&selem->snode, - &sk_storage->list); - if (free_sk_storage) { - atomic_sub(sizeof(struct bpf_sk_storage), &sk->sk_omem_alloc); - sk_storage->sk = NULL; - /* After this RCU_INIT, sk may be freed and cannot be used */ - RCU_INIT_POINTER(sk->sk_bpf_storage, NULL); - - /* sk_storage is not freed now. sk_storage->lock is - * still held and raw_spin_unlock_bh(&sk_storage->lock) - * will be done by the caller. - * - * Although the unlock will be done under - * rcu_read_lock(), it is more intutivie to - * read if kfree_rcu(sk_storage, rcu) is done - * after the raw_spin_unlock_bh(&sk_storage->lock). - * - * Hence, a "bool free_sk_storage" is returned - * to the caller which then calls the kfree_rcu() - * after unlock. - */ - } - hlist_del_init_rcu(&selem->snode); - if (rcu_access_pointer(sk_storage->cache[smap->cache_idx]) == - SDATA(selem)) - RCU_INIT_POINTER(sk_storage->cache[smap->cache_idx], NULL); - - kfree_rcu(selem, rcu); - - return free_sk_storage; -} - -static void selem_unlink_sk(struct bpf_sk_storage_elem *selem) -{ - struct bpf_sk_storage *sk_storage; - bool free_sk_storage = false; - - if (unlikely(!selem_linked_to_sk(selem))) - /* selem has already been unlinked from sk */ - return; - - sk_storage = rcu_dereference(selem->sk_storage); - raw_spin_lock_bh(&sk_storage->lock); - if (likely(selem_linked_to_sk(selem))) - free_sk_storage = __selem_unlink_sk(sk_storage, selem, true); - raw_spin_unlock_bh(&sk_storage->lock); - - if (free_sk_storage) - kfree_rcu(sk_storage, rcu); -} - -static void __selem_link_sk(struct bpf_sk_storage *sk_storage, - struct bpf_sk_storage_elem *selem) -{ - RCU_INIT_POINTER(selem->sk_storage, sk_storage); - hlist_add_head(&selem->snode, &sk_storage->list); -} - -static void selem_unlink_map(struct bpf_sk_storage_elem *selem) -{ - struct bpf_sk_storage_map *smap; - struct bucket *b; - - if (unlikely(!selem_linked_to_map(selem))) - /* selem has already be unlinked from smap */ - return; - - smap = rcu_dereference(SDATA(selem)->smap); - b = select_bucket(smap, selem); - raw_spin_lock_bh(&b->lock); - if (likely(selem_linked_to_map(selem))) - hlist_del_init_rcu(&selem->map_node); - raw_spin_unlock_bh(&b->lock); -} - -static void selem_link_map(struct bpf_sk_storage_map *smap, - struct bpf_sk_storage_elem *selem) -{ - struct bucket *b = select_bucket(smap, selem); - - raw_spin_lock_bh(&b->lock); - RCU_INIT_POINTER(SDATA(selem)->smap, smap); - hlist_add_head_rcu(&selem->map_node, &b->list); - raw_spin_unlock_bh(&b->lock); -} - -static void selem_unlink(struct bpf_sk_storage_elem *selem) -{ - /* Always unlink from map before unlinking from sk_storage - * because selem will be freed after successfully unlinked from - * the sk_storage. - */ - selem_unlink_map(selem); - selem_unlink_sk(selem); -} - -static struct bpf_sk_storage_data * -__sk_storage_lookup(struct bpf_sk_storage *sk_storage, - struct bpf_sk_storage_map *smap, - bool cacheit_lockit) -{ - struct bpf_sk_storage_data *sdata; - struct bpf_sk_storage_elem *selem; - - /* Fast path (cache hit) */ - sdata = rcu_dereference(sk_storage->cache[smap->cache_idx]); - if (sdata && rcu_access_pointer(sdata->smap) == smap) - return sdata; - - /* Slow path (cache miss) */ - hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) - if (rcu_access_pointer(SDATA(selem)->smap) == smap) - break; - - if (!selem) - return NULL; - - sdata = SDATA(selem); - if (cacheit_lockit) { - /* spinlock is needed to avoid racing with the - * parallel delete. Otherwise, publishing an already - * deleted sdata to the cache will become a use-after-free - * problem in the next __sk_storage_lookup(). - */ - raw_spin_lock_bh(&sk_storage->lock); - if (selem_linked_to_sk(selem)) - rcu_assign_pointer(sk_storage->cache[smap->cache_idx], - sdata); - raw_spin_unlock_bh(&sk_storage->lock); - } - - return sdata; -} - -static struct bpf_sk_storage_data * +static struct bpf_local_storage_data * sk_storage_lookup(struct sock *sk, struct bpf_map *map, bool cacheit_lockit) { - struct bpf_sk_storage *sk_storage; - struct bpf_sk_storage_map *smap; + struct bpf_local_storage *sk_storage; + struct bpf_local_storage_map *smap; sk_storage = rcu_dereference(sk->sk_bpf_storage); if (!sk_storage) return NULL; - smap = (struct bpf_sk_storage_map *)map; - return __sk_storage_lookup(sk_storage, smap, cacheit_lockit); -} - -static int check_flags(const struct bpf_sk_storage_data *old_sdata, - u64 map_flags) -{ - if (old_sdata && (map_flags & ~BPF_F_LOCK) == BPF_NOEXIST) - /* elem already exists */ - return -EEXIST; - - if (!old_sdata && (map_flags & ~BPF_F_LOCK) == BPF_EXIST) - /* elem doesn't exist, cannot update it */ - return -ENOENT; - - return 0; -} - -static int sk_storage_alloc(struct sock *sk, - struct bpf_sk_storage_map *smap, - struct bpf_sk_storage_elem *first_selem) -{ - struct bpf_sk_storage *prev_sk_storage, *sk_storage; - int err; - - err = omem_charge(sk, sizeof(*sk_storage)); - if (err) - return err; - - sk_storage = kzalloc(sizeof(*sk_storage), GFP_ATOMIC | __GFP_NOWARN); - if (!sk_storage) { - err = -ENOMEM; - goto uncharge; - } - INIT_HLIST_HEAD(&sk_storage->list); - raw_spin_lock_init(&sk_storage->lock); - sk_storage->sk = sk; - - __selem_link_sk(sk_storage, first_selem); - selem_link_map(smap, first_selem); - /* Publish sk_storage to sk. sk->sk_lock cannot be acquired. - * Hence, atomic ops is used to set sk->sk_bpf_storage - * from NULL to the newly allocated sk_storage ptr. - * - * From now on, the sk->sk_bpf_storage pointer is protected - * by the sk_storage->lock. Hence, when freeing - * the sk->sk_bpf_storage, the sk_storage->lock must - * be held before setting sk->sk_bpf_storage to NULL. - */ - prev_sk_storage = cmpxchg((struct bpf_sk_storage **)&sk->sk_bpf_storage, - NULL, sk_storage); - if (unlikely(prev_sk_storage)) { - selem_unlink_map(first_selem); - err = -EAGAIN; - goto uncharge; - - /* Note that even first_selem was linked to smap's - * bucket->list, first_selem can be freed immediately - * (instead of kfree_rcu) because - * bpf_sk_storage_map_free() does a - * synchronize_rcu() before walking the bucket->list. - * Hence, no one is accessing selem from the - * bucket->list under rcu_read_lock(). - */ - } - - return 0; - -uncharge: - kfree(sk_storage); - atomic_sub(sizeof(*sk_storage), &sk->sk_omem_alloc); - return err; -} - -/* sk cannot be going away because it is linking new elem - * to sk->sk_bpf_storage. (i.e. sk->sk_refcnt cannot be 0). - * Otherwise, it will become a leak (and other memory issues - * during map destruction). - */ -static struct bpf_sk_storage_data *sk_storage_update(struct sock *sk, - struct bpf_map *map, - void *value, - u64 map_flags) -{ - struct bpf_sk_storage_data *old_sdata = NULL; - struct bpf_sk_storage_elem *selem; - struct bpf_sk_storage *sk_storage; - struct bpf_sk_storage_map *smap; - int err; - - /* BPF_EXIST and BPF_NOEXIST cannot be both set */ - if (unlikely((map_flags & ~BPF_F_LOCK) > BPF_EXIST) || - /* BPF_F_LOCK can only be used in a value with spin_lock */ - unlikely((map_flags & BPF_F_LOCK) && !map_value_has_spin_lock(map))) - return ERR_PTR(-EINVAL); - - smap = (struct bpf_sk_storage_map *)map; - sk_storage = rcu_dereference(sk->sk_bpf_storage); - if (!sk_storage || hlist_empty(&sk_storage->list)) { - /* Very first elem for this sk */ - err = check_flags(NULL, map_flags); - if (err) - return ERR_PTR(err); - - selem = selem_alloc(smap, sk, value, true); - if (!selem) - return ERR_PTR(-ENOMEM); - - err = sk_storage_alloc(sk, smap, selem); - if (err) { - kfree(selem); - atomic_sub(smap->elem_size, &sk->sk_omem_alloc); - return ERR_PTR(err); - } - - return SDATA(selem); - } - - if ((map_flags & BPF_F_LOCK) && !(map_flags & BPF_NOEXIST)) { - /* Hoping to find an old_sdata to do inline update - * such that it can avoid taking the sk_storage->lock - * and changing the lists. - */ - old_sdata = __sk_storage_lookup(sk_storage, smap, false); - err = check_flags(old_sdata, map_flags); - if (err) - return ERR_PTR(err); - if (old_sdata && selem_linked_to_sk(SELEM(old_sdata))) { - copy_map_value_locked(map, old_sdata->data, - value, false); - return old_sdata; - } - } - - raw_spin_lock_bh(&sk_storage->lock); - - /* Recheck sk_storage->list under sk_storage->lock */ - if (unlikely(hlist_empty(&sk_storage->list))) { - /* A parallel del is happening and sk_storage is going - * away. It has just been checked before, so very - * unlikely. Return instead of retry to keep things - * simple. - */ - err = -EAGAIN; - goto unlock_err; - } - - old_sdata = __sk_storage_lookup(sk_storage, smap, false); - err = check_flags(old_sdata, map_flags); - if (err) - goto unlock_err; - - if (old_sdata && (map_flags & BPF_F_LOCK)) { - copy_map_value_locked(map, old_sdata->data, value, false); - selem = SELEM(old_sdata); - goto unlock; - } - - /* sk_storage->lock is held. Hence, we are sure - * we can unlink and uncharge the old_sdata successfully - * later. Hence, instead of charging the new selem now - * and then uncharge the old selem later (which may cause - * a potential but unnecessary charge failure), avoid taking - * a charge at all here (the "!old_sdata" check) and the - * old_sdata will not be uncharged later during __selem_unlink_sk(). - */ - selem = selem_alloc(smap, sk, value, !old_sdata); - if (!selem) { - err = -ENOMEM; - goto unlock_err; - } - - /* First, link the new selem to the map */ - selem_link_map(smap, selem); - - /* Second, link (and publish) the new selem to sk_storage */ - __selem_link_sk(sk_storage, selem); - - /* Third, remove old selem, SELEM(old_sdata) */ - if (old_sdata) { - selem_unlink_map(SELEM(old_sdata)); - __selem_unlink_sk(sk_storage, SELEM(old_sdata), false); - } - -unlock: - raw_spin_unlock_bh(&sk_storage->lock); - return SDATA(selem); - -unlock_err: - raw_spin_unlock_bh(&sk_storage->lock); - return ERR_PTR(err); + smap = (struct bpf_local_storage_map *)map; + return bpf_local_storage_lookup(sk_storage, smap, cacheit_lockit); } static int sk_storage_delete(struct sock *sk, struct bpf_map *map) { - struct bpf_sk_storage_data *sdata; + struct bpf_local_storage_data *sdata; sdata = sk_storage_lookup(sk, map, false); if (!sdata) return -ENOENT; - selem_unlink(SELEM(sdata)); + bpf_selem_unlink(SELEM(sdata)); return 0; } -static u16 cache_idx_get(void) -{ - u64 min_usage = U64_MAX; - u16 i, res = 0; - - spin_lock(&cache_idx_lock); - - for (i = 0; i < BPF_SK_STORAGE_CACHE_SIZE; i++) { - if (cache_idx_usage_counts[i] < min_usage) { - min_usage = cache_idx_usage_counts[i]; - res = i; - - /* Found a free cache_idx */ - if (!min_usage) - break; - } - } - cache_idx_usage_counts[res]++; - - spin_unlock(&cache_idx_lock); - - return res; -} - -static void cache_idx_free(u16 idx) -{ - spin_lock(&cache_idx_lock); - cache_idx_usage_counts[idx]--; - spin_unlock(&cache_idx_lock); -} - /* Called by __sk_destruct() & bpf_sk_storage_clone() */ void bpf_sk_storage_free(struct sock *sk) { - struct bpf_sk_storage_elem *selem; - struct bpf_sk_storage *sk_storage; + struct bpf_local_storage_elem *selem; + struct bpf_local_storage *sk_storage; bool free_sk_storage = false; struct hlist_node *n; @@ -565,7 +75,7 @@ void bpf_sk_storage_free(struct sock *sk) * Thus, no elem can be added-to or deleted-from the * sk_storage->list by the bpf_prog or by the bpf-map's syscall. * - * It is racing with bpf_sk_storage_map_free() alone + * It is racing with bpf_local_storage_map_free() alone * when unlinking elem from the sk_storage->list and * the map's bucket->list. */ @@ -574,8 +84,9 @@ void bpf_sk_storage_free(struct sock *sk) /* Always unlink from map before unlinking from * sk_storage. */ - selem_unlink_map(selem); - free_sk_storage = __selem_unlink_sk(sk_storage, selem, true); + bpf_selem_unlink_map(selem); + free_sk_storage = bpf_selem_unlink_storage_nolock(sk_storage, + selem, true); } raw_spin_unlock_bh(&sk_storage->lock); rcu_read_unlock(); @@ -584,132 +95,24 @@ void bpf_sk_storage_free(struct sock *sk) kfree_rcu(sk_storage, rcu); } -static void bpf_sk_storage_map_free(struct bpf_map *map) -{ - struct bpf_sk_storage_elem *selem; - struct bpf_sk_storage_map *smap; - struct bucket *b; - unsigned int i; - - smap = (struct bpf_sk_storage_map *)map; - - cache_idx_free(smap->cache_idx); - - /* Note that this map might be concurrently cloned from - * bpf_sk_storage_clone. Wait for any existing bpf_sk_storage_clone - * RCU read section to finish before proceeding. New RCU - * read sections should be prevented via bpf_map_inc_not_zero. - */ - synchronize_rcu(); - - /* bpf prog and the userspace can no longer access this map - * now. No new selem (of this map) can be added - * to the sk->sk_bpf_storage or to the map bucket's list. - * - * The elem of this map can be cleaned up here - * or - * by bpf_sk_storage_free() during __sk_destruct(). - */ - for (i = 0; i < (1U << smap->bucket_log); i++) { - b = &smap->buckets[i]; - - rcu_read_lock(); - /* No one is adding to b->list now */ - while ((selem = hlist_entry_safe(rcu_dereference_raw(hlist_first_rcu(&b->list)), - struct bpf_sk_storage_elem, - map_node))) { - selem_unlink(selem); - cond_resched_rcu(); - } - rcu_read_unlock(); - } - - /* bpf_sk_storage_free() may still need to access the map. - * e.g. bpf_sk_storage_free() has unlinked selem from the map - * which then made the above while((selem = ...)) loop - * exited immediately. - * - * However, the bpf_sk_storage_free() still needs to access - * the smap->elem_size to do the uncharging in - * __selem_unlink_sk(). - * - * Hence, wait another rcu grace period for the - * bpf_sk_storage_free() to finish. - */ - synchronize_rcu(); - - kvfree(smap->buckets); - kfree(map); -} - -/* U16_MAX is much more than enough for sk local storage - * considering a tcp_sock is ~2k. - */ -#define MAX_VALUE_SIZE \ - min_t(u32, \ - (KMALLOC_MAX_SIZE - MAX_BPF_STACK - sizeof(struct bpf_sk_storage_elem)), \ - (U16_MAX - sizeof(struct bpf_sk_storage_elem))) - -static int bpf_sk_storage_map_alloc_check(union bpf_attr *attr) +static void sk_storage_map_free(struct bpf_map *map) { - if (attr->map_flags & ~SK_STORAGE_CREATE_FLAG_MASK || - !(attr->map_flags & BPF_F_NO_PREALLOC) || - attr->max_entries || - attr->key_size != sizeof(int) || !attr->value_size || - /* Enforce BTF for userspace sk dumping */ - !attr->btf_key_type_id || !attr->btf_value_type_id) - return -EINVAL; - - if (!bpf_capable()) - return -EPERM; + struct bpf_local_storage_map *smap; - if (attr->value_size > MAX_VALUE_SIZE) - return -E2BIG; - - return 0; + smap = (struct bpf_local_storage_map *)map; + bpf_local_storage_cache_idx_free(&sk_cache, smap->cache_idx); + bpf_local_storage_map_free(smap); } -static struct bpf_map *bpf_sk_storage_map_alloc(union bpf_attr *attr) +static struct bpf_map *sk_storage_map_alloc(union bpf_attr *attr) { - struct bpf_sk_storage_map *smap; - unsigned int i; - u32 nbuckets; - u64 cost; - int ret; - - smap = kzalloc(sizeof(*smap), GFP_USER | __GFP_NOWARN); - if (!smap) - return ERR_PTR(-ENOMEM); - bpf_map_init_from_attr(&smap->map, attr); - - nbuckets = roundup_pow_of_two(num_possible_cpus()); - /* Use at least 2 buckets, select_bucket() is undefined behavior with 1 bucket */ - nbuckets = max_t(u32, 2, nbuckets); - smap->bucket_log = ilog2(nbuckets); - cost = sizeof(*smap->buckets) * nbuckets + sizeof(*smap); - - ret = bpf_map_charge_init(&smap->map.memory, cost); - if (ret < 0) { - kfree(smap); - return ERR_PTR(ret); - } - - smap->buckets = kvcalloc(sizeof(*smap->buckets), nbuckets, - GFP_USER | __GFP_NOWARN); - if (!smap->buckets) { - bpf_map_charge_finish(&smap->map.memory); - kfree(smap); - return ERR_PTR(-ENOMEM); - } + struct bpf_local_storage_map *smap; - for (i = 0; i < nbuckets; i++) { - INIT_HLIST_HEAD(&smap->buckets[i].list); - raw_spin_lock_init(&smap->buckets[i].lock); - } - - smap->elem_size = sizeof(struct bpf_sk_storage_elem) + attr->value_size; - smap->cache_idx = cache_idx_get(); + smap = bpf_local_storage_map_alloc(attr); + if (IS_ERR(smap)) + return ERR_CAST(smap); + smap->cache_idx = bpf_local_storage_cache_idx_get(&sk_cache); return &smap->map; } @@ -719,26 +122,9 @@ static int notsupp_get_next_key(struct bpf_map *map, void *key, return -ENOTSUPP; } -static int bpf_sk_storage_map_check_btf(const struct bpf_map *map, - const struct btf *btf, - const struct btf_type *key_type, - const struct btf_type *value_type) -{ - u32 int_data; - - if (BTF_INFO_KIND(key_type->info) != BTF_KIND_INT) - return -EINVAL; - - int_data = *(u32 *)(key_type + 1); - if (BTF_INT_BITS(int_data) != 32 || BTF_INT_OFFSET(int_data)) - return -EINVAL; - - return 0; -} - static void *bpf_fd_sk_storage_lookup_elem(struct bpf_map *map, void *key) { - struct bpf_sk_storage_data *sdata; + struct bpf_local_storage_data *sdata; struct socket *sock; int fd, err; @@ -756,14 +142,16 @@ static void *bpf_fd_sk_storage_lookup_elem(struct bpf_map *map, void *key) static int bpf_fd_sk_storage_update_elem(struct bpf_map *map, void *key, void *value, u64 map_flags) { - struct bpf_sk_storage_data *sdata; + struct bpf_local_storage_data *sdata; struct socket *sock; int fd, err; fd = *(int *)key; sock = sockfd_lookup(fd, &err); if (sock) { - sdata = sk_storage_update(sock->sk, map, value, map_flags); + sdata = bpf_local_storage_update( + sock->sk, (struct bpf_local_storage_map *)map, value, + map_flags); sockfd_put(sock); return PTR_ERR_OR_ZERO(sdata); } @@ -787,14 +175,14 @@ static int bpf_fd_sk_storage_delete_elem(struct bpf_map *map, void *key) return err; } -static struct bpf_sk_storage_elem * +static struct bpf_local_storage_elem * bpf_sk_storage_clone_elem(struct sock *newsk, - struct bpf_sk_storage_map *smap, - struct bpf_sk_storage_elem *selem) + struct bpf_local_storage_map *smap, + struct bpf_local_storage_elem *selem) { - struct bpf_sk_storage_elem *copy_selem; + struct bpf_local_storage_elem *copy_selem; - copy_selem = selem_alloc(smap, newsk, NULL, true); + copy_selem = bpf_selem_alloc(smap, newsk, NULL, true); if (!copy_selem) return NULL; @@ -810,9 +198,9 @@ bpf_sk_storage_clone_elem(struct sock *newsk, int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk) { - struct bpf_sk_storage *new_sk_storage = NULL; - struct bpf_sk_storage *sk_storage; - struct bpf_sk_storage_elem *selem; + struct bpf_local_storage *new_sk_storage = NULL; + struct bpf_local_storage *sk_storage; + struct bpf_local_storage_elem *selem; int ret = 0; RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL); @@ -824,8 +212,8 @@ int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk) goto out; hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) { - struct bpf_sk_storage_elem *copy_selem; - struct bpf_sk_storage_map *smap; + struct bpf_local_storage_elem *copy_selem; + struct bpf_local_storage_map *smap; struct bpf_map *map; smap = rcu_dereference(SDATA(selem)->smap); @@ -833,7 +221,7 @@ int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk) continue; /* Note that for lockless listeners adding new element - * here can race with cleanup in bpf_sk_storage_map_free. + * here can race with cleanup in bpf_local_storage_map_free. * Try to grab map refcnt to make sure that it's still * alive and prevent concurrent removal. */ @@ -849,10 +237,10 @@ int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk) } if (new_sk_storage) { - selem_link_map(smap, copy_selem); - __selem_link_sk(new_sk_storage, copy_selem); + bpf_selem_link_map(smap, copy_selem); + bpf_selem_link_storage_nolock(new_sk_storage, copy_selem); } else { - ret = sk_storage_alloc(newsk, smap, copy_selem); + ret = bpf_local_storage_alloc(newsk, smap, copy_selem); if (ret) { kfree(copy_selem); atomic_sub(smap->elem_size, @@ -861,7 +249,8 @@ int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk) goto out; } - new_sk_storage = rcu_dereference(copy_selem->sk_storage); + new_sk_storage = + rcu_dereference(copy_selem->local_storage); } bpf_map_put(map); } @@ -879,7 +268,7 @@ out: BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk, void *, value, u64, flags) { - struct bpf_sk_storage_data *sdata; + struct bpf_local_storage_data *sdata; if (flags > BPF_SK_STORAGE_GET_F_CREATE) return (unsigned long)NULL; @@ -895,7 +284,9 @@ BPF_CALL_4(bpf_sk_storage_get, struct bpf_map *, map, struct sock *, sk, * destruction). */ refcount_inc_not_zero(&sk->sk_refcnt)) { - sdata = sk_storage_update(sk, map, value, BPF_NOEXIST); + sdata = bpf_local_storage_update( + sk, (struct bpf_local_storage_map *)map, value, + BPF_NOEXIST); /* sk must be a fullsock (guaranteed by verifier), * so sock_gen_put() is unnecessary. */ @@ -920,18 +311,44 @@ BPF_CALL_2(bpf_sk_storage_delete, struct bpf_map *, map, struct sock *, sk) return -ENOENT; } +static int sk_storage_charge(struct bpf_local_storage_map *smap, + void *owner, u32 size) +{ + return omem_charge(owner, size); +} + +static void sk_storage_uncharge(struct bpf_local_storage_map *smap, + void *owner, u32 size) +{ + struct sock *sk = owner; + + atomic_sub(size, &sk->sk_omem_alloc); +} + +static struct bpf_local_storage __rcu ** +sk_storage_ptr(void *owner) +{ + struct sock *sk = owner; + + return &sk->sk_bpf_storage; +} + static int sk_storage_map_btf_id; const struct bpf_map_ops sk_storage_map_ops = { - .map_alloc_check = bpf_sk_storage_map_alloc_check, - .map_alloc = bpf_sk_storage_map_alloc, - .map_free = bpf_sk_storage_map_free, + .map_meta_equal = bpf_map_meta_equal, + .map_alloc_check = bpf_local_storage_map_alloc_check, + .map_alloc = sk_storage_map_alloc, + .map_free = sk_storage_map_free, .map_get_next_key = notsupp_get_next_key, .map_lookup_elem = bpf_fd_sk_storage_lookup_elem, .map_update_elem = bpf_fd_sk_storage_update_elem, .map_delete_elem = bpf_fd_sk_storage_delete_elem, - .map_check_btf = bpf_sk_storage_map_check_btf, - .map_btf_name = "bpf_sk_storage_map", + .map_check_btf = bpf_local_storage_map_check_btf, + .map_btf_name = "bpf_local_storage_map", .map_btf_id = &sk_storage_map_btf_id, + .map_local_storage_charge = sk_storage_charge, + .map_local_storage_uncharge = sk_storage_uncharge, + .map_owner_storage_ptr = sk_storage_ptr, }; const struct bpf_func_proto bpf_sk_storage_get_proto = { @@ -962,6 +379,30 @@ const struct bpf_func_proto bpf_sk_storage_delete_proto = { .arg2_type = ARG_PTR_TO_SOCKET, }; +BTF_ID_LIST(sk_storage_btf_ids) +BTF_ID_UNUSED +BTF_ID(struct, sock) + +const struct bpf_func_proto sk_storage_get_btf_proto = { + .func = bpf_sk_storage_get, + .gpl_only = false, + .ret_type = RET_PTR_TO_MAP_VALUE_OR_NULL, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_BTF_ID, + .arg3_type = ARG_PTR_TO_MAP_VALUE_OR_NULL, + .arg4_type = ARG_ANYTHING, + .btf_id = sk_storage_btf_ids, +}; + +const struct bpf_func_proto sk_storage_delete_btf_proto = { + .func = bpf_sk_storage_delete, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_CONST_MAP_PTR, + .arg2_type = ARG_PTR_TO_BTF_ID, + .btf_id = sk_storage_btf_ids, +}; + struct bpf_sk_storage_diag { u32 nr_maps; struct bpf_map *maps[]; @@ -1022,7 +463,7 @@ bpf_sk_storage_diag_alloc(const struct nlattr *nla_stgs) u32 nr_maps = 0; int rem, err; - /* bpf_sk_storage_map is currently limited to CAP_SYS_ADMIN as + /* bpf_local_storage_map is currently limited to CAP_SYS_ADMIN as * the map_alloc_check() side also does. */ if (!bpf_capable()) @@ -1072,13 +513,13 @@ err_free: } EXPORT_SYMBOL_GPL(bpf_sk_storage_diag_alloc); -static int diag_get(struct bpf_sk_storage_data *sdata, struct sk_buff *skb) +static int diag_get(struct bpf_local_storage_data *sdata, struct sk_buff *skb) { struct nlattr *nla_stg, *nla_value; - struct bpf_sk_storage_map *smap; + struct bpf_local_storage_map *smap; /* It cannot exceed max nlattr's payload */ - BUILD_BUG_ON(U16_MAX - NLA_HDRLEN < MAX_VALUE_SIZE); + BUILD_BUG_ON(U16_MAX - NLA_HDRLEN < BPF_LOCAL_STORAGE_MAX_VALUE_SIZE); nla_stg = nla_nest_start(skb, SK_DIAG_BPF_STORAGE); if (!nla_stg) @@ -1114,9 +555,9 @@ static int bpf_sk_storage_diag_put_all(struct sock *sk, struct sk_buff *skb, { /* stg_array_type (e.g. INET_DIAG_BPF_SK_STORAGES) */ unsigned int diag_size = nla_total_size(0); - struct bpf_sk_storage *sk_storage; - struct bpf_sk_storage_elem *selem; - struct bpf_sk_storage_map *smap; + struct bpf_local_storage *sk_storage; + struct bpf_local_storage_elem *selem; + struct bpf_local_storage_map *smap; struct nlattr *nla_stgs; unsigned int saved_len; int err = 0; @@ -1169,8 +610,8 @@ int bpf_sk_storage_diag_put(struct bpf_sk_storage_diag *diag, { /* stg_array_type (e.g. INET_DIAG_BPF_SK_STORAGES) */ unsigned int diag_size = nla_total_size(0); - struct bpf_sk_storage *sk_storage; - struct bpf_sk_storage_data *sdata; + struct bpf_local_storage *sk_storage; + struct bpf_local_storage_data *sdata; struct nlattr *nla_stgs; unsigned int saved_len; int err = 0; @@ -1197,8 +638,8 @@ int bpf_sk_storage_diag_put(struct bpf_sk_storage_diag *diag, saved_len = skb->len; for (i = 0; i < diag->nr_maps; i++) { - sdata = __sk_storage_lookup(sk_storage, - (struct bpf_sk_storage_map *)diag->maps[i], + sdata = bpf_local_storage_lookup(sk_storage, + (struct bpf_local_storage_map *)diag->maps[i], false); if (!sdata) @@ -1235,19 +676,19 @@ struct bpf_iter_seq_sk_storage_map_info { unsigned skip_elems; }; -static struct bpf_sk_storage_elem * +static struct bpf_local_storage_elem * bpf_sk_storage_map_seq_find_next(struct bpf_iter_seq_sk_storage_map_info *info, - struct bpf_sk_storage_elem *prev_selem) + struct bpf_local_storage_elem *prev_selem) { - struct bpf_sk_storage *sk_storage; - struct bpf_sk_storage_elem *selem; + struct bpf_local_storage *sk_storage; + struct bpf_local_storage_elem *selem; u32 skip_elems = info->skip_elems; - struct bpf_sk_storage_map *smap; + struct bpf_local_storage_map *smap; u32 bucket_id = info->bucket_id; u32 i, count, n_buckets; - struct bucket *b; + struct bpf_local_storage_map_bucket *b; - smap = (struct bpf_sk_storage_map *)info->map; + smap = (struct bpf_local_storage_map *)info->map; n_buckets = 1U << smap->bucket_log; if (bucket_id >= n_buckets) return NULL; @@ -1257,7 +698,7 @@ bpf_sk_storage_map_seq_find_next(struct bpf_iter_seq_sk_storage_map_info *info, count = 0; while (selem) { selem = hlist_entry_safe(selem->map_node.next, - struct bpf_sk_storage_elem, map_node); + struct bpf_local_storage_elem, map_node); if (!selem) { /* not found, unlock and go to the next bucket */ b = &smap->buckets[bucket_id++]; @@ -1265,7 +706,7 @@ bpf_sk_storage_map_seq_find_next(struct bpf_iter_seq_sk_storage_map_info *info, skip_elems = 0; break; } - sk_storage = rcu_dereference_raw(selem->sk_storage); + sk_storage = rcu_dereference_raw(selem->local_storage); if (sk_storage) { info->skip_elems = skip_elems + count; return selem; @@ -1278,7 +719,7 @@ bpf_sk_storage_map_seq_find_next(struct bpf_iter_seq_sk_storage_map_info *info, raw_spin_lock_bh(&b->lock); count = 0; hlist_for_each_entry(selem, &b->list, map_node) { - sk_storage = rcu_dereference_raw(selem->sk_storage); + sk_storage = rcu_dereference_raw(selem->local_storage); if (sk_storage && count >= skip_elems) { info->bucket_id = i; info->skip_elems = count; @@ -1297,7 +738,7 @@ bpf_sk_storage_map_seq_find_next(struct bpf_iter_seq_sk_storage_map_info *info, static void *bpf_sk_storage_map_seq_start(struct seq_file *seq, loff_t *pos) { - struct bpf_sk_storage_elem *selem; + struct bpf_local_storage_elem *selem; selem = bpf_sk_storage_map_seq_find_next(seq->private, NULL); if (!selem) @@ -1330,11 +771,11 @@ DEFINE_BPF_ITER_FUNC(bpf_sk_storage_map, struct bpf_iter_meta *meta, void *value) static int __bpf_sk_storage_map_seq_show(struct seq_file *seq, - struct bpf_sk_storage_elem *selem) + struct bpf_local_storage_elem *selem) { struct bpf_iter_seq_sk_storage_map_info *info = seq->private; struct bpf_iter__bpf_sk_storage_map ctx = {}; - struct bpf_sk_storage *sk_storage; + struct bpf_local_storage *sk_storage; struct bpf_iter_meta meta; struct bpf_prog *prog; int ret = 0; @@ -1345,8 +786,8 @@ static int __bpf_sk_storage_map_seq_show(struct seq_file *seq, ctx.meta = &meta; ctx.map = info->map; if (selem) { - sk_storage = rcu_dereference_raw(selem->sk_storage); - ctx.sk = sk_storage->sk; + sk_storage = rcu_dereference_raw(selem->local_storage); + ctx.sk = sk_storage->owner; ctx.value = SDATA(selem)->data; } ret = bpf_iter_run_prog(prog, &ctx); @@ -1363,13 +804,13 @@ static int bpf_sk_storage_map_seq_show(struct seq_file *seq, void *v) static void bpf_sk_storage_map_seq_stop(struct seq_file *seq, void *v) { struct bpf_iter_seq_sk_storage_map_info *info = seq->private; - struct bpf_sk_storage_map *smap; - struct bucket *b; + struct bpf_local_storage_map *smap; + struct bpf_local_storage_map_bucket *b; if (!v) { (void)__bpf_sk_storage_map_seq_show(seq, v); } else { - smap = (struct bpf_sk_storage_map *)info->map; + smap = (struct bpf_local_storage_map *)info->map; b = &smap->buckets[info->bucket_id]; raw_spin_unlock_bh(&b->lock); } @@ -1437,6 +878,8 @@ static struct bpf_iter_reg bpf_sk_storage_map_reg_info = { .target = "bpf_sk_storage_map", .attach_target = bpf_iter_attach_map, .detach_target = bpf_iter_detach_map, + .show_fdinfo = bpf_iter_map_show_fdinfo, + .fill_link_info = bpf_iter_map_fill_link_info, .ctx_arg_info_size = 2, .ctx_arg_info = { { offsetof(struct bpf_iter__bpf_sk_storage_map, sk), diff --git a/net/core/filter.c b/net/core/filter.c index b2df52086445..47eef9a0be6a 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4459,6 +4459,7 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, } else { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); + unsigned long timeout; if (optlen != sizeof(int)) return -EINVAL; @@ -4480,6 +4481,20 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname, tp->snd_ssthresh = val; } break; + case TCP_BPF_DELACK_MAX: + timeout = usecs_to_jiffies(val); + if (timeout > TCP_DELACK_MAX || + timeout < TCP_TIMEOUT_MIN) + return -EINVAL; + inet_csk(sk)->icsk_delack_max = timeout; + break; + case TCP_BPF_RTO_MIN: + timeout = usecs_to_jiffies(val); + if (timeout > TCP_RTO_MIN || + timeout < TCP_TIMEOUT_MIN) + return -EINVAL; + inet_csk(sk)->icsk_rto_min = timeout; + break; case TCP_SAVE_SYN: if (val < 0 || val > 1) ret = -EINVAL; @@ -4550,9 +4565,9 @@ static int _bpf_getsockopt(struct sock *sk, int level, int optname, tp = tcp_sk(sk); if (optlen <= 0 || !tp->saved_syn || - optlen > tp->saved_syn[0]) + optlen > tcp_saved_syn_len(tp->saved_syn)) goto err_clear; - memcpy(optval, tp->saved_syn + 1, optlen); + memcpy(optval, tp->saved_syn->data, optlen); break; default: goto err_clear; @@ -4654,9 +4669,99 @@ static const struct bpf_func_proto bpf_sock_ops_setsockopt_proto = { .arg5_type = ARG_CONST_SIZE, }; +static int bpf_sock_ops_get_syn(struct bpf_sock_ops_kern *bpf_sock, + int optname, const u8 **start) +{ + struct sk_buff *syn_skb = bpf_sock->syn_skb; + const u8 *hdr_start; + int ret; + + if (syn_skb) { + /* sk is a request_sock here */ + + if (optname == TCP_BPF_SYN) { + hdr_start = syn_skb->data; + ret = tcp_hdrlen(syn_skb); + } else if (optname == TCP_BPF_SYN_IP) { + hdr_start = skb_network_header(syn_skb); + ret = skb_network_header_len(syn_skb) + + tcp_hdrlen(syn_skb); + } else { + /* optname == TCP_BPF_SYN_MAC */ + hdr_start = skb_mac_header(syn_skb); + ret = skb_mac_header_len(syn_skb) + + skb_network_header_len(syn_skb) + + tcp_hdrlen(syn_skb); + } + } else { + struct sock *sk = bpf_sock->sk; + struct saved_syn *saved_syn; + + if (sk->sk_state == TCP_NEW_SYN_RECV) + /* synack retransmit. bpf_sock->syn_skb will + * not be available. It has to resort to + * saved_syn (if it is saved). + */ + saved_syn = inet_reqsk(sk)->saved_syn; + else + saved_syn = tcp_sk(sk)->saved_syn; + + if (!saved_syn) + return -ENOENT; + + if (optname == TCP_BPF_SYN) { + hdr_start = saved_syn->data + + saved_syn->mac_hdrlen + + saved_syn->network_hdrlen; + ret = saved_syn->tcp_hdrlen; + } else if (optname == TCP_BPF_SYN_IP) { + hdr_start = saved_syn->data + + saved_syn->mac_hdrlen; + ret = saved_syn->network_hdrlen + + saved_syn->tcp_hdrlen; + } else { + /* optname == TCP_BPF_SYN_MAC */ + + /* TCP_SAVE_SYN may not have saved the mac hdr */ + if (!saved_syn->mac_hdrlen) + return -ENOENT; + + hdr_start = saved_syn->data; + ret = saved_syn->mac_hdrlen + + saved_syn->network_hdrlen + + saved_syn->tcp_hdrlen; + } + } + + *start = hdr_start; + return ret; +} + BPF_CALL_5(bpf_sock_ops_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { + if (IS_ENABLED(CONFIG_INET) && level == SOL_TCP && + optname >= TCP_BPF_SYN && optname <= TCP_BPF_SYN_MAC) { + int ret, copy_len = 0; + const u8 *start; + + ret = bpf_sock_ops_get_syn(bpf_sock, optname, &start); + if (ret > 0) { + copy_len = ret; + if (optlen < copy_len) { + copy_len = optlen; + ret = -ENOSPC; + } + + memcpy(optval, start, copy_len); + } + + /* Zero out unused buffer at the end */ + memset(optval + copy_len, 0, optlen - copy_len); + + return ret; + } + return _bpf_getsockopt(bpf_sock->sk, level, optname, optval, optlen); } @@ -6150,6 +6255,232 @@ static const struct bpf_func_proto bpf_sk_assign_proto = { .arg3_type = ARG_ANYTHING, }; +static const u8 *bpf_search_tcp_opt(const u8 *op, const u8 *opend, + u8 search_kind, const u8 *magic, + u8 magic_len, bool *eol) +{ + u8 kind, kind_len; + + *eol = false; + + while (op < opend) { + kind = op[0]; + + if (kind == TCPOPT_EOL) { + *eol = true; + return ERR_PTR(-ENOMSG); + } else if (kind == TCPOPT_NOP) { + op++; + continue; + } + + if (opend - op < 2 || opend - op < op[1] || op[1] < 2) + /* Something is wrong in the received header. + * Follow the TCP stack's tcp_parse_options() + * and just bail here. + */ + return ERR_PTR(-EFAULT); + + kind_len = op[1]; + if (search_kind == kind) { + if (!magic_len) + return op; + + if (magic_len > kind_len - 2) + return ERR_PTR(-ENOMSG); + + if (!memcmp(&op[2], magic, magic_len)) + return op; + } + + op += kind_len; + } + + return ERR_PTR(-ENOMSG); +} + +BPF_CALL_4(bpf_sock_ops_load_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock, + void *, search_res, u32, len, u64, flags) +{ + bool eol, load_syn = flags & BPF_LOAD_HDR_OPT_TCP_SYN; + const u8 *op, *opend, *magic, *search = search_res; + u8 search_kind, search_len, copy_len, magic_len; + int ret; + + /* 2 byte is the minimal option len except TCPOPT_NOP and + * TCPOPT_EOL which are useless for the bpf prog to learn + * and this helper disallow loading them also. + */ + if (len < 2 || flags & ~BPF_LOAD_HDR_OPT_TCP_SYN) + return -EINVAL; + + search_kind = search[0]; + search_len = search[1]; + + if (search_len > len || search_kind == TCPOPT_NOP || + search_kind == TCPOPT_EOL) + return -EINVAL; + + if (search_kind == TCPOPT_EXP || search_kind == 253) { + /* 16 or 32 bit magic. +2 for kind and kind length */ + if (search_len != 4 && search_len != 6) + return -EINVAL; + magic = &search[2]; + magic_len = search_len - 2; + } else { + if (search_len) + return -EINVAL; + magic = NULL; + magic_len = 0; + } + + if (load_syn) { + ret = bpf_sock_ops_get_syn(bpf_sock, TCP_BPF_SYN, &op); + if (ret < 0) + return ret; + + opend = op + ret; + op += sizeof(struct tcphdr); + } else { + if (!bpf_sock->skb || + bpf_sock->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB) + /* This bpf_sock->op cannot call this helper */ + return -EPERM; + + opend = bpf_sock->skb_data_end; + op = bpf_sock->skb->data + sizeof(struct tcphdr); + } + + op = bpf_search_tcp_opt(op, opend, search_kind, magic, magic_len, + &eol); + if (IS_ERR(op)) + return PTR_ERR(op); + + copy_len = op[1]; + ret = copy_len; + if (copy_len > len) { + ret = -ENOSPC; + copy_len = len; + } + + memcpy(search_res, op, copy_len); + return ret; +} + +static const struct bpf_func_proto bpf_sock_ops_load_hdr_opt_proto = { + .func = bpf_sock_ops_load_hdr_opt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, +}; + +BPF_CALL_4(bpf_sock_ops_store_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock, + const void *, from, u32, len, u64, flags) +{ + u8 new_kind, new_kind_len, magic_len = 0, *opend; + const u8 *op, *new_op, *magic = NULL; + struct sk_buff *skb; + bool eol; + + if (bpf_sock->op != BPF_SOCK_OPS_WRITE_HDR_OPT_CB) + return -EPERM; + + if (len < 2 || flags) + return -EINVAL; + + new_op = from; + new_kind = new_op[0]; + new_kind_len = new_op[1]; + + if (new_kind_len > len || new_kind == TCPOPT_NOP || + new_kind == TCPOPT_EOL) + return -EINVAL; + + if (new_kind_len > bpf_sock->remaining_opt_len) + return -ENOSPC; + + /* 253 is another experimental kind */ + if (new_kind == TCPOPT_EXP || new_kind == 253) { + if (new_kind_len < 4) + return -EINVAL; + /* Match for the 2 byte magic also. + * RFC 6994: the magic could be 2 or 4 bytes. + * Hence, matching by 2 byte only is on the + * conservative side but it is the right + * thing to do for the 'search-for-duplication' + * purpose. + */ + magic = &new_op[2]; + magic_len = 2; + } + + /* Check for duplication */ + skb = bpf_sock->skb; + op = skb->data + sizeof(struct tcphdr); + opend = bpf_sock->skb_data_end; + + op = bpf_search_tcp_opt(op, opend, new_kind, magic, magic_len, + &eol); + if (!IS_ERR(op)) + return -EEXIST; + + if (PTR_ERR(op) != -ENOMSG) + return PTR_ERR(op); + + if (eol) + /* The option has been ended. Treat it as no more + * header option can be written. + */ + return -ENOSPC; + + /* No duplication found. Store the header option. */ + memcpy(opend, from, new_kind_len); + + bpf_sock->remaining_opt_len -= new_kind_len; + bpf_sock->skb_data_end += new_kind_len; + + return 0; +} + +static const struct bpf_func_proto bpf_sock_ops_store_hdr_opt_proto = { + .func = bpf_sock_ops_store_hdr_opt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_PTR_TO_MEM, + .arg3_type = ARG_CONST_SIZE, + .arg4_type = ARG_ANYTHING, +}; + +BPF_CALL_3(bpf_sock_ops_reserve_hdr_opt, struct bpf_sock_ops_kern *, bpf_sock, + u32, len, u64, flags) +{ + if (bpf_sock->op != BPF_SOCK_OPS_HDR_OPT_LEN_CB) + return -EPERM; + + if (flags || len < 2) + return -EINVAL; + + if (len > bpf_sock->remaining_opt_len) + return -ENOSPC; + + bpf_sock->remaining_opt_len -= len; + + return 0; +} + +static const struct bpf_func_proto bpf_sock_ops_reserve_hdr_opt_proto = { + .func = bpf_sock_ops_reserve_hdr_opt, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_ANYTHING, + .arg3_type = ARG_ANYTHING, +}; + #endif /* CONFIG_INET */ bool bpf_helper_changes_pkt_data(void *func) @@ -6179,6 +6510,9 @@ bool bpf_helper_changes_pkt_data(void *func) func == bpf_lwt_seg6_adjust_srh || func == bpf_lwt_seg6_action || #endif +#ifdef CONFIG_INET + func == bpf_sock_ops_store_hdr_opt || +#endif func == bpf_lwt_in_push_encap || func == bpf_lwt_xmit_push_encap) return true; @@ -6550,6 +6884,12 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) case BPF_FUNC_sk_storage_delete: return &bpf_sk_storage_delete_proto; #ifdef CONFIG_INET + case BPF_FUNC_load_hdr_opt: + return &bpf_sock_ops_load_hdr_opt_proto; + case BPF_FUNC_store_hdr_opt: + return &bpf_sock_ops_store_hdr_opt_proto; + case BPF_FUNC_reserve_hdr_opt: + return &bpf_sock_ops_reserve_hdr_opt_proto; case BPF_FUNC_tcp_sock: return &bpf_tcp_sock_proto; #endif /* CONFIG_INET */ @@ -7349,6 +7689,20 @@ static bool sock_ops_is_valid_access(int off, int size, return false; info->reg_type = PTR_TO_SOCKET_OR_NULL; break; + case offsetof(struct bpf_sock_ops, skb_data): + if (size != sizeof(__u64)) + return false; + info->reg_type = PTR_TO_PACKET; + break; + case offsetof(struct bpf_sock_ops, skb_data_end): + if (size != sizeof(__u64)) + return false; + info->reg_type = PTR_TO_PACKET_END; + break; + case offsetof(struct bpf_sock_ops, skb_tcp_flags): + bpf_ctx_record_field_size(info, size_default); + return bpf_ctx_narrow_access_ok(off, size, + size_default); default: if (size != size_default) return false; @@ -8450,17 +8804,22 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, return insn - insn_buf; switch (si->off) { - case offsetof(struct bpf_sock_ops, op) ... + case offsetof(struct bpf_sock_ops, op): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + op), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, op)); + break; + + case offsetof(struct bpf_sock_ops, replylong[0]) ... offsetof(struct bpf_sock_ops, replylong[3]): - BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, op) != - sizeof_field(struct bpf_sock_ops_kern, op)); BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, reply) != sizeof_field(struct bpf_sock_ops_kern, reply)); BUILD_BUG_ON(sizeof_field(struct bpf_sock_ops, replylong) != sizeof_field(struct bpf_sock_ops_kern, replylong)); off = si->off; - off -= offsetof(struct bpf_sock_ops, op); - off += offsetof(struct bpf_sock_ops_kern, op); + off -= offsetof(struct bpf_sock_ops, replylong[0]); + off += offsetof(struct bpf_sock_ops_kern, replylong[0]); if (type == BPF_WRITE) *insn++ = BPF_STX_MEM(BPF_W, si->dst_reg, si->src_reg, off); @@ -8681,6 +9040,49 @@ static u32 sock_ops_convert_ctx_access(enum bpf_access_type type, case offsetof(struct bpf_sock_ops, sk): SOCK_OPS_GET_SK(); break; + case offsetof(struct bpf_sock_ops, skb_data_end): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + skb_data_end), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, + skb_data_end)); + break; + case offsetof(struct bpf_sock_ops, skb_data): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + skb), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, + skb)); + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, data), + si->dst_reg, si->dst_reg, + offsetof(struct sk_buff, data)); + break; + case offsetof(struct bpf_sock_ops, skb_len): + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + skb), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, + skb)); + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct sk_buff, len), + si->dst_reg, si->dst_reg, + offsetof(struct sk_buff, len)); + break; + case offsetof(struct bpf_sock_ops, skb_tcp_flags): + off = offsetof(struct sk_buff, cb); + off += offsetof(struct tcp_skb_cb, tcp_flags); + *target_size = sizeof_field(struct tcp_skb_cb, tcp_flags); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct bpf_sock_ops_kern, + skb), + si->dst_reg, si->src_reg, + offsetof(struct bpf_sock_ops_kern, + skb)); + *insn++ = BPF_JMP_IMM(BPF_JEQ, si->dst_reg, 0, 1); + *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct tcp_skb_cb, + tcp_flags), + si->dst_reg, si->dst_reg, off); + break; } return insn - insn_buf; } diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 6a32a1fd34f8..1c81caf9630f 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -494,14 +494,34 @@ end: struct sk_psock *sk_psock_init(struct sock *sk, int node) { - struct sk_psock *psock = kzalloc_node(sizeof(*psock), - GFP_ATOMIC | __GFP_NOWARN, - node); - if (!psock) - return NULL; + struct sk_psock *psock; + struct proto *prot; + + write_lock_bh(&sk->sk_callback_lock); + + if (inet_csk_has_ulp(sk)) { + psock = ERR_PTR(-EINVAL); + goto out; + } + + if (sk->sk_user_data) { + psock = ERR_PTR(-EBUSY); + goto out; + } + psock = kzalloc_node(sizeof(*psock), GFP_ATOMIC | __GFP_NOWARN, node); + if (!psock) { + psock = ERR_PTR(-ENOMEM); + goto out; + } + + prot = READ_ONCE(sk->sk_prot); psock->sk = sk; - psock->eval = __SK_NONE; + psock->eval = __SK_NONE; + psock->sk_proto = prot; + psock->saved_unhash = prot->unhash; + psock->saved_close = prot->close; + psock->saved_write_space = sk->sk_write_space; INIT_LIST_HEAD(&psock->link); spin_lock_init(&psock->link_lock); @@ -516,6 +536,8 @@ struct sk_psock *sk_psock_init(struct sock *sk, int node) rcu_assign_sk_user_data_nocopy(sk, psock); sock_hold(sk); +out: + write_unlock_bh(&sk->sk_callback_lock); return psock; } EXPORT_SYMBOL_GPL(sk_psock_init); diff --git a/net/core/sock_map.c b/net/core/sock_map.c index 119f52a99dc1..078386d7d9a2 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -184,8 +184,6 @@ static int sock_map_init_proto(struct sock *sk, struct sk_psock *psock) { struct proto *prot; - sock_owned_by_me(sk); - switch (sk->sk_type) { case SOCK_STREAM: prot = tcp_bpf_get_proto(sk, psock); @@ -272,8 +270,8 @@ static int sock_map_link(struct bpf_map *map, struct sk_psock_progs *progs, } } else { psock = sk_psock_init(sk, map->numa_node); - if (!psock) { - ret = -ENOMEM; + if (IS_ERR(psock)) { + ret = PTR_ERR(psock); goto out_progs; } } @@ -322,8 +320,8 @@ static int sock_map_link_no_progs(struct bpf_map *map, struct sock *sk) if (!psock) { psock = sk_psock_init(sk, map->numa_node); - if (!psock) - return -ENOMEM; + if (IS_ERR(psock)) + return PTR_ERR(psock); } ret = sock_map_init_proto(sk, psock); @@ -478,8 +476,6 @@ static int sock_map_update_common(struct bpf_map *map, u32 idx, return -EINVAL; if (unlikely(idx >= map->max_entries)) return -E2BIG; - if (inet_csk_has_ulp(sk)) - return -EINVAL; link = sk_psock_init_link(); if (!link) @@ -563,10 +559,12 @@ static bool sock_map_sk_state_allowed(const struct sock *sk) return false; } -static int sock_map_update_elem(struct bpf_map *map, void *key, - void *value, u64 flags) +static int sock_hash_update_common(struct bpf_map *map, void *key, + struct sock *sk, u64 flags); + +int sock_map_update_elem_sys(struct bpf_map *map, void *key, void *value, + u64 flags) { - u32 idx = *(u32 *)key; struct socket *sock; struct sock *sk; int ret; @@ -595,14 +593,38 @@ static int sock_map_update_elem(struct bpf_map *map, void *key, sock_map_sk_acquire(sk); if (!sock_map_sk_state_allowed(sk)) ret = -EOPNOTSUPP; + else if (map->map_type == BPF_MAP_TYPE_SOCKMAP) + ret = sock_map_update_common(map, *(u32 *)key, sk, flags); else - ret = sock_map_update_common(map, idx, sk, flags); + ret = sock_hash_update_common(map, key, sk, flags); sock_map_sk_release(sk); out: fput(sock->file); return ret; } +static int sock_map_update_elem(struct bpf_map *map, void *key, + void *value, u64 flags) +{ + struct sock *sk = (struct sock *)value; + int ret; + + if (!sock_map_sk_is_suitable(sk)) + return -EOPNOTSUPP; + + local_bh_disable(); + bh_lock_sock(sk); + if (!sock_map_sk_state_allowed(sk)) + ret = -EOPNOTSUPP; + else if (map->map_type == BPF_MAP_TYPE_SOCKMAP) + ret = sock_map_update_common(map, *(u32 *)key, sk, flags); + else + ret = sock_hash_update_common(map, key, sk, flags); + bh_unlock_sock(sk); + local_bh_enable(); + return ret; +} + BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, sops, struct bpf_map *, map, void *, key, u64, flags) { @@ -683,6 +705,7 @@ const struct bpf_func_proto bpf_msg_redirect_map_proto = { static int sock_map_btf_id; const struct bpf_map_ops sock_map_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc = sock_map_alloc, .map_free = sock_map_free, .map_get_next_key = sock_map_get_next_key, @@ -855,8 +878,6 @@ static int sock_hash_update_common(struct bpf_map *map, void *key, WARN_ON_ONCE(!rcu_read_lock_held()); if (unlikely(flags > BPF_EXIST)) return -EINVAL; - if (inet_csk_has_ulp(sk)) - return -EINVAL; link = sk_psock_init_link(); if (!link) @@ -915,45 +936,6 @@ out_free: return ret; } -static int sock_hash_update_elem(struct bpf_map *map, void *key, - void *value, u64 flags) -{ - struct socket *sock; - struct sock *sk; - int ret; - u64 ufd; - - if (map->value_size == sizeof(u64)) - ufd = *(u64 *)value; - else - ufd = *(u32 *)value; - if (ufd > S32_MAX) - return -EINVAL; - - sock = sockfd_lookup(ufd, &ret); - if (!sock) - return ret; - sk = sock->sk; - if (!sk) { - ret = -EINVAL; - goto out; - } - if (!sock_map_sk_is_suitable(sk)) { - ret = -EOPNOTSUPP; - goto out; - } - - sock_map_sk_acquire(sk); - if (!sock_map_sk_state_allowed(sk)) - ret = -EOPNOTSUPP; - else - ret = sock_hash_update_common(map, key, sk, flags); - sock_map_sk_release(sk); -out: - fput(sock->file); - return ret; -} - static int sock_hash_get_next_key(struct bpf_map *map, void *key, void *key_next) { @@ -1219,10 +1201,11 @@ const struct bpf_func_proto bpf_msg_redirect_hash_proto = { static int sock_hash_map_btf_id; const struct bpf_map_ops sock_hash_ops = { + .map_meta_equal = bpf_map_meta_equal, .map_alloc = sock_hash_alloc, .map_free = sock_hash_free, .map_get_next_key = sock_hash_get_next_key, - .map_update_elem = sock_hash_update_elem, + .map_update_elem = sock_map_update_elem, .map_delete_elem = sock_hash_delete_elem, .map_lookup_elem = sock_hash_lookup, .map_lookup_elem_sys_only = sock_hash_lookup_sys, diff --git a/net/ethtool/channels.c b/net/ethtool/channels.c index 9ef54cdcf662..9ecda09ecb11 100644 --- a/net/ethtool/channels.c +++ b/net/ethtool/channels.c @@ -223,7 +223,7 @@ int ethnl_set_channels(struct sk_buff *skb, struct genl_info *info) from_channel = channels.combined_count + min(channels.rx_count, channels.tx_count); for (i = from_channel; i < old_total; i++) - if (xdp_get_umem_from_qid(dev, i)) { + if (xsk_get_pool_from_qid(dev, i)) { GENL_SET_ERR_MSG(info, "requested channel counts are too low for existing zerocopy AF_XDP sockets"); return -EINVAL; } diff --git a/net/ethtool/ioctl.c b/net/ethtool/ioctl.c index e6f5cf52023c..d497ca064ef7 100644 --- a/net/ethtool/ioctl.c +++ b/net/ethtool/ioctl.c @@ -1706,7 +1706,7 @@ static noinline_for_stack int ethtool_set_channels(struct net_device *dev, min(channels.rx_count, channels.tx_count); to_channel = curr.combined_count + max(curr.rx_count, curr.tx_count); for (i = from_channel; i < to_channel; i++) - if (xdp_get_umem_from_qid(dev, i)) + if (xsk_get_pool_from_qid(dev, i)) return -EINVAL; ret = dev->ethtool_ops->set_channels(dev, &channels); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 31f3b858db81..57a568875539 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -418,6 +418,8 @@ void tcp_init_sock(struct sock *sk) INIT_LIST_HEAD(&tp->tsorted_sent_queue); icsk->icsk_rto = TCP_TIMEOUT_INIT; + icsk->icsk_rto_min = TCP_RTO_MIN; + icsk->icsk_delack_max = TCP_DELACK_MAX; tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT); minmax_reset(&tp->rtt_min, tcp_jiffies32, ~0U); @@ -2685,6 +2687,8 @@ int tcp_disconnect(struct sock *sk, int flags) icsk->icsk_backoff = 0; icsk->icsk_probes_out = 0; icsk->icsk_rto = TCP_TIMEOUT_INIT; + icsk->icsk_rto_min = TCP_RTO_MIN; + icsk->icsk_delack_max = TCP_DELACK_MAX; tp->snd_ssthresh = TCP_INFINITE_SSTHRESH; tp->snd_cwnd = TCP_INIT_CWND; tp->snd_cwnd_cnt = 0; @@ -3207,7 +3211,8 @@ static int do_tcp_setsockopt(struct sock *sk, int level, int optname, break; case TCP_SAVE_SYN: - if (val < 0 || val > 1) + /* 0: disable, 1: enable, 2: start from ether_header */ + if (val < 0 || val > 2) err = -EINVAL; else tp->save_syn = val; @@ -3788,20 +3793,21 @@ static int do_tcp_getsockopt(struct sock *sk, int level, lock_sock(sk); if (tp->saved_syn) { - if (len < tp->saved_syn[0]) { - if (put_user(tp->saved_syn[0], optlen)) { + if (len < tcp_saved_syn_len(tp->saved_syn)) { + if (put_user(tcp_saved_syn_len(tp->saved_syn), + optlen)) { release_sock(sk); return -EFAULT; } release_sock(sk); return -EINVAL; } - len = tp->saved_syn[0]; + len = tcp_saved_syn_len(tp->saved_syn); if (put_user(len, optlen)) { release_sock(sk); return -EFAULT; } - if (copy_to_user(optval, tp->saved_syn + 1, len)) { + if (copy_to_user(optval, tp->saved_syn->data, len)) { release_sock(sk); return -EFAULT; } diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 7aa68f4aae6c..37f4cb2bba5c 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -567,10 +567,9 @@ static void tcp_bpf_rebuild_protos(struct proto prot[TCP_BPF_NUM_CFGS], prot[TCP_BPF_TX].sendpage = tcp_bpf_sendpage; } -static void tcp_bpf_check_v6_needs_rebuild(struct sock *sk, struct proto *ops) +static void tcp_bpf_check_v6_needs_rebuild(struct proto *ops) { - if (sk->sk_family == AF_INET6 && - unlikely(ops != smp_load_acquire(&tcpv6_prot_saved))) { + if (unlikely(ops != smp_load_acquire(&tcpv6_prot_saved))) { spin_lock_bh(&tcpv6_prot_lock); if (likely(ops != tcpv6_prot_saved)) { tcp_bpf_rebuild_protos(tcp_bpf_prots[TCP_BPF_IPV6], ops); @@ -603,13 +602,11 @@ struct proto *tcp_bpf_get_proto(struct sock *sk, struct sk_psock *psock) int family = sk->sk_family == AF_INET6 ? TCP_BPF_IPV6 : TCP_BPF_IPV4; int config = psock->progs.msg_parser ? TCP_BPF_TX : TCP_BPF_BASE; - if (!psock->sk_proto) { - struct proto *ops = READ_ONCE(sk->sk_prot); - - if (tcp_bpf_assert_proto_ops(ops)) + if (sk->sk_family == AF_INET6) { + if (tcp_bpf_assert_proto_ops(psock->sk_proto)) return ERR_PTR(-EINVAL); - tcp_bpf_check_v6_needs_rebuild(sk, ops); + tcp_bpf_check_v6_needs_rebuild(psock->sk_proto); } return &tcp_bpf_prots[family][config]; diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c index 09b62de04eea..af2814c9342a 100644 --- a/net/ipv4/tcp_fastopen.c +++ b/net/ipv4/tcp_fastopen.c @@ -295,7 +295,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk, refcount_set(&req->rsk_refcnt, 2); /* Now finish processing the fastopen child socket. */ - tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); + tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, skb); tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 184ea556f50e..4337841faeff 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -138,6 +138,69 @@ void clean_acked_data_flush(void) EXPORT_SYMBOL_GPL(clean_acked_data_flush); #endif +#ifdef CONFIG_CGROUP_BPF +static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) +{ + bool unknown_opt = tcp_sk(sk)->rx_opt.saw_unknown && + BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG); + bool parse_all_opt = BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG); + struct bpf_sock_ops_kern sock_ops; + + if (likely(!unknown_opt && !parse_all_opt)) + return; + + /* The skb will be handled in the + * bpf_skops_established() or + * bpf_skops_write_hdr_opt(). + */ + switch (sk->sk_state) { + case TCP_SYN_RECV: + case TCP_SYN_SENT: + case TCP_LISTEN: + return; + } + + sock_owned_by_me(sk); + + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); + sock_ops.op = BPF_SOCK_OPS_PARSE_HDR_OPT_CB; + sock_ops.is_fullsock = 1; + sock_ops.sk = sk; + bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb)); + + BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); +} + +static void bpf_skops_established(struct sock *sk, int bpf_op, + struct sk_buff *skb) +{ + struct bpf_sock_ops_kern sock_ops; + + sock_owned_by_me(sk); + + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); + sock_ops.op = bpf_op; + sock_ops.is_fullsock = 1; + sock_ops.sk = sk; + /* sk with TCP_REPAIR_ON does not have skb in tcp_finish_connect */ + if (skb) + bpf_skops_init_skb(&sock_ops, skb, tcp_hdrlen(skb)); + + BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); +} +#else +static void bpf_skops_parse_hdr(struct sock *sk, struct sk_buff *skb) +{ +} + +static void bpf_skops_established(struct sock *sk, int bpf_op, + struct sk_buff *skb) +{ +} +#endif + static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb, unsigned int len) { @@ -3801,7 +3864,7 @@ static void tcp_parse_fastopen_option(int len, const unsigned char *cookie, foc->exp = exp_opt; } -static void smc_parse_options(const struct tcphdr *th, +static bool smc_parse_options(const struct tcphdr *th, struct tcp_options_received *opt_rx, const unsigned char *ptr, int opsize) @@ -3810,10 +3873,13 @@ static void smc_parse_options(const struct tcphdr *th, if (static_branch_unlikely(&tcp_have_smc)) { if (th->syn && !(opsize & 1) && opsize >= TCPOLEN_EXP_SMC_BASE && - get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC) + get_unaligned_be32(ptr) == TCPOPT_SMC_MAGIC) { opt_rx->smc_ok = 1; + return true; + } } #endif + return false; } /* Try to parse the MSS option from the TCP header. Return 0 on failure, clamped @@ -3874,6 +3940,7 @@ void tcp_parse_options(const struct net *net, ptr = (const unsigned char *)(th + 1); opt_rx->saw_tstamp = 0; + opt_rx->saw_unknown = 0; while (length > 0) { int opcode = *ptr++; @@ -3964,15 +4031,21 @@ void tcp_parse_options(const struct net *net, */ if (opsize >= TCPOLEN_EXP_FASTOPEN_BASE && get_unaligned_be16(ptr) == - TCPOPT_FASTOPEN_MAGIC) + TCPOPT_FASTOPEN_MAGIC) { tcp_parse_fastopen_option(opsize - TCPOLEN_EXP_FASTOPEN_BASE, ptr + 2, th->syn, foc, true); - else - smc_parse_options(th, opt_rx, ptr, - opsize); + break; + } + + if (smc_parse_options(th, opt_rx, ptr, opsize)) + break; + + opt_rx->saw_unknown = 1; break; + default: + opt_rx->saw_unknown = 1; } ptr += opsize-2; length -= opsize; @@ -5590,6 +5663,8 @@ syn_challenge: goto discard; } + bpf_skops_parse_hdr(sk, skb); + return true; discard: @@ -5798,7 +5873,7 @@ discard: } EXPORT_SYMBOL(tcp_rcv_established); -void tcp_init_transfer(struct sock *sk, int bpf_op) +void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); @@ -5819,7 +5894,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op) tp->snd_cwnd = tcp_init_cwnd(tp, __sk_dst_get(sk)); tp->snd_cwnd_stamp = tcp_jiffies32; - tcp_call_bpf(sk, bpf_op, 0, NULL); + bpf_skops_established(sk, bpf_op, skb); tcp_init_congestion_control(sk); tcp_init_buffer_space(sk); } @@ -5838,7 +5913,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb) sk_mark_napi_id(sk, skb); } - tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB); + tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB, skb); /* Prevent spurious tcp_cwnd_restart() on first data * packet. @@ -6310,7 +6385,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb) } else { tcp_try_undo_spurious_syn(sk); tp->retrans_stamp = 0; - tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB); + tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB, + skb); WRITE_ONCE(tp->copied_seq, tp->rcv_nxt); } smp_mb(); @@ -6599,13 +6675,27 @@ static void tcp_reqsk_record_syn(const struct sock *sk, { if (tcp_sk(sk)->save_syn) { u32 len = skb_network_header_len(skb) + tcp_hdrlen(skb); - u32 *copy; + struct saved_syn *saved_syn; + u32 mac_hdrlen; + void *base; + + if (tcp_sk(sk)->save_syn == 2) { /* Save full header. */ + base = skb_mac_header(skb); + mac_hdrlen = skb_mac_header_len(skb); + len += mac_hdrlen; + } else { + base = skb_network_header(skb); + mac_hdrlen = 0; + } - copy = kmalloc(len + sizeof(u32), GFP_ATOMIC); - if (copy) { - copy[0] = len; - memcpy(©[1], skb_network_header(skb), len); - req->saved_syn = copy; + saved_syn = kmalloc(struct_size(saved_syn, data, len), + GFP_ATOMIC); + if (saved_syn) { + saved_syn->mac_hdrlen = mac_hdrlen; + saved_syn->network_hdrlen = skb_network_header_len(skb); + saved_syn->tcp_hdrlen = tcp_hdrlen(skb); + memcpy(saved_syn->data, base, len); + req->saved_syn = saved_syn; } } } @@ -6752,7 +6842,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, } if (fastopen_sk) { af_ops->send_synack(fastopen_sk, dst, &fl, req, - &foc, TCP_SYNACK_FASTOPEN); + &foc, TCP_SYNACK_FASTOPEN, skb); /* Add the child socket directly into the accept queue */ if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) { reqsk_fastopen_remove(fastopen_sk, req, false); @@ -6770,7 +6860,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops, tcp_timeout_init((struct sock *)req)); af_ops->send_synack(sk, dst, &fl, req, &foc, !want_cookie ? TCP_SYNACK_NORMAL : - TCP_SYNACK_COOKIE); + TCP_SYNACK_COOKIE, + skb); if (want_cookie) { reqsk_free(req); return 0; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 80b42d5c76fe..af27cfa9d8d3 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -965,7 +965,8 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, struct flowi *fl, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type) + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb) { const struct inet_request_sock *ireq = inet_rsk(req); struct flowi4 fl4; @@ -976,7 +977,7 @@ static int tcp_v4_send_synack(const struct sock *sk, struct dst_entry *dst, if (!dst && (dst = inet_csk_route_req(sk, &fl4, req)) == NULL) return -1; - skb = tcp_make_synack(sk, dst, req, foc, synack_type); + skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb); if (skb) { __tcp_v4_send_check(skb, ireq->ir_loc_addr, ireq->ir_rmt_addr); diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 495dda2449fe..56c306e3cd2f 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -548,6 +548,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, newtp->fastopen_req = NULL; RCU_INIT_POINTER(newtp->fastopen_rsk, NULL); + bpf_skops_init_child(sk, newsk); tcp_bpf_clone(sk, newsk); __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS); diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 85ff417bda7f..ab79d36ed07f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -438,6 +438,7 @@ struct tcp_out_options { u8 ws; /* window scale, 0 to disable */ u8 num_sack_blocks; /* number of SACK blocks to include */ u8 hash_size; /* bytes in hash_location */ + u8 bpf_opt_len; /* length of BPF hdr option */ __u8 *hash_location; /* temporary pointer, overloaded */ __u32 tsval, tsecr; /* need to include OPTION_TS */ struct tcp_fastopen_cookie *fastopen_cookie; /* Fast open cookie */ @@ -452,6 +453,145 @@ static void mptcp_options_write(__be32 *ptr, struct tcp_out_options *opts) #endif } +#ifdef CONFIG_CGROUP_BPF +static int bpf_skops_write_hdr_opt_arg0(struct sk_buff *skb, + enum tcp_synack_type synack_type) +{ + if (unlikely(!skb)) + return BPF_WRITE_HDR_TCP_CURRENT_MSS; + + if (unlikely(synack_type == TCP_SYNACK_COOKIE)) + return BPF_WRITE_HDR_TCP_SYNACK_COOKIE; + + return 0; +} + +/* req, syn_skb and synack_type are used when writing synack */ +static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, + struct sk_buff *syn_skb, + enum tcp_synack_type synack_type, + struct tcp_out_options *opts, + unsigned int *remaining) +{ + struct bpf_sock_ops_kern sock_ops; + int err; + + if (likely(!BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)) || + !*remaining) + return; + + /* *remaining has already been aligned to 4 bytes, so *remaining >= 4 */ + + /* init sock_ops */ + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); + + sock_ops.op = BPF_SOCK_OPS_HDR_OPT_LEN_CB; + + if (req) { + /* The listen "sk" cannot be passed here because + * it is not locked. It would not make too much + * sense to do bpf_setsockopt(listen_sk) based + * on individual connection request also. + * + * Thus, "req" is passed here and the cgroup-bpf-progs + * of the listen "sk" will be run. + * + * "req" is also used here for fastopen even the "sk" here is + * a fullsock "child" sk. It is to keep the behavior + * consistent between fastopen and non-fastopen on + * the bpf programming side. + */ + sock_ops.sk = (struct sock *)req; + sock_ops.syn_skb = syn_skb; + } else { + sock_owned_by_me(sk); + + sock_ops.is_fullsock = 1; + sock_ops.sk = sk; + } + + sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type); + sock_ops.remaining_opt_len = *remaining; + /* tcp_current_mss() does not pass a skb */ + if (skb) + bpf_skops_init_skb(&sock_ops, skb, 0); + + err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk); + + if (err || sock_ops.remaining_opt_len == *remaining) + return; + + opts->bpf_opt_len = *remaining - sock_ops.remaining_opt_len; + /* round up to 4 bytes */ + opts->bpf_opt_len = (opts->bpf_opt_len + 3) & ~3; + + *remaining -= opts->bpf_opt_len; +} + +static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, + struct sk_buff *syn_skb, + enum tcp_synack_type synack_type, + struct tcp_out_options *opts) +{ + u8 first_opt_off, nr_written, max_opt_len = opts->bpf_opt_len; + struct bpf_sock_ops_kern sock_ops; + int err; + + if (likely(!max_opt_len)) + return; + + memset(&sock_ops, 0, offsetof(struct bpf_sock_ops_kern, temp)); + + sock_ops.op = BPF_SOCK_OPS_WRITE_HDR_OPT_CB; + + if (req) { + sock_ops.sk = (struct sock *)req; + sock_ops.syn_skb = syn_skb; + } else { + sock_owned_by_me(sk); + + sock_ops.is_fullsock = 1; + sock_ops.sk = sk; + } + + sock_ops.args[0] = bpf_skops_write_hdr_opt_arg0(skb, synack_type); + sock_ops.remaining_opt_len = max_opt_len; + first_opt_off = tcp_hdrlen(skb) - max_opt_len; + bpf_skops_init_skb(&sock_ops, skb, first_opt_off); + + err = BPF_CGROUP_RUN_PROG_SOCK_OPS_SK(&sock_ops, sk); + + if (err) + nr_written = 0; + else + nr_written = max_opt_len - sock_ops.remaining_opt_len; + + if (nr_written < max_opt_len) + memset(skb->data + first_opt_off + nr_written, TCPOPT_NOP, + max_opt_len - nr_written); +} +#else +static void bpf_skops_hdr_opt_len(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, + struct sk_buff *syn_skb, + enum tcp_synack_type synack_type, + struct tcp_out_options *opts, + unsigned int *remaining) +{ +} + +static void bpf_skops_write_hdr_opt(struct sock *sk, struct sk_buff *skb, + struct request_sock *req, + struct sk_buff *syn_skb, + enum tcp_synack_type synack_type, + struct tcp_out_options *opts) +{ +} +#endif + /* Write previously computed TCP options to the packet. * * Beware: Something in the Internet is very sensitive to the ordering of @@ -691,6 +831,8 @@ static unsigned int tcp_syn_options(struct sock *sk, struct sk_buff *skb, } } + bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining); + return MAX_TCP_OPTION_SPACE - remaining; } @@ -701,7 +843,8 @@ static unsigned int tcp_synack_options(const struct sock *sk, struct tcp_out_options *opts, const struct tcp_md5sig_key *md5, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type) + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb) { struct inet_request_sock *ireq = inet_rsk(req); unsigned int remaining = MAX_TCP_OPTION_SPACE; @@ -758,6 +901,9 @@ static unsigned int tcp_synack_options(const struct sock *sk, smc_set_option_cond(tcp_sk(sk), ireq, opts, &remaining); + bpf_skops_hdr_opt_len((struct sock *)sk, skb, req, syn_skb, + synack_type, opts, &remaining); + return MAX_TCP_OPTION_SPACE - remaining; } @@ -826,6 +972,15 @@ static unsigned int tcp_established_options(struct sock *sk, struct sk_buff *skb opts->num_sack_blocks * TCPOLEN_SACK_PERBLOCK; } + if (unlikely(BPF_SOCK_OPS_TEST_FLAG(tp, + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG))) { + unsigned int remaining = MAX_TCP_OPTION_SPACE - size; + + bpf_skops_hdr_opt_len(sk, skb, NULL, NULL, 0, opts, &remaining); + + size = MAX_TCP_OPTION_SPACE - remaining; + } + return size; } @@ -1213,6 +1368,9 @@ static int __tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, } #endif + /* BPF prog is the last one writing header option */ + bpf_skops_write_hdr_opt(sk, skb, NULL, NULL, 0, &opts); + INDIRECT_CALL_INET(icsk->icsk_af_ops->send_check, tcp_v6_send_check, tcp_v4_send_check, sk, skb); @@ -3336,20 +3494,20 @@ int tcp_send_synack(struct sock *sk) } /** - * tcp_make_synack - Prepare a SYN-ACK. - * sk: listener socket - * dst: dst entry attached to the SYNACK - * req: request_sock pointer - * foc: cookie for tcp fast open - * synack_type: Type of synback to prepare - * - * Allocate one skb and build a SYNACK packet. - * @dst is consumed : Caller should not use it again. + * tcp_make_synack - Allocate one skb and build a SYNACK packet. + * @sk: listener socket + * @dst: dst entry attached to the SYNACK. It is consumed and caller + * should not use it again. + * @req: request_sock pointer + * @foc: cookie for tcp fast open + * @synack_type: Type of synack to prepare + * @syn_skb: SYN packet just received. It could be NULL for rtx case. */ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type) + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb) { struct inet_request_sock *ireq = inet_rsk(req); const struct tcp_sock *tp = tcp_sk(sk); @@ -3408,8 +3566,11 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, md5 = tcp_rsk(req)->af_specific->req_md5_lookup(sk, req_to_sk(req)); #endif skb_set_hash(skb, tcp_rsk(req)->txhash, PKT_HASH_TYPE_L4); + /* bpf program will be interested in the tcp_flags */ + TCP_SKB_CB(skb)->tcp_flags = TCPHDR_SYN | TCPHDR_ACK; tcp_header_size = tcp_synack_options(sk, req, mss, skb, &opts, md5, - foc, synack_type) + sizeof(*th); + foc, synack_type, + syn_skb) + sizeof(*th); skb_push(skb, tcp_header_size); skb_reset_transport_header(skb); @@ -3441,6 +3602,9 @@ struct sk_buff *tcp_make_synack(const struct sock *sk, struct dst_entry *dst, rcu_read_unlock(); #endif + bpf_skops_write_hdr_opt((struct sock *)sk, skb, req, syn_skb, + synack_type, &opts); + skb->skb_mstamp_ns = now; tcp_add_tx_delay(skb, tp); @@ -3741,6 +3905,8 @@ void tcp_send_delayed_ack(struct sock *sk) ato = min(ato, max_ato); } + ato = min_t(u32, ato, inet_csk(sk)->icsk_delack_max); + /* Stay within the limit we were given */ timeout = jiffies + ato; @@ -3934,7 +4100,8 @@ int tcp_rtx_synack(const struct sock *sk, struct request_sock *req) int res; tcp_rsk(req)->txhash = net_tx_rndhash(); - res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL); + res = af_ops->send_synack(sk, NULL, &fl, req, NULL, TCP_SYNACK_NORMAL, + NULL); if (!res) { __TCP_INC_STATS(sock_net(sk), TCP_MIB_RETRANSSEGS); __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSYNRETRANS); diff --git a/net/ipv4/udp_bpf.c b/net/ipv4/udp_bpf.c index eddd973e6575..7a94791efc1a 100644 --- a/net/ipv4/udp_bpf.c +++ b/net/ipv4/udp_bpf.c @@ -22,10 +22,9 @@ static void udp_bpf_rebuild_protos(struct proto *prot, const struct proto *base) prot->close = sock_map_close; } -static void udp_bpf_check_v6_needs_rebuild(struct sock *sk, struct proto *ops) +static void udp_bpf_check_v6_needs_rebuild(struct proto *ops) { - if (sk->sk_family == AF_INET6 && - unlikely(ops != smp_load_acquire(&udpv6_prot_saved))) { + if (unlikely(ops != smp_load_acquire(&udpv6_prot_saved))) { spin_lock_bh(&udpv6_prot_lock); if (likely(ops != udpv6_prot_saved)) { udp_bpf_rebuild_protos(&udp_bpf_prots[UDP_BPF_IPV6], ops); @@ -46,8 +45,8 @@ struct proto *udp_bpf_get_proto(struct sock *sk, struct sk_psock *psock) { int family = sk->sk_family == AF_INET ? UDP_BPF_IPV4 : UDP_BPF_IPV6; - if (!psock->sk_proto) - udp_bpf_check_v6_needs_rebuild(sk, READ_ONCE(sk->sk_prot)); + if (sk->sk_family == AF_INET6) + udp_bpf_check_v6_needs_rebuild(psock->sk_proto); return &udp_bpf_prots[family]; } diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 305870a72352..87a633e1fbef 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -501,7 +501,8 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, struct flowi *fl, struct request_sock *req, struct tcp_fastopen_cookie *foc, - enum tcp_synack_type synack_type) + enum tcp_synack_type synack_type, + struct sk_buff *syn_skb) { struct inet_request_sock *ireq = inet_rsk(req); struct ipv6_pinfo *np = tcp_inet6_sk(sk); @@ -515,7 +516,7 @@ static int tcp_v6_send_synack(const struct sock *sk, struct dst_entry *dst, IPPROTO_TCP)) == NULL) goto done; - skb = tcp_make_synack(sk, dst, req, foc, synack_type); + skb = tcp_make_synack(sk, dst, req, foc, synack_type, syn_skb); if (skb) { __tcp_v6_send_check(skb, &ireq->ir_v6_loc_addr, diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index e97db37354e4..a7227b447228 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -23,162 +23,6 @@ static DEFINE_IDA(umem_ida); -void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs) -{ - unsigned long flags; - - if (!xs->tx) - return; - - spin_lock_irqsave(&umem->xsk_tx_list_lock, flags); - list_add_rcu(&xs->list, &umem->xsk_tx_list); - spin_unlock_irqrestore(&umem->xsk_tx_list_lock, flags); -} - -void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs) -{ - unsigned long flags; - - if (!xs->tx) - return; - - spin_lock_irqsave(&umem->xsk_tx_list_lock, flags); - list_del_rcu(&xs->list); - spin_unlock_irqrestore(&umem->xsk_tx_list_lock, flags); -} - -/* The umem is stored both in the _rx struct and the _tx struct as we do - * not know if the device has more tx queues than rx, or the opposite. - * This might also change during run time. - */ -static int xdp_reg_umem_at_qid(struct net_device *dev, struct xdp_umem *umem, - u16 queue_id) -{ - if (queue_id >= max_t(unsigned int, - dev->real_num_rx_queues, - dev->real_num_tx_queues)) - return -EINVAL; - - if (queue_id < dev->real_num_rx_queues) - dev->_rx[queue_id].umem = umem; - if (queue_id < dev->real_num_tx_queues) - dev->_tx[queue_id].umem = umem; - - return 0; -} - -struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, - u16 queue_id) -{ - if (queue_id < dev->real_num_rx_queues) - return dev->_rx[queue_id].umem; - if (queue_id < dev->real_num_tx_queues) - return dev->_tx[queue_id].umem; - - return NULL; -} -EXPORT_SYMBOL(xdp_get_umem_from_qid); - -static void xdp_clear_umem_at_qid(struct net_device *dev, u16 queue_id) -{ - if (queue_id < dev->real_num_rx_queues) - dev->_rx[queue_id].umem = NULL; - if (queue_id < dev->real_num_tx_queues) - dev->_tx[queue_id].umem = NULL; -} - -int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev, - u16 queue_id, u16 flags) -{ - bool force_zc, force_copy; - struct netdev_bpf bpf; - int err = 0; - - ASSERT_RTNL(); - - force_zc = flags & XDP_ZEROCOPY; - force_copy = flags & XDP_COPY; - - if (force_zc && force_copy) - return -EINVAL; - - if (xdp_get_umem_from_qid(dev, queue_id)) - return -EBUSY; - - err = xdp_reg_umem_at_qid(dev, umem, queue_id); - if (err) - return err; - - umem->dev = dev; - umem->queue_id = queue_id; - - if (flags & XDP_USE_NEED_WAKEUP) { - umem->flags |= XDP_UMEM_USES_NEED_WAKEUP; - /* Tx needs to be explicitly woken up the first time. - * Also for supporting drivers that do not implement this - * feature. They will always have to call sendto(). - */ - xsk_set_tx_need_wakeup(umem); - } - - dev_hold(dev); - - if (force_copy) - /* For copy-mode, we are done. */ - return 0; - - if (!dev->netdev_ops->ndo_bpf || !dev->netdev_ops->ndo_xsk_wakeup) { - err = -EOPNOTSUPP; - goto err_unreg_umem; - } - - bpf.command = XDP_SETUP_XSK_UMEM; - bpf.xsk.umem = umem; - bpf.xsk.queue_id = queue_id; - - err = dev->netdev_ops->ndo_bpf(dev, &bpf); - if (err) - goto err_unreg_umem; - - umem->zc = true; - return 0; - -err_unreg_umem: - if (!force_zc) - err = 0; /* fallback to copy mode */ - if (err) - xdp_clear_umem_at_qid(dev, queue_id); - return err; -} - -void xdp_umem_clear_dev(struct xdp_umem *umem) -{ - struct netdev_bpf bpf; - int err; - - ASSERT_RTNL(); - - if (!umem->dev) - return; - - if (umem->zc) { - bpf.command = XDP_SETUP_XSK_UMEM; - bpf.xsk.umem = NULL; - bpf.xsk.queue_id = umem->queue_id; - - err = umem->dev->netdev_ops->ndo_bpf(umem->dev, &bpf); - - if (err) - WARN(1, "failed to disable umem!\n"); - } - - xdp_clear_umem_at_qid(umem->dev, umem->queue_id); - - dev_put(umem->dev); - umem->dev = NULL; - umem->zc = false; -} - static void xdp_umem_unpin_pages(struct xdp_umem *umem) { unpin_user_pages_dirty_lock(umem->pgs, umem->npgs, true); @@ -195,38 +39,33 @@ static void xdp_umem_unaccount_pages(struct xdp_umem *umem) } } -static void xdp_umem_release(struct xdp_umem *umem) +static void xdp_umem_addr_unmap(struct xdp_umem *umem) { - rtnl_lock(); - xdp_umem_clear_dev(umem); - rtnl_unlock(); - - ida_simple_remove(&umem_ida, umem->id); + vunmap(umem->addrs); + umem->addrs = NULL; +} - if (umem->fq) { - xskq_destroy(umem->fq); - umem->fq = NULL; - } +static int xdp_umem_addr_map(struct xdp_umem *umem, struct page **pages, + u32 nr_pages) +{ + umem->addrs = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); + if (!umem->addrs) + return -ENOMEM; + return 0; +} - if (umem->cq) { - xskq_destroy(umem->cq); - umem->cq = NULL; - } +static void xdp_umem_release(struct xdp_umem *umem) +{ + umem->zc = false; + ida_simple_remove(&umem_ida, umem->id); - xp_destroy(umem->pool); + xdp_umem_addr_unmap(umem); xdp_umem_unpin_pages(umem); xdp_umem_unaccount_pages(umem); kfree(umem); } -static void xdp_umem_release_deferred(struct work_struct *work) -{ - struct xdp_umem *umem = container_of(work, struct xdp_umem, work); - - xdp_umem_release(umem); -} - void xdp_get_umem(struct xdp_umem *umem) { refcount_inc(&umem->users); @@ -237,10 +76,8 @@ void xdp_put_umem(struct xdp_umem *umem) if (!umem) return; - if (refcount_dec_and_test(&umem->users)) { - INIT_WORK(&umem->work, xdp_umem_release_deferred); - schedule_work(&umem->work); - } + if (refcount_dec_and_test(&umem->users)) + xdp_umem_release(umem); } static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address) @@ -319,8 +156,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) return -EINVAL; } - if (mr->flags & ~(XDP_UMEM_UNALIGNED_CHUNK_FLAG | - XDP_UMEM_USES_NEED_WAKEUP)) + if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG) return -EINVAL; if (!unaligned_chunks && !is_power_of_2(chunk_size)) @@ -356,13 +192,13 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) umem->size = size; umem->headroom = headroom; umem->chunk_size = chunk_size; + umem->chunks = chunks; umem->npgs = (u32)npgs; umem->pgs = NULL; umem->user = NULL; umem->flags = mr->flags; - INIT_LIST_HEAD(&umem->xsk_tx_list); - spin_lock_init(&umem->xsk_tx_list_lock); + INIT_LIST_HEAD(&umem->xsk_dma_list); refcount_set(&umem->users, 1); err = xdp_umem_account_pages(umem); @@ -373,15 +209,13 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) if (err) goto out_account; - umem->pool = xp_create(umem->pgs, umem->npgs, chunks, chunk_size, - headroom, size, unaligned_chunks); - if (!umem->pool) { - err = -ENOMEM; - goto out_pin; - } + err = xdp_umem_addr_map(umem, umem->pgs, umem->npgs); + if (err) + goto out_unpin; + return 0; -out_pin: +out_unpin: xdp_umem_unpin_pages(umem); out_account: xdp_umem_unaccount_pages(umem); @@ -413,8 +247,3 @@ struct xdp_umem *xdp_umem_create(struct xdp_umem_reg *mr) return umem; } - -bool xdp_umem_validate_queues(struct xdp_umem *umem) -{ - return umem->fq && umem->cq; -} diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h index 32067fe98f65..181fdda2f2a8 100644 --- a/net/xdp/xdp_umem.h +++ b/net/xdp/xdp_umem.h @@ -8,14 +8,8 @@ #include <net/xdp_sock_drv.h> -int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev, - u16 queue_id, u16 flags); -void xdp_umem_clear_dev(struct xdp_umem *umem); -bool xdp_umem_validate_queues(struct xdp_umem *umem); void xdp_get_umem(struct xdp_umem *umem); void xdp_put_umem(struct xdp_umem *umem); -void xdp_add_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs); -void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock *xs); struct xdp_umem *xdp_umem_create(struct xdp_umem_reg *mr); #endif /* XDP_UMEM_H_ */ diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index c3231620d210..5eb6662f562a 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -36,68 +36,108 @@ static DEFINE_PER_CPU(struct list_head, xskmap_flush_list); bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs) { return READ_ONCE(xs->rx) && READ_ONCE(xs->umem) && - READ_ONCE(xs->umem->fq); + (xs->pool->fq || READ_ONCE(xs->fq_tmp)); } -void xsk_set_rx_need_wakeup(struct xdp_umem *umem) +void xsk_set_rx_need_wakeup(struct xsk_buff_pool *pool) { - if (umem->need_wakeup & XDP_WAKEUP_RX) + if (pool->cached_need_wakeup & XDP_WAKEUP_RX) return; - umem->fq->ring->flags |= XDP_RING_NEED_WAKEUP; - umem->need_wakeup |= XDP_WAKEUP_RX; + pool->fq->ring->flags |= XDP_RING_NEED_WAKEUP; + pool->cached_need_wakeup |= XDP_WAKEUP_RX; } EXPORT_SYMBOL(xsk_set_rx_need_wakeup); -void xsk_set_tx_need_wakeup(struct xdp_umem *umem) +void xsk_set_tx_need_wakeup(struct xsk_buff_pool *pool) { struct xdp_sock *xs; - if (umem->need_wakeup & XDP_WAKEUP_TX) + if (pool->cached_need_wakeup & XDP_WAKEUP_TX) return; rcu_read_lock(); - list_for_each_entry_rcu(xs, &umem->xsk_tx_list, list) { + list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) { xs->tx->ring->flags |= XDP_RING_NEED_WAKEUP; } rcu_read_unlock(); - umem->need_wakeup |= XDP_WAKEUP_TX; + pool->cached_need_wakeup |= XDP_WAKEUP_TX; } EXPORT_SYMBOL(xsk_set_tx_need_wakeup); -void xsk_clear_rx_need_wakeup(struct xdp_umem *umem) +void xsk_clear_rx_need_wakeup(struct xsk_buff_pool *pool) { - if (!(umem->need_wakeup & XDP_WAKEUP_RX)) + if (!(pool->cached_need_wakeup & XDP_WAKEUP_RX)) return; - umem->fq->ring->flags &= ~XDP_RING_NEED_WAKEUP; - umem->need_wakeup &= ~XDP_WAKEUP_RX; + pool->fq->ring->flags &= ~XDP_RING_NEED_WAKEUP; + pool->cached_need_wakeup &= ~XDP_WAKEUP_RX; } EXPORT_SYMBOL(xsk_clear_rx_need_wakeup); -void xsk_clear_tx_need_wakeup(struct xdp_umem *umem) +void xsk_clear_tx_need_wakeup(struct xsk_buff_pool *pool) { struct xdp_sock *xs; - if (!(umem->need_wakeup & XDP_WAKEUP_TX)) + if (!(pool->cached_need_wakeup & XDP_WAKEUP_TX)) return; rcu_read_lock(); - list_for_each_entry_rcu(xs, &umem->xsk_tx_list, list) { + list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) { xs->tx->ring->flags &= ~XDP_RING_NEED_WAKEUP; } rcu_read_unlock(); - umem->need_wakeup &= ~XDP_WAKEUP_TX; + pool->cached_need_wakeup &= ~XDP_WAKEUP_TX; } EXPORT_SYMBOL(xsk_clear_tx_need_wakeup); -bool xsk_umem_uses_need_wakeup(struct xdp_umem *umem) +bool xsk_uses_need_wakeup(struct xsk_buff_pool *pool) { - return umem->flags & XDP_UMEM_USES_NEED_WAKEUP; + return pool->uses_need_wakeup; +} +EXPORT_SYMBOL(xsk_uses_need_wakeup); + +struct xsk_buff_pool *xsk_get_pool_from_qid(struct net_device *dev, + u16 queue_id) +{ + if (queue_id < dev->real_num_rx_queues) + return dev->_rx[queue_id].pool; + if (queue_id < dev->real_num_tx_queues) + return dev->_tx[queue_id].pool; + + return NULL; +} +EXPORT_SYMBOL(xsk_get_pool_from_qid); + +void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id) +{ + if (queue_id < dev->real_num_rx_queues) + dev->_rx[queue_id].pool = NULL; + if (queue_id < dev->real_num_tx_queues) + dev->_tx[queue_id].pool = NULL; +} + +/* The buffer pool is stored both in the _rx struct and the _tx struct as we do + * not know if the device has more tx queues than rx, or the opposite. + * This might also change during run time. + */ +int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool, + u16 queue_id) +{ + if (queue_id >= max_t(unsigned int, + dev->real_num_rx_queues, + dev->real_num_tx_queues)) + return -EINVAL; + + if (queue_id < dev->real_num_rx_queues) + dev->_rx[queue_id].pool = pool; + if (queue_id < dev->real_num_tx_queues) + dev->_tx[queue_id].pool = pool; + + return 0; } -EXPORT_SYMBOL(xsk_umem_uses_need_wakeup); void xp_release(struct xdp_buff_xsk *xskb) { @@ -155,12 +195,12 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len, struct xdp_buff *xsk_xdp; int err; - if (len > xsk_umem_get_rx_frame_size(xs->umem)) { + if (len > xsk_pool_get_rx_frame_size(xs->pool)) { xs->rx_dropped++; return -ENOSPC; } - xsk_xdp = xsk_buff_alloc(xs->umem); + xsk_xdp = xsk_buff_alloc(xs->pool); if (!xsk_xdp) { xs->rx_dropped++; return -ENOSPC; @@ -208,7 +248,7 @@ static int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, static void xsk_flush(struct xdp_sock *xs) { xskq_prod_submit(xs->rx); - __xskq_cons_release(xs->umem->fq); + __xskq_cons_release(xs->pool->fq); sock_def_readable(&xs->sk); } @@ -249,32 +289,32 @@ void __xsk_map_flush(void) } } -void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries) +void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries) { - xskq_prod_submit_n(umem->cq, nb_entries); + xskq_prod_submit_n(pool->cq, nb_entries); } -EXPORT_SYMBOL(xsk_umem_complete_tx); +EXPORT_SYMBOL(xsk_tx_completed); -void xsk_umem_consume_tx_done(struct xdp_umem *umem) +void xsk_tx_release(struct xsk_buff_pool *pool) { struct xdp_sock *xs; rcu_read_lock(); - list_for_each_entry_rcu(xs, &umem->xsk_tx_list, list) { + list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) { __xskq_cons_release(xs->tx); xs->sk.sk_write_space(&xs->sk); } rcu_read_unlock(); } -EXPORT_SYMBOL(xsk_umem_consume_tx_done); +EXPORT_SYMBOL(xsk_tx_release); -bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc) +bool xsk_tx_peek_desc(struct xsk_buff_pool *pool, struct xdp_desc *desc) { struct xdp_sock *xs; rcu_read_lock(); - list_for_each_entry_rcu(xs, &umem->xsk_tx_list, list) { - if (!xskq_cons_peek_desc(xs->tx, desc, umem)) { + list_for_each_entry_rcu(xs, &pool->xsk_tx_list, tx_list) { + if (!xskq_cons_peek_desc(xs->tx, desc, pool)) { xs->tx->queue_empty_descs++; continue; } @@ -284,7 +324,7 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc) * if there is space in it. This avoids having to implement * any buffering in the Tx path. */ - if (xskq_prod_reserve_addr(umem->cq, desc->addr)) + if (xskq_prod_reserve_addr(pool->cq, desc->addr)) goto out; xskq_cons_release(xs->tx); @@ -296,7 +336,7 @@ out: rcu_read_unlock(); return false; } -EXPORT_SYMBOL(xsk_umem_consume_tx); +EXPORT_SYMBOL(xsk_tx_peek_desc); static int xsk_wakeup(struct xdp_sock *xs, u8 flags) { @@ -322,7 +362,7 @@ static void xsk_destruct_skb(struct sk_buff *skb) unsigned long flags; spin_lock_irqsave(&xs->tx_completion_lock, flags); - xskq_prod_submit_addr(xs->umem->cq, addr); + xskq_prod_submit_addr(xs->pool->cq, addr); spin_unlock_irqrestore(&xs->tx_completion_lock, flags); sock_wfree(skb); @@ -342,7 +382,7 @@ static int xsk_generic_xmit(struct sock *sk) if (xs->queue_id >= xs->dev->real_num_tx_queues) goto out; - while (xskq_cons_peek_desc(xs->tx, &desc, xs->umem)) { + while (xskq_cons_peek_desc(xs->tx, &desc, xs->pool)) { char *buffer; u64 addr; u32 len; @@ -359,14 +399,14 @@ static int xsk_generic_xmit(struct sock *sk) skb_put(skb, len); addr = desc.addr; - buffer = xsk_buff_raw_get_data(xs->umem, addr); + buffer = xsk_buff_raw_get_data(xs->pool, addr); err = skb_store_bits(skb, 0, buffer, len); /* This is the backpressure mechanism for the Tx path. * Reserve space in the completion queue and only proceed * if there is space in it. This avoids having to implement * any buffering in the Tx path. */ - if (unlikely(err) || xskq_prod_reserve(xs->umem->cq)) { + if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) { kfree_skb(skb); goto out; } @@ -431,16 +471,16 @@ static __poll_t xsk_poll(struct file *file, struct socket *sock, __poll_t mask = datagram_poll(file, sock, wait); struct sock *sk = sock->sk; struct xdp_sock *xs = xdp_sk(sk); - struct xdp_umem *umem; + struct xsk_buff_pool *pool; if (unlikely(!xsk_is_bound(xs))) return mask; - umem = xs->umem; + pool = xs->pool; - if (umem->need_wakeup) { + if (pool->cached_need_wakeup) { if (xs->zc) - xsk_wakeup(xs, umem->need_wakeup); + xsk_wakeup(xs, pool->cached_need_wakeup); else /* Poll needs to drive Tx also in copy mode */ __xsk_sendmsg(sk); @@ -481,7 +521,7 @@ static void xsk_unbind_dev(struct xdp_sock *xs) WRITE_ONCE(xs->state, XSK_UNBOUND); /* Wait for driver to stop using the xdp socket. */ - xdp_del_sk_umem(xs->umem, xs); + xp_del_xsk(xs->pool, xs); xs->dev = NULL; synchronize_net(); dev_put(dev); @@ -559,6 +599,8 @@ static int xsk_release(struct socket *sock) xskq_destroy(xs->rx); xskq_destroy(xs->tx); + xskq_destroy(xs->fq_tmp); + xskq_destroy(xs->cq_tmp); sock_orphan(sk); sock->sk = NULL; @@ -586,6 +628,11 @@ static struct socket *xsk_lookup_xsk_from_fd(int fd) return sock; } +static bool xsk_validate_queues(struct xdp_sock *xs) +{ + return xs->fq_tmp && xs->cq_tmp; +} + static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) { struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr; @@ -654,29 +701,64 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) sockfd_put(sock); goto out_unlock; } - if (umem_xs->dev != dev || umem_xs->queue_id != qid) { - err = -EINVAL; - sockfd_put(sock); - goto out_unlock; + + if (umem_xs->queue_id != qid || umem_xs->dev != dev) { + /* Share the umem with another socket on another qid + * and/or device. + */ + xs->pool = xp_create_and_assign_umem(xs, + umem_xs->umem); + if (!xs->pool) { + sockfd_put(sock); + goto out_unlock; + } + + err = xp_assign_dev_shared(xs->pool, umem_xs->umem, + dev, qid); + if (err) { + xp_destroy(xs->pool); + sockfd_put(sock); + goto out_unlock; + } + } else { + /* Share the buffer pool with the other socket. */ + if (xs->fq_tmp || xs->cq_tmp) { + /* Do not allow setting your own fq or cq. */ + err = -EINVAL; + sockfd_put(sock); + goto out_unlock; + } + + xp_get_pool(umem_xs->pool); + xs->pool = umem_xs->pool; } xdp_get_umem(umem_xs->umem); WRITE_ONCE(xs->umem, umem_xs->umem); sockfd_put(sock); - } else if (!xs->umem || !xdp_umem_validate_queues(xs->umem)) { + } else if (!xs->umem || !xsk_validate_queues(xs)) { err = -EINVAL; goto out_unlock; } else { /* This xsk has its own umem. */ - err = xdp_umem_assign_dev(xs->umem, dev, qid, flags); - if (err) + xs->pool = xp_create_and_assign_umem(xs, xs->umem); + if (!xs->pool) { + err = -ENOMEM; + goto out_unlock; + } + + err = xp_assign_dev(xs->pool, dev, qid, flags); + if (err) { + xp_destroy(xs->pool); + xs->pool = NULL; goto out_unlock; + } } xs->dev = dev; xs->zc = xs->umem->zc; xs->queue_id = qid; - xdp_add_sk_umem(xs->umem, xs); + xp_add_xsk(xs->pool, xs); out_unlock: if (err) { @@ -782,16 +864,10 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname, mutex_unlock(&xs->mutex); return -EBUSY; } - if (!xs->umem) { - mutex_unlock(&xs->mutex); - return -EINVAL; - } - q = (optname == XDP_UMEM_FILL_RING) ? &xs->umem->fq : - &xs->umem->cq; + q = (optname == XDP_UMEM_FILL_RING) ? &xs->fq_tmp : + &xs->cq_tmp; err = xsk_init_queue(entries, q, true); - if (optname == XDP_UMEM_FILL_RING) - xp_set_fq(xs->umem->pool, *q); mutex_unlock(&xs->mutex); return err; } @@ -858,7 +934,7 @@ static int xsk_getsockopt(struct socket *sock, int level, int optname, if (extra_stats) { stats.rx_ring_full = xs->rx_queue_full; stats.rx_fill_ring_empty_descs = - xs->umem ? xskq_nb_queue_empty_descs(xs->umem->fq) : 0; + xs->pool ? xskq_nb_queue_empty_descs(xs->pool->fq) : 0; stats.tx_ring_empty_descs = xskq_nb_queue_empty_descs(xs->tx); } else { stats.rx_dropped += xs->rx_queue_full; @@ -960,7 +1036,6 @@ static int xsk_mmap(struct file *file, struct socket *sock, unsigned long size = vma->vm_end - vma->vm_start; struct xdp_sock *xs = xdp_sk(sock->sk); struct xsk_queue *q = NULL; - struct xdp_umem *umem; unsigned long pfn; struct page *qpg; @@ -972,16 +1047,12 @@ static int xsk_mmap(struct file *file, struct socket *sock, } else if (offset == XDP_PGOFF_TX_RING) { q = READ_ONCE(xs->tx); } else { - umem = READ_ONCE(xs->umem); - if (!umem) - return -EINVAL; - /* Matches the smp_wmb() in XDP_UMEM_REG */ smp_rmb(); if (offset == XDP_UMEM_PGOFF_FILL_RING) - q = READ_ONCE(umem->fq); + q = READ_ONCE(xs->fq_tmp); else if (offset == XDP_UMEM_PGOFF_COMPLETION_RING) - q = READ_ONCE(umem->cq); + q = READ_ONCE(xs->cq_tmp); } if (!q) @@ -1019,8 +1090,8 @@ static int xsk_notifier(struct notifier_block *this, xsk_unbind_dev(xs); - /* Clear device references in umem. */ - xdp_umem_clear_dev(xs->umem); + /* Clear device references. */ + xp_clear_dev(xs->pool); } mutex_unlock(&xs->mutex); } @@ -1064,7 +1135,7 @@ static void xsk_destruct(struct sock *sk) if (!sock_flag(sk, SOCK_DEAD)) return; - xdp_put_umem(xs->umem); + xp_put_pool(xs->pool); sk_refcnt_debug_dec(sk); } @@ -1072,8 +1143,8 @@ static void xsk_destruct(struct sock *sk) static int xsk_create(struct net *net, struct socket *sock, int protocol, int kern) { - struct sock *sk; struct xdp_sock *xs; + struct sock *sk; if (!ns_capable(net->user_ns, CAP_NET_RAW)) return -EPERM; diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h index 455ddd480f3d..da1f73e43924 100644 --- a/net/xdp/xsk.h +++ b/net/xdp/xsk.h @@ -11,13 +11,6 @@ #define XSK_NEXT_PG_CONTIG_SHIFT 0 #define XSK_NEXT_PG_CONTIG_MASK BIT_ULL(XSK_NEXT_PG_CONTIG_SHIFT) -/* Flags for the umem flags field. - * - * The NEED_WAKEUP flag is 1 due to the reuse of the flags field for public - * flags. See inlude/uapi/include/linux/if_xdp.h. - */ -#define XDP_UMEM_USES_NEED_WAKEUP BIT(1) - struct xdp_ring_offset_v1 { __u64 producer; __u64 consumer; @@ -51,5 +44,8 @@ void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, struct xdp_sock **map_entry); int xsk_map_inc(struct xsk_map *map); void xsk_map_put(struct xsk_map *map); +void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id); +int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool, + u16 queue_id); #endif /* XSK_H_ */ diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index a2044c245215..795d7c81c0ca 100644 --- a/net/xdp/xsk_buff_pool.c +++ b/net/xdp/xsk_buff_pool.c @@ -2,21 +2,37 @@ #include <net/xsk_buff_pool.h> #include <net/xdp_sock.h> +#include <net/xdp_sock_drv.h> +#include <linux/dma-direct.h> +#include <linux/dma-noncoherent.h> +#include <linux/swiotlb.h> #include "xsk_queue.h" +#include "xdp_umem.h" +#include "xsk.h" -static void xp_addr_unmap(struct xsk_buff_pool *pool) +void xp_add_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs) { - vunmap(pool->addrs); + unsigned long flags; + + if (!xs->tx) + return; + + spin_lock_irqsave(&pool->xsk_tx_list_lock, flags); + list_add_rcu(&xs->tx_list, &pool->xsk_tx_list); + spin_unlock_irqrestore(&pool->xsk_tx_list_lock, flags); } -static int xp_addr_map(struct xsk_buff_pool *pool, - struct page **pages, u32 nr_pages) +void xp_del_xsk(struct xsk_buff_pool *pool, struct xdp_sock *xs) { - pool->addrs = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); - if (!pool->addrs) - return -ENOMEM; - return 0; + unsigned long flags; + + if (!xs->tx) + return; + + spin_lock_irqsave(&pool->xsk_tx_list_lock, flags); + list_del_rcu(&xs->tx_list); + spin_unlock_irqrestore(&pool->xsk_tx_list_lock, flags); } void xp_destroy(struct xsk_buff_pool *pool) @@ -24,59 +40,61 @@ void xp_destroy(struct xsk_buff_pool *pool) if (!pool) return; - xp_addr_unmap(pool); kvfree(pool->heads); kvfree(pool); } -struct xsk_buff_pool *xp_create(struct page **pages, u32 nr_pages, u32 chunks, - u32 chunk_size, u32 headroom, u64 size, - bool unaligned) +struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs, + struct xdp_umem *umem) { struct xsk_buff_pool *pool; struct xdp_buff_xsk *xskb; - int err; u32 i; - pool = kvzalloc(struct_size(pool, free_heads, chunks), GFP_KERNEL); + pool = kvzalloc(struct_size(pool, free_heads, umem->chunks), + GFP_KERNEL); if (!pool) goto out; - pool->heads = kvcalloc(chunks, sizeof(*pool->heads), GFP_KERNEL); + pool->heads = kvcalloc(umem->chunks, sizeof(*pool->heads), GFP_KERNEL); if (!pool->heads) goto out; - pool->chunk_mask = ~((u64)chunk_size - 1); - pool->addrs_cnt = size; - pool->heads_cnt = chunks; - pool->free_heads_cnt = chunks; - pool->headroom = headroom; - pool->chunk_size = chunk_size; - pool->unaligned = unaligned; - pool->frame_len = chunk_size - headroom - XDP_PACKET_HEADROOM; + pool->chunk_mask = ~((u64)umem->chunk_size - 1); + pool->addrs_cnt = umem->size; + pool->heads_cnt = umem->chunks; + pool->free_heads_cnt = umem->chunks; + pool->headroom = umem->headroom; + pool->chunk_size = umem->chunk_size; + pool->unaligned = umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG; + pool->frame_len = umem->chunk_size - umem->headroom - + XDP_PACKET_HEADROOM; + pool->umem = umem; + pool->addrs = umem->addrs; INIT_LIST_HEAD(&pool->free_list); + INIT_LIST_HEAD(&pool->xsk_tx_list); + spin_lock_init(&pool->xsk_tx_list_lock); + refcount_set(&pool->users, 1); + + pool->fq = xs->fq_tmp; + pool->cq = xs->cq_tmp; + xs->fq_tmp = NULL; + xs->cq_tmp = NULL; for (i = 0; i < pool->free_heads_cnt; i++) { xskb = &pool->heads[i]; xskb->pool = pool; - xskb->xdp.frame_sz = chunk_size - headroom; + xskb->xdp.frame_sz = umem->chunk_size - umem->headroom; pool->free_heads[i] = xskb; } - err = xp_addr_map(pool, pages, nr_pages); - if (!err) - return pool; + return pool; out: xp_destroy(pool); return NULL; } -void xp_set_fq(struct xsk_buff_pool *pool, struct xsk_queue *fq) -{ - pool->fq = fq; -} - void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq) { u32 i; @@ -86,70 +104,320 @@ void xp_set_rxq_info(struct xsk_buff_pool *pool, struct xdp_rxq_info *rxq) } EXPORT_SYMBOL(xp_set_rxq_info); -void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs) +static void xp_disable_drv_zc(struct xsk_buff_pool *pool) { - dma_addr_t *dma; - u32 i; + struct netdev_bpf bpf; + int err; - if (pool->dma_pages_cnt == 0) + ASSERT_RTNL(); + + if (pool->umem->zc) { + bpf.command = XDP_SETUP_XSK_POOL; + bpf.xsk.pool = NULL; + bpf.xsk.queue_id = pool->queue_id; + + err = pool->netdev->netdev_ops->ndo_bpf(pool->netdev, &bpf); + + if (err) + WARN(1, "Failed to disable zero-copy!\n"); + } +} + +static int __xp_assign_dev(struct xsk_buff_pool *pool, + struct net_device *netdev, u16 queue_id, u16 flags) +{ + bool force_zc, force_copy; + struct netdev_bpf bpf; + int err = 0; + + ASSERT_RTNL(); + + force_zc = flags & XDP_ZEROCOPY; + force_copy = flags & XDP_COPY; + + if (force_zc && force_copy) + return -EINVAL; + + if (xsk_get_pool_from_qid(netdev, queue_id)) + return -EBUSY; + + pool->netdev = netdev; + pool->queue_id = queue_id; + err = xsk_reg_pool_at_qid(netdev, pool, queue_id); + if (err) + return err; + + if (flags & XDP_USE_NEED_WAKEUP) { + pool->uses_need_wakeup = true; + /* Tx needs to be explicitly woken up the first time. + * Also for supporting drivers that do not implement this + * feature. They will always have to call sendto(). + */ + pool->cached_need_wakeup = XDP_WAKEUP_TX; + } + + dev_hold(netdev); + + if (force_copy) + /* For copy-mode, we are done. */ + return 0; + + if (!netdev->netdev_ops->ndo_bpf || + !netdev->netdev_ops->ndo_xsk_wakeup) { + err = -EOPNOTSUPP; + goto err_unreg_pool; + } + + bpf.command = XDP_SETUP_XSK_POOL; + bpf.xsk.pool = pool; + bpf.xsk.queue_id = queue_id; + + err = netdev->netdev_ops->ndo_bpf(netdev, &bpf); + if (err) + goto err_unreg_pool; + + if (!pool->dma_pages) { + WARN(1, "Driver did not DMA map zero-copy buffers"); + goto err_unreg_xsk; + } + pool->umem->zc = true; + return 0; + +err_unreg_xsk: + xp_disable_drv_zc(pool); +err_unreg_pool: + if (!force_zc) + err = 0; /* fallback to copy mode */ + if (err) + xsk_clear_pool_at_qid(netdev, queue_id); + return err; +} + +int xp_assign_dev(struct xsk_buff_pool *pool, struct net_device *dev, + u16 queue_id, u16 flags) +{ + return __xp_assign_dev(pool, dev, queue_id, flags); +} + +int xp_assign_dev_shared(struct xsk_buff_pool *pool, struct xdp_umem *umem, + struct net_device *dev, u16 queue_id) +{ + u16 flags; + + /* One fill and completion ring required for each queue id. */ + if (!pool->fq || !pool->cq) + return -EINVAL; + + flags = umem->zc ? XDP_ZEROCOPY : XDP_COPY; + if (pool->uses_need_wakeup) + flags |= XDP_USE_NEED_WAKEUP; + + return __xp_assign_dev(pool, dev, queue_id, flags); +} + +void xp_clear_dev(struct xsk_buff_pool *pool) +{ + if (!pool->netdev) return; - for (i = 0; i < pool->dma_pages_cnt; i++) { - dma = &pool->dma_pages[i]; + xp_disable_drv_zc(pool); + xsk_clear_pool_at_qid(pool->netdev, pool->queue_id); + dev_put(pool->netdev); + pool->netdev = NULL; +} + +static void xp_release_deferred(struct work_struct *work) +{ + struct xsk_buff_pool *pool = container_of(work, struct xsk_buff_pool, + work); + + rtnl_lock(); + xp_clear_dev(pool); + rtnl_unlock(); + + if (pool->fq) { + xskq_destroy(pool->fq); + pool->fq = NULL; + } + + if (pool->cq) { + xskq_destroy(pool->cq); + pool->cq = NULL; + } + + xdp_put_umem(pool->umem); + xp_destroy(pool); +} + +void xp_get_pool(struct xsk_buff_pool *pool) +{ + refcount_inc(&pool->users); +} + +void xp_put_pool(struct xsk_buff_pool *pool) +{ + if (!pool) + return; + + if (refcount_dec_and_test(&pool->users)) { + INIT_WORK(&pool->work, xp_release_deferred); + schedule_work(&pool->work); + } +} + +static struct xsk_dma_map *xp_find_dma_map(struct xsk_buff_pool *pool) +{ + struct xsk_dma_map *dma_map; + + list_for_each_entry(dma_map, &pool->umem->xsk_dma_list, list) { + if (dma_map->netdev == pool->netdev) + return dma_map; + } + + return NULL; +} + +static struct xsk_dma_map *xp_create_dma_map(struct device *dev, struct net_device *netdev, + u32 nr_pages, struct xdp_umem *umem) +{ + struct xsk_dma_map *dma_map; + + dma_map = kzalloc(sizeof(*dma_map), GFP_KERNEL); + if (!dma_map) + return NULL; + + dma_map->dma_pages = kvcalloc(nr_pages, sizeof(*dma_map->dma_pages), GFP_KERNEL); + if (!dma_map) { + kfree(dma_map); + return NULL; + } + + dma_map->netdev = netdev; + dma_map->dev = dev; + dma_map->dma_need_sync = false; + dma_map->dma_pages_cnt = nr_pages; + refcount_set(&dma_map->users, 0); + list_add(&dma_map->list, &umem->xsk_dma_list); + return dma_map; +} + +static void xp_destroy_dma_map(struct xsk_dma_map *dma_map) +{ + list_del(&dma_map->list); + kvfree(dma_map->dma_pages); + kfree(dma_map); +} + +static void __xp_dma_unmap(struct xsk_dma_map *dma_map, unsigned long attrs) +{ + dma_addr_t *dma; + u32 i; + + for (i = 0; i < dma_map->dma_pages_cnt; i++) { + dma = &dma_map->dma_pages[i]; if (*dma) { - dma_unmap_page_attrs(pool->dev, *dma, PAGE_SIZE, + dma_unmap_page_attrs(dma_map->dev, *dma, PAGE_SIZE, DMA_BIDIRECTIONAL, attrs); *dma = 0; } } + xp_destroy_dma_map(dma_map); +} + +void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs) +{ + struct xsk_dma_map *dma_map; + + if (pool->dma_pages_cnt == 0) + return; + + dma_map = xp_find_dma_map(pool); + if (!dma_map) { + WARN(1, "Could not find dma_map for device"); + return; + } + + if (!refcount_dec_and_test(&dma_map->users)) + return; + + __xp_dma_unmap(dma_map, attrs); kvfree(pool->dma_pages); pool->dma_pages_cnt = 0; pool->dev = NULL; } EXPORT_SYMBOL(xp_dma_unmap); -static void xp_check_dma_contiguity(struct xsk_buff_pool *pool) +static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map) { u32 i; - for (i = 0; i < pool->dma_pages_cnt - 1; i++) { - if (pool->dma_pages[i] + PAGE_SIZE == pool->dma_pages[i + 1]) - pool->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK; + for (i = 0; i < dma_map->dma_pages_cnt - 1; i++) { + if (dma_map->dma_pages[i] + PAGE_SIZE == dma_map->dma_pages[i + 1]) + dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK; else - pool->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK; + dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK; } } +static int xp_init_dma_info(struct xsk_buff_pool *pool, struct xsk_dma_map *dma_map) +{ + pool->dma_pages = kvcalloc(dma_map->dma_pages_cnt, sizeof(*pool->dma_pages), GFP_KERNEL); + if (!pool->dma_pages) + return -ENOMEM; + + pool->dev = dma_map->dev; + pool->dma_pages_cnt = dma_map->dma_pages_cnt; + pool->dma_need_sync = dma_map->dma_need_sync; + refcount_inc(&dma_map->users); + memcpy(pool->dma_pages, dma_map->dma_pages, + pool->dma_pages_cnt * sizeof(*pool->dma_pages)); + + return 0; +} + int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev, unsigned long attrs, struct page **pages, u32 nr_pages) { + struct xsk_dma_map *dma_map; dma_addr_t dma; + int err; u32 i; - pool->dma_pages = kvcalloc(nr_pages, sizeof(*pool->dma_pages), - GFP_KERNEL); - if (!pool->dma_pages) - return -ENOMEM; + dma_map = xp_find_dma_map(pool); + if (dma_map) { + err = xp_init_dma_info(pool, dma_map); + if (err) + return err; + + return 0; + } - pool->dev = dev; - pool->dma_pages_cnt = nr_pages; - pool->dma_need_sync = false; + dma_map = xp_create_dma_map(dev, pool->netdev, nr_pages, pool->umem); + if (!dma_map) + return -ENOMEM; - for (i = 0; i < pool->dma_pages_cnt; i++) { + for (i = 0; i < dma_map->dma_pages_cnt; i++) { dma = dma_map_page_attrs(dev, pages[i], 0, PAGE_SIZE, DMA_BIDIRECTIONAL, attrs); if (dma_mapping_error(dev, dma)) { - xp_dma_unmap(pool, attrs); + __xp_dma_unmap(dma_map, attrs); return -ENOMEM; } if (dma_need_sync(dev, dma)) - pool->dma_need_sync = true; - pool->dma_pages[i] = dma; + dma_map->dma_need_sync = true; + dma_map->dma_pages[i] = dma; } if (pool->unaligned) - xp_check_dma_contiguity(pool); + xp_check_dma_contiguity(dma_map); + + err = xp_init_dma_info(pool, dma_map); + if (err) { + __xp_dma_unmap(dma_map, attrs); + return err; + } + return 0; } EXPORT_SYMBOL(xp_dma_map); diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c index 21e9c2d123ee..5bd8ea9d206a 100644 --- a/net/xdp/xsk_diag.c +++ b/net/xdp/xsk_diag.c @@ -46,6 +46,7 @@ static int xsk_diag_put_rings_cfg(const struct xdp_sock *xs, static int xsk_diag_put_umem(const struct xdp_sock *xs, struct sk_buff *nlskb) { + struct xsk_buff_pool *pool = xs->pool; struct xdp_umem *umem = xs->umem; struct xdp_diag_umem du = {}; int err; @@ -58,8 +59,8 @@ static int xsk_diag_put_umem(const struct xdp_sock *xs, struct sk_buff *nlskb) du.num_pages = umem->npgs; du.chunk_size = umem->chunk_size; du.headroom = umem->headroom; - du.ifindex = umem->dev ? umem->dev->ifindex : 0; - du.queue_id = umem->queue_id; + du.ifindex = pool->netdev ? pool->netdev->ifindex : 0; + du.queue_id = pool->queue_id; du.flags = 0; if (umem->zc) du.flags |= XDP_DU_F_ZEROCOPY; @@ -67,10 +68,11 @@ static int xsk_diag_put_umem(const struct xdp_sock *xs, struct sk_buff *nlskb) err = nla_put(nlskb, XDP_DIAG_UMEM, sizeof(du), &du); - if (!err && umem->fq) - err = xsk_diag_put_ring(umem->fq, XDP_DIAG_UMEM_FILL_RING, nlskb); - if (!err && umem->cq) { - err = xsk_diag_put_ring(umem->cq, XDP_DIAG_UMEM_COMPLETION_RING, + if (!err && pool->fq) + err = xsk_diag_put_ring(pool->fq, + XDP_DIAG_UMEM_FILL_RING, nlskb); + if (!err && pool->cq) { + err = xsk_diag_put_ring(pool->cq, XDP_DIAG_UMEM_COMPLETION_RING, nlskb); } return err; @@ -83,7 +85,7 @@ static int xsk_diag_put_stats(const struct xdp_sock *xs, struct sk_buff *nlskb) du.n_rx_dropped = xs->rx_dropped; du.n_rx_invalid = xskq_nb_invalid_descs(xs->rx); du.n_rx_full = xs->rx_queue_full; - du.n_fill_ring_empty = xs->umem ? xskq_nb_queue_empty_descs(xs->umem->fq) : 0; + du.n_fill_ring_empty = xs->pool ? xskq_nb_queue_empty_descs(xs->pool->fq) : 0; du.n_tx_invalid = xskq_nb_invalid_descs(xs->tx); du.n_tx_ring_empty = xskq_nb_queue_empty_descs(xs->tx); return nla_put(nlskb, XDP_DIAG_STATS, sizeof(du), &du); diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h index bf42cfd74b89..2d883f631c85 100644 --- a/net/xdp/xsk_queue.h +++ b/net/xdp/xsk_queue.h @@ -166,9 +166,9 @@ static inline bool xp_validate_desc(struct xsk_buff_pool *pool, static inline bool xskq_cons_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d, - struct xdp_umem *umem) + struct xsk_buff_pool *pool) { - if (!xp_validate_desc(umem->pool, d)) { + if (!xp_validate_desc(pool, d)) { q->invalid_descs++; return false; } @@ -177,14 +177,14 @@ static inline bool xskq_cons_is_valid_desc(struct xsk_queue *q, static inline bool xskq_cons_read_desc(struct xsk_queue *q, struct xdp_desc *desc, - struct xdp_umem *umem) + struct xsk_buff_pool *pool) { while (q->cached_cons != q->cached_prod) { struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring; u32 idx = q->cached_cons & q->ring_mask; *desc = ring->desc[idx]; - if (xskq_cons_is_valid_desc(q, desc, umem)) + if (xskq_cons_is_valid_desc(q, desc, pool)) return true; q->cached_cons++; @@ -236,11 +236,11 @@ static inline bool xskq_cons_peek_addr_unchecked(struct xsk_queue *q, u64 *addr) static inline bool xskq_cons_peek_desc(struct xsk_queue *q, struct xdp_desc *desc, - struct xdp_umem *umem) + struct xsk_buff_pool *pool) { if (q->cached_prod == q->cached_cons) xskq_cons_get_entries(q); - return xskq_cons_read_desc(q, desc, umem); + return xskq_cons_read_desc(q, desc, pool); } static inline void xskq_cons_release(struct xsk_queue *q) diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c index 8367adbbe9df..2a4fd6677155 100644 --- a/net/xdp/xskmap.c +++ b/net/xdp/xskmap.c @@ -254,8 +254,16 @@ void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs, spin_unlock_bh(&map->lock); } +static bool xsk_map_meta_equal(const struct bpf_map *meta0, + const struct bpf_map *meta1) +{ + return meta0->max_entries == meta1->max_entries && + bpf_map_meta_equal(meta0, meta1); +} + static int xsk_map_btf_id; const struct bpf_map_ops xsk_map_ops = { + .map_meta_equal = xsk_map_meta_equal, .map_alloc = xsk_map_alloc, .map_free = xsk_map_free, .map_get_next_key = xsk_map_get_next_key, diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index f87ee02073ba..4f1ed0e3cf9f 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -48,6 +48,7 @@ tprogs-y += syscall_tp tprogs-y += cpustat tprogs-y += xdp_adjust_tail tprogs-y += xdpsock +tprogs-y += xsk_fwd tprogs-y += xdp_fwd tprogs-y += task_fd_query tprogs-y += xdp_sample_pkts @@ -71,12 +72,12 @@ tracex4-objs := tracex4_user.o tracex5-objs := tracex5_user.o $(TRACE_HELPERS) tracex6-objs := tracex6_user.o tracex7-objs := tracex7_user.o -test_probe_write_user-objs := bpf_load.o test_probe_write_user_user.o -trace_output-objs := bpf_load.o trace_output_user.o $(TRACE_HELPERS) -lathist-objs := bpf_load.o lathist_user.o -offwaketime-objs := bpf_load.o offwaketime_user.o $(TRACE_HELPERS) -spintest-objs := bpf_load.o spintest_user.o $(TRACE_HELPERS) -map_perf_test-objs := bpf_load.o map_perf_test_user.o +test_probe_write_user-objs := test_probe_write_user_user.o +trace_output-objs := trace_output_user.o $(TRACE_HELPERS) +lathist-objs := lathist_user.o +offwaketime-objs := offwaketime_user.o $(TRACE_HELPERS) +spintest-objs := spintest_user.o $(TRACE_HELPERS) +map_perf_test-objs := map_perf_test_user.o test_overhead-objs := bpf_load.o test_overhead_user.o test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o test_cgrp2_attach-objs := test_cgrp2_attach.o @@ -86,7 +87,7 @@ xdp1-objs := xdp1_user.o # reuse xdp1 source intentionally xdp2-objs := xdp1_user.o xdp_router_ipv4-objs := xdp_router_ipv4_user.o -test_current_task_under_cgroup-objs := bpf_load.o $(CGROUP_HELPERS) \ +test_current_task_under_cgroup-objs := $(CGROUP_HELPERS) \ test_current_task_under_cgroup_user.o trace_event-objs := trace_event_user.o $(TRACE_HELPERS) sampleip-objs := sampleip_user.o $(TRACE_HELPERS) @@ -100,10 +101,11 @@ xdp_redirect_map-objs := xdp_redirect_map_user.o xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o xdp_monitor-objs := bpf_load.o xdp_monitor_user.o xdp_rxq_info-objs := xdp_rxq_info_user.o -syscall_tp-objs := bpf_load.o syscall_tp_user.o -cpustat-objs := bpf_load.o cpustat_user.o +syscall_tp-objs := syscall_tp_user.o +cpustat-objs := cpustat_user.o xdp_adjust_tail-objs := xdp_adjust_tail_user.o xdpsock-objs := xdpsock_user.o +xsk_fwd-objs := xsk_fwd.o xdp_fwd-objs := xdp_fwd_user.o task_fd_query-objs := bpf_load.o task_fd_query_user.o $(TRACE_HELPERS) xdp_sample_pkts-objs := xdp_sample_pkts_user.o $(TRACE_HELPERS) @@ -203,6 +205,7 @@ TPROGLDLIBS_trace_output += -lrt TPROGLDLIBS_map_perf_test += -lrt TPROGLDLIBS_test_overhead += -lrt TPROGLDLIBS_xdpsock += -pthread +TPROGLDLIBS_xsk_fwd += -pthread # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline: # make M=samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang diff --git a/samples/bpf/cpustat_kern.c b/samples/bpf/cpustat_kern.c index a86a19d5f033..5aefd19cdfa1 100644 --- a/samples/bpf/cpustat_kern.c +++ b/samples/bpf/cpustat_kern.c @@ -51,28 +51,28 @@ static int cpu_opps[] = { 208000, 432000, 729000, 960000, 1200000 }; #define MAP_OFF_PSTATE_IDX 3 #define MAP_OFF_NUM 4 -struct bpf_map_def SEC("maps") my_map = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(u32), - .value_size = sizeof(u64), - .max_entries = MAX_CPU * MAP_OFF_NUM, -}; +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, u32); + __type(value, u64); + __uint(max_entries, MAX_CPU * MAP_OFF_NUM); +} my_map SEC(".maps"); /* cstate_duration records duration time for every idle state per CPU */ -struct bpf_map_def SEC("maps") cstate_duration = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(u32), - .value_size = sizeof(u64), - .max_entries = MAX_CPU * MAX_CSTATE_ENTRIES, -}; +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, u32); + __type(value, u64); + __uint(max_entries, MAX_CPU * MAX_CSTATE_ENTRIES); +} cstate_duration SEC(".maps"); /* pstate_duration records duration time for every operating point per CPU */ -struct bpf_map_def SEC("maps") pstate_duration = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(u32), - .value_size = sizeof(u64), - .max_entries = MAX_CPU * MAX_PSTATE_ENTRIES, -}; +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, u32); + __type(value, u64); + __uint(max_entries, MAX_CPU * MAX_PSTATE_ENTRIES); +} pstate_duration SEC(".maps"); /* * The trace events for cpu_idle and cpu_frequency are taken from: diff --git a/samples/bpf/cpustat_user.c b/samples/bpf/cpustat_user.c index 869a99406dbf..96675985e9e0 100644 --- a/samples/bpf/cpustat_user.c +++ b/samples/bpf/cpustat_user.c @@ -9,7 +9,6 @@ #include <string.h> #include <unistd.h> #include <fcntl.h> -#include <linux/bpf.h> #include <locale.h> #include <sys/types.h> #include <sys/stat.h> @@ -18,7 +17,9 @@ #include <sys/wait.h> #include <bpf/bpf.h> -#include "bpf_load.h" +#include <bpf/libbpf.h> + +static int cstate_map_fd, pstate_map_fd; #define MAX_CPU 8 #define MAX_PSTATE_ENTRIES 5 @@ -181,21 +182,50 @@ static void int_exit(int sig) { cpu_stat_inject_cpu_idle_event(); cpu_stat_inject_cpu_frequency_event(); - cpu_stat_update(map_fd[1], map_fd[2]); + cpu_stat_update(cstate_map_fd, pstate_map_fd); cpu_stat_print(); exit(0); } int main(int argc, char **argv) { + struct bpf_link *link = NULL; + struct bpf_program *prog; + struct bpf_object *obj; char filename[256]; int ret; snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + obj = bpf_object__open_file(filename, NULL); + if (libbpf_get_error(obj)) { + fprintf(stderr, "ERROR: opening BPF object file failed\n"); + return 0; + } - if (load_bpf_file(filename)) { - printf("%s", bpf_log_buf); - return 1; + prog = bpf_object__find_program_by_name(obj, "bpf_prog1"); + if (!prog) { + printf("finding a prog in obj file failed\n"); + goto cleanup; + } + + /* load BPF program */ + if (bpf_object__load(obj)) { + fprintf(stderr, "ERROR: loading BPF object file failed\n"); + goto cleanup; + } + + cstate_map_fd = bpf_object__find_map_fd_by_name(obj, "cstate_duration"); + pstate_map_fd = bpf_object__find_map_fd_by_name(obj, "pstate_duration"); + if (cstate_map_fd < 0 || pstate_map_fd < 0) { + fprintf(stderr, "ERROR: finding a map in obj file failed\n"); + goto cleanup; + } + + link = bpf_program__attach(prog); + if (libbpf_get_error(link)) { + fprintf(stderr, "ERROR: bpf_program__attach failed\n"); + link = NULL; + goto cleanup; } ret = cpu_stat_inject_cpu_idle_event(); @@ -210,10 +240,13 @@ int main(int argc, char **argv) signal(SIGTERM, int_exit); while (1) { - cpu_stat_update(map_fd[1], map_fd[2]); + cpu_stat_update(cstate_map_fd, pstate_map_fd); cpu_stat_print(); sleep(5); } +cleanup: + bpf_link__destroy(link); + bpf_object__close(obj); return 0; } diff --git a/samples/bpf/lathist_kern.c b/samples/bpf/lathist_kern.c index ca9c2e4e69aa..4adfcbbe6ef4 100644 --- a/samples/bpf/lathist_kern.c +++ b/samples/bpf/lathist_kern.c @@ -18,12 +18,12 @@ * trace_preempt_[on|off] tracepoints hooks is not supported. */ -struct bpf_map_def SEC("maps") my_map = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(int), - .value_size = sizeof(u64), - .max_entries = MAX_CPU, -}; +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, int); + __type(value, u64); + __uint(max_entries, MAX_CPU); +} my_map SEC(".maps"); SEC("kprobe/trace_preempt_off") int bpf_prog1(struct pt_regs *ctx) @@ -61,12 +61,12 @@ static unsigned int log2l(unsigned long v) return log2(v); } -struct bpf_map_def SEC("maps") my_lat = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(int), - .value_size = sizeof(long), - .max_entries = MAX_CPU * MAX_ENTRIES, -}; +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, int); + __type(value, long); + __uint(max_entries, MAX_CPU * MAX_ENTRIES); +} my_lat SEC(".maps"); SEC("kprobe/trace_preempt_on") int bpf_prog2(struct pt_regs *ctx) diff --git a/samples/bpf/lathist_user.c b/samples/bpf/lathist_user.c index 2ff2839a52d5..7d8ff2418303 100644 --- a/samples/bpf/lathist_user.c +++ b/samples/bpf/lathist_user.c @@ -6,9 +6,8 @@ #include <unistd.h> #include <stdlib.h> #include <signal.h> -#include <linux/bpf.h> +#include <bpf/libbpf.h> #include <bpf/bpf.h> -#include "bpf_load.h" #define MAX_ENTRIES 20 #define MAX_CPU 4 @@ -81,20 +80,51 @@ static void get_data(int fd) int main(int argc, char **argv) { + struct bpf_link *links[2]; + struct bpf_program *prog; + struct bpf_object *obj; char filename[256]; + int map_fd, i = 0; snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + obj = bpf_object__open_file(filename, NULL); + if (libbpf_get_error(obj)) { + fprintf(stderr, "ERROR: opening BPF object file failed\n"); + return 0; + } + + /* load BPF program */ + if (bpf_object__load(obj)) { + fprintf(stderr, "ERROR: loading BPF object file failed\n"); + goto cleanup; + } - if (load_bpf_file(filename)) { - printf("%s", bpf_log_buf); - return 1; + map_fd = bpf_object__find_map_fd_by_name(obj, "my_lat"); + if (map_fd < 0) { + fprintf(stderr, "ERROR: finding a map in obj file failed\n"); + goto cleanup; + } + + bpf_object__for_each_program(prog, obj) { + links[i] = bpf_program__attach(prog); + if (libbpf_get_error(links[i])) { + fprintf(stderr, "ERROR: bpf_program__attach failed\n"); + links[i] = NULL; + goto cleanup; + } + i++; } while (1) { - get_data(map_fd[1]); + get_data(map_fd); print_hist(); sleep(5); } +cleanup: + for (i--; i >= 0; i--) + bpf_link__destroy(links[i]); + + bpf_object__close(obj); return 0; } diff --git a/samples/bpf/offwaketime_kern.c b/samples/bpf/offwaketime_kern.c index e74ee1cd4b9c..14b792915a9c 100644 --- a/samples/bpf/offwaketime_kern.c +++ b/samples/bpf/offwaketime_kern.c @@ -28,38 +28,38 @@ struct key_t { u32 tret; }; -struct bpf_map_def SEC("maps") counts = { - .type = BPF_MAP_TYPE_HASH, - .key_size = sizeof(struct key_t), - .value_size = sizeof(u64), - .max_entries = 10000, -}; - -struct bpf_map_def SEC("maps") start = { - .type = BPF_MAP_TYPE_HASH, - .key_size = sizeof(u32), - .value_size = sizeof(u64), - .max_entries = 10000, -}; +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, struct key_t); + __type(value, u64); + __uint(max_entries, 10000); +} counts SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, u32); + __type(value, u64); + __uint(max_entries, 10000); +} start SEC(".maps"); struct wokeby_t { char name[TASK_COMM_LEN]; u32 ret; }; -struct bpf_map_def SEC("maps") wokeby = { - .type = BPF_MAP_TYPE_HASH, - .key_size = sizeof(u32), - .value_size = sizeof(struct wokeby_t), - .max_entries = 10000, -}; - -struct bpf_map_def SEC("maps") stackmap = { - .type = BPF_MAP_TYPE_STACK_TRACE, - .key_size = sizeof(u32), - .value_size = PERF_MAX_STACK_DEPTH * sizeof(u64), - .max_entries = 10000, -}; +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, u32); + __type(value, struct wokeby_t); + __uint(max_entries, 10000); +} wokeby SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_STACK_TRACE); + __uint(key_size, sizeof(u32)); + __uint(value_size, PERF_MAX_STACK_DEPTH * sizeof(u64)); + __uint(max_entries, 10000); +} stackmap SEC(".maps"); #define STACKID_FLAGS (0 | BPF_F_FAST_STACK_CMP) diff --git a/samples/bpf/offwaketime_user.c b/samples/bpf/offwaketime_user.c index 51c7da5341cc..5734cfdaaacb 100644 --- a/samples/bpf/offwaketime_user.c +++ b/samples/bpf/offwaketime_user.c @@ -5,19 +5,19 @@ #include <unistd.h> #include <stdlib.h> #include <signal.h> -#include <linux/bpf.h> -#include <string.h> #include <linux/perf_event.h> #include <errno.h> -#include <assert.h> #include <stdbool.h> #include <sys/resource.h> #include <bpf/libbpf.h> -#include "bpf_load.h" +#include <bpf/bpf.h> #include "trace_helpers.h" #define PRINT_RAW_ADDR 0 +/* counts, stackmap */ +static int map_fd[2]; + static void print_ksym(__u64 addr) { struct ksym *sym; @@ -52,14 +52,14 @@ static void print_stack(struct key_t *key, __u64 count) int i; printf("%s;", key->target); - if (bpf_map_lookup_elem(map_fd[3], &key->tret, ip) != 0) { + if (bpf_map_lookup_elem(map_fd[1], &key->tret, ip) != 0) { printf("---;"); } else { for (i = PERF_MAX_STACK_DEPTH - 1; i >= 0; i--) print_ksym(ip[i]); } printf("-;"); - if (bpf_map_lookup_elem(map_fd[3], &key->wret, ip) != 0) { + if (bpf_map_lookup_elem(map_fd[1], &key->wret, ip) != 0) { printf("---;"); } else { for (i = 0; i < PERF_MAX_STACK_DEPTH; i++) @@ -96,23 +96,54 @@ static void int_exit(int sig) int main(int argc, char **argv) { struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; + struct bpf_object *obj = NULL; + struct bpf_link *links[2]; + struct bpf_program *prog; + int delay = 1, i = 0; char filename[256]; - int delay = 1; - - snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); - setrlimit(RLIMIT_MEMLOCK, &r); - signal(SIGINT, int_exit); - signal(SIGTERM, int_exit); + if (setrlimit(RLIMIT_MEMLOCK, &r)) { + perror("setrlimit(RLIMIT_MEMLOCK)"); + return 1; + } if (load_kallsyms()) { printf("failed to process /proc/kallsyms\n"); return 2; } - if (load_bpf_file(filename)) { - printf("%s", bpf_log_buf); - return 1; + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + obj = bpf_object__open_file(filename, NULL); + if (libbpf_get_error(obj)) { + fprintf(stderr, "ERROR: opening BPF object file failed\n"); + obj = NULL; + goto cleanup; + } + + /* load BPF program */ + if (bpf_object__load(obj)) { + fprintf(stderr, "ERROR: loading BPF object file failed\n"); + goto cleanup; + } + + map_fd[0] = bpf_object__find_map_fd_by_name(obj, "counts"); + map_fd[1] = bpf_object__find_map_fd_by_name(obj, "stackmap"); + if (map_fd[0] < 0 || map_fd[1] < 0) { + fprintf(stderr, "ERROR: finding a map in obj file failed\n"); + goto cleanup; + } + + signal(SIGINT, int_exit); + signal(SIGTERM, int_exit); + + bpf_object__for_each_program(prog, obj) { + links[i] = bpf_program__attach(prog); + if (libbpf_get_error(links[i])) { + fprintf(stderr, "ERROR: bpf_program__attach failed\n"); + links[i] = NULL; + goto cleanup; + } + i++; } if (argc > 1) @@ -120,5 +151,10 @@ int main(int argc, char **argv) sleep(delay); print_stacks(map_fd[0]); +cleanup: + for (i--; i >= 0; i--) + bpf_link__destroy(links[i]); + + bpf_object__close(obj); return 0; } diff --git a/samples/bpf/spintest_kern.c b/samples/bpf/spintest_kern.c index f508af357251..455da77319d9 100644 --- a/samples/bpf/spintest_kern.c +++ b/samples/bpf/spintest_kern.c @@ -12,25 +12,25 @@ #include <bpf/bpf_helpers.h> #include <bpf/bpf_tracing.h> -struct bpf_map_def SEC("maps") my_map = { - .type = BPF_MAP_TYPE_HASH, - .key_size = sizeof(long), - .value_size = sizeof(long), - .max_entries = 1024, -}; -struct bpf_map_def SEC("maps") my_map2 = { - .type = BPF_MAP_TYPE_PERCPU_HASH, - .key_size = sizeof(long), - .value_size = sizeof(long), - .max_entries = 1024, -}; +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, long); + __type(value, long); + __uint(max_entries, 1024); +} my_map SEC(".maps"); +struct { + __uint(type, BPF_MAP_TYPE_PERCPU_HASH); + __uint(key_size, sizeof(long)); + __uint(value_size, sizeof(long)); + __uint(max_entries, 1024); +} my_map2 SEC(".maps"); -struct bpf_map_def SEC("maps") stackmap = { - .type = BPF_MAP_TYPE_STACK_TRACE, - .key_size = sizeof(u32), - .value_size = PERF_MAX_STACK_DEPTH * sizeof(u64), - .max_entries = 10000, -}; +struct { + __uint(type, BPF_MAP_TYPE_STACK_TRACE); + __uint(key_size, sizeof(u32)); + __uint(value_size, PERF_MAX_STACK_DEPTH * sizeof(u64)); + __uint(max_entries, 10000); +} stackmap SEC(".maps"); #define PROG(foo) \ int foo(struct pt_regs *ctx) \ diff --git a/samples/bpf/spintest_user.c b/samples/bpf/spintest_user.c index fb430ea2ef51..847da9284fa8 100644 --- a/samples/bpf/spintest_user.c +++ b/samples/bpf/spintest_user.c @@ -1,40 +1,77 @@ // SPDX-License-Identifier: GPL-2.0 #include <stdio.h> #include <unistd.h> -#include <linux/bpf.h> #include <string.h> #include <assert.h> #include <sys/resource.h> #include <bpf/libbpf.h> -#include "bpf_load.h" +#include <bpf/bpf.h> #include "trace_helpers.h" int main(int ac, char **argv) { struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; + char filename[256], symbol[256]; + struct bpf_object *obj = NULL; + struct bpf_link *links[20]; long key, next_key, value; - char filename[256]; + struct bpf_program *prog; + int map_fd, i, j = 0; + const char *title; struct ksym *sym; - int i; - snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); - setrlimit(RLIMIT_MEMLOCK, &r); + if (setrlimit(RLIMIT_MEMLOCK, &r)) { + perror("setrlimit(RLIMIT_MEMLOCK)"); + return 1; + } if (load_kallsyms()) { printf("failed to process /proc/kallsyms\n"); return 2; } - if (load_bpf_file(filename)) { - printf("%s", bpf_log_buf); - return 1; + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + obj = bpf_object__open_file(filename, NULL); + if (libbpf_get_error(obj)) { + fprintf(stderr, "ERROR: opening BPF object file failed\n"); + obj = NULL; + goto cleanup; + } + + /* load BPF program */ + if (bpf_object__load(obj)) { + fprintf(stderr, "ERROR: loading BPF object file failed\n"); + goto cleanup; + } + + map_fd = bpf_object__find_map_fd_by_name(obj, "my_map"); + if (map_fd < 0) { + fprintf(stderr, "ERROR: finding a map in obj file failed\n"); + goto cleanup; + } + + bpf_object__for_each_program(prog, obj) { + title = bpf_program__title(prog, false); + if (sscanf(title, "kprobe/%s", symbol) != 1) + continue; + + /* Attach prog only when symbol exists */ + if (ksym_get_addr(symbol)) { + links[j] = bpf_program__attach(prog); + if (libbpf_get_error(links[j])) { + fprintf(stderr, "bpf_program__attach failed\n"); + links[j] = NULL; + goto cleanup; + } + j++; + } } for (i = 0; i < 5; i++) { key = 0; printf("kprobing funcs:"); - while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0) { - bpf_map_lookup_elem(map_fd[0], &next_key, &value); + while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) { + bpf_map_lookup_elem(map_fd, &next_key, &value); assert(next_key == value); sym = ksym_search(value); key = next_key; @@ -48,10 +85,15 @@ int main(int ac, char **argv) if (key) printf("\n"); key = 0; - while (bpf_map_get_next_key(map_fd[0], &key, &next_key) == 0) - bpf_map_delete_elem(map_fd[0], &next_key); + while (bpf_map_get_next_key(map_fd, &key, &next_key) == 0) + bpf_map_delete_elem(map_fd, &next_key); sleep(1); } +cleanup: + for (j--; j >= 0; j--) + bpf_link__destroy(links[j]); + + bpf_object__close(obj); return 0; } diff --git a/samples/bpf/syscall_tp_kern.c b/samples/bpf/syscall_tp_kern.c index 5a62b03b1f88..50231c2eff9c 100644 --- a/samples/bpf/syscall_tp_kern.c +++ b/samples/bpf/syscall_tp_kern.c @@ -18,19 +18,19 @@ struct syscalls_exit_open_args { long ret; }; -struct bpf_map_def SEC("maps") enter_open_map = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(u32), - .value_size = sizeof(u32), - .max_entries = 1, -}; +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, u32); + __type(value, u32); + __uint(max_entries, 1); +} enter_open_map SEC(".maps"); -struct bpf_map_def SEC("maps") exit_open_map = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(u32), - .value_size = sizeof(u32), - .max_entries = 1, -}; +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, u32); + __type(value, u32); + __uint(max_entries, 1); +} exit_open_map SEC(".maps"); static __always_inline void count(void *map) { diff --git a/samples/bpf/syscall_tp_user.c b/samples/bpf/syscall_tp_user.c index 57014bab7cbe..76a1d00128fb 100644 --- a/samples/bpf/syscall_tp_user.c +++ b/samples/bpf/syscall_tp_user.c @@ -5,16 +5,12 @@ #include <unistd.h> #include <fcntl.h> #include <stdlib.h> -#include <signal.h> -#include <linux/bpf.h> #include <string.h> #include <linux/perf_event.h> #include <errno.h> -#include <assert.h> -#include <stdbool.h> #include <sys/resource.h> +#include <bpf/libbpf.h> #include <bpf/bpf.h> -#include "bpf_load.h" /* This program verifies bpf attachment to tracepoint sys_enter_* and sys_exit_*. * This requires kernel CONFIG_FTRACE_SYSCALLS to be set. @@ -49,16 +45,44 @@ static void verify_map(int map_id) static int test(char *filename, int num_progs) { - int i, fd, map0_fds[num_progs], map1_fds[num_progs]; + int map0_fds[num_progs], map1_fds[num_progs], fd, i, j = 0; + struct bpf_link *links[num_progs * 4]; + struct bpf_object *objs[num_progs]; + struct bpf_program *prog; for (i = 0; i < num_progs; i++) { - if (load_bpf_file(filename)) { - fprintf(stderr, "%s", bpf_log_buf); - return 1; + objs[i] = bpf_object__open_file(filename, NULL); + if (libbpf_get_error(objs[i])) { + fprintf(stderr, "opening BPF object file failed\n"); + objs[i] = NULL; + goto cleanup; } - printf("prog #%d: map ids %d %d\n", i, map_fd[0], map_fd[1]); - map0_fds[i] = map_fd[0]; - map1_fds[i] = map_fd[1]; + + /* load BPF program */ + if (bpf_object__load(objs[i])) { + fprintf(stderr, "loading BPF object file failed\n"); + goto cleanup; + } + + map0_fds[i] = bpf_object__find_map_fd_by_name(objs[i], + "enter_open_map"); + map1_fds[i] = bpf_object__find_map_fd_by_name(objs[i], + "exit_open_map"); + if (map0_fds[i] < 0 || map1_fds[i] < 0) { + fprintf(stderr, "finding a map in obj file failed\n"); + goto cleanup; + } + + bpf_object__for_each_program(prog, objs[i]) { + links[j] = bpf_program__attach(prog); + if (libbpf_get_error(links[j])) { + fprintf(stderr, "bpf_program__attach failed\n"); + links[j] = NULL; + goto cleanup; + } + j++; + } + printf("prog #%d: map ids %d %d\n", i, map0_fds[i], map1_fds[i]); } /* current load_bpf_file has perf_event_open default pid = -1 @@ -80,6 +104,12 @@ static int test(char *filename, int num_progs) verify_map(map1_fds[i]); } +cleanup: + for (j--; j >= 0; j--) + bpf_link__destroy(links[j]); + + for (i--; i >= 0; i--) + bpf_object__close(objs[i]); return 0; } diff --git a/samples/bpf/task_fd_query_kern.c b/samples/bpf/task_fd_query_kern.c index 278ade5427c8..c821294e1774 100644 --- a/samples/bpf/task_fd_query_kern.c +++ b/samples/bpf/task_fd_query_kern.c @@ -10,7 +10,7 @@ int bpf_prog1(struct pt_regs *ctx) return 0; } -SEC("kretprobe/blk_account_io_completion") +SEC("kretprobe/blk_account_io_done") int bpf_prog2(struct pt_regs *ctx) { return 0; diff --git a/samples/bpf/task_fd_query_user.c b/samples/bpf/task_fd_query_user.c index ff2e9c1c7266..4a74531dc403 100644 --- a/samples/bpf/task_fd_query_user.c +++ b/samples/bpf/task_fd_query_user.c @@ -314,7 +314,7 @@ int main(int argc, char **argv) /* test two functions in the corresponding *_kern.c file */ CHECK_AND_RET(test_debug_fs_kprobe(0, "blk_mq_start_request", BPF_FD_TYPE_KPROBE)); - CHECK_AND_RET(test_debug_fs_kprobe(1, "blk_account_io_completion", + CHECK_AND_RET(test_debug_fs_kprobe(1, "blk_account_io_done", BPF_FD_TYPE_KRETPROBE)); /* test nondebug fs kprobe */ diff --git a/samples/bpf/test_current_task_under_cgroup_kern.c b/samples/bpf/test_current_task_under_cgroup_kern.c index 6dc4f41bb6cb..fbd43e2bb4d3 100644 --- a/samples/bpf/test_current_task_under_cgroup_kern.c +++ b/samples/bpf/test_current_task_under_cgroup_kern.c @@ -10,23 +10,24 @@ #include <linux/version.h> #include <bpf/bpf_helpers.h> #include <uapi/linux/utsname.h> +#include "trace_common.h" -struct bpf_map_def SEC("maps") cgroup_map = { - .type = BPF_MAP_TYPE_CGROUP_ARRAY, - .key_size = sizeof(u32), - .value_size = sizeof(u32), - .max_entries = 1, -}; +struct { + __uint(type, BPF_MAP_TYPE_CGROUP_ARRAY); + __uint(key_size, sizeof(u32)); + __uint(value_size, sizeof(u32)); + __uint(max_entries, 1); +} cgroup_map SEC(".maps"); -struct bpf_map_def SEC("maps") perf_map = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(u32), - .value_size = sizeof(u64), - .max_entries = 1, -}; +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, u32); + __type(value, u64); + __uint(max_entries, 1); +} perf_map SEC(".maps"); /* Writes the last PID that called sync to a map at index 0 */ -SEC("kprobe/sys_sync") +SEC("kprobe/" SYSCALL(sys_sync)) int bpf_prog1(struct pt_regs *ctx) { u64 pid = bpf_get_current_pid_tgid(); diff --git a/samples/bpf/test_current_task_under_cgroup_user.c b/samples/bpf/test_current_task_under_cgroup_user.c index 06e9f8ce42e2..ac251a417f45 100644 --- a/samples/bpf/test_current_task_under_cgroup_user.c +++ b/samples/bpf/test_current_task_under_cgroup_user.c @@ -4,10 +4,9 @@ #define _GNU_SOURCE #include <stdio.h> -#include <linux/bpf.h> #include <unistd.h> #include <bpf/bpf.h> -#include "bpf_load.h" +#include <bpf/libbpf.h> #include "cgroup_helpers.h" #define CGROUP_PATH "/my-cgroup" @@ -15,13 +14,44 @@ int main(int argc, char **argv) { pid_t remote_pid, local_pid = getpid(); - int cg2, idx = 0, rc = 0; + struct bpf_link *link = NULL; + struct bpf_program *prog; + int cg2, idx = 0, rc = 1; + struct bpf_object *obj; char filename[256]; + int map_fd[2]; snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); - if (load_bpf_file(filename)) { - printf("%s", bpf_log_buf); - return 1; + obj = bpf_object__open_file(filename, NULL); + if (libbpf_get_error(obj)) { + fprintf(stderr, "ERROR: opening BPF object file failed\n"); + return 0; + } + + prog = bpf_object__find_program_by_name(obj, "bpf_prog1"); + if (!prog) { + printf("finding a prog in obj file failed\n"); + goto cleanup; + } + + /* load BPF program */ + if (bpf_object__load(obj)) { + fprintf(stderr, "ERROR: loading BPF object file failed\n"); + goto cleanup; + } + + map_fd[0] = bpf_object__find_map_fd_by_name(obj, "cgroup_map"); + map_fd[1] = bpf_object__find_map_fd_by_name(obj, "perf_map"); + if (map_fd[0] < 0 || map_fd[1] < 0) { + fprintf(stderr, "ERROR: finding a map in obj file failed\n"); + goto cleanup; + } + + link = bpf_program__attach(prog); + if (libbpf_get_error(link)) { + fprintf(stderr, "ERROR: bpf_program__attach failed\n"); + link = NULL; + goto cleanup; } if (setup_cgroup_environment()) @@ -70,12 +100,14 @@ int main(int argc, char **argv) goto err; } - goto out; -err: - rc = 1; + rc = 0; -out: +err: close(cg2); cleanup_cgroup_environment(); + +cleanup: + bpf_link__destroy(link); + bpf_object__close(obj); return rc; } diff --git a/samples/bpf/test_probe_write_user_kern.c b/samples/bpf/test_probe_write_user_kern.c index fd651a65281e..220a96438d75 100644 --- a/samples/bpf/test_probe_write_user_kern.c +++ b/samples/bpf/test_probe_write_user_kern.c @@ -13,12 +13,12 @@ #include <bpf/bpf_core_read.h> #include "trace_common.h" -struct bpf_map_def SEC("maps") dnat_map = { - .type = BPF_MAP_TYPE_HASH, - .key_size = sizeof(struct sockaddr_in), - .value_size = sizeof(struct sockaddr_in), - .max_entries = 256, -}; +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, struct sockaddr_in); + __type(value, struct sockaddr_in); + __uint(max_entries, 256); +} dnat_map SEC(".maps"); /* kprobe is NOT a stable ABI * kernel functions can be removed, renamed or completely change semantics. diff --git a/samples/bpf/test_probe_write_user_user.c b/samples/bpf/test_probe_write_user_user.c index 045eb5e30f54..00ccfb834e45 100644 --- a/samples/bpf/test_probe_write_user_user.c +++ b/samples/bpf/test_probe_write_user_user.c @@ -1,21 +1,22 @@ // SPDX-License-Identifier: GPL-2.0 #include <stdio.h> #include <assert.h> -#include <linux/bpf.h> #include <unistd.h> #include <bpf/bpf.h> -#include "bpf_load.h" +#include <bpf/libbpf.h> #include <sys/socket.h> -#include <string.h> #include <netinet/in.h> #include <arpa/inet.h> int main(int ac, char **argv) { - int serverfd, serverconnfd, clientfd; - socklen_t sockaddr_len; - struct sockaddr serv_addr, mapped_addr, tmp_addr; struct sockaddr_in *serv_addr_in, *mapped_addr_in, *tmp_addr_in; + struct sockaddr serv_addr, mapped_addr, tmp_addr; + int serverfd, serverconnfd, clientfd, map_fd; + struct bpf_link *link = NULL; + struct bpf_program *prog; + struct bpf_object *obj; + socklen_t sockaddr_len; char filename[256]; char *ip; @@ -24,10 +25,35 @@ int main(int ac, char **argv) tmp_addr_in = (struct sockaddr_in *)&tmp_addr; snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + obj = bpf_object__open_file(filename, NULL); + if (libbpf_get_error(obj)) { + fprintf(stderr, "ERROR: opening BPF object file failed\n"); + return 0; + } + + prog = bpf_object__find_program_by_name(obj, "bpf_prog1"); + if (libbpf_get_error(prog)) { + fprintf(stderr, "ERROR: finding a prog in obj file failed\n"); + goto cleanup; + } + + /* load BPF program */ + if (bpf_object__load(obj)) { + fprintf(stderr, "ERROR: loading BPF object file failed\n"); + goto cleanup; + } + + map_fd = bpf_object__find_map_fd_by_name(obj, "dnat_map"); + if (map_fd < 0) { + fprintf(stderr, "ERROR: finding a map in obj file failed\n"); + goto cleanup; + } - if (load_bpf_file(filename)) { - printf("%s", bpf_log_buf); - return 1; + link = bpf_program__attach(prog); + if (libbpf_get_error(link)) { + fprintf(stderr, "ERROR: bpf_program__attach failed\n"); + link = NULL; + goto cleanup; } assert((serverfd = socket(AF_INET, SOCK_STREAM, 0)) > 0); @@ -51,7 +77,7 @@ int main(int ac, char **argv) mapped_addr_in->sin_port = htons(5555); mapped_addr_in->sin_addr.s_addr = inet_addr("255.255.255.255"); - assert(!bpf_map_update_elem(map_fd[0], &mapped_addr, &serv_addr, BPF_ANY)); + assert(!bpf_map_update_elem(map_fd, &mapped_addr, &serv_addr, BPF_ANY)); assert(listen(serverfd, 5) == 0); @@ -75,5 +101,8 @@ int main(int ac, char **argv) /* Is the server's getsockname = the socket getpeername */ assert(memcmp(&serv_addr, &tmp_addr, sizeof(struct sockaddr_in)) == 0); +cleanup: + bpf_link__destroy(link); + bpf_object__close(obj); return 0; } diff --git a/samples/bpf/trace_output_kern.c b/samples/bpf/trace_output_kern.c index 1d7d422cae6f..b64815af0943 100644 --- a/samples/bpf/trace_output_kern.c +++ b/samples/bpf/trace_output_kern.c @@ -2,15 +2,16 @@ #include <linux/version.h> #include <uapi/linux/bpf.h> #include <bpf/bpf_helpers.h> +#include "trace_common.h" -struct bpf_map_def SEC("maps") my_map = { - .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, - .key_size = sizeof(int), - .value_size = sizeof(u32), - .max_entries = 2, -}; +struct { + __uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY); + __uint(key_size, sizeof(int)); + __uint(value_size, sizeof(u32)); + __uint(max_entries, 2); +} my_map SEC(".maps"); -SEC("kprobe/sys_write") +SEC("kprobe/" SYSCALL(sys_write)) int bpf_prog1(struct pt_regs *ctx) { struct S { diff --git a/samples/bpf/trace_output_user.c b/samples/bpf/trace_output_user.c index 60a17dd05345..364b98764d54 100644 --- a/samples/bpf/trace_output_user.c +++ b/samples/bpf/trace_output_user.c @@ -1,23 +1,10 @@ // SPDX-License-Identifier: GPL-2.0-only #include <stdio.h> -#include <unistd.h> -#include <stdlib.h> -#include <stdbool.h> -#include <string.h> #include <fcntl.h> #include <poll.h> -#include <linux/perf_event.h> -#include <linux/bpf.h> -#include <errno.h> -#include <assert.h> -#include <sys/syscall.h> -#include <sys/ioctl.h> -#include <sys/mman.h> #include <time.h> #include <signal.h> #include <bpf/libbpf.h> -#include "bpf_load.h" -#include "perf-sys.h" static __u64 time_get_ns(void) { @@ -57,20 +44,48 @@ static void print_bpf_output(void *ctx, int cpu, void *data, __u32 size) int main(int argc, char **argv) { struct perf_buffer_opts pb_opts = {}; + struct bpf_link *link = NULL; + struct bpf_program *prog; struct perf_buffer *pb; + struct bpf_object *obj; + int map_fd, ret = 0; char filename[256]; FILE *f; - int ret; snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + obj = bpf_object__open_file(filename, NULL); + if (libbpf_get_error(obj)) { + fprintf(stderr, "ERROR: opening BPF object file failed\n"); + return 0; + } - if (load_bpf_file(filename)) { - printf("%s", bpf_log_buf); - return 1; + /* load BPF program */ + if (bpf_object__load(obj)) { + fprintf(stderr, "ERROR: loading BPF object file failed\n"); + goto cleanup; + } + + map_fd = bpf_object__find_map_fd_by_name(obj, "my_map"); + if (map_fd < 0) { + fprintf(stderr, "ERROR: finding a map in obj file failed\n"); + goto cleanup; + } + + prog = bpf_object__find_program_by_name(obj, "bpf_prog1"); + if (libbpf_get_error(prog)) { + fprintf(stderr, "ERROR: finding a prog in obj file failed\n"); + goto cleanup; + } + + link = bpf_program__attach(prog); + if (libbpf_get_error(link)) { + fprintf(stderr, "ERROR: bpf_program__attach failed\n"); + link = NULL; + goto cleanup; } pb_opts.sample_cb = print_bpf_output; - pb = perf_buffer__new(map_fd[0], 8, &pb_opts); + pb = perf_buffer__new(map_fd, 8, &pb_opts); ret = libbpf_get_error(pb); if (ret) { printf("failed to setup perf_buffer: %d\n", ret); @@ -84,5 +99,9 @@ int main(int argc, char **argv) while ((ret = perf_buffer__poll(pb, 1000)) >= 0 && cnt < MAX_CNT) { } kill(0, SIGINT); + +cleanup: + bpf_link__destroy(link); + bpf_object__close(obj); return ret; } diff --git a/samples/bpf/tracex3_kern.c b/samples/bpf/tracex3_kern.c index 659613c19a82..710a4410b2fb 100644 --- a/samples/bpf/tracex3_kern.c +++ b/samples/bpf/tracex3_kern.c @@ -49,7 +49,7 @@ struct { __uint(max_entries, SLOTS); } lat_map SEC(".maps"); -SEC("kprobe/blk_account_io_completion") +SEC("kprobe/blk_account_io_done") int bpf_prog2(struct pt_regs *ctx) { long rq = PT_REGS_PARM1(ctx); diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c index 19c679456a0e..4cead341ae57 100644 --- a/samples/bpf/xdpsock_user.c +++ b/samples/bpf/xdpsock_user.c @@ -613,7 +613,16 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size) { struct xsk_umem_info *umem; struct xsk_umem_config cfg = { - .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, + /* We recommend that you set the fill ring size >= HW RX ring size + + * AF_XDP RX ring size. Make sure you fill up the fill ring + * with buffers at regular intervals, and you will with this setting + * avoid allocation failures in the driver. These are usually quite + * expensive since drivers have not been written to assume that + * allocation failures are common. For regular sockets, kernel + * allocated memory is used that only runs out in OOM situations + * that should be rare. + */ + .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS * 2, .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, .frame_size = opt_xsk_frame_size, .frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM, @@ -640,13 +649,13 @@ static void xsk_populate_fill_ring(struct xsk_umem_info *umem) u32 idx; ret = xsk_ring_prod__reserve(&umem->fq, - XSK_RING_PROD__DEFAULT_NUM_DESCS, &idx); - if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS) + XSK_RING_PROD__DEFAULT_NUM_DESCS * 2, &idx); + if (ret != XSK_RING_PROD__DEFAULT_NUM_DESCS * 2) exit_with_error(-ret); - for (i = 0; i < XSK_RING_PROD__DEFAULT_NUM_DESCS; i++) + for (i = 0; i < XSK_RING_PROD__DEFAULT_NUM_DESCS * 2; i++) *xsk_ring_prod__fill_addr(&umem->fq, idx++) = i * opt_xsk_frame_size; - xsk_ring_prod__submit(&umem->fq, XSK_RING_PROD__DEFAULT_NUM_DESCS); + xsk_ring_prod__submit(&umem->fq, XSK_RING_PROD__DEFAULT_NUM_DESCS * 2); } static struct xsk_socket_info *xsk_configure_socket(struct xsk_umem_info *umem, @@ -888,9 +897,6 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk, if (!xsk->outstanding_tx) return; - if (!opt_need_wakeup || xsk_ring_prod__needs_wakeup(&xsk->tx)) - kick_tx(xsk); - ndescs = (xsk->outstanding_tx > opt_batch_size) ? opt_batch_size : xsk->outstanding_tx; @@ -1004,7 +1010,7 @@ static void rx_drop_all(void) } } -static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb, int batch_size) +static void tx_only(struct xsk_socket_info *xsk, u32 *frame_nb, int batch_size) { u32 idx; unsigned int i; @@ -1017,14 +1023,14 @@ static void tx_only(struct xsk_socket_info *xsk, u32 frame_nb, int batch_size) for (i = 0; i < batch_size; i++) { struct xdp_desc *tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx + i); - tx_desc->addr = (frame_nb + i) << XSK_UMEM__DEFAULT_FRAME_SHIFT; + tx_desc->addr = (*frame_nb + i) << XSK_UMEM__DEFAULT_FRAME_SHIFT; tx_desc->len = PKT_SIZE; } xsk_ring_prod__submit(&xsk->tx, batch_size); xsk->outstanding_tx += batch_size; - frame_nb += batch_size; - frame_nb %= NUM_FRAMES; + *frame_nb += batch_size; + *frame_nb %= NUM_FRAMES; complete_tx_only(xsk, batch_size); } @@ -1080,7 +1086,7 @@ static void tx_only_all(void) } for (i = 0; i < num_socks; i++) - tx_only(xsks[i], frame_nb[i], batch_size); + tx_only(xsks[i], &frame_nb[i], batch_size); pkt_cnt += batch_size; diff --git a/samples/bpf/xsk_fwd.c b/samples/bpf/xsk_fwd.c new file mode 100644 index 000000000000..1cd97c84c337 --- /dev/null +++ b/samples/bpf/xsk_fwd.c @@ -0,0 +1,1085 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright(c) 2020 Intel Corporation. */ + +#define _GNU_SOURCE +#include <poll.h> +#include <pthread.h> +#include <signal.h> +#include <sched.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <sys/resource.h> +#include <sys/socket.h> +#include <sys/types.h> +#include <time.h> +#include <unistd.h> +#include <getopt.h> +#include <netinet/ether.h> +#include <net/if.h> + +#include <linux/bpf.h> +#include <linux/if_link.h> +#include <linux/if_xdp.h> + +#include <bpf/libbpf.h> +#include <bpf/xsk.h> +#include <bpf/bpf.h> + +#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) + +typedef __u64 u64; +typedef __u32 u32; +typedef __u16 u16; +typedef __u8 u8; + +/* This program illustrates the packet forwarding between multiple AF_XDP + * sockets in multi-threaded environment. All threads are sharing a common + * buffer pool, with each socket having its own private buffer cache. + * + * Example 1: Single thread handling two sockets. The packets received by socket + * A (interface IFA, queue QA) are forwarded to socket B (interface IFB, queue + * QB), while the packets received by socket B are forwarded to socket A. The + * thread is running on CPU core X: + * + * ./xsk_fwd -i IFA -q QA -i IFB -q QB -c X + * + * Example 2: Two threads, each handling two sockets. The thread running on CPU + * core X forwards all the packets received by socket A to socket B, and all the + * packets received by socket B to socket A. The thread running on CPU core Y is + * performing the same packet forwarding between sockets C and D: + * + * ./xsk_fwd -i IFA -q QA -i IFB -q QB -i IFC -q QC -i IFD -q QD + * -c CX -c CY + */ + +/* + * Buffer pool and buffer cache + * + * For packet forwarding, the packet buffers are typically allocated from the + * pool for packet reception and freed back to the pool for further reuse once + * the packet transmission is completed. + * + * The buffer pool is shared between multiple threads. In order to minimize the + * access latency to the shared buffer pool, each thread creates one (or + * several) buffer caches, which, unlike the buffer pool, are private to the + * thread that creates them and therefore cannot be shared with other threads. + * The access to the shared pool is only needed either (A) when the cache gets + * empty due to repeated buffer allocations and it needs to be replenished from + * the pool, or (B) when the cache gets full due to repeated buffer free and it + * needs to be flushed back to the pull. + * + * In a packet forwarding system, a packet received on any input port can + * potentially be transmitted on any output port, depending on the forwarding + * configuration. For AF_XDP sockets, for this to work with zero-copy of the + * packet buffers when, it is required that the buffer pool memory fits into the + * UMEM area shared by all the sockets. + */ + +struct bpool_params { + u32 n_buffers; + u32 buffer_size; + int mmap_flags; + + u32 n_users_max; + u32 n_buffers_per_slab; +}; + +/* This buffer pool implementation organizes the buffers into equally sized + * slabs of *n_buffers_per_slab*. Initially, there are *n_slabs* slabs in the + * pool that are completely filled with buffer pointers (full slabs). + * + * Each buffer cache has a slab for buffer allocation and a slab for buffer + * free, with both of these slabs initially empty. When the cache's allocation + * slab goes empty, it is swapped with one of the available full slabs from the + * pool, if any is available. When the cache's free slab goes full, it is + * swapped for one of the empty slabs from the pool, which is guaranteed to + * succeed. + * + * Partially filled slabs never get traded between the cache and the pool + * (except when the cache itself is destroyed), which enables fast operation + * through pointer swapping. + */ +struct bpool { + struct bpool_params params; + pthread_mutex_t lock; + void *addr; + + u64 **slabs; + u64 **slabs_reserved; + u64 *buffers; + u64 *buffers_reserved; + + u64 n_slabs; + u64 n_slabs_reserved; + u64 n_buffers; + + u64 n_slabs_available; + u64 n_slabs_reserved_available; + + struct xsk_umem_config umem_cfg; + struct xsk_ring_prod umem_fq; + struct xsk_ring_cons umem_cq; + struct xsk_umem *umem; +}; + +static struct bpool * +bpool_init(struct bpool_params *params, + struct xsk_umem_config *umem_cfg) +{ + struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY}; + u64 n_slabs, n_slabs_reserved, n_buffers, n_buffers_reserved; + u64 slabs_size, slabs_reserved_size; + u64 buffers_size, buffers_reserved_size; + u64 total_size, i; + struct bpool *bp; + u8 *p; + int status; + + /* mmap prep. */ + if (setrlimit(RLIMIT_MEMLOCK, &r)) + return NULL; + + /* bpool internals dimensioning. */ + n_slabs = (params->n_buffers + params->n_buffers_per_slab - 1) / + params->n_buffers_per_slab; + n_slabs_reserved = params->n_users_max * 2; + n_buffers = n_slabs * params->n_buffers_per_slab; + n_buffers_reserved = n_slabs_reserved * params->n_buffers_per_slab; + + slabs_size = n_slabs * sizeof(u64 *); + slabs_reserved_size = n_slabs_reserved * sizeof(u64 *); + buffers_size = n_buffers * sizeof(u64); + buffers_reserved_size = n_buffers_reserved * sizeof(u64); + + total_size = sizeof(struct bpool) + + slabs_size + slabs_reserved_size + + buffers_size + buffers_reserved_size; + + /* bpool memory allocation. */ + p = calloc(total_size, sizeof(u8)); + if (!p) + return NULL; + + /* bpool memory initialization. */ + bp = (struct bpool *)p; + memcpy(&bp->params, params, sizeof(*params)); + bp->params.n_buffers = n_buffers; + + bp->slabs = (u64 **)&p[sizeof(struct bpool)]; + bp->slabs_reserved = (u64 **)&p[sizeof(struct bpool) + + slabs_size]; + bp->buffers = (u64 *)&p[sizeof(struct bpool) + + slabs_size + slabs_reserved_size]; + bp->buffers_reserved = (u64 *)&p[sizeof(struct bpool) + + slabs_size + slabs_reserved_size + buffers_size]; + + bp->n_slabs = n_slabs; + bp->n_slabs_reserved = n_slabs_reserved; + bp->n_buffers = n_buffers; + + for (i = 0; i < n_slabs; i++) + bp->slabs[i] = &bp->buffers[i * params->n_buffers_per_slab]; + bp->n_slabs_available = n_slabs; + + for (i = 0; i < n_slabs_reserved; i++) + bp->slabs_reserved[i] = &bp->buffers_reserved[i * + params->n_buffers_per_slab]; + bp->n_slabs_reserved_available = n_slabs_reserved; + + for (i = 0; i < n_buffers; i++) + bp->buffers[i] = i * params->buffer_size; + + /* lock. */ + status = pthread_mutex_init(&bp->lock, NULL); + if (status) { + free(p); + return NULL; + } + + /* mmap. */ + bp->addr = mmap(NULL, + n_buffers * params->buffer_size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS | params->mmap_flags, + -1, + 0); + if (bp->addr == MAP_FAILED) { + pthread_mutex_destroy(&bp->lock); + free(p); + return NULL; + } + + /* umem. */ + status = xsk_umem__create(&bp->umem, + bp->addr, + bp->params.n_buffers * bp->params.buffer_size, + &bp->umem_fq, + &bp->umem_cq, + umem_cfg); + if (status) { + munmap(bp->addr, bp->params.n_buffers * bp->params.buffer_size); + pthread_mutex_destroy(&bp->lock); + free(p); + return NULL; + } + memcpy(&bp->umem_cfg, umem_cfg, sizeof(*umem_cfg)); + + return bp; +} + +static void +bpool_free(struct bpool *bp) +{ + if (!bp) + return; + + xsk_umem__delete(bp->umem); + munmap(bp->addr, bp->params.n_buffers * bp->params.buffer_size); + pthread_mutex_destroy(&bp->lock); + free(bp); +} + +struct bcache { + struct bpool *bp; + + u64 *slab_cons; + u64 *slab_prod; + + u64 n_buffers_cons; + u64 n_buffers_prod; +}; + +static u32 +bcache_slab_size(struct bcache *bc) +{ + struct bpool *bp = bc->bp; + + return bp->params.n_buffers_per_slab; +} + +static struct bcache * +bcache_init(struct bpool *bp) +{ + struct bcache *bc; + + bc = calloc(1, sizeof(struct bcache)); + if (!bc) + return NULL; + + bc->bp = bp; + bc->n_buffers_cons = 0; + bc->n_buffers_prod = 0; + + pthread_mutex_lock(&bp->lock); + if (bp->n_slabs_reserved_available == 0) { + pthread_mutex_unlock(&bp->lock); + free(bc); + return NULL; + } + + bc->slab_cons = bp->slabs_reserved[bp->n_slabs_reserved_available - 1]; + bc->slab_prod = bp->slabs_reserved[bp->n_slabs_reserved_available - 2]; + bp->n_slabs_reserved_available -= 2; + pthread_mutex_unlock(&bp->lock); + + return bc; +} + +static void +bcache_free(struct bcache *bc) +{ + struct bpool *bp; + + if (!bc) + return; + + /* In order to keep this example simple, the case of freeing any + * existing buffers from the cache back to the pool is ignored. + */ + + bp = bc->bp; + pthread_mutex_lock(&bp->lock); + bp->slabs_reserved[bp->n_slabs_reserved_available] = bc->slab_prod; + bp->slabs_reserved[bp->n_slabs_reserved_available + 1] = bc->slab_cons; + bp->n_slabs_reserved_available += 2; + pthread_mutex_unlock(&bp->lock); + + free(bc); +} + +/* To work correctly, the implementation requires that the *n_buffers* input + * argument is never greater than the buffer pool's *n_buffers_per_slab*. This + * is typically the case, with one exception taking place when large number of + * buffers are allocated at init time (e.g. for the UMEM fill queue setup). + */ +static inline u32 +bcache_cons_check(struct bcache *bc, u32 n_buffers) +{ + struct bpool *bp = bc->bp; + u64 n_buffers_per_slab = bp->params.n_buffers_per_slab; + u64 n_buffers_cons = bc->n_buffers_cons; + u64 n_slabs_available; + u64 *slab_full; + + /* + * Consumer slab is not empty: Use what's available locally. Do not + * look for more buffers from the pool when the ask can only be + * partially satisfied. + */ + if (n_buffers_cons) + return (n_buffers_cons < n_buffers) ? + n_buffers_cons : + n_buffers; + + /* + * Consumer slab is empty: look to trade the current consumer slab + * (full) for a full slab from the pool, if any is available. + */ + pthread_mutex_lock(&bp->lock); + n_slabs_available = bp->n_slabs_available; + if (!n_slabs_available) { + pthread_mutex_unlock(&bp->lock); + return 0; + } + + n_slabs_available--; + slab_full = bp->slabs[n_slabs_available]; + bp->slabs[n_slabs_available] = bc->slab_cons; + bp->n_slabs_available = n_slabs_available; + pthread_mutex_unlock(&bp->lock); + + bc->slab_cons = slab_full; + bc->n_buffers_cons = n_buffers_per_slab; + return n_buffers; +} + +static inline u64 +bcache_cons(struct bcache *bc) +{ + u64 n_buffers_cons = bc->n_buffers_cons - 1; + u64 buffer; + + buffer = bc->slab_cons[n_buffers_cons]; + bc->n_buffers_cons = n_buffers_cons; + return buffer; +} + +static inline void +bcache_prod(struct bcache *bc, u64 buffer) +{ + struct bpool *bp = bc->bp; + u64 n_buffers_per_slab = bp->params.n_buffers_per_slab; + u64 n_buffers_prod = bc->n_buffers_prod; + u64 n_slabs_available; + u64 *slab_empty; + + /* + * Producer slab is not yet full: store the current buffer to it. + */ + if (n_buffers_prod < n_buffers_per_slab) { + bc->slab_prod[n_buffers_prod] = buffer; + bc->n_buffers_prod = n_buffers_prod + 1; + return; + } + + /* + * Producer slab is full: trade the cache's current producer slab + * (full) for an empty slab from the pool, then store the current + * buffer to the new producer slab. As one full slab exists in the + * cache, it is guaranteed that there is at least one empty slab + * available in the pool. + */ + pthread_mutex_lock(&bp->lock); + n_slabs_available = bp->n_slabs_available; + slab_empty = bp->slabs[n_slabs_available]; + bp->slabs[n_slabs_available] = bc->slab_prod; + bp->n_slabs_available = n_slabs_available + 1; + pthread_mutex_unlock(&bp->lock); + + slab_empty[0] = buffer; + bc->slab_prod = slab_empty; + bc->n_buffers_prod = 1; +} + +/* + * Port + * + * Each of the forwarding ports sits on top of an AF_XDP socket. In order for + * packet forwarding to happen with no packet buffer copy, all the sockets need + * to share the same UMEM area, which is used as the buffer pool memory. + */ +#ifndef MAX_BURST_RX +#define MAX_BURST_RX 64 +#endif + +#ifndef MAX_BURST_TX +#define MAX_BURST_TX 64 +#endif + +struct burst_rx { + u64 addr[MAX_BURST_RX]; + u32 len[MAX_BURST_RX]; +}; + +struct burst_tx { + u64 addr[MAX_BURST_TX]; + u32 len[MAX_BURST_TX]; + u32 n_pkts; +}; + +struct port_params { + struct xsk_socket_config xsk_cfg; + struct bpool *bp; + const char *iface; + u32 iface_queue; +}; + +struct port { + struct port_params params; + + struct bcache *bc; + + struct xsk_ring_cons rxq; + struct xsk_ring_prod txq; + struct xsk_ring_prod umem_fq; + struct xsk_ring_cons umem_cq; + struct xsk_socket *xsk; + int umem_fq_initialized; + + u64 n_pkts_rx; + u64 n_pkts_tx; +}; + +static void +port_free(struct port *p) +{ + if (!p) + return; + + /* To keep this example simple, the code to free the buffers from the + * socket's receive and transmit queues, as well as from the UMEM fill + * and completion queues, is not included. + */ + + if (p->xsk) + xsk_socket__delete(p->xsk); + + bcache_free(p->bc); + + free(p); +} + +static struct port * +port_init(struct port_params *params) +{ + struct port *p; + u32 umem_fq_size, pos = 0; + int status, i; + + /* Memory allocation and initialization. */ + p = calloc(sizeof(struct port), 1); + if (!p) + return NULL; + + memcpy(&p->params, params, sizeof(p->params)); + umem_fq_size = params->bp->umem_cfg.fill_size; + + /* bcache. */ + p->bc = bcache_init(params->bp); + if (!p->bc || + (bcache_slab_size(p->bc) < umem_fq_size) || + (bcache_cons_check(p->bc, umem_fq_size) < umem_fq_size)) { + port_free(p); + return NULL; + } + + /* xsk socket. */ + status = xsk_socket__create_shared(&p->xsk, + params->iface, + params->iface_queue, + params->bp->umem, + &p->rxq, + &p->txq, + &p->umem_fq, + &p->umem_cq, + ¶ms->xsk_cfg); + if (status) { + port_free(p); + return NULL; + } + + /* umem fq. */ + xsk_ring_prod__reserve(&p->umem_fq, umem_fq_size, &pos); + + for (i = 0; i < umem_fq_size; i++) + *xsk_ring_prod__fill_addr(&p->umem_fq, pos + i) = + bcache_cons(p->bc); + + xsk_ring_prod__submit(&p->umem_fq, umem_fq_size); + p->umem_fq_initialized = 1; + + return p; +} + +static inline u32 +port_rx_burst(struct port *p, struct burst_rx *b) +{ + u32 n_pkts, pos, i; + + /* Free buffers for FQ replenish. */ + n_pkts = ARRAY_SIZE(b->addr); + + n_pkts = bcache_cons_check(p->bc, n_pkts); + if (!n_pkts) + return 0; + + /* RXQ. */ + n_pkts = xsk_ring_cons__peek(&p->rxq, n_pkts, &pos); + if (!n_pkts) { + if (xsk_ring_prod__needs_wakeup(&p->umem_fq)) { + struct pollfd pollfd = { + .fd = xsk_socket__fd(p->xsk), + .events = POLLIN, + }; + + poll(&pollfd, 1, 0); + } + return 0; + } + + for (i = 0; i < n_pkts; i++) { + b->addr[i] = xsk_ring_cons__rx_desc(&p->rxq, pos + i)->addr; + b->len[i] = xsk_ring_cons__rx_desc(&p->rxq, pos + i)->len; + } + + xsk_ring_cons__release(&p->rxq, n_pkts); + p->n_pkts_rx += n_pkts; + + /* UMEM FQ. */ + for ( ; ; ) { + int status; + + status = xsk_ring_prod__reserve(&p->umem_fq, n_pkts, &pos); + if (status == n_pkts) + break; + + if (xsk_ring_prod__needs_wakeup(&p->umem_fq)) { + struct pollfd pollfd = { + .fd = xsk_socket__fd(p->xsk), + .events = POLLIN, + }; + + poll(&pollfd, 1, 0); + } + } + + for (i = 0; i < n_pkts; i++) + *xsk_ring_prod__fill_addr(&p->umem_fq, pos + i) = + bcache_cons(p->bc); + + xsk_ring_prod__submit(&p->umem_fq, n_pkts); + + return n_pkts; +} + +static inline void +port_tx_burst(struct port *p, struct burst_tx *b) +{ + u32 n_pkts, pos, i; + int status; + + /* UMEM CQ. */ + n_pkts = p->params.bp->umem_cfg.comp_size; + + n_pkts = xsk_ring_cons__peek(&p->umem_cq, n_pkts, &pos); + + for (i = 0; i < n_pkts; i++) { + u64 addr = *xsk_ring_cons__comp_addr(&p->umem_cq, pos + i); + + bcache_prod(p->bc, addr); + } + + xsk_ring_cons__release(&p->umem_cq, n_pkts); + + /* TXQ. */ + n_pkts = b->n_pkts; + + for ( ; ; ) { + status = xsk_ring_prod__reserve(&p->txq, n_pkts, &pos); + if (status == n_pkts) + break; + + if (xsk_ring_prod__needs_wakeup(&p->txq)) + sendto(xsk_socket__fd(p->xsk), NULL, 0, MSG_DONTWAIT, + NULL, 0); + } + + for (i = 0; i < n_pkts; i++) { + xsk_ring_prod__tx_desc(&p->txq, pos + i)->addr = b->addr[i]; + xsk_ring_prod__tx_desc(&p->txq, pos + i)->len = b->len[i]; + } + + xsk_ring_prod__submit(&p->txq, n_pkts); + if (xsk_ring_prod__needs_wakeup(&p->txq)) + sendto(xsk_socket__fd(p->xsk), NULL, 0, MSG_DONTWAIT, NULL, 0); + p->n_pkts_tx += n_pkts; +} + +/* + * Thread + * + * Packet forwarding threads. + */ +#ifndef MAX_PORTS_PER_THREAD +#define MAX_PORTS_PER_THREAD 16 +#endif + +struct thread_data { + struct port *ports_rx[MAX_PORTS_PER_THREAD]; + struct port *ports_tx[MAX_PORTS_PER_THREAD]; + u32 n_ports_rx; + struct burst_rx burst_rx; + struct burst_tx burst_tx[MAX_PORTS_PER_THREAD]; + u32 cpu_core_id; + int quit; +}; + +static void swap_mac_addresses(void *data) +{ + struct ether_header *eth = (struct ether_header *)data; + struct ether_addr *src_addr = (struct ether_addr *)ð->ether_shost; + struct ether_addr *dst_addr = (struct ether_addr *)ð->ether_dhost; + struct ether_addr tmp; + + tmp = *src_addr; + *src_addr = *dst_addr; + *dst_addr = tmp; +} + +static void * +thread_func(void *arg) +{ + struct thread_data *t = arg; + cpu_set_t cpu_cores; + u32 i; + + CPU_ZERO(&cpu_cores); + CPU_SET(t->cpu_core_id, &cpu_cores); + pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpu_cores); + + for (i = 0; !t->quit; i = (i + 1) & (t->n_ports_rx - 1)) { + struct port *port_rx = t->ports_rx[i]; + struct port *port_tx = t->ports_tx[i]; + struct burst_rx *brx = &t->burst_rx; + struct burst_tx *btx = &t->burst_tx[i]; + u32 n_pkts, j; + + /* RX. */ + n_pkts = port_rx_burst(port_rx, brx); + if (!n_pkts) + continue; + + /* Process & TX. */ + for (j = 0; j < n_pkts; j++) { + u64 addr = xsk_umem__add_offset_to_addr(brx->addr[j]); + u8 *pkt = xsk_umem__get_data(port_rx->params.bp->addr, + addr); + + swap_mac_addresses(pkt); + + btx->addr[btx->n_pkts] = brx->addr[j]; + btx->len[btx->n_pkts] = brx->len[j]; + btx->n_pkts++; + + if (btx->n_pkts == MAX_BURST_TX) { + port_tx_burst(port_tx, btx); + btx->n_pkts = 0; + } + } + } + + return NULL; +} + +/* + * Process + */ +static const struct bpool_params bpool_params_default = { + .n_buffers = 64 * 1024, + .buffer_size = XSK_UMEM__DEFAULT_FRAME_SIZE, + .mmap_flags = 0, + + .n_users_max = 16, + .n_buffers_per_slab = XSK_RING_PROD__DEFAULT_NUM_DESCS * 2, +}; + +static const struct xsk_umem_config umem_cfg_default = { + .fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS * 2, + .comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, + .frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE, + .frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM, + .flags = 0, +}; + +static const struct port_params port_params_default = { + .xsk_cfg = { + .rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS, + .tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS, + .libbpf_flags = 0, + .xdp_flags = XDP_FLAGS_DRV_MODE, + .bind_flags = XDP_USE_NEED_WAKEUP | XDP_ZEROCOPY, + }, + + .bp = NULL, + .iface = NULL, + .iface_queue = 0, +}; + +#ifndef MAX_PORTS +#define MAX_PORTS 64 +#endif + +#ifndef MAX_THREADS +#define MAX_THREADS 64 +#endif + +static struct bpool_params bpool_params; +static struct xsk_umem_config umem_cfg; +static struct bpool *bp; + +static struct port_params port_params[MAX_PORTS]; +static struct port *ports[MAX_PORTS]; +static u64 n_pkts_rx[MAX_PORTS]; +static u64 n_pkts_tx[MAX_PORTS]; +static int n_ports; + +static pthread_t threads[MAX_THREADS]; +static struct thread_data thread_data[MAX_THREADS]; +static int n_threads; + +static void +print_usage(char *prog_name) +{ + const char *usage = + "Usage:\n" + "\t%s [ -b SIZE ] -c CORE -i INTERFACE [ -q QUEUE ]\n" + "\n" + "-c CORE CPU core to run a packet forwarding thread\n" + " on. May be invoked multiple times.\n" + "\n" + "-b SIZE Number of buffers in the buffer pool shared\n" + " by all the forwarding threads. Default: %u.\n" + "\n" + "-i INTERFACE Network interface. Each (INTERFACE, QUEUE)\n" + " pair specifies one forwarding port. May be\n" + " invoked multiple times.\n" + "\n" + "-q QUEUE Network interface queue for RX and TX. Each\n" + " (INTERFACE, QUEUE) pair specified one\n" + " forwarding port. Default: %u. May be invoked\n" + " multiple times.\n" + "\n"; + printf(usage, + prog_name, + bpool_params_default.n_buffers, + port_params_default.iface_queue); +} + +static int +parse_args(int argc, char **argv) +{ + struct option lgopts[] = { + { NULL, 0, 0, 0 } + }; + int opt, option_index; + + /* Parse the input arguments. */ + for ( ; ;) { + opt = getopt_long(argc, argv, "c:i:q:", lgopts, &option_index); + if (opt == EOF) + break; + + switch (opt) { + case 'b': + bpool_params.n_buffers = atoi(optarg); + break; + + case 'c': + if (n_threads == MAX_THREADS) { + printf("Max number of threads (%d) reached.\n", + MAX_THREADS); + return -1; + } + + thread_data[n_threads].cpu_core_id = atoi(optarg); + n_threads++; + break; + + case 'i': + if (n_ports == MAX_PORTS) { + printf("Max number of ports (%d) reached.\n", + MAX_PORTS); + return -1; + } + + port_params[n_ports].iface = optarg; + port_params[n_ports].iface_queue = 0; + n_ports++; + break; + + case 'q': + if (n_ports == 0) { + printf("No port specified for queue.\n"); + return -1; + } + port_params[n_ports - 1].iface_queue = atoi(optarg); + break; + + default: + printf("Illegal argument.\n"); + return -1; + } + } + + optind = 1; /* reset getopt lib */ + + /* Check the input arguments. */ + if (!n_ports) { + printf("No ports specified.\n"); + return -1; + } + + if (!n_threads) { + printf("No threads specified.\n"); + return -1; + } + + if (n_ports % n_threads) { + printf("Ports cannot be evenly distributed to threads.\n"); + return -1; + } + + return 0; +} + +static void +print_port(u32 port_id) +{ + struct port *port = ports[port_id]; + + printf("Port %u: interface = %s, queue = %u\n", + port_id, port->params.iface, port->params.iface_queue); +} + +static void +print_thread(u32 thread_id) +{ + struct thread_data *t = &thread_data[thread_id]; + u32 i; + + printf("Thread %u (CPU core %u): ", + thread_id, t->cpu_core_id); + + for (i = 0; i < t->n_ports_rx; i++) { + struct port *port_rx = t->ports_rx[i]; + struct port *port_tx = t->ports_tx[i]; + + printf("(%s, %u) -> (%s, %u), ", + port_rx->params.iface, + port_rx->params.iface_queue, + port_tx->params.iface, + port_tx->params.iface_queue); + } + + printf("\n"); +} + +static void +print_port_stats_separator(void) +{ + printf("+-%4s-+-%12s-+-%13s-+-%12s-+-%13s-+\n", + "----", + "------------", + "-------------", + "------------", + "-------------"); +} + +static void +print_port_stats_header(void) +{ + print_port_stats_separator(); + printf("| %4s | %12s | %13s | %12s | %13s |\n", + "Port", + "RX packets", + "RX rate (pps)", + "TX packets", + "TX_rate (pps)"); + print_port_stats_separator(); +} + +static void +print_port_stats_trailer(void) +{ + print_port_stats_separator(); + printf("\n"); +} + +static void +print_port_stats(int port_id, u64 ns_diff) +{ + struct port *p = ports[port_id]; + double rx_pps, tx_pps; + + rx_pps = (p->n_pkts_rx - n_pkts_rx[port_id]) * 1000000000. / ns_diff; + tx_pps = (p->n_pkts_tx - n_pkts_tx[port_id]) * 1000000000. / ns_diff; + + printf("| %4d | %12llu | %13.0f | %12llu | %13.0f |\n", + port_id, + p->n_pkts_rx, + rx_pps, + p->n_pkts_tx, + tx_pps); + + n_pkts_rx[port_id] = p->n_pkts_rx; + n_pkts_tx[port_id] = p->n_pkts_tx; +} + +static void +print_port_stats_all(u64 ns_diff) +{ + int i; + + print_port_stats_header(); + for (i = 0; i < n_ports; i++) + print_port_stats(i, ns_diff); + print_port_stats_trailer(); +} + +static int quit; + +static void +signal_handler(int sig) +{ + quit = 1; +} + +static void remove_xdp_program(void) +{ + int i; + + for (i = 0 ; i < n_ports; i++) + bpf_set_link_xdp_fd(if_nametoindex(port_params[i].iface), -1, + port_params[i].xsk_cfg.xdp_flags); +} + +int main(int argc, char **argv) +{ + struct timespec time; + u64 ns0; + int i; + + /* Parse args. */ + memcpy(&bpool_params, &bpool_params_default, + sizeof(struct bpool_params)); + memcpy(&umem_cfg, &umem_cfg_default, + sizeof(struct xsk_umem_config)); + for (i = 0; i < MAX_PORTS; i++) + memcpy(&port_params[i], &port_params_default, + sizeof(struct port_params)); + + if (parse_args(argc, argv)) { + print_usage(argv[0]); + return -1; + } + + /* Buffer pool initialization. */ + bp = bpool_init(&bpool_params, &umem_cfg); + if (!bp) { + printf("Buffer pool initialization failed.\n"); + return -1; + } + printf("Buffer pool created successfully.\n"); + + /* Ports initialization. */ + for (i = 0; i < MAX_PORTS; i++) + port_params[i].bp = bp; + + for (i = 0; i < n_ports; i++) { + ports[i] = port_init(&port_params[i]); + if (!ports[i]) { + printf("Port %d initialization failed.\n", i); + return -1; + } + print_port(i); + } + printf("All ports created successfully.\n"); + + /* Threads. */ + for (i = 0; i < n_threads; i++) { + struct thread_data *t = &thread_data[i]; + u32 n_ports_per_thread = n_ports / n_threads, j; + + for (j = 0; j < n_ports_per_thread; j++) { + t->ports_rx[j] = ports[i * n_ports_per_thread + j]; + t->ports_tx[j] = ports[i * n_ports_per_thread + + (j + 1) % n_ports_per_thread]; + } + + t->n_ports_rx = n_ports_per_thread; + + print_thread(i); + } + + for (i = 0; i < n_threads; i++) { + int status; + + status = pthread_create(&threads[i], + NULL, + thread_func, + &thread_data[i]); + if (status) { + printf("Thread %d creation failed.\n", i); + return -1; + } + } + printf("All threads created successfully.\n"); + + /* Print statistics. */ + signal(SIGINT, signal_handler); + signal(SIGTERM, signal_handler); + signal(SIGABRT, signal_handler); + + clock_gettime(CLOCK_MONOTONIC, &time); + ns0 = time.tv_sec * 1000000000UL + time.tv_nsec; + for ( ; !quit; ) { + u64 ns1, ns_diff; + + sleep(1); + clock_gettime(CLOCK_MONOTONIC, &time); + ns1 = time.tv_sec * 1000000000UL + time.tv_nsec; + ns_diff = ns1 - ns0; + ns0 = ns1; + + print_port_stats_all(ns_diff); + } + + /* Threads completion. */ + printf("Quit.\n"); + for (i = 0; i < n_threads; i++) + thread_data[i].quit = 1; + + for (i = 0; i < n_threads; i++) + pthread_join(threads[i], NULL); + + for (i = 0; i < n_ports; i++) + port_free(ports[i]); + + bpool_free(bp); + + remove_xdp_program(); + + return 0; +} diff --git a/scripts/bpf_helpers_doc.py b/scripts/bpf_helpers_doc.py index 5bfa448b4704..08388173973f 100755 --- a/scripts/bpf_helpers_doc.py +++ b/scripts/bpf_helpers_doc.py @@ -432,6 +432,7 @@ class PrinterHelpers(Printer): 'struct __sk_buff', 'struct sk_msg_md', 'struct xdp_md', + 'struct path', ] known_types = { '...', @@ -472,6 +473,7 @@ class PrinterHelpers(Printer): 'struct tcp_request_sock', 'struct udp6_sock', 'struct task_struct', + 'struct path', } mapped_types = { 'u8': '__u8', diff --git a/security/bpf/hooks.c b/security/bpf/hooks.c index 32d32d485451..788667d582ae 100644 --- a/security/bpf/hooks.c +++ b/security/bpf/hooks.c @@ -11,6 +11,7 @@ static struct security_hook_list bpf_lsm_hooks[] __lsm_ro_after_init = { LSM_HOOK_INIT(NAME, bpf_lsm_##NAME), #include <linux/lsm_hook_defs.h> #undef LSM_HOOK + LSM_HOOK_INIT(inode_free_security, bpf_inode_storage_free), }; static int __init bpf_lsm_init(void) @@ -20,7 +21,12 @@ static int __init bpf_lsm_init(void) return 0; } +struct lsm_blob_sizes bpf_lsm_blob_sizes __lsm_ro_after_init = { + .lbs_inode = sizeof(struct bpf_storage_blob), +}; + DEFINE_LSM(bpf) = { .name = "bpf", .init = bpf_lsm_init, + .blobs = &bpf_lsm_blob_sizes }; diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst index 41e2a74252d0..083db6c2fc67 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst @@ -49,7 +49,7 @@ MAP COMMANDS | | **lru_percpu_hash** | **lpm_trie** | **array_of_maps** | **hash_of_maps** | | **devmap** | **devmap_hash** | **sockmap** | **cpumap** | **xskmap** | **sockhash** | | **cgroup_storage** | **reuseport_sockarray** | **percpu_cgroup_storage** -| | **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** } +| | **queue** | **stack** | **sk_storage** | **struct_ops** | **ringbuf** | **inode_storage** } DESCRIPTION =========== diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile index 8462690a039b..02c99bc95c69 100644 --- a/tools/bpf/bpftool/Makefile +++ b/tools/bpf/bpftool/Makefile @@ -176,7 +176,11 @@ $(OUTPUT)bpftool: $(OBJS) $(LIBBPF) $(OUTPUT)%.o: %.c $(QUIET_CC)$(CC) $(CFLAGS) -c -MMD -o $@ $< -clean: $(LIBBPF)-clean +feature-detect-clean: + $(call QUIET_CLEAN, feature-detect) + $(Q)$(MAKE) -C $(srctree)/tools/build/feature/ clean >/dev/null + +clean: $(LIBBPF)-clean feature-detect-clean $(call QUIET_CLEAN, bpftool) $(Q)$(RM) -- $(OUTPUT)bpftool $(OUTPUT)*.o $(OUTPUT)*.d $(Q)$(RM) -- $(BPFTOOL_BOOTSTRAP) $(OUTPUT)*.skel.h $(OUTPUT)vmlinux.h diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index f53ed2f1a4aa..7b68e3c0a5fb 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -704,7 +704,8 @@ _bpftool() lru_percpu_hash lpm_trie array_of_maps \ hash_of_maps devmap devmap_hash sockmap cpumap \ xskmap sockhash cgroup_storage reuseport_sockarray \ - percpu_cgroup_storage queue stack' -- \ + percpu_cgroup_storage queue stack sk_storage \ + struct_ops inode_storage' -- \ "$cur" ) ) return 0 ;; diff --git a/tools/bpf/bpftool/gen.c b/tools/bpf/bpftool/gen.c index f61184653633..4033c46d83e7 100644 --- a/tools/bpf/bpftool/gen.c +++ b/tools/bpf/bpftool/gen.c @@ -19,11 +19,9 @@ #include <sys/mman.h> #include <bpf/btf.h> -#include "bpf/libbpf_internal.h" #include "json_writer.h" #include "main.h" - #define MAX_OBJ_NAME_LEN 64 static void sanitize_identifier(char *name) diff --git a/tools/bpf/bpftool/link.c b/tools/bpf/bpftool/link.c index a89f09e3c848..e77e1525d20a 100644 --- a/tools/bpf/bpftool/link.c +++ b/tools/bpf/bpftool/link.c @@ -77,6 +77,22 @@ static void show_link_attach_type_json(__u32 attach_type, json_writer_t *wtr) jsonw_uint_field(wtr, "attach_type", attach_type); } +static bool is_iter_map_target(const char *target_name) +{ + return strcmp(target_name, "bpf_map_elem") == 0 || + strcmp(target_name, "bpf_sk_storage_map") == 0; +} + +static void show_iter_json(struct bpf_link_info *info, json_writer_t *wtr) +{ + const char *target_name = u64_to_ptr(info->iter.target_name); + + jsonw_string_field(wtr, "target_name", target_name); + + if (is_iter_map_target(target_name)) + jsonw_uint_field(wtr, "map_id", info->iter.map.map_id); +} + static int get_prog_info(int prog_id, struct bpf_prog_info *info) { __u32 len = sizeof(*info); @@ -128,6 +144,9 @@ static int show_link_close_json(int fd, struct bpf_link_info *info) info->cgroup.cgroup_id); show_link_attach_type_json(info->cgroup.attach_type, json_wtr); break; + case BPF_LINK_TYPE_ITER: + show_iter_json(info, json_wtr); + break; case BPF_LINK_TYPE_NETNS: jsonw_uint_field(json_wtr, "netns_ino", info->netns.netns_ino); @@ -175,6 +194,16 @@ static void show_link_attach_type_plain(__u32 attach_type) printf("attach_type %u ", attach_type); } +static void show_iter_plain(struct bpf_link_info *info) +{ + const char *target_name = u64_to_ptr(info->iter.target_name); + + printf("target_name %s ", target_name); + + if (is_iter_map_target(target_name)) + printf("map_id %u ", info->iter.map.map_id); +} + static int show_link_close_plain(int fd, struct bpf_link_info *info) { struct bpf_prog_info prog_info; @@ -204,6 +233,9 @@ static int show_link_close_plain(int fd, struct bpf_link_info *info) printf("\n\tcgroup_id %zu ", (size_t)info->cgroup.cgroup_id); show_link_attach_type_plain(info->cgroup.attach_type); break; + case BPF_LINK_TYPE_ITER: + show_iter_plain(info); + break; case BPF_LINK_TYPE_NETNS: printf("\n\tnetns_ino %u ", info->netns.netns_ino); show_link_attach_type_plain(info->netns.attach_type); @@ -231,7 +263,7 @@ static int do_show_link(int fd) { struct bpf_link_info info; __u32 len = sizeof(info); - char raw_tp_name[256]; + char buf[256]; int err; memset(&info, 0, sizeof(info)); @@ -245,8 +277,14 @@ again: } if (info.type == BPF_LINK_TYPE_RAW_TRACEPOINT && !info.raw_tracepoint.tp_name) { - info.raw_tracepoint.tp_name = (unsigned long)&raw_tp_name; - info.raw_tracepoint.tp_name_len = sizeof(raw_tp_name); + info.raw_tracepoint.tp_name = (unsigned long)&buf; + info.raw_tracepoint.tp_name_len = sizeof(buf); + goto again; + } + if (info.type == BPF_LINK_TYPE_ITER && + !info.iter.target_name) { + info.iter.target_name = (unsigned long)&buf; + info.iter.target_name_len = sizeof(buf); goto again; } diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c index 3a27d31a1856..bc0071228f88 100644 --- a/tools/bpf/bpftool/map.c +++ b/tools/bpf/bpftool/map.c @@ -50,6 +50,7 @@ const char * const map_type_name[] = { [BPF_MAP_TYPE_SK_STORAGE] = "sk_storage", [BPF_MAP_TYPE_STRUCT_OPS] = "struct_ops", [BPF_MAP_TYPE_RINGBUF] = "ringbuf", + [BPF_MAP_TYPE_INODE_STORAGE] = "inode_storage", }; const size_t map_type_name_size = ARRAY_SIZE(map_type_name); @@ -1442,7 +1443,7 @@ static int do_help(int argc, char **argv) " lru_percpu_hash | lpm_trie | array_of_maps | hash_of_maps |\n" " devmap | devmap_hash | sockmap | cpumap | xskmap | sockhash |\n" " cgroup_storage | reuseport_sockarray | percpu_cgroup_storage |\n" - " queue | stack | sk_storage | struct_ops | ringbuf }\n" + " queue | stack | sk_storage | struct_ops | ringbuf | inode_storage }\n" " " HELP_SPEC_OPTIONS "\n" "", bin_name, argv[-2]); diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c index 56c3a2bae3ef..910e7bac6e9e 100644 --- a/tools/bpf/bpftool/net.c +++ b/tools/bpf/bpftool/net.c @@ -6,22 +6,27 @@ #include <fcntl.h> #include <stdlib.h> #include <string.h> +#include <time.h> #include <unistd.h> #include <bpf/bpf.h> #include <bpf/libbpf.h> #include <net/if.h> #include <linux/if.h> #include <linux/rtnetlink.h> +#include <linux/socket.h> #include <linux/tc_act/tc_bpf.h> #include <sys/socket.h> #include <sys/stat.h> #include <sys/types.h> #include "bpf/nlattr.h" -#include "bpf/libbpf_internal.h" #include "main.h" #include "netlink_dumper.h" +#ifndef SOL_NETLINK +#define SOL_NETLINK 270 +#endif + struct ip_devname_ifindex { char devname[64]; int ifindex; @@ -85,6 +90,266 @@ static enum net_attach_type parse_attach_type(const char *str) return net_attach_type_size; } +typedef int (*dump_nlmsg_t)(void *cookie, void *msg, struct nlattr **tb); + +typedef int (*__dump_nlmsg_t)(struct nlmsghdr *nlmsg, dump_nlmsg_t, void *cookie); + +static int netlink_open(__u32 *nl_pid) +{ + struct sockaddr_nl sa; + socklen_t addrlen; + int one = 1, ret; + int sock; + + memset(&sa, 0, sizeof(sa)); + sa.nl_family = AF_NETLINK; + + sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE); + if (sock < 0) + return -errno; + + if (setsockopt(sock, SOL_NETLINK, NETLINK_EXT_ACK, + &one, sizeof(one)) < 0) { + p_err("Netlink error reporting not supported"); + } + + if (bind(sock, (struct sockaddr *)&sa, sizeof(sa)) < 0) { + ret = -errno; + goto cleanup; + } + + addrlen = sizeof(sa); + if (getsockname(sock, (struct sockaddr *)&sa, &addrlen) < 0) { + ret = -errno; + goto cleanup; + } + + if (addrlen != sizeof(sa)) { + ret = -LIBBPF_ERRNO__INTERNAL; + goto cleanup; + } + + *nl_pid = sa.nl_pid; + return sock; + +cleanup: + close(sock); + return ret; +} + +static int netlink_recv(int sock, __u32 nl_pid, __u32 seq, + __dump_nlmsg_t _fn, dump_nlmsg_t fn, + void *cookie) +{ + bool multipart = true; + struct nlmsgerr *err; + struct nlmsghdr *nh; + char buf[4096]; + int len, ret; + + while (multipart) { + multipart = false; + len = recv(sock, buf, sizeof(buf), 0); + if (len < 0) { + ret = -errno; + goto done; + } + + if (len == 0) + break; + + for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len); + nh = NLMSG_NEXT(nh, len)) { + if (nh->nlmsg_pid != nl_pid) { + ret = -LIBBPF_ERRNO__WRNGPID; + goto done; + } + if (nh->nlmsg_seq != seq) { + ret = -LIBBPF_ERRNO__INVSEQ; + goto done; + } + if (nh->nlmsg_flags & NLM_F_MULTI) + multipart = true; + switch (nh->nlmsg_type) { + case NLMSG_ERROR: + err = (struct nlmsgerr *)NLMSG_DATA(nh); + if (!err->error) + continue; + ret = err->error; + libbpf_nla_dump_errormsg(nh); + goto done; + case NLMSG_DONE: + return 0; + default: + break; + } + if (_fn) { + ret = _fn(nh, fn, cookie); + if (ret) + return ret; + } + } + } + ret = 0; +done: + return ret; +} + +static int __dump_class_nlmsg(struct nlmsghdr *nlh, + dump_nlmsg_t dump_class_nlmsg, + void *cookie) +{ + struct nlattr *tb[TCA_MAX + 1], *attr; + struct tcmsg *t = NLMSG_DATA(nlh); + int len; + + len = nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*t)); + attr = (struct nlattr *) ((void *) t + NLMSG_ALIGN(sizeof(*t))); + if (libbpf_nla_parse(tb, TCA_MAX, attr, len, NULL) != 0) + return -LIBBPF_ERRNO__NLPARSE; + + return dump_class_nlmsg(cookie, t, tb); +} + +static int netlink_get_class(int sock, unsigned int nl_pid, int ifindex, + dump_nlmsg_t dump_class_nlmsg, void *cookie) +{ + struct { + struct nlmsghdr nlh; + struct tcmsg t; + } req = { + .nlh.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), + .nlh.nlmsg_type = RTM_GETTCLASS, + .nlh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST, + .t.tcm_family = AF_UNSPEC, + .t.tcm_ifindex = ifindex, + }; + int seq = time(NULL); + + req.nlh.nlmsg_seq = seq; + if (send(sock, &req, req.nlh.nlmsg_len, 0) < 0) + return -errno; + + return netlink_recv(sock, nl_pid, seq, __dump_class_nlmsg, + dump_class_nlmsg, cookie); +} + +static int __dump_qdisc_nlmsg(struct nlmsghdr *nlh, + dump_nlmsg_t dump_qdisc_nlmsg, + void *cookie) +{ + struct nlattr *tb[TCA_MAX + 1], *attr; + struct tcmsg *t = NLMSG_DATA(nlh); + int len; + + len = nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*t)); + attr = (struct nlattr *) ((void *) t + NLMSG_ALIGN(sizeof(*t))); + if (libbpf_nla_parse(tb, TCA_MAX, attr, len, NULL) != 0) + return -LIBBPF_ERRNO__NLPARSE; + + return dump_qdisc_nlmsg(cookie, t, tb); +} + +static int netlink_get_qdisc(int sock, unsigned int nl_pid, int ifindex, + dump_nlmsg_t dump_qdisc_nlmsg, void *cookie) +{ + struct { + struct nlmsghdr nlh; + struct tcmsg t; + } req = { + .nlh.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), + .nlh.nlmsg_type = RTM_GETQDISC, + .nlh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST, + .t.tcm_family = AF_UNSPEC, + .t.tcm_ifindex = ifindex, + }; + int seq = time(NULL); + + req.nlh.nlmsg_seq = seq; + if (send(sock, &req, req.nlh.nlmsg_len, 0) < 0) + return -errno; + + return netlink_recv(sock, nl_pid, seq, __dump_qdisc_nlmsg, + dump_qdisc_nlmsg, cookie); +} + +static int __dump_filter_nlmsg(struct nlmsghdr *nlh, + dump_nlmsg_t dump_filter_nlmsg, + void *cookie) +{ + struct nlattr *tb[TCA_MAX + 1], *attr; + struct tcmsg *t = NLMSG_DATA(nlh); + int len; + + len = nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*t)); + attr = (struct nlattr *) ((void *) t + NLMSG_ALIGN(sizeof(*t))); + if (libbpf_nla_parse(tb, TCA_MAX, attr, len, NULL) != 0) + return -LIBBPF_ERRNO__NLPARSE; + + return dump_filter_nlmsg(cookie, t, tb); +} + +static int netlink_get_filter(int sock, unsigned int nl_pid, int ifindex, int handle, + dump_nlmsg_t dump_filter_nlmsg, void *cookie) +{ + struct { + struct nlmsghdr nlh; + struct tcmsg t; + } req = { + .nlh.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), + .nlh.nlmsg_type = RTM_GETTFILTER, + .nlh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST, + .t.tcm_family = AF_UNSPEC, + .t.tcm_ifindex = ifindex, + .t.tcm_parent = handle, + }; + int seq = time(NULL); + + req.nlh.nlmsg_seq = seq; + if (send(sock, &req, req.nlh.nlmsg_len, 0) < 0) + return -errno; + + return netlink_recv(sock, nl_pid, seq, __dump_filter_nlmsg, + dump_filter_nlmsg, cookie); +} + +static int __dump_link_nlmsg(struct nlmsghdr *nlh, + dump_nlmsg_t dump_link_nlmsg, void *cookie) +{ + struct nlattr *tb[IFLA_MAX + 1], *attr; + struct ifinfomsg *ifi = NLMSG_DATA(nlh); + int len; + + len = nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); + attr = (struct nlattr *) ((void *) ifi + NLMSG_ALIGN(sizeof(*ifi))); + if (libbpf_nla_parse(tb, IFLA_MAX, attr, len, NULL) != 0) + return -LIBBPF_ERRNO__NLPARSE; + + return dump_link_nlmsg(cookie, ifi, tb); +} + +static int netlink_get_link(int sock, unsigned int nl_pid, + dump_nlmsg_t dump_link_nlmsg, void *cookie) +{ + struct { + struct nlmsghdr nlh; + struct ifinfomsg ifm; + } req = { + .nlh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)), + .nlh.nlmsg_type = RTM_GETLINK, + .nlh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST, + .ifm.ifi_family = AF_PACKET, + }; + int seq = time(NULL); + + req.nlh.nlmsg_seq = seq; + if (send(sock, &req, req.nlh.nlmsg_len, 0) < 0) + return -errno; + + return netlink_recv(sock, nl_pid, seq, __dump_link_nlmsg, + dump_link_nlmsg, cookie); +} + static int dump_link_nlmsg(void *cookie, void *msg, struct nlattr **tb) { struct bpf_netdev_t *netinfo = cookie; @@ -168,14 +433,14 @@ static int show_dev_tc_bpf(int sock, unsigned int nl_pid, tcinfo.array_len = 0; tcinfo.is_qdisc = false; - ret = libbpf_nl_get_class(sock, nl_pid, dev->ifindex, - dump_class_qdisc_nlmsg, &tcinfo); + ret = netlink_get_class(sock, nl_pid, dev->ifindex, + dump_class_qdisc_nlmsg, &tcinfo); if (ret) goto out; tcinfo.is_qdisc = true; - ret = libbpf_nl_get_qdisc(sock, nl_pid, dev->ifindex, - dump_class_qdisc_nlmsg, &tcinfo); + ret = netlink_get_qdisc(sock, nl_pid, dev->ifindex, + dump_class_qdisc_nlmsg, &tcinfo); if (ret) goto out; @@ -183,9 +448,9 @@ static int show_dev_tc_bpf(int sock, unsigned int nl_pid, filter_info.ifindex = dev->ifindex; for (i = 0; i < tcinfo.used_len; i++) { filter_info.kind = tcinfo.handle_array[i].kind; - ret = libbpf_nl_get_filter(sock, nl_pid, dev->ifindex, - tcinfo.handle_array[i].handle, - dump_filter_nlmsg, &filter_info); + ret = netlink_get_filter(sock, nl_pid, dev->ifindex, + tcinfo.handle_array[i].handle, + dump_filter_nlmsg, &filter_info); if (ret) goto out; } @@ -193,22 +458,22 @@ static int show_dev_tc_bpf(int sock, unsigned int nl_pid, /* root, ingress and egress handle */ handle = TC_H_ROOT; filter_info.kind = "root"; - ret = libbpf_nl_get_filter(sock, nl_pid, dev->ifindex, handle, - dump_filter_nlmsg, &filter_info); + ret = netlink_get_filter(sock, nl_pid, dev->ifindex, handle, + dump_filter_nlmsg, &filter_info); if (ret) goto out; handle = TC_H_MAKE(TC_H_CLSACT, TC_H_MIN_INGRESS); filter_info.kind = "clsact/ingress"; - ret = libbpf_nl_get_filter(sock, nl_pid, dev->ifindex, handle, - dump_filter_nlmsg, &filter_info); + ret = netlink_get_filter(sock, nl_pid, dev->ifindex, handle, + dump_filter_nlmsg, &filter_info); if (ret) goto out; handle = TC_H_MAKE(TC_H_CLSACT, TC_H_MIN_EGRESS); filter_info.kind = "clsact/egress"; - ret = libbpf_nl_get_filter(sock, nl_pid, dev->ifindex, handle, - dump_filter_nlmsg, &filter_info); + ret = netlink_get_filter(sock, nl_pid, dev->ifindex, handle, + dump_filter_nlmsg, &filter_info); if (ret) goto out; @@ -386,7 +651,7 @@ static int do_show(int argc, char **argv) struct bpf_attach_info attach_info = {}; int i, sock, ret, filter_idx = -1; struct bpf_netdev_t dev_array; - unsigned int nl_pid; + unsigned int nl_pid = 0; char err_buf[256]; if (argc == 2) { @@ -401,7 +666,7 @@ static int do_show(int argc, char **argv) if (ret) return -1; - sock = libbpf_netlink_open(&nl_pid); + sock = netlink_open(&nl_pid); if (sock < 0) { fprintf(stderr, "failed to open netlink sock\n"); return -1; @@ -416,7 +681,7 @@ static int do_show(int argc, char **argv) jsonw_start_array(json_wtr); NET_START_OBJECT; NET_START_ARRAY("xdp", "%s:\n"); - ret = libbpf_nl_get_link(sock, nl_pid, dump_link_nlmsg, &dev_array); + ret = netlink_get_link(sock, nl_pid, dump_link_nlmsg, &dev_array); NET_END_ARRAY("\n"); if (!ret) { diff --git a/tools/bpf/resolve_btfids/main.c b/tools/bpf/resolve_btfids/main.c index 0def0bb1f783..dfa540d8a02d 100644 --- a/tools/bpf/resolve_btfids/main.c +++ b/tools/bpf/resolve_btfids/main.c @@ -199,9 +199,16 @@ static char *get_id(const char *prefix_end) /* * __BTF_ID__func__vfs_truncate__0 * prefix_end = ^ + * pos = ^ */ - char *p, *id = strdup(prefix_end + sizeof("__") - 1); + int len = strlen(prefix_end); + int pos = sizeof("__") - 1; + char *p, *id; + if (pos >= len) + return NULL; + + id = strdup(prefix_end + pos); if (id) { /* * __BTF_ID__func__vfs_truncate__0 @@ -220,6 +227,24 @@ static char *get_id(const char *prefix_end) return id; } +static struct btf_id *add_set(struct object *obj, char *name) +{ + /* + * __BTF_ID__set__name + * name = ^ + * id = ^ + */ + char *id = name + sizeof(BTF_SET "__") - 1; + int len = strlen(name); + + if (id >= name + len) { + pr_err("FAILED to parse set name: %s\n", name); + return NULL; + } + + return btf_id__add(&obj->sets, id, true); +} + static struct btf_id *add_symbol(struct rb_root *root, char *name, size_t size) { char *id; @@ -412,7 +437,7 @@ static int symbols_collect(struct object *obj) id = add_symbol(&obj->funcs, prefix, sizeof(BTF_FUNC) - 1); /* set */ } else if (!strncmp(prefix, BTF_SET, sizeof(BTF_SET) - 1)) { - id = add_symbol(&obj->sets, prefix, sizeof(BTF_SET) - 1); + id = add_set(obj, prefix); /* * SET objects store list's count, which is encoded * in symbol's size, together with 'cnt' field hence diff --git a/tools/build/Makefile b/tools/build/Makefile index 727050c40f09..722f1700d96a 100644 --- a/tools/build/Makefile +++ b/tools/build/Makefile @@ -38,6 +38,8 @@ clean: $(call QUIET_CLEAN, fixdep) $(Q)find $(if $(OUTPUT),$(OUTPUT),.) -name '*.o' -delete -o -name '\.*.cmd' -delete -o -name '\.*.d' -delete $(Q)rm -f $(OUTPUT)fixdep + $(call QUIET_CLEAN, feature-detect) + $(Q)$(MAKE) -C feature/ clean >/dev/null $(OUTPUT)fixdep-in.o: FORCE $(Q)$(MAKE) $(build)=fixdep diff --git a/tools/build/Makefile.feature b/tools/build/Makefile.feature index c1daf4d57518..38415d251075 100644 --- a/tools/build/Makefile.feature +++ b/tools/build/Makefile.feature @@ -46,7 +46,6 @@ FEATURE_TESTS_BASIC := \ libelf-getphdrnum \ libelf-gelf_getnote \ libelf-getshdrstrndx \ - libelf-mmap \ libnuma \ numa_num_possible_cpus \ libperl \ diff --git a/tools/build/feature/Makefile b/tools/build/feature/Makefile index d220fe952747..b2a2347c67ed 100644 --- a/tools/build/feature/Makefile +++ b/tools/build/feature/Makefile @@ -25,7 +25,6 @@ FILES= \ test-libelf-getphdrnum.bin \ test-libelf-gelf_getnote.bin \ test-libelf-getshdrstrndx.bin \ - test-libelf-mmap.bin \ test-libdebuginfod.bin \ test-libnuma.bin \ test-numa_num_possible_cpus.bin \ @@ -146,9 +145,6 @@ $(OUTPUT)test-dwarf.bin: $(OUTPUT)test-dwarf_getlocations.bin: $(BUILD) $(DWARFLIBS) -$(OUTPUT)test-libelf-mmap.bin: - $(BUILD) -lelf - $(OUTPUT)test-libelf-getphdrnum.bin: $(BUILD) -lelf diff --git a/tools/build/feature/test-all.c b/tools/build/feature/test-all.c index 5479e543b194..5284e6e9c756 100644 --- a/tools/build/feature/test-all.c +++ b/tools/build/feature/test-all.c @@ -30,10 +30,6 @@ # include "test-libelf.c" #undef main -#define main main_test_libelf_mmap -# include "test-libelf-mmap.c" -#undef main - #define main main_test_get_current_dir_name # include "test-get_current_dir_name.c" #undef main diff --git a/tools/build/feature/test-libelf-mmap.c b/tools/build/feature/test-libelf-mmap.c deleted file mode 100644 index 2c3ef81affe2..000000000000 --- a/tools/build/feature/test-libelf-mmap.c +++ /dev/null @@ -1,9 +0,0 @@ -// SPDX-License-Identifier: GPL-2.0 -#include <libelf.h> - -int main(void) -{ - Elf *elf = elf_begin(0, ELF_C_READ_MMAP, 0); - - return (long)elf; -} diff --git a/tools/include/linux/btf_ids.h b/tools/include/linux/btf_ids.h index 4867d549e3c1..210b086188a3 100644 --- a/tools/include/linux/btf_ids.h +++ b/tools/include/linux/btf_ids.h @@ -3,6 +3,11 @@ #ifndef _LINUX_BTF_IDS_H #define _LINUX_BTF_IDS_H +struct btf_id_set { + u32 cnt; + u32 ids[]; +}; + #ifdef CONFIG_DEBUG_INFO_BTF #include <linux/compiler.h> /* for __PASTE */ @@ -62,7 +67,7 @@ asm( \ ".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ "." #scope " " #name "; \n" \ #name ":; \n" \ -".popsection; \n"); \ +".popsection; \n"); #define BTF_ID_LIST(name) \ __BTF_ID_LIST(name, local) \ @@ -88,12 +93,56 @@ asm( \ ".zero 4 \n" \ ".popsection; \n"); +/* + * The BTF_SET_START/END macros pair defines sorted list of + * BTF IDs plus its members count, with following layout: + * + * BTF_SET_START(list) + * BTF_ID(type1, name1) + * BTF_ID(type2, name2) + * BTF_SET_END(list) + * + * __BTF_ID__set__list: + * .zero 4 + * list: + * __BTF_ID__type1__name1__3: + * .zero 4 + * __BTF_ID__type2__name2__4: + * .zero 4 + * + */ +#define __BTF_SET_START(name, scope) \ +asm( \ +".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ +"." #scope " __BTF_ID__set__" #name "; \n" \ +"__BTF_ID__set__" #name ":; \n" \ +".zero 4 \n" \ +".popsection; \n"); + +#define BTF_SET_START(name) \ +__BTF_ID_LIST(name, local) \ +__BTF_SET_START(name, local) + +#define BTF_SET_START_GLOBAL(name) \ +__BTF_ID_LIST(name, globl) \ +__BTF_SET_START(name, globl) + +#define BTF_SET_END(name) \ +asm( \ +".pushsection " BTF_IDS_SECTION ",\"a\"; \n" \ +".size __BTF_ID__set__" #name ", .-" #name " \n" \ +".popsection; \n"); \ +extern struct btf_id_set name; + #else #define BTF_ID_LIST(name) static u32 name[5]; #define BTF_ID(prefix, name) #define BTF_ID_UNUSED #define BTF_ID_LIST_GLOBAL(name) u32 name[1]; +#define BTF_SET_START(name) static struct btf_id_set name = { 0 }; +#define BTF_SET_START_GLOBAL(name) static struct btf_id_set name = { 0 }; +#define BTF_SET_END(name) #endif /* CONFIG_DEBUG_INFO_BTF */ diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index b6238b2209b7..8dda13880957 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -155,6 +155,7 @@ enum bpf_map_type { BPF_MAP_TYPE_DEVMAP_HASH, BPF_MAP_TYPE_STRUCT_OPS, BPF_MAP_TYPE_RINGBUF, + BPF_MAP_TYPE_INODE_STORAGE, }; /* Note that tracing related programs such as @@ -345,6 +346,14 @@ enum bpf_link_type { /* The verifier internal test flag. Behavior is undefined */ #define BPF_F_TEST_STATE_FREQ (1U << 3) +/* If BPF_F_SLEEPABLE is used in BPF_PROG_LOAD command, the verifier will + * restrict map and helper usage for such programs. Sleepable BPF programs can + * only be attached to hooks where kernel execution context allows sleeping. + * Such programs are allowed to use helpers that may sleep like + * bpf_copy_from_user(). + */ +#define BPF_F_SLEEPABLE (1U << 4) + /* When BPF ldimm64's insn[0].src_reg != 0 then this can have * two extensions: * @@ -2807,7 +2816,7 @@ union bpf_attr { * * **-ERANGE** if resulting value was out of range. * - * void *bpf_sk_storage_get(struct bpf_map *map, struct bpf_sock *sk, void *value, u64 flags) + * void *bpf_sk_storage_get(struct bpf_map *map, void *sk, void *value, u64 flags) * Description * Get a bpf-local-storage from a *sk*. * @@ -2823,6 +2832,9 @@ union bpf_attr { * "type". The bpf-local-storage "type" (i.e. the *map*) is * searched against all bpf-local-storages residing at *sk*. * + * *sk* is a kernel **struct sock** pointer for LSM program. + * *sk* is a **struct bpf_sock** pointer for other program types. + * * An optional *flags* (**BPF_SK_STORAGE_GET_F_CREATE**) can be * used such that a new bpf-local-storage will be * created if one does not exist. *value* can be used @@ -2835,7 +2847,7 @@ union bpf_attr { * **NULL** if not found or there was an error in adding * a new bpf-local-storage. * - * long bpf_sk_storage_delete(struct bpf_map *map, struct bpf_sock *sk) + * long bpf_sk_storage_delete(struct bpf_map *map, void *sk) * Description * Delete a bpf-local-storage from a *sk*. * Return @@ -3395,6 +3407,175 @@ union bpf_attr { * A non-negative value equal to or less than *size* on success, * or a negative error in case of failure. * + * long bpf_load_hdr_opt(struct bpf_sock_ops *skops, void *searchby_res, u32 len, u64 flags) + * Description + * Load header option. Support reading a particular TCP header + * option for bpf program (BPF_PROG_TYPE_SOCK_OPS). + * + * If *flags* is 0, it will search the option from the + * sock_ops->skb_data. The comment in "struct bpf_sock_ops" + * has details on what skb_data contains under different + * sock_ops->op. + * + * The first byte of the *searchby_res* specifies the + * kind that it wants to search. + * + * If the searching kind is an experimental kind + * (i.e. 253 or 254 according to RFC6994). It also + * needs to specify the "magic" which is either + * 2 bytes or 4 bytes. It then also needs to + * specify the size of the magic by using + * the 2nd byte which is "kind-length" of a TCP + * header option and the "kind-length" also + * includes the first 2 bytes "kind" and "kind-length" + * itself as a normal TCP header option also does. + * + * For example, to search experimental kind 254 with + * 2 byte magic 0xeB9F, the searchby_res should be + * [ 254, 4, 0xeB, 0x9F, 0, 0, .... 0 ]. + * + * To search for the standard window scale option (3), + * the searchby_res should be [ 3, 0, 0, .... 0 ]. + * Note, kind-length must be 0 for regular option. + * + * Searching for No-Op (0) and End-of-Option-List (1) are + * not supported. + * + * *len* must be at least 2 bytes which is the minimal size + * of a header option. + * + * Supported flags: + * * **BPF_LOAD_HDR_OPT_TCP_SYN** to search from the + * saved_syn packet or the just-received syn packet. + * + * Return + * >0 when found, the header option is copied to *searchby_res*. + * The return value is the total length copied. + * + * **-EINVAL** If param is invalid + * + * **-ENOMSG** The option is not found + * + * **-ENOENT** No syn packet available when + * **BPF_LOAD_HDR_OPT_TCP_SYN** is used + * + * **-ENOSPC** Not enough space. Only *len* number of + * bytes are copied. + * + * **-EFAULT** Cannot parse the header options in the packet + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * long bpf_store_hdr_opt(struct bpf_sock_ops *skops, const void *from, u32 len, u64 flags) + * Description + * Store header option. The data will be copied + * from buffer *from* with length *len* to the TCP header. + * + * The buffer *from* should have the whole option that + * includes the kind, kind-length, and the actual + * option data. The *len* must be at least kind-length + * long. The kind-length does not have to be 4 byte + * aligned. The kernel will take care of the padding + * and setting the 4 bytes aligned value to th->doff. + * + * This helper will check for duplicated option + * by searching the same option in the outgoing skb. + * + * This helper can only be called during + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * Return + * 0 on success, or negative error in case of failure: + * + * **-EINVAL** If param is invalid + * + * **-ENOSPC** Not enough space in the header. + * Nothing has been written + * + * **-EEXIST** The option has already existed + * + * **-EFAULT** Cannot parse the existing header options + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * long bpf_reserve_hdr_opt(struct bpf_sock_ops *skops, u32 len, u64 flags) + * Description + * Reserve *len* bytes for the bpf header option. The + * space will be used by bpf_store_hdr_opt() later in + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * If bpf_reserve_hdr_opt() is called multiple times, + * the total number of bytes will be reserved. + * + * This helper can only be called during + * BPF_SOCK_OPS_HDR_OPT_LEN_CB. + * + * Return + * 0 on success, or negative error in case of failure: + * + * **-EINVAL** if param is invalid + * + * **-ENOSPC** Not enough space in the header. + * + * **-EPERM** This helper cannot be used under the + * current sock_ops->op. + * + * void *bpf_inode_storage_get(struct bpf_map *map, void *inode, void *value, u64 flags) + * Description + * Get a bpf_local_storage from an *inode*. + * + * Logically, it could be thought of as getting the value from + * a *map* with *inode* as the **key**. From this + * perspective, the usage is not much different from + * **bpf_map_lookup_elem**\ (*map*, **&**\ *inode*) except this + * helper enforces the key must be an inode and the map must also + * be a **BPF_MAP_TYPE_INODE_STORAGE**. + * + * Underneath, the value is stored locally at *inode* instead of + * the *map*. The *map* is used as the bpf-local-storage + * "type". The bpf-local-storage "type" (i.e. the *map*) is + * searched against all bpf_local_storage residing at *inode*. + * + * An optional *flags* (**BPF_LOCAL_STORAGE_GET_F_CREATE**) can be + * used such that a new bpf_local_storage will be + * created if one does not exist. *value* can be used + * together with **BPF_LOCAL_STORAGE_GET_F_CREATE** to specify + * the initial value of a bpf_local_storage. If *value* is + * **NULL**, the new bpf_local_storage will be zero initialized. + * Return + * A bpf_local_storage pointer is returned on success. + * + * **NULL** if not found or there was an error in adding + * a new bpf_local_storage. + * + * int bpf_inode_storage_delete(struct bpf_map *map, void *inode) + * Description + * Delete a bpf_local_storage from an *inode*. + * Return + * 0 on success. + * + * **-ENOENT** if the bpf_local_storage cannot be found. + * + * long bpf_d_path(struct path *path, char *buf, u32 sz) + * Description + * Return full path for given 'struct path' object, which + * needs to be the kernel BTF 'path' object. The path is + * returned in the provided buffer 'buf' of size 'sz' and + * is zero terminated. + * + * Return + * On success, the strictly positive length of the string, + * including the trailing NUL character. On error, a negative + * value. + * + * long bpf_copy_from_user(void *dst, u32 size, const void *user_ptr) + * Description + * Read *size* bytes from user space address *user_ptr* and store + * the data in *dst*. This is a wrapper of copy_from_user(). + * Return + * 0 on success, or a negative error in case of failure. */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -3539,6 +3720,13 @@ union bpf_attr { FN(skc_to_tcp_request_sock), \ FN(skc_to_udp6_sock), \ FN(get_task_stack), \ + FN(load_hdr_opt), \ + FN(store_hdr_opt), \ + FN(reserve_hdr_opt), \ + FN(inode_storage_get), \ + FN(inode_storage_delete), \ + FN(d_path), \ + FN(copy_from_user), \ /* */ /* integer value in 'imm' field of BPF_CALL instruction selects which helper @@ -3648,9 +3836,13 @@ enum { BPF_F_SYSCTL_BASE_NAME = (1ULL << 0), }; -/* BPF_FUNC_sk_storage_get flags */ +/* BPF_FUNC_<kernel_obj>_storage_get flags */ enum { - BPF_SK_STORAGE_GET_F_CREATE = (1ULL << 0), + BPF_LOCAL_STORAGE_GET_F_CREATE = (1ULL << 0), + /* BPF_SK_STORAGE_GET_F_CREATE is only kept for backward compatibility + * and BPF_LOCAL_STORAGE_GET_F_CREATE must be used instead. + */ + BPF_SK_STORAGE_GET_F_CREATE = BPF_LOCAL_STORAGE_GET_F_CREATE, }; /* BPF_FUNC_read_branch_records flags. */ @@ -4071,6 +4263,15 @@ struct bpf_link_info { __u64 cgroup_id; __u32 attach_type; } cgroup; + struct { + __aligned_u64 target_name; /* in/out: target_name buffer ptr */ + __u32 target_name_len; /* in/out: target_name buffer len */ + union { + struct { + __u32 map_id; + } map; + }; + } iter; struct { __u32 netns_ino; __u32 attach_type; @@ -4158,6 +4359,36 @@ struct bpf_sock_ops { __u64 bytes_received; __u64 bytes_acked; __bpf_md_ptr(struct bpf_sock *, sk); + /* [skb_data, skb_data_end) covers the whole TCP header. + * + * BPF_SOCK_OPS_PARSE_HDR_OPT_CB: The packet received + * BPF_SOCK_OPS_HDR_OPT_LEN_CB: Not useful because the + * header has not been written. + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB: The header and options have + * been written so far. + * BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: The SYNACK that concludes + * the 3WHS. + * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes + * the 3WHS. + * + * bpf_load_hdr_opt() can also be used to read a particular option. + */ + __bpf_md_ptr(void *, skb_data); + __bpf_md_ptr(void *, skb_data_end); + __u32 skb_len; /* The total length of a packet. + * It includes the header, options, + * and payload. + */ + __u32 skb_tcp_flags; /* tcp_flags of the header. It provides + * an easy way to check for tcp_flags + * without parsing skb_data. + * + * In particular, the skb_tcp_flags + * will still be available in + * BPF_SOCK_OPS_HDR_OPT_LEN even though + * the outgoing header has not + * been written yet. + */ }; /* Definitions for bpf_sock_ops_cb_flags */ @@ -4166,8 +4397,51 @@ enum { BPF_SOCK_OPS_RETRANS_CB_FLAG = (1<<1), BPF_SOCK_OPS_STATE_CB_FLAG = (1<<2), BPF_SOCK_OPS_RTT_CB_FLAG = (1<<3), + /* Call bpf for all received TCP headers. The bpf prog will be + * called under sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB + * + * Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB + * for the header option related helpers that will be useful + * to the bpf programs. + * + * It could be used at the client/active side (i.e. connect() side) + * when the server told it that the server was in syncookie + * mode and required the active side to resend the bpf-written + * options. The active side can keep writing the bpf-options until + * it received a valid packet from the server side to confirm + * the earlier packet (and options) has been received. The later + * example patch is using it like this at the active side when the + * server is in syncookie mode. + * + * The bpf prog will usually turn this off in the common cases. + */ + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG = (1<<4), + /* Call bpf when kernel has received a header option that + * the kernel cannot handle. The bpf prog will be called under + * sock_ops->op == BPF_SOCK_OPS_PARSE_HDR_OPT_CB. + * + * Please refer to the comment in BPF_SOCK_OPS_PARSE_HDR_OPT_CB + * for the header option related helpers that will be useful + * to the bpf programs. + */ + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG = (1<<5), + /* Call bpf when the kernel is writing header options for the + * outgoing packet. The bpf prog will first be called + * to reserve space in a skb under + * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB. Then + * the bpf prog will be called to write the header option(s) + * under sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * + * Please refer to the comment in BPF_SOCK_OPS_HDR_OPT_LEN_CB + * and BPF_SOCK_OPS_WRITE_HDR_OPT_CB for the header option + * related helpers that will be useful to the bpf programs. + * + * The kernel gets its chance to reserve space and write + * options first before the BPF program does. + */ + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6), /* Mask of all currently supported cb flags */ - BPF_SOCK_OPS_ALL_CB_FLAGS = 0xF, + BPF_SOCK_OPS_ALL_CB_FLAGS = 0x7F, }; /* List of known BPF sock_ops operators. @@ -4223,6 +4497,63 @@ enum { */ BPF_SOCK_OPS_RTT_CB, /* Called on every RTT. */ + BPF_SOCK_OPS_PARSE_HDR_OPT_CB, /* Parse the header option. + * It will be called to handle + * the packets received at + * an already established + * connection. + * + * sock_ops->skb_data: + * Referring to the received skb. + * It covers the TCP header only. + * + * bpf_load_hdr_opt() can also + * be used to search for a + * particular option. + */ + BPF_SOCK_OPS_HDR_OPT_LEN_CB, /* Reserve space for writing the + * header option later in + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + * Arg1: bool want_cookie. (in + * writing SYNACK only) + * + * sock_ops->skb_data: + * Not available because no header has + * been written yet. + * + * sock_ops->skb_tcp_flags: + * The tcp_flags of the + * outgoing skb. (e.g. SYN, ACK, FIN). + * + * bpf_reserve_hdr_opt() should + * be used to reserve space. + */ + BPF_SOCK_OPS_WRITE_HDR_OPT_CB, /* Write the header options + * Arg1: bool want_cookie. (in + * writing SYNACK only) + * + * sock_ops->skb_data: + * Referring to the outgoing skb. + * It covers the TCP header + * that has already been written + * by the kernel and the + * earlier bpf-progs. + * + * sock_ops->skb_tcp_flags: + * The tcp_flags of the outgoing + * skb. (e.g. SYN, ACK, FIN). + * + * bpf_store_hdr_opt() should + * be used to write the + * option. + * + * bpf_load_hdr_opt() can also + * be used to search for a + * particular option that + * has already been written + * by the kernel or the + * earlier bpf-progs. + */ }; /* List of TCP states. There is a build check in net/ipv4/tcp.c to detect @@ -4250,6 +4581,63 @@ enum { enum { TCP_BPF_IW = 1001, /* Set TCP initial congestion window */ TCP_BPF_SNDCWND_CLAMP = 1002, /* Set sndcwnd_clamp */ + TCP_BPF_DELACK_MAX = 1003, /* Max delay ack in usecs */ + TCP_BPF_RTO_MIN = 1004, /* Min delay ack in usecs */ + /* Copy the SYN pkt to optval + * + * BPF_PROG_TYPE_SOCK_OPS only. It is similar to the + * bpf_getsockopt(TCP_SAVED_SYN) but it does not limit + * to only getting from the saved_syn. It can either get the + * syn packet from: + * + * 1. the just-received SYN packet (only available when writing the + * SYNACK). It will be useful when it is not necessary to + * save the SYN packet for latter use. It is also the only way + * to get the SYN during syncookie mode because the syn + * packet cannot be saved during syncookie. + * + * OR + * + * 2. the earlier saved syn which was done by + * bpf_setsockopt(TCP_SAVE_SYN). + * + * The bpf_getsockopt(TCP_BPF_SYN*) option will hide where the + * SYN packet is obtained. + * + * If the bpf-prog does not need the IP[46] header, the + * bpf-prog can avoid parsing the IP header by using + * TCP_BPF_SYN. Otherwise, the bpf-prog can get both + * IP[46] and TCP header by using TCP_BPF_SYN_IP. + * + * >0: Total number of bytes copied + * -ENOSPC: Not enough space in optval. Only optlen number of + * bytes is copied. + * -ENOENT: The SYN skb is not available now and the earlier SYN pkt + * is not saved by setsockopt(TCP_SAVE_SYN). + */ + TCP_BPF_SYN = 1005, /* Copy the TCP header */ + TCP_BPF_SYN_IP = 1006, /* Copy the IP[46] and TCP header */ + TCP_BPF_SYN_MAC = 1007, /* Copy the MAC, IP[46], and TCP header */ +}; + +enum { + BPF_LOAD_HDR_OPT_TCP_SYN = (1ULL << 0), +}; + +/* args[0] value during BPF_SOCK_OPS_HDR_OPT_LEN_CB and + * BPF_SOCK_OPS_WRITE_HDR_OPT_CB. + */ +enum { + BPF_WRITE_HDR_TCP_CURRENT_MSS = 1, /* Kernel is finding the + * total option spaces + * required for an established + * sk in order to calculate the + * MSS. No skb is actually + * sent. + */ + BPF_WRITE_HDR_TCP_SYNACK_COOKIE = 2, /* Kernel is in syncookie mode + * when sending a SYN. + */ }; struct bpf_perf_event_value { diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile index bf8ed134cb8a..adbe994610f2 100644 --- a/tools/lib/bpf/Makefile +++ b/tools/lib/bpf/Makefile @@ -1,6 +1,9 @@ # SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) # Most of this file is copied from tools/lib/traceevent/Makefile +RM ?= rm +srctree = $(abs_srctree) + LIBBPF_VERSION := $(shell \ grep -oE '^LIBBPF_([0-9.]+)' libbpf.map | \ sort -rV | head -n1 | cut -d'_' -f2) @@ -56,7 +59,7 @@ ifndef VERBOSE endif FEATURE_USER = .libbpf -FEATURE_TESTS = libelf libelf-mmap zlib bpf reallocarray +FEATURE_TESTS = libelf zlib bpf FEATURE_DISPLAY = libelf zlib bpf INCLUDES = -I. -I$(srctree)/tools/include -I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi @@ -98,16 +101,8 @@ else CFLAGS := -g -Wall endif -ifeq ($(feature-libelf-mmap), 1) - override CFLAGS += -DHAVE_LIBELF_MMAP_SUPPORT -endif - -ifeq ($(feature-reallocarray), 0) - override CFLAGS += -DCOMPAT_NEED_REALLOCARRAY -endif - # Append required CFLAGS -override CFLAGS += $(EXTRA_WARNINGS) +override CFLAGS += $(EXTRA_WARNINGS) -Wno-switch-enum override CFLAGS += -Werror -Wall override CFLAGS += -fPIC override CFLAGS += $(INCLUDES) @@ -196,7 +191,7 @@ $(OUTPUT)libbpf.so.$(LIBBPF_VERSION): $(BPF_IN_SHARED) @ln -sf $(@F) $(OUTPUT)libbpf.so.$(LIBBPF_MAJOR_VERSION) $(OUTPUT)libbpf.a: $(BPF_IN_STATIC) - $(QUIET_LINK)$(RM) $@; $(AR) rcs $@ $^ + $(QUIET_LINK)$(RM) -f $@; $(AR) rcs $@ $^ $(OUTPUT)libbpf.pc: $(QUIET_GEN)sed -e "s|@PREFIX@|$(prefix)|" \ @@ -269,10 +264,10 @@ install: install_lib install_pkgconfig install_headers ### Cleaning rules config-clean: - $(call QUIET_CLEAN, config) + $(call QUIET_CLEAN, feature-detect) $(Q)$(MAKE) -C $(srctree)/tools/build/feature/ clean >/dev/null -clean: +clean: config-clean $(call QUIET_CLEAN, libbpf) $(RM) -rf $(CMD_TARGETS) \ *~ .*.d .*.cmd LIBBPF-CFLAGS $(BPF_HELPER_DEFS) \ $(SHARED_OBJDIR) $(STATIC_OBJDIR) \ @@ -299,7 +294,7 @@ cscope: cscope -b -q -I $(srctree)/include -f cscope.out tags: - rm -f TAGS tags + $(RM) -f TAGS tags ls *.c *.h | xargs $(TAGS_PROG) -a # Declare the contents of the .PHONY variable as phony. We keep that diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c index 0750681057c2..82b983ff6569 100644 --- a/tools/lib/bpf/bpf.c +++ b/tools/lib/bpf/bpf.c @@ -32,9 +32,6 @@ #include "libbpf.h" #include "libbpf_internal.h" -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 - /* * When building perf, unistd.h is overridden. __NR_bpf is * required to be defined explicitly. diff --git a/tools/lib/bpf/bpf_core_read.h b/tools/lib/bpf/bpf_core_read.h index eae5cccff761..bbcefb3ff5a5 100644 --- a/tools/lib/bpf/bpf_core_read.h +++ b/tools/lib/bpf/bpf_core_read.h @@ -19,32 +19,52 @@ enum bpf_field_info_kind { BPF_FIELD_RSHIFT_U64 = 5, }; +/* second argument to __builtin_btf_type_id() built-in */ +enum bpf_type_id_kind { + BPF_TYPE_ID_LOCAL = 0, /* BTF type ID in local program */ + BPF_TYPE_ID_TARGET = 1, /* BTF type ID in target kernel */ +}; + +/* second argument to __builtin_preserve_type_info() built-in */ +enum bpf_type_info_kind { + BPF_TYPE_EXISTS = 0, /* type existence in target kernel */ + BPF_TYPE_SIZE = 1, /* type size in target kernel */ +}; + +/* second argument to __builtin_preserve_enum_value() built-in */ +enum bpf_enum_value_kind { + BPF_ENUMVAL_EXISTS = 0, /* enum value existence in kernel */ + BPF_ENUMVAL_VALUE = 1, /* enum value value relocation */ +}; + #define __CORE_RELO(src, field, info) \ __builtin_preserve_field_info((src)->field, BPF_FIELD_##info) #if __BYTE_ORDER == __LITTLE_ENDIAN #define __CORE_BITFIELD_PROBE_READ(dst, src, fld) \ - bpf_probe_read((void *)dst, \ - __CORE_RELO(src, fld, BYTE_SIZE), \ - (const void *)src + __CORE_RELO(src, fld, BYTE_OFFSET)) + bpf_probe_read_kernel( \ + (void *)dst, \ + __CORE_RELO(src, fld, BYTE_SIZE), \ + (const void *)src + __CORE_RELO(src, fld, BYTE_OFFSET)) #else /* semantics of LSHIFT_64 assumes loading values into low-ordered bytes, so * for big-endian we need to adjust destination pointer accordingly, based on * field byte size */ #define __CORE_BITFIELD_PROBE_READ(dst, src, fld) \ - bpf_probe_read((void *)dst + (8 - __CORE_RELO(src, fld, BYTE_SIZE)), \ - __CORE_RELO(src, fld, BYTE_SIZE), \ - (const void *)src + __CORE_RELO(src, fld, BYTE_OFFSET)) + bpf_probe_read_kernel( \ + (void *)dst + (8 - __CORE_RELO(src, fld, BYTE_SIZE)), \ + __CORE_RELO(src, fld, BYTE_SIZE), \ + (const void *)src + __CORE_RELO(src, fld, BYTE_OFFSET)) #endif /* * Extract bitfield, identified by s->field, and return its value as u64. * All this is done in relocatable manner, so bitfield changes such as * signedness, bit size, offset changes, this will be handled automatically. - * This version of macro is using bpf_probe_read() to read underlying integer - * storage. Macro functions as an expression and its return type is - * bpf_probe_read()'s return value: 0, on success, <0 on error. + * This version of macro is using bpf_probe_read_kernel() to read underlying + * integer storage. Macro functions as an expression and its return type is + * bpf_probe_read_kernel()'s return value: 0, on success, <0 on error. */ #define BPF_CORE_READ_BITFIELD_PROBED(s, field) ({ \ unsigned long long val = 0; \ @@ -92,15 +112,75 @@ enum bpf_field_info_kind { __builtin_preserve_field_info(field, BPF_FIELD_EXISTS) /* - * Convenience macro to get byte size of a field. Works for integers, + * Convenience macro to get the byte size of a field. Works for integers, * struct/unions, pointers, arrays, and enums. */ #define bpf_core_field_size(field) \ __builtin_preserve_field_info(field, BPF_FIELD_BYTE_SIZE) /* - * bpf_core_read() abstracts away bpf_probe_read() call and captures offset - * relocation for source address using __builtin_preserve_access_index() + * Convenience macro to get BTF type ID of a specified type, using a local BTF + * information. Return 32-bit unsigned integer with type ID from program's own + * BTF. Always succeeds. + */ +#define bpf_core_type_id_local(type) \ + __builtin_btf_type_id(*(typeof(type) *)0, BPF_TYPE_ID_LOCAL) + +/* + * Convenience macro to get BTF type ID of a target kernel's type that matches + * specified local type. + * Returns: + * - valid 32-bit unsigned type ID in kernel BTF; + * - 0, if no matching type was found in a target kernel BTF. + */ +#define bpf_core_type_id_kernel(type) \ + __builtin_btf_type_id(*(typeof(type) *)0, BPF_TYPE_ID_TARGET) + +/* + * Convenience macro to check that provided named type + * (struct/union/enum/typedef) exists in a target kernel. + * Returns: + * 1, if such type is present in target kernel's BTF; + * 0, if no matching type is found. + */ +#define bpf_core_type_exists(type) \ + __builtin_preserve_type_info(*(typeof(type) *)0, BPF_TYPE_EXISTS) + +/* + * Convenience macro to get the byte size of a provided named type + * (struct/union/enum/typedef) in a target kernel. + * Returns: + * >= 0 size (in bytes), if type is present in target kernel's BTF; + * 0, if no matching type is found. + */ +#define bpf_core_type_size(type) \ + __builtin_preserve_type_info(*(typeof(type) *)0, BPF_TYPE_SIZE) + +/* + * Convenience macro to check that provided enumerator value is defined in + * a target kernel. + * Returns: + * 1, if specified enum type and its enumerator value are present in target + * kernel's BTF; + * 0, if no matching enum and/or enum value within that enum is found. + */ +#define bpf_core_enum_value_exists(enum_type, enum_value) \ + __builtin_preserve_enum_value(*(typeof(enum_type) *)enum_value, BPF_ENUMVAL_EXISTS) + +/* + * Convenience macro to get the integer value of an enumerator value in + * a target kernel. + * Returns: + * 64-bit value, if specified enum type and its enumerator value are + * present in target kernel's BTF; + * 0, if no matching enum and/or enum value within that enum is found. + */ +#define bpf_core_enum_value(enum_type, enum_value) \ + __builtin_preserve_enum_value(*(typeof(enum_type) *)enum_value, BPF_ENUMVAL_VALUE) + +/* + * bpf_core_read() abstracts away bpf_probe_read_kernel() call and captures + * offset relocation for source address using __builtin_preserve_access_index() * built-in, provided by Clang. * * __builtin_preserve_access_index() takes as an argument an expression of @@ -115,8 +195,8 @@ enum bpf_field_info_kind { * (local) BTF, used to record relocation. */ #define bpf_core_read(dst, sz, src) \ - bpf_probe_read(dst, sz, \ - (const void *)__builtin_preserve_access_index(src)) + bpf_probe_read_kernel(dst, sz, \ + (const void *)__builtin_preserve_access_index(src)) /* * bpf_core_read_str() is a thin wrapper around bpf_probe_read_str() @@ -124,8 +204,8 @@ enum bpf_field_info_kind { * argument. */ #define bpf_core_read_str(dst, sz, src) \ - bpf_probe_read_str(dst, sz, \ - (const void *)__builtin_preserve_access_index(src)) + bpf_probe_read_kernel_str(dst, sz, \ + (const void *)__builtin_preserve_access_index(src)) #define ___concat(a, b) a ## b #define ___apply(fn, n) ___concat(fn, n) @@ -239,15 +319,17 @@ enum bpf_field_info_kind { * int x = BPF_CORE_READ(s, a.b.c, d.e, f, g); * * BPF_CORE_READ will decompose above statement into 4 bpf_core_read (BPF - * CO-RE relocatable bpf_probe_read() wrapper) calls, logically equivalent to: + * CO-RE relocatable bpf_probe_read_kernel() wrapper) calls, logically + * equivalent to: * 1. const void *__t = s->a.b.c; * 2. __t = __t->d.e; * 3. __t = __t->f; * 4. return __t->g; * * Equivalence is logical, because there is a heavy type casting/preservation - * involved, as well as all the reads are happening through bpf_probe_read() - * calls using __builtin_preserve_access_index() to emit CO-RE relocations. + * involved, as well as all the reads are happening through + * bpf_probe_read_kernel() calls using __builtin_preserve_access_index() to + * emit CO-RE relocations. * * N.B. Only up to 9 "field accessors" are supported, which should be more * than enough for any practical purpose. diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h index e9a4ecddb7a5..1106777df00b 100644 --- a/tools/lib/bpf/bpf_helpers.h +++ b/tools/lib/bpf/bpf_helpers.h @@ -32,6 +32,9 @@ #ifndef __always_inline #define __always_inline __attribute__((always_inline)) #endif +#ifndef __noinline +#define __noinline __attribute__((noinline)) +#endif #ifndef __weak #define __weak __attribute__((weak)) #endif diff --git a/tools/lib/bpf/bpf_prog_linfo.c b/tools/lib/bpf/bpf_prog_linfo.c index bafca49cb1e6..3ed1a27b5f7c 100644 --- a/tools/lib/bpf/bpf_prog_linfo.c +++ b/tools/lib/bpf/bpf_prog_linfo.c @@ -8,9 +8,6 @@ #include "libbpf.h" #include "libbpf_internal.h" -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 - struct bpf_prog_linfo { void *raw_linfo; void *raw_jited_linfo; diff --git a/tools/lib/bpf/bpf_tracing.h b/tools/lib/bpf/bpf_tracing.h index eebf020cbe3e..f9ef37707888 100644 --- a/tools/lib/bpf/bpf_tracing.h +++ b/tools/lib/bpf/bpf_tracing.h @@ -289,9 +289,9 @@ struct pt_regs; #define BPF_KRETPROBE_READ_RET_IP BPF_KPROBE_READ_RET_IP #else #define BPF_KPROBE_READ_RET_IP(ip, ctx) \ - ({ bpf_probe_read(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx)); }) + ({ bpf_probe_read_kernel(&(ip), sizeof(ip), (void *)PT_REGS_RET(ctx)); }) #define BPF_KRETPROBE_READ_RET_IP(ip, ctx) \ - ({ bpf_probe_read(&(ip), sizeof(ip), \ + ({ bpf_probe_read_kernel(&(ip), sizeof(ip), \ (void *)(PT_REGS_FP(ctx) + sizeof(ip))); }) #endif diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c index 7dfca7016aaa..a3d259e614b0 100644 --- a/tools/lib/bpf/btf.c +++ b/tools/lib/bpf/btf.c @@ -21,9 +21,6 @@ #include "libbpf_internal.h" #include "hashmap.h" -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 - #define BTF_MAX_NR_TYPES 0x7fffffffU #define BTF_MAX_STR_OFFSET 0x7fffffffU @@ -61,7 +58,7 @@ static int btf_add_type(struct btf *btf, struct btf_type *t) expand_by = max(btf->types_size >> 2, 16U); new_size = min(BTF_MAX_NR_TYPES, btf->types_size + expand_by); - new_types = realloc(btf->types, sizeof(*new_types) * new_size); + new_types = libbpf_reallocarray(btf->types, new_size, sizeof(*new_types)); if (!new_types) return -ENOMEM; @@ -1131,14 +1128,14 @@ static int btf_ext_setup_line_info(struct btf_ext *btf_ext) return btf_ext_setup_info(btf_ext, ¶m); } -static int btf_ext_setup_field_reloc(struct btf_ext *btf_ext) +static int btf_ext_setup_core_relos(struct btf_ext *btf_ext) { struct btf_ext_sec_setup_param param = { - .off = btf_ext->hdr->field_reloc_off, - .len = btf_ext->hdr->field_reloc_len, - .min_rec_size = sizeof(struct bpf_field_reloc), - .ext_info = &btf_ext->field_reloc_info, - .desc = "field_reloc", + .off = btf_ext->hdr->core_relo_off, + .len = btf_ext->hdr->core_relo_len, + .min_rec_size = sizeof(struct bpf_core_relo), + .ext_info = &btf_ext->core_relo_info, + .desc = "core_relo", }; return btf_ext_setup_info(btf_ext, ¶m); @@ -1217,10 +1214,9 @@ struct btf_ext *btf_ext__new(__u8 *data, __u32 size) if (err) goto done; - if (btf_ext->hdr->hdr_len < - offsetofend(struct btf_ext_header, field_reloc_len)) + if (btf_ext->hdr->hdr_len < offsetofend(struct btf_ext_header, core_relo_len)) goto done; - err = btf_ext_setup_field_reloc(btf_ext); + err = btf_ext_setup_core_relos(btf_ext); if (err) goto done; @@ -1575,7 +1571,7 @@ static int btf_dedup_hypot_map_add(struct btf_dedup *d, __u32 *new_list; d->hypot_cap += max((size_t)16, d->hypot_cap / 2); - new_list = realloc(d->hypot_list, sizeof(__u32) * d->hypot_cap); + new_list = libbpf_reallocarray(d->hypot_list, d->hypot_cap, sizeof(__u32)); if (!new_list) return -ENOMEM; d->hypot_list = new_list; @@ -1871,8 +1867,7 @@ static int btf_dedup_strings(struct btf_dedup *d) struct btf_str_ptr *new_ptrs; strs.cap += max(strs.cnt / 2, 16U); - new_ptrs = realloc(strs.ptrs, - sizeof(strs.ptrs[0]) * strs.cap); + new_ptrs = libbpf_reallocarray(strs.ptrs, strs.cap, sizeof(strs.ptrs[0])); if (!new_ptrs) { err = -ENOMEM; goto done; @@ -2957,8 +2952,8 @@ static int btf_dedup_compact_types(struct btf_dedup *d) d->btf->nr_types = next_type_id - 1; d->btf->types_size = d->btf->nr_types; d->btf->hdr->type_len = p - types_start; - new_types = realloc(d->btf->types, - (1 + d->btf->nr_types) * sizeof(struct btf_type *)); + new_types = libbpf_reallocarray(d->btf->types, (1 + d->btf->nr_types), + sizeof(struct btf_type *)); if (!new_types) return -ENOMEM; d->btf->types = new_types; diff --git a/tools/lib/bpf/btf.h b/tools/lib/bpf/btf.h index 1ca14448df4c..91f0ad0e0325 100644 --- a/tools/lib/bpf/btf.h +++ b/tools/lib/bpf/btf.h @@ -24,44 +24,6 @@ struct btf_type; struct bpf_object; -/* - * The .BTF.ext ELF section layout defined as - * struct btf_ext_header - * func_info subsection - * - * The func_info subsection layout: - * record size for struct bpf_func_info in the func_info subsection - * struct btf_sec_func_info for section #1 - * a list of bpf_func_info records for section #1 - * where struct bpf_func_info mimics one in include/uapi/linux/bpf.h - * but may not be identical - * struct btf_sec_func_info for section #2 - * a list of bpf_func_info records for section #2 - * ...... - * - * Note that the bpf_func_info record size in .BTF.ext may not - * be the same as the one defined in include/uapi/linux/bpf.h. - * The loader should ensure that record_size meets minimum - * requirement and pass the record as is to the kernel. The - * kernel will handle the func_info properly based on its contents. - */ -struct btf_ext_header { - __u16 magic; - __u8 version; - __u8 flags; - __u32 hdr_len; - - /* All offsets are in bytes relative to the end of this header */ - __u32 func_info_off; - __u32 func_info_len; - __u32 line_info_off; - __u32 line_info_len; - - /* optional part of .BTF.ext header */ - __u32 field_reloc_off; - __u32 field_reloc_len; -}; - LIBBPF_API void btf__free(struct btf *btf); LIBBPF_API struct btf *btf__new(const void *data, __u32 size); LIBBPF_API struct btf *btf__parse(const char *path, struct btf_ext **btf_ext); diff --git a/tools/lib/bpf/btf_dump.c b/tools/lib/bpf/btf_dump.c index 57c00fa63932..6c079b3c8679 100644 --- a/tools/lib/bpf/btf_dump.c +++ b/tools/lib/bpf/btf_dump.c @@ -19,9 +19,6 @@ #include "libbpf.h" #include "libbpf_internal.h" -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 - static const char PREFIXES[] = "\t\t\t\t\t\t\t\t\t\t\t\t\t"; static const size_t PREFIX_CNT = sizeof(PREFIXES) - 1; @@ -323,8 +320,7 @@ static int btf_dump_add_emit_queue_id(struct btf_dump *d, __u32 id) if (d->emit_queue_cnt >= d->emit_queue_cap) { new_cap = max(16, d->emit_queue_cap * 3 / 2); - new_queue = realloc(d->emit_queue, - new_cap * sizeof(new_queue[0])); + new_queue = libbpf_reallocarray(d->emit_queue, new_cap, sizeof(new_queue[0])); if (!new_queue) return -ENOMEM; d->emit_queue = new_queue; @@ -1003,8 +999,7 @@ static int btf_dump_push_decl_stack_id(struct btf_dump *d, __u32 id) if (d->decl_stack_cnt >= d->decl_stack_cap) { new_cap = max(16, d->decl_stack_cap * 3 / 2); - new_stack = realloc(d->decl_stack, - new_cap * sizeof(new_stack[0])); + new_stack = libbpf_reallocarray(d->decl_stack, new_cap, sizeof(new_stack[0])); if (!new_stack) return -ENOMEM; d->decl_stack = new_stack; diff --git a/tools/lib/bpf/hashmap.c b/tools/lib/bpf/hashmap.c index a405dad068f5..3c20b126d60d 100644 --- a/tools/lib/bpf/hashmap.c +++ b/tools/lib/bpf/hashmap.c @@ -15,6 +15,9 @@ /* make sure libbpf doesn't use kernel-only integer typedefs */ #pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 +/* prevent accidental re-addition of reallocarray() */ +#pragma GCC poison reallocarray + /* start with 4 buckets */ #define HASHMAP_MIN_CAP_BITS 2 diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 0ad0b0491e1f..b688aadf09c5 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -44,7 +44,6 @@ #include <sys/vfs.h> #include <sys/utsname.h> #include <sys/resource.h> -#include <tools/libc_compat.h> #include <libelf.h> #include <gelf.h> #include <zlib.h> @@ -56,9 +55,6 @@ #include "libbpf_internal.h" #include "hashmap.h" -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 - #ifndef EM_BPF #define EM_BPF 247 #endif @@ -67,6 +63,8 @@ #define BPF_FS_MAGIC 0xcafe4a11 #endif +#define BPF_INSN_SZ (sizeof(struct bpf_insn)) + /* vsprintf() in __base_pr() uses nonliteral format string. It may break * compilation if user enables corresponding warning. Disable it explicitly. */ @@ -154,34 +152,35 @@ static void pr_perm_msg(int err) ___err; }) #endif -#ifdef HAVE_LIBELF_MMAP_SUPPORT -# define LIBBPF_ELF_C_READ_MMAP ELF_C_READ_MMAP -#else -# define LIBBPF_ELF_C_READ_MMAP ELF_C_READ -#endif - static inline __u64 ptr_to_u64(const void *ptr) { return (__u64) (unsigned long) ptr; } -struct bpf_capabilities { +enum kern_feature_id { /* v4.14: kernel support for program & map names. */ - __u32 name:1; + FEAT_PROG_NAME, /* v5.2: kernel support for global data sections. */ - __u32 global_data:1; + FEAT_GLOBAL_DATA, + /* BTF support */ + FEAT_BTF, /* BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO support */ - __u32 btf_func:1; + FEAT_BTF_FUNC, /* BTF_KIND_VAR and BTF_KIND_DATASEC support */ - __u32 btf_datasec:1; - /* BPF_F_MMAPABLE is supported for arrays */ - __u32 array_mmap:1; + FEAT_BTF_DATASEC, /* BTF_FUNC_GLOBAL is supported */ - __u32 btf_func_global:1; + FEAT_BTF_GLOBAL_FUNC, + /* BPF_F_MMAPABLE is supported for arrays */ + FEAT_ARRAY_MMAP, /* kernel support for expected_attach_type in BPF_PROG_LOAD */ - __u32 exp_attach_type:1; + FEAT_EXP_ATTACH_TYPE, + /* bpf_probe_read_{kernel,user}[_str] helpers */ + FEAT_PROBE_READ_KERN, + __FEAT_CNT, }; +static bool kernel_supports(enum kern_feature_id feat_id); + enum reloc_type { RELO_LD64, RELO_CALL, @@ -209,6 +208,7 @@ struct bpf_sec_def { bool is_exp_attach_type_optional; bool is_attachable; bool is_attach_btf; + bool is_sleepable; attach_fn_t attach_fn; }; @@ -253,8 +253,6 @@ struct bpf_program { __u32 func_info_rec_size; __u32 func_info_cnt; - struct bpf_capabilities *caps; - void *line_info; __u32 line_info_rec_size; __u32 line_info_cnt; @@ -403,6 +401,7 @@ struct bpf_object { Elf_Data *rodata; Elf_Data *bss; Elf_Data *st_ops_data; + size_t shstrndx; /* section index for section name strings */ size_t strtabidx; struct { GElf_Shdr shdr; @@ -436,12 +435,18 @@ struct bpf_object { void *priv; bpf_object_clear_priv_t clear_priv; - struct bpf_capabilities caps; - char path[]; }; #define obj_elf_valid(o) ((o)->efile.elf) +static const char *elf_sym_str(const struct bpf_object *obj, size_t off); +static const char *elf_sec_str(const struct bpf_object *obj, size_t off); +static Elf_Scn *elf_sec_by_idx(const struct bpf_object *obj, size_t idx); +static Elf_Scn *elf_sec_by_name(const struct bpf_object *obj, const char *name); +static int elf_sec_hdr(const struct bpf_object *obj, Elf_Scn *scn, GElf_Shdr *hdr); +static const char *elf_sec_name(const struct bpf_object *obj, Elf_Scn *scn); +static Elf_Data *elf_sec_data(const struct bpf_object *obj, Elf_Scn *scn); + void bpf_program__unload(struct bpf_program *prog) { int i; @@ -503,7 +508,7 @@ static char *__bpf_program__pin_name(struct bpf_program *prog) } static int -bpf_program__init(void *data, size_t size, char *section_name, int idx, +bpf_program__init(void *data, size_t size, const char *section_name, int idx, struct bpf_program *prog) { const size_t bpf_insn_sz = sizeof(struct bpf_insn); @@ -552,7 +557,7 @@ errout: static int bpf_object__add_program(struct bpf_object *obj, void *data, size_t size, - char *section_name, int idx) + const char *section_name, int idx) { struct bpf_program prog, *progs; int nr_progs, err; @@ -561,11 +566,10 @@ bpf_object__add_program(struct bpf_object *obj, void *data, size_t size, if (err) return err; - prog.caps = &obj->caps; progs = obj->programs; nr_progs = obj->nr_programs; - progs = reallocarray(progs, nr_progs + 1, sizeof(progs[0])); + progs = libbpf_reallocarray(progs, nr_progs + 1, sizeof(progs[0])); if (!progs) { /* * In this case the original obj->programs @@ -578,7 +582,7 @@ bpf_object__add_program(struct bpf_object *obj, void *data, size_t size, return -ENOMEM; } - pr_debug("found program %s\n", prog.section_name); + pr_debug("elf: found program '%s'\n", prog.section_name); obj->programs = progs; obj->nr_programs = nr_progs + 1; prog.obj = obj; @@ -598,8 +602,7 @@ bpf_object__init_prog_names(struct bpf_object *obj) prog = &obj->programs[pi]; - for (si = 0; si < symbols->d_size / sizeof(GElf_Sym) && !name; - si++) { + for (si = 0; si < symbols->d_size / sizeof(GElf_Sym) && !name; si++) { GElf_Sym sym; if (!gelf_getsym(symbols, si, &sym)) @@ -609,11 +612,9 @@ bpf_object__init_prog_names(struct bpf_object *obj) if (GELF_ST_BIND(sym.st_info) != STB_GLOBAL) continue; - name = elf_strptr(obj->efile.elf, - obj->efile.strtabidx, - sym.st_name); + name = elf_sym_str(obj, sym.st_name); if (!name) { - pr_warn("failed to get sym name string for prog %s\n", + pr_warn("prog '%s': failed to get symbol name\n", prog->section_name); return -LIBBPF_ERRNO__LIBELF; } @@ -623,17 +624,14 @@ bpf_object__init_prog_names(struct bpf_object *obj) name = ".text"; if (!name) { - pr_warn("failed to find sym for prog %s\n", + pr_warn("prog '%s': failed to find program symbol\n", prog->section_name); return -EINVAL; } prog->name = strdup(name); - if (!prog->name) { - pr_warn("failed to allocate memory for prog sym %s\n", - name); + if (!prog->name) return -ENOMEM; - } } return 0; @@ -1066,13 +1064,18 @@ static void bpf_object__elf_finish(struct bpf_object *obj) obj->efile.obj_buf_sz = 0; } +/* if libelf is old and doesn't support mmap(), fall back to read() */ +#ifndef ELF_C_READ_MMAP +#define ELF_C_READ_MMAP ELF_C_READ +#endif + static int bpf_object__elf_init(struct bpf_object *obj) { int err = 0; GElf_Ehdr *ep; if (obj_elf_valid(obj)) { - pr_warn("elf init: internal error\n"); + pr_warn("elf: init internal error\n"); return -LIBBPF_ERRNO__LIBELF; } @@ -1090,31 +1093,44 @@ static int bpf_object__elf_init(struct bpf_object *obj) err = -errno; cp = libbpf_strerror_r(err, errmsg, sizeof(errmsg)); - pr_warn("failed to open %s: %s\n", obj->path, cp); + pr_warn("elf: failed to open %s: %s\n", obj->path, cp); return err; } - obj->efile.elf = elf_begin(obj->efile.fd, - LIBBPF_ELF_C_READ_MMAP, NULL); + obj->efile.elf = elf_begin(obj->efile.fd, ELF_C_READ_MMAP, NULL); } if (!obj->efile.elf) { - pr_warn("failed to open %s as ELF file\n", obj->path); + pr_warn("elf: failed to open %s as ELF file: %s\n", obj->path, elf_errmsg(-1)); err = -LIBBPF_ERRNO__LIBELF; goto errout; } if (!gelf_getehdr(obj->efile.elf, &obj->efile.ehdr)) { - pr_warn("failed to get EHDR from %s\n", obj->path); + pr_warn("elf: failed to get ELF header from %s: %s\n", obj->path, elf_errmsg(-1)); err = -LIBBPF_ERRNO__FORMAT; goto errout; } ep = &obj->efile.ehdr; + if (elf_getshdrstrndx(obj->efile.elf, &obj->efile.shstrndx)) { + pr_warn("elf: failed to get section names section index for %s: %s\n", + obj->path, elf_errmsg(-1)); + err = -LIBBPF_ERRNO__FORMAT; + goto errout; + } + + /* Elf is corrupted/truncated, avoid calling elf_strptr. */ + if (!elf_rawdata(elf_getscn(obj->efile.elf, obj->efile.shstrndx), NULL)) { + pr_warn("elf: failed to get section names strings from %s: %s\n", + obj->path, elf_errmsg(-1)); + return -LIBBPF_ERRNO__FORMAT; + } + /* Old LLVM set e_machine to EM_NONE */ if (ep->e_type != ET_REL || (ep->e_machine && ep->e_machine != EM_BPF)) { - pr_warn("%s is not an eBPF object file\n", obj->path); + pr_warn("elf: %s is not a valid eBPF object file\n", obj->path); err = -LIBBPF_ERRNO__FORMAT; goto errout; } @@ -1136,7 +1152,7 @@ static int bpf_object__check_endianness(struct bpf_object *obj) #else # error "Unrecognized __BYTE_ORDER__" #endif - pr_warn("endianness mismatch.\n"); + pr_warn("elf: endianness mismatch in %s.\n", obj->path); return -LIBBPF_ERRNO__ENDIAN; } @@ -1171,55 +1187,10 @@ static bool bpf_map_type__is_map_in_map(enum bpf_map_type type) return false; } -static int bpf_object_search_section_size(const struct bpf_object *obj, - const char *name, size_t *d_size) -{ - const GElf_Ehdr *ep = &obj->efile.ehdr; - Elf *elf = obj->efile.elf; - Elf_Scn *scn = NULL; - int idx = 0; - - while ((scn = elf_nextscn(elf, scn)) != NULL) { - const char *sec_name; - Elf_Data *data; - GElf_Shdr sh; - - idx++; - if (gelf_getshdr(scn, &sh) != &sh) { - pr_warn("failed to get section(%d) header from %s\n", - idx, obj->path); - return -EIO; - } - - sec_name = elf_strptr(elf, ep->e_shstrndx, sh.sh_name); - if (!sec_name) { - pr_warn("failed to get section(%d) name from %s\n", - idx, obj->path); - return -EIO; - } - - if (strcmp(name, sec_name)) - continue; - - data = elf_getdata(scn, 0); - if (!data) { - pr_warn("failed to get section(%d) data from %s(%s)\n", - idx, name, obj->path); - return -EIO; - } - - *d_size = data->d_size; - return 0; - } - - return -ENOENT; -} - int bpf_object__section_size(const struct bpf_object *obj, const char *name, __u32 *size) { int ret = -ENOENT; - size_t d_size; *size = 0; if (!name) { @@ -1237,9 +1208,13 @@ int bpf_object__section_size(const struct bpf_object *obj, const char *name, if (obj->efile.st_ops_data) *size = obj->efile.st_ops_data->d_size; } else { - ret = bpf_object_search_section_size(obj, name, &d_size); - if (!ret) - *size = d_size; + Elf_Scn *scn = elf_sec_by_name(obj, name); + Elf_Data *data = elf_sec_data(obj, scn); + + if (data) { + ret = 0; /* found it */ + *size = data->d_size; + } } return *size ? 0 : ret; @@ -1264,8 +1239,7 @@ int bpf_object__variable_offset(const struct bpf_object *obj, const char *name, GELF_ST_TYPE(sym.st_info) != STT_OBJECT) continue; - sname = elf_strptr(obj->efile.elf, obj->efile.strtabidx, - sym.st_name); + sname = elf_sym_str(obj, sym.st_name); if (!sname) { pr_warn("failed to get sym name string for var %s\n", name); @@ -1290,7 +1264,7 @@ static struct bpf_map *bpf_object__add_map(struct bpf_object *obj) return &obj->maps[obj->nr_maps++]; new_cap = max((size_t)4, obj->maps_cap * 3 / 2); - new_maps = realloc(obj->maps, new_cap * sizeof(*obj->maps)); + new_maps = libbpf_reallocarray(obj->maps, new_cap, sizeof(*obj->maps)); if (!new_maps) { pr_warn("alloc maps for object failed\n"); return ERR_PTR(-ENOMEM); @@ -1742,12 +1716,12 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict) if (!symbols) return -EINVAL; - scn = elf_getscn(obj->efile.elf, obj->efile.maps_shndx); - if (scn) - data = elf_getdata(scn, NULL); + + scn = elf_sec_by_idx(obj, obj->efile.maps_shndx); + data = elf_sec_data(obj, scn); if (!scn || !data) { - pr_warn("failed to get Elf_Data from map section %d\n", - obj->efile.maps_shndx); + pr_warn("elf: failed to get legacy map definitions for %s\n", + obj->path); return -EINVAL; } @@ -1769,12 +1743,12 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict) nr_maps++; } /* Assume equally sized map definitions */ - pr_debug("maps in %s: %d maps in %zd bytes\n", - obj->path, nr_maps, data->d_size); + pr_debug("elf: found %d legacy map definitions (%zd bytes) in %s\n", + nr_maps, data->d_size, obj->path); if (!data->d_size || nr_maps == 0 || (data->d_size % nr_maps) != 0) { - pr_warn("unable to determine map definition size section %s, %d maps in %zd bytes\n", - obj->path, nr_maps, data->d_size); + pr_warn("elf: unable to determine legacy map definition size in %s\n", + obj->path); return -EINVAL; } map_def_sz = data->d_size / nr_maps; @@ -1795,8 +1769,7 @@ static int bpf_object__init_user_maps(struct bpf_object *obj, bool strict) if (IS_ERR(map)) return PTR_ERR(map); - map_name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, - sym.st_name); + map_name = elf_sym_str(obj, sym.st_name); if (!map_name) { pr_warn("failed to get map #%d name sym string for obj %s\n", i, obj->path); @@ -1884,6 +1857,29 @@ resolve_func_ptr(const struct btf *btf, __u32 id, __u32 *res_id) return btf_is_func_proto(t) ? t : NULL; } +static const char *btf_kind_str(const struct btf_type *t) +{ + switch (btf_kind(t)) { + case BTF_KIND_UNKN: return "void"; + case BTF_KIND_INT: return "int"; + case BTF_KIND_PTR: return "ptr"; + case BTF_KIND_ARRAY: return "array"; + case BTF_KIND_STRUCT: return "struct"; + case BTF_KIND_UNION: return "union"; + case BTF_KIND_ENUM: return "enum"; + case BTF_KIND_FWD: return "fwd"; + case BTF_KIND_TYPEDEF: return "typedef"; + case BTF_KIND_VOLATILE: return "volatile"; + case BTF_KIND_CONST: return "const"; + case BTF_KIND_RESTRICT: return "restrict"; + case BTF_KIND_FUNC: return "func"; + case BTF_KIND_FUNC_PROTO: return "func_proto"; + case BTF_KIND_VAR: return "var"; + case BTF_KIND_DATASEC: return "datasec"; + default: return "unknown"; + } +} + /* * Fetch integer attribute of BTF map definition. Such attributes are * represented using a pointer to an array, in which dimensionality of array @@ -1900,8 +1896,8 @@ static bool get_map_field_int(const char *map_name, const struct btf *btf, const struct btf_type *arr_t; if (!btf_is_ptr(t)) { - pr_warn("map '%s': attr '%s': expected PTR, got %u.\n", - map_name, name, btf_kind(t)); + pr_warn("map '%s': attr '%s': expected PTR, got %s.\n", + map_name, name, btf_kind_str(t)); return false; } @@ -1912,8 +1908,8 @@ static bool get_map_field_int(const char *map_name, const struct btf *btf, return false; } if (!btf_is_array(arr_t)) { - pr_warn("map '%s': attr '%s': expected ARRAY, got %u.\n", - map_name, name, btf_kind(arr_t)); + pr_warn("map '%s': attr '%s': expected ARRAY, got %s.\n", + map_name, name, btf_kind_str(arr_t)); return false; } arr_info = btf_array(arr_t); @@ -1924,7 +1920,7 @@ static bool get_map_field_int(const char *map_name, const struct btf *btf, static int build_map_pin_path(struct bpf_map *map, const char *path) { char buf[PATH_MAX]; - int err, len; + int len; if (!path) path = "/sys/fs/bpf"; @@ -1935,11 +1931,7 @@ static int build_map_pin_path(struct bpf_map *map, const char *path) else if (len >= PATH_MAX) return -ENAMETOOLONG; - err = bpf_map__set_pin_path(map, buf); - if (err) - return err; - - return 0; + return bpf_map__set_pin_path(map, buf); } @@ -2007,8 +1999,8 @@ static int parse_btf_map_def(struct bpf_object *obj, return -EINVAL; } if (!btf_is_ptr(t)) { - pr_warn("map '%s': key spec is not PTR: %u.\n", - map->name, btf_kind(t)); + pr_warn("map '%s': key spec is not PTR: %s.\n", + map->name, btf_kind_str(t)); return -EINVAL; } sz = btf__resolve_size(obj->btf, t->type); @@ -2049,8 +2041,8 @@ static int parse_btf_map_def(struct bpf_object *obj, return -EINVAL; } if (!btf_is_ptr(t)) { - pr_warn("map '%s': value spec is not PTR: %u.\n", - map->name, btf_kind(t)); + pr_warn("map '%s': value spec is not PTR: %s.\n", + map->name, btf_kind_str(t)); return -EINVAL; } sz = btf__resolve_size(obj->btf, t->type); @@ -2107,14 +2099,14 @@ static int parse_btf_map_def(struct bpf_object *obj, t = skip_mods_and_typedefs(obj->btf, btf_array(t)->type, NULL); if (!btf_is_ptr(t)) { - pr_warn("map '%s': map-in-map inner def is of unexpected kind %u.\n", - map->name, btf_kind(t)); + pr_warn("map '%s': map-in-map inner def is of unexpected kind %s.\n", + map->name, btf_kind_str(t)); return -EINVAL; } t = skip_mods_and_typedefs(obj->btf, t->type, NULL); if (!btf_is_struct(t)) { - pr_warn("map '%s': map-in-map inner def is of unexpected kind %u.\n", - map->name, btf_kind(t)); + pr_warn("map '%s': map-in-map inner def is of unexpected kind %s.\n", + map->name, btf_kind_str(t)); return -EINVAL; } @@ -2205,8 +2197,8 @@ static int bpf_object__init_user_btf_map(struct bpf_object *obj, return -EINVAL; } if (!btf_is_var(var)) { - pr_warn("map '%s': unexpected var kind %u.\n", - map_name, btf_kind(var)); + pr_warn("map '%s': unexpected var kind %s.\n", + map_name, btf_kind_str(var)); return -EINVAL; } if (var_extra->linkage != BTF_VAR_GLOBAL_ALLOCATED && @@ -2218,8 +2210,8 @@ static int bpf_object__init_user_btf_map(struct bpf_object *obj, def = skip_mods_and_typedefs(obj->btf, var->type, NULL); if (!btf_is_struct(def)) { - pr_warn("map '%s': unexpected def kind %u.\n", - map_name, btf_kind(var)); + pr_warn("map '%s': unexpected def kind %s.\n", + map_name, btf_kind_str(var)); return -EINVAL; } if (def->size > vi->size) { @@ -2259,12 +2251,11 @@ static int bpf_object__init_user_btf_maps(struct bpf_object *obj, bool strict, if (obj->efile.btf_maps_shndx < 0) return 0; - scn = elf_getscn(obj->efile.elf, obj->efile.btf_maps_shndx); - if (scn) - data = elf_getdata(scn, NULL); + scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx); + data = elf_sec_data(obj, scn); if (!scn || !data) { - pr_warn("failed to get Elf_Data from map section %d (%s)\n", - obj->efile.btf_maps_shndx, MAPS_ELF_SEC); + pr_warn("elf: failed to get %s map definitions for %s\n", + MAPS_ELF_SEC, obj->path); return -EINVAL; } @@ -2322,36 +2313,28 @@ static int bpf_object__init_maps(struct bpf_object *obj, static bool section_have_execinstr(struct bpf_object *obj, int idx) { - Elf_Scn *scn; GElf_Shdr sh; - scn = elf_getscn(obj->efile.elf, idx); - if (!scn) - return false; - - if (gelf_getshdr(scn, &sh) != &sh) + if (elf_sec_hdr(obj, elf_sec_by_idx(obj, idx), &sh)) return false; - if (sh.sh_flags & SHF_EXECINSTR) - return true; - - return false; + return sh.sh_flags & SHF_EXECINSTR; } static bool btf_needs_sanitization(struct bpf_object *obj) { - bool has_func_global = obj->caps.btf_func_global; - bool has_datasec = obj->caps.btf_datasec; - bool has_func = obj->caps.btf_func; + bool has_func_global = kernel_supports(FEAT_BTF_GLOBAL_FUNC); + bool has_datasec = kernel_supports(FEAT_BTF_DATASEC); + bool has_func = kernel_supports(FEAT_BTF_FUNC); return !has_func || !has_datasec || !has_func_global; } static void bpf_object__sanitize_btf(struct bpf_object *obj, struct btf *btf) { - bool has_func_global = obj->caps.btf_func_global; - bool has_datasec = obj->caps.btf_datasec; - bool has_func = obj->caps.btf_func; + bool has_func_global = kernel_supports(FEAT_BTF_GLOBAL_FUNC); + bool has_datasec = kernel_supports(FEAT_BTF_DATASEC); + bool has_func = kernel_supports(FEAT_BTF_FUNC); struct btf_type *t; int i, j, vlen; @@ -2499,7 +2482,7 @@ static int bpf_object__load_vmlinux_btf(struct bpf_object *obj) int err; /* CO-RE relocations need kernel BTF */ - if (obj->btf_ext && obj->btf_ext->field_reloc_info.len) + if (obj->btf_ext && obj->btf_ext->core_relo_info.len) need_vmlinux_btf = true; bpf_object__for_each_program(prog, obj) { @@ -2533,6 +2516,15 @@ static int bpf_object__sanitize_and_load_btf(struct bpf_object *obj) if (!obj->btf) return 0; + if (!kernel_supports(FEAT_BTF)) { + if (kernel_needs_btf(obj)) { + err = -EOPNOTSUPP; + goto report; + } + pr_debug("Kernel doesn't support BTF, skipping uploading it.\n"); + return 0; + } + sanitize = btf_needs_sanitization(obj); if (sanitize) { const void *raw_data; @@ -2558,6 +2550,7 @@ static int bpf_object__sanitize_and_load_btf(struct bpf_object *obj) } btf__free(kern_btf); } +report: if (err) { btf_mandatory = kernel_needs_btf(obj); pr_warn("Error loading .BTF into kernel: %d. %s\n", err, @@ -2569,61 +2562,199 @@ static int bpf_object__sanitize_and_load_btf(struct bpf_object *obj) return err; } +static const char *elf_sym_str(const struct bpf_object *obj, size_t off) +{ + const char *name; + + name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, off); + if (!name) { + pr_warn("elf: failed to get section name string at offset %zu from %s: %s\n", + off, obj->path, elf_errmsg(-1)); + return NULL; + } + + return name; +} + +static const char *elf_sec_str(const struct bpf_object *obj, size_t off) +{ + const char *name; + + name = elf_strptr(obj->efile.elf, obj->efile.shstrndx, off); + if (!name) { + pr_warn("elf: failed to get section name string at offset %zu from %s: %s\n", + off, obj->path, elf_errmsg(-1)); + return NULL; + } + + return name; +} + +static Elf_Scn *elf_sec_by_idx(const struct bpf_object *obj, size_t idx) +{ + Elf_Scn *scn; + + scn = elf_getscn(obj->efile.elf, idx); + if (!scn) { + pr_warn("elf: failed to get section(%zu) from %s: %s\n", + idx, obj->path, elf_errmsg(-1)); + return NULL; + } + return scn; +} + +static Elf_Scn *elf_sec_by_name(const struct bpf_object *obj, const char *name) +{ + Elf_Scn *scn = NULL; + Elf *elf = obj->efile.elf; + const char *sec_name; + + while ((scn = elf_nextscn(elf, scn)) != NULL) { + sec_name = elf_sec_name(obj, scn); + if (!sec_name) + return NULL; + + if (strcmp(sec_name, name) != 0) + continue; + + return scn; + } + return NULL; +} + +static int elf_sec_hdr(const struct bpf_object *obj, Elf_Scn *scn, GElf_Shdr *hdr) +{ + if (!scn) + return -EINVAL; + + if (gelf_getshdr(scn, hdr) != hdr) { + pr_warn("elf: failed to get section(%zu) header from %s: %s\n", + elf_ndxscn(scn), obj->path, elf_errmsg(-1)); + return -EINVAL; + } + + return 0; +} + +static const char *elf_sec_name(const struct bpf_object *obj, Elf_Scn *scn) +{ + const char *name; + GElf_Shdr sh; + + if (!scn) + return NULL; + + if (elf_sec_hdr(obj, scn, &sh)) + return NULL; + + name = elf_sec_str(obj, sh.sh_name); + if (!name) { + pr_warn("elf: failed to get section(%zu) name from %s: %s\n", + elf_ndxscn(scn), obj->path, elf_errmsg(-1)); + return NULL; + } + + return name; +} + +static Elf_Data *elf_sec_data(const struct bpf_object *obj, Elf_Scn *scn) +{ + Elf_Data *data; + + if (!scn) + return NULL; + + data = elf_getdata(scn, 0); + if (!data) { + pr_warn("elf: failed to get section(%zu) %s data from %s: %s\n", + elf_ndxscn(scn), elf_sec_name(obj, scn) ?: "<?>", + obj->path, elf_errmsg(-1)); + return NULL; + } + + return data; +} + +static bool is_sec_name_dwarf(const char *name) +{ + /* approximation, but the actual list is too long */ + return strncmp(name, ".debug_", sizeof(".debug_") - 1) == 0; +} + +static bool ignore_elf_section(GElf_Shdr *hdr, const char *name) +{ + /* no special handling of .strtab */ + if (hdr->sh_type == SHT_STRTAB) + return true; + + /* ignore .llvm_addrsig section as well */ + if (hdr->sh_type == 0x6FFF4C03 /* SHT_LLVM_ADDRSIG */) + return true; + + /* no subprograms will lead to an empty .text section, ignore it */ + if (hdr->sh_type == SHT_PROGBITS && hdr->sh_size == 0 && + strcmp(name, ".text") == 0) + return true; + + /* DWARF sections */ + if (is_sec_name_dwarf(name)) + return true; + + if (strncmp(name, ".rel", sizeof(".rel") - 1) == 0) { + name += sizeof(".rel") - 1; + /* DWARF section relocations */ + if (is_sec_name_dwarf(name)) + return true; + + /* .BTF and .BTF.ext don't need relocations */ + if (strcmp(name, BTF_ELF_SEC) == 0 || + strcmp(name, BTF_EXT_ELF_SEC) == 0) + return true; + } + + return false; +} + static int bpf_object__elf_collect(struct bpf_object *obj) { Elf *elf = obj->efile.elf; - GElf_Ehdr *ep = &obj->efile.ehdr; Elf_Data *btf_ext_data = NULL; Elf_Data *btf_data = NULL; Elf_Scn *scn = NULL; int idx = 0, err = 0; - /* Elf is corrupted/truncated, avoid calling elf_strptr. */ - if (!elf_rawdata(elf_getscn(elf, ep->e_shstrndx), NULL)) { - pr_warn("failed to get e_shstrndx from %s\n", obj->path); - return -LIBBPF_ERRNO__FORMAT; - } - while ((scn = elf_nextscn(elf, scn)) != NULL) { - char *name; + const char *name; GElf_Shdr sh; Elf_Data *data; idx++; - if (gelf_getshdr(scn, &sh) != &sh) { - pr_warn("failed to get section(%d) header from %s\n", - idx, obj->path); + + if (elf_sec_hdr(obj, scn, &sh)) return -LIBBPF_ERRNO__FORMAT; - } - name = elf_strptr(elf, ep->e_shstrndx, sh.sh_name); - if (!name) { - pr_warn("failed to get section(%d) name from %s\n", - idx, obj->path); + name = elf_sec_str(obj, sh.sh_name); + if (!name) return -LIBBPF_ERRNO__FORMAT; - } - data = elf_getdata(scn, 0); - if (!data) { - pr_warn("failed to get section(%d) data from %s(%s)\n", - idx, name, obj->path); + if (ignore_elf_section(&sh, name)) + continue; + + data = elf_sec_data(obj, scn); + if (!data) return -LIBBPF_ERRNO__FORMAT; - } - pr_debug("section(%d) %s, size %ld, link %d, flags %lx, type=%d\n", + + pr_debug("elf: section(%d) %s, size %ld, link %d, flags %lx, type=%d\n", idx, name, (unsigned long)data->d_size, (int)sh.sh_link, (unsigned long)sh.sh_flags, (int)sh.sh_type); if (strcmp(name, "license") == 0) { - err = bpf_object__init_license(obj, - data->d_buf, - data->d_size); + err = bpf_object__init_license(obj, data->d_buf, data->d_size); if (err) return err; } else if (strcmp(name, "version") == 0) { - err = bpf_object__init_kversion(obj, - data->d_buf, - data->d_size); + err = bpf_object__init_kversion(obj, data->d_buf, data->d_size); if (err) return err; } else if (strcmp(name, "maps") == 0) { @@ -2636,8 +2767,7 @@ static int bpf_object__elf_collect(struct bpf_object *obj) btf_ext_data = data; } else if (sh.sh_type == SHT_SYMTAB) { if (obj->efile.symbols) { - pr_warn("bpf: multiple SYMTAB in %s\n", - obj->path); + pr_warn("elf: multiple symbol tables in %s\n", obj->path); return -LIBBPF_ERRNO__FORMAT; } obj->efile.symbols = data; @@ -2650,16 +2780,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj) err = bpf_object__add_program(obj, data->d_buf, data->d_size, name, idx); - if (err) { - char errmsg[STRERR_BUFSIZE]; - char *cp; - - cp = libbpf_strerror_r(-err, errmsg, - sizeof(errmsg)); - pr_warn("failed to alloc program %s (%s): %s", - name, obj->path, cp); + if (err) return err; - } } else if (strcmp(name, DATA_SEC) == 0) { obj->efile.data = data; obj->efile.data_shndx = idx; @@ -2670,7 +2792,8 @@ static int bpf_object__elf_collect(struct bpf_object *obj) obj->efile.st_ops_data = data; obj->efile.st_ops_shndx = idx; } else { - pr_debug("skip section(%d) %s\n", idx, name); + pr_info("elf: skipping unrecognized data section(%d) %s\n", + idx, name); } } else if (sh.sh_type == SHT_REL) { int nr_sects = obj->efile.nr_reloc_sects; @@ -2681,34 +2804,33 @@ static int bpf_object__elf_collect(struct bpf_object *obj) if (!section_have_execinstr(obj, sec) && strcmp(name, ".rel" STRUCT_OPS_SEC) && strcmp(name, ".rel" MAPS_ELF_SEC)) { - pr_debug("skip relo %s(%d) for section(%d)\n", - name, idx, sec); + pr_info("elf: skipping relo section(%d) %s for section(%d) %s\n", + idx, name, sec, + elf_sec_name(obj, elf_sec_by_idx(obj, sec)) ?: "<?>"); continue; } - sects = reallocarray(sects, nr_sects + 1, - sizeof(*obj->efile.reloc_sects)); - if (!sects) { - pr_warn("reloc_sects realloc failed\n"); + sects = libbpf_reallocarray(sects, nr_sects + 1, + sizeof(*obj->efile.reloc_sects)); + if (!sects) return -ENOMEM; - } obj->efile.reloc_sects = sects; obj->efile.nr_reloc_sects++; obj->efile.reloc_sects[nr_sects].shdr = sh; obj->efile.reloc_sects[nr_sects].data = data; - } else if (sh.sh_type == SHT_NOBITS && - strcmp(name, BSS_SEC) == 0) { + } else if (sh.sh_type == SHT_NOBITS && strcmp(name, BSS_SEC) == 0) { obj->efile.bss = data; obj->efile.bss_shndx = idx; } else { - pr_debug("skip section(%d) %s\n", idx, name); + pr_info("elf: skipping section(%d) %s (size %zu)\n", idx, name, + (size_t)sh.sh_size); } } if (!obj->efile.strtabidx || obj->efile.strtabidx > idx) { - pr_warn("Corrupted ELF file: index of strtab invalid\n"); + pr_warn("elf: symbol strings section missing or invalid in %s\n", obj->path); return -LIBBPF_ERRNO__FORMAT; } return bpf_object__init_btf(obj, btf_data, btf_ext_data); @@ -2869,14 +2991,13 @@ static int bpf_object__collect_externs(struct bpf_object *obj) if (!obj->efile.symbols) return 0; - scn = elf_getscn(obj->efile.elf, obj->efile.symbols_shndx); - if (!scn) + scn = elf_sec_by_idx(obj, obj->efile.symbols_shndx); + if (elf_sec_hdr(obj, scn, &sh)) return -LIBBPF_ERRNO__FORMAT; - if (gelf_getshdr(scn, &sh) != &sh) - return -LIBBPF_ERRNO__FORMAT; - n = sh.sh_size / sh.sh_entsize; + n = sh.sh_size / sh.sh_entsize; pr_debug("looking for externs among %d symbols...\n", n); + for (i = 0; i < n; i++) { GElf_Sym sym; @@ -2884,13 +3005,12 @@ static int bpf_object__collect_externs(struct bpf_object *obj) return -LIBBPF_ERRNO__FORMAT; if (!sym_is_extern(&sym)) continue; - ext_name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, - sym.st_name); + ext_name = elf_sym_str(obj, sym.st_name); if (!ext_name || !ext_name[0]) continue; ext = obj->externs; - ext = reallocarray(ext, obj->nr_extern + 1, sizeof(*ext)); + ext = libbpf_reallocarray(ext, obj->nr_extern + 1, sizeof(*ext)); if (!ext) return -ENOMEM; obj->externs = ext; @@ -3109,7 +3229,7 @@ bpf_object__section_to_libbpf_map_type(const struct bpf_object *obj, int shndx) static int bpf_program__record_reloc(struct bpf_program *prog, struct reloc_desc *reloc_desc, - __u32 insn_idx, const char *name, + __u32 insn_idx, const char *sym_name, const GElf_Sym *sym, const GElf_Rel *rel) { struct bpf_insn *insn = &prog->insns[insn_idx]; @@ -3117,22 +3237,25 @@ static int bpf_program__record_reloc(struct bpf_program *prog, struct bpf_object *obj = prog->obj; __u32 shdr_idx = sym->st_shndx; enum libbpf_map_type type; + const char *sym_sec_name; struct bpf_map *map; /* sub-program call relocation */ if (insn->code == (BPF_JMP | BPF_CALL)) { if (insn->src_reg != BPF_PSEUDO_CALL) { - pr_warn("incorrect bpf_call opcode\n"); + pr_warn("prog '%s': incorrect bpf_call opcode\n", prog->name); return -LIBBPF_ERRNO__RELOC; } /* text_shndx can be 0, if no default "main" program exists */ if (!shdr_idx || shdr_idx != obj->efile.text_shndx) { - pr_warn("bad call relo against section %u\n", shdr_idx); + sym_sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, shdr_idx)); + pr_warn("prog '%s': bad call relo against '%s' in section '%s'\n", + prog->name, sym_name, sym_sec_name); return -LIBBPF_ERRNO__RELOC; } - if (sym->st_value % 8) { - pr_warn("bad call relo offset: %zu\n", - (size_t)sym->st_value); + if (sym->st_value % BPF_INSN_SZ) { + pr_warn("prog '%s': bad call relo against '%s' at offset %zu\n", + prog->name, sym_name, (size_t)sym->st_value); return -LIBBPF_ERRNO__RELOC; } reloc_desc->type = RELO_CALL; @@ -3143,8 +3266,8 @@ static int bpf_program__record_reloc(struct bpf_program *prog, } if (insn->code != (BPF_LD | BPF_IMM | BPF_DW)) { - pr_warn("invalid relo for insns[%d].code 0x%x\n", - insn_idx, insn->code); + pr_warn("prog '%s': invalid relo against '%s' for insns[%d].code 0x%x\n", + prog->name, sym_name, insn_idx, insn->code); return -LIBBPF_ERRNO__RELOC; } @@ -3159,12 +3282,12 @@ static int bpf_program__record_reloc(struct bpf_program *prog, break; } if (i >= n) { - pr_warn("extern relo failed to find extern for sym %d\n", - sym_idx); + pr_warn("prog '%s': extern relo failed to find extern for '%s' (%d)\n", + prog->name, sym_name, sym_idx); return -LIBBPF_ERRNO__RELOC; } - pr_debug("found extern #%d '%s' (sym %d) for insn %u\n", - i, ext->name, ext->sym_idx, insn_idx); + pr_debug("prog '%s': found extern #%d '%s' (sym %d) for insn #%u\n", + prog->name, i, ext->name, ext->sym_idx, insn_idx); reloc_desc->type = RELO_EXTERN; reloc_desc->insn_idx = insn_idx; reloc_desc->sym_off = i; /* sym_off stores extern index */ @@ -3172,18 +3295,19 @@ static int bpf_program__record_reloc(struct bpf_program *prog, } if (!shdr_idx || shdr_idx >= SHN_LORESERVE) { - pr_warn("invalid relo for \'%s\' in special section 0x%x; forgot to initialize global var?..\n", - name, shdr_idx); + pr_warn("prog '%s': invalid relo against '%s' in special section 0x%x; forgot to initialize global var?..\n", + prog->name, sym_name, shdr_idx); return -LIBBPF_ERRNO__RELOC; } type = bpf_object__section_to_libbpf_map_type(obj, shdr_idx); + sym_sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, shdr_idx)); /* generic map reference relocation */ if (type == LIBBPF_MAP_UNSPEC) { if (!bpf_object__shndx_is_maps(obj, shdr_idx)) { - pr_warn("bad map relo against section %u\n", - shdr_idx); + pr_warn("prog '%s': bad map relo against '%s' in section '%s'\n", + prog->name, sym_name, sym_sec_name); return -LIBBPF_ERRNO__RELOC; } for (map_idx = 0; map_idx < nr_maps; map_idx++) { @@ -3192,14 +3316,14 @@ static int bpf_program__record_reloc(struct bpf_program *prog, map->sec_idx != sym->st_shndx || map->sec_offset != sym->st_value) continue; - pr_debug("found map %zd (%s, sec %d, off %zu) for insn %u\n", - map_idx, map->name, map->sec_idx, + pr_debug("prog '%s': found map %zd (%s, sec %d, off %zu) for insn #%u\n", + prog->name, map_idx, map->name, map->sec_idx, map->sec_offset, insn_idx); break; } if (map_idx >= nr_maps) { - pr_warn("map relo failed to find map for sec %u, off %zu\n", - shdr_idx, (size_t)sym->st_value); + pr_warn("prog '%s': map relo failed to find map for section '%s', off %zu\n", + prog->name, sym_sec_name, (size_t)sym->st_value); return -LIBBPF_ERRNO__RELOC; } reloc_desc->type = RELO_LD64; @@ -3211,21 +3335,22 @@ static int bpf_program__record_reloc(struct bpf_program *prog, /* global data map relocation */ if (!bpf_object__shndx_is_data(obj, shdr_idx)) { - pr_warn("bad data relo against section %u\n", shdr_idx); + pr_warn("prog '%s': bad data relo against section '%s'\n", + prog->name, sym_sec_name); return -LIBBPF_ERRNO__RELOC; } for (map_idx = 0; map_idx < nr_maps; map_idx++) { map = &obj->maps[map_idx]; if (map->libbpf_type != type) continue; - pr_debug("found data map %zd (%s, sec %d, off %zu) for insn %u\n", - map_idx, map->name, map->sec_idx, map->sec_offset, - insn_idx); + pr_debug("prog '%s': found data map %zd (%s, sec %d, off %zu) for insn %u\n", + prog->name, map_idx, map->name, map->sec_idx, + map->sec_offset, insn_idx); break; } if (map_idx >= nr_maps) { - pr_warn("data relo failed to find map for sec %u\n", - shdr_idx); + pr_warn("prog '%s': data relo failed to find map for section '%s'\n", + prog->name, sym_sec_name); return -LIBBPF_ERRNO__RELOC; } @@ -3241,9 +3366,17 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr, Elf_Data *data, struct bpf_object *obj) { Elf_Data *symbols = obj->efile.symbols; + const char *relo_sec_name, *sec_name; + size_t sec_idx = shdr->sh_info; int err, i, nrels; - pr_debug("collecting relocating info for: '%s'\n", prog->section_name); + relo_sec_name = elf_sec_str(obj, shdr->sh_name); + sec_name = elf_sec_name(obj, elf_sec_by_idx(obj, sec_idx)); + if (!relo_sec_name || !sec_name) + return -EINVAL; + + pr_debug("sec '%s': collecting relocation for section(%zu) '%s'\n", + relo_sec_name, sec_idx, sec_name); nrels = shdr->sh_size / shdr->sh_entsize; prog->reloc_desc = malloc(sizeof(*prog->reloc_desc) * nrels); @@ -3254,35 +3387,34 @@ bpf_program__collect_reloc(struct bpf_program *prog, GElf_Shdr *shdr, prog->nr_reloc = nrels; for (i = 0; i < nrels; i++) { - const char *name; + const char *sym_name; __u32 insn_idx; GElf_Sym sym; GElf_Rel rel; if (!gelf_getrel(data, i, &rel)) { - pr_warn("relocation: failed to get %d reloc\n", i); + pr_warn("sec '%s': failed to get relo #%d\n", relo_sec_name, i); return -LIBBPF_ERRNO__FORMAT; } if (!gelf_getsym(symbols, GELF_R_SYM(rel.r_info), &sym)) { - pr_warn("relocation: symbol %"PRIx64" not found\n", - GELF_R_SYM(rel.r_info)); + pr_warn("sec '%s': symbol 0x%zx not found for relo #%d\n", + relo_sec_name, (size_t)GELF_R_SYM(rel.r_info), i); return -LIBBPF_ERRNO__FORMAT; } - if (rel.r_offset % sizeof(struct bpf_insn)) + if (rel.r_offset % BPF_INSN_SZ) { + pr_warn("sec '%s': invalid offset 0x%zx for relo #%d\n", + relo_sec_name, (size_t)GELF_R_SYM(rel.r_info), i); return -LIBBPF_ERRNO__FORMAT; + } - insn_idx = rel.r_offset / sizeof(struct bpf_insn); - name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, - sym.st_name) ? : "<?>"; + insn_idx = rel.r_offset / BPF_INSN_SZ; + sym_name = elf_sym_str(obj, sym.st_name) ?: "<?>"; - pr_debug("relo for shdr %u, symb %zu, value %zu, type %d, bind %d, name %d (\'%s\'), insn %u\n", - (__u32)sym.st_shndx, (size_t)GELF_R_SYM(rel.r_info), - (size_t)sym.st_value, GELF_ST_TYPE(sym.st_info), - GELF_ST_BIND(sym.st_info), sym.st_name, name, - insn_idx); + pr_debug("sec '%s': relo #%d: insn #%u against '%s'\n", + relo_sec_name, i, insn_idx, sym_name); err = bpf_program__record_reloc(prog, &prog->reloc_desc[i], - insn_idx, name, &sym, &rel); + insn_idx, sym_name, &sym, &rel); if (err) return err; } @@ -3433,8 +3565,14 @@ bpf_object__probe_loading(struct bpf_object *obj) return 0; } -static int -bpf_object__probe_name(struct bpf_object *obj) +static int probe_fd(int fd) +{ + if (fd >= 0) + close(fd); + return fd >= 0; +} + +static int probe_kern_prog_name(void) { struct bpf_load_program_attr attr; struct bpf_insn insns[] = { @@ -3452,16 +3590,10 @@ bpf_object__probe_name(struct bpf_object *obj) attr.license = "GPL"; attr.name = "test"; ret = bpf_load_program_xattr(&attr, NULL, 0); - if (ret >= 0) { - obj->caps.name = 1; - close(ret); - } - - return 0; + return probe_fd(ret); } -static int -bpf_object__probe_global_data(struct bpf_object *obj) +static int probe_kern_global_data(void) { struct bpf_load_program_attr prg_attr; struct bpf_create_map_attr map_attr; @@ -3498,16 +3630,23 @@ bpf_object__probe_global_data(struct bpf_object *obj) prg_attr.license = "GPL"; ret = bpf_load_program_xattr(&prg_attr, NULL, 0); - if (ret >= 0) { - obj->caps.global_data = 1; - close(ret); - } - close(map); - return 0; + return probe_fd(ret); +} + +static int probe_kern_btf(void) +{ + static const char strs[] = "\0int"; + __u32 types[] = { + /* int */ + BTF_TYPE_INT_ENC(1, BTF_INT_SIGNED, 0, 32, 4), + }; + + return probe_fd(libbpf__load_raw_btf((char *)types, sizeof(types), + strs, sizeof(strs))); } -static int bpf_object__probe_btf_func(struct bpf_object *obj) +static int probe_kern_btf_func(void) { static const char strs[] = "\0int\0x\0a"; /* void x(int a) {} */ @@ -3520,20 +3659,12 @@ static int bpf_object__probe_btf_func(struct bpf_object *obj) /* FUNC x */ /* [3] */ BTF_TYPE_ENC(5, BTF_INFO_ENC(BTF_KIND_FUNC, 0, 0), 2), }; - int btf_fd; - btf_fd = libbpf__load_raw_btf((char *)types, sizeof(types), - strs, sizeof(strs)); - if (btf_fd >= 0) { - obj->caps.btf_func = 1; - close(btf_fd); - return 1; - } - - return 0; + return probe_fd(libbpf__load_raw_btf((char *)types, sizeof(types), + strs, sizeof(strs))); } -static int bpf_object__probe_btf_func_global(struct bpf_object *obj) +static int probe_kern_btf_func_global(void) { static const char strs[] = "\0int\0x\0a"; /* static void x(int a) {} */ @@ -3546,20 +3677,12 @@ static int bpf_object__probe_btf_func_global(struct bpf_object *obj) /* FUNC x BTF_FUNC_GLOBAL */ /* [3] */ BTF_TYPE_ENC(5, BTF_INFO_ENC(BTF_KIND_FUNC, 0, BTF_FUNC_GLOBAL), 2), }; - int btf_fd; - btf_fd = libbpf__load_raw_btf((char *)types, sizeof(types), - strs, sizeof(strs)); - if (btf_fd >= 0) { - obj->caps.btf_func_global = 1; - close(btf_fd); - return 1; - } - - return 0; + return probe_fd(libbpf__load_raw_btf((char *)types, sizeof(types), + strs, sizeof(strs))); } -static int bpf_object__probe_btf_datasec(struct bpf_object *obj) +static int probe_kern_btf_datasec(void) { static const char strs[] = "\0x\0.data"; /* static int a; */ @@ -3573,20 +3696,12 @@ static int bpf_object__probe_btf_datasec(struct bpf_object *obj) BTF_TYPE_ENC(3, BTF_INFO_ENC(BTF_KIND_DATASEC, 0, 1), 4), BTF_VAR_SECINFO_ENC(2, 0, 4), }; - int btf_fd; - - btf_fd = libbpf__load_raw_btf((char *)types, sizeof(types), - strs, sizeof(strs)); - if (btf_fd >= 0) { - obj->caps.btf_datasec = 1; - close(btf_fd); - return 1; - } - return 0; + return probe_fd(libbpf__load_raw_btf((char *)types, sizeof(types), + strs, sizeof(strs))); } -static int bpf_object__probe_array_mmap(struct bpf_object *obj) +static int probe_kern_array_mmap(void) { struct bpf_create_map_attr attr = { .map_type = BPF_MAP_TYPE_ARRAY, @@ -3595,27 +3710,17 @@ static int bpf_object__probe_array_mmap(struct bpf_object *obj) .value_size = sizeof(int), .max_entries = 1, }; - int fd; - fd = bpf_create_map_xattr(&attr); - if (fd >= 0) { - obj->caps.array_mmap = 1; - close(fd); - return 1; - } - - return 0; + return probe_fd(bpf_create_map_xattr(&attr)); } -static int -bpf_object__probe_exp_attach_type(struct bpf_object *obj) +static int probe_kern_exp_attach_type(void) { struct bpf_load_program_attr attr; struct bpf_insn insns[] = { BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), }; - int fd; memset(&attr, 0, sizeof(attr)); /* use any valid combination of program type and (optional) @@ -3629,36 +3734,91 @@ bpf_object__probe_exp_attach_type(struct bpf_object *obj) attr.insns_cnt = ARRAY_SIZE(insns); attr.license = "GPL"; - fd = bpf_load_program_xattr(&attr, NULL, 0); - if (fd >= 0) { - obj->caps.exp_attach_type = 1; - close(fd); - return 1; - } - return 0; + return probe_fd(bpf_load_program_xattr(&attr, NULL, 0)); } -static int -bpf_object__probe_caps(struct bpf_object *obj) -{ - int (*probe_fn[])(struct bpf_object *obj) = { - bpf_object__probe_name, - bpf_object__probe_global_data, - bpf_object__probe_btf_func, - bpf_object__probe_btf_func_global, - bpf_object__probe_btf_datasec, - bpf_object__probe_array_mmap, - bpf_object__probe_exp_attach_type, +static int probe_kern_probe_read_kernel(void) +{ + struct bpf_load_program_attr attr; + struct bpf_insn insns[] = { + BPF_MOV64_REG(BPF_REG_1, BPF_REG_10), /* r1 = r10 (fp) */ + BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -8), /* r1 += -8 */ + BPF_MOV64_IMM(BPF_REG_2, 8), /* r2 = 8 */ + BPF_MOV64_IMM(BPF_REG_3, 0), /* r3 = 0 */ + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_probe_read_kernel), + BPF_EXIT_INSN(), }; - int i, ret; - for (i = 0; i < ARRAY_SIZE(probe_fn); i++) { - ret = probe_fn[i](obj); - if (ret < 0) - pr_debug("Probe #%d failed with %d.\n", i, ret); + memset(&attr, 0, sizeof(attr)); + attr.prog_type = BPF_PROG_TYPE_KPROBE; + attr.insns = insns; + attr.insns_cnt = ARRAY_SIZE(insns); + attr.license = "GPL"; + + return probe_fd(bpf_load_program_xattr(&attr, NULL, 0)); +} + +enum kern_feature_result { + FEAT_UNKNOWN = 0, + FEAT_SUPPORTED = 1, + FEAT_MISSING = 2, +}; + +typedef int (*feature_probe_fn)(void); + +static struct kern_feature_desc { + const char *desc; + feature_probe_fn probe; + enum kern_feature_result res; +} feature_probes[__FEAT_CNT] = { + [FEAT_PROG_NAME] = { + "BPF program name", probe_kern_prog_name, + }, + [FEAT_GLOBAL_DATA] = { + "global variables", probe_kern_global_data, + }, + [FEAT_BTF] = { + "minimal BTF", probe_kern_btf, + }, + [FEAT_BTF_FUNC] = { + "BTF functions", probe_kern_btf_func, + }, + [FEAT_BTF_GLOBAL_FUNC] = { + "BTF global function", probe_kern_btf_func_global, + }, + [FEAT_BTF_DATASEC] = { + "BTF data section and variable", probe_kern_btf_datasec, + }, + [FEAT_ARRAY_MMAP] = { + "ARRAY map mmap()", probe_kern_array_mmap, + }, + [FEAT_EXP_ATTACH_TYPE] = { + "BPF_PROG_LOAD expected_attach_type attribute", + probe_kern_exp_attach_type, + }, + [FEAT_PROBE_READ_KERN] = { + "bpf_probe_read_kernel() helper", probe_kern_probe_read_kernel, } +}; - return 0; +static bool kernel_supports(enum kern_feature_id feat_id) +{ + struct kern_feature_desc *feat = &feature_probes[feat_id]; + int ret; + + if (READ_ONCE(feat->res) == FEAT_UNKNOWN) { + ret = feat->probe(); + if (ret > 0) { + WRITE_ONCE(feat->res, FEAT_SUPPORTED); + } else if (ret == 0) { + WRITE_ONCE(feat->res, FEAT_MISSING); + } else { + pr_warn("Detection of kernel %s support failed: %d\n", feat->desc, ret); + WRITE_ONCE(feat->res, FEAT_MISSING); + } + } + + return READ_ONCE(feat->res) == FEAT_SUPPORTED; } static bool map_is_reuse_compat(const struct bpf_map *map, int map_fd) @@ -3760,7 +3920,7 @@ static int bpf_object__create_map(struct bpf_object *obj, struct bpf_map *map) memset(&create_attr, 0, sizeof(create_attr)); - if (obj->caps.name) + if (kernel_supports(FEAT_PROG_NAME)) create_attr.name = map->name; create_attr.map_ifindex = map->map_ifindex; create_attr.map_type = def->type; @@ -4011,6 +4171,10 @@ struct bpf_core_spec { const struct btf *btf; /* high-level spec: named fields and array indices only */ struct bpf_core_accessor spec[BPF_CORE_SPEC_MAX_LEN]; + /* original unresolved (no skip_mods_or_typedefs) root type ID */ + __u32 root_type_id; + /* CO-RE relocation kind */ + enum bpf_core_relo_kind relo_kind; /* high-level spec length */ int len; /* raw, low-level spec: 1-to-1 with accessor spec string */ @@ -4041,8 +4205,66 @@ static bool is_flex_arr(const struct btf *btf, return acc->idx == btf_vlen(t) - 1; } +static const char *core_relo_kind_str(enum bpf_core_relo_kind kind) +{ + switch (kind) { + case BPF_FIELD_BYTE_OFFSET: return "byte_off"; + case BPF_FIELD_BYTE_SIZE: return "byte_sz"; + case BPF_FIELD_EXISTS: return "field_exists"; + case BPF_FIELD_SIGNED: return "signed"; + case BPF_FIELD_LSHIFT_U64: return "lshift_u64"; + case BPF_FIELD_RSHIFT_U64: return "rshift_u64"; + case BPF_TYPE_ID_LOCAL: return "local_type_id"; + case BPF_TYPE_ID_TARGET: return "target_type_id"; + case BPF_TYPE_EXISTS: return "type_exists"; + case BPF_TYPE_SIZE: return "type_size"; + case BPF_ENUMVAL_EXISTS: return "enumval_exists"; + case BPF_ENUMVAL_VALUE: return "enumval_value"; + default: return "unknown"; + } +} + +static bool core_relo_is_field_based(enum bpf_core_relo_kind kind) +{ + switch (kind) { + case BPF_FIELD_BYTE_OFFSET: + case BPF_FIELD_BYTE_SIZE: + case BPF_FIELD_EXISTS: + case BPF_FIELD_SIGNED: + case BPF_FIELD_LSHIFT_U64: + case BPF_FIELD_RSHIFT_U64: + return true; + default: + return false; + } +} + +static bool core_relo_is_type_based(enum bpf_core_relo_kind kind) +{ + switch (kind) { + case BPF_TYPE_ID_LOCAL: + case BPF_TYPE_ID_TARGET: + case BPF_TYPE_EXISTS: + case BPF_TYPE_SIZE: + return true; + default: + return false; + } +} + +static bool core_relo_is_enumval_based(enum bpf_core_relo_kind kind) +{ + switch (kind) { + case BPF_ENUMVAL_EXISTS: + case BPF_ENUMVAL_VALUE: + return true; + default: + return false; + } +} + /* - * Turn bpf_field_reloc into a low- and high-level spec representation, + * Turn bpf_core_relo into a low- and high-level spec representation, * validating correctness along the way, as well as calculating resulting * field bit offset, specified by accessor string. Low-level spec captures * every single level of nestedness, including traversing anonymous @@ -4071,10 +4293,17 @@ static bool is_flex_arr(const struct btf *btf, * - field 'a' access (corresponds to '2' in low-level spec); * - array element #3 access (corresponds to '3' in low-level spec). * + * Type-based relocations (TYPE_EXISTS/TYPE_SIZE, + * TYPE_ID_LOCAL/TYPE_ID_TARGET) don't capture any field information. Their + * spec and raw_spec are kept empty. + * + * Enum value-based relocations (ENUMVAL_EXISTS/ENUMVAL_VALUE) use access + * string to specify enumerator's value index that need to be relocated. */ -static int bpf_core_spec_parse(const struct btf *btf, +static int bpf_core_parse_spec(const struct btf *btf, __u32 type_id, const char *spec_str, + enum bpf_core_relo_kind relo_kind, struct bpf_core_spec *spec) { int access_idx, parsed_len, i; @@ -4089,6 +4318,15 @@ static int bpf_core_spec_parse(const struct btf *btf, memset(spec, 0, sizeof(*spec)); spec->btf = btf; + spec->root_type_id = type_id; + spec->relo_kind = relo_kind; + + /* type-based relocations don't have a field access string */ + if (core_relo_is_type_based(relo_kind)) { + if (strcmp(spec_str, "0")) + return -EINVAL; + return 0; + } /* parse spec_str="0:1:2:3:4" into array raw_spec=[0, 1, 2, 3, 4] */ while (*spec_str) { @@ -4105,16 +4343,28 @@ static int bpf_core_spec_parse(const struct btf *btf, if (spec->raw_len == 0) return -EINVAL; - /* first spec value is always reloc type array index */ t = skip_mods_and_typedefs(btf, type_id, &id); if (!t) return -EINVAL; access_idx = spec->raw_spec[0]; - spec->spec[0].type_id = id; - spec->spec[0].idx = access_idx; + acc = &spec->spec[0]; + acc->type_id = id; + acc->idx = access_idx; spec->len++; + if (core_relo_is_enumval_based(relo_kind)) { + if (!btf_is_enum(t) || spec->raw_len > 1 || access_idx >= btf_vlen(t)) + return -EINVAL; + + /* record enumerator name in a first accessor */ + acc->name = btf__name_by_offset(btf, btf_enum(t)[access_idx].name_off); + return 0; + } + + if (!core_relo_is_field_based(relo_kind)) + return -EINVAL; + sz = btf__resolve_size(btf, id); if (sz < 0) return sz; @@ -4172,8 +4422,8 @@ static int bpf_core_spec_parse(const struct btf *btf, return sz; spec->bit_offset += access_idx * sz * 8; } else { - pr_warn("relo for [%u] %s (at idx %d) captures type [%d] of unexpected kind %d\n", - type_id, spec_str, i, id, btf_kind(t)); + pr_warn("relo for [%u] %s (at idx %d) captures type [%d] of unexpected kind %s\n", + type_id, spec_str, i, id, btf_kind_str(t)); return -EINVAL; } } @@ -4223,16 +4473,16 @@ static struct ids_vec *bpf_core_find_cands(const struct btf *local_btf, { size_t local_essent_len, targ_essent_len; const char *local_name, *targ_name; - const struct btf_type *t; + const struct btf_type *t, *local_t; struct ids_vec *cand_ids; __u32 *new_ids; int i, err, n; - t = btf__type_by_id(local_btf, local_type_id); - if (!t) + local_t = btf__type_by_id(local_btf, local_type_id); + if (!local_t) return ERR_PTR(-EINVAL); - local_name = btf__name_by_offset(local_btf, t->name_off); + local_name = btf__name_by_offset(local_btf, local_t->name_off); if (str_is_empty(local_name)) return ERR_PTR(-EINVAL); local_essent_len = bpf_core_essential_name_len(local_name); @@ -4244,12 +4494,11 @@ static struct ids_vec *bpf_core_find_cands(const struct btf *local_btf, n = btf__get_nr_types(targ_btf); for (i = 1; i <= n; i++) { t = btf__type_by_id(targ_btf, i); - targ_name = btf__name_by_offset(targ_btf, t->name_off); - if (str_is_empty(targ_name)) + if (btf_kind(t) != btf_kind(local_t)) continue; - t = skip_mods_and_typedefs(targ_btf, i, NULL); - if (!btf_is_composite(t) && !btf_is_array(t)) + targ_name = btf__name_by_offset(targ_btf, t->name_off); + if (str_is_empty(targ_name)) continue; targ_essent_len = bpf_core_essential_name_len(targ_name); @@ -4257,11 +4506,12 @@ static struct ids_vec *bpf_core_find_cands(const struct btf *local_btf, continue; if (strncmp(local_name, targ_name, local_essent_len) == 0) { - pr_debug("[%d] %s: found candidate [%d] %s\n", - local_type_id, local_name, i, targ_name); - new_ids = reallocarray(cand_ids->data, - cand_ids->len + 1, - sizeof(*cand_ids->data)); + pr_debug("CO-RE relocating [%d] %s %s: found target candidate [%d] %s %s\n", + local_type_id, btf_kind_str(local_t), + local_name, i, btf_kind_str(t), targ_name); + new_ids = libbpf_reallocarray(cand_ids->data, + cand_ids->len + 1, + sizeof(*cand_ids->data)); if (!new_ids) { err = -ENOMEM; goto err_out; @@ -4276,8 +4526,9 @@ err_out: return ERR_PTR(err); } -/* Check two types for compatibility, skipping const/volatile/restrict and - * typedefs, to ensure we are relocating compatible entities: +/* Check two types for compatibility for the purpose of field access + * relocation. const/volatile/restrict and typedefs are skipped to ensure we + * are relocating semantically compatible entities: * - any two STRUCTs/UNIONs are compatible and can be mixed; * - any two FWDs are compatible, if their names match (modulo flavor suffix); * - any two PTRs are always compatible; @@ -4432,6 +4683,100 @@ static int bpf_core_match_member(const struct btf *local_btf, return 0; } +/* Check local and target types for compatibility. This check is used for + * type-based CO-RE relocations and follow slightly different rules than + * field-based relocations. This function assumes that root types were already + * checked for name match. Beyond that initial root-level name check, names + * are completely ignored. Compatibility rules are as follows: + * - any two STRUCTs/UNIONs/FWDs/ENUMs/INTs are considered compatible, but + * kind should match for local and target types (i.e., STRUCT is not + * compatible with UNION); + * - for ENUMs, the size is ignored; + * - for INT, size and signedness are ignored; + * - for ARRAY, dimensionality is ignored, element types are checked for + * compatibility recursively; + * - CONST/VOLATILE/RESTRICT modifiers are ignored; + * - TYPEDEFs/PTRs are compatible if types they pointing to are compatible; + * - FUNC_PROTOs are compatible if they have compatible signature: same + * number of input args and compatible return and argument types. + * These rules are not set in stone and probably will be adjusted as we get + * more experience with using BPF CO-RE relocations. + */ +static int bpf_core_types_are_compat(const struct btf *local_btf, __u32 local_id, + const struct btf *targ_btf, __u32 targ_id) +{ + const struct btf_type *local_type, *targ_type; + int depth = 32; /* max recursion depth */ + + /* caller made sure that names match (ignoring flavor suffix) */ + local_type = btf__type_by_id(local_btf, local_id); + targ_type = btf__type_by_id(targ_btf, targ_id); + if (btf_kind(local_type) != btf_kind(targ_type)) + return 0; + +recur: + depth--; + if (depth < 0) + return -EINVAL; + + local_type = skip_mods_and_typedefs(local_btf, local_id, &local_id); + targ_type = skip_mods_and_typedefs(targ_btf, targ_id, &targ_id); + if (!local_type || !targ_type) + return -EINVAL; + + if (btf_kind(local_type) != btf_kind(targ_type)) + return 0; + + switch (btf_kind(local_type)) { + case BTF_KIND_UNKN: + case BTF_KIND_STRUCT: + case BTF_KIND_UNION: + case BTF_KIND_ENUM: + case BTF_KIND_FWD: + return 1; + case BTF_KIND_INT: + /* just reject deprecated bitfield-like integers; all other + * integers are by default compatible between each other + */ + return btf_int_offset(local_type) == 0 && btf_int_offset(targ_type) == 0; + case BTF_KIND_PTR: + local_id = local_type->type; + targ_id = targ_type->type; + goto recur; + case BTF_KIND_ARRAY: + local_id = btf_array(local_type)->type; + targ_id = btf_array(targ_type)->type; + goto recur; + case BTF_KIND_FUNC_PROTO: { + struct btf_param *local_p = btf_params(local_type); + struct btf_param *targ_p = btf_params(targ_type); + __u16 local_vlen = btf_vlen(local_type); + __u16 targ_vlen = btf_vlen(targ_type); + int i, err; + + if (local_vlen != targ_vlen) + return 0; + + for (i = 0; i < local_vlen; i++, local_p++, targ_p++) { + skip_mods_and_typedefs(local_btf, local_p->type, &local_id); + skip_mods_and_typedefs(targ_btf, targ_p->type, &targ_id); + err = bpf_core_types_are_compat(local_btf, local_id, targ_btf, targ_id); + if (err <= 0) + return err; + } + + /* tail recurse for return type check */ + skip_mods_and_typedefs(local_btf, local_type->type, &local_id); + skip_mods_and_typedefs(targ_btf, targ_type->type, &targ_id); + goto recur; + } + default: + pr_warn("unexpected kind %s relocated, local [%d], target [%d]\n", + btf_kind_str(local_type), local_id, targ_id); + return 0; + } +} + /* * Try to match local spec to a target type and, if successful, produce full * target spec (high-level, low-level + bit offset). @@ -4447,10 +4792,51 @@ static int bpf_core_spec_match(struct bpf_core_spec *local_spec, memset(targ_spec, 0, sizeof(*targ_spec)); targ_spec->btf = targ_btf; + targ_spec->root_type_id = targ_id; + targ_spec->relo_kind = local_spec->relo_kind; + + if (core_relo_is_type_based(local_spec->relo_kind)) { + return bpf_core_types_are_compat(local_spec->btf, + local_spec->root_type_id, + targ_btf, targ_id); + } local_acc = &local_spec->spec[0]; targ_acc = &targ_spec->spec[0]; + if (core_relo_is_enumval_based(local_spec->relo_kind)) { + size_t local_essent_len, targ_essent_len; + const struct btf_enum *e; + const char *targ_name; + + /* has to resolve to an enum */ + targ_type = skip_mods_and_typedefs(targ_spec->btf, targ_id, &targ_id); + if (!btf_is_enum(targ_type)) + return 0; + + local_essent_len = bpf_core_essential_name_len(local_acc->name); + + for (i = 0, e = btf_enum(targ_type); i < btf_vlen(targ_type); i++, e++) { + targ_name = btf__name_by_offset(targ_spec->btf, e->name_off); + targ_essent_len = bpf_core_essential_name_len(targ_name); + if (targ_essent_len != local_essent_len) + continue; + if (strncmp(local_acc->name, targ_name, local_essent_len) == 0) { + targ_acc->type_id = targ_id; + targ_acc->idx = i; + targ_acc->name = targ_name; + targ_spec->len++; + targ_spec->raw_spec[targ_spec->raw_len] = targ_acc->idx; + targ_spec->raw_len++; + return 1; + } + } + return 0; + } + + if (!core_relo_is_field_based(local_spec->relo_kind)) + return -EINVAL; + for (i = 0; i < local_spec->len; i++, local_acc++, targ_acc++) { targ_type = skip_mods_and_typedefs(targ_spec->btf, targ_id, &targ_id); @@ -4507,18 +4893,29 @@ static int bpf_core_spec_match(struct bpf_core_spec *local_spec, } static int bpf_core_calc_field_relo(const struct bpf_program *prog, - const struct bpf_field_reloc *relo, + const struct bpf_core_relo *relo, const struct bpf_core_spec *spec, __u32 *val, bool *validate) { - const struct bpf_core_accessor *acc = &spec->spec[spec->len - 1]; - const struct btf_type *t = btf__type_by_id(spec->btf, acc->type_id); + const struct bpf_core_accessor *acc; + const struct btf_type *t; __u32 byte_off, byte_sz, bit_off, bit_sz; const struct btf_member *m; const struct btf_type *mt; bool bitfield; __s64 sz; + if (relo->kind == BPF_FIELD_EXISTS) { + *val = spec ? 1 : 0; + return 0; + } + + if (!spec) + return -EUCLEAN; /* request instruction poisoning */ + + acc = &spec->spec[spec->len - 1]; + t = btf__type_by_id(spec->btf, acc->type_id); + /* a[n] accessor needs special handling */ if (!acc->name) { if (relo->kind == BPF_FIELD_BYTE_OFFSET) { @@ -4604,21 +5001,158 @@ static int bpf_core_calc_field_relo(const struct bpf_program *prog, break; case BPF_FIELD_EXISTS: default: - pr_warn("prog '%s': unknown relo %d at insn #%d\n", - bpf_program__title(prog, false), - relo->kind, relo->insn_off / 8); - return -EINVAL; + return -EOPNOTSUPP; + } + + return 0; +} + +static int bpf_core_calc_type_relo(const struct bpf_core_relo *relo, + const struct bpf_core_spec *spec, + __u32 *val) +{ + __s64 sz; + + /* type-based relos return zero when target type is not found */ + if (!spec) { + *val = 0; + return 0; + } + + switch (relo->kind) { + case BPF_TYPE_ID_TARGET: + *val = spec->root_type_id; + break; + case BPF_TYPE_EXISTS: + *val = 1; + break; + case BPF_TYPE_SIZE: + sz = btf__resolve_size(spec->btf, spec->root_type_id); + if (sz < 0) + return -EINVAL; + *val = sz; + break; + case BPF_TYPE_ID_LOCAL: + /* BPF_TYPE_ID_LOCAL is handled specially and shouldn't get here */ + default: + return -EOPNOTSUPP; + } + + return 0; +} + +static int bpf_core_calc_enumval_relo(const struct bpf_core_relo *relo, + const struct bpf_core_spec *spec, + __u32 *val) +{ + const struct btf_type *t; + const struct btf_enum *e; + + switch (relo->kind) { + case BPF_ENUMVAL_EXISTS: + *val = spec ? 1 : 0; + break; + case BPF_ENUMVAL_VALUE: + if (!spec) + return -EUCLEAN; /* request instruction poisoning */ + t = btf__type_by_id(spec->btf, spec->spec[0].type_id); + e = btf_enum(t) + spec->spec[0].idx; + *val = e->val; + break; + default: + return -EOPNOTSUPP; } return 0; } +struct bpf_core_relo_res +{ + /* expected value in the instruction, unless validate == false */ + __u32 orig_val; + /* new value that needs to be patched up to */ + __u32 new_val; + /* relocation unsuccessful, poison instruction, but don't fail load */ + bool poison; + /* some relocations can't be validated against orig_val */ + bool validate; +}; + +/* Calculate original and target relocation values, given local and target + * specs and relocation kind. These values are calculated for each candidate. + * If there are multiple candidates, resulting values should all be consistent + * with each other. Otherwise, libbpf will refuse to proceed due to ambiguity. + * If instruction has to be poisoned, *poison will be set to true. + */ +static int bpf_core_calc_relo(const struct bpf_program *prog, + const struct bpf_core_relo *relo, + int relo_idx, + const struct bpf_core_spec *local_spec, + const struct bpf_core_spec *targ_spec, + struct bpf_core_relo_res *res) +{ + int err = -EOPNOTSUPP; + + res->orig_val = 0; + res->new_val = 0; + res->poison = false; + res->validate = true; + + if (core_relo_is_field_based(relo->kind)) { + err = bpf_core_calc_field_relo(prog, relo, local_spec, &res->orig_val, &res->validate); + err = err ?: bpf_core_calc_field_relo(prog, relo, targ_spec, &res->new_val, NULL); + } else if (core_relo_is_type_based(relo->kind)) { + err = bpf_core_calc_type_relo(relo, local_spec, &res->orig_val); + err = err ?: bpf_core_calc_type_relo(relo, targ_spec, &res->new_val); + } else if (core_relo_is_enumval_based(relo->kind)) { + err = bpf_core_calc_enumval_relo(relo, local_spec, &res->orig_val); + err = err ?: bpf_core_calc_enumval_relo(relo, targ_spec, &res->new_val); + } + + if (err == -EUCLEAN) { + /* EUCLEAN is used to signal instruction poisoning request */ + res->poison = true; + err = 0; + } else if (err == -EOPNOTSUPP) { + /* EOPNOTSUPP means unknown/unsupported relocation */ + pr_warn("prog '%s': relo #%d: unrecognized CO-RE relocation %s (%d) at insn #%d\n", + bpf_program__title(prog, false), relo_idx, + core_relo_kind_str(relo->kind), relo->kind, relo->insn_off / 8); + } + + return err; +} + +/* + * Turn instruction for which CO_RE relocation failed into invalid one with + * distinct signature. + */ +static void bpf_core_poison_insn(struct bpf_program *prog, int relo_idx, + int insn_idx, struct bpf_insn *insn) +{ + pr_debug("prog '%s': relo #%d: substituting insn #%d w/ invalid insn\n", + bpf_program__title(prog, false), relo_idx, insn_idx); + insn->code = BPF_JMP | BPF_CALL; + insn->dst_reg = 0; + insn->src_reg = 0; + insn->off = 0; + /* if this instruction is reachable (not a dead code), + * verifier will complain with the following message: + * invalid func unknown#195896080 + */ + insn->imm = 195896080; /* => 0xbad2310 => "bad relo" */ +} + +static bool is_ldimm64(struct bpf_insn *insn) +{ + return insn->code == (BPF_LD | BPF_IMM | BPF_DW); +} + /* * Patch relocatable BPF instruction. * * Patched value is determined by relocation kind and target specification. - * For field existence relocation target spec will be NULL if field is not - * found. + * For existence relocations target spec will be NULL if field/type is not found. * Expected insn->imm value is determined using relocation kind and local * spec, and is checked before patching instruction. If actual insn->imm value * is wrong, bail out with error. @@ -4626,58 +5160,43 @@ static int bpf_core_calc_field_relo(const struct bpf_program *prog, * Currently three kinds of BPF instructions are supported: * 1. rX = <imm> (assignment with immediate operand); * 2. rX += <imm> (arithmetic operations with immediate operand); + * 3. rX = <imm64> (load with 64-bit immediate value). */ -static int bpf_core_reloc_insn(struct bpf_program *prog, - const struct bpf_field_reloc *relo, +static int bpf_core_patch_insn(struct bpf_program *prog, + const struct bpf_core_relo *relo, int relo_idx, - const struct bpf_core_spec *local_spec, - const struct bpf_core_spec *targ_spec) + const struct bpf_core_relo_res *res) { __u32 orig_val, new_val; struct bpf_insn *insn; - bool validate = true; - int insn_idx, err; + int insn_idx; __u8 class; - if (relo->insn_off % sizeof(struct bpf_insn)) + if (relo->insn_off % BPF_INSN_SZ) return -EINVAL; - insn_idx = relo->insn_off / sizeof(struct bpf_insn); + insn_idx = relo->insn_off / BPF_INSN_SZ; insn = &prog->insns[insn_idx]; class = BPF_CLASS(insn->code); - if (relo->kind == BPF_FIELD_EXISTS) { - orig_val = 1; /* can't generate EXISTS relo w/o local field */ - new_val = targ_spec ? 1 : 0; - } else if (!targ_spec) { - pr_debug("prog '%s': relo #%d: substituting insn #%d w/ invalid insn\n", - bpf_program__title(prog, false), relo_idx, insn_idx); - insn->code = BPF_JMP | BPF_CALL; - insn->dst_reg = 0; - insn->src_reg = 0; - insn->off = 0; - /* if this instruction is reachable (not a dead code), - * verifier will complain with the following message: - * invalid func unknown#195896080 + if (res->poison) { + /* poison second part of ldimm64 to avoid confusing error from + * verifier about "unknown opcode 00" */ - insn->imm = 195896080; /* => 0xbad2310 => "bad relo" */ + if (is_ldimm64(insn)) + bpf_core_poison_insn(prog, relo_idx, insn_idx + 1, insn + 1); + bpf_core_poison_insn(prog, relo_idx, insn_idx, insn); return 0; - } else { - err = bpf_core_calc_field_relo(prog, relo, local_spec, - &orig_val, &validate); - if (err) - return err; - err = bpf_core_calc_field_relo(prog, relo, targ_spec, - &new_val, NULL); - if (err) - return err; } + orig_val = res->orig_val; + new_val = res->new_val; + switch (class) { case BPF_ALU: case BPF_ALU64: if (BPF_SRC(insn->code) != BPF_K) return -EINVAL; - if (validate && insn->imm != orig_val) { + if (res->validate && insn->imm != orig_val) { pr_warn("prog '%s': relo #%d: unexpected insn #%d (ALU/ALU64) value: got %u, exp %u -> %u\n", bpf_program__title(prog, false), relo_idx, insn_idx, insn->imm, orig_val, new_val); @@ -4692,8 +5211,8 @@ static int bpf_core_reloc_insn(struct bpf_program *prog, case BPF_LDX: case BPF_ST: case BPF_STX: - if (validate && insn->off != orig_val) { - pr_warn("prog '%s': relo #%d: unexpected insn #%d (LD/LDX/ST/STX) value: got %u, exp %u -> %u\n", + if (res->validate && insn->off != orig_val) { + pr_warn("prog '%s': relo #%d: unexpected insn #%d (LDX/ST/STX) value: got %u, exp %u -> %u\n", bpf_program__title(prog, false), relo_idx, insn_idx, insn->off, orig_val, new_val); return -EINVAL; @@ -4710,8 +5229,37 @@ static int bpf_core_reloc_insn(struct bpf_program *prog, bpf_program__title(prog, false), relo_idx, insn_idx, orig_val, new_val); break; + case BPF_LD: { + __u64 imm; + + if (!is_ldimm64(insn) || + insn[0].src_reg != 0 || insn[0].off != 0 || + insn_idx + 1 >= prog->insns_cnt || + insn[1].code != 0 || insn[1].dst_reg != 0 || + insn[1].src_reg != 0 || insn[1].off != 0) { + pr_warn("prog '%s': relo #%d: insn #%d (LDIMM64) has unexpected form\n", + bpf_program__title(prog, false), relo_idx, insn_idx); + return -EINVAL; + } + + imm = insn[0].imm + ((__u64)insn[1].imm << 32); + if (res->validate && imm != orig_val) { + pr_warn("prog '%s': relo #%d: unexpected insn #%d (LDIMM64) value: got %llu, exp %u -> %u\n", + bpf_program__title(prog, false), relo_idx, + insn_idx, (unsigned long long)imm, + orig_val, new_val); + return -EINVAL; + } + + insn[0].imm = new_val; + insn[1].imm = 0; /* currently only 32-bit values are supported */ + pr_debug("prog '%s': relo #%d: patched insn #%d (LDIMM64) imm64 %llu -> %u\n", + bpf_program__title(prog, false), relo_idx, insn_idx, + (unsigned long long)imm, new_val); + break; + } default: - pr_warn("prog '%s': relo #%d: trying to relocate unrecognized insn #%d, code:%x, src:%x, dst:%x, off:%x, imm:%x\n", + pr_warn("prog '%s': relo #%d: trying to relocate unrecognized insn #%d, code:0x%x, src:0x%x, dst:0x%x, off:0x%x, imm:0x%x\n", bpf_program__title(prog, false), relo_idx, insn_idx, insn->code, insn->src_reg, insn->dst_reg, insn->off, insn->imm); @@ -4728,29 +5276,48 @@ static int bpf_core_reloc_insn(struct bpf_program *prog, static void bpf_core_dump_spec(int level, const struct bpf_core_spec *spec) { const struct btf_type *t; + const struct btf_enum *e; const char *s; __u32 type_id; int i; - type_id = spec->spec[0].type_id; + type_id = spec->root_type_id; t = btf__type_by_id(spec->btf, type_id); s = btf__name_by_offset(spec->btf, t->name_off); - libbpf_print(level, "[%u] %s + ", type_id, s); - for (i = 0; i < spec->raw_len; i++) - libbpf_print(level, "%d%s", spec->raw_spec[i], - i == spec->raw_len - 1 ? " => " : ":"); + libbpf_print(level, "[%u] %s %s", type_id, btf_kind_str(t), str_is_empty(s) ? "<anon>" : s); - libbpf_print(level, "%u.%u @ &x", - spec->bit_offset / 8, spec->bit_offset % 8); + if (core_relo_is_type_based(spec->relo_kind)) + return; - for (i = 0; i < spec->len; i++) { - if (spec->spec[i].name) - libbpf_print(level, ".%s", spec->spec[i].name); - else - libbpf_print(level, "[%u]", spec->spec[i].idx); + if (core_relo_is_enumval_based(spec->relo_kind)) { + t = skip_mods_and_typedefs(spec->btf, type_id, NULL); + e = btf_enum(t) + spec->raw_spec[0]; + s = btf__name_by_offset(spec->btf, e->name_off); + + libbpf_print(level, "::%s = %u", s, e->val); + return; } + if (core_relo_is_field_based(spec->relo_kind)) { + for (i = 0; i < spec->len; i++) { + if (spec->spec[i].name) + libbpf_print(level, ".%s", spec->spec[i].name); + else if (i > 0 || spec->spec[i].idx > 0) + libbpf_print(level, "[%u]", spec->spec[i].idx); + } + + libbpf_print(level, " ("); + for (i = 0; i < spec->raw_len; i++) + libbpf_print(level, "%s%d", i == 0 ? "" : ":", spec->raw_spec[i]); + + if (spec->bit_offset % 8) + libbpf_print(level, " @ offset %u.%u)", + spec->bit_offset / 8, spec->bit_offset % 8); + else + libbpf_print(level, " @ offset %u)", spec->bit_offset / 8); + return; + } } static size_t bpf_core_hash_fn(const void *key, void *ctx) @@ -4814,22 +5381,23 @@ static void *u32_as_hash_key(__u32 x) * CPU-wise compared to prebuilding a map from all local type names to * a list of candidate type names. It's also sped up by caching resolved * list of matching candidates per each local "root" type ID, that has at - * least one bpf_field_reloc associated with it. This list is shared + * least one bpf_core_relo associated with it. This list is shared * between multiple relocations for the same type ID and is updated as some * of the candidates are pruned due to structural incompatibility. */ -static int bpf_core_reloc_field(struct bpf_program *prog, - const struct bpf_field_reloc *relo, - int relo_idx, - const struct btf *local_btf, - const struct btf *targ_btf, - struct hashmap *cand_cache) +static int bpf_core_apply_relo(struct bpf_program *prog, + const struct bpf_core_relo *relo, + int relo_idx, + const struct btf *local_btf, + const struct btf *targ_btf, + struct hashmap *cand_cache) { const char *prog_name = bpf_program__title(prog, false); - struct bpf_core_spec local_spec, cand_spec, targ_spec; + struct bpf_core_spec local_spec, cand_spec, targ_spec = {}; const void *type_key = u32_as_hash_key(relo->type_id); - const struct btf_type *local_type, *cand_type; - const char *local_name, *cand_name; + struct bpf_core_relo_res cand_res, targ_res; + const struct btf_type *local_type; + const char *local_name; struct ids_vec *cand_ids; __u32 local_id, cand_id; const char *spec_str; @@ -4841,32 +5409,49 @@ static int bpf_core_reloc_field(struct bpf_program *prog, return -EINVAL; local_name = btf__name_by_offset(local_btf, local_type->name_off); - if (str_is_empty(local_name)) + if (!local_name) return -EINVAL; spec_str = btf__name_by_offset(local_btf, relo->access_str_off); if (str_is_empty(spec_str)) return -EINVAL; - err = bpf_core_spec_parse(local_btf, local_id, spec_str, &local_spec); + err = bpf_core_parse_spec(local_btf, local_id, spec_str, relo->kind, &local_spec); if (err) { - pr_warn("prog '%s': relo #%d: parsing [%d] %s + %s failed: %d\n", - prog_name, relo_idx, local_id, local_name, spec_str, - err); + pr_warn("prog '%s': relo #%d: parsing [%d] %s %s + %s failed: %d\n", + prog_name, relo_idx, local_id, btf_kind_str(local_type), + str_is_empty(local_name) ? "<anon>" : local_name, + spec_str, err); return -EINVAL; } - pr_debug("prog '%s': relo #%d: kind %d, spec is ", prog_name, relo_idx, - relo->kind); + pr_debug("prog '%s': relo #%d: kind <%s> (%d), spec is ", prog_name, + relo_idx, core_relo_kind_str(relo->kind), relo->kind); bpf_core_dump_spec(LIBBPF_DEBUG, &local_spec); libbpf_print(LIBBPF_DEBUG, "\n"); + /* TYPE_ID_LOCAL relo is special and doesn't need candidate search */ + if (relo->kind == BPF_TYPE_ID_LOCAL) { + targ_res.validate = true; + targ_res.poison = false; + targ_res.orig_val = local_spec.root_type_id; + targ_res.new_val = local_spec.root_type_id; + goto patch_insn; + } + + /* libbpf doesn't support candidate search for anonymous types */ + if (str_is_empty(spec_str)) { + pr_warn("prog '%s': relo #%d: <%s> (%d) relocation doesn't support anonymous types\n", + prog_name, relo_idx, core_relo_kind_str(relo->kind), relo->kind); + return -EOPNOTSUPP; + } + if (!hashmap__find(cand_cache, type_key, (void **)&cand_ids)) { cand_ids = bpf_core_find_cands(local_btf, local_id, targ_btf); if (IS_ERR(cand_ids)) { - pr_warn("prog '%s': relo #%d: target candidate search failed for [%d] %s: %ld", - prog_name, relo_idx, local_id, local_name, - PTR_ERR(cand_ids)); + pr_warn("prog '%s': relo #%d: target candidate search failed for [%d] %s %s: %ld", + prog_name, relo_idx, local_id, btf_kind_str(local_type), + local_name, PTR_ERR(cand_ids)); return PTR_ERR(cand_ids); } err = hashmap__set(cand_cache, type_key, cand_ids, NULL, NULL); @@ -4878,36 +5463,51 @@ static int bpf_core_reloc_field(struct bpf_program *prog, for (i = 0, j = 0; i < cand_ids->len; i++) { cand_id = cand_ids->data[i]; - cand_type = btf__type_by_id(targ_btf, cand_id); - cand_name = btf__name_by_offset(targ_btf, cand_type->name_off); - - err = bpf_core_spec_match(&local_spec, targ_btf, - cand_id, &cand_spec); - pr_debug("prog '%s': relo #%d: matching candidate #%d %s against spec ", - prog_name, relo_idx, i, cand_name); - bpf_core_dump_spec(LIBBPF_DEBUG, &cand_spec); - libbpf_print(LIBBPF_DEBUG, ": %d\n", err); + err = bpf_core_spec_match(&local_spec, targ_btf, cand_id, &cand_spec); if (err < 0) { - pr_warn("prog '%s': relo #%d: matching error: %d\n", - prog_name, relo_idx, err); + pr_warn("prog '%s': relo #%d: error matching candidate #%d ", + prog_name, relo_idx, i); + bpf_core_dump_spec(LIBBPF_WARN, &cand_spec); + libbpf_print(LIBBPF_WARN, ": %d\n", err); return err; } + + pr_debug("prog '%s': relo #%d: %s candidate #%d ", prog_name, + relo_idx, err == 0 ? "non-matching" : "matching", i); + bpf_core_dump_spec(LIBBPF_DEBUG, &cand_spec); + libbpf_print(LIBBPF_DEBUG, "\n"); + if (err == 0) continue; + err = bpf_core_calc_relo(prog, relo, relo_idx, &local_spec, &cand_spec, &cand_res); + if (err) + return err; + if (j == 0) { + targ_res = cand_res; targ_spec = cand_spec; } else if (cand_spec.bit_offset != targ_spec.bit_offset) { - /* if there are many candidates, they should all - * resolve to the same bit offset + /* if there are many field relo candidates, they + * should all resolve to the same bit offset */ - pr_warn("prog '%s': relo #%d: offset ambiguity: %u != %u\n", + pr_warn("prog '%s': relo #%d: field offset ambiguity: %u != %u\n", prog_name, relo_idx, cand_spec.bit_offset, targ_spec.bit_offset); return -EINVAL; + } else if (cand_res.poison != targ_res.poison || cand_res.new_val != targ_res.new_val) { + /* all candidates should result in the same relocation + * decision and value, otherwise it's dangerous to + * proceed due to ambiguity + */ + pr_warn("prog '%s': relo #%d: relocation decision ambiguity: %s %u != %s %u\n", + prog_name, relo_idx, + cand_res.poison ? "failure" : "success", cand_res.new_val, + targ_res.poison ? "failure" : "success", targ_res.new_val); + return -EINVAL; } - cand_ids->data[j++] = cand_spec.spec[0].type_id; + cand_ids->data[j++] = cand_spec.root_type_id; } /* @@ -4926,19 +5526,25 @@ static int bpf_core_reloc_field(struct bpf_program *prog, * as well as expected case, depending whether instruction w/ * relocation is guarded in some way that makes it unreachable (dead * code) if relocation can't be resolved. This is handled in - * bpf_core_reloc_insn() uniformly by replacing that instruction with + * bpf_core_patch_insn() uniformly by replacing that instruction with * BPF helper call insn (using invalid helper ID). If that instruction * is indeed unreachable, then it will be ignored and eliminated by * verifier. If it was an error, then verifier will complain and point * to a specific instruction number in its log. */ - if (j == 0) - pr_debug("prog '%s': relo #%d: no matching targets found for [%d] %s + %s\n", - prog_name, relo_idx, local_id, local_name, spec_str); + if (j == 0) { + pr_debug("prog '%s': relo #%d: no matching targets found\n", + prog_name, relo_idx); - /* bpf_core_reloc_insn should know how to handle missing targ_spec */ - err = bpf_core_reloc_insn(prog, relo, relo_idx, &local_spec, - j ? &targ_spec : NULL); + /* calculate single target relo result explicitly */ + err = bpf_core_calc_relo(prog, relo, relo_idx, &local_spec, NULL, &targ_res); + if (err) + return err; + } + +patch_insn: + /* bpf_core_patch_insn() should know how to handle missing targ_spec */ + err = bpf_core_patch_insn(prog, relo, relo_idx, &targ_res); if (err) { pr_warn("prog '%s': relo #%d: failed to patch insn at offset %d: %d\n", prog_name, relo_idx, relo->insn_off, err); @@ -4949,10 +5555,10 @@ static int bpf_core_reloc_field(struct bpf_program *prog, } static int -bpf_core_reloc_fields(struct bpf_object *obj, const char *targ_btf_path) +bpf_object__relocate_core(struct bpf_object *obj, const char *targ_btf_path) { const struct btf_ext_info_sec *sec; - const struct bpf_field_reloc *rec; + const struct bpf_core_relo *rec; const struct btf_ext_info *seg; struct hashmap_entry *entry; struct hashmap *cand_cache = NULL; @@ -4961,6 +5567,9 @@ bpf_core_reloc_fields(struct bpf_object *obj, const char *targ_btf_path) const char *sec_name; int i, err = 0; + if (obj->btf_ext->core_relo_info.len == 0) + return 0; + if (targ_btf_path) targ_btf = btf__parse_elf(targ_btf_path, NULL); else @@ -4976,7 +5585,7 @@ bpf_core_reloc_fields(struct bpf_object *obj, const char *targ_btf_path) goto out; } - seg = &obj->btf_ext->field_reloc_info; + seg = &obj->btf_ext->core_relo_info; for_each_btf_ext_sec(seg, sec) { sec_name = btf__name_by_offset(obj->btf, sec->sec_name_off); if (str_is_empty(sec_name)) { @@ -4997,15 +5606,15 @@ bpf_core_reloc_fields(struct bpf_object *obj, const char *targ_btf_path) goto out; } - pr_debug("prog '%s': performing %d CO-RE offset relocs\n", + pr_debug("sec '%s': found %d CO-RE relocations\n", sec_name, sec->num_info); for_each_btf_ext_rec(seg, sec, i, rec) { - err = bpf_core_reloc_field(prog, rec, i, obj->btf, - targ_btf, cand_cache); + err = bpf_core_apply_relo(prog, rec, i, obj->btf, + targ_btf, cand_cache); if (err) { pr_warn("prog '%s': relo #%d: failed to relocate: %d\n", - sec_name, i, err); + prog->name, i, err); goto out; } } @@ -5025,17 +5634,6 @@ out: } static int -bpf_object__relocate_core(struct bpf_object *obj, const char *targ_btf_path) -{ - int err = 0; - - if (obj->btf_ext->field_reloc_info.len) - err = bpf_core_reloc_fields(obj, targ_btf_path); - - return err; -} - -static int bpf_program__reloc_text(struct bpf_program *prog, struct bpf_object *obj, struct reloc_desc *relo) { @@ -5051,7 +5649,7 @@ bpf_program__reloc_text(struct bpf_program *prog, struct bpf_object *obj, return -LIBBPF_ERRNO__RELOC; } new_cnt = prog->insns_cnt + text->insns_cnt; - new_insn = reallocarray(prog->insns, new_cnt, sizeof(*insn)); + new_insn = libbpf_reallocarray(prog->insns, new_cnt, sizeof(*insn)); if (!new_insn) { pr_warn("oom in prog realloc\n"); return -ENOMEM; @@ -5136,7 +5734,8 @@ bpf_program__relocate(struct bpf_program *prog, struct bpf_object *obj) return err; break; default: - pr_warn("relo #%d: bad relo type %d\n", i, relo->type); + pr_warn("prog '%s': relo #%d: bad relo type %d\n", + prog->name, i, relo->type); return -EINVAL; } } @@ -5171,7 +5770,8 @@ bpf_object__relocate(struct bpf_object *obj, const char *targ_btf_path) err = bpf_program__relocate(prog, obj); if (err) { - pr_warn("failed to relocate '%s'\n", prog->section_name); + pr_warn("prog '%s': failed to relocate data references: %d\n", + prog->name, err); return err; } break; @@ -5186,7 +5786,8 @@ bpf_object__relocate(struct bpf_object *obj, const char *targ_btf_path) err = bpf_program__relocate(prog, obj); if (err) { - pr_warn("failed to relocate '%s'\n", prog->section_name); + pr_warn("prog '%s': failed to relocate calls: %d\n", + prog->name, err); return err; } } @@ -5230,8 +5831,7 @@ static int bpf_object__collect_map_relos(struct bpf_object *obj, i, (size_t)GELF_R_SYM(rel.r_info)); return -LIBBPF_ERRNO__FORMAT; } - name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, - sym.st_name) ? : "<?>"; + name = elf_sym_str(obj, sym.st_name) ?: "<?>"; if (sym.st_shndx != obj->efile.btf_maps_shndx) { pr_warn(".maps relo #%d: '%s' isn't a BTF-defined map\n", i, name); @@ -5293,7 +5893,7 @@ static int bpf_object__collect_map_relos(struct bpf_object *obj, moff /= bpf_ptr_sz; if (moff >= map->init_slots_sz) { new_sz = moff + 1; - tmp = realloc(map->init_slots, new_sz * host_ptr_sz); + tmp = libbpf_reallocarray(map->init_slots, new_sz, host_ptr_sz); if (!tmp) return -ENOMEM; map->init_slots = tmp; @@ -5348,6 +5948,51 @@ static int bpf_object__collect_reloc(struct bpf_object *obj) return 0; } +static bool insn_is_helper_call(struct bpf_insn *insn, enum bpf_func_id *func_id) +{ + if (BPF_CLASS(insn->code) == BPF_JMP && + BPF_OP(insn->code) == BPF_CALL && + BPF_SRC(insn->code) == BPF_K && + insn->src_reg == 0 && + insn->dst_reg == 0) { + *func_id = insn->imm; + return true; + } + return false; +} + +static int bpf_object__sanitize_prog(struct bpf_object* obj, struct bpf_program *prog) +{ + struct bpf_insn *insn = prog->insns; + enum bpf_func_id func_id; + int i; + + for (i = 0; i < prog->insns_cnt; i++, insn++) { + if (!insn_is_helper_call(insn, &func_id)) + continue; + + /* on kernels that don't yet support + * bpf_probe_read_{kernel,user}[_str] helpers, fall back + * to bpf_probe_read() which works well for old kernels + */ + switch (func_id) { + case BPF_FUNC_probe_read_kernel: + case BPF_FUNC_probe_read_user: + if (!kernel_supports(FEAT_PROBE_READ_KERN)) + insn->imm = BPF_FUNC_probe_read; + break; + case BPF_FUNC_probe_read_kernel_str: + case BPF_FUNC_probe_read_user_str: + if (!kernel_supports(FEAT_PROBE_READ_KERN)) + insn->imm = BPF_FUNC_probe_read_str; + break; + default: + break; + } + } + return 0; +} + static int load_program(struct bpf_program *prog, struct bpf_insn *insns, int insns_cnt, char *license, __u32 kern_version, int *pfd) @@ -5364,12 +6009,12 @@ load_program(struct bpf_program *prog, struct bpf_insn *insns, int insns_cnt, memset(&load_attr, 0, sizeof(struct bpf_load_program_attr)); load_attr.prog_type = prog->type; /* old kernels might not support specifying expected_attach_type */ - if (!prog->caps->exp_attach_type && prog->sec_def && + if (!kernel_supports(FEAT_EXP_ATTACH_TYPE) && prog->sec_def && prog->sec_def->is_exp_attach_type_optional) load_attr.expected_attach_type = 0; else load_attr.expected_attach_type = prog->expected_attach_type; - if (prog->caps->name) + if (kernel_supports(FEAT_PROG_NAME)) load_attr.name = prog->name; load_attr.insns = insns; load_attr.insns_cnt = insns_cnt; @@ -5387,7 +6032,7 @@ load_program(struct bpf_program *prog, struct bpf_insn *insns, int insns_cnt, } /* specify func_info/line_info only if kernel supports them */ btf_fd = bpf_object__btf_fd(prog->obj); - if (btf_fd >= 0 && prog->obj->caps.btf_func) { + if (btf_fd >= 0 && kernel_supports(FEAT_BTF_FUNC)) { load_attr.prog_btf_fd = btf_fd; load_attr.func_info = prog->func_info; load_attr.func_info_rec_size = prog->func_info_rec_size; @@ -5425,7 +6070,7 @@ retry_load: free(log_buf); goto retry_load; } - ret = -errno; + ret = errno ? -errno : -LIBBPF_ERRNO__LOAD; cp = libbpf_strerror_r(errno, errmsg, sizeof(errmsg)); pr_warn("load bpf program failed: %s\n", cp); pr_perm_msg(ret); @@ -5564,11 +6209,17 @@ bpf_object__load_progs(struct bpf_object *obj, int log_level) for (i = 0; i < obj->nr_programs; i++) { prog = &obj->programs[i]; + err = bpf_object__sanitize_prog(obj, prog); + if (err) + return err; + } + + for (i = 0; i < obj->nr_programs; i++) { + prog = &obj->programs[i]; if (bpf_program__is_function_storage(prog, obj)) continue; if (!prog->load) { - pr_debug("prog '%s'('%s'): skipped loading\n", - prog->name, prog->section_name); + pr_debug("prog '%s': skipped loading\n", prog->name); continue; } prog->log_level |= log_level; @@ -5641,6 +6292,8 @@ __bpf_object__open(const char *path, const void *obj_buf, size_t obj_buf_sz, /* couldn't guess, but user might manually specify */ continue; + if (prog->sec_def->is_sleepable) + prog->prog_flags |= BPF_F_SLEEPABLE; bpf_program__set_type(prog, prog->sec_def->prog_type); bpf_program__set_expected_attach_type(prog, prog->sec_def->expected_attach_type); @@ -5750,11 +6403,11 @@ static int bpf_object__sanitize_maps(struct bpf_object *obj) bpf_object__for_each_map(m, obj) { if (!bpf_map__is_internal(m)) continue; - if (!obj->caps.global_data) { + if (!kernel_supports(FEAT_GLOBAL_DATA)) { pr_warn("kernel doesn't support global data\n"); return -ENOTSUP; } - if (!obj->caps.array_mmap) + if (!kernel_supports(FEAT_ARRAY_MMAP)) m->def.map_flags ^= BPF_F_MMAPABLE; } @@ -5904,7 +6557,6 @@ int bpf_object__load_xattr(struct bpf_object_load_attr *attr) } err = bpf_object__probe_loading(obj); - err = err ? : bpf_object__probe_caps(obj); err = err ? : bpf_object__resolve_externs(obj, obj->kconfig); err = err ? : bpf_object__sanitize_and_load_btf(obj); err = err ? : bpf_object__sanitize_maps(obj); @@ -6713,7 +7365,7 @@ int bpf_program__fd(const struct bpf_program *prog) size_t bpf_program__size(const struct bpf_program *prog) { - return prog->insns_cnt * sizeof(struct bpf_insn); + return prog->insns_cnt * BPF_INSN_SZ; } int bpf_program__set_prep(struct bpf_program *prog, int nr_instances, @@ -6910,6 +7562,21 @@ static const struct bpf_sec_def section_defs[] = { .expected_attach_type = BPF_TRACE_FEXIT, .is_attach_btf = true, .attach_fn = attach_trace), + SEC_DEF("fentry.s/", TRACING, + .expected_attach_type = BPF_TRACE_FENTRY, + .is_attach_btf = true, + .is_sleepable = true, + .attach_fn = attach_trace), + SEC_DEF("fmod_ret.s/", TRACING, + .expected_attach_type = BPF_MODIFY_RETURN, + .is_attach_btf = true, + .is_sleepable = true, + .attach_fn = attach_trace), + SEC_DEF("fexit.s/", TRACING, + .expected_attach_type = BPF_TRACE_FEXIT, + .is_attach_btf = true, + .is_sleepable = true, + .attach_fn = attach_trace), SEC_DEF("freplace/", EXT, .is_attach_btf = true, .attach_fn = attach_trace), @@ -6917,6 +7584,11 @@ static const struct bpf_sec_def section_defs[] = { .is_attach_btf = true, .expected_attach_type = BPF_LSM_MAC, .attach_fn = attach_lsm), + SEC_DEF("lsm.s/", LSM, + .is_attach_btf = true, + .is_sleepable = true, + .expected_attach_type = BPF_LSM_MAC, + .attach_fn = attach_lsm), SEC_DEF("iter/", TRACING, .expected_attach_type = BPF_TRACE_ITER, .is_attach_btf = true, @@ -7122,8 +7794,7 @@ static int bpf_object__collect_st_ops_relos(struct bpf_object *obj, return -LIBBPF_ERRNO__FORMAT; } - name = elf_strptr(obj->efile.elf, obj->efile.strtabidx, - sym.st_name) ? : "<?>"; + name = elf_sym_str(obj, sym.st_name) ?: "<?>"; map = find_struct_ops_map_by_offset(obj, rel.r_offset); if (!map) { pr_warn("struct_ops reloc: cannot find map at rel.r_offset %zu\n", @@ -7640,7 +8311,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr, prog->prog_ifindex = attr->ifindex; prog->log_level = attr->log_level; - prog->prog_flags = attr->prog_flags; + prog->prog_flags |= attr->prog_flags; if (!first_prog) first_prog = prog; } @@ -8594,7 +9265,7 @@ struct perf_buffer *perf_buffer__new(int map_fd, size_t page_cnt, struct perf_buffer_params p = {}; struct perf_event_attr attr = { 0, }; - attr.config = PERF_COUNT_SW_BPF_OUTPUT, + attr.config = PERF_COUNT_SW_BPF_OUTPUT; attr.type = PERF_TYPE_SOFTWARE; attr.sample_type = PERF_SAMPLE_RAW; attr.sample_period = 1; @@ -8832,6 +9503,11 @@ static int perf_buffer__process_records(struct perf_buffer *pb, return 0; } +int perf_buffer__epoll_fd(const struct perf_buffer *pb) +{ + return pb->epoll_fd; +} + int perf_buffer__poll(struct perf_buffer *pb, int timeout_ms) { int i, cnt, err; @@ -8849,6 +9525,55 @@ int perf_buffer__poll(struct perf_buffer *pb, int timeout_ms) return cnt < 0 ? -errno : cnt; } +/* Return number of PERF_EVENT_ARRAY map slots set up by this perf_buffer + * manager. + */ +size_t perf_buffer__buffer_cnt(const struct perf_buffer *pb) +{ + return pb->cpu_cnt; +} + +/* + * Return perf_event FD of a ring buffer in *buf_idx* slot of + * PERF_EVENT_ARRAY BPF map. This FD can be polled for new data using + * select()/poll()/epoll() Linux syscalls. + */ +int perf_buffer__buffer_fd(const struct perf_buffer *pb, size_t buf_idx) +{ + struct perf_cpu_buf *cpu_buf; + + if (buf_idx >= pb->cpu_cnt) + return -EINVAL; + + cpu_buf = pb->cpu_bufs[buf_idx]; + if (!cpu_buf) + return -ENOENT; + + return cpu_buf->fd; +} + +/* + * Consume data from perf ring buffer corresponding to slot *buf_idx* in + * PERF_EVENT_ARRAY BPF map without waiting/polling. If there is no data to + * consume, do nothing and return success. + * Returns: + * - 0 on success; + * - <0 on failure. + */ +int perf_buffer__consume_buffer(struct perf_buffer *pb, size_t buf_idx) +{ + struct perf_cpu_buf *cpu_buf; + + if (buf_idx >= pb->cpu_cnt) + return -EINVAL; + + cpu_buf = pb->cpu_bufs[buf_idx]; + if (!cpu_buf) + return -ENOENT; + + return perf_buffer__process_records(pb, cpu_buf); +} + int perf_buffer__consume(struct perf_buffer *pb) { int i, err; @@ -8861,7 +9586,7 @@ int perf_buffer__consume(struct perf_buffer *pb) err = perf_buffer__process_records(pb, cpu_buf); if (err) { - pr_warn("error while processing records: %d\n", err); + pr_warn("perf_buffer: failed to process records in buffer #%d: %d\n", i, err); return err; } } diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h index 5ecb4069a9f0..308e0ded8f14 100644 --- a/tools/lib/bpf/libbpf.h +++ b/tools/lib/bpf/libbpf.h @@ -588,8 +588,12 @@ perf_buffer__new_raw(int map_fd, size_t page_cnt, const struct perf_buffer_raw_opts *opts); LIBBPF_API void perf_buffer__free(struct perf_buffer *pb); +LIBBPF_API int perf_buffer__epoll_fd(const struct perf_buffer *pb); LIBBPF_API int perf_buffer__poll(struct perf_buffer *pb, int timeout_ms); LIBBPF_API int perf_buffer__consume(struct perf_buffer *pb); +LIBBPF_API int perf_buffer__consume_buffer(struct perf_buffer *pb, size_t buf_idx); +LIBBPF_API size_t perf_buffer__buffer_cnt(const struct perf_buffer *pb); +LIBBPF_API int perf_buffer__buffer_fd(const struct perf_buffer *pb, size_t buf_idx); typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(struct perf_event_header *hdr, diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map index e35bd6cdbdbf..3fedcdc4ae2f 100644 --- a/tools/lib/bpf/libbpf.map +++ b/tools/lib/bpf/libbpf.map @@ -299,3 +299,12 @@ LIBBPF_0.1.0 { btf__set_fd; btf__set_pointer_size; } LIBBPF_0.0.9; + +LIBBPF_0.2.0 { + global: + perf_buffer__buffer_cnt; + perf_buffer__buffer_fd; + perf_buffer__epoll_fd; + perf_buffer__consume_buffer; + xsk_socket__create_shared; +} LIBBPF_0.1.0; diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h index 50d70e90d5f1..4d1c366fca2c 100644 --- a/tools/lib/bpf/libbpf_internal.h +++ b/tools/lib/bpf/libbpf_internal.h @@ -9,6 +9,15 @@ #ifndef __LIBBPF_LIBBPF_INTERNAL_H #define __LIBBPF_LIBBPF_INTERNAL_H +#include <stdlib.h> +#include <limits.h> + +/* make sure libbpf doesn't use kernel-only integer typedefs */ +#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 + +/* prevent accidental re-addition of reallocarray() */ +#pragma GCC poison reallocarray + #include "libbpf.h" #define BTF_INFO_ENC(kind, kind_flag, vlen) \ @@ -23,6 +32,12 @@ #define BTF_PARAM_ENC(name, type) (name), (type) #define BTF_VAR_SECINFO_ENC(type, offset, size) (type), (offset), (size) +#ifndef likely +#define likely(x) __builtin_expect(!!(x), 1) +#endif +#ifndef unlikely +#define unlikely(x) __builtin_expect(!!(x), 0) +#endif #ifndef min # define min(x, y) ((x) < (y) ? (x) : (y)) #endif @@ -63,6 +78,33 @@ do { \ #define pr_info(fmt, ...) __pr(LIBBPF_INFO, fmt, ##__VA_ARGS__) #define pr_debug(fmt, ...) __pr(LIBBPF_DEBUG, fmt, ##__VA_ARGS__) +#ifndef __has_builtin +#define __has_builtin(x) 0 +#endif +/* + * Re-implement glibc's reallocarray() for libbpf internal-only use. + * reallocarray(), unfortunately, is not available in all versions of glibc, + * so requires extra feature detection and using reallocarray() stub from + * <tools/libc_compat.h> and COMPAT_NEED_REALLOCARRAY. All this complicates + * build of libbpf unnecessarily and is just a maintenance burden. Instead, + * it's trivial to implement libbpf-specific internal version and use it + * throughout libbpf. + */ +static inline void *libbpf_reallocarray(void *ptr, size_t nmemb, size_t size) +{ + size_t total; + +#if __has_builtin(__builtin_mul_overflow) + if (unlikely(__builtin_mul_overflow(nmemb, size, &total))) + return NULL; +#else + if (size == 0 || nmemb > ULONG_MAX / size) + return NULL; + total = nmemb * size; +#endif + return realloc(ptr, total); +} + static inline bool libbpf_validate_opts(const char *opts, size_t opts_sz, size_t user_sz, const char *type_name) @@ -105,18 +147,6 @@ int bpf_object__section_size(const struct bpf_object *obj, const char *name, int bpf_object__variable_offset(const struct bpf_object *obj, const char *name, __u32 *off); -struct nlattr; -typedef int (*libbpf_dump_nlmsg_t)(void *cookie, void *msg, struct nlattr **tb); -int libbpf_netlink_open(unsigned int *nl_pid); -int libbpf_nl_get_link(int sock, unsigned int nl_pid, - libbpf_dump_nlmsg_t dump_link_nlmsg, void *cookie); -int libbpf_nl_get_class(int sock, unsigned int nl_pid, int ifindex, - libbpf_dump_nlmsg_t dump_class_nlmsg, void *cookie); -int libbpf_nl_get_qdisc(int sock, unsigned int nl_pid, int ifindex, - libbpf_dump_nlmsg_t dump_qdisc_nlmsg, void *cookie); -int libbpf_nl_get_filter(int sock, unsigned int nl_pid, int ifindex, int handle, - libbpf_dump_nlmsg_t dump_filter_nlmsg, void *cookie); - struct btf_ext_info { /* * info points to the individual info section (e.g. func_info and @@ -138,6 +168,44 @@ struct btf_ext_info { i < (sec)->num_info; \ i++, rec = (void *)rec + (seg)->rec_size) +/* + * The .BTF.ext ELF section layout defined as + * struct btf_ext_header + * func_info subsection + * + * The func_info subsection layout: + * record size for struct bpf_func_info in the func_info subsection + * struct btf_sec_func_info for section #1 + * a list of bpf_func_info records for section #1 + * where struct bpf_func_info mimics one in include/uapi/linux/bpf.h + * but may not be identical + * struct btf_sec_func_info for section #2 + * a list of bpf_func_info records for section #2 + * ...... + * + * Note that the bpf_func_info record size in .BTF.ext may not + * be the same as the one defined in include/uapi/linux/bpf.h. + * The loader should ensure that record_size meets minimum + * requirement and pass the record as is to the kernel. The + * kernel will handle the func_info properly based on its contents. + */ +struct btf_ext_header { + __u16 magic; + __u8 version; + __u8 flags; + __u32 hdr_len; + + /* All offsets are in bytes relative to the end of this header */ + __u32 func_info_off; + __u32 func_info_len; + __u32 line_info_off; + __u32 line_info_len; + + /* optional part of .BTF.ext header */ + __u32 core_relo_off; + __u32 core_relo_len; +}; + struct btf_ext { union { struct btf_ext_header *hdr; @@ -145,7 +213,7 @@ struct btf_ext { }; struct btf_ext_info func_info; struct btf_ext_info line_info; - struct btf_ext_info field_reloc_info; + struct btf_ext_info core_relo_info; __u32 data_size; }; @@ -170,32 +238,40 @@ struct bpf_line_info_min { __u32 line_col; }; -/* bpf_field_info_kind encodes which aspect of captured field has to be - * adjusted by relocations. Currently supported values are: - * - BPF_FIELD_BYTE_OFFSET: field offset (in bytes); - * - BPF_FIELD_EXISTS: field existence (1, if field exists; 0, otherwise); +/* bpf_core_relo_kind encodes which aspect of captured field/type/enum value + * has to be adjusted by relocations. */ -enum bpf_field_info_kind { +enum bpf_core_relo_kind { BPF_FIELD_BYTE_OFFSET = 0, /* field byte offset */ - BPF_FIELD_BYTE_SIZE = 1, + BPF_FIELD_BYTE_SIZE = 1, /* field size in bytes */ BPF_FIELD_EXISTS = 2, /* field existence in target kernel */ - BPF_FIELD_SIGNED = 3, - BPF_FIELD_LSHIFT_U64 = 4, - BPF_FIELD_RSHIFT_U64 = 5, + BPF_FIELD_SIGNED = 3, /* field signedness (0 - unsigned, 1 - signed) */ + BPF_FIELD_LSHIFT_U64 = 4, /* bitfield-specific left bitshift */ + BPF_FIELD_RSHIFT_U64 = 5, /* bitfield-specific right bitshift */ + BPF_TYPE_ID_LOCAL = 6, /* type ID in local BPF object */ + BPF_TYPE_ID_TARGET = 7, /* type ID in target kernel */ + BPF_TYPE_EXISTS = 8, /* type existence in target kernel */ + BPF_TYPE_SIZE = 9, /* type size in bytes */ + BPF_ENUMVAL_EXISTS = 10, /* enum value existence in target kernel */ + BPF_ENUMVAL_VALUE = 11, /* enum value integer value */ }; -/* The minimum bpf_field_reloc checked by the loader +/* The minimum bpf_core_relo checked by the loader * - * Field relocation captures the following data: + * CO-RE relocation captures the following data: * - insn_off - instruction offset (in bytes) within a BPF program that needs * its insn->imm field to be relocated with actual field info; * - type_id - BTF type ID of the "root" (containing) entity of a relocatable - * field; + * type or field; * - access_str_off - offset into corresponding .BTF string section. String - * itself encodes an accessed field using a sequence of field and array - * indicies, separated by colon (:). It's conceptually very close to LLVM's - * getelementptr ([0]) instruction's arguments for identifying offset to - * a field. + * interpretation depends on specific relocation kind: + * - for field-based relocations, string encodes an accessed field using + * a sequence of field and array indices, separated by colon (:). It's + * conceptually very close to LLVM's getelementptr ([0]) instruction's + * arguments for identifying offset to a field. + * - for type-based relocations, strings is expected to be just "0"; + * - for enum value-based relocations, string contains an index of enum + * value within its enum type; * * Example to provide a better feel. * @@ -226,11 +302,11 @@ enum bpf_field_info_kind { * * [0] https://llvm.org/docs/LangRef.html#getelementptr-instruction */ -struct bpf_field_reloc { +struct bpf_core_relo { __u32 insn_off; __u32 type_id; __u32 access_str_off; - enum bpf_field_info_kind kind; + enum bpf_core_relo_kind kind; }; #endif /* __LIBBPF_LIBBPF_INTERNAL_H */ diff --git a/tools/lib/bpf/libbpf_probes.c b/tools/lib/bpf/libbpf_probes.c index 5a3d3f078408..5482a9b7ae2d 100644 --- a/tools/lib/bpf/libbpf_probes.c +++ b/tools/lib/bpf/libbpf_probes.c @@ -17,9 +17,6 @@ #include "libbpf.h" #include "libbpf_internal.h" -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 - static bool grep(const char *buffer, const char *pattern) { return !!strstr(buffer, pattern); @@ -173,7 +170,7 @@ int libbpf__load_raw_btf(const char *raw_types, size_t types_len, return btf_fd; } -static int load_sk_storage_btf(void) +static int load_local_storage_btf(void) { const char strs[] = "\0bpf_spin_lock\0val\0cnt\0l"; /* struct bpf_spin_lock { @@ -232,12 +229,13 @@ bool bpf_probe_map_type(enum bpf_map_type map_type, __u32 ifindex) key_size = 0; break; case BPF_MAP_TYPE_SK_STORAGE: + case BPF_MAP_TYPE_INODE_STORAGE: btf_key_type_id = 1; btf_value_type_id = 3; value_size = 8; max_entries = 0; map_flags = BPF_F_NO_PREALLOC; - btf_fd = load_sk_storage_btf(); + btf_fd = load_local_storage_btf(); if (btf_fd < 0) return false; break; diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c index 312f887570b2..4dd73de00b6f 100644 --- a/tools/lib/bpf/netlink.c +++ b/tools/lib/bpf/netlink.c @@ -15,13 +15,12 @@ #include "libbpf_internal.h" #include "nlattr.h" -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 - #ifndef SOL_NETLINK #define SOL_NETLINK 270 #endif +typedef int (*libbpf_dump_nlmsg_t)(void *cookie, void *msg, struct nlattr **tb); + typedef int (*__dump_nlmsg_t)(struct nlmsghdr *nlmsg, libbpf_dump_nlmsg_t, void *cookie); @@ -31,7 +30,7 @@ struct xdp_id_md { struct xdp_link_info info; }; -int libbpf_netlink_open(__u32 *nl_pid) +static int libbpf_netlink_open(__u32 *nl_pid) { struct sockaddr_nl sa; socklen_t addrlen; @@ -283,6 +282,9 @@ static int get_xdp_info(void *cookie, void *msg, struct nlattr **tb) return 0; } +static int libbpf_nl_get_link(int sock, unsigned int nl_pid, + libbpf_dump_nlmsg_t dump_link_nlmsg, void *cookie); + int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info, size_t info_size, __u32 flags) { @@ -368,121 +370,3 @@ int libbpf_nl_get_link(int sock, unsigned int nl_pid, return bpf_netlink_recv(sock, nl_pid, seq, __dump_link_nlmsg, dump_link_nlmsg, cookie); } - -static int __dump_class_nlmsg(struct nlmsghdr *nlh, - libbpf_dump_nlmsg_t dump_class_nlmsg, - void *cookie) -{ - struct nlattr *tb[TCA_MAX + 1], *attr; - struct tcmsg *t = NLMSG_DATA(nlh); - int len; - - len = nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*t)); - attr = (struct nlattr *) ((void *) t + NLMSG_ALIGN(sizeof(*t))); - if (libbpf_nla_parse(tb, TCA_MAX, attr, len, NULL) != 0) - return -LIBBPF_ERRNO__NLPARSE; - - return dump_class_nlmsg(cookie, t, tb); -} - -int libbpf_nl_get_class(int sock, unsigned int nl_pid, int ifindex, - libbpf_dump_nlmsg_t dump_class_nlmsg, void *cookie) -{ - struct { - struct nlmsghdr nlh; - struct tcmsg t; - } req = { - .nlh.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), - .nlh.nlmsg_type = RTM_GETTCLASS, - .nlh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST, - .t.tcm_family = AF_UNSPEC, - .t.tcm_ifindex = ifindex, - }; - int seq = time(NULL); - - req.nlh.nlmsg_seq = seq; - if (send(sock, &req, req.nlh.nlmsg_len, 0) < 0) - return -errno; - - return bpf_netlink_recv(sock, nl_pid, seq, __dump_class_nlmsg, - dump_class_nlmsg, cookie); -} - -static int __dump_qdisc_nlmsg(struct nlmsghdr *nlh, - libbpf_dump_nlmsg_t dump_qdisc_nlmsg, - void *cookie) -{ - struct nlattr *tb[TCA_MAX + 1], *attr; - struct tcmsg *t = NLMSG_DATA(nlh); - int len; - - len = nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*t)); - attr = (struct nlattr *) ((void *) t + NLMSG_ALIGN(sizeof(*t))); - if (libbpf_nla_parse(tb, TCA_MAX, attr, len, NULL) != 0) - return -LIBBPF_ERRNO__NLPARSE; - - return dump_qdisc_nlmsg(cookie, t, tb); -} - -int libbpf_nl_get_qdisc(int sock, unsigned int nl_pid, int ifindex, - libbpf_dump_nlmsg_t dump_qdisc_nlmsg, void *cookie) -{ - struct { - struct nlmsghdr nlh; - struct tcmsg t; - } req = { - .nlh.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), - .nlh.nlmsg_type = RTM_GETQDISC, - .nlh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST, - .t.tcm_family = AF_UNSPEC, - .t.tcm_ifindex = ifindex, - }; - int seq = time(NULL); - - req.nlh.nlmsg_seq = seq; - if (send(sock, &req, req.nlh.nlmsg_len, 0) < 0) - return -errno; - - return bpf_netlink_recv(sock, nl_pid, seq, __dump_qdisc_nlmsg, - dump_qdisc_nlmsg, cookie); -} - -static int __dump_filter_nlmsg(struct nlmsghdr *nlh, - libbpf_dump_nlmsg_t dump_filter_nlmsg, - void *cookie) -{ - struct nlattr *tb[TCA_MAX + 1], *attr; - struct tcmsg *t = NLMSG_DATA(nlh); - int len; - - len = nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*t)); - attr = (struct nlattr *) ((void *) t + NLMSG_ALIGN(sizeof(*t))); - if (libbpf_nla_parse(tb, TCA_MAX, attr, len, NULL) != 0) - return -LIBBPF_ERRNO__NLPARSE; - - return dump_filter_nlmsg(cookie, t, tb); -} - -int libbpf_nl_get_filter(int sock, unsigned int nl_pid, int ifindex, int handle, - libbpf_dump_nlmsg_t dump_filter_nlmsg, void *cookie) -{ - struct { - struct nlmsghdr nlh; - struct tcmsg t; - } req = { - .nlh.nlmsg_len = NLMSG_LENGTH(sizeof(struct tcmsg)), - .nlh.nlmsg_type = RTM_GETTFILTER, - .nlh.nlmsg_flags = NLM_F_DUMP | NLM_F_REQUEST, - .t.tcm_family = AF_UNSPEC, - .t.tcm_ifindex = ifindex, - .t.tcm_parent = handle, - }; - int seq = time(NULL); - - req.nlh.nlmsg_seq = seq; - if (send(sock, &req, req.nlh.nlmsg_len, 0) < 0) - return -errno; - - return bpf_netlink_recv(sock, nl_pid, seq, __dump_filter_nlmsg, - dump_filter_nlmsg, cookie); -} diff --git a/tools/lib/bpf/nlattr.c b/tools/lib/bpf/nlattr.c index 0ad41dfea8eb..b607fa9852b1 100644 --- a/tools/lib/bpf/nlattr.c +++ b/tools/lib/bpf/nlattr.c @@ -7,14 +7,11 @@ */ #include <errno.h> -#include "nlattr.h" -#include "libbpf_internal.h" -#include <linux/rtnetlink.h> #include <string.h> #include <stdio.h> - -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 +#include <linux/rtnetlink.h> +#include "nlattr.h" +#include "libbpf_internal.h" static uint16_t nla_attr_minlen[LIBBPF_NLA_TYPE_MAX+1] = { [LIBBPF_NLA_U8] = sizeof(uint8_t), diff --git a/tools/lib/bpf/ringbuf.c b/tools/lib/bpf/ringbuf.c index 4fc6c6cbb4eb..5c6522c89af1 100644 --- a/tools/lib/bpf/ringbuf.c +++ b/tools/lib/bpf/ringbuf.c @@ -16,15 +16,11 @@ #include <asm/barrier.h> #include <sys/mman.h> #include <sys/epoll.h> -#include <tools/libc_compat.h> #include "libbpf.h" #include "libbpf_internal.h" #include "bpf.h" -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 - struct ring { ring_buffer_sample_fn sample_cb; void *ctx; @@ -82,12 +78,12 @@ int ring_buffer__add(struct ring_buffer *rb, int map_fd, return -EINVAL; } - tmp = reallocarray(rb->rings, rb->ring_cnt + 1, sizeof(*rb->rings)); + tmp = libbpf_reallocarray(rb->rings, rb->ring_cnt + 1, sizeof(*rb->rings)); if (!tmp) return -ENOMEM; rb->rings = tmp; - tmp = reallocarray(rb->events, rb->ring_cnt + 1, sizeof(*rb->events)); + tmp = libbpf_reallocarray(rb->events, rb->ring_cnt + 1, sizeof(*rb->events)); if (!tmp) return -ENOMEM; rb->events = tmp; diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c index f7f4efb70a4c..49c324594792 100644 --- a/tools/lib/bpf/xsk.c +++ b/tools/lib/bpf/xsk.c @@ -20,6 +20,7 @@ #include <linux/if_ether.h> #include <linux/if_packet.h> #include <linux/if_xdp.h> +#include <linux/list.h> #include <linux/sockios.h> #include <net/if.h> #include <sys/ioctl.h> @@ -32,9 +33,6 @@ #include "libbpf_internal.h" #include "xsk.h" -/* make sure libbpf doesn't use kernel-only integer typedefs */ -#pragma GCC poison u8 u16 u32 u64 s8 s16 s32 s64 - #ifndef SOL_XDP #define SOL_XDP 283 #endif @@ -48,26 +46,35 @@ #endif struct xsk_umem { - struct xsk_ring_prod *fill; - struct xsk_ring_cons *comp; + struct xsk_ring_prod *fill_save; + struct xsk_ring_cons *comp_save; char *umem_area; struct xsk_umem_config config; int fd; int refcount; + struct list_head ctx_list; +}; + +struct xsk_ctx { + struct xsk_ring_prod *fill; + struct xsk_ring_cons *comp; + __u32 queue_id; + struct xsk_umem *umem; + int refcount; + int ifindex; + struct list_head list; + int prog_fd; + int xsks_map_fd; + char ifname[IFNAMSIZ]; }; struct xsk_socket { struct xsk_ring_cons *rx; struct xsk_ring_prod *tx; __u64 outstanding_tx; - struct xsk_umem *umem; + struct xsk_ctx *ctx; struct xsk_socket_config config; int fd; - int ifindex; - int prog_fd; - int xsks_map_fd; - __u32 queue_id; - char ifname[IFNAMSIZ]; }; struct xsk_nl_info { @@ -203,15 +210,73 @@ static int xsk_get_mmap_offsets(int fd, struct xdp_mmap_offsets *off) return -EINVAL; } +static int xsk_create_umem_rings(struct xsk_umem *umem, int fd, + struct xsk_ring_prod *fill, + struct xsk_ring_cons *comp) +{ + struct xdp_mmap_offsets off; + void *map; + int err; + + err = setsockopt(fd, SOL_XDP, XDP_UMEM_FILL_RING, + &umem->config.fill_size, + sizeof(umem->config.fill_size)); + if (err) + return -errno; + + err = setsockopt(fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, + &umem->config.comp_size, + sizeof(umem->config.comp_size)); + if (err) + return -errno; + + err = xsk_get_mmap_offsets(fd, &off); + if (err) + return -errno; + + map = mmap(NULL, off.fr.desc + umem->config.fill_size * sizeof(__u64), + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, + XDP_UMEM_PGOFF_FILL_RING); + if (map == MAP_FAILED) + return -errno; + + fill->mask = umem->config.fill_size - 1; + fill->size = umem->config.fill_size; + fill->producer = map + off.fr.producer; + fill->consumer = map + off.fr.consumer; + fill->flags = map + off.fr.flags; + fill->ring = map + off.fr.desc; + fill->cached_cons = umem->config.fill_size; + + map = mmap(NULL, off.cr.desc + umem->config.comp_size * sizeof(__u64), + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, + XDP_UMEM_PGOFF_COMPLETION_RING); + if (map == MAP_FAILED) { + err = -errno; + goto out_mmap; + } + + comp->mask = umem->config.comp_size - 1; + comp->size = umem->config.comp_size; + comp->producer = map + off.cr.producer; + comp->consumer = map + off.cr.consumer; + comp->flags = map + off.cr.flags; + comp->ring = map + off.cr.desc; + + return 0; + +out_mmap: + munmap(map, off.fr.desc + umem->config.fill_size * sizeof(__u64)); + return err; +} + int xsk_umem__create_v0_0_4(struct xsk_umem **umem_ptr, void *umem_area, __u64 size, struct xsk_ring_prod *fill, struct xsk_ring_cons *comp, const struct xsk_umem_config *usr_config) { - struct xdp_mmap_offsets off; struct xdp_umem_reg mr; struct xsk_umem *umem; - void *map; int err; if (!umem_area || !umem_ptr || !fill || !comp) @@ -230,6 +295,7 @@ int xsk_umem__create_v0_0_4(struct xsk_umem **umem_ptr, void *umem_area, } umem->umem_area = umem_area; + INIT_LIST_HEAD(&umem->ctx_list); xsk_set_umem_config(&umem->config, usr_config); memset(&mr, 0, sizeof(mr)); @@ -244,71 +310,16 @@ int xsk_umem__create_v0_0_4(struct xsk_umem **umem_ptr, void *umem_area, err = -errno; goto out_socket; } - err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_FILL_RING, - &umem->config.fill_size, - sizeof(umem->config.fill_size)); - if (err) { - err = -errno; - goto out_socket; - } - err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_COMPLETION_RING, - &umem->config.comp_size, - sizeof(umem->config.comp_size)); - if (err) { - err = -errno; - goto out_socket; - } - err = xsk_get_mmap_offsets(umem->fd, &off); - if (err) { - err = -errno; - goto out_socket; - } - - map = mmap(NULL, off.fr.desc + umem->config.fill_size * sizeof(__u64), - PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, umem->fd, - XDP_UMEM_PGOFF_FILL_RING); - if (map == MAP_FAILED) { - err = -errno; + err = xsk_create_umem_rings(umem, umem->fd, fill, comp); + if (err) goto out_socket; - } - - umem->fill = fill; - fill->mask = umem->config.fill_size - 1; - fill->size = umem->config.fill_size; - fill->producer = map + off.fr.producer; - fill->consumer = map + off.fr.consumer; - fill->flags = map + off.fr.flags; - fill->ring = map + off.fr.desc; - fill->cached_prod = *fill->producer; - /* cached_cons is "size" bigger than the real consumer pointer - * See xsk_prod_nb_free - */ - fill->cached_cons = *fill->consumer + umem->config.fill_size; - - map = mmap(NULL, off.cr.desc + umem->config.comp_size * sizeof(__u64), - PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, umem->fd, - XDP_UMEM_PGOFF_COMPLETION_RING); - if (map == MAP_FAILED) { - err = -errno; - goto out_mmap; - } - - umem->comp = comp; - comp->mask = umem->config.comp_size - 1; - comp->size = umem->config.comp_size; - comp->producer = map + off.cr.producer; - comp->consumer = map + off.cr.consumer; - comp->flags = map + off.cr.flags; - comp->ring = map + off.cr.desc; - comp->cached_prod = *comp->producer; - comp->cached_cons = *comp->consumer; + umem->fill_save = fill; + umem->comp_save = comp; *umem_ptr = umem; return 0; -out_mmap: - munmap(map, off.fr.desc + umem->config.fill_size * sizeof(__u64)); out_socket: close(umem->fd); out_umem_alloc: @@ -342,6 +353,7 @@ DEFAULT_VERSION(xsk_umem__create_v0_0_4, xsk_umem__create, LIBBPF_0.0.4) static int xsk_load_xdp_prog(struct xsk_socket *xsk) { static const int log_buf_size = 16 * 1024; + struct xsk_ctx *ctx = xsk->ctx; char log_buf[log_buf_size]; int err, prog_fd; @@ -369,7 +381,7 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk) /* *(u32 *)(r10 - 4) = r2 */ BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -4), /* r1 = xskmap[] */ - BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd), + BPF_LD_MAP_FD(BPF_REG_1, ctx->xsks_map_fd), /* r3 = XDP_PASS */ BPF_MOV64_IMM(BPF_REG_3, 2), /* call bpf_redirect_map */ @@ -381,7 +393,7 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk) /* r2 += -4 */ BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r1 = xskmap[] */ - BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd), + BPF_LD_MAP_FD(BPF_REG_1, ctx->xsks_map_fd), /* call bpf_map_lookup_elem */ BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem), /* r1 = r0 */ @@ -393,7 +405,7 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk) /* r2 = *(u32 *)(r10 - 4) */ BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_10, -4), /* r1 = xskmap[] */ - BPF_LD_MAP_FD(BPF_REG_1, xsk->xsks_map_fd), + BPF_LD_MAP_FD(BPF_REG_1, ctx->xsks_map_fd), /* r3 = 0 */ BPF_MOV64_IMM(BPF_REG_3, 0), /* call bpf_redirect_map */ @@ -411,19 +423,21 @@ static int xsk_load_xdp_prog(struct xsk_socket *xsk) return prog_fd; } - err = bpf_set_link_xdp_fd(xsk->ifindex, prog_fd, xsk->config.xdp_flags); + err = bpf_set_link_xdp_fd(xsk->ctx->ifindex, prog_fd, + xsk->config.xdp_flags); if (err) { close(prog_fd); return err; } - xsk->prog_fd = prog_fd; + ctx->prog_fd = prog_fd; return 0; } static int xsk_get_max_queues(struct xsk_socket *xsk) { struct ethtool_channels channels = { .cmd = ETHTOOL_GCHANNELS }; + struct xsk_ctx *ctx = xsk->ctx; struct ifreq ifr = {}; int fd, err, ret; @@ -432,7 +446,7 @@ static int xsk_get_max_queues(struct xsk_socket *xsk) return -errno; ifr.ifr_data = (void *)&channels; - memcpy(ifr.ifr_name, xsk->ifname, IFNAMSIZ - 1); + memcpy(ifr.ifr_name, ctx->ifname, IFNAMSIZ - 1); ifr.ifr_name[IFNAMSIZ - 1] = '\0'; err = ioctl(fd, SIOCETHTOOL, &ifr); if (err && errno != EOPNOTSUPP) { @@ -460,6 +474,7 @@ out: static int xsk_create_bpf_maps(struct xsk_socket *xsk) { + struct xsk_ctx *ctx = xsk->ctx; int max_queues; int fd; @@ -472,15 +487,17 @@ static int xsk_create_bpf_maps(struct xsk_socket *xsk) if (fd < 0) return fd; - xsk->xsks_map_fd = fd; + ctx->xsks_map_fd = fd; return 0; } static void xsk_delete_bpf_maps(struct xsk_socket *xsk) { - bpf_map_delete_elem(xsk->xsks_map_fd, &xsk->queue_id); - close(xsk->xsks_map_fd); + struct xsk_ctx *ctx = xsk->ctx; + + bpf_map_delete_elem(ctx->xsks_map_fd, &ctx->queue_id); + close(ctx->xsks_map_fd); } static int xsk_lookup_bpf_maps(struct xsk_socket *xsk) @@ -488,10 +505,11 @@ static int xsk_lookup_bpf_maps(struct xsk_socket *xsk) __u32 i, *map_ids, num_maps, prog_len = sizeof(struct bpf_prog_info); __u32 map_len = sizeof(struct bpf_map_info); struct bpf_prog_info prog_info = {}; + struct xsk_ctx *ctx = xsk->ctx; struct bpf_map_info map_info; int fd, err; - err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len); + err = bpf_obj_get_info_by_fd(ctx->prog_fd, &prog_info, &prog_len); if (err) return err; @@ -505,11 +523,11 @@ static int xsk_lookup_bpf_maps(struct xsk_socket *xsk) prog_info.nr_map_ids = num_maps; prog_info.map_ids = (__u64)(unsigned long)map_ids; - err = bpf_obj_get_info_by_fd(xsk->prog_fd, &prog_info, &prog_len); + err = bpf_obj_get_info_by_fd(ctx->prog_fd, &prog_info, &prog_len); if (err) goto out_map_ids; - xsk->xsks_map_fd = -1; + ctx->xsks_map_fd = -1; for (i = 0; i < prog_info.nr_map_ids; i++) { fd = bpf_map_get_fd_by_id(map_ids[i]); @@ -523,7 +541,7 @@ static int xsk_lookup_bpf_maps(struct xsk_socket *xsk) } if (!strcmp(map_info.name, "xsks_map")) { - xsk->xsks_map_fd = fd; + ctx->xsks_map_fd = fd; continue; } @@ -531,7 +549,7 @@ static int xsk_lookup_bpf_maps(struct xsk_socket *xsk) } err = 0; - if (xsk->xsks_map_fd == -1) + if (ctx->xsks_map_fd == -1) err = -ENOENT; out_map_ids: @@ -541,16 +559,19 @@ out_map_ids: static int xsk_set_bpf_maps(struct xsk_socket *xsk) { - return bpf_map_update_elem(xsk->xsks_map_fd, &xsk->queue_id, + struct xsk_ctx *ctx = xsk->ctx; + + return bpf_map_update_elem(ctx->xsks_map_fd, &ctx->queue_id, &xsk->fd, 0); } static int xsk_setup_xdp_prog(struct xsk_socket *xsk) { + struct xsk_ctx *ctx = xsk->ctx; __u32 prog_id = 0; int err; - err = bpf_get_link_xdp_id(xsk->ifindex, &prog_id, + err = bpf_get_link_xdp_id(ctx->ifindex, &prog_id, xsk->config.xdp_flags); if (err) return err; @@ -566,12 +587,12 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk) return err; } } else { - xsk->prog_fd = bpf_prog_get_fd_by_id(prog_id); - if (xsk->prog_fd < 0) + ctx->prog_fd = bpf_prog_get_fd_by_id(prog_id); + if (ctx->prog_fd < 0) return -errno; err = xsk_lookup_bpf_maps(xsk); if (err) { - close(xsk->prog_fd); + close(ctx->prog_fd); return err; } } @@ -580,25 +601,110 @@ static int xsk_setup_xdp_prog(struct xsk_socket *xsk) err = xsk_set_bpf_maps(xsk); if (err) { xsk_delete_bpf_maps(xsk); - close(xsk->prog_fd); + close(ctx->prog_fd); return err; } return 0; } -int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, - __u32 queue_id, struct xsk_umem *umem, - struct xsk_ring_cons *rx, struct xsk_ring_prod *tx, - const struct xsk_socket_config *usr_config) +static struct xsk_ctx *xsk_get_ctx(struct xsk_umem *umem, int ifindex, + __u32 queue_id) +{ + struct xsk_ctx *ctx; + + if (list_empty(&umem->ctx_list)) + return NULL; + + list_for_each_entry(ctx, &umem->ctx_list, list) { + if (ctx->ifindex == ifindex && ctx->queue_id == queue_id) { + ctx->refcount++; + return ctx; + } + } + + return NULL; +} + +static void xsk_put_ctx(struct xsk_ctx *ctx) +{ + struct xsk_umem *umem = ctx->umem; + struct xdp_mmap_offsets off; + int err; + + if (--ctx->refcount == 0) { + err = xsk_get_mmap_offsets(umem->fd, &off); + if (!err) { + munmap(ctx->fill->ring - off.fr.desc, + off.fr.desc + umem->config.fill_size * + sizeof(__u64)); + munmap(ctx->comp->ring - off.cr.desc, + off.cr.desc + umem->config.comp_size * + sizeof(__u64)); + } + + list_del(&ctx->list); + free(ctx); + } +} + +static struct xsk_ctx *xsk_create_ctx(struct xsk_socket *xsk, + struct xsk_umem *umem, int ifindex, + const char *ifname, __u32 queue_id, + struct xsk_ring_prod *fill, + struct xsk_ring_cons *comp) +{ + struct xsk_ctx *ctx; + int err; + + ctx = calloc(1, sizeof(*ctx)); + if (!ctx) + return NULL; + + if (!umem->fill_save) { + err = xsk_create_umem_rings(umem, xsk->fd, fill, comp); + if (err) { + free(ctx); + return NULL; + } + } else if (umem->fill_save != fill || umem->comp_save != comp) { + /* Copy over rings to new structs. */ + memcpy(fill, umem->fill_save, sizeof(*fill)); + memcpy(comp, umem->comp_save, sizeof(*comp)); + } + + ctx->ifindex = ifindex; + ctx->refcount = 1; + ctx->umem = umem; + ctx->queue_id = queue_id; + memcpy(ctx->ifname, ifname, IFNAMSIZ - 1); + ctx->ifname[IFNAMSIZ - 1] = '\0'; + + umem->fill_save = NULL; + umem->comp_save = NULL; + ctx->fill = fill; + ctx->comp = comp; + list_add(&ctx->list, &umem->ctx_list); + return ctx; +} + +int xsk_socket__create_shared(struct xsk_socket **xsk_ptr, + const char *ifname, + __u32 queue_id, struct xsk_umem *umem, + struct xsk_ring_cons *rx, + struct xsk_ring_prod *tx, + struct xsk_ring_prod *fill, + struct xsk_ring_cons *comp, + const struct xsk_socket_config *usr_config) { void *rx_map = NULL, *tx_map = NULL; struct sockaddr_xdp sxdp = {}; struct xdp_mmap_offsets off; struct xsk_socket *xsk; - int err; + struct xsk_ctx *ctx; + int err, ifindex; - if (!umem || !xsk_ptr || !(rx || tx)) + if (!umem || !xsk_ptr || !(rx || tx) || !fill || !comp) return -EFAULT; xsk = calloc(1, sizeof(*xsk)); @@ -609,10 +715,10 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, if (err) goto out_xsk_alloc; - if (umem->refcount && - !(xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)) { - pr_warn("Error: shared umems not supported by libbpf supplied XDP program.\n"); - err = -EBUSY; + xsk->outstanding_tx = 0; + ifindex = if_nametoindex(ifname); + if (!ifindex) { + err = -errno; goto out_xsk_alloc; } @@ -626,16 +732,16 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, xsk->fd = umem->fd; } - xsk->outstanding_tx = 0; - xsk->queue_id = queue_id; - xsk->umem = umem; - xsk->ifindex = if_nametoindex(ifname); - if (!xsk->ifindex) { - err = -errno; - goto out_socket; + ctx = xsk_get_ctx(umem, ifindex, queue_id); + if (!ctx) { + ctx = xsk_create_ctx(xsk, umem, ifindex, ifname, queue_id, + fill, comp); + if (!ctx) { + err = -ENOMEM; + goto out_socket; + } } - memcpy(xsk->ifname, ifname, IFNAMSIZ - 1); - xsk->ifname[IFNAMSIZ - 1] = '\0'; + xsk->ctx = ctx; if (rx) { err = setsockopt(xsk->fd, SOL_XDP, XDP_RX_RING, @@ -643,7 +749,7 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, sizeof(xsk->config.rx_size)); if (err) { err = -errno; - goto out_socket; + goto out_put_ctx; } } if (tx) { @@ -652,14 +758,14 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, sizeof(xsk->config.tx_size)); if (err) { err = -errno; - goto out_socket; + goto out_put_ctx; } } err = xsk_get_mmap_offsets(xsk->fd, &off); if (err) { err = -errno; - goto out_socket; + goto out_put_ctx; } if (rx) { @@ -669,7 +775,7 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, xsk->fd, XDP_PGOFF_RX_RING); if (rx_map == MAP_FAILED) { err = -errno; - goto out_socket; + goto out_put_ctx; } rx->mask = xsk->config.rx_size - 1; @@ -708,10 +814,10 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, xsk->tx = tx; sxdp.sxdp_family = PF_XDP; - sxdp.sxdp_ifindex = xsk->ifindex; - sxdp.sxdp_queue_id = xsk->queue_id; + sxdp.sxdp_ifindex = ctx->ifindex; + sxdp.sxdp_queue_id = ctx->queue_id; if (umem->refcount > 1) { - sxdp.sxdp_flags = XDP_SHARED_UMEM; + sxdp.sxdp_flags |= XDP_SHARED_UMEM; sxdp.sxdp_shared_umem_fd = umem->fd; } else { sxdp.sxdp_flags = xsk->config.bind_flags; @@ -723,7 +829,7 @@ int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, goto out_mmap_tx; } - xsk->prog_fd = -1; + ctx->prog_fd = -1; if (!(xsk->config.libbpf_flags & XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD)) { err = xsk_setup_xdp_prog(xsk); @@ -742,6 +848,8 @@ out_mmap_rx: if (rx) munmap(rx_map, off.rx.desc + xsk->config.rx_size * sizeof(struct xdp_desc)); +out_put_ctx: + xsk_put_ctx(ctx); out_socket: if (--umem->refcount) close(xsk->fd); @@ -750,25 +858,24 @@ out_xsk_alloc: return err; } -int xsk_umem__delete(struct xsk_umem *umem) +int xsk_socket__create(struct xsk_socket **xsk_ptr, const char *ifname, + __u32 queue_id, struct xsk_umem *umem, + struct xsk_ring_cons *rx, struct xsk_ring_prod *tx, + const struct xsk_socket_config *usr_config) { - struct xdp_mmap_offsets off; - int err; + return xsk_socket__create_shared(xsk_ptr, ifname, queue_id, umem, + rx, tx, umem->fill_save, + umem->comp_save, usr_config); +} +int xsk_umem__delete(struct xsk_umem *umem) +{ if (!umem) return 0; if (umem->refcount) return -EBUSY; - err = xsk_get_mmap_offsets(umem->fd, &off); - if (!err) { - munmap(umem->fill->ring - off.fr.desc, - off.fr.desc + umem->config.fill_size * sizeof(__u64)); - munmap(umem->comp->ring - off.cr.desc, - off.cr.desc + umem->config.comp_size * sizeof(__u64)); - } - close(umem->fd); free(umem); @@ -778,15 +885,16 @@ int xsk_umem__delete(struct xsk_umem *umem) void xsk_socket__delete(struct xsk_socket *xsk) { size_t desc_sz = sizeof(struct xdp_desc); + struct xsk_ctx *ctx = xsk->ctx; struct xdp_mmap_offsets off; int err; if (!xsk) return; - if (xsk->prog_fd != -1) { + if (ctx->prog_fd != -1) { xsk_delete_bpf_maps(xsk); - close(xsk->prog_fd); + close(ctx->prog_fd); } err = xsk_get_mmap_offsets(xsk->fd, &off); @@ -799,14 +907,15 @@ void xsk_socket__delete(struct xsk_socket *xsk) munmap(xsk->tx->ring - off.tx.desc, off.tx.desc + xsk->config.tx_size * desc_sz); } - } - xsk->umem->refcount--; + xsk_put_ctx(ctx); + + ctx->umem->refcount--; /* Do not close an fd that also has an associated umem connected * to it. */ - if (xsk->fd != xsk->umem->fd) + if (xsk->fd != ctx->umem->fd) close(xsk->fd); free(xsk); } diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h index 584f6820a639..1069c46364ff 100644 --- a/tools/lib/bpf/xsk.h +++ b/tools/lib/bpf/xsk.h @@ -234,6 +234,15 @@ LIBBPF_API int xsk_socket__create(struct xsk_socket **xsk, struct xsk_ring_cons *rx, struct xsk_ring_prod *tx, const struct xsk_socket_config *config); +LIBBPF_API int +xsk_socket__create_shared(struct xsk_socket **xsk_ptr, + const char *ifname, + __u32 queue_id, struct xsk_umem *umem, + struct xsk_ring_cons *rx, + struct xsk_ring_prod *tx, + struct xsk_ring_prod *fill, + struct xsk_ring_cons *comp, + const struct xsk_socket_config *config); /* Returns 0 for success and -EBUSY if the umem is still in use. */ LIBBPF_API int xsk_umem__delete(struct xsk_umem *umem); diff --git a/tools/perf/Makefile.config b/tools/perf/Makefile.config index 190be4fa5c21..81bb099f6f06 100644 --- a/tools/perf/Makefile.config +++ b/tools/perf/Makefile.config @@ -483,10 +483,6 @@ ifndef NO_LIBELF EXTLIBS += -lelf $(call detected,CONFIG_LIBELF) - ifeq ($(feature-libelf-mmap), 1) - CFLAGS += -DHAVE_LIBELF_MMAP_SUPPORT - endif - ifeq ($(feature-libelf-getphdrnum), 1) CFLAGS += -DHAVE_ELF_GETPHDRNUM_SUPPORT endif diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h index ff4f4c47e148..03e264a27cd3 100644 --- a/tools/perf/util/symbol.h +++ b/tools/perf/util/symbol.h @@ -28,7 +28,7 @@ struct option; * libelf 0.8.x and earlier do not support ELF_C_READ_MMAP; * for newer versions we can use mmap to reduce memory usage: */ -#ifdef HAVE_LIBELF_MMAP_SUPPORT +#ifdef ELF_C_READ_MMAP # define PERF_ELF_C_READ_MMAP ELF_C_READ_MMAP #else # define PERF_ELF_C_READ_MMAP ELF_C_READ diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index fc946b7ac288..65d3d9aaeb31 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -316,7 +316,7 @@ $(TRUNNER_BPF_OBJS): $(TRUNNER_OUTPUT)/%.o: \ $(TRUNNER_BPF_PROGS_DIR)/%.c \ $(TRUNNER_BPF_PROGS_DIR)/*.h \ $$(INCLUDE_DIR)/vmlinux.h \ - $$(BPFOBJ) | $(TRUNNER_OUTPUT) + $(wildcard $(BPFDIR)/bpf_*.h) | $(TRUNNER_OUTPUT) $$(call $(TRUNNER_BPF_BUILD_RULE),$$<,$$@, \ $(TRUNNER_BPF_CFLAGS), \ $(TRUNNER_BPF_LDFLAGS)) diff --git a/tools/testing/selftests/bpf/README.rst b/tools/testing/selftests/bpf/README.rst index e885d351595f..66acfcf15ff2 100644 --- a/tools/testing/selftests/bpf/README.rst +++ b/tools/testing/selftests/bpf/README.rst @@ -43,3 +43,24 @@ This is due to a llvm BPF backend bug. The fix https://reviews.llvm.org/D78466 has been pushed to llvm 10.x release branch and will be available in 10.0.1. The fix is available in llvm 11.0.0 trunk. + +BPF CO-RE-based tests and Clang version +======================================= + +A set of selftests use BPF target-specific built-ins, which might require +bleeding-edge Clang versions (Clang 12 nightly at this time). + +Few sub-tests of core_reloc test suit (part of test_progs test runner) require +the following built-ins, listed with corresponding Clang diffs introducing +them to Clang/LLVM. These sub-tests are going to be skipped if Clang is too +old to support them, they shouldn't cause build failures or runtime test +failures: + + - __builtin_btf_type_id() ([0], [1], [2]); + - __builtin_preserve_type_info(), __builtin_preserve_enum_value() ([3], [4]). + + [0] https://reviews.llvm.org/D74572 + [1] https://reviews.llvm.org/D74668 + [2] https://reviews.llvm.org/D85174 + [3] https://reviews.llvm.org/D83878 + [4] https://reviews.llvm.org/D83242 diff --git a/tools/testing/selftests/bpf/bench.c b/tools/testing/selftests/bpf/bench.c index 944ad4721c83..1a427685a8a8 100644 --- a/tools/testing/selftests/bpf/bench.c +++ b/tools/testing/selftests/bpf/bench.c @@ -317,6 +317,7 @@ extern const struct bench bench_trig_tp; extern const struct bench bench_trig_rawtp; extern const struct bench bench_trig_kprobe; extern const struct bench bench_trig_fentry; +extern const struct bench bench_trig_fentry_sleep; extern const struct bench bench_trig_fmodret; extern const struct bench bench_rb_libbpf; extern const struct bench bench_rb_custom; @@ -338,6 +339,7 @@ static const struct bench *benchs[] = { &bench_trig_rawtp, &bench_trig_kprobe, &bench_trig_fentry, + &bench_trig_fentry_sleep, &bench_trig_fmodret, &bench_rb_libbpf, &bench_rb_custom, diff --git a/tools/testing/selftests/bpf/benchs/bench_trigger.c b/tools/testing/selftests/bpf/benchs/bench_trigger.c index 49c22832f216..2a0b6c9885a4 100644 --- a/tools/testing/selftests/bpf/benchs/bench_trigger.c +++ b/tools/testing/selftests/bpf/benchs/bench_trigger.c @@ -90,6 +90,12 @@ static void trigger_fentry_setup() attach_bpf(ctx.skel->progs.bench_trigger_fentry); } +static void trigger_fentry_sleep_setup() +{ + setup_ctx(); + attach_bpf(ctx.skel->progs.bench_trigger_fentry_sleep); +} + static void trigger_fmodret_setup() { setup_ctx(); @@ -155,6 +161,17 @@ const struct bench bench_trig_fentry = { .report_final = hits_drops_report_final, }; +const struct bench bench_trig_fentry_sleep = { + .name = "trig-fentry-sleep", + .validate = trigger_validate, + .setup = trigger_fentry_sleep_setup, + .producer_thread = trigger_producer, + .consumer_thread = trigger_consumer, + .measure = trigger_measure, + .report_progress = hits_drops_report_progress, + .report_final = hits_drops_report_final, +}; + const struct bench bench_trig_fmodret = { .name = "trig-fmodret", .validate = trigger_validate, diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c index f56655690f9b..12ee40284da0 100644 --- a/tools/testing/selftests/bpf/network_helpers.c +++ b/tools/testing/selftests/bpf/network_helpers.c @@ -104,6 +104,43 @@ error_close: return -1; } +int fastopen_connect(int server_fd, const char *data, unsigned int data_len, + int timeout_ms) +{ + struct sockaddr_storage addr; + socklen_t addrlen = sizeof(addr); + struct sockaddr_in *addr_in; + int fd, ret; + + if (getsockname(server_fd, (struct sockaddr *)&addr, &addrlen)) { + log_err("Failed to get server addr"); + return -1; + } + + addr_in = (struct sockaddr_in *)&addr; + fd = socket(addr_in->sin_family, SOCK_STREAM, 0); + if (fd < 0) { + log_err("Failed to create client socket"); + return -1; + } + + if (settimeo(fd, timeout_ms)) + goto error_close; + + ret = sendto(fd, data, data_len, MSG_FASTOPEN, (struct sockaddr *)&addr, + addrlen); + if (ret != data_len) { + log_err("sendto(data, %u) != %d\n", data_len, ret); + goto error_close; + } + + return fd; + +error_close: + save_errno_close(fd); + return -1; +} + static int connect_fd_to_addr(int fd, const struct sockaddr_storage *addr, socklen_t addrlen) diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h index c3728f6667e4..7205f8afdba1 100644 --- a/tools/testing/selftests/bpf/network_helpers.h +++ b/tools/testing/selftests/bpf/network_helpers.h @@ -37,6 +37,8 @@ int start_server(int family, int type, const char *addr, __u16 port, int timeout_ms); int connect_to_fd(int server_fd, int timeout_ms); int connect_fd_to_fd(int client_fd, int server_fd, int timeout_ms); +int fastopen_connect(int server_fd, const char *data, unsigned int data_len, + int timeout_ms); int make_sockaddr(int family, const char *addr_str, __u16 port, struct sockaddr_storage *addr, socklen_t *len); diff --git a/tools/testing/selftests/bpf/prog_tests/btf_map_in_map.c b/tools/testing/selftests/bpf/prog_tests/btf_map_in_map.c index 6ccecbd39476..540fea4c91a5 100644 --- a/tools/testing/selftests/bpf/prog_tests/btf_map_in_map.c +++ b/tools/testing/selftests/bpf/prog_tests/btf_map_in_map.c @@ -53,7 +53,7 @@ static int kern_sync_rcu(void) return err; } -void test_btf_map_in_map(void) +static void test_lookup_update(void) { int err, key = 0, val, i; struct test_btf_map_in_map *skel; @@ -143,3 +143,36 @@ void test_btf_map_in_map(void) cleanup: test_btf_map_in_map__destroy(skel); } + +static void test_diff_size(void) +{ + struct test_btf_map_in_map *skel; + int err, inner_map_fd, zero = 0; + + skel = test_btf_map_in_map__open_and_load(); + if (CHECK(!skel, "skel_open", "failed to open&load skeleton\n")) + return; + + inner_map_fd = bpf_map__fd(skel->maps.sockarr_sz2); + err = bpf_map_update_elem(bpf_map__fd(skel->maps.outer_sockarr), &zero, + &inner_map_fd, 0); + CHECK(err, "outer_sockarr inner map size check", + "cannot use a different size inner_map\n"); + + inner_map_fd = bpf_map__fd(skel->maps.inner_map_sz2); + err = bpf_map_update_elem(bpf_map__fd(skel->maps.outer_arr), &zero, + &inner_map_fd, 0); + CHECK(!err, "outer_arr inner map size check", + "incorrectly updated with a different size inner_map\n"); + + test_btf_map_in_map__destroy(skel); +} + +void test_btf_map_in_map(void) +{ + if (test__start_subtest("lookup_update")) + test_lookup_update(); + + if (test__start_subtest("diff_size")) + test_diff_size(); +} diff --git a/tools/testing/selftests/bpf/prog_tests/core_reloc.c b/tools/testing/selftests/bpf/prog_tests/core_reloc.c index a54eafc5e4b3..30e40ff4b0d8 100644 --- a/tools/testing/selftests/bpf/prog_tests/core_reloc.c +++ b/tools/testing/selftests/bpf/prog_tests/core_reloc.c @@ -3,6 +3,9 @@ #include "progs/core_reloc_types.h" #include <sys/mman.h> #include <sys/syscall.h> +#include <bpf/btf.h> + +static int duration = 0; #define STRUCT_TO_CHAR_PTR(struct_name) (const char *)&(struct struct_name) @@ -177,14 +180,13 @@ .fails = true, \ } -#define EXISTENCE_CASE_COMMON(name) \ +#define FIELD_EXISTS_CASE_COMMON(name) \ .case_name = #name, \ .bpf_obj_file = "test_core_reloc_existence.o", \ - .btf_src_file = "btf__core_reloc_" #name ".o", \ - .relaxed_core_relocs = true + .btf_src_file = "btf__core_reloc_" #name ".o" \ -#define EXISTENCE_ERR_CASE(name) { \ - EXISTENCE_CASE_COMMON(name), \ +#define FIELD_EXISTS_ERR_CASE(name) { \ + FIELD_EXISTS_CASE_COMMON(name), \ .fails = true, \ } @@ -253,6 +255,61 @@ .fails = true, \ } +#define TYPE_BASED_CASE_COMMON(name) \ + .case_name = #name, \ + .bpf_obj_file = "test_core_reloc_type_based.o", \ + .btf_src_file = "btf__core_reloc_" #name ".o" \ + +#define TYPE_BASED_CASE(name, ...) { \ + TYPE_BASED_CASE_COMMON(name), \ + .output = STRUCT_TO_CHAR_PTR(core_reloc_type_based_output) \ + __VA_ARGS__, \ + .output_len = sizeof(struct core_reloc_type_based_output), \ +} + +#define TYPE_BASED_ERR_CASE(name) { \ + TYPE_BASED_CASE_COMMON(name), \ + .fails = true, \ +} + +#define TYPE_ID_CASE_COMMON(name) \ + .case_name = #name, \ + .bpf_obj_file = "test_core_reloc_type_id.o", \ + .btf_src_file = "btf__core_reloc_" #name ".o" \ + +#define TYPE_ID_CASE(name, setup_fn) { \ + TYPE_ID_CASE_COMMON(name), \ + .output = STRUCT_TO_CHAR_PTR(core_reloc_type_id_output) {}, \ + .output_len = sizeof(struct core_reloc_type_id_output), \ + .setup = setup_fn, \ +} + +#define TYPE_ID_ERR_CASE(name) { \ + TYPE_ID_CASE_COMMON(name), \ + .fails = true, \ +} + +#define ENUMVAL_CASE_COMMON(name) \ + .case_name = #name, \ + .bpf_obj_file = "test_core_reloc_enumval.o", \ + .btf_src_file = "btf__core_reloc_" #name ".o" \ + +#define ENUMVAL_CASE(name, ...) { \ + ENUMVAL_CASE_COMMON(name), \ + .output = STRUCT_TO_CHAR_PTR(core_reloc_enumval_output) \ + __VA_ARGS__, \ + .output_len = sizeof(struct core_reloc_enumval_output), \ +} + +#define ENUMVAL_ERR_CASE(name) { \ + ENUMVAL_CASE_COMMON(name), \ + .fails = true, \ +} + +struct core_reloc_test_case; + +typedef int (*setup_test_fn)(struct core_reloc_test_case *test); + struct core_reloc_test_case { const char *case_name; const char *bpf_obj_file; @@ -264,8 +321,136 @@ struct core_reloc_test_case { bool fails; bool relaxed_core_relocs; bool direct_raw_tp; + setup_test_fn setup; }; +static int find_btf_type(const struct btf *btf, const char *name, __u32 kind) +{ + int id; + + id = btf__find_by_name_kind(btf, name, kind); + if (CHECK(id <= 0, "find_type_id", "failed to find '%s', kind %d: %d\n", name, kind, id)) + return -1; + + return id; +} + +static int setup_type_id_case_local(struct core_reloc_test_case *test) +{ + struct core_reloc_type_id_output *exp = (void *)test->output; + struct btf *local_btf = btf__parse(test->bpf_obj_file, NULL); + struct btf *targ_btf = btf__parse(test->btf_src_file, NULL); + const struct btf_type *t; + const char *name; + int i; + + if (CHECK(IS_ERR(local_btf), "local_btf", "failed: %ld\n", PTR_ERR(local_btf)) || + CHECK(IS_ERR(targ_btf), "targ_btf", "failed: %ld\n", PTR_ERR(targ_btf))) { + btf__free(local_btf); + btf__free(targ_btf); + return -EINVAL; + } + + exp->local_anon_struct = -1; + exp->local_anon_union = -1; + exp->local_anon_enum = -1; + exp->local_anon_func_proto_ptr = -1; + exp->local_anon_void_ptr = -1; + exp->local_anon_arr = -1; + + for (i = 1; i <= btf__get_nr_types(local_btf); i++) + { + t = btf__type_by_id(local_btf, i); + /* we are interested only in anonymous types */ + if (t->name_off) + continue; + + if (btf_is_struct(t) && btf_vlen(t) && + (name = btf__name_by_offset(local_btf, btf_members(t)[0].name_off)) && + strcmp(name, "marker_field") == 0) { + exp->local_anon_struct = i; + } else if (btf_is_union(t) && btf_vlen(t) && + (name = btf__name_by_offset(local_btf, btf_members(t)[0].name_off)) && + strcmp(name, "marker_field") == 0) { + exp->local_anon_union = i; + } else if (btf_is_enum(t) && btf_vlen(t) && + (name = btf__name_by_offset(local_btf, btf_enum(t)[0].name_off)) && + strcmp(name, "MARKER_ENUM_VAL") == 0) { + exp->local_anon_enum = i; + } else if (btf_is_ptr(t) && (t = btf__type_by_id(local_btf, t->type))) { + if (btf_is_func_proto(t) && (t = btf__type_by_id(local_btf, t->type)) && + btf_is_int(t) && (name = btf__name_by_offset(local_btf, t->name_off)) && + strcmp(name, "_Bool") == 0) { + /* ptr -> func_proto -> _Bool */ + exp->local_anon_func_proto_ptr = i; + } else if (btf_is_void(t)) { + /* ptr -> void */ + exp->local_anon_void_ptr = i; + } + } else if (btf_is_array(t) && (t = btf__type_by_id(local_btf, btf_array(t)->type)) && + btf_is_int(t) && (name = btf__name_by_offset(local_btf, t->name_off)) && + strcmp(name, "_Bool") == 0) { + /* _Bool[] */ + exp->local_anon_arr = i; + } + } + + exp->local_struct = find_btf_type(local_btf, "a_struct", BTF_KIND_STRUCT); + exp->local_union = find_btf_type(local_btf, "a_union", BTF_KIND_UNION); + exp->local_enum = find_btf_type(local_btf, "an_enum", BTF_KIND_ENUM); + exp->local_int = find_btf_type(local_btf, "int", BTF_KIND_INT); + exp->local_struct_typedef = find_btf_type(local_btf, "named_struct_typedef", BTF_KIND_TYPEDEF); + exp->local_func_proto_typedef = find_btf_type(local_btf, "func_proto_typedef", BTF_KIND_TYPEDEF); + exp->local_arr_typedef = find_btf_type(local_btf, "arr_typedef", BTF_KIND_TYPEDEF); + + btf__free(local_btf); + btf__free(targ_btf); + return 0; +} + +static int setup_type_id_case_success(struct core_reloc_test_case *test) { + struct core_reloc_type_id_output *exp = (void *)test->output; + struct btf *targ_btf = btf__parse(test->btf_src_file, NULL); + int err; + + err = setup_type_id_case_local(test); + if (err) + return err; + + targ_btf = btf__parse(test->btf_src_file, NULL); + + exp->targ_struct = find_btf_type(targ_btf, "a_struct", BTF_KIND_STRUCT); + exp->targ_union = find_btf_type(targ_btf, "a_union", BTF_KIND_UNION); + exp->targ_enum = find_btf_type(targ_btf, "an_enum", BTF_KIND_ENUM); + exp->targ_int = find_btf_type(targ_btf, "int", BTF_KIND_INT); + exp->targ_struct_typedef = find_btf_type(targ_btf, "named_struct_typedef", BTF_KIND_TYPEDEF); + exp->targ_func_proto_typedef = find_btf_type(targ_btf, "func_proto_typedef", BTF_KIND_TYPEDEF); + exp->targ_arr_typedef = find_btf_type(targ_btf, "arr_typedef", BTF_KIND_TYPEDEF); + + btf__free(targ_btf); + return 0; +} + +static int setup_type_id_case_failure(struct core_reloc_test_case *test) +{ + struct core_reloc_type_id_output *exp = (void *)test->output; + int err; + + err = setup_type_id_case_local(test); + if (err) + return err; + + exp->targ_struct = 0; + exp->targ_union = 0; + exp->targ_enum = 0; + exp->targ_int = 0; + exp->targ_struct_typedef = 0; + exp->targ_func_proto_typedef = 0; + exp->targ_arr_typedef = 0; + + return 0; +} + static struct core_reloc_test_case test_cases[] = { /* validate we can find kernel image and use its BTF for relocs */ { @@ -364,7 +549,7 @@ static struct core_reloc_test_case test_cases[] = { /* validate field existence checks */ { - EXISTENCE_CASE_COMMON(existence), + FIELD_EXISTS_CASE_COMMON(existence), .input = STRUCT_TO_CHAR_PTR(core_reloc_existence) { .a = 1, .b = 2, @@ -388,7 +573,7 @@ static struct core_reloc_test_case test_cases[] = { .output_len = sizeof(struct core_reloc_existence_output), }, { - EXISTENCE_CASE_COMMON(existence___minimal), + FIELD_EXISTS_CASE_COMMON(existence___minimal), .input = STRUCT_TO_CHAR_PTR(core_reloc_existence___minimal) { .a = 42, }, @@ -408,12 +593,12 @@ static struct core_reloc_test_case test_cases[] = { .output_len = sizeof(struct core_reloc_existence_output), }, - EXISTENCE_ERR_CASE(existence__err_int_sz), - EXISTENCE_ERR_CASE(existence__err_int_type), - EXISTENCE_ERR_CASE(existence__err_int_kind), - EXISTENCE_ERR_CASE(existence__err_arr_kind), - EXISTENCE_ERR_CASE(existence__err_arr_value_type), - EXISTENCE_ERR_CASE(existence__err_struct_type), + FIELD_EXISTS_ERR_CASE(existence__err_int_sz), + FIELD_EXISTS_ERR_CASE(existence__err_int_type), + FIELD_EXISTS_ERR_CASE(existence__err_int_kind), + FIELD_EXISTS_ERR_CASE(existence__err_arr_kind), + FIELD_EXISTS_ERR_CASE(existence__err_arr_value_type), + FIELD_EXISTS_ERR_CASE(existence__err_struct_type), /* bitfield relocation checks */ BITFIELDS_CASE(bitfields, { @@ -452,11 +637,117 @@ static struct core_reloc_test_case test_cases[] = { /* size relocation checks */ SIZE_CASE(size), SIZE_CASE(size___diff_sz), + SIZE_ERR_CASE(size___err_ambiguous), + + /* validate type existence and size relocations */ + TYPE_BASED_CASE(type_based, { + .struct_exists = 1, + .union_exists = 1, + .enum_exists = 1, + .typedef_named_struct_exists = 1, + .typedef_anon_struct_exists = 1, + .typedef_struct_ptr_exists = 1, + .typedef_int_exists = 1, + .typedef_enum_exists = 1, + .typedef_void_ptr_exists = 1, + .typedef_func_proto_exists = 1, + .typedef_arr_exists = 1, + .struct_sz = sizeof(struct a_struct), + .union_sz = sizeof(union a_union), + .enum_sz = sizeof(enum an_enum), + .typedef_named_struct_sz = sizeof(named_struct_typedef), + .typedef_anon_struct_sz = sizeof(anon_struct_typedef), + .typedef_struct_ptr_sz = sizeof(struct_ptr_typedef), + .typedef_int_sz = sizeof(int_typedef), + .typedef_enum_sz = sizeof(enum_typedef), + .typedef_void_ptr_sz = sizeof(void_ptr_typedef), + .typedef_func_proto_sz = sizeof(func_proto_typedef), + .typedef_arr_sz = sizeof(arr_typedef), + }), + TYPE_BASED_CASE(type_based___all_missing, { + /* all zeros */ + }), + TYPE_BASED_CASE(type_based___diff_sz, { + .struct_exists = 1, + .union_exists = 1, + .enum_exists = 1, + .typedef_named_struct_exists = 1, + .typedef_anon_struct_exists = 1, + .typedef_struct_ptr_exists = 1, + .typedef_int_exists = 1, + .typedef_enum_exists = 1, + .typedef_void_ptr_exists = 1, + .typedef_func_proto_exists = 1, + .typedef_arr_exists = 1, + .struct_sz = sizeof(struct a_struct___diff_sz), + .union_sz = sizeof(union a_union___diff_sz), + .enum_sz = sizeof(enum an_enum___diff_sz), + .typedef_named_struct_sz = sizeof(named_struct_typedef___diff_sz), + .typedef_anon_struct_sz = sizeof(anon_struct_typedef___diff_sz), + .typedef_struct_ptr_sz = sizeof(struct_ptr_typedef___diff_sz), + .typedef_int_sz = sizeof(int_typedef___diff_sz), + .typedef_enum_sz = sizeof(enum_typedef___diff_sz), + .typedef_void_ptr_sz = sizeof(void_ptr_typedef___diff_sz), + .typedef_func_proto_sz = sizeof(func_proto_typedef___diff_sz), + .typedef_arr_sz = sizeof(arr_typedef___diff_sz), + }), + TYPE_BASED_CASE(type_based___incompat, { + .enum_exists = 1, + .enum_sz = sizeof(enum an_enum), + }), + TYPE_BASED_CASE(type_based___fn_wrong_args, { + .struct_exists = 1, + .struct_sz = sizeof(struct a_struct), + }), + + /* BTF_TYPE_ID_LOCAL/BTF_TYPE_ID_TARGET tests */ + TYPE_ID_CASE(type_id, setup_type_id_case_success), + TYPE_ID_CASE(type_id___missing_targets, setup_type_id_case_failure), + + /* Enumerator value existence and value relocations */ + ENUMVAL_CASE(enumval, { + .named_val1_exists = true, + .named_val2_exists = true, + .named_val3_exists = true, + .anon_val1_exists = true, + .anon_val2_exists = true, + .anon_val3_exists = true, + .named_val1 = 1, + .named_val2 = 2, + .anon_val1 = 0x10, + .anon_val2 = 0x20, + }), + ENUMVAL_CASE(enumval___diff, { + .named_val1_exists = true, + .named_val2_exists = true, + .named_val3_exists = true, + .anon_val1_exists = true, + .anon_val2_exists = true, + .anon_val3_exists = true, + .named_val1 = 101, + .named_val2 = 202, + .anon_val1 = 0x11, + .anon_val2 = 0x22, + }), + ENUMVAL_CASE(enumval___val3_missing, { + .named_val1_exists = true, + .named_val2_exists = true, + .named_val3_exists = false, + .anon_val1_exists = true, + .anon_val2_exists = true, + .anon_val3_exists = false, + .named_val1 = 111, + .named_val2 = 222, + .anon_val1 = 0x111, + .anon_val2 = 0x222, + }), + ENUMVAL_ERR_CASE(enumval___err_missing), }; struct data { char in[256]; char out[256]; + bool skip; uint64_t my_pid_tgid; }; @@ -472,7 +763,7 @@ void test_core_reloc(void) struct bpf_object_load_attr load_attr = {}; struct core_reloc_test_case *test_case; const char *tp_name, *probe_name; - int err, duration = 0, i, equal; + int err, i, equal; struct bpf_link *link = NULL; struct bpf_map *data_map; struct bpf_program *prog; @@ -488,11 +779,13 @@ void test_core_reloc(void) if (!test__start_subtest(test_case->case_name)) continue; - DECLARE_LIBBPF_OPTS(bpf_object_open_opts, opts, - .relaxed_core_relocs = test_case->relaxed_core_relocs, - ); + if (test_case->setup) { + err = test_case->setup(test_case); + if (CHECK(err, "test_setup", "test #%d setup failed: %d\n", i, err)) + continue; + } - obj = bpf_object__open_file(test_case->bpf_obj_file, &opts); + obj = bpf_object__open_file(test_case->bpf_obj_file, NULL); if (CHECK(IS_ERR(obj), "obj_open", "failed to open '%s': %ld\n", test_case->bpf_obj_file, PTR_ERR(obj))) continue; @@ -515,15 +808,10 @@ void test_core_reloc(void) load_attr.log_level = 0; load_attr.target_btf_path = test_case->btf_src_file; err = bpf_object__load_xattr(&load_attr); - if (test_case->fails) { - CHECK(!err, "obj_load_fail", - "should fail to load prog '%s'\n", probe_name); + if (err) { + if (!test_case->fails) + CHECK(false, "obj_load", "failed to load prog '%s': %d\n", probe_name, err); goto cleanup; - } else { - if (CHECK(err, "obj_load", - "failed to load prog '%s': %d\n", - probe_name, err)) - goto cleanup; } data_map = bpf_object__find_map_by_name(obj, "test_cor.bss"); @@ -551,6 +839,16 @@ void test_core_reloc(void) /* trigger test run */ usleep(1); + if (data->skip) { + test__skip(); + goto cleanup; + } + + if (test_case->fails) { + CHECK(false, "obj_load_fail", "should fail to load prog '%s'\n", probe_name); + goto cleanup; + } + equal = memcmp(data->out, test_case->output, test_case->output_len) == 0; if (CHECK(!equal, "check_result", diff --git a/tools/testing/selftests/bpf/prog_tests/d_path.c b/tools/testing/selftests/bpf/prog_tests/d_path.c new file mode 100644 index 000000000000..fc12e0d445ff --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/d_path.c @@ -0,0 +1,147 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#include <test_progs.h> +#include <sys/stat.h> +#include <linux/sched.h> +#include <sys/syscall.h> + +#define MAX_PATH_LEN 128 +#define MAX_FILES 7 + +#include "test_d_path.skel.h" + +static int duration; + +static struct { + __u32 cnt; + char paths[MAX_FILES][MAX_PATH_LEN]; +} src; + +static int set_pathname(int fd, pid_t pid) +{ + char buf[MAX_PATH_LEN]; + + snprintf(buf, MAX_PATH_LEN, "/proc/%d/fd/%d", pid, fd); + return readlink(buf, src.paths[src.cnt++], MAX_PATH_LEN); +} + +static int trigger_fstat_events(pid_t pid) +{ + int sockfd = -1, procfd = -1, devfd = -1; + int localfd = -1, indicatorfd = -1; + int pipefd[2] = { -1, -1 }; + struct stat fileStat; + int ret = -1; + + /* unmountable pseudo-filesystems */ + if (CHECK(pipe(pipefd) < 0, "trigger", "pipe failed\n")) + return ret; + /* unmountable pseudo-filesystems */ + sockfd = socket(AF_INET, SOCK_STREAM, 0); + if (CHECK(sockfd < 0, "trigger", "socket failed\n")) + goto out_close; + /* mountable pseudo-filesystems */ + procfd = open("/proc/self/comm", O_RDONLY); + if (CHECK(procfd < 0, "trigger", "open /proc/self/comm failed\n")) + goto out_close; + devfd = open("/dev/urandom", O_RDONLY); + if (CHECK(devfd < 0, "trigger", "open /dev/urandom failed\n")) + goto out_close; + localfd = open("/tmp/d_path_loadgen.txt", O_CREAT | O_RDONLY, 0644); + if (CHECK(localfd < 0, "trigger", "open /tmp/d_path_loadgen.txt failed\n")) + goto out_close; + /* bpf_d_path will return path with (deleted) */ + remove("/tmp/d_path_loadgen.txt"); + indicatorfd = open("/tmp/", O_PATH); + if (CHECK(indicatorfd < 0, "trigger", "open /tmp/ failed\n")) + goto out_close; + + ret = set_pathname(pipefd[0], pid); + if (CHECK(ret < 0, "trigger", "set_pathname failed for pipe[0]\n")) + goto out_close; + ret = set_pathname(pipefd[1], pid); + if (CHECK(ret < 0, "trigger", "set_pathname failed for pipe[1]\n")) + goto out_close; + ret = set_pathname(sockfd, pid); + if (CHECK(ret < 0, "trigger", "set_pathname failed for socket\n")) + goto out_close; + ret = set_pathname(procfd, pid); + if (CHECK(ret < 0, "trigger", "set_pathname failed for proc\n")) + goto out_close; + ret = set_pathname(devfd, pid); + if (CHECK(ret < 0, "trigger", "set_pathname failed for dev\n")) + goto out_close; + ret = set_pathname(localfd, pid); + if (CHECK(ret < 0, "trigger", "set_pathname failed for file\n")) + goto out_close; + ret = set_pathname(indicatorfd, pid); + if (CHECK(ret < 0, "trigger", "set_pathname failed for dir\n")) + goto out_close; + + /* triggers vfs_getattr */ + fstat(pipefd[0], &fileStat); + fstat(pipefd[1], &fileStat); + fstat(sockfd, &fileStat); + fstat(procfd, &fileStat); + fstat(devfd, &fileStat); + fstat(localfd, &fileStat); + fstat(indicatorfd, &fileStat); + +out_close: + /* triggers filp_close */ + close(pipefd[0]); + close(pipefd[1]); + close(sockfd); + close(procfd); + close(devfd); + close(localfd); + close(indicatorfd); + return ret; +} + +void test_d_path(void) +{ + struct test_d_path__bss *bss; + struct test_d_path *skel; + int err; + + skel = test_d_path__open_and_load(); + if (CHECK(!skel, "setup", "d_path skeleton failed\n")) + goto cleanup; + + err = test_d_path__attach(skel); + if (CHECK(err, "setup", "attach failed: %d\n", err)) + goto cleanup; + + bss = skel->bss; + bss->my_pid = getpid(); + + err = trigger_fstat_events(bss->my_pid); + if (err < 0) + goto cleanup; + + for (int i = 0; i < MAX_FILES; i++) { + CHECK(strncmp(src.paths[i], bss->paths_stat[i], MAX_PATH_LEN), + "check", + "failed to get stat path[%d]: %s vs %s\n", + i, src.paths[i], bss->paths_stat[i]); + CHECK(strncmp(src.paths[i], bss->paths_close[i], MAX_PATH_LEN), + "check", + "failed to get close path[%d]: %s vs %s\n", + i, src.paths[i], bss->paths_close[i]); + /* The d_path helper returns size plus NUL char, hence + 1 */ + CHECK(bss->rets_stat[i] != strlen(bss->paths_stat[i]) + 1, + "check", + "failed to match stat return [%d]: %d vs %zd [%s]\n", + i, bss->rets_stat[i], strlen(bss->paths_stat[i]) + 1, + bss->paths_stat[i]); + CHECK(bss->rets_close[i] != strlen(bss->paths_stat[i]) + 1, + "check", + "failed to match stat return [%d]: %d vs %zd [%s]\n", + i, bss->rets_close[i], strlen(bss->paths_close[i]) + 1, + bss->paths_stat[i]); + } + +cleanup: + test_d_path__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c b/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c index 197d0d217b56..a550dab9ba7a 100644 --- a/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c +++ b/tools/testing/selftests/bpf/prog_tests/fexit_bpf2bpf.c @@ -123,6 +123,7 @@ static void test_func_replace(void) "freplace/get_skb_len", "freplace/get_skb_ifindex", "freplace/get_constant", + "freplace/test_pkt_write_access_subprog", }; test_fexit_bpf2bpf_common("./fexit_bpf2bpf.o", "./test_pkt_access.o", @@ -141,10 +142,77 @@ static void test_func_replace_verify(void) prog_name, false); } +static void test_func_sockmap_update(void) +{ + const char *prog_name[] = { + "freplace/cls_redirect", + }; + test_fexit_bpf2bpf_common("./freplace_cls_redirect.o", + "./test_cls_redirect.o", + ARRAY_SIZE(prog_name), + prog_name, false); +} + +static void test_obj_load_failure_common(const char *obj_file, + const char *target_obj_file) + +{ + /* + * standalone test that asserts failure to load freplace prog + * because of invalid return code. + */ + struct bpf_object *obj = NULL, *pkt_obj; + int err, pkt_fd; + __u32 duration = 0; + + err = bpf_prog_load(target_obj_file, BPF_PROG_TYPE_UNSPEC, + &pkt_obj, &pkt_fd); + /* the target prog should load fine */ + if (CHECK(err, "tgt_prog_load", "file %s err %d errno %d\n", + target_obj_file, err, errno)) + return; + DECLARE_LIBBPF_OPTS(bpf_object_open_opts, opts, + .attach_prog_fd = pkt_fd, + ); + + obj = bpf_object__open_file(obj_file, &opts); + if (CHECK(IS_ERR_OR_NULL(obj), "obj_open", + "failed to open %s: %ld\n", obj_file, + PTR_ERR(obj))) + goto close_prog; + + /* It should fail to load the program */ + err = bpf_object__load(obj); + if (CHECK(!err, "bpf_obj_load should fail", "err %d\n", err)) + goto close_prog; + +close_prog: + if (!IS_ERR_OR_NULL(obj)) + bpf_object__close(obj); + bpf_object__close(pkt_obj); +} + +static void test_func_replace_return_code(void) +{ + /* test invalid return code in the replaced program */ + test_obj_load_failure_common("./freplace_connect_v4_prog.o", + "./connect4_prog.o"); +} + +static void test_func_map_prog_compatibility(void) +{ + /* test with spin lock map value in the replaced program */ + test_obj_load_failure_common("./freplace_attach_probe.o", + "./test_attach_probe.o"); +} + void test_fexit_bpf2bpf(void) { test_target_no_callees(); test_target_yes_callees(); test_func_replace(); test_func_replace_verify(); + test_func_sockmap_update(); + test_func_replace_return_code(); + test_func_map_prog_compatibility(); } diff --git a/tools/testing/selftests/bpf/prog_tests/perf_buffer.c b/tools/testing/selftests/bpf/prog_tests/perf_buffer.c index c33ec180b3f2..ca9f0895ec84 100644 --- a/tools/testing/selftests/bpf/prog_tests/perf_buffer.c +++ b/tools/testing/selftests/bpf/prog_tests/perf_buffer.c @@ -7,6 +7,8 @@ #include "test_perf_buffer.skel.h" #include "bpf/libbpf_internal.h" +static int duration; + /* AddressSanitizer sometimes crashes due to data dereference below, due to * this being mmap()'ed memory. Disable instrumentation with * no_sanitize_address attribute @@ -24,13 +26,31 @@ static void on_sample(void *ctx, int cpu, void *data, __u32 size) CPU_SET(cpu, cpu_seen); } +int trigger_on_cpu(int cpu) +{ + cpu_set_t cpu_set; + int err; + + CPU_ZERO(&cpu_set); + CPU_SET(cpu, &cpu_set); + + err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), &cpu_set); + if (err && CHECK(err, "set_affinity", "cpu #%d, err %d\n", cpu, err)) + return err; + + usleep(1); + + return 0; +} + void test_perf_buffer(void) { - int err, on_len, nr_on_cpus = 0, nr_cpus, i, duration = 0; + int err, on_len, nr_on_cpus = 0, nr_cpus, i; struct perf_buffer_opts pb_opts = {}; struct test_perf_buffer *skel; - cpu_set_t cpu_set, cpu_seen; + cpu_set_t cpu_seen; struct perf_buffer *pb; + int last_fd = -1, fd; bool *online; nr_cpus = libbpf_num_possible_cpus(); @@ -63,6 +83,9 @@ void test_perf_buffer(void) if (CHECK(IS_ERR(pb), "perf_buf__new", "err %ld\n", PTR_ERR(pb))) goto out_close; + CHECK(perf_buffer__epoll_fd(pb) < 0, "epoll_fd", + "bad fd: %d\n", perf_buffer__epoll_fd(pb)); + /* trigger kprobe on every CPU */ CPU_ZERO(&cpu_seen); for (i = 0; i < nr_cpus; i++) { @@ -71,16 +94,8 @@ void test_perf_buffer(void) continue; } - CPU_ZERO(&cpu_set); - CPU_SET(i, &cpu_set); - - err = pthread_setaffinity_np(pthread_self(), sizeof(cpu_set), - &cpu_set); - if (err && CHECK(err, "set_affinity", "cpu #%d, err %d\n", - i, err)) + if (trigger_on_cpu(i)) goto out_close; - - usleep(1); } /* read perf buffer */ @@ -92,6 +107,34 @@ void test_perf_buffer(void) "expect %d, seen %d\n", nr_on_cpus, CPU_COUNT(&cpu_seen))) goto out_free_pb; + if (CHECK(perf_buffer__buffer_cnt(pb) != nr_cpus, "buf_cnt", + "got %zu, expected %d\n", perf_buffer__buffer_cnt(pb), nr_cpus)) + goto out_close; + + for (i = 0; i < nr_cpus; i++) { + if (i >= on_len || !online[i]) + continue; + + fd = perf_buffer__buffer_fd(pb, i); + CHECK(fd < 0 || last_fd == fd, "fd_check", "last fd %d == fd %d\n", last_fd, fd); + last_fd = fd; + + err = perf_buffer__consume_buffer(pb, i); + if (CHECK(err, "drain_buf", "cpu %d, err %d\n", i, err)) + goto out_close; + + CPU_CLR(i, &cpu_seen); + if (trigger_on_cpu(i)) + goto out_close; + + err = perf_buffer__consume_buffer(pb, i); + if (CHECK(err, "consume_buf", "cpu %d, err %d\n", i, err)) + goto out_close; + + if (CHECK(!CPU_ISSET(i, &cpu_seen), "cpu_seen", "cpu %d not seen\n", i)) + goto out_close; + } + out_free_pb: perf_buffer__free(pb); out_close: diff --git a/tools/testing/selftests/bpf/prog_tests/resolve_btfids.c b/tools/testing/selftests/bpf/prog_tests/resolve_btfids.c index 3b127cab4864..8826c652adad 100644 --- a/tools/testing/selftests/bpf/prog_tests/resolve_btfids.c +++ b/tools/testing/selftests/bpf/prog_tests/resolve_btfids.c @@ -47,6 +47,15 @@ BTF_ID(struct, S) BTF_ID(union, U) BTF_ID(func, func) +BTF_SET_START(test_set) +BTF_ID(typedef, S) +BTF_ID(typedef, T) +BTF_ID(typedef, U) +BTF_ID(struct, S) +BTF_ID(union, U) +BTF_ID(func, func) +BTF_SET_END(test_set) + static int __resolve_symbol(struct btf *btf, int type_id) { @@ -116,12 +125,40 @@ int test_resolve_btfids(void) */ for (j = 0; j < ARRAY_SIZE(test_lists); j++) { test_list = test_lists[j]; - for (i = 0; i < ARRAY_SIZE(test_symbols) && !ret; i++) { + for (i = 0; i < ARRAY_SIZE(test_symbols); i++) { ret = CHECK(test_list[i] != test_symbols[i].id, "id_check", "wrong ID for %s (%d != %d)\n", test_symbols[i].name, test_list[i], test_symbols[i].id); + if (ret) + return ret; + } + } + + /* Check BTF_SET_START(test_set) IDs */ + for (i = 0; i < test_set.cnt; i++) { + bool found = false; + + for (j = 0; j < ARRAY_SIZE(test_symbols); j++) { + if (test_symbols[j].id != test_set.ids[i]) + continue; + found = true; + break; + } + + ret = CHECK(!found, "id_check", + "ID %d not found in test_symbols\n", + test_set.ids[i]); + if (ret) + break; + + if (i > 0) { + ret = CHECK(test_set.ids[i - 1] > test_set.ids[i], + "sort_check", + "test_set is not sorted\n"); + if (ret) + break; } } diff --git a/tools/testing/selftests/bpf/prog_tests/sk_assign.c b/tools/testing/selftests/bpf/prog_tests/sk_assign.c index 47fa04adc147..a49a26f95a8b 100644 --- a/tools/testing/selftests/bpf/prog_tests/sk_assign.c +++ b/tools/testing/selftests/bpf/prog_tests/sk_assign.c @@ -49,7 +49,7 @@ configure_stack(void) sprintf(tc_cmd, "%s %s %s %s", "tc filter add dev lo ingress bpf", "direct-action object-file ./test_sk_assign.o", "section classifier/sk_assign_test", - (env.verbosity < VERBOSE_VERY) ? " 2>/dev/null" : ""); + (env.verbosity < VERBOSE_VERY) ? " 2>/dev/null" : "verbose"); if (CHECK(system(tc_cmd), "BPF load failed;", "run with -vv for more info\n")) return false; @@ -268,6 +268,7 @@ void test_sk_assign(void) int server = -1; int server_map; int self_net; + int i; self_net = open(NS_SELF, O_RDONLY); if (CHECK_FAIL(self_net < 0)) { @@ -286,7 +287,7 @@ void test_sk_assign(void) goto cleanup; } - for (int i = 0; i < ARRAY_SIZE(tests) && !READ_ONCE(stop); i++) { + for (i = 0; i < ARRAY_SIZE(tests) && !READ_ONCE(stop); i++) { struct test_sk_cfg *test = &tests[i]; const struct sockaddr *addr; const int zero = 0; diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c index 96e7b7f84c65..0b79d78b98db 100644 --- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c +++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c @@ -4,6 +4,8 @@ #include "test_progs.h" #include "test_skmsg_load_helpers.skel.h" +#include "test_sockmap_update.skel.h" +#include "test_sockmap_invalid_update.skel.h" #define TCP_REPAIR 19 /* TCP sock is under repair right now */ @@ -101,6 +103,74 @@ out: test_skmsg_load_helpers__destroy(skel); } +static void test_sockmap_update(enum bpf_map_type map_type) +{ + struct bpf_prog_test_run_attr tattr; + int err, prog, src, dst, duration = 0; + struct test_sockmap_update *skel; + __u64 src_cookie, dst_cookie; + const __u32 zero = 0; + char dummy[14] = {0}; + __s64 sk; + + sk = connected_socket_v4(); + if (CHECK(sk == -1, "connected_socket_v4", "cannot connect\n")) + return; + + skel = test_sockmap_update__open_and_load(); + if (CHECK(!skel, "open_and_load", "cannot load skeleton\n")) + goto close_sk; + + prog = bpf_program__fd(skel->progs.copy_sock_map); + src = bpf_map__fd(skel->maps.src); + if (map_type == BPF_MAP_TYPE_SOCKMAP) + dst = bpf_map__fd(skel->maps.dst_sock_map); + else + dst = bpf_map__fd(skel->maps.dst_sock_hash); + + err = bpf_map_update_elem(src, &zero, &sk, BPF_NOEXIST); + if (CHECK(err, "update_elem(src)", "errno=%u\n", errno)) + goto out; + + err = bpf_map_lookup_elem(src, &zero, &src_cookie); + if (CHECK(err, "lookup_elem(src, cookie)", "errno=%u\n", errno)) + goto out; + + tattr = (struct bpf_prog_test_run_attr){ + .prog_fd = prog, + .repeat = 1, + .data_in = dummy, + .data_size_in = sizeof(dummy), + }; + + err = bpf_prog_test_run_xattr(&tattr); + if (CHECK_ATTR(err || !tattr.retval, "bpf_prog_test_run", + "errno=%u retval=%u\n", errno, tattr.retval)) + goto out; + + err = bpf_map_lookup_elem(dst, &zero, &dst_cookie); + if (CHECK(err, "lookup_elem(dst, cookie)", "errno=%u\n", errno)) + goto out; + + CHECK(dst_cookie != src_cookie, "cookie mismatch", "%llu != %llu\n", + dst_cookie, src_cookie); + +out: + test_sockmap_update__destroy(skel); +close_sk: + close(sk); +} + +static void test_sockmap_invalid_update(void) +{ + struct test_sockmap_invalid_update *skel; + int duration = 0; + + skel = test_sockmap_invalid_update__open_and_load(); + if (CHECK(skel, "open_and_load", "verifier accepted map_update\n")) + test_sockmap_invalid_update__destroy(skel); +} + void test_sockmap_basic(void) { if (test__start_subtest("sockmap create_update_free")) @@ -111,4 +181,10 @@ void test_sockmap_basic(void) test_skmsg_helpers(BPF_MAP_TYPE_SOCKMAP); if (test__start_subtest("sockhash sk_msg load helpers")) test_skmsg_helpers(BPF_MAP_TYPE_SOCKHASH); + if (test__start_subtest("sockmap update")) + test_sockmap_update(BPF_MAP_TYPE_SOCKMAP); + if (test__start_subtest("sockhash update")) + test_sockmap_update(BPF_MAP_TYPE_SOCKHASH); + if (test__start_subtest("sockmap update in unsafe context")) + test_sockmap_invalid_update(); } diff --git a/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c new file mode 100644 index 000000000000..24ba0d21b641 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/tcp_hdr_options.c @@ -0,0 +1,622 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ + +#define _GNU_SOURCE +#include <sched.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/socket.h> +#include <linux/compiler.h> + +#include "test_progs.h" +#include "cgroup_helpers.h" +#include "network_helpers.h" +#include "test_tcp_hdr_options.h" +#include "test_tcp_hdr_options.skel.h" +#include "test_misc_tcp_hdr_options.skel.h" + +#define LO_ADDR6 "::eB9F" +#define CG_NAME "/tcpbpf-hdr-opt-test" + +struct bpf_test_option exp_passive_estab_in; +struct bpf_test_option exp_active_estab_in; +struct bpf_test_option exp_passive_fin_in; +struct bpf_test_option exp_active_fin_in; +struct hdr_stg exp_passive_hdr_stg; +struct hdr_stg exp_active_hdr_stg = { .active = true, }; + +static struct test_misc_tcp_hdr_options *misc_skel; +static struct test_tcp_hdr_options *skel; +static int lport_linum_map_fd; +static int hdr_stg_map_fd; +static __u32 duration; +static int cg_fd; + +struct sk_fds { + int srv_fd; + int passive_fd; + int active_fd; + int passive_lport; + int active_lport; +}; + +static int add_lo_addr(void) +{ + char ip_addr_cmd[256]; + int cmdlen; + + cmdlen = snprintf(ip_addr_cmd, sizeof(ip_addr_cmd), + "ip -6 addr add %s/128 dev lo scope host", + LO_ADDR6); + + if (CHECK(cmdlen >= sizeof(ip_addr_cmd), "compile ip cmd", + "failed to add host addr %s to lo. ip cmdlen is too long\n", + LO_ADDR6)) + return -1; + + if (CHECK(system(ip_addr_cmd), "run ip cmd", + "failed to add host addr %s to lo\n", LO_ADDR6)) + return -1; + + return 0; +} + +static int create_netns(void) +{ + if (CHECK(unshare(CLONE_NEWNET), "create netns", + "unshare(CLONE_NEWNET): %s (%d)", + strerror(errno), errno)) + return -1; + + if (CHECK(system("ip link set dev lo up"), "run ip cmd", + "failed to bring lo link up\n")) + return -1; + + if (add_lo_addr()) + return -1; + + return 0; +} + +static int write_sysctl(const char *sysctl, const char *value) +{ + int fd, err, len; + + fd = open(sysctl, O_WRONLY); + if (CHECK(fd == -1, "open sysctl", "open(%s): %s (%d)\n", + sysctl, strerror(errno), errno)) + return -1; + + len = strlen(value); + err = write(fd, value, len); + close(fd); + if (CHECK(err != len, "write sysctl", + "write(%s, %s): err:%d %s (%d)\n", + sysctl, value, err, strerror(errno), errno)) + return -1; + + return 0; +} + +static void print_hdr_stg(const struct hdr_stg *hdr_stg, const char *prefix) +{ + fprintf(stderr, "%s{active:%u, resend_syn:%u, syncookie:%u, fastopen:%u}\n", + prefix ? : "", hdr_stg->active, hdr_stg->resend_syn, + hdr_stg->syncookie, hdr_stg->fastopen); +} + +static void print_option(const struct bpf_test_option *opt, const char *prefix) +{ + fprintf(stderr, "%s{flags:0x%x, max_delack_ms:%u, rand:0x%x}\n", + prefix ? : "", opt->flags, opt->max_delack_ms, opt->rand); +} + +static void sk_fds_close(struct sk_fds *sk_fds) +{ + close(sk_fds->srv_fd); + close(sk_fds->passive_fd); + close(sk_fds->active_fd); +} + +static int sk_fds_shutdown(struct sk_fds *sk_fds) +{ + int ret, abyte; + + shutdown(sk_fds->active_fd, SHUT_WR); + ret = read(sk_fds->passive_fd, &abyte, sizeof(abyte)); + if (CHECK(ret != 0, "read-after-shutdown(passive_fd):", + "ret:%d %s (%d)\n", + ret, strerror(errno), errno)) + return -1; + + shutdown(sk_fds->passive_fd, SHUT_WR); + ret = read(sk_fds->active_fd, &abyte, sizeof(abyte)); + if (CHECK(ret != 0, "read-after-shutdown(active_fd):", + "ret:%d %s (%d)\n", + ret, strerror(errno), errno)) + return -1; + + return 0; +} + +static int sk_fds_connect(struct sk_fds *sk_fds, bool fast_open) +{ + const char fast[] = "FAST!!!"; + struct sockaddr_in6 addr6; + socklen_t len; + + sk_fds->srv_fd = start_server(AF_INET6, SOCK_STREAM, LO_ADDR6, 0, 0); + if (CHECK(sk_fds->srv_fd == -1, "start_server", "%s (%d)\n", + strerror(errno), errno)) + goto error; + + if (fast_open) + sk_fds->active_fd = fastopen_connect(sk_fds->srv_fd, fast, + sizeof(fast), 0); + else + sk_fds->active_fd = connect_to_fd(sk_fds->srv_fd, 0); + + if (CHECK_FAIL(sk_fds->active_fd == -1)) { + close(sk_fds->srv_fd); + goto error; + } + + len = sizeof(addr6); + if (CHECK(getsockname(sk_fds->srv_fd, (struct sockaddr *)&addr6, + &len), "getsockname(srv_fd)", "%s (%d)\n", + strerror(errno), errno)) + goto error_close; + sk_fds->passive_lport = ntohs(addr6.sin6_port); + + len = sizeof(addr6); + if (CHECK(getsockname(sk_fds->active_fd, (struct sockaddr *)&addr6, + &len), "getsockname(active_fd)", "%s (%d)\n", + strerror(errno), errno)) + goto error_close; + sk_fds->active_lport = ntohs(addr6.sin6_port); + + sk_fds->passive_fd = accept(sk_fds->srv_fd, NULL, 0); + if (CHECK(sk_fds->passive_fd == -1, "accept(srv_fd)", "%s (%d)\n", + strerror(errno), errno)) + goto error_close; + + if (fast_open) { + char bytes_in[sizeof(fast)]; + int ret; + + ret = read(sk_fds->passive_fd, bytes_in, sizeof(bytes_in)); + if (CHECK(ret != sizeof(fast), "read fastopen syn data", + "expected=%lu actual=%d\n", sizeof(fast), ret)) { + close(sk_fds->passive_fd); + goto error_close; + } + } + + return 0; + +error_close: + close(sk_fds->active_fd); + close(sk_fds->srv_fd); + +error: + memset(sk_fds, -1, sizeof(*sk_fds)); + return -1; +} + +static int check_hdr_opt(const struct bpf_test_option *exp, + const struct bpf_test_option *act, + const char *hdr_desc) +{ + if (CHECK(memcmp(exp, act, sizeof(*exp)), + "expected-vs-actual", "unexpected %s\n", hdr_desc)) { + print_option(exp, "expected: "); + print_option(act, " actual: "); + return -1; + } + + return 0; +} + +static int check_hdr_stg(const struct hdr_stg *exp, int fd, + const char *stg_desc) +{ + struct hdr_stg act; + + if (CHECK(bpf_map_lookup_elem(hdr_stg_map_fd, &fd, &act), + "map_lookup(hdr_stg_map_fd)", "%s %s (%d)\n", + stg_desc, strerror(errno), errno)) + return -1; + + if (CHECK(memcmp(exp, &act, sizeof(*exp)), + "expected-vs-actual", "unexpected %s\n", stg_desc)) { + print_hdr_stg(exp, "expected: "); + print_hdr_stg(&act, " actual: "); + return -1; + } + + return 0; +} + +static int check_error_linum(const struct sk_fds *sk_fds) +{ + unsigned int nr_errors = 0; + struct linum_err linum_err; + int lport; + + lport = sk_fds->passive_lport; + if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum_err)) { + fprintf(stderr, + "bpf prog error out at lport:passive(%d), linum:%u err:%d\n", + lport, linum_err.linum, linum_err.err); + nr_errors++; + } + + lport = sk_fds->active_lport; + if (!bpf_map_lookup_elem(lport_linum_map_fd, &lport, &linum_err)) { + fprintf(stderr, + "bpf prog error out at lport:active(%d), linum:%u err:%d\n", + lport, linum_err.linum, linum_err.err); + nr_errors++; + } + + return nr_errors; +} + +static void check_hdr_and_close_fds(struct sk_fds *sk_fds) +{ + if (sk_fds_shutdown(sk_fds)) + goto check_linum; + + if (check_hdr_stg(&exp_passive_hdr_stg, sk_fds->passive_fd, + "passive_hdr_stg")) + goto check_linum; + + if (check_hdr_stg(&exp_active_hdr_stg, sk_fds->active_fd, + "active_hdr_stg")) + goto check_linum; + + if (check_hdr_opt(&exp_passive_estab_in, &skel->bss->passive_estab_in, + "passive_estab_in")) + goto check_linum; + + if (check_hdr_opt(&exp_active_estab_in, &skel->bss->active_estab_in, + "active_estab_in")) + goto check_linum; + + if (check_hdr_opt(&exp_passive_fin_in, &skel->bss->passive_fin_in, + "passive_fin_in")) + goto check_linum; + + check_hdr_opt(&exp_active_fin_in, &skel->bss->active_fin_in, + "active_fin_in"); + +check_linum: + CHECK_FAIL(check_error_linum(sk_fds)); + sk_fds_close(sk_fds); +} + +static void prepare_out(void) +{ + skel->bss->active_syn_out = exp_passive_estab_in; + skel->bss->passive_synack_out = exp_active_estab_in; + + skel->bss->active_fin_out = exp_passive_fin_in; + skel->bss->passive_fin_out = exp_active_fin_in; +} + +static void reset_test(void) +{ + size_t optsize = sizeof(struct bpf_test_option); + int lport, err; + + memset(&skel->bss->passive_synack_out, 0, optsize); + memset(&skel->bss->passive_fin_out, 0, optsize); + + memset(&skel->bss->passive_estab_in, 0, optsize); + memset(&skel->bss->passive_fin_in, 0, optsize); + + memset(&skel->bss->active_syn_out, 0, optsize); + memset(&skel->bss->active_fin_out, 0, optsize); + + memset(&skel->bss->active_estab_in, 0, optsize); + memset(&skel->bss->active_fin_in, 0, optsize); + + skel->data->test_kind = TCPOPT_EXP; + skel->data->test_magic = 0xeB9F; + + memset(&exp_passive_estab_in, 0, optsize); + memset(&exp_active_estab_in, 0, optsize); + memset(&exp_passive_fin_in, 0, optsize); + memset(&exp_active_fin_in, 0, optsize); + + memset(&exp_passive_hdr_stg, 0, sizeof(exp_passive_hdr_stg)); + memset(&exp_active_hdr_stg, 0, sizeof(exp_active_hdr_stg)); + exp_active_hdr_stg.active = true; + + err = bpf_map_get_next_key(lport_linum_map_fd, NULL, &lport); + while (!err) { + bpf_map_delete_elem(lport_linum_map_fd, &lport); + err = bpf_map_get_next_key(lport_linum_map_fd, &lport, &lport); + } +} + +static void fastopen_estab(void) +{ + struct bpf_link *link; + struct sk_fds sk_fds; + + hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map); + lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map); + + exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_passive_estab_in.rand = 0xfa; + exp_passive_estab_in.max_delack_ms = 11; + + exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_active_estab_in.rand = 0xce; + exp_active_estab_in.max_delack_ms = 22; + + exp_passive_hdr_stg.fastopen = true; + + prepare_out(); + + /* Allow fastopen without fastopen cookie */ + if (write_sysctl("/proc/sys/net/ipv4/tcp_fastopen", "1543")) + return; + + link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, true)) { + bpf_link__destroy(link); + return; + } + + check_hdr_and_close_fds(&sk_fds); + bpf_link__destroy(link); +} + +static void syncookie_estab(void) +{ + struct bpf_link *link; + struct sk_fds sk_fds; + + hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map); + lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map); + + exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_passive_estab_in.rand = 0xfa; + exp_passive_estab_in.max_delack_ms = 11; + + exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS | + OPTION_F_RESEND; + exp_active_estab_in.rand = 0xce; + exp_active_estab_in.max_delack_ms = 22; + + exp_passive_hdr_stg.syncookie = true; + exp_active_hdr_stg.resend_syn = true, + + prepare_out(); + + /* Clear the RESEND to ensure the bpf prog can learn + * want_cookie and set the RESEND by itself. + */ + skel->bss->passive_synack_out.flags &= ~OPTION_F_RESEND; + + /* Enforce syncookie mode */ + if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "2")) + return; + + link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, false)) { + bpf_link__destroy(link); + return; + } + + check_hdr_and_close_fds(&sk_fds); + bpf_link__destroy(link); +} + +static void fin(void) +{ + struct bpf_link *link; + struct sk_fds sk_fds; + + hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map); + lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map); + + exp_passive_fin_in.flags = OPTION_F_RAND; + exp_passive_fin_in.rand = 0xfa; + + exp_active_fin_in.flags = OPTION_F_RAND; + exp_active_fin_in.rand = 0xce; + + prepare_out(); + + if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1")) + return; + + link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, false)) { + bpf_link__destroy(link); + return; + } + + check_hdr_and_close_fds(&sk_fds); + bpf_link__destroy(link); +} + +static void __simple_estab(bool exprm) +{ + struct bpf_link *link; + struct sk_fds sk_fds; + + hdr_stg_map_fd = bpf_map__fd(skel->maps.hdr_stg_map); + lport_linum_map_fd = bpf_map__fd(skel->maps.lport_linum_map); + + exp_passive_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_passive_estab_in.rand = 0xfa; + exp_passive_estab_in.max_delack_ms = 11; + + exp_active_estab_in.flags = OPTION_F_RAND | OPTION_F_MAX_DELACK_MS; + exp_active_estab_in.rand = 0xce; + exp_active_estab_in.max_delack_ms = 22; + + prepare_out(); + + if (!exprm) { + skel->data->test_kind = 0xB9; + skel->data->test_magic = 0; + } + + if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1")) + return; + + link = bpf_program__attach_cgroup(skel->progs.estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, false)) { + bpf_link__destroy(link); + return; + } + + check_hdr_and_close_fds(&sk_fds); + bpf_link__destroy(link); +} + +static void no_exprm_estab(void) +{ + __simple_estab(false); +} + +static void simple_estab(void) +{ + __simple_estab(true); +} + +static void misc(void) +{ + const char send_msg[] = "MISC!!!"; + char recv_msg[sizeof(send_msg)]; + const unsigned int nr_data = 2; + struct bpf_link *link; + struct sk_fds sk_fds; + int i, ret; + + lport_linum_map_fd = bpf_map__fd(misc_skel->maps.lport_linum_map); + + if (write_sysctl("/proc/sys/net/ipv4/tcp_syncookies", "1")) + return; + + link = bpf_program__attach_cgroup(misc_skel->progs.misc_estab, cg_fd); + if (CHECK(IS_ERR(link), "attach_cgroup(misc_estab)", "err: %ld\n", + PTR_ERR(link))) + return; + + if (sk_fds_connect(&sk_fds, false)) { + bpf_link__destroy(link); + return; + } + + for (i = 0; i < nr_data; i++) { + /* MSG_EOR to ensure skb will not be combined */ + ret = send(sk_fds.active_fd, send_msg, sizeof(send_msg), + MSG_EOR); + if (CHECK(ret != sizeof(send_msg), "send(msg)", "ret:%d\n", + ret)) + goto check_linum; + + ret = read(sk_fds.passive_fd, recv_msg, sizeof(recv_msg)); + if (CHECK(ret != sizeof(send_msg), "read(msg)", "ret:%d\n", + ret)) + goto check_linum; + } + + if (sk_fds_shutdown(&sk_fds)) + goto check_linum; + + CHECK(misc_skel->bss->nr_syn != 1, "unexpected nr_syn", + "expected (1) != actual (%u)\n", + misc_skel->bss->nr_syn); + + CHECK(misc_skel->bss->nr_data != nr_data, "unexpected nr_data", + "expected (%u) != actual (%u)\n", + nr_data, misc_skel->bss->nr_data); + + /* The last ACK may have been delayed, so it is either 1 or 2. */ + CHECK(misc_skel->bss->nr_pure_ack != 1 && + misc_skel->bss->nr_pure_ack != 2, + "unexpected nr_pure_ack", + "expected (1 or 2) != actual (%u)\n", + misc_skel->bss->nr_pure_ack); + + CHECK(misc_skel->bss->nr_fin != 1, "unexpected nr_fin", + "expected (1) != actual (%u)\n", + misc_skel->bss->nr_fin); + +check_linum: + CHECK_FAIL(check_error_linum(&sk_fds)); + sk_fds_close(&sk_fds); + bpf_link__destroy(link); +} + +struct test { + const char *desc; + void (*run)(void); +}; + +#define DEF_TEST(name) { #name, name } +static struct test tests[] = { + DEF_TEST(simple_estab), + DEF_TEST(no_exprm_estab), + DEF_TEST(syncookie_estab), + DEF_TEST(fastopen_estab), + DEF_TEST(fin), + DEF_TEST(misc), +}; + +void test_tcp_hdr_options(void) +{ + int i; + + skel = test_tcp_hdr_options__open_and_load(); + if (CHECK(!skel, "open and load skel", "failed")) + return; + + misc_skel = test_misc_tcp_hdr_options__open_and_load(); + if (CHECK(!misc_skel, "open and load misc test skel", "failed")) + goto skel_destroy; + + cg_fd = test__join_cgroup(CG_NAME); + if (CHECK_FAIL(cg_fd < 0)) + goto skel_destroy; + + for (i = 0; i < ARRAY_SIZE(tests); i++) { + if (!test__start_subtest(tests[i].desc)) + continue; + + if (create_netns()) + break; + + tests[i].run(); + + reset_test(); + } + + close(cg_fd); +skel_destroy: + test_misc_tcp_hdr_options__destroy(misc_skel); + test_tcp_hdr_options__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/prog_tests/test_bpffs.c b/tools/testing/selftests/bpf/prog_tests/test_bpffs.c new file mode 100644 index 000000000000..172c999e523c --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_bpffs.c @@ -0,0 +1,94 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ +#define _GNU_SOURCE +#include <sched.h> +#include <sys/mount.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <test_progs.h> + +#define TDIR "/sys/kernel/debug" + +static int read_iter(char *file) +{ + /* 1024 should be enough to get contiguous 4 "iter" letters at some point */ + char buf[1024]; + int fd, len; + + fd = open(file, 0); + if (fd < 0) + return -1; + while ((len = read(fd, buf, sizeof(buf))) > 0) + if (strstr(buf, "iter")) { + close(fd); + return 0; + } + close(fd); + return -1; +} + +static int fn(void) +{ + int err, duration = 0; + + err = unshare(CLONE_NEWNS); + if (CHECK(err, "unshare", "failed: %d\n", errno)) + goto out; + + err = mount("", "/", "", MS_REC | MS_PRIVATE, NULL); + if (CHECK(err, "mount /", "failed: %d\n", errno)) + goto out; + + err = umount(TDIR); + if (CHECK(err, "umount " TDIR, "failed: %d\n", errno)) + goto out; + + err = mount("none", TDIR, "tmpfs", 0, NULL); + if (CHECK(err, "mount", "mount root failed: %d\n", errno)) + goto out; + + err = mkdir(TDIR "/fs1", 0777); + if (CHECK(err, "mkdir "TDIR"/fs1", "failed: %d\n", errno)) + goto out; + err = mkdir(TDIR "/fs2", 0777); + if (CHECK(err, "mkdir "TDIR"/fs2", "failed: %d\n", errno)) + goto out; + + err = mount("bpf", TDIR "/fs1", "bpf", 0, NULL); + if (CHECK(err, "mount bpffs "TDIR"/fs1", "failed: %d\n", errno)) + goto out; + err = mount("bpf", TDIR "/fs2", "bpf", 0, NULL); + if (CHECK(err, "mount bpffs " TDIR "/fs2", "failed: %d\n", errno)) + goto out; + + err = read_iter(TDIR "/fs1/maps.debug"); + if (CHECK(err, "reading " TDIR "/fs1/maps.debug", "failed\n")) + goto out; + err = read_iter(TDIR "/fs2/progs.debug"); + if (CHECK(err, "reading " TDIR "/fs2/progs.debug", "failed\n")) + goto out; +out: + umount(TDIR "/fs1"); + umount(TDIR "/fs2"); + rmdir(TDIR "/fs1"); + rmdir(TDIR "/fs2"); + umount(TDIR); + exit(err); +} + +void test_test_bpffs(void) +{ + int err, duration = 0, status = 0; + pid_t pid; + + pid = fork(); + if (CHECK(pid == -1, "clone", "clone failed %d", errno)) + return; + if (pid == 0) + fn(); + err = waitpid(pid, &status, 0); + if (CHECK(err == -1 && errno != ECHILD, "waitpid", "failed %d", errno)) + return; + if (CHECK(WEXITSTATUS(status), "bpffs test ", "failed %d", WEXITSTATUS(status))) + return; +} diff --git a/tools/testing/selftests/bpf/prog_tests/test_local_storage.c b/tools/testing/selftests/bpf/prog_tests/test_local_storage.c new file mode 100644 index 000000000000..91cd6f357246 --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/test_local_storage.c @@ -0,0 +1,60 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright (C) 2020 Google LLC. + */ + +#include <test_progs.h> +#include <linux/limits.h> + +#include "local_storage.skel.h" +#include "network_helpers.h" + +int create_and_unlink_file(void) +{ + char fname[PATH_MAX] = "/tmp/fileXXXXXX"; + int fd; + + fd = mkstemp(fname); + if (fd < 0) + return fd; + + close(fd); + unlink(fname); + return 0; +} + +void test_test_local_storage(void) +{ + struct local_storage *skel = NULL; + int err, duration = 0, serv_sk = -1; + + skel = local_storage__open_and_load(); + if (CHECK(!skel, "skel_load", "lsm skeleton failed\n")) + goto close_prog; + + err = local_storage__attach(skel); + if (CHECK(err, "attach", "lsm attach failed: %d\n", err)) + goto close_prog; + + skel->bss->monitored_pid = getpid(); + + err = create_and_unlink_file(); + if (CHECK(err < 0, "exec_cmd", "err %d errno %d\n", err, errno)) + goto close_prog; + + CHECK(skel->data->inode_storage_result != 0, "inode_storage_result", + "inode_local_storage not set\n"); + + serv_sk = start_server(AF_INET6, SOCK_STREAM, NULL, 0, 0); + if (CHECK(serv_sk < 0, "start_server", "failed to start server\n")) + goto close_prog; + + CHECK(skel->data->sk_storage_result != 0, "sk_storage_result", + "sk_local_storage not set\n"); + + close(serv_sk); + +close_prog: + local_storage__destroy(skel); +} diff --git a/tools/testing/selftests/bpf/prog_tests/test_lsm.c b/tools/testing/selftests/bpf/prog_tests/test_lsm.c index b17eb2045c1d..6ab29226c99b 100644 --- a/tools/testing/selftests/bpf/prog_tests/test_lsm.c +++ b/tools/testing/selftests/bpf/prog_tests/test_lsm.c @@ -10,6 +10,7 @@ #include <unistd.h> #include <malloc.h> #include <stdlib.h> +#include <unistd.h> #include "lsm.skel.h" @@ -55,6 +56,7 @@ void test_test_lsm(void) { struct lsm *skel = NULL; int err, duration = 0; + int buf = 1234; skel = lsm__open_and_load(); if (CHECK(!skel, "skel_load", "lsm skeleton failed\n")) @@ -81,6 +83,13 @@ void test_test_lsm(void) CHECK(skel->bss->mprotect_count != 1, "mprotect_count", "mprotect_count = %d\n", skel->bss->mprotect_count); + syscall(__NR_setdomainname, &buf, -2L); + syscall(__NR_setdomainname, 0, -3L); + syscall(__NR_setdomainname, ~0L, -4L); + + CHECK(skel->bss->copy_test != 3, "copy_test", + "copy_test = %d\n", skel->bss->copy_test); + close_prog: lsm__destroy(skel); } diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval.c new file mode 100644 index 000000000000..48e62f3f074f --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_enumval x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval___diff.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval___diff.c new file mode 100644 index 000000000000..53e5e5a76888 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval___diff.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_enumval___diff x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval___err_missing.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval___err_missing.c new file mode 100644 index 000000000000..d024fb2ac06e --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval___err_missing.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_enumval___err_missing x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval___val3_missing.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval___val3_missing.c new file mode 100644 index 000000000000..9de6595d250c --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_enumval___val3_missing.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_enumval___val3_missing x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_size___err_ambiguous.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_size___err_ambiguous.c new file mode 100644 index 000000000000..f3e9904df9c2 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_size___err_ambiguous.c @@ -0,0 +1,4 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_size___err_ambiguous1 x, + struct core_reloc_size___err_ambiguous2 y) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based.c new file mode 100644 index 000000000000..fc3f69e58c71 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_type_based x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___all_missing.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___all_missing.c new file mode 100644 index 000000000000..51511648b4ec --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___all_missing.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_type_based___all_missing x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___diff_sz.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___diff_sz.c new file mode 100644 index 000000000000..67db3dceb279 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___diff_sz.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_type_based___diff_sz x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___fn_wrong_args.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___fn_wrong_args.c new file mode 100644 index 000000000000..b357fc65431d --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___fn_wrong_args.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_type_based___fn_wrong_args x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___incompat.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___incompat.c new file mode 100644 index 000000000000..8ddf20d33d9e --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_based___incompat.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_type_based___incompat x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_type_id.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_id.c new file mode 100644 index 000000000000..abbe5bddcefd --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_id.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_type_id x) {} diff --git a/tools/testing/selftests/bpf/progs/btf__core_reloc_type_id___missing_targets.c b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_id___missing_targets.c new file mode 100644 index 000000000000..24e7caf4f013 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/btf__core_reloc_type_id___missing_targets.c @@ -0,0 +1,3 @@ +#include "core_reloc_types.h" + +void f(struct core_reloc_type_id___missing_targets x) {} diff --git a/tools/testing/selftests/bpf/progs/core_reloc_types.h b/tools/testing/selftests/bpf/progs/core_reloc_types.h index 69139ed66216..e6e616cb7bc9 100644 --- a/tools/testing/selftests/bpf/progs/core_reloc_types.h +++ b/tools/testing/selftests/bpf/progs/core_reloc_types.h @@ -652,7 +652,7 @@ struct core_reloc_misc_extensible { }; /* - * EXISTENCE + * FIELD EXISTENCE */ struct core_reloc_existence_output { int a_exists; @@ -809,3 +809,353 @@ struct core_reloc_size___diff_sz { void *ptr_field; enum { OTHER_VALUE = 0xFFFFFFFFFFFFFFFF } enum_field; }; + +/* Error case of two candidates with the fields (int_field) at the same + * offset, but with differing final relocation values: size 4 vs size 1 + */ +struct core_reloc_size___err_ambiguous1 { + /* int at offset 0 */ + int int_field; + + struct { int x; } struct_field; + union { int x; } union_field; + int arr_field[4]; + void *ptr_field; + enum { VALUE___1 = 123 } enum_field; +}; + +struct core_reloc_size___err_ambiguous2 { + /* char at offset 0 */ + char int_field; + + struct { int x; } struct_field; + union { int x; } union_field; + int arr_field[4]; + void *ptr_field; + enum { VALUE___2 = 123 } enum_field; +}; + +/* + * TYPE EXISTENCE & SIZE + */ +struct core_reloc_type_based_output { + bool struct_exists; + bool union_exists; + bool enum_exists; + bool typedef_named_struct_exists; + bool typedef_anon_struct_exists; + bool typedef_struct_ptr_exists; + bool typedef_int_exists; + bool typedef_enum_exists; + bool typedef_void_ptr_exists; + bool typedef_func_proto_exists; + bool typedef_arr_exists; + + int struct_sz; + int union_sz; + int enum_sz; + int typedef_named_struct_sz; + int typedef_anon_struct_sz; + int typedef_struct_ptr_sz; + int typedef_int_sz; + int typedef_enum_sz; + int typedef_void_ptr_sz; + int typedef_func_proto_sz; + int typedef_arr_sz; +}; + +struct a_struct { + int x; +}; + +union a_union { + int y; + int z; +}; + +typedef struct a_struct named_struct_typedef; + +typedef struct { int x, y, z; } anon_struct_typedef; + +typedef struct { + int a, b, c; +} *struct_ptr_typedef; + +enum an_enum { + AN_ENUM_VAL1 = 1, + AN_ENUM_VAL2 = 2, + AN_ENUM_VAL3 = 3, +}; + +typedef int int_typedef; + +typedef enum { TYPEDEF_ENUM_VAL1, TYPEDEF_ENUM_VAL2 } enum_typedef; + +typedef void *void_ptr_typedef; + +typedef int (*func_proto_typedef)(long); + +typedef char arr_typedef[20]; + +struct core_reloc_type_based { + struct a_struct f1; + union a_union f2; + enum an_enum f3; + named_struct_typedef f4; + anon_struct_typedef f5; + struct_ptr_typedef f6; + int_typedef f7; + enum_typedef f8; + void_ptr_typedef f9; + func_proto_typedef f10; + arr_typedef f11; +}; + +/* no types in target */ +struct core_reloc_type_based___all_missing { +}; + +/* different type sizes, extra modifiers, anon vs named enums, etc */ +struct a_struct___diff_sz { + long x; + int y; + char z; +}; + +union a_union___diff_sz { + char yy; + char zz; +}; + +typedef struct a_struct___diff_sz named_struct_typedef___diff_sz; + +typedef struct { long xx, yy, zzz; } anon_struct_typedef___diff_sz; + +typedef struct { + char aa[1], bb[2], cc[3]; +} *struct_ptr_typedef___diff_sz; + +enum an_enum___diff_sz { + AN_ENUM_VAL1___diff_sz = 0x123412341234, + AN_ENUM_VAL2___diff_sz = 2, +}; + +typedef unsigned long int_typedef___diff_sz; + +typedef enum an_enum___diff_sz enum_typedef___diff_sz; + +typedef const void * const void_ptr_typedef___diff_sz; + +typedef int_typedef___diff_sz (*func_proto_typedef___diff_sz)(char); + +typedef int arr_typedef___diff_sz[2]; + +struct core_reloc_type_based___diff_sz { + struct a_struct___diff_sz f1; + union a_union___diff_sz f2; + enum an_enum___diff_sz f3; + named_struct_typedef___diff_sz f4; + anon_struct_typedef___diff_sz f5; + struct_ptr_typedef___diff_sz f6; + int_typedef___diff_sz f7; + enum_typedef___diff_sz f8; + void_ptr_typedef___diff_sz f9; + func_proto_typedef___diff_sz f10; + arr_typedef___diff_sz f11; +}; + +/* incompatibilities between target and local types */ +union a_struct___incompat { /* union instead of struct */ + int x; +}; + +struct a_union___incompat { /* struct instead of union */ + int y; + int z; +}; + +/* typedef to union, not to struct */ +typedef union a_struct___incompat named_struct_typedef___incompat; + +/* typedef to void pointer, instead of struct */ +typedef void *anon_struct_typedef___incompat; + +/* extra pointer indirection */ +typedef struct { + int a, b, c; +} **struct_ptr_typedef___incompat; + +/* typedef of a struct with int, instead of int */ +typedef struct { int x; } int_typedef___incompat; + +/* typedef to func_proto, instead of enum */ +typedef int (*enum_typedef___incompat)(void); + +/* pointer to char instead of void */ +typedef char *void_ptr_typedef___incompat; + +/* void return type instead of int */ +typedef void (*func_proto_typedef___incompat)(long); + +/* multi-dimensional array instead of a single-dimensional */ +typedef int arr_typedef___incompat[20][2]; + +struct core_reloc_type_based___incompat { + union a_struct___incompat f1; + struct a_union___incompat f2; + /* the only valid one is enum, to check that something still succeeds */ + enum an_enum f3; + named_struct_typedef___incompat f4; + anon_struct_typedef___incompat f5; + struct_ptr_typedef___incompat f6; + int_typedef___incompat f7; + enum_typedef___incompat f8; + void_ptr_typedef___incompat f9; + func_proto_typedef___incompat f10; + arr_typedef___incompat f11; +}; + +/* func_proto with incompatible signature */ +typedef void (*func_proto_typedef___fn_wrong_ret1)(long); +typedef int * (*func_proto_typedef___fn_wrong_ret2)(long); +typedef struct { int x; } int_struct_typedef; +typedef int_struct_typedef (*func_proto_typedef___fn_wrong_ret3)(long); +typedef int (*func_proto_typedef___fn_wrong_arg)(void *); +typedef int (*func_proto_typedef___fn_wrong_arg_cnt1)(long, long); +typedef int (*func_proto_typedef___fn_wrong_arg_cnt2)(void); + +struct core_reloc_type_based___fn_wrong_args { + /* one valid type to make sure relos still work */ + struct a_struct f1; + func_proto_typedef___fn_wrong_ret1 f2; + func_proto_typedef___fn_wrong_ret2 f3; + func_proto_typedef___fn_wrong_ret3 f4; + func_proto_typedef___fn_wrong_arg f5; + func_proto_typedef___fn_wrong_arg_cnt1 f6; + func_proto_typedef___fn_wrong_arg_cnt2 f7; +}; + +/* + * TYPE ID MAPPING (LOCAL AND TARGET) + */ +struct core_reloc_type_id_output { + int local_anon_struct; + int local_anon_union; + int local_anon_enum; + int local_anon_func_proto_ptr; + int local_anon_void_ptr; + int local_anon_arr; + + int local_struct; + int local_union; + int local_enum; + int local_int; + int local_struct_typedef; + int local_func_proto_typedef; + int local_arr_typedef; + + int targ_struct; + int targ_union; + int targ_enum; + int targ_int; + int targ_struct_typedef; + int targ_func_proto_typedef; + int targ_arr_typedef; +}; + +struct core_reloc_type_id { + struct a_struct f1; + union a_union f2; + enum an_enum f3; + named_struct_typedef f4; + func_proto_typedef f5; + arr_typedef f6; +}; + +struct core_reloc_type_id___missing_targets { + /* nothing */ +}; + +/* + * ENUMERATOR VALUE EXISTENCE AND VALUE RELOCATION + */ +struct core_reloc_enumval_output { + bool named_val1_exists; + bool named_val2_exists; + bool named_val3_exists; + bool anon_val1_exists; + bool anon_val2_exists; + bool anon_val3_exists; + + int named_val1; + int named_val2; + int anon_val1; + int anon_val2; +}; + +enum named_enum { + NAMED_ENUM_VAL1 = 1, + NAMED_ENUM_VAL2 = 2, + NAMED_ENUM_VAL3 = 3, +}; + +typedef enum { + ANON_ENUM_VAL1 = 0x10, + ANON_ENUM_VAL2 = 0x20, + ANON_ENUM_VAL3 = 0x30, +} anon_enum; + +struct core_reloc_enumval { + enum named_enum f1; + anon_enum f2; +}; + +/* differing enumerator values */ +enum named_enum___diff { + NAMED_ENUM_VAL1___diff = 101, + NAMED_ENUM_VAL2___diff = 202, + NAMED_ENUM_VAL3___diff = 303, +}; + +typedef enum { + ANON_ENUM_VAL1___diff = 0x11, + ANON_ENUM_VAL2___diff = 0x22, + ANON_ENUM_VAL3___diff = 0x33, +} anon_enum___diff; + +struct core_reloc_enumval___diff { + enum named_enum___diff f1; + anon_enum___diff f2; +}; + +/* missing (optional) third enum value */ +enum named_enum___val3_missing { + NAMED_ENUM_VAL1___val3_missing = 111, + NAMED_ENUM_VAL2___val3_missing = 222, +}; + +typedef enum { + ANON_ENUM_VAL1___val3_missing = 0x111, + ANON_ENUM_VAL2___val3_missing = 0x222, +} anon_enum___val3_missing; + +struct core_reloc_enumval___val3_missing { + enum named_enum___val3_missing f1; + anon_enum___val3_missing f2; +}; + +/* missing (mandatory) second enum value, should fail */ +enum named_enum___err_missing { + NAMED_ENUM_VAL1___err_missing = 1, + NAMED_ENUM_VAL3___err_missing = 3, +}; + +typedef enum { + ANON_ENUM_VAL1___err_missing = 0x111, + ANON_ENUM_VAL3___err_missing = 0x222, +} anon_enum___err_missing; + +struct core_reloc_enumval___err_missing { + enum named_enum___err_missing f1; + anon_enum___err_missing f2; +}; diff --git a/tools/testing/selftests/bpf/progs/fexit_bpf2bpf.c b/tools/testing/selftests/bpf/progs/fexit_bpf2bpf.c index 98e1efe14549..49a84a3a2306 100644 --- a/tools/testing/selftests/bpf/progs/fexit_bpf2bpf.c +++ b/tools/testing/selftests/bpf/progs/fexit_bpf2bpf.c @@ -1,8 +1,10 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright (c) 2019 Facebook */ #include <linux/stddef.h> +#include <linux/if_ether.h> #include <linux/ipv6.h> #include <linux/bpf.h> +#include <linux/tcp.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_endian.h> #include <bpf/bpf_tracing.h> @@ -151,4 +153,29 @@ int new_get_constant(long val) test_get_constant = 1; return test_get_constant; /* original get_constant() returns val - 122 */ } + +__u64 test_pkt_write_access_subprog = 0; +SEC("freplace/test_pkt_write_access_subprog") +int new_test_pkt_write_access_subprog(struct __sk_buff *skb, __u32 off) +{ + + void *data = (void *)(long)skb->data; + void *data_end = (void *)(long)skb->data_end; + struct tcphdr *tcp; + + if (off > sizeof(struct ethhdr) + sizeof(struct ipv6hdr)) + return -1; + + tcp = data + off; + if (tcp + 1 > data_end) + return -1; + + /* make modifications to the packet data */ + tcp->check++; + tcp->syn = 0; + + test_pkt_write_access_subprog = 1; + return 0; +} + char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/freplace_attach_probe.c b/tools/testing/selftests/bpf/progs/freplace_attach_probe.c new file mode 100644 index 000000000000..bb2a77c5b62b --- /dev/null +++ b/tools/testing/selftests/bpf/progs/freplace_attach_probe.c @@ -0,0 +1,40 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Facebook + +#include <linux/ptrace.h> +#include <linux/bpf.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#define VAR_NUM 2 + +struct hmap_elem { + struct bpf_spin_lock lock; + int var[VAR_NUM]; +}; + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, struct hmap_elem); +} hash_map SEC(".maps"); + +SEC("freplace/handle_kprobe") +int new_handle_kprobe(struct pt_regs *ctx) +{ + struct hmap_elem zero = {}, *val; + int key = 0; + + val = bpf_map_lookup_elem(&hash_map, &key); + if (!val) + return 1; + /* spin_lock in hash map */ + bpf_spin_lock(&val->lock); + val->var[0] = 99; + bpf_spin_unlock(&val->lock); + + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/freplace_cls_redirect.c b/tools/testing/selftests/bpf/progs/freplace_cls_redirect.c new file mode 100644 index 000000000000..68a5a9db928a --- /dev/null +++ b/tools/testing/selftests/bpf/progs/freplace_cls_redirect.c @@ -0,0 +1,34 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Facebook + +#include <linux/stddef.h> +#include <linux/bpf.h> +#include <linux/pkt_cls.h> +#include <bpf/bpf_endian.h> +#include <bpf/bpf_helpers.h> + +struct bpf_map_def SEC("maps") sock_map = { + .type = BPF_MAP_TYPE_SOCKMAP, + .key_size = sizeof(int), + .value_size = sizeof(int), + .max_entries = 2, +}; + +SEC("freplace/cls_redirect") +int freplace_cls_redirect_test(struct __sk_buff *skb) +{ + int ret = 0; + const int zero = 0; + struct bpf_sock *sk; + + sk = bpf_map_lookup_elem(&sock_map, &zero); + if (!sk) + return TC_ACT_SHOT; + + ret = bpf_map_update_elem(&sock_map, &zero, sk, 0); + bpf_sk_release(sk); + + return ret == 0 ? TC_ACT_OK : TC_ACT_SHOT; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/freplace_connect_v4_prog.c b/tools/testing/selftests/bpf/progs/freplace_connect_v4_prog.c new file mode 100644 index 000000000000..544e5ac90461 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/freplace_connect_v4_prog.c @@ -0,0 +1,19 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Facebook + +#include <linux/stddef.h> +#include <linux/ipv6.h> +#include <linux/bpf.h> +#include <linux/in.h> +#include <sys/socket.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_endian.h> + +SEC("freplace/connect_v4_prog") +int new_connect_v4_prog(struct bpf_sock_addr *ctx) +{ + // return value thats in invalid range + return 255; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/local_storage.c b/tools/testing/selftests/bpf/progs/local_storage.c new file mode 100644 index 000000000000..0758ba229ae0 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/local_storage.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Copyright 2020 Google LLC. + */ + +#include <errno.h> +#include <linux/bpf.h> +#include <stdbool.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +char _license[] SEC("license") = "GPL"; + +#define DUMMY_STORAGE_VALUE 0xdeadbeef + +int monitored_pid = 0; +int inode_storage_result = -1; +int sk_storage_result = -1; + +struct dummy_storage { + __u32 value; +}; + +struct { + __uint(type, BPF_MAP_TYPE_INODE_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct dummy_storage); +} inode_storage_map SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_SK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC | BPF_F_CLONE); + __type(key, int); + __type(value, struct dummy_storage); +} sk_storage_map SEC(".maps"); + +/* TODO Use vmlinux.h once BTF pruning for embedded types is fixed. + */ +struct sock {} __attribute__((preserve_access_index)); +struct sockaddr {} __attribute__((preserve_access_index)); +struct socket { + struct sock *sk; +} __attribute__((preserve_access_index)); + +struct inode {} __attribute__((preserve_access_index)); +struct dentry { + struct inode *d_inode; +} __attribute__((preserve_access_index)); +struct file { + struct inode *f_inode; +} __attribute__((preserve_access_index)); + + +SEC("lsm/inode_unlink") +int BPF_PROG(unlink_hook, struct inode *dir, struct dentry *victim) +{ + __u32 pid = bpf_get_current_pid_tgid() >> 32; + struct dummy_storage *storage; + + if (pid != monitored_pid) + return 0; + + storage = bpf_inode_storage_get(&inode_storage_map, victim->d_inode, 0, + BPF_SK_STORAGE_GET_F_CREATE); + if (!storage) + return 0; + + if (storage->value == DUMMY_STORAGE_VALUE) + inode_storage_result = -1; + + inode_storage_result = + bpf_inode_storage_delete(&inode_storage_map, victim->d_inode); + + return 0; +} + +SEC("lsm/socket_bind") +int BPF_PROG(socket_bind, struct socket *sock, struct sockaddr *address, + int addrlen) +{ + __u32 pid = bpf_get_current_pid_tgid() >> 32; + struct dummy_storage *storage; + + if (pid != monitored_pid) + return 0; + + storage = bpf_sk_storage_get(&sk_storage_map, sock->sk, 0, + BPF_SK_STORAGE_GET_F_CREATE); + if (!storage) + return 0; + + if (storage->value == DUMMY_STORAGE_VALUE) + sk_storage_result = -1; + + sk_storage_result = bpf_sk_storage_delete(&sk_storage_map, sock->sk); + return 0; +} + +SEC("lsm/socket_post_create") +int BPF_PROG(socket_post_create, struct socket *sock, int family, int type, + int protocol, int kern) +{ + __u32 pid = bpf_get_current_pid_tgid() >> 32; + struct dummy_storage *storage; + + if (pid != monitored_pid) + return 0; + + storage = bpf_sk_storage_get(&sk_storage_map, sock->sk, 0, + BPF_SK_STORAGE_GET_F_CREATE); + if (!storage) + return 0; + + storage->value = DUMMY_STORAGE_VALUE; + + return 0; +} + +SEC("lsm/file_open") +int BPF_PROG(file_open, struct file *file) +{ + __u32 pid = bpf_get_current_pid_tgid() >> 32; + struct dummy_storage *storage; + + if (pid != monitored_pid) + return 0; + + if (!file->f_inode) + return 0; + + storage = bpf_inode_storage_get(&inode_storage_map, file->f_inode, 0, + BPF_LOCAL_STORAGE_GET_F_CREATE); + if (!storage) + return 0; + + storage->value = DUMMY_STORAGE_VALUE; + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/lsm.c b/tools/testing/selftests/bpf/progs/lsm.c index b4598d4bc4f7..ff4d343b94b5 100644 --- a/tools/testing/selftests/bpf/progs/lsm.c +++ b/tools/testing/selftests/bpf/progs/lsm.c @@ -9,6 +9,27 @@ #include <bpf/bpf_tracing.h> #include <errno.h> +struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} array SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} hash SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_LRU_HASH); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} lru_hash SEC(".maps"); + char _license[] SEC("license") = "GPL"; int monitored_pid = 0; @@ -36,13 +57,54 @@ int BPF_PROG(test_int_hook, struct vm_area_struct *vma, return ret; } -SEC("lsm/bprm_committed_creds") +SEC("lsm.s/bprm_committed_creds") int BPF_PROG(test_void_hook, struct linux_binprm *bprm) { __u32 pid = bpf_get_current_pid_tgid() >> 32; + char args[64]; + __u32 key = 0; + __u64 *value; if (monitored_pid == pid) bprm_count++; + bpf_copy_from_user(args, sizeof(args), (void *)bprm->vma->vm_mm->arg_start); + bpf_copy_from_user(args, sizeof(args), (void *)bprm->mm->arg_start); + + value = bpf_map_lookup_elem(&array, &key); + if (value) + *value = 0; + value = bpf_map_lookup_elem(&hash, &key); + if (value) + *value = 0; + value = bpf_map_lookup_elem(&lru_hash, &key); + if (value) + *value = 0; + + return 0; +} +SEC("lsm/task_free") /* lsm/ is ok, lsm.s/ fails */ +int BPF_PROG(test_task_free, struct task_struct *task) +{ + return 0; +} + +int copy_test = 0; + +SEC("fentry.s/__x64_sys_setdomainname") +int BPF_PROG(test_sys_setdomainname, struct pt_regs *regs) +{ + void *ptr = (void *)PT_REGS_PARM1(regs); + int len = PT_REGS_PARM2(regs); + int buf = 0; + long ret; + + ret = bpf_copy_from_user(&buf, sizeof(buf), ptr); + if (len == -2 && ret == 0 && buf == 1234) + copy_test++; + if (len == -3 && ret == -EFAULT) + copy_test++; + if (len == -4 && ret == -EFAULT) + copy_test++; return 0; } diff --git a/tools/testing/selftests/bpf/progs/map_ptr_kern.c b/tools/testing/selftests/bpf/progs/map_ptr_kern.c index 473665cac67e..982a2d8aa844 100644 --- a/tools/testing/selftests/bpf/progs/map_ptr_kern.c +++ b/tools/testing/selftests/bpf/progs/map_ptr_kern.c @@ -589,7 +589,7 @@ static inline int check_stack(void) return 1; } -struct bpf_sk_storage_map { +struct bpf_local_storage_map { struct bpf_map map; } __attribute__((preserve_access_index)); @@ -602,8 +602,8 @@ struct { static inline int check_sk_storage(void) { - struct bpf_sk_storage_map *sk_storage = - (struct bpf_sk_storage_map *)&m_sk_storage; + struct bpf_local_storage_map *sk_storage = + (struct bpf_local_storage_map *)&m_sk_storage; struct bpf_map *map = (struct bpf_map *)&m_sk_storage; VERIFY(check(&sk_storage->map, map, sizeof(__u32), sizeof(__u32), 0)); diff --git a/tools/testing/selftests/bpf/progs/test_btf_map_in_map.c b/tools/testing/selftests/bpf/progs/test_btf_map_in_map.c index e5093796be97..193fe0198b21 100644 --- a/tools/testing/selftests/bpf/progs/test_btf_map_in_map.c +++ b/tools/testing/selftests/bpf/progs/test_btf_map_in_map.c @@ -11,6 +11,13 @@ struct inner_map { } inner_map1 SEC(".maps"), inner_map2 SEC(".maps"); +struct inner_map_sz2 { + __uint(type, BPF_MAP_TYPE_ARRAY); + __uint(max_entries, 2); + __type(key, int); + __type(value, int); +} inner_map_sz2 SEC(".maps"); + struct outer_arr { __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); __uint(max_entries, 3); @@ -50,6 +57,30 @@ struct outer_hash { }, }; +struct sockarr_sz1 { + __uint(type, BPF_MAP_TYPE_REUSEPORT_SOCKARRAY); + __uint(max_entries, 1); + __type(key, int); + __type(value, int); +} sockarr_sz1 SEC(".maps"); + +struct sockarr_sz2 { + __uint(type, BPF_MAP_TYPE_REUSEPORT_SOCKARRAY); + __uint(max_entries, 2); + __type(key, int); + __type(value, int); +} sockarr_sz2 SEC(".maps"); + +struct outer_sockarr_sz1 { + __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS); + __uint(max_entries, 1); + __uint(key_size, sizeof(int)); + __uint(value_size, sizeof(int)); + __array(values, struct sockarr_sz1); +} outer_sockarr SEC(".maps") = { + .values = { (void *)&sockarr_sz1 }, +}; + int input = 0; SEC("raw_tp/sys_enter") diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_enumval.c b/tools/testing/selftests/bpf/progs/test_core_reloc_enumval.c new file mode 100644 index 000000000000..e7ef3dada2bf --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_core_reloc_enumval.c @@ -0,0 +1,72 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Facebook + +#include <linux/bpf.h> +#include <stdint.h> +#include <stdbool.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> + +char _license[] SEC("license") = "GPL"; + +struct { + char in[256]; + char out[256]; + bool skip; +} data = {}; + +enum named_enum { + NAMED_ENUM_VAL1 = 1, + NAMED_ENUM_VAL2 = 2, + NAMED_ENUM_VAL3 = 3, +}; + +typedef enum { + ANON_ENUM_VAL1 = 0x10, + ANON_ENUM_VAL2 = 0x20, + ANON_ENUM_VAL3 = 0x30, +} anon_enum; + +struct core_reloc_enumval_output { + bool named_val1_exists; + bool named_val2_exists; + bool named_val3_exists; + bool anon_val1_exists; + bool anon_val2_exists; + bool anon_val3_exists; + + int named_val1; + int named_val2; + int anon_val1; + int anon_val2; +}; + +SEC("raw_tracepoint/sys_enter") +int test_core_enumval(void *ctx) +{ +#if __has_builtin(__builtin_preserve_enum_value) + struct core_reloc_enumval_output *out = (void *)&data.out; + enum named_enum named = 0; + anon_enum anon = 0; + + out->named_val1_exists = bpf_core_enum_value_exists(named, NAMED_ENUM_VAL1); + out->named_val2_exists = bpf_core_enum_value_exists(enum named_enum, NAMED_ENUM_VAL2); + out->named_val3_exists = bpf_core_enum_value_exists(enum named_enum, NAMED_ENUM_VAL3); + + out->anon_val1_exists = bpf_core_enum_value_exists(anon, ANON_ENUM_VAL1); + out->anon_val2_exists = bpf_core_enum_value_exists(anon_enum, ANON_ENUM_VAL2); + out->anon_val3_exists = bpf_core_enum_value_exists(anon_enum, ANON_ENUM_VAL3); + + out->named_val1 = bpf_core_enum_value(named, NAMED_ENUM_VAL1); + out->named_val2 = bpf_core_enum_value(named, NAMED_ENUM_VAL2); + /* NAMED_ENUM_VAL3 value is optional */ + + out->anon_val1 = bpf_core_enum_value(anon, ANON_ENUM_VAL1); + out->anon_val2 = bpf_core_enum_value(anon, ANON_ENUM_VAL2); + /* ANON_ENUM_VAL3 value is optional */ +#else + data.skip = true; +#endif + + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c b/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c index aba928fd60d3..145028b52ad8 100644 --- a/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c +++ b/tools/testing/selftests/bpf/progs/test_core_reloc_kernel.c @@ -3,6 +3,7 @@ #include <linux/bpf.h> #include <stdint.h> +#include <stdbool.h> #include <bpf/bpf_helpers.h> #include <bpf/bpf_core_read.h> @@ -11,6 +12,7 @@ char _license[] SEC("license") = "GPL"; struct { char in[256]; char out[256]; + bool skip; uint64_t my_pid_tgid; } data = {}; diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_type_based.c b/tools/testing/selftests/bpf/progs/test_core_reloc_type_based.c new file mode 100644 index 000000000000..fb60f8195c53 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_core_reloc_type_based.c @@ -0,0 +1,110 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Facebook + +#include <linux/bpf.h> +#include <stdint.h> +#include <stdbool.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> + +char _license[] SEC("license") = "GPL"; + +struct { + char in[256]; + char out[256]; + bool skip; +} data = {}; + +struct a_struct { + int x; +}; + +union a_union { + int y; + int z; +}; + +typedef struct a_struct named_struct_typedef; + +typedef struct { int x, y, z; } anon_struct_typedef; + +typedef struct { + int a, b, c; +} *struct_ptr_typedef; + +enum an_enum { + AN_ENUM_VAL1 = 1, + AN_ENUM_VAL2 = 2, + AN_ENUM_VAL3 = 3, +}; + +typedef int int_typedef; + +typedef enum { TYPEDEF_ENUM_VAL1, TYPEDEF_ENUM_VAL2 } enum_typedef; + +typedef void *void_ptr_typedef; + +typedef int (*func_proto_typedef)(long); + +typedef char arr_typedef[20]; + +struct core_reloc_type_based_output { + bool struct_exists; + bool union_exists; + bool enum_exists; + bool typedef_named_struct_exists; + bool typedef_anon_struct_exists; + bool typedef_struct_ptr_exists; + bool typedef_int_exists; + bool typedef_enum_exists; + bool typedef_void_ptr_exists; + bool typedef_func_proto_exists; + bool typedef_arr_exists; + + int struct_sz; + int union_sz; + int enum_sz; + int typedef_named_struct_sz; + int typedef_anon_struct_sz; + int typedef_struct_ptr_sz; + int typedef_int_sz; + int typedef_enum_sz; + int typedef_void_ptr_sz; + int typedef_func_proto_sz; + int typedef_arr_sz; +}; + +SEC("raw_tracepoint/sys_enter") +int test_core_type_based(void *ctx) +{ +#if __has_builtin(__builtin_preserve_type_info) + struct core_reloc_type_based_output *out = (void *)&data.out; + + out->struct_exists = bpf_core_type_exists(struct a_struct); + out->union_exists = bpf_core_type_exists(union a_union); + out->enum_exists = bpf_core_type_exists(enum an_enum); + out->typedef_named_struct_exists = bpf_core_type_exists(named_struct_typedef); + out->typedef_anon_struct_exists = bpf_core_type_exists(anon_struct_typedef); + out->typedef_struct_ptr_exists = bpf_core_type_exists(struct_ptr_typedef); + out->typedef_int_exists = bpf_core_type_exists(int_typedef); + out->typedef_enum_exists = bpf_core_type_exists(enum_typedef); + out->typedef_void_ptr_exists = bpf_core_type_exists(void_ptr_typedef); + out->typedef_func_proto_exists = bpf_core_type_exists(func_proto_typedef); + out->typedef_arr_exists = bpf_core_type_exists(arr_typedef); + + out->struct_sz = bpf_core_type_size(struct a_struct); + out->union_sz = bpf_core_type_size(union a_union); + out->enum_sz = bpf_core_type_size(enum an_enum); + out->typedef_named_struct_sz = bpf_core_type_size(named_struct_typedef); + out->typedef_anon_struct_sz = bpf_core_type_size(anon_struct_typedef); + out->typedef_struct_ptr_sz = bpf_core_type_size(struct_ptr_typedef); + out->typedef_int_sz = bpf_core_type_size(int_typedef); + out->typedef_enum_sz = bpf_core_type_size(enum_typedef); + out->typedef_void_ptr_sz = bpf_core_type_size(void_ptr_typedef); + out->typedef_func_proto_sz = bpf_core_type_size(func_proto_typedef); + out->typedef_arr_sz = bpf_core_type_size(arr_typedef); +#else + data.skip = true; +#endif + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/test_core_reloc_type_id.c b/tools/testing/selftests/bpf/progs/test_core_reloc_type_id.c new file mode 100644 index 000000000000..22aba3f6e344 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_core_reloc_type_id.c @@ -0,0 +1,115 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Facebook + +#include <linux/bpf.h> +#include <stdint.h> +#include <stdbool.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_core_read.h> + +char _license[] SEC("license") = "GPL"; + +struct { + char in[256]; + char out[256]; + bool skip; +} data = {}; + +/* some types are shared with test_core_reloc_type_based.c */ +struct a_struct { + int x; +}; + +union a_union { + int y; + int z; +}; + +enum an_enum { + AN_ENUM_VAL1 = 1, + AN_ENUM_VAL2 = 2, + AN_ENUM_VAL3 = 3, +}; + +typedef struct a_struct named_struct_typedef; + +typedef int (*func_proto_typedef)(long); + +typedef char arr_typedef[20]; + +struct core_reloc_type_id_output { + int local_anon_struct; + int local_anon_union; + int local_anon_enum; + int local_anon_func_proto_ptr; + int local_anon_void_ptr; + int local_anon_arr; + + int local_struct; + int local_union; + int local_enum; + int local_int; + int local_struct_typedef; + int local_func_proto_typedef; + int local_arr_typedef; + + int targ_struct; + int targ_union; + int targ_enum; + int targ_int; + int targ_struct_typedef; + int targ_func_proto_typedef; + int targ_arr_typedef; +}; + +/* preserve types even if Clang doesn't support built-in */ +struct a_struct t1 = {}; +union a_union t2 = {}; +enum an_enum t3 = 0; +named_struct_typedef t4 = {}; +func_proto_typedef t5 = 0; +arr_typedef t6 = {}; + +SEC("raw_tracepoint/sys_enter") +int test_core_type_id(void *ctx) +{ + /* We use __builtin_btf_type_id() in this tests, but up until the time + * __builtin_preserve_type_info() was added it contained a bug that + * would make this test fail. The bug was fixed ([0]) with addition of + * __builtin_preserve_type_info(), though, so that's what we are using + * to detect whether this test has to be executed, however strange + * that might look like. + * + * [0] https://reviews.llvm.org/D85174 + */ +#if __has_builtin(__builtin_preserve_type_info) + struct core_reloc_type_id_output *out = (void *)&data.out; + + out->local_anon_struct = bpf_core_type_id_local(struct { int marker_field; }); + out->local_anon_union = bpf_core_type_id_local(union { int marker_field; }); + out->local_anon_enum = bpf_core_type_id_local(enum { MARKER_ENUM_VAL = 123 }); + out->local_anon_func_proto_ptr = bpf_core_type_id_local(_Bool(*)(int)); + out->local_anon_void_ptr = bpf_core_type_id_local(void *); + out->local_anon_arr = bpf_core_type_id_local(_Bool[47]); + + out->local_struct = bpf_core_type_id_local(struct a_struct); + out->local_union = bpf_core_type_id_local(union a_union); + out->local_enum = bpf_core_type_id_local(enum an_enum); + out->local_int = bpf_core_type_id_local(int); + out->local_struct_typedef = bpf_core_type_id_local(named_struct_typedef); + out->local_func_proto_typedef = bpf_core_type_id_local(func_proto_typedef); + out->local_arr_typedef = bpf_core_type_id_local(arr_typedef); + + out->targ_struct = bpf_core_type_id_kernel(struct a_struct); + out->targ_union = bpf_core_type_id_kernel(union a_union); + out->targ_enum = bpf_core_type_id_kernel(enum an_enum); + out->targ_int = bpf_core_type_id_kernel(int); + out->targ_struct_typedef = bpf_core_type_id_kernel(named_struct_typedef); + out->targ_func_proto_typedef = bpf_core_type_id_kernel(func_proto_typedef); + out->targ_arr_typedef = bpf_core_type_id_kernel(arr_typedef); +#else + data.skip = true; +#endif + + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/test_d_path.c b/tools/testing/selftests/bpf/progs/test_d_path.c new file mode 100644 index 000000000000..61f007855649 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_d_path.c @@ -0,0 +1,58 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_tracing.h> + +#define MAX_PATH_LEN 128 +#define MAX_FILES 7 + +pid_t my_pid = 0; +__u32 cnt_stat = 0; +__u32 cnt_close = 0; +char paths_stat[MAX_FILES][MAX_PATH_LEN] = {}; +char paths_close[MAX_FILES][MAX_PATH_LEN] = {}; +int rets_stat[MAX_FILES] = {}; +int rets_close[MAX_FILES] = {}; + +SEC("fentry/vfs_getattr") +int BPF_PROG(prog_stat, struct path *path, struct kstat *stat, + __u32 request_mask, unsigned int query_flags) +{ + pid_t pid = bpf_get_current_pid_tgid() >> 32; + __u32 cnt = cnt_stat; + int ret; + + if (pid != my_pid) + return 0; + + if (cnt >= MAX_FILES) + return 0; + ret = bpf_d_path(path, paths_stat[cnt], MAX_PATH_LEN); + + rets_stat[cnt] = ret; + cnt_stat++; + return 0; +} + +SEC("fentry/filp_close") +int BPF_PROG(prog_close, struct file *file, void *id) +{ + pid_t pid = bpf_get_current_pid_tgid() >> 32; + __u32 cnt = cnt_close; + int ret; + + if (pid != my_pid) + return 0; + + if (cnt >= MAX_FILES) + return 0; + ret = bpf_d_path(&file->f_path, + paths_close[cnt], MAX_PATH_LEN); + + rets_close[cnt] = ret; + cnt_close++; + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c new file mode 100644 index 000000000000..3a216d1d0226 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_misc_tcp_hdr_options.c @@ -0,0 +1,325 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ + +#include <stddef.h> +#include <errno.h> +#include <stdbool.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <linux/ipv6.h> +#include <linux/tcp.h> +#include <linux/socket.h> +#include <linux/bpf.h> +#include <linux/types.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_endian.h> +#define BPF_PROG_TEST_TCP_HDR_OPTIONS +#include "test_tcp_hdr_options.h" + +__u16 last_addr16_n = __bpf_htons(0xeB9F); +__u16 active_lport_n = 0; +__u16 active_lport_h = 0; +__u16 passive_lport_n = 0; +__u16 passive_lport_h = 0; + +/* options received at passive side */ +unsigned int nr_pure_ack = 0; +unsigned int nr_data = 0; +unsigned int nr_syn = 0; +unsigned int nr_fin = 0; + +/* Check the header received from the active side */ +static int __check_active_hdr_in(struct bpf_sock_ops *skops, bool check_syn) +{ + union { + struct tcphdr th; + struct ipv6hdr ip6; + struct tcp_exprm_opt exprm_opt; + struct tcp_opt reg_opt; + __u8 data[100]; /* IPv6 (40) + Max TCP hdr (60) */ + } hdr = {}; + __u64 load_flags = check_syn ? BPF_LOAD_HDR_OPT_TCP_SYN : 0; + struct tcphdr *pth; + int ret; + + hdr.reg_opt.kind = 0xB9; + + /* The option is 4 bytes long instead of 2 bytes */ + ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, 2, load_flags); + if (ret != -ENOSPC) + RET_CG_ERR(ret); + + /* Test searching magic with regular kind */ + hdr.reg_opt.len = 4; + ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt), + load_flags); + if (ret != -EINVAL) + RET_CG_ERR(ret); + + hdr.reg_opt.len = 0; + ret = bpf_load_hdr_opt(skops, &hdr.reg_opt, sizeof(hdr.reg_opt), + load_flags); + if (ret != 4 || hdr.reg_opt.len != 4 || hdr.reg_opt.kind != 0xB9 || + hdr.reg_opt.data[0] != 0xfa || hdr.reg_opt.data[1] != 0xce) + RET_CG_ERR(ret); + + /* Test searching experimental option with invalid kind length */ + hdr.exprm_opt.kind = TCPOPT_EXP; + hdr.exprm_opt.len = 5; + hdr.exprm_opt.magic = 0; + ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt), + load_flags); + if (ret != -EINVAL) + RET_CG_ERR(ret); + + /* Test searching experimental option with 0 magic value */ + hdr.exprm_opt.len = 4; + ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt), + load_flags); + if (ret != -ENOMSG) + RET_CG_ERR(ret); + + hdr.exprm_opt.magic = __bpf_htons(0xeB9F); + ret = bpf_load_hdr_opt(skops, &hdr.exprm_opt, sizeof(hdr.exprm_opt), + load_flags); + if (ret != 4 || hdr.exprm_opt.len != 4 || + hdr.exprm_opt.kind != TCPOPT_EXP || + hdr.exprm_opt.magic != __bpf_htons(0xeB9F)) + RET_CG_ERR(ret); + + if (!check_syn) + return CG_OK; + + /* Test loading from skops->syn_skb if sk_state == TCP_NEW_SYN_RECV + * + * Test loading from tp->saved_syn for other sk_state. + */ + ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr.ip6, + sizeof(hdr.ip6)); + if (ret != -ENOSPC) + RET_CG_ERR(ret); + + if (hdr.ip6.saddr.s6_addr16[7] != last_addr16_n || + hdr.ip6.daddr.s6_addr16[7] != last_addr16_n) + RET_CG_ERR(0); + + ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN_IP, &hdr, sizeof(hdr)); + if (ret < 0) + RET_CG_ERR(ret); + + pth = (struct tcphdr *)(&hdr.ip6 + 1); + if (pth->dest != passive_lport_n || pth->source != active_lport_n) + RET_CG_ERR(0); + + ret = bpf_getsockopt(skops, SOL_TCP, TCP_BPF_SYN, &hdr, sizeof(hdr)); + if (ret < 0) + RET_CG_ERR(ret); + + if (hdr.th.dest != passive_lport_n || hdr.th.source != active_lport_n) + RET_CG_ERR(0); + + return CG_OK; +} + +static int check_active_syn_in(struct bpf_sock_ops *skops) +{ + return __check_active_hdr_in(skops, true); +} + +static int check_active_hdr_in(struct bpf_sock_ops *skops) +{ + struct tcphdr *th; + + if (__check_active_hdr_in(skops, false) == CG_ERR) + return CG_ERR; + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + if (tcp_hdrlen(th) < skops->skb_len) + nr_data++; + + if (th->fin) + nr_fin++; + + if (th->ack && !th->fin && tcp_hdrlen(th) == skops->skb_len) + nr_pure_ack++; + + return CG_OK; +} + +static int active_opt_len(struct bpf_sock_ops *skops) +{ + int err; + + /* Reserve more than enough to allow the -EEXIST test in + * the write_active_opt(). + */ + err = bpf_reserve_hdr_opt(skops, 12, 0); + if (err) + RET_CG_ERR(err); + + return CG_OK; +} + +static int write_active_opt(struct bpf_sock_ops *skops) +{ + struct tcp_exprm_opt exprm_opt = {}; + struct tcp_opt win_scale_opt = {}; + struct tcp_opt reg_opt = {}; + struct tcphdr *th; + int err, ret; + + exprm_opt.kind = TCPOPT_EXP; + exprm_opt.len = 4; + exprm_opt.magic = __bpf_htons(0xeB9F); + + reg_opt.kind = 0xB9; + reg_opt.len = 4; + reg_opt.data[0] = 0xfa; + reg_opt.data[1] = 0xce; + + win_scale_opt.kind = TCPOPT_WINDOW; + + err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0); + if (err) + RET_CG_ERR(err); + + /* Store the same exprm option */ + err = bpf_store_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0); + if (err != -EEXIST) + RET_CG_ERR(err); + + err = bpf_store_hdr_opt(skops, ®_opt, sizeof(reg_opt), 0); + if (err) + RET_CG_ERR(err); + err = bpf_store_hdr_opt(skops, ®_opt, sizeof(reg_opt), 0); + if (err != -EEXIST) + RET_CG_ERR(err); + + /* Check the option has been written and can be searched */ + ret = bpf_load_hdr_opt(skops, &exprm_opt, sizeof(exprm_opt), 0); + if (ret != 4 || exprm_opt.len != 4 || exprm_opt.kind != TCPOPT_EXP || + exprm_opt.magic != __bpf_htons(0xeB9F)) + RET_CG_ERR(ret); + + reg_opt.len = 0; + ret = bpf_load_hdr_opt(skops, ®_opt, sizeof(reg_opt), 0); + if (ret != 4 || reg_opt.len != 4 || reg_opt.kind != 0xB9 || + reg_opt.data[0] != 0xfa || reg_opt.data[1] != 0xce) + RET_CG_ERR(ret); + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + if (th->syn) { + active_lport_h = skops->local_port; + active_lport_n = th->source; + + /* Search the win scale option written by kernel + * in the SYN packet. + */ + ret = bpf_load_hdr_opt(skops, &win_scale_opt, + sizeof(win_scale_opt), 0); + if (ret != 3 || win_scale_opt.len != 3 || + win_scale_opt.kind != TCPOPT_WINDOW) + RET_CG_ERR(ret); + + /* Write the win scale option that kernel + * has already written. + */ + err = bpf_store_hdr_opt(skops, &win_scale_opt, + sizeof(win_scale_opt), 0); + if (err != -EEXIST) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static int handle_hdr_opt_len(struct bpf_sock_ops *skops) +{ + __u8 tcp_flags = skops_tcp_flags(skops); + + if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK) + /* Check the SYN from bpf_sock_ops_kern->syn_skb */ + return check_active_syn_in(skops); + + /* Passive side should have cleared the write hdr cb by now */ + if (skops->local_port == passive_lport_h) + RET_CG_ERR(0); + + return active_opt_len(skops); +} + +static int handle_write_hdr_opt(struct bpf_sock_ops *skops) +{ + if (skops->local_port == passive_lport_h) + RET_CG_ERR(0); + + return write_active_opt(skops); +} + +static int handle_parse_hdr(struct bpf_sock_ops *skops) +{ + /* Passive side is not writing any non-standard/unknown + * option, so the active side should never be called. + */ + if (skops->local_port == active_lport_h) + RET_CG_ERR(0); + + return check_active_hdr_in(skops); +} + +static int handle_passive_estab(struct bpf_sock_ops *skops) +{ + int err; + + /* No more write hdr cb */ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags & + ~BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG); + + /* Recheck the SYN but check the tp->saved_syn this time */ + err = check_active_syn_in(skops); + if (err == CG_ERR) + return err; + + nr_syn++; + + /* The ack has header option written by the active side also */ + return check_active_hdr_in(skops); +} + +SEC("sockops/misc_estab") +int misc_estab(struct bpf_sock_ops *skops) +{ + int true_val = 1; + + switch (skops->op) { + case BPF_SOCK_OPS_TCP_LISTEN_CB: + passive_lport_h = skops->local_port; + passive_lport_n = __bpf_htons(passive_lport_h); + bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN, + &true_val, sizeof(true_val)); + set_hdr_cb_flags(skops); + break; + case BPF_SOCK_OPS_TCP_CONNECT_CB: + set_hdr_cb_flags(skops); + break; + case BPF_SOCK_OPS_PARSE_HDR_OPT_CB: + return handle_parse_hdr(skops); + case BPF_SOCK_OPS_HDR_OPT_LEN_CB: + return handle_hdr_opt_len(skops); + case BPF_SOCK_OPS_WRITE_HDR_OPT_CB: + return handle_write_hdr_opt(skops); + case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: + return handle_passive_estab(skops); + } + + return CG_OK; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_pkt_access.c b/tools/testing/selftests/bpf/progs/test_pkt_access.c index e72eba4a93d2..852051064507 100644 --- a/tools/testing/selftests/bpf/progs/test_pkt_access.c +++ b/tools/testing/selftests/bpf/progs/test_pkt_access.c @@ -79,6 +79,24 @@ int get_skb_ifindex(int val, struct __sk_buff *skb, int var) return skb->ifindex * val * var; } +__attribute__ ((noinline)) +int test_pkt_write_access_subprog(struct __sk_buff *skb, __u32 off) +{ + void *data = (void *)(long)skb->data; + void *data_end = (void *)(long)skb->data_end; + struct tcphdr *tcp = NULL; + + if (off > sizeof(struct ethhdr) + sizeof(struct ipv6hdr)) + return -1; + + tcp = data + off; + if (tcp + 1 > data_end) + return -1; + /* make modification to the packet data */ + tcp->check++; + return 0; +} + SEC("classifier/test_pkt_access") int test_pkt_access(struct __sk_buff *skb) { @@ -117,6 +135,8 @@ int test_pkt_access(struct __sk_buff *skb) if (test_pkt_access_subprog3(3, skb) != skb->len * 3 * skb->ifindex) return TC_ACT_SHOT; if (tcp) { + if (test_pkt_write_access_subprog(skb, (void *)tcp - data)) + return TC_ACT_SHOT; if (((void *)(tcp) + 20) > data_end || proto != 6) return TC_ACT_SHOT; barrier(); /* to force ordering of checks */ diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_invalid_update.c b/tools/testing/selftests/bpf/progs/test_sockmap_invalid_update.c new file mode 100644 index 000000000000..02a59e220cbc --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_sockmap_invalid_update.c @@ -0,0 +1,23 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Cloudflare +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> + +struct { + __uint(type, BPF_MAP_TYPE_SOCKMAP); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} map SEC(".maps"); + +SEC("sockops") +int bpf_sockmap(struct bpf_sock_ops *skops) +{ + __u32 key = 0; + + if (skops->sk) + bpf_map_update_elem(&map, &key, skops->sk, 0); + return 0; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_sockmap_update.c b/tools/testing/selftests/bpf/progs/test_sockmap_update.c new file mode 100644 index 000000000000..9d0c9f28cab2 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_sockmap_update.c @@ -0,0 +1,48 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (c) 2020 Cloudflare +#include "vmlinux.h" +#include <bpf/bpf_helpers.h> + +struct { + __uint(type, BPF_MAP_TYPE_SOCKMAP); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} src SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_SOCKMAP); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} dst_sock_map SEC(".maps"); + +struct { + __uint(type, BPF_MAP_TYPE_SOCKHASH); + __uint(max_entries, 1); + __type(key, __u32); + __type(value, __u64); +} dst_sock_hash SEC(".maps"); + +SEC("classifier/copy_sock_map") +int copy_sock_map(void *ctx) +{ + struct bpf_sock *sk; + bool failed = false; + __u32 key = 0; + + sk = bpf_map_lookup_elem(&src, &key); + if (!sk) + return SK_DROP; + + if (bpf_map_update_elem(&dst_sock_map, &key, sk, 0)) + failed = true; + + if (bpf_map_update_elem(&dst_sock_hash, &key, sk, 0)) + failed = true; + + bpf_sk_release(sk); + return failed ? SK_DROP : SK_PASS; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c new file mode 100644 index 000000000000..9197a23df3da --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_tcp_hdr_options.c @@ -0,0 +1,623 @@ +// SPDX-License-Identifier: GPL-2.0 +/* Copyright (c) 2020 Facebook */ + +#include <stddef.h> +#include <errno.h> +#include <stdbool.h> +#include <sys/types.h> +#include <sys/socket.h> +#include <linux/tcp.h> +#include <linux/socket.h> +#include <linux/bpf.h> +#include <linux/types.h> +#include <bpf/bpf_helpers.h> +#include <bpf/bpf_endian.h> +#define BPF_PROG_TEST_TCP_HDR_OPTIONS +#include "test_tcp_hdr_options.h" + +#ifndef sizeof_field +#define sizeof_field(TYPE, MEMBER) sizeof((((TYPE *)0)->MEMBER)) +#endif + +__u8 test_kind = TCPOPT_EXP; +__u16 test_magic = 0xeB9F; + +struct bpf_test_option passive_synack_out = {}; +struct bpf_test_option passive_fin_out = {}; + +struct bpf_test_option passive_estab_in = {}; +struct bpf_test_option passive_fin_in = {}; + +struct bpf_test_option active_syn_out = {}; +struct bpf_test_option active_fin_out = {}; + +struct bpf_test_option active_estab_in = {}; +struct bpf_test_option active_fin_in = {}; + +struct { + __uint(type, BPF_MAP_TYPE_SK_STORAGE); + __uint(map_flags, BPF_F_NO_PREALLOC); + __type(key, int); + __type(value, struct hdr_stg); +} hdr_stg_map SEC(".maps"); + +static bool skops_want_cookie(const struct bpf_sock_ops *skops) +{ + return skops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE; +} + +static bool skops_current_mss(const struct bpf_sock_ops *skops) +{ + return skops->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS; +} + +static __u8 option_total_len(__u8 flags) +{ + __u8 i, len = 1; /* +1 for flags */ + + if (!flags) + return 0; + + /* RESEND bit does not use a byte */ + for (i = OPTION_RESEND + 1; i < __NR_OPTION_FLAGS; i++) + len += !!TEST_OPTION_FLAGS(flags, i); + + if (test_kind == TCPOPT_EXP) + return len + TCP_BPF_EXPOPT_BASE_LEN; + else + return len + 2; /* +1 kind, +1 kind-len */ +} + +static void write_test_option(const struct bpf_test_option *test_opt, + __u8 *data) +{ + __u8 offset = 0; + + data[offset++] = test_opt->flags; + if (TEST_OPTION_FLAGS(test_opt->flags, OPTION_MAX_DELACK_MS)) + data[offset++] = test_opt->max_delack_ms; + + if (TEST_OPTION_FLAGS(test_opt->flags, OPTION_RAND)) + data[offset++] = test_opt->rand; +} + +static int store_option(struct bpf_sock_ops *skops, + const struct bpf_test_option *test_opt) +{ + union { + struct tcp_exprm_opt exprm; + struct tcp_opt regular; + } write_opt; + int err; + + if (test_kind == TCPOPT_EXP) { + write_opt.exprm.kind = TCPOPT_EXP; + write_opt.exprm.len = option_total_len(test_opt->flags); + write_opt.exprm.magic = __bpf_htons(test_magic); + write_opt.exprm.data32 = 0; + write_test_option(test_opt, write_opt.exprm.data); + err = bpf_store_hdr_opt(skops, &write_opt.exprm, + sizeof(write_opt.exprm), 0); + } else { + write_opt.regular.kind = test_kind; + write_opt.regular.len = option_total_len(test_opt->flags); + write_opt.regular.data32 = 0; + write_test_option(test_opt, write_opt.regular.data); + err = bpf_store_hdr_opt(skops, &write_opt.regular, + sizeof(write_opt.regular), 0); + } + + if (err) + RET_CG_ERR(err); + + return CG_OK; +} + +static int parse_test_option(struct bpf_test_option *opt, const __u8 *start) +{ + opt->flags = *start++; + + if (TEST_OPTION_FLAGS(opt->flags, OPTION_MAX_DELACK_MS)) + opt->max_delack_ms = *start++; + + if (TEST_OPTION_FLAGS(opt->flags, OPTION_RAND)) + opt->rand = *start++; + + return 0; +} + +static int load_option(struct bpf_sock_ops *skops, + struct bpf_test_option *test_opt, bool from_syn) +{ + union { + struct tcp_exprm_opt exprm; + struct tcp_opt regular; + } search_opt; + int ret, load_flags = from_syn ? BPF_LOAD_HDR_OPT_TCP_SYN : 0; + + if (test_kind == TCPOPT_EXP) { + search_opt.exprm.kind = TCPOPT_EXP; + search_opt.exprm.len = 4; + search_opt.exprm.magic = __bpf_htons(test_magic); + search_opt.exprm.data32 = 0; + ret = bpf_load_hdr_opt(skops, &search_opt.exprm, + sizeof(search_opt.exprm), load_flags); + if (ret < 0) + return ret; + return parse_test_option(test_opt, search_opt.exprm.data); + } else { + search_opt.regular.kind = test_kind; + search_opt.regular.len = 0; + search_opt.regular.data32 = 0; + ret = bpf_load_hdr_opt(skops, &search_opt.regular, + sizeof(search_opt.regular), load_flags); + if (ret < 0) + return ret; + return parse_test_option(test_opt, search_opt.regular.data); + } +} + +static int synack_opt_len(struct bpf_sock_ops *skops) +{ + struct bpf_test_option test_opt = {}; + __u8 optlen; + int err; + + if (!passive_synack_out.flags) + return CG_OK; + + err = load_option(skops, &test_opt, true); + + /* bpf_test_option is not found */ + if (err == -ENOMSG) + return CG_OK; + + if (err) + RET_CG_ERR(err); + + optlen = option_total_len(passive_synack_out.flags); + if (optlen) { + err = bpf_reserve_hdr_opt(skops, optlen, 0); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static int write_synack_opt(struct bpf_sock_ops *skops) +{ + struct bpf_test_option opt; + + if (!passive_synack_out.flags) + /* We should not even be called since no header + * space has been reserved. + */ + RET_CG_ERR(0); + + opt = passive_synack_out; + if (skops_want_cookie(skops)) + SET_OPTION_FLAGS(opt.flags, OPTION_RESEND); + + return store_option(skops, &opt); +} + +static int syn_opt_len(struct bpf_sock_ops *skops) +{ + __u8 optlen; + int err; + + if (!active_syn_out.flags) + return CG_OK; + + optlen = option_total_len(active_syn_out.flags); + if (optlen) { + err = bpf_reserve_hdr_opt(skops, optlen, 0); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static int write_syn_opt(struct bpf_sock_ops *skops) +{ + if (!active_syn_out.flags) + RET_CG_ERR(0); + + return store_option(skops, &active_syn_out); +} + +static int fin_opt_len(struct bpf_sock_ops *skops) +{ + struct bpf_test_option *opt; + struct hdr_stg *hdr_stg; + __u8 optlen; + int err; + + if (!skops->sk) + RET_CG_ERR(0); + + hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0); + if (!hdr_stg) + RET_CG_ERR(0); + + if (hdr_stg->active) + opt = &active_fin_out; + else + opt = &passive_fin_out; + + optlen = option_total_len(opt->flags); + if (optlen) { + err = bpf_reserve_hdr_opt(skops, optlen, 0); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static int write_fin_opt(struct bpf_sock_ops *skops) +{ + struct bpf_test_option *opt; + struct hdr_stg *hdr_stg; + + if (!skops->sk) + RET_CG_ERR(0); + + hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0); + if (!hdr_stg) + RET_CG_ERR(0); + + if (hdr_stg->active) + opt = &active_fin_out; + else + opt = &passive_fin_out; + + if (!opt->flags) + RET_CG_ERR(0); + + return store_option(skops, opt); +} + +static int resend_in_ack(struct bpf_sock_ops *skops) +{ + struct hdr_stg *hdr_stg; + + if (!skops->sk) + return -1; + + hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0); + if (!hdr_stg) + return -1; + + return !!hdr_stg->resend_syn; +} + +static int nodata_opt_len(struct bpf_sock_ops *skops) +{ + int resend; + + resend = resend_in_ack(skops); + if (resend < 0) + RET_CG_ERR(0); + + if (resend) + return syn_opt_len(skops); + + return CG_OK; +} + +static int write_nodata_opt(struct bpf_sock_ops *skops) +{ + int resend; + + resend = resend_in_ack(skops); + if (resend < 0) + RET_CG_ERR(0); + + if (resend) + return write_syn_opt(skops); + + return CG_OK; +} + +static int data_opt_len(struct bpf_sock_ops *skops) +{ + /* Same as the nodata version. Mostly to show + * an example usage on skops->skb_len. + */ + return nodata_opt_len(skops); +} + +static int write_data_opt(struct bpf_sock_ops *skops) +{ + return write_nodata_opt(skops); +} + +static int current_mss_opt_len(struct bpf_sock_ops *skops) +{ + /* Reserve maximum that may be needed */ + int err; + + err = bpf_reserve_hdr_opt(skops, option_total_len(OPTION_MASK), 0); + if (err) + RET_CG_ERR(err); + + return CG_OK; +} + +static int handle_hdr_opt_len(struct bpf_sock_ops *skops) +{ + __u8 tcp_flags = skops_tcp_flags(skops); + + if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK) + return synack_opt_len(skops); + + if (tcp_flags & TCPHDR_SYN) + return syn_opt_len(skops); + + if (tcp_flags & TCPHDR_FIN) + return fin_opt_len(skops); + + if (skops_current_mss(skops)) + /* The kernel is calculating the MSS */ + return current_mss_opt_len(skops); + + if (skops->skb_len) + return data_opt_len(skops); + + return nodata_opt_len(skops); +} + +static int handle_write_hdr_opt(struct bpf_sock_ops *skops) +{ + __u8 tcp_flags = skops_tcp_flags(skops); + struct tcphdr *th; + + if ((tcp_flags & TCPHDR_SYNACK) == TCPHDR_SYNACK) + return write_synack_opt(skops); + + if (tcp_flags & TCPHDR_SYN) + return write_syn_opt(skops); + + if (tcp_flags & TCPHDR_FIN) + return write_fin_opt(skops); + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + if (skops->skb_len > tcp_hdrlen(th)) + return write_data_opt(skops); + + return write_nodata_opt(skops); +} + +static int set_delack_max(struct bpf_sock_ops *skops, __u8 max_delack_ms) +{ + __u32 max_delack_us = max_delack_ms * 1000; + + return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_DELACK_MAX, + &max_delack_us, sizeof(max_delack_us)); +} + +static int set_rto_min(struct bpf_sock_ops *skops, __u8 peer_max_delack_ms) +{ + __u32 min_rto_us = peer_max_delack_ms * 1000; + + return bpf_setsockopt(skops, SOL_TCP, TCP_BPF_RTO_MIN, &min_rto_us, + sizeof(min_rto_us)); +} + +static int handle_active_estab(struct bpf_sock_ops *skops) +{ + struct hdr_stg init_stg = { + .active = true, + }; + int err; + + err = load_option(skops, &active_estab_in, false); + if (err && err != -ENOMSG) + RET_CG_ERR(err); + + init_stg.resend_syn = TEST_OPTION_FLAGS(active_estab_in.flags, + OPTION_RESEND); + if (!skops->sk || !bpf_sk_storage_get(&hdr_stg_map, skops->sk, + &init_stg, + BPF_SK_STORAGE_GET_F_CREATE)) + RET_CG_ERR(0); + + if (init_stg.resend_syn) + /* Don't clear the write_hdr cb now because + * the ACK may get lost and retransmit may + * be needed. + * + * PARSE_ALL_HDR cb flag is set to learn if this + * resend_syn option has received by the peer. + * + * The header option will be resent until a valid + * packet is received at handle_parse_hdr() + * and all hdr cb flags will be cleared in + * handle_parse_hdr(). + */ + set_parse_all_hdr_cb_flags(skops); + else if (!active_fin_out.flags) + /* No options will be written from now */ + clear_hdr_cb_flags(skops); + + if (active_syn_out.max_delack_ms) { + err = set_delack_max(skops, active_syn_out.max_delack_ms); + if (err) + RET_CG_ERR(err); + } + + if (active_estab_in.max_delack_ms) { + err = set_rto_min(skops, active_estab_in.max_delack_ms); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static int handle_passive_estab(struct bpf_sock_ops *skops) +{ + struct hdr_stg init_stg = {}; + struct tcphdr *th; + int err; + + err = load_option(skops, &passive_estab_in, true); + if (err == -ENOENT) { + /* saved_syn is not found. It was in syncookie mode. + * We have asked the active side to resend the options + * in ACK, so try to find the bpf_test_option from ACK now. + */ + err = load_option(skops, &passive_estab_in, false); + init_stg.syncookie = true; + } + + /* ENOMSG: The bpf_test_option is not found which is fine. + * Bail out now for all other errors. + */ + if (err && err != -ENOMSG) + RET_CG_ERR(err); + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + if (th->syn) { + /* Fastopen */ + + /* Cannot clear cb_flags to stop write_hdr cb. + * synack is not sent yet for fast open. + * Even it was, the synack may need to be retransmitted. + * + * PARSE_ALL_HDR cb flag is set to learn + * if synack has reached the peer. + * All cb_flags will be cleared in handle_parse_hdr(). + */ + set_parse_all_hdr_cb_flags(skops); + init_stg.fastopen = true; + } else if (!passive_fin_out.flags) { + /* No options will be written from now */ + clear_hdr_cb_flags(skops); + } + + if (!skops->sk || + !bpf_sk_storage_get(&hdr_stg_map, skops->sk, &init_stg, + BPF_SK_STORAGE_GET_F_CREATE)) + RET_CG_ERR(0); + + if (passive_synack_out.max_delack_ms) { + err = set_delack_max(skops, passive_synack_out.max_delack_ms); + if (err) + RET_CG_ERR(err); + } + + if (passive_estab_in.max_delack_ms) { + err = set_rto_min(skops, passive_estab_in.max_delack_ms); + if (err) + RET_CG_ERR(err); + } + + return CG_OK; +} + +static int handle_parse_hdr(struct bpf_sock_ops *skops) +{ + struct hdr_stg *hdr_stg; + struct tcphdr *th; + + if (!skops->sk) + RET_CG_ERR(0); + + th = skops->skb_data; + if (th + 1 > skops->skb_data_end) + RET_CG_ERR(0); + + hdr_stg = bpf_sk_storage_get(&hdr_stg_map, skops->sk, NULL, 0); + if (!hdr_stg) + RET_CG_ERR(0); + + if (hdr_stg->resend_syn || hdr_stg->fastopen) + /* The PARSE_ALL_HDR cb flag was turned on + * to ensure that the previously written + * options have reached the peer. + * Those previously written option includes: + * - Active side: resend_syn in ACK during syncookie + * or + * - Passive side: SYNACK during fastopen + * + * A valid packet has been received here after + * the 3WHS, so the PARSE_ALL_HDR cb flag + * can be cleared now. + */ + clear_parse_all_hdr_cb_flags(skops); + + if (hdr_stg->resend_syn && !active_fin_out.flags) + /* Active side resent the syn option in ACK + * because the server was in syncookie mode. + * A valid packet has been received, so + * clear header cb flags if there is no + * more option to send. + */ + clear_hdr_cb_flags(skops); + + if (hdr_stg->fastopen && !passive_fin_out.flags) + /* Passive side was in fastopen. + * A valid packet has been received, so + * the SYNACK has reached the peer. + * Clear header cb flags if there is no more + * option to send. + */ + clear_hdr_cb_flags(skops); + + if (th->fin) { + struct bpf_test_option *fin_opt; + int err; + + if (hdr_stg->active) + fin_opt = &active_fin_in; + else + fin_opt = &passive_fin_in; + + err = load_option(skops, fin_opt, false); + if (err && err != -ENOMSG) + RET_CG_ERR(err); + } + + return CG_OK; +} + +SEC("sockops/estab") +int estab(struct bpf_sock_ops *skops) +{ + int true_val = 1; + + switch (skops->op) { + case BPF_SOCK_OPS_TCP_LISTEN_CB: + bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN, + &true_val, sizeof(true_val)); + set_hdr_cb_flags(skops); + break; + case BPF_SOCK_OPS_TCP_CONNECT_CB: + set_hdr_cb_flags(skops); + break; + case BPF_SOCK_OPS_PARSE_HDR_OPT_CB: + return handle_parse_hdr(skops); + case BPF_SOCK_OPS_HDR_OPT_LEN_CB: + return handle_hdr_opt_len(skops); + case BPF_SOCK_OPS_WRITE_HDR_OPT_CB: + return handle_write_hdr_opt(skops); + case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: + return handle_passive_estab(skops); + case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: + return handle_active_estab(skops); + } + + return CG_OK; +} + +char _license[] SEC("license") = "GPL"; diff --git a/tools/testing/selftests/bpf/progs/test_vmlinux.c b/tools/testing/selftests/bpf/progs/test_vmlinux.c index 29fa09d6a6c6..e9dfa0313d1b 100644 --- a/tools/testing/selftests/bpf/progs/test_vmlinux.c +++ b/tools/testing/selftests/bpf/progs/test_vmlinux.c @@ -19,12 +19,14 @@ SEC("tp/syscalls/sys_enter_nanosleep") int handle__tp(struct trace_event_raw_sys_enter *args) { struct __kernel_timespec *ts; + long tv_nsec; if (args->id != __NR_nanosleep) return 0; ts = (void *)args->args[0]; - if (BPF_CORE_READ(ts, tv_nsec) != MY_TV_NSEC) + if (bpf_probe_read_user(&tv_nsec, sizeof(ts->tv_nsec), &ts->tv_nsec) || + tv_nsec != MY_TV_NSEC) return 0; tp_called = true; @@ -35,12 +37,14 @@ SEC("raw_tp/sys_enter") int BPF_PROG(handle__raw_tp, struct pt_regs *regs, long id) { struct __kernel_timespec *ts; + long tv_nsec; if (id != __NR_nanosleep) return 0; ts = (void *)PT_REGS_PARM1_CORE(regs); - if (BPF_CORE_READ(ts, tv_nsec) != MY_TV_NSEC) + if (bpf_probe_read_user(&tv_nsec, sizeof(ts->tv_nsec), &ts->tv_nsec) || + tv_nsec != MY_TV_NSEC) return 0; raw_tp_called = true; @@ -51,12 +55,14 @@ SEC("tp_btf/sys_enter") int BPF_PROG(handle__tp_btf, struct pt_regs *regs, long id) { struct __kernel_timespec *ts; + long tv_nsec; if (id != __NR_nanosleep) return 0; ts = (void *)PT_REGS_PARM1_CORE(regs); - if (BPF_CORE_READ(ts, tv_nsec) != MY_TV_NSEC) + if (bpf_probe_read_user(&tv_nsec, sizeof(ts->tv_nsec), &ts->tv_nsec) || + tv_nsec != MY_TV_NSEC) return 0; tp_btf_called = true; diff --git a/tools/testing/selftests/bpf/progs/trigger_bench.c b/tools/testing/selftests/bpf/progs/trigger_bench.c index 8b36b6640e7e..9a4d09590b3d 100644 --- a/tools/testing/selftests/bpf/progs/trigger_bench.c +++ b/tools/testing/selftests/bpf/progs/trigger_bench.c @@ -39,6 +39,13 @@ int bench_trigger_fentry(void *ctx) return 0; } +SEC("fentry.s/__x64_sys_getpgid") +int bench_trigger_fentry_sleep(void *ctx) +{ + __sync_add_and_fetch(&hits, 1); + return 0; +} + SEC("fmod_ret/__x64_sys_getpgid") int bench_trigger_fmodret(void *ctx) { diff --git a/tools/testing/selftests/bpf/test_current_pid_tgid_new_ns.c b/tools/testing/selftests/bpf/test_current_pid_tgid_new_ns.c index ed253f252cd0..ec53b1ef90d2 100644 --- a/tools/testing/selftests/bpf/test_current_pid_tgid_new_ns.c +++ b/tools/testing/selftests/bpf/test_current_pid_tgid_new_ns.c @@ -156,4 +156,5 @@ cleanup: bpf_object__close(obj); } } + return 0; } diff --git a/tools/testing/selftests/bpf/test_tcp_hdr_options.h b/tools/testing/selftests/bpf/test_tcp_hdr_options.h new file mode 100644 index 000000000000..78a8cf9eab42 --- /dev/null +++ b/tools/testing/selftests/bpf/test_tcp_hdr_options.h @@ -0,0 +1,151 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright (c) 2020 Facebook */ + +#ifndef _TEST_TCP_HDR_OPTIONS_H +#define _TEST_TCP_HDR_OPTIONS_H + +struct bpf_test_option { + __u8 flags; + __u8 max_delack_ms; + __u8 rand; +} __attribute__((packed)); + +enum { + OPTION_RESEND, + OPTION_MAX_DELACK_MS, + OPTION_RAND, + __NR_OPTION_FLAGS, +}; + +#define OPTION_F_RESEND (1 << OPTION_RESEND) +#define OPTION_F_MAX_DELACK_MS (1 << OPTION_MAX_DELACK_MS) +#define OPTION_F_RAND (1 << OPTION_RAND) +#define OPTION_MASK ((1 << __NR_OPTION_FLAGS) - 1) + +#define TEST_OPTION_FLAGS(flags, option) (1 & ((flags) >> (option))) +#define SET_OPTION_FLAGS(flags, option) ((flags) |= (1 << (option))) + +/* Store in bpf_sk_storage */ +struct hdr_stg { + bool active; + bool resend_syn; /* active side only */ + bool syncookie; /* passive side only */ + bool fastopen; /* passive side only */ +}; + +struct linum_err { + unsigned int linum; + int err; +}; + +#define TCPHDR_FIN 0x01 +#define TCPHDR_SYN 0x02 +#define TCPHDR_RST 0x04 +#define TCPHDR_PSH 0x08 +#define TCPHDR_ACK 0x10 +#define TCPHDR_URG 0x20 +#define TCPHDR_ECE 0x40 +#define TCPHDR_CWR 0x80 +#define TCPHDR_SYNACK (TCPHDR_SYN | TCPHDR_ACK) + +#define TCPOPT_EOL 0 +#define TCPOPT_NOP 1 +#define TCPOPT_WINDOW 3 +#define TCPOPT_EXP 254 + +#define TCP_BPF_EXPOPT_BASE_LEN 4 +#define MAX_TCP_HDR_LEN 60 +#define MAX_TCP_OPTION_SPACE 40 + +#ifdef BPF_PROG_TEST_TCP_HDR_OPTIONS + +#define CG_OK 1 +#define CG_ERR 0 + +#ifndef SOL_TCP +#define SOL_TCP 6 +#endif + +struct tcp_exprm_opt { + __u8 kind; + __u8 len; + __u16 magic; + union { + __u8 data[4]; + __u32 data32; + }; +} __attribute__((packed)); + +struct tcp_opt { + __u8 kind; + __u8 len; + union { + __u8 data[4]; + __u32 data32; + }; +} __attribute__((packed)); + +struct { + __uint(type, BPF_MAP_TYPE_HASH); + __uint(max_entries, 2); + __type(key, int); + __type(value, struct linum_err); +} lport_linum_map SEC(".maps"); + +static inline unsigned int tcp_hdrlen(const struct tcphdr *th) +{ + return th->doff << 2; +} + +static inline __u8 skops_tcp_flags(const struct bpf_sock_ops *skops) +{ + return skops->skb_tcp_flags; +} + +static inline void clear_hdr_cb_flags(struct bpf_sock_ops *skops) +{ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags & + ~(BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG | + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG)); +} + +static inline void set_hdr_cb_flags(struct bpf_sock_ops *skops) +{ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags | + BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG | + BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG); +} +static inline void +clear_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops) +{ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags & + ~BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG); +} + +static inline void +set_parse_all_hdr_cb_flags(struct bpf_sock_ops *skops) +{ + bpf_sock_ops_cb_flags_set(skops, + skops->bpf_sock_ops_cb_flags | + BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG); +} + +#define RET_CG_ERR(__err) ({ \ + struct linum_err __linum_err; \ + int __lport; \ + \ + __linum_err.linum = __LINE__; \ + __linum_err.err = __err; \ + __lport = skops->local_port; \ + bpf_map_update_elem(&lport_linum_map, &__lport, &__linum_err, BPF_NOEXIST); \ + clear_hdr_cb_flags(skops); \ + clear_parse_all_hdr_cb_flags(skops); \ + return CG_ERR; \ +}) + +#endif /* BPF_PROG_TEST_TCP_HDR_OPTIONS */ + +#endif /* _TEST_TCP_HDR_OPTIONS_H */ diff --git a/tools/testing/selftests/bpf/test_verifier.c b/tools/testing/selftests/bpf/test_verifier.c index 78a6bae56ea6..9be395d9dc64 100644 --- a/tools/testing/selftests/bpf/test_verifier.c +++ b/tools/testing/selftests/bpf/test_verifier.c @@ -114,6 +114,7 @@ struct bpf_test { bpf_testdata_struct_t retvals[MAX_TEST_RUNS]; }; enum bpf_attach_type expected_attach_type; + const char *kfunc; }; /* Note we want this to be 64 bit aligned so that the end of our array is @@ -984,8 +985,24 @@ static void do_test_single(struct bpf_test *test, bool unpriv, attr.log_level = 4; attr.prog_flags = pflags; + if (prog_type == BPF_PROG_TYPE_TRACING && test->kfunc) { + attr.attach_btf_id = libbpf_find_vmlinux_btf_id(test->kfunc, + attr.expected_attach_type); + if (attr.attach_btf_id < 0) { + printf("FAIL\nFailed to find BTF ID for '%s'!\n", + test->kfunc); + (*errors)++; + return; + } + } + fd_prog = bpf_load_program_xattr(&attr, bpf_vlog, sizeof(bpf_vlog)); - if (fd_prog < 0 && !bpf_probe_prog_type(prog_type, 0)) { + + /* BPF_PROG_TYPE_TRACING requires more setup and + * bpf_probe_prog_type won't give correct answer + */ + if (fd_prog < 0 && prog_type != BPF_PROG_TYPE_TRACING && + !bpf_probe_prog_type(prog_type, 0)) { printf("SKIP (unsupported program type %d)\n", prog_type); skips++; goto close_fds; diff --git a/tools/testing/selftests/bpf/verifier/bounds.c b/tools/testing/selftests/bpf/verifier/bounds.c index 4d6645f2874c..dac40de3f868 100644 --- a/tools/testing/selftests/bpf/verifier/bounds.c +++ b/tools/testing/selftests/bpf/verifier/bounds.c @@ -557,3 +557,149 @@ .result = ACCEPT, .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS, }, +{ + "bounds check for reg = 0, reg xor 1", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_MOV64_IMM(BPF_REG_1, 0), + BPF_ALU64_IMM(BPF_XOR, BPF_REG_1, 1), + BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_8b = { 3 }, + .result = ACCEPT, +}, +{ + "bounds check for reg32 = 0, reg32 xor 1", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_MOV32_IMM(BPF_REG_1, 0), + BPF_ALU32_IMM(BPF_XOR, BPF_REG_1, 1), + BPF_JMP32_IMM(BPF_JNE, BPF_REG_1, 0, 1), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_8b = { 3 }, + .result = ACCEPT, +}, +{ + "bounds check for reg = 2, reg xor 3", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_MOV64_IMM(BPF_REG_1, 2), + BPF_ALU64_IMM(BPF_XOR, BPF_REG_1, 3), + BPF_JMP_IMM(BPF_JGT, BPF_REG_1, 0, 1), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_8b = { 3 }, + .result = ACCEPT, +}, +{ + "bounds check for reg = any, reg xor 3", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0), + BPF_ALU64_IMM(BPF_XOR, BPF_REG_1, 3), + BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, 1), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_8b = { 3 }, + .result = REJECT, + .errstr = "invalid access to map value", + .errstr_unpriv = "invalid access to map value", +}, +{ + "bounds check for reg32 = any, reg32 xor 3", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0), + BPF_ALU32_IMM(BPF_XOR, BPF_REG_1, 3), + BPF_JMP32_IMM(BPF_JNE, BPF_REG_1, 0, 1), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_8b = { 3 }, + .result = REJECT, + .errstr = "invalid access to map value", + .errstr_unpriv = "invalid access to map value", +}, +{ + "bounds check for reg > 0, reg xor 3", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0), + BPF_JMP_IMM(BPF_JLE, BPF_REG_1, 0, 3), + BPF_ALU64_IMM(BPF_XOR, BPF_REG_1, 3), + BPF_JMP_IMM(BPF_JGE, BPF_REG_1, 0, 1), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_8b = { 3 }, + .result = ACCEPT, +}, +{ + "bounds check for reg32 > 0, reg32 xor 3", + .insns = { + BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_LD_MAP_FD(BPF_REG_1, 0), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1), + BPF_EXIT_INSN(), + BPF_LDX_MEM(BPF_DW, BPF_REG_1, BPF_REG_0, 0), + BPF_JMP32_IMM(BPF_JLE, BPF_REG_1, 0, 3), + BPF_ALU32_IMM(BPF_XOR, BPF_REG_1, 3), + BPF_JMP32_IMM(BPF_JGE, BPF_REG_1, 0, 1), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 8), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .fixup_map_hash_8b = { 3 }, + .result = ACCEPT, +}, diff --git a/tools/testing/selftests/bpf/verifier/d_path.c b/tools/testing/selftests/bpf/verifier/d_path.c new file mode 100644 index 000000000000..b988396379a7 --- /dev/null +++ b/tools/testing/selftests/bpf/verifier/d_path.c @@ -0,0 +1,37 @@ +{ + "d_path accept", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_MOV64_IMM(BPF_REG_6, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_6, 0), + BPF_LD_IMM64(BPF_REG_3, 8), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_d_path), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .result = ACCEPT, + .prog_type = BPF_PROG_TYPE_TRACING, + .expected_attach_type = BPF_TRACE_FENTRY, + .kfunc = "dentry_open", +}, +{ + "d_path reject", + .insns = { + BPF_LDX_MEM(BPF_W, BPF_REG_1, BPF_REG_1, 0), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), + BPF_MOV64_IMM(BPF_REG_6, 0), + BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_6, 0), + BPF_LD_IMM64(BPF_REG_3, 8), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_d_path), + BPF_MOV64_IMM(BPF_REG_0, 0), + BPF_EXIT_INSN(), + }, + .errstr = "helper call is not allowed in probe", + .result = REJECT, + .prog_type = BPF_PROG_TYPE_TRACING, + .expected_attach_type = BPF_TRACE_FENTRY, + .kfunc = "d_path", +}, |