aboutsummaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2022-10-07Merge tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linuxLinus Torvalds
Pull passthrough updates from Jens Axboe: "With these changes, passthrough NVMe support over io_uring now performs at the same level as block device O_DIRECT, and in many cases 6-8% better. This contains: - Add support for fixed buffers for passthrough (Anuj, Kanchan) - Enable batched allocations and freeing on passthrough, similarly to what we support on the normal storage path (me) - Fix from Geert fixing an issue with !CONFIG_IO_URING" * tag 'for-6.1/passthrough-2022-10-04' of git://git.kernel.dk/linux: io_uring: Add missing inline to io_uring_cmd_import_fixed() dummy nvme: wire up fixed buffer support for nvme passthrough nvme: pass ubuffer as an integer block: extend functionality to map bvec iterator block: factor out blk_rq_map_bio_alloc helper block: rename bio_map_put to blk_mq_map_bio_put nvme: refactor nvme_alloc_request nvme: refactor nvme_add_user_metadata nvme: Use blk_rq_map_user_io helper scsi: Use blk_rq_map_user_io helper block: add blk_rq_map_user_io io_uring: introduce fixed buffer support for io_uring_cmd io_uring: add io_uring_cmd_import_fixed nvme: enable batched completions of passthrough IO nvme: split out metadata vs non metadata end_io uring_cmd completions block: allow end_io based requests in the completion batch handling block: change request end_io handler to pass back a return value block: enable batched allocation for blk_mq_alloc_request() block: kill deprecated BUG_ON() in the flush handling
2022-10-07Merge tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds
Pull block updates from Jens Axboe: - NVMe pull requests via Christoph: - handle number of queue changes in the TCP and RDMA drivers (Daniel Wagner) - allow changing the number of queues in nvmet (Daniel Wagner) - also consider host_iface when checking ip options (Daniel Wagner) - don't map pages which can't come from HIGHMEM (Fabio M. De Francesco) - avoid unnecessary flush bios in nvmet (Guixin Liu) - shrink and better pack the nvme_iod structure (Keith Busch) - add comment for unaligned "fake" nqn (Linjun Bao) - print actual source IP address through sysfs "address" attr (Martin Belanger) - various cleanups (Jackie Liu, Wolfram Sang, Genjian Zhang) - handle effects after freeing the request (Keith Busch) - copy firmware_rev on each init (Keith Busch) - restrict management ioctls to admin (Keith Busch) - ensure subsystem reset is single threaded (Keith Busch) - report the actual number of tagset maps in nvme-pci (Keith Busch) - small fabrics authentication fixups (Christoph Hellwig) - add common code for tagset allocation and freeing (Christoph Hellwig) - stop using the request_queue in nvmet (Christoph Hellwig) - set min_align_mask before calculating max_hw_sectors (Rishabh Bhatnagar) - send a rediscover uevent when a persistent discovery controller reconnects (Sagi Grimberg) - misc nvmet-tcp fixes (Varun Prakash, zhenwei pi) - MD pull request via Song: - Various raid5 fix and clean up, by Logan Gunthorpe and David Sloan. - Raid10 performance optimization, by Yu Kuai. - sbitmap wakeup hang fixes (Hugh, Keith, Jan, Yu) - IO scheduler switching quisce fix (Keith) - s390/dasd block driver updates (Stefan) - support for recovery for the ublk driver (ZiyangZhang) - rnbd drivers fixes and updates (Guoqing, Santosh, ye, Christoph) - blk-mq and null_blk map fixes (Bart) - various bcache fixes (Coly, Jilin, Jules) - nbd signal hang fix (Shigeru) - block writeback throttling fix (Yu) - optimize the passthrough mapping handling (me) - prepare block cgroups to being gendisk based (Christoph) - get rid of an old PSI hack in the block layer, moving it to the callers instead where it belongs (Christoph) - blk-throttle fixes and cleanups (Yu) - misc fixes and cleanups (Liu Shixin, Liu Song, Miaohe, Pankaj, Ping-Xiang, Wolfram, Saurabh, Li Jinlin, Li Lei, Lin, Li zeming, Miaohe, Bart, Coly, Gaosheng * tag 'for-6.1/block-2022-10-03' of git://git.kernel.dk/linux: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ...
2022-10-07Merge tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Add supported for more directly managed task_work running. This is beneficial for real world applications that end up issuing lots of system calls as part of handling work. Normal task_work will always execute as we transition in and out of the kernel, even for "unrelated" system calls. It's more efficient to defer the handling of io_uring's deferred work until the application wants it to be run, generally in batches. As part of ongoing work to write an io_uring network backend for Thrift, this has been shown to greatly improve performance. (Dylan) - Add IOPOLL support for passthrough (Kanchan) - Improvements and fixes to the send zero-copy support (Pavel) - Partial IO handling fixes (Pavel) - CQE ordering fixes around CQ ring overflow (Pavel) - Support sendto() for non-zc as well (Pavel) - Support sendmsg for zerocopy (Pavel) - Networking iov_iter fix (Stefan) - Misc fixes and cleanups (Pavel, me) * tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...
2022-10-06Merge tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfsLinus Torvalds
Pull vfs constification updates from Al Viro: "whack-a-mole: constifying struct path *" * tag 'pull-path' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: ecryptfs: constify path spufs: constify path nd_jump_link(): constify path audit_init_parent(): constify path __io_setxattr(): constify path do_proc_readlink(): constify path overlayfs: constify path fs/notify: constify path may_linkat(): constify path do_sys_name_to_handle(): constify path ->getprocattr(): attribute name is const char *, TYVM...
2022-09-30Merge tag 'io_uring-6.0-2022-09-29' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fixes from Jens Axboe: "Two fixes that should go into 6.0: - Tweak the single issuer logic to register the task at creation, rather than at first submit. SINGLE_ISSUER was added for 6.0, and after some discussion on this, we decided to make it a bit stricter while it's still possible to do so (Dylan). - Stefan from Samba had some doubts on the level triggered poll that was added for this release. Rather than attempt to mess around with it now, just do the quick one-liner to disable it for release and we have time to discuss and change it for 6.1 instead (me)" * tag 'io_uring-6.0-2022-09-29' of git://git.kernel.dk/linux: io_uring/poll: disable level triggered poll io_uring: register single issuer task at creation
2022-09-30io_uring: introduce fixed buffer support for io_uring_cmdAnuj Gupta
Add IORING_URING_CMD_FIXED flag that is to be used for sending io_uring command with previously registered buffers. User-space passes the buffer index in sqe->buf_index, same as done in read/write variants that uses fixed buffers. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-3-anuj20.g@samsung.com [axboe: shuffle valid flags check before acting on it] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30io_uring: add io_uring_cmd_import_fixedAnuj Gupta
This is a new helper that callers can use to obtain a bvec iterator for the previously mapped buffer. This is preparatory work to enable fixed-buffer support for io_uring_cmd. Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com> Link: https://lore.kernel.org/r/20220930062749.152261-2-anuj20.g@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-30Merge branch 'for-6.1/io_uring' into for-6.1/passthroughJens Axboe
* for-6.1/io_uring: (56 commits) io_uring/net: fix notif cqe reordering io_uring/net: don't update msg_name if not provided io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL io_uring/rw: defer fsnotify calls to task context io_uring/net: fix fast_iov assignment in io_setup_async_msg() io_uring/net: fix non-zc send with address io_uring/net: don't skip notifs for failed requests io_uring/rw: don't lose short results on io_setup_async_rw() io_uring/rw: fix unexpected link breakage io_uring/net: fix cleanup double free free_iov init io_uring: fix CQE reordering io_uring/net: fix UAF in io_sendrecv_fail() selftest/net: adjust io_uring sendzc notif handling io_uring: ensure local task_work marks task as running io_uring/net: zerocopy sendmsg io_uring/net: combine fail handlers io_uring/net: rename io_sendzc() io_uring/net: support non-zerocopy sendto io_uring/net: refactor io_setup_async_addr io_uring/net: don't lose partial send_zc on fail ...
2022-09-30Merge branch 'for-6.1/block' into for-6.1/passthroughJens Axboe
* for-6.1/block: (162 commits) sbitmap: fix lockup while swapping block: add rationale for not using blk_mq_plug() when applicable block: adapt blk_mq_plug() to not plug for writes that require a zone lock s390/dasd: use blk_mq_alloc_disk blk-cgroup: don't update the blkg lookup hint in blkg_conf_prep nvmet: don't look at the request_queue in nvmet_bdev_set_limits nvmet: don't look at the request_queue in nvmet_bdev_zone_mgmt_emulate_all blk-mq: use quiesced elevator switch when reinitializing queues block: replace blk_queue_nowait with bdev_nowait nvme: remove nvme_ctrl_init_connect_q nvme-loop: use the tagset alloc/free helpers nvme-loop: store the generic nvme_ctrl in set->driver_data nvme-loop: initialize sqsize later nvme-fc: use the tagset alloc/free helpers nvme-fc: store the generic nvme_ctrl in set->driver_data nvme-fc: keep ctrl->sqsize in sync with opts->queue_size nvme-rdma: use the tagset alloc/free helpers nvme-rdma: store the generic nvme_ctrl in set->driver_data nvme-tcp: use the tagset alloc/free helpers nvme-tcp: store the generic nvme_ctrl in set->driver_data ... Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29io_uring/net: fix notif cqe reorderingPavel Begunkov
send zc is not restricted to !IO_URING_F_UNLOCKED anymore and so we can't use task-tw ordering trick to order notification cqes with requests completions. In this case leave it alone and let io_send_zc_cleanup() flush it. Cc: stable@vger.kernel.org Fixes: 53bdc88aac9a2 ("io_uring/notif: order notif vs send CQEs") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0031f3a00d492e814a4a0935a2029a46d9c9ba06.1664486545.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29io_uring/net: don't update msg_name if not providedPavel Begunkov
io_sendmsg_copy_hdr() may clear msg->msg_name if the userspace didn't provide it, we should retain NULL in this case. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/97d49f61b5ec76d0900df658cfde3aa59ff22121.1664486545.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29io_uring: don't gate task_work run on TIF_NOTIFY_SIGNALJens Axboe
This isn't a reliable mechanism to tell if we have task_work pending, we really should be looking at whether we have any items queued. This is problematic if forward progress is gated on running said task_work. One such example is reading from a pipe, where the write side has been closed right before the read is started. The fput() of the file queues TWA_RESUME task_work, and we need that task_work to be run before ->release() is called for the pipe. If ->release() isn't called, then the read will sit forever waiting on data that will never arise. Fix this by io_run_task_work() so it checks if we have task_work pending rather than rely on TIF_NOTIFY_SIGNAL for that. The latter obviously doesn't work for task_work that is queued without TWA_SIGNAL. Reported-by: Christiano Haesbaert <haesbaert@haesbaert.org> Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/665 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29io_uring/rw: defer fsnotify calls to task contextJens Axboe
We can't call these off the kiocb completion as that might be off soft/hard irq context. Defer the calls to when we process the task_work for this request. That avoids valid complaints like: stack backtrace: CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.0.0-rc6-syzkaller-00321-g105a36f3694e #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_usage_bug kernel/locking/lockdep.c:3961 [inline] valid_state kernel/locking/lockdep.c:3973 [inline] mark_lock_irq kernel/locking/lockdep.c:4176 [inline] mark_lock.part.0.cold+0x18/0xd8 kernel/locking/lockdep.c:4632 mark_lock kernel/locking/lockdep.c:4596 [inline] mark_usage kernel/locking/lockdep.c:4527 [inline] __lock_acquire+0x11d9/0x56d0 kernel/locking/lockdep.c:5007 lock_acquire kernel/locking/lockdep.c:5666 [inline] lock_acquire+0x1ab/0x570 kernel/locking/lockdep.c:5631 __fs_reclaim_acquire mm/page_alloc.c:4674 [inline] fs_reclaim_acquire+0x115/0x160 mm/page_alloc.c:4688 might_alloc include/linux/sched/mm.h:271 [inline] slab_pre_alloc_hook mm/slab.h:700 [inline] slab_alloc mm/slab.c:3278 [inline] __kmem_cache_alloc_lru mm/slab.c:3471 [inline] kmem_cache_alloc+0x39/0x520 mm/slab.c:3491 fanotify_alloc_fid_event fs/notify/fanotify/fanotify.c:580 [inline] fanotify_alloc_event fs/notify/fanotify/fanotify.c:813 [inline] fanotify_handle_event+0x1130/0x3f40 fs/notify/fanotify/fanotify.c:948 send_to_group fs/notify/fsnotify.c:360 [inline] fsnotify+0xafb/0x1680 fs/notify/fsnotify.c:570 __fsnotify_parent+0x62f/0xa60 fs/notify/fsnotify.c:230 fsnotify_parent include/linux/fsnotify.h:77 [inline] fsnotify_file include/linux/fsnotify.h:99 [inline] fsnotify_access include/linux/fsnotify.h:309 [inline] __io_complete_rw_common+0x485/0x720 io_uring/rw.c:195 io_complete_rw+0x1a/0x1f0 io_uring/rw.c:228 iomap_dio_complete_work fs/iomap/direct-io.c:144 [inline] iomap_dio_bio_end_io+0x438/0x5e0 fs/iomap/direct-io.c:178 bio_endio+0x5f9/0x780 block/bio.c:1564 req_bio_endio block/blk-mq.c:695 [inline] blk_update_request+0x3fc/0x1300 block/blk-mq.c:825 scsi_end_request+0x7a/0x9a0 drivers/scsi/scsi_lib.c:541 scsi_io_completion+0x173/0x1f70 drivers/scsi/scsi_lib.c:971 scsi_complete+0x122/0x3b0 drivers/scsi/scsi_lib.c:1438 blk_complete_reqs+0xad/0xe0 block/blk-mq.c:1022 __do_softirq+0x1d3/0x9c6 kernel/softirq.c:571 invoke_softirq kernel/softirq.c:445 [inline] __irq_exit_rcu+0x123/0x180 kernel/softirq.c:650 irq_exit_rcu+0x5/0x20 kernel/softirq.c:662 common_interrupt+0xa9/0xc0 arch/x86/kernel/irq.c:240 Fixes: f63cf5192fe3 ("io_uring: ensure that fsnotify is always called") Link: https://lore.kernel.org/all/20220929135627.ykivmdks2w5vzrwg@quack3/ Reported-by: syzbot+dfcc5f4da15868df7d4d@syzkaller.appspotmail.com Reported-by: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-29io_uring/net: fix fast_iov assignment in io_setup_async_msg()Stefan Metzmacher
I hit a very bad problem during my tests of SENDMSG_ZC. BUG(); in first_iovec_segment() triggered very easily. The problem was io_setup_async_msg() in the partial retry case, which seems to happen more often with _ZC. iov_iter_iovec_advance() may change i->iov in order to have i->iov_offset being only relative to the first element. Which means kmsg->msg.msg_iter.iov is no longer the same as kmsg->fast_iov. But this would rewind the copy to be the start of async_msg->fast_iov, which means the internal state of sync_msg->msg.msg_iter is inconsitent. I tested with 5 vectors with length like this 4, 0, 64, 20, 8388608 and got a short writes with: - ret=2675244 min_ret=8388692 => remaining 5713448 sr->done_io=2675244 - ret=-EAGAIN => io_uring_poll_arm - ret=4911225 min_ret=5713448 => remaining 802223 sr->done_io=7586469 - ret=-EAGAIN => io_uring_poll_arm - ret=802223 min_ret=802223 => res=8388692 While this was easily triggered with SENDMSG_ZC (queued for 6.1), it was a potential problem starting with 7ba89d2af17aa879dda30f5d5d3f152e587fc551 in 5.18 for IORING_OP_RECVMSG. And also with 4c3c09439c08b03d9503df0ca4c7619c5842892e in 5.19 for IORING_OP_SENDMSG. However 257e84a5377fbbc336ff563833a8712619acce56 introduced the critical code into io_setup_async_msg() in 5.11. Fixes: 7ba89d2af17aa ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly") Fixes: 257e84a5377fb ("io_uring: refactor sendmsg/recvmsg iov managing") Cc: stable@vger.kernel.org Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b2e7be246e2fb173520862b0c7098e55767567a2.1664436949.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-28io_uring/net: fix non-zc send with addressPavel Begunkov
We're currently ignoring the dest address with non-zerocopy send because even though we copy it from the userspace shortly after ->msg_name gets zeroed. Move msghdr init earlier. Fixes: 516e82f0e043a ("io_uring/net: support non-zerocopy sendto") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/176ced5e8568aa5d300ca899b7f05b303ebc49fd.1664409532.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-28io_uring/poll: disable level triggered pollJens Axboe
Stefan reports that there are issues with the level triggered notification. Since we're late in the cycle, and it was introduced for the 6.0 release, just disable it at prep time and we can bring this back when Samba is happy with it. Reported-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-28io_uring/net: don't skip notifs for failed requestsPavel Begunkov
We currently only add a notification CQE when the send succeded, i.e. cqe.res >= 0. However, it'd be more robust to do buffer notifications for failed requests as well in case drivers decide do something fanky. Always return a buffer notification after initial prep, don't hide it. This behaviour is better aligned with documentation and the patch also helps the userspace to respect it. Cc: stable@vger.kernel.org # 6.0 Suggested-by: Stefan Metzmacher <metze@samba.org> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9c8bead87b2b980fcec441b8faef52188b4a6588.1664292100.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-27block: replace blk_queue_nowait with bdev_nowaitChristoph Hellwig
Replace blk_queue_nowait with a bdev_nowait helpers that takes the block_device given that the I/O submission path should not have to look into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Link: https://lore.kernel.org/r/20220927075815.269694-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26io_uring/rw: don't lose short results on io_setup_async_rw()Pavel Begunkov
If a retry io_setup_async_rw() fails we lose result from the first io_iter_do_read(), which is a problem mostly for streams/sockets. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e8d20cebe5fc9c96ed268463c394237daabc384.1664235732.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26io_uring/rw: fix unexpected link breakagePavel Begunkov
req->cqe.res is set in io_read() to the amount of bytes left to be done, which is used to figure out whether to fail a read or not. However, io_read() may do another without returning, and we stash the previous value into ->bytes_done but forget to update cqe.res. Then we ask a read to do strictly less than cqe.res but expect the return to be exactly cqe.res. Fix the bug by updating cqe.res for retries. Cc: stable@vger.kernel.org Reported-and-Tested-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3a1088440c7be98e5800267af922a67da0ef9f13.1664235732.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26io_uring: register single issuer task at creationDylan Yudaken
Instead of picking the task from the first submitter task, rather use the creator task or in the case of disabled (IORING_SETUP_R_DISABLED) the enabling task. This approach allows a lot of simplification of the logic here. This removes init logic from the submission path, which can always be a bit confusing, but also removes the need for locking to write (or read) the submitter_task. Users that want to move a ring before submitting can create the ring disabled and then enable it on the submitting task. Signed-off-by: Dylan Yudaken <dylany@fb.com> Fixes: 97bbdc06a444 ("io_uring: add IORING_SETUP_SINGLE_ISSUER") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-26io_uring/net: fix cleanup double free free_iov initPavel Begunkov
Having ->async_data doesn't mean it's initialised and previously we vere relying on setting F_CLEANUP at the right moment. With zc sendmsg though, we set F_CLEANUP early in prep when we alloc a notif and so we may allocate async_data, fail in copy_msg_hdr() leaving struct io_async_msghdr not initialised correctly but with F_CLEANUP set, which causes a ->free_iov double free and probably other nastiness. Always initialise ->free_iov. Also, now it might point to fast_iov when fails, so avoid freeing it during cleanups. Reported-by: syzbot+edfd15cd4246a3fc615a@syzkaller.appspotmail.com Fixes: 493108d95f146 ("io_uring/net: zerocopy sendmsg") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-24Merge tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring fix from Jens Axboe: "Just a single fix for an issue with un-reaped IOPOLL requests on ring exit" * tag 'io_uring-6.0-2022-09-23' of git://git.kernel.dk/linux: io_uring: ensure that cached task references are always put on exit
2022-09-23io_uring: ensure that cached task references are always put on exitJens Axboe
io_uring caches task references to avoid doing atomics for each of them per request. If a request is put from the same task that allocated it, then we can maintain a per-ctx cache of them. This obviously relies on io_uring always pruning caches in a reliable way, and there's currently a case off io_uring fd release where we can miss that. One example is a ring setup with IOPOLL, which relies on the task polling for completions, which will free them. However, if such a task submits a request and then exits or closes the ring without reaping the completion, then ring release will reap and put. If release happens from that very same task, the completed request task refs will get put back into the cache pool. This is problematic, as we're now beyond the point of pruning caches. Manually drop these caches after doing an IOPOLL reap. This releases references from the current task, which is enough. If another task happens to be doing the release, then the caching will not be triggered and there's no issue. Cc: stable@vger.kernel.org Fixes: e98e49b2bbf7 ("io_uring: extend task put optimisations") Reported-by: Homin Rhee <hominlab@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23io_uring: fix CQE reorderingPavel Begunkov
Overflowing CQEs may result in reordering, which is buggy in case of links, F_MORE and so on. If we guarantee that we don't reorder for the unlikely event of a CQ ring overflow, then we can further extend this to not have to terminate multishot requests if it happens. For other operations, like zerocopy sends, we have no choice but to honor CQE ordering. Reported-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ec3bc55687b0768bbe20fb62d7d06cfced7d7e70.1663892031.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-23io_uring/net: fix UAF in io_sendrecv_fail()Pavel Begunkov
We should not assume anything about ->free_iov just from REQ_F_ASYNC_DATA but rather rely on REQ_F_NEED_CLEANUP, as we may allocate ->async_data but failed init would leave the field in not consistent state. The easiest solution is to remove removing REQ_F_NEED_CLEANUP and so ->async_data dealloc from io_sendrecv_fail() and let io_send_zc_cleanup() do the job. The catch here is that we also need to prevent double notif flushing, just test it for NULL and zero where it's needed. BUG: KASAN: use-after-free in io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221 Write of size 8 at addr ffff8880771b4080 by task syz-executor.3/30199 CPU: 1 PID: 30199 Comm: syz-executor.3 Not tainted 6.0.0-rc6-next-20220923-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:284 [inline] print_report+0x15e/0x45d mm/kasan/report.c:395 kasan_report+0xbb/0x1f0 mm/kasan/report.c:495 io_sendrecv_fail+0x3b0/0x3e0 io_uring/net.c:1221 io_req_complete_failed+0x155/0x1b0 io_uring/io_uring.c:873 io_drain_req io_uring/io_uring.c:1648 [inline] io_queue_sqe_fallback.cold+0x29f/0x788 io_uring/io_uring.c:1931 io_submit_sqe io_uring/io_uring.c:2160 [inline] io_submit_sqes+0x1180/0x1df0 io_uring/io_uring.c:2276 __do_sys_io_uring_enter+0xac6/0x2410 io_uring/io_uring.c:3216 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd Fixes: c4c0009e0b56e ("io_uring/net: combine fail handlers") Reported-by: syzbot+4c597a574a3f5a251bda@syzkaller.appspotmail.com Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/23ab8346e407ea50b1198a172c8a97e1cf22915b.1663945875.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring: ensure local task_work marks task as runningJens Axboe
io_uring will run task_work from contexts that have been prepared for waiting, and in doing so it'll implicitly set the task running again to avoid issues with blocking conditions. The new deferred local task_work doesn't do that, which can result in spews on this being an invalid condition: 

[ 112.917576] do not call blocking ops when !TASK_RUNNING; state=1 set at [<00000000ad64af64>] prepare_to_wait_exclusive+0x3f/0xd0 [ 112.983088] WARNING: CPU: 1 PID: 190 at kernel/sched/core.c:9819 __might_sleep+0x5a/0x60 [ 112.987240] Modules linked in: [ 112.990504] CPU: 1 PID: 190 Comm: io_uring Not tainted 6.0.0-rc6+ #1617 [ 113.053136] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 [ 113.133650] RIP: 0010:__might_sleep+0x5a/0x60 [ 113.136507] Code: ee 48 89 df 5b 31 d2 5d e9 33 ff ff ff 48 8b 90 30 0b 00 00 48 c7 c7 90 de 45 82 c6 05 20 8b 79 01 01 48 89 d1 e8 3a 49 77 00 <0f> 0b eb d1 66 90 0f 1f 44 00 00 9c 58 f6 c4 02 74 35 65 8b 05 ed [ 113.223940] RSP: 0018:ffffc90000537ca0 EFLAGS: 00010286 [ 113.232903] RAX: 0000000000000000 RBX: ffffffff8246782c RCX: ffffffff8270bcc8 IOPS=133.15K, BW=520MiB/s, IOS/call=32/31 [ 113.353457] RDX: ffffc90000537b50 RSI: 00000000ffffdfff RDI: 0000000000000001 [ 113.358970] RBP: 00000000000003bc R08: 0000000000000000 R09: c0000000ffffdfff [ 113.361746] R10: 0000000000000001 R11: ffffc90000537b48 R12: ffff888103f97280 [ 113.424038] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001 [ 113.428009] FS: 00007f67ae7fc700(0000) GS:ffff88842fc80000(0000) knlGS:0000000000000000 [ 113.432794] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 113.503186] CR2: 00007f67b8b9b3b0 CR3: 0000000102b9b005 CR4: 0000000000770ee0 [ 113.507291] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 113.512669] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 113.574374] PKRU: 55555554 [ 113.576800] Call Trace: [ 113.578325] <TASK> [ 113.579799] set_page_dirty_lock+0x1b/0x90 [ 113.582411] __bio_release_pages+0x141/0x160 [ 113.673078] ? set_next_entity+0xd7/0x190 [ 113.675632] blk_rq_unmap_user+0xaa/0x210 [ 113.678398] ? timerqueue_del+0x2a/0x40 [ 113.679578] nvme_uring_task_cb+0x94/0xb0 [ 113.683025] __io_run_local_work+0x8a/0x150 [ 113.743724] ? io_cqring_wait+0x33d/0x500 [ 113.746091] io_run_local_work.part.76+0x2e/0x60 [ 113.750091] io_cqring_wait+0x2e7/0x500 [ 113.752395] ? trace_event_raw_event_io_uring_req_failed+0x180/0x180 [ 113.823533] __x64_sys_io_uring_enter+0x131/0x3c0 [ 113.827382] ? switch_fpu_return+0x49/0xc0 [ 113.830753] do_syscall_64+0x34/0x80 [ 113.832620] entry_SYSCALL_64_after_hwframe+0x5e/0xc8 Ensure that we mark current as TASK_RUNNING for deferred task_work as well. Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Reported-by: Stefan Roesch <shr@fb.com> Reviewed-by: Dylan Yudaken <dylany@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: zerocopy sendmsgPavel Begunkov
Add a zerocopy version of sendmsg. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6aabc4bdfc0ec78df6ec9328137e394af9d4e7ef.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: combine fail handlersPavel Begunkov
Merge io_send_zc_fail() into io_sendrecv_fail(), saves a few lines of code and some headache for following patch. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e0eba1d577413aef5602cd45f588b9230207082d.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: rename io_sendzc()Pavel Begunkov
Simple renaming of io_sendzc*() functions in preparatio to adding a zerocopy sendmsg variant. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/265af46829e6076dd220011b1858dc3151969226.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: support non-zerocopy sendtoPavel Begunkov
We have normal sends, but what is missing is sendto-like requests. Add sendto() capabilities to IORING_OP_SEND by passing in addr just as we do for IORING_OP_SEND_ZC. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/69fbd8b2cb830e57d1bf9ec351e9bf95c5b77e3f.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: refactor io_setup_async_addrPavel Begunkov
Instead of passing the right address into io_setup_async_addr() only specify local on-stack storage and let the function infer where to grab it from. It optimises out one local variable we have to deal with. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6bfa9ab810d776853eb26ed59301e2536c3a5471.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: don't lose partial send_zc on failPavel Begunkov
Partial zc send may end up in io_req_complete_failed(), which not only would return invalid result but also mask out the notification leading to lifetime issues. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/5673285b5e83e6ceca323727b4ddaa584b5cc91e.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: don't lose partial send/recv on failPavel Begunkov
Just as with rw, partial send/recv may end up in io_req_complete_failed() and loose the result, make sure we return the number of bytes processed. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/a4ff95897b5419356fca9ea55db91ac15b2975f9.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/rw: don't lose partial IO result on failPavel Begunkov
A partially done read/write may end up in io_req_complete_failed() and loose the result, make sure we return the number of bytes processed. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/05e0879c226bcd53b441bf92868eadd4bf04e2fc.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring: add custom opcode hooks on failPavel Begunkov
Sometimes we have to do a little bit of a fixup on a request failuer in io_req_complete_failed(). Add a callback in opdef for that. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b734cff4e67cb30cca976b9face321023f37549a.1663668091.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/fdinfo: fix sqe dumping for IORING_SETUP_SQE128Jens Axboe
If we have doubly sized SQEs, then we need to shift the sq index by 1 to account for using two entries for a single request. The CQE dumping gets this right, but the SQE one does not. Improve the SQE dumping in general, the information dumped is pretty sparse and doesn't even cover the whole basic part of the SQE. Include information on the extended part of the SQE, if doubly sized SQEs are in use. A typical dump now looks like the following: [...] SQEs: 32 32: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2721, e0:0x0, e1:0xffffb8041000, e2:0x100000000000, e3:0x5500, e4:0x7, e5:0x0, e6:0x0, e7:0x0 33: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2722, e0:0x0, e1:0xffffb8043000, e2:0x100000000000, e3:0x5508, e4:0x7, e5:0x0, e6:0x0, e7:0x0 34: opcode:URING_CMD, fd:0, flags:1, off:3225964160, addr:0x0, rw_flags:0x0, buf_index:0 user_data:2723, e0:0x0, e1:0xffffb8045000, e2:0x100000000000, e3:0x5510, e4:0x7, e5:0x0, e6:0x0, e7:0x0 [...] Fixes: ebdeb7c01d02 ("io_uring: add support for 128-byte SQEs") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/fdinfo: get rid of unnecessary is_cqe32 variableJens Axboe
We already have the cq_shift, just use that to tell if we have doubly sized CQEs or not. While in there, cleanup the CQE32 vs normal CQE size printing. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring: remove unused return from io_disarm_nextPavel Begunkov
We removed conditional io_commit_cqring_flush() guarding against spurious eventfd and the io_disarm_next()'s return value is not used anymore, just void it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9a441c9a32a58bcc586076fa9a7d0dc33f1fb3cb.1662652536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring: add fast path for io_run_local_work()Pavel Begunkov
We'll grab uring_lock and call __io_run_local_work() with several atomics inside even if there are no task works. Skip it if ->work_llist is empty. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/f6a885f372bad2d77d9cd87341b0a86a4000c0ff.1662652536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/iopoll: unify tw breaking logicPavel Begunkov
Let's keep checks for whether to break the iopoll loop or not same for normal and defer tw, this includes ->cached_cq_tail checks guarding against polling more than asked for. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d2fa8a44f8114f55a4807528da438cde93815360.1662652536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/iopoll: fix unexpected returnsPavel Begunkov
We may propagate a positive return value of io_run_task_work() out of io_iopoll_check(), which breaks our tests. io_run_task_work() doesn't return anything useful for us, ignore the return value. Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/c442bb87f79cea10b3f857cbd4b9a4f0a0493fa3.1662652536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring: disallow defer-tw run w/ no submittersPavel Begunkov
We try to restrict CQ waiters when IORING_SETUP_DEFER_TASKRUN is set, but if nothing has been submitted yet it'll allow any waiter, which violates the contract. Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/b4f0d3f14236d7059d08c5abe2661ef0b78b5528.1662652536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring: further limit non-owner defer-tw cq waitingPavel Begunkov
In case of DEFER_TASK_WORK we try to restrict waiters to only one task, which is also the only submitter; however, we don't do it reliably, which might be very confusing and backfire in the future. E.g. we currently allow multiple tasks in io_iopoll_check(). Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Dylan Yudaken <dylany@fb.com> Link: https://lore.kernel.org/r/94c83c0a7fe468260ee2ec31bdb0095d6e874ba2.1662652536.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: use io_sr_msg for sendzcPavel Begunkov
Reuse struct io_sr_msg for zerocopy sends, which is handy. There is only one zerocopy specific field, namely .notif, and we have enough space for it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/408c5b1b2d8869e1a12da5f5a78ed72cac112149.1662639236.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: refactor io_sr_msg typesPavel Begunkov
In preparation for using struct io_sr_msg for zerocopy sends, clean up types. First, flags can be u16 as it's provided by the userspace in u16 ioprio, as well as addr_len. This saves us 4 bytes. Also use unsigned for size and done_io, both are as well limited to u32. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/42c2639d6385b8b2181342d2af3a42d3b1c5bcd2.1662639236.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: add non-bvec sg chunking callbackPavel Begunkov
Add a sg_from_iter() for when we initiate non-bvec zerocopy sends, which helps us to remove some extra steps from io_sg_from_iter(). The only thing the new function has to do before giving control away to __zerocopy_sg_from_iter() is to check if the skb has managed frags and downgrade them if so. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cda3dea0d36f7931f63a70f350130f085ac3f3dd.1662639236.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: io_async_msghdr caches for sendzcPavel Begunkov
We already keep io_async_msghdr caches for normal send/recv requests, use them also for zerocopy send. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/42fa615b6e0be25f47a685c35d7b5e4f1b03d348.1662639236.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: use async caches for async prepPavel Begunkov
send/recv have async_data caches but there are only used from within issue handlers. Extend their use also to ->prep_async, should be handy with links and IOSQE_ASYNC. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b9a2264b807582a97ed606c5bfcdc2399384e8a5.1662639236.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-09-21io_uring/net: reshuffle error handlingPavel Begunkov
We should prioritise send/recv retry cases over failures, they're more important. Shuffle -ERESTARTSYS after we handled retries. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d9059691b30d0963b7269fa4a0c81ee7720555e6.1662639236.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>