aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-02-08Merge branch 'pipe-exclusive-wakeup'Linus Torvalds
Merge thundering herd avoidance on pipe IO. This would have been applied for 5.5 already, but got delayed because of a user-space race condition in the GNU make jobserver code. Now that there's a new GNU make 4.3 release, and most distributions seem to have at least applied the (almost three year old) fix for the problem, let's see if people notice. And it might have been just bad random timing luck on my machine. If you do hit the race condition, things will still work, but the symptom is that you don't get nearly the expected parallelism when using "make -j<N>". The jobserver bug can definitely happen without this patch too, but seems to be easier to trigger when we no longer wake up pipe waiters unnecessarily. * pipe-exclusive-wakeup: pipe: use exclusive waits when reading or writing
2020-02-08pipe: use exclusive waits when reading or writingLinus Torvalds
This makes the pipe code use separate wait-queues and exclusive waiting for readers and writers, avoiding a nasty thundering herd problem when there are lots of readers waiting for data on a pipe (or, less commonly, lots of writers waiting for a pipe to have space). While this isn't a common occurrence in the traditional "use a pipe as a data transport" case, where you typically only have a single reader and a single writer process, there is one common special case: using a pipe as a source of "locking tokens" rather than for data communication. In particular, the GNU make jobserver code ends up using a pipe as a way to limit parallelism, where each job consumes a token by reading a byte from the jobserver pipe, and releases the token by writing a byte back to the pipe. This pattern is fairly traditional on Unix, and works very well, but will waste a lot of time waking up a lot of processes when only a single reader needs to be woken up when a writer releases a new token. A simplified test-case of just this pipe interaction is to create 64 processes, and then pass a single token around between them (this test-case also intentionally passes another token that gets ignored to test the "wake up next" logic too, in case anybody wonders about it): #include <unistd.h> int main(int argc, char **argv) { int fd[2], counters[2]; pipe(fd); counters[0] = 0; counters[1] = -1; write(fd[1], counters, sizeof(counters)); /* 64 processes */ fork(); fork(); fork(); fork(); fork(); fork(); do { int i; read(fd[0], &i, sizeof(i)); if (i < 0) continue; counters[0] = i+1; write(fd[1], counters, (1+(i & 1)) *sizeof(int)); } while (counters[0] < 1000000); return 0; } and in a perfect world, passing that token around should only cause one context switch per transfer, when the writer of a token causes a directed wakeup of just a single reader. But with the "writer wakes all readers" model we traditionally had, on my test box the above case causes more than an order of magnitude more scheduling: instead of the expected ~1M context switches, "perf stat" shows 231,852.37 msec task-clock # 15.857 CPUs utilized 11,250,961 context-switches # 0.049 M/sec 616,304 cpu-migrations # 0.003 M/sec 1,648 page-faults # 0.007 K/sec 1,097,903,998,514 cycles # 4.735 GHz 120,781,778,352 instructions # 0.11 insn per cycle 27,997,056,043 branches # 120.754 M/sec 283,581,233 branch-misses # 1.01% of all branches 14.621273891 seconds time elapsed 0.018243000 seconds user 3.611468000 seconds sys before this commit. After this commit, I get 5,229.55 msec task-clock # 3.072 CPUs utilized 1,212,233 context-switches # 0.232 M/sec 103,951 cpu-migrations # 0.020 M/sec 1,328 page-faults # 0.254 K/sec 21,307,456,166 cycles # 4.074 GHz 12,947,819,999 instructions # 0.61 insn per cycle 2,881,985,678 branches # 551.096 M/sec 64,267,015 branch-misses # 2.23% of all branches 1.702148350 seconds time elapsed 0.004868000 seconds user 0.110786000 seconds sys instead. Much better. [ Note! This kernel improvement seems to be very good at triggering a race condition in the make jobserver (in GNU make 4.2.1) for me. It's a long known bug that was fixed back in June 2017 by GNU make commit b552b0525198 ("[SV 51159] Use a non-blocking read with pselect to avoid hangs."). But there wasn't a new release of GNU make until 4.3 on Jan 19 2020, so a number of distributions may still have the buggy version. Some have backported the fix to their 4.2.1 release, though, and even without the fix it's quite timing-dependent whether the bug actually is hit. ] Josh Triplett says: "I've been hammering on your pipe fix patch (switching to exclusive wait queues) for a month or so, on several different systems, and I've run into no issues with it. The patch *substantially* improves parallel build times on large (~100 CPU) systems, both with parallel make and with other things that use make's pipe-based jobserver. All current distributions (including stable and long-term stable distributions) have versions of GNU make that no longer have the jobserver bug" Tested-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-07Merge tag 'fuse-fixes-5.6-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse fixes from Miklos Szeredi: - Fix a regression introduced in v5.1 that triggers WARNINGs for some fuse filesystems - Fix an xfstest failure - Allow overlayfs to be used on top of fuse/virtiofs - Code and documentation cleanups * tag 'fuse-fixes-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: use true,false for bool variable Documentation: filesystems: convert fuse to RST fuse: Support RENAME_WHITEOUT flag fuse: don't overflow LLONG_MAX with end offset fix up iter on short count in fuse_direct_io()
2020-02-07Merge tag 'gfs2-for-5.6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 fixes from Andreas Gruenbacher: - Fix a bug in Abhi Das's journal head lookup improvements that can cause a valid journal to be rejected. - Fix an O_SYNC write handling bug reported by Christoph Hellwig. * tag 'gfs2-for-5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: fix O_SYNC write handling gfs2: move setting current->backing_dev_info gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0
2020-02-07Merge tag 'for-linus-5.6-ofs1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux Pull orangefs fix from Mike Marshall: "Debugfs fix for orangefs. Vasliy Averin noticed that 'if seq_file .next function does not change position index, read after some lseek can generate unexpected output' and sent in this fix" * tag 'for-linus-5.6-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux: help_next should increase position index
2020-02-07Merge tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "Highlights: - Server-to-server copy code from Olga. To use it, client and both servers must have support, the target server must be able to access the source server over NFSv4.2, and the target server must have the inter_copy_offload_enable module parameter set. - Improvements and bugfixes for the new filehandle cache, especially in the container case, from Trond - Also from Trond, better reporting of write errors. - Y2038 work from Arnd" * tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux: (55 commits) sunrpc: expiry_time should be seconds not timeval nfsd: make nfsd_filecache_wq variable static nfsd4: fix double free in nfsd4_do_async_copy() nfsd: convert file cache to use over/underflow safe refcount nfsd: Define the file access mode enum for tracing nfsd: Fix a perf warning nfsd: Ensure sampling of the write verifier is atomic with the write nfsd: Ensure sampling of the commit verifier is atomic with the commit sunrpc: clean up cache entry add/remove from hashtable sunrpc: Fix potential leaks in sunrpc_cache_unhash() nfsd: Ensure exclusion between CLONE and WRITE errors nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range() nfsd: Update the boot verifier on stable writes too. nfsd: Fix stable writes nfsd: Allow nfsd_vfs_write() to take the nfsd_file as an argument nfsd: Fix a soft lockup race in nfsd_file_mark_find_or_create() nfsd: Reduce the number of calls to nfsd_file_gc() nfsd: Schedule the laundrette regularly irrespective of file errors nfsd: Remove unused constant NFSD_FILE_LRU_RESCAN nfsd: Containerise filecache laundrette ...
2020-02-07Merge tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Puyll NFS client updates from Anna Schumaker: "Stable bugfixes: - Fix memory leaks and corruption in readdir # v2.6.37+ - Directory page cache needs to be locked when read # v2.6.37+ New features: - Convert NFS to use the new mount API - Add "softreval" mount option to let clients use cache if server goes down - Add a config option to compile without UDP support - Limit the number of inactive delegations the client can cache at once - Improved readdir concurrency using iterate_shared() Other bugfixes and cleanups: - More 64-bit time conversions - Add additional diagnostic tracepoints - Check for holes in swapfiles, and add dependency on CONFIG_SWAP - Various xprtrdma cleanups to prepare for 5.7's changes - Several fixes for NFS writeback and commit handling - Fix acls over krb5i/krb5p mounts - Recover from premature loss of openstateids - Fix NFS v3 chacl and chmod bug - Compare creds using cred_fscmp() - Use kmemdup_nul() in more places - Optimize readdir cache page invalidation - Lease renewal and recovery fixes" * tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits) NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals NFSv4: try lease recovery on NFS4ERR_EXPIRED NFS: Fix memory leaks nfs: optimise readdir cache page invalidation NFS: Switch readdir to using iterate_shared() NFS: Use kmemdup_nul() in nfs_readdir_make_qstr() NFS: Directory page cache pages need to be locked when read NFS: Fix memory leaks and corruption in readdir SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id() NFS: Replace various occurrences of kstrndup() with kmemdup_nul() NFSv4: Limit the total number of cached delegations NFSv4: Add accounting for the number of active delegations held NFSv4: Try to return the delegation immediately when marked for return on close NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING NFS: nfs_find_open_context() should use cred_fscmp() NFS: nfs_access_get_cached_rcu() should use cred_fscmp() NFSv4: pnfs_roc() must use cred_fscmp() to compare creds NFS: remove unused macros nfs: Return EINVAL rather than ERANGE for mount parse errors ...
2020-02-07nfsd: make nfsd_filecache_wq variable staticChen Zhou
Fix sparse warning: fs/nfsd/filecache.c:55:25: warning: symbol 'nfsd_filecache_wq' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Chen Zhou <chenzhou10@huawei.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-06gfs2: fix O_SYNC write handlingAndreas Gruenbacher
In gfs2_file_write_iter, for direct writes, the error checking in the buffered write fallback case is incomplete. This can cause inode write errors to go undetected. Fix and clean up gfs2_file_write_iter along the way. Based on a proposed fix by Christoph Hellwig <hch@lst.de>. Fixes: 967bcc91b044 ("gfs2: iomap direct I/O support") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-06gfs2: move setting current->backing_dev_infoChristoph Hellwig
Set current->backing_dev_info just around the buffered write calls to prepare for the next fix. Fixes: 967bcc91b044 ("gfs2: iomap direct I/O support") Cc: stable@vger.kernel.org # v4.19+ Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-06gfs2: fix gfs2_find_jhead that returns uninitialized jhead with seq 0Abhi Das
When the first log header in a journal happens to have a sequence number of 0, a bug in gfs2_find_jhead() causes it to prematurely exit, and return an uninitialized jhead with seq 0. This can cause failures in the caller. For instance, a mount fails in one test case. The correct behavior is for it to continue searching through the journal to find the correct journal head with the highest sequence number. Fixes: f4686c26ecc3 ("gfs2: read journal in large chunks") Cc: stable@vger.kernel.org # v5.2+ Signed-off-by: Abhi Das <adas@redhat.com> Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
2020-02-06nfsd4: fix double free in nfsd4_do_async_copy()Dan Carpenter
This frees "copy->nf_src" before and again after the goto. Fixes: ce0887ac96d3 ("NFSD add nfs4 inter ssc to nfsd4_copy") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-06nfsd: convert file cache to use over/underflow safe refcountTrond Myklebust
Use the 'refcount_t' type instead of 'atomic_t' for improved refcounting safety. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-06nfsd: Define the file access mode enum for tracingTrond Myklebust
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-06nfsd: Fix a perf warningTrond Myklebust
perf does not know how to deal with a __builtin_bswap32() call, and complains. All other functions just store the xid etc in host endian form, so let's do that in the tracepoint for nfsd_file_acquire too. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-06fuse: use true,false for bool variablezhengbin
Fixes coccicheck warning: fs/fuse/readdir.c:335:1-19: WARNING: Assignment of 0/1 to bool variable fs/fuse/file.c:1398:2-19: WARNING: Assignment of 0/1 to bool variable fs/fuse/file.c:1400:2-20: WARNING: Assignment of 0/1 to bool variable fs/fuse/cuse.c:454:1-20: WARNING: Assignment of 0/1 to bool variable fs/fuse/cuse.c:455:1-19: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:497:2-17: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:504:2-23: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:511:2-22: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:518:2-23: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:522:2-26: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:526:2-18: WARNING: Assignment of 0/1 to bool variable fs/fuse/inode.c:1000:1-20: WARNING: Assignment of 0/1 to bool variable Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-06fuse: Support RENAME_WHITEOUT flagVivek Goyal
Allow fuse to pass RENAME_WHITEOUT to fuse server. Overlayfs on top of virtiofs uses RENAME_WHITEOUT. Without this patch renaming a directory in overlayfs (dir is on lower) fails with -EINVAL. With this patch it works. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-06fuse: don't overflow LLONG_MAX with end offsetMiklos Szeredi
Handle the special case of fuse_readpages() wanting to read the last page of a hugest file possible and overflowing the end offset in the process. This is basically to unbreak xfstests:generic/525 and prevent filesystems from doing bad things with an overflowing offset. Reported-by: Xiao Yang <ice_yangxiao@163.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-06fix up iter on short count in fuse_direct_io()Miklos Szeredi
fuse_direct_io() can end up advancing the iterator by more than the amount of data read or written. This case is handled by the generic code if going through ->direct_IO(), but not in the FOPEN_DIRECT_IO case. Fix by reverting the extra bytes from the iterator in case of error or a short count. To test: install lxcfs, then the following testcase int fd = open("/var/lib/lxcfs/proc/uptime", O_RDONLY); sendfile(1, fd, NULL, 16777216); sendfile(1, fd, NULL, 16777216); will spew WARN_ON() in iov_iter_pipe(). Reported-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Al Viro <viro@zeniv.linux.org.uk> Fixes: 3c3db095b68c ("fuse: use iov_iter based generic splice helpers") Cc: <stable@vger.kernel.org> # v5.1 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-06Merge tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: - a set of patches that fixes various corner cases in mount and umount code (Xiubo Li). This has to do with choosing an MDS, distinguishing between laggy and down MDSes and parsing the server path. - inode initialization fixes (Jeff Layton). The one included here mostly concerns things like open_by_handle() and there is another one that will come through Al. - copy_file_range() now uses the new copy-from2 op (Luis Henriques). The existing copy-from op turned out to be infeasible for generic filesystem use; we disable the copy offload if OSDs don't support copy-from2. - a patch to link "rbd" and "block" devices together in sysfs (Hannes Reinecke) ... and a smattering of cleanups from Xiubo, Jeff and Chengguang. * tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits) rbd: set the 'device' link in sysfs ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c ceph: print name of xattr in __ceph_{get,set}xattr() douts ceph: print r_direct_hash in hex in __choose_mds() dout ceph: use copy-from2 op in copy_file_range ceph: close holes in structs ceph_mds_session and ceph_mds_request rbd: work around -Wuninitialized warning ceph: allocate the correct amount of extra bytes for the session features ceph: rename get_session and switch to use ceph_get_mds_session ceph: remove the extra slashes in the server path ceph: add possible_max_rank and make the code more readable ceph: print dentry offset in hex and fix xattr_version type ceph: only touch the caps which have the subset mask requested ceph: don't clear I_NEW until inode metadata is fully populated ceph: retry the same mds later after the new session is opened ceph: check availability of mds cluster on mount after wait timeout ceph: keep the session state until it is released ceph: add __send_request helper ceph: ensure we have a new cap before continuing in fill_inode ceph: drop unused ttl_from parameter from fill_inode ...
2020-02-06Merge tag 'xfs-5.6-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull moar xfs updates from Darrick Wong: "This contains the buffer error code refactoring I mentioned last week, now that it has had extra time to complete the full xfs fuzz testing suite to make sure there aren't any obvious new bugs" * tag 'xfs-5.6-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix xfs_buf_ioerror_alert location reporting xfs: remove unnecessary null pointer checks from _read_agf callers xfs: make xfs_*read_agf return EAGAIN to ALLOC_FLAG_TRYLOCK callers xfs: remove the xfs_btree_get_buf[ls] functions xfs: make xfs_trans_get_buf return an error code xfs: make xfs_trans_get_buf_map return an error code xfs: make xfs_buf_read return an error code xfs: make xfs_buf_get_uncached return an error code xfs: make xfs_buf_get return an error code xfs: make xfs_buf_read_map return an error code xfs: make xfs_buf_get_map return an error code xfs: make xfs_buf_alloc return an error code
2020-02-06Merge tag 'trace-v5.6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - Added new "bootconfig". This looks for a file appended to initrd to add boot config options, and has been discussed thoroughly at Linux Plumbers. Very useful for adding kprobes at bootup. Only enabled if "bootconfig" is on the real kernel command line. - Created dynamic event creation. Merges common code between creating synthetic events and kprobe events. - Rename perf "ring_buffer" structure to "perf_buffer" - Rename ftrace "ring_buffer" structure to "trace_buffer" Had to rename existing "trace_buffer" to "array_buffer" - Allow trace_printk() to work withing (some) tracing code. - Sort of tracing configs to be a little better organized - Fixed bug where ftrace_graph hash was not being protected properly - Various other small fixes and clean ups * tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits) bootconfig: Show the number of nodes on boot message tools/bootconfig: Show the number of bootconfig nodes bootconfig: Add more parse error messages bootconfig: Use bootconfig instead of boot config ftrace: Protect ftrace_graph_hash with ftrace_sync ftrace: Add comment to why rcu_dereference_sched() is open coded tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu tracing: Annotate ftrace_graph_hash pointer with __rcu bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline tracing: Use seq_buf for building dynevent_cmd string tracing: Remove useless code in dynevent_arg_pair_add() tracing: Remove check_arg() callbacks from dynevent args tracing: Consolidate some synth_event_trace code tracing: Fix now invalid var_ref_vals assumption in trace action tracing: Change trace_boot to use synth_event interface tracing: Move tracing selftests to bottom of menu tracing: Move mmio tracer config up with the other tracers tracing: Move tracing test module configs together tracing: Move all function tracing configs together tracing: Documentation for in-kernel synthetic event API ...
2020-02-06Merge tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Some later fixes for io_uring: - Small cleanup series from Pavel - Belt and suspenders build time check of sqe size and layout (Stefan) - Addition of ->show_fdinfo() on request of Jann Horn, to aid in understanding mapped personalities - eventfd recursion/deadlock fix, for both io_uring and aio - Fixup for send/recv handling - Fixup for double deferral of read/write request - Fix for potential double completion event for close request - Adjust fadvise advice async/inline behavior - Fix for shutdown hang with SQPOLL thread - Fix for potential use-after-free of fixed file table" * tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block: io_uring: cleanup fixed file data table references io_uring: spin for sq thread to idle on shutdown aio: prevent potential eventfd recursion on poll io_uring: put the flag changing code in the same spot io_uring: iterate req cache backwards io_uring: punt even fadvise() WILLNEED to async context io_uring: fix sporadic double CQE entry for close io_uring: remove extra ->file check io_uring: don't map read/write iovec potentially twice io_uring: use the proper helpers for io_send/recv io_uring: prevent potential eventfd recursion on poll eventfd: track eventfd_signal() recursion depth io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe io_uring: add ->show_fdinfo() for the io_uring file descriptor
2020-02-05Merge tag 'jfs-5.6' of git://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs update from David Kleikamp: "Trivial cleanup for jfs" * tag 'jfs-5.6' of git://github.com/kleikamp/linux-shaggy: jfs: remove unused MAXL2PAGES
2020-02-05Merge branch 'work.recursive_removal' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs recursive removal updates from Al Viro: "We have quite a few places where synthetic filesystems do an equivalent of 'rm -rf', with varying amounts of code duplication, wrong locking, etc. That really ought to be a library helper. Only debugfs (and very similar tracefs) are converted here - I have more conversions, but they'd never been in -next, so they'll have to wait" * 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems
2020-02-05Merge branch 'imm.timestamp' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs timestamp updates from Al Viro: "More 64bit timestamp work" * 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kernfs: don't bother with timestamp truncation fs: Do not overload update_time fs: Delete timespec64_trunc() fs: ubifs: Eliminate timespec64_trunc() usage fs: ceph: Delete timespec64_trunc() usage fs: cifs: Delete usage of timespec64_trunc fs: fat: Eliminate timespec64_trunc() usage utimes: Clamp the timestamps in notify_change()
2020-02-04io_uring: cleanup fixed file data table referencesJens Axboe
syzbot reports a use-after-free in io_ring_file_ref_switch() when it tries to switch back to percpu mode. When we put the final reference to the table by calling percpu_ref_kill_and_confirm(), we don't want the zero reference to queue async work for flushing the potentially queued up items. We currently do a few flush_work(), but they merely paper around the issue, since the work item may not have been queued yet depending on the when the percpu-ref callback gets run. Coming into the file unregister, we know we have the ring quiesced. io_ring_file_ref_switch() can check for whether or not the ref is dying or not, and not queue anything async at that point. Once the ref has been confirmed killed, flush any potential items manually. Reported-by: syzbot+7caeaea49c2c8a591e3d@syzkaller.appspotmail.com Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-04io_uring: spin for sq thread to idle on shutdownJens Axboe
As part of io_uring shutdown, we cancel work that is pending and won't necessarily complete on its own. That includes requests like poll commands and timeouts. If we're using SQPOLL for kernel side submission and we shutdown the ring immediately after queueing such work, we can race with the sqthread doing the submission. This means we may miss cancelling some work, which results in the io_uring shutdown hanging forever. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-04help_next should increase position indexVasily Averin
if seq_file .next fuction does not change position index, read after some lseek can generate unexpected output. https://bugzilla.kernel.org/show_bug.cgi?id=206283 Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Mike Marshall <hubcap@omnibond.com>
2020-02-04NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewalsRobert Milkowski
Currently, each time nfs4_do_fsinfo() is called it will do an implicit NFS4 lease renewal, which is not compliant with the NFS4 specification. This can result in a lease being expired by an NFS server. Commit 83ca7f5ab31f ("NFS: Avoid PUTROOTFH when managing leases") introduced implicit client lease renewal in nfs4_do_fsinfo(), which can result in the NFSv4.0 lease to expire on a server side, and servers returning NFS4ERR_EXPIRED or NFS4ERR_STALE_CLIENTID. This can easily be reproduced by frequently unmounting a sub-mount, then stat'ing it to get it mounted again, which will delay or even completely prevent client from sending RENEW operations if no other NFS operations are issued. Eventually nfs server will expire client's lease and return an error on file access or next RENEW. This can also happen when a sub-mount is automatically unmounted due to inactivity (after nfs_mountpoint_expiry_timeout), then it is mounted again via stat(). This can result in a short window during which client's lease will expire on a server but not on a client. This specific case was observed on production systems. This patch removes the implicit lease renewal from nfs4_do_fsinfo(). Fixes: 83ca7f5ab31f ("NFS: Avoid PUTROOTFH when managing leases") Signed-off-by: Robert Milkowski <rmilkowski@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04NFSv4: try lease recovery on NFS4ERR_EXPIREDRobert Milkowski
Currently, if an nfs server returns NFS4ERR_EXPIRED to open(), we return EIO to applications without even trying to recover. Fixes: 272289a3df72 ("NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid") Signed-off-by: Robert Milkowski <rmilkowski@gmail.com> Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04NFS: Fix memory leaksWenwen Wang
In _nfs42_proc_copy(), 'res->commit_res.verf' is allocated through kzalloc() if 'args->sync' is true. In the following code, if 'res->synchronous' is false, handle_async_copy() will be invoked. If an error occurs during the invocation, the following code will not be executed and the error will be returned . However, the allocated 'res->commit_res.verf' is not deallocated, leading to a memory leak. This is also true if the invocation of process_copy_commit() returns an error. To fix the above leaks, redirect the execution to the 'out' label if an error is encountered. Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04nfs: optimise readdir cache page invalidationDai Ngo
When the directory is large and it's being modified by one client while another client is doing the 'ls -l' on the same directory then the cache page invalidation from nfs_force_use_readdirplus causes the reading client to keep restarting READDIRPLUS from cookie 0 which causes the 'ls -l' to take a very long time to complete, possibly never completing. Currently when nfs_force_use_readdirplus is called to switch from READDIR to READDIRPLUS, it invalidates all the cached pages of the directory. This cache page invalidation causes the next nfs_readdir to re-read the directory content from cookie 0. This patch is to optimise the cache invalidation in nfs_force_use_readdirplus by only truncating the cached pages from last page index accessed to the end the file. It also marks the inode to delay invalidating all the cached page of the directory until the next initial nfs_readdir of the next 'ls' instance. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com> [Anna - Fix conflicts with Trond's readdir patches] [Anna - Remove redundant call to nfs_zap_mapping()] [Anna - Replace d_inode(file_dentry(desc->file)) with file_inode(desc->file)] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04Merge tag 'ovl-update-5.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: - Try to preserve holes in sparse files when copying up, thus saving disk space and improving performance. - Fix a performance regression introduced in v4.19 by preserving asynchronicity of IO when fowarding to underlying layers. Add VFS helpers to submit async iocbs. - Fix a regression in lseek(2) introduced in v4.19 that breaks >2G seeks on 32bit kernels. - Fix a corner case where st_ino/st_dev was not preserved across copy up. - Miscellaneous fixes and cleanups. * tag 'ovl-update-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix lseek overflow on 32bit ovl: add splice file read write helper ovl: implement async IO routines vfs: add vfs_iocb_iter_[read|write] helper functions ovl: layer is const ovl: fix corner case of non-constant st_dev;st_ino ovl: fix corner case of conflicting lower layer uuid ovl: generalize the lower_fs[] array ovl: simplify ovl_same_sb() helper ovl: generalize the lower_layers[] array ovl: improving copy-up efficiency for big sparse file ovl: use ovl_inode_lock in ovl_llseek() ovl: use pr_fmt auto generate prefix ovl: fix wrong WARN_ON() in ovl_cache_update_ino()
2020-02-04treewide: remove redundant IS_ERR() before error code checkMasahiro Yamada
'PTR_ERR(p) == -E*' is a stronger condition than IS_ERR(p). Hence, IS_ERR(p) is unneeded. The semantic patch that generates this commit is as follows: // <smpl> @@ expression ptr; constant error_code; @@ -IS_ERR(ptr) && (PTR_ERR(ptr) == - error_code) +PTR_ERR(ptr) == - error_code // </smpl> Link: http://lkml.kernel.org/r/20200106045833.1725-1-masahiroy@kernel.org Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Julia Lawall <julia.lawall@lip6.fr> Acked-by: Stephen Boyd <sboyd@kernel.org> [drivers/clk/clk.c] Acked-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> [GPIO] Acked-by: Wolfram Sang <wsa@the-dreams.de> [drivers/i2c] Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [acpi/scan.c] Acked-by: Rob Herring <robh@kernel.org> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04proc: convert everything to "struct proc_ops"Alexey Dobriyan
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in seq_file.h. Conversion rule is: llseek => proc_lseek unlocked_ioctl => proc_ioctl xxx => proc_xxx delete ".owner = THIS_MODULE" line [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c] [sfr@canb.auug.org.au: fix kernel/sched/psi.c] Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04proc: decouple proc from VFS with "struct proc_ops"Alexey Dobriyan
Currently core /proc code uses "struct file_operations" for custom hooks, however, VFS doesn't directly call them. Every time VFS expands file_operations hook set, /proc code bloats for no reason. Introduce "struct proc_ops" which contains only those hooks which /proc allows to call into (open, release, read, write, ioctl, mmap, poll). It doesn't contain module pointer as well. Save ~184 bytes per usage: add/remove: 26/26 grow/shrink: 1/4 up/down: 1922/-6674 (-4752) Function old new delta sysvipc_proc_ops - 72 +72 ... config_gz_proc_ops - 72 +72 proc_get_inode 289 339 +50 proc_reg_get_unmapped_area 110 107 -3 close_pdeo 227 224 -3 proc_reg_open 289 284 -5 proc_create_data 60 53 -7 rt_cpu_seq_fops 256 - -256 ... default_affinity_proc_fops 256 - -256 Total: Before=5430095, After=5425343, chg -0.09% Link: http://lkml.kernel.org/r/20191225172228.GA13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm: pagewalk: add 'depth' parameter to pte_holeSteven Price
The pte_hole() callback is called at multiple levels of the page tables. Code dumping the kernel page tables needs to know what at what depth the missing entry is. Add this is an extra parameter to pte_hole(). When the depth isn't know (e.g. processing a vma) then -1 is passed. The depth that is reported is the actual level where the entry is missing (ignoring any folding that is in place), i.e. any levels where PTRS_PER_P?D is set to 1 are ignored. Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their natural numbers as levels 2/3/4. Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Tested-by: Zong Li <zong.li@sifive.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04fs/proc/page.c: allow inspection of last section and fix end detectionDavid Hildenbrand
If max_pfn does not fall onto a section boundary, it is possible to inspect PFNs up to max_pfn, and PFNs above max_pfn, however, max_pfn itself can't be inspected. We can have a valid (and online) memmap at and above max_pfn if max_pfn is not aligned to a section boundary. The whole early section has a memmap and is marked online. Being able to inspect the state of these PFNs is valuable for debugging, especially because max_pfn can change on memory hotplug and expose these memmaps. Also, querying page flags via "./page-types -r -a 0x144001," (tools/vm/page-types.c) inside a x86-64 guest with 4160MB under QEMU results in an (almost) endless loop in user space, because the end is not detected properly when starting after max_pfn. Instead, let's allow to inspect all pages in the highest section and return 0 directly if we try to access pages above that section. While at it, check the count before adjusting it, to avoid masking user errors. Link: http://lkml.kernel.org/r/20191211163201.17179-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Bob Picco <bob.picco@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04ocfs2: fix oops when writing cloned fileGang He
Writing a cloned file triggers a kernel oops and the user-space command process is also killed by the system. The bug can be reproduced stably via: 1) create a file under ocfs2 file system directory. journalctl -b > aa.txt 2) create a cloned file for this file. reflink aa.txt bb.txt 3) write the cloned file with dd command. dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc The dd command is killed by the kernel, then you can see the oops message via dmesg command. [ 463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 463.875413] #PF: supervisor read access in kernel mode [ 463.875416] #PF: error_code(0x0000) - not-present page [ 463.875418] PGD 0 P4D 0 [ 463.875425] Oops: 0000 [#1] SMP PTI [ 463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G OE 5.3.16-2-default [ 463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2] [ 463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28 [ 463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202 [ 463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0 [ 463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000 [ 463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000 [ 463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048 [ 463.875526] FS: 00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000 [ 463.875529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0 [ 463.875546] Call Trace: [ 463.875596] ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2] [ 463.875648] ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2] [ 463.875672] new_sync_write+0x12d/0x1d0 [ 463.875688] vfs_write+0xad/0x1a0 [ 463.875697] ksys_write+0xa1/0xe0 [ 463.875710] do_syscall_64+0x60/0x1f0 [ 463.875743] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 463.875758] RIP: 0033:0x7f8f4482ed44 [ 463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 [ 463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44 [ 463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001 [ 463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003 [ 463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000 [ 463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000 This regression problem was introduced by commit e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com Fixes: e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-03aio: prevent potential eventfd recursion on pollJens Axboe
If we have nested or circular eventfd wakeups, then we can deadlock if we run them inline from our poll waitqueue wakeup handler. It's also possible to have very long chains of notifications, to the extent where we could risk blowing the stack. Check the eventfd recursion count before calling eventfd_signal(). If it's non-zero, then punt the signaling to async context. This is always safe, as it takes us out-of-line in terms of stack and locking context. Cc: stable@vger.kernel.org # 4.19+ Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: put the flag changing code in the same spotPavel Begunkov
Both iocb_flags() and kiocb_set_rw_flags() are inline and modify kiocb->ki_flags. Place them close, so they can be potentially better optimised. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: iterate req cache backwardsPavel Begunkov
Grab requests from cache-array from the end, so can get by only free_reqs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: punt even fadvise() WILLNEED to async contextJens Axboe
Andres correctly points out that read-ahead can block, if it needs to read in meta data (or even just through the page cache page allocations). Play it safe for now and just ensure WILLNEED is also punted to async context. While in there, allow the file settings hints from non-blocking context. They don't need to start/do IO, and we can safely do them inline. Fixes: 4840e418c2fc ("io_uring: add IORING_OP_FADVISE") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: fix sporadic double CQE entry for closeJens Axboe
We punt close to async for the final fput(), but we log the completion even before that even in that case. We rely on the request not having a files table assigned to detect what the final async close should do. However, if we punt the async queue to __io_queue_sqe(), we'll get ->files assigned and this makes io_close_finish() think it should both close the filp again (which does no harm) AND log a new CQE event for this request. This causes duplicate CQEs. Queue the request up for async manually so we don't grab files needlessly and trigger this condition. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: remove extra ->file checkPavel Begunkov
It won't ever get into io_prep_rw() when req->file haven't been set in io_req_set_file(), hence remove the check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: don't map read/write iovec potentially twiceJens Axboe
If we have a read/write that is deferred, we already setup the async IO context for that request, and mapped it. When we later try and execute the request and we get -EAGAIN, we don't want to attempt to re-map it. If we do, we end up with garbage in the iovec, which typically leads to an -EFAULT or -EINVAL completion. Cc: stable@vger.kernel.org # 5.5 Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: use the proper helpers for io_send/recvJens Axboe
Don't use the recvmsg/sendmsg helpers, use the same helpers that the recv(2) and send(2) system calls use. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: prevent potential eventfd recursion on pollJens Axboe
If we have nested or circular eventfd wakeups, then we can deadlock if we run them inline from our poll waitqueue wakeup handler. It's also possible to have very long chains of notifications, to the extent where we could risk blowing the stack. Check the eventfd recursion count before calling eventfd_signal(). If it's non-zero, then punt the signaling to async context. This is always safe, as it takes us out-of-line in terms of stack and locking context. Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03eventfd: track eventfd_signal() recursion depthJens Axboe
eventfd use cases from aio and io_uring can deadlock due to circular or resursive calling, when eventfd_signal() tries to grab the waitqueue lock. On top of that, it's also possible to construct notification chains that are deep enough that we could blow the stack. Add a percpu counter that tracks the percpu recursion depth, warn if we exceed it. The counter is also exposed so that users of eventfd_signal() can do the right thing if it's non-zero in the context where it is called. Cc: stable@vger.kernel.org # 4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk>