aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2020-02-07Merge tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfsLinus Torvalds
Puyll NFS client updates from Anna Schumaker: "Stable bugfixes: - Fix memory leaks and corruption in readdir # v2.6.37+ - Directory page cache needs to be locked when read # v2.6.37+ New features: - Convert NFS to use the new mount API - Add "softreval" mount option to let clients use cache if server goes down - Add a config option to compile without UDP support - Limit the number of inactive delegations the client can cache at once - Improved readdir concurrency using iterate_shared() Other bugfixes and cleanups: - More 64-bit time conversions - Add additional diagnostic tracepoints - Check for holes in swapfiles, and add dependency on CONFIG_SWAP - Various xprtrdma cleanups to prepare for 5.7's changes - Several fixes for NFS writeback and commit handling - Fix acls over krb5i/krb5p mounts - Recover from premature loss of openstateids - Fix NFS v3 chacl and chmod bug - Compare creds using cred_fscmp() - Use kmemdup_nul() in more places - Optimize readdir cache page invalidation - Lease renewal and recovery fixes" * tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits) NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals NFSv4: try lease recovery on NFS4ERR_EXPIRED NFS: Fix memory leaks nfs: optimise readdir cache page invalidation NFS: Switch readdir to using iterate_shared() NFS: Use kmemdup_nul() in nfs_readdir_make_qstr() NFS: Directory page cache pages need to be locked when read NFS: Fix memory leaks and corruption in readdir SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id() NFS: Replace various occurrences of kstrndup() with kmemdup_nul() NFSv4: Limit the total number of cached delegations NFSv4: Add accounting for the number of active delegations held NFSv4: Try to return the delegation immediately when marked for return on close NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING NFS: nfs_find_open_context() should use cred_fscmp() NFS: nfs_access_get_cached_rcu() should use cred_fscmp() NFSv4: pnfs_roc() must use cred_fscmp() to compare creds NFS: remove unused macros nfs: Return EINVAL rather than ERANGE for mount parse errors ...
2020-02-06Merge tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: - a set of patches that fixes various corner cases in mount and umount code (Xiubo Li). This has to do with choosing an MDS, distinguishing between laggy and down MDSes and parsing the server path. - inode initialization fixes (Jeff Layton). The one included here mostly concerns things like open_by_handle() and there is another one that will come through Al. - copy_file_range() now uses the new copy-from2 op (Luis Henriques). The existing copy-from op turned out to be infeasible for generic filesystem use; we disable the copy offload if OSDs don't support copy-from2. - a patch to link "rbd" and "block" devices together in sysfs (Hannes Reinecke) ... and a smattering of cleanups from Xiubo, Jeff and Chengguang. * tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits) rbd: set the 'device' link in sysfs ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c ceph: print name of xattr in __ceph_{get,set}xattr() douts ceph: print r_direct_hash in hex in __choose_mds() dout ceph: use copy-from2 op in copy_file_range ceph: close holes in structs ceph_mds_session and ceph_mds_request rbd: work around -Wuninitialized warning ceph: allocate the correct amount of extra bytes for the session features ceph: rename get_session and switch to use ceph_get_mds_session ceph: remove the extra slashes in the server path ceph: add possible_max_rank and make the code more readable ceph: print dentry offset in hex and fix xattr_version type ceph: only touch the caps which have the subset mask requested ceph: don't clear I_NEW until inode metadata is fully populated ceph: retry the same mds later after the new session is opened ceph: check availability of mds cluster on mount after wait timeout ceph: keep the session state until it is released ceph: add __send_request helper ceph: ensure we have a new cap before continuing in fill_inode ceph: drop unused ttl_from parameter from fill_inode ...
2020-02-06Merge tag 'xfs-5.6-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull moar xfs updates from Darrick Wong: "This contains the buffer error code refactoring I mentioned last week, now that it has had extra time to complete the full xfs fuzz testing suite to make sure there aren't any obvious new bugs" * tag 'xfs-5.6-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix xfs_buf_ioerror_alert location reporting xfs: remove unnecessary null pointer checks from _read_agf callers xfs: make xfs_*read_agf return EAGAIN to ALLOC_FLAG_TRYLOCK callers xfs: remove the xfs_btree_get_buf[ls] functions xfs: make xfs_trans_get_buf return an error code xfs: make xfs_trans_get_buf_map return an error code xfs: make xfs_buf_read return an error code xfs: make xfs_buf_get_uncached return an error code xfs: make xfs_buf_get return an error code xfs: make xfs_buf_read_map return an error code xfs: make xfs_buf_get_map return an error code xfs: make xfs_buf_alloc return an error code
2020-02-06Merge tag 'trace-v5.6-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace Pull tracing updates from Steven Rostedt: - Added new "bootconfig". This looks for a file appended to initrd to add boot config options, and has been discussed thoroughly at Linux Plumbers. Very useful for adding kprobes at bootup. Only enabled if "bootconfig" is on the real kernel command line. - Created dynamic event creation. Merges common code between creating synthetic events and kprobe events. - Rename perf "ring_buffer" structure to "perf_buffer" - Rename ftrace "ring_buffer" structure to "trace_buffer" Had to rename existing "trace_buffer" to "array_buffer" - Allow trace_printk() to work withing (some) tracing code. - Sort of tracing configs to be a little better organized - Fixed bug where ftrace_graph hash was not being protected properly - Various other small fixes and clean ups * tag 'trace-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (88 commits) bootconfig: Show the number of nodes on boot message tools/bootconfig: Show the number of bootconfig nodes bootconfig: Add more parse error messages bootconfig: Use bootconfig instead of boot config ftrace: Protect ftrace_graph_hash with ftrace_sync ftrace: Add comment to why rcu_dereference_sched() is open coded tracing: Annotate ftrace_graph_notrace_hash pointer with __rcu tracing: Annotate ftrace_graph_hash pointer with __rcu bootconfig: Only load bootconfig if "bootconfig" is on the kernel cmdline tracing: Use seq_buf for building dynevent_cmd string tracing: Remove useless code in dynevent_arg_pair_add() tracing: Remove check_arg() callbacks from dynevent args tracing: Consolidate some synth_event_trace code tracing: Fix now invalid var_ref_vals assumption in trace action tracing: Change trace_boot to use synth_event interface tracing: Move tracing selftests to bottom of menu tracing: Move mmio tracer config up with the other tracers tracing: Move tracing test module configs together tracing: Move all function tracing configs together tracing: Documentation for in-kernel synthetic event API ...
2020-02-06Merge tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Some later fixes for io_uring: - Small cleanup series from Pavel - Belt and suspenders build time check of sqe size and layout (Stefan) - Addition of ->show_fdinfo() on request of Jann Horn, to aid in understanding mapped personalities - eventfd recursion/deadlock fix, for both io_uring and aio - Fixup for send/recv handling - Fixup for double deferral of read/write request - Fix for potential double completion event for close request - Adjust fadvise advice async/inline behavior - Fix for shutdown hang with SQPOLL thread - Fix for potential use-after-free of fixed file table" * tag 'io_uring-5.6-2020-02-05' of git://git.kernel.dk/linux-block: io_uring: cleanup fixed file data table references io_uring: spin for sq thread to idle on shutdown aio: prevent potential eventfd recursion on poll io_uring: put the flag changing code in the same spot io_uring: iterate req cache backwards io_uring: punt even fadvise() WILLNEED to async context io_uring: fix sporadic double CQE entry for close io_uring: remove extra ->file check io_uring: don't map read/write iovec potentially twice io_uring: use the proper helpers for io_send/recv io_uring: prevent potential eventfd recursion on poll eventfd: track eventfd_signal() recursion depth io_uring: add BUILD_BUG_ON() to assert the layout of struct io_uring_sqe io_uring: add ->show_fdinfo() for the io_uring file descriptor
2020-02-05Merge tag 'jfs-5.6' of git://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs update from David Kleikamp: "Trivial cleanup for jfs" * tag 'jfs-5.6' of git://github.com/kleikamp/linux-shaggy: jfs: remove unused MAXL2PAGES
2020-02-05Merge branch 'work.recursive_removal' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs recursive removal updates from Al Viro: "We have quite a few places where synthetic filesystems do an equivalent of 'rm -rf', with varying amounts of code duplication, wrong locking, etc. That really ought to be a library helper. Only debugfs (and very similar tracefs) are converted here - I have more conversions, but they'd never been in -next, so they'll have to wait" * 'work.recursive_removal' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: simple_recursive_removal(): kernel-side rm -rf for ramfs-style filesystems
2020-02-05Merge branch 'imm.timestamp' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs timestamp updates from Al Viro: "More 64bit timestamp work" * 'imm.timestamp' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: kernfs: don't bother with timestamp truncation fs: Do not overload update_time fs: Delete timespec64_trunc() fs: ubifs: Eliminate timespec64_trunc() usage fs: ceph: Delete timespec64_trunc() usage fs: cifs: Delete usage of timespec64_trunc fs: fat: Eliminate timespec64_trunc() usage utimes: Clamp the timestamps in notify_change()
2020-02-04io_uring: cleanup fixed file data table referencesJens Axboe
syzbot reports a use-after-free in io_ring_file_ref_switch() when it tries to switch back to percpu mode. When we put the final reference to the table by calling percpu_ref_kill_and_confirm(), we don't want the zero reference to queue async work for flushing the potentially queued up items. We currently do a few flush_work(), but they merely paper around the issue, since the work item may not have been queued yet depending on the when the percpu-ref callback gets run. Coming into the file unregister, we know we have the ring quiesced. io_ring_file_ref_switch() can check for whether or not the ref is dying or not, and not queue anything async at that point. Once the ref has been confirmed killed, flush any potential items manually. Reported-by: syzbot+7caeaea49c2c8a591e3d@syzkaller.appspotmail.com Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-04io_uring: spin for sq thread to idle on shutdownJens Axboe
As part of io_uring shutdown, we cancel work that is pending and won't necessarily complete on its own. That includes requests like poll commands and timeouts. If we're using SQPOLL for kernel side submission and we shutdown the ring immediately after queueing such work, we can race with the sqthread doing the submission. This means we may miss cancelling some work, which results in the io_uring shutdown hanging forever. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-04NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewalsRobert Milkowski
Currently, each time nfs4_do_fsinfo() is called it will do an implicit NFS4 lease renewal, which is not compliant with the NFS4 specification. This can result in a lease being expired by an NFS server. Commit 83ca7f5ab31f ("NFS: Avoid PUTROOTFH when managing leases") introduced implicit client lease renewal in nfs4_do_fsinfo(), which can result in the NFSv4.0 lease to expire on a server side, and servers returning NFS4ERR_EXPIRED or NFS4ERR_STALE_CLIENTID. This can easily be reproduced by frequently unmounting a sub-mount, then stat'ing it to get it mounted again, which will delay or even completely prevent client from sending RENEW operations if no other NFS operations are issued. Eventually nfs server will expire client's lease and return an error on file access or next RENEW. This can also happen when a sub-mount is automatically unmounted due to inactivity (after nfs_mountpoint_expiry_timeout), then it is mounted again via stat(). This can result in a short window during which client's lease will expire on a server but not on a client. This specific case was observed on production systems. This patch removes the implicit lease renewal from nfs4_do_fsinfo(). Fixes: 83ca7f5ab31f ("NFS: Avoid PUTROOTFH when managing leases") Signed-off-by: Robert Milkowski <rmilkowski@gmail.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04NFSv4: try lease recovery on NFS4ERR_EXPIREDRobert Milkowski
Currently, if an nfs server returns NFS4ERR_EXPIRED to open(), we return EIO to applications without even trying to recover. Fixes: 272289a3df72 ("NFSv4: nfs4_do_handle_exception() handle revoke/expiry of a single stateid") Signed-off-by: Robert Milkowski <rmilkowski@gmail.com> Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04NFS: Fix memory leaksWenwen Wang
In _nfs42_proc_copy(), 'res->commit_res.verf' is allocated through kzalloc() if 'args->sync' is true. In the following code, if 'res->synchronous' is false, handle_async_copy() will be invoked. If an error occurs during the invocation, the following code will not be executed and the error will be returned . However, the allocated 'res->commit_res.verf' is not deallocated, leading to a memory leak. This is also true if the invocation of process_copy_commit() returns an error. To fix the above leaks, redirect the execution to the 'out' label if an error is encountered. Signed-off-by: Wenwen Wang <wenwen@cs.uga.edu> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04nfs: optimise readdir cache page invalidationDai Ngo
When the directory is large and it's being modified by one client while another client is doing the 'ls -l' on the same directory then the cache page invalidation from nfs_force_use_readdirplus causes the reading client to keep restarting READDIRPLUS from cookie 0 which causes the 'ls -l' to take a very long time to complete, possibly never completing. Currently when nfs_force_use_readdirplus is called to switch from READDIR to READDIRPLUS, it invalidates all the cached pages of the directory. This cache page invalidation causes the next nfs_readdir to re-read the directory content from cookie 0. This patch is to optimise the cache invalidation in nfs_force_use_readdirplus by only truncating the cached pages from last page index accessed to the end the file. It also marks the inode to delay invalidating all the cached page of the directory until the next initial nfs_readdir of the next 'ls' instance. Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Reviewed-by: Trond Myklebust <trond.myklebust@hammerspace.com> [Anna - Fix conflicts with Trond's readdir patches] [Anna - Remove redundant call to nfs_zap_mapping()] [Anna - Replace d_inode(file_dentry(desc->file)) with file_inode(desc->file)] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-04Merge tag 'ovl-update-5.6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: - Try to preserve holes in sparse files when copying up, thus saving disk space and improving performance. - Fix a performance regression introduced in v4.19 by preserving asynchronicity of IO when fowarding to underlying layers. Add VFS helpers to submit async iocbs. - Fix a regression in lseek(2) introduced in v4.19 that breaks >2G seeks on 32bit kernels. - Fix a corner case where st_ino/st_dev was not preserved across copy up. - Miscellaneous fixes and cleanups. * tag 'ovl-update-5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix lseek overflow on 32bit ovl: add splice file read write helper ovl: implement async IO routines vfs: add vfs_iocb_iter_[read|write] helper functions ovl: layer is const ovl: fix corner case of non-constant st_dev;st_ino ovl: fix corner case of conflicting lower layer uuid ovl: generalize the lower_fs[] array ovl: simplify ovl_same_sb() helper ovl: generalize the lower_layers[] array ovl: improving copy-up efficiency for big sparse file ovl: use ovl_inode_lock in ovl_llseek() ovl: use pr_fmt auto generate prefix ovl: fix wrong WARN_ON() in ovl_cache_update_ino()
2020-02-04treewide: remove redundant IS_ERR() before error code checkMasahiro Yamada
'PTR_ERR(p) == -E*' is a stronger condition than IS_ERR(p). Hence, IS_ERR(p) is unneeded. The semantic patch that generates this commit is as follows: // <smpl> @@ expression ptr; constant error_code; @@ -IS_ERR(ptr) && (PTR_ERR(ptr) == - error_code) +PTR_ERR(ptr) == - error_code // </smpl> Link: http://lkml.kernel.org/r/20200106045833.1725-1-masahiroy@kernel.org Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Cc: Julia Lawall <julia.lawall@lip6.fr> Acked-by: Stephen Boyd <sboyd@kernel.org> [drivers/clk/clk.c] Acked-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> [GPIO] Acked-by: Wolfram Sang <wsa@the-dreams.de> [drivers/i2c] Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [acpi/scan.c] Acked-by: Rob Herring <robh@kernel.org> Cc: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04proc: convert everything to "struct proc_ops"Alexey Dobriyan
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in seq_file.h. Conversion rule is: llseek => proc_lseek unlocked_ioctl => proc_ioctl xxx => proc_xxx delete ".owner = THIS_MODULE" line [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c] [sfr@canb.auug.org.au: fix kernel/sched/psi.c] Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04proc: decouple proc from VFS with "struct proc_ops"Alexey Dobriyan
Currently core /proc code uses "struct file_operations" for custom hooks, however, VFS doesn't directly call them. Every time VFS expands file_operations hook set, /proc code bloats for no reason. Introduce "struct proc_ops" which contains only those hooks which /proc allows to call into (open, release, read, write, ioctl, mmap, poll). It doesn't contain module pointer as well. Save ~184 bytes per usage: add/remove: 26/26 grow/shrink: 1/4 up/down: 1922/-6674 (-4752) Function old new delta sysvipc_proc_ops - 72 +72 ... config_gz_proc_ops - 72 +72 proc_get_inode 289 339 +50 proc_reg_get_unmapped_area 110 107 -3 close_pdeo 227 224 -3 proc_reg_open 289 284 -5 proc_create_data 60 53 -7 rt_cpu_seq_fops 256 - -256 ... default_affinity_proc_fops 256 - -256 Total: Before=5430095, After=5425343, chg -0.09% Link: http://lkml.kernel.org/r/20191225172228.GA13378@avx2 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04mm: pagewalk: add 'depth' parameter to pte_holeSteven Price
The pte_hole() callback is called at multiple levels of the page tables. Code dumping the kernel page tables needs to know what at what depth the missing entry is. Add this is an extra parameter to pte_hole(). When the depth isn't know (e.g. processing a vma) then -1 is passed. The depth that is reported is the actual level where the entry is missing (ignoring any folding that is in place), i.e. any levels where PTRS_PER_P?D is set to 1 are ignored. Note that depth starts at 0 for a PGD so that PUD/PMD/PTE retain their natural numbers as levels 2/3/4. Link: http://lkml.kernel.org/r/20191218162402.45610-16-steven.price@arm.com Signed-off-by: Steven Price <steven.price@arm.com> Tested-by: Zong Li <zong.li@sifive.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexandre Ghiti <alex@ghiti.fr> Cc: Andy Lutomirski <luto@kernel.org> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David S. Miller <davem@davemloft.net> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Hogan <jhogan@kernel.org> Cc: James Morse <james.morse@arm.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Burton <paul.burton@mips.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04fs/proc/page.c: allow inspection of last section and fix end detectionDavid Hildenbrand
If max_pfn does not fall onto a section boundary, it is possible to inspect PFNs up to max_pfn, and PFNs above max_pfn, however, max_pfn itself can't be inspected. We can have a valid (and online) memmap at and above max_pfn if max_pfn is not aligned to a section boundary. The whole early section has a memmap and is marked online. Being able to inspect the state of these PFNs is valuable for debugging, especially because max_pfn can change on memory hotplug and expose these memmaps. Also, querying page flags via "./page-types -r -a 0x144001," (tools/vm/page-types.c) inside a x86-64 guest with 4160MB under QEMU results in an (almost) endless loop in user space, because the end is not detected properly when starting after max_pfn. Instead, let's allow to inspect all pages in the highest section and return 0 directly if we try to access pages above that section. While at it, check the count before adjusting it, to avoid masking user errors. Link: http://lkml.kernel.org/r/20191211163201.17179-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Bob Picco <bob.picco@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Steven Sistare <steven.sistare@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04ocfs2: fix oops when writing cloned fileGang He
Writing a cloned file triggers a kernel oops and the user-space command process is also killed by the system. The bug can be reproduced stably via: 1) create a file under ocfs2 file system directory. journalctl -b > aa.txt 2) create a cloned file for this file. reflink aa.txt bb.txt 3) write the cloned file with dd command. dd if=/dev/zero of=bb.txt bs=512 count=1 conv=notrunc The dd command is killed by the kernel, then you can see the oops message via dmesg command. [ 463.875404] BUG: kernel NULL pointer dereference, address: 0000000000000028 [ 463.875413] #PF: supervisor read access in kernel mode [ 463.875416] #PF: error_code(0x0000) - not-present page [ 463.875418] PGD 0 P4D 0 [ 463.875425] Oops: 0000 [#1] SMP PTI [ 463.875431] CPU: 1 PID: 2291 Comm: dd Tainted: G OE 5.3.16-2-default [ 463.875433] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 [ 463.875500] RIP: 0010:ocfs2_refcount_cow+0xa4/0x5d0 [ocfs2] [ 463.875505] Code: 06 89 6c 24 38 89 eb f6 44 24 3c 02 74 be 49 8b 47 28 [ 463.875508] RSP: 0018:ffffa2cb409dfce8 EFLAGS: 00010202 [ 463.875512] RAX: ffff8b1ebdca8000 RBX: 0000000000000001 RCX: ffff8b1eb73a9df0 [ 463.875515] RDX: 0000000000056a01 RSI: 0000000000000000 RDI: 0000000000000000 [ 463.875517] RBP: 0000000000000001 R08: ffff8b1eb73a9de0 R09: 0000000000000000 [ 463.875520] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 [ 463.875522] R13: ffff8b1eb922f048 R14: 0000000000000000 R15: ffff8b1eb922f048 [ 463.875526] FS: 00007f8f44d15540(0000) GS:ffff8b1ebeb00000(0000) knlGS:0000000000000000 [ 463.875529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 463.875532] CR2: 0000000000000028 CR3: 000000003c17a000 CR4: 00000000000006e0 [ 463.875546] Call Trace: [ 463.875596] ? ocfs2_inode_lock_full_nested+0x18b/0x960 [ocfs2] [ 463.875648] ocfs2_file_write_iter+0xaf8/0xc70 [ocfs2] [ 463.875672] new_sync_write+0x12d/0x1d0 [ 463.875688] vfs_write+0xad/0x1a0 [ 463.875697] ksys_write+0xa1/0xe0 [ 463.875710] do_syscall_64+0x60/0x1f0 [ 463.875743] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 463.875758] RIP: 0033:0x7f8f4482ed44 [ 463.875762] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 80 00 00 00 [ 463.875765] RSP: 002b:00007fff300a79d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 463.875769] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f8f4482ed44 [ 463.875771] RDX: 0000000000000200 RSI: 000055f771b5c000 RDI: 0000000000000001 [ 463.875774] RBP: 0000000000000200 R08: 00007f8f44af9c78 R09: 0000000000000003 [ 463.875776] R10: 000000000000089f R11: 0000000000000246 R12: 000055f771b5c000 [ 463.875779] R13: 0000000000000200 R14: 0000000000000000 R15: 000055f771b5c000 This regression problem was introduced by commit e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Link: http://lkml.kernel.org/r/20200121050153.13290-1-ghe@suse.com Fixes: e74540b28556 ("ocfs2: protect extent tree in ocfs2_prepare_inode_for_write()"). Signed-off-by: Gang He <ghe@suse.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-03aio: prevent potential eventfd recursion on pollJens Axboe
If we have nested or circular eventfd wakeups, then we can deadlock if we run them inline from our poll waitqueue wakeup handler. It's also possible to have very long chains of notifications, to the extent where we could risk blowing the stack. Check the eventfd recursion count before calling eventfd_signal(). If it's non-zero, then punt the signaling to async context. This is always safe, as it takes us out-of-line in terms of stack and locking context. Cc: stable@vger.kernel.org # 4.19+ Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: put the flag changing code in the same spotPavel Begunkov
Both iocb_flags() and kiocb_set_rw_flags() are inline and modify kiocb->ki_flags. Place them close, so they can be potentially better optimised. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: iterate req cache backwardsPavel Begunkov
Grab requests from cache-array from the end, so can get by only free_reqs. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: punt even fadvise() WILLNEED to async contextJens Axboe
Andres correctly points out that read-ahead can block, if it needs to read in meta data (or even just through the page cache page allocations). Play it safe for now and just ensure WILLNEED is also punted to async context. While in there, allow the file settings hints from non-blocking context. They don't need to start/do IO, and we can safely do them inline. Fixes: 4840e418c2fc ("io_uring: add IORING_OP_FADVISE") Reported-by: Andres Freund <andres@anarazel.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: fix sporadic double CQE entry for closeJens Axboe
We punt close to async for the final fput(), but we log the completion even before that even in that case. We rely on the request not having a files table assigned to detect what the final async close should do. However, if we punt the async queue to __io_queue_sqe(), we'll get ->files assigned and this makes io_close_finish() think it should both close the filp again (which does no harm) AND log a new CQE event for this request. This causes duplicate CQEs. Queue the request up for async manually so we don't grab files needlessly and trigger this condition. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: remove extra ->file checkPavel Begunkov
It won't ever get into io_prep_rw() when req->file haven't been set in io_req_set_file(), hence remove the check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: don't map read/write iovec potentially twiceJens Axboe
If we have a read/write that is deferred, we already setup the async IO context for that request, and mapped it. When we later try and execute the request and we get -EAGAIN, we don't want to attempt to re-map it. If we do, we end up with garbage in the iovec, which typically leads to an -EFAULT or -EINVAL completion. Cc: stable@vger.kernel.org # 5.5 Reported-by: Dan Melnic <dmm@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: use the proper helpers for io_send/recvJens Axboe
Don't use the recvmsg/sendmsg helpers, use the same helpers that the recv(2) and send(2) system calls use. Reported-by: 李通洲 <carter.li@eoitek.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03io_uring: prevent potential eventfd recursion on pollJens Axboe
If we have nested or circular eventfd wakeups, then we can deadlock if we run them inline from our poll waitqueue wakeup handler. It's also possible to have very long chains of notifications, to the extent where we could risk blowing the stack. Check the eventfd recursion count before calling eventfd_signal(). If it's non-zero, then punt the signaling to async context. This is always safe, as it takes us out-of-line in terms of stack and locking context. Cc: stable@vger.kernel.org # 5.1+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03eventfd: track eventfd_signal() recursion depthJens Axboe
eventfd use cases from aio and io_uring can deadlock due to circular or resursive calling, when eventfd_signal() tries to grab the waitqueue lock. On top of that, it's also possible to construct notification chains that are deep enough that we could blow the stack. Add a percpu counter that tracks the percpu recursion depth, warn if we exceed it. The counter is also exposed so that users of eventfd_signal() can do the right thing if it's non-zero in the context where it is called. Cc: stable@vger.kernel.org # 4.19+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-02-03NFS: Switch readdir to using iterate_shared()Trond Myklebust
Now that the page cache locking is repaired, we should be able to switch to using iterate_shared() for improved concurrency when doing readdir(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFS: Use kmemdup_nul() in nfs_readdir_make_qstr()Trond Myklebust
The directory strings stored in the readdir cache may be used with printk(), so it is better to ensure they are nul-terminated. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFS: Directory page cache pages need to be locked when readTrond Myklebust
When a NFS directory page cache page is removed from the page cache, its contents are freed through a call to nfs_readdir_clear_array(). To prevent the removal of the page cache entry until after we've finished reading it, we must take the page lock. Fixes: 11de3b11e08c ("NFS: Fix a memory leak in nfs_readdir") Cc: stable@vger.kernel.org # v2.6.37+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFS: Fix memory leaks and corruption in readdirTrond Myklebust
nfs_readdir_xdr_to_array() must not exit without having initialised the array, so that the page cache deletion routines can safely call nfs_readdir_clear_array(). Furthermore, we should ensure that if we exit nfs_readdir_filler() with an error, we free up any page contents to prevent a leak if we try to fill the page again. Fixes: 11de3b11e08c ("NFS: Fix a memory leak in nfs_readdir") Cc: stable@vger.kernel.org # v2.6.37+ Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFS: Replace various occurrences of kstrndup() with kmemdup_nul()Trond Myklebust
When we already know the string length, it is more efficient to use kmemdup_nul(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> [Anna - Changes to super.c were already made during fscontext conversion] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFSv4: Limit the total number of cached delegationsTrond Myklebust
Delegations can be expensive to return, and can cause scalability issues for the server. Let's therefore try to limit the number of inactive delegations we hold. Once the number of delegations is above a certain threshold, start to return them on close. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFSv4: Add accounting for the number of active delegations heldTrond Myklebust
In order to better manage our delegation caching, add a counter to track the number of active delegations. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFSv4: Try to return the delegation immediately when marked for return on closeTrond Myklebust
Add a routine to return the delegation immediately upon close of the file if it was marked for return-on-close. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returnedTrond Myklebust
If a delegation is marked as needing to be returned when the file is closed, then don't clear that marking until we're ready to return it. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNINGTrond Myklebust
In particular, the pnfs return-on-close code will check for that flag, so ensure we set it appropriately. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFS: nfs_find_open_context() should use cred_fscmp()Trond Myklebust
We want to find open contexts that match our filesystem access properties. They don't have to exactly match the cred. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFS: nfs_access_get_cached_rcu() should use cred_fscmp()Trond Myklebust
We do not need to have the rcu lookup method fail in the case where the fsuid/fsgid and supplemental groups match. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03NFSv4: pnfs_roc() must use cred_fscmp() to compare credsTrond Myklebust
When comparing two 'struct cred' for equality w.r.t. behaviour under filesystem access, we need to use cred_fscmp(). Fixes: a52458b48af1 ("NFS/NFSD/SUNRPC: replace generic creds with 'struct cred'.") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03Merge tag 'for-5.6-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs updates from David Sterba: "Fixes that arrived after the merge window freeze, mostly stable material. - fix race in tree-mod-log element tracking - fix bio flushing inside extent writepages - fix assertion when in-memory tracking of discarded extents finds an empty tree (eg. after adding a new device) - update logic of temporary read-only block groups to take into account overcommit - fix some fixup worker corner cases: - page could not go through proper COW cycle and the dirty status is lost due to page migration - deadlock if delayed allocation is performed under page lock - fix send emitting invalid clones within the same file - fix statfs reporting 0 free space when global block reserve size is larger than remaining free space but there is still space for new chunks" * tag 'for-5.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: do not zero f_bavail if we have available space Btrfs: send, fix emission of invalid clone operations within the same file btrfs: do not do delalloc reservation under page lock btrfs: drop the -EBUSY case in __extent_writepage_io Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker btrfs: take overcommit into account in inc_block_group_ro btrfs: fix force usage in inc_block_group_ro btrfs: Correctly handle empty trees in find_first_clear_extent_bit btrfs: flush write bio if we loop in extent_write_cache_pages Btrfs: fix race between adding and putting tree mod seq elements and nodes
2020-02-03NFS: remove unused macrosAlex Shi
MNT_fhs_status_sz/MNT_fhandle3_sz are never used after they were introduced. So better to remove them. Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Anna Schumaker <anna.schumaker@netapp.com> Cc: linux-nfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03ovl: fix lseek overflow on 32bitMiklos Szeredi
ovl_lseek() is using ssize_t to return the value from vfs_llseek(). On a 32-bit kernel ssize_t is a 32-bit signed int, which overflows above 2 GB. Assign the return value of vfs_llseek() to loff_t to fix this. Reported-by: Boris Gjenero <boris.gjenero@gmail.com> Fixes: 9e46b840c705 ("ovl: support stacked SEEK_HOLE/SEEK_DATA") Cc: <stable@vger.kernel.org> # v4.19 Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-02-02btrfs: do not zero f_bavail if we have available spaceJosef Bacik
There was some logic added a while ago to clear out f_bavail in statfs() if we did not have enough free metadata space to satisfy our global reserve. This was incorrect at the time, however didn't really pose a problem for normal file systems because we would often allocate chunks if we got this low on free metadata space, and thus wouldn't really hit this case unless we were actually full. Fast forward to today and now we are much better about not allocating metadata chunks all of the time. Couple this with d792b0f19711 ("btrfs: always reserve our entire size for the global reserve") which now means we'll easily have a larger global reserve than our free space, we are now more likely to trip over this while still having plenty of space. Fix this by skipping this logic if the global rsv's space_info is not full. space_info->full is 0 unless we've attempted to allocate a chunk for that space_info and that has failed. If this happens then the space for the global reserve is definitely sacred and we need to report b_avail == 0, but before then we can just use our calculated b_avail. Reported-by: Martin Steigerwald <martin@lichtvoll.de> Fixes: ca8a51b3a979 ("btrfs: statfs: report zero available if metadata are exhausted") CC: stable@vger.kernel.org # 4.5+ Reviewed-by: Qu Wenruo <wqu@suse.com> Tested-By: Martin Steigerwald <martin@lichtvoll.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-01Merge tag '5.6-rc-small-smb3-fix-for-stable' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull cifs fix from Steve French: "Small SMB3 fix for stable (fixes problem with soft mounts)" * tag '5.6-rc-small-smb3-fix-for-stable' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module version number cifs: fix soft mounts hanging in the reconnect code
2020-02-01vfs: fix do_last() regressionAl Viro
Brown paperbag time: fetching ->i_uid/->i_mode really should've been done from nd->inode. I even suggested that, but the reason for that has slipped through the cracks and I went for dir->d_inode instead - made for more "obvious" patch. Analysis: - at the entry into do_last() and all the way to step_into(): dir (aka nd->path.dentry) is known not to have been freed; so's nd->inode and it's equal to dir->d_inode unless we are already doomed to -ECHILD. inode of the file to get opened is not known. - after step_into(): inode of the file to get opened is known; dir might be pointing to freed memory/be negative/etc. - at the call of may_create_in_sticky(): guaranteed to be out of RCU mode; inode of the file to get opened is known and pinned; dir might be garbage. The last was the reason for the original patch. Except that at the do_last() entry we can be in RCU mode and it is possible that nd->path.dentry->d_inode has already changed under us. In that case we are going to fail with -ECHILD, but we need to be careful; nd->inode is pointing to valid struct inode and it's the same as nd->path.dentry->d_inode in "won't fail with -ECHILD" case, so we should use that. Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" <tommi.t.rantala@nokia.com> Reported-by: syzbot+190005201ced78a74ad6@syzkaller.appspotmail.com Wearing-brown-paperbag: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org Fixes: d0cb50185ae9 ("do_last(): fetch directory ->i_mode and ->i_uid before it's too late") Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>