aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-03-29fs: fix fd table size alignment properlyLinus Torvalds
Jason Donenfeld reports that my commit 1c24a186398f ("fs: fd tables have to be multiples of BITS_PER_LONG") doesn't work, and the reason is an embarrassing brown-paper-bag bug. Yes, we want to align the number of fds to BITS_PER_LONG, and yes, the reason they might not be aligned is because the incoming 'max_fd' argument might not be aligned. But aligining the argument - while simple - will cause a "infinitely big" maxfd (eg NR_OPEN_MAX) to just overflow to zero. Which most definitely isn't what we want either. The obvious fix was always just to do the alignment last, but I had moved it earlier just to make the patch smaller and the code look simpler. Duh. It certainly made _me_ look simple. Fixes: 1c24a186398f ("fs: fd tables have to be multiples of BITS_PER_LONG") Reported-and-tested-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Fedor Pchelkin <aissur0002@gmail.com> Cc: Alexey Khoroshilov <khoroshilov@ispras.ru> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-29Merge tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Features: - Switch NFS to use readahead instead of the obsolete readpages. - Readdir fixes to improve cacheability of large directories when there are multiple readers and writers. - Readdir performance improvements when doing a seekdir() immediately after opening the directory (common when re-exporting NFS). - NFS swap improvements from Neil Brown. - Loosen up memory allocation to permit direct reclaim and write back in cases where there is no danger of deadlocking the writeback code or NFS swap. - Avoid sillyrename when the NFSv4 server claims to support the necessary features to recover the unlinked but open file after reboot. Bugfixes: - Patch from Olga to add a mount option to control NFSv4.1 session trunking discovery, and default it to being off. - Fix a lockup in nfs_do_recoalesce(). - Two fixes for list iterator variables being used when pointing to the list head. - Fix a kernel memory scribble when reading from a non-socket transport in /sys/kernel/sunrpc. - Fix a race where reconnecting to a server could leave the TCP socket stuck forever in the connecting state. - Patch from Neil to fix a shutdown race which can leave the SUNRPC transport timer primed after we free the struct xprt itself. - Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy offload. - Sunrpc patch from Olga to avoid resending a task on an offlined transport. Cleanups: - Patches from Dave Wysochanski to clean up the fscache code" * tag 'nfs-for-5.18-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (91 commits) NFSv4/pNFS: Fix another issue with a list iterator pointing to the head NFS: Don't loop forever in nfs_do_recoalesce() SUNRPC: Don't return error values in sysfs read of closed files SUNRPC: Do not dereference non-socket transports in sysfs NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error SUNRPC don't resend a task on an offlined transport NFS: replace usage of found with dedicated list iterator variable SUNRPC: avoid race between mod_timer() and del_timer_sync() pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod NFS: Avoid writeback threads getting stuck in mempool_alloc() NFS: nfsiod should not block forever in mempool_alloc() SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent SUNRPC: Fix unx_lookup_cred() allocation NFS: Fix memory allocation in rpc_alloc_task() NFS: Fix memory allocation in rpc_malloc() SUNRPC: Improve accuracy of socket ENOBUFS determination SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE SUNRPC: Fix socket waits for write buffer space ...
2022-03-29Merge tag 'jfs-5.18' of https://github.com/kleikamp/linux-shaggyLinus Torvalds
Pull jfs updates from Dave Kleikamp: "A couple bug fixes" * tag 'jfs-5.18' of https://github.com/kleikamp/linux-shaggy: jfs: prevent NULL deref in diFree jfs: fix divide error in dbNextAG
2022-03-29fs: fd tables have to be multiples of BITS_PER_LONGLinus Torvalds
This has always been the rule: fdtables have several bitmaps in them, and as a result they have to be sized properly for bitmaps. We walk those bitmaps in chunks of 'unsigned long' in serveral cases, but even when we don't, we use the regular kernel bitops that are defined to work on arrays of 'unsigned long', not on some byte array. Now, the distinction between arrays of bytes and 'unsigned long' normally only really ends up being noticeable on big-endian systems, but Fedor Pchelkin and Alexey Khoroshilov reported that copy_fd_bitmaps() could be called with an argument that wasn't even a multiple of BITS_PER_BYTE. And then it fails to do the proper copy even on little-endian machines. The bug wasn't in copy_fd_bitmap(), but in sane_fdtable_size(), which didn't actually sanitize the fdtable size sufficiently, and never made sure it had the proper BITS_PER_LONG alignment. That's partly because the alignment historically came not from having to explicitly align things, but simply from previous fdtable sizes, and from count_open_files(), which counts the file descriptors by walking them one 'unsigned long' word at a time and thus naturally ends up doing sizing in the proper 'chunks of unsigned long'. But with the introduction of close_range(), we now have an external source of "this is how many files we want to have", and so sane_fdtable_size() needs to do a better job. This also adds that explicit alignment to alloc_fdtable(), although there it is mainly just for documentation at a source code level. The arithmetic we do there to pick a reasonable fdtable size already aligns the result sufficiently. In fact,clang notices that the added ALIGN() in that function doesn't actually do anything, and does not generate any extra code for it. It turns out that gcc ends up confusing itself by combining a previous constant-sized shift operation with the variable-sized shift operations in roundup_pow_of_two(). And probably due to that doesn't notice that the ALIGN() is a no-op. But that's a (tiny) gcc misfeature that doesn't matter. Having the explicit alignment makes sense, and would actually matter on a 128-bit architecture if we ever go there. This also adds big comments above both functions about how fdtable sizes have to have that BITS_PER_LONG alignment. Fixes: 60997c3d45d9 ("close_range: add CLOSE_RANGE_UNSHARE") Reported-by: Fedor Pchelkin <aissur0002@gmail.com> Reported-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Link: https://lore.kernel.org/all/20220326114009.1690-1-aissur0002@gmail.com/ Tested-and-acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-28Merge tag 'ptrace-cleanups-for-v5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull ptrace cleanups from Eric Biederman: "This set of changes removes tracehook.h, moves modification of all of the ptrace fields inside of siglock to remove races, adds a missing permission check to ptrace.c The removal of tracehook.h is quite significant as it has been a major source of confusion in recent years. Much of that confusion was around task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the semantics clearer). For people who don't know tracehook.h is a vestiage of an attempt to implement uprobes like functionality that was never fully merged, and was later superseeded by uprobes when uprobes was merged. For many years now we have been removing what tracehook functionaly a little bit at a time. To the point where anything left in tracehook.h was some weird strange thing that was difficult to understand" * tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: ptrace: Remove duplicated include in ptrace.c ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE ptrace: Return the signal to continue with from ptrace_stop ptrace: Move setting/clearing ptrace_message into ptrace_stop tracehook: Remove tracehook.h resume_user_mode: Move to resume_user_mode.h resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume signal: Move set_notify_signal and clear_notify_signal into sched/signal.h task_work: Decouple TIF_NOTIFY_SIGNAL and task_work task_work: Call tracehook_notify_signal from get_signal on all architectures task_work: Introduce task_work_pending task_work: Remove unnecessary include from posix_timers.h ptrace: Remove tracehook_signal_handler ptrace: Remove arch_syscall_{enter,exit}_tracehook ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h ptrace/arm: Rename tracehook_report_syscall report_syscall ptrace: Move ptrace_report_syscall into ptrace.h
2022-03-28Merge tag 'driver-core-5.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here is the set of driver core changes for 5.18-rc1. Not much here, primarily it was a bunch of cleanups and small updates: - kobj_type cleanups for default_groups - documentation updates - firmware loader minor changes - component common helper added and take advantage of it in many drivers (the largest part of this pull request). All of these have been in linux-next for a while with no reported problems" * tag 'driver-core-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (54 commits) Documentation: update stable review cycle documentation drivers/base/dd.c : Remove the initial value of the global variable Documentation: update stable tree link Documentation: add link to stable release candidate tree devres: fix typos in comments Documentation: add note block surrounding security patch note samples/kobject: Use sysfs_emit instead of sprintf base: soc: Make soc_device_match() simpler and easier to read driver core: dd: fix return value of __setup handler driver core: Refactor sysfs and drv/bus remove hooks driver core: Refactor multiple copies of device cleanup scripts: get_abi.pl: Fix typo in help message kernfs: fix typos in comments kernfs: remove unneeded #if 0 guard ALSA: hda/realtek: Make use of the helper component_compare_dev_name video: omapfb: dss: Make use of the helper component_compare_dev power: supply: ab8500: Make use of the helper component_compare_dev ASoC: codecs: wcd938x: Make use of the helper component_compare/release_of iommu/mediatek: Make use of the helper component_compare/release_of drm: of: Make use of the helper component_release_of ...
2022-03-28NFSv4/pNFS: Fix another issue with a list iterator pointing to the headTrond Myklebust
In nfs4_callback_devicenotify(), if we don't find a matching entry for the deviceid, we're left with a pointer to 'struct nfs_server' that actually points to the list of super blocks associated with our struct nfs_client. Furthermore, even if we have a valid pointer, nothing pins the super block, and so the struct nfs_server could end up getting freed while we're using it. Since all we want is a pointer to the struct pnfs_layoutdriver_type, let's skip all the iteration over super blocks, and just use APIs to find the layout driver directly. Reported-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Fixes: 1be5683b03a7 ("pnfs: CB_NOTIFY_DEVICEID") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-27Merge tag 'x86_core_for_5.18_rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 CET-IBT (Control-Flow-Integrity) support from Peter Zijlstra: "Add support for Intel CET-IBT, available since Tigerlake (11th gen), which is a coarse grained, hardware based, forward edge Control-Flow-Integrity mechanism where any indirect CALL/JMP must target an ENDBR instruction or suffer #CP. Additionally, since Alderlake (12th gen)/Sapphire-Rapids, speculation is limited to 2 instructions (and typically fewer) on branch targets not starting with ENDBR. CET-IBT also limits speculation of the next sequential instruction after the indirect CALL/JMP [1]. CET-IBT is fundamentally incompatible with retpolines, but provides, as described above, speculation limits itself" [1] https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/technical-documentation/branch-history-injection.html * tag 'x86_core_for_5.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits) kvm/emulate: Fix SETcc emulation for ENDBR x86/Kconfig: Only allow CONFIG_X86_KERNEL_IBT with ld.lld >= 14.0.0 x86/Kconfig: Only enable CONFIG_CC_HAS_IBT for clang >= 14.0.0 kbuild: Fixup the IBT kbuild changes x86/Kconfig: Do not allow CONFIG_X86_X32_ABI=y with llvm-objcopy x86: Remove toolchain check for X32 ABI capability x86/alternative: Use .ibt_endbr_seal to seal indirect calls objtool: Find unused ENDBR instructions objtool: Validate IBT assumptions objtool: Add IBT/ENDBR decoding objtool: Read the NOENDBR annotation x86: Annotate idtentry_df() x86,objtool: Move the ASM_REACHABLE annotation to objtool.h x86: Annotate call_on_stack() objtool: Rework ASM_REACHABLE x86: Mark __invalid_creds() __noreturn exit: Mark do_group_exit() __noreturn x86: Mark stop_this_cpu() __noreturn objtool: Ignore extra-symbol code objtool: Rename --duplicate to --lto ...
2022-03-26Merge tag 'write-page-prefaulting' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull iomap fixlet from Andreas Gruenbacher: "Fix buffered write page prefaulting. I forgot to send it the previous merge window. I've only improved the patch description since" * tag 'write-page-prefaulting' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: fs/iomap: Fix buffered write page prefaulting
2022-03-26Merge tag 'for-5.18/alloc-cleanups-2022-03-25' of ↵Linus Torvalds
git://git.kernel.dk/linux-block Pull bio allocation fix from Jens Axboe: "We got some reports of users seeing: Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x1192888 which is a regression caused by the bio allocation cleanups" * tag 'for-5.18/alloc-cleanups-2022-03-25' of git://git.kernel.dk/linux-block: fs: do not pass __GFP_HIGHMEM to bio_alloc in do_mpage_readpage
2022-03-26Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull NVMe write streams removal from Jens Axboe: "This removes the write streams support in NVMe. No vendor ever really shipped working support for this, and they are not interested in supporting it. With the NVMe support gone, we have nothing in the tree that supports this. Remove passing around of the hints. The only discussion point in this patchset imho is the fact that the file specific write hint setting/getting fcntl helpers will now return -1/EINVAL like they did before we supported write hints. No known applications use these functions, I only know of one prototype that I help do for RocksDB, and that's not used. That said, with a change like this, it's always a bit controversial. Alternatively, we could just make them return 0 and pretend it worked. It's placement based hints after all" * tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block: fs: remove fs.f_write_hint fs: remove kiocb.ki_hint block: remove the per-bio/request write hint nvme: remove support or stream based temperature hint
2022-03-25NFS: Don't loop forever in nfs_do_recoalesce()Trond Myklebust
If __nfs_pageio_add_request() fails to add the request, it will return with either desc->pg_error < 0, or mirror->pg_recoalesce will be set, so we are guaranteed either to exit the function altogether, or to loop. However if there is nothing left in mirror->pg_list to coalesce, we must exit, so make sure that we clear mirror->pg_recoalesce every time we loop. Reported-by: Olga Kornievskaia <aglo@umich.edu> Fixes: 70536bf4eb07 ("NFS: Clean up reset of the mirror accounting variables") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-25Merge tag 'fs_for_v5.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull reiserfs updates from Jan Kara: "The biggest change in this pull is the addition of a deprecation message about reiserfs with the outlook that we'd eventually be able to remove it from the kernel. Because it is practically unmaintained and untested and odd enough that people don't want to bother with it anymore... Otherwise there are small udf and ext2 fixes" * tag 'fs_for_v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: udf: remove redundant assignment of variable etype reiserfs: Deprecate reiserfs ext2: correct max file size computing reiserfs: get rid of AOP_FLAG_CONT_EXPAND flag
2022-03-25Merge tag 'fsnotify_for_v5.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify updates from Jan Kara: "A few fsnotify improvements and cleanups" * tag 'fsnotify_for_v5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: fsnotify: remove redundant parameter judgment fsnotify: optimize FS_MODIFY events with no ignored masks fsnotify: fix merge with parent's ignored mask
2022-03-25Merge tag 'kbuild-gnu11-v5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild update for C11 language base from Masahiro Yamada: "Kbuild -std=gnu11 updates for v5.18 Linus pointed out the benefits of C99 some years ago, especially variable declarations in loops [1]. At that time, we were not ready for the migration due to old compilers. Recently, Jakob Koschel reported a bug in list_for_each_entry(), which leaks the invalid pointer out of the loop [2]. In the discussion, we agreed that the time had come. Now that GCC 5.1 is the minimum compiler version, there is nothing to prevent us from going to -std=gnu99, or even straight to -std=gnu11. Discussions for a better list iterator implementation are ongoing, but this patch set must land first" [1] https://lore.kernel.org/all/CAHk-=wgr12JkKmRd21qh-se-_Gs69kbPgR9x4C+Es-yJV2GLkA@mail.gmail.com/ [2] https://lore.kernel.org/lkml/86C4CE7D-6D93-456B-AA82-F8ADEACA40B7@gmail.com/ * tag 'kbuild-gnu11-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: Kbuild: use -std=gnu11 for KBUILD_USERCFLAGS Kbuild: move to -std=gnu11 Kbuild: use -Wdeclaration-after-statement Kbuild: add -Wno-shift-negative-value where -Wextra is used
2022-03-25fs/iomap: Fix buffered write page prefaultingAndreas Gruenbacher
When part of the user buffer passed to generic_perform_write() or iomap_file_buffered_write() cannot be faulted in for reading, the entire write currently fails. The correct behavior would be to write all the data that can be written, up to the point of failure. Commit a6294593e8a1 ("iov_iter: Turn iov_iter_fault_in_readable into fault_in_iov_iter_readable") gave us the information needed, so fix the page prefaulting in generic_perform_write() and iomap_write_iter() to only bail out when no pages could be faulted in. We already factor in that pages that are faulted in may no longer be resident by the time they are accessed. Paging out pages has the same effect as not faulting in those pages in the first place, so the code can already deal with that. Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
2022-03-24Merge tag 'ceph-for-5.18-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "The highlights are: - several changes to how snap context and snap realms are tracked (Xiubo Li). In particular, this should resolve a long-standing issue of high kworker CPU usage and various stalls caused by needless iteration over all inodes in the snap realm. - async create fixes to address hangs in some edge cases (Jeff Layton) - support for getvxattr MDS op for querying server-side xattrs, such as file/directory layouts and ephemeral pins (Milind Changire) - average latency is now maintained for all metrics (Venky Shankar) - some tweaks around handling inline data to make it fit better with netfs helper library (David Howells) Also a couple of memory leaks got plugged along with a few assorted fixups. Last but not least, Xiubo has stepped up to serve as a CephFS co-maintainer" * tag 'ceph-for-5.18-rc1' of https://github.com/ceph/ceph-client: (27 commits) ceph: fix memory leak in ceph_readdir when note_last_dentry returns error ceph: uninitialized variable in debug output ceph: use tracked average r/w/m latencies to display metrics in debugfs ceph: include average/stdev r/w/m latency in mds metrics ceph: track average r/w/m latency ceph: use ktime_to_timespec64() rather than jiffies_to_timespec64() ceph: assign the ci only when the inode isn't NULL ceph: fix inode reference leakage in ceph_get_snapdir() ceph: misc fix for code style and logs ceph: allocate capsnap memory outside of ceph_queue_cap_snap() ceph: do not release the global snaprealm until unmounting ceph: remove incorrect and unused CEPH_INO_DOTDOT macro MAINTAINERS: add Xiubo Li as cephfs co-maintainer ceph: eliminate the recursion when rebuilding the snap context ceph: do not update snapshot context when there is no new snapshot ceph: zero the dir_entries memory when allocating it ceph: move to a dedicated slabcache for ceph_cap_snap ceph: add getvxattr op libceph: drop else branches in prepare_read_data{,_cont} ceph: fix comments mentioning i_mutex ...
2022-03-24Merge tag 'xfs-5.18-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull xfs updates from Darrick Wong: "The biggest change this cycle is bringing XFS' inode attribute setting code back towards alignment with what the VFS does. IOWs, setgid bit handling should be a closer match with ext4 and btrfs behavior. The rest of the branch is bug fixes around the filesystem -- patching gaps in quota enforcement, removing bogus selinux audit messages, and fixing log corruption and problems with log recovery. There will be a second pull request later on in the merge window with more bug fixes. Dave Chinner will be taking over as XFS maintainer for one release cycle, starting from the day 5.18-rc1 drops until 5.19-rc1 is tagged so that I can focus on starting a massive design review for the (feature complete after five years) online repair feature. Summary: - Fix some incorrect mapping state being passed to iomap during COW - Don't create bogus selinux audit messages when deciding to degrade gracefully due to lack of privilege - Fix setattr implementation to use VFS helpers so that we drop setgid consistently with the other filesystems - Fix link/unlink/rename to check quota limits - Constify xfs_name_dotdot to prevent abuse of in-kernel symbols - Fix log livelock between the AIL and inodegc threads during recovery - Fix a log stall when the AIL races with pushers - Fix stalls in CIL flushes due to pinned inode cluster buffers during recovery - Fix log corruption due to incorrect usage of xfs_is_shutdown vs xlog_is_shutdown because during an induced fs shutdown, AIL writeback must continue until the log is shut down, even if the filesystem has already shut down" * tag 'xfs-5.18-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: xfs_is_shutdown vs xlog_is_shutdown cage fight xfs: AIL should be log centric xfs: log items should have a xlog pointer, not a mount xfs: async CIL flushes need pending pushes to be made stable xfs: xfs_ail_push_all_sync() stalls when racing with updates xfs: check buffer pin state after locking in delwri_submit xfs: log worker needs to start before intent/unlink recovery xfs: constify xfs_name_dotdot xfs: constify the name argument to various directory functions xfs: reserve quota for target dir expansion when renaming files xfs: reserve quota for dir expansion when linking/unlinking files xfs: refactor user/group quota chown in xfs_setattr_nonsize xfs: use setattr_copy to set vfs inode attributes xfs: don't generate selinux audit messages for capability testing xfs: add missing cmap->br_state = XFS_EXT_NORM update
2022-03-24Merge tag 'dax-for-5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull DAX updates from Dan Williams: "Andrew has been shepherding major dax features that touch the core -mm through his tree, but I still collect the dax updates that are core-mm independent. - Fix a crash due to a missing rcu_barrier() in dax_fs_exit() - Fix two miscellaneous doc issues" * tag 'dax-for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax: Fix missing kdoc for dax_device dax: make sure inodes are flushed before destroy cache fsdax: fix function description
2022-03-24Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: "Various misc subsystems, before getting into the post-linux-next material. 41 patches. Subsystems affected by this patch series: procfs, misc, core-kernel, lib, checkpatch, init, pipe, minix, fat, cgroups, kexec, kdump, taskstats, panic, kcov, resource, and ubsan" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (41 commits) Revert "ubsan, kcsan: Don't combine sanitizer with kcov on clang" kernel/resource: fix kfree() of bootmem memory again kcov: properly handle subsequent mmap calls kcov: split ioctl handling into locked and unlocked parts panic: move panic_print before kmsg dumpers panic: add option to dump all CPUs backtraces in panic_print docs: sysctl/kernel: add missing bit to panic_print taskstats: remove unneeded dead assignment kasan: no need to unset panic_on_warn in end_report() ubsan: no need to unset panic_on_warn in ubsan_epilogue() panic: unset panic_on_warn inside panic() docs: kdump: add scp example to write out the dump file docs: kdump: update description about sysfs file system support arm64: mm: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef x86/setup: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef riscv: mm: init: use IS_ENABLED(CONFIG_KEXEC_CORE) instead of #ifdef kexec: make crashk_res, crashk_low_res and crash_notes symbols always visible cgroup: use irqsave in cgroup_rstat_flush_locked(). fat: use pointer to simple type in put_user() minix: fix bug when opening a file with O_DIRECT ...
2022-03-24Merge tag 'flexible-array-transformations-5.18-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux Pull flexible-array transformations from Gustavo Silva: "Treewide patch that replaces zero-length arrays with flexible-array members. This has been baking in linux-next for a whole development cycle" * tag 'flexible-array-transformations-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: treewide: Replace zero-length arrays with flexible-array members
2022-03-24Merge tag 'fs.rt.v5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount attributes PREEMPT_RT update from Christian Brauner: "This contains Sebastian's fix to make changing mount attributes/getting write access compatible with CONFIG_PREEMPT_RT. The change only applies when users explicitly opt-in to real-time via CONFIG_PREEMPT_RT otherwise things are exactly as before. We've waited quite a long time with this to make sure folks could take a good look" * tag 'fs.rt.v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: fs/namespace: Boost the mount_lock.lock owner instead of spinning on PREEMPT_RT.
2022-03-24Merge tag 'fs.v5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull mount_setattr updates from Christian Brauner: "This contains a few more patches to massage the mount_setattr() codepaths and one minor fix to reuse a helper we added some time back. The final two patches do similar cleanups in different ways. One patch is mine and the other is Al's who was nice enough to give me a branch for it. Since his came in later and my branch had been sitting in -next for quite some time we just put his on top instead of swap them" * tag 'fs.v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: mount_setattr(): clean the control flow and calling conventions fs: clean up mount_setattr control flow fs: don't open-code mnt_hold_writers() fs: simplify check in mount_setattr_commit() fs: add mnt_allow_writers() and simplify mount_setattr_prepare()
2022-03-24NFSv4.1: don't retry BIND_CONN_TO_SESSION on session errorOlga Kornievskaia
There is no reason to retry the operation if a session error had occurred in such case result structure isn't filled out. Fixes: dff58530c4ca ("NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION") Signed-off-by: Olga Kornievskaia <kolga@netapp.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-24NFS: replace usage of found with dedicated list iterator variableJakob Koschel
To move the list iterator variable into the list_for_each_entry_*() macro in the future it should be avoided to use the list iterator variable after the loop body. To *never* use the list iterator variable after the loop it was concluded to use a separate iterator variable instead of a found boolean [1]. This removes the need to use a found variable and simply checking if the variable was set, can determine if the break/goto was hit. Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/ Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-03-23fat: use pointer to simple type in put_user()Helge Deller
The put_user(val,ptr) macro wants a pointer to a simple type, but in fat_ioctl_filldir() the d_name field references an "array of chars". Be more accurate and explicitly give the pointer to the first character of the d_name[] array. I noticed that issue while trying to optimize the parisc put_user() macro and used an intermediate variable to store the pointer. In that case I got this error: In file included from include/linux/uaccess.h:11, from include/linux/compat.h:17, from fs/fat/dir.c:18: fs/fat/dir.c: In function `fat_ioctl_filldir': fs/fat/dir.c:725:33: error: invalid initializer 725 | if (put_user(0, d2->d_name) || \ | ^~ include/asm/uaccess.h:152:33: note: in definition of macro `__put_user' 152 | __typeof__(ptr) __ptr = ptr; \ | ^~~ fs/fat/dir.c:759:1: note: in expansion of macro `FAT_IOCTL_FILLDIR_FUNC' 759 | FAT_IOCTL_FILLDIR_FUNC(fat_ioctl_filldir, __fat_dirent) Andreas Schwab <schwab@linux-m68k.org> suggested to use __typeof__(&*(ptr)) __ptr = ptr; instead. This works, but nevertheless it's probably reasonable to fix the original caller too. Link: https://lkml.kernel.org/r/Ygo+A9MREmC1H3kr@p100 Signed-off-by: Helge Deller <deller@gmx.de> Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: David Laight <David.Laight@aculab.com> Cc: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23minix: fix bug when opening a file with O_DIRECTQinghua Jin
Testcase: 1. create a minix file system and mount it 2. open a file on the file system with O_RDWR|O_CREAT|O_TRUNC|O_DIRECT 3. open fails with -EINVAL but leaves an empty file behind. All other open() failures don't leave the failed open files behind. It is hard to check the direct_IO op before creating the inode. Just as ext4 and btrfs do, this patch will resolve the issue by allowing to create the file with O_DIRECT but returning error when writing the file. Link: https://lkml.kernel.org/r/20220107133626.413379-1-qhjin.dev@gmail.com Signed-off-by: Qinghua Jin <qhjin.dev@gmail.com> Reported-by: Colin Ian King <colin.king@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23fs/pipe.c: local vars have to match types of proper pipe_inode_info fieldsAndrei Vagin
head, tail, ring_size are declared as unsigned int, so all local variables that operate with these fields have to be unsigned to avoid signed integer overflow. Right now, it isn't an issue because the maximum pipe size is limited by 1U<<31. Link: https://lkml.kernel.org/r/20220106171946.36128-1-avagin@gmail.com Signed-off-by: Andrei Vagin <avagin@gmail.com> Suggested-by: Dmitry Safonov <0x7f454c46@gmail.com> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23fs/pipe: use kvcalloc to allocate a pipe_buffer arrayAndrei Vagin
Right now, kcalloc is used to allocate a pipe_buffer array. The size of the pipe_buffer struct is 40 bytes. kcalloc allows allocating reliably chunks with sizes less or equal to PAGE_ALLOC_COSTLY_ORDER (3). It means that the maximum pipe size is 3.2MB in this case. In CRIU, we use pipes to dump processes memory. CRIU freezes a target process, injects a parasite code into it and then this code splices memory into pipes. If a maximum pipe size is small, we need to do many iterations or create many pipes. kvcalloc attempt to allocate physically contiguous memory, but upon failure, fall back to non-contiguous (vmalloc) allocation and so it isn't limited by PAGE_ALLOC_COSTLY_ORDER. The maximum pipe size for non-root users is limited by the /proc/sys/fs/pipe-max-size sysctl that is 1MB by default, so only the root user will be able to trigger vmalloc allocations. Link: https://lkml.kernel.org/r/20220104171058.22580-1-avagin@gmail.com Signed-off-by: Andrei Vagin <avagin@gmail.com> Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23proc/vmcore: fix vmcore_alloc_buf() kernel-doc commentYang Li
Fix a spelling problem to remove warnings found by running scripts/kernel-doc, which is caused by using 'make W=1'. fs/proc/vmcore.c:492: warning: Function parameter or member 'size' not described in 'vmcore_alloc_buf' fs/proc/vmcore.c:492: warning: Excess function parameter 'sizez' description in 'vmcore_alloc_buf' Link: https://lkml.kernel.org/r/20220129011449.105278-1-yang.lee@linux.alibaba.com Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Reported-by: Abaci Robot <abaci@linux.alibaba.com> Acked-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23proc/vmcore: fix possible deadlock on concurrent mmap and readDavid Hildenbrand
Lockdep noticed that there is chance for a deadlock if we have concurrent mmap, concurrent read, and the addition/removal of a callback. As nicely explained by Boqun: "Lockdep warned about the above sequences because rw_semaphore is a fair read-write lock, and the following can cause a deadlock: TASK 1 TASK 2 TASK 3 ====== ====== ====== down_write(mmap_lock); down_read(vmcore_cb_rwsem) down_write(vmcore_cb_rwsem); // blocked down_read(vmcore_cb_rwsem); // cannot get the lock because of the fairness down_read(mmap_lock); // blocked IOW, a reader can block another read if there is a writer queued by the second reader and the lock is fair" To fix this, convert to srcu to make this deadlock impossible. We need srcu as our callbacks can sleep. With this change, I cannot trigger any lockdep warnings. ====================================================== WARNING: possible circular locking dependency detected 5.17.0-0.rc0.20220117git0c947b893d69.68.test.fc36.x86_64 #1 Not tainted ------------------------------------------------------ makedumpfile/542 is trying to acquire lock: ffffffff832d2eb8 (vmcore_cb_rwsem){.+.+}-{3:3}, at: mmap_vmcore+0x340/0x580 but task is already holding lock: ffff8880af226438 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x150 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&mm->mmap_lock#2){++++}-{3:3}: lock_acquire+0xc3/0x1a0 __might_fault+0x4e/0x70 _copy_to_user+0x1f/0x90 __copy_oldmem_page+0x72/0xc0 read_from_oldmem+0x77/0x1e0 read_vmcore+0x2c2/0x310 proc_reg_read+0x47/0xa0 vfs_read+0x101/0x340 __x64_sys_pread64+0x5d/0xa0 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae -> #0 (vmcore_cb_rwsem){.+.+}-{3:3}: validate_chain+0x9f4/0x2670 __lock_acquire+0x8f7/0xbc0 lock_acquire+0xc3/0x1a0 down_read+0x4a/0x140 mmap_vmcore+0x340/0x580 proc_reg_mmap+0x3e/0x90 mmap_region+0x504/0x880 do_mmap+0x38a/0x520 vm_mmap_pgoff+0xc1/0x150 ksys_mmap_pgoff+0x178/0x200 do_syscall_64+0x43/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&mm->mmap_lock#2); lock(vmcore_cb_rwsem); lock(&mm->mmap_lock#2); lock(vmcore_cb_rwsem); *** DEADLOCK *** 1 lock held by makedumpfile/542: #0: ffff8880af226438 (&mm->mmap_lock#2){++++}-{3:3}, at: vm_mmap_pgoff+0x84/0x150 stack backtrace: CPU: 0 PID: 542 Comm: makedumpfile Not tainted 5.17.0-0.rc0.20220117git0c947b893d69.68.test.fc36.x86_64 #1 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 Call Trace: __lock_acquire+0x8f7/0xbc0 lock_acquire+0xc3/0x1a0 down_read+0x4a/0x140 mmap_vmcore+0x340/0x580 proc_reg_mmap+0x3e/0x90 mmap_region+0x504/0x880 do_mmap+0x38a/0x520 vm_mmap_pgoff+0xc1/0x150 ksys_mmap_pgoff+0x178/0x200 do_syscall_64+0x43/0x90 Link: https://lkml.kernel.org/r/20220119193417.100385-1-david@redhat.com Fixes: cc5f2704c934 ("proc/vmcore: convert oldmem_pfn_is_ram callback to more generic vmcore callbacks") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Baoquan He <bhe@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Josh Triplett <josh@joshtriplett.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23proc: alloc PATH_MAX bytes for /proc/${pid}/fd/ symlinksHao Lee
It's not a standard approach that use __get_free_page() to alloc path buffer directly. We'd better use kmalloc and PATH_MAX. PAGE_SIZE is different on different archs. An unlinked file with very long canonical pathname will readlink differently because "(deleted)" eats into a buffer. --adobriyan [akpm@linux-foundation.org: remove now-unneeded cast] Link: https://lkml.kernel.org/r/Ye1fCxyZZ0I5lgOL@localhost.localdomain Signed-off-by: Hao Lee <haolee.swjtu@gmail.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Kees Cook <keescook@chromium.org> Cc: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-23Merge tag 'asm-generic-5.18' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull asm-generic updates from Arnd Bergmann: "There are three sets of updates for 5.18 in the asm-generic tree: - The set_fs()/get_fs() infrastructure gets removed for good. This was already gone from all major architectures, but now we can finally remove it everywhere, which loses some particularly tricky and error-prone code. There is a small merge conflict against a parisc cleanup, the solution is to use their new version. - The nds32 architecture ends its tenure in the Linux kernel. The hardware is still used and the code is in reasonable shape, but the mainline port is not actively maintained any more, as all remaining users are thought to run vendor kernels that would never be updated to a future release. - A series from Masahiro Yamada cleans up some of the uapi header files to pass the compile-time checks" * tag 'asm-generic-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: (27 commits) nds32: Remove the architecture uaccess: remove CONFIG_SET_FS ia64: remove CONFIG_SET_FS support sh: remove CONFIG_SET_FS support sparc64: remove CONFIG_SET_FS support lib/test_lockup: fix kernel pointer check for separate address spaces uaccess: generalize access_ok() uaccess: fix type mismatch warnings from access_ok() arm64: simplify access_ok() m68k: fix access_ok for coldfire MIPS: use simpler access_ok() MIPS: Handle address errors for accesses above CPU max virtual user address uaccess: add generic __{get,put}_kernel_nofault nios2: drop access_ok() check from __put_user() x86: use more conventional access_ok() definition x86: remove __range_not_ok() sparc64: add __{get,put}_kernel_nofault() nds32: fix access_ok() checks in get/put_user uaccess: fix nios2 and microblaze get_user_8() sparc64: fix building assembly files ...
2022-03-23fs: do not pass __GFP_HIGHMEM to bio_alloc in do_mpage_readpageChristoph Hellwig
The mpage bio alloc cleanup accidentally removed clearing ~GFP_KERNEL bits from the mask passed to bio_alloc. Fix this up in a slightly less obsfucated way that mirrors what iomap does in its readpage code. Fixes: 77c436de01c0 ("mpage: pass the operation to bio_alloc") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Link: https://lore.kernel.org/r/20220323153952.1418560-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2022-03-22Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull filesystem folio updates from Matthew Wilcox: "Primarily this series converts some of the address_space operations to take a folio instead of a page. Notably: - a_ops->is_partially_uptodate() takes a folio instead of a page and changes the type of the 'from' and 'count' arguments to make it obvious they're bytes. - a_ops->invalidatepage() becomes ->invalidate_folio() and has a similar type change. - a_ops->launder_page() becomes ->launder_folio() - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the address_space as an argument. There are a couple of other misc changes up front that weren't worth separating into their own pull request" * tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits) fs: Remove aops ->set_page_dirty fb_defio: Use noop_dirty_folio() fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio fs: Convert __set_page_dirty_buffers to block_dirty_folio nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio() mm: Convert swap_set_page_dirty() to swap_dirty_folio() ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio() btrfs: Convert extent_range_redirty_for_io() to use folios fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio btrfs: Convert from set_page_dirty to dirty_folio fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio() fs: Add aops->dirty_folio fs: Remove aops->launder_page orangefs: Convert launder_page to launder_folio nfs: Convert from launder_page to launder_folio fuse: Convert from launder_page to launder_folio ...
2022-03-22Merge tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecacheLinus Torvalds
Pull folio updates from Matthew Wilcox: - Rewrite how munlock works to massively reduce the contention on i_mmap_rwsem (Hugh Dickins): https://lore.kernel.org/linux-mm/8e4356d-9622-a7f0-b2c-f116b5f2efea@google.com/ - Sort out the page refcount mess for ZONE_DEVICE pages (Christoph Hellwig): https://lore.kernel.org/linux-mm/20220210072828.2930359-1-hch@lst.de/ - Convert GUP to use folios and make pincount available for order-1 pages. (Matthew Wilcox) - Convert a few more truncation functions to use folios (Matthew Wilcox) - Convert page_vma_mapped_walk to use PFNs instead of pages (Matthew Wilcox) - Convert rmap_walk to use folios (Matthew Wilcox) - Convert most of shrink_page_list() to use a folio (Matthew Wilcox) - Add support for creating large folios in readahead (Matthew Wilcox) * tag 'folio-5.18c' of git://git.infradead.org/users/willy/pagecache: (114 commits) mm/damon: minor cleanup for damon_pa_young selftests/vm/transhuge-stress: Support file-backed PMD folios mm/filemap: Support VM_HUGEPAGE for file mappings mm/readahead: Switch to page_cache_ra_order mm/readahead: Align file mappings for non-DAX mm/readahead: Add large folio readahead mm: Support arbitrary THP sizes mm: Make large folios depend on THP mm: Fix READ_ONLY_THP warning mm/filemap: Allow large folios to be added to the page cache mm: Turn can_split_huge_page() into can_split_folio() mm/vmscan: Convert pageout() to take a folio mm/vmscan: Turn page_check_references() into folio_check_references() mm/vmscan: Account large folios correctly mm/vmscan: Optimise shrink_page_list for non-PMD-sized folios mm/vmscan: Free non-shmem folios without splitting them mm/rmap: Constify the rmap_walk_control argument mm/rmap: Convert rmap_walk() to take a folio mm: Turn page_anon_vma() into folio_anon_vma() mm/rmap: Turn page_lock_anon_vma_read() into folio_lock_anon_vma_read() ...
2022-03-22Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge updates from Andrew Morton: - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs - Most the MM patches which precede the patches in Willy's tree: kasan, pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap, sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb, userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp, cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap, zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon. * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits) mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release() Docs/ABI/testing: add DAMON sysfs interface ABI document Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface selftests/damon: add a test for DAMON sysfs interface mm/damon/sysfs: support DAMOS stats mm/damon/sysfs: support DAMOS watermarks mm/damon/sysfs: support schemes prioritization mm/damon/sysfs: support DAMOS quotas mm/damon/sysfs: support DAMON-based Operation Schemes mm/damon/sysfs: support the physical address space monitoring mm/damon/sysfs: link DAMON for virtual address spaces monitoring mm/damon: implement a minimal stub for sysfs-based DAMON interface mm/damon/core: add number of each enum type values mm/damon/core: allow non-exclusive DAMON start/stop Docs/damon: update outdated term 'regions update interval' Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling Docs/vm/damon: call low level monitoring primitives the operations mm/damon: remove unnecessary CONFIG_DAMON option mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}() mm/damon/dbgfs-test: fix is_target_id() change ...
2022-03-22mm/fs: delete PF_SWAPWRITEHugh Dickins
PF_SWAPWRITE has been redundant since v3.2 commit ee72886d8ed5 ("mm: vmscan: do not writeback filesystem pages in direct reclaim"). Coincidentally, NeilBrown's current patch "remove inode_congested()" deletes may_write_to_inode(), which appeared to be the one function which took notice of PF_SWAPWRITE. But if you study the old logic, and the conditions under which may_write_to_inode() was called, you discover that flag and function have been pointless for a decade. Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: NeilBrown <neilb@suse.de> Cc: Jan Kara <jack@suse.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22userfaultfd: provide unmasked address on page-faultNadav Amit
Userfaultfd is supposed to provide the full address (i.e., unmasked) of the faulting access back to userspace. However, that is not the case for quite some time. Even running "userfaultfd_demo" from the userfaultfd man page provides the wrong output (and contradicts the man page). Notice that "UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and not the first read address (0x7fc5e30b300f). Address returned by mmap() = 0x7fc5e30b3000 fault_handler_thread(): poll() returns: nready = 1; POLLIN = 1; POLLERR = 0 UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000 (uffdio_copy.copy returned 4096) Read address 0x7fc5e30b300f in main(): A Read address 0x7fc5e30b340f in main(): A Read address 0x7fc5e30b380f in main(): A Read address 0x7fc5e30b3c0f in main(): A The exact address is useful for various reasons and specifically for prefetching decisions. If it is known that the memory is populated by certain objects whose size is not page-aligned, then based on the faulting address, the uffd-monitor can decide whether to prefetch and prefault the adjacent page. This bug has been for quite some time in the kernel: since commit 1a29d85eb0f1 ("mm: use vmf->address instead of of vmf->virtual_address") vmf->virtual_address"), which dates back to 2016. A concern has been raised that existing userspace application might rely on the old/wrong behavior in which the address is masked. Therefore, it was suggested to provide the masked address unless the user explicitly asks for the exact address. Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct userfaultfd to provide the exact address. Add a new "real_address" field to vmf to hold the unmasked address. Provide the address to userspace accordingly. Initialize real_address in various code-paths to be consistent with address, even when it is not used, to be on the safe side. [namit@vmware.com: initialize real_address on all code paths, per Jan] Link: https://lkml.kernel.org/r/20220226022655.350562-1-namit@vmware.com [akpm@linux-foundation.org: fix typo in comment, per Jan] Link: https://lkml.kernel.org/r/20220218041003.3508-1-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22mm: dcache: use kmem_cache_alloc_lru() to allocate dentryMuchun Song
Like inode cache, the dentry will also be added to its memcg list_lru. So replace kmem_cache_alloc() with kmem_cache_alloc_lru() to allocate dentry. Link: https://lkml.kernel.org/r/20220228122126.37293-8-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22f2fs: allocate inode by using alloc_inode_sb()Muchun Song
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-6-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22fs: allocate inode by using alloc_inode_sb()Muchun Song
The inode allocation is supposed to use alloc_inode_sb(), so convert kmem_cache_alloc() of all filesystems to alloc_inode_sb(). Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Theodore Ts'o <tytso@mit.edu> [ext4] Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22fs: introduce alloc_inode_sb() to allocate filesystems specific inodeMuchun Song
The allocated inode cache is supposed to be added to its memcg list_lru which should be allocated as well in advance. That can be done by kmem_cache_alloc_lru() which allocates object and list_lru. The file systems is main user of it. So introduce alloc_inode_sb() to allocate file system specific inodes and set up the inode reclaim context properly. The file system is supposed to use alloc_inode_sb() to allocate inodes. In later patches, we will convert all users to the new API. Link: https://lkml.kernel.org/r/20220228122126.37293-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22mm: fs: fix lru_cache_disabled race in bh_lruMinchan Kim
Check lru_cache_disabled under bh_lru_lock. Otherwise, it could introduce race below and it fails to migrate pages containing buffer_head. CPU 0 CPU 1 bh_lru_install lru_cache_disable lru_cache_disabled = false atomic_inc(&lru_disable_count); invalidate_bh_lrus_cpu of CPU 0 bh_lru_lock __invalidate_bh_lrus bh_lru_unlock bh_lru_lock install the bh bh_lru_unlock WHen this race happens a CMA allocation fails, which is critical for the workload which depends on CMA. Link: https://lkml.kernel.org/r/20220308180709.2017638-1-minchan@kernel.org Fixes: 8cc621d2f45d ("mm: fs: invalidate BH LRU during page migration") Signed-off-by: Minchan Kim <minchan@kernel.org> Cc: Chris Goldsworthy <cgoldswo@codeaurora.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: John Dias <joaodias@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22mount: warn only once about timestamp range expirationAnthony Iliopoulos
Commit f8b92ba67c5d ("mount: Add mount warning for impending timestamp expiry") introduced a mount warning regarding filesystem timestamp limits, that is printed upon each writable mount or remount. This can result in a lot of unnecessary messages in the kernel log in setups where filesystems are being frequently remounted (or mounted multiple times). Avoid this by setting a superblock flag which indicates that the warning has been emitted at least once for any particular mount, as suggested in [1]. Link: https://lore.kernel.org/CAHk-=wim6VGnxQmjfK_tDg6fbHYKL4EFkmnTjVr9QnRqjDBAeA@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20220119202934.26495-1-ailiop@suse.com Signed-off-by: Anthony Iliopoulos <ailiop@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <christian.brauner@ubuntu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22f2fs: replace congestion_wait() calls with io_schedule_timeout()NeilBrown
As congestion is no longer tracked, congestion_wait() is effectively equivalent to io_schedule_timeout(). So introduce f2fs_io_schedule_timeout() which sets TASK_UNINTERRUPTIBLE and call that instead. Link: https://lkml.kernel.org/r/164549983744.9187.6425865370954230902.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22remove bdi_congested() and wb_congested() and related functionsNeilBrown
These functions are no longer useful as no BDIs report congestions any more. Removing the test on bdi_write_contested() in current_may_throttle() could cause a small change in behaviour, but only when PF_LOCAL_THROTTLE is set. So replace the calls by 'false' and simplify the code - and remove the functions. [akpm@linux-foundation.org: fix build] Link: https://lkml.kernel.org/r/164549983742.9187.2570198746005819592.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> [nilfs] Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22remove inode_congested()NeilBrown
inode_congested() reports if the backing-device for the inode is congested. No bdi reports congestion any more, so this always returns 'false'. So remove inode_congested() and related functions, and remove the call sites, assuming that inode_congested() always returns 'false'. Link: https://lkml.kernel.org/r/164549983741.9187.2174285592262191311.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22ceph: remove reliance on bdi congestionNeilBrown
The bdi congestion tracking in not widely used and will be removed. CEPHfs is one of a small number of filesystems that uses it, setting just the async (write) congestion flags at what it determines are appropriate times. The only remaining effect of the async flag is to cause (some) WB_SYNC_NONE writes to be skipped. So instead of setting the flag, set an internal flag and change: - .writepages to do nothing if WB_SYNC_NONE and the flag is set - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE and the flag is set. The writepages change causes a behavioural change in that pageout() can now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will be called on the page which (I think) wil further delay the next attempt at writeout. This might be a good thing. Link: https://lkml.kernel.org/r/164549983739.9187.14895675781408171186.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22nfs: remove reliance on bdi congestionNeilBrown
The bdi congestion tracking in not widely used and will be removed. NFS is one of a small number of filesystems that uses it, setting just the async (write) congestion flag at what it determines are appropriate times. The only remaining effect of the async flag is to cause (some) WB_SYNC_NONE writes to be skipped. So instead of setting the flag, set an internal flag and change: - .writepages to do nothing if WB_SYNC_NONE and the flag is set - .writepage to return AOP_WRITEPAGE_ACTIVATE if WB_SYNC_NONE and the flag is set. The writepages change causes a behavioural change in that pageout() can now return PAGE_ACTIVATE instead of PAGE_KEEP, so SetPageActive() will be called on the page which (I think) wil further delay the next attempt at writeout. This might be a good thing. Link: https://lkml.kernel.org/r/164549983738.9187.3972219847989393182.stgit@noble.brown Signed-off-by: NeilBrown <neilb@suse.de> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Paolo Valente <paolo.valente@linaro.org> Cc: Philipp Reisner <philipp.reisner@linbit.com> Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>