aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2022-08-14take care to handle NULL ->proc_lseek()Al Viro
Easily done now, just by clearing FMODE_LSEEK in ->f_mode during proc_reg_open() for such entries. Fixes: 868941b14441 "fs: remove no_llseek" Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-13Merge tag 'pull-work.misc' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull /proc/mounts fix from Al Viro: "Fix for /proc/mounts escaping - escape the '#' character too" * tag 'pull-work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: vfs: escape hash as well
2022-08-13Merge tag '5.20-rc-smb3-client-fixes-part2' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull more cifs updates from Steve French: - two fixes for stable, one for a lock length miscalculation, and another fixes a lease break timeout bug - improvement to handle leases, allows the close timeout to be configured more safely - five restructuring/cleanup patches * tag '5.20-rc-smb3-client-fixes-part2' of git://git.samba.org/sfrench/cifs-2.6: cifs: Do not access tcon->cfids->cfid directly from is_path_accessible cifs: Add constructor/destructors for tcon->cfid SMB3: fix lease break timeout when multiple deferred close handles for the same file. smb3: allow deferred close timeout to be configurable cifs: Do not use tcon->cfid directly, use the cfid we get from open_cached_dir cifs: Move cached-dir functions into a separate file cifs: Remove {cifs,nfs}_fscache_release_page() cifs: fix lock length calculation
2022-08-13afs: Enable multipage folio supportDavid Howells
Enable multipage folio support for the afs filesystem. Support has already been implemented in netfslib, fscache and cachefiles and in most of afs, but I've waited for Matthew Wilcox's latest folio changes. Note that it does require a change to afs_write_begin() to return the correct subpage. This is a "temporary" change as we're working on getting rid of the need for ->write_begin() and ->write_end() completely, at least as far as network filesystems are concerned - but it doesn't prevent afs from making use of the capability. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: kafs-testing@auristor.com Cc: Marc Dionne <marc.dionne@auristor.com> Cc: linux-afs@lists.infradead.org Link: https://lore.kernel.org/lkml/2274528.1645833226@warthog.procyon.org.uk/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-13Merge tag 'timers-urgent-2022-08-13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fixes from Ingo Molnar: "Misc timer fixes: - fix a potential use-after-free bug in posix timers - correct a prototype - address a build warning" * tag 'timers-urgent-2022-08-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: posix-cpu-timers: Cleanup CPU timers before freeing them during exec time: Correct the prototype of ns_to_kernel_old_timeval and ns_to_timespec64 posix-timers: Make do_clock_gettime() static
2022-08-13Merge tag 'xfs-5.20-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull more xfs updates from Darrick Wong: "There's not a lot this time around, just the usual bug fixes and corrections for missing error returns. - Return error codes from block device flushes to userspace - Fix a deadlock between reclaim and mount time quotacheck - Fix an unnecessary ENOSPC return when doing COW on a filesystem with severe free space fragmentation - Fix a miscalculation in the transaction reservation computations for file removal operations" * tag 'xfs-5.20-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: fix inode reservation space for removing transaction xfs: Fix false ENOSPC when performing direct write on a delalloc extent in cow fork xfs: fix intermittent hang during quotacheck xfs: check return codes when flushing block devices
2022-08-12cifs: Do not access tcon->cfids->cfid directly from is_path_accessibleRonnie Sahlberg
cfids will soon keep a list of cached fids so we should not access this directly from outside of cached_dir.c Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11cifs: Add constructor/destructors for tcon->cfidRonnie Sahlberg
and move the structure definitions into cached_dir.h Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11SMB3: fix lease break timeout when multiple deferred close handles for the ↵Bharath SM
same file. Solution is to send lease break ack immediately even in case of deferred close handles to avoid lease break request timing out and let deferred closed handle gets closed as scheduled. Later patches could optimize cases where we then close some of these handles sooner for the cases where lease break is to 'none' Cc: stable@kernel.org Signed-off-by: Bharath SM <bharathsm@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11smb3: allow deferred close timeout to be configurableSteve French
Deferred close can be a very useful feature for allowing caching data for read, and for minimizing the number of reopens needed for a file that is repeatedly opened and close but there are workloads where its default (1 second, similar to actimeo/acregmax) is much too small. Allow the user to configure the amount of time we can defer sending the final smb3 close when we have a handle lease on the file (rather than forcing it to depend on value of actimeo which is often unrelated, and less safe). Adds new mount parameter "closetimeo=" which is the maximum number of seconds we can wait before sending an SMB3 close when we have a handle lease for it. Default value also is set to slightly larger at 5 seconds (although some other clients use larger default this should still help). Suggested-by: Bharath SM <bharathsm@microsoft.com> Reviewed-by: Bharath SM <bharathsm@microsoft.com> Reviewed-by: Shyam Prasad N <sprasad@microsoft.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11cifs: Do not use tcon->cfid directly, use the cfid we get from open_cached_dirRonnie Sahlberg
They are the same right now but tcon-> will later point to a different type of struct containing a list of cfids. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-11Merge tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds
Pull more iomap updates from Darrick Wong: "In the past 10 days or so I've not heard any ZOMG STOP style complaints about removing ->writepage support from gfs2 or zonefs, so here's the pull request removing them (and the underlying fs iomap support) from the kernel: - Remove iomap_writepage and all callers, since the mm apparently never called the zonefs or gfs2 writepage functions" * tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: iomap: remove iomap_writepage zonefs: remove ->writepage gfs2: remove ->writepage gfs2: stop using generic_writepages in gfs2_ail1_start_one
2022-08-11Merge tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph updates from Ilya Dryomov: "We have a good pile of various fixes and cleanups from Xiubo, Jeff, Luis and others, almost exclusively in the filesystem. Several patches touch files outside of our normal purview to set the stage for bringing in Jeff's long awaited ceph+fscrypt series in the near future. All of them have appropriate acks and sat in linux-next for a while" * tag 'ceph-for-5.20-rc1' of https://github.com/ceph/ceph-client: (27 commits) libceph: clean up ceph_osdc_start_request prototype libceph: fix ceph_pagelist_reserve() comment typo ceph: remove useless check for the folio ceph: don't truncate file in atomic_open ceph: make f_bsize always equal to f_frsize ceph: flush the dirty caps immediatelly when quota is approaching libceph: print fsid and epoch with osd id libceph: check pointer before assigned to "c->rules[]" ceph: don't get the inline data for new creating files ceph: update the auth cap when the async create req is forwarded ceph: make change_auth_cap_ses a global symbol ceph: fix incorrect old_size length in ceph_mds_request_args ceph: switch back to testing for NULL folio->private in ceph_dirty_folio ceph: call netfs_subreq_terminated with was_async == false ceph: convert to generic_file_llseek ceph: fix the incorrect comment for the ceph_mds_caps struct ceph: don't leak snap_rwsem in handle_cap_grant ceph: prevent a client from exceeding the MDS maximum xattr size ceph: choose auth MDS for getxattr with the Xs caps ceph: add session already open notify support ...
2022-08-11cifs: Move cached-dir functions into a separate fileRonnie Sahlberg
Also rename crfid to cfid to have consistent naming for this variable. This commit does not change any logic. Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-10cifs: Remove {cifs,nfs}_fscache_release_page()David Howells
Remove {cifs,nfs}_fscache_release_page() from fs/cifs/fscache.h. This functionality got built directly into cifs_release_folio() and will hopefully be replaced with netfs_release_folio() at some point. The "nfs_" version is a copy and paste error and should've been altered to read "cifs_". That can also be removed. Reported-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@redhat.com> cc: Steve French <smfrench@gmail.com> cc: linux-cifs@vger.kernel.org cc: samba-technical@lists.samba.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-10xfs: fix inode reservation space for removing transactionhexiaole
In 'fs/xfs/libxfs/xfs_trans_resv.c', the comment for transaction of removing a directory entry writes: /* fs/xfs/libxfs/xfs_trans_resv.c begin */ /* * For removing a directory entry we can modify: * the parent directory inode: inode size * the removed inode: inode size ... xfs_calc_remove_reservation( struct xfs_mount *mp) { return XFS_DQUOT_LOGRES(mp) + xfs_calc_iunlink_add_reservation(mp) + max((xfs_calc_inode_res(mp, 1) + ... /* fs/xfs/libxfs/xfs_trans_resv.c end */ There has 2 inode size of space to be reserverd, but the actual code for inode reservation space writes. There only count for 1 inode size to be reserved in 'xfs_calc_inode_res(mp, 1)', rather than 2. Signed-off-by: hexiaole <hexiaole@kylinos.cn> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: remove redundant code citations] Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-08-10cifs: fix lock length calculationPaulo Alcantara
The lock length was wrongly set to 0 when fl_end == OFFSET_MAX, thus failing to lock the whole file when l_start=0 and l_len=0. This fixes test 2 from cthon04. Before patch: $ ./cthon04/lock/tlocklfs -t 2 /mnt Creating parent/child synchronization pipes. Test #1 - Test regions of an unlocked file. Parent: 1.1 - F_TEST [ 0, 1] PASSED. Parent: 1.2 - F_TEST [ 0, ENDING] PASSED. Parent: 1.3 - F_TEST [ 0,7fffffffffffffff] PASSED. Parent: 1.4 - F_TEST [ 1, 1] PASSED. Parent: 1.5 - F_TEST [ 1, ENDING] PASSED. Parent: 1.6 - F_TEST [ 1,7fffffffffffffff] PASSED. Parent: 1.7 - F_TEST [7fffffffffffffff, 1] PASSED. Parent: 1.8 - F_TEST [7fffffffffffffff, ENDING] PASSED. Parent: 1.9 - F_TEST [7fffffffffffffff,7fffffffffffffff] PASSED. Test #2 - Try to lock the whole file. Parent: 2.0 - F_TLOCK [ 0, ENDING] PASSED. Child: 2.1 - F_TEST [ 0, 1] FAILED! Child: **** Expected EACCES, returned success... Child: **** Probably implementation error. ** CHILD pass 1 results: 0/0 pass, 0/0 warn, 1/1 fail (pass/total). Parent: Child died ** PARENT pass 1 results: 10/10 pass, 0/0 warn, 0/0 fail (pass/total). After patch: $ ./cthon04/lock/tlocklfs -t 2 /mnt Creating parent/child synchronization pipes. Test #2 - Try to lock the whole file. Parent: 2.0 - F_TLOCK [ 0, ENDING] PASSED. Child: 2.1 - F_TEST [ 0, 1] PASSED. Child: 2.2 - F_TEST [ 0, ENDING] PASSED. Child: 2.3 - F_TEST [ 0,7fffffffffffffff] PASSED. Child: 2.4 - F_TEST [ 1, 1] PASSED. Child: 2.5 - F_TEST [ 1, ENDING] PASSED. Child: 2.6 - F_TEST [ 1,7fffffffffffffff] PASSED. Child: 2.7 - F_TEST [7fffffffffffffff, 1] PASSED. Child: 2.8 - F_TEST [7fffffffffffffff, ENDING] PASSED. Child: 2.9 - F_TEST [7fffffffffffffff,7fffffffffffffff] PASSED. Parent: 2.10 - F_ULOCK [ 0, ENDING] PASSED. ** PARENT pass 1 results: 2/2 pass, 0/0 warn, 0/0 fail (pass/total). ** CHILD pass 1 results: 9/9 pass, 0/0 warn, 0/0 fail (pass/total). Fixes: d80c69846ddf ("cifs: fix signed integer overflow when fl_end is OFFSET_MAX") Reported-by: Xiaoli Feng <xifeng@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Paulo Alcantara (SUSE) <pc@cjr.nz> Reviewed-by: Ronnie Sahlberg <lsahlber@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
2022-08-10Merge tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - pNFS/flexfiles: Fix infinite looping when the RDMA connection errors out Bugfixes: - NFS: fix port value parsing - SUNRPC: Reinitialise the backchannel request buffers before reuse - SUNRPC: fix expiry of auth creds - NFSv4: Fix races in the legacy idmapper upcall - NFS: O_DIRECT fixes from Jeff Layton - NFSv4.1: Fix OP_SEQUENCE error handling - SUNRPC: Fix an RPC/RDMA performance regression - NFS: Fix case insensitive renames - NFSv4/pnfs: Fix a use-after-free bug in open - NFSv4.1: RECLAIM_COMPLETE must handle EACCES Features: - NFSv4.1: session trunking enhancements - NFSv4.2: READ_PLUS performance optimisations - NFS: relax the rules for rsize/wsize mount options - NFS: don't unhash dentry during unlink/rename - SUNRPC: Fail faster on bad verifier - NFS/SUNRPC: Various tracing improvements" * tag 'nfs-for-5.20-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (46 commits) NFS: Improve readpage/writepage tracing NFS: Improve O_DIRECT tracing NFS: Improve write error tracing NFS: don't unhash dentry during unlink/rename NFSv4/pnfs: Fix a use-after-free bug in open NFS: nfs_async_write_reschedule_io must not recurse into the writeback code SUNRPC: Don't reuse bvec on retransmission of the request SUNRPC: Reinitialise the backchannel request buffers before reuse NFSv4.1: RECLAIM_COMPLETE must handle EACCES NFSv4.1 probe offline transports for trunking on session creation SUNRPC create a function that probes only offline transports SUNRPC export xprt_iter_rewind function SUNRPC restructure rpc_clnt_setup_test_and_add_xprt NFSv4.1 remove xprt from xprt_switch if session trunking test fails SUNRPC create an rpc function that allows xprt removal from rpc_clnt SUNRPC enable back offline transports in trunking discovery SUNRPC create an iterator to list only OFFLINE xprts NFSv4.1 offline trunkable transports on DESTROY_SESSION SUNRPC add function to offline remove trunkable transports SUNRPC expose functions for offline remote xprt functionality ...
2022-08-10Merge tag 'mm-stable-2022-08-09' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull remaining MM updates from Andrew Morton: "Three patch series - two that perform cleanups and one feature: - hugetlb_vmemmap cleanups from Muchun Song - hardware poisoning support for 1GB hugepages, from Naoya Horiguchi - highmem documentation fixups from Fabio De Francesco" * tag 'mm-stable-2022-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (23 commits) Documentation/mm: add details about kmap_local_page() and preemption highmem: delete a sentence from kmap_local_page() kdocs Documentation/mm: rrefer kmap_local_page() and avoid kmap() Documentation/mm: avoid invalid use of addresses from kmap_local_page() Documentation/mm: don't kmap*() pages which can't come from HIGHMEM highmem: specify that kmap_local_page() is callable from interrupts highmem: remove unneeded spaces in kmap_local_page() kdocs mm, hwpoison: enable memory error handling on 1GB hugepage mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage mm, hwpoison: make __page_handle_poison returns int mm, hwpoison: set PG_hwpoison for busy hugetlb pages mm, hwpoison: make unpoison aware of raw error info in hwpoisoned hugepage mm, hwpoison, hugetlb: support saving mechanism of raw error pages mm/hugetlb: make pud_huge() and follow_huge_pud() aware of non-present pud entry mm/hugetlb: check gigantic_page_runtime_supported() in return_unused_surplus_pages() mm: hugetlb_vmemmap: use PTRS_PER_PTE instead of PMD_SIZE / PAGE_SIZE mm: hugetlb_vmemmap: move code comments to vmemmap_dedup.rst mm: hugetlb_vmemmap: improve hugetlb_vmemmap code readability mm: hugetlb_vmemmap: replace early_param() with core_param() mm: hugetlb_vmemmap: move vmemmap code related to HugeTLB to hugetlb_vmemmap.c ...
2022-08-09Merge tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds
Pull nfsd updates from Chuck Lever: "Work on 'courteous server', which was introduced in 5.19, continues apace. This release introduces a more flexible limit on the number of NFSv4 clients that NFSD allows, now that NFSv4 clients can remain in courtesy state long after the lease expiration timeout. The client limit is adjusted based on the physical memory size of the server. The NFSD filecache is a cache of files held open by NFSv4 clients or recently touched by NFSv2 or NFSv3 clients. This cache had some significant scalability constraints that have been relieved in this release. Thanks to all who contributed to this work. A data corruption bug found during the most recent NFS bake-a-thon that involves NFSv3 and NFSv4 clients writing the same file has been addressed in this release. This release includes several improvements in CPU scalability for NFSv4 operations. In addition, Neil Brown provided patches that simplify locking during file lookup, creation, rename, and removal that enables subsequent work on making these operations more scalable. We expect to see that work materialize in the next release. There are also numerous single-patch fixes, clean-ups, and the usual improvements in observability" * tag 'nfsd-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (78 commits) lockd: detect and reject lock arguments that overflow NFSD: discard fh_locked flag and fh_lock/fh_unlock NFSD: use (un)lock_inode instead of fh_(un)lock for file operations NFSD: use explicit lock/unlock for directory ops NFSD: reduce locking in nfsd_lookup() NFSD: only call fh_unlock() once in nfsd_link() NFSD: always drop directory lock in nfsd_unlink() NFSD: change nfsd_create()/nfsd_symlink() to unlock directory before returning. NFSD: add posix ACLs to struct nfsd_attrs NFSD: add security label to struct nfsd_attrs NFSD: set attributes when creating symlinks NFSD: introduce struct nfsd_attrs NFSD: verify the opened dentry after setting a delegation NFSD: drop fh argument from alloc_init_deleg NFSD: Move copy offload callback arguments into a separate structure NFSD: Add nfsd4_send_cb_offload() NFSD: Remove kmalloc from nfsd4_do_async_copy() NFSD: Refactor nfsd4_do_copy() NFSD: Refactor nfsd4_cleanup_inter_ssc() (2/2) NFSD: Refactor nfsd4_cleanup_inter_ssc() (1/2) ...
2022-08-09NFS: Improve readpage/writepage tracingTrond Myklebust
Switch formatting to better match that used by other NFS tracepoints. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-09NFS: Improve O_DIRECT tracingTrond Myklebust
Switch the formatting to match the other NFS tracepoints. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-09NFS: Improve write error tracingTrond Myklebust
Don't leak request pointers, but use the "device:inode" labelling that is used by all the other trace points. Furthermore, replace use of page indexes with an offset, again in order to align behaviour with other NFS trace points. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-09posix-cpu-timers: Cleanup CPU timers before freeing them during execThadeu Lima de Souza Cascardo
Commit 55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task") started looking up tasks by PID when deleting a CPU timer. When a non-leader thread calls execve, it will switch PIDs with the leader process. Then, as it calls exit_itimers, posix_cpu_timer_del cannot find the task because the timer still points out to the old PID. That means that armed timers won't be disarmed, that is, they won't be removed from the timerqueue_list. exit_itimers will still release their memory, and when that list is later processed, it leads to a use-after-free. Clean up the timers from the de-threaded task before freeing them. This prevents a reported use-after-free. Fixes: 55e8c8eb2c7b ("posix-cpu-timers: Store a reference to a pid not a task") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20220809170751.164716-1-cascardo@canonical.com
2022-08-09Merge tag 'fscache-fixes-20220809' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull fscache updates from David Howells: - Fix a cookie access ref leak if a cookie is invalidated a second time before the first invalidation is actually processed. - Add a tracepoint to log cookie lookup failure * tag 'fscache-fixes-20220809' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: fscache: add tracepoint when failing cookie fscache: don't leak cookie access refs if invalidation is in progress or failed
2022-08-09Merge tag 'afs-fixes-20220802' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull AFS fixes from David Howells: "Fix AFS refcount handling. The first patch converts afs to use refcount_t for its refcounts and the second patch fixes afs_put_call() and afs_put_server() to save the values they're going to log in the tracepoint before decrementing the refcount" * tag 'afs-fixes-20220802' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: afs: Fix access after dec in put functions afs: Use refcount_t rather than atomic_t
2022-08-09Merge tag 'fs.setgid.v6.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux Pull setgid updates from Christian Brauner: "This contains the work to move setgid stripping out of individual filesystems and into the VFS itself. Creating files that have both the S_IXGRP and S_ISGID bit raised in directories that themselves have the S_ISGID bit set requires additional privileges to avoid security issues. When a filesystem creates a new inode it needs to take care that the caller is either in the group of the newly created inode or they have CAP_FSETID in their current user namespace and are privileged over the parent directory of the new inode. If any of these two conditions is true then the S_ISGID bit can be raised for an S_IXGRP file and if not it needs to be stripped. However, there are several key issues with the current implementation: - S_ISGID stripping logic is entangled with umask stripping. For example, if the umask removes the S_IXGRP bit from the file about to be created then the S_ISGID bit will be kept. The inode_init_owner() helper is responsible for S_ISGID stripping and is called before posix_acl_create(). So we can end up with two different orderings: 1. FS without POSIX ACL support First strip umask then strip S_ISGID in inode_init_owner(). In other words, if a filesystem doesn't support or enable POSIX ACLs then umask stripping is done directly in the vfs before calling into the filesystem: 2. FS with POSIX ACL support First strip S_ISGID in inode_init_owner() then strip umask in posix_acl_create(). In other words, if the filesystem does support POSIX ACLs then unmask stripping may be done in the filesystem itself when calling posix_acl_create(). Note that technically filesystems are free to impose their own ordering between posix_acl_create() and inode_init_owner() meaning that there's additional ordering issues that influence S_ISGID inheritance. (Note that the commit message of commit 1639a49ccdce ("fs: move S_ISGID stripping into the vfs_*() helpers") gets the ordering between inode_init_owner() and posix_acl_create() the wrong way around. I realized this too late.) - Filesystems that don't rely on inode_init_owner() don't get S_ISGID stripping logic. While that may be intentional (e.g. network filesystems might just defer setgid stripping to a server) it is often just a security issue. Note that mandating the use of inode_init_owner() was proposed as an alternative solution but that wouldn't fix the ordering issues and there are examples such as afs where the use of inode_init_owner() isn't possible. In any case, we should also try the cleaner and generalized solution first before resorting to this approach. - We still have S_ISGID inheritance bugs years after the initial round of S_ISGID inheritance fixes: e014f37db1a2 ("xfs: use setattr_copy to set vfs inode attributes") 01ea173e103e ("xfs: fix up non-directory creation in SGID directories") fd84bfdddd16 ("ceph: fix up non-directory creation in SGID directories") All of this led us to conclude that the current state is too messy. While we won't be able to make it completely clean as posix_acl_create() is still a filesystem specific call we can improve the S_SIGD stripping situation quite a bit by hoisting it out of inode_init_owner() and into the respective vfs creation operations. The obvious advantage is that we don't need to rely on individual filesystems getting S_ISGID stripping right and instead can standardize the ordering between S_ISGID and umask stripping directly in the VFS. A few short implementation notes: - The stripping logic needs to happen in vfs_*() helpers for the sake of stacking filesystems such as overlayfs that rely on these helpers taking care of S_ISGID stripping. - Security hooks have never seen the mode as it is ultimately seen by the filesystem because of the ordering issue we mentioned. Nothing is changed for them. We simply continue to strip the umask before passing the mode down to the security hooks. - The following filesystems use inode_init_owner() and thus relied on S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus, hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs, reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs. We've audited all callchains as best as we could. More details can be found in the commit message to 1639a49ccdce ("fs: move S_ISGID stripping into the vfs_*() helpers")" * tag 'fs.setgid.v6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: ceph: rely on vfs for setgid stripping fs: move S_ISGID stripping into the vfs_*() helpers fs: Add missing umask strip in vfs_tmpfile fs: add mode_strip_sgid() helper
2022-08-09fscache: add tracepoint when failing cookieJeff Layton
Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com>
2022-08-09fscache: don't leak cookie access refs if invalidation is in progress or failedJeff Layton
It's possible for a request to invalidate a fscache_cookie will come in while we're already processing an invalidation. If that happens we currently take an extra access reference that will leak. Only call __fscache_begin_cookie_access if the FSCACHE_COOKIE_DO_INVALIDATE bit was previously clear. Also, ensure that we attempt to clear the bit when the cookie is "FAILED" and put the reference to avoid an access leak. Fixes: 85e4ea1049c7 ("fscache: Fix invalidation/lookup race") Suggested-by: David Howells <dhowells@redhat.com> Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: David Howells <dhowells@redhat.com>
2022-08-08Merge tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbdLinus Torvalds
Pull ksmbd updates from Steve French: - fixes for memory access bugs (out of bounds access, oops, leak) - multichannel fixes - session disconnect performance improvement, and session register improvement - cleanup * tag '5.20-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd: ksmbd: fix heap-based overflow in set_ntacl_dacl() ksmbd: prevent out of bound read for SMB2_TREE_CONNNECT ksmbd: prevent out of bound read for SMB2_WRITE ksmbd: fix use-after-free bug in smb2_tree_disconect ksmbd: fix memory leak in smb2_handle_negotiate ksmbd: fix racy issue while destroying session on multichannel ksmbd: use wait_event instead of schedule_timeout() ksmbd: fix kernel oops from idr_remove() ksmbd: add channel rwlock ksmbd: replace sessions list in connection with xarray MAINTAINERS: ksmbd: add entry for documentation ksmbd: remove unused ksmbd_share_configs_cleanup function
2022-08-08Merge tag 'pull-work.iov_iter-rebased' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more iov_iter updates from Al Viro: - more new_sync_{read,write}() speedups - ITER_UBUF introduction - ITER_PIPE cleanups - unification of iov_iter_get_pages/iov_iter_get_pages_alloc and switching them to advancing semantics - making ITER_PIPE take high-order pages without splitting them - handling copy_page_from_iter() for high-order pages properly * tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits) fix copy_page_from_iter() for compound destinations hugetlbfs: copy_page_to_iter() can deal with compound pages copy_page_to_iter(): don't split high-order page in case of ITER_PIPE expand those iov_iter_advance()... pipe_get_pages(): switch to append_pipe() get rid of non-advancing variants ceph: switch the last caller of iov_iter_get_pages_alloc() 9p: convert to advancing variant of iov_iter_get_pages_alloc() af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages() iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() block: convert to advancing variants of iov_iter_get_pages{,_alloc}() iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() iov_iter: saner helper for page array allocation fold __pipe_get_pages() into pipe_get_pages() ITER_XARRAY: don't open-code DIV_ROUND_UP() unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts unify xarray_get_pages() and xarray_get_pages_alloc() unify pipe_get_pages() and pipe_get_pages_alloc() iov_iter_get_pages(): sanity-check arguments iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper ...
2022-08-08hugetlbfs: copy_page_to_iter() can deal with compound pagesAl Viro
... since April 2021 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08ceph: switch the last caller of iov_iter_get_pages_alloc()Al Viro
here nothing even looks at the iov_iter after the call, so we couldn't care less whether it advances or not. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()Al Viro
... and untangle the cleanup on failure to add into pipe. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()Al Viro
Most of the users immediately follow successful iov_iter_get_pages() with advancing by the amount it had returned. Provide inline wrappers doing that, convert trivial open-coded uses of those. BTW, iov_iter_get_pages() never returns more than it had been asked to; such checks in cifs ought to be removed someday... Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08splice: stop abusing iov_iter_advance() to flush a pipeAl Viro
Use pipe_discard_from() explicitly in generic_file_read_iter(); don't bother with rather non-obvious use of iov_iter_advance() in there. Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08switch new_sync_{read,write}() to ITER_UBUFAl Viro
Reviewed-by: Christian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08new iov_iter flavour - ITER_UBUFAl Viro
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(), checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC ones. We are going to expose the things like ->write_iter() et.al. to those in subsequent commits. New predicate (user_backed_iter()) that is true for ITER_IOVEC and ITER_UBUF; places like direct-IO handling should use that for checking that pages we modify after getting them from iov_iter_get_pages() would need to be dirtied. DO NOT assume that replacing iter_is_iovec() with user_backed_iter() will solve all problems - there's code that uses iter_is_iovec() to decide how to poke around in iov_iter guts and for that the predicate replacement obviously won't suffice. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2022-08-08mm: hugetlb_vmemmap: introduce the name HVOMuchun Song
It it inconvenient to mention the feature of optimizing vmemmap pages associated with HugeTLB pages when communicating with others since there is no specific or abbreviated name for it when it is first introduced. Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now. This commit also updates the document about "hugetlb_free_vmemmap" by the way discussed in thread [1]. Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1] Link: https://lkml.kernel.org/r/20220628092235.91270-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-08-08NFS: don't unhash dentry during unlink/renameNeilBrown
NFS unlink() (and rename over existing target) must determine if the file is open, and must perform a "silly rename" instead of an unlink (or before rename) if it is. Otherwise the client might hold a file open which has been removed on the server. Consequently if it determines that the file isn't open, it must block any subsequent opens until the unlink/rename has been completed on the server. This is currently achieved by unhashing the dentry. This forces any open attempt to the slow-path for lookup which will block on i_rwsem on the directory until the unlink/rename completes. A future patch will change the VFS to only get a shared lock on i_rwsem for unlink, so this will no longer work. Instead we introduce an explicit interlock. A special value is stored in dentry->d_fsdata while the unlink/rename is running and ->d_revalidate blocks while that value is present. When ->d_revalidate unblocks, the dentry will be invalid. This closes the race without requiring exclusion on i_rwsem. d_fsdata is already used in two different ways. 1/ an IS_ROOT directory dentry might have a "devname" stored in d_fsdata. Such a dentry doesn't have a name and so cannot be the target of unlink or rename. For safety we check if an old devname is still stored, and remove it if it is. 2/ a dentry with DCACHE_NFSFS_RENAMED set will have a 'struct nfs_unlinkdata' stored in d_fsdata. While this is set maydelete() will fail, so an unlink or rename will never proceed on such a dentry. Neither of these can be in effect when a dentry is the target of unlink or rename. So we can expect d_fsdata to be NULL, and store a special value ((void*)1) which is given the name NFS_FSDATA_BLOCKED to indicate that any lookup will be blocked. The d_count() is incremented under d_lock() when a lookup finds the dentry, so we check d_count() is low, and set NFS_FSDATA_BLOCKED under the same lock to avoid any races. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2022-08-08Merge tag 'f2fs-for-6.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this cycle, we mainly fixed some corner cases that manipulate a per-file compression flag inappropriately. And, we found f2fs counted valid blocks in a section incorrectly when zone capacity is set, and thus, fixed it with additional sysfs entry to check it easily. Lastly, this series includes several patches with respect to the new atomic write support such as a couple of bug fixes and re-adding atomic_write_abort support that we removed by mistake in the previous release. Enhancements: - add sysfs entries to understand atomic write operations and zone capacity - introduce memory mode to get a hint for low-memory devices - adjust the waiting time of foreground GC - decompress clusters under softirq to avoid non-deterministic latency - do not skip updating inode when retrying to flush node page - enforce single zone capacity Bug fixes: - set the compression/no-compression flags correctly - revive F2FS_IOC_ABORT_VOLATILE_WRITE - check inline_data during compressed inode conversion - understand zone capacity when calculating valid block count As usual, the series includes several minor clean-ups and sanity checks" * tag 'f2fs-for-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (29 commits) f2fs: use onstack pages instead of pvec f2fs: intorduce f2fs_all_cluster_page_ready f2fs: clean up f2fs_abort_atomic_write() f2fs: handle decompress only post processing in softirq f2fs: do not allow to decompress files have FI_COMPRESS_RELEASED f2fs: do not set compression bit if kernel doesn't support f2fs: remove device type check for direct IO f2fs: fix null-ptr-deref in f2fs_get_dnode_of_data f2fs: revive F2FS_IOC_ABORT_VOLATILE_WRITE f2fs: fix to do sanity check on segment type in build_sit_entries() f2fs: obsolete unused MAX_DISCARD_BLOCKS f2fs: fix to avoid use f2fs_bug_on() in f2fs_new_node_page() f2fs: fix to remove F2FS_COMPR_FL and tag F2FS_NOCOMP_FL at the same time f2fs: introduce sysfs atomic write statistics f2fs: don't bother wait_ms by foreground gc f2fs: invalidate meta pages only for post_read required inode f2fs: allow compression of files without blocks f2fs: fix to check inline_data during compressed inode conversion f2fs: Delete f2fs_copy_page() and replace with memcpy_page() f2fs: fix to invalidate META_MAPPING before DIO write ...
2022-08-08Merge tag 'fuse-update-6.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Fix an issue with reusing the bdi in case of block based filesystems - Allow root (in init namespace) to access fuse filesystems in user namespaces if expicitly enabled with a module param - Misc fixes * tag 'fuse-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: retire block-device-based superblock on force unmount vfs: function to prevent re-use of block-device-based superblocks virtio_fs: Modify format for virtio_fs_direct_access virtiofs: delete unused parameter for virtio_fs_cleanup_vqs fuse: Add module param for CAP_SYS_ADMIN access bypassing allow_other fuse: Remove the control interface for virtio-fs fuse: ioctl: translate ENOSYS fuse: limit nsec fuse: avoid unnecessary spinlock bump fuse: fix deadlock between atomic O_TRUNC and page invalidation fuse: write inode in fuse_release()
2022-08-08Merge tag 'ovl-update-6.0' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs update from Miklos Szeredi: "Just a small update" * tag 'ovl-update-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: ovl: fix spelling mistakes ovl: drop WARN_ON() dentry is NULL in ovl_encode_fh() ovl: improve ovl_get_acl() if POSIX ACL support is off ovl: fix some kernel-doc comments ovl: warn if trusted xattr creation fails
2022-08-08Merge tag 'exfat-for-5.20-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat Pull exfat updates from Namjae Jeon: - fix the error code of rename syscall - cleanup and suppress the superfluous error messages - remove duplicate directory entry update - add exfat git tree in MAINTAINERS * tag 'exfat-for-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat: MAINTAINERS: Add Namjae's exfat git tree exfat: Drop superfluous new line for error messages exfat: Downgrade ENAMETOOLONG error message to debug messages exfat: Expand exfat_err() and co directly to pr_*() macro exfat: Define NLS_NAME_* as bit flags explicitly exfat: Return ENAMETOOLONG consistently for oversized paths exfat: remove duplicate write inode for extending dir/file exfat: remove duplicate write inode for truncating file exfat: reuse __exfat_write_inode() to update directory entry
2022-08-08vfs: Check the truncate maximum size in inode_newsize_ok()David Howells
If something manages to set the maximum file size to MAX_OFFSET+1, this can cause the xfs and ext4 filesystems at least to become corrupt. Ordinarily, the kernel protects against userspace trying this by checking the value early in the truncate() and ftruncate() system calls calls - but there are at least two places that this check is bypassed: (1) Cachefiles will round up the EOF of the backing file to DIO block size so as to allow DIO on the final block - but this might push the offset negative. It then calls notify_change(), but this inadvertently bypasses the checking. This can be triggered if someone puts an 8EiB-1 file on a server for someone else to try and access by, say, nfs. (2) ksmbd doesn't check the value it is given in set_end_of_file_info() and then calls vfs_truncate() directly - which also bypasses the check. In both cases, it is potentially possible for a network filesystem to cause a disk filesystem to be corrupted: cachefiles in the client's cache filesystem; ksmbd in the server's filesystem. nfsd is okay as it checks the value, but we can then remove this check too. Fix this by adding a check to inode_newsize_ok(), as called from setattr_prepare(), thereby catching the issue as filesystems set up to perform the truncate with minimal opportunity for bypassing the new check. Fixes: 1f08c925e7a3 ("cachefiles: Implement backing file wrangling") Fixes: f44158485826 ("cifsd: add file operations") Signed-off-by: David Howells <dhowells@redhat.com> Reported-by: Jeff Layton <jlayton@kernel.org> Tested-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Namjae Jeon <linkinjeon@kernel.org> Cc: stable@kernel.org Acked-by: Alexander Viro <viro@zeniv.linux.org.uk> cc: Steve French <sfrench@samba.org> cc: Hyunchul Lee <hyc.lee@gmail.com> cc: Chuck Lever <chuck.lever@oracle.com> cc: Dave Wysochanski <dwysocha@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-08-07Merge tag '5.20-rc-smb3-client-fixes-part1' of ↵Linus Torvalds
git://git.samba.org/sfrench/cifs-2.6 Pull cifs updates from Steve French: "Mostly cleanup, including smb1 refactoring: - multichannel perf improvement - move additional SMB1 code to not be compiled in when legacy support is disabled. - bug fixes, including one important one for memory leak - various cleanup patches We are still working on and testing some deferred close improvements including an important lease break fix for case when multiple deferred closes are still open, and also some additional perf improvements - those are not included here" * tag '5.20-rc-smb3-client-fixes-part1' of git://git.samba.org/sfrench/cifs-2.6: cifs: update internal module number cifs: alloc_mid function should be marked as static cifs: remove "cifs_" prefix from init/destroy mids functions cifs: remove useless DeleteMidQEntry() cifs: when insecure legacy is disabled shrink amount of SMB1 code cifs: trivial style fixup cifs: fix wrong unlock before return from cifs_tree_connect() cifs: avoid use of global locks for high contention data cifs: remove remaining build warnings cifs: list_for_each() -> list_for_each_entry() cifs: update MAINTAINERS file with reviewers smb2: small refactor in smb2_check_message() cifs: Fix memory leak when using fscache cifs: remove minor build warning cifs: remove some camelCase and also some static build warnings cifs: remove unnecessary (void*) conversions. cifs: remove unnecessary type castings cifs: remove redundant initialization to variable mnt_sign_enabled smb3: check xattr value length earlier
2022-08-07Merge tag 'mm-nonmm-stable-2022-08-06-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc updates from Andrew Morton: "Updates to various subsystems which I help look after. lib, ocfs2, fatfs, autofs, squashfs, procfs, etc. A relatively small amount of material this time" * tag 'mm-nonmm-stable-2022-08-06-2' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (72 commits) scripts/gdb: ensure the absolute path is generated on initial source MAINTAINERS: kunit: add David Gow as a maintainer of KUnit mailmap: add linux.dev alias for Brendan Higgins mailmap: update Kirill's email profile: setup_profiling_timer() is moslty not implemented ocfs2: fix a typo in a comment ocfs2: use the bitmap API to simplify code ocfs2: remove some useless functions lib/mpi: fix typo 'the the' in comment proc: add some (hopefully) insightful comments bdi: remove enum wb_congested_state kernel/hung_task: fix address space of proc_dohung_task_timeout_secs lib/lzo/lzo1x_compress.c: replace ternary operator with min() and min_t() squashfs: support reading fragments in readahead call squashfs: implement readahead squashfs: always build "file direct" version of page actor Revert "squashfs: provide backing_dev_info in order to disable read-ahead" fs/ocfs2: Fix spelling typo in comment ia64: old_rr4 added under CONFIG_HUGETLB_PAGE proc: fix test for "vsyscall=xonly" boot option ...
2022-08-06Merge tag '9p-for-5.20' of https://github.com/martinetd/linuxLinus Torvalds
Pull 9p updates from Dominique Martinet: - a couple of fixes - add a tracepoint for fid refcounting - some cleanup/followup on fid lookup - some cleanup around req refcounting * tag '9p-for-5.20' of https://github.com/martinetd/linux: net/9p: Initialize the iounit field during fid creation net: 9p: fix refcount leak in p9_read_work() error handling 9p: roll p9_tag_remove into p9_req_put 9p: Add client parameter to p9_req_put() 9p: Drop kref usage 9p: Fix some kernel-doc comments 9p fid refcount: cleanup p9_fid_put calls 9p fid refcount: add a 9p_fid_ref tracepoint 9p fid refcount: add p9_fid_get/put wrappers 9p: Fix minor typo in code comment 9p: Remove unnecessary variable for old fids while walking from d_parent 9p: Make the path walk logic more clear about when cloning is required 9p: Track the root fid with its own variable during lookups
2022-08-06Merge tag 'gfs2-v5.19-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2 Pull gfs2 updates from Andreas Gruenbacher: - Instantiate glocks ouside of the glock state engine, in the contect of the process taking the glock. This moves unnecessary complexity out of the core glock code. Clean up the instantiate logic to be more sensible. - In gfs2_glock_async_wait(), cancel pending locking request upon failure. Make sure all glocks are left in a consistent state. - Various other minor cleanups and fixes. * tag 'gfs2-v5.19-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2: gfs2: List traversal in do_promote is safe gfs2: do_promote glock holder stealing fix gfs2: Use better variable name gfs2: Make go_instantiate take a glock gfs2: Add new go_held glock operation gfs2: Revert 'Fix "truncate in progress" hang' gfs2: Instantiate glocks ouside of glock state engine gfs2: Fix up gfs2_glock_async_wait gfs2: Minor gfs2_glock_nq_m cleanup gfs2: Fix spelling mistake in comment gfs2: Rewrap overlong comment in do_promote gfs2: Remove redundant NULL check before kfree
2022-08-05xfs: Fix false ENOSPC when performing direct write on a delalloc extent in ↵Chandan Babu R
cow fork On a higly fragmented filesystem a Direct IO write can fail with -ENOSPC error even though the filesystem has sufficient number of free blocks. This occurs if the file offset range on which the write operation is being performed has a delalloc extent in the cow fork and this delalloc extent begins much before the Direct IO range. In such a scenario, xfs_reflink_allocate_cow() invokes xfs_bmapi_write() to allocate the blocks mapped by the delalloc extent. The extent thus allocated may not cover the beginning of file offset range on which the Direct IO write was issued. Hence xfs_reflink_allocate_cow() ends up returning -ENOSPC. The following script reliably recreates the bug described above. #!/usr/bin/bash device=/dev/loop0 shortdev=$(basename $device) mntpnt=/mnt/ file1=${mntpnt}/file1 file2=${mntpnt}/file2 fragmentedfile=${mntpnt}/fragmentedfile punchprog=/root/repos/xfstests-dev/src/punch-alternating errortag=/sys/fs/xfs/${shortdev}/errortag/bmap_alloc_minlen_extent umount $device > /dev/null 2>&1 echo "Create FS" mkfs.xfs -f -m reflink=1 $device > /dev/null 2>&1 if [[ $? != 0 ]]; then echo "mkfs failed." exit 1 fi echo "Mount FS" mount $device $mntpnt > /dev/null 2>&1 if [[ $? != 0 ]]; then echo "mount failed." exit 1 fi echo "Create source file" xfs_io -f -c "pwrite 0 32M" $file1 > /dev/null 2>&1 sync echo "Create Reflinked file" xfs_io -f -c "reflink $file1" $file2 &>/dev/null echo "Set cowextsize" xfs_io -c "cowextsize 16M" $file1 > /dev/null 2>&1 echo "Fragment FS" xfs_io -f -c "pwrite 0 64M" $fragmentedfile > /dev/null 2>&1 sync $punchprog $fragmentedfile echo "Allocate block sized extent from now onwards" echo -n 1 > $errortag echo "Create 16MiB delalloc extent in CoW fork" xfs_io -c "pwrite 0 4k" $file1 > /dev/null 2>&1 sync echo "Direct I/O write at offset 12k" xfs_io -d -c "pwrite 12k 8k" $file1 This commit fixes the bug by invoking xfs_bmapi_write() in a loop until disk blocks are allocated for atleast the starting file offset of the Direct IO write range. Fixes: 3c68d44a2b49 ("xfs: allocate direct I/O COW blocks in iomap_begin") Reported-and-Root-caused-by: Wengang Wang <wen.gang.wang@oracle.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [djwong: slight editing to make the locking less grody, and fix some style things] Signed-off-by: Darrick J. Wong <djwong@kernel.org>