Age | Commit message (Collapse) | Author |
|
commit f588d72bd95f748849685412b1f0c7959ca228cf upstream.
The Linux NFS server strips the SUID and SGID from the file mode
on ALLOCATE op.
Modify _nfs42_proc_fallocate to add NFS_INO_REVAL_FORCED to
nfs_set_cache_invalid's argument to force update of the file
mode suid/sgid bit.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Dai Ngo <dai.ngo@oracle.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 379e4adfddd6a2f95a4f2029b8ddcbacf92b21f9 upstream.
This patches fixes commit 51d674a5e488 "NFSv4.1: use
EXCHGID4_FLAG_USE_PNFS_DS for DS server", purpose of that
commit was to mark EXCHANGE_ID to the DS with the appropriate
flag.
However, connection to MDS can return both EXCHGID4_FLAG_USE_PNFS_DS
and EXCHGID4_FLAG_USE_PNFS_MDS set but previous patch would only
remember the USE_PNFS_DS and for the 2nd EXCHANGE_ID send that
to the MDS.
Instead, just mark the pnfs path exclusively.
Fixes: 51d674a5e488 ("NFSv4.1: use EXCHGID4_FLAG_USE_PNFS_DS for DS server")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e1c6cfbb3bd1377e2ddcbe06cf8fb1ec323ea7d3 upstream.
Ensure that we check the layout pointer and validity after dereferencing
it in ff_layout_mirror_prepare_stats.
Fixes: 08e2e5bc6c9a ("pNFS/flexfiles: Clean up layoutstats")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f63955721a8020e979b99cc417dcb6da3106aa24 upstream.
We are not allowed to call pnfs_mark_matching_lsegs_return() without
also holding a reference to the layout header, since doing so could lead
to the reference count going to zero when we call
pnfs_layout_remove_lseg(). This again can lead to a hang when we get to
nfs4_evict_inode() and are unable to clear the layout pointer.
pnfs_layout_return_unused_byserver() is guilty of this behaviour, and
has been seen to trigger the refcount warning prior to a hang.
Fixes: b6d49ecd1081 ("NFSv4: Fix a pNFS layout related use-after-free race when freeing the inode")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e36f94914021e58ee88a8856c7fdf35adf9c7ee1 ]
At btrfs_realloc_node() we have these checks to verify we are not using a
stale transaction (a past transaction with an unblocked state or higher),
and the only thing we do is to trigger two WARN_ON(). This however is a
critical problem, highly unexpected and if it happens it's most likely due
to a bug, so we should error out and turn the fs into error state so that
such issue is much more easily noticed if it's triggered.
The problem is critical because in btrfs_realloc_node() we COW tree blocks,
and using such stale transaction will lead to not persisting the extent
buffers used for the COW operations, as allocating tree block adds the
range of the respective extent buffers to the ->dirty_pages iotree of the
transaction, and a stale transaction, in the unlocked state or higher,
will not flush dirty extent buffers anymore, therefore resulting in not
persisting the tree block and resource leaks (not cleaning the dirty_pages
iotree for example).
So do the following changes:
1) Return -EUCLEAN if we find a stale transaction;
2) Turn the fs into error state, with error -EUCLEAN, so that no
transaction can be committed, and generate a stack trace;
3) Combine both conditions into a single if statement, as both are related
and have the same error message;
4) Mark the check as unlikely, since this is not expected to ever happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a2caab29884397e583d09be6546259a83ebfbdb1 ]
At btrfs_cow_block() we check if the block being COWed belongs to a root
that is being deleted and if so we log an error message. However this is
an unexpected case and it indicates a bug somewhere, so we should return
an error and abort the transaction. So change this in the following ways:
1) Abort the transaction with -EUCLEAN, so that if the issue ever happens
it can easily be noticed;
2) Change the logged message level from error to critical, and change the
message itself to print the block's logical address and the ID of the
root;
3) Return -EUCLEAN to the caller;
4) As this is an unexpected scenario, that should never happen, mark the
check as unlikely, allowing the compiler to potentially generate better
code.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 48774f3bf8b4dd3b1a0e155825c9ce48483db14c ]
At btrfs_cow_block() we have these checks to verify we are not using a
stale transaction (a past transaction with an unblocked state or higher),
and the only thing we do is to trigger a WARN with a message and a stack
trace. This however is a critical problem, highly unexpected and if it
happens it's most likely due to a bug, so we should error out and turn the
fs into error state so that such issue is much more easily noticed if it's
triggered.
The problem is critical because using such stale transaction will lead to
not persisting the extent buffer used for the COW operation, as allocating
a tree block adds the range of the respective extent buffer to the
->dirty_pages iotree of the transaction, and a stale transaction, in the
unlocked state or higher, will not flush dirty extent buffers anymore,
therefore resulting in not persisting the tree block and resource leaks
(not cleaning the dirty_pages iotree for example).
So do the following changes:
1) Return -EUCLEAN if we find a stale transaction;
2) Turn the fs into error state, with error -EUCLEAN, so that no
transaction can be committed, and generate a stack trace;
3) Combine both conditions into a single if statement, as both are related
and have the same error message;
4) Mark the check as unlikely, since this is not expected to ever happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 9147b9ded499d9853bdf0e9804b7eaa99c4429ed ]
Jens reported the following warnings from -Wmaybe-uninitialized recent
Linus' branch.
In file included from ./include/asm-generic/rwonce.h:26,
from ./arch/arm64/include/asm/rwonce.h:71,
from ./include/linux/compiler.h:246,
from ./include/linux/export.h:5,
from ./include/linux/linkage.h:7,
from ./include/linux/kernel.h:17,
from fs/btrfs/ioctl.c:6:
In function ‘instrument_copy_from_user_before’,
inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3,
inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7,
inlined from ‘btrfs_ioctl_space_info’ at fs/btrfs/ioctl.c:2999:6,
inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4616:10:
./include/linux/kasan-checks.h:38:27: warning: ‘space_args’ may be used
uninitialized [-Wmaybe-uninitialized]
38 | #define kasan_check_write __kasan_check_write
./include/linux/instrumented.h:129:9: note: in expansion of macro
‘kasan_check_write’
129 | kasan_check_write(to, n);
| ^~~~~~~~~~~~~~~~~
./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’:
./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const
volatile void *’ to ‘__kasan_check_write’ declared here
20 | bool __kasan_check_write(const volatile void *p, unsigned int
size);
| ^~~~~~~~~~~~~~~~~~~
fs/btrfs/ioctl.c:2981:39: note: ‘space_args’ declared here
2981 | struct btrfs_ioctl_space_args space_args;
| ^~~~~~~~~~
In function ‘instrument_copy_from_user_before’,
inlined from ‘_copy_from_user’ at ./include/linux/uaccess.h:148:3,
inlined from ‘copy_from_user’ at ./include/linux/uaccess.h:183:7,
inlined from ‘_btrfs_ioctl_send’ at fs/btrfs/ioctl.c:4343:9,
inlined from ‘btrfs_ioctl’ at fs/btrfs/ioctl.c:4658:10:
./include/linux/kasan-checks.h:38:27: warning: ‘args32’ may be used
uninitialized [-Wmaybe-uninitialized]
38 | #define kasan_check_write __kasan_check_write
./include/linux/instrumented.h:129:9: note: in expansion of macro
‘kasan_check_write’
129 | kasan_check_write(to, n);
| ^~~~~~~~~~~~~~~~~
./include/linux/kasan-checks.h: In function ‘btrfs_ioctl’:
./include/linux/kasan-checks.h:20:6: note: by argument 1 of type ‘const
volatile void *’ to ‘__kasan_check_write’ declared here
20 | bool __kasan_check_write(const volatile void *p, unsigned int
size);
| ^~~~~~~~~~~~~~~~~~~
fs/btrfs/ioctl.c:4341:49: note: ‘args32’ declared here
4341 | struct btrfs_ioctl_send_args_32 args32;
| ^~~~~~
This was due to his config options and having KASAN turned on,
which adds some extra checks around copy_from_user(), which then
triggered the -Wmaybe-uninitialized checker for these cases.
Fix the warnings by initializing the different structs we're copying
into.
Reported-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 03dbab3bba5f009d053635c729d1244f2c8bad38 ]
Nathan reported that he was seeing the new warning in
setattr_copy_mgtime pop when starting podman containers. Overlayfs is
trying to set the atime and mtime via notify_change without also
setting the ctime.
POSIX states that when the atime and mtime are updated via utimes() that
we must also update the ctime to the current time. The situation with
overlayfs copy-up is analogies, so add ATTR_CTIME to the bitmask.
notify_change will fill in the value.
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Christian Brauner <brauner@kernel.org>
Acked-by: Amir Goldstein <amir73il@gmail.com>
Message-Id: <20230913-ctime-v1-1-c6bc509cbc27@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b4c639f699349880b7918b861e1bd360442ec450 ]
Jens reported a compiler warning when using
CONFIG_CC_OPTIMIZE_FOR_SIZE=y that looks like this
fs/btrfs/tree-log.c: In function ‘btrfs_log_prealloc_extents’:
fs/btrfs/tree-log.c:4828:23: warning: ‘start_slot’ may be used
uninitialized [-Wmaybe-uninitialized]
4828 | ret = copy_items(trans, inode, dst_path, path,
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
4829 | start_slot, ins_nr, 1, 0);
| ~~~~~~~~~~~~~~~~~~~~~~~~~
fs/btrfs/tree-log.c:4725:13: note: ‘start_slot’ was declared here
4725 | int start_slot;
| ^~~~~~~~~~
The compiler is incorrect, as we only use this code when ins_len > 0,
and when ins_len > 0 we have start_slot properly initialized. However
we generally find the -Wmaybe-uninitialized warnings valuable, so
initialize start_slot to get rid of the warning.
Reported-by: Jens Axboe <axboe@kernel.dk>
Tested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 1bf76df3fee56d6637718e267f7c34ed70d0c7dc ]
When running a delayed tree reference, if we find a ref count different
from 1, we return -EIO. This isn't an IO error, as it indicates either a
bug in the delayed refs code or a memory corruption, so change the error
code from -EIO to -EUCLEAN. Also tag the branch as 'unlikely' as this is
not expected to ever happen, and change the error message to print the
tree block's bytenr without the parenthesis (and there was a missing space
between the 'block' word and the opening parenthesis), for consistency as
that's the style we used everywhere else.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a7ddeeb079505961355cf0106154da0110f1fdff ]
When starting a transaction, with a non-zero number of items, we reserve
metadata space for that number of items and for delayed refs by doing a
call to btrfs_block_rsv_add(), with the transaction block reserve passed
as the block reserve argument. This reserves metadata space and adds it
to the transaction block reserve. Later we migrate the space we reserved
for delayed references from the transaction block reserve into the delayed
refs block reserve, by calling btrfs_migrate_to_delayed_refs_rsv().
btrfs_migrate_to_delayed_refs_rsv() decrements the number of bytes to
migrate from the source block reserve, and this however may result in an
underflow in case the space added to the transaction block reserve ended
up being used by another task that has not reserved enough space for its
own use - examples are tasks doing reflinks or hole punching because they
end up calling btrfs_replace_file_extents() -> btrfs_drop_extents() and
may need to modify/COW a variable number of leaves/paths, so they keep
trying to use space from the transaction block reserve when they need to
COW an extent buffer, and may end up trying to use more space then they
have reserved (1 unit/path only for removing file extent items).
This can be avoided by simply reserving space first without adding it to
the transaction block reserve, then add the space for delayed refs to the
delayed refs block reserve and finally add the remaining reserved space
to the transaction block reserve. This also makes the code a bit shorter
and simpler. So just do that.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit be049c3a088d512187407b7fd036cecfab46d565 ]
When writing back an inode and performing an fsync on it concurrently, a
deadlock issue may arise as shown below. In each writeback iteration, a
clean inode is requeued to the wb->b_dirty queue due to non-zero
pages_skipped, without anything actually being written. This causes an
infinite loop and prevents the plug from being flushed, resulting in a
deadlock. We now avoid requeuing the clean inode to prevent this issue.
wb_writeback fsync (inode-Y)
blk_start_plug(&plug)
for (;;) {
iter i-1: some reqs with page-X added into plug->mq_list // f2fs node page-X with PG_writeback
filemap_fdatawrite
__filemap_fdatawrite_range // write inode-Y with sync_mode WB_SYNC_ALL
do_writepages
f2fs_write_data_pages
__f2fs_write_data_pages // wb_sync_req[DATA]++ for WB_SYNC_ALL
f2fs_write_cache_pages
f2fs_write_single_data_page
f2fs_do_write_data_page
f2fs_outplace_write_data
f2fs_update_data_blkaddr
f2fs_wait_on_page_writeback
wait_on_page_writeback // wait for f2fs node page-X
iter i:
progress = __writeback_inodes_wb(wb, work)
. writeback_sb_inodes
. __writeback_single_inode // write inode-Y with sync_mode WB_SYNC_NONE
. . do_writepages
. . f2fs_write_data_pages
. . . __f2fs_write_data_pages // skip writepages due to (wb_sync_req[DATA]>0)
. . . wbc->pages_skipped += get_dirty_pages(inode) // wbc->pages_skipped = 1
. if (!(inode->i_state & I_DIRTY_ALL)) // i_state = I_SYNC | I_SYNC_QUEUED
. total_wrote++; // total_wrote = 1
. requeue_inode // requeue inode-Y to wb->b_dirty queue due to non-zero pages_skipped
if (progress) // progress = 1
continue;
iter i+1:
queue_io
// similar process with iter i, infinite for-loop !
}
blk_finish_plug(&plug) // flush plug won't be called
Signed-off-by: Chunhai Guo <guochunhai@vivo.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Message-Id: <20230916045131.957929-1-guochunhai@vivo.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit bfbe5b31caa74ab97f1784fe9ade5f45e0d3de91 upstream.
Reported-by: syzbot+e94d98936a0ed08bde43@syzkaller.appspotmail.com
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8e7e27b2ee1e19c4040d4987e345f678a74c0aed upstream.
Here is a BUG report about linux-6.1 from syzbot, but it still remains
within upstream:
BUG: KASAN: slab-out-of-bounds in ntfs_list_ea fs/ntfs3/xattr.c:191 [inline]
BUG: KASAN: slab-out-of-bounds in ntfs_listxattr+0x401/0x570 fs/ntfs3/xattr.c:710
Read of size 1 at addr ffff888021acaf3d by task syz-executor128/3632
Call Trace:
kasan_report+0x139/0x170 mm/kasan/report.c:495
ntfs_list_ea fs/ntfs3/xattr.c:191 [inline]
ntfs_listxattr+0x401/0x570 fs/ntfs3/xattr.c:710
vfs_listxattr fs/xattr.c:457 [inline]
listxattr+0x293/0x2d0 fs/xattr.c:804
path_listxattr fs/xattr.c:828 [inline]
__do_sys_llistxattr fs/xattr.c:846 [inline]
Before derefering field members of `ea` in unpacked_ea_size(), we need to
check whether the EA_FULL struct is located in access validate range.
Similarly, when derefering `ea->name` field member, we need to check
whethe the ea->name is located in access validate range, too.
Fixes: be71b5cba2e6 ("fs/ntfs3: Add attrib operations")
Reported-by: syzbot+9fcea5ef6dc4dc72d334@syzkaller.appspotmail.com
Signed-off-by: Zeng Heng <zengheng4@huawei.com>
[almaz.alexandrovich@paragon-software.com: took the ret variable out of the loop block]
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1f9b94af923c88539426ed811ae7e9543834a5c5 upstream.
Upon investigation of the C reproducer provided by Syzbot, it seemed
the reproducer was trying to mount a corrupted NTFS filesystem, then
issue a rename syscall to some nodes in the filesystem. This can be
shown by modifying the reproducer to only include the mount syscall,
and investigating the filesystem by e.g. `ls` and `rm` commands. As a
result, during the problematic call to `hdr_fine_e`, the `inode` being
supplied did not go through `indx_init`, hence the `cmp` function
pointer was never set.
The fix is simply to check whether `cmp` is not set, and return NULL
if that's the case, in order to be consistent with other error
scenarios of the `hdr_find_e` method. The rationale behind this patch
is that:
- We should prevent crashing the kernel even if the mounted filesystem
is corrupted. Any syscalls made on the filesystem could return
invalid, but the kernel should be able to sustain these calls.
- Only very specific corruption would lead to this bug, so it would be
a pretty rare case in actual usage anyways. Therefore, introducing a
check to specifically protect against this bug seems appropriate.
Because of its rarity, an `unlikely` clause is used to wrap around
this nullity check.
Reported-by: syzbot+60cf892fc31d1f4358fc@syzkaller.appspotmail.com
Signed-off-by: Ziqi Zhao <astrajoan@yahoo.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 03adc61edad49e1bbecfb53f7ea5d78f398fe368 upstream.
An io_uring openat operation can update an audit reference count
from multiple threads resulting in the call trace below.
A call to io_uring_submit() with a single openat op with a flag of
IOSQE_ASYNC results in the following reference count updates.
These first part of the system call performs two increments that do not race.
do_syscall_64()
__do_sys_io_uring_enter()
io_submit_sqes()
io_openat_prep()
__io_openat_prep()
getname()
getname_flags() /* update 1 (increment) */
__audit_getname() /* update 2 (increment) */
The openat op is queued to an io_uring worker thread which starts the
opportunity for a race. The system call exit performs one decrement.
do_syscall_64()
syscall_exit_to_user_mode()
syscall_exit_to_user_mode_prepare()
__audit_syscall_exit()
audit_reset_context()
putname() /* update 3 (decrement) */
The io_uring worker thread performs one increment and two decrements.
These updates can race with the system call decrement.
io_wqe_worker()
io_worker_handle_work()
io_wq_submit_work()
io_issue_sqe()
io_openat()
io_openat2()
do_filp_open()
path_openat()
__audit_inode() /* update 4 (increment) */
putname() /* update 5 (decrement) */
__audit_uring_exit()
audit_reset_context()
putname() /* update 6 (decrement) */
The fix is to change the refcnt member of struct audit_names
from int to atomic_t.
kernel BUG at fs/namei.c:262!
Call Trace:
...
? putname+0x68/0x70
audit_reset_context.part.0.constprop.0+0xe1/0x300
__audit_uring_exit+0xda/0x1c0
io_issue_sqe+0x1f3/0x450
? lock_timer_base+0x3b/0xd0
io_wq_submit_work+0x8d/0x2b0
? __try_to_del_timer_sync+0x67/0xa0
io_worker_handle_work+0x17c/0x2b0
io_wqe_worker+0x10a/0x350
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/lkml/MW2PR2101MB1033FFF044A258F84AEAA584F1C9A@MW2PR2101MB1033.namprd21.prod.outlook.com/
Fixes: 5bd2182d58e9 ("audit,io_uring,io-wq: add some basic audit support to io_uring")
Signed-off-by: Dan Clash <daclash@linux.microsoft.com>
Link: https://lore.kernel.org/r/20231012215518.GA4048@linuxonhyperv3.guj3yctzbm1etfxqx2vob5hsef.xx.internal.cloudapp.net
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 8a540e990d7da36813cb71a4a422712bfba448a4 upstream.
Commit f6fca3917b4d "btrfs: store chunk size in space-info struct"
broke data chunk allocations on non-zoned multi-device filesystems when
using default chunk_size. Commit 5da431b71d4b "btrfs: fix the max chunk
size and stripe length calculation" partially fixed that, and this patch
completes the fix for that case.
After commit f6fca3917b4d and 5da431b71d4b, the sequence of events for
a data chunk allocation on a non-zoned filesystem is:
1. btrfs_create_chunk calls init_alloc_chunk_ctl, which copies
space_info->chunk_size (default 10 GiB) to ctl->max_stripe_len
unmodified. Before f6fca3917b4d, ctl->max_stripe_len value was
1 GiB for non-zoned data chunks and not configurable.
2. btrfs_create_chunk calls gather_device_info which consumes
and produces more fields of chunk_ctl.
3. gather_device_info multiplies ctl->max_stripe_len by
ctl->dev_stripes (which is 1 in all cases except dup)
and calls find_free_dev_extent with that number as num_bytes.
4. find_free_dev_extent locates the first dev_extent hole on
a device which is at least as large as num_bytes. With default
max_chunk_size from f6fca3917b4d, it finds the first hole which is
longer than 10 GiB, or the largest hole if that hole is shorter
than 10 GiB. This is different from the pre-f6fca3917b4d
behavior, where num_bytes is 1 GiB, and find_free_dev_extent
may choose a different hole.
5. gather_device_info repeats step 4 with all devices to find
the first or largest dev_extent hole that can be allocated on
each device.
6. gather_device_info sorts the device list by the hole size
on each device, using total unallocated space on each device to
break ties, then returns to btrfs_create_chunk with the list.
7. btrfs_create_chunk calls decide_stripe_size_regular.
8. decide_stripe_size_regular finds the largest stripe_len that
fits across the first nr_devs device dev_extent holes that were
found by gather_device_info (and satisfies other constraints
on stripe_len that are not relevant here).
9. decide_stripe_size_regular caps the length of the stripe it
computed at 1 GiB. This cap appeared in 5da431b71d4b to correct
one of the other regressions introduced in f6fca3917b4d.
10. btrfs_create_chunk creates a new chunk with the above
computed size and number of devices.
At step 4, gather_device_info() has found a location where stripe up to
10 GiB in length could be allocated on several devices, and selected
which devices should have a dev_extent allocated on them, but at step
9, only 1 GiB of the space that was found on each device can be used.
This mismatch causes new suboptimal chunk allocation cases that did not
occur in pre-f6fca3917b4d kernels.
Consider a filesystem using raid1 profile with 3 devices. After some
balances, device 1 has 10x 1 GiB unallocated space, while devices 2
and 3 have 1x 10 GiB unallocated space, i.e. the same total amount of
space, but distributed across different numbers of dev_extent holes.
For visualization, let's ignore all the chunks that were allocated before
this point, and focus on the remaining holes:
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10x 1 GiB unallocated)
Device 2: [__________] (10 GiB contig unallocated)
Device 3: [__________] (10 GiB contig unallocated)
Before f6fca3917b4d, the allocator would fill these optimally by
allocating chunks with dev_extents on devices 1 and 2 ([12]), 1 and 3
([13]), or 2 and 3 ([23]):
[after 0 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [__________] (10 GiB)
Device 3: [__________] (10 GiB)
[after 1 chunk allocation]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_]
Device 2: [12] [_________] (9 GiB)
Device 3: [__________] (10 GiB)
[after 2 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [12] [_________] (9 GiB)
Device 3: [13] [_________] (9 GiB)
[after 3 chunk allocations]
Device 1: [12] [13] [12] [_] [_] [_] [_] [_] [_] [_] (7 GiB)
Device 2: [12] [12] [________] (8 GiB)
Device 3: [13] [_________] (9 GiB)
[...]
[after 12 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [_] [_] (2 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [__] (2 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 13 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [_] (1 GiB)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [__] (2 GiB)
[after 14 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [_] (1 GiB)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [_] (1 GiB)
[after 15 chunk allocations]
Device 1: [12] [13] [12] [13] [12] [13] [12] [13] [12] [13] (full)
Device 2: [12] [12] [23] [23] [12] [12] [23] [23] [12] [23] (full)
Device 3: [13] [13] [23] [23] [13] [23] [13] [23] [13] [23] (full)
This allocates all of the space with no waste. The sorting function used
by gather_device_info considers free space holes above 1 GiB in length
to be equal to 1 GiB, so once find_free_dev_extent locates a sufficiently
long hole on each device, all the holes appear equal in the sort, and the
comparison falls back to sorting devices by total free space. This keeps
usable space on each device equal so they can all be filled completely.
After f6fca3917b4d, the allocator prefers the devices with larger holes
over the devices with more free space, so it makes bad allocation choices:
[after 1 chunk allocation]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [_________] (9 GiB)
Device 3: [23] [_________] (9 GiB)
[after 2 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [________] (8 GiB)
Device 3: [23] [23] [________] (8 GiB)
[after 3 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [_______] (7 GiB)
Device 3: [23] [23] [23] [_______] (7 GiB)
[...]
[after 9 chunk allocations]
Device 1: [_] [_] [_] [_] [_] [_] [_] [_] [_] [_] (10 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 10 chunk allocations]
Device 1: [12] [_] [_] [_] [_] [_] [_] [_] [_] [_] (9 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [_] (1 GiB)
[after 11 chunk allocations]
Device 1: [12] [13] [_] [_] [_] [_] [_] [_] [_] [_] (8 GiB)
Device 2: [23] [23] [23] [23] [23] [23] [23] [23] [12] (full)
Device 3: [23] [23] [23] [23] [23] [23] [23] [23] [13] (full)
No further allocations are possible, with 8 GiB wasted (4 GiB of data
space). The sort in gather_device_info now considers free space in
holes longer than 1 GiB to be distinct, so it will prefer devices 2 and
3 over device 1 until all but 1 GiB is allocated on devices 2 and 3.
At that point, with only 1 GiB unallocated on every device, the largest
hole length on each device is equal at 1 GiB, so the sort finally moves
to ordering the devices with the most free space, but by this time it
is too late to make use of the free space on device 1.
Note that it's possible to contrive a case where the pre-f6fca3917b4d
allocator fails the same way, but these cases generally have extensive
dev_extent fragmentation as a precondition (e.g. many holes of 768M
in length on one device, and few holes 1 GiB in length on the others).
With the regression in f6fca3917b4d, bad chunk allocation can occur even
under optimal conditions, when all dev_extent holes are exact multiples
of stripe_len in length, as in the example above.
Also note that post-f6fca3917b4d kernels do treat dev_extent holes
larger than 10 GiB as equal, so the bad behavior won't show up on a
freshly formatted filesystem; however, as the filesystem ages and fills
up, and holes ranging from 1 GiB to 10 GiB in size appear, the problem
can show up as a failure to balance after adding or removing devices,
or an unexpected shortfall in available space due to unequal allocation.
To fix the regression and make data chunk allocation work
again, set ctl->max_stripe_len back to the original SZ_1G, or
space_info->chunk_size if that's smaller (the latter can happen if the
user set space_info->chunk_size to less than 1 GiB via sysfs, or it's
a 32 MiB system chunk with a hardcoded chunk_size and stripe_len).
While researching the background of the earlier commits, I found that an
identical fix was already proposed at:
https://lore.kernel.org/linux-btrfs/de83ac46-a4a3-88d3-85ce-255b7abc5249@gmx.com/
The previous review missed one detail: ctl->max_stripe_len is used
before decide_stripe_size_regular() is called, when it is too late for
the changes in that function to have any effect. ctl->max_stripe_len is
not used directly by decide_stripe_size_regular(), but the parameter
does heavily influence the per-device free space data presented to
the function.
Fixes: f6fca3917b4d ("btrfs: store chunk size in space-info struct")
CC: stable@vger.kernel.org # 6.1+
Link: https://lore.kernel.org/linux-btrfs/20231007051421.19657-1-ce3g8jdj@umail.furryterror.org/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 07bb00ef00ace88dd6f695fadbba76565756e55c upstream.
In this code "ret" is type long and "src_objlen" is unsigned int. The
problem is that on 32bit systems, when we do the comparison signed longs
are type promoted to unsigned int. So negative error codes from
do_splice_direct() are treated as success instead of failure.
Cc: stable@vger.kernel.org
Fixes: 1b0c3b9f91f0 ("ceph: re-org copy_file_range and fix some error paths")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Xiubo Li <xiubli@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 15c0a870dc44ed14e01efbdd319d232234ee639f upstream.
When truncating the inode the MDS will acquire the xlock for the
ifile Locker, which will revoke the 'Frwsxl' caps from the clients.
But when the client just releases and flushes the 'Fw' caps to MDS,
for exmaple, and once the MDS receives the caps flushing msg it
just thought the revocation has finished. Then the MDS will continue
truncating the inode and then issued the truncate notification to
all the clients. While just before the clients receives the cap
flushing ack they receive the truncation notification, the clients
will detecte that the 'issued | dirty' is still holding the 'Fw'
caps.
Cc: stable@vger.kernel.org
Link: https://tracker.ceph.com/issues/56693
Fixes: b0d7c2231015 ("ceph: introduce i_truncate_mutex")
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f43328357defc0dc9d28dbd06dc3361fd2b22e28 upstream.
Cthon test fail with the following error.
check for proper open/unlink operation
nfsjunk files before unlink:
-rwxr-xr-x 1 root root 0 9월 25 11:03 ./nfs2y8Jm9
./nfs2y8Jm9 open; unlink ret = 0
nfsjunk files after unlink:
-rwxr-xr-x 1 root root 0 9월 25 11:03 ./nfs2y8Jm9
data compare ok
nfsjunk files after close:
ls: cannot access './nfs2y8Jm9': No such file or directory
special tests failed
Cthon expect to second unlink failure when file is already unlinked.
ksmbd can not allow to open file if flags of ksmbd inode is set with
S_DEL_ON_CLS flags.
Cc: stable@vger.kernel.org
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 869b6ea1609f655a43251bf41757aa44e5350a8f upstream.
Eric has reported that commit dabc8b207566 ("quota: fix dqput() to
follow the guarantees dquot_srcu should provide") heavily increases
runtime of generic/270 xfstest for ext4 in nojournal mode. The reason
for this is that ext4 in nojournal mode leaves dquots dirty until the last
dqput() and thus the cleanup done in quota_release_workfn() has to write
them all. Due to the way quota_release_workfn() is written this results
in synchronize_srcu() call for each dirty dquot which makes the dquot
cleanup when turning quotas off extremely slow.
To be able to avoid synchronize_srcu() for each dirty dquot we need to
rework how we track dquots to be cleaned up. Instead of keeping the last
dquot reference while it is on releasing_dquots list, we drop it right
away and mark the dquot with new DQ_RELEASING_B bit instead. This way we
can we can remove dquot from releasing_dquots list when new reference to
it is acquired and thus there's no need to call synchronize_srcu() each
time we drop dq_list_lock.
References: https://lore.kernel.org/all/ZRytn6CxFK2oECUt@debian-BULLSEYE-live-builder-AMD64
Reported-by: Eric Whitney <enwlinux@gmail.com>
Fixes: dabc8b207566 ("quota: fix dqput() to follow the guarantees dquot_srcu should provide")
CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit f16fd0b11f0f4d41846b5102b1656ea1fc9ac7a0 which is
commit 954998b60caa8f2a3bf3abe490de6f08d283687a upstream.
There are reported NFS problems in the 6.1.56 release, so revert a set
of NFS patches to hopefully resolve the issue.
Reported-by: poester <poester@internetbrands.com>
Link: https://lore.kernel.org/r/20231012165439.137237-2-kernel@linuxace.com
Reported-by: Daniel Díaz <daniel.diaz@linaro.org>
Link: https://lore.kernel.org/r/2023100755-livestock-barcode-fe41@gregkh
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit 4d98038e5bd939bd13cc4e602dfe60cd5110efa8 which is
commit 7c6339322ce0c6128acbe36aacc1eeb986dd7bf1 upstream.
There are reported NFS problems in the 6.1.56 release, so revert a set
of NFS patches to hopefully resolve the issue.
Reported-by: poester <poester@internetbrands.com>
Link: https://lore.kernel.org/r/20231012165439.137237-2-kernel@linuxace.com
Reported-by: Daniel Díaz <daniel.diaz@linaro.org>
Link: https://lore.kernel.org/r/2023100755-livestock-barcode-fe41@gregkh
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit 1f49386d67792424028acfe781d466b010f8fa3f which is
commit 8982f7aff39fb526aba4441fff2525fcedd5e1a3 upstream.
There are reported NFS problems in the 6.1.56 release, so revert a set
of NFS patches to hopefully resolve the issue.
Reported-by: poester <poester@internetbrands.com>
Link: https://lore.kernel.org/r/20231012165439.137237-2-kernel@linuxace.com
Reported-by: Daniel Díaz <daniel.diaz@linaro.org>
Link: https://lore.kernel.org/r/2023100755-livestock-barcode-fe41@gregkh
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit d4729af1c73cfacb64facda3d196e25940f0e7a5 which is
commit b193a78ddb5ee7dba074d3f28dc050069ba083c0 upstream.
There are reported NFS problems in the 6.1.56 release, so revert a set
of NFS patches to hopefully resolve the issue.
Reported-by: poester <poester@internetbrands.com>
Link: https://lore.kernel.org/r/20231012165439.137237-2-kernel@linuxace.com
Reported-by: Daniel Díaz <daniel.diaz@linaro.org>
Link: https://lore.kernel.org/r/2023100755-livestock-barcode-fe41@gregkh
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit edd1f06145101dab83497806bb6162641255ef50 which is
commit b11243f720ee5f9376861099019c8542969b6318 upstream.
There are reported NFS problems in the 6.1.56 release, so revert a set
of NFS patches to hopefully resolve the issue.
Reported-by: poester <poester@internetbrands.com>
Link: https://lore.kernel.org/r/20231012165439.137237-2-kernel@linuxace.com
Reported-by: Daniel Díaz <daniel.diaz@linaro.org>
Link: https://lore.kernel.org/r/2023100755-livestock-barcode-fe41@gregkh
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 9af86694fd5d387992699ec99007ed374966ce9a upstream.
This was noticed by Miklos that file_remove_privs might call into
notify_change(), which requires to hold an exclusive lock. The problem
exists in FUSE and btrfs. We can fix it without any additional helpers
from VFS, in case the privileges would need to be dropped, change the
lock type to be exclusive and redo the loop.
Fixes: e9adabb9712e ("btrfs: use shared lock for direct writes within EOF")
CC: Miklos Szeredi <miklos@szeredi.hu>
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bernd Schubert <bschubert@ddn.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fee4c19937439693f2420a916169d08e88576e8e upstream.
When logging a new name, we don't expect to fail joining a log transaction
since we know at least one of the inodes was logged before in the current
transaction. However if we fail for some unexpected reason, we end up not
freeing the fscrypt name we previously allocated. So fix that by freeing
the name in case we failed to join a log transaction.
Fixes: ab3c5c18e8fa ("btrfs: setup qstr from dentrys using fscrypt helper")
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit abe3bf7425fb695a9b37394af18b9ea58a800802 upstream.
If new_whiteout_inode() fails, some resources need to be freed.
Add the missing goto to the error handling path.
Fixes: ab3c5c18e8fa ("btrfs: setup qstr from dentrys using fscrypt helper")
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c69813471a1ec081a0b9bf0c6bd7e8afd818afce upstream.
drop reference after use opinfo.
Signed-off-by: luosili <rootlab@huawei.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 53ff5cf89142b978b1a5ca8dc4d4425e6a09745f upstream.
Thread A + Thread B
ksmbd_session_lookup | smb2_sess_setup
sess = xa_load |
|
| xa_erase(&conn->sessions, sess->id);
|
| ksmbd_session_destroy(sess) --> kfree(sess)
|
// UAF! |
sess->last_active = jiffies |
+
This patch add rwsem to fix race condition between ksmbd_session_lookup
and ksmbd_expire_session.
Reported-by: luosili <rootlab@huawei.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit cedc019b9f260facfadd20c6c490e403abf292e3 upstream.
Recent changes to kernel_connect() and kernel_bind() ensure that
callers are insulated from changes to the address parameter made by BPF
SOCK_ADDR hooks. This patch wraps direct calls to ops->connect() and
ops->bind() with kernel_connect() and kernel_bind() to ensure that SMB
mounts do not see their mount address overwritten in such cases.
Link: https://lore.kernel.org/netdev/9944248dba1bce861375fcce9de663934d933ba9.camel@redhat.com/
Cc: <stable@vger.kernel.org> # 6.0+
Signed-off-by: Jordan Rife <jrife@google.com>
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ed1cc05aa1f7fe8197d300e914afc28ab9818f89 ]
If the NFS4CLNT_RUN_MANAGER flag got set just before we cleared
NFS4CLNT_MANAGER_RUNNING, then we might have won the race against
nfs4_schedule_state_manager(), and are responsible for handling the
recovery situation.
Fixes: aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 75a5221630fe5aa3fedba7a06be618db0f79ba1e ]
When stressing microLZMA EROFS images with the new global compressed
deduplication feature enabled (`-Ededupe`), I found some short-lived
temporary pages weren't properly released, which could slowly cause
unexpected OOMs hours later.
Let's fix it now (LZ4 and DEFLATE don't have this issue.)
Fixes: 5c2a64252c5d ("erofs: introduce partial-referenced pclusters")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230907050542.97152-1-hsiangkao@linux.alibaba.com
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 5b4a82a0724af1dfd1320826e0266117b6a57fbd ]
Olga Kornievskaia reports that this patch breaks NFSv4.0 state recovery.
It also introduces additional complexity in the error paths for cases not
related to the original problem. Let's revert it for now, and address the
original problem in another manner.
This reverts commit f5ea16137a3fa2858620dc9084466491c128535f.
Fixes: f5ea16137a3f ("NFSv4: Retry LOCK on OLD_STATEID during delegation return")
Reported-by: Kornievskaia, Olga <Olga.Kornievskaia@netapp.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6db75318823a169e836a478ca57d6a7c0a156b77 ]
While struct qstr is more natural without fscrypt, since it's provided
by dentries, struct fscrypt_str is provided by the fscrypt handlers
processing dentries, and is thus more natural in the fscrypt world.
Replace all of the struct qstr uses with struct fscrypt_str.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 9af86694fd5d ("btrfs: file_remove_privs needs an exclusive lock in direct io write")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ab3c5c18e8fa3f8ea116016095d25adab466cd39 ]
Most places where we get a struct qstr, we are doing so from a dentry.
With fscrypt, the dentry's name may be encrypted on-disk, so fscrypt
provides a helper to convert a dentry name to the appropriate disk name
if necessary. Convert each of the dentry name accesses to use
fscrypt_setup_filename(), then convert the resulting fscrypt_name back
to an unencrypted qstr. This does not work for nokey names, but the
specific locations that could spawn nokey names are noted.
At present, since there are no encrypted directories, nothing goes down
the filename encryption paths.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 9af86694fd5d ("btrfs: file_remove_privs needs an exclusive lock in direct io write")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e43eec81c5167b655b72c781b0e75e62a05e415e ]
Many functions throughout btrfs take name buffer and name length
arguments. Most of these functions at the highest level are usually
called with these arguments extracted from a supplied dentry's name.
But the entire name can be passed instead, making each function a little
more elegant.
Each function whose arguments are currently the name and length
extracted from a dentry is herein converted to instead take a pointer to
the name in the dentry. The couple of calls to these calls without a
struct dentry are converted to create an appropriate qstr to pass in.
Additionally, every function which is only called with a name/len
extracted directly from a qstr is also converted.
This change has positive effect on stack consumption, frame of many
functions is reduced but this will be used in the future for fscrypt
related structures.
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 9af86694fd5d ("btrfs: file_remove_privs needs an exclusive lock in direct io write")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 956fd46f97d238032cb5fa4771cdaccc6e760f9a ]
Commit 4dc73c679114 reintroduces the deadlock that was fixed by commit
aeabb3c96186 ("NFSv4: Fix a NFSv4 state manager deadlock") because it
prevents the setup of new threads to handle reboot recovery, while the
older recovery thread is stuck returning delegations.
Fixes: 4dc73c679114 ("NFSv4: keep state manager thread active if swap is enabled")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8b18a2edecc0741b0eecf8b18fdb356a0f8682de ]
Be brief and match the subsystem name. There's no need to distinguish this
kset variable from the server.
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Stable-dep-of: 956fd46f97d2 ("NFSv4: Fix a state manager thread deadlock regression")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e025f0a73f6acb920d86549b2177a5883535421d ]
The root rpc_clnt is not used here, clean it up.
Fixes: 4dc73c679114 ("NFSv4: keep state manager thread active if swap is enabled")
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Stable-dep-of: 956fd46f97d2 ("NFSv4: Fix a state manager thread deadlock regression")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 7c3151585730b7095287be8162b846d31e6eee61 upstream.
The elf-fdpic loader hard sets the process personality to either
PER_LINUX_FDPIC for true elf-fdpic binaries or to PER_LINUX for normal ELF
binaries (in this case they would be constant displacement compiled with
-pie for example). The problem with that is that it will lose any other
bits that may be in the ELF header personality (such as the "bug
emulation" bits).
On the ARM architecture the ADDR_LIMIT_32BIT flag is used to signify a
normal 32bit binary - as opposed to a legacy 26bit address binary. This
matters since start_thread() will set the ARM CPSR register as required
based on this flag. If the elf-fdpic loader loses this bit the process
will be mis-configured and crash out pretty quickly.
Modify elf-fdpic loader personality setting so that it preserves the upper
three bytes by using the SET_PERSONALITY macro to set it. This macro in
the generic case sets PER_LINUX and preserves the upper bytes.
Architectures can override this for their specific use case, and ARM does
exactly this.
The problem shows up quite easily running under qemu using the ARM
architecture, but not necessarily on all types of real ARM hardware. If
the underlying ARM processor does not support the legacy 26-bit addressing
mode then everything will work as expected.
Link: https://lkml.kernel.org/r/20230907011808.2985083-1-gerg@kernel.org
Fixes: 1bde925d23547 ("fs/binfmt_elf_fdpic.c: provide NOMMU loader for regular ELF binaries")
Signed-off-by: Greg Ungerer <gerg@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Greg Ungerer <gerg@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e6e43b8aa7cd3c3af686caf0c2e11819a886d705 upstream.
Forget to reset ctx->password to NULL will lead to bug like double free
Cc: stable@vger.kernel.org
Cc: Willy Tarreau <w@1wt.eu>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Quang Le <quanglex97@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 58bfe2ccec5f9f137b41dd38f335290dcc13cd5c upstream.
A user reported some issues with smaller file systems that get very
full. While investigating this issue I noticed that df wasn't showing
100% full, despite having 0 chunk space and having < 1MiB of available
metadata space.
This turns out to be an overflow issue, we're doing:
total_available_metadata_space - SZ_4M < global_block_rsv_size
to determine if there's not enough space to make metadata allocations,
which overflows if total_available_metadata_space is < 4M. Fix this by
checking to see if our available space is greater than the 4M threshold.
This makes df properly report 100% usage on the file system.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7ee29facd8a9c5a26079148e36bcf07141b3a6bc upstream.
In nilfs_gccache_submit_read_data(), brelse(bh) is called to drop the
reference count of bh when the call to nilfs_dat_translate() fails. If
the reference count hits 0 and its owner page gets unlocked, bh may be
freed. However, bh->b_page is dereferenced to put the page after that,
which may result in a use-after-free bug. This patch moves the release
operation after unlocking and putting the page.
NOTE: The function in question is only called in GC, and in combination
with current userland tools, address translation using DAT does not occur
in that function, so the code path that causes this issue will not be
executed. However, it is possible to run that code path by intentionally
modifying the userland GC library or by calling the GC ioctl directly.
[konishi.ryusuke@gmail.com: NOTE added to the commit log]
Link: https://lkml.kernel.org/r/1543201709-53191-1-git-send-email-bianpan2016@163.com
Link: https://lkml.kernel.org/r/20230921141731.10073-1-konishi.ryusuke@gmail.com
Fixes: a3d93f709e89 ("nilfs2: block cache for garbage collection")
Signed-off-by: Pan Bian <bianpan2016@163.com>
Reported-by: Ferry Meng <mengferry@linux.alibaba.com>
Closes: https://lkml.kernel.org/r/20230818092022.111054-1-mengferry@linux.alibaba.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4506f23e117161a20104c8fa04f33e1ca63c26af ]
Currently, if the OPEN compound experiencing an error and needs to
get the file attributes separately, it will send a stand alone
GETATTR but it would use the filehandle from the results of
the OPEN compound. In case of the CLAIM_FH OPEN, nfs_openres's fh
is zero value. That generate a GETATTR that's sent with a zero
value filehandle, and results in the server returning an error.
Instead, for the CLAIM_FH OPEN, take the filehandle that was used
in the PUTFH of the OPEN compound.
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit ebc3d4e44a7e05457825e03d0560153687265523 ]
checkpatch flagged a few places with:
WARNING: ENOTSUPP is not a SUSV4 error code, prefer EOPNOTSUPP
Also fixed minor typo
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a57c2d4e46f519b24558ae0752c17eec416ac72a ]
When removing a delayed item, or releasing which will remove it as well,
we will modify one of the delayed node's rbtrees and item counter if the
delayed item is in one of the rbtrees. This require having the delayed
node's mutex locked, otherwise we will race with other tasks modifying
the rbtrees and the counter.
This is motivated by a previous version of another patch actually calling
btrfs_release_delayed_item() after unlocking the delayed node's mutex and
against a delayed item that is in a rbtree.
So assert at __btrfs_remove_delayed_item() that the delayed node's mutex
is locked.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e3dfcab2080dc1f9a4b09cc1327361bc2845bfcd ]
When unmounting all the dirty buffers will be flushed and after
the last osd request is finished the last reference of the i_count
will be released. Then it will flush the dirty cap/snap to MDSs,
and the unmounting won't wait the possible acks, which will ihold
the inodes when updating the metadata locally but makes no sense
any more, of this. This will make the evict_inodes() to skip these
inodes.
If encrypt is enabled the kernel generate a warning when removing
the encrypt keys when the skipped inodes still hold the keyring:
WARNING: CPU: 4 PID: 168846 at fs/crypto/keyring.c:242 fscrypt_destroy_keyring+0x7e/0xd0
CPU: 4 PID: 168846 Comm: umount Tainted: G S 6.1.0-rc5-ceph-g72ead199864c #1
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015
RIP: 0010:fscrypt_destroy_keyring+0x7e/0xd0
RSP: 0018:ffffc9000b277e28 EFLAGS: 00010202
RAX: 0000000000000002 RBX: ffff88810d52ac00 RCX: ffff88810b56aa00
RDX: 0000000080000000 RSI: ffffffff822f3a09 RDI: ffff888108f59000
RBP: ffff8881d394fb88 R08: 0000000000000028 R09: 0000000000000000
R10: 0000000000000001 R11: 11ff4fe6834fcd91 R12: ffff8881d394fc40
R13: ffff888108f59000 R14: ffff8881d394f800 R15: 0000000000000000
FS: 00007fd83f6f1080(0000) GS:ffff88885fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f918d417000 CR3: 000000017f89a005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
generic_shutdown_super+0x47/0x120
kill_anon_super+0x14/0x30
ceph_kill_sb+0x36/0x90 [ceph]
deactivate_locked_super+0x29/0x60
cleanup_mnt+0xb8/0x140
task_work_run+0x67/0xb0
exit_to_user_mode_prepare+0x23d/0x240
syscall_exit_to_user_mode+0x25/0x60
do_syscall_64+0x40/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd83dc39e9b
Later the kernel will crash when iput() the inodes and dereferencing
the "sb->s_master_keys", which has been released by the
generic_shutdown_super().
Link: https://tracker.ceph.com/issues/59162
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-and-tested-by: Luís Henriques <lhenriques@suse.de>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|