Age | Commit message (Collapse) | Author |
|
commit 4b081ce0d830b684fdf967abc3696d1261387254 upstream.
If authblob->SessionKey.Length is bigger than session key
size(CIFS_KEY_SIZE), slub overflow can happen in key exchange codes.
cifs_arc4_crypt copy to session key array from SessionKey from client.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21940
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 17d5b135bb720832364e8f55f6a887a3c7ec8fdb upstream.
If ->DataOffset of create context is 0, DataBuffer size is not correctly
validated. This patch change wrong validation code and consider tag
length in request.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21824
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit e4c1cf523d820730a86cae2c6d55924833b6f7ac upstream.
This was accidentally fixed up in commit e4c1cf523d82 but we can't
take the full change due to other dependancy issues, so here is just
the actual bugfix that is needed.
[Background]
keltargw reported an issue [1] that with mmaped I/Os, sometimes the
tail of the last page (after file ends) is not filled with zeroes.
The root cause is that such tail page could be wrongly selected for
inplace I/Os so the zeroed part will then be filled with compressed
data instead of zeroes.
A simple fix is to avoid doing inplace I/Os for such tail parts,
actually that was already fixed upstream in commit e4c1cf523d82
("erofs: tidy up z_erofs_do_read_page()") by accident.
[1] https://lore.kernel.org/r/3ad8b469-25db-a297-21f9-75db2d6ad224@linux.alibaba.com
Reported-by: keltargw <keltar.gw@gmail.com>
Fixes: 3883a79abd02 ("staging: erofs: introduce VLE decompression support")
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 2d8ae8c417db284f598dffb178cc01e7db0f1821 upstream.
We've aligned setgid behavior over multiple kernel releases. The details
can be found in commit cf619f891971 ("Merge tag 'fs.ovl.setgid.v6.2' of
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping") and
commit 426b4ca2d6a5 ("Merge tag 'fs.setgid.v6.0' of
git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux").
Consistent setgid stripping behavior is now encapsulated in the
setattr_should_drop_sgid() helper which is used by all filesystems that
strip setgid bits outside of vfs proper. Usually ATTR_KILL_SGID is
raised in e.g., chown_common() and is subject to the
setattr_should_drop_sgid() check to determine whether the setgid bit can
be retained. Since nfsd is raising ATTR_KILL_SGID unconditionally it
will cause notify_change() to strip it even if the caller had the
necessary privileges to retain it. Ensure that nfsd only raises
ATR_KILL_SGID if the caller lacks the necessary privileges to retain the
setgid bit.
Without this patch the setgid stripping tests in LTP will fail:
> As you can see, the problem is S_ISGID (0002000) was dropped on a
> non-group-executable file while chown was invoked by super-user, while
[...]
> fchown02.c:66: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700
[...]
> chown02.c:57: TFAIL: testfile2: wrong mode permissions 0100700, expected 0102700
With this patch all tests pass.
Reported-by: Sherry Yang <sherry.yang@oracle.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
[Harshit: backport to 6.1.y:
Use init_user_ns instead of nop_mnt_idmap as we don't have
commit abf08576afe3 ("fs: port vfs_*() helpers to struct mnt_idmap")]
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 4f704d9a8352f5c0a8fcdb6213b934630342bd44 upstream.
We've aligned setgid behavior over multiple kernel releases. The details
can be found in the following two merge messages:
cf619f891971 ("Merge tag 'fs.ovl.setgid.v6.2')
426b4ca2d6a5 ("Merge tag 'fs.setgid.v6.0')
Consistent setgid stripping behavior is now encapsulated in the
setattr_should_drop_sgid() helper which is used by all filesystems that
strip setgid bits outside of vfs proper. Switch nfs to rely on this
helper as well. Without this patch the setgid stripping tests in
xfstests will fail.
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Message-Id: <20230313-fs-nfs-setgid-v2-1-9a59f436cfc0@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
[ Harshit: backport to 6.1.y:
fs/internal.h -- minor conflict due to code change differences.
include/linux/fs.h -- Used struct user_namespace *mnt_userns
instead of struct mnt_idmap *idmap
fs/nfs/inode.c -- Used init_user_ns instead of nop_mnt_idmap ]
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 3b816601e279756e781e6c4d9b3f3bd21a72ac67 upstream.
We have some reports of linux NFS clients that cannot satisfy a linux knfsd
server that always sets SEQ4_STATUS_RECALLABLE_STATE_REVOKED even though
those clients repeatedly walk all their known state using TEST_STATEID and
receive NFS4_OK for all.
Its possible for revoke_delegation() to set NFS4_REVOKED_DELEG_STID, then
nfsd4_free_stateid() finds the delegation and returns NFS4_OK to
FREE_STATEID. Afterward, revoke_delegation() moves the same delegation to
cl_revoked. This would produce the observed client/server effect.
Fix this by ensuring that the setting of sc_type to NFS4_REVOKED_DELEG_STID
and move to cl_revoked happens within the same cl_lock. This will allow
nfsd4_free_stateid() to properly remove the delegation from cl_revoked.
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2217103
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2176575
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Cc: stable@vger.kernel.org # v4.17+
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit be2fd1560eb57b7298aa3c258ddcca0d53ecdea3 upstream.
Be more careful when tearing down the subrequests of an O_DIRECT write
as part of a retransmission.
Reported-by: Chris Mason <clm@fb.com>
Fixes: ed5d588fe47f ("NFS: Try to join page groups before an O_DIRECT retransmission")
Cc: stable@vger.kernel.org
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 1cbc11aaa01f80577b67ae02c73ee781112125fd upstream.
Commmit f5ea16137a3f ("NFSv4: Retry LOCK on OLD_STATEID during delegation
return") attempted to solve this problem by using nfs4's generic async error
handling, but introduced a regression where v4.0 lock recovery would hang.
The additional complexity introduced by overloading that error handling is
not necessary for this case. This patch expects that commit to be
reverted.
The problem as originally explained in the above commit is:
There's a small window where a LOCK sent during a delegation return can
race with another OPEN on client, but the open stateid has not yet been
updated. In this case, the client doesn't handle the OLD_STATEID error
from the server and will lose this lock, emitting:
"NFS: nfs4_handle_delegation_recall_error: unhandled error -10024".
Fix this by using the old_stateid refresh helpers if the server replies
with OLD_STATEID.
Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 46f881b5b1758dc4a35fba4a643c10717d0cf427 ]
Before removing checkpoint buffer from the t_checkpoint_list, we have to
check both BH_Dirty and BH_Lock bits together to distinguish buffers
have not been or were being written back. But __cp_buffer_busy() checks
them separately, it first check lock state and then check dirty, the
window between these two checks could be raced by writing back
procedure, which locks buffer and clears buffer dirty before I/O
completes. So it cannot guarantee checkpointing buffers been written
back to disk if some error happens later. Finally, it may clean
checkpoint transactions and lead to inconsistent filesystem.
jbd2_journal_forget() and __journal_try_to_free_buffer() also have the
same problem (journal_unmap_buffer() escape from this issue since it's
running under the buffer lock), so fix them through introducing a new
helper to try holding the buffer lock and remove really clean buffer.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=217490
Cc: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230606135928.434610-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b98dba273a0e47dbfade89c9af73c5b012a4eabb ]
journal_clean_one_cp_list() and journal_shrink_one_cp_list() are almost
the same, so merge them into journal_shrink_one_cp_list(), remove the
nr_to_scan parameter, always scan and try to free the whole checkpoint
list.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230606135928.434610-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: 46f881b5b175 ("jbd2: fix a race when checking checkpoint buffer busy")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit be22255360f80d3af789daad00025171a65424a5 ]
Since t_checkpoint_io_list was stop using in jbd2_log_do_checkpoint()
now, it's time to remove the whole t_checkpoint_io_list logic.
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230606135928.434610-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Stable-dep-of: 46f881b5b175 ("jbd2: fix a race when checking checkpoint buffer busy")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f4e89f1a6dab4c063fc1e823cc9dddc408ff40cf ]
Another highly rare error case when a page allocating loop (inside
__nfs4_get_acl_uncached, this time) is not properly unwound on error.
Since pages array is allocated being uninitialized, need to free only
lower array indices. NULL checks were useful before commit 62a1573fcf84
("NFSv4 fix acl retrieval over krb5i/krb5p mounts") when the array had
been initialized to zero on stack.
Found by Linux Verification Center (linuxtesting.org).
Fixes: 62a1573fcf84 ("NFSv4 fix acl retrieval over krb5i/krb5p mounts")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 4e3733fd2b0f677faae21cf838a43faf317986d3 ]
There is a slight issue with error handling code inside
nfs42_proc_getxattr(). If page allocating loop fails then we free the
failing page array element which is NULL but __free_page() can't deal with
NULL args.
Found by Linux Verification Center (linuxtesting.org).
Fixes: a1f26739ccdc ("NFSv4.2: improve page handling for GETXATTR")
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
This reverts commit a78a8bcdc26de5ef3a0ee27c9c6c512e54a6051c which is
commit a6ec83786ab9f13f25fb18166dee908845713a95 upstream.
Something is currently broken in the f2fs code, Guenter has reported
boot problems with it for a few releases now, so revert the most recent
f2fs changes in the hope to get this back to a working filesystem.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/b392e1a8-b987-4993-bd45-035db9415a6e@roeck-us.net
Cc: Chao Yu <chao@kernel.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit 6ba0594a81f91d6fd8ca9bd4ad23aa1618635a0f which is
commit 967eaad1fed5f6335ea97a47d45214744dc57925 upstream.
Something is currently broken in the f2fs code, Guenter has reported
boot problems with it for a few releases now, so revert the most recent
f2fs changes in the hope to get this back to a working filesystem.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/b392e1a8-b987-4993-bd45-035db9415a6e@roeck-us.net
Cc: Chao Yu <chao@kernel.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Yangtao Li <frank.li@vivo.com>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
This reverts commit e2fb24ce37caeaecff08af4e9967c8462624312b which is
commit 458c15dfbce62c35fefd9ca637b20a051309c9f1 upstream.
Something is currently broken in the f2fs code, Guenter has reported
boot problems with it for a few releases now, so revert the most recent
f2fs changes in the hope to get this back to a working filesystem.
Reported-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/b392e1a8-b987-4993-bd45-035db9415a6e@roeck-us.net
Cc: Chao Yu <chao@kernel.org>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Sasha Levin <sashal@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 69513dd669e243928f7450893190915a88f84a2b upstream.
Under the current code, when cifs_readpage_worker is called, the call
contract is that the callee should unlock the page. This is documented
in the read_folio section of Documentation/filesystems/vfs.rst as:
> The filesystem should unlock the folio once the read has completed,
> whether it was successful or not.
Without this change, when fscache is in use and cache hit occurs during
a read, the page lock is leaked, producing the following stack on
subsequent reads (via mmap) to the page:
$ cat /proc/3890/task/12864/stack
[<0>] folio_wait_bit_common+0x124/0x350
[<0>] filemap_read_folio+0xad/0xf0
[<0>] filemap_fault+0x8b1/0xab0
[<0>] __do_fault+0x39/0x150
[<0>] do_fault+0x25c/0x3e0
[<0>] __handle_mm_fault+0x6ca/0xc70
[<0>] handle_mm_fault+0xe9/0x350
[<0>] do_user_addr_fault+0x225/0x6c0
[<0>] exc_page_fault+0x84/0x1b0
[<0>] asm_exc_page_fault+0x27/0x30
This requires a reboot to resolve; it is a deadlock.
Note however that the call to cifs_readpage_from_fscache does mark the
page clean, but does not free the folio lock. This happens in
__cifs_readpage_from_fscache on success. Releasing the lock at that
point however is not appropriate as cifs_readahead also calls
cifs_readpage_from_fscache and *does* unconditionally release the lock
after its return. This change therefore effectively makes
cifs_readpage_worker work like cifs_readahead.
Signed-off-by: Russell Harmon <russ@har.mn>
Acked-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 29eefa6d0d07e185f7bfe9576f91e6dba98189c2 upstream.
Pausing and canceling balance can race to interrupt balance lead to BUG_ON
panic in btrfs_cancel_balance. The BUG_ON condition in btrfs_cancel_balance
does not take this race scenario into account.
However, the race condition has no other side effects. We can fix that.
Reproducing it with panic trace like this:
kernel BUG at fs/btrfs/volumes.c:4618!
RIP: 0010:btrfs_cancel_balance+0x5cf/0x6a0
Call Trace:
<TASK>
? do_nanosleep+0x60/0x120
? hrtimer_nanosleep+0xb7/0x1a0
? sched_core_clone_cookie+0x70/0x70
btrfs_ioctl_balance_ctl+0x55/0x70
btrfs_ioctl+0xa46/0xd20
__x64_sys_ioctl+0x7d/0xa0
do_syscall_64+0x38/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Race scenario as follows:
> mutex_unlock(&fs_info->balance_mutex);
> --------------------
> .......issue pause and cancel req in another thread
> --------------------
> ret = __btrfs_balance(fs_info);
>
> mutex_lock(&fs_info->balance_mutex);
> if (ret == -ECANCELED && atomic_read(&fs_info->balance_pause_req)) {
> btrfs_info(fs_info, "balance: paused");
> btrfs_exclop_balance(fs_info, BTRFS_EXCLOP_BALANCE_PAUSED);
> }
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: xiaoshoukui <xiaoshoukui@ruijie.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c962098ca4af146f2625ed64399926a098752c9c upstream.
In production we were seeing a variety of WARN_ON()'s in the extent_map
code, specifically in btrfs_drop_extent_map_range() when we have to call
add_extent_mapping() for our second split.
Consider the following extent map layout
PINNED
[0 16K) [32K, 48K)
and then we call btrfs_drop_extent_map_range for [0, 36K), with
skip_pinned == true. The initial loop will have
start = 0
end = 36K
len = 36K
we will find the [0, 16k) extent, but since we are pinned we will skip
it, which has this code
start = em_end;
if (end != (u64)-1)
len = start + len - em_end;
em_end here is 16K, so now the values are
start = 16K
len = 16K + 36K - 16K = 36K
len should instead be 20K. This is a problem when we find the next
extent at [32K, 48K), we need to split this extent to leave [36K, 48k),
however the code for the split looks like this
split->start = start + len;
split->len = em_end - (start + len);
In this case we have
em_end = 48K
split->start = 16K + 36K // this should be 16K + 20K
split->len = 48K - (16K + 36K) // this overflows as 16K + 36K is 52K
and now we have an invalid extent_map in the tree that potentially
overlaps other entries in the extent map. Even in the non-overlapping
case we will have split->start set improperly, which will cause problems
with any block related calculations.
We don't actually need len in this loop, we can simply use end as our
end point, and only adjust start up when we find a pinned extent we need
to skip.
Adjust the logic to do this, which keeps us from inserting an invalid
extent map.
We only skip_pinned in the relocation case, so this is relatively rare,
except in the case where you are running relocation a lot, which can
happen with auto relocation on.
Fixes: 55ef68990029 ("Btrfs: Fix btrfs_drop_extent_cache for skip pinned case")
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e8f5f849ffce24490eb9449e98312b66c0dba76f ]
With deferred close we can have closes that race with lease breaks,
and so with the current checks for whether to send the lease response,
oplock_response(), this can mean that an unmount (kill_sb) can occur
just before we were checking if the tcon->ses is valid. See below:
[Fri Aug 4 04:12:50 2023] RIP: 0010:cifs_oplock_break+0x1f7/0x5b0 [cifs]
[Fri Aug 4 04:12:50 2023] Code: 7d a8 48 8b 7d c0 c0 e9 02 48 89 45 b8 41 89 cf e8 3e f5 ff ff 4c 89 f7 41 83 e7 01 e8 82 b3 03 f2 49 8b 45 50 48 85 c0 74 5e <48> 83 78 60 00 74 57 45 84 ff 75 52 48 8b 43 98 48 83 eb 68 48 39
[Fri Aug 4 04:12:50 2023] RSP: 0018:ffffb30607ddbdf8 EFLAGS: 00010206
[Fri Aug 4 04:12:50 2023] RAX: 632d223d32612022 RBX: ffff97136944b1e0 RCX: 0000000080100009
[Fri Aug 4 04:12:50 2023] RDX: 0000000000000001 RSI: 0000000080100009 RDI: ffff97136944b188
[Fri Aug 4 04:12:50 2023] RBP: ffffb30607ddbe58 R08: 0000000000000001 R09: ffffffffc08e0900
[Fri Aug 4 04:12:50 2023] R10: 0000000000000001 R11: 000000000000000f R12: ffff97136944b138
[Fri Aug 4 04:12:50 2023] R13: ffff97149147c000 R14: ffff97136944b188 R15: 0000000000000000
[Fri Aug 4 04:12:50 2023] FS: 0000000000000000(0000) GS:ffff9714f7c00000(0000) knlGS:0000000000000000
[Fri Aug 4 04:12:50 2023] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Aug 4 04:12:50 2023] CR2: 00007fd8de9c7590 CR3: 000000011228e000 CR4: 0000000000350ef0
[Fri Aug 4 04:12:50 2023] Call Trace:
[Fri Aug 4 04:12:50 2023] <TASK>
[Fri Aug 4 04:12:50 2023] process_one_work+0x225/0x3d0
[Fri Aug 4 04:12:50 2023] worker_thread+0x4d/0x3e0
[Fri Aug 4 04:12:50 2023] ? process_one_work+0x3d0/0x3d0
[Fri Aug 4 04:12:50 2023] kthread+0x12a/0x150
[Fri Aug 4 04:12:50 2023] ? set_kthread_struct+0x50/0x50
[Fri Aug 4 04:12:50 2023] ret_from_fork+0x22/0x30
[Fri Aug 4 04:12:50 2023] </TASK>
To fix this change the ordering of the checks before sending the oplock_response
to first check if the openFileList is empty.
Fixes: da787d5b7498 ("SMB3: Do not send lease break acknowledgment if all file handles have been closed")
Suggested-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Bharath SM <bharathsm@microsoft.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0657b20c5a76c938612f8409735a8830d257866e ]
If a task creates a new block group and that block group becomes unused
before we finish its creation, at btrfs_create_pending_block_groups(),
then when btrfs_mark_bg_unused() is called against the block group, we
assume that the block group is currently in the list of block groups to
reclaim, and we move it out of the list of new block groups and into the
list of unused block groups. This has two consequences:
1) We move it out of the list of new block groups associated to the
current transaction. So the block group creation is not finished and
if we attempt to delete the bg because it's unused, we will not find
the block group item in the extent tree (or the new block group tree),
its device extent items in the device tree etc, resulting in the
deletion to fail due to the missing items;
2) We don't increment the reference count on the block group when we
move it to the list of unused block groups, because we assumed the
block group was on the list of block groups to reclaim, and in that
case it already has the correct reference count. However the block
group was on the list of new block groups, in which case no extra
reference was taken because it's local to the current task. This
later results in doing an extra reference count decrement when
removing the block group from the unused list, eventually leading the
reference count to 0.
This second case was caught when running generic/297 from fstests, which
produced the following assertion failure and stack trace:
[589.559] assertion failed: refcount_read(&block_group->refs) == 1, in fs/btrfs/block-group.c:4299
[589.559] ------------[ cut here ]------------
[589.559] kernel BUG at fs/btrfs/block-group.c:4299!
[589.560] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[589.560] CPU: 8 PID: 2819134 Comm: umount Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[589.560] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[589.560] RIP: 0010:btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.561] Code: 68 62 da c0 (...)
[589.561] RSP: 0018:ffffa55a8c3b3d98 EFLAGS: 00010246
[589.561] RAX: 0000000000000058 RBX: ffff8f030d7f2000 RCX: 0000000000000000
[589.562] RDX: 0000000000000000 RSI: ffffffff953f0878 RDI: 00000000ffffffff
[589.562] RBP: ffff8f030d7f2088 R08: 0000000000000000 R09: ffffa55a8c3b3c50
[589.562] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8f05850b4c00
[589.562] R13: ffff8f030d7f2090 R14: ffff8f05850b4cd8 R15: dead000000000100
[589.563] FS: 00007f497fd2e840(0000) GS:ffff8f09dfc00000(0000) knlGS:0000000000000000
[589.563] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[589.563] CR2: 00007f497ff8ec10 CR3: 0000000271472006 CR4: 0000000000370ee0
[589.563] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[589.564] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[589.564] Call Trace:
[589.564] <TASK>
[589.565] ? __die_body+0x1b/0x60
[589.565] ? die+0x39/0x60
[589.565] ? do_trap+0xeb/0x110
[589.565] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? do_error_trap+0x6a/0x90
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.566] ? exc_invalid_op+0x4e/0x70
[589.566] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? asm_exc_invalid_op+0x16/0x20
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] ? btrfs_free_block_groups+0x449/0x4a0 [btrfs]
[589.567] close_ctree+0x35d/0x560 [btrfs]
[589.568] ? fsnotify_sb_delete+0x13e/0x1d0
[589.568] ? dispose_list+0x3a/0x50
[589.568] ? evict_inodes+0x151/0x1a0
[589.568] generic_shutdown_super+0x73/0x1a0
[589.569] kill_anon_super+0x14/0x30
[589.569] btrfs_kill_super+0x12/0x20 [btrfs]
[589.569] deactivate_locked_super+0x2e/0x70
[589.569] cleanup_mnt+0x104/0x160
[589.570] task_work_run+0x56/0x90
[589.570] exit_to_user_mode_prepare+0x160/0x170
[589.570] syscall_exit_to_user_mode+0x22/0x50
[589.570] ? __x64_sys_umount+0x12/0x20
[589.571] do_syscall_64+0x48/0x90
[589.571] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[589.571] RIP: 0033:0x7f497ff0a567
[589.571] Code: af 98 0e (...)
[589.572] RSP: 002b:00007ffc98347358 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
[589.572] RAX: 0000000000000000 RBX: 00007f49800b8264 RCX: 00007f497ff0a567
[589.572] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000557f558abfa0
[589.573] RBP: 0000557f558a6ba0 R08: 0000000000000000 R09: 00007ffc98346100
[589.573] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[589.573] R13: 0000557f558abfa0 R14: 0000557f558a6cb0 R15: 0000557f558a6dd0
[589.573] </TASK>
[589.574] Modules linked in: dm_snapshot dm_thin_pool (...)
[589.576] ---[ end trace 0000000000000000 ]---
Fix this by adding a runtime flag to the block group to tell that the
block group is still in the list of new block groups, and therefore it
should not be moved to the list of unused block groups, at
btrfs_mark_bg_unused(), until the flag is cleared, when we finish the
creation of the block group at btrfs_create_pending_block_groups().
Fixes: a9f189716cf1 ("btrfs: move out now unused BG from the reclaim list")
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 961f5b8bf48a463ac5fe5b13143426d79eb41817 ]
In zoned mode the sequential status of zone can be also tracked in the
runtime flags of block group.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0d7764ff58b4b45c39eb03f2c74a819c1a88fa7b ]
We already have flags in block group to track various status bits,
convert needs_free_space as well and reduce size of btrfs_block_group.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable-dep-of: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit a9f189716cf15913c453299d72f69c51a9b0f86b ]
An unused block group is easy to remove to free up space and should be
reclaimed fast. Such block group can often already be a target of the
reclaim process. As we check list_empty(&bg->bg_list), we keep it in the
reclaim list. That block group is never reclaimed until the file system
is filled e.g. up to 75%.
Instead, we can move unused block group to the unused list and delete it
fast.
Fixes: 18bb8bbf13c1 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit e0f363a98830e8d7d70fbaf91c07ae0b7c57aafe ]
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 97498cd610c0d030a7bd49a7efad974790661162 ]
In a previous commit 2681631c2973 ("fs/ntfs3: Add null pointer check to
attr_load_runs_vcn"), ni can be NULL in attr_load_runs_vcn(), and thus it
should be checked before being used.
However, in the call stack of this commit, mft_ni in mi_read() is
aliased with ni in attr_load_runs_vcn(), and it is also used in
mi_read() at two places:
mi_read()
rw_lock = &mft_ni->file.run_lock -> No check
attr_load_runs_vcn(mft_ni, ...)
ni (namely mft_ni) is checked in the previous commit
attr_load_runs_vcn(..., &mft_ni->file.run) -> No check
Thus, to avoid possible null-pointer dereferences, the related checks
should be added.
These bugs are reported by a static analysis tool implemented by myself,
and they are found by extending a known bug fixed in the previous commit.
Thus, they could be theoretical bugs.
Signed-off-by: Jia-Ju Bai <baijiaju@buaa.edu.cn>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fdec309c7672cbee4dc0229ee4cbb33c948a1bdd ]
ni_create_attr_list uses WARN_ON to catch error cases while generating
attribute list, which only prints out stack trace and may not be enough.
This repalces them with more proper error handling flow.
[ 59.666332] BUG: kernel NULL pointer dereference, address: 000000000000000e
[ 59.673268] #PF: supervisor read access in kernel mode
[ 59.678354] #PF: error_code(0x0000) - not-present page
[ 59.682831] PGD 8000000005ff1067 P4D 8000000005ff1067 PUD 7dee067 PMD 0
[ 59.688556] Oops: 0000 [#1] PREEMPT SMP KASAN PTI
[ 59.692642] CPU: 0 PID: 198 Comm: poc Tainted: G B W 6.2.0-rc1+ #4
[ 59.698868] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 59.708795] RIP: 0010:ni_create_attr_list+0x505/0x860
[ 59.713657] Code: 7e 10 e8 5e d0 d0 ff 45 0f b7 76 10 48 8d 7b 16 e8 00 d1 d0 ff 66 44 89 73 16 4d 8d 75 0e 4c 89 f7 e8 3f d0 d0 ff 4c 8d8
[ 59.731559] RSP: 0018:ffff88800a56f1e0 EFLAGS: 00010282
[ 59.735691] RAX: 0000000000000001 RBX: ffff88800b7b5088 RCX: ffffffffb83079fe
[ 59.741792] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffbb7f9fc0
[ 59.748423] RBP: ffff88800a56f3a8 R08: ffff88800b7b50a0 R09: fffffbfff76ff3f9
[ 59.754654] R10: ffffffffbb7f9fc7 R11: fffffbfff76ff3f8 R12: ffff88800b756180
[ 59.761552] R13: 0000000000000000 R14: 000000000000000e R15: 0000000000000050
[ 59.768323] FS: 00007feaa8c96440(0000) GS:ffff88806d400000(0000) knlGS:0000000000000000
[ 59.776027] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 59.781395] CR2: 00007f3a2e0b1000 CR3: 000000000a5bc000 CR4: 00000000000006f0
[ 59.787607] Call Trace:
[ 59.790271] <TASK>
[ 59.792488] ? __pfx_ni_create_attr_list+0x10/0x10
[ 59.797235] ? kernel_text_address+0xd3/0xe0
[ 59.800856] ? unwind_get_return_address+0x3e/0x60
[ 59.805101] ? __kasan_check_write+0x18/0x20
[ 59.809296] ? preempt_count_sub+0x1c/0xd0
[ 59.813421] ni_ins_attr_ext+0x52c/0x5c0
[ 59.817034] ? __pfx_ni_ins_attr_ext+0x10/0x10
[ 59.821926] ? __vfs_setxattr+0x121/0x170
[ 59.825718] ? __vfs_setxattr_noperm+0x97/0x300
[ 59.829562] ? __vfs_setxattr_locked+0x145/0x170
[ 59.833987] ? vfs_setxattr+0x137/0x2a0
[ 59.836732] ? do_setxattr+0xce/0x150
[ 59.839807] ? setxattr+0x126/0x140
[ 59.842353] ? path_setxattr+0x164/0x180
[ 59.845275] ? __x64_sys_setxattr+0x71/0x90
[ 59.848838] ? do_syscall_64+0x3f/0x90
[ 59.851898] ? entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 59.857046] ? stack_depot_save+0x17/0x20
[ 59.860299] ni_insert_attr+0x1ba/0x420
[ 59.863104] ? __pfx_ni_insert_attr+0x10/0x10
[ 59.867069] ? preempt_count_sub+0x1c/0xd0
[ 59.869897] ? _raw_spin_unlock_irqrestore+0x2b/0x50
[ 59.874088] ? __create_object+0x3ae/0x5d0
[ 59.877865] ni_insert_resident+0xc4/0x1c0
[ 59.881430] ? __pfx_ni_insert_resident+0x10/0x10
[ 59.886355] ? kasan_save_alloc_info+0x1f/0x30
[ 59.891117] ? __kasan_kmalloc+0x8b/0xa0
[ 59.894383] ntfs_set_ea+0x90d/0xbf0
[ 59.897703] ? __pfx_ntfs_set_ea+0x10/0x10
[ 59.901011] ? kernel_text_address+0xd3/0xe0
[ 59.905308] ? __kernel_text_address+0x16/0x50
[ 59.909811] ? unwind_get_return_address+0x3e/0x60
[ 59.914898] ? __pfx_stack_trace_consume_entry+0x10/0x10
[ 59.920250] ? arch_stack_walk+0xa2/0x100
[ 59.924560] ? filter_irq_stacks+0x27/0x80
[ 59.928722] ntfs_setxattr+0x405/0x440
[ 59.932512] ? __pfx_ntfs_setxattr+0x10/0x10
[ 59.936634] ? kvmalloc_node+0x2d/0x120
[ 59.940378] ? kasan_save_stack+0x41/0x60
[ 59.943870] ? kasan_save_stack+0x2a/0x60
[ 59.947719] ? kasan_set_track+0x29/0x40
[ 59.951417] ? kasan_save_alloc_info+0x1f/0x30
[ 59.955733] ? __kasan_kmalloc+0x8b/0xa0
[ 59.959598] ? __kmalloc_node+0x68/0x150
[ 59.963163] ? kvmalloc_node+0x2d/0x120
[ 59.966490] ? vmemdup_user+0x2b/0xa0
[ 59.969060] __vfs_setxattr+0x121/0x170
[ 59.972456] ? __pfx___vfs_setxattr+0x10/0x10
[ 59.976008] __vfs_setxattr_noperm+0x97/0x300
[ 59.981562] __vfs_setxattr_locked+0x145/0x170
[ 59.986100] vfs_setxattr+0x137/0x2a0
[ 59.989964] ? __pfx_vfs_setxattr+0x10/0x10
[ 59.993616] ? __kasan_check_write+0x18/0x20
[ 59.997425] do_setxattr+0xce/0x150
[ 60.000304] setxattr+0x126/0x140
[ 60.002967] ? __pfx_setxattr+0x10/0x10
[ 60.006471] ? __virt_addr_valid+0xcb/0x140
[ 60.010461] ? __call_rcu_common.constprop.0+0x1c7/0x330
[ 60.016037] ? debug_smp_processor_id+0x1b/0x30
[ 60.021008] ? kasan_quarantine_put+0x5b/0x190
[ 60.025545] ? putname+0x84/0xa0
[ 60.027910] ? __kasan_slab_free+0x11e/0x1b0
[ 60.031483] ? putname+0x84/0xa0
[ 60.033986] ? preempt_count_sub+0x1c/0xd0
[ 60.036876] ? __mnt_want_write+0xae/0x100
[ 60.040738] ? mnt_want_write+0x8f/0x150
[ 60.044317] path_setxattr+0x164/0x180
[ 60.048096] ? __pfx_path_setxattr+0x10/0x10
[ 60.052096] ? strncpy_from_user+0x175/0x1c0
[ 60.056482] ? debug_smp_processor_id+0x1b/0x30
[ 60.059848] ? fpregs_assert_state_consistent+0x6b/0x80
[ 60.064557] __x64_sys_setxattr+0x71/0x90
[ 60.068892] do_syscall_64+0x3f/0x90
[ 60.072868] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 60.077523] RIP: 0033:0x7feaa86e4469
[ 60.080915] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 088
[ 60.097353] RSP: 002b:00007ffdbd8311e8 EFLAGS: 00000286 ORIG_RAX: 00000000000000bc
[ 60.103386] RAX: ffffffffffffffda RBX: 9461c5e290baac00 RCX: 00007feaa86e4469
[ 60.110322] RDX: 00007ffdbd831fe0 RSI: 00007ffdbd831305 RDI: 00007ffdbd831263
[ 60.116808] RBP: 00007ffdbd836180 R08: 0000000000000001 R09: 00007ffdbd836268
[ 60.123879] R10: 000000000000007d R11: 0000000000000286 R12: 0000000000400500
[ 60.130540] R13: 00007ffdbd836260 R14: 0000000000000000 R15: 0000000000000000
[ 60.136553] </TASK>
[ 60.138818] Modules linked in:
[ 60.141839] CR2: 000000000000000e
[ 60.144831] ---[ end trace 0000000000000000 ]---
[ 60.149058] RIP: 0010:ni_create_attr_list+0x505/0x860
[ 60.153975] Code: 7e 10 e8 5e d0 d0 ff 45 0f b7 76 10 48 8d 7b 16 e8 00 d1 d0 ff 66 44 89 73 16 4d 8d 75 0e 4c 89 f7 e8 3f d0 d0 ff 4c 8d8
[ 60.172443] RSP: 0018:ffff88800a56f1e0 EFLAGS: 00010282
[ 60.176246] RAX: 0000000000000001 RBX: ffff88800b7b5088 RCX: ffffffffb83079fe
[ 60.182752] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffffffffbb7f9fc0
[ 60.189949] RBP: ffff88800a56f3a8 R08: ffff88800b7b50a0 R09: fffffbfff76ff3f9
[ 60.196950] R10: ffffffffbb7f9fc7 R11: fffffbfff76ff3f8 R12: ffff88800b756180
[ 60.203671] R13: 0000000000000000 R14: 000000000000000e R15: 0000000000000050
[ 60.209595] FS: 00007feaa8c96440(0000) GS:ffff88806d400000(0000) knlGS:0000000000000000
[ 60.216299] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 60.222276] CR2: 00007f3a2e0b1000 CR3: 000000000a5bc000 CR4: 00000000000006f0
Signed-off-by: Edward Lo <loyuantsung@gmail.com>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 8b0da5c549ae63ba1debd92a350f90773cb4bfe7 ]
When the msgs are corrupted we need to dump them and then it will
be easier to dig what has happened and where the issue is.
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Milind Changire <mchangir@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6fa0a72cbbe45db4ed967a51f9e6f4e3afe61d20 ]
Some fields such as gt_logd_secs of the struct gfs2_tune are accessed
without holding the lock gt_spin in gfs2_show_options():
val = sdp->sd_tune.gt_logd_secs;
if (val != 30)
seq_printf(s, ",commit=%d", val);
And thus can cause data races when gfs2_show_options() and other functions
such as gfs2_reconfigure() are concurrently executed:
spin_lock(>->gt_spin);
gt->gt_logd_secs = newargs->ar_commit;
To fix these possible data races, the lock sdp->sd_tune.gt_spin is
acquired before accessing the fields of gfs2_tune and released after these
accesses.
Further changes by Andreas:
- Don't hold the spin lock over the seq_printf operations.
Reported-by: BassCheck <bass@buaa.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 99f280700b4cc02d5f141b8d15f8e9fad0418f65 ]
Don't collect exiting session in smb2_reconnect_server(), because it
will be released soon.
Note that the exiting session will stay in server->smb_ses_list until
it complete the cifs_free_ipc() and logoff() and then delete itself
from the list.
Signed-off-by: Winston Wen <wentao@uniontech.com>
Reviewed-by: Shyam Prasad N <sprasad@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 12c30f33cc6769bf411088a2872843c4f9ea32f9 ]
This fixes the following warning reported by kernel test robot
fs/smb/client/cifsfs.c:982 cifs_smb3_do_mount() warn: possible
memory leak of 'cifs_sb'
Link: https://lore.kernel.org/all/202306170124.CtQqzf0I-lkp@intel.com/
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 92fb94b69c6accf1e49fff699640fa0ce03dc910 upstream.
We set cache_block_group_error if btrfs_cache_block_group() returns an
error, this is because we could end up not finding space to allocate and
mistakenly return -ENOSPC, and which could then abort the transaction
with the incorrect errno, and in the case of ENOSPC result in a
WARN_ON() that will trip up tests like generic/475.
However there's the case where multiple threads can be racing, one
thread gets the proper error, and the other thread doesn't actually call
btrfs_cache_block_group(), it instead sees ->cached ==
BTRFS_CACHE_ERROR. Again the result is the same, we fail to allocate
our space and return -ENOSPC. Instead we need to set
cache_block_group_error to -EIO in this case to make sure that if we do
not make our allocation we get the appropriate error returned back to
the caller.
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 6ebcd021c92b8e4b904552e4d87283032100796d upstream.
[BUG]
Syzbot reported a crash that an ASSERT() got triggered inside
prepare_to_merge().
That ASSERT() makes sure the reloc tree is properly pointed back by its
subvolume tree.
[CAUSE]
After more debugging output, it turns out we had an invalid reloc tree:
BTRFS error (device loop1): reloc tree mismatch, root 8 has no reloc root, expect reloc root key (-8, 132, 8) gen 17
Note the above root key is (TREE_RELOC_OBJECTID, ROOT_ITEM,
QUOTA_TREE_OBJECTID), meaning it's a reloc tree for quota tree.
But reloc trees can only exist for subvolumes, as for non-subvolume
trees, we just COW the involved tree block, no need to create a reloc
tree since those tree blocks won't be shared with other trees.
Only subvolumes tree can share tree blocks with other trees (thus they
have BTRFS_ROOT_SHAREABLE flag).
Thus this new debug output proves my previous assumption that corrupted
on-disk data can trigger that ASSERT().
[FIX]
Besides the dedicated fix and the graceful exit, also let tree-checker to
check such root keys, to make sure reloc trees can only exist for subvolumes.
CC: stable@vger.kernel.org # 5.15+
Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 05d7ce504545f7874529701664c90814ca645c5d upstream.
[BUG]
Syzbot reported a crash that an ASSERT() got triggered inside
prepare_to_merge().
[CAUSE]
The root cause of the triggered ASSERT() is we can have a race between
quota tree creation and relocation.
This leads us to create a duplicated quota tree in the
btrfs_read_fs_root() path, and since it's treated as fs tree, it would
have ROOT_SHAREABLE flag, causing us to create a reloc tree for it.
The bug itself is fixed by a dedicated patch for it, but this already
taught us the ASSERT() is not something straightforward for
developers.
[ENHANCEMENT]
Instead of using an ASSERT(), let's handle it gracefully and output
extra info about the mismatch reloc roots to help debug.
Also with the above ASSERT() removed, we can trigger ASSERT(0)s inside
merge_reloc_roots() later.
Also replace those ASSERT(0)s with WARN_ON()s.
CC: stable@vger.kernel.org # 5.15+
Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 12b2d64e591652a2d97dd3afa2b062ca7a4ba352 upstream.
When the call to btrfs_reloc_clone_csums in cow_file_range returns an
error, we jump to the out_unlock label with the extent_reserved variable
set to false. The cleanup at the label will then call
extent_clear_unlock_delalloc on the range from start to end. But we've
already added cur_alloc_size to start before the jump, so there might no
range be left from the newly incremented start to end. Move the check for
'start < end' so that it is reached by also for the !extent_reserved case.
CC: stable@vger.kernel.org # 6.1+
Fixes: a315e68f6e8b ("Btrfs: fix invalid attempt to free reserved space on failure to cow range")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit effa24f689ce0948f68c754991a445a8d697d3a8 upstream.
extent_write_cache_pages stops writing pages as soon as nr_to_write hits
zero. That is the right thing for opportunistic writeback, but incorrect
for data integrity writeback, which needs to ensure that no dirty pages
are left in the range. Thus only stop the writeback for WB_SYNC_NONE
if nr_to_write hits 0.
This is a port of write_cache_pages changes in commit 05fe478dd04e
("mm: write_cache_pages integrity fix").
Note that I've only trigger the problem with other changes to the btrfs
writeback code, but this condition seems worthwhile fixing anyway.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ updated comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit fc1f91b9231a28fba333f931a031bf776bc6ef0e upstream.
Recently we've been having mysterious hangs while running generic/475 on
the CI system. This turned out to be something like this:
Task 1
dmsetup suspend --nolockfs
-> __dm_suspend
-> dm_wait_for_completion
-> dm_wait_for_bios_completion
-> Unable to complete because of IO's on a plug in Task 2
Task 2
wb_workfn
-> wb_writeback
-> blk_start_plug
-> writeback_sb_inodes
-> Infinite loop unable to make an allocation
Task 3
cache_block_group
->read_extent_buffer_pages
->Waiting for IO to complete that can't be submitted because Task 1
suspended the DM device
The problem here is that we need Task 2 to be scheduled completely for
the blk plug to flush. Normally this would happen, we normally wait for
the block group caching to finish (Task 3), and this schedule would
result in the block plug flushing.
However if there's enough free space available from the current caching
to satisfy the allocation we won't actually wait for the caching to
complete. This check however just checks that we have enough space, not
that we can make the allocation. In this particular case we were trying
to allocate 9MiB, and we had 10MiB of free space, but we didn't have
9MiB of contiguous space to allocate, and thus the allocation failed and
we looped.
We specifically don't cycle through the FFE loop until we stop finding
cached block groups because we don't want to allocate new block groups
just because we're caching, so we short circuit the normal loop once we
hit LOOP_CACHING_WAIT and we found a caching block group.
This is normally fine, except in this particular case where the caching
thread can't make progress because the DM device has been suspended.
Fix this by not only waiting for free space to >= the amount of space we
want to allocate, but also that we make some progress in caching from
the time we start waiting. This will keep us from busy looping when the
caching is taking a while but still theoretically has enough space for
us to allocate from, and fixes this particular case by forcing us to
actually sleep and wait for forward progress, which will flush the plug.
With this fix we're no longer hanging with generic/475.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit f8654743a0e6909dc634cbfad6db6816f10f3399 upstream.
During unmount process of nilfs2, nothing holds nilfs_root structure after
nilfs2 detaches its writer in nilfs_detach_log_writer(). Previously,
nilfs_evict_inode() could cause use-after-free read for nilfs_root if
inodes are left in "garbage_list" and released by nilfs_dispose_list at
the end of nilfs_detach_log_writer(), and this bug was fixed by commit
9b5a04ac3ad9 ("nilfs2: fix use-after-free bug of nilfs_root in
nilfs_evict_inode()").
However, it turned out that there is another possibility of UAF in the
call path where mark_inode_dirty_sync() is called from iput():
nilfs_detach_log_writer()
nilfs_dispose_list()
iput()
mark_inode_dirty_sync()
__mark_inode_dirty()
nilfs_dirty_inode()
__nilfs_mark_inode_dirty()
nilfs_load_inode_block() --> causes UAF of nilfs_root struct
This can happen after commit 0ae45f63d4ef ("vfs: add support for a
lazytime mount option"), which changed iput() to call
mark_inode_dirty_sync() on its final reference if i_state has I_DIRTY_TIME
flag and i_nlink is non-zero.
This issue appears after commit 28a65b49eb53 ("nilfs2: do not write dirty
data after degenerating to read-only") when using the syzbot reproducer,
but the issue has potentially existed before.
Fix this issue by adding a "purging flag" to the nilfs structure, setting
that flag while disposing the "garbage_list" and checking it in
__nilfs_mark_inode_dirty().
Unlike commit 9b5a04ac3ad9 ("nilfs2: fix use-after-free bug of nilfs_root
in nilfs_evict_inode()"), this patch does not rely on ns_writer to
determine whether to skip operations, so as not to break recovery on
mount. The nilfs_salvage_orphan_logs routine dirties the buffer of
salvaged data before attaching the log writer, so changing
__nilfs_mark_inode_dirty() to skip the operation when ns_writer is NULL
will cause recovery write to fail. The purpose of using the cleanup-only
flag is to allow for narrowing of such conditions.
Link: https://lkml.kernel.org/r/20230728191318.33047-1-konishi.ryusuke@gmail.com
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+74db8b3087f293d3a13a@syzkaller.appspotmail.com
Closes: https://lkml.kernel.org/r/000000000000b4e906060113fd63@google.com
Fixes: 0ae45f63d4ef ("vfs: add support for a lazytime mount option")
Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Cc: <stable@vger.kernel.org> # 4.0+
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 79ed288cef201f1f212dfb934bcaac75572fb8f6 upstream.
There are multiple smb2_ea_info buffers in FILE_FULL_EA_INFORMATION request
from client. ksmbd find next smb2_ea_info using ->NextEntryOffset of
current smb2_ea_info. ksmbd need to validate buffer length Before
accessing the next ea. ksmbd should check buffer length using buf_len,
not next variable. next is the start offset of current ea that got from
previous ea.
Cc: stable@vger.kernel.org
Reported-by: zdi-disclosures@trendmicro.com # ZDI-CAN-21598
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 5aa4fda5aa9c2a5a7bac67b4a12b089ab81fee3c upstream.
In commit 2b9b8f3b68ed ("ksmbd: validate command payload size"), except
for SMB2_OPLOCK_BREAK_HE command, the request size of other commands
is not checked, it's not expected. Fix it by add check for request
size of other commands.
Cc: stable@vger.kernel.org
Fixes: 2b9b8f3b68ed ("ksmbd: validate command payload size")
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Long Li <leo.lilong@huawei.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d42334578eba1390859012ebb91e1e556d51db49 ]
exfat_extract_uni_name copies characters from a given file name entry into
the 'uniname' variable. This variable is actually defined on the stack of
the exfat_readdir() function. According to the definition of
the 'exfat_uni_name' type, the file name should be limited 255 characters
(+ null teminator space), but the exfat_get_uniname_from_ext_entry()
function can write more characters because there is no check if filename
entries exceeds max filename length. This patch add the check not to copy
filename characters when exceeding max filename length.
Cc: stable@vger.kernel.org
Cc: Yuezhang Mo <Yuezhang.Mo@sony.com>
Reported-by: Maxim Suhanov <dfirblog@gmail.com>
Reviewed-by: Sungjong Seo <sj1557.seo@samsung.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 458c15dfbce62c35fefd9ca637b20a051309c9f1 ]
syzbot reports a bug as below:
general protection fault, probably for non-canonical address 0xdffffc0000000009: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:__lock_acquire+0x69/0x2000 kernel/locking/lockdep.c:4942
Call Trace:
lock_acquire+0x1e3/0x520 kernel/locking/lockdep.c:5691
__raw_write_lock include/linux/rwlock_api_smp.h:209 [inline]
_raw_write_lock+0x2e/0x40 kernel/locking/spinlock.c:300
__drop_extent_tree+0x3ac/0x660 fs/f2fs/extent_cache.c:1100
f2fs_drop_extent_tree+0x17/0x30 fs/f2fs/extent_cache.c:1116
f2fs_insert_range+0x2d5/0x3c0 fs/f2fs/file.c:1664
f2fs_fallocate+0x4e4/0x6d0 fs/f2fs/file.c:1838
vfs_fallocate+0x54b/0x6b0 fs/open.c:324
ksys_fallocate fs/open.c:347 [inline]
__do_sys_fallocate fs/open.c:355 [inline]
__se_sys_fallocate fs/open.c:353 [inline]
__x64_sys_fallocate+0xbd/0x100 fs/open.c:353
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The root cause is race condition as below:
- since it tries to remount rw filesystem, so that do_remount won't
call sb_prepare_remount_readonly to block fallocate, there may be race
condition in between remount and fallocate.
- in f2fs_remount(), default_options() will reset mount option to default
one, and then update it based on result of parse_options(), so there is
a hole which race condition can happen.
Thread A Thread B
- f2fs_fill_super
- parse_options
- clear_opt(READ_EXTENT_CACHE)
- f2fs_remount
- default_options
- set_opt(READ_EXTENT_CACHE)
- f2fs_fallocate
- f2fs_insert_range
- f2fs_drop_extent_tree
- __drop_extent_tree
- __may_extent_tree
- test_opt(READ_EXTENT_CACHE) return true
- write_lock(&et->lock) access NULL pointer
- parse_options
- clear_opt(READ_EXTENT_CACHE)
Cc: <stable@vger.kernel.org>
Reported-by: syzbot+d015b6c2fbb5c383bf08@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/20230522124203.3838360-1-chao@kernel.org
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 967eaad1fed5f6335ea97a47d45214744dc57925 ]
Some minor modifications to flush_merge and related parameters:
1.The FLUSH_MERGE opt is set by default only in non-ro mode.
2.When ro and merge are set at the same time, an error is reported.
3.Display noflush_merge mount opt.
Suggested-by: Chao Yu <chao@kernel.org>
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: 458c15dfbce6 ("f2fs: don't reset unchangable mount option in f2fs_remount()")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit a6ec83786ab9f13f25fb18166dee908845713a95 upstream.
syzbot reports below bug:
BUG: KASAN: slab-use-after-free in f2fs_truncate_data_blocks_range+0x122a/0x14c0 fs/f2fs/file.c:574
Read of size 4 at addr ffff88802a25c000 by task syz-executor148/5000
CPU: 1 PID: 5000 Comm: syz-executor148 Not tainted 6.4.0-rc7-syzkaller-00041-ge660abd551f1 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106
print_address_description.constprop.0+0x2c/0x3c0 mm/kasan/report.c:351
print_report mm/kasan/report.c:462 [inline]
kasan_report+0x11c/0x130 mm/kasan/report.c:572
f2fs_truncate_data_blocks_range+0x122a/0x14c0 fs/f2fs/file.c:574
truncate_dnode+0x229/0x2e0 fs/f2fs/node.c:944
f2fs_truncate_inode_blocks+0x64b/0xde0 fs/f2fs/node.c:1154
f2fs_do_truncate_blocks+0x4ac/0xf30 fs/f2fs/file.c:721
f2fs_truncate_blocks+0x7b/0x300 fs/f2fs/file.c:749
f2fs_truncate.part.0+0x4a5/0x630 fs/f2fs/file.c:799
f2fs_truncate include/linux/fs.h:825 [inline]
f2fs_setattr+0x1738/0x2090 fs/f2fs/file.c:1006
notify_change+0xb2c/0x1180 fs/attr.c:483
do_truncate+0x143/0x200 fs/open.c:66
handle_truncate fs/namei.c:3295 [inline]
do_open fs/namei.c:3640 [inline]
path_openat+0x2083/0x2750 fs/namei.c:3791
do_filp_open+0x1ba/0x410 fs/namei.c:3818
do_sys_openat2+0x16d/0x4c0 fs/open.c:1356
do_sys_open fs/open.c:1372 [inline]
__do_sys_creat fs/open.c:1448 [inline]
__se_sys_creat fs/open.c:1442 [inline]
__x64_sys_creat+0xcd/0x120 fs/open.c:1442
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x39/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The root cause is, inodeA references inodeB via inodeB's ino, once inodeA
is truncated, it calls truncate_dnode() to truncate data blocks in inodeB's
node page, it traverse mapping data from node->i.i_addr[0] to
node->i.i_addr[ADDRS_PER_BLOCK() - 1], result in out-of-boundary access.
This patch fixes to add sanity check on dnode page in truncate_dnode(),
so that, it can help to avoid triggering such issue, and once it encounters
such issue, it will record newly introduced ERROR_INVALID_NODE_REFERENCE
error into superblock, later fsck can detect such issue and try repairing.
Also, it removes f2fs_truncate_data_blocks() for cleanup due to the
function has only one caller, and uses f2fs_truncate_data_blocks_range()
instead.
Reported-and-tested-by: syzbot+12cb4425b22169b52036@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-f2fs-devel/000000000000f3038a05fef867f8@google.com
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d8ccbd21918fd7fa6ce3226cffc22c444228e8ad upstream.
At add_new_free_space() we have these BUG_ON()'s that are there to deal
with any failure to add free space to the in memory free space cache.
Such failures are mostly -ENOMEM that should be very rare. However there's
no need to have these BUG_ON()'s, we can just return any error to the
caller and all callers and their upper call chain are already dealing with
errors.
So just make add_new_free_space() return any errors, while removing the
BUG_ON()'s, and returning the total amount of added free space to an
optional u64 pointer argument.
Reported-by: syzbot+3ba856e07b7127889d8c@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000e9cb8305ff4e8327@google.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 404615d7f1dcd4cca200e9a7a9df3a1dcae1dd62 upstream.
Ext2 has fields in superblock reserved for subblock allocation support.
However that never landed. Drop the many years dead code.
Reported-by: syzbot+af5e10f73dbff48f70af@syzkaller.appspotmail.com
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit c541dce86c537714b6761a79a969c1623dfa222b upstream.
The reconfigure / remount code takes a lot of effort to protect
filesystem's reconfiguration code from racing writes on remounting
read-only. However during remounting read-only filesystem to read-write
mode userspace writes can start immediately once we clear SB_RDONLY
flag. This is inconvenient for example for ext4 because we need to do
some writes to the filesystem (such as preparation of quota files)
before we can take userspace writes so we are clearing SB_RDONLY flag
before we are fully ready to accept userpace writes and syzbot has found
a way to exploit this [1]. Also as far as I'm reading the code
the filesystem remount code was protected from racing writes in the
legacy mount path by the mount's MNT_READONLY flag so this is relatively
new problem. It is actually fairly easy to protect remount read-write
from racing writes using sb->s_readonly_remount flag so let's just do
that instead of having to workaround these races in the filesystem code.
[1] https://lore.kernel.org/all/00000000000006a0df05f6667499@google.com/T/
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230615113848.8439-1-jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ea2b62f305893992156a798f665847e0663c9f41 upstream.
sb_getblk(inode->i_sb, parent) return a null ptr and taking lock on
that leads to the null-ptr-deref bug.
Reported-by: syzbot+aad58150cbc64ba41bdc@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=aad58150cbc64ba41bdc
Signed-off-by: Prince Kumar Maurya <princekumarmaurya06@gmail.com>
Message-Id: <20230531013141.19487-1-princekumarmaurya06@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit ea303f72d70ce2f0b0aa94ab127085289768c5a6 upstream.
syzbot is reporting too large allocation at ntfs_load_attr_list(), for
a crafted filesystem can have huge data_size.
Reported-by: syzbot <syzbot+89dbb3a789a5b9711793@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=89dbb3a789a5b9711793
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 797964253d358cf8d705614dda394dbe30120223 upstream.
In commit 20ea1e7d13c1 ("file: always lock position for
FMODE_ATOMIC_POS") we ended up always taking the file pos lock, because
pidfd_getfd() could get a reference to the file even when it didn't have
an elevated file count due to threading of other sharing cases.
But Mateusz Guzik reports that the extra locking is actually measurable,
so let's re-introduce the optimization, and only force the locking for
directory traversal.
Directories need the lock for correctness reasons, while regular files
only need it for "POSIX semantics". Since pidfd_getfd() is about
debuggers etc special things that are _way_ outside of POSIX, we can
relax the rules for that case.
Reported-by: Mateusz Guzik <mjguzik@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Link: https://lore.kernel.org/linux-fsdevel/20230803095311.ijpvhx3fyrbkasul@f/
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|