aboutsummaryrefslogtreecommitdiff
path: root/fs/ext4
AgeCommit message (Collapse)Author
2023-12-20ext4: prevent the normalized size from exceeding EXT_MAX_BLOCKSBaokun Li
commit 2dcf5fde6dffb312a4bfb8ef940cea2d1f402e32 upstream. For files with logical blocks close to EXT_MAX_BLOCKS, the file size predicted in ext4_mb_normalize_request() may exceed EXT_MAX_BLOCKS. This can cause some blocks to be preallocated that will not be used. And after [Fixes], the following issue may be triggered: ========================================================= kernel BUG at fs/ext4/mballoc.c:4653! Internal error: Oops - BUG: 00000000f2000800 [#1] SMP CPU: 1 PID: 2357 Comm: xfs_io 6.7.0-rc2-00195-g0f5cc96c367f Hardware name: linux,dummy-virt (DT) pc : ext4_mb_use_inode_pa+0x148/0x208 lr : ext4_mb_use_inode_pa+0x98/0x208 Call trace: ext4_mb_use_inode_pa+0x148/0x208 ext4_mb_new_inode_pa+0x240/0x4a8 ext4_mb_use_best_found+0x1d4/0x208 ext4_mb_try_best_found+0xc8/0x110 ext4_mb_regular_allocator+0x11c/0xf48 ext4_mb_new_blocks+0x790/0xaa8 ext4_ext_map_blocks+0x7cc/0xd20 ext4_map_blocks+0x170/0x600 ext4_iomap_begin+0x1c0/0x348 ========================================================= Here is a calculation when adjusting ac_b_ex in ext4_mb_new_inode_pa(): ex.fe_logical = orig_goal_end - EXT4_C2B(sbi, ex.fe_len); if (ac->ac_o_ex.fe_logical >= ex.fe_logical) goto adjust_bex; The problem is that when orig_goal_end is subtracted from ac_b_ex.fe_len it is still greater than EXT_MAX_BLOCKS, which causes ex.fe_logical to overflow to a very small value, which ultimately triggers a BUG_ON in ext4_mb_new_inode_pa() because pa->pa_free < len. The last logical block of an actual write request does not exceed EXT_MAX_BLOCKS, so in ext4_mb_normalize_request() also avoids normalizing the last logical block to exceed EXT_MAX_BLOCKS to avoid the above issue. The test case in [Link] can reproduce the above issue with 64k block size. Link: https://patchwork.kernel.org/project/fstests/list/?series=804003 Cc: <stable@kernel.org> # 6.4 Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231127063313.3734294-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-12-20ext4: fix warning in ext4_dio_write_end_io()Jan Kara
[ Upstream commit 619f75dae2cf117b1d07f27b046b9ffb071c4685 ] The syzbot has reported that it can hit the warning in ext4_dio_write_end_io() because i_size < i_disksize. Indeed the reproducer creates a race between DIO IO completion and truncate expanding the file and thus ext4_dio_write_end_io() sees an inconsistent inode state where i_disksize is already updated but i_size is not updated yet. Since we are careful when setting up DIO write and consider it extending (and thus performing the IO synchronously with i_rwsem held exclusively) whenever it goes past either of i_size or i_disksize, we can use the same test during IO completion without risking entering ext4_handle_inode_extension() without i_rwsem held. This way we make it obvious both i_size and i_disksize are large enough when we report DIO completion without relying on unreliable WARN_ON. Reported-by: <syzbot+47479b71cdfc78f56d30@syzkaller.appspotmail.com> Fixes: 91562895f803 ("ext4: properly sync file size update after O_SYNC direct IO") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20231130095653.22679-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03ext4: make sure allocate pending entry not failZhang Yi
[ Upstream commit 8e387c89e96b9543a339f84043cf9df15fed2632 ] __insert_pending() allocate memory in atomic context, so the allocation could fail, but we are not handling that failure now. It could lead ext4_es_remove_extent() to get wrong reserved clusters, and the global data blocks reservation count will be incorrect. The same to extents_status entry preallocation, preallocate pending entry out of the i_es_lock with __GFP_NOFAIL, make sure __insert_pending() and __revise_pending() always succeeds. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230824092619.1327976-3-yi.zhang@huaweicloud.com Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03ext4: fix slab-use-after-free in ext4_es_insert_extent()Baokun Li
[ Upstream commit 768d612f79822d30a1e7d132a4d4b05337ce42ec ] Yikebaer reported an issue: ================================================================== BUG: KASAN: slab-use-after-free in ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894 Read of size 4 at addr ffff888112ecc1a4 by task syz-executor/8438 CPU: 1 PID: 8438 Comm: syz-executor Not tainted 6.5.0-rc5 #1 Call Trace: [...] kasan_report+0xba/0xf0 mm/kasan/report.c:588 ext4_es_insert_extent+0xc68/0xcb0 fs/ext4/extents_status.c:894 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Allocated by task 8438: [...] kmem_cache_zalloc include/linux/slab.h:693 [inline] __es_alloc_extent fs/ext4/extents_status.c:469 [inline] ext4_es_insert_extent+0x672/0xcb0 fs/ext4/extents_status.c:873 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] Freed by task 8438: [...] kmem_cache_free+0xec/0x490 mm/slub.c:3823 ext4_es_try_to_merge_right fs/ext4/extents_status.c:593 [inline] __es_insert_extent+0x9f4/0x1440 fs/ext4/extents_status.c:802 ext4_es_insert_extent+0x2ca/0xcb0 fs/ext4/extents_status.c:882 ext4_map_blocks+0x92a/0x16f0 fs/ext4/inode.c:680 ext4_alloc_file_blocks.isra.0+0x2df/0xb70 fs/ext4/extents.c:4462 ext4_zero_range fs/ext4/extents.c:4622 [inline] ext4_fallocate+0x251c/0x3ce0 fs/ext4/extents.c:4721 [...] ================================================================== The flow of issue triggering is as follows: 1. remove es raw es es removed es1 |-------------------| -> |----|.......|------| 2. insert es es insert es1 merge with es es1 merge with es and free es1 |----|.......|------| -> |------------|------| -> |-------------------| es merges with newes, then merges with es1, frees es1, then determines if es1->es_len is 0 and triggers a UAF. The code flow is as follows: ext4_es_insert_extent es1 = __es_alloc_extent(true); es2 = __es_alloc_extent(true); __es_remove_extent(inode, lblk, end, NULL, es1) __es_insert_extent(inode, &newes, es1) ---> insert es1 to es tree __es_insert_extent(inode, &newes, es2) ext4_es_try_to_merge_right ext4_es_free_extent(inode, es1) ---> es1 is freed if (es1 && !es1->es_len) // Trigger UAF by determining if es1 is used. We determine whether es1 or es2 is used immediately after calling __es_remove_extent() or __es_insert_extent() to avoid triggering a UAF if es1 or es2 is freed. Reported-by: Yikebaer Aizezi <yikebaer61@gmail.com> Closes: https://lore.kernel.org/lkml/CALcu4raD4h9coiyEBL4Bm0zjDwxC2CyPiTwsP3zFuhot6y9Beg@mail.gmail.com Fixes: 2a69c450083d ("ext4: using nofail preallocation in ext4_es_insert_extent()") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230815070808.3377171-1-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8e387c89e96b ("ext4: make sure allocate pending entry not fail") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03ext4: using nofail preallocation in ext4_es_insert_extent()Baokun Li
[ Upstream commit 2a69c450083db164596c75c0f5b4d9c4c0e18eba ] Similar to in ext4_es_insert_delayed_block(), we use preallocations that do not fail to avoid inconsistencies, but we do not care about es that are not must be kept, and we return 0 even if such es memory allocation fails. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-9-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8e387c89e96b ("ext4: make sure allocate pending entry not fail") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03ext4: using nofail preallocation in ext4_es_insert_delayed_block()Baokun Li
[ Upstream commit 4a2d98447b37bcb68a7f06a1078edcb4f7e6ce7e ] Similar to in ext4_es_remove_extent(), we use a no-fail preallocation to avoid inconsistencies, except that here we may have to preallocate two extent_status. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-8-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8e387c89e96b ("ext4: make sure allocate pending entry not fail") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03ext4: using nofail preallocation in ext4_es_remove_extent()Baokun Li
[ Upstream commit e9fe2b882bd5b26b987c9ba110c2222796f72af5 ] If __es_remove_extent() returns an error it means that when splitting extent, allocating an extent that must be kept failed, where returning an error directly would cause the extent tree to be inconsistent. So we use GFP_NOFAIL to pre-allocate an extent_status and pass it to __es_remove_extent() to avoid this problem. In addition, since the allocated memory is outside the i_es_lock, the extent_status tree may change and the pre-allocated extent_status is no longer needed, so we release the pre-allocated extent_status when es->es_len is not initialized. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-7-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8e387c89e96b ("ext4: make sure allocate pending entry not fail") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03ext4: use pre-allocated es in __es_remove_extent()Baokun Li
[ Upstream commit bda3efaf774fb687c2b7a555aaec3006b14a8857 ] When splitting extent, if the second extent can not be dropped, we return -ENOMEM and use GFP_NOFAIL to preallocate an extent_status outside of i_es_lock and pass it to __es_remove_extent() to be used as the second extent. This ensures that __es_remove_extent() is executed successfully, thus ensuring consistency in the extent status tree. If the second extent is not undroppable, we simply drop it and return 0. Then retry is no longer necessary, remove it. Now, __es_remove_extent() will always remove what it should, maybe more. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-6-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8e387c89e96b ("ext4: make sure allocate pending entry not fail") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03ext4: use pre-allocated es in __es_insert_extent()Baokun Li
[ Upstream commit 95f0b320339a977cf69872eac107122bf536775d ] Pass a extent_status pointer prealloc to __es_insert_extent(). If the pointer is non-null, it is used directly when a new extent_status is needed to avoid memory allocation failures. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-5-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8e387c89e96b ("ext4: make sure allocate pending entry not fail") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03ext4: factor out __es_alloc_extent() and __es_free_extent()Baokun Li
[ Upstream commit 73a2f033656be11298912201ad50615307b4477a ] Factor out __es_alloc_extent() and __es_free_extent(), which only allocate and free extent_status in these two helpers. The ext4_es_alloc_extent() function is split into __es_alloc_extent() and ext4_es_init_extent(). In __es_alloc_extent() we allocate memory using GFP_KERNEL | __GFP_NOFAIL | __GFP_ZERO if the memory allocation cannot fail, otherwise we use GFP_ATOMIC. and the ext4_es_init_extent() is used to initialize extent_status and update related variables after a successful allocation. This is to prepare for the use of pre-allocated extent_status later. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8e387c89e96b ("ext4: make sure allocate pending entry not fail") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-12-03ext4: add a new helper to check if es must be keptBaokun Li
[ Upstream commit 9649eb18c6288f514cacffdd699d5cd999c2f8f6 ] In the extent status tree, we have extents which we can just drop without issues and extents we must not drop - this depends on the extent's status - currently ext4_es_is_delayed() extents must stay, others may be dropped. A helper function is added to help determine if the current extent can be dropped, although only ext4_es_is_delayed() extents cannot be dropped currently. Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 8e387c89e96b ("ext4: make sure allocate pending entry not fail") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-28ext4: properly sync file size update after O_SYNC direct IOJan Kara
commit 91562895f8030cb9a0470b1db49de79346a69f91 upstream. Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly sync file size update and thus if we crash at unfortunate moment, the file can have smaller size although O_SYNC IO has reported successful completion. The problem happens because update of on-disk inode size is handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus dio_complete() in particular) has returned and generic_file_sync() gets called by dio_complete(). Fix the problem by handling on-disk inode size update directly in our ->end_io completion handler. References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com> CC: stable@vger.kernel.org Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") Signed-off-by: Jan Kara <jack@suse.cz> Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20231013121350.26872-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28ext4: add missed brelse in update_backupsKemeng Shi
commit 9adac8b01f4be28acd5838aade42b8daa4f0b642 upstream. add missed brelse in update_backups Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230826174712.4059355-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28ext4: remove gdb backup copy for meta bg in setup_new_flex_group_blocksKemeng Shi
commit 40dd7953f4d606c280074f10d23046b6812708ce upstream. Wrong check of gdb backup in meta bg as following: first_group is the first group of meta_bg which contains target group, so target group is always >= first_group. We check if target group has gdb backup by comparing first_group with [group + 1] and [group + EXT4_DESC_PER_BLOCK(sb) - 1]. As group >= first_group, then [group + N] is > first_group. So no copy of gdb backup in meta bg is done in setup_new_flex_group_blocks. No need to do gdb backup copy in meta bg from setup_new_flex_group_blocks as we always copy updated gdb block to backups at end of ext4_flex_group_add as following: ext4_flex_group_add /* no gdb backup copy for meta bg any more */ setup_new_flex_group_blocks /* update current group number */ ext4_update_super sbi->s_groups_count += flex_gd->count; /* * if group in meta bg contains backup is added, the primary gdb block * of the meta bg will be copy to backup in new added group here. */ for (; gdb_num <= gdb_num_end; gdb_num++) update_backups(...) In summary, we can remove wrong gdb backup copy code in setup_new_flex_group_blocks. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230826174712.4059355-5-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28ext4: correct the start block of counting reserved clustersZhang Yi
commit 40ea98396a3659062267d1fe5f99af4f7e4f05e3 upstream. When big allocate feature is enabled, we need to count and update reserved clusters before removing a delayed only extent_status entry. {init|count|get}_rsvd() have already done this, but the start block number of this counting isn't correct in the following case. lblk end | | v v ------------------------- | | orig_es ------------------------- ^ ^ len1 is 0 | len2 | If the start block of the orig_es entry founded is bigger than lblk, we passed lblk as start block to count_rsvd(), but the length is correct, finally, the range to be counted is offset. This patch fix this by passing the start blocks to 'orig_es->lblk + len1'. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Cc: stable@kernel.org Link: https://lore.kernel.org/r/20230824092619.1327976-2-yi.zhang@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28ext4: correct return value of ext4_convert_meta_bgKemeng Shi
commit 48f1551592c54f7d8e2befc72a99ff4e47f7dca0 upstream. Avoid to ignore error in "err". Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Link: https://lore.kernel.org/r/20230826174712.4059355-4-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28ext4: mark buffer new if it is unwritten to avoid stale data exposureOjaswin Mujoo
commit 2cd8bdb5efc1e0d5b11a4b7ba6b922fd2736a87f upstream. ** Short Version ** In ext4 with dioread_nolock, we could have a scenario where the bh returned by get_blocks (ext4_get_block_unwritten()) in __block_write_begin_int() has UNWRITTEN and MAPPED flag set. Since such a bh does not have NEW flag set we never zero out the range of bh that is not under write, causing whatever stale data is present in the folio at that time to be written out to disk. To fix this mark the buffer as new, in case it is unwritten, in ext4_get_block_unwritten(). ** Long Version ** The issue mentioned above was resulting in two different bugs: 1. On block size < page size case in ext4, generic/269 was reliably failing with dioread_nolock. The state of the write was as follows: * The write was extending i_size. * The last block of the file was fallocated and had an unwritten extent * We were near ENOSPC and hence we were switching to non-delayed alloc allocation. In this case, the back trace that triggers the bug is as follows: ext4_da_write_begin() /* switch to nodelalloc due to low space */ ext4_write_begin() ext4_should_dioread_nolock() // true since mount flags still have delalloc __block_write_begin(..., ext4_get_block_unwritten) __block_write_begin_int() for(each buffer head in page) { /* first iteration, this is bh1 which contains i_size */ if (!buffer_mapped) get_block() /* returns bh with only UNWRITTEN and MAPPED */ /* second iteration, bh2 */ if (!buffer_mapped) get_block() /* we fail here, could be ENOSPC */ } if (err) /* * this would zero out all new buffers and mark them uptodate. * Since bh1 was never marked new, we skip it here which causes * the bug later. */ folio_zero_new_buffers(); /* ext4_wrte_begin() error handling */ ext4_truncate_failed_write() ext4_truncate() ext4_block_truncate_page() __ext4_block_zero_page_range() if(!buffer_uptodate()) ext4_read_bh_lock() ext4_read_bh() -> ... ext4_submit_bh_wbc() BUG_ON(buffer_unwritten(bh)); /* !!! */ 2. The second issue is stale data exposure with page size >= blocksize with dioread_nolock. The conditions needed for it to happen are same as the previous issue ie dioread_nolock around ENOSPC condition. The issue is also similar where in __block_write_begin_int() when we call ext4_get_block_unwritten() on the buffer_head and the underlying extent is unwritten, we get an unwritten and mapped buffer head. Since it is not new, we never zero out the partial range which is not under write, thus writing stale data to disk. This can be easily observed with the following reproducer: fallocate -l 4k testfile xfs_io -c "pwrite 2k 2k" testfile # hexdump output will have stale data in from byte 0 to 2k in testfile hexdump -C testfile NOTE: To trigger this, we need dioread_nolock enabled and write happening via ext4_write_begin(), which is usually used when we have -o nodealloc. Since dioread_nolock is disabled with nodelalloc, the only alternate way to call ext4_write_begin() is to ensure that delayed alloc switches to nodelalloc ie ext4_da_write_begin() calls ext4_write_begin(). This will usually happen when ext4 is almost full like the way generic/269 was triggering it in Issue 1 above. This might make the issue harder to hit. Hence, for reliable replication, I used the below patch to temporarily allow dioread_nolock with nodelalloc and then mount the disk with -o nodealloc,dioread_nolock. With this you can hit the stale data issue 100% of times: @@ -508,8 +508,8 @@ static inline int ext4_should_dioread_nolock(struct inode *inode) if (ext4_should_journal_data(inode)) return 0; /* temporary fix to prevent generic/422 test failures */ - if (!test_opt(inode->i_sb, DELALLOC)) - return 0; + // if (!test_opt(inode->i_sb, DELALLOC)) + // return 0; return 1; } After applying this patch to mark buffer as NEW, both the above issues are fixed. Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Cc: stable@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/d0ed09d70a9733fbb5349c5c7b125caac186ecdf.1695033645.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28ext4: correct offset of gdb backup in non meta_bg group to update_backupsKemeng Shi
commit 31f13421c004a420c0e9d288859c9ea9259ea0cc upstream. Commit 0aeaa2559d6d5 ("ext4: fix corruption when online resizing a 1K bigalloc fs") found that primary superblock's offset in its group is not equal to offset of backup superblock in its group when block size is 1K and bigalloc is enabled. As group descriptor blocks are right after superblock, we can't pass block number of gdb to update_backups for the same reason. The root casue of the issue above is that leading 1K padding block is count as data block offset for primary block while backup block has no padding block offset in its group. Remove padding data block count to fix the issue for gdb backups. For meta_bg case, update_backups treat blk_off as block number, do no conversion in this case. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230826174712.4059355-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-28ext4: apply umask if ACL support is disabledMax Kellermann
commit 484fd6c1de13b336806a967908a927cc0356e312 upstream. The function ext4_init_acl() calls posix_acl_create() which is responsible for applying the umask. But without CONFIG_EXT4_FS_POSIX_ACL, ext4_init_acl() is an empty inline function, and nobody applies the umask. This fixes a bug which causes the umask to be ignored with O_TMPFILE on ext4: https://github.com/MusicPlayerDaemon/MPD/issues/558 https://bugs.gentoo.org/show_bug.cgi?id=686142#c3 https://bugzilla.kernel.org/show_bug.cgi?id=203625 Reviewed-by: "J. Bruce Fields" <bfields@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Max Kellermann <max.kellermann@ionos.com> Link: https://lore.kernel.org/r/20230919081824.1096619-1-max.kellermann@ionos.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-20ext4: move 'ix' sanity check to corrent positionGou Hao
[ Upstream commit af90a8f4a09ec4a3de20142e37f37205d4687f28 ] Check 'ix' before it is used. Fixes: 80e675f906db ("ext4: optimize memmmove lengths in extent/index insertions") Signed-off-by: Gou Hao <gouhao@uniontech.com> Link: https://lore.kernel.org/r/20230906013341.7199-1-gouhao@uniontech.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-11-02ext4: avoid overlapping preallocations due to overflowBaokun Li
commit bedc5d34632c21b5adb8ca7143d4c1f794507e4c upstream. Let's say we want to allocate 2 blocks starting from 4294966386, after predicting the file size, start is aligned to 4294965248, len is changed to 2048, then end = start + size = 0x100000000. Since end is of type ext4_lblk_t, i.e. uint, end is truncated to 0. This causes (pa->pa_lstart >= end) to always hold when checking if the current extent to be allocated crosses already preallocated blocks, so the resulting ac_g_ex may cross already preallocated blocks. Hence we convert the end type to loff_t and use pa_logical_end() to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-4-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-02ext4: fix BUG in ext4_mb_new_inode_pa() due to overflowBaokun Li
commit bc056e7163ac7db945366de219745cf94f32a3e6 upstream. When we calculate the end position of ext4_free_extent, this position may be exactly where ext4_lblk_t (i.e. uint) overflows. For example, if ac_g_ex.fe_logical is 4294965248 and ac_orig_goal_len is 2048, then the computed end is 0x100000000, which is 0. If ac->ac_o_ex.fe_logical is not the first case of adjusting the best extent, that is, new_bex_end > 0, the following BUG_ON will be triggered: ========================================================= kernel BUG at fs/ext4/mballoc.c:5116! invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 673 Comm: xfs_io Tainted: G E 6.5.0-rc1+ #279 RIP: 0010:ext4_mb_new_inode_pa+0xc5/0x430 Call Trace: <TASK> ext4_mb_use_best_found+0x203/0x2f0 ext4_mb_try_best_found+0x163/0x240 ext4_mb_regular_allocator+0x158/0x1550 ext4_mb_new_blocks+0x86a/0xe10 ext4_ext_map_blocks+0xb0c/0x13a0 ext4_map_blocks+0x2cd/0x8f0 ext4_iomap_begin+0x27b/0x400 iomap_iter+0x222/0x3d0 __iomap_dio_rw+0x243/0xcb0 iomap_dio_rw+0x16/0x80 ========================================================= A simple reproducer demonstrating the problem: mkfs.ext4 -F /dev/sda -b 4096 100M mount /dev/sda /tmp/test fallocate -l1M /tmp/test/tmp fallocate -l10M /tmp/test/file fallocate -i -o 1M -l16777203M /tmp/test/file fsstress -d /tmp/test -l 0 -n 100000 -p 8 & sleep 10 && killall -9 fsstress rm -f /tmp/test/tmp xfs_io -c "open -ad /tmp/test/file" -c "pwrite -S 0xff 0 8192" We simply refactor the logic for adjusting the best extent by adding a temporary ext4_free_extent ex and use extent_logical_end() to avoid overflow, which also simplifies the code. Cc: stable@kernel.org # 6.4 Fixes: 93cdf49f6eca ("ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()") Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-3-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-11-02ext4: add two helper functions extent_logical_end() and pa_logical_end()Baokun Li
commit 43bbddc067883d94de7a43d5756a295439fbe37d upstream. When we use lstart + len to calculate the end of free extent or prealloc space, it may exceed the maximum value of 4294967295(0xffffffff) supported by ext4_lblk_t and cause overflow, which may lead to various problems. Therefore, we add two helper functions, extent_logical_end() and pa_logical_end(), to limit the type of end to loff_t, and also convert lstart to loff_t for calculation to avoid overflow. Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230724121059.11834-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Baokun Li <libaokun1@huawei.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-10-06ext4: do not let fstrim block system suspendJan Kara
[ Upstream commit 5229a658f6453362fbb9da6bf96872ef25a7097e ] Len Brown has reported that system suspend sometimes fail due to inability to freeze a task working in ext4_trim_fs() for one minute. Trimming a large filesystem on a disk that slowly processes discard requests can indeed take a long time. Since discard is just an advisory call, it is perfectly fine to interrupt it at any time and the return number of discarded blocks until that moment. Do that when we detect the task is being frozen. Cc: stable@kernel.org Reported-by: Len Brown <lenb@kernel.org> Suggested-by: Dave Chinner <david@fromorbit.com> References: https://bugzilla.kernel.org/show_bug.cgi?id=216322 Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230913150504.9054-2-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06ext4: move setting of trimmed bit into ext4_try_to_trim_range()Jan Kara
[ Upstream commit 45e4ab320c9b5fa67b1fc3b6a9b381cfcc0c8488 ] Currently we set the group's trimmed bit in ext4_trim_all_free() based on return value of ext4_try_to_trim_range(). However when we will want to abort trimming because of suspend attempt, we want to return success from ext4_try_to_trim_range() but not set the trimmed bit. Instead implementing awkward propagation of this information, just move setting of trimmed bit into ext4_try_to_trim_range() when the whole group is trimmed. Cc: stable@kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230913150504.9054-1-jack@suse.cz Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-10-06ext4: replace the traditional ternary conditional operator with with max()/min()Kemeng Shi
[ Upstream commit de8bf0e5ee7482585450357c6d4eddec8efc5cb7 ] Replace the traditional ternary conditional operator with with max()/min() Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-7-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Stable-dep-of: 45e4ab320c9b ("ext4: move setting of trimmed bit into ext4_try_to_trim_range()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-23ext4: fix rec_len verify errorShida Zhang
commit 7fda67e8c3ab6069f75888f67958a6d30454a9f6 upstream. With the configuration PAGE_SIZE 64k and filesystem blocksize 64k, a problem occurred when more than 13 million files were directly created under a directory: EXT4-fs error (device xx): ext4_dx_csum_set:492: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D. EXT4-fs error (device xx): ext4_dx_csum_verify:463: inode #xxxx: comm xxxxx: dir seems corrupt? Run e2fsck -D. EXT4-fs error (device xx): dx_probe:856: inode #xxxx: block 8188: comm xxxxx: Directory index failed checksum When enough files are created, the fake_dirent->reclen will be 0xffff. it doesn't equal to the blocksize 65536, i.e. 0x10000. But it is not the same condition when blocksize equals to 4k. when enough files are created, the fake_dirent->reclen will be 0x1000. it equals to the blocksize 4k, i.e. 0x1000. The problem seems to be related to the limitation of the 16-bit field when the blocksize is set to 64k. To address this, helpers like ext4_rec_len_{from,to}_disk has already been introduced to complete the conversion between the encoded and the plain form of rec_len. So fix this one by using the helper, and all the other in this file too. Cc: stable@kernel.org Fixes: dbe89444042a ("ext4: Calculate and verify checksums for htree nodes") Suggested-by: Andreas Dilger <adilger@dilger.ca> Suggested-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Shida Zhang <zhangshida@kylinos.cn> Reviewed-by: Andreas Dilger <adilger@dilger.ca> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20230803060938.1929759-1-zhangshida@kylinos.cn Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-19ext4: fix memory leaks in ext4_fname_{setup_filename,prepare_lookup}Luís Henriques
commit 7ca4b085f430f3774c3838b3da569ceccd6a0177 upstream. If the filename casefolding fails, we'll be leaking memory from the fscrypt_name struct, namely from the 'crypto_buf.name' member. Make sure we free it in the error path on both ext4_fname_setup_filename() and ext4_fname_prepare_lookup() functions. Cc: stable@kernel.org Fixes: 1ae98e295fa2 ("ext4: optimize match for casefolded encrypted dirs") Signed-off-by: Luís Henriques <lhenriques@suse.de> Reviewed-by: Eric Biggers <ebiggers@google.com> Link: https://lore.kernel.org/r/20230803091713.13239-1-lhenriques@suse.de Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-19ext4: add correct group descriptors and reserved GDT blocks to system zoneWang Jianjian
commit 68228da51c9a436872a4ef4b5a7692e29f7e5bc7 upstream. When setup_system_zone, flex_bg is not initialized so it is always 1. Use a new helper function, ext4_num_base_meta_blocks() which does not depend on sbi->s_log_groups_per_flex being initialized. [ Squashed two patches in the Link URL's below together into a single commit, which is simpler to review/understand. Also fix checkpatch warnings. --TYT ] Cc: stable@kernel.org Signed-off-by: Wang Jianjian <wangjianjian0@foxmail.com> Link: https://lore.kernel.org/r/tencent_21AF0D446A9916ED5C51492CC6C9A0A77B05@qq.com Link: https://lore.kernel.org/r/tencent_D744D1450CC169AEA77FCF0A64719909ED05@qq.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13ext4: fix unttached inode after power cut with orphan file feature enabledZhihao Cheng
[ Upstream commit 1524773425ae8113b0b782886366e68656b34e53 ] Running generic/475(filesystem consistent tests after power cut) could easily trigger unattached inode error while doing fsck: Unattached zero-length inode 39405. Clear? no Unattached inode 39405 Connect to /lost+found? no Above inconsistence is caused by following process: P1 P2 ext4_create inode = ext4_new_inode_start_handle // itable records nlink=1 ext4_add_nondir err = ext4_add_entry // ENOSPC ext4_append ext4_bread ext4_getblk ext4_map_blocks // returns ENOSPC drop_nlink(inode) // won't be updated into disk inode ext4_orphan_add(handle, inode) ext4_orphan_file_add ext4_journal_stop(handle) jbd2_journal_commit_transaction // commit success >> power cut << ext4_fill_super ext4_load_and_init_journal // itable records nlink=1 ext4_orphan_cleanup ext4_process_orphan if (inode->i_nlink) // true, inode won't be deleted Then, allocated inode will be reserved on disk and corresponds to no dentries, so e2fsck reports 'unattached inode' problem. The problem won't happen if orphan file feature is disabled, because ext4_orphan_add() will update disk inode in orphan list mode. There are several places not updating disk inode while putting inode into orphan area, such as ext4_add_nondir(), ext4_symlink() and whiteout in ext4_rename(). Fix it by updating inode into disk in all error branches of these places. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217605 Fixes: 02f310fcf47f ("ext4: Speedup ext4 orphan inode handling") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230628132011.650383-1-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13ext4: avoid potential data overflow in next_linear_groupKemeng Shi
[ Upstream commit 60c672b7f2d1e5dd1774f2399b355c9314e709f8 ] ngroups is ext4_group_t (unsigned int) while next_linear_group treat it in int. If ngroups is bigger than max number described by int, it will be treat as a negative number. Then "return group + 1 >= ngroups ? 0 : group + 1;" may keep returning 0. Switch int to ext4_group_t in next_linear_group to fix the overflow. Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-3-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13ext4: correct grp validation in ext4_mb_good_groupKemeng Shi
[ Upstream commit a9ce5993a0f5c0887c8a1b4ffa3b8046fbcfdc93 ] Group corruption check will access memory of grp and will trigger kernel crash if grp is NULL. So do NULL check before corruption check. Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail") Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Link: https://lore.kernel.org/r/20230801143204.2284343-2-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-27ext4: correct inline offset when handling xattrs in inode bodyEric Whitney
commit 6909cf5c4101214f4305a62d582a5b93c7e1eb9a upstream. When run on a file system where the inline_data feature has been enabled, xfstests generic/269, generic/270, and generic/476 cause ext4 to emit error messages indicating that inline directory entries are corrupted. This occurs because the inline offset used to locate inline directory entries in the inode body is not updated when an xattr in that shared region is deleted and the region is shifted in memory to recover the space it occupied. If the deleted xattr precedes the system.data attribute, which points to the inline directory entries, that attribute will be moved further up in the region. The inline offset continues to point to whatever is located in system.data's former location, with unfortunate effects when used to access directory entries or (presumably) inline data in the inode body. Cc: stable@kernel.org Signed-off-by: Eric Whitney <enwlinux@gmail.com> Link: https://lore.kernel.org/r/20230522181520.1570360-1-enwlinux@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23ext4: only update i_reserved_data_blocks on successful block allocationBaokun Li
commit de25d6e9610a8b30cce9bbb19b50615d02ebca02 upstream. In our fault injection test, we create an ext4 file, migrate it to non-extent based file, then punch a hole and finally trigger a WARN_ON in the ext4_da_update_reserve_space(): EXT4-fs warning (device sda): ext4_da_update_reserve_space:369: ino 14, used 11 with only 10 reserved data blocks When writing back a non-extent based file, if we enable delalloc, the number of reserved blocks will be subtracted from the number of blocks mapped by ext4_ind_map_blocks(), and the extent status tree will be updated. We update the extent status tree by first removing the old extent_status and then inserting the new extent_status. If the block range we remove happens to be in an extent, then we need to allocate another extent_status with ext4_es_alloc_extent(). use old to remove to add new |----------|------------|------------| old extent_status The problem is that the allocation of a new extent_status failed due to a fault injection, and __es_shrink() did not get free memory, resulting in a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM, we map to the same extent again, and the number of reserved blocks is again subtracted from the number of blocks in that extent. Since the blocks in the same extent are subtracted twice, we end up triggering WARN_ON at ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks. For non-extent based file, we update the number of reserved blocks after ext4_ind_map_blocks() is executed, which causes a problem that when we call ext4_ind_map_blocks() to create a block, it doesn't always create a block, but we always reduce the number of reserved blocks. So we move the logic for updating reserved blocks to ext4_ind_map_blocks() to ensure that the number of reserved blocks is updated only after we do succeed in allocating some new blocks. Fixes: 5f634d064c70 ("ext4: Fix quota accounting error with fallocate") Cc: stable@kernel.org Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230424033846.4732-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23ext4: turn quotas off if mount failed after enabling quotasBaokun Li
commit d13f99632748462c32fc95d729f5e754bab06064 upstream. Yi found during a review of the patch "ext4: don't BUG on inconsistent journal feature" that when ext4_mark_recovery_complete() returns an error value, the error handling path does not turn off the enabled quotas, which triggers the following kmemleak: ================================================================ unreferenced object 0xffff8cf68678e7c0 (size 64): comm "mount", pid 746, jiffies 4294871231 (age 11.540s) hex dump (first 32 bytes): 00 90 ef 82 f6 8c ff ff 00 00 00 00 41 01 00 00 ............A... c7 00 00 00 bd 00 00 00 0a 00 00 00 48 00 00 00 ............H... backtrace: [<00000000c561ef24>] __kmem_cache_alloc_node+0x4d4/0x880 [<00000000d4e621d7>] kmalloc_trace+0x39/0x140 [<00000000837eee74>] v2_read_file_info+0x18a/0x3a0 [<0000000088f6c877>] dquot_load_quota_sb+0x2ed/0x770 [<00000000340a4782>] dquot_load_quota_inode+0xc6/0x1c0 [<0000000089a18bd5>] ext4_enable_quotas+0x17e/0x3a0 [ext4] [<000000003a0268fa>] __ext4_fill_super+0x3448/0x3910 [ext4] [<00000000b0f2a8a8>] ext4_fill_super+0x13d/0x340 [ext4] [<000000004a9489c4>] get_tree_bdev+0x1dc/0x370 [<000000006e723bf1>] ext4_get_tree+0x1d/0x30 [ext4] [<00000000c7cb663d>] vfs_get_tree+0x31/0x160 [<00000000320e1bed>] do_new_mount+0x1d5/0x480 [<00000000c074654c>] path_mount+0x22e/0xbe0 [<0000000003e97a8e>] do_mount+0x95/0xc0 [<000000002f3d3736>] __x64_sys_mount+0xc4/0x160 [<0000000027d2140c>] do_syscall_64+0x3f/0x90 ================================================================ To solve this problem, we add a "failed_mount10" tag, and call ext4_quota_off_umount() in this tag to release the enabled qoutas. Fixes: 11215630aada ("ext4: don't BUG on inconsistent journal feature") Cc: stable@kernel.org Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Signed-off-by: Baokun Li <libaokun1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230327141630.156875-2-libaokun1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23ext4: fix to check return value of freeze_bdev() in ext4_shutdown()Chao Yu
commit c4d13222afd8a64bf11bc7ec68645496ee8b54b9 upstream. freeze_bdev() can fail due to a lot of reasons, it needs to check its reason before later process. Fixes: 783d94854499 ("ext4: add EXT4_IOC_GOINGDOWN ioctl") Cc: stable@kernel.org Signed-off-by: Chao Yu <chao@kernel.org> Link: https://lore.kernel.org/r/20230606073203.1310389-1-chao@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23ext4: fix wrong unit use in ext4_mb_new_blocksKemeng Shi
commit 2ec6d0a5ea72689a79e6f725fd8b443a788ae279 upstream. Function ext4_free_blocks_simple needs count in cluster. Function ext4_free_blocks accepts count in block. Convert count to cluster to fix the mismatch. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-12-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23ext4: get block from bh in ext4_free_blocks for fast commit replayKemeng Shi
commit 11b6890be0084ad4df0e06d89a9fdcc948472c65 upstream. ext4_free_blocks will retrieve block from bh if block parameter is zero. Retrieve block before ext4_free_blocks_simple to avoid potentially passing wrong block to ext4_free_blocks_simple. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-9-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23ext4: fix wrong unit use in ext4_mb_clear_bbKemeng Shi
commit 247c3d214c23dfeeeb892e91a82ac1188bdaec9f upstream. Function ext4_issue_discard need count in cluster. Pass count_clusters instead of count to fix the mismatch. Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Cc: stable@kernel.org Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Link: https://lore.kernel.org/r/20230603150327.3596033-11-shikemeng@huaweicloud.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23ext4: Fix reusing stale buffer heads from last failed mountingZhihao Cheng
commit 26fb5290240dc31cae99b8b4dd2af7f46dfcba6b upstream. Following process makes ext4 load stale buffer heads from last failed mounting in a new mounting operation: mount_bdev ext4_fill_super | ext4_load_and_init_journal | ext4_load_journal | jbd2_journal_load | load_superblock | journal_get_superblock | set_buffer_verified(bh) // buffer head is verified | jbd2_journal_recover // failed caused by EIO | goto failed_mount3a // skip 'sb->s_root' initialization deactivate_locked_super kill_block_super generic_shutdown_super if (sb->s_root) // false, skip ext4_put_super->invalidate_bdev-> // invalidate_mapping_pages->mapping_evict_folio-> // filemap_release_folio->try_to_free_buffers, which // cannot drop buffer head. blkdev_put blkdev_put_whole if (atomic_dec_and_test(&bdev->bd_openers)) // false, systemd-udev happens to open the device. Then // blkdev_flush_mapping->kill_bdev->truncate_inode_pages-> // truncate_inode_folio->truncate_cleanup_folio-> // folio_invalidate->block_invalidate_folio-> // filemap_release_folio->try_to_free_buffers will be skipped, // dropping buffer head is missed again. Second mount: ext4_fill_super ext4_load_and_init_journal ext4_load_journal ext4_get_journal jbd2_journal_init_inode journal_init_common bh = getblk_unmovable bh = __find_get_block // Found stale bh in last failed mounting journal->j_sb_buffer = bh jbd2_journal_load load_superblock journal_get_superblock if (buffer_verified(bh)) // true, skip journal->j_format_version = 2, value is 0 jbd2_journal_recover do_one_pass next_log_block += count_tags(journal, bh) // According to journal_tag_bytes(), 'tag_bytes' calculating is // affected by jbd2_has_feature_csum3(), jbd2_has_feature_csum3() // returns false because 'j->j_format_version >= 2' is not true, // then we get wrong next_log_block. The do_one_pass may exit // early whenoccuring non JBD2_MAGIC_NUMBER in 'next_log_block'. The filesystem is corrupted here, journal is partially replayed, and new journal sequence number actually is already used by last mounting. The invalidate_bdev() can drop all buffer heads even racing with bare reading block device(eg. systemd-udev), so we can fix it by invalidating bdev in error handling path in __ext4_fill_super(). Fetch a reproducer in [Link]. Link: https://bugzilla.kernel.org/show_bug.cgi?id=217171 Fixes: 25ed6e8a54df ("jbd2: enable journal clients to enable v2 checksumming") Cc: stable@vger.kernel.org # v3.5 Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230315013128.3911115-2-chengzhihao1@huawei.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-19ext4: Remove ext4 locking of moved directoryJan Kara
commit 3658840cd363f2be094f5dfd2f0b174a9055dd0f upstream. Remove locking of moved directory in ext4_rename2(). We will take care of it in VFS instead. This effectively reverts commit 0813299c586b ("ext4: Fix possible corruption when moving a directory") and followup fixes. CC: Ted Tso <tytso@mit.edu> CC: stable@vger.kernel.org Signed-off-by: Jan Kara <jack@suse.cz> Message-Id: <20230601105830.13168-1-jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-21ext4: drop the call to ext4_error() from ext4_get_group_info()Fabio M. De Francesco
[ Upstream commit f451fd97dd2b78f286379203a47d9d295c467255 ] A recent patch added a call to ext4_error() which is problematic since some callers of the ext4_get_group_info() function may be holding a spinlock, whereas ext4_error() must never be called in atomic context. This triggered a report from Syzbot: "BUG: sleeping function called from invalid context in ext4_update_super" (see the link below). Therefore, drop the call to ext4_error() from ext4_get_group_info(). In the meantime use eight characters tabs instead of nine characters ones. Reported-by: syzbot+4acc7d910e617b360859@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000070575805fdc6cdb2@google.com/ Fixes: 5354b2af3406 ("ext4: allow ext4_get_group_info() to fail") Suggested-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Link: https://lore.kernel.org/r/20230614100446.14337-1-fmdefrancesco@gmail.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-14ext4: only check dquot_initialize_needed() when debuggingTheodore Ts'o
commit dea9d8f7643fab07bf89a1155f1f94f37d096a5e upstream. ext4_xattr_block_set() relies on its caller to call dquot_initialize() on the inode. To assure that this has happened there are WARN_ON checks. Unfortunately, this is subject to false positives if there is an antagonist thread which is flipping the file system at high rates between r/o and rw. So only do the check if EXT4_XATTR_DEBUG is enabled. Link: https://lore.kernel.org/r/20230608044056.GA1418535@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-14Revert "ext4: don't clear SB_RDONLY when remounting r/w until quota is ↵Theodore Ts'o
re-enabled" commit 1b29243933098cdbc31b579b5616e183b4275e2f upstream. This reverts commit a44be64bbecb15a452496f60db6eacfee2b59c79. Link: https://lore.kernel.org/r/653b3359-2005-21b1-039d-c55ca4cffdcc@gmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09ext4: enable the lazy init thread when remounting read/writeTheodore Ts'o
commit eb1f822c76beeaa76ab8b6737ab9dc9f9798408c upstream. In commit a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting r/w until quota is re-enabled") we defer clearing tyhe SB_RDONLY flag in struct super. However, we didn't defer when we checked sb_rdonly() to determine the lazy itable init thread should be enabled, with the next result that the lazy inode table initialization would not be properly started. This can cause generic/231 to fail in ext4's nojournal mode. Fix this by moving when we decide to start or stop the lazy itable init thread to after we clear the SB_RDONLY flag when we are remounting the file system read/write. Fixes a44be64bbecb ("ext4: don't clear SB_RDONLY when remounting r/w until...") Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230527035729.1001605-1-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09ext4: add lockdep annotations for i_data_sem for ea_inode'sTheodore Ts'o
commit aff3bea95388299eec63440389b4545c8041b357 upstream. Treat i_data_sem for ea_inodes as being in their own lockdep class to avoid lockdep complaints about ext4_setattr's use of inode_lock() on normal inodes potentially causing lock ordering with i_data_sem on ea_inodes in ext4_xattr_inode_write(). However, ea_inodes will be operated on by ext4_setattr(), so this isn't a problem. Cc: stable@kernel.org Link: https://syzkaller.appspot.com/bug?extid=298c5d8fb4a128bc27b0 Reported-by: syzbot+298c5d8fb4a128bc27b0@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230524034951.779531-5-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09ext4: disallow ea_inodes with extended attributesTheodore Ts'o
commit 2bc7e7c1a3bc9bd0cbf0f71006f6fe7ef24a00c2 upstream. An ea_inode stores the value of an extended attribute; it can not have extended attributes itself, or this will cause recursive nightmares. Add a check in ext4_iget() to make sure this is the case. Cc: stable@kernel.org Reported-by: syzbot+e44749b6ba4d0434cd47@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230524034951.779531-4-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09ext4: set lockdep subclass for the ea_inode in ext4_xattr_inode_cache_find()Theodore Ts'o
commit b928dfdcb27d8fa59917b794cfba53052a2f050f upstream. If the ea_inode has been pushed out of the inode cache while there is still a reference in the mb_cache, the lockdep subclass will not be set on the inode, which can lead to some lockdep false positives. Fixes: 33d201e0277b ("ext4: fix lockdep warning about recursive inode locking") Cc: stable@kernel.org Reported-by: syzbot+d4b971e744b1f5439336@syzkaller.appspotmail.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230524034951.779531-3-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09ext4: add EA_INODE checking to ext4_iget()Theodore Ts'o
commit b3e6bcb94590dea45396b9481e47b809b1be4afa upstream. Add a new flag, EXT4_IGET_EA_INODE which indicates whether the inode is expected to have the EA_INODE flag or not. If the flag is not set/clear as expected, then fail the iget() operation and mark the file system as corrupted. This commit also makes the ext4_iget() always perform the is_bad_inode() check even when the inode is already inode cache. This allows us to remove the is_bad_inode() check from the callers of ext4_iget() in the ea_inode code. Reported-by: syzbot+cbb68193bdb95af4340a@syzkaller.appspotmail.com Reported-by: syzbot+62120febbd1ee3c3c860@syzkaller.appspotmail.com Reported-by: syzbot+edce54daffee36421b4c@syzkaller.appspotmail.com Cc: stable@kernel.org Signed-off-by: Theodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20230524034951.779531-2-tytso@mit.edu Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-24ext4: Fix best extent lstart adjustment logic in ext4_mb_new_inode_pa()Ojaswin Mujoo
[ Upstream commit 93cdf49f6eca5e23f6546b8f28457b2e6a6961d9 ] When the length of best extent found is less than the length of goal extent we need to make sure that the best extent atleast covers the start of the original request. This is done by adjusting the ac_b_ex.fe_logical (logical start) of the extent. While doing so, the current logic sometimes results in the best extent's logical range overflowing the goal extent. Since this best extent is later added to the inode preallocation list, we have a possibility of introducing overlapping preallocations. This is discussed in detail here [1]. As per Jan's suggestion, to fix this, replace the existing logic with the below logic for adjusting best extent as it keeps fragmentation in check while ensuring logical range of best extent doesn't overflow out of goal extent: 1. Check if best extent can be kept at end of goal range and still cover original start. 2. Else, check if best extent can be kept at start of goal range and still cover original start. 3. Else, keep the best extent at start of original request. Also, add a few extra BUG_ONs that might help catch errors faster. [1] https://lore.kernel.org/r/Y+OGkVvzPN0RMv0O@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com Suggested-by: Jan Kara <jack@suse.cz> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/f96aca6d415b36d1f90db86c1a8cd7e2e9d7ab0e.1679731817.git.ojaswin@linux.ibm.com Signed-off-by: Theodore Ts'o <tytso@mit.edu> Signed-off-by: Sasha Levin <sashal@kernel.org>