aboutsummaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2023-09-13md: fix regression for null-ptr-deference in __md_stop()Yu Kuai
commit 433279beba1d4872da10b7b60a539e0cb828b32b upstream. Commit 3e453522593d ("md: Free resources in __md_stop") tried to fix null-ptr-deference for 'active_io' by moving percpu_ref_exit() to __md_stop(), however, the commit also moving 'writes_pending' to __md_stop(), and this will cause mdadm tests broken: BUG: kernel NULL pointer dereference, address: 0000000000000038 Oops: 0000 [#1] PREEMPT SMP CPU: 15 PID: 17830 Comm: mdadm Not tainted 6.3.0-rc3-next-20230324-00009-g520d37 RIP: 0010:free_percpu+0x465/0x670 Call Trace: <TASK> __percpu_ref_exit+0x48/0x70 percpu_ref_exit+0x1a/0x90 __md_stop+0xe9/0x170 do_md_stop+0x1e1/0x7b0 md_ioctl+0x90c/0x1aa0 blkdev_ioctl+0x19b/0x400 vfs_ioctl+0x20/0x50 __x64_sys_ioctl+0xba/0xe0 do_syscall_64+0x6c/0xe0 entry_SYSCALL_64_after_hwframe+0x63/0xcd And the problem can be reporduced 100% by following test: mdadm -CR /dev/md0 -l1 -n1 /dev/sda --force echo inactive > /sys/block/md0/md/array_state echo read-auto > /sys/block/md0/md/array_state echo inactive > /sys/block/md0/md/array_state Root cause: // start raid raid1_run mddev_init_writes_pending percpu_ref_init // inactive raid array_state_store do_md_stop __md_stop percpu_ref_exit // start raid again array_state_store do_md_run raid1_run mddev_init_writes_pending if (mddev->writes_pending.percpu_count_ptr) // won't reinit // inactive raid again ... percpu_ref_exit -> null-ptr-deference Before the commit, 'writes_pending' is exited when mddev is freed, and it's safe to restart raid because mddev_init_writes_pending() already make sure that 'writes_pending' will only be initialized once. Fix the prblem by moving 'writes_pending' back, it's a litter hard to find the relationship between alloc memory and free memory, however, code changes is much less and we lived with this for a long time already. Fixes: 3e453522593d ("md: Free resources in __md_stop") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230328094400.1448955-1-yukuai1@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13md: Free resources in __md_stopXiao Ni
commit 3e453522593d74a87cf68a38e14aa36ebca1dbcd upstream. If md_run() fails after ->active_io is initialized, then percpu_ref_exit is called in error path. However, later md_free_disk will call percpu_ref_exit again which leads to a panic because of null pointer dereference. It can also trigger this bug when resources are initialized but are freed in error path, then will be freed again in md_free_disk. BUG: kernel NULL pointer dereference, address: 0000000000000038 Oops: 0000 [#1] PREEMPT SMP Workqueue: md_misc mddev_delayed_delete RIP: 0010:free_percpu+0x110/0x630 Call Trace: <TASK> __percpu_ref_exit+0x44/0x70 percpu_ref_exit+0x16/0x90 md_free_disk+0x2f/0x80 disk_release+0x101/0x180 device_release+0x84/0x110 kobject_put+0x12a/0x380 kobject_put+0x160/0x380 mddev_delayed_delete+0x19/0x30 process_one_work+0x269/0x680 worker_thread+0x266/0x640 kthread+0x151/0x1b0 ret_from_fork+0x1f/0x30 For creating raid device, md raid calls do_md_run->md_run, dm raid calls md_run. We alloc those memory in md_run. For stopping raid device, md raid calls do_md_stop->__md_stop, dm raid calls md_stop->__md_stop. So we can free those memory resources in __md_stop. Fixes: 72adae23a72c ("md: Change active_io to percpu") Reported-and-tested-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-09-13md: raid0: account for split bio in iostat accountingDavid Jeffery
[ Upstream commit cc22b5407e9ca76adb7efeed843146510b1b72a5 ] When a bio is split by md raid0, the newly created bio will not be tracked by md for I/O accounting. Only the portion of I/O still assigned to the original bio which was reduced by the split will be accounted for. This results in md iostat data sometimes showing I/O values far below the actual amount of data being sent through md. md_account_bio() needs to be called for all bio generated by the bio split. A simple example of the issue was generated using a raid0 device on partitions to the same device. Since all raid0 I/O then goes to one device, it makes it easy to see a gap between the md device and its sd storage. Reading an lvm device on top of the md device, the iostat output (some 0 columns and extra devices removed to make the data more compact) was: Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read md2 0.00 0.00 0.00 0.00 0 sde 0.00 0.00 0.00 0.00 0 md2 1364.00 411496.00 0.00 0.00 411496 sde 1734.00 646144.00 0.00 0.00 646144 md2 1699.00 510680.00 0.00 0.00 510680 sde 2155.00 802784.00 0.00 0.00 802784 md2 803.00 241480.00 0.00 0.00 241480 sde 1016.00 377888.00 0.00 0.00 377888 md2 0.00 0.00 0.00 0.00 0 sde 0.00 0.00 0.00 0.00 0 I/O was generated doing large direct I/O reads (12M) with dd to a linear lvm volume on top of the 4 leg raid0 device. The md2 reads were showing as roughly 2/3 of the reads to the sde device containing all of md2's raid partitions. The sum of reads to sde was 1826816 kB, which was the expected amount as it was the amount read by dd. With the patch, the total reads from md will match the reads from sde and be consistent with the amount of I/O generated. Fixes: 10764815ff47 ("md: add io accounting for raid0 and raid5") Signed-off-by: David Jeffery <djeffery@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Laurence Oberman <loberman@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230816181433.13289-1-djeffery@redhat.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid0: Fix performance regression for large sequential writesJan Kara
[ Upstream commit 319ff40a542736d67e5bce18635de35d0e7a0bff ] Commit f00d7c85be9e ("md/raid0: fix up bio splitting.") among other things changed how bio that needs to be split is submitted. Before this commit, we have split the bio, mapped and submitted each part. After this commit, we map only the first part of the split bio and submit the second part unmapped. Due to bio sorting in __submit_bio_noacct() this results in the following request ordering: 9,0 18 1181 0.525037895 15995 Q WS 1479315464 + 63392 Split off chunk-sized (1024 sectors) request: 9,0 18 1182 0.629019647 15995 X WS 1479315464 / 1479316488 Request is unaligned to the chunk so it's split in raid0_make_request(). This is the first part mapped and punted to bio_list: 8,0 18 7053 0.629020455 15995 A WS 739921928 + 1016 <- (9,0) 1479315464 Now raid0_make_request() returns, second part is postponed on bio_list. __submit_bio_noacct() resorts the bio_list, mapped request is submitted to the underlying device: 8,0 18 7054 0.629022782 15995 G WS 739921928 + 1016 Now we take another request from the bio_list which is the remainder of the original huge request. Split off another chunk-sized bit from it and the situation repeats: 9,0 18 1183 0.629024499 15995 X WS 1479316488 / 1479317512 8,16 18 6998 0.629025110 15995 A WS 739921928 + 1016 <- (9,0) 1479316488 8,16 18 6999 0.629026728 15995 G WS 739921928 + 1016 ... 9,0 18 1184 0.629032940 15995 X WS 1479317512 / 1479318536 [libnetacq-write] 8,0 18 7059 0.629033294 15995 A WS 739922952 + 1016 <- (9,0) 1479317512 8,0 18 7060 0.629033902 15995 G WS 739922952 + 1016 ... This repeats until we consume the whole original huge request. Now we finally get to processing the second parts of the split off requests (in reverse order): 8,16 18 7181 0.629161384 15995 A WS 739952640 + 8 <- (9,0) 1479377920 8,0 18 7239 0.629162140 15995 A WS 739952640 + 8 <- (9,0) 1479376896 8,16 18 7186 0.629163881 15995 A WS 739951616 + 8 <- (9,0) 1479375872 8,0 18 7242 0.629164421 15995 A WS 739951616 + 8 <- (9,0) 1479374848 ... I guess it is obvious that this IO pattern is extremely inefficient way to perform sequential IO. It also makes bio_list to grow to rather long lengths. Change raid0_make_request() to map both parts of the split bio. Since we know we are provided with at most chunk-sized bios, we will always need to split the incoming bio at most once. Fixes: f00d7c85be9e ("md/raid0: fix up bio splitting.") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230814092720.3931-2-jack@suse.cz Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid0: Factor out helper for mapping and submitting a bioJan Kara
[ Upstream commit af50e20afb401cc203bd2a9ff62ece0ae4976103 ] Factor out helper function for mapping and submitting a bio out of raid0_make_request(). We will use it later for submitting both parts of a split bio. Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230814092720.3931-1-jack@suse.cz Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: 319ff40a5427 ("md/raid0: Fix performance regression for large sequential writes") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: add error_handlers for raid0 and linearMariusz Tkaczyk
[ Upstream commit c31fea2f8e2a72c817f318016bbc327095175a9f ] After the commit 9631abdbf406c("md: Set MD_BROKEN for RAID1 and RAID10") MD_BROKEN must be set if array is failed because state_store() checks it. If it is set then -EBUSY is returned to userspace. For raid0 and linear MD_BROKEN is not set by error_handler(). As a result mdadm is unable to trigger clean-up actions. It is a regression. This patch adds appropriate error_handler for raid0 and linear. The error handler sets MD_BROKEN for this device. Reviewed-by: Xiao Ni <xni@redhat.com> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230306130317.3418-1-mariusz.tkaczyk@linux.intel.com Stable-dep-of: 319ff40a5427 ("md/raid0: Fix performance regression for large sequential writes") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid5-cache: fix null-ptr-deref for r5l_flush_stripe_to_raid()Yu Kuai
[ Upstream commit 0d0bd28c500173bfca78aa840f8f36d261ef1765 ] r5l_flush_stripe_to_raid() will check if the list 'flushing_ios' is empty, and then submit 'flush_bio', however, r5l_log_flush_endio() is clearing the list first and then clear the bio, which will cause null-ptr-deref: T1: submit flush io raid5d handle_active_stripes r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init submit_bio // io1 T2: io1 is done r5l_log_flush_endio list_splice_tail_init // clear the list T3: submit new flush io ... r5l_flush_stripe_to_raid // list is empty // add 'io_end_ios' to the list bio_init bio_uninit // clear bio->bi_blkg submit_bio // null-ptr-deref Fix this problem by clearing bio before clearing the list in r5l_log_flush_endio(). Fixes: 0dd00cba99c3 ("raid5-cache: fully initialize flush_bio when needed") Reported-and-tested-by: Corey Hickey <bugfood-ml@fatooh.org> Closes: https://lore.kernel.org/all/cddd7213-3dfd-4ab7-a3ac-edd54d74a626@fatooh.org/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid5-cache: fix a deadlock in r5l_exit_log()Yu Kuai
[ Upstream commit a705b11b358dee677aad80630e7608b2d5f56691 ] Commit b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") introduce a new problem: // caller hold reconfig_mutex r5l_exit_log flush_work(&log->disable_writeback_work) r5c_disable_writeback_async wait_event /* * conf->log is not NULL, and mddev_trylock() * will fail, wait_event() can never pass. */ conf->log = NULL Fix this problem by setting 'config->log' to NULL before wake_up() as it used to be, so that wait_event() from r5c_disable_writeback_async() can exist. In the meantime, move forward md_unregister_thread() so that null-ptr-deref this commit fixed can still be fixed. Fixes: b13015af94cf ("md/raid5-cache: Clear conf->log after finishing work") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230708091727.1417894-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/md-bitmap: hold 'reconfig_mutex' in backlog_store()Yu Kuai
[ Upstream commit 44abfa6a95df425c0660d56043020b67e6d93ab8 ] Several reasons why 'reconfig_mutex' should be held: 1) rdev_for_each() is not safe to be called without the lock, because rdev can be removed concurrently. 2) mddev_destroy_serial_pool() and mddev_create_serial_pool() should not be called concurrently. 3) mddev_suspend() from mddev_destroy/create_serial_pool() should be protected by the lock. Fixes: 10c92fca636e ("md-bitmap: create and destroy wb_info_pool with the change of backlog") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/md-bitmap: remove unnecessary local variable in backlog_store()Yu Kuai
[ Upstream commit b4d129640f194ffc4cc64c3e97f98ae944c072e8 ] Local variable is definied first in the beginning of backlog_store(), there is no need to define it again. Fixes: 8c13ab115b57 ("md/bitmap: don't set max_write_behind if there is no write mostly device") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230706083727.608914-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid10: use dereference_rdev_and_rrdev() to get devicesLi Nan
[ Upstream commit 673643490b9a0eb3b25633abe604f62b8f63dba1 ] Commit 2ae6aaf76912 ("md/raid10: fix io loss while replacement replace rdev") reads replacement first to prevent io loss. However, there are same issue in wait_blocked_dev() and raid10_handle_discard(), too. Fix it by using dereference_rdev_and_rrdev() to get devices. Fixes: d30588b2731f ("md/raid10: improve raid10 discard request") Fixes: f2e7e269a752 ("md/raid10: pull the code that wait for blocked dev into one function") Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-4-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md/raid10: factor out dereference_rdev_and_rrdev()Li Nan
[ Upstream commit b99f8fd2d91eb734f13098aa1cf337edaca454b7 ] Factor out a helper to get 'rdev' and 'replacement' from config->mirrors. Just to make code cleaner and prepare to fix the bug of io loss while 'replacement' replace 'rdev'. There is no functional change. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20230701080529.2684932-3-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: 673643490b9a ("md/raid10: use dereference_rdev_and_rrdev() to get devices") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: restore 'noio_flag' for the last mddev_resume()Yu Kuai
[ Upstream commit e24ed04389f9619e0aaef615a8948633c182a8b0 ] memalloc_noio_save() is called for the first mddev_suspend(), and repeated mddev_suspend() only increase 'suspended'. However, memalloc_noio_restore() is also called for the first mddev_resume(), which means that memory reclaim will be enabled before the last mddev_resume() is called, while the array is still suspended. Fix this problem by restore 'noio_flag' for the last mddev_resume(). Fixes: 78f57ef9d50a ("md: use memalloc scope APIs in mddev_suspend()/mddev_resume()") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20230628012931.88911-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: Change active_io to percpuXiao Ni
[ Upstream commit 72adae23a72cb12e2ef0dcd7c0aa042867f27998 ] Now the type of active_io is atomic. It's used to count how many ios are in the submitting process and it's added and decreased very time. But it only needs to check if it's zero when suspending the raid. So we can switch atomic to percpu to improve the performance. After switching active_io to percpu type, we use the state of active_io to judge if the raid device is suspended. And we don't need to wake up ->sb_wait in md_handle_request anymore. It's done in the callback function which is registered when initing active_io. The argument mddev->suspended is only used to count how many users are trying to set raid to suspend state. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: e24ed04389f9 ("md: restore 'noio_flag' for the last mddev_resume()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-09-13md: Factor out is_md_suspended helperXiao Ni
[ Upstream commit d19329133d25ad3dc32f8a62635692cb2f189014 ] This helper function will be used in next patch. It's easy for understanding. Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Stable-dep-of: e24ed04389f9 ("md: restore 'noio_flag' for the last mddev_resume()") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03dm cache policy smq: ensure IO doesn't prevent cleaner policy progressJoe Thornber
commit 1e4ab7b4c881cf26c1c72b3f56519e03475486fb upstream. When using the cleaner policy to decommission the cache, there is never any writeback started from the cache as it is constantly delayed due to normal I/O keeping the device busy. Meaning @idle=false was always being passed to clean_target_met() Fix this by adding a specific 'cleaner' flag that is set when the cleaner policy is configured. This flag serves to always allow the cleaner's writeback work to be queued until the cache is decommissioned (even if the cache isn't idle). Reported-by: David Jeffery <djeffery@redhat.com> Fixes: b29d4986d0da ("dm cache: significant rework to leverage dm-bio-prison-v2") Cc: stable@vger.kernel.org Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-08-03dm raid: protect md_stop() with 'reconfig_mutex'Yu Kuai
[ Upstream commit 7d5fff8982a2199d49ec067818af7d84d4f95ca0 ] __md_stop_writes() and __md_stop() will modify many fields that are protected by 'reconfig_mutex', and all the callers will grab 'reconfig_mutex' except for md_stop(). Also, update md_stop() to make certain 'reconfig_mutex' is held using lockdep_assert_held(). Fixes: 9d09e663d550 ("dm: raid456 basic support") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03dm raid: clean up four equivalent goto tags in raid_ctr()Yu Kuai
[ Upstream commit e74c874eabe2e9173a8fbdad616cd89c70eb8ffd ] There are four equivalent goto tags in raid_ctr(), clean them up to use just one. There is no functional change and this is preparation to fix raid_ctr()'s unprotected md_stop(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Stable-dep-of: 7d5fff8982a2 ("dm raid: protect md_stop() with 'reconfig_mutex'") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-08-03dm raid: fix missing reconfig_mutex unlock in raid_ctr() error pathsYu Kuai
[ Upstream commit bae3028799dc4f1109acc4df37c8ff06f2d8f1a0 ] In the error paths 'bad_stripe_cache' and 'bad_check_reshape', 'reconfig_mutex' is still held after raid_ctr() returns. Fixes: 9dbd1aa3a81c ("dm raid: add reshaping support to the target") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-23dm: verity-loadpin: Add NULL pointer check for 'bdev' parameterMatthias Kaehlcke
commit 47f04616f2c9b2f4f0c9127e30ca515a078db591 upstream. Add a NULL check for the 'bdev' parameter of dm_verity_loadpin_is_bdev_trusted(). The function is called by loadpin_check(), which passes the block device that corresponds to the super block of the file system from which a file is being loaded. Generally a super_block structure has an associated block device, however that is not always the case (e.g. tmpfs). Cc: stable@vger.kernel.org # v6.0+ Fixes: b6c1c5745ccc ("dm: Add verity helpers for LoadPin") Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Link: https://lore.kernel.org/r/20230627202800.1.Id63f7f59536d20f1ab83e1abdc1fda1471c7d031@changeid Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23dm init: add dm-mod.waitfor to wait for asynchronously probed block devicesPeter Korsgaard
commit 035641b01e72af4f6c6cf22a4bdb5d7dfc4e8e8e upstream. Just calling wait_for_device_probe() is not enough to ensure that asynchronously probed block devices are available (E.G. mmc, usb), so add a "dm-mod.waitfor=<device1>[,..,<deviceN>]" parameter to get dm-init to explicitly wait for specific block devices before initializing the tables with logic similar to the rootwait logic that was introduced with commit cc1ed7542c8c ("init: wait for asynchronously scanned block devices"). E.G. with dm-verity on mmc using: dm-mod.waitfor="PARTLABEL=hash-a,PARTLABEL=root-a" [ 0.671671] device-mapper: init: waiting for all devices to be available before creating mapped devices [ 0.671679] device-mapper: init: waiting for device PARTLABEL=hash-a ... [ 0.710695] mmc0: new HS200 MMC card at address 0001 [ 0.711158] mmcblk0: mmc0:0001 004GA0 3.69 GiB [ 0.715954] mmcblk0boot0: mmc0:0001 004GA0 partition 1 2.00 MiB [ 0.722085] mmcblk0boot1: mmc0:0001 004GA0 partition 2 2.00 MiB [ 0.728093] mmcblk0rpmb: mmc0:0001 004GA0 partition 3 512 KiB, chardev (249:0) [ 0.738274] mmcblk0: p1 p2 p3 p4 p5 p6 p7 [ 0.751282] device-mapper: init: waiting for device PARTLABEL=root-a ... [ 0.751306] device-mapper: init: all devices available [ 0.751683] device-mapper: verity: sha256 using implementation "sha256-generic" [ 0.759344] device-mapper: ioctl: dm-0 (vroot) is ready [ 0.766540] VFS: Mounted root (squashfs filesystem) readonly on device 254:0. Signed-off-by: Peter Korsgaard <peter@korsgaard.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23md/raid0: add discard support for the 'original' layoutJason Baron
commit e836007089ba8fdf24e636ef2b007651fb4582e6 upstream. We've found that using raid0 with the 'original' layout and discard enabled with different disk sizes (such that at least two zones are created) can result in data corruption. This is due to the fact that the discard handling in 'raid0_handle_discard()' assumes the 'alternate' layout. We've seen this corruption using ext4 but other filesystems are likely susceptible as well. More specifically, while multiple zones are necessary to create the corruption, the corruption may not occur with multiple zones if they layout in such a way the layout matches what the 'alternate' layout would have produced. Thus, not all raid0 devices with the 'original' layout, different size disks and discard enabled will encounter this corruption. The 3.14 kernel inadvertently changed the raid0 disk layout for different size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the same raid0 array could corrupt data. This lead to the creation of the 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout (to match the post 3.14 layout) in the 5.4 kernel time frame and an option to tell the kernel which layout to use (since it couldn't be autodetected). However, when the 'original' layout was added back to 5.4 discard support for the 'original' layout was not added leading this issue. I've been able to reliably reproduce the corruption with the following test case: 1. create raid0 array with different size disks using original layout 2. mkfs 3. mount -o discard 4. create lots of files 5. remove 1/2 the files 6. fstrim -a (or just the mount point for the raid0 array) 7. umount 8. fsck -fn /dev/md0 (spews all sorts of corruptions) Let's fix this by adding proper discard support to the 'original' layout. The fix 'maps' the 'original' layout disks to the order in which they are read/written such that we can compare the disks in the same way that the current 'alternate' layout does. A 'disk_shift' field is added to 'struct strip_zone'. This could be computed on the fly in raid0_handle_discard() but by adding this field, we save some computation in the discard path. Note we could also potentially fix this by re-ordering the disks in the zones that follow the first one, and then always read/writing them using the 'alternate' layout. However, that is seen as a more substantial change, and we are attempting the least invasive fix at this time to remedy the corruption. I've verified the change using the reproducer mentioned above. Typically, the corruption is seen after less than 3 iterations, while the patch has run 500+ iterations. Cc: NeilBrown <neilb@suse.de> Cc: Song Liu <song@kernel.org> Fixes: c84a1372df92 ("md/raid0: avoid RAID0 data corruption due to layout confusion.") Cc: stable@vger.kernel.org Signed-off-by: Jason Baron <jbaron@akamai.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230623180523.1901230-1-jbaron@akamai.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-23dm integrity: reduce vmalloc space footprint on 32-bit architecturesMikulas Patocka
commit 6d50eb4725934fd22f5eeccb401000687c790fd0 upstream. It was reported that dm-integrity runs out of vmalloc space on 32-bit architectures. On x86, there is only 128MiB vmalloc space and dm-integrity consumes it quickly because it has a 64MiB journal and 8MiB recalculate buffer. Fix this by reducing the size of the journal to 4MiB and the size of the recalculate buffer to 1MiB, so that multiple dm-integrity devices can be created and activated on 32-bit architectures. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-19bcache: Fix __bch_btree_node_alloc to make the failure behavior consistentZheng Wang
commit 80fca8a10b604afad6c14213fdfd816c4eda3ee4 upstream. In some specific situations, the return value of __bch_btree_node_alloc may be NULL. This may lead to a potential NULL pointer dereference in caller function like a calling chain : btree_split->bch_btree_node_alloc->__bch_btree_node_alloc. Fix it by initializing the return value in __bch_btree_node_alloc. Fixes: cafe56359144 ("bcache: A block layer cache") Cc: stable@vger.kernel.org Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-19bcache: Remove unnecessary NULL point check in node allocationsZheng Wang
commit 028ddcac477b691dd9205c92f991cc15259d033e upstream. Due to the previous fix of __bch_btree_node_alloc, the return value will never be a NULL pointer. So IS_ERR is enough to handle the failure situation. Fix it by replacing IS_ERR_OR_NULL check by an IS_ERR check. Fixes: cafe56359144 ("bcache: A block layer cache") Cc: stable@vger.kernel.org Signed-off-by: Zheng Wang <zyytlz.wz@163.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-19bcache: fixup btree_cache_wait list damageMingzhe Zou
commit f0854489fc07d2456f7cc71a63f4faf9c716ffbe upstream. We get a kernel crash about "list_add corruption. next->prev should be prev (ffff9c801bc01210), but was ffff9c77b688237c. (next=ffffae586d8afe68)." crash> struct list_head 0xffff9c801bc01210 struct list_head { next = 0xffffae586d8afe68, prev = 0xffffae586d8afe68 } crash> struct list_head 0xffff9c77b688237c struct list_head { next = 0x0, prev = 0x0 } crash> struct list_head 0xffffae586d8afe68 struct list_head struct: invalid kernel virtual address: ffffae586d8afe68 type: "gdb_readmem_callback" Cannot access memory at address 0xffffae586d8afe68 [230469.019492] Call Trace: [230469.032041] prepare_to_wait+0x8a/0xb0 [230469.044363] ? bch_btree_keys_free+0x6c/0xc0 [escache] [230469.056533] mca_cannibalize_lock+0x72/0x90 [escache] [230469.068788] mca_alloc+0x2ae/0x450 [escache] [230469.080790] bch_btree_node_get+0x136/0x2d0 [escache] [230469.092681] bch_btree_check_thread+0x1e1/0x260 [escache] [230469.104382] ? finish_wait+0x80/0x80 [230469.115884] ? bch_btree_check_recurse+0x1a0/0x1a0 [escache] [230469.127259] kthread+0x112/0x130 [230469.138448] ? kthread_flush_work_fn+0x10/0x10 [230469.149477] ret_from_fork+0x35/0x40 bch_btree_check_thread() and bch_dirty_init_thread() may call mca_cannibalize() to cannibalize other cached btree nodes. Only one thread can do it at a time, so the op of other threads will be added to the btree_cache_wait list. We must call finish_wait() to remove op from btree_cache_wait before free it's memory address. Otherwise, the list will be damaged. Also should call bch_cannibalize_unlock() to release the btree_cache_alloc_lock and wake_up other waiters. Fixes: 8e7102273f59 ("bcache: make bch_btree_check() to be multithreaded") Fixes: b144e45fc576 ("bcache: make bch_sectors_dirty_init() to be multithreaded") Cc: stable@vger.kernel.org Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20230615121223.22502-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-19dm ioctl: Avoid double-fetch of versionDemi Marie Obenour
[ Upstream commit 249bed821b4db6d95a99160f7d6d236ea5fe6362 ] The version is fetched once in check_version(), which then does some validation and then overwrites the version in userspace with the API version supported by the kernel. copy_params() then fetches the version from userspace *again*, and this time no validation is done. The result is that the kernel's version number is completely controllable by userspace, provided that userspace can win a race condition. Fix this flaw by not copying the version back to the kernel the second time. This is not exploitable as the version is not further used in the kernel. However, it could become a problem if future patches start relying on the version field. Cc: stable@vger.kernel.org Signed-off-by: Demi Marie Obenour <demi@invisiblethingslab.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19dm ioctl: have constant on the right side of the testHeinz Mauelshagen
[ Upstream commit 5cae0aa77397015f530aeb34f3ced32db6ac2875 ] Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Stable-dep-of: 249bed821b4d ("dm ioctl: Avoid double-fetch of version") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19dm: avoid split of quoted strings where possibleHeinz Mauelshagen
[ Upstream commit 2e84fecf19e1694338deec8bf6c90ff84f8f31fb ] Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Stable-dep-of: 249bed821b4d ("dm ioctl: Avoid double-fetch of version") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19dm: fix undue/missing spacesHeinz Mauelshagen
[ Upstream commit 43be9c743c2553519c2093d1798b542f28095a51 ] Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Stable-dep-of: 249bed821b4d ("dm ioctl: Avoid double-fetch of version") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid10: fix the condition to call bio_end_io_acct()Li Nan
[ Upstream commit 125bfc7cd750e68c99f1d446e2c22abea08c237f ] /sys/block/[device]/queue/iostats is used to control whether to count io stat. Write 0 to it will clear queue_flags QUEUE_FLAG_IO_STAT which means iostats is disabled. If we disable iostats and later endable it, the io issued during this period will be counted incorrectly, inflight will be decreased to -1. //T1 set iostats echo 0 > /sys/block/md0/queue/iostats clear QUEUE_FLAG_IO_STAT //T2 issue io if (QUEUE_FLAG_IO_STAT) -> false bio_start_io_acct inflight++ echo 1 > /sys/block/md0/queue/iostats set QUEUE_FLAG_IO_STAT //T3 io end if (QUEUE_FLAG_IO_STAT) -> true bio_end_io_acct inflight-- -> -1 Also, if iostats is enabled while issuing io but disabled while io end, inflight will never be decreased. Fix it by checking start_time when io end. If start_time is not 0, call bio_end_io_acct(). Fixes: 528bc2cf2fcc ("md/raid10: enable io accounting") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230609094320.2397604-1-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid1-10: fix casting from randomized structure in raid1_submit_write()Yu Kuai
commit b5a99602b74bbfa655be509c615181dd95b0719e upstream. Following build error triggered while build with clang version 17.0.0 with W=1(this can't be reporduced with gcc 13.1.0): drivers/md/raid1-10.c:117:25: error: casting from randomized structure pointer type 'struct block_device *' to 'struct md_rdev *' 117 | struct md_rdev *rdev = (struct md_rdev *)bio->bi_bdev; | ^ Fix this by casting 'bio->bi_bdev' to 'void *', as it used to be. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202306142042.fmjfmTF8-lkp@intel.com/ Fixes: 8295efbe68c0 ("md/raid1-10: factor out a helper to submit normal write") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230616012136.3047071-1-yukuai1@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-07-19md/raid1-10: submit write io directly if bitmap is not enabledYu Kuai
[ Upstream commit 7db922bae3abdf0a1db81ef7228cc0b996a0c1e3 ] Commit 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10") add bitmap support, and it changed that write io is submitted through daemon thread because bitmap need to be updated before write io. And later, plug is used to fix performance regression because all the write io will go to demon thread, which means io can't be issued concurrently. However, if bitmap is not enabled, the write io should not go to daemon thread in the first place, and plug is not needed as well. Fixes: 6cce3b23f6f8 ("[PATCH] md: write intent bitmap support for raid10") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-5-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid1-10: factor out a helper to submit normal writeYu Kuai
[ Upstream commit 8295efbe68c080047e98d9c0eb5cb933b238a8cb ] There are multiple places to do the same thing, factor out a helper to prevent redundant code, and the helper will be used in following patch as well. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-4-yukuai1@huaweicloud.com Stable-dep-of: 7db922bae3ab ("md/raid1-10: submit write io directly if bitmap is not enabled") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid1-10: factor out a helper to add bio to plugYu Kuai
[ Upstream commit 5ec6ca140a034682e421e2e808ef5ddfdfd65242 ] The code in raid1 and raid10 is identical, prepare to limit the number of plugged bios. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230529131106.2123367-3-yukuai1@huaweicloud.com Stable-dep-of: 7db922bae3ab ("md/raid1-10: submit write io directly if bitmap is not enabled") Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid10: fix io loss while replacement replace rdevLi Nan
[ Upstream commit 2ae6aaf76912bae53c74b191569d2ab484f24bf3 ] When removing a disk with replacement, the replacement will be used to replace rdev. During this process, there is a brief window in which both rdev and replacement are read as NULL in raid10_write_request(). This will result in io not being submitted but it should be. //remove //write raid10_remove_disk raid10_write_request mirror->rdev = NULL read rdev -> NULL mirror->rdev = mirror->replacement mirror->replacement = NULL read replacement -> NULL Fix it by reading replacement first and rdev later, meanwhile, use smp_mb() to prevent memory reordering. Fixes: 475b0321a4df ("md/raid10: writes should get directed to replacement as well as original.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230602091839.743798-3-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid10: fix null-ptr-deref of mreplace in raid10_sync_requestLi Nan
[ Upstream commit 34817a2441747b48e444cb0e05d84e14bc9443da ] There are two check of 'mreplace' in raid10_sync_request(). In the first check, 'need_replace' will be set and 'mreplace' will be used later if no-Faulty 'mreplace' exists, In the second check, 'mreplace' will be set to NULL if it is Faulty, but 'need_replace' will not be changed accordingly. null-ptr-deref occurs if Faulty is set between two check. Fix it by merging two checks into one. And replace 'need_replace' with 'mreplace' because their values are always the same. Fixes: ee37d7314a32 ("md/raid10: Fix raid10 replace hang when new added disk faulty") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230527072218.2365857-2-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid10: fix wrong setting of max_corr_read_errorsLi Nan
[ Upstream commit f8b20a405428803bd9881881d8242c9d72c6b2b2 ] There is no input check when echo md/max_read_errors and overflow might occur. Add check of input number. Fixes: 1e50915fe0bb ("raid: improve MD/raid10 handling of correctable read errors.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230522072535.1523740-3-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid10: fix overflow of md/safe_mode_delayLi Nan
[ Upstream commit 6beb489b2eed25978523f379a605073f99240c50 ] There is no input check when echo md/safe_mode_delay in safe_delay_store(). And msec might also overflow when HZ < 1000 in safe_delay_show(), Fix it by checking overflow in safe_delay_store() and use unsigned long conversion in safe_delay_show(). Fixes: 72e02075a33f ("md: factor out parsing of fixed-point numbers") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230522072535.1523740-2-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-07-19md/raid10: check slab-out-of-bounds in md_bitmap_get_counterLi Nan
[ Upstream commit 301867b1c16805aebbc306aafa6ecdc68b73c7e5 ] If we write a large number to md/bitmap_set_bits, md_bitmap_checkpage() will return -EINVAL because 'page >= bitmap->pages', but the return value was not checked immediately in md_bitmap_get_counter() in order to set *blocks value and slab-out-of-bounds occurs. Move check of 'page >= bitmap->pages' to md_bitmap_get_counter() and return directly if true. Fixes: ef4256733506 ("md/bitmap: optimise scanning of empty bitmaps.") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230515134808.3936750-2-linan666@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21dm: don't lock fs when the map is NULL during suspend or resumeLi Lingfeng
[ Upstream commit 2760904d895279f87196f0fa9ec570c79fe6a2e4 ] As described in commit 38d11da522aa ("dm: don't lock fs when the map is NULL in process of resume"), a deadlock may be triggered between do_resume() and do_mount(). This commit preserves the fix from commit 38d11da522aa but moves it to where it also serves to fix a similar deadlock between do_suspend() and do_mount(). It does so, if the active map is NULL, by clearing DM_SUSPEND_LOCKFS_FLAG in dm_suspend() which is called by both do_suspend() and do_resume(). Fixes: 38d11da522aa ("dm: don't lock fs when the map is NULL in process of resume") Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-06-21dm thin: fix issue_discard to pass GFP_NOIO to __blkdev_issue_discardMike Snitzer
commit 722d90822321497e2837cfc9000202e256e6b32f upstream. issue_discard() passes GFP_NOWAIT to __blkdev_issue_discard() despite its code assuming bio_alloc() always succeeds. Commit 3dba53a958a75 ("dm thin: use __blkdev_issue_discard for async discard support") clearly shows where things went bad: Before commit 3dba53a958a75, dm-thin.c's open-coded __blkdev_issue_discard_async() properly handled using GFP_NOWAIT. Unfortunately __blkdev_issue_discard() doesn't and it was missed during review. Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-21dm thin metadata: check fail_io before using data_smLi Lingfeng
commit cb65b282c9640c27d3129e2e04b711ce1b352838 upstream. Must check pmd->fail_io before using pmd->data_sm since pmd->data_sm may be destroyed by other processes. P1(kworker) P2(message) do_worker process_prepared process_prepared_discard_passdown_pt2 dm_pool_dec_data_range pool_message commit dm_pool_commit_metadata ↓ // commit failed metadata_operation_failed abort_transaction dm_pool_abort_metadata __open_or_format_metadata ↓ dm_sm_disk_open ↓ // open failed // pmd->data_sm is NULL dm_sm_dec_blocks ↓ // try to access pmd->data_sm --> UAF As shown above, if dm_pool_commit_metadata() and dm_pool_abort_metadata() fail in pool_message process, kworker may trigger UAF. Fixes: be500ed721a6 ("dm space maps: improve performance with inc/dec on ranges of blocks") Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-06-09md/raid5: fix miscalculation of 'end_sector' in raid5_read_one_chunk()Yu Kuai
commit 8557dc27126949c702bd3aafe8a7e0b7e4fcb44c upstream. 'end_sector' is compared to 'rdev->recovery_offset', which is offset to rdev, however, commit e82ed3a4fbb5 ("md/raid6: refactor raid5_read_one_chunk") changes the calculation of 'end_sector' to offset to the array. Fix this miscalculation. Fixes: e82ed3a4fbb5 ("md/raid6: refactor raid5_read_one_chunk") Cc: stable@vger.kernel.org # v5.12+ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230524014118.3172781-1-yukuai1@huaweicloud.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-24md: fix soft lockup in status_resyncYu Kuai
[ Upstream commit 6efddf1e32e2a264694766ca485a4f5e04ee82a7 ] status_resync() will calculate 'curr_resync - recovery_active' to show user a progress bar like following: [============>........] resync = 61.4% 'curr_resync' and 'recovery_active' is updated in md_do_sync(), and status_resync() can read them concurrently, hence it's possible that 'curr_resync - recovery_active' can overflow to a huge number. In this case status_resync() will be stuck in the loop to print a large amount of '=', which will end up soft lockup. Fix the problem by setting 'resync' to MD_RESYNC_ACTIVE in this case, this way resync in progress will be reported to user. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20230310073855.1337560-3-yukuai1@huaweicloud.com Signed-off-by: Sasha Levin <sashal@kernel.org>
2023-05-11dm: don't lock fs when the map is NULL in process of resumeLi Lingfeng
commit 38d11da522aacaa05898c734a1cec86f1e611129 upstream. Commit fa247089de99 ("dm: requeue IO if mapping table not yet available") added a detection of whether the mapping table is available in the IO submission process. If the mapping table is unavailable, it returns BLK_STS_RESOURCE and requeues the IO. This can lead to the following deadlock problem: dm create mount ioctl(DM_DEV_CREATE_CMD) ioctl(DM_TABLE_LOAD_CMD) do_mount vfs_get_tree ext4_get_tree get_tree_bdev sget_fc alloc_super // got &s->s_umount down_write_nested(&s->s_umount, ...); ext4_fill_super ext4_load_super ext4_read_bh submit_bio // submit and wait io end ioctl(DM_DEV_SUSPEND_CMD) dev_suspend do_resume dm_suspend __dm_suspend lock_fs freeze_bdev get_active_super grab_super // wait for &s->s_umount down_write(&s->s_umount); dm_swap_table __bind // set md->map(can't get here) IO will be continuously requeued while holding the lock since mapping table is NULL. At the same time, mapping table won't be set since the lock is not available. Like request-based DM, bio-based DM also has the same problem. It's not proper to just abort IO if the mapping table not available. So clear DM_SKIP_LOCKFS_FLAG when the mapping table is NULL, this allows the DM table to be loaded and the IO submitted upon resume. Fixes: fa247089de99 ("dm: requeue IO if mapping table not yet available") Cc: stable@vger.kernel.org Signed-off-by: Li Lingfeng <lilingfeng3@huawei.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11dm ioctl: fix nested locking in table_clear() to remove deadlock concernMike Snitzer
commit 3d32aaa7e66d5c1479a3c31d6c2c5d45dd0d3b89 upstream. syzkaller found the following problematic rwsem locking (with write lock already held): down_read+0x9d/0x450 kernel/locking/rwsem.c:1509 dm_get_inactive_table+0x2b/0xc0 drivers/md/dm-ioctl.c:773 __dev_status+0x4fd/0x7c0 drivers/md/dm-ioctl.c:844 table_clear+0x197/0x280 drivers/md/dm-ioctl.c:1537 In table_clear, it first acquires a write lock https://elixir.bootlin.com/linux/v6.2/source/drivers/md/dm-ioctl.c#L1520 down_write(&_hash_lock); Then before the lock is released at L1539, there is a path shown above: table_clear -> __dev_status -> dm_get_inactive_table -> down_read https://elixir.bootlin.com/linux/v6.2/source/drivers/md/dm-ioctl.c#L773 down_read(&_hash_lock); It tries to acquire the same read lock again, resulting in the deadlock problem. Fix this by moving table_clear()'s __dev_status() call to after its up_write(&_hash_lock); Cc: stable@vger.kernel.org Reported-by: Zheng Zhang <zheng.zhang@email.ucr.edu> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11dm flakey: fix a crash with invalid table lineMikulas Patocka
commit 98dba02d9a93eec11bffbb93c7c51624290702d2 upstream. This command will crash with NULL pointer dereference: dmsetup create flakey --table \ "0 `blockdev --getsize /dev/ram0` flakey /dev/ram0 0 0 1 2 corrupt_bio_byte 512" Fix the crash by checking if arg_name is non-NULL before comparing it. Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11dm integrity: call kmem_cache_destroy() in dm_integrity_init() error pathMike Snitzer
commit 6b79a428c02769f2a11f8ae76bf866226d134887 upstream. Otherwise the journal_io_cache will leak if dm_register_target() fails. Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-05-11dm clone: call kmem_cache_destroy() in dm_clone_init() error pathMike Snitzer
commit 6827af4a9a9f5bb664c42abf7c11af4978d72201 upstream. Otherwise the _hydration_cache will leak if dm_register_target() fails. Cc: stable@vger.kernel.org Signed-off-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>