aboutsummaryrefslogtreecommitdiff
path: root/drivers
AgeCommit message (Collapse)Author
2021-10-20nvmet: use struct_size over open coded arithmeticLen Baker
As noted in the "Deprecated Interfaces, Language Features, Attributes, and Conventions" documentation [1], size calculations (especially multiplication) should not be performed in memory allocator (or similar) function arguments due to the risk of them overflowing. This could lead to values wrapping around and a smaller allocation being made than the caller was expecting. Using those allocations could lead to linear overflows of heap memory and other misbehaviors. In this case this is not actually dynamic size: all the operands involved in the calculation are constant values. However it is better to refactor this anyway, just to keep the open-coded math idiom out of code. So, use the struct_size() helper to do the arithmetic instead of the argument "size + count * size" in the kmalloc() function. This code was detected with the help of Coccinelle and audited and fixed manually. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments Signed-off-by: Len Baker <len.baker@gmx.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme: drop scan_lock and always kick requeue list when removing namespacesHannes Reinecke
When reading the partition table on initial scan hits an I/O error the I/O will hang with the scan_mutex held: [<0>] do_read_cache_page+0x49b/0x790 [<0>] read_part_sector+0x39/0xe0 [<0>] read_lba+0xf9/0x1d0 [<0>] efi_partition+0xf1/0x7f0 [<0>] bdev_disk_changed+0x1ee/0x550 [<0>] blkdev_get_whole+0x81/0x90 [<0>] blkdev_get_by_dev+0x128/0x2e0 [<0>] device_add_disk+0x377/0x3c0 [<0>] nvme_mpath_set_live+0x130/0x1b0 [nvme_core] [<0>] nvme_mpath_add_disk+0x150/0x160 [nvme_core] [<0>] nvme_alloc_ns+0x417/0x950 [nvme_core] [<0>] nvme_validate_or_alloc_ns+0xe9/0x1e0 [nvme_core] [<0>] nvme_scan_work+0x168/0x310 [nvme_core] [<0>] process_one_work+0x231/0x420 and trying to delete the controller will deadlock as it tries to grab the scan mutex: [<0>] nvme_mpath_clear_ctrl_paths+0x25/0x80 [nvme_core] [<0>] nvme_remove_namespaces+0x31/0xf0 [nvme_core] [<0>] nvme_do_delete_ctrl+0x4b/0x80 [nvme_core] As we're now properly ordering the namespace list there is no need to hold the scan_mutex in nvme_mpath_clear_ctrl_paths() anymore. And we always need to kick the requeue list as the path will be marked as unusable and I/O will be requeued _without_ a current path. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme-pci: clear shadow doorbell memory on resetsKeith Busch
The host memory doorbell and event buffers need to be initialized on each reset so the driver doesn't observe stale values from the previous instantiation. Signed-off-by: Keith Busch <kbusch@kernel.org> Tested-by: John Levon <john.levon@nutanix.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme-rdma: fix error code in nvme_rdma_setup_ctrlMax Gurtovoy
In case that icdoff is not zero or mandatory keyed sgls are not supported by the NVMe/RDMA target, we'll go to error flow but we'll return 0 to the caller. Fix it by returning an appropriate error code. Fixes: c66e2998c8ca ("nvme-rdma: centralize controller setup sequence") Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme-multipath: add error handling support for add_disk()Luis Chamberlain
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Since we now can tell for sure when a disk was added, move setting the bit NVME_NSHEAD_DISK_LIVE only when we did add the disk successfully. Nothing to do here as the cleanup is done elsewhere. We take care and use test_and_set_bit() because it is protects against two nvme paths simultaneously calling device_add_disk() on the same namespace head. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet: use macro definitions for setting cmic valueMax Gurtovoy
This makes the code more readable. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet: use macro definition for setting nmic valueMax Gurtovoy
This makes the code more readable. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme: display correct subsystem NQNHannes Reinecke
With discovery controllers supporting unique subsystem NQNs the actual subsystem NQN might be different from that one passed in via the connect args. So add a helper to display the resulting subsystem NQN. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme: Add connect option 'discovery'Hannes Reinecke
Add a connect option 'discovery' to specify that the connection should be made to a discovery controller, not a normal I/O controller. With discovery controllers supporting unique subsystem NQNs we cannot easily distinguish by the subsystem NQN if this should be a discovery connection, but we need this information to blank out options not supported by discovery controllers. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme: expose subsystem type in sysfs attribute 'subsystype'Hannes Reinecke
With unique discovery controller NQNs we cannot distinguish the subsystem type by the NQN alone, but need to check the subsystem type, too. So expose the subsystem type in a new sysfs attribute 'subsystype'. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet: set 'CNTRLTYPE' in the identify controller dataHannes Reinecke
Set the correct 'CNTRLTYPE' field in the identify controller data. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet: add nvmet_is_disc_subsys() helperHannes Reinecke
Add a helper function to determine if a given subsystem is a discovery subsystem. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet: make discovery NQN configurableHannes Reinecke
TPAR8013 allows for unique discovery NQNs, so make the discovery controller NQN configurable by exposing a subsys attribute 'discovery_nqn'. Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet-rdma: implement get_max_queue_size controller opMax Gurtovoy
Limit the maximal queue size for RDMA controllers. Today, the target reports a limit of 1024 and this limit isn't valid for some of the RDMA based controllers. For now, limit RDMA transport to 128 entries (the max queue depth configured for Linux NVMe/RDMA host). Future general solution should use RDMA/core API to calculate this size according to device capabilities and number of WRs needed per NVMe IO request. Reported-by: Mark Ruijter <mruijter@primelogic.nl> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet: add get_max_queue_size op for controllersMax Gurtovoy
Some transports, such as RDMA, would like to set the queue size according to device/port/ctrl characteristics. Add a new nvmet transport op that is called during ctrl initialization. This will not effect transports that don't implement this option. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme-rdma: limit the maximal queue size for RDMA controllersMax Gurtovoy
Corrent limit of 1024 isn't valid for some of the RDMA based ctrls. In case the target expose a cap of larger amount of entries (e.g. 1024), the initiator may fail to create a QP with this size. Thus limit to a value that works for all RDMA adapters. Future general solution should use RDMA/core API to calculate this size according to device capabilities and number of WRs needed per NVMe IO request. Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet-tcp: fix use-after-free when a port is removedIsrael Rukshin
When removing a port, all its controllers are being removed, but there are queues on the port that doesn't belong to any controller (during connection time). This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_alloc_ctrl). Those queues should be destroyed before freeing the port via configfs. Destroy the remaining queues after the accept_work was cancelled guarantees that no new queue will be created. Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet-rdma: fix use-after-free when a port is removedIsrael Rukshin
When removing a port, all its controllers are being removed, but there are queues on the port that doesn't belong to any controller (during connection time). This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_alloc_ctrl). Those queues should be destroyed before freeing the port via configfs. Destroy the remaining queues after the RDMA-CM was destroyed guarantees that no new queue will be created. Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvmet: fix use-after-free when a port is removedIsrael Rukshin
When a port is removed through configfs, any connected controllers are starting teardown flow asynchronously and can still send commands. This causes a use-after-free bug for any command that dereferences req->port (like in nvmet_parse_io_cmd). To fix this, wait for all the teardown scheduled works to complete (like release_work at rdma/tcp drivers). This ensures there are no active controllers when the port is eventually removed. Signed-off-by: Israel Rukshin <israelr@nvidia.com> Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20qla2xxx: add ->map_queues support for nvmeSaurav Kashyap
Implement ->map queues and use the block layer blk_mq_pci_map_queues helper for mapping queues to CPUs. With this mapping minimum 10%+ increase in performance is noticed. Signed-off-by: Saurav Kashyap <skashyap@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme-fc: add support for ->map_queuesSaurav Kashyap
NVMe FC don't have support for map queues, unlike the PCI, RDMA and TCP transports. Add a ->map_queues callout for the LLDDs to provide such functionality. Signed-off-by: Saurav Kashyap <skashyap@marvell.com> Signed-off-by: Nilesh Javali <njavali@marvell.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-20nvme: generate uevent once a multipath namespace is operational againHannes Reinecke
When fast_io_fail_tmo is set I/O will be aborted while recovery is still ongoing. This causes MD to set the namespace to failed, and no futher I/O will be submitted to that namespace. However, once the recovery succeeds and the namespace becomes operational again the NVMe subsystem doesn't send a notification, so MD cannot automatically reinstate operation and requires manual interaction. This patch will send a KOBJ_CHANGE uevent per multipathed namespace once the underlying controller transitions to LIVE, allowing an automatic MD reassembly with these udev rules: /etc/udev/rules.d/65-md-auto-re-add.rules: SUBSYSTEM!="block", GOTO="md_end" ACTION!="change", GOTO="md_end" ENV{ID_FS_TYPE}!="linux_raid_member", GOTO="md_end" PROGRAM="/sbin/md_raid_auto_readd.sh $devnode" LABEL="md_end" /sbin/md_raid_auto_readd.sh: MDADM=/sbin/mdadm DEVNAME=$1 export $(${MDADM} --examine --export ${DEVNAME}) if [ -z "${MD_UUID}" ]; then exit 1 fi UUID_LINK=$(readlink /dev/disk/by-id/md-uuid-${MD_UUID}) MD_DEVNAME=${UUID_LINK##*/} export $(${MDADM} --detail --export /dev/${MD_DEVNAME}) if [ -z "${MD_METADATA}" ] ; then exit 1 fi if [ $(cat /sys/block/${MD_DEVNAME}/md/degraded) != 1 ]; then echo "${MD_DEVNAME}: array not degraded, nothing to do" exit 0 fi MD_STATE=$(cat /sys/block/${MD_DEVNAME}/md/array_state) if [ ${MD_STATE} != "clean" ] ; then echo "${MD_DEVNAME}: array state ${MD_STATE}, cannot re-add" exit 1 fi MD_VARNAME="MD_DEVICE_dev_${DEVNAME##*/}_ROLE" if [ ${!MD_VARNAME} = "spare" ] ; then ${MDADM} --manage /dev/${MD_DEVNAME} --re-add ${DEVNAME} fi Changes to v2: - Add udev rules example to description Changes to v1: - use disk_uevent() as suggested by hch Signed-off-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Christoph Hellwig <hch@lst.de>
2021-10-19nvme: don't memset() the normal read/write commandJens Axboe
This memset in the fast path costs a lot of cycles on my setup. Here's a top-of-profile of doing ~6.7M IOPS: + 5.90% io_uring [nvme] [k] nvme_queue_rq + 5.32% io_uring [nvme_core] [k] nvme_setup_cmd + 5.17% io_uring [kernel.vmlinux] [k] io_submit_sqes + 4.97% io_uring [kernel.vmlinux] [k] blkdev_direct_IO and a perf diff with this patch: 0.92% +4.40% [nvme_core] [k] nvme_setup_cmd reducing it from 5.3% to only 0.9%. This takes it from the 2nd most cycle consumer to something that's mostly irrelevant. Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19nvme: move command clear into the various setup helpersJens Axboe
We don't have to worry about doing extra memsets by moving it outside the protection of RQF_DONTPREP, as nvme doesn't do partial completions. This is in preparation for making the read/write fast path not do a full memset of the command. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-19block: ataflop: fix breakage introduced at blk-mq refactoringMichael Schmitz
Refactoring of the Atari floppy driver when converting to blk-mq has broken the state machine in not-so-subtle ways: finish_fdc() must be called when operations on the floppy device have completed. This is crucial in order to relase the ST-DMA lock, which protects against concurrent access to the ST-DMA controller by other drivers (some DMA related, most just related to device register access - broken beyond compare, I know). When rewriting the driver's old do_request() function, the fact that finish_fdc() was called only when all queued requests had completed appears to have been overlooked. Instead, the new request function calls finish_fdc() immediately after the last request has been queued. finish_fdc() executes a dummy seek after most requests, and this overwrites the state machine's interrupt hander that was set up to wait for completion of the read/write request just prior. To make matters worse, finish_fdc() is called before device interrupts are re-enabled, making certain that the read/write interupt is missed. Shifting the finish_fdc() call into the read/write request completion handler ensures the driver waits for the request to actually complete. With a queue depth of 2, we won't see long request sequences, so calling finish_fdc() unconditionally just adds a little overhead for the dummy seeks, and keeps the code simple. While we're at it, kill ataflop_commit_rqs() which does nothing but run finish_fdc() unconditionally, again likely wiping out an in-flight request. Signed-off-by: Michael Schmitz <schmitzmic@gmail.com> Fixes: 6ec3938cff95 ("ataflop: convert to blk-mq") CC: linux-block@vger.kernel.org CC: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Link: https://lore.kernel.org/r/20211019061321.26425-1-schmitzmic@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18nbd: fix uaf in nbd_handle_reply()Yu Kuai
There is a problem that nbd_handle_reply() might access freed request: 1) At first, a normal io is submitted and completed with scheduler: internel_tag = blk_mq_get_tag -> get tag from sched_tags blk_mq_rq_ctx_init sched_tags->rq[internel_tag] = sched_tag->static_rq[internel_tag] ... blk_mq_get_driver_tag __blk_mq_get_driver_tag -> get tag from tags tags->rq[tag] = sched_tag->static_rq[internel_tag] So, both tags->rq[tag] and sched_tags->rq[internel_tag] are pointing to the request: sched_tags->static_rq[internal_tag]. Even if the io is finished. 2) nbd server send a reply with random tag directly: recv_work nbd_handle_reply blk_mq_tag_to_rq(tags, tag) rq = tags->rq[tag] 3) if the sched_tags->static_rq is freed: blk_mq_sched_free_requests blk_mq_free_rqs(q->tag_set, hctx->sched_tags, i) -> step 2) access rq before clearing rq mapping blk_mq_clear_rq_mapping(set, tags, hctx_idx); __free_pages() -> rq is freed here 4) Then, nbd continue to use the freed request in nbd_handle_reply Fix the problem by get 'q_usage_counter' before blk_mq_tag_to_rq(), thus request is ensured not to be freed because 'q_usage_counter' is not zero. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210916141810.2325276-1-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18nbd: partition nbd_read_stat() into nbd_read_reply() and nbd_handle_reply()Yu Kuai
Prepare to fix uaf in nbd_read_stat(), no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-7-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18nbd: clean up return value checking of sock_xmit()Yu Kuai
Check if sock_xmit() return 0 is useless because it'll never return 0, comment it and remove such checkings. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-6-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18nbd: don't start request if nbd_queue_rq() failedYu Kuai
commit 6a468d5990ec ("nbd: don't start req until after the dead connection logic") move blk_mq_start_request() from nbd_queue_rq() to nbd_handle_cmd() to skip starting request if the connection is dead. However, request is still started in other error paths. Currently, blk_mq_end_request() will be called immediately if nbd_queue_rq() failed, thus start request in such situation is useless. So remove blk_mq_start_request() from error paths in nbd_handle_cmd(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-5-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18nbd: check sock index in nbd_read_stat()Yu Kuai
The sock that clent send request in nbd_send_cmd() and receive reply in nbd_read_stat() should be the same. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-4-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18nbd: make sure request completion won't concurrentYu Kuai
commit cddce0116058 ("nbd: Aovid double completion of a request") try to fix that nbd_clear_que() and recv_work() can complete a request concurrently. However, the problem still exists: t1 t2 t3 nbd_disconnect_and_put flush_workqueue recv_work blk_mq_complete_request blk_mq_complete_request_remote -> this is true WRITE_ONCE(rq->state, MQ_RQ_COMPLETE) blk_mq_raise_softirq blk_done_softirq blk_complete_reqs nbd_complete_rq blk_mq_end_request blk_mq_free_request WRITE_ONCE(rq->state, MQ_RQ_IDLE) nbd_clear_que blk_mq_tagset_busy_iter nbd_clear_req __blk_mq_free_request blk_mq_put_tag blk_mq_complete_request -> complete again There are three places where request can be completed in nbd: recv_work(), nbd_clear_que() and nbd_xmit_timeout(). Since they all hold cmd->lock before completing the request, it's easy to avoid the problem by setting and checking a cmd flag. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-3-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18nbd: don't handle response without a corresponding request messageYu Kuai
While handling a response message from server, nbd_read_stat() will try to get request by tag, and then complete the request. However, this is problematic if nbd haven't sent a corresponding request message: t1 t2 submit_bio nbd_queue_rq blk_mq_start_request recv_work nbd_read_stat blk_mq_tag_to_rq blk_mq_complete_request nbd_send_cmd Thus add a new cmd flag 'NBD_CMD_INFLIGHT', it will be set in nbd_send_cmd() and checked in nbd_read_stat(). Noted that this patch can't fix that blk_mq_tag_to_rq() might return a freed request, and this will be fixed in following patches. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Link: https://lore.kernel.org/r/20210916093350.1410403-2-yukuai3@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18mtip32xx: Remove redundant 'flush_workqueue()' callsChristophe JAILLET
'destroy_workqueue()' already drains the queue before destroying it, so there is no need to flush it explicitly. Remove the redundant 'flush_workqueue()' calls. This was generated with coccinelle: @@ expression E; @@ - flush_workqueue(E); destroy_workqueue(E); Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/0fea349c808c6cfbf549b0e33701320c7860c8b7.1634234221.git.christophe.jaillet@wanadoo.fr Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: update superblock after changing rdev flags in state_storeXiao Ni
When the in memory flag is changed, we need to persist the change in the rdev superblock flags. This is needed for "writemostly" and "failfast". Reviewed-by: Li Feng <fengli@smartx.com> Signed-off-by: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: remove unused argument from md_new_eventGuoqing Jiang
Actually, mddev is not used by md_new_event. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md/raid5: call roundup_pow_of_two in raid5_runGuoqing Jiang
Let's call roundup_pow_of_two here instead of open code. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md/raid1: use rdev in raid1_write_request directlyGuoqing Jiang
We already get rdev from conf->mirrors[i].rdev at the beginning of the loop, so just use it. Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md/raid1: only allocate write behind bio for WriteMostly deviceGuoqing Jiang
Commit 6607cd319b6b91bff94e90f798a61c031650b514 ("raid1: ensure write behind bio has less than BIO_MAX_VECS sectors") tried to guarantee the size of behind bio is not bigger than BIO_MAX_VECS sectors. Unfortunately the same calltrace still could happen since an array could enable write-behind without write mostly device. To match the manpage of mdadm (which says "write-behind is only attempted on drives marked as write-mostly"), we need to check WriteMostly flag to avoid such unexpected behavior. [1]. https://bugzilla.kernel.org/show_bug.cgi?id=213181#c25 Cc: stable@vger.kernel.org # v5.12+ Cc: Jens Stutte <jens@chianterastutte.eu> Reported-by: Jens Stutte <jens@chianterastutte.eu> Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: properly unwind when failing to add the kobject in md_allocChristoph Hellwig
Add proper error handling to delete the gendisk when failing to add the md kobject and clean up the error unwinding in general. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: extend disks_mutex coverageChristoph Hellwig
disks_mutex is intended to serialize md_alloc. Extended it to also cover the kobject_uevent call and getting the sysfs dirent to help reducing error handling complexity. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: add the bitmap group to the default groups for the md kobjectChristoph Hellwig
Replace the deprecated default_attrs with the default_groups mechanism, and add the always visible bitmap group to the groups created add kobject_add time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18md: add error handling support for add_disk()Luis Chamberlain
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. We just do the unwinding of what was not done before, and are sure to unlock prior to bailing. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18swim3: add missing major.h includeJens Axboe
swim3 got this through blkdev.h previously, but blkdev.h is not including it anymore. Include it specifically for the driver, otherwise FLOPPY_MAJOR is undefined and breaks the compile on PPC if swim3 is configured. Fixes: b81e0c2372e6 ("block: drop unused includes in <linux/genhd.h>") Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18sx8: fix an error code in carm_init_one()Dan Carpenter
Return a negative error code here on this error path instead of returning success. Fixes: 637208e74a86 ("block/sx8: add error handling support for add_disk()") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20211001122722.GC2283@kili Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18pf: fix error codes in pf_init_unit()Dan Carpenter
Return a negative error code instead of success on these error paths. Fixes: fb367e6baeb0 ("pf: cleanup initialization") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20211001122654.GB2283@kili Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18pcd: fix error codes in pcd_init_unit()Dan Carpenter
Return -ENODEV on these error paths instead of returning success. Fixes: af761f277b7f ("pcd: cleanup initialization") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Link: https://lore.kernel.org/r/20211001122623.GA2283@kili Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block/ataflop: add error handling support for add_disk()Luis Chamberlain
We never checked for errors on add_disk() as this function returned void. Now that this is fixed, use the shiny new error handling. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20210927220302.1073499-15-mcgrof@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block/ataflop: provide a helper for cleanup up an atari diskLuis Chamberlain
Instead of using two separate code paths for cleaning up an atari disk, use one. We take the more careful approach to check for *all* disk types, as is done on exit. The init path didn't have that check as the alternative disk types are only probed for later, they are not initialized by default. Yes, there is a shared tag for all disks. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20210927220302.1073499-14-mcgrof@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block/ataflop: add registration bool before calling del_gendisk()Luis Chamberlain
The ataflop assumes del_gendisk() is safe to call, this is only true because add_disk() does not return a failure, but that will change soon. And so, before we get to adding error handling for that case, let's make sure we keep track of which disks actually get registered. Then we use this to only call del_gendisk for them. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20210927220302.1073499-13-mcgrof@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-10-18block/ataflop: use the blk_cleanup_disk() helperLuis Chamberlain
Use the helper to replace two lines with one. Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20210927220302.1073499-12-mcgrof@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>