aboutsummaryrefslogtreecommitdiff
path: root/drivers/md/raid10.c
AgeCommit message (Collapse)Author
2015-09-05Merge linux-block/for-4.3/core into md/for-linuxNeilBrown
There were a few conflicts that are fairly easy to resolve. Signed-off-by: NeilBrown <neilb@suse.com>
2015-09-02Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block updates from Jens Axboe: "This first core part of the block IO changes contains: - Cleanup of the bio IO error signaling from Christoph. We used to rely on the uptodate bit and passing around of an error, now we store the error in the bio itself. - Improvement of the above from myself, by shrinking the bio size down again to fit in two cachelines on x86-64. - Revert of the max_hw_sectors cap removal from a revision again, from Jeff Moyer. This caused performance regressions in various tests. Reinstate the limit, bump it to a more reasonable size instead. - Make /sys/block/<dev>/queue/discard_max_bytes writeable, by me. Most devices have huge trim limits, which can cause nasty latencies when deleting files. Enable the admin to configure the size down. We will look into having a more sane default instead of UINT_MAX sectors. - Improvement of the SGP gaps logic from Keith Busch. - Enable the block core to handle arbitrarily sized bios, which enables a nice simplification of bio_add_page() (which is an IO hot path). From Kent. - Improvements to the partition io stats accounting, making it faster. From Ming Lei. - Also from Ming Lei, a basic fixup for overflow of the sysfs pending file in blk-mq, as well as a fix for a blk-mq timeout race condition. - Ming Lin has been carrying Kents above mentioned patches forward for a while, and testing them. Ming also did a few fixes around that. - Sasha Levin found and fixed a use-after-free problem introduced by the bio->bi_error changes from Christoph. - Small blk cgroup cleanup from Viresh Kumar" * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits) blk: Fix bio_io_vec index when checking bvec gaps block: Replace SG_GAPS with new queue limits mask block: bump BLK_DEF_MAX_SECTORS to 2560 Revert "block: remove artifical max_hw_sectors cap" blk-mq: fix race between timeout and freeing request blk-mq: fix buffer overflow when reading sysfs file of 'pending' Documentation: update notes in biovecs about arbitrarily sized bios block: remove bio_get_nr_vecs() fs: use helper bio_add_page() instead of open coding on bi_io_vec block: kill merge_bvec_fn() completely md/raid5: get rid of bio_fits_rdev() md/raid5: split bio for chunk_aligned_read block: remove split code in blkdev_issue_{discard,write_same} btrfs: remove bio splitting and merge_bvec_fn() calls bcache: remove driver private bio splitting code block: simplify bio_add_page() block: make generic_make_request handle arbitrarily sized bios blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL) block: don't access bio->bi_error after bio_put() block: shrink struct bio down to 2 cache lines again ...
2015-08-31md/raid10: ensure device failure recorded before write request returns.NeilBrown
When a write to one of the legs of a RAID10 fails, the failure is recorded in the metadata of the other legs so that after a restart the data on the failed drive wont be trusted even if that drive seems to be working again (maybe a cable was unplugged). Currently there is no interlock between the write request completing and the metadata update. So it is possible that the write will complete, the app will confirm success in some way, and then the machine will crash before the metadata update completes. This is an extremely small hole for a racy to fit in, but it is theoretically possible and so should be closed. So: - set MD_CHANGE_PENDING when requesting a metadata update for a failed device, so we can know with certainty when it completes - queue requests that experienced an error on a new queue which is only processed after the metadata update completes - call raid_end_bio_io() on bios in that queue when the time comes. Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md/raid10: fix a few typos in commentsNeilBrown
Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-13block: kill merge_bvec_fn() completelyKent Overstreet
As generic_make_request() is now able to handle arbitrarily sized bios, it's no longer necessary for each individual block driver to define its own ->merge_bvec_fn() callback. Remove every invocation completely. Cc: Jens Axboe <axboe@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: drbd-user@lists.linbit.com Cc: Jiri Kosina <jkosina@suse.cz> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@kernel.org> Cc: ceph-devel@vger.kernel.org Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Cc: Christoph Hellwig <hch@infradead.org> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Acked-by: NeilBrown <neilb@suse.de> (for the 'md' bits) Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: also remove ->merge_bvec_fn() in dm-thin as well as dm-era-target, and resolve merge conflicts] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <ming.l@ssi.samsung.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29block: manipulate bio->bi_flags through helpersJens Axboe
Some places use helpers now, others don't. We only have the 'is set' helper, add helpers for setting and clearing flags too. It was a bit of a mess of atomic vs non-atomic access. With BIO_UPTODATE gone, we don't have any risk of concurrent access to the flags. So relax the restriction and don't make any of them atomic. The flags that do have serialization issues (reffed and chained), we already handle those separately. Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-29block: add a bi_error field to struct bioChristoph Hellwig
Currently we have two different ways to signal an I/O error on a BIO: (1) by clearing the BIO_UPTODATE flag (2) by returning a Linux errno value to the bi_end_io callback The first one has the drawback of only communicating a single possible error (-EIO), and the second one has the drawback of not beeing persistent when bios are queued up, and are not passed along from child to parent bio in the ever more popular chaining scenario. Having both mechanisms available has the additional drawback of utterly confusing driver authors and introducing bugs where various I/O submitters only deal with one of them, and the others have to add boilerplate code to deal with both kinds of error returns. So add a new bi_error field to store an errno value directly in struct bio and remove the existing mechanisms to clean all this up. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: NeilBrown <neilb@suse.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-07-22md/raid10: always set reshape_safe when initializing reshape_position.NeilBrown
'reshape_position' tracks where in the reshape we have reached. 'reshape_safe' tracks where in the reshape we have safely recorded in the metadata. These are compared to determine when to update the metadata. So it is important that reshape_safe is initialised properly. Currently it isn't. When starting a reshape from the beginning it usually has the correct value by luck. But when reducing the number of devices in a RAID10, it has the wrong value and this leads to the metadata not being updated correctly. This can lead to corruption if the reshape is not allowed to complete. This patch is suitable for any -stable kernel which supports RAID10 reshape, which is 3.5 and later. Fixes: 3ea7daa5d7fd ("md/raid10: add reshape support") Cc: stable@vger.kernel.org (v3.5+ please wait for -final to be out for 2 weeks) Signed-off-by: NeilBrown <neilb@suse.com>
2015-06-29Merge tag 'md/4.2' of git://neil.brown.name/mdLinus Torvalds
Pull md updates from Neil Brown: "A mixed bag - a few bug fixes - some performance improvement that decrease lock contention - some clean-up Nothing major" * tag 'md/4.2' of git://neil.brown.name/md: md: clear Blocked flag on failed devices when array is read-only. md: unlock mddev_lock on an error path. md: clear mddev->private when it has been freed. md: fix a build warning md/raid5: ignore released_stripes check md/raid5: per hash value and exclusive wait_for_stripe md/raid5: split wait_for_stripe and introduce wait_for_quiescent wait: introduce wait_event_exclusive_cmd md: convert to kstrto*() md/raid10: make sync_request_write() call bio_copy_data()
2015-06-25Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull cgroup writeback support from Jens Axboe: "This is the big pull request for adding cgroup writeback support. This code has been in development for a long time, and it has been simmering in for-next for a good chunk of this cycle too. This is one of those problems that has been talked about for at least half a decade, finally there's a solution and code to go with it. Also see last weeks writeup on LWN: http://lwn.net/Articles/648292/" * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits) writeback, blkio: add documentation for cgroup writeback support vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB writeback: do foreign inode detection iff cgroup writeback is enabled v9fs: fix error handling in v9fs_session_init() bdi: fix wrong error return value in cgwb_create() buffer: remove unusued 'ret' variable writeback: disassociate inodes from dying bdi_writebacks writeback: implement foreign cgroup inode bdi_writeback switching writeback: add lockdep annotation to inode_to_wb() writeback: use unlocked_inode_to_wb transaction in inode_congested() writeback: implement unlocked_inode_to_wb transaction and use it for stat updates writeback: implement [locked_]inode_to_wb_and_lock_list() writeback: implement foreign cgroup inode detection writeback: make writeback_control track the inode being written back writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb() mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use writeback: implement memcg writeback domain based throttling writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes writeback: implement memcg wb_domain writeback: update wb_over_bg_thresh() to use wb_domain aware operations ...
2015-06-17md/raid10: make sync_request_write() call bio_copy_data()Kent Overstreet
Refactor sync_request_write() of md/raid10 to use bio_copy_data() instead of open coding bio_vec iterations. Cc: Christoph Hellwig <hch@infradead.org> Cc: Neil Brown <neilb@suse.de> Cc: linux-raid@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com> [dpark: add more description in commit message] Signed-off-by: Dongsu Park <dpark@posteo.net> Signed-off-by: Ming Lin <mlin@kernel.org> Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-12md: make sure MD_RECOVERY_DONE is clear before starting recovery/resyncNeilBrown
MD_RECOVERY_DONE is normally cleared by md_check_recovery after a resync etc finished. However it is possible for raid5_start_reshape to race and start a reshape before MD_RECOVERY_DONE is cleared. This can lean to multiple reshapes running at the same time, which isn't good. To make sure it is cleared before starting a reshape, and also clear it when reaping a thread, just to be safe. Signed-off-by: NeilBrown <neilb@suse.de>
2015-06-02writeback: move backing_dev_info->state into bdi_writebackTejun Heo
Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback) and the role of the separation is unclear. For cgroup support for writeback IOs, a bdi will be updated to host multiple wb's where each wb serves writeback IOs of a different cgroup on the bdi. To achieve that, a wb should carry all states necessary for servicing writeback IOs for a cgroup independently. This patch moves bdi->state into wb. * enum bdi_state is renamed to wb_state and the prefix of all enums is changed from BDI_ to WB_. * Explicit zeroing of bdi->state is removed without adding zeoring of wb->state as the whole data structure is zeroed on init anyway. * As there's still only one bdi_writeback per backing_dev_info, all uses of bdi->state are mechanically replaced with bdi->wb.state introducing no behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Jens Axboe <axboe@kernel.dk> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: drbd-dev@lists.linbit.com Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2015-04-22md: remove 'go_faster' option from ->sync_request()NeilBrown
This option is not well justified and testing suggests that it hardly ever makes any difference. The comment suggests there might be a need to wait for non-resync activity indicated by ->nr_waiting, however raise_barrier() already waits for all of that. So just remove it to simplify reasoning about speed limiting. This allows us to remove a 'FIXME' comment from raid5.c as that never used the flag. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-16md/raid10: round up to bdev_logical_block_size in narrow_write_error.NeilBrown
RAID10 version of earlier fix for RAID1. We must never initiate IO with sizes less that logical_block_size. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-12md/raid10: fix conversion from RAID0 to RAID10NeilBrown
A RAID0 array (like a LINEAR array) does not have a concept of 'size' being the amount of each device that is in use. Rather, as much of each device as is available is used. So the 'size' is set to 0 and ignored. RAID10 does have this concept and needs it to be set correctly. So when we convert RAID0 to RAID10 we must determine the 'size' (that being the size of the first 'strip_zone' in the RAID0), and set it correctly. Reported-and-tested-by: Xiao Ni <xni@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: rename ->stop to ->freeNeilBrown
Now that the ->stop function only frees the private data, rename is accordingly. Also pass in the private pointer as an arg rather than using mddev->private. This flexibility will be useful in level_store(). Finally, don't clear ->private. It doesn't make sense to clear it seeing that isn't what we free, and it is no longer necessary to clear ->private (it was some time ago before ->to_remove was introduced). Setting ->to_remove in ->free() is a bit of a wart, but not a big problem at the moment. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: split detach operation out from ->stop.NeilBrown
Each md personality has a 'stop' operation which does two things: 1/ it finalizes some aspects of the array to ensure nothing is accessing the ->private data 2/ it frees the ->private data. All the steps in '1' can apply to all arrays and so can be performed in common code. This is useful as in the case where we change the personality which manages an array (in level_store()), it would be helpful to do step 1 early, and step 2 later. So split the 'step 1' functionality out into a new mddev_detach(). Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: make merge_bvec_fn more robust in face of personality changes.NeilBrown
There is no locking around calls to merge_bvec_fn(), so it is possible that calls which coincide with a level (or personality) change could go wrong. So create a central dispatch point for these functions and use rcu_read_lock(). If the array is suspended, reject any merge that can be rejected. If not, we know it is safe to call the function. Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-04md: make ->congested robust against personality changes.NeilBrown
There is currently no locking around calls to the 'congested' bdi function. If called at an awkward time while an array is being converted from one level (or personality) to another, there is a tiny chance of running code in an unreferenced module etc. So add a 'congested' function to the md_personality operations structure, and call it with appropriate locking from a central 'mddev_congested'. When the array personality is changing the array will be 'suspended' so no IO is processed. If mddev_congested detects this, it simply reports that the array is congested, which is a safe guess. As mddev_suspend calls synchronize_rcu(), mddev_congested can avoid races by included the whole call inside an rcu_read_lock() region. This require that the congested functions for all subordinate devices can be run under rcu_lock. Fortunately this is the case. Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14md: remove unwanted white space from md.cNeilBrown
My editor shows much of this is RED. Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-14md/raid10: another memory leak due to reshape.NeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2014-10-09md: use set_bit/clear_bit instead of shift/mask for bi_flags changes.NeilBrown
Using {set,clear}_bit is more consistent than shifting and masking. No functional change. Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-19md/raid10: always initialise ->state on newly allocated r10_bioNeilBrown
Most places which allocate an r10_bio zero the ->state, some don't. As the r10_bio comes from a mempool, and the allocation function uses kzalloc it is often zero anyway. But sometimes it isn't and it is best to be safe. I only noticed this because of the bug fixed by an earlier patch where the r10_bios allocated for a reshape were left around to be used by a subsequent resync. In that case the R10BIO_IsReshape flag caused problems. Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-19md/raid10: avoid memory leak on error path during reshape.NeilBrown
If raid10 reshape fails to find somewhere to read a block from, it returns without freeing memory... Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-19md/raid10: Fix memory leak when raid10 reshape completes.NeilBrown
When a raid10 commences a resync/recovery/reshape it allocates some buffer space. When a resync/recovery completes the buffer space is freed. But not when the reshape completes. This can result in a small memory leak. There is a subtle side-effect of this bug. When a RAID10 is reshaped to a larger array (more devices), the reshape is immediately followed by a "resync" of the new space. This "resync" will use the buffer space which was allocated for "reshape". This can cause problems including a "BUG" in the SCSI layer. So this is suitable for -stable. Cc: stable@vger.kernel.org (v3.5+) Fixes: 3ea7daa5d7fde47cd41f4d56c2deb949114da9d6 Signed-off-by: NeilBrown <neilb@suse.de>
2014-08-19md/raid10: fix memory leak when reshaping a RAID10.NeilBrown
raid10 reshape clears unwanted bits from a bio->bi_flags using a method which, while clumsy, worked until 3.10 when BIO_OWNS_VEC was added. Since then it clears that bit but shouldn't. This results in a memory leak. So change to used the approved method of clearing unwanted bits. As this causes a memory leak which can consume all of memory the fix is suitable for -stable. Fixes: a38352e0ac02dbbd4fa464dc22d1352b5fbd06fd Cc: stable@vger.kernel.org (v3.10+) Reported-by: mdraid.pkoch@dfgh.net (Peter Koch) Signed-off-by: NeilBrown <neilb@suse.de>
2014-07-31md/raid1,raid10: always abort recover on write error.NeilBrown
Currently we don't abort recovery on a write error if the write error to the recovering device was triggerd by normal IO (as opposed to recovery IO). This means that for one bitmap region, the recovery might write to the recovering device for a few sectors, then not bother for subsequent sectors (as it never writes to failed devices). In this case the bitmap bit will be cleared, but it really shouldn't. The result is that if the recovering device fails and is then re-added (after fixing whatever hardware problem triggerred the failure), the second recovery won't redo the region it was in the middle of, so some of the device will not be recovered properly. If we abort the recovery, the region being processes will be cancelled (bit not cleared) and the whole region will be retried. As the bug can result in data corruption the patch is suitable for -stable. For kernels prior to 3.11 there is a conflict in raid10.c which will require care. Original-from: jiao hui <jiaohui@bwstor.com.cn> Reported-and-tested-by: jiao hui <jiaohui@bwstor.com.cn> Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@vger.kernel.org
2014-05-06md/raid10: call wait_barrier() for each request submitted.NeilBrown
wait_barrier() includes a counter, so we must call it precisely once (unless balanced by allow_barrier()) for each request submitted. Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 block: Introduce new bio_split() in 3.14-rc1, we don't call it for the extra requests generated when we need to split a bio. When this happens the counter goes negative, any resync/recovery will never start, and "mdadm --stop" will hang. Reported-by: Chris Murphy <lists@colorremedies.com> Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1 Cc: stable@vger.kernel.org (3.14+) Cc: Kent Overstreet <kmo@daterainc.com> Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-30Merge branch 'for-3.14/core' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull core block IO changes from Jens Axboe: "The major piece in here is the immutable bio_ve series from Kent, the rest is fairly minor. It was supposed to go in last round, but various issues pushed it to this release instead. The pull request contains: - Various smaller blk-mq fixes from different folks. Nothing major here, just minor fixes and cleanups. - Fix for a memory leak in the error path in the block ioctl code from Christian Engelmayer. - Header export fix from CaiZhiyong. - Finally the immutable biovec changes from Kent Overstreet. This enables some nice future work on making arbitrarily sized bios possible, and splitting more efficient. Related fixes to immutable bio_vecs: - dm-cache immutable fixup from Mike Snitzer. - btrfs immutable fixup from Muthu Kumar. - bio-integrity fix from Nic Bellinger, which is also going to stable" * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits) xtensa: fixup simdisk driver to work with immutable bio_vecs block/blk-mq-cpu.c: use hotcpu_notifier() blk-mq: for_each_* macro correctness block: Fix memory leak in rw_copy_check_uvector() handling bio-integrity: Fix bio_integrity_verify segment start bug block: remove unrelated header files and export symbol blk-mq: uses page->list incorrectly blk-mq: use __smp_call_function_single directly btrfs: fix missing increment of bi_remaining Revert "block: Warn and free bio if bi_end_io is not set" block: Warn and free bio if bi_end_io is not set blk-mq: fix initializing request's start time block: blk-mq: don't export blk_mq_free_queue() block: blk-mq: make blk_sync_queue support mq block: blk-mq: support draining mq queue dm cache: increment bi_remaining when bi_end_io is restored block: fixup for generic bio chaining block: Really silence spurious compiler warnings block: Silence spurious compiler warnings block: Kill bio_pair_split() ...
2014-01-14md/raid10: avoid fullsync when not necessary.NeilBrown
This is the raid10 equivalent of commit 4f0a5e012cf41321d611e7cad63e1017d143d138 MD RAID1: Further conditionalize 'fullsync' If a device in a newly assembled array is not fully recovered we currently do a fully resync by don't need to. Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md/raid10: fix bug when raid10 recovery fails to recover a block.NeilBrown
commit e875ecea266a543e643b19e44cf472f1412708f9 md/raid10 record bad blocks as needed during recovery. added code to the "cannot recover this block" path to record a bad block rather than fail the whole recovery. Unfortunately this new case was placed *after* r10bio was freed rather than *before*, yet it still uses r10bio. This is will crash with a null dereference. So move the freeing of r10bio down where it is safe. Cc: stable@vger.kernel.org (v3.1+) Fixes: e875ecea266a543e643b19e44cf472f1412708f9 Reported-by: Damian Nowak <spam@nowaker.net> URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181 Signed-off-by: NeilBrown <neilb@suse.de>
2014-01-14md/raid10: fix two bugs in handling of known-bad-blocks.NeilBrown
If we discover a bad block when reading we split the request and potentially read some of it from a different device. The code path of this has two bugs in RAID10. 1/ we get a spin_lock with _irq, but unlock without _irq!! 2/ The calculation of 'sectors_handled' is wrong, as can be clearly seen by comparison with raid1.c This leads to at least 2 warnings and a probable crash is a RAID10 ever had known bad blocks. Cc: stable@vger.kernel.org (v3.1+) Fixes: 856e08e23762dfb92ffc68fd0a8d228f9e152160 Reported-by: Damian Nowak <spam@nowaker.net> URL: https://bugzilla.kernel.org/show_bug.cgi?id=68181 Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-23block: Introduce new bio_split()Kent Overstreet
The new bio_split() can split arbitrary bios - it's not restricted to single page bios, like the old bio_split() (previously renamed to bio_pair_split()). It also has different semantics - it doesn't allocate a struct bio_pair, leaving it up to the caller to handle completions. Then convert the existing bio_pair_split() users to the new bio_split() - and also nvme, which was open coding bio splitting. (We have to take that BUG_ON() out of bio_integrity_trim() because this bio_split() needs to use it, and there's no reason it has to be used on bios marked as cloned; BIO_CLONED doesn't seem to have clearly documented semantics anyways.) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Neil Brown <neilb@suse.de>
2013-11-23block: Rename bio_split() -> bio_pair_split()Kent Overstreet
This is prep work for introducing a more general bio_split(). Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Lars Ellenberg <lars.ellenberg@linbit.com> Cc: Peter Osterlund <petero2@telia.com> Cc: Sage Weil <sage@inktank.com>
2013-11-23block: Kill bio_segments()/bi_vcnt usageKent Overstreet
When we start sharing biovecs, keeping bi_vcnt accurate for splits is going to be error prone - and unnecessary, if we refactor some code. So bio_segments() has to go - but most of the existing users just needed to know if the bio had multiple segments, which is easier - add a bio_multiple_segments() for them. (Two of the current uses of bio_segments() are going to go away in a couple patches, but the current implementation of bio_segments() is unsafe as soon as we start doing driver conversions for immutable biovecs - so implement a dumb version for bisectability, it'll go away in a couple patches) Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Nagalakshmi Nandigama <Nagalakshmi.Nandigama@lsi.com> Cc: Sreekanth Reddy <Sreekanth.Reddy@lsi.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
2013-11-23block: Abstract out bvec iteratorKent Overstreet
Immutable biovecs are going to require an explicit iterator. To implement immutable bvecs, a later patch is going to add a bi_bvec_done member to this struct; for now, this patch effectively just renames things. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: "Ed L. Cashin" <ecashin@coraid.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Lars Ellenberg <drbd-dev@lists.linbit.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Geoff Levand <geoff@infradead.org> Cc: Yehuda Sadeh <yehuda@inktank.com> Cc: Sage Weil <sage@inktank.com> Cc: Alex Elder <elder@inktank.com> Cc: ceph-devel@vger.kernel.org Cc: Joshua Morris <josh.h.morris@us.ibm.com> Cc: Philip Kelleher <pjk1939@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Neil Brown <neilb@suse.de> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: dm-devel@redhat.com Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: linux390@de.ibm.com Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Benny Halevy <bhalevy@tonian.com> Cc: "James E.J. Bottomley" <JBottomley@parallels.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Nicholas A. Bellinger" <nab@linux-iscsi.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chris Mason <chris.mason@fusionio.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Jaegeuk Kim <jaegeuk.kim@samsung.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Joern Engel <joern@logfs.org> Cc: Prasad Joshi <prasadjoshi.linux@gmail.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: KONISHI Ryusuke <konishi.ryusuke@lab.ntt.co.jp> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: xfs@oss.sgi.com Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Herton Ronaldo Krzesinski <herton.krzesinski@canonical.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Guo Chao <yan@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Cc: "Roger Pau Monné" <roger.pau@citrix.com> Cc: Jan Beulich <jbeulich@suse.com> Cc: Stefano Stabellini <stefano.stabellini@eu.citrix.com> Cc: Ian Campbell <Ian.Campbell@citrix.com> Cc: Sebastian Ott <sebott@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Nitin Gupta <ngupta@vflare.org> Cc: Jerome Marchand <jmarchand@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Peng Tao <tao.peng@emc.com> Cc: Andy Adamson <andros@netapp.com> Cc: fanchaoting <fanchaoting@cn.fujitsu.com> Cc: Jie Liu <jeff.liu@oracle.com> Cc: Sunil Mushran <sunil.mushran@gmail.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: Namjae Jeon <namjae.jeon@samsung.com> Cc: Pankaj Kumar <pankaj.km@samsung.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Cc: Mel Gorman <mgorman@suse.de>6
2013-11-20Merge tag 'md/3.13' of git://neil.brown.name/mdLinus Torvalds
Pull md update from Neil Brown: "Mostly optimisations and obscure bug fixes. - raid5 gets less lock contention - raid1 gets less contention between normal-io and resync-io during resync" * tag 'md/3.13' of git://neil.brown.name/md: md/raid5: Use conf->device_lock protect changing of multi-thread resources. md/raid5: Before freeing old multi-thread worker, it should flush them. md/raid5: For stripe with R5_ReadNoMerge, we replace REQ_FLUSH with REQ_NOMERGE. UAPI: include <asm/byteorder.h> in linux/raid/md_p.h raid1: Rewrite the implementation of iobarrier. raid1: Add some macros to make code clearly. raid1: Replace raise_barrier/lower_barrier with freeze_array/unfreeze_array when reconfiguring the array. raid1: Add a field array_frozen to indicate whether raid in freeze state. md: Convert use of typedef ctl_table to struct ctl_table md/raid5: avoid deadlock when raid5 array has unack badblocks during md_stop_writes. md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread. md: fix some places where mddev_lock return value is not checked. raid5: Retry R5_ReadNoMerge flag when hit a read error. raid5: relieve lock contention in get_active_stripe() raid5: relieve lock contention in get_active_stripe() wait: add wait_event_cmd() md/raid5.c: add proper locking to error path of raid5_start_reshape. md: fix calculation of stacking limits on level change. raid5: Use slow_path to release stripe when mddev->thread is null
2013-11-19md: use MD_RECOVERY_INTR instead of kthread_should_stop in resync thread.NeilBrown
We currently use kthread_should_stop() in various places in the sync/reshape code to abort early. However some places set MD_RECOVERY_INTR but don't immediately call md_reap_sync_thread() (and we will shortly get another one). When this happens we are relying on md_check_recovery() to reap the thread and that only happen when it finishes normally. So MD_RECOVERY_INTR must lead to a normal finish without the kthread_should_stop() test. So replace all relevant tests, and be more careful when the thread is interrupted not to acknowledge that latest step in a reshape as it may not be fully committed yet. Also add a test on MD_RECOVERY_INTR in the 'is_mddev_idle' loop so we don't wait have to wait for the speed to drop before we can abort. Signed-off-by: NeilBrown <neilb@suse.de>
2013-11-08block: Consolidate duplicated bio_trim() implementationsKent Overstreet
Someone cut and pasted md's md_trim_bio() into xen-blkfront.c. Come on, we should know better than this. Signed-off-by: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Neil Brown <neilb@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2013-10-24md: Fix skipping recovery for read-only arrays.Lukasz Dorau
Since: commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86 md: Allow devices to be re-added to a read-only array. spares are activated on a read-only array. In case of raid1 and raid10 personalities it causes that not-in-sync devices are marked in-sync without checking if recovery has been finished. If a read-only array is degraded and one of its devices is not in-sync (because the array has been only partially recovered) recovery will be skipped. This patch adds checking if recovery has been finished before marking a device in-sync for raid1 and raid10 personalities. In case of raid5 personality such condition is already present (at raid5.c:6029). Bug was introduced in 3.10 and causes data corruption. Cc: stable@vger.kernel.org Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com> Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-25md/raid10: remove use-after-free bug.NeilBrown
We always need to be careful when calling generic_make_request, as it can start a chain of events which might free something that we are using. Here is one place I wasn't careful enough. If the wbio2 is not in use, then it might get freed at the first generic_make_request call. So perform all necessary tests first. This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an oops, so fix is suitable for any -stable since then. Cc: stable@vger.kernel.org (3.3+) Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-18md/raid10: fix two problems with RAID10 resync.NeilBrown
1/ When an different between blocks is found, data is copied from one bio to the other. However bv_len is used as the length to copy and this could be zero. So use r10_bio->sectors to calculate length instead. Using bv_len was probably always a bit dubious, but the introduction of bio_advance made it much more likely to be a problem. 2/ When preparing some blocks for sync, we don't set BIO_UPTODATE except on bios that we schedule for a read. This ensures that missing/failed devices don't confuse the loop at the top of sync_request write. Commit 8be185f2c9d54d6 "raid10: Use bio_reset()" removed a loop which set BIO_UPTDATE on all appropriate bios. So we need to re-add that flag. These bugs were introduced in 3.10, so this patch is suitable for 3.10-stable, and can remove a potential for data corruption. Cc: stable@vger.kernel.org (3.10) Reported-by: Brassow Jonathan <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-04md/raid10: fix bug which causes all RAID10 reshapes to move no data.NeilBrown
The recent comment: commit 7e83ccbecd608b971f340e951c9e84cd0343002f md/raid10: Allow skipping recovery when clean arrays are assembled Causes raid10 to skip a recovery in certain cases where it is safe to do so. Unfortunately it also causes a reshape to be skipped which is never safe. The result is that an attempt to reshape a RAID10 will appear to complete instantly, but no data will have been moves so the array will now contain garbage. (If nothing is written, you can recovery by simple performing the reverse reshape which will also complete instantly). Bug was introduced in 3.10, so this is suitable for 3.10-stable. Cc: stable@vger.kernel.org (3.10) Cc: Martin Wilck <mwilck@arcor.de> Signed-off-by: NeilBrown <neilb@suse.de>
2013-07-03md/raid10: fix two bugs affecting RAID10 reshape.NeilBrown
1/ If a RAID10 is being reshaped to a fewer number of devices and is stopped while this is ongoing, then when the array is reassembled the 'mirrors' array will be allocated too small. This will lead to an access error or memory corruption. 2/ A sanity test for a reshaping RAID10 array is restarted is slightly incorrect. Due to the first bug, this is suitable for any -stable kernel since 3.5 where this code was introduced. Cc: stable@vger.kernel.org (v3.5+) Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14md/raid10: check In_sync flag in 'enough()'.NeilBrown
It isn't really enough to check that the rdev is present, we need to also be sure that the device is still In_sync. Doing this requires using rcu_dereference to access the rdev, and holding the rcu_read_lock() to ensure the rdev doesn't disappear while we look at it. Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14md/raid10: locking changes for 'enough()'.NeilBrown
As 'enough' accesses conf->prev and conf->geo, which can change spontanously, it should guard against changes. This can be done with device_lock as start_reshape holds device_lock while updating 'geo' and end_reshape holds it while updating 'prev'. So 'error' needs to hold 'device_lock'. On the other hand, raid10_end_read_request knows which of the two it really wants to access, and as it is an active request on that one, the value cannot change underneath it. So change _enough to take flag rather than a pointer, pass the appropriate flag from raid10_end_read_request(), and remove the locking. All other calls to 'enough' are made with reconfig_mutex held, so neither 'prev' nor 'geo' can change. Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-14DM RAID: Add ability to restore transiently failed devices on resumeJonathan Brassow
DM RAID: Add ability to restore transiently failed devices on resume This patch adds code to the resume function to check over the devices in the RAID array. If any are found to be marked as failed and their superblocks can be read, an attempt is made to reintegrate them into the array. This allows the user to refresh the array with a simple suspend and resume of the array - rather than having to load a completely new table, allocate and initialize all the structures and throw away the old instantiation. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2013-06-13Merge tag 'md-3.10-fixes' of git://neil.brown.name/mdLinus Torvalds
Pull md bugfixes from Neil Brown: "A few bugfixes for md Some tagged for -stable" * tag 'md-3.10-fixes' of git://neil.brown.name/md: md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in place md/raid1,raid10: use freeze_array in place of raise_barrier in various places. md/raid1: consider WRITE as successful only if at least one non-Faulty and non-rebuilding drive completed it. md: md_stop_writes() should always freeze recovery.
2013-06-13md/raid1,5,10: Disable WRITE SAME until a recovery strategy is in placeH. Peter Anvin
There are cases where the kernel will believe that the WRITE SAME command is supported by a block device which does not, in fact, support WRITE SAME. This currently happens for SATA drivers behind a SAS controller, but there are probably a hundred other ways that can happen, including drive firmware bugs. After receiving an error for WRITE SAME the block layer will retry the request as a plain write of zeroes, but mdraid will consider the failure as fatal and consider the drive failed. This has the effect that all the mirrors containing a specific set of data are each offlined in very rapid succession resulting in data loss. However, just bouncing the request back up to the block layer isn't ideal either, because the whole initial request-retry sequence should be inside the write bitmap fence, which probably means that md needs to do its own conversion of WRITE SAME to write zero. Until the failure scenario has been sorted out, disable WRITE SAME for raid1, raid5, and raid10. [neilb: added raid5] This patch is appropriate for any -stable since 3.7 when write_same support was added. Cc: stable@vger.kernel.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NeilBrown <neilb@suse.de>