Age | Commit message (Collapse) | Author |
|
Introduce functions that modify the queue flags and that protect
these modifications with the request queue lock. Except for moving
one wake_up_all() call from inside to outside a critical section,
this patch does not change any functionality.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Except for changing the atomic queue flag manipulations that are
protected by the queue lock into non-atomic manipulations, this
patch does not change any functionality.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Move the definition of queue_flag_clear_unlocked() up and move the
definition of queue_in_flight() down such that all queue flag
manipulation function definitions become contiguous.
This patch does not change any functionality.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Tokens are prefixed by a variable length of bytes. If a bytestring is
not stored in an tiny or short atom, we have to skip more than one byte
in order to have the actual bytes not prefixed by the bytes describing
the actual length of the string.
Acked-by: Jonathan Derrick <jonathan.derrick@intel.com>
Signed-off-by: Jonas Rabenstein <jonas.rabenstein@studium.uni-erlangen.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
On ARM64, the default page size has been 64K on some distributions, and
we should allow ARM64 people to play null_blk.
This patch fixes the issue by extend page bitmap size for supporting
other non-4KB PAGE_SIZE.
Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
Cc: Shaohua Li <shli@kernel.org>
Cc: Kyungchan Koh <kkc6196@fb.com>,
Cc: weiping zhang <zhangweiping@didichuxing.com>
Cc: Yi Zhang <yi.zhang@redhat.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A change to the generic scatterlist code caused a conflict with
the rtsx card reader driver:
In file included from drivers/staging/rts5208/rtsx.h:180,
from drivers/staging/rts5208/rtsx.c:28:
drivers/staging/rts5208/rtsx_chip.h:343: error: "SG_END" redefined [-Werror]
This changes one instance of the driver to prefix SG_END and
related constants.
Fixes: 723fbf563a6a ("lib/scatterlist: Add SG_CHAIN and SG_END macros for LSB encodings")
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
A change to the generic scatterlist code caused a conflict with
the rtsx card reader driver:
In file included from drivers/misc/cardreader/rtsx_pcr.c:32:
include/linux/rtsx_pci.h:40: error: "SG_END" redefined [-Werror]
This changes one instance of the driver to prefix SG_END and
related constants.
Fixes: 723fbf563a6a ("lib/scatterlist: Add SG_CHAIN and SG_END macros for LSB encodings")
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Avoid that the following race can occur:
blk_cleanup_queue() blkcg_print_blkgs()
spin_lock_irq(lock) (1) spin_lock_irq(blkg->q->queue_lock) (2,5)
q->queue_lock = &q->__queue_lock (3)
spin_unlock_irq(lock) (4)
spin_unlock_irq(blkg->q->queue_lock) (6)
(1) take driver lock;
(2) busy loop for driver lock;
(3) override driver lock with internal lock;
(4) unlock driver lock;
(5) can take driver lock now;
(6) but unlock internal lock.
This change is safe because only the SCSI core and the NVME core keep
a reference on a request queue after having called blk_cleanup_queue().
Neither driver accesses any of the removed data structures between its
blk_cleanup_queue() and blk_put_queue() calls.
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jan Kara <jack@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Initialize the request queue lock earlier such that the following
race can no longer occur:
blk_init_queue_node() blkcg_print_blkgs()
blk_alloc_queue_node (1)
q->queue_lock = &q->__queue_lock (2)
blkcg_init_queue(q) (3)
spin_lock_irq(blkg->q->queue_lock) (4)
q->queue_lock = lock (5)
spin_unlock_irq(blkg->q->queue_lock) (6)
(1) allocate an uninitialized queue;
(2) initialize queue_lock to its default internal lock;
(3) initialize blkcg part of request queue, which will create blkg and
then insert it to blkg_list;
(4) traverse blkg_list and find the created blkg, and then take its
queue lock, here it is the default *internal lock*;
(5) *race window*, now queue_lock is overridden with *driver specified
lock*;
(6) now unlock *driver specified lock*, not the locked *internal lock*,
unlock balance breaks.
The changes in this patch are as follows:
- Move the .queue_lock initialization from blk_init_queue_node() into
blk_alloc_queue_node().
- Only override the .queue_lock pointer for legacy queues because it
is not useful for blk-mq queues to override this pointer.
- For all all block drivers that initialize .queue_lock explicitly,
change the blk_alloc_queue() call in the driver into a
blk_alloc_queue_node() call and remove the explicit .queue_lock
initialization. Additionally, initialize the spin lock that will
be used as queue lock earlier if necessary.
Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch does not change any functionality.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Philipp Reisner <philipp.reisner@linbit.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Shaohua Li <shli@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Remove the disk, partition and bdi sysfs attributes before cleaning up
the request queue associated with the disk.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Shaohua Li <shli@fb.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Similarly to the support we have for testing/faking timeouts for
null_blk, this adds support for triggering a requeue condition.
Considering the issues around restart we've been seeing, this should be
a useful addition to the testing arsenal to ensure that we are handling
requeue conditions correctly.
This works for queue mode 1 (legacy request_fn based path) and 2 (blk-mq
path), as there's no good way to do requeue with a bio based driver.
This is similar to the timeout path. For the blk-mq path, we alternate
between passing back BLK_STS_RESOURCE and manually calling
blk_mq_requeue_request() in the driver. The former will hit the core
requeue path, while the latter exercises the IO scheduler requeue
path.
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
sbitmap_queue_get()/sbitmap_queue_clear() are used for
allocating/freeing a resource, so they should provide acquire/release
barrier semantics, respectively. sbitmap_get() currently contains a full
barrier, which is unnecessary, so use test_and_set_bit_lock() instead of
test_and_set_bit() (these are equivalent on x86_64). sbitmap_clear_bit()
does not imply any barriers, which is incorrect, as accesses of the
resource (e.g., request) could potentially get reordered to after the
clear_bit(). Introduce sbitmap_clear_bit_unlock() and use it for
sbitmap_queue_clear() (this only adds a compiler barrier on x86_64). The
other existing user of sbitmap_clear_bit() (the blk-mq software queue
pending map) is serialized through a spinlock and does not need this.
Reported-by: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When we insert a request, we set the software queue pending bit while
holding the software queue lock. However, we clear it outside of the
lock, so it's possible that a concurrent insert could reset the bit
after we clear it but before we empty the request list. Afterwards, the
bit would still be set but the software queue wouldn't have any requests
in it, leading us to do a spurious run in the future. This is mostly a
benign/theoretical issue, but it makes the following change easier to
justify.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When debugging the ZBC code in the mq-deadline scheduler it is very
important to know which zones are locked and which zones are not
locked. Hence this patch that exports the zone locking information
through debugfs.
Cc: Omar Sandoval <osandov@fb.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Make sure that the queue show and store methods are contiguous and
also that these appear in alphabetical order.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This replaces scatterlist->page_link LSB encodings with SG_CHAIN and
SG_END definitions without any functional change.
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Pull NVMe fixes from Keith for 4.16-rc.
* 'for-jens' of git://git.infradead.org/nvme:
nvmet: fix PSDT field check in command format
nvme-multipath: fix sysfs dangerously created links
nvme-pci: Fix nvme queue cleanup if IRQ setup fails
nvmet-loop: use blk_rq_payload_bytes for sgl selection
nvme-rdma: use blk_rq_payload_bytes instead of blk_rq_bytes
nvme-fabrics: don't check for non-NULL module in nvmf_register_transport
|
|
PSDT field section according to NVM_Express-1.3:
"This field specifies whether PRPs or SGLs are used for any data
transfer associated with the command. PRPs shall be used for all
Admin commands for NVMe over PCIe. SGLs shall be used for all Admin
and I/O commands for NVMe over Fabrics. This field shall be set to
01b for NVMe over Fabrics 1.0 implementations.
Suggested-by: Idan Burstein <idanb@mellanox.com>
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
|
If multipathing is enabled, each NVMe subsystem creates a head
namespace (e.g., nvme0n1) and multiple private namespaces
(e.g., nvme0c0n1 and nvme0c1n1) in sysfs. When creating links for
private namespaces, links of head namespace are used, so the
namespace creation order must be followed (e.g., nvme0n1 ->
nvme0c1n1). If the order is not followed, links of sysfs will be
incomplete or kernel panic will occur.
The kernel panic was:
kernel BUG at fs/sysfs/symlink.c:27!
Call Trace:
nvme_mpath_add_disk_links+0x5d/0x80 [nvme_core]
nvme_validate_ns+0x5c2/0x850 [nvme_core]
nvme_scan_work+0x1af/0x2d0 [nvme_core]
Correct order
Context A Context B
nvme0n1
nvme0c0n1 nvme0c1n1
Incorrect order
Context A Context B
nvme0c1n1
nvme0n1
nvme0c0n1
The nvme_mpath_add_disk (for creating head namespace) is called
just before the nvme_mpath_add_disk_links (for creating private
namespaces). In nvme_mpath_add_disk, the first context acquires
the lock of subsystem and creates a head namespace, and other
contexts do nothing by checking GENHD_FL_UP of a head namespace
after waiting to acquire the lock. We verified the code with or
without multipathing using three vendors of dual-port NVMe SSDs.
Signed-off-by: Baegjae Sung <baegjae@gmail.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
|
It seems that the proper value to return in this particular case is the
one contained into variable new_index instead of ret.
Addresses-Coverity-ID: 1465148 ("Copy-paste error")
Fixes: e46c7287b1c2 ("nbd: add a basic netlink interface")
Reviewed-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Kernel crashed when run fio in a RAID5 backend bcache device, the call
trace is bellow:
[ 440.012034] kernel BUG at block/blk-ioc.c:146!
[ 440.012696] invalid opcode: 0000 [#1] SMP NOPTI
[ 440.026537] CPU: 2 PID: 2205 Comm: md127_raid5 Not tainted 4.15.0 #8
[ 440.027441] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 07/16
/2015
[ 440.028615] RIP: 0010:put_io_context+0x8b/0x90
[ 440.029246] RSP: 0018:ffffa8c882b43af8 EFLAGS: 00010246
[ 440.029990] RAX: 0000000000000000 RBX: ffffa8c88294fca0 RCX: 0000000000
0f4240
[ 440.031006] RDX: 0000000000000004 RSI: 0000000000000286 RDI: ffffa8c882
94fca0
[ 440.032030] RBP: ffffa8c882b43b10 R08: 0000000000000003 R09: ffff949cb8
0c1700
[ 440.033206] R10: 0000000000000104 R11: 000000000000b71c R12: 00000000000
01000
[ 440.034222] R13: 0000000000000000 R14: ffff949cad84db70 R15: ffff949cb11
bd1e0
[ 440.035239] FS: 0000000000000000(0000) GS:ffff949cba280000(0000) knlGS:
0000000000000000
[ 440.060190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 440.084967] CR2: 00007ff0493ef000 CR3: 00000002f1e0a002 CR4: 00000000001
606e0
[ 440.110498] Call Trace:
[ 440.135443] bio_disassociate_task+0x1b/0x60
[ 440.160355] bio_free+0x1b/0x60
[ 440.184666] bio_put+0x23/0x30
[ 440.208272] search_free+0x23/0x40 [bcache]
[ 440.231448] cached_dev_write_complete+0x31/0x70 [bcache]
[ 440.254468] closure_put+0xb6/0xd0 [bcache]
[ 440.277087] request_endio+0x30/0x40 [bcache]
[ 440.298703] bio_endio+0xa1/0x120
[ 440.319644] handle_stripe+0x418/0x2270 [raid456]
[ 440.340614] ? load_balance+0x17b/0x9c0
[ 440.360506] handle_active_stripes.isra.58+0x387/0x5a0 [raid456]
[ 440.380675] ? __release_stripe+0x15/0x20 [raid456]
[ 440.400132] raid5d+0x3ed/0x5d0 [raid456]
[ 440.419193] ? schedule+0x36/0x80
[ 440.437932] ? schedule_timeout+0x1d2/0x2f0
[ 440.456136] md_thread+0x122/0x150
[ 440.473687] ? wait_woken+0x80/0x80
[ 440.491411] kthread+0x102/0x140
[ 440.508636] ? find_pers+0x70/0x70
[ 440.524927] ? kthread_associate_blkcg+0xa0/0xa0
[ 440.541791] ret_from_fork+0x35/0x40
[ 440.558020] Code: c2 48 00 5b 41 5c 41 5d 5d c3 48 89 c6 4c 89 e7 e8 bb c2
48 00 48 8b 3d bc 36 4b 01 48 89 de e8 7c f7 e0 ff 5b 41 5c 41 5d 5d c3 <0f> 0b
0f 1f 00 0f 1f 44 00 00 55 48 8d 47 b8 48 89 e5 41 57 41
[ 440.610020] RIP: put_io_context+0x8b/0x90 RSP: ffffa8c882b43af8
[ 440.628575] ---[ end trace a1fd79d85643a73e ]--
All the crash issue happened when a bypass IO coming, in such scenario
s->iop.bio is pointed to the s->orig_bio. In search_free(), it finishes the
s->orig_bio by calling bio_complete(), and after that, s->iop.bio became
invalid, then kernel would crash when calling bio_put(). Maybe its upper
layer's faulty, since bio should not be freed before we calling bio_put(),
but we'd better calling bio_put() first before calling bio_complete() to
notify upper layer ending this bio.
This patch moves bio_complete() under bio_put() to avoid kernel crash.
[mlyle: fixed commit subject for character limits]
Reported-by: Matthias Ferdinand <bcache@mfedv.net>
Tested-by: Matthias Ferdinand <bcache@mfedv.net>
Signed-off-by: Tang Junhui <tang.junhui@zte.com.cn>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 2831231d4c3f ("bcache: reduce cache_set devices iteration by
devices_max_used") adds c->devices_max_used to reduce iteration of
c->uuids elements, this value is updated in bcache_device_attach().
But for flash only volume, when calling flash_devs_run(), the function
bcache_device_attach() is not called yet and c->devices_max_used is not
updated. The unexpected result is, the flash only volume won't be run
by flash_devs_run().
This patch fixes the issue by iterate all c->uuids elements in
flash_devs_run(). c->devices_max_used will be updated properly when
bcache_device_attach() gets called.
[mlyle: commit subject edited for character limit]
Fixes: 2831231d4c3f ("bcache: reduce cache_set devices iteration by devices_max_used")
Reported-by: Tang Junhui <tang.junhui@zte.com.cn>
Signed-off-by: Coly Li <colyli@suse.de>
Reviewed-by: Michael Lyle <mlyle@lyle.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
'struct blk_user_trace_setup' is passed to BLKTRACESETUP, not
BLKTRACESTART.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When blkdev_open() races with device removal and creation it can happen
that unhashed bdev inode gets associated with newly created gendisk
like:
CPU0 CPU1
blkdev_open()
bdev = bd_acquire()
del_gendisk()
bdev_unhash_inode(bdev);
remove device
create new device with the same number
__blkdev_get()
disk = get_gendisk()
- gets reference to gendisk of the new device
Now another blkdev_open() will not find original 'bdev' as it got
unhashed, create a new one and associate it with the same 'disk' at
which point problems start as we have two independent page caches for
one device.
Fix the problem by verifying that the bdev inode didn't get unhashed
before we acquired gendisk reference. That way we make sure gendisk can
get associated only with visible bdev inodes.
Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When two blkdev_open() calls for a partition race with device removal
and recreation, we can hit BUG_ON(!bd_may_claim(bdev, whole, holder)) in
blkdev_open(). The race can happen as follows:
CPU0 CPU1 CPU2
del_gendisk()
bdev_unhash_inode(part1);
blkdev_open(part1, O_EXCL) blkdev_open(part1, O_EXCL)
bdev = bd_acquire() bdev = bd_acquire()
blkdev_get(bdev)
bd_start_claiming(bdev)
- finds old inode 'whole'
bd_prepare_to_claim() -> 0
bdev_unhash_inode(whole);
<device removed>
<new device under same
number created>
blkdev_get(bdev);
bd_start_claiming(bdev)
- finds new inode 'whole'
bd_prepare_to_claim()
- this also succeeds as we have
different 'whole' here...
- bad things happen now as we
have two exclusive openers of
the same bdev
The problem here is that block device opens can see various intermediate
states while gendisk is shutting down and then being recreated.
We fix the problem by introducing new lookup_sem in gendisk that
synchronizes gendisk deletion with get_gendisk() and furthermore by
making sure that get_gendisk() does not return gendisk that is being (or
has been) deleted. This makes sure that once we ever manage to look up
newly created bdev inode, we are also guaranteed that following
get_gendisk() will either return failure (and we fail open) or it
returns gendisk for the new device and following bdget_disk() will
return new bdev inode (i.e., blkdev_open() follows the path as if it is
completely run after new device is created).
Reported-and-analyzed-by: Hou Tao <houtao1@huawei.com>
Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
When two blkdev_open() calls race with device removal and recreation,
__blkdev_get() can use looked up gendisk after it is freed:
CPU0 CPU1 CPU2
del_gendisk(disk);
bdev_unhash_inode(inode);
blkdev_open() blkdev_open()
bdev = bd_acquire(inode);
- creates and returns new inode
bdev = bd_acquire(inode);
- returns the same inode
__blkdev_get(devt) __blkdev_get(devt)
disk = get_gendisk(devt);
- got structure of device going away
<finish device removal>
<new device gets
created under the same
device number>
disk = get_gendisk(devt);
- got new device structure
if (!bdev->bd_openers) {
does the first open
}
if (!bdev->bd_openers)
- false
} else {
put_disk_and_module(disk)
- remember this was old device - this was last ref and disk is
now freed
}
disk_unblock_events(disk); -> oops
Fix the problem by making sure we drop reference to disk in
__blkdev_get() only after we are really done with it.
Reported-by: Hou Tao <houtao1@huawei.com>
Tested-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Add a proper counterpart to get_disk_and_module() -
put_disk_and_module(). Currently it is opencoded in several places.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Rename get_disk() to get_disk_and_module() to make sure what the
function does. It's not a great name but at least it is now clear that
put_disk() is not it's counterpart.
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit 8ddcd653257c "block: introduce GENHD_FL_HIDDEN" added handling of
hidden devices to get_gendisk() but forgot to drop module reference
which is also acquired by get_disk(). Drop the reference as necessary.
Arguably the function naming here is misleading as put_disk() is *not*
the counterpart of get_disk() but let's fix that in the follow up
commit since that will be more intrusive.
Fixes: 8ddcd653257c18a669fcb75ee42c37054908e0d6
CC: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Commit e864f39569f4 "fs: add RWF_DSYNC aand RWF_SYNC" added additional
way for direct IO to become synchronous and thus trigger fsync from the
IO completion handler. Then commit 9830f4be159b "fs: Use RWF_* flags for
AIO operations" allowed these flags to be set for AIO as well. However
that commit forgot to update the condition checking whether the IO
completion handling should be defered to a workqueue and thus AIO DIO
with RWF_[D]SYNC set will call fsync() from IRQ context resulting in
sleep in atomic.
Fix the problem by checking directly iocb flags (the same way as it is
done in dio_complete()) instead of checking all conditions that could
lead to IO being synchronous.
CC: Christoph Hellwig <hch@lst.de>
CC: Goldwyn Rodrigues <rgoldwyn@suse.com>
CC: stable@vger.kernel.org
Reported-by: Mark Rutland <mark.rutland@arm.com>
Tested-by: Mark Rutland <mark.rutland@arm.com>
Fixes: 9830f4be159b29399d107bffb99e0132bc5aedd4
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch fixes nvme queue cleanup if requesting an IRQ handler for
the queue's vector fails. It does this by resetting the cq_vector to
the uninitialized value of -1 so it is ignored for a controller reset.
Signed-off-by: Jianchao Wang <jianchao.w.wang@oracle.com>
[changelog updates, removed misc whitespace changes]
Signed-off-by: Keith Busch <keith.busch@intel.com>
|
|
When requeuing request, the domain token should have been freed
before re-inserting the request to io scheduler. Otherwise, the
assigned domain token will be leaked, and IO hang can be caused.
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
__blk_mq_requeue_request() covers two cases:
- one is that the requeued request is added to hctx->dispatch, such as
blk_mq_dispatch_rq_list()
- another case is that the request is requeued to io scheduler, such as
blk_mq_requeue_request().
We should call io sched's .requeue_request callback only for the 2nd
case.
Cc: Paolo Valente <paolo.valente@linaro.org>
Cc: Omar Sandoval <osandov@fb.com>
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The 'lend' parameter of truncate_inode_pages_range is required to be
inclusive, so follow the rule.
This patch fixes one memory corruption triggered by discard.
Cc: <stable@vger.kernel.org>
Cc: Dmitry Monakhov <dmonakhov@openvz.org>
Fixes: 351499a172c0 ("block: Invalidate cache on discard v2")
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The MIPS %.its.S compiler command did not define __ASSEMBLY__, which meant
when compiler_types.h was added to kconfig.h, unexpected things appeared
(e.g. struct declarations) which should not have been present. As done in
the general %.S compiler command, __ASSEMBLY__ is now included here too.
The failure was:
Error: arch/mips/boot/vmlinux.gz.its:201.1-2 syntax error
FATAL ERROR: Unable to parse input tree
/usr/bin/mkimage: Can't read arch/mips/boot/vmlinux.gz.itb.tmp: Invalid argument
/usr/bin/mkimage Can't add hashes to FIT blob
Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 28128c61e08e ("kconfig.h: Include compiler types to avoid missed struct attributes")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull siginfo fix from Eric Biederman:
"This fixes a build error that only shows up on blackfin"
* 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
fs/signalfd: fix build error for BUS_MCEERR_AR
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fix from Herbert Xu:
"Fix an oops in the s5p-sss driver when used with ecb(aes)"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: s5p-sss - Fix kernel Oops in AES-ECB mode
|
|
Fix build error in fs/signalfd.c by using same method that is used in
kernel/signal.c: separate blocks for different signal si_code values.
./fs/signalfd.c: error: 'BUS_MCEERR_AR' undeclared (first use in this function)
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are a number of USB fixes for 4.16-rc3
Nothing major, but a number of different fixes all over the place in
the USB stack for reported issues. Mostly gadget driver fixes,
although the typical set of xhci bugfixes are there, along with some
new quirks additions as well.
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-4.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (39 commits)
Revert "usb: musb: host: don't start next rx urb if current one failed"
usb: musb: fix enumeration after resume
usb: cdc_acm: prevent race at write to acm while system resumes
Add delay-init quirk for Corsair K70 RGB keyboards
usb: ohci: Proper handling of ed_rm_list to handle race condition between usb_kill_urb() and finish_unlinks()
usb: host: ehci: always enable interrupt for qtd completion at test mode
usb: ldusb: add PIDs for new CASSY devices supported by this driver
usb: renesas_usbhs: missed the "running" flag in usb_dmac with rx path
usb: host: ehci: use correct device pointer for dma ops
usbip: keep usbip_device sockfd state in sync with tcp_socket
ohci-hcd: Fix race condition caused by ohci_urb_enqueue() and io_watchdog_func()
USB: serial: option: Add support for Quectel EP06
xhci: fix xhci debugfs errors in xhci_stop
xhci: xhci debugfs device nodes weren't removed after device plugged out
xhci: Fix xhci debugfs devices node disappearance after hibernation
xhci: Fix NULL pointer in xhci debugfs
xhci: Don't print a warning when setting link state for disabled ports
xhci: workaround for AMD Promontory disabled ports wakeup
usb: dwc3: core: Fix ULPI PHYs and prevent phy_get/ulpi_init during suspend/resume
USB: gadget: udc: Add missing platform_device_put() on error in bdc_pci_probe()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO fixes from Greg KH:
"Here are a small number of staging and iio driver fixes for 4.16-rc2.
The IIO fixes are all for reported things, and the android driver
fixes also resolve some reported problems. The remaining fsl-mc
Kconfig change resolves a build testing error that Arnd reported.
All of these have been in linux-next with no reported issues"
* tag 'staging-4.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: buffer: check if a buffer has been set up when poll is called
iio: adis_lib: Initialize trigger before requesting interrupt
staging: android: ion: Zero CMA allocated memory
staging: android: ashmem: Fix a race condition in pin ioctls
staging: fsl-mc: fix build testing on x86
iio: srf08: fix link error "devm_iio_triggered_buffer_setup" undefined
staging: iio: ad5933: switch buffer mode to software
iio: adc: stm32: fix stm32h7_adc_enable error handling
staging: iio: adc: ad7192: fix external frequency setting
iio: adc: aspeed: Fix error handling path
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are a handful of char/misc driver fixes for 4.16-rc3.
There are some binder driver fixes to resolve reported issues in
stress testing the recent binder changes, some extcon driver fixes,
and a few mei driver fixes and new device ids.
All of these, with the exception of the mei driver id additions, have
been in linux-next for a while. I forgot to push out the mei driver id
additions to kernel.org until today, but all build tests pass with
them enabled"
* tag 'char-misc-4.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
mei: me: add cannon point device ids for 4th device
mei: me: add cannon point device ids
mei: set device client to the disconnected state upon suspend.
ANDROID: binder: synchronize_rcu() when using POLLFREE.
binder: replace "%p" with "%pK"
ANDROID: binder: remove WARN() for redundant txn error
binder: check for binder_thread allocation failure in binder_poll()
extcon: int3496: process id-pin first so that we start with the right status
Revert "extcon: axp288: Redo charger type detection a couple of seconds after probe()"
extcon: axp288: Constify the axp288_pwr_up_down_info array
|
|
Pull rdma fixes from Doug Ledford:
"Nothing in this is overly interesting, it's mostly your garden variety
fixes.
There was some work in this merge cycle around the new ioctl kABI, so
there are fixes in here related to that (probably with more to come).
We've also recently added new netlink support with a goal of moving
the primary means of configuring the entire subsystem to netlink
(eventually, this is a long term project), so there are fixes for
that.
Then a few bnxt_re driver fixes, and a few minor WARN_ON removals, and
that covers this pull request. There are already a few more fixes on
the list as of this morning, so there will certainly be more to come
in this rc cycle ;-)
Summary:
- Lots of fixes for the new IOCTL interface and general uverbs flow.
Found through testing and syzkaller
- Bugfixes for the new resource track netlink reporting
- Remove some unneeded WARN_ONs that were triggering for some users
in IPoIB
- Various fixes for the bnxt_re driver"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (27 commits)
RDMA/uverbs: Fix kernel panic while using XRC_TGT QP type
RDMA/bnxt_re: Avoid system hang during device un-reg
RDMA/bnxt_re: Fix system crash during load/unload
RDMA/bnxt_re: Synchronize destroy_qp with poll_cq
RDMA/bnxt_re: Unpin SQ and RQ memory if QP create fails
RDMA/bnxt_re: Disable atomic capability on bnxt_re adapters
RDMA/restrack: don't use uaccess_kernel()
RDMA/verbs: Check existence of function prior to accessing it
RDMA/vmw_pvrdma: Fix usage of user response structures in ABI file
RDMA/uverbs: Sanitize user entered port numbers prior to access it
RDMA/uverbs: Fix circular locking dependency
RDMA/uverbs: Fix bad unlock balance in ib_uverbs_close_xrcd
RDMA/restrack: Increment CQ restrack object before committing
RDMA/uverbs: Protect from command mask overflow
IB/uverbs: Fix unbalanced unlock on error path for rdma_explicit_destroy
IB/uverbs: Improve lockdep_check
RDMA/uverbs: Protect from races between lookup and destroy of uobjects
IB/uverbs: Hold the uobj write lock after allocate
IB/uverbs: Fix possible oops with duplicate ioctl attributes
IB/uverbs: Add ioctl support for 32bit processes
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
Pull RISC-V cleanups from Palmer Dabbelt:
"This contains a handful of small cleanups.
The only functional change is that IRQs are now enabled during
exception handling, which was found when some warnings triggered with
`CONFIG_DEBUG_ATOMIC_SLEEP=y`.
The remaining fixes should have no functional change: `sbi_save()` has
been renamed to `parse_dtb()` reflect what it actually does, and a
handful of unused Kconfig entries have been removed"
* tag 'riscv-for-linus-4.16-rc3-riscv_cleanups' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
Rename sbi_save to parse_dtb to improve code readability
RISC-V: Enable IRQ during exception handling
riscv: Remove ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE select
riscv: kconfig: Remove RISCV_IRQ_INTC select
riscv: Remove ARCH_WANT_OPTIONAL_GPIOLIB select
|
|
Merge misc fixes from Andrew Morton:
"16 fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm: don't defer struct page initialization for Xen pv guests
lib/Kconfig.debug: enable RUNTIME_TESTING_MENU
vmalloc: fix __GFP_HIGHMEM usage for vmalloc_32 on 32b systems
selftests/memfd: add run_fuse_test.sh to TEST_FILES
bug.h: work around GCC PR82365 in BUG()
mm/swap.c: make functions and their kernel-doc agree (again)
mm/zpool.c: zpool_evictable: fix mismatch in parameter name and kernel-doc
ida: do zeroing in ida_pre_get()
mm, swap, frontswap: fix THP swap if frontswap enabled
certs/blacklist_nohashes.c: fix const confusion in certs blacklist
kernel/relay.c: limit kmalloc size to KMALLOC_MAX_SIZE
mm, mlock, vmscan: no more skipping pagevecs
mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats
Kbuild: always define endianess in kconfig.h
include/linux/sched/mm.h: re-inline mmdrop()
tools: fix cross-compile var clobbering
|
|
Each read from a file in efivarfs results in two calls to EFI
(one to get the file size, another to get the actual data).
On X86 these EFI calls result in broadcast system management
interrupts (SMI) which affect performance of the whole system.
A malicious user can loop performing reads from efivarfs bringing
the system to its knees.
Linus suggested per-user rate limit to solve this.
So we add a ratelimit structure to "user_struct" and initialize
it for the root user for no limit. When allocating user_struct for
other users we set the limit to 100 per second. This could be used
for other places that want to limit the rate of some detrimental
user action.
In efivarfs if the limit is exceeded when reading, we take an
interruptible nap for 50ms and check the rate limit again.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The header files for some structures could get included in such a way
that struct attributes (specifically __randomize_layout from path.h) would
be parsed as variable names instead of attributes. This could lead to
some instances of a structure being unrandomized, causing nasty GPFs, etc.
This patch makes sure the compiler_types.h header is included in
kconfig.h so that we've always got types and struct attributes defined,
since kconfig.h is included from the compiler command line.
Reported-by: Patrick McLean <chutzpah@gentoo.org>
Root-caused-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Tested-by: Maciej S. Szmigiero <mail@maciej.szmigiero.name>
Fixes: 3859a271a003 ("randstruct: Mark various structs for randomization")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|