aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/extent-tree.c
AgeCommit message (Collapse)Author
2019-11-18btrfs: rename btrfs_block_group_cacheDavid Sterba
The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format. Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: Ensure we trim ranges across block group boundaryQu Wenruo
[BUG] When deleting large files (which cross block group boundary) with discard mount option, we find some btrfs_discard_extent() calls only trimmed part of its space, not the whole range: btrfs_discard_extent: type=0x1 start=19626196992 len=2144530432 trimmed=1073741824 ratio=50% type: bbio->map_type, in above case, it's SINGLE DATA. start: Logical address of this trim len: Logical length of this trim trimmed: Physically trimmed bytes ratio: trimmed / len Thus leaving some unused space not discarded. [CAUSE] When discard mount option is specified, after a transaction is fully committed (super block written to disk), we begin to cleanup pinned extents in the following call chain: btrfs_commit_transaction() |- btrfs_finish_extent_commit() |- find_first_extent_bit(unpin, 0, &start, &end, EXTENT_DIRTY); |- btrfs_discard_extent() However, pinned extents are recorded in an extent_io_tree, which can merge adjacent extent states. When a large file gets deleted and it has adjacent file extents across block group boundary, we will get a large merged range like this: |<--- BG1 --->|<--- BG2 --->| |//////|<-- Range to discard --->|/////| To discard that range, we have the following calls: btrfs_discard_extent() |- btrfs_map_block() | Returned bbio will end at BG1's end. As btrfs_map_block() | never returns result across block group boundary. |- btrfs_issuse_discard() Issue discard for each stripe. So we will only discard the range in BG1, not the remaining part in BG2. Furthermore, this bug is not that reliably observed, for above case, if there is no other extent in BG2, BG2 will be empty and btrfs will trim all space of BG2, covering up the bug. [FIX] - Allow __btrfs_map_block_for_discard() to modify @length parameter btrfs_map_block() uses its @length paramter to notify the caller how many bytes are mapped in current call. With __btrfs_map_block_for_discard() also modifing the @length, btrfs_discard_extent() now understands when to do extra trim. - Call btrfs_map_block() in a loop until we hit the range end Since we now know how many bytes are mapped each time, we can iterate through each block group boundary and issue correct trim for each range. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: add dedicated members for start and length of a block groupDavid Sterba
The on-disk format of block group item makes use of the key that stores the offset and length. This is further used in the code, although this makes thing harder to understand. The key is also packed so the offset/length is not properly aligned as u64. Add start (key.objectid) and length (key.offset) members to block group and remove the embedded key. When the item is searched or written, a local variable for key is used. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: move block_group_item::used to block groupDavid Sterba
For unknown reasons, the member 'used' in the block group struct is stored in the b-tree item and accessed everywhere using the special accessor helper. Let's unify it and make it a regular member and only update the item before writing it to the tree. The item is still being used for flags and chunk_objectid, there's some duplication until the item is removed in following patches. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: User assert to document transaction requirementNikolay Borisov
Using an ASSERT in btrfs_pin_extent allows to more stringently observe whether the function is called under a transaction or not. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-11-18btrfs: opencode extent_buffer_getDavid Sterba
The helper is trivial and we can understand what the atomic_inc on something named refs does. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: refactor the ticket wakeup codeJosef Bacik
Now that btrfs_space_info_add_old_bytes simply checks if we can make the reservation and updates bytes_may_use, there's no reason to have both helpers in place. Factor out the ticket wakeup logic into it's own helper, make btrfs_space_info_add_old_bytes() update bytes_may_use and then call the wakeup helper, and replace all calls to btrfs_space_info_add_new_bytes() with the wakeup helper. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: roll tracepoint into btrfs_space_info_update helperJosef Bacik
We duplicate this tracepoint everywhere we call these helpers, so update the helper to have the tracepoint as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move math functions to misc.hDavid Sterba
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move cond_wake_up functions out of ctreeDavid Sterba
The file ctree.h serves as a header for everything and has become quite bloated. Split some helpers that are generic and create a new file that should be the catch-all for code that's not btrfs-specific. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: improve comments around nocow pathNikolay Borisov
run_delalloc_nocow contains numerous, somewhat subtle, checks when figuring out whether a particular extent should be CoW'ed or not. This patch explicitly states the assumptions those checks verify. As a result also document 2 of the more subtle checks in check_committed_ref as well. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: tree-checker: Add EXTENT_DATA_REF checkQu Wenruo
EXTENT_DATA_REF is a little like DIR_ITEM which contains hash in its key->offset. This patch will check the following contents: - Key->objectid Basic alignment check. - Hash Hash of each extent_data_ref item must match key->offset. - Offset Basic alignment check. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the block group cleanup codeJosef Bacik
This can now be easily migrated as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh on top of sysfs cleanups ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the alloc_profile helpersJosef Bacik
These feel more at home in block-group.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh, adjust btrfs_get_alloc_profile exports ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the chunk allocation codeJosef Bacik
This feels more at home in block-group.c than in extent-tree.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com>i [ refresh ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the block group space accounting helpersJosef Bacik
We can now easily migrate this code as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: export block group accounting helpersJosef Bacik
Want to move these functions into block-group.c, so export them. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the dirty bg writeout codeJosef Bacik
This can be easily migrated over now. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate inc/dec_block_group_ro codeJosef Bacik
This can easily be moved now. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: temporarily export btrfs_get_restripe_targetJosef Bacik
This gets used by a few different logical chunks of the block group code, export it while we move things around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the block group read/creation codeJosef Bacik
All of the prep work has been done so we can now cleanly move this chunk over. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh, add btrfs_get_alloc_profile export, comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the block group removal codeJosef Bacik
This is the removal code and the unused bgs code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ refresh, move clear_incompat_bg_bits ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: temporarily export inc_block_group_roJosef Bacik
This is used in a few logical parts of the block group code, temporarily export it so we can move things in pieces. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the block group caching codeJosef Bacik
We can now just copy it over to block-group.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: factor out sysfs code for deleting block group and space infosDavid Sterba
The helpers to create block group and space info directories already live in sysfs.c, move the deletion part there too. Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: factor sysfs code out of link_block_groupDavid Sterba
The part of link_block_group that just creates the sysfs object is independent and can be factored out to a helper. Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: make caching_thread use btrfs_find_next_keyJosef Bacik
extent-tree.c has a find_next_key that just walks up the path to find the next key, but it is used for both the caching stuff and the snapshot delete stuff. The snapshot deletion stuff is special so it can't really use btrfs_find_next_key, but the caching thread stuff can. We just need to fix btrfs_find_next_key to deal with ->skip_locking and then it works exactly the same as the private find_next_key helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: temporarily export fragment_free_spaceJosef Bacik
This is used in caching and reading block groups, so export it while we move these chunks independently. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: export the caching control helpersJosef Bacik
Man a lot of people use this stuff. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: export the excluded extents helpersJosef Bacik
We'll need this to move the caching stuff around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: export the block group caching helpersJosef Bacik
This will make it so we can move them easily. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ coding style updates ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate nocow and reservation helpersJosef Bacik
These are relatively straightforward as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the block group ref counting stuffJosef Bacik
Another easy set to move over to block-group.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: migrate the block group lookup codeJosef Bacik
Move these bits first as they are the easiest to move. Export two of the helpers so they can be moved all at once. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style updates ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: move basic block_group definitions to their own headerJosef Bacik
This is prep work for moving all of the block group cache code into its own file. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: extent-tree: Make sure we only allocate extents from block groups ↵Qu Wenruo
with the same type [BUG] With fuzzed image and MIXED_GROUPS super flag, we can hit the following BUG_ON(): kernel BUG at fs/btrfs/delayed-ref.c:491! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 1849 Comm: sync Tainted: G O 5.2.0-custom #27 RIP: 0010:update_existing_head_ref.cold+0x44/0x46 [btrfs] Call Trace: add_delayed_ref_head+0x20c/0x2d0 [btrfs] btrfs_add_delayed_tree_ref+0x1fc/0x490 [btrfs] btrfs_free_tree_block+0x123/0x380 [btrfs] __btrfs_cow_block+0x435/0x500 [btrfs] btrfs_cow_block+0x110/0x240 [btrfs] btrfs_search_slot+0x230/0xa00 [btrfs] ? __lock_acquire+0x105e/0x1e20 btrfs_insert_empty_items+0x67/0xc0 [btrfs] alloc_reserved_file_extent+0x9e/0x340 [btrfs] __btrfs_run_delayed_refs+0x78e/0x1240 [btrfs] ? kvm_clock_read+0x18/0x30 ? __sched_clock_gtod_offset+0x21/0x50 btrfs_run_delayed_refs.part.0+0x4e/0x180 [btrfs] btrfs_run_delayed_refs+0x23/0x30 [btrfs] btrfs_commit_transaction+0x53/0x9f0 [btrfs] btrfs_sync_fs+0x7c/0x1c0 [btrfs] ? __ia32_sys_fdatasync+0x20/0x20 sync_fs_one_sb+0x23/0x30 iterate_supers+0x95/0x100 ksys_sync+0x62/0xb0 __ia32_sys_sync+0xe/0x20 do_syscall_64+0x65/0x240 entry_SYSCALL_64_after_hwframe+0x49/0xbe [CAUSE] This situation is caused by several factors: - Fuzzed image The extent tree of this fs missed one backref for extent tree root. So we can allocated space from that slot. - MIXED_BG feature Super block has MIXED_BG flag. - No mixed block groups exists All block groups are just regular ones. This makes data space_info->block_groups[] contains metadata block groups. And when we reserve space for data, we can use space in metadata block group. Then we hit the following file operations: - fallocate We need to allocate data extents. find_free_extent() choose to use the metadata block to allocate space from, and choose the space of extent tree root, since its backref is missing. This generate one delayed ref head with is_data = 1. - extent tree update We need to update extent tree at run_delayed_ref time. This generate one delayed ref head with is_data = 0, for the same bytenr of old extent tree root. Then we trigger the BUG_ON(). [FIX] The quick fix here is to check block_group->flags before using it. The problem can only happen for MIXED_GROUPS fs. Regular filesystems won't have space_info with DATA|METADATA flag, and no way to hit the bug. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203255 Reported-by: Jungyeon Yoon <jungyeon.yoon@gmail.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: volumes: Remove ENOSPC-prone btrfs_can_relocate()Qu Wenruo
[BUG] Test case btrfs/156 fails since commit 302167c50b32 ("btrfs: don't end the transaction for delayed refs in throttle") with ENOSPC. [CAUSE] The ENOSPC is reported from btrfs_can_relocate(). This function will check: - If this block group is empty, we can relocate - If we can enough free space, we can relocate Above checks are valid but the following check is vague due to its implementation: - If and only if we can allocated a new block group to contain all the used space, we can relocate This design itself is OK, but the way to determine if we can allocate a new block group is problematic. btrfs_can_relocate() uses find_free_dev_extent() to find free space on a device. However find_free_dev_extent() only searches commit root and excludes dev extents allocated in current trans, this makes it unable to use dev extent just freed in current transaction. So for the following example, btrfs_can_relocate() will report ENOSPC: The example block group layout: 1M 129M 257M 385M 513M 550M |///////|///////////|//////////| | | // = Used bg, consider all bg is 100% used for easy calculation. And all block groups are SINGLE, on-disk bytenr is the same as the logical bytenr. 1) Bg in [129M, 257M) get relocated to [385M, 513M), transid=100 1M 129M 257M 385M 513M 550M |///////| |//////////|/////////| In transid 100, bg in [129M, 257M) get relocated to [385M, 513M) However transid 100 is not committed yet, so in dev commit tree, we still have the old dev extents layout: 1M 129M 257M 385M 513M 550M |///////|///////////|//////////| | | 2) Try to relocate bg [257M, 385M) We goes into btrfs_can_relocate(), no free space in current bgs, so we check if we can find large enough free dev extents. The first slot is [385M, 513M), but that is already used by new bg at [385M, 513M), so we continue search. The remaining slot is [512M, 550M), smaller than the bg's length 128M. So btrfs_can_relocate report ENOSPC. However this is over killed, in fact if we just skip btrfs_can_relocate() check, and go into regular relocation routine, at extent reservation time, if we can't find free extent, then we fallback to commit transaction, which will free up the dev extents and allow new block group to be created. [FIX] The fix here is to remove btrfs_can_relocate() completely. If we hit the false ENOSPC case just like btrfs/156, extent allocator will push harder by committing transaction and we will have space for new block group, avoiding the false ENOSPC. If we really ran out of space, we will hit ENOSPC at relocate_block_group(), and btrfs will just reports the ENOSPC error as usual. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-09-09btrfs: extent-tree: Add comment for inc_block_group_ro()Qu Wenruo
inc_block_group_ro() is only designed to mark one block group read-only, it doesn't really care if other block groups have enough free space to contain the used space in the block group. However due to the close connection between this function and relocation, sometimes we can be confused and think this function is responsible for balance space reservation, which is not true. Add some comment to make the functionality clear. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-07btrfs: trim: Check the range passed into to prevent overflowQu Wenruo
Normally the range->len is set to default value (U64_MAX), but when it's not default value, we should check if the range overflows. And if it overflows, return -EINVAL before doing anything. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-08-07Btrfs: fix sysfs warning and missing raid sysfs directoriesFilipe Manana
In the 5.3 merge window, commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups"), we started using the member "defaults_groups" for the kobject type "btrfs_raid_ktype". That leads to a series of warnings when running some test cases of fstests, such as btrfs/027, btrfs/124 and btrfs/176. The traces produced by those warnings are like the following: [116648.059212] kernfs: can not remove 'total_bytes', no directory [116648.060112] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.066482] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.069376] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.072385] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.073437] RAX: 0000000000000000 RBX: ffffffffc0c11998 RCX: 0000000000000000 [116648.074201] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.074956] RBP: ffffffffc0b9ca2f R08: 0000000000000000 R09: 0000000000000001 [116648.075708] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.076434] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.077143] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.077852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.078546] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.079235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.079907] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.080585] Call Trace: [116648.081262] remove_files+0x31/0x70 [116648.081929] sysfs_remove_group+0x38/0x80 [116648.082596] sysfs_remove_groups+0x34/0x70 [116648.083258] kobject_del+0x20/0x60 [116648.083933] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.084608] close_ctree+0x19a/0x380 [btrfs] [116648.085278] generic_shutdown_super+0x6c/0x110 [116648.085951] kill_anon_super+0xe/0x30 [116648.086621] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.087289] deactivate_locked_super+0x3a/0x70 [116648.087956] cleanup_mnt+0xb4/0x160 [116648.088620] task_work_run+0x7e/0xc0 [116648.089285] exit_to_usermode_loop+0xfa/0x100 [116648.089933] do_syscall_64+0x1cb/0x220 [116648.090567] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.091197] RIP: 0033:0x7f9cdc073b37 (...) [116648.100046] ---[ end trace 22e24db328ccadf8 ]--- [116648.100618] ------------[ cut here ]------------ [116648.101175] kernfs: can not remove 'used_bytes', no directory [116648.101731] WARNING: CPU: 3 PID: 28500 at fs/kernfs/dir.c:1504 kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.105649] CPU: 3 PID: 28500 Comm: umount Tainted: G W 5.3.0-rc3-btrfs-next-54 #1 (...) [116648.107461] RIP: 0010:kernfs_remove_by_name_ns+0x75/0x80 (...) [116648.109336] RSP: 0018:ffffabfd0090bd08 EFLAGS: 00010282 [116648.109979] RAX: 0000000000000000 RBX: ffffffffc0c119a0 RCX: 0000000000000000 [116648.110625] RDX: ffff9fff603a7a00 RSI: ffff9fff603978a8 RDI: ffff9fff603978a8 [116648.111283] RBP: ffffffffc0b9ca41 R08: 0000000000000000 R09: 0000000000000001 [116648.111940] R10: ffff9ffe1f72e1c0 R11: 0000000000000000 R12: ffffffffc0b94120 [116648.112603] R13: ffffffffb3d9b4e0 R14: 0000000000000000 R15: dead000000000100 [116648.113268] FS: 00007f9cdc78a2c0(0000) GS:ffff9fff60380000(0000) knlGS:0000000000000000 [116648.113939] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [116648.114607] CR2: 00007f9fc4747ab4 CR3: 00000005c7832003 CR4: 00000000003606e0 [116648.115286] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [116648.115966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [116648.116649] Call Trace: [116648.117326] remove_files+0x31/0x70 [116648.117997] sysfs_remove_group+0x38/0x80 [116648.118671] sysfs_remove_groups+0x34/0x70 [116648.119342] kobject_del+0x20/0x60 [116648.120022] btrfs_free_block_groups+0x405/0x430 [btrfs] [116648.120707] close_ctree+0x19a/0x380 [btrfs] [116648.121396] generic_shutdown_super+0x6c/0x110 [116648.122057] kill_anon_super+0xe/0x30 [116648.122702] btrfs_kill_super+0x12/0xa0 [btrfs] [116648.123335] deactivate_locked_super+0x3a/0x70 [116648.123961] cleanup_mnt+0xb4/0x160 [116648.124586] task_work_run+0x7e/0xc0 [116648.125210] exit_to_usermode_loop+0xfa/0x100 [116648.125830] do_syscall_64+0x1cb/0x220 [116648.126463] entry_SYSCALL_64_after_hwframe+0x49/0xbe [116648.127080] RIP: 0033:0x7f9cdc073b37 (...) [116648.135923] ---[ end trace 22e24db328ccadf9 ]--- These happen because, during the unmount path, we call kobject_del() for raid kobjects that are not fully initialized, meaning that we set their ktype (as btrfs_raid_ktype) through link_block_group() but we didn't set their parent kobject, which is done through btrfs_add_raid_kobjects(). We have this split raid kobject setup since commit 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") in order to avoid triggering reclaim during contextes where we can not (either we are holding a transaction handle or some lock required by the transaction commit path), so that we do the calls to kobject_add(), which triggers GFP_KERNEL allocations, through btrfs_add_raid_kobjects() in contextes where it is safe to trigger reclaim. That change expected that a new raid kobject can only be created either when mounting the filesystem or after raid profile conversion through the relocation path. However, we can have new raid kobject created in other two cases at least: 1) During device replace (or scrub) after adding a device a to the filesystem. The replace procedure (and scrub) do calls to btrfs_inc_block_group_ro() which can allocate a new block group with a new raid profile (because we now have more devices). This can be triggered by test cases btrfs/027 and btrfs/176. 2) During a degraded mount trough any write path. This can be triggered by test case btrfs/124. Fixing this by adding extra calls to btrfs_add_raid_kobjects(), not only makes things more complex and fragile, can also introduce deadlocks with reclaim the following way: 1) Calling btrfs_add_raid_kobjects() at btrfs_inc_block_group_ro() or anywhere in the replace/scrub path will cause a deadlock with reclaim because if reclaim happens and a transaction commit is triggered, the transaction commit path will block at btrfs_scrub_pause(). 2) During degraded mounts it is essentially impossible to figure out where to add extra calls to btrfs_add_raid_kobjects(), because allocation of a block group with a new raid profile can happen anywhere, which means we can't safely figure out which contextes are safe for reclaim, as we can either hold a transaction handle or some lock needed by the transaction commit path. So it is too complex and error prone to have this split setup of raid kobjects. So fix the issue by consolidating the setup of the kobjects in a single place, at link_block_group(), and setup a nofs context there in order to prevent reclaim being triggered by the memory allocations done through the call chain of kobject_add(). Besides fixing the sysfs warnings during kobject_del(), this also ensures the sysfs directories for the new raid profiles end up created and visible to users (a bug that existed before the 5.3 commit 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups")). Fixes: 75cb379d263521 ("btrfs: defer adding raid type kobject until after chunk relocation") Fixes: 7c7e301406d0a9 ("btrfs: sysfs: Replace default_attrs in ktypes with groups") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: move the subvolume reservation stuff out of extent-tree.cJosef Bacik
This is just two functions, put it in root-tree.c since it involves root items. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: migrate the delalloc space stuff to it's own homeJosef Bacik
We have code for data and metadata reservations for delalloc. There's quite a bit of code here, and it's used in a lot of places so I've separated it out to it's own file. inode.c and file.c are already pretty large, and this code is complicated enough to live in its own space. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: migrate btrfs_trans_release_chunk_metadataJosef Bacik
Move this into transaction.c with the rest of the transaction related code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-04btrfs: migrate the delayed refs rsv codeJosef Bacik
These belong with the delayed refs related code, not in extent-tree.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: migrate the global_block_rsv helpers to block-rsv.cJosef Bacik
These helpers belong in block-rsv.c Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: migrate the block-rsv code to block-rsv.cJosef Bacik
This moves everything out of extent-tree.c to block-rsv.c. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: stop using block_rsv_release_bytes everywhereJosef Bacik
block_rsv_release_bytes() is the internal to the block_rsv code, and shouldn't be called directly by anything else. Switch all users to the exported helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: cleanup the target logic in __btrfs_block_rsv_releaseJosef Bacik
This works for all callers already, but if we wanted to use the helper for the global_block_rsv it would end up trying to refill itself. Fix the logic to be able to be used no matter which block rsv is passed in to this helper. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: export __btrfs_block_rsv_releaseJosef Bacik
The delalloc reserve stuff calls this directly because it cares about the qgroup accounting stuff, so export it so we can move it around. Fix btrfs_block_rsv_release() to just be a static inline since it just calls __btrfs_block_rsv_release() with NULL for the qgroup stuff. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
2019-07-02btrfs: export btrfs_block_rsv_add_bytesJosef Bacik
This is used in a few places, we need to make sure it's exported so we can move it around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>