aboutsummaryrefslogtreecommitdiff
path: root/fs/btrfs/ioctl.c
AgeCommit message (Collapse)Author
2016-01-18Merge branch 'for-linus-4.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This has our usual assortment of fixes and cleanups, but the biggest change included is Omar Sandoval's free space tree. It's not the default yet, mounting -o space_cache=v2 enables it and sets a readonly compat bit. The tree can actually be deleted and regenerated if there are any problems, but it has held up really well in testing so far. For very large filesystems (30T+) our existing free space caching code can end up taking a huge amount of time during commits. The new tree based code is faster and less work overall to update as the commit progresses. Omar worked on this during the summer and we'll hammer on it in production here at FB over the next few months" * 'for-linus-4.5' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (73 commits) Btrfs: fix fitrim discarding device area reserved for boot loader's use Btrfs: Check metadata redundancy on balance btrfs: statfs: report zero available if metadata are exhausted btrfs: preallocate path for snapshot creation at ioctl time btrfs: allocate root item at snapshot ioctl time btrfs: do an allocation earlier during snapshot creation btrfs: use smaller type for btrfs_path locks btrfs: use smaller type for btrfs_path lowest_level btrfs: use smaller type for btrfs_path reada btrfs: cleanup, use enum values for btrfs_path reada btrfs: constify static arrays btrfs: constify remaining structs with function pointers btrfs tests: replace whole ops structure for free space tests btrfs: use list_for_each_entry* in backref.c btrfs: use list_for_each_entry_safe in free-space-cache.c btrfs: use list_for_each_entry* in check-integrity.c Btrfs: use linux/sizes.h to represent constants btrfs: cleanup, remove stray return statements btrfs: zero out delayed node upon allocation btrfs: pass proper enum type to start_transaction() ...
2016-01-11Merge branch 'misc-cleanups-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2016-01-11Merge branch 'misc-for-4.5' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5
2016-01-07btrfs: preallocate path for snapshot creation at ioctl timeDavid Sterba
We can also preallocate btrfs_path that's used during pending snapshot creation and avoid another late ENOMEM failure. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: allocate root item at snapshot ioctl timeDavid Sterba
The actual snapshot creation is delayed until transaction commit. If we cannot get enough memory for the root item there, we have to fail the whole transaction commit which is bad. So we'll allocate the memory at the ioctl call and pass it along with the pending_snapshot struct. The potential ENOMEM will be returned to the caller of snapshot ioctl. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: do an allocation earlier during snapshot creationDavid Sterba
We can allocate pending_snapshot earlier and do not have to do cleanup in case of failure. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: cleanup, use enum values for btrfs_path readaDavid Sterba
Replace the integers by enums for better readability. The value 2 does not have any meaning since a717531942f488209dded30f6bc648167bcefa72 "Btrfs: do less aggressive btree readahead" (2009-01-22). Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07btrfs: constify static arraysDavid Sterba
There are a few statically initialized arrays that can be made const. The remaining (like file_system_type, sysfs attributes or prop handlers) do not allow that due to type mismatch when passed to the APIs or because the structures are modified through other members. Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-07Btrfs: use linux/sizes.h to represent constantsByongho Lee
We use many constants to represent size and offset value. And to make code readable we use '256 * 1024 * 1024' instead of '268435456' to represent '256MB'. However we can make far more readable with 'SZ_256MB' which is defined in the 'linux/sizes.h'. So this patch replaces 'xxx * 1024 * 1024' kind of expression with single 'SZ_xxxMB' if 'xxx' is a power of 2 then 'xxx * SZ_1M' if 'xxx' is not a power of 2. And I haven't touched to '4096' & '8192' because it's more intuitive than 'SZ_4KB' & 'SZ_8KB'. Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2016-01-01btrfs: use new dedupe data function pointerDarrick J. Wong
Now that the VFS encapsulates the dedupe ioctl, wire up btrfs to it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-23Merge branch 'dev/simplify-set-bit' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.5 Signed-off-by: Chris Mason <clm@fb.com>
2015-12-07vfs: pull btrfs clone API to vfs layerChristoph Hellwig
The btrfs clone ioctls are now adopted by other file systems, with NFS and CIFS already having support for them, and XFS being under active development. To avoid growth of various slightly incompatible implementations, add one to the VFS. Note that clones are different from file copies in several ways: - they are atomic vs other writers - they support whole file clones - they support 64-bit legth clones - they do not allow partial success (aka short writes) - clones are expected to be a fast metadata operation Because of that it would be rather cumbersome to try to piggyback them on top of the recent clone_file_range infrastructure. The converse isn't true and the clone_file_range system call could try clone file range as a first attempt to copy, something that further patches will enable. Based on earlier work from Peng Tao. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-12-03btrfs: use GFP_KERNEL for allocations in ioctl handlersDavid Sterba
We don't have to use GFP_NOFS in the ioctl handlers because there's no risk of looping through the allocators back to the filesystem. This patch covers only allocations that are directly in the ioctl handlers. Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-03btrfs: drop unused parameter from lock_extent_bitsDavid Sterba
We've always passed 0. Stack usage will slightly decrease. Signed-off-by: David Sterba <dsterba@suse.com>
2015-12-01btrfs: add .copy_file_range file operationZach Brown
This rearranges the existing COPY_RANGE ioctl implementation so that the .copy_file_range file operation can call the core loop that copies file data extent items. The extent copying loop is lifted up into its own function. It retains the core btrfs error checks that should be shared. Signed-off-by: Zach Brown <zab@redhat.com> [Anna Schumaker: Make flags an unsigned int, Check for COPY_FR_REFLINK] Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-11-06Merge branch 'for-linus-4.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "We have a lot of subvolume quota improvements in here, along with big piles of cleanups from Dave Sterba and Anand Jain and others. Josef pitched in a batch of allocator fixes based on production use here at FB. We found that mount -o ssd_spread greatly improved our performance on hardware raid5/6, but it exposed some CPU bottlenecks in the allocator. These patches make a huge difference" * 'for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (100 commits) Btrfs: fix hole punching when using the no-holes feature Btrfs: find_free_extent: Do not erroneously skip LOOP_CACHING_WAIT state btrfs: Fix a data space underflow warning btrfs: qgroup: Fix a rebase bug which will cause qgroup double free btrfs: qgroup: Fix a race in delayed_ref which leads to abort trans btrfs: clear PF_NOFREEZE in cleaner_kthread() btrfs: qgroup: Don't copy extent buffer to do qgroup rescan btrfs: add balance filters limits, stripes and usage to supported mask btrfs: extend balance filter usage to take minimum and maximum btrfs: add balance filter for stripes btrfs: extend balance filter limit to take minimum and maximum btrfs: fix use after free iterating extrefs btrfs: check unsupported filters in balance arguments Btrfs: fix regression running delayed references when using qgroups Btrfs: fix regression when running delayed references Btrfs: don't do extra bitmap search in one bit case Btrfs: keep track of largest extent in bitmaps Btrfs: don't keep trying to build clusters if we are fragmented Btrfs: cut down on loops through the allocator Btrfs: don't continue setting up space cache when enospc ...
2015-10-26btrfs: check unsupported filters in balance argumentsDavid Sterba
We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-25Btrfs: fix regression running delayed references when using qgroupsFilipe Manana
In the kernel 4.2 merge window we had a big changes to the implementation of delayed references and qgroups which made the no_quota field of delayed references not used anymore. More specifically the no_quota field is not used anymore as of: commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism.") Leaving the no_quota field actually prevents delayed references from getting merged, which in turn cause the following BUG_ON(), at fs/btrfs/extent-tree.c, to be hit when qgroups are enabled: static int run_delayed_tree_ref(...) { (...) BUG_ON(node->ref_mod != 1); (...) } This happens on a scenario like the following: 1) Ref1 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. 2) Ref2 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with Ref1 because Ref1->no_quota != Ref2->no_quota. 3) Ref3 bytenr X, action = BTRFS_ADD_DELAYED_REF, no_quota = 1, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref2 is incompatible due to Ref2->no_quota != Ref3->no_quota. 4) Ref4 bytenr X, action = BTRFS_DROP_DELAYED_REF, no_quota = 0, added. It's not merged with the reference at the tail of the list of refs for bytenr X because the reference at the tail, Ref3 is incompatible due to Ref3->no_quota != Ref4->no_quota. 5) We run delayed references, trigger merging of delayed references, through __btrfs_run_delayed_refs() -> btrfs_merge_delayed_refs(). 6) Ref1 and Ref3 are merged as Ref1->no_quota = Ref3->no_quota and all other conditions are satisfied too. So Ref1 gets a ref_mod value of 2. 7) Ref2 and Ref4 are merged as Ref2->no_quota = Ref4->no_quota and all other conditions are satisfied too. So Ref2 gets a ref_mod value of 2. 8) Ref1 and Ref2 aren't merged, because they have different values for their no_quota field. 9) Delayed reference Ref1 is picked for running (select_delayed_ref() always prefers references with an action == BTRFS_ADD_DELAYED_REF). So run_delayed_tree_ref() is called for Ref1 which triggers the BUG_ON because Ref1->red_mod != 1 (equals 2). So fix this by removing the no_quota field, as it's not used anymore as of commit 0ed4792af0e8 ("btrfs: qgroup: Switch to new extent-oriented qgroup mechanism."). The use of no_quota was also buggy in at least two places: 1) At delayed-refs.c:btrfs_add_delayed_tree_ref() - we were setting no_quota to 0 instead of 1 when the following condition was true: is_fstree(ref_root) || !fs_info->quota_enabled 2) At extent-tree.c:__btrfs_inc_extent_ref() - we were attempting to reset a node's no_quota when the condition "!is_fstree(root_objectid) || !root->fs_info->quota_enabled" was true but we did it only in an unused local stack variable, that is, we never reset the no_quota value in the node itself. This fixes the remainder of problems several people have been having when running delayed references, mostly while a balance is running in parallel, on a 4.2+ kernel. Very special thanks to Stéphane Lesimple for helping debugging this issue and testing this fix on his multi terabyte filesystem (which took more than one day to balance alone, plus fsck, etc). Also, this fixes deadlock issue when using the clone ioctl with qgroups enabled, as reported by Elias Probst in the mailing list. The deadlock happens because after calling btrfs_insert_empty_item we have our path holding a write lock on a leaf of the fs/subvol tree and then before releasing the path we called check_ref() which did backref walking, when qgroups are enabled, and tried to read lock the same leaf. The trace for this case is the following: INFO: task systemd-nspawn:6095 blocked for more than 120 seconds. (...) Call Trace: [<ffffffff86999201>] schedule+0x74/0x83 [<ffffffff863ef64c>] btrfs_tree_read_lock+0xc0/0xea [<ffffffff86137ed7>] ? wait_woken+0x74/0x74 [<ffffffff8639f0a7>] btrfs_search_old_slot+0x51a/0x810 [<ffffffff863a129b>] btrfs_next_old_leaf+0xdf/0x3ce [<ffffffff86413a00>] ? ulist_add_merge+0x1b/0x127 [<ffffffff86411688>] __resolve_indirect_refs+0x62a/0x667 [<ffffffff863ef546>] ? btrfs_clear_lock_blocking_rw+0x78/0xbe [<ffffffff864122d3>] find_parent_nodes+0xaf3/0xfc6 [<ffffffff86412838>] __btrfs_find_all_roots+0x92/0xf0 [<ffffffff864128f2>] btrfs_find_all_roots+0x45/0x65 [<ffffffff8639a75b>] ? btrfs_get_tree_mod_seq+0x2b/0x88 [<ffffffff863e852e>] check_ref+0x64/0xc4 [<ffffffff863e9e01>] btrfs_clone+0x66e/0xb5d [<ffffffff863ea77f>] btrfs_ioctl_clone+0x48f/0x5bb [<ffffffff86048a68>] ? native_sched_clock+0x28/0x77 [<ffffffff863ed9b0>] btrfs_ioctl+0xabc/0x25cb (...) The problem goes away by eleminating check_ref(), which no longer is needed as its purpose was to get a value for the no_quota field of a delayed reference (this patch removes the no_quota field as mentioned earlier). Reported-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Tested-by: Stéphane Lesimple <stephane_btrfs@lesimple.fr> Reported-by: Elias Probst <mail@eliasprobst.eu> Reported-by: Peter Becker <floyd.net@gmail.com> Reported-by: Malte Schröder <malte@tnxip.de> Reported-by: Derek Dongray <derek@valedon.co.uk> Reported-by: Erkki Seppala <flux-btrfs@inside.org> Cc: stable@vger.kernel.org # 4.2+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Qu Wenruo <quwenruo@cn.fujitsu.com>
2015-10-21btrfs: qgroup: Cleanup old inaccurate facilitiesQu Wenruo
Cleanup the old facilities which use old btrfs_qgroup_reserve() function call, replace them with the newer version, and remove the "__" prefix in them. Also, make btrfs_qgroup_reserve/free() functions private, as they are now only used inside qgroup codes. Now, the whole btrfs qgroup is swithed to use the new reserve facilities. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21btrfs: extent-tree: Switch to new delalloc space reserve and releaseQu Wenruo
Use new __btrfs_delalloc_reserve_space() and __btrfs_delalloc_release_space() to reserve and release space for delalloc. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21Merge branch 'integration-4.4' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux into for-linus-4.4
2015-10-21Merge branch 'cleanups/for-4.4' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4
2015-10-21btrfs: fix possible leak in btrfs_ioctl_balance()Christian Engelmayer
Commit 8eb934591f8b ("btrfs: check unsupported filters in balance arguments") adds a jump to exit label out_bargs in case the argument check fails. At this point in addition to the bargs memory, the memory for struct btrfs_balance_control has already been allocated. Ownership of bctl is passed to btrfs_balance() in the good case, thus the memory is not freed due to the introduced jump. Make sure that the memory gets freed in any case as necessary. Detected by Coverity CID 1328378. Signed-off-by: Christian Engelmayer <cengelma@gmx.at> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-21btrfs: replace unnecessary list_for_each_entry_safe to list_for_each_entryByongho Lee
There is no removing list element while iterating over list. So, replace list_for_each_entry_safe to list_for_each_entry. Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Byongho Lee <bhlee.kernel@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-14Btrfs: fix file corruption and data loss after cloning inline extentsFilipe Manana
Currently the clone ioctl allows to clone an inline extent from one file to another that already has other (non-inlined) extents. This is a problem because btrfs is not designed to deal with files having inline and regular extents, if a file has an inline extent then it must be the only extent in the file and must start at file offset 0. Having a file with an inline extent followed by regular extents results in EIO errors when doing reads or writes against the first 4K of the file. Also, the clone ioctl allows one to lose data if the source file consists of a single inline extent, with a size of N bytes, and the destination file consists of a single inline extent with a size of M bytes, where we have M > N. In this case the clone operation removes the inline extent from the destination file and then copies the inline extent from the source file into the destination file - we lose the M - N bytes from the destination file, a read operation will get the value 0x00 for any bytes in the the range [N, M] (the destination inode's i_size remained as M, that's why we can read past N bytes). So fix this by not allowing such destructive operations to happen and return errno EOPNOTSUPP to user space. Currently the fstest btrfs/035 tests the data loss case but it totally ignores this - i.e. expects the operation to succeed and does not check the we got data loss. The following test case for fstests exercises all these cases that result in file corruption and data loss: seq=`basename $0` seqres=$RESULT_DIR/$seq echo "QA output created by $seq" tmp=/tmp/$$ status=1 # failure is the default! trap "_cleanup; exit \$status" 0 1 2 3 15 _cleanup() { rm -f $tmp.* } # get standard environment, filters and checks . ./common/rc . ./common/filter # real QA test starts here _need_to_be_root _supported_fs btrfs _supported_os Linux _require_scratch _require_cloner _require_btrfs_fs_feature "no_holes" _require_btrfs_mkfs_feature "no-holes" rm -f $seqres.full test_cloning_inline_extents() { local mkfs_opts=$1 local mount_opts=$2 _scratch_mkfs $mkfs_opts >>$seqres.full 2>&1 _scratch_mount $mount_opts # File bar, the source for all the following clone operations, consists # of a single inline extent (50 bytes). $XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \ | _filter_xfs_io # Test cloning into a file with an extent (non-inlined) where the # destination offset overlaps that extent. It should not be possible to # clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo data after clone operation:" # All bytes should have the value 0xaa (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo $XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io # Test cloning the inline extent against a file which has a hole in its # first 4K followed by a non-inlined extent. It should not be possible # as well to clone the inline extent from file bar into this file. $XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "File foo2 data after clone operation:" # All bytes should have the value 0x00 (clone operation failed and did # not modify our file). od -t x1 $SCRATCH_MNT/foo2 $XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io # Test cloning the inline extent against a file which has a size of zero # but has a prealloc extent. It should not be possible as well to clone # the inline extent from file bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3 # Doing IO against any range in the first 4K of the file should work. # Due to a past clone ioctl bug which allowed cloning the inline extent, # these operations resulted in EIO errors. echo "First 50 bytes of foo3 after clone operation:" # Should not be able to read any bytes, file has 0 bytes i_size (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo3 $XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size not greater than the size of # bar's inline extent (40 < 50). # It should be possible to do the extent cloning from bar to this file. $XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4 # Doing IO against any range in the first 4K of the file should work. echo "File foo4 data after clone operation:" # Must match file bar's content. od -t x1 $SCRATCH_MNT/foo4 $XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io # Test cloning the inline extent against a file which consists of a # single inline extent that has a size greater than the size of bar's # inline extent (60 > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \ | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5 # Reading the file should not fail. echo "File foo5 data after clone operation:" # Must have a size of 60 bytes, with all bytes having a value of 0x03 # (the clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo5 # Test cloning the inline extent against a file which has no extents but # has a size greater than bar's inline extent (16K > 50). # It should not be possible to clone the inline extent from file bar # into this file. $XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6 # Reading the file should not fail. echo "File foo6 data after clone operation:" # Must have a size of 16K, with all bytes having a value of 0x00 (the # clone operation failed and did not modify our file). od -t x1 $SCRATCH_MNT/foo6 # Test cloning the inline extent against a file which has no extents but # has a size not greater than bar's inline extent (30 < 50). # It should be possible to clone the inline extent from file bar into # this file. $XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7 # Reading the file should not fail. echo "File foo7 data after clone operation:" # Must have a size of 50 bytes, with all bytes having a value of 0xbb. od -t x1 $SCRATCH_MNT/foo7 # Test cloning the inline extent against a file which has a size not # greater than the size of bar's inline extent (20 < 50) but has # a prealloc extent that goes beyond the file's size. It should not be # possible to clone the inline extent from bar into this file. $XFS_IO_PROG -f -c "falloc -k 0 1M" \ -c "pwrite -S 0x88 0 20" \ $SCRATCH_MNT/foo8 | _filter_xfs_io $CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8 echo "File foo8 data after clone operation:" # Must have a size of 20 bytes, with all bytes having a value of 0x88 # (the clone operation did not modify our file). od -t x1 $SCRATCH_MNT/foo8 _scratch_unmount } echo -e "\nTesting without compression and without the no-holes feature...\n" test_cloning_inline_extents echo -e "\nTesting with compression and without the no-holes feature...\n" test_cloning_inline_extents "" "-o compress" echo -e "\nTesting without compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "" echo -e "\nTesting with compression and with the no-holes feature...\n" test_cloning_inline_extents "-O no-holes" "-o compress" status=0 exit Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-10-13btrfs: check unsupported filters in balance argumentsDavid Sterba
We don't verify that all the balance filter arguments supplemented by the flags are actually known to the kernel. Thus we let it silently pass and do nothing. At the moment this means only the 'limit' filter, but we're going to add a few more soon so it's better to have that fixed. Also in older stable kernels so that it works with newer userspace tools. Cc: stable@vger.kernel.org # 3.16+ Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-10-12Merge branch 'anand/sysfs-updates-v4.3-rc3' of ↵Chris Mason
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.4 Signed-off-by: Chris Mason <clm@fb.com>
2015-10-08btrfs: switch more printks to our helpersDavid Sterba
Convert the simple cases, not all functions provide a way to reach the fs_info. Also skipped debugging messages (print-tree, integrity checker and pr_debug) and messages that are printed from possibly unfinished mount. Signed-off-by: David Sterba <dsterba@suse.com>
2015-10-08btrfs: switch message printers to _in_rcu variantsDavid Sterba
Signed-off-by: David Sterba <dsterba@suse.com>
2015-09-29Btrfs: consolidate btrfs_error() to btrfs_std_error()Anand Jain
btrfs_error() and btrfs_std_error() does the same thing and calls _btrfs_std_error(), so consolidate them together. And the main motivation is that btrfs_error() is closely named with btrfs_err(), one handles error action the other is to log the error, so don't closely name them. Signed-off-by: Anand Jain <anand.jain@oracle.com> Suggested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2015-08-09btrfs: fix clone / extent-same deadlocksMark Fasheh
Clone and extent same lock their source and target inodes in opposite order. In addition to this, the range locking in clone doesn't take ordering into account. Fix this by having clone use the same locking helpers as btrfs-extent-same. In addition, I do a small cleanup of the locking helpers, removing a case (both inodes being the same) which was poorly accounted for and never actually used by the callers. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09Btrfs: fix defrag to merge tail file extentLiu Bo
The file layout is [extent 1]...[extent n][4k extent][HOLE][extent x] extent 1~n and 4k extent can be merged during defrag, and the whole defrag bytes is larger than our defrag thresh(256k), 4k extent as a tail is left unmerged since we check if its next extent can be merged (the next one is a hole, so the check will fail), the layout thus can be [new extent][4k extent][HOLE][extent x] (1~n) To fix it, beside looking at the next one, this also looks at the previous one by checking @defrag_end, which is set to 0 when we decide to stop merging contiguous extents, otherwise, we can merge the previous one with our extent. Also, this makes btrfs behave consistent with how xfs and ext4 do. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-08-09btrfs: fix search key advancing conditionNaohiro Aota
The search key advancing condition used in copy_to_sk() is loose. It can advance the key even if it reaches sk->max_*: e.g. when the max key = (512, 1024, -1) and the current key = (512, 1025, 10), it increments the offset by 1, continues hopeless search from (512, 1025, 11). This issue make ioctl() to take unexpectedly long time scanning all the leaf a blocks one by one. This commit fix the problem using standard way of key comparison: btrfs_comp_cpu_keys() Signed-off-by: Naohiro Aota <naota@elisp.net> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-14Btrfs: fix file corruption after cloning inline extentsFilipe Manana
Using the clone ioctl (or extent_same ioctl, which calls the same extent cloning function as well) we end up allowing copy an inline extent from the source file into a non-zero offset of the destination file. This is something not expected and that the btrfs code is not prepared to deal with - all inline extents must be at a file offset equals to 0. For example, the following excerpt of a test case for fstests triggers a crash/BUG_ON() on a write operation after an inline extent is cloned into a non-zero offset: _scratch_mkfs >>$seqres.full 2>&1 _scratch_mount # Create our test files. File foo has the same 2K of data at offset 4K # as file bar has at its offset 0. $XFS_IO_PROG -f -s -c "pwrite -S 0xaa 0 4K" \ -c "pwrite -S 0xbb 4k 2K" \ -c "pwrite -S 0xcc 8K 4K" \ $SCRATCH_MNT/foo | _filter_xfs_io # File bar consists of a single inline extent (2K size). $XFS_IO_PROG -f -s -c "pwrite -S 0xbb 0 2K" \ $SCRATCH_MNT/bar | _filter_xfs_io # Now call the clone ioctl to clone the extent of file bar into file # foo at its offset 4K. This made file foo have an inline extent at # offset 4K, something which the btrfs code can not deal with in future # IO operations because all inline extents are supposed to start at an # offset of 0, resulting in all sorts of chaos. # So here we validate that clone ioctl returns an EOPNOTSUPP, which is # what it returns for other cases dealing with inlined extents. $CLONER_PROG -s 0 -d $((4 * 1024)) -l $((2 * 1024)) \ $SCRATCH_MNT/bar $SCRATCH_MNT/foo # Because of the inline extent at offset 4K, the following write made # the kernel crash with a BUG_ON(). $XFS_IO_PROG -c "pwrite -S 0xdd 6K 2K" $SCRATCH_MNT/foo | _filter_xfs_io status=0 exit The stack trace of the BUG_ON() triggered by the last write is: [152154.035903] ------------[ cut here ]------------ [152154.036424] kernel BUG at mm/page-writeback.c:2286! [152154.036424] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [152154.036424] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse parport_pc acpi_cpu$ [152154.036424] CPU: 2 PID: 17873 Comm: xfs_io Tainted: G W 4.1.0-rc6-btrfs-next-11+ #2 [152154.036424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20150316_085822-nilsson.home.kraxel.org 04/01/2014 [152154.036424] task: ffff880429f70990 ti: ffff880429efc000 task.ti: ffff880429efc000 [152154.036424] RIP: 0010:[<ffffffff8111a9d5>] [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP: 0018:ffff880429effc68 EFLAGS: 00010246 [152154.036424] RAX: 0200000000000806 RBX: ffffea0006a6d8f0 RCX: 0000000000000001 [152154.036424] RDX: 0000000000000000 RSI: ffffffff81155d1b RDI: ffffea0006a6d8f0 [152154.036424] RBP: ffff880429effc78 R08: ffff8801ce389fe0 R09: 0000000000000001 [152154.036424] R10: 0000000000002000 R11: ffffffffffffffff R12: ffff8800200dce68 [152154.036424] R13: 0000000000000000 R14: ffff8800200dcc88 R15: ffff8803d5736d80 [152154.036424] FS: 00007fbf119f6700(0000) GS:ffff88043d280000(0000) knlGS:0000000000000000 [152154.036424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [152154.036424] CR2: 0000000001bdc000 CR3: 00000003aa555000 CR4: 00000000000006e0 [152154.036424] Stack: [152154.036424] ffff8803d5736d80 0000000000000001 ffff880429effcd8 ffffffffa04e97c1 [152154.036424] ffff880429effd68 ffff880429effd60 0000000000000001 ffff8800200dc9c8 [152154.036424] 0000000000000001 ffff8800200dcc88 0000000000000000 0000000000001000 [152154.036424] Call Trace: [152154.036424] [<ffffffffa04e97c1>] lock_and_cleanup_extent_if_need+0x147/0x18d [btrfs] [152154.036424] [<ffffffffa04ea82c>] __btrfs_buffered_write+0x245/0x4c8 [btrfs] [152154.036424] [<ffffffffa04ed14b>] ? btrfs_file_write_iter+0x150/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed15a>] ? btrfs_file_write_iter+0x15f/0x3e0 [btrfs] [152154.036424] [<ffffffffa04ed2c7>] btrfs_file_write_iter+0x2cc/0x3e0 [btrfs] [152154.036424] [<ffffffff81165a4a>] __vfs_write+0x7c/0xa5 [152154.036424] [<ffffffff81165f89>] vfs_write+0xa0/0xe4 [152154.036424] [<ffffffff81166855>] SyS_pwrite64+0x64/0x82 [152154.036424] [<ffffffff81465197>] system_call_fastpath+0x12/0x6f [152154.036424] Code: 48 89 c7 e8 0f ff ff ff 5b 41 5c 5d c3 0f 1f 44 00 00 55 48 89 e5 41 54 53 48 89 fb e8 ae ef 00 00 49 89 c4 48 8b 03 a8 01 75 02 <0f> 0b 4d 85 e4 74 59 49 8b 3c 2$ [152154.036424] RIP [<ffffffff8111a9d5>] clear_page_dirty_for_io+0x1e/0x90 [152154.036424] RSP <ffff880429effc68> [152154.242621] ---[ end trace e3d3376b23a57041 ]--- Fix this by returning the error EOPNOTSUPP if an attempt to copy an inline extent into a non-zero offset happens, just like what is done for other scenarios that would require copying/splitting inline extents, which were introduced by the following commits: 00fdf13a2e9f ("Btrfs: fix a crash of clone with inline extents's split") 3f9e3df8da3c ("btrfs: replace error code from btrfs_drop_extents") Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com>
2015-07-11Btrfs: fix memory leak in the extent_same ioctlFilipe Manana
We were allocating memory with memdup_user() but we were never releasing that memory. This affected pretty much every call to the ioctl, whether it deduplicated extents or not. This issue was reported on IRC by Julian Taylor and on the mailing list by Marcel Ritter, credit goes to them for finding the issue. Reported-by: Julian Taylor <jtaylor.debian@googlemail.com> Reported-by: Marcel Ritter <ritter.marcel@gmail.com> Cc: stable@vger.kernel.org Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Mark Fasheh <mfasheh@suse.de>
2015-07-01btrfs: don't update mtime/ctime on deduped inodesMark Fasheh
One issue users have reported is that dedupe changes mtime on files, resulting in tools like rsync thinking that their contents have changed when in fact the data is exactly the same. We also skip the ctime update as no user-visible metadata changes here and we want dedupe to be transparent to the user. Clone still wants time changes, so we special case this in the code. This was tested with the btrfs-extent-same tool. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01btrfs: allow dedupe of same inodeMark Fasheh
clone() supports cloning within an inode so extent-same can do the same now. This patch fixes up the locking in extent-same to know about the single-inode case. In addition to that, we add a check for overlapping ranges, which clone does not allow. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01btrfs: fix deadlock with extent-same and readpageMark Fasheh
->readpage() does page_lock() before extent_lock(), we do the opposite in extent-same. We want to reverse the order in btrfs_extent_same() but it's not quite straightforward since the page locks are taken inside btrfs_cmp_data(). So I split btrfs_cmp_data() into 3 parts with a small context structure that is passed between them. The first, btrfs_cmp_data_prepare() gathers up the pages needed (taking page lock as required) and puts them on our context structure. At this point, we are safe to lock the extent range. Afterwards, we use btrfs_cmp_data() to do the data compare as usual and btrfs_cmp_data_free() to clean up our context. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-07-01btrfs: pass unaligned length to btrfs_cmp_data()Mark Fasheh
In the case that we dedupe the tail of a file, we might expand the dedupe len out to the end of our last block. We don't want to compare data past i_size however, so pass the original length to btrfs_cmp_data(). Signed-off-by: Mark Fasheh <mfasheh@suse.de> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10btrfs: Handle unaligned length in extent_sameMark Fasheh
The extent-same code rejects requests with an unaligned length. This poses a problem when we want to dedupe the tail extent of files as we skip cloning the portion between i_size and the extent boundary. If we don't clone the entire extent, it won't be deleted. So the combination of these behaviors winds up giving us worst-case dedupe on many files. We can fix this by allowing a length that extents to i_size and internally aligining those to the end of the block. This is what btrfs_ioctl_clone() so we can just copy that check over. Signed-off-by: Mark Fasheh <mfasheh@suse.de> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10Btrfs: btrfs_defrag_file: Fix calculation of max_to_defrag.chandan
max_to_defrag represents the number of pages to defrag rather than the last page of the file range to be defragged. Consider a file having 10 4k blocks (i.e. blocks in the range [0 - 9]). If the defrag ioctl was invoked for the block range [3 - 6], then max_to_defrag should actually have the value 4. Instead in the current code we end up setting it to 6. Now, this does not (yet) cause an issue since the first part of the while loop condition in btrfs_defrag_file() (i.e. "i <= last_index") causes the control to flow out of the while loop before any buggy behavior is actually caused. So the patch just makes sure that max_to_defrag ends up having the right value rather than fixing a bug. I did run the xfstests suite to make sure that the code does not regress. Changelog: v1->v2: Provide a much descriptive commit message. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-10Btrfs: btrfs_defrag_file: Fix ra_index computation.chandan
Read-ahead is done for the pages in the range [ra_index, ra_index + cluster - 1]. So the next read-ahead should be starting from the page at index 'ra_index + cluster' (unless we deemed that the extent at 'ra_index + cluster' as non-defraggable) rather than from the page at index 'ra_index + max_cluster'. This patch fixes this. I did run the xfstests suite to make sure that the code does not regress. Signed-off-by: Chandan Rajendra <chandan@linux.vnet.ibm.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: make root id query unprivilegedDavid Sterba
The INO_LOOKUP ioctl can lookup path for a given inode number and is thus restricted. As a sideefect it can find the root id of the containing subvolume and we're using this int the 'btrfs inspect rootid' command. The restriction is unnecessary in case we set the ioctl args args::treeid = 0 args::objectid = 256 (BTRFS_FIRST_FREE_OBJECTID) Then the path will be empty and the treeid is filled with the root id of the inode on which the ioctl is called. This behaviour is unchanged, after the root restriction is removed. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02btrfs: fix warnings after changes in btrfs_abort_transactionDavid Sterba
fs/btrfs/volumes.c: In function ‘btrfs_create_uuid_tree’: fs/btrfs/volumes.c:3909:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, tree_root, ^ CC [M] fs/btrfs/ioctl.o fs/btrfs/ioctl.c: In function ‘create_subvol’: fs/btrfs/ioctl.c:549:3: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘long int’ [-Wformat=] btrfs_abort_transaction(trans, root, PTR_ERR(new_root)); PTR_ERR returns long, but we're really using 'int' for the error codes everywhere so just set and use the local variable. Signed-off-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-06-02Btrfs: don't invalidate root dentry when subvolume deletion failsOmar Sandoval
Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate"), mounted subvolumes can be deleted because d_invalidate() won't fail. However, we run into problems when we attempt to delete the default subvolume while it is mounted as the root filesystem: # btrfs subvol list / ID 257 gen 306 top level 5 path rootvol ID 267 gen 334 top level 5 path snap1 # btrfs subvol get-default / ID 267 gen 334 top level 5 path snap1 # btrfs inspect-internal rootid / 267 # mount -o subvol=/ /dev/vda1 /mnt # btrfs subvol del /mnt/snap1 Delete subvolume (no-commit): '/mnt/snap1' ERROR: cannot delete '/mnt/snap1' - Operation not permitted # findmnt / findmnt: can't read /proc/mounts: No such file or directory # ls /proc # Markus reported that this same scenario simply led to a kernel oops. This happens because in btrfs_ioctl_snap_destroy(), we call d_invalidate() before we check may_destroy_subvol(), which means that we detach the submounts and drop the dentry before erroring out. Instead, we should only invalidate the dentry once the deletion has succeeded. Additionally, the shrink_dcache_sb() isn't necessary; d_invalidate() will prune the dcache for the deleted subvolume. Cc: <stable@vger.kernel.org> Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate") Reported-by: Markus Schauler <mschauler@gmail.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-05-01Merge branch 'for-linus-4.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "A few more btrfs fixes. These range from corners Filipe found in the new free space cache writeback to a grab bag of fixes from the list" * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. btrfs: unlock i_mutex after attempting to delete subvolume during send btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache btrfs: fix race on ENOMEM in alloc_extent_buffer btrfs: handle ENOMEM in btrfs_alloc_tree_block Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole Btrfs: don't check for delalloc_bytes in cache_save_setup Btrfs: fix deadlock when starting writeback of bg caches Btrfs: fix race between start dirty bg cache writeout and bg deletion
2015-04-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26btrfs: unlock i_mutex after attempting to delete subvolume during sendOmar Sandoval
Whenever the check for a send in progress introduced in commit 521e0546c970 (btrfs: protect snapshots from deleting during send) is hit, we return without unlocking inode->i_mutex. This is easy to see with lockdep enabled: [ +0.000059] ================================================ [ +0.000028] [ BUG: lock held when returning to user space! ] [ +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted [ +0.000026] ------------------------------------------------ [ +0.000029] btrfs/211 is leaving the kernel with locks still held! [ +0.000029] 1 lock held by btrfs/211: [ +0.000023] #0: (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0 Make sure we unlock it in the error path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-15VFS: normal filesystems (and lustre): d_inode() annotationsDavid Howells
that's the bulk of filesystem drivers dealing with inodes of their own Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-13btrfs: quota: Update quota tree after qgroup relationship change.Qu Wenruo
Previous patch modified the in memory struct but it's not written in quota tree until next commit. So user will still get old data using "btrfs qgroup show" after assign/remove. This patch will call btrfs_run_qgroups in assign ioctl so it will be updated to in memory quota trees and user will get up-to-date results. Signed-off-by: Qu Wenruo <quwenruo@cn.fujitsu.com> Reviewed-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Chris Mason <clm@fb.com>