aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2015-06-04xfs: check min blks for random debug mode sparse allocationsBrian Foster
The inode allocator enables random sparse inode chunk allocations in DEBUG mode to facilitate testing. Sparse inode allocations are not always possible, however, depending on the fs geometry. For example, there is no possibility for a sparse inode allocation on filesystems where the block size is large enough to fit one or more inode chunks within a single block. Fix up the DEBUG mode sparse inode allocation logic to trigger random sparse allocations only when the geometry of the fs allows it. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04xfs: fix sparse inodes 32-bit compile failureBrian Foster
The kbuild test robot reports the following compilation failure with a 32-bit kernel configuration: fs/built-in.o: In function `xfs_ifree_cluster': >> xfs_inode.c:(.text+0x17ac84): undefined reference to `__umoddi3' This is due to the use of the modulus operator on a 64-bit variable in the ASSERT() added as part of the following commit: xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster() This ASSERT() simply checks that the offset of the inode in a sparse cluster is appropriately aligned. Since the maximum inode record offset is 63 (for a 64 inode record) and the calculated offset here should be something less than that, just use a 32-bit variable to store the offset and call the do_mod() helper. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: enable sparse inode chunks for v5 superblocksBrian Foster
Enable mounting of filesystems with sparse inode support enabled. Add the incompat. feature bit to the *_ALL mask. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: skip unallocated regions of inode chunks in xfs_ifree_cluster()Brian Foster
xfs_ifree_cluster() is called to mark all in-memory inodes and inode buffers as stale. This occurs after we've removed the inobt records and dropped any references of inobt data. xfs_ifree_cluster() uses the starting inode number to walk the namespace of inodes expected for a single chunk a cluster buffer at a time. The cluster buffer disk addresses are calculated by decoding the sequential inode numbers expected from the chunk. The problem with this approach is that if the inode chunk being removed is a sparse chunk, not all of the buffer addresses that are calculated as part of this sequence may be inode clusters. Attempting to acquire the buffer based on expected inode characterstics (i.e., cluster length) can lead to errors and is generally incorrect. We already use a couple variables to carry requisite state from xfs_difree() to xfs_ifree_cluster(). Rather than add a third, define a new internal structure to carry the existing parameters through these functions. Add an alloc field that represents the physical allocation bitmap of inodes in the chunk being removed. Modify xfs_ifree_cluster() to check each inode against the bitmap and skip the clusters that were never allocated as real inodes on disk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: only free allocated regions of inode chunksBrian Foster
An inode chunk is currently added to the transaction free list based on a simple fsb conversion and hardcoded chunk length. The nature of sparse chunks is such that the physical chunk of inodes on disk may consist of one or more discontiguous parts. Blocks that reside in the holes of the inode chunk are not inodes and could be allocated to any other use or not allocated at all. Refactor the existing xfs_bmap_add_free() call into the xfs_difree_inode_chunk() helper. The new helper uses the existing calculation if a chunk is not sparse. Otherwise, use the inobt record holemask to free the contiguous regions of the chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: filter out sparse regions from individual inode allocationBrian Foster
Inode allocation from an existing record with free inodes traditionally selects the first inode available according to the ir_free mask. With sparse inode chunks, the ir_free mask could refer to an unallocated region. We must mask the unallocated regions out of ir_free before using it to select a free inode in the chunk. Update the xfs_inobt_first_free_inode() helper to find the first free inode available of the allocated regions of the inode chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: randomly do sparse inode allocations in DEBUG modeBrian Foster
Sparse inode allocations generally only occur when full inode chunk allocation fails. This requires some level of filesystem space usage and fragmentation. For filesystems formatted with sparse inode chunks enabled, do random sparse inode chunk allocs when compiled in DEBUG mode to increase test coverage. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: allocate sparse inode chunks on full chunk allocation failureBrian Foster
xfs_ialloc_ag_alloc() makes several attempts to allocate a full inode chunk. If all else fails, reduce the allocation to the sparse length and alignment and attempt to allocate a sparse inode chunk. If sparse chunk allocation succeeds, check whether an inobt record already exists that can track the chunk. If so, inherit and update the existing record. Otherwise, insert a new record for the sparse chunk. Create helpers to align sparse chunk inode records and insert or update existing records in the inode btrees. The xfs_inobt_insert_sprec() helper implements the merge or update semantics required for sparse inode records with respect to both the inobt and finobt. To update the inobt, either insert a new record or merge with an existing record. To update the finobt, use the updated inobt record to either insert or replace an existing record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: helper to convert holemask to inode alloc. bitmapBrian Foster
The inobt record holemask field is a condensed data type designed to fit into the existing on-disk record and is zero based (allocated regions are set to 0, sparse regions are set to 1) to provide backwards compatibility. This makes the type somewhat complex for use in higher level inode manipulations such as individual inode allocation, etc. Rather than foist the complexity of dealing with this field to every bit of logic that requires inode granular information, create a helper to convert the holemask to an inode allocation bitmap. The inode allocation bitmap is inode granularity similar to the inobt record free mask and indicates which inodes of the chunk are physically allocated on disk, irrespective of whether the inode is considered allocated or free by the filesystem. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: handle sparse inode chunks in icreate log recoveryBrian Foster
Recovery of icreate transactions assumes hardcoded values for the inode count and chunk length. Sparse inode chunks are allocated in units of m_ialloc_min_blks. Update the icreate validity checks to allow for appropriately sized inode chunks and verify the inode count matches what is expected based on the extent length rather than assuming a hardcoded count. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: pass inode count through ordered icreate log itemBrian Foster
v5 superblocks use an ordered log item for logging the initialization of inode chunks. The icreate log item is currently hardcoded to an inode count of 64 inodes. The agbno and extent length are used to initialize the inode chunk from log recovery. While an incorrect inode count does not lead to bad inode chunk initialization, we should pass the correct inode count such that log recovery has enough data to perform meaningful validity checks on the chunk. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: use actual inode count for sparse records in bulkstat/inumbersBrian Foster
The bulkstat and inumbers mechanisms make the assumption that inode records consist of a full 64 inode chunk in several places. For example, this is used to track how many inodes have been processed overall as well as to determine whether a record has allocated inodes that must be handled. This assumption is invalid for sparse inode records. While sparse inodes will be marked as free in the ir_free mask, they are not accounted as free in ir_freecount because they cannot be allocated. Therefore, ir_freecount may be less than 64 inodes in an inode record for which all physically allocated inodes are free (and in turn ir_freecount < 64 does not signify that the record has allocated inodes). The new in-core inobt record format includes the ir_count field. This holds the number of true, physical inodes tracked by the record. The in-core ir_count field is always valid as it is hardcoded to XFS_INODES_PER_CHUNK when sparse inodes is not enabled. Use ir_count to handle inode records correctly in bulkstat in a generic manner. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: introduce inode record hole mask for sparse inode chunksBrian Foster
The inode btrees track 64 inodes per record regardless of inode size. Thus, inode chunks on disk vary in size depending on the size of the inodes. This creates a contiguous allocation requirement for new inode chunks that can be difficult to satisfy on an aged and fragmented (free space) filesystems. The inode record freecount currently uses 4 bytes on disk to track the free inode count. With a maximum freecount value of 64, only one byte is required. Convert the freecount field to a single byte and use two of the remaining 3 higher order bytes left for the hole mask field. Use the final leftover byte for the total count field. The hole mask field tracks holes in the chunks of physical space that the inode record refers to. This facilitates the sparse allocation of inode chunks when contiguous chunks are not available and allows the inode btrees to identify what portions of the chunk contain valid inodes. The total count field contains the total number of valid inodes referred to by the record. This can also be deduced from the hole mask. The count field provides clarity and redundancy for internal record verification. Note that neither of the new fields can be written to disk on fs' without sparse inode support. Doing so writes to the high-order bytes of freecount and causes corruption from the perspective of older kernels. The on-disk inobt record data structure is updated with a union to distinguish between the original, "full" format and the new, "sparse" format. The conversion routines to get, insert and update records are updated to translate to and from the on-disk record accordingly such that freecount remains a 4-byte value on non-supported fs, yet the new fields of the in-core record are always valid with respect to the record. This means that higher level code can refer to the current in-core record format unconditionally and lower level code ensures that records are translated to/from disk according to the capabilities of the fs. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: add fs geometry bit for sparse inode chunksBrian Foster
Define an fs geometry bit for sparse inode chunks such that the characteristic of the fs can be identified by userspace. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: sparse inode chunks feature helpers and mount requirementsBrian Foster
The sparse inode chunks feature uses the helper function to enable the allocation of sparse inode chunks. The incompatible feature bit is set on disk at mkfs time to prevent mount from unsupported kernels. Also, enforce the inode alignment requirements required for sparse inode chunks at mount time. When enabled, full inode chunks (and all inode record) alignment is increased from cluster size to inode chunk size. Sparse inode alignment must match the cluster size of the fs. Both superblock alignment fields are set as such by mkfs when sparse inode support is enabled. Finally, warn that sparse inode chunks is an experimental feature until further notice. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: use sparse chunk alignment for min. inode allocation requirementBrian Foster
xfs_ialloc_ag_select() iterates through the allocation groups looking for free inodes or free space to determine whether to allow an inode allocation to proceed. If no free inodes are available, it assumes that an AG must have an extent longer than mp->m_ialloc_blks. Sparse inode chunk support currently allows for allocations smaller than the traditional inode chunk size specified in m_ialloc_blks. The current minimum sparse allocation is set in the superblock sb_spino_align field at mkfs time. Create a new m_ialloc_min_blks field in xfs_mount and use this to represent the minimum supported allocation size for inode chunks. Initialize m_ialloc_min_blks at mount time based on whether sparse inodes are supported. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: add sparse inode chunk alignment superblock fieldBrian Foster
Add sb_spino_align to the superblock to specify sparse inode chunk alignment. This also currently represents the minimum allowable sparse chunk allocation size. Signed-off-by: Brian Foster <bfoster@redhat.com>
2015-05-29xfs: support min/max agbno args in block allocatorBrian Foster
The block allocator supports various arguments to tweak block allocation behavior and set allocation requirements. The sparse inode chunk feature introduces a new requirement not supported by the current arguments. Sparse inode allocations must convert or merge into an inode record that describes a fixed length chunk (64 inodes x inodesize). Full inode chunk allocations by definition always result in valid inode records. Sparse chunk allocations are smaller and the associated records can refer to blocks not owned by the inode chunk. This model can result in invalid inode records in certain cases. For example, if a sparse allocation occurs near the start of an AG, the aligned inode record for that chunk might refer to agbno 0. If an allocation occurs towards the end of the AG and the AG size is not aligned, the inode record could refer to blocks beyond the end of the AG. While neither of these scenarios directly result in corruption, they both insert invalid inode records and at minimum cause repair to complain, are unlikely to merge into full chunks over time and set land mines for other areas of code. To guarantee sparse inode chunk allocation creates valid inode records, support the ability to specify an agbno range limit for XFS_ALLOCTYPE_NEAR_BNO block allocations. The min/max agbno's are specified in the allocation arguments and limit the block allocation algorithms to that range. The starting 'agbno' hint is clamped to the range if the specified agbno is out of range. If no sufficient extent is available within the range, the allocation fails. For backwards compatibility, the min/max fields can be initialized to 0 to disable range limiting (e.g., equivalent to min=0,max=agsize). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: update free inode record logic to support sparse inode recordsBrian Foster
xfs_difree_inobt() uses logic in a couple places that assume inobt records refer to fully allocated chunks. Specifically, the use of mp->m_ialloc_inos can cause problems for inode chunks that are sparsely allocated. Sparse inode chunks can, by definition, define a smaller number of inodes than a full inode chunk. Fix the logic that determines whether an inode record should be removed from the inobt to use the ir_free mask rather than ir_freecount. Fix the agi counters modification to use ir_freecount to add the actual number of inodes freed rather than assuming a full inode chunk. Also make sure that we preserve the behavior to not remove inode chunks if the block size is large enough for multiple inode chunks (e.g., bsize=64k, isize=512). This behavior was previously implicit in that in such configurations, ir.freecount of a single record never matches m_ialloc_inos. Hence, add some comments as well. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-29xfs: create individual inode alloc. helperBrian Foster
Inode allocation from sparse inode records must filter the ir_free mask against ir_holemask. In preparation for this requirement, create a helper to allocate an individual inode from an inode record. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-05-03Merge tag 'for_linus_stable' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 fixes from Ted Ts'o: "Some miscellaneous bug fixes and some final on-disk and ABI changes for ext4 encryption which provide better security and performance" * tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: fix growing of tiny filesystems ext4: move check under lock scope to close a race. ext4: fix data corruption caused by unwritten and delayed extents ext4 crypto: remove duplicated encryption mode definitions ext4 crypto: do not select from EXT4_FS_ENCRYPTION ext4 crypto: add padding to filenames before encrypting ext4 crypto: simplify and speed up filename encryption
2015-05-02ext4: fix growing of tiny filesystemsJan Kara
The estimate of necessary transaction credits in ext4_flex_group_add() is too pessimistic. It reserves credit for sb, resize inode, and resize inode dindirect block for each group added in a flex group although they are always the same block and thus it is enough to account them only once. Also the number of modified GDT block is overestimated since we fit EXT4_DESC_PER_BLOCK(sb) descriptors in one block. Make the estimation more precise. That reduces number of requested credits enough that we can grow 20 MB filesystem (which has 1 MB journal, 79 reserved GDT blocks, and flex group size 16 by default). Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
2015-05-02ext4: move check under lock scope to close a race.Davide Italiano
fallocate() checks that the file is extent-based and returns EOPNOTSUPP in case is not. Other tasks can convert from and to indirect and extent so it's safe to check only after grabbing the inode mutex. Signed-off-by: Davide Italiano <dccitaliano@gmail.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-05-02ext4: fix data corruption caused by unwritten and delayed extentsLukas Czerner
Currently it is possible to lose whole file system block worth of data when we hit the specific interaction with unwritten and delayed extents in status extent tree. The problem is that when we insert delayed extent into extent status tree the only way to get rid of it is when we write out delayed buffer. However there is a limitation in the extent status tree implementation so that when inserting unwritten extent should there be even a single delayed block the whole unwritten extent would be marked as delayed. At this point, there is no way to get rid of the delayed extents, because there are no delayed buffers to write out. So when a we write into said unwritten extent we will convert it to written, but it still remains delayed. When we try to write into that block later ext4_da_map_blocks() will set the buffer new and delayed and map it to invalid block which causes the rest of the block to be zeroed loosing already written data. For now we can fix this by simply not allowing to set delayed status on written extent in the extent status tree. Also add WARN_ON() to make sure that we notice if this happens in the future. This problem can be easily reproduced by running the following xfs_io. xfs_io -f -c "pwrite -S 0xaa 4096 2048" \ -c "falloc 0 131072" \ -c "pwrite -S 0xbb 65536 2048" \ -c "fsync" /mnt/test/fff echo 3 > /proc/sys/vm/drop_caches xfs_io -c "pwrite -S 0xdd 67584 2048" /mnt/test/fff This can be theoretically also reproduced by at random by running fsx, but it's not very reliable, though on machines with bigger page size (like ppc) this can be seen more often (especially xfstest generic/127) Signed-off-by: Lukas Czerner <lczerner@redhat.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu> Cc: stable@vger.kernel.org
2015-05-02ext4 crypto: remove duplicated encryption mode definitionsChanho Park
This patch removes duplicated encryption modes which were already in ext4.h. They were duplicated from commit 3edc18d and commit f542fb. Cc: Theodore Ts'o <tytso@mit.edu> Cc: Michael Halcrow <mhalcrow@google.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Signed-off-by: Chanho Park <chanho61.park@samsung.com> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-02ext4 crypto: do not select from EXT4_FS_ENCRYPTIONHerbert Xu
This patch adds a tristate EXT4_ENCRYPTION to do the selections for EXT4_FS_ENCRYPTION because selecting from a bool causes all the selected options to be built-in, even if EXT4 itself is a module. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-01ext4 crypto: add padding to filenames before encryptingTheodore Ts'o
This obscures the length of the filenames, to decrease the amount of information leakage. By default, we pad the filenames to the next 4 byte boundaries. This costs nothing, since the directory entries are aligned to 4 byte boundaries anyway. Filenames can also be padded to 8, 16, or 32 bytes, which will consume more directory space. Change-Id: Ibb7a0fb76d2c48e2061240a709358ff40b14f322 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-01ext4 crypto: simplify and speed up filename encryptionTheodore Ts'o
Avoid using SHA-1 when calculating the user-visible filename when the encryption key is available, and avoid decrypting lots of filenames when searching for a directory entry in a directory block. Change-Id: If4655f144784978ba0305b597bfa1c8d7bb69e63 Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2015-05-01Merge branch 'for-linus-4.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "A few more btrfs fixes. These range from corners Filipe found in the new free space cache writeback to a grab bag of fixes from the list" * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extent Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. btrfs: unlock i_mutex after attempting to delete subvolume during send btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cache btrfs: fix race on ENOMEM in alloc_extent_buffer btrfs: handle ENOMEM in btrfs_alloc_tree_block Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole Btrfs: don't check for delalloc_bytes in cache_save_setup Btrfs: fix deadlock when starting writeback of bg caches Btrfs: fix race between start dirty bg cache writeout and bg deletion
2015-04-29Btrfs: btrfs_release_extent_buffer_page didn't free pages of dummy extentForrest Liu
btrfs_release_extent_buffer_page() can't handle dummy extent that allocated by btrfs_clone_extent_buffer() properly. That is because reference count of pages that allocated by btrfs_clone_extent_buffer() was 2, 1 by alloc_page(), and another by attach_extent_buffer_page(). Running following command repeatly can check this memory leak problem btrfs inspect-internal inode-resolve 256 /mnt/btrfs Signed-off-by: Chien-Kuan Yeh <ckya@synology.com> Signed-off-by: Forrest Liu <forrestl@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Tested-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Merge branch 'for-linus-4.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "Filipe hit two problems in my block group cache patches. We finalized the fixes last week and ran through more tests" * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: prevent list corruption during free space cache processing Btrfs: fix inode cache writeout
2015-04-26Merge tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Another set of mainly bugfixes and a couple of cleanups. No new functionality in this round. Highlights include: Stable patches: - Fix a regression in /proc/self/mountstats - Fix the pNFS flexfiles O_DIRECT support - Fix high load average due to callback thread sleeping Bugfixes: - Various patches to fix the pNFS layoutcommit support - Do not cache pNFS deviceids unless server notifications are enabled - Fix a SUNRPC transport reconnection regression - make debugfs file creation failure non-fatal in SUNRPC - Another fix for circular directory warnings on NFSv4 "junctioned" mountpoints - Fix locking around NFSv4.2 fallocate() support - Truncating NFSv4 file opens should also sync O_DIRECT writes - Prevent infinite loop in rpcrdma_ep_create() Features: - Various improvements to the RDMA transport code's handling of memory registration - Various code cleanups" * tag 'nfs-for-4.1-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (55 commits) fs/nfs: fix new compiler warning about boolean in switch nfs: Remove unneeded casts in nfs NFS: Don't attempt to decode missing directory entries Revert "nfs: replace nfs_add_stats with nfs_inc_stats when add one" NFS: Rename idmap.c to nfs4idmap.c NFS: Move nfs_idmap.h into fs/nfs/ NFS: Remove CONFIG_NFS_V4 checks from nfs_idmap.h NFS: Add a stub for GETDEVICELIST nfs: remove WARN_ON_ONCE from nfs_direct_good_bytes nfs: fix DIO good bytes calculation nfs: Fetch MOUNTED_ON_FILEID when updating an inode sunrpc: make debugfs file creation failure non-fatal nfs: fix high load average due to callback thread sleeping NFS: Reduce time spent holding the i_mutex during fallocate() NFS: Don't zap caches on fallocate() xprtrdma: Make rpcrdma_{un}map_one() into inline functions xprtrdma: Handle non-SEND completions via a callout xprtrdma: Add "open" memreg op xprtrdma: Add "destroy MRs" memreg op xprtrdma: Add "reset MRs" memreg op ...
2015-04-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode.Yang Dongsheng
We need to fill inode when we found a node for it in delayed_nodes_tree. But we did not fill the ->last_trans currently, it will cause the test of xfstest/generic/311 fail. Scenario of the 311 is shown as below: Problem: (1). test_fd = open(fname, O_RDWR|O_DIRECT) (2). pwrite(test_fd, buf, 4096, 0) (3). close(test_fd) (4). drop_all_caches() <-------- "echo 3 > /proc/sys/vm/drop_caches" (5). test_fd = open(fname, O_RDWR|O_DIRECT) (6). fsync(test_fd); <-------- we did not get the correct log entry for the file Reason: When we re-open this file in (5), we would find a node in delayed_nodes_tree and fill the inode we are lookup with the information. But the ->last_trans is not filled, then the fsync() will check the ->last_trans and found it's 0 then say this inode is already in our tree which is commited, not recording the extents for it. Fix: This patch fill the ->last_trans properly and set the runtime_flags if needed in this situation. Then we can get the log entries we expected after (6) and generic/311 passed. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Reviewed-by: Miao Xie <miaoxie@huawei.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26btrfs: unlock i_mutex after attempting to delete subvolume during sendOmar Sandoval
Whenever the check for a send in progress introduced in commit 521e0546c970 (btrfs: protect snapshots from deleting during send) is hit, we return without unlocking inode->i_mutex. This is easy to see with lockdep enabled: [ +0.000059] ================================================ [ +0.000028] [ BUG: lock held when returning to user space! ] [ +0.000029] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted [ +0.000026] ------------------------------------------------ [ +0.000029] btrfs/211 is leaving the kernel with locks still held! [ +0.000029] 1 lock held by btrfs/211: [ +0.000023] #0: (&type->i_mutex_dir_key){+.+.+.}, at: [<ffffffff8135b8df>] btrfs_ioctl_snap_destroy+0x2df/0x7a0 Make sure we unlock it in the error path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.cz> Cc: stable@vger.kernel.org Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26btrfs: check io_ctl_prepare_pages return in __btrfs_write_out_cacheOmar Sandoval
If io_ctl_prepare_pages fails, the pages in io_ctl.pages are not valid. When we try to access them later, things will blow up in various ways. Also fix the comment about the return value, which is an errno on error, not -1, and update the cases where it was not. Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26btrfs: fix race on ENOMEM in alloc_extent_bufferOmar Sandoval
Consider the following interleaving of overlapping calls to alloc_extent_buffer: Call 1: - Successfully allocates a few pages with find_or_create_page - find_or_create_page fails, goto free_eb - Unlocks the allocated pages Call 2: - Calls find_or_create_page and gets a page in call 1's extent_buffer - Finds that the page is already associated with an extent_buffer - Grabs a reference to the half-written extent_buffer and calls mark_extent_buffer_accessed on it mark_extent_buffer_accessed will then try to call mark_page_accessed on a null page and panic. The fix is to decrement the reference count on the half-written extent_buffer before unlocking the pages so call 2 won't use it. We should also set exists = NULL in the case that we don't use exists to avoid accidentally returning a freed extent_buffer in an error case. Signed-off-by: Omar Sandoval <osandov@osandov.com> Reviewed-by: David Sterba <dsterba@suse.cz> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26btrfs: handle ENOMEM in btrfs_alloc_tree_blockOmar Sandoval
This is one of the first places to give out when memory is tight. Handle it properly rather than with a BUG_ON. Also fix the comment about the return value, which is an ERR_PTR, not NULL, on error. Signed-off-by: Omar Sandoval <osandov@osandov.com> Reviewed-by: David Sterba <dsterba@suse.cz> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Btrfs: fix find_free_dev_extent() malfunction in case device tree has holeForrest Liu
If device tree has hole, find_free_dev_extent() cannot find available address properly. The problem can be reproduce by following script. mntpath=/btrfs loopdev=/dev/loop0 filepath=/home/forrest/image umount $mntpath losetup -d $loopdev truncate --size 100g $filepath losetup $loopdev $filepath mkfs.btrfs -f $loopdev mount $loopdev $mntpath # make device tree with one big hole for i in `seq 1 1 100`; do fallocate -l 1g $mntpath/$i done sync for i in `seq 1 1 95`; do rm $mntpath/$i done sync # wait cleaner thread remove unused block group sleep 300 fallocate -l 1g $mntpath/aaa # failed to allocate new chunk fallocate -l 1g $mntpath/bbb Above script will make device tree with one big hole, and can only allocate just one chunk in a transaction, so failed to allocate new chunk for $mntpath/bbb item 8 key (1 DEV_EXTENT 2185232384) itemoff 15859 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 106292051968 length 1073741824 item 9 key (1 DEV_EXTENT 104190705664) itemoff 15811 itemsize 48 dev extent chunk_tree 3 chunk objectid 256 chunk offset 103108575232 length 1073741824 Signed-off-by: Forrest Liu <forrestl@synology.com> Reviewed-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Btrfs: don't check for delalloc_bytes in cache_save_setupChris Mason
Now that we're doing free space cache writeback outside the critical section in the commit, there is a bigger window for delalloc_bytes to be added after a cache has been written. find_free_extent may do this without putting the block group back into the dirty list, and also without a transaction running. Checking for delalloc_bytes in cache_save_setup means we might leave the cache marked as written without invalidating it. Consistency checks during mount will toss the cache, but it's better to get rid of the check in cache_save_setup and let it get invalidated by the checks already done during cache write out. Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Btrfs: fix deadlock when starting writeback of bg cachesFilipe Manana
While starting the writes of the dirty block group caches, if we don't find a block group item in the extent tree we were leaving without releasing our path, running delayed references and then looping again to process any new dirty block groups. However this second iteration of the loop could cause a deadlock because it tries to lock some other extent tree node/leaf which another task already locked and it's blocked because it's waiting for a lock on some node/leaf that is in our path that was not released before. We could also deadlock when running the delayed references - as we could end up trying to lock the same nodes/leafs that we have in our local path (with a different lock type). Got into such case when running xfstests: [20892.242791] ------------[ cut here ]------------ [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [20892.245874] BTRFS: Transaction aborted (error -2) (...) [20892.269378] Call Trace: [20892.269915] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [20892.271097] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [20892.272173] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [20892.273386] [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [20892.274857] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [20892.275851] [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [20892.277341] [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs] [20892.278628] [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs] [20892.280191] [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] (...) [20892.291316] ---[ end trace 597f77e664245373 ]--- [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry [20892.297390] BTRFS info (device sdg): forced readonly [20892.298222] ------------[ cut here ]------------ [20892.299190] WARNING: CPU: 0 PID: 13299 at fs/btrfs/ctree.c:2683 btrfs_search_slot+0x7e/0x7d2 [btrfs]() (...) [20892.326253] Call Trace: [20892.326904] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [20892.329503] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [20892.330815] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [20892.332556] [<ffffffffa0510b73>] ? btrfs_search_slot+0x7e/0x7d2 [btrfs] [20892.333955] [<ffffffff81045f62>] warn_slowpath_null+0x1a/0x1c [20892.335562] [<ffffffffa0510b73>] btrfs_search_slot+0x7e/0x7d2 [btrfs] [20892.336849] [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc [20892.338222] [<ffffffffa051ad52>] ? cache_save_setup+0x43/0x2a5 [btrfs] [20892.339823] [<ffffffffa051ad66>] ? cache_save_setup+0x57/0x2a5 [btrfs] [20892.341275] [<ffffffff814351a4>] ? _raw_spin_unlock+0x32/0x46 [20892.342810] [<ffffffffa0515de7>] write_one_cache_group+0x3f/0xaf [btrfs] [20892.344184] [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs] [20892.347162] [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] (...) [20892.361015] ---[ end trace 597f77e664245374 ]--- [21120.688097] INFO: task kworker/u8:17:29854 blocked for more than 120 seconds. [21120.689881] Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [21120.691384] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. (...) [21120.703696] Call Trace: [21120.704310] [<ffffffff8143107e>] schedule+0x74/0x83 [21120.705490] [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs] [21120.706757] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 [21120.708156] [<ffffffffa054ac1e>] lock_extent_buffer_for_io+0x3e/0x194 [btrfs] [21120.709892] [<ffffffffa054bb86>] ? btree_write_cache_pages+0x273/0x385 [btrfs] [21120.711605] [<ffffffffa054bc42>] btree_write_cache_pages+0x32f/0x385 [btrfs] [21120.723440] [<ffffffffa0527552>] btree_writepages+0x23/0x5c [btrfs] [21120.724943] [<ffffffff8110c4c8>] do_writepages+0x23/0x2c [21120.726008] [<ffffffff81176dde>] __writeback_single_inode+0x73/0x2fa [21120.727230] [<ffffffff8117714a>] ? writeback_sb_inodes+0xe5/0x38b [21120.728526] [<ffffffff811771fb>] ? writeback_sb_inodes+0x196/0x38b [21120.729701] [<ffffffff8117726a>] writeback_sb_inodes+0x205/0x38b (...) [21120.747853] INFO: task btrfs:13282 blocked for more than 120 seconds. [21120.749459] Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [21120.751137] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. (...) [21120.768457] Call Trace: [21120.769039] [<ffffffff8143107e>] schedule+0x74/0x83 [21120.770107] [<ffffffffa052f25c>] btrfs_commit_transaction+0x315/0x9c9 [btrfs] [21120.771558] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 [21120.773659] [<ffffffffa056fd8c>] prepare_to_relocate+0xcb/0xd2 [btrfs] [21120.776257] [<ffffffffa05741da>] relocate_block_group+0x44/0x4a9 [btrfs] [21120.777755] [<ffffffffa05747a0>] ? btrfs_relocate_block_group+0x161/0x288 [btrfs] [21120.779459] [<ffffffffa05747a8>] btrfs_relocate_block_group+0x169/0x288 [btrfs] [21120.781153] [<ffffffffa0550403>] btrfs_relocate_chunk.isra.29+0x3e/0xa7 [btrfs] [21120.783918] [<ffffffffa05518fd>] btrfs_balance+0xaa4/0xc52 [btrfs] [21120.785436] [<ffffffff8114306e>] ? cpu_cache_get.isra.39+0xe/0x1f [21120.786434] [<ffffffffa0559252>] btrfs_ioctl_balance+0x23f/0x2b0 [btrfs] (...) [21120.889251] INFO: task fsstress:13288 blocked for more than 120 seconds. [21120.890526] Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [21120.891773] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. (...) [21120.899960] Call Trace: [21120.900743] [<ffffffff8143107e>] schedule+0x74/0x83 [21120.903004] [<ffffffffa055f025>] btrfs_tree_lock+0xd7/0x236 [btrfs] [21120.904383] [<ffffffff81075cd6>] ? signal_pending_state+0x31/0x31 [21120.905608] [<ffffffffa051125b>] btrfs_search_slot+0x766/0x7d2 [btrfs] [21120.906812] [<ffffffff8114290e>] ? virt_to_head_page+0x9/0x2c [21120.907874] [<ffffffff81144b7f>] ? cache_alloc_debugcheck_after.isra.42+0x16c/0x1cb [21120.909551] [<ffffffffa05124e0>] btrfs_insert_empty_items+0x5d/0xa8 [btrfs] [21120.910914] [<ffffffffa0512585>] btrfs_insert_item+0x5a/0xa5 [btrfs] [21120.912181] [<ffffffffa0520271>] ? btrfs_create_pending_block_groups+0x96/0x130 [btrfs] [21120.913784] [<ffffffffa052028a>] btrfs_create_pending_block_groups+0xaf/0x130 [btrfs] [21120.915374] [<ffffffffa052ffc2>] __btrfs_end_transaction+0x84/0x366 [btrfs] [21120.916735] [<ffffffffa05302b4>] btrfs_end_transaction+0x10/0x12 [btrfs] [21120.917996] [<ffffffffa051ab26>] btrfs_check_data_free_space+0x11f/0x27c [btrfs] [21120.919478] [<ffffffffa051ba25>] btrfs_delalloc_reserve_space+0x1e/0x51 [btrfs] [21120.921226] [<ffffffffa05382f2>] btrfs_truncate_page+0x85/0x2c4 [btrfs] [21120.923121] [<ffffffffa0538572>] btrfs_cont_expand+0x41/0x3ef [btrfs] [21120.924449] [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs] [21120.926602] [<ffffffff8107b024>] ? arch_local_irq_save+0x9/0xc [21120.927769] [<ffffffffa0541091>] ? btrfs_file_write_iter+0x19a/0x431 [btrfs] [21120.929324] [<ffffffffa05410a0>] ? btrfs_file_write_iter+0x1a9/0x431 [btrfs] [21120.930723] [<ffffffffa05410d9>] btrfs_file_write_iter+0x1e2/0x431 [btrfs] [21120.931897] [<ffffffff81067d85>] ? get_parent_ip+0xe/0x3e [21120.934446] [<ffffffff811534c3>] new_sync_write+0x7c/0xa0 [21120.935528] [<ffffffff81153b58>] vfs_write+0xb2/0x117 (...) Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-26Btrfs: fix race between start dirty bg cache writeout and bg deletionFilipe Manana
While running xfstests I ran into the following: [20892.242791] ------------[ cut here ]------------ [20892.243776] WARNING: CPU: 0 PID: 13299 at fs/btrfs/super.c:260 __btrfs_abort_transaction+0x52/0x114 [btrfs]() [20892.245874] BTRFS: Transaction aborted (error -2) [20892.247329] Modules linked in: btrfs dm_snapshot dm_bufio dm_flakey dm_mod crc32c_generic xor raid6_pq nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc loop fuse$ [20892.258488] CPU: 0 PID: 13299 Comm: fsstress Tainted: G W 4.0.0-rc5-btrfs-next-9+ #2 [20892.262011] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [20892.264738] 0000000000000009 ffff880427f8bc18 ffffffff8142fa46 ffffffff8108b6a2 [20892.266244] ffff880427f8bc68 ffff880427f8bc58 ffffffff81045ea5 ffff880427f8bc48 [20892.267761] ffffffffa0509a6d 00000000fffffffe ffff8803545d6f40 ffffffffa05a15a0 [20892.269378] Call Trace: [20892.269915] [<ffffffff8142fa46>] dump_stack+0x4f/0x7b [20892.271097] [<ffffffff8108b6a2>] ? console_unlock+0x361/0x3ad [20892.272173] [<ffffffff81045ea5>] warn_slowpath_common+0xa1/0xbb [20892.273386] [<ffffffffa0509a6d>] ? __btrfs_abort_transaction+0x52/0x114 [btrfs] [20892.274857] [<ffffffff81045f05>] warn_slowpath_fmt+0x46/0x48 [20892.275851] [<ffffffffa0509a6d>] __btrfs_abort_transaction+0x52/0x114 [btrfs] [20892.277341] [<ffffffffa0515e10>] write_one_cache_group+0x68/0xaf [btrfs] [20892.278628] [<ffffffffa052088a>] btrfs_start_dirty_block_groups+0x18d/0x29b [btrfs] [20892.280191] [<ffffffffa052f077>] btrfs_commit_transaction+0x130/0x9c9 [btrfs] [20892.281781] [<ffffffff8107d33d>] ? trace_hardirqs_on+0xd/0xf [20892.282873] [<ffffffffa054163b>] btrfs_sync_file+0x313/0x387 [btrfs] [20892.284111] [<ffffffff8117acad>] vfs_fsync_range+0x95/0xa4 [20892.285203] [<ffffffff810e603f>] ? time_hardirqs_on+0x15/0x28 [20892.286290] [<ffffffff8123960b>] ? trace_hardirqs_on_thunk+0x3a/0x3f [20892.287469] [<ffffffff8117acd8>] vfs_fsync+0x1c/0x1e [20892.288412] [<ffffffff8117ae54>] do_fsync+0x34/0x4e [20892.289348] [<ffffffff8117b07c>] SyS_fsync+0x10/0x14 [20892.290255] [<ffffffff81435b32>] system_call_fastpath+0x12/0x17 [20892.291316] ---[ end trace 597f77e664245373 ]--- [20892.293955] BTRFS: error (device sdg) in write_one_cache_group:3184: errno=-2 No such entry [20892.297390] BTRFS info (device sdg): forced readonly This happens because in btrfs_start_dirty_block_groups() we splice the transaction's list of dirty block groups into a local list and then we keep extracting the first element of the list without holding the cache_write_mutex mutex. This means that before we acquire that mutex the first block group on the list might be removed by a conurrent task running btrfs_remove_block_group(). So make sure we extract the first element (and test the list emptyness) while holding that mutex. Fixes: 1bbc621ef284 ("Btrfs: allow block group cache writeout outside critical section in commit") Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-24RCU pathwalk breakage when running into a symlink overmounting somethingAl Viro
Calling unlazy_walk() in walk_component() and do_last() when we find a symlink that needs to be followed doesn't acquire a reference to vfsmount. That's fine when the symlink is on the same vfsmount as the parent directory (which is almost always the case), but it's not always true - one _can_ manage to bind a symlink on top of something. And in such cases we end up with excessive mntput(). Cc: stable@vger.kernel.org # since 2.6.39 Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24direct-io: only inc/dec inode->i_dio_count for file systemsJens Axboe
do_blockdev_direct_IO() increments and decrements the inode ->i_dio_count for each IO operation. It does this to protect against truncate of a file. Block devices don't need this sort of protection. For a capable multiqueue setup, this atomic int is the only shared state between applications accessing the device for O_DIRECT, and it presents a scaling wall for that. In my testing, as much as 30% of system time is spent incrementing and decrementing this value. A mixed read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with better latencies too. Before: clat percentiles (usec): | 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34], | 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35], | 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80], | 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155], | 99.99th=[ 165] After: clat percentiles (usec): | 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149], | 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171], | 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270], | 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422], | 99.99th=[ 438] In other setups, Robert Elliott reported seeing good performance improvements: https://lkml.org/lkml/2015/4/3/557 The more applications accessing the device, the worse it gets. Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells do_blockdev_direct_IO() that it need not worry about incrementing or decrementing the inode i_dio_count for this caller. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Elliott, Robert (Server Storage) <elliott@hp.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Jens Axboe <axboe@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24fs/9p: fix readdir()Johannes Berg
Al Viro's IOV changes broke 9p readdir() because the new code didn't abort the read when it returned nothing. The original code checked if the combined error/length was <= 0 but in the new code that accidentally got changed to just an error check. Add back the return from the function when nothing is read. Cc: Al Viro <viro@zeniv.linux.org.uk> Fixes: e1200fe68f20 ("9p: switch p9_client_read() to passing struct iov_iter *") Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24Btrfs: prevent list corruption during free space cache processingChris Mason
__btrfs_write_out_cache is holding the ctl->tree_lock while it prepares a list of bitmaps to record in the free space cache. It was dropping the lock while it worked on other components, which made a window for free_bitmap() to free the bitmap struct without removing it from the list. This changes things to hold the lock the whole time, and also makes sure we hold the lock during enospc cleanup. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>
2015-04-24Merge branch 'for-4.1' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd updates from Bruce Fields: "A quiet cycle this time; this is basically entirely bugfixes. The few that aren't cc'd to stable are cleanup or seemed unlikely to affect anyone much" * 'for-4.1' of git://linux-nfs.org/~bfields/linux: uapi: Remove kernel internal declaration nfsd: fix nsfd startup race triggering BUG_ON nfsd: eliminate NFSD_DEBUG nfsd4: fix READ permission checking nfsd4: disallow SEEK with special stateids nfsd4: disallow ALLOCATE with special stateids nfsd: add NFSEXP_PNFS to the exflags array nfsd: Remove duplicate macro define for max sec label length nfsd: allow setting acls with unenforceable DENYs nfsd: NFSD_FAULT_INJECTION depends on DEBUG_FS nfsd: remove unused status arg to nfsd4_cleanup_open_state nfsd: remove bogus setting of status in nfsd4_process_open2 NFSD: Use correct reply size calculating function NFSD: Using path_equal() for checking two paths
2015-04-24Merge branch 'for-linus-4.1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "I've been running these through a longer set of load tests because my commits change the free space cache writeout. It fixes commit stalls on large filesystems (~20T space used and up) that we have been triggering here. We were seeing new writers blocked for 10 seconds or more during commits, which is far from good. Josef and I fixed up ENOSPC aborts when deleting huge files (3T or more), that are triggered because our metadata reservations were not properly accounting for crcs and were not replenishing during the truncate. Also in this series, a number of qgroup fixes from Fujitsu and Dave Sterba collected most of the pending cleanups from the list" * 'for-linus-4.1' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (93 commits) btrfs: quota: Update quota tree after qgroup relationship change. btrfs: quota: Automatically update related qgroups or mark INCONSISTENT flags when assigning/deleting a qgroup relations. btrfs: qgroup: clear STATUS_FLAG_ON in disabling quota. btrfs: Update btrfs qgroup status item when rescan is done. btrfs: qgroup: Fix dead judgement on qgroup_rescan_leaf() return value. btrfs: Don't allow subvolid >= (1 << BTRFS_QGROUP_LEVEL_SHIFT) to be created btrfs: Check qgroup level in kernel qgroup assign. btrfs: qgroup: allow to remove qgroup which has parent but no child. btrfs: qgroup: return EINVAL if level of parent is not higher than child's. btrfs: qgroup: do a reservation in a higher level. Btrfs: qgroup, Account data space in more proper timings. Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use. Btrfs: qgroup: free reserved in exceeding quota. Btrfs: qgroup: cleanup, remove an unsued parameter in btrfs_create_qgroup(). btrfs: qgroup: fix limit args override whole limit struct btrfs: qgroup: update limit info in function btrfs_run_qgroups(). btrfs: qgroup: consolidate the parameter of fucntion update_qgroup_limit_item(). btrfs: qgroup: update qgroup in memory at the same time when we update it in btree. btrfs: qgroup: inherit limit info from srcgroup in creating snapshot. btrfs: Support busy loop of write and delete ...
2015-04-24Merge tag 'xfs-for-linus-4.1-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs update from Dave Chinner: "This update contains: - RENAME_WHITEOUT support - conversion of per-cpu superblock accounting to use generic counters - new inode mmap lock so that we can lock page faults out of truncate, hole punch and other direct extent manipulation functions to avoid racing mmap writes from causing data corruption - rework of direct IO submission and completion to solve data corruption issue when running concurrent extending DIO writes. Also solves problem of running IO completion transactions in interrupt context during size extending AIO writes. - FALLOC_FL_INSERT_RANGE support for inserting holes into a file via direct extent manipulation to avoid needing to copy data within the file - attribute block header field overflow fix for 64k block size filesystems - Lots of changes to log messaging to be more informative and concise when errors occur. Also prevent a lot of unnecessary log spamming due to cascading failures in error conditions. - lots of cleanups and bug fixes One thing of note is the direct IO fixes that we merged last week after the window opened. Even though a little late, they fix a user reported data corruption and have been pretty well tested. I figured there was not much point waiting another 2 weeks for -rc1 to be released just so I could send them to you..." * tag 'xfs-for-linus-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (49 commits) xfs: using generic_file_direct_write() is unnecessary xfs: direct IO EOF zeroing needs to drain AIO xfs: DIO write completion size updates race xfs: DIO writes within EOF don't need an ioend xfs: handle DIO overwrite EOF update completion correctly xfs: DIO needs an ioend for writes xfs: move DIO mapping size calculation xfs: factor DIO write mapping from get_blocks xfs: unlock i_mutex in xfs_break_layouts xfs: kill unnecessary firstused overflow check on attr3 leaf removal xfs: use larger in-core attr firstused field and detect overflow xfs: pass attr geometry to attr leaf header conversion functions xfs: disallow ro->rw remount on norecovery mount xfs: xfs_shift_file_space can be static xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate fs: Add support FALLOC_FL_INSERT_RANGE for fallocate xfs: Fix incorrect positive ENOMEM return xfs: xfs_mru_cache_insert() should use GFP_NOFS xfs: %pF is only for function pointers xfs: fix shadow warning in xfs_da3_root_split() ...
2015-04-23Btrfs: fix inode cache writeoutChris Mason
The code to fix stalls during free spache cache IO wasn't using the correct root when waiting on the IO for inode caches. This is only a problem when the inode cache is enabled with mount -o inode_cache This fixes the inode cache writeout to preserve any error values and makes sure not to override the root when inode cache writeout is done. Reported-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Chris Mason <clm@fb.com>