aboutsummaryrefslogtreecommitdiff
path: root/fs/xfs/libxfs
AgeCommit message (Collapse)Author
2016-05-26Merge tag 'xfs-for-linus-4.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs Pull xfs updates from Dave Chinner: "A pretty average collection of fixes, cleanups and improvements in this request. Summary: - fixes for mount line parsing, sparse warnings, read-only compat feature remount behaviour - allow fast path symlink lookups for inline symlinks. - attribute listing cleanups - writeback goes direct to bios rather than indirecting through bufferheads - transaction allocation cleanup - optimised kmem_realloc - added configurable error handling for metadata write errors, changed default error handling behaviour from "retry forever" to "retry until unmount then fail" - fixed several inode cluster writeback lookup vs reclaim race conditions - fixed inode cluster writeback checking wrong inode after lookup - fixed bugs where struct xfs_inode freeing wasn't actually RCU safe - cleaned up inode reclaim tagging" * tag 'xfs-for-linus-4.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (39 commits) xfs: fix warning in xfs_finish_page_writeback for non-debug builds xfs: move reclaim tagging functions xfs: simplify inode reclaim tagging interfaces xfs: rename variables in xfs_iflush_cluster for clarity xfs: xfs_iflush_cluster has range issues xfs: mark reclaimed inodes invalid earlier xfs: xfs_inode_free() isn't RCU safe xfs: optimise xfs_iext_destroy xfs: skip stale inodes in xfs_iflush_cluster xfs: fix inode validity check in xfs_iflush_cluster xfs: xfs_iflush_cluster fails to abort on error xfs: remove xfs_fs_evict_inode() xfs: add "fail at unmount" error handling configuration xfs: add configuration handlers for specific errors xfs: add configuration of error failure speed xfs: introduce table-based init for error behaviors xfs: add configurable error support to metadata buffers xfs: introduce metadata IO error class xfs: configurable error behavior via sysfs xfs: buffer ->bi_end_io function requires irq-safe lock ...
2016-05-20Merge branch 'xfs-4.7-inode-reclaim' into for-nextDave Chinner
2016-05-20Merge branch 'xfs-4.7-misc-fixes' into for-nextDave Chinner
2016-05-20Merge branch 'xfs-4.7-optimise-inline-symlinks' into for-nextDave Chinner
2016-05-18xfs: optimise xfs_iext_destroyAlex Lyakas
When unmounting XFS, we call: xfs_inode_free => xfs_idestroy_fork => xfs_iext_destroy This goes over the whole indirection array and calls xfs_iext_irec_remove for each one of the erps (from the last one to the first one). As a result, we keep shrinking (reallocating actually) the indirection array until we shrink out all of its elements. When we have files with huge numbers of extents, umount takes 30-80 sec, depending on the amount of files that XFS loaded and the amount of indirection entries of each file. The unmount stack looks like: [<ffffffffc0b6d200>] xfs_iext_realloc_indirect+0x40/0x60 [xfs] [<ffffffffc0b6cd8e>] xfs_iext_irec_remove+0xee/0xf0 [xfs] [<ffffffffc0b6cdcd>] xfs_iext_destroy+0x3d/0xb0 [xfs] [<ffffffffc0b6cef6>] xfs_idestroy_fork+0xb6/0xf0 [xfs] [<ffffffffc0b87002>] xfs_inode_free+0xb2/0xc0 [xfs] [<ffffffffc0b87260>] xfs_reclaim_inode+0x250/0x340 [xfs] [<ffffffffc0b87583>] xfs_reclaim_inodes_ag+0x233/0x370 [xfs] [<ffffffffc0b8823d>] xfs_reclaim_inodes+0x1d/0x20 [xfs] [<ffffffffc0b96feb>] xfs_unmountfs+0x7b/0x1a0 [xfs] [<ffffffffc0b98e4d>] xfs_fs_put_super+0x2d/0x70 [xfs] [<ffffffff811e9e36>] generic_shutdown_super+0x76/0x100 [<ffffffff811ea207>] kill_block_super+0x27/0x70 [<ffffffff811ea519>] deactivate_locked_super+0x49/0x60 [<ffffffff811eaaee>] deactivate_super+0x4e/0x70 [<ffffffff81207593>] cleanup_mnt+0x43/0x90 [<ffffffff81207632>] __cleanup_mnt+0x12/0x20 [<ffffffff8108f8e7>] task_work_run+0xa7/0xe0 [<ffffffff81014ff7>] do_notify_resume+0x97/0xb0 [<ffffffff81717c6f>] int_signal+0x12/0x17 Further, this reallocation prevents us from freeing the extent list from a RCU callback as allocation can block. Hence if the extent list is in indirect format, optimise the freeing of the extent list to only use kmem_free calls by freeing entire extent buffer pages at a time, rather than extent by extent. [dchinner: simplified freeing loop based on Christoph's suggestion] Signed-off-by: Alex Lyakas <alex@zadarastorage.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: improve kmem_reallocChristoph Hellwig
Use krealloc to implement our realloc function. This helps to avoid new allocations if we are still in the slab bucket. At least for the bmap btree root that's actually the common case. This also allows removing the now unused oldsize argument. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: remove transaction typesChristoph Hellwig
These aren't used for CIL-style logging and can be dropped. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: better xfs_trans_alloc interfaceChristoph Hellwig
Merge xfs_trans_reserve and xfs_trans_alloc into a single function call that returns a transaction with all the required log and block reservations, and which allows passing transaction flags directly to avoid the cumbersome _xfs_trans_alloc interface. While we're at it we also get rid of the transaction type argument that has been superflous since we stopped supporting the non-CIL logging mode. The guts of it will be removed in another patch. [dchinner: fixed transaction leak in error path in xfs_setattr_nonsize] Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: optimize inline symlinksChristoph Hellwig
By overallocating the in-core inode fork data buffer and zero terminating the link target in xfs_init_local_fork we can avoid the memory allocation in ->follow_link. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-06xfs: factor out a helper to initialize a local format inode forkChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-04-04mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macrosKirill A. Shutemov
PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time ago with promise that one day it will be possible to implement page cache with bigger chunks than PAGE_SIZE. This promise never materialized. And unlikely will. We have many places where PAGE_CACHE_SIZE assumed to be equal to PAGE_SIZE. And it's constant source of confusion on whether PAGE_CACHE_* or PAGE_* constant should be used in a particular case, especially on the border between fs and mm. Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much breakage to be doable. Let's stop pretending that pages in page cache are special. They are not. The changes are pretty straight-forward: - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>; - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN}; - page_cache_get() -> get_page(); - page_cache_release() -> put_page(); This patch contains automated changes generated with coccinelle using script below. For some reason, coccinelle doesn't patch header files. I've called spatch for them manually. The only adjustment after coccinelle is revert of changes to PAGE_CAHCE_ALIGN definition: we are going to drop it later. There are few places in the code where coccinelle didn't reach. I'll fix them manually in a separate patch. Comments and documentation also will be addressed with the separate patch. virtual patch @@ expression E; @@ - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ expression E; @@ - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) + E @@ @@ - PAGE_CACHE_SHIFT + PAGE_SHIFT @@ @@ - PAGE_CACHE_SIZE + PAGE_SIZE @@ @@ - PAGE_CACHE_MASK + PAGE_MASK @@ expression E; @@ - PAGE_CACHE_ALIGN(E) + PAGE_ALIGN(E) @@ expression E; @@ - page_cache_get(E) + get_page(E) @@ expression E; @@ - page_cache_release(E) + put_page(E) Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-03-15Merge branch 'xfs-misc-fixes-4.6-4' into for-nextDave Chinner
2016-03-15xfs: always set rvalp in xfs_dir2_node_trim_freeChristoph Hellwig
xfs_dir2_node_trim_free can return with setting the rvalp argument pointer. Initialize it to 0 at the beginning of the function and only update it to 1 if we succeeded trimming a freespace block. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15xfs: borrow indirect blocks from freed extent when availableBrian Foster
xfs_bmap_del_extent() handles extent removal from the in-core and on-disk extent lists. When removing a delalloc range, it updates the indirect block reservation appropriately based on the removal. It currently enforces that the new indirect block reservation is less than or equal to the original. This is normally the case in all situations except for in certain cases when the removed range creates a hole in a single delalloc extent, thus splitting a single delalloc extent in two. It is possible with small enough extents to split an indlen==1 extent into two such slightly smaller extents. This leaves one extent with 0 indirect blocks and leads to assert failures in other areas (e.g., xfs_bunmapi() if the extent happens to be removed). Update the indlen distribution code to steal blocks from the deleted extent, if necessary, to satisfy the worst case total indirect reservation for the new extents. This is safe as the caller does not update the fdblocks counters until the extent is removed. Blocks stolen in this manner simply remain accounted as allocated, having ownership transferred from the data extent to an indirect reservation. As a precaution, fall back to the original reservation algorithm if the new indlen requirement is not met and warn if we end up with extents without any reservation at all to detect this more easily in the future. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15xfs: refactor delalloc indlen reservation split into helperBrian Foster
The delayed allocation indirect reservation splitting code is not sufficient in some cases where a delalloc extent is split in two. In preparation for enhancements to this code, refactor the current indlen distribution algorithm into a new helper function. [dchinner: rename temp, temp2 variables] Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-15xfs: update freeblocks counter after extent deletionBrian Foster
xfs_bunmapi() currently updates the fdblocks counter, unreserves quota, etc. before the extent is deleted by xfs_bmap_del_extent(). The function has problems dividing up the indirect reserved blocks for scenarios where a single delalloc extent is split in two. Particularly, there aren't always enough blocks reserved for multiple extents in a single extent reservation. The solution to this problem is to allow the extent removal code to steal from the deleted extent to meet indirect reservation requirements. Move the block of code in xfs_bmapi() that updates the fdblocks counter to after the call to xfs_bmap_del_extent() to allow the codepath to update the extent record before the free blocks are accounted. Also, reshuffle the code slightly so the delalloc accounting occurs near the xfs_bmap_del_extent() call to provide context for the comments. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-09Merge branch 'xfs-misc-fixes-4.6-3' into for-nextDave Chinner
2016-03-09xfs: remove impossible conditionLuis de Bethencourt
bp_release is set to 0 just before the breakpoint of the for loop before the conditional check (in line 458). The other breakpoint is a goto that skips the dead code. Addresses-Coverity-Id: 102338 Signed-off-by: Luis de Bethencourt <luisbg@osg.samsung.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-07Merge branch 'xfs-misc-fixes-4.6-2' into for-nextDave Chinner
2016-03-07Merge branch 'xfs-gut-icdinode-4.6' into for-nextDave Chinner
2016-03-07Merge branch 'xfs-misc-fixes-4.6' into for-nextDave Chinner
2016-03-07Merge branch 'xfs-get-next-dquot-4.6' into for-nextDave Chinner
2016-03-07xfs: fix computation of inode btree maxlevelsDarrick J. Wong
Commit 88740da18[1] introduced a function to compute the maximum height of the inode btree back in 1994. Back then, apparently, the freespace and inode btrees shared the same geometry; however, it has long since been the case that the inode and freespace btrees have different record and key sizes. Therefore, we must use m_inobt_mnr if we want a correct calculation/log reservation/etc. (Yes, this bug has been around for 21 years and ten months.) (Yes, I was in middle school when this bug was committed.) [1] http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=88740da18ddd9d7ba3ebaa9502fefc6ef2fd19cd Historical-research-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-03-02xfs: remove xfs_trans_get_block_resChristoph Hellwig
Just use the t_blk_res field directly instead of obsfucating the reference by a macro. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: mode di_mode to vfs inodeDave Chinner
Move the di_mode value from the xfs_icdinode to the VFS inode, reducing the xfs_icdinode byte another 2 bytes and collapsing another 2 byte hole in the structure. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: move di_changecount to VFS inodeDave Chinner
We can store the di_changecount in the i_version field of the VFS inode and remove another 8 bytes from the xfs_icdinode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: move inode generation count to VFS inodeDave Chinner
Pull another 4 bytes out of the xfs_icdinode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: use vfs inode nlink field everywhereDave Chinner
The VFS tracks the inode nlink just like the xfs_icdinode. We can remove the variable from the icdinode and use the VFS inode variable everywhere, reducing the size of the xfs_icdinode by a further 4 bytes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: move v1 inode conversion to xfs_inode_from_diskDave Chinner
So we don't have to carry an di_onlink variable around anymore, move the inode conversion from v1 inode format to v2 inode format into xfs_inode_from_disk(). This means we can remove the di_onlink fields from the struct xfs_icdinode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: cull unnecessary icdinode fieldsDave Chinner
Now that the struct xfs_icdinode is not directly related to the on-disk format, we can cull things in it we really don't need to store: - magic number never changes - padding is not necessary - next_unlinked is never used - inode number is redundant - uuid is redundant - lsn is accessed directly from dinode - inode CRC is only accessed directly from dinode Hence we can remove these from the struct xfs_icdinode and redirect the code that uses them to the xfs_dinode appripriately. This reduces the size of the struct icdinode from 152 bytes to 88 bytes, and removes a fair chunk of unnecessary code, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: remove timestamps from incore inodeDave Chinner
The struct xfs_inode has two copies of the current timestamps in it, one in the vfs inode and one in the struct xfs_icdinode. Now that we no longer log the struct xfs_icdinode directly, we don't need to keep the timestamps in this structure. instead we can copy them straight out of the VFS inode when formatting the inode log item or the on-disk inode. This reduces the struct xfs_inode in size by 24 bytes. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: introduce inode log format objectDave Chinner
We currently carry around and log an entire inode core in the struct xfs_inode. A lot of the information in the inode core is duplicated in the VFS inode, but we cannot remove this duplication of infomration because the inode core is logged directly in xfs_inode_item_format(). Add a new function xfs_inode_item_format_core() that copies the inode core data into a struct xfs_icdinode that is pulled directly from the log vector buffer. This means we no longer directly copy the inode core, but copy the structures one member at a time. This will be slightly less efficient than copying, but will allow us to remove duplicate and unnecessary items from the struct xfs_inode. To enable us to do this, call the new structure a xfs_log_dinode, so that we know it's different to the physical xfs_dinode and the in-core xfs_icdinode. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: RT bitmap and summary buffers need verifiersDave Chinner
Buffers without verifiers issue runtime warnings on XFS. We don't have anything we can actually verify in the RT buffers (no CRCs, not magic numbers, etc), but we still need verifiers to avoid the warnings. Add a set of dummy verifier operations for the realtime buffers and apply them in the appropriate places. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-09xfs: RT bitmap and summary buffers are not typedDave Chinner
When logging buffers, we attach a type to them that follows the buffer all the way into the log and is used to identify the buffer contents in log recovery. Both the realtime summary buffers and the bitmap buffers do not have types defined or set, so when we try to log them we see assert failure: XFS: Assertion failed: (bip->bli_flags & XFS_BLI_STALE) || (xfs_blft_from_flags(&bip->__bli_format) > XFS_BLFT_UNKNOWN_BUF && xfs_blft_from_flags(&bip->__bli_format) < XFS_BLFT_MAX_BUF), file: fs/xfs/xfs_buf_item.c, line: 294 Fix this by adding buffer log format types for these buffers, and add identification support into log recovery for them. Only build the log recovery support if CONFIG_XFS_RT=y - we can't get into log recovery for real time filesystems if support is not built into the kernel, and this avoids potential build problems. Signed-off-by: Dave Chinner <dchinner@redhat.com> Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08xfs: move struct xfs_attr_shortform to xfs_da_format.hDarrick J. Wong
Move the shortform attr structure definition to the same place as the other attribute structure definitions for consistency and also so that xfs/122 verifies the structure size. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08xfs: remove unused function definitionsEric Sandeen
Old leftovers. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08xfs: move buffer invalidation to xfs_btree_free_blockChristoph Hellwig
... instead of leaving it in the methods. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08xfs: factor btree block freeing into a helperChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08xfs: handle errors from ->free_blocks in xfs_btree_kill_irootChristoph Hellwig
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-02-08xfs: wire up Q_XGETNEXTQUOTA / get_nextdqblkEric Sandeen
Add code to allow the Q_XGETNEXTQUOTA quotactl to quickly find all active quotas by examining the quota inode, and skipping over unallocated or uninitialized regions. Userspace can then use this interface rather than i.e. a getpwent() loop when asked to report all active quotas. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-19Merge branch 'xfs-setxattr-promotion' into for-nextDave Chinner
2016-01-12Merge branch 'xfs-misc-fixes-for-4.5-2' into for-nextDave Chinner
2016-01-12xfs: handle dquot buffer readahead in log recovery correctlyDave Chinner
When we do dquot readahead in log recovery, we do not use a verifier as the underlying buffer may not have dquots in it. e.g. the allocation operation hasn't yet been replayed. Hence we do not want to fail recovery because we detect an operation to be replayed has not been run yet. This problem was addressed for inodes in commit d891400 ("xfs: inode buffers may not be valid during recovery readahead") but the problem was not recognised to exist for dquots and their buffers as the dquot readahead did not have a verifier. The result of not using a verifier is that when the buffer is then next read to replay a dquot modification, the dquot buffer verifier will only be attached to the buffer if *readahead is not complete*. Hence we can read the buffer, replay the dquot changes and then add it to the delwri submission list without it having a verifier attached to it. This then generates warnings in xfs_buf_ioapply(), which catches and warns about this case. Fix this and make it handle the same readahead verifier error cases as for inode buffers by adding a new readahead verifier that has a write operation as well as a read operation that marks the buffer as not done if any corruption is detected. Also make sure we don't run readahead if the dquot buffer has been marked as cancelled by recovery. This will result in readahead either succeeding and the buffer having a valid write verifier, or readahead failing and the buffer state requiring the subsequent read to resubmit the IO with the new verifier. In either case, this will result in the buffer always ending up with a valid write verifier on it. Note: we also need to fix the inode buffer readahead error handling to mark the buffer with EIO. Brian noticed the code I copied from there wrong during review, so fix it at the same time. Add comments linking the two functions that handle readahead verifier errors together so we don't forget this behavioural link in future. cc: <stable@vger.kernel.org> # 3.12 - current Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-12xfs: inode recovery readahead can race with inode buffer creationDave Chinner
When we do inode readahead in log recovery, we do can do the readahead before we've replayed the icreate transaction that stamps the buffer with inode cores. The inode readahead verifier catches this and marks the buffer as !done to indicate that it doesn't yet contain valid inodes. In adding buffer error notification (i.e. setting b_error = -EIO at the same time as as we clear the done flag) to such a readahead verifier failure, we can then get subsequent inode recovery failing with this error: XFS (dm-0): metadata I/O error: block 0xa00060 ("xlog_recover_do..(read#2)") error 5 numblks 32 This occurs when readahead completion races with icreate item replay such as: inode readahead find buffer lock buffer submit RA io .... icreate recovery xfs_trans_get_buffer find buffer lock buffer <blocks on RA completion> ..... <ra completion> fails verifier clear XBF_DONE set bp->b_error = -EIO release and unlock buffer <icreate gains lock> icreate initialises buffer marks buffer as done adds buffer to delayed write queue releases buffer At this point, we have an initialised inode buffer that is up to date but has an -EIO state registered against it. When we finally get to recovering an inode in that buffer: inode item recovery xfs_trans_read_buffer find buffer lock buffer sees XBF_DONE is set, returns buffer sees bp->b_error is set fail log recovery! Essentially, we need xfs_trans_get_buf_map() to clear the error status of the buffer when doing a lookup. This function returns uninitialised buffers, so the buffer returned can not be in an error state and none of the code that uses this function expects b_error to be set on return. Indeed, there is an ASSERT(!bp->b_error); in the transaction case in xfs_trans_get_buf_map() that would have caught this if log recovery used transactions.... This patch firstly changes the inode readahead failure to set -EIO on the buffer, and secondly changes xfs_buf_get_map() to never return a buffer with an error state set so this first change doesn't cause unexpected log recovery failures. cc: <stable@vger.kernel.org> # 3.12 - current Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-11xfs: eliminate committed arg from xfs_bmap_finishEric Sandeen
Calls to xfs_bmap_finish() and xfs_trans_ijoin(), and the associated comments were replicated several times across the attribute code, all dealing with what to do if the transaction was or wasn't committed. And in that replicated code, an ASSERT() test of an uninitialized variable occurs in several locations: error = xfs_attr_thing(&args); if (!error) { error = xfs_bmap_finish(&args.trans, args.flist, &committed); } if (error) { ASSERT(committed); If the first xfs_attr_thing() failed, we'd skip the xfs_bmap_finish, never set "committed", and then test it in the ASSERT. Fix this up by moving the committed state internal to xfs_bmap_finish, and add a new inode argument. If an inode is passed in, it is passed through to __xfs_trans_roll() and joined to the transaction there if the transaction was committed. xfs_qm_dqalloc() was a little unique in that it called bjoin rather than ijoin, but as Dave points out we can detect the committed state but checking whether (*tpp != tp). Addresses-Coverity-Id: 102360 Addresses-Coverity-Id: 102361 Addresses-Coverity-Id: 102363 Addresses-Coverity-Id: 102364 Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-08xfs: bmapbt checking on debug kernels too expensiveDave Chinner
For large sparse or fragmented files, checking every single entry in the bmapbt on every operation is prohibitively expensive. Especially as such checks rarely discover problems during normal operations on high extent coutn files. Our regression tests don't tend to exercise files with hundreds of thousands to millions of extents, so mostly this isn't noticed. However, trying to run things like xfs_mdrestore of large filesystem dumps on a debug kernel quickly becomes impossible as the CPU is completely burnt up repeatedly walking the sparse file bmapbt that is generated for every allocation that is made. Hence, if the file has more than 10,000 extents, just don't bother with walking the tree to check it exhaustively. The btree code has checks that ensure that the newly inserted/removed/modified record is correctly ordered, so the entrie tree walk in thses cases has limited additional value. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
2016-01-05Merge branch 'xfs-misc-fixes-for-4.5' into for-nextDave Chinner
2016-01-04xfs: introduce per-inode DAX enablementDave Chinner
Rather than just being able to turn DAX on and off via a mount option, some applications may only want to enable DAX for certain performance critical files in a filesystem. This patch introduces a new inode flag to enable DAX in the v3 inode di_flags2 field. It adds support for setting and clearing flags in the di_flags2 field via the XFS_IOC_FSSETXATTR ioctl, and sets the S_DAX inode flag appropriately when it is seen. When this flag is set on a directory, it acts as an "inherit flag". That is, inodes created in the directory will automatically inherit the on-disk inode DAX flag, enabling administrators to set up directory heirarchies that automatically use DAX. Setting this flag on an empty root directory will make the entire filesystem use DAX by default. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-01-04xfs: use FS_XFLAG definitions directlyDave Chinner
Now that the ioctls have been hoisted up to the VFS level, use the VFs definitions directly and remove the XFS specific definitions completely. Userspace is going to have to handle the change of this interface separately, so removing the definitions from xfs_fs.h is not an issue here at all. Signed-off-by: Dave Chinner <dchinner@redhat.com>
2016-01-04fs: XFS_IOC_FS[SG]SETXATTR to FS_IOC_FS[SG]ETXATTR promotionDave Chinner
Hoist the ioctl definitions for the XFS_IOC_FS[SG]SETXATTR API from fs/xfs/libxfs/xfs_fs.h to include/uapi/linux/fs.h so that the ioctls can be used by all filesystems, not just XFS. This enables (initially) ext4 to use the ioctl to set project IDs on inodes. Based-on-patch-from: Li Xi <lixi@ddn.com> Signed-off-by: Dave Chinner <dchinner@redhat.com>