aboutsummaryrefslogtreecommitdiff
path: root/fs/afs/inode.c
AgeCommit message (Collapse)Author
2021-01-24fs: make helpers idmap mount awareChristian Brauner
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2021-01-24stat: handle idmapped mountsChristian Brauner
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-11-22afs: Fix speculative status fetch going out of order wrt to modificationsDavid Howells
When doing a lookup in a directory, the afs filesystem uses a bulk status fetch to speculatively retrieve the statuses of up to 48 other vnodes found in the same directory and it will then either update extant inodes or create new ones - effectively doing 'lookup ahead'. To avoid the possibility of deadlocking itself, however, the filesystem doesn't lock all of those inodes; rather just the directory inode is locked (by the VFS). When the operation completes, afs_inode_init_from_status() or afs_apply_status() is called, depending on whether the inode already exists, to commit the new status. A case exists, however, where the speculative status fetch operation may straddle a modification operation on one of those vnodes. What can then happen is that the speculative bulk status RPC retrieves the old status, and whilst that is happening, the modification happens - which returns an updated status, then the modification status is committed, then we attempt to commit the speculative status. This results in something like the following being seen in dmesg: kAFS: vnode modified {100058:861} 8->9 YFS.InlineBulkStatus showing that for vnode 861 on volume 100058, we saw YFS.InlineBulkStatus say that the vnode had data version 8 when we'd already recorded version 9 due to a local modification. This was causing the cache to be invalidated for that vnode when it shouldn't have been. If it happens on a data file, this might lead to local changes being lost. Fix this by ignoring speculative status updates if the data version doesn't match the expected value. Note that it is possible to get a DV regression if a volume gets restored from a backup - but we should get a callback break in such a case that should trigger a recheck anyway. It might be worth checking the volume creation time in the volsync info and, if a change is observed in that (as would happen on a restore), invalidate all caches associated with the volume. Fixes: 5cf9dd55a0ec ("afs: Prospectively look up extra files when doing a single lookup") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-08afs: Fix deadlock between writeback and truncateDavid Howells
The afs filesystem has a lock[*] that it uses to serialise I/O operations going to the server (vnode->io_lock), as the server will only perform one modification operation at a time on any given file or directory. This prevents the the filesystem from filling up all the call slots to a server with calls that aren't going to be executed in parallel anyway, thereby allowing operations on other files to obtain slots. [*] Note that is probably redundant for directories at least since i_rwsem is used to serialise directory modifications and lookup/reading vs modification. The server does allow parallel non-modification ops, however. When a file truncation op completes, we truncate the in-memory copy of the file to match - but we do it whilst still holding the io_lock, the idea being to prevent races with other operations. However, if writeback starts in a worker thread simultaneously with truncation (whilst notify_change() is called with i_rwsem locked, writeback pays it no heed), it may manage to set PG_writeback bits on the pages that will get truncated before afs_setattr_success() manages to call truncate_pagecache(). Truncate will then wait for those pages - whilst still inside io_lock: # cat /proc/8837/stack [<0>] wait_on_page_bit_common+0x184/0x1e7 [<0>] truncate_inode_pages_range+0x37f/0x3eb [<0>] truncate_pagecache+0x3c/0x53 [<0>] afs_setattr_success+0x4d/0x6e [<0>] afs_wait_for_operation+0xd8/0x169 [<0>] afs_do_sync_operation+0x16/0x1f [<0>] afs_setattr+0x1fb/0x25d [<0>] notify_change+0x2cf/0x3c4 [<0>] do_truncate+0x7f/0xb2 [<0>] do_sys_ftruncate+0xd1/0x104 [<0>] do_syscall_64+0x2d/0x3a [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The writeback operation, however, stalls indefinitely because it needs to get the io_lock to proceed: # cat /proc/5940/stack [<0>] afs_get_io_locks+0x58/0x1ae [<0>] afs_begin_vnode_operation+0xc7/0xd1 [<0>] afs_store_data+0x1b2/0x2a3 [<0>] afs_write_back_from_locked_page+0x418/0x57c [<0>] afs_writepages_region+0x196/0x224 [<0>] afs_writepages+0x74/0x156 [<0>] do_writepages+0x2d/0x56 [<0>] __writeback_single_inode+0x84/0x207 [<0>] writeback_sb_inodes+0x238/0x3cf [<0>] __writeback_inodes_wb+0x68/0x9f [<0>] wb_writeback+0x145/0x26c [<0>] wb_do_writeback+0x16a/0x194 [<0>] wb_workfn+0x74/0x177 [<0>] process_one_work+0x174/0x264 [<0>] worker_thread+0x117/0x1b9 [<0>] kthread+0xec/0xf1 [<0>] ret_from_fork+0x1f/0x30 and thus deadlock has occurred. Note that whilst afs_setattr() calls filemap_write_and_wait(), the fact that the caller is holding i_rwsem doesn't preclude more pages being dirtied through an mmap'd region. Fix this by: (1) Use the vnode validate_lock to mediate access between afs_setattr() and afs_writepages(): (a) Exclusively lock validate_lock in afs_setattr() around the whole RPC operation. (b) If WB_SYNC_ALL isn't set on entry to afs_writepages(), trying to shared-lock validate_lock and returning immediately if we couldn't get it. (c) If WB_SYNC_ALL is set, wait for the lock. The validate_lock is also used to validate a file and to zap its cache if the file was altered by a third party, so it's probably a good fit for this. (2) Move the truncation outside of the io_lock in setattr, using the same hook as is used for local directory editing. This requires the old i_size to be retained in the operation record as we commit the revised status to the inode members inside the io_lock still, but we still need to know if we reduced the file size. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-16afs: Fix silly renameDavid Howells
Fix AFS's silly rename by the following means: (1) Set the destination directory in afs_do_silly_rename() so as to avoid misbehaviour and indicate that the directory data version will increment by 1 so as to avoid warnings about unexpected changes in the DV. Also indicate that the ctime should be updated to avoid xfstest grumbling. (2) Note when the server indicates that a directory changed more than we expected (AFS_OPERATION_DIR_CONFLICT), indicating a conflict with a third party change, checking on successful completion of unlink and rename. The problem is that the FS.RemoveFile RPC op doesn't report the status of the unlinked file, though YFS.RemoveFile2 does. This can be mitigated by the assumption that if the directory DV cranked by exactly 1, we can be sure we removed one link from the file; further, ordinarily in AFS, files cannot be hardlinked across directories, so if we reduce nlink to 0, the file is deleted. However, if the directory DV jumps by more than 1, we cannot know if a third party intervened by adding or removing a link on the file we just removed a link from. The same also goes for any vnode that is at the destination of the FS.Rename RPC op. (3) Make afs_vnode_commit_status() apply the nlink drop inside the cb_lock section along with the other attribute updates if ->op_unlinked is set on the descriptor for the appropriate vnode. (4) Issue a follow up status fetch to the unlinked file in the event of a third party conflict that makes it impossible for us to know if we actually deleted the file or not. (5) Provide a flag, AFS_VNODE_SILLY_DELETED, to make afs_getattr() lie to the user about the nlink of a silly deleted file so that it appears as 0, not 1. Found with the generic/035 and generic/084 xfstests. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: afs_vnode_commit_status() doesn't need to check the RPC errorDavid Howells
afs_vnode_commit_status() is only ever called if op->error is 0, so remove the op->error checks from the function. Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-16afs: Fix use of afs_check_for_remote_deletion()David Howells
afs_check_for_remote_deletion() checks to see if error ENOENT is returned by the server in response to an operation and, if so, marks the primary vnode as having been deleted as the FID is no longer valid. However, it's being called from the operation success functions, where no abort has happened - and if an inline abort is recorded, it's handled by afs_vnode_commit_status(). Fix this by actually calling the operation aborted method if provided and having that point to afs_check_for_remote_deletion(). Fixes: e49c7b2f6de7 ("afs: Build an abstraction around an "operation" concept") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Fix truncation issues and mmap writeback sizeDavid Howells
Fix the following issues: (1) Fix writeback to reduce the size of a store operation to i_size, effectively discarding the extra data. The problem comes when afs_page_mkwrite() records that a page is about to be modified by mmap(). It doesn't know what bits of the page are going to be modified, so it records the whole page as being dirty (this is stored in page->private as start and end offsets). Without this, the marshalling for the store to the server extends the size of the file to the end of the page (in afs_fs_store_data() and yfs_fs_store_data()). (2) Fix setattr to actually truncate the pagecache, thereby clearing the discarded part of a file. (3) Fix setattr to check that the new size is okay and to disable ATTR_SIZE if i_size wouldn't change. (4) Force i_size to be updated as the result of a truncate. (5) Don't truncate if ATTR_SIZE is not set. (6) Call pagecache_isize_extended() if the file was enlarged. Note that truncate_set_size() isn't used because the setting of i_size is done inside afs_vnode_commit_status() under the vnode->cb_lock. Found with the generic/029 and generic/393 xfstests. Fixes: 31143d5d515e ("AFS: implement basic file write support") Fixes: 4343d00872e1 ("afs: Get rid of the afs_writeback record") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Concoct ctimesDavid Howells
The in-kernel afs filesystem ignores ctime because the AFS fileserver protocol doesn't support ctimes. This, however, causes various xfstests to fail. Work around this by: (1) Setting ctime to attr->ia_ctime in afs_setattr(). (2) Not ignoring ATTR_MTIME_SET, ATTR_TIMES_SET and ATTR_TOUCH settings. (3) Setting the ctime from the server mtime when on the target file when creating a hard link to it. (4) Setting the ctime on directories from their revised mtimes when renaming/moving a file. Found by the generic/221 and generic/309 xfstests. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-15afs: Fix EOF corruptionDavid Howells
When doing a partial writeback, afs_write_back_from_locked_page() may generate an FS.StoreData RPC request that writes out part of a file when a file has been constructed from pieces by doing seek, write, seek, write, ... as is done by ld. The FS.StoreData RPC is given the current i_size as the file length, but the server basically ignores it unless the data length is 0 (in which case it's just a truncate operation). The revised file length returned in the result of the RPC may then not reflect what we suggested - and this leads to i_size getting moved backwards - which causes issues later. Fix the client to take account of this by ignoring the returned file size unless the data version number jumped unexpectedly - in which case we're going to have to clear the pagecache and reload anyway. This can be observed when doing a kernel build on an AFS mount. The following pair of commands produce the issue: ld -m elf_x86_64 -z max-page-size=0x200000 --emit-relocs \ -T arch/x86/realmode/rm/realmode.lds \ arch/x86/realmode/rm/header.o \ arch/x86/realmode/rm/trampoline_64.o \ arch/x86/realmode/rm/stack.o \ arch/x86/realmode/rm/reboot.o \ -o arch/x86/realmode/rm/realmode.elf arch/x86/tools/relocs --realmode \ arch/x86/realmode/rm/realmode.elf \ >arch/x86/realmode/rm/realmode.relocs This results in the latter giving: Cannot read ELF section headers 0/18: Success as the realmode.elf file got corrupted. The sequence of events can also be driven with: xfs_io -t -f \ -c "pwrite -S 0x58 0 0x58" \ -c "pwrite -S 0x59 10000 1000" \ -c "close" \ /afs/example.com/scratch/a Fixes: 31143d5d515e ("AFS: implement basic file write support") Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-09afs: Make afs_zap_data() staticDavid Howells
Make afs_zap_data() static as it's only used in the file in which it is defined. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04afs: Reorganise volume and server trees to be rooted on the cellDavid Howells
Reorganise afs_volume objects such that they're in a tree keyed on volume ID, rooted at on an afs_cell object rather than being in multiple trees, each of which is rooted on an afs_server object. afs_server structs become per-cell and acquire a pointer to the cell. The process of breaking a callback then starts with finding the server by its network address, following that to the cell and then looking up each volume ID in the volume tree. This is simpler than the afs_vol_interest/afs_cb_interest N:M mapping web and allows those structs and the code for maintaining them to be simplified or removed. It does make a couple of things a bit more tricky, though: (1) Operations now start with a volume, not a server, so there can be more than one answer as to whether or not the server we'll end up using supports the FS.InlineBulkStatus RPC. (2) CB RPC operations that specify the server UUID. There's still a tree of servers by UUID on the afs_net struct, but the UUIDs in it aren't guaranteed unique. Signed-off-by: David Howells <dhowells@redhat.com>
2020-06-04afs: Build an abstraction around an "operation" conceptDavid Howells
Turn the afs_operation struct into the main way that most fileserver operations are managed. Various things are added to the struct, including the following: (1) All the parameters and results of the relevant operations are moved into it, removing corresponding fields from the afs_call struct. afs_call gets a pointer to the op. (2) The target volume is made the main focus of the operation, rather than the target vnode(s), and a bunch of op->vnode->volume are made op->volume instead. (3) Two vnode records are defined (op->file[]) for the vnode(s) involved in most operations. The vnode record (struct afs_vnode_param) contains: - The vnode pointer. - The fid of the vnode to be included in the parameters or that was returned in the reply (eg. FS.MakeDir). - The status and callback information that may be returned in the reply about the vnode. - Callback break and data version tracking for detecting simultaneous third-parth changes. (4) Pointers to dentries to be updated with new inodes. (5) An operations table pointer. The table includes pointers to functions for issuing AFS and YFS-variant RPCs, handling the success and abort of an operation and handling post-I/O-lock local editing of a directory. To make this work, the following function restructuring is made: (A) The rotation loop that issues calls to fileservers that can be found in each function that wants to issue an RPC (such as afs_mkdir()) is extracted out into common code, in a new file called fs_operation.c. (B) The rotation loops, such as the one in afs_mkdir(), are replaced with a much smaller piece of code that allocates an operation, sets the parameters and then calls out to the common code to do the actual work. (C) The code for handling the success and failure of an operation are moved into operation functions (as (5) above) and these are called from the core code at appropriate times. (D) The pseudo inode getting stuff used by the dynamic root code is moved over into dynroot.c. (E) struct afs_iget_data is absorbed into the operation struct and afs_iget() expects to be given an op pointer and a vnode record. (F) Point (E) doesn't work for the root dir of a volume, but we know the FID in advance (it's always vnode 1, unique 1), so a separate inode getter, afs_root_iget(), is provided to special-case that. (G) The inode status init/update functions now also take an op and a vnode record. (H) The RPC marshalling functions now, for the most part, just take an afs_operation struct as their only argument. All the data they need is held there. The result delivery functions write their answers there as well. (I) The call is attached to the operation and then the operation core does the waiting. And then the new operation code is, for the moment, made to just initialise the operation, get the appropriate vnode I/O locks and do the same rotation loop as before. This lays the foundation for the following changes in the future: (*) Overhauling the rotation (again). (*) Support for asynchronous I/O, where the fileserver rotation must be done asynchronously also. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Rename struct afs_fs_cursor to afs_operationDavid Howells
As a prelude to implementing asynchronous fileserver operations in the afs filesystem, rename struct afs_fs_cursor to afs_operation. This struct is going to form the core of the operation management and is going to acquire more members in later. Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31afs: Remove the error argument from afs_protocol_error()David Howells
Remove the error argument from afs_protocol_error() as it's always -EBADMSG. Signed-off-by: David Howells <dhowells@redhat.com>
2019-11-25Merge tag 'printk-for-5.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk Pull printk updates from Petr Mladek: - Allow to print symbolic error names via new %pe modifier. - Use pr_warn() instead of the remaining pr_warning() calls. Fix formatting of the related lines. - Add VSPRINTF entry to MAINTAINERS. * tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (32 commits) checkpatch: don't warn about new vsprintf pointer extension '%pe' MAINTAINERS: Add VSPRINTF tools lib api: Renaming pr_warning to pr_warn ASoC: samsung: Use pr_warn instead of pr_warning lib: cpu_rmap: Use pr_warn instead of pr_warning trace: Use pr_warn instead of pr_warning dma-debug: Use pr_warn instead of pr_warning vgacon: Use pr_warn instead of pr_warning fs: afs: Use pr_warn instead of pr_warning sh/intc: Use pr_warn instead of pr_warning scsi: Use pr_warn instead of pr_warning platform/x86: intel_oaktrail: Use pr_warn instead of pr_warning platform/x86: asus-laptop: Use pr_warn instead of pr_warning platform/x86: eeepc-laptop: Use pr_warn instead of pr_warning oprofile: Use pr_warn instead of pr_warning of: Use pr_warn instead of pr_warning macintosh: Use pr_warn instead of pr_warning idsn: Use pr_warn instead of pr_warning ide: Use pr_warn instead of pr_warning crypto: n2: Use pr_warn instead of pr_warning ...
2019-10-18fs: afs: Use pr_warn instead of pr_warningKefeng Wang
As said in commit f2c2cbcc35d4 ("powerpc: Use pr_warn instead of pr_warning"), removing pr_warning so all logging messages use a consistent <prefix>_warn style. Let's do it. Link: http://lkml.kernel.org/r/20191018031850.48498-23-wangkefeng.wang@huawei.com To: linux-kernel@vger.kernel.org Cc: David Howells <dhowells@redhat.com> Cc: linux-afs@lists.infradead.org Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: Petr Mladek <pmladek@suse.com>
2019-09-15afs dynroot: switch to simple_dir_operationsAl Viro
no point reinventing it (with wrong ->read(), BTW). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2019-06-20afs: Add some callback management tracepointsDavid Howells
Add a couple of tracepoints to track callback management: (1) afs_cb_miss - Logs when we were unable to apply a callback, either due to the inode being discarded or due to a competing thread applying a callback first. (2) afs_cb_break - Logs when we attempted to clear the noted callback promise, either due to the server explicitly breaking the callback, the callback promise lapsing or a local event obsoleting it. Signed-off-by: David Howells <dhowells@redhat.com>
2019-06-20afs: Fix setting of i_blocksDavid Howells
The setting of i_blocks, which is calculated from i_size, has got accidentally misordered relative to the setting of i_size when initially setting up an inode. Further, i_blocks isn't updated by afs_apply_status() when the size is updated. To fix this, break the i_size/i_blocks setting out into a helper function and call it from both places. Fixes: a58823ac4589 ("afs: Fix application of status and callback to be under same lock") Signed-off-by: David Howells <dhowells@redhat.com>
2019-06-20afs: Fix over zealous "vnode modified" warningsDavid Howells
Occasionally, warnings like this: vnode modified 2af7 on {10000b:1} [exp 2af2] YFS.FetchStatus(vnode) are emitted into the kernel log. This indicates that when we were applying the updated vnode (file) status retrieved from the server to an inode we saw that the data version number wasn't what we were expecting (in this case it's 0x2af7 rather than 0x2af2). We've usually received a callback from the server prior to this point - or the callback promise has lapsed - so the warning is merely informative and the state is to be expected. Fix this by only emitting the warning if the we still think that we have a valid callback promise and haven't received a callback. Also change the format slightly so so that the new data version doesn't look like part of the text, the like is prefixed with "kAFS: " and the message is ranked as a warning. Fixes: 31143d5d515e ("AFS: implement basic file write support") Reported-by: Ian Wienand <iwienand@redhat.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Pass pre-fetch server and volume break counts into afs_iget5_set()David Howells
Pass the server and volume break counts from before the status fetch operation that queried the attributes of a file into afs_iget5_set() so that the new vnode's break counters can be initialised appropriately. This allows detection of a volume or server break that happened whilst we were fetching the status or setting up the vnode. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix unlink to handle YFS.RemoveFile2 betterDavid Howells
Make use of the status update for the target file that the YFS.RemoveFile2 RPC op returns to correctly update the vnode as to whether the file was actually deleted or just had nlink reduced. Fixes: 30062bd13e36 ("afs: Implement YFS support in the fs client") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Clear AFS_VNODE_CB_PROMISED if we detect callback expiryDavid Howells
Fix afs_validate() to clear AFS_VNODE_CB_PROMISED on a vnode if we detect any condition that causes the callback promise to be broken implicitly, including server break (cb_s_break), volume break (cb_v_break) or callback expiry. Fixes: ae3b7361dc0e ("afs: Fix validation/callback interaction") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Make vnode->cb_interest RCU safeDavid Howells
Use RCU-based freeing for afs_cb_interest struct objects and use RCU on vnode->cb_interest. Use that change to allow afs_check_validity() to use read_seqbegin_or_lock() instead of read_seqlock_excl(). This also requires the caller of afs_check_validity() to hold the RCU read lock across the call. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Split afs_validate() so first part can be used under LOOKUP_RCUDavid Howells
Split afs_validate() so that the part that decides if the vnode is still valid can be used under LOOKUP_RCU conditions from afs_d_revalidate(). Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Don't save callback version and type fieldsDavid Howells
Don't save callback version and type fields as the version is about the format of the callback information and the type is relative to the particular RPC call. Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Fix application of status and callback to be under same lockDavid Howells
When applying the status and callback in the response of an operation, apply them in the same critical section so that there's no race between checking the callback state and checking status-dependent state (such as the data version). Fix this by: (1) Allocating a joint {status,callback} record (afs_status_cb) before calling the RPC function for each vnode for which the RPC reply contains a status or a status plus a callback. A flag is set in the record to indicate if a callback was actually received. (2) These records are passed into the RPC functions to be filled in. The afs_decode_status() and yfs_decode_status() functions are removed and the cb_lock is no longer taken. (3) xdr_decode_AFSFetchStatus() and xdr_decode_YFSFetchStatus() no longer update the vnode. (4) xdr_decode_AFSCallBack() and xdr_decode_YFSCallBack() no longer update the vnode. (5) vnodes, expected data-version numbers and callback break counters (cb_break) no longer need to be passed to the reply delivery functions. Note that, for the moment, the file locking functions still need access to both the call and the vnode at the same time. (6) afs_vnode_commit_status() is now given the cb_break value and the expected data_version and the task of applying the status and the callback to the vnode are now done here. This is done under a single taking of vnode->cb_lock. (7) afs_pages_written_back() is now called by afs_store_data() rather than by the reply delivery function. afs_pages_written_back() has been moved to before the call point and is now given the first and last page numbers rather than a pointer to the call. (8) The indicator from YFS.RemoveFile2 as to whether the target file actually got removed (status.abort_code == VNOVNODE) rather than merely dropping a link is now checked in afs_unlink rather than in xdr_decode_YFSFetchStatus(). Supplementary fixes: (*) afs_cache_permit() now gets the caller_access mask from the afs_status_cb object rather than picking it out of the vnode's status record. afs_fetch_status() returns caller_access through its argument list for this purpose also. (*) afs_inode_init_from_status() now uses a write lock on cb_lock rather than a read lock and now sets the callback inside the same critical section. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not setDavid Howells
Don't invalidate the callback promise on a directory if the AFS_VNODE_DIR_VALID flag is not set (which indicates that the directory contents are invalid, due to edit failure, callback break, page reclaim). The directory will be reloaded next time the directory is accessed, so clearing the callback flag at this point may race with a reload of the directory and cancel it's recorded callback promise. Fixes: f3ddee8dc4e2 ("afs: Fix directory handling") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16afs: Make some RPC operations non-interruptibleDavid Howells
Make certain RPC operations non-interruptible, including: (*) Set attributes (*) Store data We don't want to get interrupted during a flush on close, flush on unlock, writeback or an inode update, leaving us in a state where we still need to do the writeback or update. (*) Extend lock (*) Release lock We don't want to get lock extension interrupted as the file locks on the server are time-limited. Interruption during lock release is less of an issue since the lock is time-limited, but it's better to complete the release to avoid a several-minute wait to recover it. *Setting* the lock isn't a problem if it's interrupted since we can just return to the user and tell them they were interrupted - at which point they can elect to retry. (*) Silly unlink We want to remove silly unlink files if we can, rather than leaving them for the salvager to clear up. Note that whilst these calls are no longer interruptible, they do have timeouts on them, so if the server stops responding the call will fail with something like ETIME or ECONNRESET. Without this, the following: kAFS: Unexpected error from FS.StoreData -512 appears in dmesg when a pending store data gets interrupted and some processes may just hang. Additionally, make the code that checks/updates the server record ignore failure due to interruption if the main call is uninterruptible and if the server has an address list. The next op will check it again since the expiration time on the old list has past. Fixes: d2ddc776a458 ("afs: Overhaul volume and server record caching and fileserver rotation") Reported-by: Jonathan Billings <jsbillings@jsbillings.org> Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-15afs: Fix key leak in afs_release() and afs_evict_inode()David Howells
Fix afs_release() to go through the cleanup part of the function if FMODE_WRITE is set rather than exiting through vfs_fsync() (which skips the cleanup). The cleanup involves discarding the refs on the key used for file ops and the writeback key record. Also fix afs_evict_inode() to clean up any left over wb keys attached to the inode/vnode when it is removed. Fixes: 5a8132761609 ("afs: Do better accretion of small writes on newly created content") Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-07afs: Calculate i_blocks based on file sizeMarc Dionne
While it's not possible to give an accurate number for the blocks used on the server, populate i_blocks based on the file size so that 'du' can give a reasonable estimate. The value is rounded up to 1K granularity, for consistency with what other AFS clients report, and the servers' 1K usage quota unit. Note that the value calculated by 'du' at the root of a volume can still be slightly lower than the quota usage on the server, as 0-length files are charged 1 quota block, but are reported as occupying 0 blocks. Again, this is consistent with other AFS clients. Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-07afs: Log more information for "kAFS: AFS vnode with undefined type\n"David Howells
Log more information when "kAFS: AFS vnode with undefined type\n" is displayed due to a vnode record being retrieved from the server that appears to have a duff file type (usually 0). This prints more information to try and help pin down the problem. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-25afs: Implement sillyrename for unlink and renameDavid Howells
Implement sillyrename for AFS unlink and rename, using the NFS variant implementation as a basis. Note that the asynchronous file locking extender/releaser has to be notified with a state change to stop it complaining if there's a race between that and the actual file deletion. A tracepoint, afs_silly_rename, is also added to note the silly rename and the cleanup. The afs_edit_dir tracepoint is given some extra reason indicators and the afs_flock_ev tracepoint is given a silly-delete file lock cancellation indicator. Signed-off-by: David Howells <dhowells@redhat.com>
2019-04-13afs: avoid deprecated get_seconds()Arnd Bergmann
get_seconds() has a limited range on 32-bit architectures and is deprecated because of that. While AFS uses the same limits for its inode timestamps on the wire protocol, let's just use the simpler current_time() as we do for other file systems. This will still zero out the 'tv_nsec' field of the timestamps internally. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-17afs: Fix key refcounting in file locking codeDavid Howells
Fix the refcounting of the authentication keys in the file locking code. The vnode->lock_key member points to a key on which it expects to be holding a ref, but it isn't always given an extra ref, however. Fixes: 0fafdc9f888b ("afs: Fix file locking") Signed-off-by: David Howells <dhowells@redhat.com>
2019-01-17afs: Don't set vnode->cb_s_break in afs_validate()Marc Dionne
A cb_interest record is not necessarily attached to the vnode on entry to afs_validate(), which can cause an oops when we try to bring the vnode's cb_s_break up to date in the default case (ie. no current callback promise and the vnode has not been deleted). Fix this by simply removing the line, as vnode->cb_s_break will be set when needed by afs_register_server_cb_interest() when we next get a callback promise from RPC call. The oops looks something like: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 ... RIP: 0010:afs_validate+0x66/0x250 [kafs] ... Call Trace: afs_d_revalidate+0x8d/0x340 [kafs] ? __d_lookup+0x61/0x150 lookup_dcache+0x44/0x70 ? lookup_dcache+0x44/0x70 __lookup_hash+0x24/0xa0 do_unlinkat+0x11d/0x2c0 __x64_sys_unlink+0x23/0x30 do_syscall_64+0x4d/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: ae3b7361dc0e ("afs: Fix validation/callback interaction") Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2018-11-29afs: Fix validation/callback interactionDavid Howells
When afs_validate() is called to validate a vnode (inode), there are two unhandled cases in the fastpath at the top of the function: (1) If the vnode is promised (AFS_VNODE_CB_PROMISED is set), the break counters match and the data has expired, then there's an implicit case in which the vnode needs revalidating. This has no consequences since the default "valid = false" set at the top of the function happens to do the right thing. (2) If the vnode is not promised and it hasn't been deleted (AFS_VNODE_DELETED is not set) then there's a default case we're not handling in which the vnode is invalid. If the vnode is invalid, we need to bring cb_s_break and cb_v_break up to date before we refetch the status. As a consequence, once the server loses track of the client (ie. sufficient time has passed since we last sent it an operation), it will send us a CB.InitCallBackState* operation when we next try to talk to it. This calls afs_init_callback_state() which increments afs_server::cb_s_break, but this then doesn't propagate to the afs_vnode record. The result being that every afs_validate() call thereafter sends a status fetch operation to the server. Clarify and fix this by: (A) Setting valid in all the branches rather than initialising it at the top so that the compiler catches where we've missed. (B) Restructuring the logic in the 'promised' branch so that we set valid to false if the callback is due to expire (or has expired) and so that the final case is that the vnode is still valid. (C) Adding an else-statement that ups cb_s_break and cb_v_break if the promised and deleted cases don't match. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-10-24afs: Calc callback expiry in op reply deliveryDavid Howells
Calculate the callback expiration time at the point of operation reply delivery, using the reply time queried from AF_RXRPC on that call as a base. Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFSDavid Howells
Increase the sizes of the volume ID to 64 bits and the vnode ID (inode number equivalent) to 96 bits to allow the support of YFS. This requires the iget comparator to check the vnode->fid rather than i_ino and i_generation as i_ino is not sufficiently capacious. It also requires this data to be placed into the vnode cache key for fscache. For the moment, just discard the top 32 bits of the vnode ID when returning it though stat. Signed-off-by: David Howells <dhowells@redhat.com>
2018-10-24afs: Better tracing of protocol errorsDavid Howells
Include the site of detection of AFS protocol errors in trace lines to better be able to determine what went wrong. Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14afs: Fix whole-volume callback handlingDavid Howells
It's possible for an AFS file server to issue a whole-volume notification that callbacks on all the vnodes in the file have been broken. This is done for R/O and backup volumes (which don't have per-file callbacks) and for things like a volume being taken offline. Fix callback handling to detect whole-volume notifications, to track it across operations and to check it during inode validation. Fixes: c435ee34551e ("afs: Overhaul the callback handling") Signed-off-by: David Howells <dhowells@redhat.com>
2018-05-14afs: Fix directory page lockingDavid Howells
The afs directory loading code (primarily afs_read_dir()) locks all the pages that hold a directory's content blob to defend against getdents/getdents races and getdents/lookup races where the competitors issue conflicting reads on the same data. As the reads will complete consecutively, they may retrieve different versions of the data and one may overwrite the data that the other is busy parsing. Fix this by not locking the pages at all, but rather by turning the validation lock into an rwsem and getting an exclusive lock on it whilst reading the data or validating the attributes and a shared lock whilst parsing the data. Sharing the attribute validation lock should be fine as the data fetch will retrieve the attributes also. The individual page locks aren't needed at all as the only place they're being used is to serialise data loading. Without this patch, the: if (!test_bit(AFS_VNODE_DIR_VALID, &dvnode->flags)) { ... } part of afs_read_dir() may be skipped, leaving the pages unlocked when we hit the success: clause - in which case we try to unlock the not-locked pages, leading to the following oops: page:ffffe38b405b4300 count:3 mapcount:0 mapping:ffff98156c83a978 index:0x0 flags: 0xfffe000001004(referenced|private) raw: 000fffe000001004 ffff98156c83a978 0000000000000000 00000003ffffffff raw: dead000000000100 dead000000000200 0000000000000001 ffff98156b27c000 page dumped because: VM_BUG_ON_PAGE(!PageLocked(page)) page->mem_cgroup:ffff98156b27c000 ------------[ cut here ]------------ kernel BUG at mm/filemap.c:1205! ... RIP: 0010:unlock_page+0x43/0x50 ... Call Trace: afs_dir_iterate+0x789/0x8f0 [kafs] ? _cond_resched+0x15/0x30 ? kmem_cache_alloc_trace+0x166/0x1d0 ? afs_do_lookup+0x69/0x490 [kafs] ? afs_do_lookup+0x101/0x490 [kafs] ? key_default_cmp+0x20/0x20 ? request_key+0x3c/0x80 ? afs_lookup+0xf1/0x340 [kafs] ? __lookup_slow+0x97/0x150 ? lookup_slow+0x35/0x50 ? walk_component+0x1bf/0x490 ? path_lookupat.isra.52+0x75/0x200 ? filename_lookup.part.66+0xa0/0x170 ? afs_end_vnode_operation+0x41/0x60 [kafs] ? __check_object_size+0x9c/0x171 ? strncpy_from_user+0x4a/0x170 ? vfs_statx+0x73/0xe0 ? __do_sys_newlstat+0x39/0x70 ? __x64_sys_getdents+0xc9/0x140 ? __x64_sys_getdents+0x140/0x140 ? do_syscall_64+0x5b/0x160 ? entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: f3ddee8dc4e2 ("afs: Fix directory handling") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Trace protocol errorsDavid Howells
Trace protocol errors detected in afs. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Locally edit directory data for mkdir/create/unlink/...David Howells
Locally edit the contents of an AFS directory upon a successful inode operation that modifies that directory (such as mkdir, create and unlink) so that we can avoid the current practice of re-downloading the directory after each change. This is viable provided that the directory version number we get back from the modifying RPC op is exactly incremented by 1 from what we had previously. The data in the directory contents is in a defined format that we have to parse locally to perform lookups and readdir, so modifying isn't a problem. If the edit fails, we just clear the VALID flag on the directory and it will be reloaded next time it is needed. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Fix directory handlingDavid Howells
AFS directories are structured blobs that are downloaded just like files and then parsed by the lookup and readdir code and, as such, are currently handled in the pagecache like any other file, with the entire directory content being thrown away each time the directory changes. However, since the blob is a known structure and since the data version counter on a directory increases by exactly one for each change committed to that directory, we can actually edit the directory locally rather than fetching it from the server after each locally-induced change. What we can't do, though, is mix data from the server and data from the client since the server is technically at liberty to rearrange or compress a directory if it sees fit, provided it updates the data version number when it does so and breaks the callback (ie. sends a notification). Further, lookup with lookup-ahead, readdir and, when it arrives, local editing are likely want to scan the whole of a directory. So directory handling needs to be improved to maintain the coherency of the directory blob prior to permitting local directory editing. To this end: (1) If any directory page gets discarded, invalidate and reread the entire directory. (2) If readpage notes that if when it fetches a single page that the version number has changed, the entire directory is flagged for invalidation. (3) Read as much of the directory in one go as we can. Note that this removes local caching of directories in fscache for the moment as we can't pass the pages to fscache_read_or_alloc_pages() since page->lru is in use by the LRU. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Keep track of invalid-before version for dentry coherencyDavid Howells
Each afs dentry is tagged with the version that the parent directory was at last time it was validated and, currently, if this differs, the directory is scanned and the dentry is refreshed. However, this leads to an excessive amount of revalidation on directories that get modified on the client without conflict with another client. We know there's no conflict because the parent directory's data version number got incremented by exactly 1 on any create, mkdir, unlink, etc., therefore we can trust the current state of the unaffected dentries when we perform a local directory modification. Optimise by keeping track of the last version of the parent directory that was changed outside of the client in the parent directory's vnode and using that to validate the dentries rather than the current version. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Rearrange status mappingDavid Howells
Rearrange the AFSFetchStatus to inode attribute mapping code in a number of ways: (1) Use an XDR structure rather than a series of incremented pointer accesses when decoding an AFSFetchStatus object. This allows out-of-order decode. (2) Don't store the if_version value but rather just check it and abort if it's not something we can handle. (3) Store the owner and group in the status record as raw values rather than converting them to kuid/kgid. Do that when they're mapped into i_uid/i_gid. (4) Validate the type and abort code up front and abort if they're wrong. (5) Split the inode attribute setting out into its own function from the XDR decode of an AFSFetchStatus object. This allows it to be called from elsewhere too. (6) Differentiate changes to data from changes to metadata. (7) Use the split-out attribute mapping function from afs_iget(). Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Make it possible to get the data version in readpageDavid Howells
Store the data version number indicated by an FS.FetchData op into the read request structure so that it's accessible by the page reader. Signed-off-by: David Howells <dhowells@redhat.com>
2018-04-09afs: Init inode before accessing cacheDavid Howells
We no longer parse symlinks when we get the inode to determine if this symlink is actually a mountpoint as we detect that by examining the mode instead (symlinks are always 0777 and mountpoints 0644). Access the cache after mapping the status so that we don't have to manually set the inode size now. Note that this may need adjusting if the disconnected operation is implemented as the file metadata may have to be obtained from the cache. Signed-off-by: David Howells <dhowells@redhat.com>