aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2014-01-13Revert "kernfs: implement kernfs_{de|re}activate[_self]()"Greg Kroah-Hartman
This reverts commit 9f010c2ad5194a4b682e747984477850fabd03be. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs, sysfs, driver-core: implement kernfs_remove_self() and its ↵Greg Kroah-Hartman
wrappers" This reverts commit 1ae06819c77cff1ea2833c94f8c093fe8a5c79db. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "sysfs, driver-core: remove unused ↵Greg Kroah-Hartman
{sysfs|device}_schedule_callback_owner()" This reverts commit d1ba277e79889085a2faec3b68b91ce89c63f888. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-13Revert "kernfs: remove unnecessary NULL check in __kernfs_remove()"Greg Kroah-Hartman
This reverts commit 88533f990c616cf50c2fe585ea03f75c806a293d. Tejun writes: I'm sorry but can you please revert the whole series? get_active() waiting while a node is deactivated has potential to lead to deadlock and that deactivate/reactivate interface is something fundamentally flawed and that cgroup will have to work with the remove_self() like everybody else. IOW, I think the first posting was correct. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-11kernfs: remove unnecessary NULL check in __kernfs_remove()Tejun Heo
895a068a524e ("kernfs: make kernfs_get_active() block if the node is deactivated but not removed") added "struct kernfs_root *root = kernfs_root(kn);" at the head of the function; however, the parameter @kn is checked for later implying that the function may be called with NULL. This means that we may end up invoking kernfs_root() with NULL which will oops. None of the existing users invokes removal with NULL @kn, so this bug doesn't actually trigger. We can relocate kernfs_root() invocation after NULL check; however, allowing NULL param tends to cause more confusion than actually helping anything. As there's no existing user, let's remove the spurious NULL check. This bug was detected by smatch. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10sysfs, driver-core: remove unused {sysfs|device}_schedule_callback_owner()Tejun Heo
All device_schedule_callback_owner() users are converted to use device_remove_file_self(). Remove now unused {sysfs|device}_schedule_callback_owner(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs, sysfs, driver-core: implement kernfs_remove_self() and its wrappersTejun Heo
Sometimes it's necessary to implement a node which wants to delete nodes including itself. This isn't straightforward because of kernfs active reference. While a file operation is in progress, an active reference is held and kernfs_remove() waits for all such references to drain before completing. For a self-deleting node, this is a deadlock as kernfs_remove() ends up waiting for an active reference that itself is sitting on top of. This currently is worked around in the sysfs layer using sysfs_schedule_callback() which makes such removals asynchronous. While it works, it's rather cumbersome and inherently breaks synchronicity of the operation - the file operation which triggered the operation may complete before the removal is finished (or even started) and the removal may fail asynchronously. If a removal operation is immmediately followed by another operation which expects the specific name to be available (e.g. removal followed by rename onto the same name), there's no way to make the latter operation reliable. The thing is there's no inherent reason for this to be asynchrnous. All that's necessary to do this synchronous is a dedicated operation which drops its own active ref and deactivates self. This patch implements kernfs_remove_self() and its wrappers in sysfs and driver core. kernfs_remove_self() is to be called from one of the file operations, drops the active ref and deactivates using __kernfs_deactivate_self(), removes the self node, and restores active ref to the dead node using __kernfs_reactivate_self() so that the ref is balanced afterwards. __kernfs_remove() is updated so that it takes an early exit if the target node is already fully removed so that the active ref restored by kernfs_remove_self() after removal doesn't confuse the deactivation path. This makes implementing self-deleting nodes very easy. The normal removal path doesn't even need to be changed to use kernfs_remove_self() for the self-deleting node. The method can invoke kernfs_remove_self() on itself before proceeding the normal removal path. kernfs_remove() invoked on the node by the normal deletion path will simply be ignored. This will replace sysfs_schedule_callback(). A subtle feature of sysfs_schedule_callback() is that it collapses multiple invocations - even if multiple removals are triggered, the removal callback is run only once. An equivalent effect can be achieved by testing the return value of kernfs_remove_self() - only the one which gets %true return value should proceed with actual deletion. All other instances of kernfs_remove_self() will wait till the enclosing kernfs operation which invoked the winning instance of kernfs_remove_self() finishes and then return %false. This trivially makes all users of kernfs_remove_self() automatically show correct synchronous behavior even when there are multiple concurrent operations - all "echo 1 > delete" instances will finish only after the whole operation is completed by one of the instances. v2: For !CONFIG_SYSFS, dummy version kernfs_remove_self() was missing and sysfs_remove_file_self() had incorrect return type. Fix it. Reported by kbuild test bot. v3: Updated to use __kernfs_{de|re}activate_self(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Alan Stern <stern@rowland.harvard.edu> Cc: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: implement kernfs_{de|re}activate[_self]()Tejun Heo
This patch implements four functions to manipulate deactivation state - deactivate, reactivate and the _self suffixed pair. A new fields kernfs_node->deact_depth is added so that concurrent and nested deactivations are handled properly. kernfs_node->hash is moved so that it's paired with the new field so that it doesn't increase the size of kernfs_node. A kernfs user's lock would normally nest inside active ref but during removal the user may want to perform kernfs_remove() while holding the said lock, which would introduce a reverse locking dependency. This function can be used to break such reverse dependency by allowing deactivation step to performed separately outside user's critical section. This will also be used implement kernfs_remove_self(). Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: make kernfs_get_active() block if the node is deactivated but not ↵Tejun Heo
removed Currently, kernfs_get_active() fails if the target node is deactivated. This is fine as a node always gets removed after deactivation; however, we're gonna add reactivation so the assumption won't hold. It'd be incorrect for kernfs_get_active() to fail for a node which was deactivated only temporarily. This patch makes kernfs_get_active() block if the node is deactivated but not removed. If the node gets reactivated (not yet implemented), it will be retried and succeed. If the node gets removed, it will be woken up and fail. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: remove kernfs_addrm_cxtTejun Heo
kernfs_addrm_cxt and the accompanying kernfs_addrm_start/finish() were added because there were operations which should be performed outside kernfs_mutex after adding and removing kernfs_nodes. The necessary operations were recorded in kernfs_addrm_cxt and performed by kernfs_addrm_finish(); however, after the recent changes which relocated deactivation and unmapping so that they're performed directly during removal, the only operation kernfs_addrm_finish() performs is kernfs_put(), which can be moved inside the removal path too. This patch moves the kernfs_put() of the base ref to __kernfs_remove() and remove kernfs_addrm_cxt and kernfs_addrm_start/finish(). * kernfs_add_one() is updated to grab and release the parent's active ref and kernfs_mutex itself. kernfs_get/put_active() and kernfs_addrm_start/finish() invocations around it are removed from all users. * __kernfs_remove() puts an unlinked node directly instead of chaining it to kernfs_addrm_cxt. Its callers are updated to grab and release kernfs_mutex instead of calling kernfs_addrm_start/finish() around it. v2: Updated to fit the v2 restructuring of removal path. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: invoke kernfs_unmap_bin_file() directly from __kernfs_remove()Tejun Heo
kernfs_unmap_bin_file() is supposed to unmap all memory mappings of the target file before kernfs_remove() finishes; however, it currently is being called from kernfs_addrm_finish() and has the same race problem as the original implementation of deactivation when there are multiple removers - only the remover which snatches the node to its addrm_cxt->removed list is guaranteed to wait for its completion before returning. It can be fixed by moving kernfs_unmap_bin_file() invocation from kernfs_addrm_finish() to __kernfs_remove(). The function may be called multiple times but that shouldn't do any harm. We end up dropping kernfs_mutex in the removal loop and the node may be removed inbetween by someone else. kernfs_unlink_sibling() is updated to test whether the node has already been removed and return accordingly. __kernfs_remove() in turn performs post-unlinking cleanup only if it actually unlinked the node. KERNFS_HAS_MMAP test is moved out of the unmap function into __kernfs_remove() so that we don't unlock kernfs_mutex unnecessarily. While at it, drop the now meaningless "bin" qualifier from the function name. v2: Rewritten to fit the v2 restructuring of removal path. HAS_MMAP test relocated. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: restructure removal path to fix possible premature returnTejun Heo
The recursive nature of kernfs_remove() means that, even if kernfs_remove() is not allowed to be called multiple times on the same node, there may be race conditions between removal of parent and its descendants. While we can claim that kernfs_remove() shouldn't be called on one of the descendants while the removal of an ancestor is in progress, such rule is unnecessarily restrictive and very difficult to enforce. It's better to simply allow invoking kernfs_remove() as the caller sees fit as long as the caller ensures that the node is accessible. The current behavior in such situations is broken. Whoever enters removal path first takes the node off the hierarchy and then deactivates. Following removers either return as soon as it notices that it's not the first one or can't even find the target node as it has already been removed from the hierarchy. In both cases, the following removers may finish prematurely while the nodes which should be removed and drained are still being processed by the first one. This patch restructures so that multiple removers, whether through recursion or direction invocation, always follow the following rules. * When there are multiple concurrent removers, only one puts the base ref. * Regardless of which one puts the base ref, all removers are blocked until the target node is fully deactivated and removed. To achieve the above, removal path now first deactivates the subtree, drains it and then unlinks one-by-one. __kernfs_deactivate() is called directly from __kernfs_removal() and drops and regrabs kernfs_mutex for each descendant to drain active refs. As this means that multiple removers can enter __kernfs_deactivate() for the same node, the function is updated so that it can handle multiple deactivators of the same node - only one actually deactivates but all wait till drain completion. The restructured removal path guarantees that a removed node gets unlinked only after the node is deactivated and drained. Combined with proper multiple deactivator handling, this guarantees that any invocation of kernfs_remove() returns only after the node itself and all its descendants are deactivated, drained and removed. v2: Draining separated into a separate loop (used to be in the same loop as unlink) and done from __kernfs_deactivate(). This is to allow exposing deactivation as a separate interface later. Root node removal was broken in v1 patch. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: remove KERNFS_REMOVEDTejun Heo
KERNFS_REMOVED is used to mark half-initialized and dying nodes so that they don't show up in lookups and deny adding new nodes under or renaming it; however, its role overlaps those of deactivation and removal from rbtree. It's necessary to deny addition of new children while removal is in progress; however, this role considerably intersects with deactivation - KERNFS_REMOVED prevents new children while deactivation prevents new file operations. There's no reason to have them separate making things more complex than necessary. KERNFS_REMOVED is also used to decide whether a node is still visible to vfs layer, which is rather redundant as equivalent determination can be made by testing whether the node is on its parent's children rbtree or not. This patch removes KERNFS_REMOVED. * Instead of KERNFS_REMOVED, each node now starts its life deactivated. This means that we now use both atomic_add() and atomic_sub() on KN_DEACTIVATED_BIAS, which is INT_MIN. The compiler generates an overflow warnings when negating INT_MIN as the negation can't be represented as a positive number. Nothing is actually broken but let's bump BIAS by one to avoid the warnings for archs which negates the subtrahend.. * KERNFS_REMOVED tests in add and rename paths are replaced with kernfs_get/put_active() of the target nodes. Due to the way the add path is structured now, active ref handling is done in the callers of kernfs_add_one(). This will be consolidated up later. * kernfs_remove_one() is updated to deactivate instead of setting KERNFS_REMOVED. This removes deactivation from kernfs_deactivate(), which is now renamed to kernfs_drain(). * kernfs_dop_revalidate() now tests RB_EMPTY_NODE(&kn->rb) instead of KERNFS_REMOVED and KERNFS_REMOVED test in kernfs_dir_pos() is dropped. A node which is removed from the children rbtree is not included in the iteration in the first place. This means that a node may be visible through vfs a bit longer - it's now also visible after deactivation until the actual removal. This slightly enlarged window difference doesn't make any difference to the userland. * Sanity check on KERNFS_REMOVED in kernfs_put() is replaced with checks on the active ref. * Some comment style updates in the affected area. v2: Reordered before removal path restructuring. kernfs_active() dropped and kernfs_get/put_active() used instead. RB_EMPTY_NODE() used in the lookup paths. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: remove KERNFS_ACTIVE_REF and add kernfs_lockdep()Tejun Heo
There currently are two mechanisms gating active ref lockdep annotations - KERNFS_LOCKDEP flag and KERNFS_ACTIVE_REF type mask. The former disables lockdep annotations in kernfs_get/put_active() while the latter disables all of kernfs_deactivate(). While KERNFS_ACTIVE_REF also behaves as an optimization to skip the deactivation step for non-file nodes, the benefit is marginal and it needlessly diverges code paths. Let's drop KERNFS_ACTIVE_REF and use KERNFS_LOCKDEP in kernfs_deactivate() too. While at it, add a test helper kernfs_lockdep() to test KERNFS_LOCKDEP flag so that it's more convenient and the related code can be compiled out when not enabled. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: replace kernfs_node->u.completion with kernfs_root->deactivate_waitqTejun Heo
kernfs_node->u.completion is used to notify deactivation completion from kernfs_put_active() to kernfs_deactivate(). We now allow multiple racing removals of the same node and the current removal scheme is no longer correct - kernfs_remove() invocation may return before the node is properly deactivated if it races against another removal. The removal path will be restructured to address the issue. To help such restructure which requires supporting multiple waiters, this patch replaces kernfs_node->u.completion with kernfs_root->deactivate_waitq. This makes deactivation event notifications share a per-root waitqueue_head; however, the wait path is quite cold and this will also allow shaving one pointer off kernfs_node. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-10kernfs: fix get_active failure handling in kernfs_seq_*()Tejun Heo
When kernfs_seq_start() fails to obtain an active reference, it returns ERR_PTR(-ENODEV). kernfs_seq_stop() is then invoked with the error pointer value; however, it still proceeds to invoke kernfs_put_active() on the node leading to unbalanced put. If kernfs_seq_stop() is called even after active ref failure, it should skip invocation of @ops->seq_stop() and put_active. Unfortunately, this is a bit complicated because active ref failure isn't the only thing which may fail with ERR_PTR(-ENODEV). @ops->seq_start/next() may also fail with the error value and kernfs_seq_stop() doesn't have a way to tell apart those failures. Work it around by factoring out the active part of kernfs_seq_stop() into kernfs_seq_stop_active() and invoking it directly if @ops->seq_start/next() fail with ERR_PTR(-ENODEV) and updating kernfs_seq_stop() to skip kernfs_seq_stop_active() on ERR_PTR(-ENODEV). This is a bit nasty but ensures that the active put is skipped iff get_active failed in kernfs_seq_start(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-24Merge 3.13-rc5 into staging-nextGreg Kroah-Hartman
We want these fixes here to handle some merge issues. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-22Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds
Pull AIO leak fixes from Ben LaHaise: "I've put these two patches plus Linus's change through a round of tests, and it passes millions of iterations of the aio numa migratepage test, as well as a number of repetitions of a few simple read and write tests. The first patch fixes the memory leak Kent introduced, while the second patch makes aio_migratepage() much more paranoid and robust" * git://git.kvack.org/~bcrl/aio-next: aio/migratepages: make aio migrate pages sane aio: fix kioctx leak introduced by "aio: Fix a trinity splat"
2013-12-22aio: clean up and fix aio_setup_ring page mappingLinus Torvalds
Since commit 36bc08cc01709 ("fs/aio: Add support to aio ring pages migration") the aio ring setup code has used a special per-ring backing inode for the page allocations, rather than just using random anonymous pages. However, rather than remembering the pages as it allocated them, it would allocate the pages, insert them into the file mapping (dirty, so that they couldn't be free'd), and then forget about them. And then to look them up again, it would mmap the mapping, and then use "get_user_pages()" to get back an array of the pages we just created. Now, not only is that incredibly inefficient, it also leaked all the pages if the mmap failed (which could happen due to excessive number of mappings, for example). So clean it all up, making it much more straightforward. Also remove some left-overs of the previous (broken) mm_populate() usage that was removed in commit d6c355c7dabc ("aio: fix race in ring buffer page lookup introduced by page migration support") but left the pointless and now misleading MAP_POPULATE flag around. Tested-and-acked-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-21aio/migratepages: make aio migrate pages saneBenjamin LaHaise
The arbitrary restriction on page counts offered by the core migrate_page_move_mapping() code results in rather suspicious looking fiddling with page reference counts in the aio_migratepage() operation. To fix this, make migrate_page_move_mapping() take an extra_count parameter that allows aio to tell the code about its own reference count on the page being migrated. While cleaning up aio_migratepage(), make it validate that the old page being passed in is actually what aio_migratepage() expects to prevent misbehaviour in the case of races. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
2013-12-21aio: fix kioctx leak introduced by "aio: Fix a trinity splat"Benjamin LaHaise
e34ecee2ae791df674dfb466ce40692ca6218e43 reworked the percpu reference counting to correct a bug trinity found. Unfortunately, the change lead to kioctxes being leaked because there was no final reference count to put. Add that reference count back in to fix things. Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: stable@vger.kernel.org
2013-12-20Merge tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs bugfixes from Ben Myers: "This contains fixes for some asserts related to project quotas, a memory leak, a hang when disabling group or project quotas before disabling user quotas, Dave's email address, several fixes for the alignment of file allocation to stripe unit/width geometry, a fix for an assertion with xfs_zero_remaining_bytes, and the behavior of metadata writeback in the face of IO errors. Details: - fix memory leak in xfs_dir2_node_removename - fix quota assertion in xfs_setattr_size - fix quota assertions in xfs_qm_vop_create_dqattach - fix for hang when disabling group and project quotas before disabling user quotas - fix Dave Chinner's email address in MAINTAINERS - fix for file allocation alignment - fix for assertion in xfs_buf_stale by removing xfsbdstrat - fix for alignment with swalloc mount option - fix for "retry forever" semantics on IO errors" * tag 'xfs-for-linus-v3.13-rc5' of git://oss.sgi.com/xfs/xfs: xfs: abort metadata writeback on permanent errors xfs: swalloc doesn't align allocations properly xfs: remove xfsbdstrat error xfs: align initial file allocations correctly MAINTAINERS: fix incorrect mail address of XFS maintainer xfs: fix infinite loop by detaching the group/project hints from user dquot xfs: fix assertion failure at xfs_setattr_nonsize xfs: fix false assertion at xfs_qm_vop_create_dqattach xfs: fix memory leak in xfs_dir2_node_removename
2013-12-20pstore: Don't allow high traffic options on fragile devicesLuck, Tony
Some pstore backing devices use on board flash as persistent storage. These have limited numbers of write cycles so it is a poor idea to use them from high frequency operations. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-18Merge tag 'driver-core-3.13-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fix from Greg KH: "Here's a single sysfs fix for 3.13-rc5 that resolves a lockdep issue in sysfs that has been reported" * tag 'driver-core-3.13-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: sysfs: give different locking key to regular and bin files
2013-12-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client Pull two Ceph fixes from Sage Weil: "One of these is fixing a regression from the d_flags file type patch that went into -rc1 that broke instantiation of inodes and dentries (we were doing dentries first). The other is just an off-by-one corner case" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: Avoid data inconsistency due to d-cache aliasing in readpage() ceph: initialize inode before instantiating dentry
2013-12-17kernfs: add kernfs_dir_opsTejun Heo
Add support for mkdir(2), rmdir(2) and rename(2) syscalls. This is implemented through optional kernfs_dir_ops callback table which can be specified on kernfs_create_root(). An implemented callback is invoked when the matching syscall is invoked. As kernfs keep dcache syncs with internal representation and revalidates dentries on each access, the implementation of these methods is extremely simple. Each just discovers the relevant kernfs_node(s) and invokes the requested callback which is allowed to do any kernfs operations and the end result doesn't necessarily have to match the expected semantics of the syscall. This will be used to convert cgroup to use kernfs instead of its own filesystem implementation. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: allow negative dentriesTejun Heo
kernfs doesn't allow negative dentries - kernfs_iop_lookup() returns ERR_PTR(-ENOENT) instead of NULL which short-circuits negative dentry creation and kernfs's d_delete() callback, kernfs_dop_delete(), returns 1 for all removed nodes. This in turn allows kernfs_dop_revalidate() to assume that there's no negative dentry for kernfs. This worked fine for sysfs but kernfs is scheduled to grow mkdir(2) support which depend on negative dentries. This patch updates so that kernfs allows negative dentries. The required changes are almost trivial - kernfs_iop_lookup() now returns NULL instead of ERR_PTR(-ENOENT) when the target kernfs_node doesn't exist, kernfs_dop_delete() is removed and kernfs_dop_revalidate() is updated to check whether the target dentry is negative and request fresh lookup if so. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: update kernfs_rename_ns() to consider KERNFS_STATIC_NAMETejun Heo
kernfs_rename_ns() currently assumes that the target sysfs_dirent has a copied name. This has been okay because sysfs supports rename only for directories which always have copied names; however, there's nothing in kernfs interface which calls for such restriction and currently invoking kernfs_rename_ns() on a regular file leads to oops because it ends up trying to kfree() a static name. This patch updates kernfs_rename_ns() so that it skips kfree() of the old name if it's static. This allows it to be used for all node types. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: mark static names with KERNFS_STATIC_NAMETejun Heo
Because sysfs used struct attribute which are supposed to stay constant, sysfs didn't copy names when creating regular files. The specified string for name was supposed to stay constant. Such distinction isn't inherent for kernfs. kernfs_create_file[_ns]() should be able to take the same @name as kernfs_create_dir[_ns]() As there can be huge number of sysfs attributes, we still want to be able to use static names for sysfs attributes. This patch renames kernfs_create_file_ns_key() to __kernfs_create_file() and adds @name_is_static parameter so that the caller can explicitly indicate that @name can be used without copying. kernfs is updated to use KERNFS_STATIC_NAME to distinguish static and copied names. This patch doesn't introduce any behavior changes. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: add REMOVED check to create and rename pathsTejun Heo
kernfs currently assumes that the caller doesn't try to create a new node under a removed parent, rename a removed node, or move a node under a removed node. While this works fine for sysfs, it'd be nice to have protection against such cases especially given that kernfs is planned to add support for mkdir, rmdir and rename requsts from userland which may make race conditions more likely. This patch updates create and rename paths to check REMOVED and fail the operation with -ENOENT if performed on or towards removed nodes. Note that remove path already has such check. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17kernfs: add @mode to kernfs_create_dir[_ns]()Tejun Heo
sysfs assumed 0755 for all newly created directories and kernfs inherited it. This assumption is unnecessarily restrictive and inconsistent with kernfs_create_file[_ns](). This patch adds @mode parameter to kernfs_create_dir[_ns]() and update uses in sysfs accordingly. Among others, this will be useful for implementations of the planned ->mkdir() method. This patch doesn't introduce any behavior differences. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-12-17xfs: abort metadata writeback on permanent errorsDave Chinner
If we are doing aysnc writeback of metadata, we can get write errors but have nobody to report them to. At the moment, we simply attempt to reissue the write from io completion in the hope that it's a transient error. When it's not a transient error, the buffer is stuck forever in this loop, and we cannot break out of it. Eventually, unmount will hang because the AIL cannot be emptied and everything goes downhill from them. To solve this problem, only retry the write IO once before aborting it. We don't throw the buffer away because some transient errors can last minutes (e.g. FC path failover) or even hours (thin provisioned devices that have run out of backing space) before they go away. Hence we really want to keep trying until we can't try any more. Because the buffer was not cleaned, however, it does not get removed from the AIL and hence the next pass across the AIL will start IO on it again. As such, we still get the "retry forever" semantics that we currently have, but we allow other access to the buffer in the mean time. Meanwhile the filesystem can continue to modify the buffer and relog it, so the IO errors won't hang the log or the filesystem. Now when we are pushing the AIL, we can see all these "permanent IO error" buffers and we can issue a warning about failures before we retry the IO. We can also catch these buffers when unmounting an issue a corruption warning, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: swalloc doesn't align allocations properlyDave Chinner
When swalloc is specified as a mount option, allocations are supposed to be aligned to the stripe width rather than the stripe unit of the underlying filesystem. However, it does not do this. What the implementation does is round up the allocation size to a stripe width, hence ensuring that all allocations span a full stripe width. It does not, however, ensure that that allocation is aligned to a stripe width, and hence the allocations can span multiple underlying stripes and so still see RMW cycles for things like direct IO on MD RAID. So, if the swalloc mount option is set, change the allocation alignment in xfs_bmap_btalloc() to use the stripe width rather than the stripe unit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: remove xfsbdstrat errorChristoph Hellwig
The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that handles the case of a shut down filesystem. Most of the users have private, uncached buffers that can just be freed in this case, but the complex error handling in xfs_bioerror_relse messes up the case when it's called without a locked buffer. Remove xfsbdstrat and opencode the error handling in the callers. All but one can simply return an error and don't need to deal with buffer state, and the one caller that cares about the buffer state could do with a major cleanup as well, but we'll defer that to later. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com>
2013-12-17xfs: align initial file allocations correctlyDave Chinner
The function xfs_bmap_isaeof() is used to indicate that an allocation is occurring at or past the end of file, and as such should be aligned to the underlying storage geometry if possible. Commit 27a3f8f ("xfs: introduce xfs_bmap_last_extent") changed the behaviour of this function for empty files - it turned off allocation alignment for this case accidentally. Hence large initial allocations from direct IO are not getting correctly aligned to the underlying geometry, and that is cause write performance to drop in alignment sensitive configurations. Fix it by considering allocation into empty files as requiring aligned allocation again. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit f9b395a8ef8f34d19cae2cde361e19c96e097fad)
2013-12-17xfs: fix infinite loop by detaching the group/project hints from user dquotJie Liu
xfs_quota(8) will hang up if trying to turn group/project quota off before the user quota is off, this could be 100% reproduced by: # mount -ouquota,gquota /dev/sda7 /xfs # mkdir /xfs/test # xfs_quota -xc 'off -g' /xfs <-- hangs up # echo w > /proc/sysrq-trigger # dmesg SysRq : Show Blocked State task PC stack pid father xfs_quota D 0000000000000000 0 27574 2551 0x00000000 [snip] Call Trace: [<ffffffff81aaa21d>] schedule+0xad/0xc0 [<ffffffff81aa327e>] schedule_timeout+0x35e/0x3c0 [<ffffffff8114b506>] ? mark_held_locks+0x176/0x1c0 [<ffffffff810ad6c0>] ? call_timer_fn+0x2c0/0x2c0 [<ffffffffa0c25380>] ? xfs_qm_shrink_count+0x30/0x30 [xfs] [<ffffffff81aa3306>] schedule_timeout_uninterruptible+0x26/0x30 [<ffffffffa0c26155>] xfs_qm_dquot_walk+0x235/0x260 [xfs] [<ffffffffa0c059d8>] ? xfs_perag_get+0x1d8/0x2d0 [xfs] [<ffffffffa0c05805>] ? xfs_perag_get+0x5/0x2d0 [xfs] [<ffffffffa0b7707e>] ? xfs_inode_ag_iterator+0xae/0xf0 [xfs] [<ffffffffa0c22280>] ? xfs_trans_free_dqinfo+0x50/0x50 [xfs] [<ffffffffa0b7709f>] ? xfs_inode_ag_iterator+0xcf/0xf0 [xfs] [<ffffffffa0c261e6>] xfs_qm_dqpurge_all+0x66/0xb0 [xfs] [<ffffffffa0c2497a>] xfs_qm_scall_quotaoff+0x20a/0x5f0 [xfs] [<ffffffffa0c2b8f6>] xfs_fs_set_xstate+0x136/0x180 [xfs] [<ffffffff8136cf7a>] do_quotactl+0x53a/0x6b0 [<ffffffff812fba4b>] ? iput+0x5b/0x90 [<ffffffff8136d257>] SyS_quotactl+0x167/0x1d0 [<ffffffff814cf2ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f [<ffffffff81abcd19>] system_call_fastpath+0x16/0x1b It's fine if we turn user quota off at first, then turn off other kind of quotas if they are enabled since the group/project dquot refcount is decreased to zero once the user quota if off. Otherwise, those dquots refcount is non-zero due to the user dquot might refer to them as hint(s). Hence, above operation cause an infinite loop at xfs_qm_dquot_walk() while trying to purge dquot cache. This problem has been around since Linux 3.4, it was introduced by: [ b84a3a9675 xfs: remove the per-filesystem list of dquots ] Originally we will release the group dquot pointers because the user dquots maybe carrying around as a hint via xfs_qm_detach_gdquots(). However, with above change, there is no such work to be done before purging group/project dquot cache. In order to solve this problem, this patch introduces a special routine xfs_qm_dqpurge_hints(), and it would release the group/project dquot pointers the user dquots maybe carrying around as a hint, and then it will proceed to purge the user dquot cache if requested. Cc: stable@vger.kernel.org Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit df8052e7dae00bde6f21b40b6e3e1099770f3afc)
2013-12-17xfs: fix assertion failure at xfs_setattr_nonsizeJie Liu
For CRC enabled v5 super block, change a file's ownership can simply trigger an ASSERT failure at xfs_setattr_nonsize() if both group and project quota are enabled, i.e, [ 305.337609] XFS: Assertion failed: !XFS_IS_PQUOTA_ON(mp), file: fs/xfs/xfs_iops.c, line: 621 [ 305.339250] Kernel BUG at ffffffffa0a7fa32 [verbose debug info unavailable] [ 305.383939] Call Trace: [ 305.385536] [<ffffffffa0a7d95a>] xfs_setattr_nonsize+0x69a/0x720 [xfs] [ 305.387142] [<ffffffffa0a7dea9>] xfs_vn_setattr+0x29/0x70 [xfs] [ 305.388727] [<ffffffff811ca388>] notify_change+0x1a8/0x350 [ 305.390298] [<ffffffff811ac39d>] chown_common+0xfd/0x110 [ 305.391868] [<ffffffff811ad6bf>] SyS_fchownat+0xaf/0x110 [ 305.393440] [<ffffffff811ad760>] SyS_lchown+0x20/0x30 [ 305.394995] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f [ 305.399870] RIP [<ffffffffa0a7fa32>] assfail+0x22/0x30 [xfs] This fix adjust the assertion to check if the super block support both quota inodes or not. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 5a01dd54f4a7fb513062070c5acef20d13cad980)
2013-12-17xfs: fix false assertion at xfs_qm_vop_create_dqattachJie Liu
After the previous fix, there still has another ASSERT failure if turning off any type of quota while fsstress is running at the same time. Backtrace in this case: [ 50.867897] XFS: Assertion failed: XFS_IS_GQUOTA_ON(mp), file: fs/xfs/xfs_qm.c, line: 2118 [ 50.867924] ------------[ cut here ]------------ ... <snip> [ 50.867957] Kernel BUG at ffffffffa0b55a32 [verbose debug info unavailable] [ 50.867999] invalid opcode: 0000 [#1] SMP [ 50.869407] Call Trace: [ 50.869446] [<ffffffffa0bc408a>] xfs_qm_vop_create_dqattach+0x19a/0x2d0 [xfs] [ 50.869512] [<ffffffffa0b9cc45>] xfs_create+0x5c5/0x6a0 [xfs] [ 50.869564] [<ffffffffa0b5307c>] xfs_vn_mknod+0xac/0x1d0 [xfs] [ 50.869615] [<ffffffffa0b531d6>] xfs_vn_mkdir+0x16/0x20 [xfs] [ 50.869655] [<ffffffff811becd5>] vfs_mkdir+0x95/0x130 [ 50.869689] [<ffffffff811bf63a>] SyS_mkdirat+0xaa/0xe0 [ 50.869723] [<ffffffff811bf689>] SyS_mkdir+0x19/0x20 [ 50.869757] [<ffffffff8170f7dd>] system_call_fastpath+0x1a/0x1f [ 50.869793] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 <snip> [ 50.870003] RIP [<ffffffffa0b55a32>] assfail+0x22/0x30 [xfs] [ 50.870050] RSP <ffff88002941fd60> [ 50.879251] ---[ end trace c93a2b342341c65b ]--- We're hitting the ASSERT(XFS_IS_*QUOTA_ON(mp)) in xfs_qm_vop_create_dqattach(), however the assertion itself is not right IMHO. While performing quota off, we firstly clear the XFS_*QUOTA_ACTIVE bit(s) from struct xfs_mount without taking any special locks, see xfs_qm_scall_quotaoff(). Hence there is no guarantee that the desired quota is still active. Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 37eb9706ebf5b99d14c6086cdeef2c2f73f9c9fb)
2013-12-17xfs: fix memory leak in xfs_dir2_node_removenameMark Tinguely
Fix the leak of kernel memory in xfs_dir2_node_removename() when xfs_dir2_leafn_remove() returns an error code. Signed-off-by: Mark Tinguely <tinguely@sgi.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit ef701600fd26cace9d513ee174688a2b83832126)
2013-12-13ceph: Avoid data inconsistency due to d-cache aliasing in readpage()Li Wang
If the length of data to be read in readpage() is exactly PAGE_CACHE_SIZE, the original code does not flush d-cache for data consistency after finishing reading. This patches fixes this. Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Sage Weil <sage@inktank.com>
2013-12-13ceph: initialize inode before instantiating dentryYan, Zheng
commit b18825a7c8 (Put a small type field into struct dentry::d_flags) put a type field into struct dentry::d_flags. __d_instantiate() set the field by checking inode->i_mode. So we should initialize inode before instantiating dentry when handling mds reply. Fixes: http://tracker.ceph.com/issues/6930 Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-12-12Merge branch 'akpm' (fixes from Andrew)Linus Torvalds
Merge patches from Andrew Morton: "13 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm: memcg: do not allow task about to OOM kill to bypass the limit mm: memcg: fix race condition between memcg teardown and swapin thp: move preallocated PTE page table on move_huge_pmd() mfd/rtc: s5m: fix register updating by adding regmap for RTC rtc: s5m: enable IRQ wake during suspend rtc: s5m: limit endless loop waiting for register update rtc: s5m: fix unsuccesful IRQ request during probe drivers/rtc/rtc-s5m.c: fix info->rtc assignment include/linux/kernel.h: make might_fault() a nop for !MMU drivers/rtc/rtc-at91rm9200.c: correct alarm over day/month wrap procfs: also fix proc_reg_get_unmapped_area() for !MMU case mm: memcg: do not declare OOM from __GFP_NOFAIL allocations include/linux/hugetlb.h: make isolate_huge_page() an inline
2013-12-12procfs: also fix proc_reg_get_unmapped_area() for !MMU caseJan Beulich
Commit fad1a86e25e0 ("procfs: call default get_unmapped_area on MMU-present architectures"), as its title says, took care of only the MMU case, leaving the !MMU side still in the regressed state (returning -EIO in all cases where pde->proc_fops->get_unmapped_area is NULL). From the fad1a86e25e0 changelog: "Commit c4fe24485729 ("sparc: fix PCI device proc file mmap(2)") added proc_reg_get_unmapped_area in proc_reg_file_ops and proc_reg_file_ops_no_compat, by which now mmap always returns EIO if get_unmapped_area method is not defined for the target procfs file, which causes regression of mmap on /proc/vmcore. To address this issue, like get_unmapped_area(), call default current->mm->get_unmapped_area on MMU-present architectures if pde->proc_fops->get_unmapped_area, i.e. the one in actual file operation in the procfs file, is not defined" Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David S. Miller <davem@davemloft.net> Cc: <stable@vger.kernel.org> [3.12.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "This is a small collection of fixes. It was rebased this morning, but I was just fixing signed-off-by tags with the wrong email" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix access_ok() check in btrfs_ioctl_send() Btrfs: make sure we cleanup all reloc roots if error happens Btrfs: skip building backref tree for uuid and quota tree when doing balance relocation Btrfs: fix an oops when doing balance relocation Btrfs: don't miss skinny extent items on delayed ref head contention btrfs: call mnt_drop_write after interrupted subvol deletion Btrfs: don't clear the default compression type
2013-12-12Merge branch 'for-3.13' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd reply cache bugfix from Bruce Fields: "One bugfix for nfsd crashes" * 'for-3.13' of git://linux-nfs.org/~bfields/linux: nfsd: when reusing an existing repcache entry, unhash it first
2013-12-12dcache: allow word-at-a-time name hashing with big-endian CPUsWill Deacon
When explicitly hashing the end of a string with the word-at-a-time interface, we have to be careful which end of the word we pick up. On big-endian CPUs, the upper-bits will contain the data we're after, so ensure we generate our masks accordingly (and avoid hashing whatever random junk may have been sitting after the string). This patch adds a new dcache helper, bytemask_from_count, which creates a mask appropriate for the CPU endianness. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12Merge tag 'xfs-for-linus-v3.13-rc4' of git://oss.sgi.com/xfs/xfsLinus Torvalds
Pull xfs bugfixes from Ben Myers: - fix for buffer overrun in agfl with growfs on v4 superblock - return EINVAL if requested discard length is less than a block - fix possible memory corruption in xfs_attrlist_by_handle() * tag 'xfs-for-linus-v3.13-rc4' of git://oss.sgi.com/xfs/xfs: xfs: growfs overruns AGFL buffer on V4 filesystems xfs: don't perform discard if the given range length is less than block size xfs: underflow bug in xfs_attrlist_by_handle()
2013-12-12Btrfs: fix access_ok() check in btrfs_ioctl_send()Dan Carpenter
The closing parenthesis is in the wrong place. We want to check "sizeof(*arg->clone_sources) * arg->clone_sources_count" instead of "sizeof(*arg->clone_sources * arg->clone_sources_count)". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org
2013-12-12Btrfs: make sure we cleanup all reloc roots if error happensWang Shilong
I hit an oops when merging reloc roots fails, the reason is that new reloc roots may be added and we should make sure we cleanup all reloc roots. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>
2013-12-12Btrfs: skip building backref tree for uuid and quota tree when doing balance ↵Wang Shilong
relocation Quota tree and UUID Tree is only cowed, they can not be snapshoted. Signed-off-by: Wang Shilong <wangsl.fnst@cn.fujitsu.com> Signed-off-by: Chris Mason <clm@fb.com>