aboutsummaryrefslogtreecommitdiff
path: root/fs
AgeCommit message (Collapse)Author
2013-08-27ceph: use vfs __set_page_dirty_nobuffers interface instead of doing it ↵Sha Zhengju
inside filesystem Following we will begin to add memcg dirty page accounting around __set_page_dirty_{buffers,nobuffers} in vfs layer, so we'd better use vfs interface to avoid exporting those details to filesystems. Since vfs set_page_dirty() should be called under page lock, here we don't need elaborate codes to handle racy anymore, and two WARN_ON() are added to detect such exceptions. Thanks very much for Sage and Yan Zheng's coaching! I tested it in a two server's ceph environment that one is client and the other is mds/osd/mon, and run the following fsx test from xfstests: ./fsx 1MB -N 50000 -p 10000 -l 1048576 ./fsx 10MB -N 50000 -p 10000 -l 10485760 ./fsx 100MB -N 50000 -p 10000 -l 104857600 The fsx does lots of mmap-read/mmap-write/truncate operations and the tests completed successfully without triggering any of WARN_ON. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-27ceph: allow sync_read/write return partial successed size of read/write.majianpeng
For sync_read/write, it may do multi stripe operations.If one of those met erro, we return the former successed size rather than a error value. There is a exception for write-operation met -EOLDSNAPC.If this occur,we retry the whole write again. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
2013-08-27ceph: fix bugs about handling short-read for sync read mode.majianpeng
cephfs . show_layout >layyout.data_pool: 0 >layout.object_size: 4194304 >layout.stripe_unit: 4194304 >layout.stripe_count: 1 TestA: >dd if=/dev/urandom of=test bs=1M count=2 oflag=direct >dd if=/dev/urandom of=test bs=1M count=2 seek=4 oflag=direct >dd if=test of=/dev/null bs=6M count=1 iflag=direct The messages from func striped_read are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT ceph: file.c:350 : striped_read 2097152~4194304 (read 2097152) got 0 HITSTRIPE SHORT ceph: file.c:381 : zero tail 4194304 ceph: file.c:390 : striped_read returns 6291456 The hole of file is from 2M--4M.But actualy it zero the last 4M include the last 2M area which isn't a hole. Using this patch, the messages are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 2097152 HITSTRIPE SHORT ceph: file.c:358 : zero gap 2097152 to 4194304 ceph: file.c:350 : striped_read 4194304~2097152 (read 4194304) got 2097152 ceph: file.c:384 : striped_read returns 6291456 TestB: >echo majianpeng > test >dd if=test of=/dev/null bs=2M count=1 iflag=direct The messages are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT ceph: file.c:350 : striped_read 11~6291445 (read 11) got 0 HITSTRIPE SHORT ceph: file.c:390 : striped_read returns 11 For this case,it did once more striped_read.It's no meaningless. Using this patch, the message are: ceph: file.c:350 : striped_read 0~6291456 (read 0) got 11 HITSTRIPE SHORT ceph: file.c:384 : striped_read returns 11 Big thanks to Yan Zheng for the patch. Reviewed-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: Jianpeng Ma <majianpeng@gmail.com>
2013-08-27ceph: remove useless variable revoked_rdcacheLi Wang
Cleanup in handle_cap_grant(). Signed-off-by: Li Wang <liwang@ubuntukylin.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-27ceph: fix fallocate divisionSage Weil
We need to use do_div to divide by a 64-bit value. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
2013-08-15ceph: punch hole supportLi Wang
This patch implements fallocate and punch hole support for Ceph kernel client. Signed-off-by: Li Wang <liwang@ubuntukylin.com> Signed-off-by: Yunchuan Wen <yunchuanwen@ubuntukylin.com>
2013-08-15ceph: fix request max sizeYan, Zheng
ceph_check_caps() requests new max size only when there is Fw cap. If we call check_max_size() while there is no Fw cap. It updates i_wanted_max_size and calls ceph_check_caps(), but ceph_check_caps() does nothing. Later when Fw cap is issued, we call check_max_size() again. But i_wanted_max_size is equal to 'endoff' at this time, so check_max_size() doesn't call ceph_check_caps() and we end up with waiting for the new max size forever. The fix is duplicate ceph_check_caps()'s "request max size" code in check_max_size(), and make try_get_cap_refs() wait for the Fw cap before retry requesting new max size. This patch also removes the "endoff > (inode->i_size << 1)" check in check_max_size(). It's useless because there is no corresponding logic in ceph_check_caps(). Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-15ceph: introduce i_truncate_mutexYan, Zheng
I encountered below deadlock when running fsstress wmtruncate work truncate MDS --------------- ------------------ -------------------------- lock i_mutex <- truncate file lock i_mutex (blocked) <- revoking Fcb (filelock to MIX) send request -> handle request (xlock filelock) At the initial time, there are some dirty pages in the page cache. When the kclient receives the truncate message, it reduces inode size and creates some 'out of i_size' dirty pages. wmtruncate work can't truncate these dirty pages because it's blocked by the i_mutex. Later when the kclient receives the cap message that revokes Fcb caps, It can't flush all dirty pages because writepages() only flushes dirty pages within the inode size. When the MDS handles the 'truncate' request from kclient, it waits for the filelock to become stable. But the filelock is stuck in unstable state because it can't finish revoking kclient's Fcb caps. The truncate pagecache locking has already caused lots of trouble for use. I think it's time simplify it by introducing a new mutex. We use the new mutex to prevent concurrent truncate_inode_pages(). There is no need to worry about race between buffered write and truncate_inode_pages(), because our "get caps" mechanism prevents them from concurrent execution. Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-15ceph: cleanup the logic in ceph_invalidatepageMilosz Tanski
The invalidatepage code bails if it encounters a non-zero page offset. The current logic that does is non-obvious with multiple if statements. This should be logically and functionally equivalent. Signed-off-by: Milosz Tanski <milosz@adfin.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-15Merge remote-tracking branch 'linus/master' into testingSage Weil
2013-08-13fs/proc/task_mmu.c: fix buffer overflow in add_page_map()yonghua zheng
Recently we met quite a lot of random kernel panic issues after enabling CONFIG_PROC_PAGE_MONITOR. After debuggind we found this has something to do with following bug in pagemap: In struct pagemapread: struct pagemapread { int pos, len; pagemap_entry_t *buffer; bool v2; }; pos is number of PM_ENTRY_BYTES in buffer, but len is the size of buffer, it is a mistake to compare pos and len in add_page_map() for checking buffer is full or not, and this can lead to buffer overflow and random kernel panic issue. Correct len to be total number of PM_ENTRY_BYTES in buffer. [akpm@linux-foundation.org: document pagemapread.pos and .len units, fix PM_ENTRY_BYTES definition] Signed-off-by: Yonghua Zheng <younghua.zheng@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13ocfs2: fix null pointer dereference in ocfs2_dir_foreach_blk_id()Jeff Liu
Fix a NULL pointer deference while removing an empty directory, which was introduced by commit 3704412bdbf3 ("[readdir] convert ocfs2"). BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<(null)>] (null) PGD 6da85067 PUD 6da89067 PMD 0 Oops: 0010 [#1] SMP CPU: 0 PID: 6564 Comm: rmdir Tainted: G O 3.11.0-rc1 #4 RIP: 0010:[<0000000000000000>] [< (null)>] (null) Call Trace: ocfs2_dir_foreach+0x49/0x50 [ocfs2] ocfs2_empty_dir+0x12c/0x3e0 [ocfs2] ocfs2_unlink+0x56e/0xc10 [ocfs2] vfs_rmdir+0xd5/0x140 do_rmdir+0x1cb/0x1e0 SyS_rmdir+0x16/0x20 system_call_fastpath+0x16/0x1b Code: Bad RIP value. RIP [< (null)>] (null) RSP <ffff88006daddc10> CR2: 0000000000000000 [dan.carpenter@oracle.com: fix pointer math] Signed-off-by: Jie Liu <jeff.liu@oracle.com> Reported-by: David Weber <wb@munzinger.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13ocfs2: fix NULL pointer dereference in ocfs2_duplicate_clusters_by_pageTiger Yang
Since ocfs2_cow_file_pos will invoke ocfs2_refcount_icow with a NULL as the struct file pointer, it finally result in a null pointer dereference in ocfs2_duplicate_clusters_by_page. This patch replace file pointer with inode pointer in cow_duplicate_clusters to fix this issue. [jeff.liu@oracle.com: rebased patch against linux-next tree] Signed-off-by: Tiger Yang <tiger.yang@oracle.com> Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Acked-by: Tao Ma <tm@tao.ma> Tested-by: David Weber <wb@munzinger.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13ocfs2: Revert 40bd62e to avoid regression in extended allocationJie Liu
Revert commit 40bd62eb7fb8 ("fs/ocfs2/journal.h: add bits_wanted while calculating credits in ocfs2_calc_extend_credits"). Unfortunately this change broke fallocate even if there is insufficient disk space for the preallocation, which is a serious problem. # df -h /dev/sda8 22G 1.2G 21G 6% /ocfs2 # fallocate -o 0 -l 200M /ocfs2/testfile fallocate: /ocfs2/test: fallocate failed: No space left on device and a kernel warning: CPU: 3 PID: 3656 Comm: fallocate Tainted: G W O 3.11.0-rc3 #2 Call Trace: dump_stack+0x77/0x9e warn_slowpath_common+0xc4/0x110 warn_slowpath_null+0x2a/0x40 start_this_handle+0x6c/0x640 [jbd2] jbd2__journal_start+0x138/0x300 [jbd2] jbd2_journal_start+0x23/0x30 [jbd2] ocfs2_start_trans+0x166/0x300 [ocfs2] __ocfs2_extend_allocation+0x38f/0xdb0 [ocfs2] ocfs2_allocate_unwritten_extents+0x3c9/0x520 __ocfs2_change_file_space+0x5e0/0xa60 [ocfs2] ocfs2_fallocate+0xb1/0xe0 [ocfs2] do_fallocate+0x1cb/0x220 SyS_fallocate+0x6f/0xb0 system_call_fastpath+0x16/0x1b JBD2: fallocate wants too many credits (51216 > 4381) Signed-off-by: Jie Liu <jeff.liu@oracle.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mfasheh@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13hugetlb: fix lockdep splat caused by pmd sharingMichal Hocko
Dave has reported the following lockdep splat: ================================= [ INFO: inconsistent lock state ] 3.11.0-rc1+ #9 Not tainted --------------------------------- inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage. kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes: (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3 {RECLAIM_FS-ON-W} state was registered at: mark_held_locks+0x81/0xe7 lockdep_trace_alloc+0x5e/0xbc __alloc_pages_nodemask+0x8b/0x9b6 __get_free_pages+0x20/0x31 get_zeroed_page+0x12/0x14 __pmd_alloc+0x1c/0x6b huge_pmd_share+0x265/0x283 huge_pte_alloc+0x5d/0x71 hugetlb_fault+0x7c/0x64a handle_mm_fault+0x255/0x299 __do_page_fault+0x142/0x55c do_page_fault+0xd/0x16 error_code+0x6c/0x74 irq event stamp: 3136917 hardirqs last enabled at (3136917): _raw_spin_unlock_irq+0x27/0x50 hardirqs last disabled at (3136916): _raw_spin_lock_irq+0x15/0x78 softirqs last enabled at (3136180): __do_softirq+0x137/0x30f softirqs last disabled at (3136175): irq_exit+0xa8/0xaa other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&mapping->i_mmap_mutex); <Interrupt> lock(&mapping->i_mmap_mutex); *** DEADLOCK *** no locks held by kswapd0/49. stack backtrace: CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9 Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008 Call Trace: dump_stack+0x4b/0x79 print_usage_bug+0x1d9/0x1e3 mark_lock+0x1e0/0x261 __lock_acquire+0x623/0x17f2 lock_acquire+0x7d/0x195 mutex_lock_nested+0x6c/0x3a7 page_referenced+0x87/0x5e3 shrink_page_list+0x3d9/0x947 shrink_inactive_list+0x155/0x4cb shrink_lruvec+0x300/0x5ce shrink_zone+0x53/0x14e kswapd+0x517/0xa75 kthread+0xa8/0xaa ret_from_kernel_thread+0x1b/0x28 which is a false positive caused by hugetlb pmd sharing code which allocates a new pmd from withing mapping->i_mmap_mutex. If this allocation causes reclaim then the lockdep detector complains that we might self-deadlock. This is not correct though, because hugetlb pages are not reclaimable so their mapping will be never touched from the reclaim path. The patch tells lockup detector that hugetlb i_mmap_mutex is special by assigning it a separate lockdep class so it won't report possible deadlocks on unrelated mappings. [peterz@infradead.org: comment for annotation] Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Peter Zijlstra <peterz@infradead.org> Reviewed-by: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13mm: save soft-dirty bits on file pagesCyrill Gorcunov
Andy reported that if file page get reclaimed we lose the soft-dirty bit if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get encoded into pte entry. Thus when #pf happens on such non-present pte we can restore it back. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-13mm: save soft-dirty bits on swapped pagesCyrill Gorcunov
Andy Lutomirski reported that if a page with _PAGE_SOFT_DIRTY bit set get swapped out, the bit is getting lost and no longer available when pte read back. To resolve this we introduce _PTE_SWP_SOFT_DIRTY bit which is saved in pte entry for the page being swapped out. When such page is to be read back from a swap cache we check for bit presence and if it's there we clear it and restore the former _PAGE_SOFT_DIRTY bit back. One of the problem was to find a place in pte entry where we can save the _PTE_SWP_SOFT_DIRTY bit while page is in swap. The _PAGE_PSE was chosen for that, it doesn't intersect with swap entry format stored in pte. Reported-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Cc: Matt Mackall <mpm@selenic.com> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-12Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull CIFS fixes from Steve French: "A set of small cifs fixes, including 3 relating to symlink handling" * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: cifs: don't instantiate new dentries in readdir for inodes that need to be revalidated immediately cifs: set sb->s_d_op before calling d_make_root() cifs: fix bad error handling in crypto code cifs: file: initialize oparms.reconnect before using it Do not attempt to do cifs operations reading symlinks with SMB2 cifs: extend the buffer length enought for sprintf() using
2013-08-12Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull more ext4 bugfixes from Ted Ts'o: "A number of miscellaneous ext4 bugs fixes for v3.11, including a fix so that if ext4 is built as a module, to allow it to be unloaded" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOT ext4: fix mount/remount error messages for incompatible mount options ext4: allow the mount options nodelalloc and data=journal
2013-08-12ext4: flush the extent status cache during EXT4_IOC_SWAP_BOOTTheodore Ts'o
Previously we weren't swapping only some of the extent_status LRU fields during the processing of the EXT4_IOC_SWAP_BOOT ioctl. The much safer thing to do is to just completely flush the extent status tree when doing the swap. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: Zheng Liu <gnehzuil.liu@gmail.com> Cc: stable@vger.kernel.org
2013-08-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes from Chris Mason: "These are assorted fixes, mostly from Josef nailing down xfstests runs. Zach also has a long standing fix for problems with readdir wrapping f_pos (or ctx->pos) These patches were spread out over different bases, so I rebased things on top of rc4 and retested overnight" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: btrfs: don't loop on large offsets in readdir Btrfs: check to see if root_list is empty before adding it to dead roots Btrfs: release both paths before logging dir/changed extents Btrfs: allow splitting of hole em's when dropping extent cache Btrfs: make sure the backref walker catches all refs to our extent Btrfs: fix backref walking when we hit a compressed extent Btrfs: do not offset physical if we're compressed Btrfs: fix extent buffer leak after backref walking Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified
2013-08-10Merge tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client bugfixes from Trond Myklebust: - Stable patch for lockd to fix Oopses due to inappropriate calls to utsname()->nodename - Stable patches for sunrpc to fix Oopses on shutdown when using AF_LOCAL sockets with rpcbind - Fix memory leak and error checking issues in nfs4_proc_lookup_mountpoint - Fix a regression with the sync mount option failing to work for nfs4 mounts - Fix a writeback performance issue when doing cache invalidation - Remove an incorrect call to nfs_setsecurity in nfs_fhget * tag 'nfs-for-3.11-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFSv4: Fix up nfs4_proc_lookup_mountpoint NFS: Remove unnecessary call to nfs_setsecurity in nfs_fhget() NFSv4: Fix the sync mount option for nfs4 mounts NFS: Fix writeback performance issue on cache invalidation SUNRPC: If the rpcbind channel is disconnected, fail the call to unregister SUNRPC: Don't auto-disconnect from the local rpcbind socket LOCKD: Don't call utsname()->nodename from nlmclnt_setlockargs
2013-08-10Merge branch 'for-3.11' of git://linux-nfs.org/~bfields/linuxLinus Torvalds
Pull nfsd fixes from Bruce Fields: "Some fixes for a 4.1 feature that in retrospect probably should have waited for 3.12.... But it appears to be working now" * 'for-3.11' of git://linux-nfs.org/~bfields/linux: nfsd: Fix SP4_MACH_CRED negotiation in EXCHANGE_ID nfsd4: Fix MACH_CRED NULL dereference
2013-08-09ceph: Remove bogus check in invalidatepageMilosz Tanski
The early bug checks are moot because the VMA layer ensures those things. 1. It will not call invalidatepage unless PagePrivate (or PagePrivate2) are set 2. It will not call invalidatepage without taking a PageLock first. 3. Guantrees that the inode page is mapped. Signed-off-by: Milosz Tanski <milosz@adfin.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09ceph: replace hold_mutex flag with gotoSage Weil
All of the early exit paths need to drop the mutex; it is only the normal path through the function that does not. Skip the unlock in that case with a goto out_unlocked. Signed-off-by: Sage Weil <sage@inktank.com> Reviewed-by: Jianpeng Ma <majianpeng@gmail.com>
2013-08-09ceph: Move the place for EOLDSNAPC handle in ceph_aio_write to easily understandmajianpeng
Only for ceph_sync_write, the osd can return EOLDSNAPC.so move the related codes after the call ceph_sync_write. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09ceph: fix freeing inode vs removing session caps raceYan, Zheng
remove_session_caps() uses iterate_session_caps() to remove caps, but iterate_session_caps() skips inodes that are being deleted. So session->s_nr_caps can be non-zero after iterate_session_caps() return. We can fix the issue by waiting until deletions are complete. __wait_on_freeing_inode() is designed for the job, but it is not exported, so we use lookup inode function to access it. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
2013-08-09ceph: Add check returned value on func ceph_calc_ceph_pg.majianpeng
Func ceph_calc_ceph_pg maybe failed.So add check for returned value. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com> Signed-off-by: Sage Weil <sage@inktank.com>
2013-08-09ceph: Don't use ceph-sync-mode for synchronous-fs.majianpeng
Sending reads and writes through the sync read/write paths bypasses the page cache, which is not expected or generally a good idea. Removing the write check is safe as there is a conditional vfs_fsync_range() later in ceph_aio_write that already checks for the same flag (via IS_SYNC(inode)). Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09ceph: cleanup types in striped_read()Dan Carpenter
We pass in a u64 value for "len" and then immediately truncate away the upper 32 bits. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Sage Weil <sage@inktank.com> Reviewed-by: Alex Elder <alex.elder@linaro.org>
2013-08-09ceph: trim deleted inodeYan, Zheng
The MDS uses caps message to notify clients about deleted inode. when receiving a such message, invalidate any alias of the inode. This makes the kernel release the inode ASAP. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09ceph: wake up writer if vmtruncate work get blockedYan, Zheng
To write data, the writer first acquires the i_mutex, then try getting caps. The writer may sleep while holding the i_mutex. If the MDS revokes Fb cap in this case, vmtruncate work can't do its job because i_mutex is locked. We should wake up the writer and let it truncate the pages. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09ceph: drop CAP_LINK_SHARED when sending "link" request to MDSYan, Zheng
To handle "link" request, the MDS need to xlock inode's linklock, which requires revoking any CAP_LINK_SHARED. Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09ceph: fix null pointer dereferenceNathaniel Yazdani
When register_session() is given an out-of-range argument for mds, ceph_mdsmap_get_addr() will return a null pointer, which would be given to ceph_con_open() & be dereferenced, causing a kernel oops. This fixes bug #4685 in the Ceph bug tracker <http://tracker.ceph.com/issues/4685>. Signed-off-by: Nathaniel Yazdani <n1ght.4nd.d4y@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09ceph: Don't forget the 'up_read(&osdc->map_sem)' if met error.majianpeng
CC: stable@vger.kernel.org Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Reviewed-by: Sage Weil <sage@inktank.com>
2013-08-09btrfs: don't loop on large offsets in readdirZach Brown
When btrfs readdir() hits the last entry it sets the readdir offset to a huge value to stop buggy apps from breaking when the same name is returned by readdir() with concurrent rename()s. But unconditionally setting the offset to INT_MAX causes readdir() to loop returning any entries with offsets past INT_MAX. It only takes a few hours of constant file creation and removal to create entries past INT_MAX. So let's set the huge offset to LLONG_MAX if the last entry has already overflowed 32bit loff_t. Without large offsets behaviour is identical. With large offsets 64bit apps will work and 32bit apps will be no more broken than they currently are if they see large offsets. Signed-off-by: Zach Brown <zab@redhat.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09Btrfs: check to see if root_list is empty before adding it to dead rootsJosef Bacik
A user reported a panic when running with autodefrag and deleting snapshots. This is because we could end up trying to add the root to the dead roots list twice. To fix this check to see if we are empty before adding ourselves to the dead roots list. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09Btrfs: release both paths before logging dir/changed extentsJosef Bacik
The ceph guys tripped over this bug where we were still holding onto the original path that we used to copy the inode with when logging. This is based on Chris's fix which was reported to fix the problem. We need to drop the paths in two cases anyway so just move the drop up so that we don't have duplicate code. Thanks, Cc: stable@vger.kernel.org Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09Btrfs: allow splitting of hole em's when dropping extent cacheJosef Bacik
I noticed while running multi-threaded fsync tests that sometimes fsck would complain about an improper gap. This happens because we fail to add a hole extent to the file, which was happening when we'd split a hole EM because btrfs_drop_extent_cache was just discarding the whole em instead of splitting it. So this patch fixes this by allowing us to split a hole em properly, which means that added holes actually get logged properly and we no longer see this fsck error. Thankfully we're tolerant of these sort of problems so a user would not see any adverse effects of this bug, other than fsck complaining. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09Btrfs: make sure the backref walker catches all refs to our extentJosef Bacik
Because we don't mess with the offset into the extent for compressed we will properly find both extents for this case [extent a][extent b][rest of extent a] but because we already added a ref for the front half we won't add the inode information for the second half. This causes us to leak that memory and not print out the other offset when we do logical-resolve. So fix this by calling ulist_add_merge and then add our eie to the existing entry if there is one. With this patch we get both offsets out of logical-resolve. With this and the other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo set. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09Btrfs: fix backref walking when we hit a compressed extentJosef Bacik
If you do btrfs inspect-internal logical-resolve on a compressed extent that has been partly overwritten it won't find anything. This is because we try and match the extent offset we've searched for based on the extent offset in the data extent entry. However this doesn't work for compressed extents because the offsets are for the uncompressed size, not the compressed size. So instead only do this check if we are not compressed, that way we can get an actual entry for the physical offset rather than nothing for compressed. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09Btrfs: do not offset physical if we're compressedJosef Bacik
xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was offsetting the physical based on the extent offset. This is perfectly fine with uncompressed extents, however the extent offset is into the uncompressed area, not the compressed. So we can return a physical value that isn't at all within the area we have allocated on disk. Fix this by returning the start of the extent if it is compressed no matter what the offset. Thanks, Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09Btrfs: fix extent buffer leak after backref walkingLiu Bo
commit 47fb091fb787420cd195e66f162737401cce023f(Btrfs: fix unlock after free on rewinded tree blocks) takes an extra increment on the reference of allocated dummy extent buffer, so now we cannot free this dummy one, and end up with extent buffer leak. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Reviewed-by: Jan Schmidt <list.btrfs@jan-o-sch.net> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extentsLiu Bo
For partial extents, snapshot-aware defrag does not work as expected, since a) we use the wrong logical offset to search for parents, which should be disk_bytenr + extent_offset, not just disk_bytenr, b) 'offset' returned by the backref walking just refers to key.offset, not the 'offset' stored in btrfs_extent_data_ref which is (key.offset - extent_offset). The reproducer: $ mkfs.btrfs sda $ mount sda /mnt $ btrfs sub create /mnt/sub $ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done $ btrfs sub snap /mnt/sub /mnt/snap1 $ btrfs sub snap /mnt/sub /mnt/snap2 $ sync; btrfs filesystem defrag /mnt/sub/foo; $ umount /mnt $ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared. This addresses the above two problems. Signed-off-by: Liu Bo <bo.li.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specifiedJie Liu
Create a small file and fallocate it to a big size with FALLOC_FL_KEEP_SIZE option, then truncate it back to the small size again, the disk free space is not changed back in this case. i.e, total 4 -rw-r--r-- 1 root root 512 Jun 28 11:35 test Filesystem Size Used Avail Use% Mounted on .... /dev/sdb1 8.0G 56K 7.2G 1% /mnt -rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test Filesystem Size Used Avail Use% Mounted on .... /dev/sdb1 8.0G 5.1G 2.2G 70% /mnt Filesystem Size Used Avail Use% Mounted on .... /dev/sdb1 8.0G 5.1G 2.2G 70% /mnt With this fix, the truncated up space is back as: Filesystem Size Used Avail Use% Mounted on .... /dev/sdb1 8.0G 56K 7.2G 1% /mnt Signed-off-by: Jie Liu <jeff.liu@oracle.com> Signed-off-by: Josef Bacik <jbacik@fusionio.com> Signed-off-by: Chris Mason <chris.mason@fusionio.com>
2013-08-09dlm: kill the unnecessary and wrong device_close()->recalc_sigpending()Oleg Nesterov
device_close()->recalc_sigpending() is not needed, sigprocmask() takes care of TIF_SIGPENDING correctly. And without ->siglock it is racy and wrong, it can wrongly clear TIF_SIGPENDING and miss a signal. But even with this patch device_close() is still buggy: 1. sigprocmask() should not be used, we have set_task_blocked(), but this is minor. 2. We should never block SIGKILL or SIGSTOP, and this is what the code tries to do. 3. This can't protect against SIGKILL or SIGSTOP anyway. Another thread can do signal_wake_up(), say, do_signal_stop() or complete_signal() or debugger. 4. sigprocmask(SIG_BLOCK, allsigs) doesn't necessarily clears TIF_SIGPENDING, say, freezing() or ->jobctl. 5. device_write() looks equally wrong by the same reason. Looks like, this tries to protect some wait_event_interruptible() logic from signals, it should be turned into uninterruptible wait. Or we need to implement something like signals_stop/start for such a use-case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-08ext4: fix mount/remount error messages for incompatible mount optionsPiotr Sarna
Commit 5688978 ("ext4: improve handling of conflicting mount options") introduced incorrect messages shown while choosing wrong mount options. First of all, both cases of incorrect mount options, "data=journal,delalloc" and "data=journal,dioread_nolock" result in the same error message. Secondly, the problem above isn't solved for remount option: the mismatched parameter is simply ignored. Moreover, ext4_msg states that remount with options "data=journal,delalloc" succeeded, which is not true. To fix it up, I added a simple check after parse_options() call to ensure that data=journal and delalloc/dioread_nolock parameters are not present at the same time. Signed-off-by: Piotr Sarna <p.sarna@partner.samsung.com> Acked-by: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-08-08ext4: allow the mount options nodelalloc and data=journalTheodore Ts'o
Commit 26092bf ("ext4: use a table-driven handler for mount options") wrongly disallows the specifying the mount options nodelalloc and data=journal simultaneously. This is incorrect; it should have only disallowed the combination of delalloc and data=journal simultaneously. Reported-by: Piotr Sarna <p.sarna@partner.samsung.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org
2013-08-08Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 bugfixes from Ted Ts'o. Misc ext4 fixes, delayed by Ted moving mail servers and email getting marked as spam due to bad spf records. * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: ext4: add WARN_ON to check the length of allocated blocks ext4: fix retry handling in ext4_ext_truncate() ext4: destroy ext4_es_cachep on module unload ext4: make sure group number is bumped after a inode allocation race
2013-08-07NFSv4: Fix up nfs4_proc_lookup_mountpointTrond Myklebust
Currently, we do not check the return value of client = rpc_clone_client(), nor do we shut down the resulting cloned rpc_clnt in the case where a NFS4ERR_WRONGSEC has caused nfs4_proc_lookup_common() to replace the original value of 'client' (causing a memory leak). Fix both issues and simplify the code by moving the call to rpc_clone_client() until after nfs4_proc_lookup_common() has done its business. Reported-by: Andy Adamson <andros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>