Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This patch series contains several performance tuning patches
regarding to the IO submission flow, in addition to supporting new
features such as a ZBC-base drive and multiple devices.
It also includes some major bug fixes such as:
- checkpoint version control
- fdatasync-related roll-forward recovery routine
- memory boundary or null-pointer access in corner cases
- missing error cases
It has various minor clean-up patches as well"
* tag 'for-f2fs-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (66 commits)
f2fs: fix a missing size change in f2fs_setattr
f2fs: fix to access nullified flush_cmd_control pointer
f2fs: free meta pages if sanity check for ckpt is failed
f2fs: detect wrong layout
f2fs: call sync_fs when f2fs is idle
Revert "f2fs: use percpu_counter for # of dirty pages in inode"
f2fs: return AOP_WRITEPAGE_ACTIVATE for writepage
f2fs: do not activate auto_recovery for fallocated i_size
f2fs: fix to determine start_cp_addr by sbi->cur_cp_pack
f2fs: fix 32-bit build
f2fs: set ->owner for debugfs status file's file_operations
f2fs: fix incorrect free inode count in ->statfs
f2fs: drop duplicate header timer.h
f2fs: fix wrong AUTO_RECOVER condition
f2fs: do not recover i_size if it's valid
f2fs: fix fdatasync
f2fs: fix to account total free nid correctly
f2fs: fix an infinite loop when flush nodes in cp
f2fs: don't wait writeback for datas during checkpoint
f2fs: fix wrong written_valid_blocks counting
...
|
|
Thread A Thread B Thread C
- f2fs_create
- f2fs_new_inode
- f2fs_lock_op
- alloc_nid
alloc last nid
- f2fs_unlock_op
- f2fs_create
- f2fs_new_inode
- f2fs_lock_op
- alloc_nid
as node count still not
be increased, we will
loop in alloc_nid
- f2fs_write_node_pages
- f2fs_balance_fs_bg
- f2fs_sync_fs
- write_checkpoint
- block_operations
- f2fs_lock_all
- f2fs_lock_op
While creating new inode, we do not allocate and account nid atomically,
so that when there is almost no free nids left, we may encounter deadloop
like above stack.
In order to avoid that, reuse nm_i::available_nids for accounting free nids
and make nid allocation and counting being atomical during node creation.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Thread A Thread B
- write_checkpoint
- block_operations
-blk_start_plug
-sync_node_pages - f2fs_do_sync_file
- fsync_node_pages
- f2fs_wait_on_page_writeback
Thread A wait for global F2FS_DIRTY_NODES decreased to zero,
it start a plug list, some requests have been added to this list.
Thread B lock one dirty node page, and wait this page write back.
But this page has been in plug list of thread A with PG_writeback flag.
Thread A keep on running and its plug list has no chance to finish,
so it seems a deadlock between cp and fsync path.
This patch add a wait on page write back before set node page dirty
to avoid this problem.
Signed-off-by: Yunlei He <heyunlei@huawei.com>
Signed-off-by: Pengyang Hou <houpengyang@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
We don't need to allocate bio partially in order to maximize sequential writes.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch avoids build warning.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Let build_free_nids support sync/async methods, in allocation flow of nids,
we use synchronuous method, so that we can avoid looping in alloc_nid when
free memory is low; in unblock_operations and f2fs_balance_fs_bg we use
asynchronuous method in where low memory condition can interrupt us.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch cleans up to use consistent free nid list ops.
Reviewed-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
During free nid allocation, in order to do preallocation, we will tag free
nid entry as allocated one and still leave it in free nid list, for other
allocators who want to grab free nids, it needs to traverse the free nid
list for lookup. It becomes overhead in scenario of allocating free nid
intensively by multithreads.
This patch splits free nid list to two list: {free,alloc}_nid_list, to
keep free nids and preallocated free nids separately, after that, traverse
latency will be gone, besides split nid_cnt for separate statistic.
Additionally, introduce __insert_nid_to_list and __remove_nid_from_list for
cleanup.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: modify f2fs_bug_on to avoid needless branches]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
f2fs contained a number of endianness conversion bugs.
Also, one function should have been 'static'.
Found with sparse by running 'make C=2 CF=-D__CHECK_ENDIAN__ fs/f2fs/'
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In fsync_node_pages, if f2fs was taged with CP_ERROR_FLAG, make sure bio
cache was flushed before return.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If there is no dirty pages in inode, we should give a chance to detach
the inode from global dirty list, otherwise it needs to call another
unnecessary .writepages for detaching.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
During nid allocation, it needs to exclude building and allocating flow
of free nids, this is because while building free nid cache, there are two
steps: a) load free nids from unused nat entries in NAT pages, b) update
free nid cache by checking nat journal. The two steps should be atomical,
otherwise an used nid can be allocated as free one after a) and before b).
This patch adds missing lock which covers build_free_nids in
unlock_operation and f2fs_balance_fs_bg to avoid that.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Add wbc_to_write_flags(), which returns the write modifier flags to use,
based on a struct writeback_control. No functional changes in this
patch, but it prepares us for factoring other wbc fields for write type.
Signed-off-by: Jens Axboe <axboe@fb.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
Remove the WRITE_* and READ_SYNC wrappers, and just use the flags
directly. Where applicable this also drops usage of the
bio_set_op_attrs wrapper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
"Assorted misc bits and pieces.
There are several single-topic branches left after this (rename2
series from Miklos, current_time series from Deepa Dinamani, xattr
series from Andreas, uaccess stuff from from me) and I'd prefer to
send those separately"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
proc: switch auxv to use of __mem_open()
hpfs: support FIEMAP
cifs: get rid of unused arguments of CIFSSMBWrite()
posix_acl: uapi header split
posix_acl: xattr representation cleanups
fs/aio.c: eliminate redundant loads in put_aio_ring_file
fs/internal.h: add const to ns_dentry_operations declaration
compat: remove compat_printk()
fs/buffer.c: make __getblk_slow() static
proc: unsigned file descriptors
fs/file: more unsigned file descriptors
fs: compat: remove redundant check of nr_segs
cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
cifs: don't use memcpy() to copy struct iov_iter
get rid of separate multipage fault-in primitives
fs: Avoid premature clearing of capabilities
fs: Give dentry to inode_change_ok() instead of inode
fuse: Propagate dentry down to inode_change_ok()
ceph: Propagate dentry down to inode_change_ok()
xfs: Propagate dentry down to inode_change_ok()
...
|
|
In sync_node_pages, we won't check and commit last merged pages in private
bio cache of f2fs, as these pages were taged as writeback, someone who is
waiting for writebacking of the page will be blocked until the cache was
committed by someone else.
We need to commit node type bio cache to avoid potential deadlock or long
delay of waiting writeback.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Previously, we only support global fault injection configuration, so that
when we configure type/rate of fault injection through sysfs, mount
option, it will influence all f2fs partition which is being used.
It is not make sence, since it will be not convenient if developer want
to test separated partitions with different fault injection rate/type
simultaneously, also it's not possible to enable fault injection in one
partition and disable fault injection in other one.
>From now on, we move global configuration of fault injection in module
into per-superblock, hence injection testing can be more flexible.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch improves the migration of dirty pages and allows migrating atomic
written pages that F2FS uses in Page Cache. Instead of the fallback releasing
page path, it provides better performance for memory compaction, CMA and other
users of memory page migrating. For dirty pages, there is no need to write back
first when migrating. For an atomic written page before committing, we can
migrate the page and update the related 'inmem_pages' list at the same time.
Signed-off-by: Weichao Guo <guoweichao@huawei.com>
Reviewed-by: Chao Yu <yuchao0@huawei.com>
[Jaegeuk Kim: fix some coding style]
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch gives another chances during roll-forward recovery regarding to
-ENOMEM.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In available_free_memory, there are two same judgement conditions which
is used for checking NAT excess, remove one of them.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
LKP reported -36.3% regression of fsmark.files_per_sec due to this patch.
I've confirmed that fxmark [1] has also slight regression for DWAL.
[1] https://github.com/sslab-gatech/fxmark
This reverts commit ec795418c41850056feb956534edf059dc1155d4.
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"The major change in this version is mitigating cpu overheads on write
paths by replacing redundant inode page updates with mark_inode_dirty
calls. And we tried to reduce lock contentions as well to improve
filesystem scalability. Other feature is setting F2FS automatically
when detecting host-managed SMR.
Enhancements:
- ioctl to move a range of data between files
- inject orphan inode errors
- avoid flush commands congestion
- support lazytime
Bug fixes:
- return proper results for some dentry operations
- fix deadlock in add_link failure
- disable extent_cache for fcollapse/finsert"
* tag 'for-f2fs-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (68 commits)
f2fs: clean up coding style and redundancy
f2fs: get victim segment again after new cp
f2fs: handle error case with f2fs_bug_on
f2fs: avoid data race when deciding checkpoin in f2fs_sync_file
f2fs: support an ioctl to move a range of data blocks
f2fs: fix to report error number of f2fs_find_entry
f2fs: avoid memory allocation failure due to a long length
f2fs: reset default idle interval value
f2fs: use blk_plug in all the possible paths
f2fs: fix to avoid data update racing between GC and DIO
f2fs: add maximum prefree segments
f2fs: disable extent_cache for fcollapse/finsert inodes
f2fs: refactor __exchange_data_block for speed up
f2fs: fix ERR_PTR returned by bio
f2fs: avoid mark_inode_dirty
f2fs: move i_size_write in f2fs_write_end
f2fs: fix to avoid redundant discard during fstrim
f2fs: avoid mismatching block range for discard
f2fs: fix incorrect f_bfree calculation in ->statfs
f2fs: use percpu_rw_semaphore
...
|
|
These two are confusing leftover of the old world order, combining
values of the REQ_OP_ and REQ_ namespaces. For callers that don't
special case we mostly just replace bi_rw with bio_data_dir or
op_is_write, except for the few cases where a switch over the REQ_OP_
values makes more sense. Any check for READA is replaced with an
explicit check for REQ_RAHEAD. Also remove the READA alias for
REQ_RAHEAD.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Mike Christie <mchristi@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This patch reverts 19a5f5e2ef37 (f2fs: drop any block plugging),
and adds blk_plug in write paths additionally.
The main reason is that blk_start_plug can be used to wake up from low-power
mode before submitting further bios.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This reduces the elapsed time to do xfstests/generic/017.
Before: 715 s
After: 458 s
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch replaces rw_semaphore with percpu_rw_semaphore for:
sbi->cp_rwsem
nm_i->nat_tree_lock
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If the node page is up-to-date, it should be alive.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
SetPageUptodate() issues memory barrier, resulting in performance degrdation.
Let's avoid that.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch adds f2fs_set_page_dirty_nobuffer() copied from __set_page_dirty_buffer.
When appending 4KB blocks in f2fs on pmem with multiple cores, this improves the
overall performance.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In procedure of synchonized read, after sending out the read request, reader
will try to lock the page for waiting device to finish the read jobs and
unlock the page, but meanwhile, truncater will race with reader, so after
reader get lock of the page, it should check page's mapping to detect
whether someone has truncated the page in advance, then reader has the
chance to do the retry if truncation was done, otherwise read can be failed
due to previous condition check.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The readahead nat pages are more likely to be reclaimed quickly, so it'd better
to gather more free nids in advance.
And, let's keep some free nids as much as possible.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Separate the op from the rq_flag_bits and have f2fs
set/get the bio using bio_set_op_attrs/bio_op.
Signed-off-by: Mike Christie <mchristi@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
|
|
This is to avoid cache entry management overhead including radix tree.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Commit aaf9607516ed38825268515ef4d773289a44f429 ("f2fs: check node page
contents all the time") pointed out that "sometimes it was reported that
its contents was missing", so it checks the page's mapping and contents.
When "nid != nid_of_node(page)", ERR_PTR(-EIO) will be returned to the
caller. However, commit e1c51b9f1df2f9efc2ec11488717e40cd12015f9 ("f2fs:
clean up node page updating flow") moves "nid != nid_of_node(page)" test
to "f2fs_bug_on(sbi, nid != nid_of_node(page))", this will return a
wrong page to the caller when F2FS_CHECK_FS is off when "sometimes it
was reported that its contents was missing" happens.
This patch restores to check node page contents all the time, and
returns the errno to make the caller known something is wrong and avoid
to use the page. This patch also moves f2fs_bug_on to its proper location.
Signed-off-by: Yunlong Song <yunlong.song@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
If roll-forward recovery can recover i_size, we don't need to update inode's
metadata during fsync.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch reduces to call them across the whole tree.
- sync_inode_page()
- update_inode_page()
- update_inode()
- f2fs_write_inode()
Instead, checkpoint will flush all the dirty inode metadata before syncing
node pages.
Note that, this is doable, since we call mark_inode_dirty_sync() for all
inode's field change which needs to update on-disk inode as well.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch registers all the inodes which have dirty metadata to sync when
checkpoint is doing.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch calls mark_inode_dirty_sync() for the following on-disk inode
changes.
-> largest
-> ctime/mtime/atime
-> i_current_depth
-> i_xattr_nid
-> i_pino
-> i_advise
-> i_flags
-> i_mode
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch refactors to use inode pointer for set_inode_flag and
clear_inode_flag.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
Once we failed to merge inline data into inode page during flushing inline
inode, we will skip invoking inode_dec_dirty_pages, which makes dirty page
count incorrect, result in panic in ->evict_inode, Fix it.
------------[ cut here ]------------
kernel BUG at /home/yuchao/git/devf2fs/inode.c:336!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 3 PID: 10004 Comm: umount Tainted: G O 4.6.0-rc5+ #17
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
task: f0c33000 ti: c5212000 task.ti: c5212000
EIP: 0060:[<f89aacb5>] EFLAGS: 00010202 CPU: 3
EIP is at f2fs_evict_inode+0x85/0x490 [f2fs]
EAX: 00000001 EBX: c4529ea0 ECX: 00000001 EDX: 00000000
ESI: c0131000 EDI: f89dd0a0 EBP: c5213e9c ESP: c5213e78
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
CR0: 80050033 CR2: b75878c0 CR3: 1a36a700 CR4: 000406f0
Stack:
c4529ea0 c4529ef4 c5213e8c c176d45c c4529ef4 00000000 c4529ea0 c4529fac
f89dd0a0 c5213eb0 c1204a68 c5213ed8 c452a2b4 c6680930 c5213ec0 c1204b64
c6680d44 c6680620 c5213eec c120588d ee84b000 ee84b5c0 c5214000 ee84b5e0
Call Trace:
[<c176d45c>] ? _raw_spin_unlock+0x2c/0x50
[<c1204a68>] evict+0xa8/0x170
[<c1204b64>] dispose_list+0x34/0x50
[<c120588d>] evict_inodes+0x10d/0x130
[<c11ea941>] generic_shutdown_super+0x41/0xe0
[<c1185190>] ? unregister_shrinker+0x40/0x50
[<c1185190>] ? unregister_shrinker+0x40/0x50
[<c11eac52>] kill_block_super+0x22/0x70
[<f89af23e>] kill_f2fs_super+0x1e/0x20 [f2fs]
[<c11eae1d>] deactivate_locked_super+0x3d/0x70
[<c11eb383>] deactivate_super+0x43/0x60
[<c1208ec9>] cleanup_mnt+0x39/0x80
[<c1208f50>] __cleanup_mnt+0x10/0x20
[<c107d091>] task_work_run+0x71/0x90
[<c105725a>] exit_to_usermode_loop+0x72/0x9e
[<c1001c7c>] do_fast_syscall_32+0x19c/0x1c0
[<c176dd48>] sysenter_past_esp+0x45/0x74
EIP: [<f89aacb5>] f2fs_evict_inode+0x85/0x490 [f2fs] SS:ESP 0068:c5213e78
---[ end trace d30536330b7fdc58 ]---
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch enables reading node blocks in advance when truncating large
data blocks.
> time rm $MNT/testfile (500GB) after drop_cachees
Before : 9.422 s
After : 4.821 s
Reported-by: Stephen Bates <stephen.bates@microsemi.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch removes an obsolete variable used in add_free_nid.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch injects ENOSPC failures.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch converts grab_cache_page to f2fs_grab_cache_page.
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
For foreground GC, we cache node blocks in victim section and set them
dirty, then we call sync_node_pages to flush these node pages, but
meanwhile, those node pages which does not locate in victim section
will be flushed together, so more bandwidth and continuous free space
would be occupied.
So for this condition, it's better to leave those unrelated node page
in cache for further write hit, and let CP or VM to flush them afterward.
Signed-off-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
In order to give atomic writes, we should consider power failure during
sync_node_pages in fsync.
So, this patch marks fsync flag only in the last dnode block.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
The fsync_node_pages should return pass or failure so that user could know
fsync is completed or not.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
This patch splits the existing sync_node_pages into (f)sync_node_pages.
The fsync_node_pages is used for f2fs_sync_file only.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
When fsync is called, sync_node_pages finds a proper direct node pages to flush.
But, it locks unrelated direct node pages together unnecessarily.
Acked-by: Chao Yu <yuchao0@huawei.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|