Age | Commit message (Collapse) | Author |
|
[ Upstream commit b74cd55aa9a9d0aca760028a51343ec79812e410 ]
First, function gfs2_ail_flush_reqd checks the SDF_FORCE_AIL_FLUSH flag
to determine if an AIL flush should be forced in low-memory situations.
However, it also immediately clears the flag, and when called repeatedly
as in function gfs2_logd, the flag will be lost. Fix that by pulling
the SDF_FORCE_AIL_FLUSH flag check out of gfs2_ail_flush_reqd.
Second, function gfs2_writepages sets the SDF_FORCE_AIL_FLUSH flag
whether or not enough pages were written. If enough pages could be
written, flushing the AIL is unnecessary, though.
Third, gfs2_writepages doesn't wake up logd after setting the
SDF_FORCE_AIL_FLUSH flag, so it can take a long time for logd to react.
It would be preferable to wake up logd, but that hurts the performance
of some workloads and we don't quite understand why so far, so don't
wake up logd so far.
Fixes: b066a4eebd4f ("gfs2: forcibly flush ail to relieve memory pressure")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6df373b09b1dcf2f7d579f515f653f89a896d417 ]
In gfs2_logd(), switch from an open-coded wait loop to
wait_event_interruptible_timeout().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Stable-dep-of: b74cd55aa9a9 ("gfs2: low-memory forced flush fixes")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 6fa0a72cbbe45db4ed967a51f9e6f4e3afe61d20 ]
Some fields such as gt_logd_secs of the struct gfs2_tune are accessed
without holding the lock gt_spin in gfs2_show_options():
val = sdp->sd_tune.gt_logd_secs;
if (val != 30)
seq_printf(s, ",commit=%d", val);
And thus can cause data races when gfs2_show_options() and other functions
such as gfs2_reconfigure() are concurrently executed:
spin_lock(>->gt_spin);
gt->gt_logd_secs = newargs->ar_commit;
To fix these possible data races, the lock sdp->sd_tune.gt_spin is
acquired before accessing the fields of gfs2_tune and released after these
accesses.
Further changes by Andreas:
- Don't hold the spin lock over the seq_printf operations.
Reported-by: BassCheck <bass@buaa.edu.cn>
Signed-off-by: Tuo Li <islituo@gmail.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit c8ed1b35931245087968fd95b2ec3dfc50f77769 ]
In gfs2_file_buffered_write(), we currently jump from the second call of
function should_fault_in_pages() to above the first call, so
should_fault_in_pages() is getting called twice in a row, causing it to
accidentally fall back to single-page writes rather than trying the more
efficient multi-page writes first.
Fix that by moving the retry label to the correct place, behind the
first call to should_fault_in_pages().
Fixes: e1fa9ea85ce8 ("gfs2: Stop using glock holder auto-demotion for now")
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit fa58cc888d67e640e354d8b3ceef877ea167b0cf ]
When a direct I/O write is performed, iomap_dio_rw() invalidates the
part of the page cache which the write is going to before carrying out
the write. In the odd case, the direct I/O write will be reading from
the same page it is writing to. gfs2 carries out writes with page
faults disabled, so it should have been obvious that this page
invalidation can cause iomap_dio_rw() to never make any progress.
Currently, gfs2 will end up in an endless retry loop in
gfs2_file_direct_write() instead, though.
Break this endless loop by limiting the number of retries and falling
back to buffered I/O after that.
Also simplify should_fault_in_pages() sightly and add a comment to make
the above case easier to understand.
Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 504a10d9e46bc37b23d0a1ae2f28973c8516e636 ]
On corrupt gfs2 file systems the evict code can try to reference the
journal descriptor structure, jdesc, after it has been freed and set to
NULL. The sequence of events is:
init_journal()
...
fail_jindex:
gfs2_jindex_free(sdp); <------frees journals, sets jdesc = NULL
if (gfs2_holder_initialized(&ji_gh))
gfs2_glock_dq_uninit(&ji_gh);
fail:
iput(sdp->sd_jindex); <--references jdesc in evict_linked_inode
evict()
gfs2_evict_inode()
evict_linked_inode()
ret = gfs2_trans_begin(sdp, 0, sdp->sd_jdesc->jd_blocks);
<------references the now freed/zeroed sd_jdesc pointer.
The call to gfs2_trans_begin is done because the truncate_inode_pages
call can cause gfs2 events that require a transaction, such as removing
journaled data (jdata) blocks from the journal.
This patch fixes the problem by adding a check for sdp->sd_jdesc to
function gfs2_evict_inode. In theory, this should only happen to corrupt
gfs2 file systems, when gfs2 detects the problem, reports it, then tries
to evict all the system inodes it has read in up to that point.
Reported-by: Yang Lan <lanyang0908@gmail.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cfcdb5bad34f600aed7613c3c1a5e618111f77b7 ]
The maximum allowed height of an inode's metadata tree depends on the
filesystem block size; it is lower for bigger-block filesystems. When
reading in an inode, make sure that the height doesn't exceed the
maximum allowed height.
Arrays like sd_heightsize are sized to be big enough for any filesystem
block size; they will often be slightly bigger than what's needed for a
specific filesystem.
Reported-by: syzbot+45d4691b1ed3c48eba05@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit b66f723bb552ad59c2acb5d45ea45c890f84498b ]
In gfs2_make_fs_rw(), make sure to call gfs2_consist() to report an
inconsistency and mark the filesystem as withdrawn when
gfs2_find_jhead() fails.
At the end of gfs2_make_fs_rw(), when we discover that the filesystem
has been withdrawn, make sure we report an error. This also replaces
the gfs2_withdrawn() check after gfs2_find_jhead().
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: syzbot+f51cb4b9afbd87ec06f2@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit cbb60951ce18c9b6e91d2eb97deb41d8ff616622 ]
The ->writepage() and ->writepages() operations are supposed to write
entire pages. However, on filesystems with a block size smaller than
PAGE_SIZE, __gfs2_jdata_writepage() only adds the first block to the
current transaction instead of adding the entire page. Fix that.
Fixes: 18ec7d5c3f43 ("[GFS2] Make journaled data files identical to normal files on disk")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 70376c7ff31221f1d21db5611d8209e677781d3a upstream.
Check if the inode size of stuffed (inline) inodes is within the allowed
range when reading inodes from disk (gfs2_dinode_in()). This prevents
us from on-disk corruption.
The two checks in stuffed_readpage() and gfs2_unstuffer_page() that just
truncate inline data to the maximum allowed size don't actually make
sense, and they can be removed now as well.
Reported-by: syzbot+7bb81dfa9cda07d9cd9d@syzkaller.appspotmail.com
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 7db354444ad8429e660b0f8145d425285d4f90ff upstream.
In each of the two functions, add an inode variable that points to
&ip->i_inode and use that throughout the rest of the function.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 95ecbd0f162fc06ef4c4045a66f653f47b62a2d3 ]
Commit b2b0a5e97855 switched from generic_writepages() to
filemap_fdatawrite_wbc() in gfs2_ail1_start_one() on the path to
replacing ->writepage() with ->writepages() and eventually eliminating
the former. Function gfs2_ail1_start_one() is called from
gfs2_log_flush(), our main function for flushing the filesystem log.
Unfortunately, at least as implemented today, ->writepage() and
->writepages() are entirely different operations for journaled data
inodes: while the former creates and submits transactions covering the
data to be written, the latter flushes dirty buffers out to disk.
With gfs2_ail1_start_one() now calling ->writepages(), we end up
creating filesystem transactions while we are in the course of a log
flush, which immediately deadlocks on the sdp->sd_log_flush_lock
semaphore.
Work around that by going back to how things used to work before commit
b2b0a5e97855 for now; figuring out a superior solution will take time we
don't have available right now. However ...
Since the removal of generic_writepages() is imminent, open-code it
here. We're already inside a blk_start_plug() ... blk_finish_plug()
section here, so skip that part of the original generic_writepages().
This reverts commit b2b0a5e978552e348f85ad9c7568b630a5ede659.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 88f4a9f813c549f6b8a6fbf12030949b48a4d5a4 ]
Commit c412a97cf6c5 changed delete_work_func() to always perform an
inode lookup when gfs2_try_evict() fails. This doesn't make sense as a
gfs2_try_evict() failure indicates that the inode is likely still in
use. Revert that change.
Fixes: c412a97cf6c5 ("gfs2: Use TRY lock in gfs2_inode_lookup for UNLINKED inodes")
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 debugfs updates from Andreas Gruenbacher:
- Improve the way how the state of glocks is reported in debugfs for
glocks which are not held by processes, but rather by other resouces
like cached inodes or flocks.
* tag 'gfs2-nopid-for-v6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Mark the remaining process-independent glock holders as GL_NOPID
gfs2: Mark flock glock holders as GL_NOPID
gfs2: Add GL_NOPID flag for process-independent glock holders
gfs2: Add flocks to glockfd debugfs file
gfs2: Add glockfd debugfs file
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Make sure to initialize the filesystem work queues before registering
the filesystem; this prevents them from being used uninitialized.
- On filesystem withdraw: prevent a a double iput() and immediately
reject pending locking requests that can no longer succeed.
- Use TRY lock in gfs2_inode_lookup() to prevent a rare glock hang
during evict.
- During filesystem mount, explicitly make sure that the sb_bsize and
sb_bsize_shift super block fields are consistent with each other.
This prevents messy error messages during fuzz testing.
- Switch from strlcpy to strscpy.
* tag 'gfs2-v6.0-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: Register fs after creating workqueues
gfs2: Check sb_bsize_shift after reading superblock
gfs2: Switch from strlcpy to strscpy
gfs2: Clear flags when withdraw prevents xmote
gfs2: Dequeue waiters when withdrawn
gfs2: Prevent double iput for journal on error
gfs2: Use TRY lock in gfs2_inode_lookup for UNLINKED inodes
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
|
|
Resolves a conflict in gfs2_inode_lookup() between the following commits:
gfs2: Use TRY lock in gfs2_inode_lookup for UNLINKED inodes
gfs2: Mark the remaining process-independent glock holders as GL_NOPID
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Pull vfs file updates from Al Viro:
"struct file-related stuff"
* tag 'pull-file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
dma_buf_getfile(): don't bother with ->f_flags reassignments
Change calling conventions for filldir_t
locks: fix TOCTOU race when granting write lease
|
|
Before this patch, the gfs2 file system was registered prior to creating
the three workqueues. In some cases this allowed dlm to send recovery
work to a workqueue that did not yet exist because gfs2 was still
initializing.
This patch changes the order of gfs2's initialization routine so it only
registers the file system after the work queues are created.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Fuzzers like to scribble over sb_bsize_shift but in reality it's very
unlikely that this field would be corrupted on its own. Nevertheless it
should be checked to avoid the possibility of messy mount errors due to
bad calculations. It's always a fixed value based on the block size so
we can just check that it's the expected value.
Tested with:
mkfs.gfs2 -O -p lock_nolock /dev/vdb
for i in 0 -1 64 65 32 33; do
gfs2_edit -p sb field sb_bsize_shift $i /dev/vdb
mount /dev/vdb /mnt/test && umount /mnt/test
done
Before this patch we get a withdraw after
[ 76.413681] gfs2: fsid=loop0.0: fatal: invalid metadata block
[ 76.413681] bh = 19 (type: exp=5, found=4)
[ 76.413681] function = gfs2_meta_buffer, file = fs/gfs2/meta_io.c, line = 492
and with UBSAN configured we also get complaints like
[ 76.373395] UBSAN: shift-out-of-bounds in fs/gfs2/ops_fstype.c:295:19
[ 76.373815] shift exponent 4294967287 is too large for 64-bit type 'long unsigned int'
After the patch, these complaints don't appear, mount fails immediately
and we get an explanation in dmesg.
Reported-by: syzbot+dcf33a7aae997956fe06@syzkaller.appspotmail.com
Signed-off-by: Andrew Price <anprice@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
ll_rw_block() is not safe for the sync read path because it cannot
guarantee that always submitting read IO if the buffer has been locked,
so stop using it. We also switch to new bh_readahead() helper for the
readahead path.
Link: https://lkml.kernel.org/r/20220901133505.2510834-5-yi.zhang@huawei.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Switch from strlcpy to strscpy and make sure that @count is the size of
the smaller of the source and destination buffers. This prevents
reading beyond the end of the source buffer when the source string isn't
null terminated.
Found by a modified version of syzkaller.
Suggested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
There are a couple places in function do_xmote where normal processing
is circumvented due to withdraws in progress. However, since we bypass
most of do_xmote() we bypass telling dlm to lock the dlm lock, which
means dlm will never respond with a completion callback. Since the
completion callback ordinarily clears GLF_LOCK, this patch changes
function do_xmote to handle those situations more gracefully so the
file system may be unmounted after withdraw.
A very similar situation happens with the GLF_DEMOTE_IN_PROGRESS flag,
which is cleared by function finish_xmote(). Since the withdraw causes
us to skip the majority of do_xmote, it therefore also skips the call
to finish_xmote() so the DEMOTE_IN_PROGRESS flag needs to be cleared
manually.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
When a withdraw occurs, ordinary (not system) glocks may not be granted
anymore. Later, when the file system is unmounted, gfs2_gl_hash_clear()
tries to clear out all the glocks, but these un-grantable pending
waiters prevent some glocks from being freed. So the unmount hangs, at
least for its ten-minute timeout period.
This patch takes measures to remove any pending waiters from
the glocks that will never be granted. This allows the unmount to
proceed in a reasonable period of time.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
When a gfs2 file system is withdrawn it does iput on its journal to
allow recovery from another cluster node. If it's unable to get a
replacement inode for whatever reason, the journal descriptor would
still be pointing at the evicted inode. So when unmount clears out the
list of journals, it would do a second iput referencing the pointer.
To avoid this, set the inode pointer to NULL.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Before this patch, delete_work_func() would check for the GLF_DEMOTE
flag on the iopen glock and if set, it would perform special processing.
However, there was a race whereby the GLF_DEMOTE flag could be set by
another process after the check. Then when it called
gfs2_lookup_by_inum() which calls gfs2_inode_lookup(), it tried to lock
the iopen glock in SH mode, but the GLF_DEMOTE flag prevented the
request from being granted. But the iopen glock could never be demoted
because that happens when the inode is evicted, and the evict was never
completed because of the failed lookup.
To fix that, change function gfs2_inode_lookup() so that when
GFS2_BLKST_UNLINKED inodes are searched, it uses the LM_FLAG_TRY flag
for the iopen glock. If the locking request fails, fail
gfs2_inode_lookup() with -EAGAIN so that delete_work_func() can retry
the operation later.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
The DLM_LSFL_FS flag is set in lockspaces created directly
for a kernel user, as opposed to those lockspaces created
for user space applications. The user space libdlm allowed
this flag to be set for lockspaces created from user space,
but then used by a kernel user. No kernel user has ever
used this method, so remove the ability to do it.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
|
|
filldir_t instances (directory iterators callbacks) used to return 0 for
"OK, keep going" or -E... for "stop". Note that it's *NOT* how the
error values are reported - the rules for those are callback-dependent
and ->iterate{,_shared}() instances only care about zero vs. non-zero
(look at emit_dir() and friends).
So let's just return bool ("should we keep going?") - it's less confusing
that way. The choice between "true means keep going" and "true means
stop" is bikesheddable; we have two groups of callbacks -
do something for everything in directory, until we run into problem
and
find an entry in directory and do something to it.
The former tended to use 0/-E... conventions - -E<something> on failure.
The latter tended to use 0/1, 1 being "stop, we are done".
The callers treated anything non-zero as "stop", ignoring which
non-zero value did they get.
"true means stop" would be more natural for the second group; "true
means keep going" - for the first one. I tried both variants and
the things like
if allocation failed
something = -ENOMEM;
return true;
just looked unnatural and asking for trouble.
[folded suggestion from Matthew Wilcox <willy@infradead.org>]
Acked-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Pull more iomap updates from Darrick Wong:
"In the past 10 days or so I've not heard any ZOMG STOP style
complaints about removing ->writepage support from gfs2 or zonefs, so
here's the pull request removing them (and the underlying fs iomap
support) from the kernel:
- Remove iomap_writepage and all callers, since the mm apparently
never called the zonefs or gfs2 writepage functions"
* tag 'iomap-6.0-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
iomap: remove iomap_writepage
zonefs: remove ->writepage
gfs2: remove ->writepage
gfs2: stop using generic_writepages in gfs2_ail1_start_one
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more iov_iter updates from Al Viro:
- more new_sync_{read,write}() speedups - ITER_UBUF introduction
- ITER_PIPE cleanups
- unification of iov_iter_get_pages/iov_iter_get_pages_alloc and
switching them to advancing semantics
- making ITER_PIPE take high-order pages without splitting them
- handling copy_page_from_iter() for high-order pages properly
* tag 'pull-work.iov_iter-rebased' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (32 commits)
fix copy_page_from_iter() for compound destinations
hugetlbfs: copy_page_to_iter() can deal with compound pages
copy_page_to_iter(): don't split high-order page in case of ITER_PIPE
expand those iov_iter_advance()...
pipe_get_pages(): switch to append_pipe()
get rid of non-advancing variants
ceph: switch the last caller of iov_iter_get_pages_alloc()
9p: convert to advancing variant of iov_iter_get_pages_alloc()
af_alg_make_sg(): switch to advancing variant of iov_iter_get_pages()
iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()
block: convert to advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()
iov_iter: saner helper for page array allocation
fold __pipe_get_pages() into pipe_get_pages()
ITER_XARRAY: don't open-code DIV_ROUND_UP()
unify the rest of iov_iter_get_pages()/iov_iter_get_pages_alloc() guts
unify xarray_get_pages() and xarray_get_pages_alloc()
unify pipe_get_pages() and pipe_get_pages_alloc()
iov_iter_get_pages(): sanity-check arguments
iov_iter_get_pages_alloc(): lift freeing pages array on failure exits into wrapper
...
|
|
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(),
checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC
ones.
We are going to expose the things like ->write_iter() et.al. to those
in subsequent commits.
New predicate (user_backed_iter()) that is true for ITER_IOVEC and
ITER_UBUF; places like direct-IO handling should use that for
checking that pages we modify after getting them from iov_iter_get_pages()
would need to be dirtied.
DO NOT assume that replacing iter_is_iovec() with user_backed_iter()
will solve all problems - there's code that uses iter_is_iovec() to
decide how to poke around in iov_iter guts and for that the predicate
replacement obviously won't suffice.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
Pull gfs2 updates from Andreas Gruenbacher:
- Instantiate glocks ouside of the glock state engine, in the contect
of the process taking the glock. This moves unnecessary complexity
out of the core glock code. Clean up the instantiate logic to be more
sensible.
- In gfs2_glock_async_wait(), cancel pending locking request upon
failure. Make sure all glocks are left in a consistent state.
- Various other minor cleanups and fixes.
* tag 'gfs2-v5.19-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
gfs2: List traversal in do_promote is safe
gfs2: do_promote glock holder stealing fix
gfs2: Use better variable name
gfs2: Make go_instantiate take a glock
gfs2: Add new go_held glock operation
gfs2: Revert 'Fix "truncate in progress" hang'
gfs2: Instantiate glocks ouside of glock state engine
gfs2: Fix up gfs2_glock_async_wait
gfs2: Minor gfs2_glock_nq_m cleanup
gfs2: Fix spelling mistake in comment
gfs2: Rewrap overlong comment in do_promote
gfs2: Remove redundant NULL check before kfree
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Most of the MM queue. A few things are still pending.
Liam's maple tree rework didn't make it. This has resulted in a few
other minor patch series being held over for next time.
Multi-gen LRU still isn't merged as we were waiting for mapletree to
stabilize. The current plan is to merge MGLRU into -mm soon and to
later reintroduce mapletree, with a view to hopefully getting both
into 6.1-rc1.
Summary:
- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe
Lin, Yang Shi, Anshuman Khandual and Mike Rapoport
- Some kmemleak fixes from Patrick Wang and Waiman Long
- DAMON updates from SeongJae Park
- memcg debug/visibility work from Roman Gushchin
- vmalloc speedup from Uladzislau Rezki
- more folio conversion work from Matthew Wilcox
- enhancements for coherent device memory mapping from Alex Sierra
- addition of shared pages tracking and CoW support for fsdax, from
Shiyang Ruan
- hugetlb optimizations from Mike Kravetz
- Mel Gorman has contributed some pagealloc changes to improve
latency and realtime behaviour.
- mprotect soft-dirty checking has been improved by Peter Xu
- Many other singleton patches all over the place"
[ XFS merge from hell as per Darrick Wong in
https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ]
* tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits)
tools/testing/selftests/vm/hmm-tests.c: fix build
mm: Kconfig: fix typo
mm: memory-failure: convert to pr_fmt()
mm: use is_zone_movable_page() helper
hugetlbfs: fix inaccurate comment in hugetlbfs_statfs()
hugetlbfs: cleanup some comments in inode.c
hugetlbfs: remove unneeded header file
hugetlbfs: remove unneeded hugetlbfs_ops forward declaration
hugetlbfs: use helper macro SZ_1{K,M}
mm: cleanup is_highmem()
mm/hmm: add a test for cross device private faults
selftests: add soft-dirty into run_vmtests.sh
selftests: soft-dirty: add test for mprotect
mm/mprotect: fix soft-dirty check in can_change_pte_writable()
mm: memcontrol: fix potential oom_lock recursion deadlock
mm/gup.c: fix formatting in check_and_migrate_movable_page()
xfs: fail dax mount if reflink is enabled on a partition
mm/memcontrol.c: remove the redundant updating of stats_flush_threshold
userfaultfd: don't fail on unrecognized features
hugetlb_cgroup: fix wrong hugetlb cgroup numa stat
...
|
|
|
|
Pull folio updates from Matthew Wilcox:
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into
their own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
fs: remove the NULL get_block case in mpage_writepages
fs: don't call ->writepage from __mpage_writepage
fs: remove the nobh helpers
jfs: stop using the nobh helper
ext2: remove nobh support
ntfs3: refactor ntfs_writepages
mm/folio-compat: Remove migration compatibility functions
fs: Remove aops->migratepage()
secretmem: Convert to migrate_folio
hugetlb: Convert to migrate_folio
aio: Convert to migrate_folio
f2fs: Convert to filemap_migrate_folio()
ubifs: Convert to filemap_migrate_folio()
btrfs: Convert btrfs_migratepage to migrate_folio
mm/migrate: Add filemap_migrate_folio()
mm/migrate: Convert migrate_page() to migrate_folio()
nfs: Convert to migrate_folio
btrfs: Convert btree_migratepage to migrate_folio
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
...
|
|
There is nothing iomap-specific about iomap_migratepage(), and it fits
a pattern used by several other filesystems, so move it to mm/migrate.c,
convert it to be filemap_migrate_folio() and convert the iomap filesystems
to use it.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use folio_put_refs() to perform only one atomic operation instead of two.
The other changes are straightforward conversions from page APIs to
their folio equivalents.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
|
|
->writepage is only used for single page writeback from memory reclaim,
and not called at all for cgroup writeback. Follow the lead of XFS
and remove ->writepage and rely entirely on ->writepages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Use filemap_fdatawrite_wbc instead of generic_writepages in
gfs2_ail1_start_one so that the functin can also cope with address_space
operations that only implement ->writepages and to properly account
for cgroup writeback.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|
Improve static type checking by using the enum req_op type for variables
that represent a request operation and the new blk_opf_t type for
variables that represent request flags. Combine the first two
gfs2_submit_bhs() arguments into a single argument.
Reviewed-by: Andreas Gruenbacher <agruenba@redhat.com>
Cc: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-54-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Both submit_bh() and ll_rw_block() accept a request operation type and
request flags as their first two arguments. Micro-optimize these two
functions by combining these first two arguments into a single argument.
This patch does not change the behavior of any of the modified code.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Jan Kara <jack@suse.cz>
Acked-by: Song Liu <song@kernel.org> (for the md changes)
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220714180729.1065367-48-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Currently shrinkers are anonymous objects. For debugging purposes they
can be identified by count/scan function names, but it's not always
useful: e.g. for superblock's shrinkers it's nice to have at least an
idea of to which superblock the shrinker belongs.
This commit adds names to shrinkers. register_shrinker() and
prealloc_shrinker() functions are extended to take a format and arguments
to master a name.
In some cases it's not possible to determine a good name at the time when
a shrinker is allocated. For such cases shrinker_debugfs_rename() is
provided.
The expected format is:
<subsystem>-<shrinker_type>[:<instance>]-<id>
For some shrinkers an instance can be encoded as (MAJOR:MINOR) pair.
After this change the shrinker debugfs directory looks like:
$ cd /sys/kernel/debug/shrinker/
$ ls
dquota-cache-16 sb-devpts-28 sb-proc-47 sb-tmpfs-42
mm-shadow-18 sb-devtmpfs-5 sb-proc-48 sb-tmpfs-43
mm-zspool:zram0-34 sb-hugetlbfs-17 sb-pstore-31 sb-tmpfs-44
rcu-kfree-0 sb-hugetlbfs-33 sb-rootfs-2 sb-tmpfs-49
sb-aio-20 sb-iomem-12 sb-securityfs-6 sb-tracefs-13
sb-anon_inodefs-15 sb-mqueue-21 sb-selinuxfs-22 sb-xfs:vda1-36
sb-bdev-3 sb-nsfs-4 sb-sockfs-8 sb-zsmalloc-19
sb-bpf-32 sb-pipefs-14 sb-sysfs-26 thp-deferred_split-10
sb-btrfs:vda2-24 sb-proc-25 sb-tmpfs-1 thp-zero-9
sb-cgroup2-30 sb-proc-39 sb-tmpfs-27 xfs-buf:vda1-37
sb-configfs-23 sb-proc-41 sb-tmpfs-29 xfs-inodegc:vda1-38
sb-dax-11 sb-proc-45 sb-tmpfs-35
sb-debugfs-7 sb-proc-46 sb-tmpfs-40
[roman.gushchin@linux.dev: fix build warnings]
Link: https://lkml.kernel.org/r/Yr+ZTnLb9lJk6fJO@castle
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lkml.kernel.org/r/20220601032227.4076670-4-roman.gushchin@linux.dev
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Cc: Dave Chinner <dchinner@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In do_promote(), we're never removing the current entry from the list
and so the list traversal is actually safe. Switch back to
list_for_each_entry().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In do_promote(), when the glock had no strong holders, we were
accidentally calling demote_incompat_holders() with new_gh == NULL, so
no weak holders were considered incompatible. Instead, the new holder
should have been passed in.
For doing that, the HIF_HOLDER flag needs to be set in new_gh to prevent
may_grant() from complaining. This means that the new holder will now
be recognized as a current holder, so skip over it explicitly in
demote_incompat_holders() to prevent it from being dequeued.
To further clarify things, we can now rename new_gh to current_gh in
demote_incompat_holders(); after all, the HIF_HOLDER flag is already set,
which means the new holder is already a current holder.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
In do_promote() and add_to_queue(), use current_gh as the variable name
for the first strong holder we could find: this matches the variable
name is may_grant(), and more clearly indicates that we're interested in
one (any) of the current strong holders.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Make go_instantiate take a glock instead of a glock holder as its argument:
this handler is supposed to instantiate the object associated with the glock.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Right now, inode_go_instantiate() contains functionality that relates to
how a glock is held rather than the glock itself, like waiting for
pending direct I/O to complete and completing interrupted truncates.
This code is meant to be run each time a holder is acquired, but
go_instantiate is actually only called once, when the glock is
instantiated.
To fix that, introduce a new go_held glock operation that is called each
time a glock holder is acquired. Move the holder specific code in
inode_go_instantiate() over to inode_go_held().
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Now that interrupted truncates are completed in the context of the
process taking the glock, there is no need for the glock state engine to
delegate that task to gfs2_quotad or for quotad to perform those
truncates anymore. Get rid of the obsolete associated infrastructure.
Reverts commit 813e0c46c9e2 ("GFS2: Fix "truncate in progress" hang").
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
|
|
Instantiate glocks outside of the glock state engine: there is no real
reason for instantiating them inside the glock state engine; it only
complicates the code.
Instead, instantiate them in gfs2_glock_wait() and gfs2_glock_async_wait()
using the new gfs2_glock_holder_ready() helper. On top of that, the only
other place that acquires a glock without using gfs2_glock_wait() or
gfs2_glock_async_wait() is gfs2_upgrade_iopen_glock(), so call
gfs2_glock_holder_ready() there as well.
If a dinode has a pending truncate, the glock-specific instantiate function
for inodes wakes up the truncate function in the quota daemon. Waiting for
the completion of the truncate was previously done by the glock state
engine, but we now need to wait in inode_go_instantiate().
This also means that gfs2_instantiate() will now no longer return any
"special" error codes.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|
|
Since commit 1fc05c8d8426 ("gfs2: cancel timed-out glock requests"), a
pending locking request can be canceled by calling gfs2_glock_dq() on
the pending holder. In gfs2_glock_async_wait(), when we time out, use
that to cancel the remaining locking requests and dequeue the locking
requests already granted. That's simpler as well as more efficient than
waiting for all locking requests to eventually be granted and dequeuing
them then.
In addition, gfs2_glock_async_wait() promises that by the time the
function completes, all glocks are either granted or dequeued, but the
implementation doesn't keep that promise if individual locking requests
fail. Fix that as well.
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
|