Age | Commit message (Collapse) | Author |
|
[ Upstream commit e7547daccd6a37522f0af74ec4b5a3036f3dd328 ]
This patch prepares extent_cache to be ready for addition.
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Stable-dep-of: 043d2d00b443 ("f2fs: factor out victim_entry usage from general rb_tree use")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 054fbf7ff8143d35ca7d3bb5414bb44ee1574194 ]
The arguments passed to the trace events are of type unsigned int,
however the signature of the events used __le32 parameters.
I may be missing the point here, but sparse flagged this and it
does seem incorrect to me.
net/qrtr/ns.c: note: in included file (through include/trace/trace_events.h, include/trace/define_trace.h, include/trace/events/qrtr.h):
./include/trace/events/qrtr.h:11:1: warning: cast to restricted __le32
./include/trace/events/qrtr.h:11:1: warning: restricted __le32 degrades to integer
./include/trace/events/qrtr.h:11:1: warning: restricted __le32 degrades to integer
... (a lot more similar warnings)
net/qrtr/ns.c:115:47: expected restricted __le32 [usertype] service
net/qrtr/ns.c:115:47: got unsigned int service
net/qrtr/ns.c:115:61: warning: incorrect type in argument 2 (different base types)
... (a lot more similar warnings)
Fixes: dfddb54043f0 ("net: qrtr: Add tracepoint support")
Reviewed-by: Manivannan Sadhasivam <mani@kernel.org>
Signed-off-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230402-qrtr-trace-types-v1-1-92ad55008dd3@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit db7b464df9d820186e98a65aa6a10f0d51fbf8ce ]
This commit adds checks for the TICK_DEP_MASK_RCU_EXP bit, thus enabling
RCU expedited grace periods to actually force-enable scheduling-clock
interrupts on holdout CPUs.
Fixes: df1e849ae455 ("rcu: Enable tick for nohz_full CPUs slow to provide expedited QS")
Signed-off-by: Zqiang <qiang1.zhang@intel.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit f82e7ca019dfad3b006fd3b772f7ac569672db55 ]
A __field() in the TRACE_EVENT() macro is used to set up the fields of the
trace event data. It is for single storage units (word, char, int,
pointer, etc) and not for complex structures or arrays. Unfortunately,
there's nothing preventing the build from accepting:
__field(int, arr[5]);
from building. It will turn into a array value. This use to work fine, as
the offset and size use to be determined by the macro using the field name,
but things have changed and the offset and size are now determined by the
type. So the above would only be size 4, and the next field will be
located 4 bytes from it (instead of 20).
The proper way to declare static arrays is to use the __array() macro.
Instead of __field(int, arr[5]) it should be __array(int, arr, 5).
Add some macro tricks to the building of a trace event from the
TRACE_EVENT() macro such that __field(int, arr[5]) will fail to build. A
comment by the failure will explain why the build failed.
Link: https://lore.kernel.org/lkml/20230306122549.236561-1-douglas.raillard@arm.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230309221302.642e82d9@gandalf.local.home
Reported-by: Douglas RAILLARD <douglas.raillard@arm.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0b04d4c0542e8573a837b1d81b94209e48723b25 ]
Fix the nid_t field so that its size is correctly reported in the text
format embedded in trace.dat files. As it stands, it is reported as
being of size 4:
field:nid_t nid[3]; offset:24; size:4; signed:0;
Instead of 12:
field:nid_t nid[3]; offset:24; size:12; signed:0;
This also fixes the reported offset of subsequent fields so that they
match with the actual struct layout.
Signed-off-by: Douglas Raillard <douglas.raillard@arm.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit d18a04157fc171fd48075e3dc96471bd3b87f0dd upstream.
Fix the rcutorturename field so that its size is correctly reported in
the text format embedded in trace.dat files. As it stands, it is
reported as being of size 1:
field:char rcutorturename[8]; offset:8; size:1; signed:0;
Signed-off-by: Douglas Raillard <douglas.raillard@arm.com>
Reviewed-by: Mukesh Ojha <quic_mojha@quicinc.com>
Cc: stable@vger.kernel.org
Fixes: 04ae87a52074e ("ftrace: Rework event_create_dir()")
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
[ boqun: Add "Cc" and "Fixes" tags per Steven ]
Signed-off-by: Boqun Feng <boqun.feng@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e01c4b7bd41522ae0299c07e2ee8c721fee02595 ]
This patch adds tracepoints for send and recv cases of dlm messages and
dlm rcom messages. In case of send and dlm message we add the dlm rsb
resource name this dlm messages belongs to. This has the advantage to
follow dlm messages on a per lock basis. In case of recv message the
resource name can be extracted by follow the send message sequence
number.
The dlm message DLM_MSG_PURGE doesn't belong to a lock request and will
not set the resource name in a dlm_message trace. The same for all rcom
messages.
There is additional handling required for this debugging functionality
which is tried to be small as possible. Also the midcomms layer gets
aware of lock resource names, for now this is required to make a
connection between sequence number and lock resource names. It is for
debugging purpose only.
Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
Stable-dep-of: 724b6bab0d75 ("fs: dlm: fix use after free in midcomms commit")
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 2f3a9ae990a7881c9a57a073bb52ebe34fdc3160 ]
Commit 3db1de0e582c ("f2fs: change the current atomic write way")
removed old tracepoints, but it missed to add new one, this patch
fixes to introduce trace_f2fs_replace_atomic_write_block to trace
atomic_write commit flow.
Fixes: 3db1de0e582c ("f2fs: change the current atomic write way")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit d0ab772c1f1558af84f3293a52e9e886e08e0754 ]
Fix a bug in trace point definition for devlink health report, as
TP_STRUCT_entry of reporter_name should get reporter_name and not msg.
Note no fixes tag as this is a harmless bug as both reporter_name and
msg are strings and TP_fast_assign for this entry is correct.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit b6c7abd1c28a63ad633433d037ee15a1bc3023ba upstream.
After commit 3087c61ed2c4 ("tools/testing/selftests/bpf: replace open-coded 16 with TASK_COMM_LEN"),
the content of the format file under
/sys/kernel/tracing/events/task/task_newtask was changed from
field:char comm[16]; offset:12; size:16; signed:0;
to
field:char comm[TASK_COMM_LEN]; offset:12; size:16; signed:0;
John reported that this change breaks older versions of perfetto.
Then Mathieu pointed out that this behavioral change was caused by the
use of __stringify(_len), which happens to work on macros, but not on enum
labels. And he also gave the suggestion on how to fix it:
:One possible solution to make this more robust would be to extend
:struct trace_event_fields with one more field that indicates the length
:of an array as an actual integer, without storing it in its stringified
:form in the type, and do the formatting in f_show where it belongs.
The result as follows after this change,
$ cat /sys/kernel/tracing/events/task/task_newtask/format
field:char comm[16]; offset:12; size:16; signed:0;
Link: https://lore.kernel.org/lkml/Y+QaZtz55LIirsUO@google.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230210155921.4610-1-laoar.shao@gmail.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230212151303.12353-1-laoar.shao@gmail.com
Cc: stable@vger.kernel.org
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Kajetan Puchalski <kajetan.puchalski@arm.com>
CC: Qais Yousef <qyousef@layalina.io>
Fixes: 3087c61ed2c4 ("tools/testing/selftests/bpf: replace open-coded 16 with TASK_COMM_LEN")
Reported-by: John Stultz <jstultz@google.com>
Debugged-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0a3212de8ab3e2ce5808c6265855e528d4a6767b ]
Fix a typo of printing FLUSH_DELAYED_REFS event in flush_space() as
FLUSH_ELAYED_REFS.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
commit 0fbcb5251fc81b58969b272c4fb7374a7b922e3e upstream.
fast-commit of create, link, and unlink operations in encrypted
directories is completely broken because the unencrypted filenames are
being written to the fast-commit journal instead of the encrypted
filenames. These operations can't be replayed, as encryption keys
aren't present at journal replay time. It is also an information leak.
Until if/when we can get this working properly, make encrypted directory
operations ineligible for fast-commit.
Note that fast-commit operations on encrypted regular files continue to
be allowed, as they seem to work.
Fixes: aa75f4d3daae ("ext4: main fast-commit commit path")
Cc: <stable@vger.kernel.org> # v5.10+
Signed-off-by: Eric Biggers <ebiggers@google.com>
Link: https://lore.kernel.org/r/20221106224841.279231-2-ebiggers@kernel.org
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit d87a7b4c77a997d5388566dd511ca8e6b8e8a0a8 upstream.
The print format error was found when using ftrace event:
<...>-1406 [000] .... 23599442.895823: jbd2_end_commit: dev 252,8 transaction -1866216965 sync 0 head -1866217368
<...>-1406 [000] .... 23599442.896299: jbd2_start_commit: dev 252,8 transaction -1866216964 sync 0
Use the correct print format for transaction, head and tid.
Fixes: 879c5e6b7cb4 ('jbd2: convert instrumentation from markers to tracepoints')
Signed-off-by: Bixuan Cui <cuibixuan@linux.alibaba.com>
Reviewed-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/1665488024-95172-1-git-send-email-cuibixuan@linux.alibaba.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 5c20311d76cbaeb7ed2ecf9c8b8322f8fc4a7ae3 ]
Tracepoints are not allowed to sleep, as such the following splat is
generated due to call to ib_query_pkey() in atomic context.
WARNING: CPU: 0 PID: 1888000 at kernel/trace/ring_buffer.c:2492 rb_commit+0xc1/0x220
CPU: 0 PID: 1888000 Comm: kworker/u9:0 Kdump: loaded Tainted: G OE --------- - - 4.18.0-305.3.1.el8.x86_64 #1
Hardware name: Red Hat KVM, BIOS 1.13.0-2.module_el8.3.0+555+a55c8938 04/01/2014
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
RIP: 0010:rb_commit+0xc1/0x220
RSP: 0000:ffffa8ac80f9bca0 EFLAGS: 00010202
RAX: ffff8951c7c01300 RBX: ffff8951c7c14a00 RCX: 0000000000000246
RDX: ffff8951c707c000 RSI: ffff8951c707c57c RDI: ffff8951c7c14a00
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8951c7c01300 R11: 0000000000000001 R12: 0000000000000246
R13: 0000000000000000 R14: ffffffff964c70c0 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff8951fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f20e8f39010 CR3: 000000002ca10005 CR4: 0000000000170ef0
Call Trace:
ring_buffer_unlock_commit+0x1d/0xa0
trace_buffer_unlock_commit_regs+0x3b/0x1b0
trace_event_buffer_commit+0x67/0x1d0
trace_event_raw_event_ib_mad_recv_done_handler+0x11c/0x160 [ib_core]
ib_mad_recv_done+0x48b/0xc10 [ib_core]
? trace_event_raw_event_cq_poll+0x6f/0xb0 [ib_core]
__ib_process_cq+0x91/0x1c0 [ib_core]
ib_cq_poll_work+0x26/0x80 [ib_core]
process_one_work+0x1a7/0x360
? create_worker+0x1a0/0x1a0
worker_thread+0x30/0x390
? create_worker+0x1a0/0x1a0
kthread+0x116/0x130
? kthread_flush_work_fn+0x10/0x10
ret_from_fork+0x35/0x40
---[ end trace 78ba8509d3830a16 ]---
Fixes: 821bf1de45a1 ("IB/MAD: Add recv path trace point")
Signed-off-by: Leonid Ravich <lravich@gmail.com>
Link: https://lore.kernel.org/r/Y2t5feomyznrVj7V@leonid-Inspiron-3421
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
[ Upstream commit 0db18eec0d9a7ee525209e31e3ac2f673545b12f ]
commit 18ae8d12991b ("f2fs: show more DIO information in tracepoint")
introduces iocb field in 'f2fs_direct_IO_enter' trace event
And it only assigns the pointer and later it accesses its field
in trace print log.
Unable to handle kernel paging request at virtual address ffffffc04cef3d30
Mem abort info:
ESR = 0x96000007
EC = 0x25: DABT (current EL), IL = 32 bits
pc : trace_raw_output_f2fs_direct_IO_enter+0x54/0xa4
lr : trace_raw_output_f2fs_direct_IO_enter+0x2c/0xa4
sp : ffffffc0443cbbd0
x29: ffffffc0443cbbf0 x28: ffffff8935b120d0 x27: ffffff8935b12108
x26: ffffff8935b120f0 x25: ffffff8935b12100 x24: ffffff8935b110c0
x23: ffffff8935b10000 x22: ffffff88859a936c x21: ffffff88859a936c
x20: ffffff8935b110c0 x19: ffffff8935b10000 x18: ffffffc03b195060
x17: ffffff8935b11e76 x16: 00000000000000cc x15: ffffffef855c4f2c
x14: 0000000000000001 x13: 000000000000004e x12: ffff0000ffffff00
x11: ffffffef86c350d0 x10: 00000000000010c0 x9 : 000000000fe0002c
x8 : ffffffc04cef3d28 x7 : 7f7f7f7f7f7f7f7f x6 : 0000000002000000
x5 : ffffff8935b11e9a x4 : 0000000000006250 x3 : ffff0a00ffffff04
x2 : 0000000000000002 x1 : ffffffef86a0a31f x0 : ffffff8935b10000
Call trace:
trace_raw_output_f2fs_direct_IO_enter+0x54/0xa4
print_trace_fmt+0x9c/0x138
print_trace_line+0x154/0x254
tracing_read_pipe+0x21c/0x380
vfs_read+0x108/0x3ac
ksys_read+0x7c/0xec
__arm64_sys_read+0x20/0x30
invoke_syscall+0x60/0x150
el0_svc_common.llvm.1237943816091755067+0xb8/0xf8
do_el0_svc+0x28/0xa0
Fix it by copying the required variables for printing and while at
it fix the similar issue at some other places in the same file.
Fixes: bd984c03097b ("f2fs: show more DIO information in tracepoint")
Signed-off-by: Mukesh Ojha <quic_mojha@quicinc.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
|
If a cookie expires from the LRU and the LRU_DISCARD flag is set, but
the state machine has not run yet, it's possible another thread can call
fscache_use_cookie and begin to use it.
When the cookie_worker finally runs, it will see the LRU_DISCARD flag
set, transition the cookie->state to LRU_DISCARDING, which will then
withdraw the cookie. Once the cookie is withdrawn the object is removed
the below oops will occur because the object associated with the cookie
is now NULL.
Fix the oops by clearing the LRU_DISCARD bit if another thread uses the
cookie before the cookie_worker runs.
BUG: kernel NULL pointer dereference, address: 0000000000000008
...
CPU: 31 PID: 44773 Comm: kworker/u130:1 Tainted: G E 6.0.0-5.dneg.x86_64 #1
Hardware name: Google Compute Engine/Google Compute Engine, BIOS Google 08/26/2022
Workqueue: events_unbound netfs_rreq_write_to_cache_work [netfs]
RIP: 0010:cachefiles_prepare_write+0x28/0x90 [cachefiles]
...
Call Trace:
netfs_rreq_write_to_cache_work+0x11c/0x320 [netfs]
process_one_work+0x217/0x3e0
worker_thread+0x4a/0x3b0
kthread+0xd6/0x100
Fixes: 12bb21a29c19 ("fscache: Implement cookie user counting and resource pinning")
Reported-by: Daire Byrne <daire.byrne@gmail.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Daire Byrne <daire@dneg.com>
Link: https://lore.kernel.org/r/20221117115023.1350181-1-dwysocha@redhat.com/ # v1
Link: https://lore.kernel.org/r/20221117142915.1366990-1-dwysocha@redhat.com/ # v2
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
filename from function call
Refactor the mm_khugepaged_scan_file tracepoint to move filename
dereference to the tracepoint definition, to maintain consistency with
other tracepoints[1].
[1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/
Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com
Fixes: d41fd2016ed07 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()")
Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- Add tracing events for the most common watchdog events
* tag 'linux-watchdog-6.1-rc2' of git://www.linux-watchdog.org/linux-watchdog:
watchdog: Add tracing events for the most usual watchdog events
|
|
To simplify debugging which process touches a watchdog and when, add
tracing events for .start(), .set_timeout(), .ping() and .stop().
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Reviewed-by: Guenter Roeck <linux@roeck-us.net>
Link: https://lore.kernel.org/r/20221008174602.3972859-1-u.kleine-koenig@pengutronix.de
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Wim Van Sebroeck <wim@linux-watchdog.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs
Pull f2fs updates from Jaegeuk Kim:
"This round looks fairly small comparing to the previous updates and
includes mostly minor bug fixes. Nevertheless, as we've still
interested in improving the stability, Chao added some debugging
methods to diagnoze subtle runtime inconsistency problem.
Enhancements:
- store all the corruption or failure reasons in superblock
- detect meta inode, summary info, and block address inconsistency
- increase the limit for reserve_root for low-end devices
- add the number of compressed IO in iostat
Bug fixes:
- DIO write fix for zoned devices
- do out-of-place writes for cold files
- fix some stat updates (FS_CP_DATA_IO, dirty page count)
- fix race condition on setting FI_NO_EXTENT flag
- fix data races when freezing super
- fix wrong continue condition check in GC
- do not allow ATGC for LFS mode
In addition, there're some code enhancement and clean-ups as usual"
* tag 'f2fs-for-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (32 commits)
f2fs: change to use atomic_t type form sbi.atomic_files
f2fs: account swapfile inodes
f2fs: allow direct read for zoned device
f2fs: support recording errors into superblock
f2fs: support recording stop_checkpoint reason into super_block
f2fs: remove the unnecessary check in f2fs_xattr_fiemap
f2fs: introduce cp_status sysfs entry
f2fs: fix to detect corrupted meta ino
f2fs: fix to account FS_CP_DATA_IO correctly
f2fs: code clean and fix a type error
f2fs: add "c_len" into trace_f2fs_update_extent_tree_range for compressed file
f2fs: fix to do sanity check on summary info
f2fs: port to vfs{g,u}id_t and associated helpers
f2fs: fix to do sanity check on destination blkaddr during recovery
f2fs: let FI_OPU_WRITE override FADVISE_COLD_BIT
f2fs: fix race condition on setting FI_NO_EXTENT flag
f2fs: remove redundant check in f2fs_sanity_check_cluster
f2fs: add static init_idisk_time function to reduce the code
f2fs: fix typo
f2fs: fix wrong dirty page count when race between mmap and fallocate.
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
- Yu Zhao's Multi-Gen LRU patches are here. They've been under test in
linux-next for a couple of months without, to my knowledge, any
negative reports (or any positive ones, come to that).
- Also the Maple Tree from Liam Howlett. An overlapping range-based
tree for vmas. It it apparently slightly more efficient in its own
right, but is mainly targeted at enabling work to reduce mmap_lock
contention.
Liam has identified a number of other tree users in the kernel which
could be beneficially onverted to mapletrees.
Yu Zhao has identified a hard-to-hit but "easy to fix" lockdep splat
at [1]. This has yet to be addressed due to Liam's unfortunately
timed vacation. He is now back and we'll get this fixed up.
- Dmitry Vyukov introduces KMSAN: the Kernel Memory Sanitizer. It uses
clang-generated instrumentation to detect used-unintialized bugs down
to the single bit level.
KMSAN keeps finding bugs. New ones, as well as the legacy ones.
- Yang Shi adds a userspace mechanism (madvise) to induce a collapse of
memory into THPs.
- Zach O'Keefe has expanded Yang Shi's madvise(MADV_COLLAPSE) to
support file/shmem-backed pages.
- userfaultfd updates from Axel Rasmussen
- zsmalloc cleanups from Alexey Romanov
- cleanups from Miaohe Lin: vmscan, hugetlb_cgroup, hugetlb and
memory-failure
- Huang Ying adds enhancements to NUMA balancing memory tiering mode's
page promotion, with a new way of detecting hot pages.
- memcg updates from Shakeel Butt: charging optimizations and reduced
memory consumption.
- memcg cleanups from Kairui Song.
- memcg fixes and cleanups from Johannes Weiner.
- Vishal Moola provides more folio conversions
- Zhang Yi removed ll_rw_block() :(
- migration enhancements from Peter Xu
- migration error-path bugfixes from Huang Ying
- Aneesh Kumar added ability for a device driver to alter the memory
tiering promotion paths. For optimizations by PMEM drivers, DRM
drivers, etc.
- vma merging improvements from Jakub Matěn.
- NUMA hinting cleanups from David Hildenbrand.
- xu xin added aditional userspace visibility into KSM merging
activity.
- THP & KSM code consolidation from Qi Zheng.
- more folio work from Matthew Wilcox.
- KASAN updates from Andrey Konovalov.
- DAMON cleanups from Kaixu Xia.
- DAMON work from SeongJae Park: fixes, cleanups.
- hugetlb sysfs cleanups from Muchun Song.
- Mike Kravetz fixes locking issues in hugetlbfs and in hugetlb core.
Link: https://lkml.kernel.org/r/CAOUHufZabH85CeUN-MEMgL8gJGzJEWUrkiM58JkTbBhh-jew0Q@mail.gmail.com [1]
* tag 'mm-stable-2022-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (555 commits)
hugetlb: allocate vma lock for all sharable vmas
hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer
hugetlb: fix vma lock handling during split vma and range unmapping
mglru: mm/vmscan.c: fix imprecise comments
mm/mglru: don't sync disk for each aging cycle
mm: memcontrol: drop dead CONFIG_MEMCG_SWAP config symbol
mm: memcontrol: use do_memsw_account() in a few more places
mm: memcontrol: deprecate swapaccounting=0 mode
mm: memcontrol: don't allocate cgroup swap arrays when memcg is disabled
mm/secretmem: remove reduntant return value
mm/hugetlb: add available_huge_pages() func
mm: remove unused inline functions from include/linux/mm_inline.h
selftests/vm: add selftest for MADV_COLLAPSE of uffd-minor memory
selftests/vm: add file/shmem MADV_COLLAPSE selftest for cleared pmd
selftests/vm: add thp collapse shmem testing
selftests/vm: add thp collapse file and tmpfs testing
selftests/vm: modularize thp collapse memory operations
selftests/vm: dedup THP helpers
mm/khugepaged: add tracepoint to hpage_collapse_scan_file()
mm/madvise: add file and shmem support to MADV_COLLAPSE
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab
Pull slab fixes from Vlastimil Babka:
- The "common kmalloc v4" series [1] by Hyeonggon Yoo.
While the plan after LPC is to try again if it's possible to get rid
of SLOB and SLAB (and if any critical aspect of those is not possible
to achieve with SLUB today, modify it accordingly), it will take a
while even in case there are no objections.
Meanwhile this is a nice cleanup and some parts (e.g. to the
tracepoints) will be useful even if we end up with a single slab
implementation in the future:
- Improves the mm/slab_common.c wrappers to allow deleting
duplicated code between SLAB and SLUB.
- Large kmalloc() allocations in SLAB are passed to page allocator
like in SLUB, reducing number of kmalloc caches.
- Removes the {kmem_cache_alloc,kmalloc}_node variants of
tracepoints, node id parameter added to non-_node variants.
- Addition of kmalloc_size_roundup()
The first two patches from a series by Kees Cook [2] that introduce
kmalloc_size_roundup(). This will allow merging of per-subsystem
patches using the new function and ultimately stop (ab)using ksize()
in a way that causes ongoing trouble for debugging functionality and
static checkers.
- Wasted kmalloc() memory tracking in debugfs alloc_traces
A patch from Feng Tang that enhances the existing debugfs
alloc_traces file for kmalloc caches with information about how much
space is wasted by allocations that needs less space than the
particular kmalloc cache provides.
- My series [3] to fix validation races for caches with enabled
debugging:
- By decoupling the debug cache operation more from non-debug
fastpaths, extra locking simplifications were possible and thus
done afterwards.
- Additional cleanup of PREEMPT_RT specific code on top, by Thomas
Gleixner.
- A late fix for slab page leaks caused by the series, by Feng
Tang.
- Smaller fixes and cleanups:
- Unneeded variable removals, by ye xingchen
- A cleanup removing a BUG_ON() in create_unique_id(), by Chao Yu
Link: https://lore.kernel.org/all/20220817101826.236819-1-42.hyeyoo@gmail.com/ [1]
Link: https://lore.kernel.org/all/20220923202822.2667581-1-keescook@chromium.org/ [2]
Link: https://lore.kernel.org/all/20220823170400.26546-1-vbabka@suse.cz/ [3]
* tag 'slab-for-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: (30 commits)
mm/slub: fix a slab missed to be freed problem
slab: Introduce kmalloc_size_roundup()
slab: Remove __malloc attribute from realloc functions
mm/slub: clean up create_unique_id()
mm/slub: enable debugging memory wasting of kmalloc
slub: Make PREEMPT_RT support less convoluted
mm/slub: simplify __cmpxchg_double_slab() and slab_[un]lock()
mm/slub: convert object_map_lock to non-raw spinlock
mm/slub: remove slab_lock() usage for debug operations
mm/slub: restrict sysfs validation to debug caches and make it safe
mm/sl[au]b: check if large object is valid in __ksize()
mm/slab_common: move declaration of __ksize() to mm/slab.h
mm/slab_common: drop kmem_alloc & avoid dereferencing fields when not using
mm/slab_common: unify NUMA and UMA version of tracepoints
mm/sl[au]b: cleanup kmem_cache_alloc[_node]_trace()
mm/sl[au]b: generalize kmalloc subsystem
mm/slub: move free_debug_processing() further
mm/sl[au]b: introduce common alloc/free functions without tracepoint
mm/slab: kmalloc: pass requests larger than order-1 page to page allocator
mm/slab_common: cleanup kmalloc_large()
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc and other driver updates from Greg KH:
"Here is the large set of char/misc and other small driver subsystem
changes for 6.1-rc1. Loads of different things in here:
- IIO driver updates, additions, and changes. Probably the largest
part of the diffstat
- habanalabs driver update with support for new hardware and
features, the second largest part of the diff.
- fpga subsystem driver updates and additions
- mhi subsystem updates
- Coresight driver updates
- gnss subsystem updates
- extcon driver updates
- icc subsystem updates
- fsi subsystem updates
- nvmem subsystem and driver updates
- misc driver updates
- speakup driver additions for new features
- lots of tiny driver updates and cleanups
All of these have been in the linux-next tree for a while with no
reported issues"
* tag 'char-misc-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (411 commits)
w1: Split memcpy() of struct cn_msg flexible array
spmi: pmic-arb: increase SPMI transaction timeout delay
spmi: pmic-arb: block access for invalid PMIC arbiter v5 SPMI writes
spmi: pmic-arb: correct duplicate APID to PPID mapping logic
spmi: pmic-arb: add support to dispatch interrupt based on IRQ status
spmi: pmic-arb: check apid against limits before calling irq handler
spmi: pmic-arb: do not ack and clear peripheral interrupts in cleanup_irq
spmi: pmic-arb: handle spurious interrupt
spmi: pmic-arb: add a print in cleanup_irq
drivers: spmi: Directly use ida_alloc()/free()
MAINTAINERS: add TI ECAP driver info
counter: ti-ecap-capture: capture driver support for ECAP
Documentation: ABI: sysfs-bus-counter: add frequency & num_overflows items
dt-bindings: counter: add ti,am62-ecap-capture.yaml
counter: Introduce the COUNTER_COMP_ARRAY component type
counter: Consolidate Counter extension sysfs attribute creation
counter: Introduce the Count capture component
counter: 104-quad-8: Add Signal polarity component
counter: Introduce the Signal polarity component
counter: interrupt-cnt: Implement watch_validate callback
...
|
|
Pull io_uring updates from Jens Axboe:
- Add supported for more directly managed task_work running.
This is beneficial for real world applications that end up issuing
lots of system calls as part of handling work. Normal task_work will
always execute as we transition in and out of the kernel, even for
"unrelated" system calls. It's more efficient to defer the handling
of io_uring's deferred work until the application wants it to be run,
generally in batches.
As part of ongoing work to write an io_uring network backend for
Thrift, this has been shown to greatly improve performance. (Dylan)
- Add IOPOLL support for passthrough (Kanchan)
- Improvements and fixes to the send zero-copy support (Pavel)
- Partial IO handling fixes (Pavel)
- CQE ordering fixes around CQ ring overflow (Pavel)
- Support sendto() for non-zc as well (Pavel)
- Support sendmsg for zerocopy (Pavel)
- Networking iov_iter fix (Stefan)
- Misc fixes and cleanups (Pavel, me)
* tag 'for-6.1/io_uring-2022-10-03' of git://git.kernel.dk/linux: (56 commits)
io_uring/net: fix notif cqe reordering
io_uring/net: don't update msg_name if not provided
io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL
io_uring/rw: defer fsnotify calls to task context
io_uring/net: fix fast_iov assignment in io_setup_async_msg()
io_uring/net: fix non-zc send with address
io_uring/net: don't skip notifs for failed requests
io_uring/rw: don't lose short results on io_setup_async_rw()
io_uring/rw: fix unexpected link breakage
io_uring/net: fix cleanup double free free_iov init
io_uring: fix CQE reordering
io_uring/net: fix UAF in io_sendrecv_fail()
selftest/net: adjust io_uring sendzc notif handling
io_uring: ensure local task_work marks task as running
io_uring/net: zerocopy sendmsg
io_uring/net: combine fail handlers
io_uring/net: rename io_sendzc()
io_uring/net: support non-zerocopy sendto
io_uring/net: refactor io_setup_async_addr
io_uring/net: don't lose partial send_zc on fail
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There's a bunch of performance improvements, most notably the FIEMAP
speedup, the new block group tree to speed up mount on large
filesystems, more io_uring integration, some sysfs exports and the
usual fixes and core updates.
Summary:
Performance:
- outstanding FIEMAP speed improvement
- algorithmic change how extents are enumerated leads to orders of
magnitude speed boost (uncached and cached)
- extent sharing check speedup (2.2x uncached, 3x cached)
- add more cancellation points, allowing to interrupt seeking in
files with large number of extents
- more efficient hole and data seeking (4x uncached, 1.3x cached)
- sample results:
256M, 32K extents: 4s -> 29ms (~150x)
512M, 64K extents: 30s -> 59ms (~550x)
1G, 128K extents: 225s -> 120ms (~1800x)
- improved inode logging, especially for directories (on dbench
workload throughput +25%, max latency -21%)
- improved buffered IO, remove redundant extent state tracking,
lowering memory consumption and avoiding rb tree traversal
- add sysfs tunable to let qgroup temporarily skip exact accounting
when deleting snapshot, leading to a speedup but requiring a rescan
after that, will be used by snapper
- support io_uring and buffered writes, until now it was just for
direct IO, with the no-wait semantics implemented in the buffered
write path it now works and leads to speed improvement in IOPS
(2x), throughput (2.2x), latency (depends, 2x to 150x)
- small performance improvements when dropping and searching for
extent maps as well as when flushing delalloc in COW mode
(throughput +5MB/s)
User visible changes:
- new incompatible feature block-group-tree adding a dedicated tree
for tracking block groups, this allows a much faster load during
mount and avoids seeking unlike when it's scattered in the extent
tree items
- this reduces mount time for many-terabyte sized filesystems
- conversion tool will be provided so existing filesystem can also
be updated in place
- to reduce test matrix and feature combinations requires no-holes
and free-space-tree (mkfs defaults since 5.15)
- improved reporting of super block corruption detected by scrub
- scrub also tries to repair super block and does not wait until next
commit
- discard stats and tunables are exported in sysfs
(/sys/fs/btrfs/FSID/discard)
- qgroup status is exported in sysfs
(/sys/sys/fs/btrfs/FSID/qgroups/)
- verify that super block was not modified when thawing filesystem
Fixes:
- FIEMAP fixes
- fix extent sharing status, does not depend on the cached status
where merged
- flush delalloc so compressed extents are reported correctly
- fix alignment of VMA for memory mapped files on THP
- send: fix failures when processing inodes with no links (orphan
files and directories)
- fix race between quota enable and quota rescan ioctl
- handle more corner cases for read-only compat feature verification
- fix missed extent on fsync after dropping extent maps
Core:
- lockdep annotations to validate various transactions states and
state transitions
- preliminary support for fs-verity in send
- more effective memory use in scrub for subpage where sector is
smaller than page
- block group caching progress logic has been removed, load is now
synchronous
- simplify end IO callbacks and bio handling, use chained bios
instead of own tracking
- add no-wait semantics to several functions (tree search, nocow,
flushing, buffered write
- cleanups and refactoring
MM changes:
- export balance_dirty_pages_ratelimited_flags"
* tag 'for-6.1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (177 commits)
btrfs: set generation before calling btrfs_clean_tree_block in btrfs_init_new_buffer
btrfs: drop extent map range more efficiently
btrfs: avoid pointless extent map tree search when flushing delalloc
btrfs: remove unnecessary next extent map search
btrfs: remove unnecessary NULL pointer checks when searching extent maps
btrfs: assert tree is locked when clearing extent map from logging
btrfs: remove unnecessary extent map initializations
btrfs: remove the refcount warning/check at free_extent_map()
btrfs: add helper to replace extent map range with a new extent map
btrfs: move open coded extent map tree deletion out of inode eviction
btrfs: use cond_resched_rwlock_write() during inode eviction
btrfs: use extent_map_end() at btrfs_drop_extent_map_range()
btrfs: move btrfs_drop_extent_cache() to extent_map.c
btrfs: fix missed extent on fsync after dropping extent maps
btrfs: remove stale prototype of btrfs_write_inode
btrfs: enable nowait async buffered writes
btrfs: assert nowait mode is not used for some btree search functions
btrfs: make btrfs_buffered_write nowait compatible
btrfs: plumb NOWAIT through the write path
btrfs: make lock_and_cleanup_extent_if_need nowait compatible
...
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound updates from Takashi Iwai:
"The majority of changes are ASoC drivers (SOF, Intel, AMD, Mediatek,
Qualcomm, TI, Apple Silicon, etc), while we see a few small fixes in
ALSA / ASoC core side, too.
Here are highlights:
Core:
- A new string helper parse_int_array_user() and cleanups with it
- Continued cleanup of memory allocation helpers
- PCM core optimization and hardening
- Continued ASoC core code cleanups
ASoC:
- Improvements to the SOF IPC4 code, especially around trace
- Support for AMD Rembrant DSPs, AMD Pink Sardine ACP 6.2, Apple
Silicon systems, Everest ES8326, Intel Sky Lake and Kaby Lake,
Mediatek MT8186 support, NXP i.MX8ULP DSPs, Qualcomm SC8280XP,
SM8250 and SM8450 and Texas Instruments SRC4392
HD- and USB-audio:
- Cleanups for unification of hda-ext bus
- HD-audio HDMI codec driver cleanups
- Continued endpoint management fixes for USB-audio
- New quirks as usual"
* tag 'sound-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (422 commits)
ALSA: hda: Fix position reporting on Poulsbo
ALSA: hda/hdmi: Don't skip notification handling during PM operation
ASoC: rockchip: i2s: use regmap_read_poll_timeout_atomic to poll I2S_CLR
ASoC: dt-bindings: Document audio OF graph dai-tdm-slot-num dai-tdm-slot-width props
ASoC: qcom: fix unmet direct dependencies for SND_SOC_QDSP6
ALSA: usb-audio: Fix potential memory leaks
ALSA: usb-audio: Fix NULL dererence at error path
ASoC: mediatek: mt8192-mt6359: Set the driver name for the card
ALSA: hda/realtek: More robust component matching for CS35L41
ASoC: Intel: sof_rt5682: remove SOF_RT1015_SPEAKER_AMP_100FS flag
ASoC: nau8825: Add TDM support
ASoC: core: clarify the driver name initialization
ASoC: mt6660: Fix PM disable depth imbalance in mt6660_i2c_probe
ASoC: wm5102: Fix PM disable depth imbalance in wm5102_probe
ASoC: wm5110: Fix PM disable depth imbalance in wm5110_probe
ASoC: wm8997: Fix PM disable depth imbalance in wm8997_probe
ASoC: wcd-mbhc-v2: Revert "ASoC: wcd-mbhc-v2: use pm_runtime_resume_and_get()"
ASoC: mediatek: mt8186: Fix spelling mistake "slect" -> "select"
ALSA: hda/realtek: Add quirk for HP Zbook Firefly 14 G9 model
ALSA: asihpi - Remove unused struct hpi_subsys_response
...
|
|
The trace_f2fs_update_extent_tree_range could not record compressed
block length in the cluster of compress file and we just add it.
Signed-off-by: Zhang Qilong <zhangqilong3@huawei.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
- Fix a couple races found with a new torture test
- Improve errors when api functions are used incorrectly
- Improve tracing for lock requests from user space
- Fix use after free in recently added tracing cod.
- Small internal code cleanups
* tag 'dlm-6.1' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
fs: dlm: fix possible use after free if tracing
fs: dlm: const void resource name parameter
fs: dlm: LSFL_CB_DELAY only for kernel lockspaces
fs: dlm: remove DLM_LSFL_FS from uapi
fs: dlm: trace user space callbacks
fs: dlm: change ls_clear_proc_locks to spinlock
fs: dlm: remove dlm_del_ast prototype
fs: dlm: handle rcom in else if branch
fs: dlm: allow lockspaces have zero lvblen
fs: dlm: fix invalid derefence of sb_lvbptr
fs: dlm: handle -EINVAL as log_error()
fs: dlm: use __func__ for function name
fs: dlm: handle -EBUSY first in unlock validation
fs: dlm: handle -EBUSY first in lock arg validation
fs: dlm: fix race between test_bit() and queue_work()
fs: dlm: fix race in lowcomms
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs updates from Gao Xiang:
"In this cycle, for container use cases, fscache-based shared domain is
introduced [1] so that data blobs in the same domain will be storage
deduplicated and it will also be used for page cache sharing later.
Also, a special packed inode is now introduced to record inode
fragments which keep the tail part of files by Yue Hu [2]. You can
keep arbitary length or (at will) the whole file as a fragment and
then fragments can be optionally compressed in the packed inode
together and even deduplicated for smaller image sizes.
In addition to that, global compressed data deduplication by sharing
partial-referenced pclusters is also supported in this cycle.
Summary:
- Introduce fscache-based domain to share blobs between images
- Support recording fragments in a special packed inode
- Support partial-referenced pclusters for global compressed data
deduplication
- Fix an order >= MAX_ORDER warning due to crafted negative i_size
- Several cleanups"
Link: https://lore.kernel.org/r/20220916085940.89392-1-zhujia.zj@bytedance.com [1]
Link: https://lore.kernel.org/r/cover.1663065968.git.huyue2@coolpad.com [2]
* tag 'erofs-for-6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: clean up erofs_iget()
erofs: clean up unnecessary code and comments
erofs: fold in z_erofs_reload_indexes()
erofs: introduce partial-referenced pclusters
erofs: support on-disk compressed fragments data
erofs: support interlaced uncompressed data for compressed files
erofs: clean up .read_folio() and .readahead() in fscache mode
erofs: introduce 'domain_id' mount option
erofs: Support sharing cookies in the same domain
erofs: introduce a pseudo mnt to manage shared cookies
erofs: introduce fscache-based domain
erofs: code clean up for fscache
erofs: use kill_anon_super() to kill super in fscache mode
erofs: fix order >= MAX_ORDER warning due to crafted negative i_size
|
|
Add huge_memory:trace_mm_khugepaged_scan_file tracepoint to
hpage_collapse_scan_file() analogously to hpage_collapse_scan_pmd().
While this change is targeted at debugging MADV_COLLAPSE pathway, the
"mm_khugepaged" prefix is retained for symmetry with
huge_memory:trace_mm_khugepaged_scan_pmd, which retains it's legacy name
to prevent changing kernel ABI as much as possible.
Link: https://lkml.kernel.org/r/20220907144521.3115321-5-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220922224046.1143204-5-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Add support for MADV_COLLAPSE to collapse shmem-backed and file-backed
memory into THPs (requires CONFIG_READ_ONLY_THP_FOR_FS=y).
On success, the backing memory will be a hugepage. For the memory range
and process provided, the page tables will synchronously have a huge pmd
installed, mapping the THP. Other mappings of the file extent mapped by
the memory range may be added to a set of entries that khugepaged will
later process and attempt update their page tables to map the THP by a
pmd.
This functionality unlocks two important uses:
(1) Immediately back executable text by THPs. Current support provided
by CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large
system which might impair services from serving at their full rated
load after (re)starting. Tricks like mremap(2)'ing text onto
anonymous memory to immediately realize iTLB performance prevents
page sharing and demand paging, both of which increase steady state
memory footprint. Now, we can have the best of both worlds: Peak
upfront performance and lower RAM footprints.
(2) userfaultfd-based live migration of virtual machines satisfy UFFD
faults by fetching native-sized pages over the network (to avoid
latency of transferring an entire hugepage). However, after guest
memory has been fully copied to the new host, MADV_COLLAPSE can
be used to immediately increase guest performance.
Since khugepaged is single threaded, this change now introduces
possibility of collapse contexts racing in file collapse path. There a
important few places to consider:
(1) hpage_collapse_scan_file(), when we xas_pause() and drop RCU.
We could have the memory collapsed out from under us, but
the next xas_for_each() iteration will correctly pick up the
hugepage. The hugepage might not be up to date (insofar as
copying of small page contents might not have completed - the
page still may be locked), but regardless what small page index
we were iterating over, we'll find the hugepage and identify it
as a suitably aligned compound page of order HPAGE_PMD_ORDER.
In khugepaged path, we locklessly check the value of the pmd,
and only add it to deferred collapse array if we find pmd
mapping pte table. This is fine, since other values that could
have raced in right afterwards denote failure, or that the
memory was successfully collapsed, so we don't need further
processing.
In madvise path, we'll take mmap_lock() in write to serialize
against page table updates and will know what to do based on the
true value of the pmd: recheck all ptes if we point to a pte table,
directly install the pmd, if the pmd has been cleared, but
memory not yet faulted, or nothing at all if we find a huge pmd.
It's worth putting emphasis here on how we treat the none pmd
here. If khugepaged has processed this mm's page tables
already, it will have left the pmd cleared (ready for refault by
the process). Depending on the VMA flags and sysfs settings,
amount of RAM on the machine, and the current load, could be a
relatively common occurrence - and as such is one we'd like to
handle successfully in MADV_COLLAPSE. When we see the none pmd
in collapse_pte_mapped_thp(), we've locked mmap_lock in write
and checked (a) huepaged_vma_check() to see if the backing
memory is appropriate still, along with VMA sizing and
appropriate hugepage alignment within the file, and (b) we've
found a hugepage head of order HPAGE_PMD_ORDER at the offset
in the file mapped by our hugepage-aligned virtual address.
Even though the common-case is likely race with khugepaged,
given these checks (regardless how we got here - we could be
operating on a completely different file than originally checked
in hpage_collapse_scan_file() for all we know) it should be safe
to directly make the pmd a huge pmd pointing to this hugepage.
(2) collapse_file() is mostly serialized on the same file extent by
lock sequence:
| lock hupepage
| lock mapping->i_pages
| lock 1st page
| unlock mapping->i_pages
| <page checks>
| lock mapping->i_pages
| page_ref_freeze(3)
| xas_store(hugepage)
| unlock mapping->i_pages
| page_ref_unfreeze(1)
| unlock 1st page
V unlock hugepage
Once a context (who already has their fresh hugepage locked)
locks mapping->i_pages exclusively, it will hold said lock
until it locks the first page, and it will hold that lock until
the after the hugepage has been added to the page cache (and
will unlock the hugepage after page table update, though that
isn't important here).
A racing context that loses the race for mapping->i_pages will
then lose the race to locking the first page. Here - depending
on how far the other racing context has gotten - we might find
the new hugepage (in which case we'll exit cleanly when we
check PageTransCompound()), or we'll find the "old" 1st small
page (in which we'll exit cleanly when we discover unexpected
refcount of 2 after isolate_lru_page()). This is assuming we
are able to successfully lock the page we find - in shmem path,
we could just fail the trylock and exit cleanly anyways.
Failure path in collapse_file() is similar: once we hold lock
on 1st small page, we are serialized against other collapse
contexts. Before the 1st small page is unlocked, we add it
back to the pagecache and unfreeze the refcount appropriately.
Contexts who lost the race to the 1st small page will then find
the same 1st small page with the correct refcount and will be
able to proceed.
[zokeefe@google.com: don't check pmd value twice in collapse_pte_mapped_thp()]
Link: https://lkml.kernel.org/r/20220927033854.477018-1-zokeefe@google.com
[shy828301@gmail.com: Delete hugepage_vma_revalidate_anon(), remove
check for multi-add in khugepaged_add_pte_mapped_thp()]
Link: https://lore.kernel.org/linux-mm/CAHbLzkrtpM=ic7cYAHcqkubah5VTR8N5=k5RT8MTvv5rN1Y91w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220907144521.3115321-4-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220922224046.1143204-4-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
The main benefit of THPs are that they can be mapped at the pmd level,
increasing the likelihood of TLB hit and spending less cycles in page
table walks. pte-mapped hugepages - that is - hugepage-aligned compound
pages of order HPAGE_PMD_ORDER mapped by ptes - although being contiguous
in physical memory, don't have this advantage. In fact, one could argue
they are detrimental to system performance overall since they occupy a
precious hugepage-aligned/sized region of physical memory that could
otherwise be used more effectively. Additionally, pte-mapped hugepages
can be the cheapest memory to collapse for khugepaged since no new
hugepage allocation or copying of memory contents is necessary - we only
need to update the mapping page tables.
In the anonymous collapse path, we are able to collapse pte-mapped
hugepages (albeit, perhaps suboptimally), but the file/shmem path makes no
effort when compound pages (of any order) are encountered.
Identify pte-mapped hugepages in the file/shmem collapse path. The
final step of which makes a racy check of the value of the pmd to
ensure it maps a pte table. This should be fine, since races that
result in false-positive (i.e. attempt collapse even though we
shouldn't) will fail later in collapse_pte_mapped_thp() once we
actually lock mmap_lock and reinspect the pmd value. Races that result
in false-negatives (i.e. where we decide to not attempt collapse, but
should have) shouldn't be an issue, since in the worst case, we do
nothing - which is what we've done up to this point. We make a similar
check in retract_page_tables(). If we do think we've found a
pte-mapped hugepgae in khugepaged context, attempt to update page
tables mapping this hugepage.
Note that these collapses still count towards the
/sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed counter,
and if the pte-mapped hugepage was also mapped into multiple process'
address spaces, could be incremented for each page table update. Since we
increment the counter when a pte-mapped hugepage is successfully added to
the list of to-collapse pte-mapped THPs, it's possible that we never
actually update the page table either. This is different from how
file/shmem pages_collapsed accounting works today where only a successful
page cache update is counted (it's also possible here that no page tables
are actually changed). Though it incurs some slop, this is preferred to
either not accounting for the event at all, or plumbing through data in
struct mm_slot on whether to account for the collapse or not.
Also note that work still needs to be done to support arbitrary compound
pages, and that this should all be converted to using folios.
[shy828301@gmail.com: Spelling mistake, update comment, and add Documentation]
Link: https://lore.kernel.org/linux-mm/CAHbLzkpHwZxFzjfX9nxVoRhzup8WMjMfyL6Xiq8mZ9M-N3ombw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20220907144521.3115321-3-zokeefe@google.com
Link: https://lkml.kernel.org/r/20220922224046.1143204-3-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
isdir indicated REQ_META|REQ_PRIO which no longer works now.
Get rid of isdir entirely.
Link: https://lore.kernel.org/r/20220927063607.54832-2-hsiangkao@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
|
|
Start tracking the VMAs with the new maple tree structure in parallel with
the rb_tree. Add debug and trace events for maple tree operations and
duplicate the rb_tree that is created on forks into the maple tree.
The maple tree is added to the mm_struct including the mm_init struct,
added support in required mm/mmap functions, added tracking in kernel/fork
for process forking, and used to find the unmapped_area and checked
against what the rbtree finds.
This also moves the mmap_lock() in exit_mmap() since the oom reaper call
does walk the VMAs. Otherwise lockdep will be unhappy if oom happens.
When splitting a vma fails due to allocations of the maple tree nodes,
the error path in __split_vma() calls new->vm_ops->close(new). The page
accounting for hugetlb is actually in the close() operation, so it
accounts for the removal of 1/2 of the VMA which was not adjusted. This
results in a negative exit value. To avoid the negative charge, set
vm_start = vm_end and vm_pgoff = 0.
There is also a potential accounting issue in special mappings from
insert_vm_struct() failing to allocate, so reverse the charge there in
the failure scenario.
Link: https://lkml.kernel.org/r/20220906194824.2110408-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Patch series "Introducing the Maple Tree"
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
Davidlor said
: Yes I like the maple tree, and at this stage I don't think we can ask for
: more from this series wrt the MM - albeit there seems to still be some
: folks reporting breakage. Fundamentally I see Liam's work to (re)move
: complexity out of the MM (not to say that the actual maple tree is not
: complex) by consolidating the three complimentary data structures very
: much worth it considering performance does not take a hit. This was very
: much a turn off with the range locking approach, which worst case scenario
: incurred in prohibitive overhead. Also as Liam and Matthew have
: mentioned, RCU opens up a lot of nice performance opportunities, and in
: addition academia[1] has shown outstanding scalability of address spaces
: with the foundation of replacing the locked rbtree with RCU aware trees.
A similar work has been discovered in the academic press
https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
Sheer coincidence. We designed our tree with the intention of solving the
hardest problem first. Upon settling on a b-tree variant and a rough
outline, we researched ranged based b-trees and RCU b-trees and did find
that article. So it was nice to find reassurances that we were on the
right path, but our design choice of using ranges made that paper unusable
for us.
This patch (of 70):
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently. There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface. If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.
The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes. With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses. The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.
The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct. The
long term goal is to reduce or remove the mmap_lock contention.
The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers. A single write operation will be
allowed at a time. A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.
There is additional BUG_ON() calls added within the tree, most of which
are in debug code. These will be replaced with a WARN_ON() call in the
future. There is also additional BUG_ON() calls within the code which
will also be reduced in number at a later date. These exist to catch
things such as out-of-range accesses which would crash anyways.
Link: https://lkml.kernel.org/r/20220906194824.2110408-1-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20220906194824.2110408-2-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: David Howells <dhowells@redhat.com>
Tested-by: Sven Schnelle <svens@linux.ibm.com>
Tested-by: Yu Zhao <yuzhao@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
There is a separate I/O failure tree to track the fail reads, so remove
the extra EXTENT_DAMAGED bit in the I/O tree as it's set but never used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
We still have this oddity of stashing the io_failure_record in the
extent state for the io_failure_tree, which is leftover from when we
used to stuff private pointers in extent_io_trees.
However this doesn't make a lot of sense for the io failure records, we
can simply use a normal rb_tree for this. This will allow us to further
simplify the extent_io_tree code by removing the io_failure_rec pointer
from the extent state.
Convert the io_failure_tree to an rb tree + spinlock in the inode, and
then use our rb tree simple helpers to insert and find failed records.
This greatly cleans up this code and makes it easier to separate out the
extent_io_tree code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Another set of fixes for fixes for the soc tree:
- A fix for the interrupt number on at91/lan966 ethernet PHYs
- A second round of fixes for NXP i.MX series, including a couple of
build issues, and board specific DT corrections on TQMa8MPQL,
imx8mp-venice-gw74xx and imx8mm-verdin for reliability and
partially broken functionality
- Several fixes for Rockchip SoCs, addressing a USB issue on
BPI-R2-Pro, wakeup on Gru-Bob and reliability of high-speed SD
cards, among other minor issues
- A fix for a long-running naming mistake that prevented the moxart
mmc driver from working at all
- Multiple Arm SCMI firmware fixes for hardening some corner cases"
* tag 'soc-fixes-6.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (30 commits)
arm64: dts: imx8mp-venice-gw74xx: fix port/phy validation
ARM: dts: lan966x: Fix the interrupt number for internal PHYs
arm64: dts: imx8mp-venice-gw74xx: fix ksz9477 cpu port
arm64: dts: imx8mp-venice-gw74xx: fix CAN STBY polarity
dt-bindings: memory-controllers: fsl,imx8m-ddrc: drop Leonard Crestez
arm64: dts: tqma8mqml: Include phy-imx8-pcie.h header
arm64: defconfig: enable ARCH_NXP
arm64: dts: imx8mp-tqma8mpql-mba8mpxl: add missing pinctrl for RTC alarm
ARM: dts: fix Moxa SDIO 'compatible', remove 'sdhci' misnomer
arm64: dts: imx8mm-verdin: extend pmic voltages
arm64: dts: rockchip: Remove 'enable-active-low' from rk3566-quartz64-a
arm64: dts: rockchip: Remove 'enable-active-low' from rk3399-puma
arm64: dts: rockchip: fix property for usb2 phy supply on rk3568-evb1-v10
arm64: dts: rockchip: fix property for usb2 phy supply on rock-3a
arm64: dts: imx8ulp: add #reset-cells for pcc
arm64: dts: tqma8mpxl-ba8mpxl: Fix button GPIOs
arm64: dts: imx8mn: remove GPU power domain reset
arm64: dts: rockchip: Set RK3399-Gru PCLK_EDP to 24 MHz
arm64: dts: imx8mm: Reverse CPLD_Dn GPIO label mapping on MX8Menlo
arm64: dts: rockchip: fix upper usb port on BPI-R2-Pro
...
|
|
Add tracing for io_run_local_task_work
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220830125013.570060-8-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
This patch replaces dev_vdbg with tracepoints in new ipc4-loader code.
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Noah Klayman <noah.klayman@intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20220919122108.43764-8-pierre-louis.bossart@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
This patch removes unneeded dev_vdbg calls and replaces remaining ones
with tracepoints to reduce overhead and enable use of trace collection
and analysis tools.
Signed-off-by: Noah Klayman <noah.klayman@intel.com>
Reviewed-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Link: https://lore.kernel.org/r/20220919122108.43764-7-pierre-louis.bossart@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
This patch replaces all dev_vdbg calls with tracepoints to reduce
overhead and enable use of trace collection and analysis tools.
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Noah Klayman <noah.klayman@intel.com>
Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20220919122108.43764-6-pierre-louis.bossart@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
The Intel HDaudio controller relies on a single interrupt line which
wire-ORs multiple interrupt sources, such as stream, IPC, SoundWire and
wakes. This patch adds the ability to trace each event occurrence.
Reviewed-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Noah Klayman <noah.klayman@intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20220919122108.43764-3-pierre-louis.bossart@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
Enables tracking of use_count during widget setup and free routines.
Useful for debugging unbalanced use_counts during suspend/resume.
Reviewed-by: Péter Ujfalusi <peter.ujfalusi@linux.intel.com>
Signed-off-by: Noah Klayman <noah.klayman@intel.com>
Signed-off-by: Bard Liao <yung-chuan.liao@linux.intel.com>
Signed-off-by: Pierre-Louis Bossart <pierre-louis.bossart@linux.intel.com>
Link: https://lore.kernel.org/r/20220919122108.43764-2-pierre-louis.bossart@linux.intel.com
Signed-off-by: Mark Brown <broonie@kernel.org>
|
|
This patch add tracepoints in the code for DMA allocation.
The main purpose is to be able to cross data with the map operations and
determine whether memory violation occurred, for example free DMA
allocation before unmapping it from device memory.
To achieve this the DMA alloc/free code flows were refactored so that a
single DMA tracepoint will catch many flows.
To get better understanding of what happened in the DMA allocations
the real allocating function is added to the trace as well.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
This patch adds trace events for habanalabs driver to gain all the
benefits such an infrastructure can supply.
The following events were added:
- MMU map/unmap: to be able to track driver's memory allocations
- DMA alloc/free: to track our DMA allocation
the above trace points in conjunction will help us map the device memory
usage as well as to be able to track memory violations.
Signed-off-by: Ohad Sharabi <osharabi@habana.ai>
Acked-by: Oded Gabbay <ogabbay@kernel.org>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
|
|
When scanning an anon pmd to see if it's eligible for collapse, return
SCAN_PMD_MAPPED if the pmd already maps a hugepage. Note that
SCAN_PMD_MAPPED is different from SCAN_PAGE_COMPOUND used in the
file-collapse path, since the latter might identify pte-mapped compound
pages. This is required by MADV_COLLAPSE which necessarily needs to know
what hugepage-aligned/sized regions are already pmd-mapped.
In order to determine if a pmd already maps a hugepage, refactor
mm_find_pmd():
Return mm_find_pmd() to it's pre-commit f72e7dcdd252 ("mm: let mm_find_pmd
fix buggy race with THP fault") behavior. ksm was the only caller that
explicitly wanted a pte-mapping pmd, so open code the pte-mapping logic
there (pmd_present() and pmd_trans_huge() checks).
Undo revert change in commit f72e7dcdd252 ("mm: let mm_find_pmd fix buggy
race with THP fault") that open-coded split_huge_pmd_address() pmd lookup
and use mm_find_pmd() instead.
Link: https://lkml.kernel.org/r/20220706235936.2197195-9-zokeefe@google.com
Signed-off-by: Zach O'Keefe <zokeefe@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
As Eric reported, the 'reason' field is not presented when trace the
kfree_skb event by perf:
$ perf record -e skb:kfree_skb -a sleep 10
$ perf script
ip_defrag 14605 [021] 221.614303: skb:kfree_skb:
skbaddr=0xffff9d2851242700 protocol=34525 location=0xffffffffa39346b1
reason:
The cause seems to be passing kernel address directly to TP_printk(),
which is not right. As the enum 'skb_drop_reason' is not exported to
user space through TRACE_DEFINE_ENUM(), perf can't get the drop reason
string from the 'reason' field, which is a number.
Therefore, we introduce the macro DEFINE_DROP_REASON(), which is used
to define the trace enum by TRACE_DEFINE_ENUM(). With the help of
DEFINE_DROP_REASON(), now we can remove the auto-generate that we
introduced in the commit ec43908dd556
("net: skb: use auto-generation to convert skb drop reason to string"),
and define the string array 'drop_reasons'.
Hmmmm...now we come back to the situation that have to maintain drop
reasons in both enum skb_drop_reason and DEFINE_DROP_REASON. But they
are both in dropreason.h, which makes it easier.
After this commit, now the format of kfree_skb is like this:
$ cat /tracing/events/skb/kfree_skb/format
name: kfree_skb
ID: 1524
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:void * skbaddr; offset:8; size:8; signed:0;
field:void * location; offset:16; size:8; signed:0;
field:unsigned short protocol; offset:24; size:2; signed:0;
field:enum skb_drop_reason reason; offset:28; size:4; signed:0;
print fmt: "skbaddr=%p protocol=%u location=%p reason: %s", REC->skbaddr, REC->protocol, REC->location, __print_symbolic(REC->reason, { 1, "NOT_SPECIFIED" }, { 2, "NO_SOCKET" } ......
Fixes: ec43908dd556 ("net: skb: use auto-generation to convert skb drop reason to string")
Link: https://lore.kernel.org/netdev/CANn89i+bx0ybvE55iMYf5GJM48WwV1HNpdm9Q6t-HaEstqpCSA@mail.gmail.com/
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Menglong Dong <imagedong@tencent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Drop kmem_alloc event class, and define kmalloc and kmem_cache_alloc
using TRACE_EVENT() macro.
And then this patch does:
- Do not pass pointer to struct kmem_cache to trace_kmalloc.
gfp flag is enough to know if it's accounted or not.
- Avoid dereferencing s->object_size and s->size when not using kmem_cache_alloc event.
- Avoid dereferencing s->name in when not using kmem_cache_free event.
- Adjust s->size to SLOB_UNITS(s->size) * SLOB_UNIT in SLOB
Cc: Vasily Averin <vasily.averin@linux.dev>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|
|
Drop kmem_alloc event class, rename kmem_alloc_node to kmem_alloc, and
remove _node postfix for NUMA version of tracepoints.
This will break some tools that depend on {kmem_cache_alloc,kmalloc}_node,
but at this point maintaining both kmem_alloc and kmem_alloc_node
event classes does not makes sense at all.
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
|