aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2023-01-04mm, compaction: fix fast_isolate_around() to stay within boundariesNARIBAYASHI Akira
commit be21b32afe470c5ae98e27e49201158a47032942 upstream. Depending on the memory configuration, isolate_freepages_block() may scan pages out of the target range and causes panic. Panic can occur on systems with multiple zones in a single pageblock. The reason it is rare is that it only happens in special configurations. Depending on how many similar systems there are, it may be a good idea to fix this problem for older kernels as well. The problem is that pfn as argument of fast_isolate_around() could be out of the target range. Therefore we should consider the case where pfn < start_pfn, and also the case where end_pfn < pfn. This problem should have been addressd by the commit 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone") but there was an oversight. Case1: pfn < start_pfn <at memory compaction for node Y> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ end_pfn ^ start_pfn = cc->zone->zone_start_pfn pfn <---------> scanned range by "Scan After" Case2: end_pfn < pfn <at memory compaction for node X> | node X's zone | node Y's zone +-----------------+------------------------------... pageblock ^ ^ ^ +-----------+-----------+-----------+-----------+... ^ ^ ^ ^ ^ pfn ^ end_pfn start_pfn <---------> scanned range by "Scan Before" It seems that there is no good reason to skip nr_isolated pages just after given pfn. So let perform simple scan from start to end instead of dividing the scan into "Before" and "After". Link: https://lkml.kernel.org/r/20221026112438.236336-1-a.naribayashi@fujitsu.com Fixes: 6e2b7044c199 ("mm, compaction: make fast_isolate_freepages() stay within zone"). Signed-off-by: NARIBAYASHI Akira <a.naribayashi@fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04kmsan: include linux/vmalloc.hArnd Bergmann
commit aaa746ad8b30f38ef89a301faf339ef1c19cf33a upstream. This is needed for the vmap/vunmap declarations: mm/kmsan/kmsan_test.c:316:9: error: implicit declaration of function 'vmap' is invalid in C99 [-Werror,-Wimplicit-function-declaration] vbuf = vmap(pages, npages, VM_MAP, PAGE_KERNEL); ^ mm/kmsan/kmsan_test.c:316:29: error: use of undeclared identifier 'VM_MAP' vbuf = vmap(pages, npages, VM_MAP, PAGE_KERNEL); ^ mm/kmsan/kmsan_test.c:322:3: error: implicit declaration of function 'vunmap' is invalid in C99 [-Werror,-Wimplicit-function-declaration] vunmap(vbuf); ^ Link: https://lkml.kernel.org/r/20221215163046.4079767-1-arnd@kernel.org Fixes: 8ed691b02ade ("kmsan: add tests for KMSAN") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04kmsan: export kmsan_handle_urbArnd Bergmann
commit 7ba594d700998bafa96a75360d2e060aa39156d2 upstream. USB support can be in a loadable module, and this causes a link failure with KMSAN: ERROR: modpost: "kmsan_handle_urb" [drivers/usb/core/usbcore.ko] undefined! Export the symbol so it can be used by this module. Link: https://lkml.kernel.org/r/20221215162710.3802378-1-arnd@kernel.org Fixes: 553a80188a5d ("kmsan: handle memory sent to/from USB") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04mm/mempolicy: fix memory leak in set_mempolicy_home_node system callMathieu Desnoyers
commit 38ce7c9bdfc228c14d7621ba36d3eebedd9d4f76 upstream. When encountering any vma in the range with policy other than MPOL_BIND or MPOL_PREFERRED_MANY, an error is returned without issuing a mpol_put on the policy just allocated with mpol_dup(). This allows arbitrary users to leak kernel memory. Link: https://lkml.kernel.org/r/20221215194621.202816-1-mathieu.desnoyers@efficios.com Fixes: c6018b4b2549 ("mm/mempolicy: add set_mempolicy_home_node syscall") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2023-01-04mm, mremap: fix mremap() expanding vma with addr inside vmaVlastimil Babka
commit 6f12be792fde994ed934168f93c2a0d2a0cf0bc5 upstream. Since 6.1 we have noticed random rpm install failures that were tracked to mremap() returning -ENOMEM and to commit ca3d76b0aa80 ("mm: add merging after mremap resize"). The problem occurs when mremap() expands a VMA in place, but using an starting address that's not vma->vm_start, but somewhere in the middle. The extension_pgoff calculation introduced by the commit is wrong in that case, so vma_merge() fails due to pgoffs not being compatible. Fix the calculation. By the way it seems that the situations, where rpm now expands a vma from the middle, were made possible also due to that commit, thanks to the improved vma merging. Yet it should work just fine, except for the buggy calculation. Link: https://lkml.kernel.org/r/20221216163227.24648-1-vbabka@suse.cz Reported-by: Jiri Slaby <jirislaby@kernel.org> Link: https://bugzilla.suse.com/show_bug.cgi?id=1206359 Fixes: ca3d76b0aa80 ("mm: add merging after mremap resize") Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jakub Matěna <matenajakub@gmail.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-31mm/gup: disallow FOLL_FORCE|FOLL_WRITE on hugetlb mappingsDavid Hildenbrand
commit f347454d034184b4f0a2caf6e14daf7848cea01c upstream. hugetlb does not support fake write-faults (write faults without write permissions). However, we are currently able to trigger a FAULT_FLAG_WRITE fault on a VMA without VM_WRITE. If we'd ever want to support FOLL_FORCE|FOLL_WRITE, we'd have to teach hugetlb to: (1) Leave the page mapped R/O after the fake write-fault, like maybe_mkwrite() does. (2) Allow writing to an exclusive anon page that's mapped R/O when FOLL_FORCE is set, like can_follow_write_pte(). E.g., __follow_hugetlb_must_fault() needs adjustment. For now, it's not clear if that added complexity is really required. History tolds us that FOLL_FORCE is dangerous and that we better limit its use to a bare minimum. -------------------------------------------------------------------------- #include <stdio.h> #include <stdlib.h> #include <fcntl.h> #include <unistd.h> #include <errno.h> #include <stdint.h> #include <sys/mman.h> #include <linux/mman.h> int main(int argc, char **argv) { char *map; int mem_fd; map = mmap(NULL, 2 * 1024 * 1024u, PROT_READ, MAP_PRIVATE|MAP_ANON|MAP_HUGETLB|MAP_HUGE_2MB, -1, 0); if (map == MAP_FAILED) { fprintf(stderr, "mmap() failed: %d\n", errno); return 1; } mem_fd = open("/proc/self/mem", O_RDWR); if (mem_fd < 0) { fprintf(stderr, "open(/proc/self/mem) failed: %d\n", errno); return 1; } if (pwrite(mem_fd, "0", 1, (uintptr_t) map) == 1) { fprintf(stderr, "write() succeeded, which is unexpected\n"); return 1; } printf("write() failed as expected: %d\n", errno); return 0; } -------------------------------------------------------------------------- Fortunately, we have a sanity check in hugetlb_wp() in place ever since commit 1d8d14641fd9 ("mm/hugetlb: support write-faults in shared mappings"), that bails out instead of silently mapping a page writable in a !PROT_WRITE VMA. Consequently, above reproducer triggers a warning, similar to the one reported by szsbot: ------------[ cut here ]------------ WARNING: CPU: 1 PID: 3612 at mm/hugetlb.c:5313 hugetlb_wp+0x20a/0x1af0 mm/hugetlb.c:5313 Modules linked in: CPU: 1 PID: 3612 Comm: syz-executor250 Not tainted 6.1.0-rc2-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:hugetlb_wp+0x20a/0x1af0 mm/hugetlb.c:5313 Code: ea 03 80 3c 02 00 0f 85 31 14 00 00 49 8b 5f 20 31 ff 48 89 dd 83 e5 02 48 89 ee e8 70 ab b7 ff 48 85 ed 75 5b e8 76 ae b7 ff <0f> 0b 41 bd 40 00 00 00 e8 69 ae b7 ff 48 b8 00 00 00 00 00 fc ff RSP: 0018:ffffc90003caf620 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000008640070 RCX: 0000000000000000 RDX: ffff88807b963a80 RSI: ffffffff81c4ed2a RDI: 0000000000000007 RBP: 0000000000000000 R08: 0000000000000007 R09: 0000000000000000 R10: 0000000000000000 R11: 000000000008c07e R12: ffff888023805800 R13: 0000000000000000 R14: ffffffff91217f38 R15: ffff88801d4b0360 FS: 0000555555bba300(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fff7a47a1b8 CR3: 000000002378d000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> hugetlb_no_page mm/hugetlb.c:5755 [inline] hugetlb_fault+0x19cc/0x2060 mm/hugetlb.c:5874 follow_hugetlb_page+0x3f3/0x1850 mm/hugetlb.c:6301 __get_user_pages+0x2cb/0xf10 mm/gup.c:1202 __get_user_pages_locked mm/gup.c:1434 [inline] __get_user_pages_remote+0x18f/0x830 mm/gup.c:2187 get_user_pages_remote+0x84/0xc0 mm/gup.c:2260 __access_remote_vm+0x287/0x6b0 mm/memory.c:5517 ptrace_access_vm+0x181/0x1d0 kernel/ptrace.c:61 generic_ptrace_pokedata kernel/ptrace.c:1323 [inline] ptrace_request+0xb46/0x10c0 kernel/ptrace.c:1046 arch_ptrace+0x36/0x510 arch/x86/kernel/ptrace.c:828 __do_sys_ptrace kernel/ptrace.c:1296 [inline] __se_sys_ptrace kernel/ptrace.c:1269 [inline] __x64_sys_ptrace+0x178/0x2a0 kernel/ptrace.c:1269 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd [...] So let's silence that warning by teaching GUP code that FOLL_FORCE -- so far -- does not apply to hugetlb. Note that FOLL_FORCE for read-access seems to be working as expected. The assumption is that this has been broken forever, only ever since above commit, we actually detect the wrong handling and WARN_ON_ONCE(). I assume this has been broken at least since 2014, when mm/gup.c came to life. I failed to come up with a suitable Fixes tag quickly. Link: https://lkml.kernel.org/r/20221031152524.173644-1-david@redhat.com Fixes: 1d8d14641fd9 ("mm/hugetlb: support write-faults in shared mappings") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: <syzbot+f0b97304ef90f0d0b1dc@syzkaller.appspotmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2022-12-10Merge tag 'mm-hotfixes-stable-2022-12-10-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "Nine hotfixes. Six for MM, three for other areas. Four of these patches address post-6.0 issues" * tag 'mm-hotfixes-stable-2022-12-10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: memcg: fix possible use-after-free in memcg_write_event_control() MAINTAINERS: update Muchun Song's email mm/gup: fix gup_pud_range() for dax mmap: fix do_brk_flags() modifying obviously incorrect VMAs mm/swap: fix SWP_PFN_BITS with CONFIG_PHYS_ADDR_T_64BIT on 32bit tmpfs: fix data loss from failed fallocate kselftests: cgroup: update kmem test precision tolerance mm: do not BUG_ON missing brk mapping, because userspace can unmap it mailmap: update Matti Vaittinen's email address
2022-12-09memcg: fix possible use-after-free in memcg_write_event_control()Tejun Heo
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Link: https://lkml.kernel.org/r/Y5FRm/cfcKPGzWwl@slm.duckdns.org Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Jann Horn <jannh@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> [3.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09mm/gup: fix gup_pud_range() for daxJohn Starks
For dax pud, pud_huge() returns true on x86. So the function works as long as hugetlb is configured. However, dax doesn't depend on hugetlb. Commit 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") fixed devmap-backed huge PMDs, but missed devmap-backed huge PUDs. Fix this as well. This fixes the below kernel panic: general protection fault, probably for non-canonical address 0x69e7c000cc478: 0000 [#1] SMP < snip > Call Trace: <TASK> get_user_pages_fast+0x1f/0x40 iov_iter_get_pages+0xc6/0x3b0 ? mempool_alloc+0x5d/0x170 bio_iov_iter_get_pages+0x82/0x4e0 ? bvec_alloc+0x91/0xc0 ? bio_alloc_bioset+0x19a/0x2a0 blkdev_direct_IO+0x282/0x480 ? __io_complete_rw_common+0xc0/0xc0 ? filemap_range_has_page+0x82/0xc0 generic_file_direct_write+0x9d/0x1a0 ? inode_update_time+0x24/0x30 __generic_file_write_iter+0xbd/0x1e0 blkdev_write_iter+0xb4/0x150 ? io_import_iovec+0x8d/0x340 io_write+0xf9/0x300 io_issue_sqe+0x3c3/0x1d30 ? sysvec_reschedule_ipi+0x6c/0x80 __io_queue_sqe+0x33/0x240 ? fget+0x76/0xa0 io_submit_sqes+0xe6a/0x18d0 ? __fget_light+0xd1/0x100 __x64_sys_io_uring_enter+0x199/0x880 ? __context_tracking_enter+0x1f/0x70 ? irqentry_exit_to_user_mode+0x24/0x30 ? irqentry_exit+0x1d/0x30 ? __context_tracking_exit+0xe/0x70 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x61/0xcb RIP: 0033:0x7fc97c11a7be < snip > </TASK> ---[ end trace 48b2e0e67debcaeb ]--- RIP: 0010:internal_get_user_pages_fast+0x340/0x990 < snip > Kernel panic - not syncing: Fatal exception Kernel Offset: disabled Link: https://lkml.kernel.org/r/1670392853-28252-1-git-send-email-ssengar@linux.microsoft.com Fixes: 414fd080d125 ("mm/gup: fix gup_pmd_range() for dax") Signed-off-by: John Starks <jostarks@microsoft.com> Signed-off-by: Saurabh Sengar <ssengar@linux.microsoft.com> Cc: Jan Kara <jack@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09mmap: fix do_brk_flags() modifying obviously incorrect VMAsLiam Howlett
Add more sanity checks to the VMA that do_brk_flags() will expand. Ensure the VMA matches basic merge requirements within the function before calling can_vma_merge_after(). Drop the duplicate checks from vm_brk_flags() since they will be enforced later. The old code would expand file VMAs on brk(), which is functionally wrong and also dangerous in terms of locking because the brk() path isn't designed for file VMAs and therefore doesn't lock the file mapping. Checking can_vma_merge_after() ensures that new anonymous VMAs can't be merged into file VMAs. See https://lore.kernel.org/linux-mm/CAG48ez1tJZTOjS_FjRZhvtDA-STFmdw8PEizPDwMGFd_ui0Nrw@mail.gmail.com/ Link: https://lkml.kernel.org/r/20221205192304.1957418-1-Liam.Howlett@oracle.com Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Suggested-by: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: SeongJae Park <sj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09tmpfs: fix data loss from failed fallocateHugh Dickins
Fix tmpfs data loss when the fallocate system call is interrupted by a signal, or fails for some other reason. The partial folio handling in shmem_undo_range() forgot to consider this unfalloc case, and was liable to erase or truncate out data which had already been committed earlier. It turns out that none of the partial folio handling there is appropriate for the unfalloc case, which just wants to proceed to removal of whole folios: which find_get_entries() provides, even when partially covered. Original patch by Rui Wang. Link: https://lore.kernel.org/linux-mm/33b85d82.7764.1842e9ab207.Coremail.chenguoqic@163.com/ Link: https://lkml.kernel.org/r/a5dac112-cf4b-7af-a33-f386e347fd38@google.com Fixes: b9a8a4195c7d ("truncate,shmem: Handle truncates that split large folios") Signed-off-by: Hugh Dickins <hughd@google.com> Reported-by: Guoqi Chen <chenguoqic@163.com> Link: https://lore.kernel.org/all/20221101032248.819360-1-kernel@hev.cc/ Cc: Rui Wang <kernel@hev.cc> Cc: Huacai Chen <chenhuacai@loongson.cn> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-09mm: do not BUG_ON missing brk mapping, because userspace can unmap itJason A. Donenfeld
The following program will trigger the BUG_ON that this patch removes, because the user can munmap() mm->brk: #include <sys/syscall.h> #include <sys/mman.h> #include <assert.h> #include <unistd.h> static void *brk_now(void) { return (void *)syscall(SYS_brk, 0); } static void brk_set(void *b) { assert(syscall(SYS_brk, b) != -1); } int main(int argc, char *argv[]) { void *b = brk_now(); brk_set(b + 4096); assert(munmap(b - 4096, 4096 * 2) == 0); brk_set(b); return 0; } Compile that with musl, since glibc actually uses brk(), and then execute it, and it'll hit this splat: kernel BUG at mm/mmap.c:229! invalid opcode: 0000 [#1] PREEMPT SMP CPU: 12 PID: 1379 Comm: a.out Tainted: G S U 6.1.0-rc7+ #419 RIP: 0010:__do_sys_brk+0x2fc/0x340 Code: 00 00 4c 89 ef e8 04 d3 fe ff eb 9a be 01 00 00 00 4c 89 ff e8 35 e0 fe ff e9 6e ff ff ff 4d 89 a7 20> RSP: 0018:ffff888140bc7eb0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 00000000007e7000 RCX: ffff8881020fe000 RDX: ffff8881020fe001 RSI: ffff8881955c9b00 RDI: ffff8881955c9b08 RBP: 0000000000000000 R08: ffff8881955c9b00 R09: 00007ffc77844000 R10: 0000000000000000 R11: 0000000000000001 R12: 00000000007e8000 R13: 00000000007e8000 R14: 00000000007e7000 R15: ffff8881020fe000 FS: 0000000000604298(0000) GS:ffff88901f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000603fe0 CR3: 000000015ba9a005 CR4: 0000000000770ee0 PKRU: 55555554 Call Trace: <TASK> do_syscall_64+0x2b/0x50 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x400678 Code: 10 4c 8d 41 08 4c 89 44 24 10 4c 8b 01 8b 4c 24 08 83 f9 2f 77 0a 4c 8d 4c 24 20 4c 01 c9 eb 05 48 8b> RSP: 002b:00007ffc77863890 EFLAGS: 00000212 ORIG_RAX: 000000000000000c RAX: ffffffffffffffda RBX: 000000000040031b RCX: 0000000000400678 RDX: 00000000004006a1 RSI: 00000000007e6000 RDI: 00000000007e7000 RBP: 00007ffc77863900 R08: 0000000000000000 R09: 00000000007e6000 R10: 00007ffc77863930 R11: 0000000000000212 R12: 00007ffc77863978 R13: 00007ffc77863988 R14: 0000000000000000 R15: 0000000000000000 </TASK> Instead, just return the old brk value if the original mapping has been removed. [akpm@linux-foundation.org: fix changelog, per Liam] Link: https://lkml.kernel.org/r/20221202162724.2009-1-Jason@zx2c4.com Fixes: 2e7ce7d354f2 ("mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Jann Horn <jannh@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-12-08memcg: Fix possible use-after-free in memcg_write_event_control()Tejun Heo
memcg_write_event_control() accesses the dentry->d_name of the specified control fd to route the write call. As a cgroup interface file can't be renamed, it's safe to access d_name as long as the specified file is a regular cgroup file. Also, as these cgroup interface files can't be removed before the directory, it's safe to access the parent too. Prior to 347c4a874710 ("memcg: remove cgroup_event->cft"), there was a call to __file_cft() which verified that the specified file is a regular cgroupfs file before further accesses. The cftype pointer returned from __file_cft() was no longer necessary and the commit inadvertently dropped the file type check with it allowing any file to slip through. With the invarients broken, the d_name and parent accesses can now race against renames and removals of arbitrary files and cause use-after-free's. Fix the bug by resurrecting the file type check in __file_cft(). Now that cgroupfs is implemented through kernfs, checking the file operations needs to go through a layer of indirection. Instead, let's check the superblock and dentry type. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: 347c4a874710 ("memcg: remove cgroup_event->cft") Cc: stable@kernel.org # v3.14+ Reported-by: Jann Horn <jannh@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-04Revert "mm: align larger anonymous mappings on THP boundaries"Linus Torvalds
This reverts commit f35b5d7d676e59e401690b678cd3cfec5e785c23. It has been reported to cause huge performance regressions on some loads (will-it-scale.per_process_ops, but also building the kernel with clang). The commit did speed up gcc builds by a small amount, so it's not an unambiguous regression, but until the big regressions are understood, let's revert it. Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/r/202210181535.7144dd15-yujie.liu@intel.com Reported-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/lkml/Y1DNQaoPWxE%2BrGce@dev-arch.thelio-3990X/ Cc: Huang, Ying <ying.huang@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-12-02Merge tag 'mm-hotfixes-stable-2022-12-02' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "15 hotfixes, 11 marked cc:stable. Only three or four of the latter address post-6.0 issues, which is hopefully a sign that things are converging" * tag 'mm-hotfixes-stable-2022-12-02' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: revert "kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible" Kconfig.debug: provide a little extra FRAME_WARN leeway when KASAN is enabled drm/amdgpu: temporarily disable broken Clang builds due to blown stack-frame mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths mm/khugepaged: fix GUP-fast interaction by sending IPI mm/khugepaged: take the right locks for page table retraction mm: migrate: fix THP's mapcount on isolation mm: introduce arch_has_hw_nonleaf_pmd_young() mm: add dummy pmd_young() for architectures not having it mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in damon_sysfs_set_schemes() tools/vm/slabinfo-gnuplot: use "grep -E" instead of "egrep" nilfs2: fix NULL pointer dereference in nilfs_palloc_commit_free_entry() hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing madvise: use zap_page_range_single for madvise dontneed mm: replace VM_WARN_ON to pr_warn if the node is offline with __GFP_THISNODE
2022-11-30mm/khugepaged: invoke MMU notifiers in shmem/file collapse pathsJann Horn
Any codepath that zaps page table entries must invoke MMU notifiers to ensure that secondary MMUs (like KVM) don't keep accessing pages which aren't mapped anymore. Secondary MMUs don't hold their own references to pages that are mirrored over, so failing to notify them can lead to page use-after-free. I'm marking this as addressing an issue introduced in commit f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of the security impact of this only came in commit 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP"), which actually omitted flushes for the removal of present PTEs, not just for the removal of empty page tables. Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages") Signed-off-by: Jann Horn <jannh@google.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/khugepaged: fix GUP-fast interaction by sending IPIJann Horn
Since commit 70cbc3cc78a99 ("mm: gup: fix the fast GUP race against THP collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to ensure that the page table was not removed by khugepaged in between. However, lockless_pages_from_mm() still requires that the page table is not concurrently freed. Fix it by sending IPIs (if the architecture uses semi-RCU-style page table freeing) before freeing/reusing page tables. Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com Fixes: ba76149f47d8 ("thp: khugepaged") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/khugepaged: take the right locks for page table retractionJann Horn
pagetable walks on address ranges mapped by VMAs can be done under the mmap lock, the lock of an anon_vma attached to the VMA, or the lock of the VMA's address_space. Only one of these needs to be held, and it does not need to be held in exclusive mode. Under those circumstances, the rules for concurrent access to page table entries are: - Terminal page table entries (entries that don't point to another page table) can be arbitrarily changed under the page table lock, with the exception that they always need to be consistent for hardware page table walks and lockless_pages_from_mm(). This includes that they can be changed into non-terminal entries. - Non-terminal page table entries (which point to another page table) can not be modified; readers are allowed to READ_ONCE() an entry, verify that it is non-terminal, and then assume that its value will stay as-is. Retracting a page table involves modifying a non-terminal entry, so page-table-level locks are insufficient to protect against concurrent page table traversal; it requires taking all the higher-level locks under which it is possible to start a page walk in the relevant range in exclusive mode. The collapse_huge_page() path for anonymous THP already follows this rule, but the shmem/file THP path was getting it wrong, making it possible for concurrent rmap-based operations to cause corruption. Link: https://lkml.kernel.org/r/20221129154730.2274278-1-jannh@google.com Link: https://lkml.kernel.org/r/20221128180252.1684965-1-jannh@google.com Link: https://lkml.kernel.org/r/20221125213714.4115729-1-jannh@google.com Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP") Signed-off-by: Jann Horn <jannh@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: migrate: fix THP's mapcount on isolationGavin Shan
The issue is reported when removing memory through virtio_mem device. The transparent huge page, experienced copy-on-write fault, is wrongly regarded as pinned. The transparent huge page is escaped from being isolated in isolate_migratepages_block(). The transparent huge page can't be migrated and the corresponding memory block can't be put into offline state. Fix it by replacing page_mapcount() with total_mapcount(). With this, the transparent huge page can be isolated and migrated, and the memory block can be put into offline state. Besides, The page's refcount is increased a bit earlier to avoid the page is released when the check is executed. Link: https://lkml.kernel.org/r/20221124095523.31061-1-gshan@redhat.com Fixes: 1da2f328fa64 ("mm,thp,compaction,cma: allow THP migration for CMA allocations") Signed-off-by: Gavin Shan <gshan@redhat.com> Reported-by: Zhenyu Zhang <zhenyzha@redhat.com> Tested-by: Zhenyu Zhang <zhenyzha@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: <stable@vger.kernel.org> [5.7+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm: introduce arch_has_hw_nonleaf_pmd_young()Juergen Gross
When running as a Xen PV guests commit eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") can cause a protection violation in pmdp_test_and_clear_young(): BUG: unable to handle page fault for address: ffff8880083374d0 #PF: supervisor write access in kernel mode #PF: error_code(0x0003) - permissions violation PGD 3026067 P4D 3026067 PUD 3027067 PMD 7fee5067 PTE 8010000008337065 Oops: 0003 [#1] PREEMPT SMP NOPTI CPU: 7 PID: 158 Comm: kswapd0 Not tainted 6.1.0-rc5-20221118-doflr+ #1 RIP: e030:pmdp_test_and_clear_young+0x25/0x40 This happens because the Xen hypervisor can't emulate direct writes to page table entries other than PTEs. This can easily be fixed by introducing arch_has_hw_nonleaf_pmd_young() similar to arch_has_hw_pte_young() and test that instead of CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG. Link: https://lkml.kernel.org/r/20221123064510.16225-1-jgross@suse.com Fixes: eed9a328aa1a ("mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG") Signed-off-by: Juergen Gross <jgross@suse.com> Reported-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: Yu Zhao <yuzhao@google.com> Tested-by: Sander Eikelenboom <linux@eikelenboom.it> Acked-by: David Hildenbrand <david@redhat.com> [core changes] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30mm/damon/sysfs: fix wrong empty schemes assumption under online tuning in ↵SeongJae Park
damon_sysfs_set_schemes() Commit da87878010e5 ("mm/damon/sysfs: support online inputs update") made 'damon_sysfs_set_schemes()' to be called for running DAMON context, which could have schemes. In the case, DAMON sysfs interface is supposed to update, remove, or add schemes to reflect the sysfs files. However, the code is assuming the DAMON context wouldn't have schemes at all, and therefore creates and adds new schemes. As a result, the code doesn't work as intended for online schemes tuning and could have more than expected memory footprint. The schemes are all in the DAMON context, so it doesn't leak the memory, though. Remove the wrong asssumption (the DAMON context wouldn't have schemes) in 'damon_sysfs_set_schemes()' to fix the bug. Link: https://lkml.kernel.org/r/20221122194831.3472-1-sj@kernel.org Fixes: da87878010e5 ("mm/damon/sysfs: support online inputs update") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [5.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processingMike Kravetz
madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page tables associated with the address range. For hugetlb vmas, zap_page_range will call __unmap_hugepage_range_final. However, __unmap_hugepage_range_final assumes the passed vma is about to be removed and deletes the vma_lock to prevent pmd sharing as the vma is on the way out. In the case of madvise(MADV_DONTNEED) the vma remains, but the missing vma_lock prevents pmd sharing and could potentially lead to issues with truncation/fault races. This issue was originally reported here [1] as a BUG triggered in page_try_dup_anon_rmap. Prior to the introduction of the hugetlb vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to prevent pmd sharing. Subsequent faults on this vma were confused as VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was not set in new pages added to the page table. This resulted in pages that appeared anonymous in a VM_SHARED vma and triggered the BUG. Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap call from unmap_vmas(). This is used to indicate the 'final' unmapping of a hugetlb vma. When called via MADV_DONTNEED, this flag is not set and the vm_lock is not deleted. [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/ Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Wei Chen <harperchen1110@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-30madvise: use zap_page_range_single for madvise dontneedMike Kravetz
This series addresses the issue first reported in [1], and fully described in patch 2. Patches 1 and 2 address the user visible issue and are tagged for stable backports. While exploring solutions to this issue, related problems with mmu notification calls were discovered. This is addressed in the patch "hugetlb: remove duplicate mmu notifications:". Since there are no user visible effects, this third is not tagged for stable backports. Previous discussions suggested further cleanup by removing the routine zap_page_range. This is possible because zap_page_range_single is now exported, and all callers of zap_page_range pass ranges entirely within a single vma. This work will be done in a later patch so as not to distract from this bug fix. [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/ This patch (of 2): Expose the routine zap_page_range_single to zap a range within a single vma. The madvise routine madvise_dontneed_single_vma can use this routine as it explicitly operates on a single vma. Also, update the mmu notification range in zap_page_range_single to take hugetlb pmd sharing into account. This is required as MADV_DONTNEED supports hugetlb vmas. Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com Fixes: 90e7e7f5ef3f ("mm: enable MADV_DONTNEED for hugetlb mappings") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Wei Chen <harperchen1110@gmail.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-25Merge tag 'mm-hotfixes-stable-2022-11-24' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull hotfixes from Andrew Morton: "24 MM and non-MM hotfixes. 8 marked cc:stable and 16 for post-6.0 issues. There have been a lot of hotfixes this cycle, and this is quite a large batch given how far we are into the -rc cycle. Presumably a reflection of the unusually large amount of MM material which went into 6.1-rc1" * tag 'mm-hotfixes-stable-2022-11-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (24 commits) test_kprobes: fix implicit declaration error of test_kprobes nilfs2: fix nilfs_sufile_mark_dirty() not set segment usage as dirty mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1 mm: fix unexpected changes to {failslab|fail_page_alloc}.attr swapfile: fix soft lockup in scan_swap_map_slots hugetlb: fix __prep_compound_gigantic_page page flag setting kfence: fix stack trace pruning proc/meminfo: fix spacing in SecPageTables mm: multi-gen LRU: retry folios written back while isolated mailmap: update email address for Satya Priya mm/migrate_device: return number of migrating pages in args->cpages kbuild: fix -Wimplicit-function-declaration in license_is_gpl_compatible MAINTAINERS: update Alex Hung's email address mailmap: update Alex Hung's email address mm: mmap: fix documentation for vma_mas_szero mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed mm/memory: return vm_fault_t result from migrate_to_ram() callback mm: correctly charge compressed memory to its memcg ipc/shm: call underlying open/close vm_ops gcov: clang: fix the buffer overflow issue ...
2022-11-22mm/cgroup/reclaim: fix dirty pages throttling on cgroup v1Aneesh Kumar K.V
balance_dirty_pages doesn't do the required dirty throttling on cgroupv1. See commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback on traditional hierarchies"). Instead, the kernel depends on writeback throttling in shrink_folio_list to achieve the same goal. With large memory systems, the flusher may not be able to writeback quickly enough such that we will start finding pages in the shrink_folio_list already in writeback. Hence for cgroupv1 let's do a reclaim throttle after waking up the flusher. The below test which used to fail on a 256GB system completes till the the file system is full with this change. root@lp2:/sys/fs/cgroup/memory# mkdir test root@lp2:/sys/fs/cgroup/memory# cd test/ root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M Killed Link: https://lkml.kernel.org/r/20221118070603.84081-1-aneesh.kumar@linux.ibm.com Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: zefan li <lizefan.x@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm: fix unexpected changes to {failslab|fail_page_alloc}.attrQi Zheng
When we specify __GFP_NOWARN, we only expect that no warnings will be issued for current caller. But in the __should_failslab() and __should_fail_alloc_page(), the local GFP flags alter the global {failslab|fail_page_alloc}.attr, which is persistent and shared by all tasks. This is not what we expected, let's fix it. [akpm@linux-foundation.org: unexport should_fail_ex()] Link: https://lkml.kernel.org/r/20221118100011.2634-1-zhengqi.arch@bytedance.com Fixes: 3f913fc5f974 ("mm: fix missing handler for __GFP_NOWARN") Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22swapfile: fix soft lockup in scan_swap_map_slotsChen Wandun
A softlockup occurs in scan free swap slot under huge memory pressure. The test scenario is: 64 CPU cores, 64GB memory, and 28 zram devices, the disksize of each zram device is 50MB. LATENCY_LIMIT is used to prevent softlockups in scan_swap_map_slots(), but the real loop number would more than LATENCY_LIMIT because of "goto checks and goto scan" repeatly without decreasing latency limit. In order to fix it, decrease latency_ration in advance. There is also a suspicious place that will cause softlockups in get_swap_pages(). In this function, the "goto start_over" may result in continuous scanning of the swap partition. If there is no cond_sched in scan_swap_map_slots(), it would cause a softlockup (I am not sure about this). WARN: soft lockup - CPU#11 stuck for 11s! [kswapd0:466] CPU: 11 PID: 466 Comm: kswapd@ Kdump: loaded Tainted: G dump backtrace+0x0/0x1le4 show stack+0x20/@x2c dump_stack+0xd8/0x140 watchdog print_info+0x48/0x54 watchdog_process_before_softlockup+0x98/0xa0 watchdog_timer_fn+0xlac/0x2d0 hrtimer_rum_queues+0xb0/0x130 hrtimer_interrupt+0x13c/0x3c0 arch_timer_handler_virt+0x3c/0x50 handLe_percpu_devid_irq+0x90/0x1f4 handle domain irq+0x84/0x100 gic_handle_irq+0x88/0x2b0 e11 ira+0xhB/Bx140 scan_swap_map_slots+0x678/0x890 get_swap_pages+0x29c/0x440 get_swap_page+0x120/0x2e0 add_to_swap+UX2U/0XyC shrink_page_list+0x5d0/0x152c shrink_inactive_list+0xl6c/Bx500 shrink_lruvec+0x270/0x304 WARN: soft lockup - CPU#32 stuck for 11s! [stress-ng:309915] watchdog_timer_fn+0x1ac/0x2d0 __run_hrtimer+0x98/0x2a0 __hrtimer_run_queues+0xb0/0x130 hrtimer_interrupt+0x13c/0x3c0 arch_timer_handler_virt+0x3c/0x50 handle_percpu_devid_irq+0x90/0x1f4 __handle_domain_irq+0x84/0x100 gic_handle_irq+0x88/0x2b0 el1_irq+0xb8/0x140 get_swap_pages+0x1e8/0x440 get_swap_page+0x1c8/0x2e0 add_to_swap+0x20/0x9c shrink_page_list+0x5d0/0x152c reclaim_pages+0x160/0x310 madvise_cold_or_pageout_pte_range+0x7bc/0xe3c walk_pmd_range.isra.0+0xac/0x22c walk_pud_range+0xfc/0x1c0 walk_pgd_range+0x158/0x1b0 __walk_page_range+0x64/0x100 walk_page_range+0x104/0x150 Link: https://lkml.kernel.org/r/20221118133850.3360369-1-chenwandun@huawei.com Fixes: 048c27fd7281 ("[PATCH] swap: scan_swap_map latency breaks") Signed-off-by: Chen Wandun <chenwandun@huawei.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: <xialonglong1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22hugetlb: fix __prep_compound_gigantic_page page flag settingMike Kravetz
Commit 2b21624fc232 ("hugetlb: freeze allocated pages before creating hugetlb pages") changed the order page flags were cleared and set in the head page. It moved the __ClearPageReserved after __SetPageHead. However, there is a check to make sure __ClearPageReserved is never done on a head page. If CONFIG_DEBUG_VM_PGFLAGS is enabled, the following BUG will be hit when creating a hugetlb gigantic page: page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) ------------[ cut here ]------------ kernel BUG at include/linux/page-flags.h:500! Call Trace will differ depending on whether hugetlb page is created at boot time or run time. Make sure to __ClearPageReserved BEFORE __SetPageHead. Link: https://lkml.kernel.org/r/20221118195249.178319-1-mike.kravetz@oracle.com Fixes: 2b21624fc232 ("hugetlb: freeze allocated pages before creating hugetlb pages") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Tarun Sahu <tsahu@linux.ibm.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22kfence: fix stack trace pruningMarco Elver
Commit b14051352465 ("mm/sl[au]b: generalize kmalloc subsystem") refactored large parts of the kmalloc subsystem, resulting in the stack trace pruning logic done by KFENCE to no longer work. While b14051352465 attempted to fix the situation by including '__kmem_cache_free' in the list of functions KFENCE should skip through, this only works when the compiler actually optimized the tail call from kfree() to __kmem_cache_free() into a jump (and thus kfree() _not_ appearing in the full stack trace to begin with). In some configurations, the compiler no longer optimizes the tail call into a jump, and __kmem_cache_free() appears in the stack trace. This means that the pruned stack trace shown by KFENCE would include kfree() which is not intended - for example: | BUG: KFENCE: invalid free in kfree+0x7c/0x120 | | Invalid free of 0xffff8883ed8fefe0 (in kfence-#126): | kfree+0x7c/0x120 | test_double_free+0x116/0x1a9 | kunit_try_run_case+0x90/0xd0 | [...] Fix it by moving __kmem_cache_free() to the list of functions that may be tail called by an allocator entry function, making the pruning logic work in both the optimized and unoptimized tail call cases. Link: https://lkml.kernel.org/r/20221118152216.3914899-1-elver@google.com Fixes: b14051352465 ("mm/sl[au]b: generalize kmalloc subsystem") Signed-off-by: Marco Elver <elver@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm: multi-gen LRU: retry folios written back while isolatedYu Zhao
The page reclaim isolates a batch of folios from the tail of one of the LRU lists and works on those folios one by one. For a suitable swap-backed folio, if the swap device is async, it queues that folio for writeback. After the page reclaim finishes an entire batch, it puts back the folios it queued for writeback to the head of the original LRU list. In the meantime, the page writeback flushes the queued folios also by batches. Its batching logic is independent from that of the page reclaim. For each of the folios it writes back, the page writeback calls folio_rotate_reclaimable() which tries to rotate a folio to the tail. folio_rotate_reclaimable() only works for a folio after the page reclaim has put it back. If an async swap device is fast enough, the page writeback can finish with that folio while the page reclaim is still working on the rest of the batch containing it. In this case, that folio will remain at the head and the page reclaim will not retry it before reaching there. This patch adds a retry to evict_folios(). After evict_folios() has finished an entire batch and before it puts back folios it cannot free immediately, it retries those that may have missed the rotation. Before this patch, ~60% of folios swapped to an Intel Optane missed folio_rotate_reclaimable(). After this patch, ~99% of missed folios were reclaimed upon retry. This problem affects relatively slow async swap devices like Samsung 980 Pro much less and does not affect sync swap devices like zram or zswap at all. Link: https://lkml.kernel.org/r/20221116013808.3995280-1-yuzhao@google.com Fixes: ac35a4902374 ("mm: multi-gen LRU: minimal implementation") Signed-off-by: Yu Zhao <yuzhao@google.com> Cc: "Yin, Fengwei" <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm/migrate_device: return number of migrating pages in args->cpagesAlistair Popple
migrate_vma->cpages originally contained a count of the number of pages migrating including non-present pages which can be populated directly on the target. Commit 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and migrate_device_coherent_page()") inadvertantly changed this to contain just the number of pages that were unmapped. Usage of migrate_vma->cpages isn't documented, but most drivers use it to see if all the requested addresses can be migrated so restore the original behaviour. Link: https://lkml.kernel.org/r/20221111005135.1344004-1-apopple@nvidia.com Fixes: 241f68859656 ("mm/migrate_device.c: refactor migrate_vma and migrate_deivce_coherent_page()") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reported-by: Ralph Campbell <rcampbell@nvidia.com> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm: mmap: fix documentation for vma_mas_szeroIan Cowan
When the struct_mm input, mm, was changed to a struct ma_state, mas, the documentation for the function was never updated. This updates that documentation reference. Link: https://lkml.kernel.org/r/20221114003349.41235-1-ian@linux.cowan.aero Signed-off-by: Ian Cowan <ian@linux.cowan.aero> Acked-by: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm/damon/sysfs-schemes: skip stats update if the scheme directory is removedSeongJae Park
A DAMON sysfs interface user can start DAMON with a scheme, remove the sysfs directory for the scheme, and then ask update of the scheme's stats. Because the schemes stats update logic isn't aware of the situation, it results in an invalid memory access. Fix the bug by checking if the scheme sysfs directory exists. Link: https://lkml.kernel.org/r/20221114175552.1951-1-sj@kernel.org Fixes: 0ac32b8affb5 ("mm/damon/sysfs: support DAMOS stats") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [v5.18] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm/memory: return vm_fault_t result from migrate_to_ram() callbackAlistair Popple
The migrate_to_ram() callback should always succeed, but in rare cases can fail usually returning VM_FAULT_SIGBUS. Commit 16ce101db85d ("mm/memory.c: fix race when faulting a device private page") incorrectly stopped passing the return code up the stack. Fix this by setting the ret variable, restoring the previous behaviour on migrate_to_ram() failure. Link: https://lkml.kernel.org/r/20221114115537.727371-1-apopple@nvidia.com Fixes: 16ce101db85d ("mm/memory.c: fix race when faulting a device private page") Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm: correctly charge compressed memory to its memcgLi Liguang
Kswapd will reclaim memory when memory pressure is high, the annonymous memory will be compressed and stored in the zpool if zswap is enabled. The memcg_kmem_bypass() in get_obj_cgroup_from_page() will bypass the kernel thread and cause the compressed memory not be charged to its memory cgroup. Remove the memcg_kmem_bypass() call and properly charge compressed memory to its corresponding memory cgroup. Link: https://lore.kernel.org/linux-mm/CALvZod4nnn8BHYqAM4xtcR0Ddo2-Wr8uKm9h_CHWUaXw7g_DCg@mail.gmail.com/ Link: https://lkml.kernel.org/r/20221114194828.100822-1-hannes@cmpxchg.org Fixes: f4840ccfca25 ("zswap: memcg accounting") Signed-off-by: Li Liguang <liliguang@baidu.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> [5.19+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm/khugepaged: refactor mm_khugepaged_scan_file tracepoint to remove ↵Gautam Menghani
filename from function call Refactor the mm_khugepaged_scan_file tracepoint to move filename dereference to the tracepoint definition, to maintain consistency with other tracepoints[1]. [1]:lore.kernel.org/lkml/20221024111621.3ba17e2c@gandalf.local.home/ Link: https://lkml.kernel.org/r/20221026044524.54793-1-gautammenghani201@gmail.com Fixes: d41fd2016ed07 ("mm/khugepaged: add tracepoint to hpage_collapse_scan_file()") Signed-off-by: Gautam Menghani <gautammenghani201@gmail.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: David Hildenbrand <david@redhat.com> Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm/page_exit: fix kernel doc warning in page_ext_put()Charan Teja Kalla
Fix the below compiler warnings reported with 'make W=1 mm/'. mm/page_ext.c:178: warning: Function parameter or member 'page_ext' not described in 'page_ext_put'. [quic_pkondeti@quicinc.com: better patch title] Link: https://lkml.kernel.org/r/1667884582-2465-1-git-send-email-quic_charante@quicinc.com Fixes: b1d5488a252dc9 ("mm: fix use-after free of page_ext after race with memory-offline") Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Reported-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Vlastimil Babka <vbabka@suse.cz> Cc: Pavan Kondeti <quic_pkondeti@quicinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm: khugepaged: allow page allocation fallback to eligible nodesYang Shi
Syzbot reported the below splat: WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 __alloc_pages_node include/linux/gfp.h:221 [inline] WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] WARNING: CPU: 1 PID: 3646 at include/linux/gfp.h:221 alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 Modules linked in: CPU: 1 PID: 3646 Comm: syz-executor210 Not tainted 6.1.0-rc1-syzkaller-00454-ga70385240892 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 RIP: 0010:__alloc_pages_node include/linux/gfp.h:221 [inline] RIP: 0010:hpage_collapse_alloc_page mm/khugepaged.c:807 [inline] RIP: 0010:alloc_charge_hpage+0x802/0xaa0 mm/khugepaged.c:963 Code: e5 01 4c 89 ee e8 6e f9 ae ff 4d 85 ed 0f 84 28 fc ff ff e8 70 fc ae ff 48 8d 6b ff 4c 8d 63 07 e9 16 fc ff ff e8 5e fc ae ff <0f> 0b e9 96 fa ff ff 41 bc 1a 00 00 00 e9 86 fd ff ff e8 47 fc ae RSP: 0018:ffffc90003fdf7d8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff888077f457c0 RSI: ffffffff81cd8f42 RDI: 0000000000000001 RBP: ffff888079388c0c R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: dffffc0000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f6b48ccf700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6b48a819f0 CR3: 00000000171e7000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> collapse_file+0x1ca/0x5780 mm/khugepaged.c:1715 hpage_collapse_scan_file+0xd6c/0x17a0 mm/khugepaged.c:2156 madvise_collapse+0x53a/0xb40 mm/khugepaged.c:2611 madvise_vma_behavior+0xd0a/0x1cc0 mm/madvise.c:1066 madvise_walk_vmas+0x1c7/0x2b0 mm/madvise.c:1240 do_madvise.part.0+0x24a/0x340 mm/madvise.c:1419 do_madvise mm/madvise.c:1432 [inline] __do_sys_madvise mm/madvise.c:1432 [inline] __se_sys_madvise mm/madvise.c:1430 [inline] __x64_sys_madvise+0x113/0x150 mm/madvise.c:1430 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f6b48a4eef9 Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 b1 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f6b48ccf318 EFLAGS: 00000246 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 00007f6b48af0048 RCX: 00007f6b48a4eef9 RDX: 0000000000000019 RSI: 0000000000600003 RDI: 0000000020000000 RBP: 00007f6b48af0040 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6b48aa53a4 R13: 00007f6b48bffcbf R14: 00007f6b48ccf400 R15: 0000000000022000 </TASK> The khugepaged code would pick up the node with the most hit as the preferred node, and also tries to do some balance if several nodes have the same hit record. Basically it does conceptually: * If the target_node <= last_target_node, then iterate from last_target_node + 1 to MAX_NUMNODES (1024 on default config) * If the max_value == node_load[nid], then target_node = nid But there is a corner case, paritucularly for MADV_COLLAPSE, that the non-existing node may be returned as preferred node. Assuming the system has 2 nodes, the target_node is 0 and the last_target_node is 1, if MADV_COLLAPSE path is hit, the max_value may be 0, then it may return 2 for target_node, but it is actually not existing (offline), so the warn is triggered. The node balance was introduced by commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node") to satisfy "numactl --interleave=all". But interleaving is a mere hint rather than something that has hard requirements. So use nodemask to record the nodes which have the same hit record, the hugepage allocation could fallback to those nodes. And remove __GFP_THISNODE since it does disallow fallback. And if the nodemask just has one node set, it means there is one single node has the most hit record, the nodemask approach actually behaves like __GFP_THISNODE. Link: https://lkml.kernel.org/r/20221108184357.55614-2-shy828301@gmail.com Fixes: 7d8faaf15545 ("mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse") Signed-off-by: Yang Shi <shy828301@gmail.com> Suggested-by: Zach O'Keefe <zokeefe@google.com> Suggested-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reported-by: <syzbot+0044b22d177870ee974f@syzkaller.appspotmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-22mm: vmscan: fix extreme overreclaim and swap floodsJohannes Weiner
During proactive reclaim, we sometimes observe severe overreclaim, with several thousand times more pages reclaimed than requested. This trace was obtained from shrink_lruvec() during such an instance: prio:0 anon_cost:1141521 file_cost:7767 nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190) nr=[7161123 345 578 1111] While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it by swapping. These requests take over a minute, during which the write() to memory.reclaim is unkillably stuck inside the kernel. Digging into the source, this is caused by the proportional reclaim bailout logic. This code tries to resolve a fundamental conflict: to reclaim roughly what was requested, while also aging all LRUs fairly and in accordance to their size, swappiness, refault rates etc. The way it attempts fairness is that once the reclaim goal has been reached, it stops scanning the LRUs with the smaller remaining scan targets, and adjusts the remainder of the bigger LRUs according to how much of the smaller LRUs was scanned. It then finishes scanning that remainder regardless of the reclaim goal. This works fine if priority levels are low and the LRU lists are comparable in size. However, in this instance, the cgroup that is targeted by proactive reclaim has almost no files left - they've already been squeezed out by proactive reclaim earlier - and the remaining anon pages are hot. Anon rotations cause the priority level to drop to 0, which results in reclaim targeting all of anon (a lot) and all of file (almost nothing). By the time reclaim decides to bail, it has scanned most or all of the file target, and therefor must also scan most or all of the enormous anon target. This target is thousands of times larger than the reclaim goal, thus causing the overreclaim. The bailout code hasn't changed in years, why is this failing now? The most likely explanations are two other recent changes in anon reclaim: 1. Before the series starting with commit 5df741963d52 ("mm: fix LRU balancing effect of new transparent huge pages"), the VM was overall relatively reluctant to swap at all, even if swap was configured. This means the LRU balancing code didn't come into play as often as it does now, and mostly in high pressure situations where pronounced swap activity wouldn't be as surprising. 2. For historic reasons, shrink_lruvec() loops on the scan targets of all LRU lists except the active anon one, meaning it would bail if the only remaining pages to scan were active anon - even if there were a lot of them. Before the series starting with commit ccc5dc67340c ("mm/vmscan: make active/inactive ratio as 1:1 for anon lru"), most anon pages would live on the active LRU; the inactive one would contain only a handful of preselected reclaim candidates. After the series, anon gets aged similarly to file, and the inactive list is the default for new anon pages as well, making it often the much bigger list. As a result, the VM is now more likely to actually finish large anon targets than before. Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the larger LRU lists is made before bailing out on a met reclaim goal. This fixes the extreme overreclaim problem. Fairness is more subtle and harder to evaluate. No obvious misbehavior was observed on the test workload, in any case. Conceptually, fairness should primarily be a cumulative effect from regular, lower priority scans. Once the VM is in trouble and needs to escalate scan targets to make forward progress, fairness needs to take a backseat. This is also acknowledged by the myriad exceptions in get_scan_count(). This patch makes fairness decrease gradually, as it keeps fairness work static over increasing priority levels with growing scan targets. This should make more sense - although we may have to re-visit the exact values. Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-17Merge tag 'net-6.1-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from bpf. Current release - regressions: - tls: fix memory leak in tls_enc_skb() and tls_sw_fallback_init() Previous releases - regressions: - bridge: fix memory leaks when changing VLAN protocol - dsa: make dsa_master_ioctl() see through port_hwtstamp_get() shims - dsa: don't leak tagger-owned storage on switch driver unbind - eth: mlxsw: avoid warnings when not offloaded FDB entry with IPv6 is removed - eth: stmmac: ensure tx function is not running in stmmac_xdp_release() - eth: hns3: fix return value check bug of rx copybreak Previous releases - always broken: - kcm: close race conditions on sk_receive_queue - bpf: fix alignment problem in bpf_prog_test_run_skb() - bpf: fix writing offset in case of fault in strncpy_from_kernel_nofault - eth: macvlan: use built-in RCU list checking - eth: marvell: add sleep time after enabling the loopback bit - eth: octeon_ep: fix potential memory leak in octep_device_setup() Misc: - tcp: configurable source port perturb table size - bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace)" * tag 'net-6.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (51 commits) net: use struct_group to copy ip/ipv6 header addresses net: usb: smsc95xx: fix external PHY reset net: usb: qmi_wwan: add Telit 0x103a composition netdevsim: Fix memory leak of nsim_dev->fa_cookie tcp: configurable source port perturb table size l2tp: Serialize access to sk_user_data with sk_callback_lock net: thunderbolt: Fix error handling in tbnet_init() net: microchip: sparx5: Fix potential null-ptr-deref in sparx_stats_init() and sparx5_start() net: lan966x: Fix potential null-ptr-deref in lan966x_stats_init() net: dsa: don't leak tagger-owned storage on switch driver unbind net/x25: Fix skb leak in x25_lapb_receive_frame() net: ag71xx: call phylink_disconnect_phy if ag71xx_hw_enable() fail in ag71xx_open() bridge: switchdev: Fix memory leaks when changing VLAN protocol net: hns3: fix setting incorrect phy link ksettings for firmware in resetting process net: hns3: fix return value check bug of rx copybreak net: hns3: fix incorrect hw rss hash type of rx packet net: phy: marvell: add sleep time after enabling the loopback bit net: ena: Fix error handling in ena_init() kcm: close race conditions on sk_receive_queue net: ionic: Fix error handling in ionic_init_module() ...
2022-11-11Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski
Andrii Nakryiko says: ==================== bpf 2022-11-11 We've added 11 non-merge commits during the last 8 day(s) which contain a total of 11 files changed, 83 insertions(+), 74 deletions(-). The main changes are: 1) Fix strncpy_from_kernel_nofault() to prevent out-of-bounds writes, from Alban Crequy. 2) Fix for bpf_prog_test_run_skb() to prevent wrong alignment, from Baisong Zhong. 3) Switch BPF_DISPATCHER to static_call() instead of ftrace infra, with a small build fix on top, from Peter Zijlstra and Nathan Chancellor. 4) Fix memory leak in BPF verifier in some error cases, from Wang Yufen. 5) 32-bit compilation error fixes for BPF selftests, from Pu Lehui and Yang Jihong. 6) Ensure even distribution of per-CPU free list elements, from Xu Kuohai. 7) Fix copy_map_value() to track special zeroed out areas properly, from Xu Kuohai. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Fix offset calculation error in __copy_map_value and zero_map_value bpf: Initialize same number of free nodes for each pcpu_freelist selftests: bpf: Add a test when bpf_probe_read_kernel_str() returns EFAULT maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault() selftests/bpf: Fix test_progs compilation failure in 32-bit arch selftests/bpf: Fix casting error when cross-compiling test_verifier for 32-bit platforms bpf: Fix memory leaks in __check_func_call bpf: Add explicit cast to 'void *' for __BPF_DISPATCHER_UPDATE() bpf: Convert BPF_DISPATCHER to use static_call() (not ftrace) bpf: Revert ("Fix dispatcher patchable function entry to 5 bytes nop") bpf, test_run: Fix alignment problem in bpf_prog_test_run_skb() ==================== Link: https://lore.kernel.org/r/20221111231624.938829-1-andrii@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-11-11Merge tag 'mm-hotfixes-stable-2022-11-11' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc hotfixes from Andrew Morton: "22 hotfixes. Eight are cc:stable and the remainder address issues which were introduced post-6.0 or which aren't considered serious enough to justify a -stable backport" * tag 'mm-hotfixes-stable-2022-11-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) docs: kmsan: fix formatting of "Example report" mm/damon/dbgfs: check if rm_contexts input is for a real context maple_tree: don't set a new maximum on the node when not reusing nodes maple_tree: fix depth tracking in maple_state arch/x86/mm/hugetlbpage.c: pud_huge() returns 0 when using 2-level paging fs: fix leaked psi pressure state nilfs2: fix use-after-free bug of ns_writer on remount x86/traps: avoid KMSAN bugs originating from handle_bug() kmsan: make sure PREEMPT_RT is off Kconfig.debug: ensure early check for KMSAN in CONFIG_KMSAN_WARN x86/uaccess: instrument copy_from_user_nmi() kmsan: core: kmsan_in_runtime() should return true in NMI context mm: hugetlb_vmemmap: include missing linux/moduleparam.h mm/shmem: use page_mapping() to detect page cache for uffd continue mm/memremap.c: map FS_DAX device memory as decrypted Partly revert "mm/thp: carry over dirty bit when thp splits on pmd" nilfs2: fix deadlock in nilfs_count_free_blocks() mm/mmap: fix memory leak in mmap_region() hugetlbfs: don't delete error page from pagecache maple_tree: reorganize testing to restore module testing ...
2022-11-11maccess: Fix writing offset in case of fault in strncpy_from_kernel_nofault()Alban Crequy
If a page fault occurs while copying the first byte, this function resets one byte before dst. As a consequence, an address could be modified and leaded to kernel crashes if case the modified address was accessed later. Fixes: b58294ead14c ("maccess: allow architectures to provide kernel probing directly") Signed-off-by: Alban Crequy <albancrequy@linux.microsoft.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Tested-by: Francis Laniel <flaniel@linux.microsoft.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@vger.kernel.org> [5.8] Link: https://lore.kernel.org/bpf/20221110085614.111213-2-albancrequy@linux.microsoft.com
2022-11-09Merge tag 'slab-for-6.1-rc4-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab fixes from Vlastimil Babka: "Most are small fixups as described below. The !CONFIG_TRACING fix is a bit bigger and would normally be done in the next merge window as part of upcoming hardening changes. But we realized it can make the kmalloc waste tracking introduced in this window inaccurate, so decided to go with it now. Summary: - Remove !CONFIG_TRACING kmalloc() wrappers intended to save a function call, due to incompatilibity with recently introduced wasted space tracking and planned hardening changes. - A tracing parameter regression fix, by Kees Cook. - Two kernel-doc warning fixups, by Lukas Bulwahn and myself * tag 'slab-for-6.1-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm, slab: remove duplicate kernel-doc comment for ksize() mm/slab_common: Restore passing "caller" for tracing mm/slab: remove !CONFIG_TRACING variants of kmalloc_[node_]trace() mm/slab_common: repair kernel-doc for __ksize()
2022-11-08mm/damon/dbgfs: check if rm_contexts input is for a real contextSeongJae Park
A user could write a name of a file under 'damon/' debugfs directory, which is not a user-created context, to 'rm_contexts' file. In the case, 'dbgfs_rm_context()' just assumes it's the valid DAMON context directory only if a file of the name exist. As a result, invalid memory access could happen as below. Fix the bug by checking if the given input is for a directory. This check can filter out non-context inputs because directories under 'damon/' debugfs directory can be created via only 'mk_contexts' file. This bug has found by syzbot[1]. [1] https://lore.kernel.org/damon/000000000000ede3ac05ec4abf8e@google.com/ Link: https://lkml.kernel.org/r/20221107165001.5717-2-sj@kernel.org Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts") Signed-off-by: SeongJae Park <sj@kernel.org> Reported-by: syzbot+6087eafb76a94c4ac9eb@syzkaller.appspotmail.com Cc: <stable@vger.kernel.org> [5.15.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08kmsan: core: kmsan_in_runtime() should return true in NMI contextAlexander Potapenko
Without that, every call to __msan_poison_alloca() in NMI may end up allocating memory, which is NMI-unsafe. Link: https://lkml.kernel.org/r/20221102110611.1085175-1-glider@google.com Link: https://lore.kernel.org/lkml/20221025221755.3810809-1-glider@google.com/ Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm: hugetlb_vmemmap: include missing linux/moduleparam.hVasily Gorbik
The kernel test robot reported build failures with a 'randconfig' on s390: >> mm/hugetlb_vmemmap.c:421:11: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] core_param(hugetlb_free_vmemmap, vmemmap_optimize_enabled, bool, 0); ^ Link: https://lore.kernel.org/linux-mm/202210300751.rG3UDsuc-lkp@intel.com/ Link: https://lkml.kernel.org/r/patch.git-296b83ca939b.your-ad-here.call-01667411912-ext-5073@work.hours Fixes: 30152245c63b ("mm: hugetlb_vmemmap: replace early_param() with core_param()") Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Reported-by: kernel test robot <lkp@intel.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm/shmem: use page_mapping() to detect page cache for uffd continuePeter Xu
mfill_atomic_install_pte() checks page->mapping to detect whether one page is used in the page cache. However as pointed out by Matthew, the page can logically be a tail page rather than always the head in the case of uffd minor mode with UFFDIO_CONTINUE. It means we could wrongly install one pte with shmem thp tail page assuming it's an anonymous page. It's not that clear even for anonymous page, since normally anonymous pages also have page->mapping being setup with the anon vma. It's safe here only because the only such caller to mfill_atomic_install_pte() is always passing in a newly allocated page (mcopy_atomic_pte()), whose page->mapping is not yet setup. However that's not extremely obvious either. For either of above, use page_mapping() instead. Link: https://lkml.kernel.org/r/Y2K+y7wnhC4vbnP2@x1n Fixes: 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Matthew Wilcox <willy@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08mm/memremap.c: map FS_DAX device memory as decryptedPankaj Gupta
virtio_pmem use devm_memremap_pages() to map the device memory. By default this memory is mapped as encrypted with SEV. Guest reboot changes the current encryption key and guest no longer properly decrypts the FSDAX device meta data. Mark the corresponding device memory region for FSDAX devices (mapped with memremap_pages) as decrypted to retain the persistent memory property. Link: https://lkml.kernel.org/r/20221102160728.3184016-1-pankaj.gupta@amd.com Fixes: b7b3c01b19159 ("mm/memremap_pages: support multiple ranges per invocation") Signed-off-by: Pankaj Gupta <pankaj.gupta@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-11-08Partly revert "mm/thp: carry over dirty bit when thp splits on pmd"Peter Xu
Anatoly Pugachev reported sparc64 breakage on the patch: https://lore.kernel.org/r/20221021160603.GA23307@u164.east.ru The sparc64 impl of pte_mkdirty() is definitely slightly special in that it leverages a code patching mechanism for sun4u/sun4v on relevant pgtable entry operations. Before having a clue of why the sparc64 is special and caused the patch to SIGSEGV the processes, revert the patch for now. The swap path of dirty bit inheritage is kept because that's using the swap shared code so we assume it'll not be affected. Link: https://lkml.kernel.org/r/Y1Wbi4yyVvDtg4zN@x1n Fixes: 0ccf7f168e17 ("mm/thp: carry over dirty bit when thp splits on pmd") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: Anatoly Pugachev <matorola@gmail.com> Tested-by: Anatoly Pugachev <matorola@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Minchan Kim <minchan@kernel.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>