diff options
author | Linus Torvalds | 2020-08-12 11:24:12 -0700 |
---|---|---|
committer | Linus Torvalds | 2020-08-12 11:24:12 -0700 |
commit | 9ad57f6dfc2345ed5d3a8bf4dabac0a34069c54c (patch) | |
tree | 9e12a809a2020178eab234395b0f3e1149cb3c0d | |
parent | 24fb33d40d60bd7d196400e7d5b26ff566fd98b7 (diff) | |
parent | 64019a2e467a288a16b65ab55ddcbf58c1b00187 (diff) |
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
- most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction,
mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util,
memory-hotplug, cleanups, uaccess, migration, gup, pagemap),
- various other subsystems (alpha, misc, sparse, bitmap, lib, bitops,
checkpatch, autofs, minix, nilfs, ufs, fat, signals, kmod, coredump,
exec, kdump, rapidio, panic, kcov, kgdb, ipc).
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
mm/gup: remove task_struct pointer for all gup code
mm: clean up the last pieces of page fault accountings
mm/xtensa: use general page fault accounting
mm/x86: use general page fault accounting
mm/sparc64: use general page fault accounting
mm/sparc32: use general page fault accounting
mm/sh: use general page fault accounting
mm/s390: use general page fault accounting
mm/riscv: use general page fault accounting
mm/powerpc: use general page fault accounting
mm/parisc: use general page fault accounting
mm/openrisc: use general page fault accounting
mm/nios2: use general page fault accounting
mm/nds32: use general page fault accounting
mm/mips: use general page fault accounting
mm/microblaze: use general page fault accounting
mm/m68k: use general page fault accounting
mm/ia64: use general page fault accounting
mm/hexagon: use general page fault accounting
mm/csky: use general page fault accounting
...
264 files changed, 2357 insertions, 1437 deletions
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index fa4018afa5a4..6be43781ec7f 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1274,6 +1274,10 @@ PAGE_SIZE multiple when read back. Amount of memory used for storing in-kernel data structures. + percpu + Amount of memory used for storing per-cpu kernel + data structures. + sock Amount of memory used in network transmission buffers diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 2ae9669eb22c..d4b32cc32bb7 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -164,7 +164,8 @@ core_pattern %s signal number %t UNIX time of dump %h hostname - %e executable filename (may be shortened) + %e executable filename (may be shortened, could be changed by prctl etc) + %f executable filename %E executable path %c maximum size of core file by resource limit RLIMIT_CORE %<OTHER> both are dropped diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index d997cc3c26d0..4b9d2e8e9142 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -119,6 +119,21 @@ all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required. +compaction_proactiveness +======================== + +This tunable takes a value in the range [0, 100] with a default value of +20. This tunable determines how aggressively compaction is done in the +background. Setting it to 0 disables proactive compaction. + +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. + +Be careful when setting it to extreme values like 100, as that may +cause excessive background compaction activity. compact_unevictable_allowed =========================== diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index e024a9efffd8..533c79e8d2cd 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -1633,9 +1633,6 @@ may allocate from based on an estimation of its current memory and swap use. For example, if a task is using all allowed memory, its badness score will be 1000. If it is using half of its allowed memory, its score will be 500. -There is an additional factor included in the badness score: the current memory -and swap usage is discounted by 3% for root processes. - The amount of "allowed" memory depends on the context in which the oom killer was called. If it is due to the memory assigned to the allocating task's cpuset being exhausted, the allowed memory represents the set of mems assigned to that @@ -1671,11 +1668,6 @@ The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last value set by a CAP_SYS_RESOURCE process. To reduce the value any lower requires CAP_SYS_RESOURCE. -Caveat: when a parent task is selected, the oom killer will sacrifice any first -generation children with separate address spaces instead, if possible. This -avoids servers and important system daemons from being killed and loses the -minimal amount of work. - 3.2 /proc/<pid>/oom_score - Display current oom-killer score ------------------------------------------------------------- @@ -1684,6 +1676,9 @@ This file can be used to check the current score used by the oom-killer for any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which process should be killed in an out-of-memory situation. +Please note that the exported value includes oom_score_adj so it is +effectively in range [0,2000]. + 3.3 /proc/<pid>/io - Display the IO accounting fields ------------------------------------------------------- diff --git a/Documentation/vm/page_migration.rst b/Documentation/vm/page_migration.rst index 1d6cd7db4e43..68883ac485fa 100644 --- a/Documentation/vm/page_migration.rst +++ b/Documentation/vm/page_migration.rst @@ -253,5 +253,32 @@ which are function pointers of struct address_space_operations. PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag for own purpose. +Monitoring Migration +===================== + +The following events (counters) can be used to monitor page migration. + +1. PGMIGRATE_SUCCESS: Normal page migration success. Each count means that a + page was migrated. If the page was a non-THP page, then this counter is + increased by one. If the page was a THP, then this counter is increased by + the number of THP subpages. For example, migration of a single 2MB THP that + has 4KB-size base pages (subpages) will cause this counter to increase by + 512. + +2. PGMIGRATE_FAIL: Normal page migration failure. Same counting rules as for + _SUCCESS, above: this will be increased by the number of subpages, if it was + a THP. + +3. THP_MIGRATION_SUCCESS: A THP was migrated without being split. + +4. THP_MIGRATION_FAIL: A THP could not be migrated nor it could be split. + +5. THP_MIGRATION_SPLIT: A THP was migrated, but not as such: first, the THP had + to be split. After splitting, a migration retry was used for it's sub-pages. + +THP_MIGRATION_* events also update the appropriate PGMIGRATE_SUCCESS or +PGMIGRATE_FAIL events. For example, a THP migration failure will cause both +THP_MIGRATION_FAIL and PGMIGRATE_FAIL to increase. + Christoph Lameter, May 8, 2006. Minchan Kim, Mar 28, 2016. @@ -893,6 +893,10 @@ KBUILD_CFLAGS += $(CC_FLAGS_SCS) export CC_FLAGS_SCS endif +ifdef CONFIG_DEBUG_FORCE_FUNCTION_ALIGN_32B +KBUILD_CFLAGS += -falign-functions=32 +endif + # arch Makefile may override CC so keep this after arch Makefile is included NOSTDINC_FLAGS += -nostdinc -isystem $(shell $(CC) -print-file-name=include) diff --git a/arch/alpha/include/asm/io.h b/arch/alpha/include/asm/io.h index a4d0c19f1e79..640e1a2f57b4 100644 --- a/arch/alpha/include/asm/io.h +++ b/arch/alpha/include/asm/io.h @@ -489,10 +489,10 @@ extern inline void writeq(u64 b, volatile void __iomem *addr) } #endif -#define ioread16be(p) be16_to_cpu(ioread16(p)) -#define ioread32be(p) be32_to_cpu(ioread32(p)) -#define iowrite16be(v,p) iowrite16(cpu_to_be16(v), (p)) -#define iowrite32be(v,p) iowrite32(cpu_to_be32(v), (p)) +#define ioread16be(p) swab16(ioread16(p)) +#define ioread32be(p) swab32(ioread32(p)) +#define iowrite16be(v,p) iowrite16(swab16(v), (p)) +#define iowrite32be(v,p) iowrite32(swab32(v), (p)) #define inb_p inb #define inw_p inw diff --git a/arch/alpha/include/asm/uaccess.h b/arch/alpha/include/asm/uaccess.h index 1fe2b56cb861..1b6f25efa247 100644 --- a/arch/alpha/include/asm/uaccess.h +++ b/arch/alpha/include/asm/uaccess.h @@ -20,7 +20,7 @@ #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) /* * Is a address valid? This does a straightforward calculation rather diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c index c2303a8c2b9f..09172f017efc 100644 --- a/arch/alpha/mm/fault.c +++ b/arch/alpha/mm/fault.c @@ -25,6 +25,7 @@ #include <linux/interrupt.h> #include <linux/extable.h> #include <linux/uaccess.h> +#include <linux/perf_event.h> extern void die_if_kernel(char *,struct pt_regs *,long, unsigned long *); @@ -116,6 +117,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr, #endif if (user_mode(regs)) flags |= FAULT_FLAG_USER; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -148,7 +150,7 @@ retry: /* If for any reason at all we couldn't handle the fault, make sure we exit gracefully rather than endlessly redo the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -164,10 +166,6 @@ retry: } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/arc/include/asm/segment.h b/arch/arc/include/asm/segment.h index 6a2a5be5026d..871f8ab11bfd 100644 --- a/arch/arc/include/asm/segment.h +++ b/arch/arc/include/asm/segment.h @@ -14,8 +14,7 @@ typedef unsigned long mm_segment_t; #define KERNEL_DS MAKE_MM_SEG(0) #define USER_DS MAKE_MM_SEG(TASK_SIZE) - -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) #endif /* __ASSEMBLY__ */ #endif /* __ASMARC_SEGMENT_H */ diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c index e12c80d71b78..efeba1fe7252 100644 --- a/arch/arc/kernel/process.c +++ b/arch/arc/kernel/process.c @@ -91,7 +91,7 @@ fault: goto fail; mmap_read_lock(current->mm); - ret = fixup_user_fault(current, current->mm, (unsigned long) uaddr, + ret = fixup_user_fault(current->mm, (unsigned long) uaddr, FAULT_FLAG_WRITE, NULL); mmap_read_unlock(current->mm); diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c index 7287c793d1c9..f5657cb68e4f 100644 --- a/arch/arc/mm/fault.c +++ b/arch/arc/mm/fault.c @@ -105,6 +105,7 @@ void do_page_fault(unsigned long address, struct pt_regs *regs) if (write) flags |= FAULT_FLAG_WRITE; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); @@ -130,7 +131,7 @@ retry: goto bad_area; } - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -155,22 +156,9 @@ bad_area: * Major/minor page fault accounting * (in case of retry we only land here once) */ - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - - if (likely(!(fault & VM_FAULT_ERROR))) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - } - + if (likely(!(fault & VM_FAULT_ERROR))) /* Normal return path: fault Handled Gracefully */ return; - } if (!user_mode(regs)) goto no_context; diff --git a/arch/arm/include/asm/uaccess.h b/arch/arm/include/asm/uaccess.h index b5fdd30252f8..a13d90206472 100644 --- a/arch/arm/include/asm/uaccess.h +++ b/arch/arm/include/asm/uaccess.h @@ -76,7 +76,7 @@ static inline void set_fs(mm_segment_t fs) modify_domain(DOMAIN_KERNEL, fs ? DOMAIN_CLIENT : DOMAIN_MANAGER); } -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) /* * We use 33-bit arithmetic here. Success returns zero, failure returns @@ -267,7 +267,7 @@ extern int __put_user_8(void *, unsigned long long); */ #define USER_DS KERNEL_DS -#define segment_eq(a, b) (1) +#define uaccess_kernel() (true) #define __addr_ok(addr) ((void)(addr), 1) #define __range_ok(addr, size) ((void)(addr), 0) #define get_fs() (KERNEL_DS) diff --git a/arch/arm/kernel/signal.c b/arch/arm/kernel/signal.c index ab2568996ddb..c9dc912b83f0 100644 --- a/arch/arm/kernel/signal.c +++ b/arch/arm/kernel/signal.c @@ -713,7 +713,9 @@ struct page *get_signal_page(void) /* Defer to generic check */ asmlinkage void addr_limit_check_failed(void) { +#ifdef CONFIG_MMU addr_limit_user_check(); +#endif } #ifdef CONFIG_DEBUG_RSEQ diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index c6550eddfce1..efa402025031 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -202,7 +202,8 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma) static vm_fault_t __kprobes __do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr, - unsigned int flags, struct task_struct *tsk) + unsigned int flags, struct task_struct *tsk, + struct pt_regs *regs) { struct vm_area_struct *vma; vm_fault_t fault; @@ -224,7 +225,7 @@ good_area: goto out; } - return handle_mm_fault(vma, addr & PAGE_MASK, flags); + return handle_mm_fault(vma, addr & PAGE_MASK, flags, regs); check_stack: /* Don't allow expansion below FIRST_USER_ADDRESS */ @@ -266,6 +267,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) if ((fsr & FSR_WRITE) && !(fsr & FSR_CM)) flags |= FAULT_FLAG_WRITE; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); + /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, @@ -290,7 +293,7 @@ retry: #endif } - fault = __do_page_fault(mm, addr, fsr, flags, tsk); + fault = __do_page_fault(mm, addr, fsr, flags, tsk, regs); /* If we need to retry but a fatal signal is pending, handle the * signal first. We do not need to release the mmap_lock because @@ -302,23 +305,7 @@ retry: return 0; } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ - - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); if (!(fault & VM_FAULT_ERROR) && flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, addr); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, addr); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; goto retry; diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 8d7c466f809b..991dd5f031e4 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -50,7 +50,7 @@ static inline void set_fs(mm_segment_t fs) CONFIG_ARM64_UAO)); } -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) /* * Test whether a block of memory is a valid user space address. diff --git a/arch/arm64/kernel/sdei.c b/arch/arm64/kernel/sdei.c index dab88260b137..7689f2031c0c 100644 --- a/arch/arm64/kernel/sdei.c +++ b/arch/arm64/kernel/sdei.c @@ -180,7 +180,7 @@ static __kprobes unsigned long _sdei_handler(struct pt_regs *regs, /* * We didn't take an exception to get here, set PAN. UAO will be cleared - * by sdei_event_handler()s set_fs(USER_DS) call. + * by sdei_event_handler()s force_uaccess_begin() call. */ __uaccess_enable_hw_pan(); diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 8afb238ff335..f07333e86c2f 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -404,7 +404,8 @@ static void do_bad_area(unsigned long addr, unsigned int esr, struct pt_regs *re #define VM_FAULT_BADACCESS 0x020000 static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr, - unsigned int mm_flags, unsigned long vm_flags) + unsigned int mm_flags, unsigned long vm_flags, + struct pt_regs *regs) { struct vm_area_struct *vma = find_vma(mm, addr); @@ -428,7 +429,7 @@ static vm_fault_t __do_page_fault(struct mm_struct *mm, unsigned long addr, */ if (!(vma->vm_flags & vm_flags)) return VM_FAULT_BADACCESS; - return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags); + return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags, regs); } static bool is_el0_instruction_abort(unsigned int esr) @@ -450,7 +451,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, { const struct fault_info *inf; struct mm_struct *mm = current->mm; - vm_fault_t fault, major = 0; + vm_fault_t fault; unsigned long vm_flags = VM_ACCESS_FLAGS; unsigned int mm_flags = FAULT_FLAG_DEFAULT; @@ -516,8 +517,7 @@ retry: #endif } - fault = __do_page_fault(mm, addr, mm_flags, vm_flags); - major |= fault & VM_FAULT_MAJOR; + fault = __do_page_fault(mm, addr, mm_flags, vm_flags, regs); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -538,25 +538,8 @@ retry: * Handle the "normal" (no error) case first. */ if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | - VM_FAULT_BADACCESS)))) { - /* - * Major/minor page fault accounting is only done - * once. If we go through a retry, it is extremely - * likely that the page will be found in page cache at - * that point. - */ - if (major) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, - addr); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, - addr); - } - + VM_FAULT_BADACCESS)))) return 0; - } /* * If we are in kernel mode at this point, we have no context to diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c index aafcee3e3f7e..73f8b49d485c 100644 --- a/arch/arm64/mm/numa.c +++ b/arch/arm64/mm/numa.c @@ -461,13 +461,3 @@ void __init arm64_numa_init(void) numa_init(dummy_numa_init); } - -/* - * We hope that we will be hotplugging memory on nodes we already know about, - * such that acpi_get_node() succeeds and we never fall back to this... - */ -int memory_add_physaddr_to_nid(u64 addr) -{ - pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n", addr); - return 0; -} diff --git a/arch/csky/include/asm/segment.h b/arch/csky/include/asm/segment.h index db2640d5f575..79ede9b1a646 100644 --- a/arch/csky/include/asm/segment.h +++ b/arch/csky/include/asm/segment.h @@ -13,6 +13,6 @@ typedef struct { #define USER_DS ((mm_segment_t) { 0x80000000UL }) #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #endif /* __ASM_CSKY_SEGMENT_H */ diff --git a/arch/csky/mm/fault.c b/arch/csky/mm/fault.c index b1dce9f2f04d..081b178b41b1 100644 --- a/arch/csky/mm/fault.c +++ b/arch/csky/mm/fault.c @@ -150,7 +150,8 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, write ? FAULT_FLAG_WRITE : 0); + fault = handle_mm_fault(vma, address, write ? FAULT_FLAG_WRITE : 0, + regs); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -160,16 +161,6 @@ good_area: goto bad_area; BUG(); } - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, - address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, - address); - } - mmap_read_unlock(mm); return; diff --git a/arch/h8300/include/asm/segment.h b/arch/h8300/include/asm/segment.h index a407978f9f9f..37950725d9b9 100644 --- a/arch/h8300/include/asm/segment.h +++ b/arch/h8300/include/asm/segment.h @@ -33,7 +33,7 @@ static inline mm_segment_t get_fs(void) return USER_DS; } -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #endif /* __ASSEMBLY__ */ diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c index cd3808f96b93..ef32c5a84ff3 100644 --- a/arch/hexagon/mm/vm_fault.c +++ b/arch/hexagon/mm/vm_fault.c @@ -18,6 +18,7 @@ #include <linux/signal.h> #include <linux/extable.h> #include <linux/hardirq.h> +#include <linux/perf_event.h> /* * Decode of hardware exception sends us to one of several @@ -53,6 +54,8 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs) if (user_mode(regs)) flags |= FAULT_FLAG_USER; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -88,7 +91,7 @@ good_area: break; } - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -96,10 +99,6 @@ good_area: /* The most common case -- we are done. */ if (likely(!(fault & VM_FAULT_ERROR))) { if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; goto retry; diff --git a/arch/ia64/include/asm/uaccess.h b/arch/ia64/include/asm/uaccess.h index 8aa473a4b0f4..179243c3dfc7 100644 --- a/arch/ia64/include/asm/uaccess.h +++ b/arch/ia64/include/asm/uaccess.h @@ -50,7 +50,7 @@ #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) /* * When accessing user memory, we need to make sure the entire area really is in diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c index 3a4dec334cc5..cd9766d2b6e0 100644 --- a/arch/ia64/mm/fault.c +++ b/arch/ia64/mm/fault.c @@ -14,6 +14,7 @@ #include <linux/kdebug.h> #include <linux/prefetch.h> #include <linux/uaccess.h> +#include <linux/perf_event.h> #include <asm/processor.h> #include <asm/exception.h> @@ -105,6 +106,8 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re flags |= FAULT_FLAG_USER; if (mask & VM_WRITE) flags |= FAULT_FLAG_WRITE; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); @@ -143,7 +146,7 @@ retry: * sure we exit gracefully rather than endlessly redo the * fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -166,10 +169,6 @@ retry: } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/ia64/mm/numa.c b/arch/ia64/mm/numa.c index 5e1015eb6d0d..f34964271101 100644 --- a/arch/ia64/mm/numa.c +++ b/arch/ia64/mm/numa.c @@ -106,7 +106,5 @@ int memory_add_physaddr_to_nid(u64 addr) return 0; return nid; } - -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif #endif diff --git a/arch/m68k/include/asm/segment.h b/arch/m68k/include/asm/segment.h index c6686559e9b7..2b5e68a71ef7 100644 --- a/arch/m68k/include/asm/segment.h +++ b/arch/m68k/include/asm/segment.h @@ -52,7 +52,7 @@ static inline void set_fs(mm_segment_t val) #define set_fs(x) (current_thread_info()->addr_limit = (x)) #endif -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #endif /* __ASSEMBLY__ */ diff --git a/arch/m68k/include/asm/tlbflush.h b/arch/m68k/include/asm/tlbflush.h index 191e75a6bb24..5337bc2c262f 100644 --- a/arch/m68k/include/asm/tlbflush.h +++ b/arch/m68k/include/asm/tlbflush.h @@ -85,10 +85,10 @@ static inline void flush_tlb_mm(struct mm_struct *mm) static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long addr) { if (vma->vm_mm == current->active_mm) { - mm_segment_t old_fs = get_fs(); - set_fs(USER_DS); + mm_segment_t old_fs = force_uaccess_begin(); + __flush_tlb_one(addr); - set_fs(old_fs); + force_uaccess_end(old_fs); } } diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c index 508abb63da67..795f483b1050 100644 --- a/arch/m68k/mm/fault.c +++ b/arch/m68k/mm/fault.c @@ -12,6 +12,7 @@ #include <linux/interrupt.h> #include <linux/module.h> #include <linux/uaccess.h> +#include <linux/perf_event.h> #include <asm/setup.h> #include <asm/traps.h> @@ -84,6 +85,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); @@ -134,7 +137,7 @@ good_area: * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); pr_debug("handle_mm_fault returns %x\n", fault); if (fault_signal_pending(fault, regs)) @@ -150,16 +153,7 @@ good_area: BUG(); } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/microblaze/include/asm/uaccess.h b/arch/microblaze/include/asm/uaccess.h index 6723c56ec378..304b04ffea2f 100644 --- a/arch/microblaze/include/asm/uaccess.h +++ b/arch/microblaze/include/asm/uaccess.h @@ -41,7 +41,7 @@ # define get_fs() (current_thread_info()->addr_limit) # define set_fs(val) (current_thread_info()->addr_limit = (val)) -# define segment_eq(a, b) ((a).seg == (b).seg) +# define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #ifndef CONFIG_MMU diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c index a2bfe587b491..b3fed2cecf84 100644 --- a/arch/microblaze/mm/fault.c +++ b/arch/microblaze/mm/fault.c @@ -28,6 +28,7 @@ #include <linux/mman.h> #include <linux/mm.h> #include <linux/interrupt.h> +#include <linux/perf_event.h> #include <asm/page.h> #include <asm/mmu.h> @@ -121,6 +122,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long address, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -214,7 +217,7 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -230,10 +233,6 @@ good_area: } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (unlikely(fault & VM_FAULT_MAJOR)) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/mips/include/asm/uaccess.h b/arch/mips/include/asm/uaccess.h index 62b298c50905..61fc01f177a6 100644 --- a/arch/mips/include/asm/uaccess.h +++ b/arch/mips/include/asm/uaccess.h @@ -72,7 +72,7 @@ extern u64 __ua_limit; #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) /* * eva_kernel_access() - determine whether kernel memory access on an EVA system diff --git a/arch/mips/kernel/unaligned.c b/arch/mips/kernel/unaligned.c index 0adce604fa44..126a5f3f4e4c 100644 --- a/arch/mips/kernel/unaligned.c +++ b/arch/mips/kernel/unaligned.c @@ -191,17 +191,16 @@ static void emulate_load_store_insn(struct pt_regs *regs, * memory, so we need to "switch" the address limit to * user space, so that address check can work properly. */ - seg = get_fs(); - set_fs(USER_DS); + seg = force_uaccess_begin(); switch (insn.spec3_format.func) { case lhe_op: if (!access_ok(addr, 2)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } LoadHWE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } compute_return_epc(regs); @@ -209,12 +208,12 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case lwe_op: if (!access_ok(addr, 4)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } LoadWE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } compute_return_epc(regs); @@ -222,12 +221,12 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case lhue_op: if (!access_ok(addr, 2)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } LoadHWUE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } compute_return_epc(regs); @@ -235,35 +234,35 @@ static void emulate_load_store_insn(struct pt_regs *regs, break; case she_op: if (!access_ok(addr, 2)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } compute_return_epc(regs); value = regs->regs[insn.spec3_format.rt]; StoreHWE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } break; case swe_op: if (!access_ok(addr, 4)) { - set_fs(seg); + force_uaccess_end(seg); goto sigbus; } compute_return_epc(regs); value = regs->regs[insn.spec3_format.rt]; StoreWE(addr, value, res); if (res) { - set_fs(seg); + force_uaccess_end(seg); goto fault; } break; default: - set_fs(seg); + force_uaccess_end(seg); goto sigill; } - set_fs(seg); + force_uaccess_end(seg); } #endif break; diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c index 01b168a90434..7c871b14e74a 100644 --- a/arch/mips/mm/fault.c +++ b/arch/mips/mm/fault.c @@ -96,6 +96,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -152,12 +154,11 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); if (unlikely(fault & VM_FAULT_ERROR)) { if (fault & VM_FAULT_OOM) goto out_of_memory; @@ -168,15 +169,6 @@ good_area: BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - tsk->maj_flt++; - } else { - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - tsk->min_flt++; - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/nds32/include/asm/uaccess.h b/arch/nds32/include/asm/uaccess.h index 3a9219f53ee0..010ba5f1d7dd 100644 --- a/arch/nds32/include/asm/uaccess.h +++ b/arch/nds32/include/asm/uaccess.h @@ -44,7 +44,7 @@ static inline void set_fs(mm_segment_t fs) current_thread_info()->addr_limit = fs; } -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) #define __range_ok(addr, size) (size <= get_fs() && addr <= (get_fs() -size)) diff --git a/arch/nds32/kernel/process.c b/arch/nds32/kernel/process.c index e85bbbadc0e7..e01ad5d17224 100644 --- a/arch/nds32/kernel/process.c +++ b/arch/nds32/kernel/process.c @@ -121,7 +121,7 @@ void show_regs(struct pt_regs *regs) regs->uregs[3], regs->uregs[2], regs->uregs[1], regs->uregs[0]); pr_info(" IRQs o%s Segment %s\n", interrupts_enabled(regs) ? "n" : "ff", - segment_eq(get_fs(), KERNEL_DS)? "kernel" : "user"); + uaccess_kernel() ? "kernel" : "user"); } EXPORT_SYMBOL(show_regs); diff --git a/arch/nds32/mm/alignment.c b/arch/nds32/mm/alignment.c index c8b9061a2ee3..1eb7ded6992b 100644 --- a/arch/nds32/mm/alignment.c +++ b/arch/nds32/mm/alignment.c @@ -512,7 +512,7 @@ int do_unaligned_access(unsigned long addr, struct pt_regs *regs) { unsigned long inst; int ret = -EFAULT; - mm_segment_t seg = get_fs(); + mm_segment_t seg; inst = get_inst(regs->ipc); @@ -520,13 +520,12 @@ int do_unaligned_access(unsigned long addr, struct pt_regs *regs) "Faulting addr: 0x%08lx, pc: 0x%08lx [inst: 0x%08lx ]\n", addr, regs->ipc, inst); - set_fs(USER_DS); - + seg = force_uaccess_begin(); if (inst & NDS32_16BIT_INSTRUCTION) ret = do_16((inst >> 16) & 0xffff, regs); else ret = do_32(inst, regs); - set_fs(seg); + force_uaccess_end(seg); return ret; } diff --git a/arch/nds32/mm/fault.c b/arch/nds32/mm/fault.c index 8fb73f6401a0..f02524eb6d56 100644 --- a/arch/nds32/mm/fault.c +++ b/arch/nds32/mm/fault.c @@ -121,6 +121,8 @@ void do_page_fault(unsigned long entry, unsigned long addr, if (unlikely(faulthandler_disabled() || !mm)) goto no_context; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); + /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, @@ -206,7 +208,7 @@ good_area: * the fault. */ - fault = handle_mm_fault(vma, addr, flags); + fault = handle_mm_fault(vma, addr, flags, regs); /* * If we need to retry but a fatal signal is pending, handle the @@ -228,22 +230,7 @@ good_area: goto bad_area; } - /* - * Major/minor page fault accounting is only done on the initial - * attempt. If we go through a retry, it is extremely likely that the - * page will be found in page cache at that point. - */ - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, addr); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, addr); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/nios2/include/asm/uaccess.h b/arch/nios2/include/asm/uaccess.h index e83f831a76f9..a741abbed6fb 100644 --- a/arch/nios2/include/asm/uaccess.h +++ b/arch/nios2/include/asm/uaccess.h @@ -30,7 +30,7 @@ #define get_fs() (current_thread_info()->addr_limit) #define set_fs(seg) (current_thread_info()->addr_limit = (seg)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define __access_ok(addr, len) \ (((signed long)(((long)get_fs().seg) & \ diff --git a/arch/nios2/mm/fault.c b/arch/nios2/mm/fault.c index 4112ef0e247e..9476feecf512 100644 --- a/arch/nios2/mm/fault.c +++ b/arch/nios2/mm/fault.c @@ -24,6 +24,7 @@ #include <linux/mm.h> #include <linux/extable.h> #include <linux/uaccess.h> +#include <linux/perf_event.h> #include <asm/mmu_context.h> #include <asm/traps.h> @@ -83,6 +84,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long cause, if (user_mode(regs)) flags |= FAULT_FLAG_USER; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + if (!mmap_read_trylock(mm)) { if (!user_mode(regs) && !search_exception_tables(regs->ea)) goto bad_area_nosemaphore; @@ -131,7 +134,7 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -146,16 +149,7 @@ good_area: BUG(); } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/openrisc/include/asm/uaccess.h b/arch/openrisc/include/asm/uaccess.h index 17c24f14615f..48b691530d3e 100644 --- a/arch/openrisc/include/asm/uaccess.h +++ b/arch/openrisc/include/asm/uaccess.h @@ -43,7 +43,7 @@ #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) -#define segment_eq(a, b) ((a) == (b)) +#define uaccess_kernel() (get_fs() == KERNEL_DS) /* Ensure that the range from addr to addr+size is all within the process' * address space diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c index d2224ccca294..ca97d9baab51 100644 --- a/arch/openrisc/mm/fault.c +++ b/arch/openrisc/mm/fault.c @@ -15,6 +15,7 @@ #include <linux/interrupt.h> #include <linux/extable.h> #include <linux/sched/signal.h> +#include <linux/perf_event.h> #include <linux/uaccess.h> #include <asm/siginfo.h> @@ -103,6 +104,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address, if (in_interrupt() || !mm) goto no_context; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -159,7 +162,7 @@ good_area: * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -176,10 +179,6 @@ good_area: if (flags & FAULT_FLAG_ALLOW_RETRY) { /*RGD modeled on Cris */ - if (fault & VM_FAULT_MAJOR) - tsk->maj_flt++; - else - tsk->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/parisc/include/asm/uaccess.h b/arch/parisc/include/asm/uaccess.h index ebbb9ffe038c..ed2cd4fb479b 100644 --- a/arch/parisc/include/asm/uaccess.h +++ b/arch/parisc/include/asm/uaccess.h @@ -14,7 +14,7 @@ #define KERNEL_DS ((mm_segment_t){0}) #define USER_DS ((mm_segment_t){1}) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c index 66ac0719bd49..4bfe2da9fbe3 100644 --- a/arch/parisc/mm/fault.c +++ b/arch/parisc/mm/fault.c @@ -18,6 +18,7 @@ #include <linux/extable.h> #include <linux/uaccess.h> #include <linux/hugetlb.h> +#include <linux/perf_event.h> #include <asm/traps.h> @@ -281,6 +282,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code, acc_type = parisc_acctyp(code, regs->iir); if (acc_type & VM_WRITE) flags |= FAULT_FLAG_WRITE; + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); retry: mmap_read_lock(mm); vma = find_vma_prev(mm, address, &prev_vma); @@ -302,7 +304,7 @@ good_area: * fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -323,10 +325,6 @@ good_area: BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { /* * No need to mmap_read_unlock(mm) as we would diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h index 64c04ab09112..00699903f1ef 100644 --- a/arch/powerpc/include/asm/uaccess.h +++ b/arch/powerpc/include/asm/uaccess.h @@ -38,8 +38,7 @@ static inline void set_fs(mm_segment_t fs) set_thread_flag(TIF_FSCHECK); } -#define segment_eq(a, b) ((a).seg == (b).seg) - +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define user_addr_max() (get_fs().seg) #ifdef __powerpc64__ diff --git a/arch/powerpc/mm/copro_fault.c b/arch/powerpc/mm/copro_fault.c index b83abbead4a2..8acd00178956 100644 --- a/arch/powerpc/mm/copro_fault.c +++ b/arch/powerpc/mm/copro_fault.c @@ -64,7 +64,7 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, } ret = 0; - *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0); + *flt = handle_mm_fault(vma, ea, is_write ? FAULT_FLAG_WRITE : 0, NULL); if (unlikely(*flt & VM_FAULT_ERROR)) { if (*flt & VM_FAULT_OOM) { ret = -ENOMEM; @@ -76,11 +76,6 @@ int copro_handle_mm_fault(struct mm_struct *mm, unsigned long ea, BUG(); } - if (*flt & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; - out_unlock: mmap_read_unlock(mm); return ret; diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 925a7231abb3..0add963a849b 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -511,7 +511,7 @@ retry: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); major |= fault & VM_FAULT_MAJOR; @@ -537,14 +537,9 @@ retry: /* * Major/minor page fault accounting. */ - if (major) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); + if (major) cmo_account_page_fault(); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - } + return 0; } NOKPROBE_SYMBOL(__do_page_fault); diff --git a/arch/riscv/include/asm/uaccess.h b/arch/riscv/include/asm/uaccess.h index 8ce9d607b53d..f56c66b3f5fe 100644 --- a/arch/riscv/include/asm/uaccess.h +++ b/arch/riscv/include/asm/uaccess.h @@ -8,6 +8,8 @@ #ifndef _ASM_RISCV_UACCESS_H #define _ASM_RISCV_UACCESS_H +#include <asm/pgtable.h> /* for TASK_SIZE */ + /* * User space memory access functions */ @@ -62,11 +64,9 @@ static inline void set_fs(mm_segment_t fs) current_thread_info()->addr_limit = fs; } -#define segment_eq(a, b) ((a).seg == (b).seg) - +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define user_addr_max() (get_fs().seg) - /** * access_ok: - Checks if a user space pointer is valid * @addr: User space pointer to start of block to check diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 5873835a3e6b..716d64e36f83 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -109,7 +109,7 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, addr, flags); + fault = handle_mm_fault(vma, addr, flags, regs); /* * If we need to retry but a fatal signal is pending, handle the @@ -127,21 +127,7 @@ good_area: BUG(); } - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, addr); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, addr); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/s390/include/asm/uaccess.h b/arch/s390/include/asm/uaccess.h index 324438889fe1..f09444d6aeab 100644 --- a/arch/s390/include/asm/uaccess.h +++ b/arch/s390/include/asm/uaccess.h @@ -32,7 +32,7 @@ #define USER_DS_SACF (3) #define get_fs() (current->thread.mm_segment) -#define segment_eq(a,b) (((a) & 2) == ((b) & 2)) +#define uaccess_kernel() ((get_fs() & 2) == KERNEL_DS) void set_fs(mm_segment_t fs); diff --git a/arch/s390/kvm/interrupt.c b/arch/s390/kvm/interrupt.c index 1608fd99bbee..2f177298c663 100644 --- a/arch/s390/kvm/interrupt.c +++ b/arch/s390/kvm/interrupt.c @@ -2768,7 +2768,7 @@ static struct page *get_map_page(struct kvm *kvm, u64 uaddr) struct page *page = NULL; mmap_read_lock(kvm->mm); - get_user_pages_remote(NULL, kvm->mm, uaddr, 1, FOLL_WRITE, + get_user_pages_remote(kvm->mm, uaddr, 1, FOLL_WRITE, &page, NULL, NULL); mmap_read_unlock(kvm->mm); return page; diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c index 66da278a67fb..6b74b92c1a58 100644 --- a/arch/s390/kvm/kvm-s390.c +++ b/arch/s390/kvm/kvm-s390.c @@ -1892,7 +1892,7 @@ static long kvm_s390_set_skeys(struct kvm *kvm, struct kvm_s390_skeys *args) r = set_guest_storage_key(current->mm, hva, keys[i], 0); if (r) { - r = fixup_user_fault(current, current->mm, hva, + r = fixup_user_fault(current->mm, hva, FAULT_FLAG_WRITE, &unlocked); if (r) break; diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c index 2f721a923b54..cd74989ce0b0 100644 --- a/arch/s390/kvm/priv.c +++ b/arch/s390/kvm/priv.c @@ -273,7 +273,7 @@ retry: rc = get_guest_storage_key(current->mm, vmaddr, &key); if (rc) { - rc = fixup_user_fault(current, current->mm, vmaddr, + rc = fixup_user_fault(current->mm, vmaddr, FAULT_FLAG_WRITE, &unlocked); if (!rc) { mmap_read_unlock(current->mm); @@ -319,7 +319,7 @@ retry: mmap_read_lock(current->mm); rc = reset_guest_reference_bit(current->mm, vmaddr); if (rc < 0) { - rc = fixup_user_fault(current, current->mm, vmaddr, + rc = fixup_user_fault(current->mm, vmaddr, FAULT_FLAG_WRITE, &unlocked); if (!rc) { mmap_read_unlock(current->mm); @@ -390,7 +390,7 @@ static int handle_sske(struct kvm_vcpu *vcpu) m3 & SSKE_MC); if (rc < 0) { - rc = fixup_user_fault(current, current->mm, vmaddr, + rc = fixup_user_fault(current->mm, vmaddr, FAULT_FLAG_WRITE, &unlocked); rc = !rc ? -EAGAIN : rc; } @@ -1094,7 +1094,7 @@ static int handle_pfmf(struct kvm_vcpu *vcpu) rc = cond_set_guest_storage_key(current->mm, vmaddr, key, NULL, nq, mr, mc); if (rc < 0) { - rc = fixup_user_fault(current, current->mm, vmaddr, + rc = fixup_user_fault(current->mm, vmaddr, FAULT_FLAG_WRITE, &unlocked); rc = !rc ? -EAGAIN : rc; } diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index aebf9183bedd..4c8c063bce5b 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -476,7 +476,7 @@ retry: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) { fault = VM_FAULT_SIGNAL; if (flags & FAULT_FLAG_RETRY_NOWAIT) @@ -486,21 +486,7 @@ retry: if (unlikely(fault & VM_FAULT_ERROR)) goto out_up; - /* - * Major/minor page fault accounting is only done on the - * initial attempt. If we go through a retry, it is extremely - * likely that the page will be found in page cache at that point. - */ if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - } if (fault & VM_FAULT_RETRY) { if (IS_ENABLED(CONFIG_PGSTE) && gmap && (flags & FAULT_FLAG_RETRY_NOWAIT)) { diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c index 190357ff86b3..8747487c50a8 100644 --- a/arch/s390/mm/gmap.c +++ b/arch/s390/mm/gmap.c @@ -649,7 +649,7 @@ retry: rc = vmaddr; goto out_up; } - if (fixup_user_fault(current, gmap->mm, vmaddr, fault_flags, + if (fixup_user_fault(gmap->mm, vmaddr, fault_flags, &unlocked)) { rc = -EFAULT; goto out_up; @@ -879,7 +879,7 @@ static int gmap_pte_op_fixup(struct gmap *gmap, unsigned long gaddr, BUG_ON(gmap_is_shadow(gmap)); fault_flags = (prot == PROT_WRITE) ? FAULT_FLAG_WRITE : 0; - if (fixup_user_fault(current, mm, vmaddr, fault_flags, &unlocked)) + if (fixup_user_fault(mm, vmaddr, fault_flags, &unlocked)) return -EFAULT; if (unlocked) /* lost mmap_lock, caller has to retry __gmap_translate */ diff --git a/arch/sh/include/asm/segment.h b/arch/sh/include/asm/segment.h index 33d1d28057cb..02e54a3335d6 100644 --- a/arch/sh/include/asm/segment.h +++ b/arch/sh/include/asm/segment.h @@ -24,8 +24,7 @@ typedef struct { #define USER_DS KERNEL_DS #endif -#define segment_eq(a, b) ((a).seg == (b).seg) - +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define get_fs() (current_thread_info()->addr_limit) #define set_fs(x) (current_thread_info()->addr_limit = (x)) diff --git a/arch/sh/include/asm/sparsemem.h b/arch/sh/include/asm/sparsemem.h index 4eb899751e45..084706bb8cca 100644 --- a/arch/sh/include/asm/sparsemem.h +++ b/arch/sh/include/asm/sparsemem.h @@ -5,11 +5,9 @@ #ifdef __KERNEL__ /* * SECTION_SIZE_BITS 2^N: how big each section will be - * MAX_PHYSADDR_BITS 2^N: how much physical address space we have - * MAX_PHYSMEM_BITS 2^N: how much memory we can have in that space + * MAX_PHYSMEM_BITS 2^N: how much physical address space we have */ #define SECTION_SIZE_BITS 26 -#define MAX_PHYSADDR_BITS 32 #define MAX_PHYSMEM_BITS 32 #endif diff --git a/arch/sh/kernel/traps_32.c b/arch/sh/kernel/traps_32.c index 058c6181bb30..b62ad0ba2395 100644 --- a/arch/sh/kernel/traps_32.c +++ b/arch/sh/kernel/traps_32.c @@ -482,8 +482,6 @@ asmlinkage void do_address_error(struct pt_regs *regs, error_code = lookup_exception_vector(); #endif - oldfs = get_fs(); - if (user_mode(regs)) { int si_code = BUS_ADRERR; unsigned int user_action; @@ -491,13 +489,13 @@ asmlinkage void do_address_error(struct pt_regs *regs, local_irq_enable(); inc_unaligned_user_access(); - set_fs(USER_DS); + oldfs = force_uaccess_begin(); if (copy_from_user(&instruction, (insn_size_t *)(regs->pc & ~1), sizeof(instruction))) { - set_fs(oldfs); + force_uaccess_end(oldfs); goto uspace_segv; } - set_fs(oldfs); + force_uaccess_end(oldfs); /* shout about userspace fixups */ unaligned_fixups_notify(current, instruction, regs); @@ -520,11 +518,11 @@ fixup: goto uspace_segv; } - set_fs(USER_DS); + oldfs = force_uaccess_begin(); tmp = handle_unaligned_access(instruction, regs, &user_mem_access, 0, address); - set_fs(oldfs); + force_uaccess_end(oldfs); if (tmp == 0) return; /* sorted */ diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c index fbe1f2fe9a8c..482668a2f6d3 100644 --- a/arch/sh/mm/fault.c +++ b/arch/sh/mm/fault.c @@ -482,22 +482,13 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (unlikely(fault & (VM_FAULT_RETRY | VM_FAULT_ERROR))) if (mm_fault_error(regs, error_code, address, fault)) return; if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, - regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, - regs, address); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c index 613de8096335..cd1379360f08 100644 --- a/arch/sh/mm/init.c +++ b/arch/sh/mm/init.c @@ -425,15 +425,6 @@ int arch_add_memory(int nid, u64 start, u64 size, return ret; } -#ifdef CONFIG_NUMA -int memory_add_physaddr_to_nid(u64 addr) -{ - /* Node 0 for now.. */ - return 0; -} -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); -#endif - void arch_remove_memory(int nid, u64 start, u64 size, struct vmem_altmap *altmap) { diff --git a/arch/sparc/include/asm/sparsemem.h b/arch/sparc/include/asm/sparsemem.h index 1dd1b61432db..aa9a676bc341 100644 --- a/arch/sparc/include/asm/sparsemem.h +++ b/arch/sparc/include/asm/sparsemem.h @@ -7,7 +7,6 @@ #include <asm/page.h> #define SECTION_SIZE_BITS 30 -#define MAX_PHYSADDR_BITS MAX_PHYS_ADDRESS_BITS #define MAX_PHYSMEM_BITS MAX_PHYS_ADDRESS_BITS #endif /* !(__KERNEL__) */ diff --git a/arch/sparc/include/asm/uaccess_32.h b/arch/sparc/include/asm/uaccess_32.h index d6d8413eca83..0a2d3ebc4bb8 100644 --- a/arch/sparc/include/asm/uaccess_32.h +++ b/arch/sparc/include/asm/uaccess_32.h @@ -28,7 +28,7 @@ #define get_fs() (current->thread.current_ds) #define set_fs(val) ((current->thread.current_ds) = (val)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) /* We have there a nice not-mapped page at PAGE_OFFSET - PAGE_SIZE, so that this test * can be fairly lightweight. diff --git a/arch/sparc/include/asm/uaccess_64.h b/arch/sparc/include/asm/uaccess_64.h index bf9d330073b2..698cf69f74e9 100644 --- a/arch/sparc/include/asm/uaccess_64.h +++ b/arch/sparc/include/asm/uaccess_64.h @@ -32,7 +32,7 @@ #define get_fs() ((mm_segment_t){(current_thread_info()->current_ds)}) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define set_fs(val) \ do { \ diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c index cfef656eda0f..8071bfd72349 100644 --- a/arch/sparc/mm/fault_32.c +++ b/arch/sparc/mm/fault_32.c @@ -234,7 +234,7 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -250,15 +250,6 @@ good_area: } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, address); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, address); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -410,7 +401,7 @@ good_area: if (!(vma->vm_flags & (VM_READ | VM_EXEC))) goto bad_area; } - switch (handle_mm_fault(vma, address, flags)) { + switch (handle_mm_fault(vma, address, flags, NULL)) { case VM_FAULT_SIGBUS: case VM_FAULT_OOM: goto do_sigbus; diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c index a3806614e4dc..0a6bcc85fba7 100644 --- a/arch/sparc/mm/fault_64.c +++ b/arch/sparc/mm/fault_64.c @@ -422,7 +422,7 @@ good_area: goto bad_area; } - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) goto exit_exception; @@ -438,15 +438,6 @@ good_area: } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) { - current->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, - 1, regs, address); - } else { - current->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, - 1, regs, address); - } if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c index 2b3afa354a90..ad12f78bda7e 100644 --- a/arch/um/kernel/trap.c +++ b/arch/um/kernel/trap.c @@ -71,7 +71,7 @@ good_area: do { vm_fault_t fault; - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, NULL); if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) goto out_nosemaphore; @@ -88,10 +88,6 @@ good_area: BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h index 2f3e8f2a958f..ecefaffd15d4 100644 --- a/arch/x86/include/asm/uaccess.h +++ b/arch/x86/include/asm/uaccess.h @@ -33,7 +33,7 @@ static inline void set_fs(mm_segment_t fs) set_thread_flag(TIF_FSCHECK); } -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define user_addr_max() (current->thread.addr_limit.seg) /* diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 0c7643d9f7cb..35f1498e9832 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1139,7 +1139,7 @@ void do_user_addr_fault(struct pt_regs *regs, struct vm_area_struct *vma; struct task_struct *tsk; struct mm_struct *mm; - vm_fault_t fault, major = 0; + vm_fault_t fault; unsigned int flags = FAULT_FLAG_DEFAULT; tsk = current; @@ -1291,8 +1291,7 @@ good_area: * userland). The return to userland is identified whenever * FAULT_FLAG_USER|FAULT_FLAG_KILLABLE are both set in flags. */ - fault = handle_mm_fault(vma, address, flags); - major |= fault & VM_FAULT_MAJOR; + fault = handle_mm_fault(vma, address, flags, regs); /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { @@ -1319,18 +1318,6 @@ good_area: return; } - /* - * Major/minor page fault accounting. If any of the events - * returned VM_FAULT_MAJOR, we account it as a major fault. - */ - if (major) { - tsk->maj_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); - } else { - tsk->min_flt++; - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - } - check_v8086_mode(regs, address, tsk); } NOKPROBE_SYMBOL(do_user_addr_fault); diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 3b246ae40c8f..a4ac13cc3fdc 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1452,6 +1452,15 @@ static unsigned long probe_memory_block_size(void) goto done; } + /* + * Use max block size to minimize overhead on bare metal, where + * alignment for memory hotplug isn't a concern. + */ + if (!boot_cpu_has(X86_FEATURE_HYPERVISOR)) { + bz = MAX_BLOCK_SIZE; + goto done; + } + /* Find the largest allowed block size that aligns to memory end */ for (bz = MAX_BLOCK_SIZE; bz > MIN_MEMORY_BLOCK_SIZE; bz >>= 1) { if (IS_ALIGNED(boot_mem_end, bz)) diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c index b05f45e5e8e2..aa76ec2d359b 100644 --- a/arch/x86/mm/numa.c +++ b/arch/x86/mm/numa.c @@ -929,5 +929,4 @@ int memory_add_physaddr_to_nid(u64 start) nid = numa_meminfo.blk[0].nid; return nid; } -EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); #endif diff --git a/arch/xtensa/include/asm/uaccess.h b/arch/xtensa/include/asm/uaccess.h index e57f0d0a88d8..b9758119feca 100644 --- a/arch/xtensa/include/asm/uaccess.h +++ b/arch/xtensa/include/asm/uaccess.h @@ -35,7 +35,7 @@ #define get_fs() (current->thread.current_ds) #define set_fs(val) (current->thread.current_ds = (val)) -#define segment_eq(a, b) ((a).seg == (b).seg) +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #define __kernel_ok (uaccess_kernel()) #define __user_ok(addr, size) \ diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c index c128dcc7c85b..7666408ce12a 100644 --- a/arch/xtensa/mm/fault.c +++ b/arch/xtensa/mm/fault.c @@ -72,6 +72,9 @@ void do_page_fault(struct pt_regs *regs) if (user_mode(regs)) flags |= FAULT_FLAG_USER; + + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); + retry: mmap_read_lock(mm); vma = find_vma(mm, address); @@ -107,7 +110,7 @@ good_area: * make sure we exit gracefully rather than endlessly redo * the fault. */ - fault = handle_mm_fault(vma, address, flags); + fault = handle_mm_fault(vma, address, flags, regs); if (fault_signal_pending(fault, regs)) return; @@ -122,10 +125,6 @@ good_area: BUG(); } if (flags & FAULT_FLAG_ALLOW_RETRY) { - if (fault & VM_FAULT_MAJOR) - current->maj_flt++; - else - current->min_flt++; if (fault & VM_FAULT_RETRY) { flags |= FAULT_FLAG_TRIED; @@ -139,12 +138,6 @@ good_area: } mmap_read_unlock(mm); - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address); - if (flags & VM_FAULT_MAJOR) - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); - else - perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); - return; /* Something tried to access memory that isn't in our memory map.. diff --git a/drivers/firmware/arm_sdei.c b/drivers/firmware/arm_sdei.c index e7e36aab2386..b4b9ce97f415 100644 --- a/drivers/firmware/arm_sdei.c +++ b/drivers/firmware/arm_sdei.c @@ -1136,15 +1136,14 @@ int sdei_event_handler(struct pt_regs *regs, * access kernel memory. * Do the same here because this doesn't come via the same entry code. */ - orig_addr_limit = get_fs(); - set_fs(USER_DS); + orig_addr_limit = force_uaccess_begin(); err = arg->callback(event_num, regs, arg->callback_arg); if (err) pr_err_ratelimited("event %u on CPU %u failed with error: %d\n", event_num, smp_processor_id(), err); - set_fs(orig_addr_limit); + force_uaccess_end(orig_addr_limit); return err; } diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c index e946032b13e4..2c2bf24140c9 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c @@ -469,7 +469,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct *_work) locked = 1; } ret = pin_user_pages_remote - (work->task, mm, + (mm, obj->userptr.ptr + pinned * PAGE_SIZE, npages - pinned, flags, diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c index 5e32f61a2fe4..cc6b4befde7c 100644 --- a/drivers/infiniband/core/umem_odp.c +++ b/drivers/infiniband/core/umem_odp.c @@ -439,7 +439,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt, * complex (and doesn't gain us much performance in most use * cases). */ - npages = get_user_pages_remote(owning_process, owning_mm, + npages = get_user_pages_remote(owning_mm, user_virt, gup_num_pages, flags, local_page_list, NULL, NULL); mmap_read_unlock(owning_mm); diff --git a/drivers/iommu/amd/iommu_v2.c b/drivers/iommu/amd/iommu_v2.c index e4b025c5637c..c259108ab6dd 100644 --- a/drivers/iommu/amd/iommu_v2.c +++ b/drivers/iommu/amd/iommu_v2.c @@ -495,7 +495,7 @@ static void do_fault(struct work_struct *work) if (access_error(vma, fault)) goto out; - ret = handle_mm_fault(vma, address, flags); + ret = handle_mm_fault(vma, address, flags, NULL); out: mmap_read_unlock(mm); diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 442623ac4b47..95c3164a2302 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -962,7 +962,8 @@ static irqreturn_t prq_event_thread(int irq, void *d) goto invalid; ret = handle_mm_fault(vma, address, - req->wr_req ? FAULT_FLAG_WRITE : 0); + req->wr_req ? FAULT_FLAG_WRITE : 0, + NULL); if (ret & VM_FAULT_ERROR) goto invalid; diff --git a/drivers/rapidio/devices/rio_mport_cdev.c b/drivers/rapidio/devices/rio_mport_cdev.c index 451608e960a1..c07ceec3c6d4 100644 --- a/drivers/rapidio/devices/rio_mport_cdev.c +++ b/drivers/rapidio/devices/rio_mport_cdev.c @@ -981,7 +981,7 @@ static int rio_mport_transfer_ioctl(struct file *filp, void __user *arg) if (unlikely(copy_from_user(transfer, (void __user *)(uintptr_t)transaction.block, - transaction.count * sizeof(*transfer)))) { + array_size(sizeof(*transfer), transaction.count)))) { ret = -EFAULT; goto out_free; } @@ -994,7 +994,7 @@ static int rio_mport_transfer_ioctl(struct file *filp, void __user *arg) if (unlikely(copy_to_user((void __user *)(uintptr_t)transaction.block, transfer, - transaction.count * sizeof(*transfer)))) + array_size(sizeof(*transfer), transaction.count)))) ret = -EFAULT; out_free: @@ -1710,8 +1710,7 @@ static int rio_mport_add_riodev(struct mport_cdev_priv *priv, if (rval & RIO_PEF_SWITCH) { rio_mport_read_config_32(mport, destid, hopcount, RIO_SWP_INFO_CAR, &swpinfo); - size += (RIO_GET_TOTAL_PORTS(swpinfo) * - sizeof(rswitch->nextdev[0])) + sizeof(*rswitch); + size += struct_size(rswitch, nextdev, RIO_GET_TOTAL_PORTS(swpinfo)); } rdev = kzalloc(size, GFP_KERNEL); diff --git a/drivers/rapidio/rio-scan.c b/drivers/rapidio/rio-scan.c index eb8ed28533f8..19b0c33f4a62 100644 --- a/drivers/rapidio/rio-scan.c +++ b/drivers/rapidio/rio-scan.c @@ -330,7 +330,7 @@ static struct rio_dev *rio_setup_device(struct rio_net *net, size_t size; u32 swpinfo = 0; - size = sizeof(struct rio_dev); + size = sizeof(*rdev); if (rio_mport_read_config_32(port, destid, hopcount, RIO_PEF_CAR, &result)) return NULL; @@ -338,10 +338,8 @@ static struct rio_dev *rio_setup_device(struct rio_net *net, if (result & (RIO_PEF_SWITCH | RIO_PEF_MULTIPORT)) { rio_mport_read_config_32(port, destid, hopcount, RIO_SWP_INFO_CAR, &swpinfo); - if (result & RIO_PEF_SWITCH) { - size += (RIO_GET_TOTAL_PORTS(swpinfo) * - sizeof(rswitch->nextdev[0])) + sizeof(*rswitch); - } + if (result & RIO_PEF_SWITCH) + size += struct_size(rswitch, nextdev, RIO_GET_TOTAL_PORTS(swpinfo)); } rdev = kzalloc(size, GFP_KERNEL); diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 5e556ac9102a..9d41105bfd01 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -425,7 +425,7 @@ static int follow_fault_pfn(struct vm_area_struct *vma, struct mm_struct *mm, if (ret) { bool unlocked = false; - ret = fixup_user_fault(NULL, mm, vaddr, + ret = fixup_user_fault(mm, vaddr, FAULT_FLAG_REMOTE | (write_fault ? FAULT_FLAG_WRITE : 0), &unlocked); @@ -453,7 +453,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, flags |= FOLL_WRITE; mmap_read_lock(mm); - ret = pin_user_pages_remote(NULL, mm, vaddr, 1, flags | FOLL_LONGTERM, + ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM, page, NULL, NULL); if (ret == 1) { *pfn = page_to_pfn(page[0]); diff --git a/fs/coredump.c b/fs/coredump.c index 7237f07ff6be..76e7c10edfc0 100644 --- a/fs/coredump.c +++ b/fs/coredump.c @@ -153,10 +153,10 @@ int cn_esc_printf(struct core_name *cn, const char *fmt, ...) return ret; } -static int cn_print_exe_file(struct core_name *cn) +static int cn_print_exe_file(struct core_name *cn, bool name_only) { struct file *exe_file; - char *pathbuf, *path; + char *pathbuf, *path, *ptr; int ret; exe_file = get_mm_exe_file(current->mm); @@ -175,6 +175,11 @@ static int cn_print_exe_file(struct core_name *cn) goto free_buf; } + if (name_only) { + ptr = strrchr(path, '/'); + if (ptr) + path = ptr + 1; + } ret = cn_esc_printf(cn, "%s", path); free_buf: @@ -301,12 +306,16 @@ static int format_corename(struct core_name *cn, struct coredump_params *cprm, utsname()->nodename); up_read(&uts_sem); break; - /* executable */ + /* executable, could be changed by prctl PR_SET_NAME etc */ case 'e': err = cn_esc_printf(cn, "%s", current->comm); break; + /* file name of executable */ + case 'f': + err = cn_print_exe_file(cn, true); + break; case 'E': - err = cn_print_exe_file(cn); + err = cn_print_exe_file(cn, false); break; /* core limit size */ case 'c': diff --git a/fs/exec.c b/fs/exec.c index 3698252719a3..a91003e28eaa 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -141,12 +141,14 @@ SYSCALL_DEFINE1(uselib, const char __user *, library) if (IS_ERR(file)) goto out; - error = -EINVAL; - if (!S_ISREG(file_inode(file)->i_mode)) - goto exit; - + /* + * may_open() has already checked for this, so it should be + * impossible to trip now. But we need to be extra cautious + * and check again at the very end too. + */ error = -EACCES; - if (path_noexec(&file->f_path)) + if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) || + path_noexec(&file->f_path))) goto exit; fsnotify_open(file); @@ -215,7 +217,7 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos, * We are doing an exec(). 'current' is the process * doing the exec and bprm->mm is the new process's mm. */ - ret = get_user_pages_remote(current, bprm->mm, pos, 1, gup_flags, + ret = get_user_pages_remote(bprm->mm, pos, 1, gup_flags, &page, NULL, NULL); if (ret <= 0) return NULL; @@ -909,11 +911,14 @@ static struct file *do_open_execat(int fd, struct filename *name, int flags) if (IS_ERR(file)) goto out; + /* + * may_open() has already checked for this, so it should be + * impossible to trip now. But we need to be extra cautious + * and check again at the very end too. + */ err = -EACCES; - if (!S_ISREG(file_inode(file)->i_mode)) - goto exit; - - if (path_noexec(&file->f_path)) + if (WARN_ON_ONCE(!S_ISREG(file_inode(file)->i_mode) || + path_noexec(&file->f_path))) goto exit; err = deny_write_access(file); @@ -1402,7 +1407,12 @@ int begin_new_exec(struct linux_binprm * bprm) if (retval) goto out_unlock; - set_fs(USER_DS); + /* + * Ensure that the uaccess routines can actually operate on userspace + * pointers: + */ + force_uaccess_begin(); + me->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD | PF_NOFREEZE | PF_NO_SETAFFINITY); flush_thread(); diff --git a/fs/fat/Kconfig b/fs/fat/Kconfig index ca31993dcb47..66532a71e8fd 100644 --- a/fs/fat/Kconfig +++ b/fs/fat/Kconfig @@ -41,7 +41,7 @@ config MSDOS_FS they are compressed; to access compressed MSDOS partitions under Linux, you can either use the DOS emulator DOSEMU, described in the DOSEMU-HOWTO, available from - <http://www.tldp.org/docs.html#howto>, or try dmsdosfs in + <https://www.tldp.org/docs.html#howto>, or try dmsdosfs in <ftp://ibiblio.org/pub/Linux/system/filesystems/dosfs/>. If you intend to use dosemu with a non-compressed MSDOS partition, say Y here) and MSDOS floppies. This means that file access becomes diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c index bbfe18c07417..f7e3304b7802 100644 --- a/fs/fat/fatent.c +++ b/fs/fat/fatent.c @@ -657,6 +657,9 @@ static void fat_ra_init(struct super_block *sb, struct fatent_ra *ra, unsigned long ra_pages = sb->s_bdi->ra_pages; unsigned int reada_blocks; + if (fatent->entry >= ent_limit) + return; + if (ra_pages > sb->s_bdi->io_pages) ra_pages = rounddown(ra_pages, sb->s_bdi->io_pages); reada_blocks = ra_pages << (PAGE_SHIFT - sb->s_blocksize_bits + 1); diff --git a/fs/fat/file.c b/fs/fat/file.c index 42134c58c87e..f9ee27cf4d7c 100644 --- a/fs/fat/file.c +++ b/fs/fat/file.c @@ -25,9 +25,9 @@ static int fat_ioctl_get_attributes(struct inode *inode, u32 __user *user_attr) { u32 attr; - inode_lock(inode); + inode_lock_shared(inode); attr = fat_make_attrs(inode); - inode_unlock(inode); + inode_unlock_shared(inode); return put_user(attr, user_attr); } diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 523954d00dff..b5c109703daa 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -1364,6 +1364,12 @@ hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc) sb->s_magic = HUGETLBFS_MAGIC; sb->s_op = &hugetlbfs_ops; sb->s_time_gran = 1; + + /* + * Due to the special and limited functionality of hugetlbfs, it does + * not work well as a stacking filesystem. + */ + sb->s_stack_depth = FILESYSTEM_MAX_STACK_DEPTH; sb->s_root = d_make_root(hugetlbfs_get_root(sb, ctx)); if (!sb->s_root) goto out_free; diff --git a/fs/minix/inode.c b/fs/minix/inode.c index 7cb5fd38eb14..7b09a9158e40 100644 --- a/fs/minix/inode.c +++ b/fs/minix/inode.c @@ -150,6 +150,25 @@ static int minix_remount (struct super_block * sb, int * flags, char * data) return 0; } +static bool minix_check_superblock(struct super_block *sb) +{ + struct minix_sb_info *sbi = minix_sb(sb); + + if (sbi->s_imap_blocks == 0 || sbi->s_zmap_blocks == 0) + return false; + + /* + * s_max_size must not exceed the block mapping limitation. This check + * is only needed for V1 filesystems, since V2/V3 support an extra level + * of indirect blocks which places the limit well above U32_MAX. + */ + if (sbi->s_version == MINIX_V1 && + sb->s_maxbytes > (7 + 512 + 512*512) * BLOCK_SIZE) + return false; + + return true; +} + static int minix_fill_super(struct super_block *s, void *data, int silent) { struct buffer_head *bh; @@ -185,7 +204,7 @@ static int minix_fill_super(struct super_block *s, void *data, int silent) sbi->s_zmap_blocks = ms->s_zmap_blocks; sbi->s_firstdatazone = ms->s_firstdatazone; sbi->s_log_zone_size = ms->s_log_zone_size; - sbi->s_max_size = ms->s_max_size; + s->s_maxbytes = ms->s_max_size; s->s_magic = ms->s_magic; if (s->s_magic == MINIX_SUPER_MAGIC) { sbi->s_version = MINIX_V1; @@ -216,7 +235,7 @@ static int minix_fill_super(struct super_block *s, void *data, int silent) sbi->s_zmap_blocks = m3s->s_zmap_blocks; sbi->s_firstdatazone = m3s->s_firstdatazone; sbi->s_log_zone_size = m3s->s_log_zone_size; - sbi->s_max_size = m3s->s_max_size; + s->s_maxbytes = m3s->s_max_size; sbi->s_ninodes = m3s->s_ninodes; sbi->s_nzones = m3s->s_zones; sbi->s_dirsize = 64; @@ -228,11 +247,12 @@ static int minix_fill_super(struct super_block *s, void *data, int silent) } else goto out_no_fs; + if (!minix_check_superblock(s)) + goto out_illegal_sb; + /* * Allocate the buffer map to keep the superblock small. */ - if (sbi->s_imap_blocks == 0 || sbi->s_zmap_blocks == 0) - goto out_illegal_sb; i = (sbi->s_imap_blocks + sbi->s_zmap_blocks) * sizeof(bh); map = kzalloc(i, GFP_KERNEL); if (!map) @@ -468,6 +488,13 @@ static struct inode *V1_minix_iget(struct inode *inode) iget_failed(inode); return ERR_PTR(-EIO); } + if (raw_inode->i_nlinks == 0) { + printk("MINIX-fs: deleted inode referenced: %lu\n", + inode->i_ino); + brelse(bh); + iget_failed(inode); + return ERR_PTR(-ESTALE); + } inode->i_mode = raw_inode->i_mode; i_uid_write(inode, raw_inode->i_uid); i_gid_write(inode, raw_inode->i_gid); @@ -501,6 +528,13 @@ static struct inode *V2_minix_iget(struct inode *inode) iget_failed(inode); return ERR_PTR(-EIO); } + if (raw_inode->i_nlinks == 0) { + printk("MINIX-fs: deleted inode referenced: %lu\n", + inode->i_ino); + brelse(bh); + iget_failed(inode); + return ERR_PTR(-ESTALE); + } inode->i_mode = raw_inode->i_mode; i_uid_write(inode, raw_inode->i_uid); i_gid_write(inode, raw_inode->i_gid); diff --git a/fs/minix/itree_common.c b/fs/minix/itree_common.c index 043c3fdbc8e7..446148792f41 100644 --- a/fs/minix/itree_common.c +++ b/fs/minix/itree_common.c @@ -75,6 +75,7 @@ static int alloc_branch(struct inode *inode, int n = 0; int i; int parent = minix_new_block(inode); + int err = -ENOSPC; branch[0].key = cpu_to_block(parent); if (parent) for (n = 1; n < num; n++) { @@ -85,6 +86,11 @@ static int alloc_branch(struct inode *inode, break; branch[n].key = cpu_to_block(nr); bh = sb_getblk(inode->i_sb, parent); + if (!bh) { + minix_free_block(inode, nr); + err = -ENOMEM; + break; + } lock_buffer(bh); memset(bh->b_data, 0, bh->b_size); branch[n].bh = bh; @@ -103,7 +109,7 @@ static int alloc_branch(struct inode *inode, bforget(branch[i].bh); for (i = 0; i < n; i++) minix_free_block(inode, block_to_cpu(branch[i].key)); - return -ENOSPC; + return err; } static inline int splice_branch(struct inode *inode, diff --git a/fs/minix/itree_v1.c b/fs/minix/itree_v1.c index 046cc96ee7ad..1fed906042aa 100644 --- a/fs/minix/itree_v1.c +++ b/fs/minix/itree_v1.c @@ -29,12 +29,12 @@ static int block_to_path(struct inode * inode, long block, int offsets[DEPTH]) if (block < 0) { printk("MINIX-fs: block_to_path: block %ld < 0 on dev %pg\n", block, inode->i_sb->s_bdev); - } else if (block >= (minix_sb(inode->i_sb)->s_max_size/BLOCK_SIZE)) { - if (printk_ratelimit()) - printk("MINIX-fs: block_to_path: " - "block %ld too big on dev %pg\n", - block, inode->i_sb->s_bdev); - } else if (block < 7) { + return 0; + } + if ((u64)block * BLOCK_SIZE >= inode->i_sb->s_maxbytes) + return 0; + + if (block < 7) { offsets[n++] = block; } else if ((block -= 7) < 512) { offsets[n++] = 7; diff --git a/fs/minix/itree_v2.c b/fs/minix/itree_v2.c index f7fc7ecccccc..9d00f31a2d9d 100644 --- a/fs/minix/itree_v2.c +++ b/fs/minix/itree_v2.c @@ -32,13 +32,12 @@ static int block_to_path(struct inode * inode, long block, int offsets[DEPTH]) if (block < 0) { printk("MINIX-fs: block_to_path: block %ld < 0 on dev %pg\n", block, sb->s_bdev); - } else if ((u64)block * (u64)sb->s_blocksize >= - minix_sb(sb)->s_max_size) { - if (printk_ratelimit()) - printk("MINIX-fs: block_to_path: " - "block %ld too big on dev %pg\n", - block, sb->s_bdev); - } else if (block < DIRCOUNT) { + return 0; + } + if ((u64)block * (u64)sb->s_blocksize >= sb->s_maxbytes) + return 0; + + if (block < DIRCOUNT) { offsets[n++] = block; } else if ((block -= DIRCOUNT) < INDIRCOUNT(sb)) { offsets[n++] = DIRCOUNT; diff --git a/fs/minix/minix.h b/fs/minix/minix.h index df081e8afcc3..168d45d3de73 100644 --- a/fs/minix/minix.h +++ b/fs/minix/minix.h @@ -32,7 +32,6 @@ struct minix_sb_info { unsigned long s_zmap_blocks; unsigned long s_firstdatazone; unsigned long s_log_zone_size; - unsigned long s_max_size; int s_dirsize; int s_namelen; struct buffer_head ** s_imap; diff --git a/fs/namei.c b/fs/namei.c index 9fa10c614de7..2112e578dccc 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -2849,18 +2849,24 @@ static int may_open(const struct path *path, int acc_mode, int flag) case S_IFLNK: return -ELOOP; case S_IFDIR: - if (acc_mode & MAY_WRITE) + if (acc_mode & (MAY_WRITE | MAY_EXEC)) return -EISDIR; break; case S_IFBLK: case S_IFCHR: if (!may_open_dev(path)) return -EACCES; - /*FALLTHRU*/ + fallthrough; case S_IFIFO: case S_IFSOCK: + if (acc_mode & MAY_EXEC) + return -EACCES; flag &= ~O_TRUNC; break; + case S_IFREG: + if ((acc_mode & MAY_EXEC) && path_noexec(path)) + return -EACCES; + break; } error = inode_permission(inode, MAY_OPEN | acc_mode); diff --git a/fs/nilfs2/alloc.c b/fs/nilfs2/alloc.c index 235b959fc2b3..adf3bb0a8048 100644 --- a/fs/nilfs2/alloc.c +++ b/fs/nilfs2/alloc.c @@ -613,10 +613,10 @@ void nilfs_palloc_commit_free_entry(struct inode *inode, lock = nilfs_mdt_bgl_lock(inode, group); if (!nilfs_clear_bit_atomic(lock, group_offset, bitmap)) - nilfs_msg(inode->i_sb, KERN_WARNING, - "%s (ino=%lu): entry number %llu already freed", - __func__, inode->i_ino, - (unsigned long long)req->pr_entry_nr); + nilfs_warn(inode->i_sb, + "%s (ino=%lu): entry number %llu already freed", + __func__, inode->i_ino, + (unsigned long long)req->pr_entry_nr); else nilfs_palloc_group_desc_add_entries(desc, lock, 1); @@ -654,10 +654,10 @@ void nilfs_palloc_abort_alloc_entry(struct inode *inode, lock = nilfs_mdt_bgl_lock(inode, group); if (!nilfs_clear_bit_atomic(lock, group_offset, bitmap)) - nilfs_msg(inode->i_sb, KERN_WARNING, - "%s (ino=%lu): entry number %llu already freed", - __func__, inode->i_ino, - (unsigned long long)req->pr_entry_nr); + nilfs_warn(inode->i_sb, + "%s (ino=%lu): entry number %llu already freed", + __func__, inode->i_ino, + (unsigned long long)req->pr_entry_nr); else nilfs_palloc_group_desc_add_entries(desc, lock, 1); @@ -763,10 +763,10 @@ int nilfs_palloc_freev(struct inode *inode, __u64 *entry_nrs, size_t nitems) do { if (!nilfs_clear_bit_atomic(lock, group_offset, bitmap)) { - nilfs_msg(inode->i_sb, KERN_WARNING, - "%s (ino=%lu): entry number %llu already freed", - __func__, inode->i_ino, - (unsigned long long)entry_nrs[j]); + nilfs_warn(inode->i_sb, + "%s (ino=%lu): entry number %llu already freed", + __func__, inode->i_ino, + (unsigned long long)entry_nrs[j]); } else { n++; } @@ -808,10 +808,10 @@ int nilfs_palloc_freev(struct inode *inode, __u64 *entry_nrs, size_t nitems) ret = nilfs_palloc_delete_entry_block(inode, last_nrs[k]); if (ret && ret != -ENOENT) - nilfs_msg(inode->i_sb, KERN_WARNING, - "error %d deleting block that object (entry=%llu, ino=%lu) belongs to", - ret, (unsigned long long)last_nrs[k], - inode->i_ino); + nilfs_warn(inode->i_sb, + "error %d deleting block that object (entry=%llu, ino=%lu) belongs to", + ret, (unsigned long long)last_nrs[k], + inode->i_ino); } desc_kaddr = kmap_atomic(desc_bh->b_page); @@ -826,9 +826,9 @@ int nilfs_palloc_freev(struct inode *inode, __u64 *entry_nrs, size_t nitems) if (nfree == nilfs_palloc_entries_per_group(inode)) { ret = nilfs_palloc_delete_bitmap_block(inode, group); if (ret && ret != -ENOENT) - nilfs_msg(inode->i_sb, KERN_WARNING, - "error %d deleting bitmap block of group=%lu, ino=%lu", - ret, group, inode->i_ino); + nilfs_warn(inode->i_sb, + "error %d deleting bitmap block of group=%lu, ino=%lu", + ret, group, inode->i_ino); } } return 0; diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c index 23e043eca237..f42ab57201e7 100644 --- a/fs/nilfs2/btree.c +++ b/fs/nilfs2/btree.c @@ -351,10 +351,10 @@ static int nilfs_btree_node_broken(const struct nilfs_btree_node *node, (flags & NILFS_BTREE_NODE_ROOT) || nchildren < 0 || nchildren > NILFS_BTREE_NODE_NCHILDREN_MAX(size))) { - nilfs_msg(inode->i_sb, KERN_CRIT, - "bad btree node (ino=%lu, blocknr=%llu): level = %d, flags = 0x%x, nchildren = %d", - inode->i_ino, (unsigned long long)blocknr, level, - flags, nchildren); + nilfs_crit(inode->i_sb, + "bad btree node (ino=%lu, blocknr=%llu): level = %d, flags = 0x%x, nchildren = %d", + inode->i_ino, (unsigned long long)blocknr, level, + flags, nchildren); ret = 1; } return ret; @@ -381,9 +381,9 @@ static int nilfs_btree_root_broken(const struct nilfs_btree_node *node, level >= NILFS_BTREE_LEVEL_MAX || nchildren < 0 || nchildren > NILFS_BTREE_ROOT_NCHILDREN_MAX)) { - nilfs_msg(inode->i_sb, KERN_CRIT, - "bad btree root (ino=%lu): level = %d, flags = 0x%x, nchildren = %d", - inode->i_ino, level, flags, nchildren); + nilfs_crit(inode->i_sb, + "bad btree root (ino=%lu): level = %d, flags = 0x%x, nchildren = %d", + inode->i_ino, level, flags, nchildren); ret = 1; } return ret; @@ -450,10 +450,10 @@ static int nilfs_btree_bad_node(const struct nilfs_bmap *btree, { if (unlikely(nilfs_btree_node_get_level(node) != level)) { dump_stack(); - nilfs_msg(btree->b_inode->i_sb, KERN_CRIT, - "btree level mismatch (ino=%lu): %d != %d", - btree->b_inode->i_ino, - nilfs_btree_node_get_level(node), level); + nilfs_crit(btree->b_inode->i_sb, + "btree level mismatch (ino=%lu): %d != %d", + btree->b_inode->i_ino, + nilfs_btree_node_get_level(node), level); return 1; } return 0; @@ -508,7 +508,7 @@ static int __nilfs_btree_get_block(const struct nilfs_bmap *btree, __u64 ptr, out_no_wait: if (!buffer_uptodate(bh)) { - nilfs_msg(btree->b_inode->i_sb, KERN_ERR, + nilfs_err(btree->b_inode->i_sb, "I/O error reading b-tree node block (ino=%lu, blocknr=%llu)", btree->b_inode->i_ino, (unsigned long long)ptr); brelse(bh); @@ -2074,10 +2074,10 @@ static int nilfs_btree_propagate(struct nilfs_bmap *btree, ret = nilfs_btree_do_lookup(btree, path, key, NULL, level + 1, 0); if (ret < 0) { if (unlikely(ret == -ENOENT)) - nilfs_msg(btree->b_inode->i_sb, KERN_CRIT, - "writing node/leaf block does not appear in b-tree (ino=%lu) at key=%llu, level=%d", - btree->b_inode->i_ino, - (unsigned long long)key, level); + nilfs_crit(btree->b_inode->i_sb, + "writing node/leaf block does not appear in b-tree (ino=%lu) at key=%llu, level=%d", + btree->b_inode->i_ino, + (unsigned long long)key, level); goto out; } @@ -2114,11 +2114,11 @@ static void nilfs_btree_add_dirty_buffer(struct nilfs_bmap *btree, if (level < NILFS_BTREE_LEVEL_NODE_MIN || level >= NILFS_BTREE_LEVEL_MAX) { dump_stack(); - nilfs_msg(btree->b_inode->i_sb, KERN_WARNING, - "invalid btree level: %d (key=%llu, ino=%lu, blocknr=%llu)", - level, (unsigned long long)key, - btree->b_inode->i_ino, - (unsigned long long)bh->b_blocknr); + nilfs_warn(btree->b_inode->i_sb, + "invalid btree level: %d (key=%llu, ino=%lu, blocknr=%llu)", + level, (unsigned long long)key, + btree->b_inode->i_ino, + (unsigned long long)bh->b_blocknr); return; } diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c index 8d41311b5db4..86d4d850d130 100644 --- a/fs/nilfs2/cpfile.c +++ b/fs/nilfs2/cpfile.c @@ -322,7 +322,7 @@ int nilfs_cpfile_delete_checkpoints(struct inode *cpfile, int ret, ncps, nicps, nss, count, i; if (unlikely(start == 0 || start > end)) { - nilfs_msg(cpfile->i_sb, KERN_ERR, + nilfs_err(cpfile->i_sb, "cannot delete checkpoints: invalid range [%llu, %llu)", (unsigned long long)start, (unsigned long long)end); return -EINVAL; @@ -376,7 +376,7 @@ int nilfs_cpfile_delete_checkpoints(struct inode *cpfile, cpfile, cno); if (ret == 0) continue; - nilfs_msg(cpfile->i_sb, KERN_ERR, + nilfs_err(cpfile->i_sb, "error %d deleting checkpoint block", ret); break; @@ -981,12 +981,10 @@ int nilfs_cpfile_read(struct super_block *sb, size_t cpsize, int err; if (cpsize > sb->s_blocksize) { - nilfs_msg(sb, KERN_ERR, - "too large checkpoint size: %zu bytes", cpsize); + nilfs_err(sb, "too large checkpoint size: %zu bytes", cpsize); return -EINVAL; } else if (cpsize < NILFS_MIN_CHECKPOINT_SIZE) { - nilfs_msg(sb, KERN_ERR, - "too small checkpoint size: %zu bytes", cpsize); + nilfs_err(sb, "too small checkpoint size: %zu bytes", cpsize); return -EINVAL; } diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c index 6f4066636be9..8bccdf1158fc 100644 --- a/fs/nilfs2/dat.c +++ b/fs/nilfs2/dat.c @@ -340,11 +340,11 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, sector_t blocknr) kaddr = kmap_atomic(entry_bh->b_page); entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr); if (unlikely(entry->de_blocknr == cpu_to_le64(0))) { - nilfs_msg(dat->i_sb, KERN_CRIT, - "%s: invalid vblocknr = %llu, [%llu, %llu)", - __func__, (unsigned long long)vblocknr, - (unsigned long long)le64_to_cpu(entry->de_start), - (unsigned long long)le64_to_cpu(entry->de_end)); + nilfs_crit(dat->i_sb, + "%s: invalid vblocknr = %llu, [%llu, %llu)", + __func__, (unsigned long long)vblocknr, + (unsigned long long)le64_to_cpu(entry->de_start), + (unsigned long long)le64_to_cpu(entry->de_end)); kunmap_atomic(kaddr); brelse(entry_bh); return -EINVAL; @@ -471,11 +471,11 @@ int nilfs_dat_read(struct super_block *sb, size_t entry_size, int err; if (entry_size > sb->s_blocksize) { - nilfs_msg(sb, KERN_ERR, "too large DAT entry size: %zu bytes", + nilfs_err(sb, "too large DAT entry size: %zu bytes", entry_size); return -EINVAL; } else if (entry_size < NILFS_MIN_DAT_ENTRY_SIZE) { - nilfs_msg(sb, KERN_ERR, "too small DAT entry size: %zu bytes", + nilfs_err(sb, "too small DAT entry size: %zu bytes", entry_size); return -EINVAL; } diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c index 533e24ea3a88..f353101955e3 100644 --- a/fs/nilfs2/direct.c +++ b/fs/nilfs2/direct.c @@ -328,16 +328,18 @@ static int nilfs_direct_assign(struct nilfs_bmap *bmap, key = nilfs_bmap_data_get_key(bmap, *bh); if (unlikely(key > NILFS_DIRECT_KEY_MAX)) { - nilfs_msg(bmap->b_inode->i_sb, KERN_CRIT, - "%s (ino=%lu): invalid key: %llu", __func__, - bmap->b_inode->i_ino, (unsigned long long)key); + nilfs_crit(bmap->b_inode->i_sb, + "%s (ino=%lu): invalid key: %llu", + __func__, + bmap->b_inode->i_ino, (unsigned long long)key); return -EINVAL; } ptr = nilfs_direct_get_ptr(bmap, key); if (unlikely(ptr == NILFS_BMAP_INVALID_PTR)) { - nilfs_msg(bmap->b_inode->i_sb, KERN_CRIT, - "%s (ino=%lu): invalid pointer: %llu", __func__, - bmap->b_inode->i_ino, (unsigned long long)ptr); + nilfs_crit(bmap->b_inode->i_sb, + "%s (ino=%lu): invalid pointer: %llu", + __func__, + bmap->b_inode->i_ino, (unsigned long long)ptr); return -EINVAL; } diff --git a/fs/nilfs2/gcinode.c b/fs/nilfs2/gcinode.c index aa3c328ee189..448320496856 100644 --- a/fs/nilfs2/gcinode.c +++ b/fs/nilfs2/gcinode.c @@ -142,7 +142,7 @@ int nilfs_gccache_wait_and_mark_dirty(struct buffer_head *bh) if (!buffer_uptodate(bh)) { struct inode *inode = bh->b_page->mapping->host; - nilfs_msg(inode->i_sb, KERN_ERR, + nilfs_err(inode->i_sb, "I/O error reading %s block for GC (ino=%lu, vblocknr=%llu)", buffer_nilfs_node(bh) ? "node" : "data", inode->i_ino, (unsigned long long)bh->b_blocknr); diff --git a/fs/nilfs2/ifile.c b/fs/nilfs2/ifile.c index 4140d232cadc..02727ed3a7c6 100644 --- a/fs/nilfs2/ifile.c +++ b/fs/nilfs2/ifile.c @@ -142,8 +142,8 @@ int nilfs_ifile_get_inode_block(struct inode *ifile, ino_t ino, err = nilfs_palloc_get_entry_block(ifile, ino, 0, out_bh); if (unlikely(err)) - nilfs_msg(sb, KERN_WARNING, "error %d reading inode: ino=%lu", - err, (unsigned long)ino); + nilfs_warn(sb, "error %d reading inode: ino=%lu", + err, (unsigned long)ino); return err; } diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c index 28009ec54420..745d371d6fea 100644 --- a/fs/nilfs2/inode.c +++ b/fs/nilfs2/inode.c @@ -104,10 +104,10 @@ int nilfs_get_block(struct inode *inode, sector_t blkoff, * However, the page having this block must * be locked in this case. */ - nilfs_msg(inode->i_sb, KERN_WARNING, - "%s (ino=%lu): a race condition while inserting a data block at offset=%llu", - __func__, inode->i_ino, - (unsigned long long)blkoff); + nilfs_warn(inode->i_sb, + "%s (ino=%lu): a race condition while inserting a data block at offset=%llu", + __func__, inode->i_ino, + (unsigned long long)blkoff); err = 0; } nilfs_transaction_abort(inode->i_sb); @@ -388,7 +388,8 @@ struct inode *nilfs_new_inode(struct inode *dir, umode_t mode) failed_after_creation: clear_nlink(inode); - unlock_new_inode(inode); + if (inode->i_state & I_NEW) + unlock_new_inode(inode); iput(inode); /* * raw_inode will be deleted through * nilfs_evict_inode(). @@ -706,9 +707,8 @@ repeat: goto repeat; failed: - nilfs_msg(ii->vfs_inode.i_sb, KERN_WARNING, - "error %d truncating bmap (ino=%lu)", ret, - ii->vfs_inode.i_ino); + nilfs_warn(ii->vfs_inode.i_sb, "error %d truncating bmap (ino=%lu)", + ret, ii->vfs_inode.i_ino); } void nilfs_truncate(struct inode *inode) @@ -919,9 +919,9 @@ int nilfs_set_file_dirty(struct inode *inode, unsigned int nr_dirty) * This will happen when somebody is freeing * this inode. */ - nilfs_msg(inode->i_sb, KERN_WARNING, - "cannot set file dirty (ino=%lu): the file is being freed", - inode->i_ino); + nilfs_warn(inode->i_sb, + "cannot set file dirty (ino=%lu): the file is being freed", + inode->i_ino); spin_unlock(&nilfs->ns_inode_lock); return -EINVAL; /* * NILFS_I_DIRTY may remain for @@ -942,9 +942,9 @@ int __nilfs_mark_inode_dirty(struct inode *inode, int flags) err = nilfs_load_inode_block(inode, &ibh); if (unlikely(err)) { - nilfs_msg(inode->i_sb, KERN_WARNING, - "cannot mark inode dirty (ino=%lu): error %d loading inode block", - inode->i_ino, err); + nilfs_warn(inode->i_sb, + "cannot mark inode dirty (ino=%lu): error %d loading inode block", + inode->i_ino, err); return err; } nilfs_update_inode(inode, ibh, flags); @@ -970,8 +970,8 @@ void nilfs_dirty_inode(struct inode *inode, int flags) struct nilfs_mdt_info *mdi = NILFS_MDT(inode); if (is_bad_inode(inode)) { - nilfs_msg(inode->i_sb, KERN_WARNING, - "tried to mark bad_inode dirty. ignored."); + nilfs_warn(inode->i_sb, + "tried to mark bad_inode dirty. ignored."); dump_stack(); return; } diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c index 4ba73dbf3e8d..07d26f61f22a 100644 --- a/fs/nilfs2/ioctl.c +++ b/fs/nilfs2/ioctl.c @@ -569,25 +569,25 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode, if (unlikely(ret < 0)) { if (ret == -ENOENT) - nilfs_msg(inode->i_sb, KERN_CRIT, - "%s: invalid virtual block address (%s): ino=%llu, cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu", - __func__, vdesc->vd_flags ? "node" : "data", - (unsigned long long)vdesc->vd_ino, - (unsigned long long)vdesc->vd_cno, - (unsigned long long)vdesc->vd_offset, - (unsigned long long)vdesc->vd_blocknr, - (unsigned long long)vdesc->vd_vblocknr); + nilfs_crit(inode->i_sb, + "%s: invalid virtual block address (%s): ino=%llu, cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu", + __func__, vdesc->vd_flags ? "node" : "data", + (unsigned long long)vdesc->vd_ino, + (unsigned long long)vdesc->vd_cno, + (unsigned long long)vdesc->vd_offset, + (unsigned long long)vdesc->vd_blocknr, + (unsigned long long)vdesc->vd_vblocknr); return ret; } if (unlikely(!list_empty(&bh->b_assoc_buffers))) { - nilfs_msg(inode->i_sb, KERN_CRIT, - "%s: conflicting %s buffer: ino=%llu, cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu", - __func__, vdesc->vd_flags ? "node" : "data", - (unsigned long long)vdesc->vd_ino, - (unsigned long long)vdesc->vd_cno, - (unsigned long long)vdesc->vd_offset, - (unsigned long long)vdesc->vd_blocknr, - (unsigned long long)vdesc->vd_vblocknr); + nilfs_crit(inode->i_sb, + "%s: conflicting %s buffer: ino=%llu, cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu", + __func__, vdesc->vd_flags ? "node" : "data", + (unsigned long long)vdesc->vd_ino, + (unsigned long long)vdesc->vd_cno, + (unsigned long long)vdesc->vd_offset, + (unsigned long long)vdesc->vd_blocknr, + (unsigned long long)vdesc->vd_vblocknr); brelse(bh); return -EEXIST; } @@ -837,8 +837,7 @@ int nilfs_ioctl_prepare_clean_segments(struct the_nilfs *nilfs, return 0; failed: - nilfs_msg(nilfs->ns_sb, KERN_ERR, "error %d preparing GC: %s", ret, - msg); + nilfs_err(nilfs->ns_sb, "error %d preparing GC: %s", ret, msg); return ret; } @@ -947,7 +946,7 @@ static int nilfs_ioctl_clean_segments(struct inode *inode, struct file *filp, ret = nilfs_ioctl_move_blocks(inode->i_sb, &argv[0], kbufs[0]); if (ret < 0) { - nilfs_msg(inode->i_sb, KERN_ERR, + nilfs_err(inode->i_sb, "error %d preparing GC: cannot read source blocks", ret); } else { diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c index 700870a92bc4..c0361ce45f62 100644 --- a/fs/nilfs2/mdt.c +++ b/fs/nilfs2/mdt.c @@ -199,7 +199,7 @@ static int nilfs_mdt_read_block(struct inode *inode, unsigned long block, out_no_wait: err = -EIO; if (!buffer_uptodate(first_bh)) { - nilfs_msg(inode->i_sb, KERN_ERR, + nilfs_err(inode->i_sb, "I/O error reading meta-data file (ino=%lu, block-offset=%lu)", inode->i_ino, block); goto failed_bh; diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c index 9fe6d4ab74f0..a6ec7961d4f5 100644 --- a/fs/nilfs2/namei.c +++ b/fs/nilfs2/namei.c @@ -272,9 +272,9 @@ static int nilfs_do_unlink(struct inode *dir, struct dentry *dentry) goto out; if (!inode->i_nlink) { - nilfs_msg(inode->i_sb, KERN_WARNING, - "deleting nonexistent file (ino=%lu), %d", - inode->i_ino, inode->i_nlink); + nilfs_warn(inode->i_sb, + "deleting nonexistent file (ino=%lu), %d", + inode->i_ino, inode->i_nlink); set_nlink(inode, 1); } err = nilfs_delete_entry(de, page); diff --git a/fs/nilfs2/nilfs.h b/fs/nilfs2/nilfs.h index 42395ba52da6..f8450ee3fd06 100644 --- a/fs/nilfs2/nilfs.h +++ b/fs/nilfs2/nilfs.h @@ -289,9 +289,8 @@ static inline int nilfs_mark_inode_dirty_sync(struct inode *inode) /* super.c */ extern struct inode *nilfs_alloc_inode(struct super_block *); -extern __printf(3, 4) -void __nilfs_msg(struct super_block *sb, const char *level, - const char *fmt, ...); +__printf(2, 3) +void __nilfs_msg(struct super_block *sb, const char *fmt, ...); extern __printf(3, 4) void __nilfs_error(struct super_block *sb, const char *function, const char *fmt, ...); @@ -299,7 +298,7 @@ void __nilfs_error(struct super_block *sb, const char *function, #ifdef CONFIG_PRINTK #define nilfs_msg(sb, level, fmt, ...) \ - __nilfs_msg(sb, level, fmt, ##__VA_ARGS__) + __nilfs_msg(sb, level fmt, ##__VA_ARGS__) #define nilfs_error(sb, fmt, ...) \ __nilfs_error(sb, __func__, fmt, ##__VA_ARGS__) @@ -307,7 +306,7 @@ void __nilfs_error(struct super_block *sb, const char *function, #define nilfs_msg(sb, level, fmt, ...) \ do { \ - no_printk(fmt, ##__VA_ARGS__); \ + no_printk(level fmt, ##__VA_ARGS__); \ (void)(sb); \ } while (0) #define nilfs_error(sb, fmt, ...) \ @@ -318,6 +317,15 @@ void __nilfs_error(struct super_block *sb, const char *function, #endif /* CONFIG_PRINTK */ +#define nilfs_crit(sb, fmt, ...) \ + nilfs_msg(sb, KERN_CRIT, fmt, ##__VA_ARGS__) +#define nilfs_err(sb, fmt, ...) \ + nilfs_msg(sb, KERN_ERR, fmt, ##__VA_ARGS__) +#define nilfs_warn(sb, fmt, ...) \ + nilfs_msg(sb, KERN_WARNING, fmt, ##__VA_ARGS__) +#define nilfs_info(sb, fmt, ...) \ + nilfs_msg(sb, KERN_INFO, fmt, ##__VA_ARGS__) + extern struct nilfs_super_block * nilfs_read_super_block(struct super_block *, u64, int, struct buffer_head **); extern int nilfs_store_magic_and_option(struct super_block *, diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c index d7fc8d369d89..b175f1330408 100644 --- a/fs/nilfs2/page.c +++ b/fs/nilfs2/page.c @@ -391,9 +391,8 @@ void nilfs_clear_dirty_page(struct page *page, bool silent) BUG_ON(!PageLocked(page)); if (!silent) - nilfs_msg(sb, KERN_WARNING, - "discard dirty page: offset=%lld, ino=%lu", - page_offset(page), inode->i_ino); + nilfs_warn(sb, "discard dirty page: offset=%lld, ino=%lu", + page_offset(page), inode->i_ino); ClearPageUptodate(page); ClearPageMappedToDisk(page); @@ -409,9 +408,9 @@ void nilfs_clear_dirty_page(struct page *page, bool silent) do { lock_buffer(bh); if (!silent) - nilfs_msg(sb, KERN_WARNING, - "discard dirty block: blocknr=%llu, size=%zu", - (u64)bh->b_blocknr, bh->b_size); + nilfs_warn(sb, + "discard dirty block: blocknr=%llu, size=%zu", + (u64)bh->b_blocknr, bh->b_size); set_mask_bits(&bh->b_state, clear_bits, 0); unlock_buffer(bh); diff --git a/fs/nilfs2/recovery.c b/fs/nilfs2/recovery.c index 140b663e91c7..0b453ef8fae5 100644 --- a/fs/nilfs2/recovery.c +++ b/fs/nilfs2/recovery.c @@ -51,7 +51,7 @@ static int nilfs_warn_segment_error(struct super_block *sb, int err) switch (err) { case NILFS_SEG_FAIL_IO: - nilfs_msg(sb, KERN_ERR, "I/O error reading segment"); + nilfs_err(sb, "I/O error reading segment"); return -EIO; case NILFS_SEG_FAIL_MAGIC: msg = "Magic number mismatch"; @@ -72,10 +72,10 @@ static int nilfs_warn_segment_error(struct super_block *sb, int err) msg = "No super root in the last segment"; break; default: - nilfs_msg(sb, KERN_ERR, "unrecognized segment error %d", err); + nilfs_err(sb, "unrecognized segment error %d", err); return -EINVAL; } - nilfs_msg(sb, KERN_WARNING, "invalid segment: %s", msg); + nilfs_warn(sb, "invalid segment: %s", msg); return -EINVAL; } @@ -543,10 +543,10 @@ static int nilfs_recover_dsync_blocks(struct the_nilfs *nilfs, put_page(page); failed_inode: - nilfs_msg(sb, KERN_WARNING, - "error %d recovering data block (ino=%lu, block-offset=%llu)", - err, (unsigned long)rb->ino, - (unsigned long long)rb->blkoff); + nilfs_warn(sb, + "error %d recovering data block (ino=%lu, block-offset=%llu)", + err, (unsigned long)rb->ino, + (unsigned long long)rb->blkoff); if (!err2) err2 = err; next: @@ -669,8 +669,7 @@ static int nilfs_do_roll_forward(struct the_nilfs *nilfs, } if (nsalvaged_blocks) { - nilfs_msg(sb, KERN_INFO, "salvaged %lu blocks", - nsalvaged_blocks); + nilfs_info(sb, "salvaged %lu blocks", nsalvaged_blocks); ri->ri_need_recovery = NILFS_RECOVERY_ROLLFORWARD_DONE; } out: @@ -681,7 +680,7 @@ static int nilfs_do_roll_forward(struct the_nilfs *nilfs, confused: err = -EINVAL; failed: - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "error %d roll-forwarding partial segment at blocknr = %llu", err, (unsigned long long)pseg_start); goto out; @@ -703,8 +702,8 @@ static void nilfs_finish_roll_forward(struct the_nilfs *nilfs, set_buffer_dirty(bh); err = sync_dirty_buffer(bh); if (unlikely(err)) - nilfs_msg(nilfs->ns_sb, KERN_WARNING, - "buffer sync write failed during post-cleaning of recovery."); + nilfs_warn(nilfs->ns_sb, + "buffer sync write failed during post-cleaning of recovery."); brelse(bh); } @@ -739,8 +738,7 @@ int nilfs_salvage_orphan_logs(struct the_nilfs *nilfs, err = nilfs_attach_checkpoint(sb, ri->ri_cno, true, &root); if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, - "error %d loading the latest checkpoint", err); + nilfs_err(sb, "error %d loading the latest checkpoint", err); return err; } @@ -751,8 +749,7 @@ int nilfs_salvage_orphan_logs(struct the_nilfs *nilfs, if (ri->ri_need_recovery == NILFS_RECOVERY_ROLLFORWARD_DONE) { err = nilfs_prepare_segment_for_recovery(nilfs, sb, ri); if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, - "error %d preparing segment for recovery", + nilfs_err(sb, "error %d preparing segment for recovery", err); goto failed; } @@ -766,8 +763,7 @@ int nilfs_salvage_orphan_logs(struct the_nilfs *nilfs, nilfs_detach_log_writer(sb); if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, - "error %d writing segment for recovery", + nilfs_err(sb, "error %d writing segment for recovery", err); goto failed; } diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c index 20c479b5e41b..1a8729eded8b 100644 --- a/fs/nilfs2/segbuf.c +++ b/fs/nilfs2/segbuf.c @@ -505,7 +505,7 @@ static int nilfs_segbuf_wait(struct nilfs_segment_buffer *segbuf) } while (--segbuf->sb_nbio > 0); if (unlikely(atomic_read(&segbuf->sb_err) > 0)) { - nilfs_msg(segbuf->sb_super, KERN_ERR, + nilfs_err(segbuf->sb_super, "I/O error writing log (start-blocknr=%llu, block-count=%lu) in segment %llu", (unsigned long long)segbuf->sb_pseg_start, segbuf->sb_sum.nblocks, diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index 91b58c897f92..a651e821c2de 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -158,7 +158,7 @@ static int nilfs_prepare_segment_lock(struct super_block *sb, * it is saved and will be restored on * nilfs_transaction_commit(). */ - nilfs_msg(sb, KERN_WARNING, "journal info from a different FS"); + nilfs_warn(sb, "journal info from a different FS"); save = current->journal_info; } if (!ti) { @@ -1940,9 +1940,9 @@ static int nilfs_segctor_collect_dirty_files(struct nilfs_sc_info *sci, err = nilfs_ifile_get_inode_block( ifile, ii->vfs_inode.i_ino, &ibh); if (unlikely(err)) { - nilfs_msg(sci->sc_super, KERN_WARNING, - "log writer: error %d getting inode block (ino=%lu)", - err, ii->vfs_inode.i_ino); + nilfs_warn(sci->sc_super, + "log writer: error %d getting inode block (ino=%lu)", + err, ii->vfs_inode.i_ino); return err; } spin_lock(&nilfs->ns_inode_lock); @@ -2449,7 +2449,7 @@ int nilfs_clean_segments(struct super_block *sb, struct nilfs_argv *argv, if (likely(!err)) break; - nilfs_msg(sb, KERN_WARNING, "error %d cleaning segments", err); + nilfs_warn(sb, "error %d cleaning segments", err); set_current_state(TASK_INTERRUPTIBLE); schedule_timeout(sci->sc_interval); } @@ -2457,9 +2457,9 @@ int nilfs_clean_segments(struct super_block *sb, struct nilfs_argv *argv, int ret = nilfs_discard_segments(nilfs, sci->sc_freesegs, sci->sc_nfreesegs); if (ret) { - nilfs_msg(sb, KERN_WARNING, - "error %d on discard request, turning discards off for the device", - ret); + nilfs_warn(sb, + "error %d on discard request, turning discards off for the device", + ret); nilfs_clear_opt(nilfs, DISCARD); } } @@ -2540,9 +2540,9 @@ static int nilfs_segctor_thread(void *arg) /* start sync. */ sci->sc_task = current; wake_up(&sci->sc_wait_task); /* for nilfs_segctor_start_thread() */ - nilfs_msg(sci->sc_super, KERN_INFO, - "segctord starting. Construction interval = %lu seconds, CP frequency < %lu seconds", - sci->sc_interval / HZ, sci->sc_mjcp_freq / HZ); + nilfs_info(sci->sc_super, + "segctord starting. Construction interval = %lu seconds, CP frequency < %lu seconds", + sci->sc_interval / HZ, sci->sc_mjcp_freq / HZ); spin_lock(&sci->sc_state_lock); loop: @@ -2616,8 +2616,8 @@ static int nilfs_segctor_start_thread(struct nilfs_sc_info *sci) if (IS_ERR(t)) { int err = PTR_ERR(t); - nilfs_msg(sci->sc_super, KERN_ERR, - "error %d creating segctord thread", err); + nilfs_err(sci->sc_super, "error %d creating segctord thread", + err); return err; } wait_event(sci->sc_wait_task, sci->sc_task != NULL); @@ -2727,14 +2727,14 @@ static void nilfs_segctor_destroy(struct nilfs_sc_info *sci) nilfs_segctor_write_out(sci); if (!list_empty(&sci->sc_dirty_files)) { - nilfs_msg(sci->sc_super, KERN_WARNING, - "disposed unprocessed dirty file(s) when stopping log writer"); + nilfs_warn(sci->sc_super, + "disposed unprocessed dirty file(s) when stopping log writer"); nilfs_dispose_list(nilfs, &sci->sc_dirty_files, 1); } if (!list_empty(&sci->sc_iput_queue)) { - nilfs_msg(sci->sc_super, KERN_WARNING, - "disposed unprocessed inode(s) in iput queue when stopping log writer"); + nilfs_warn(sci->sc_super, + "disposed unprocessed inode(s) in iput queue when stopping log writer"); nilfs_dispose_list(nilfs, &sci->sc_iput_queue, 1); } @@ -2812,8 +2812,8 @@ void nilfs_detach_log_writer(struct super_block *sb) spin_lock(&nilfs->ns_inode_lock); if (!list_empty(&nilfs->ns_dirty_files)) { list_splice_init(&nilfs->ns_dirty_files, &garbage_list); - nilfs_msg(sb, KERN_WARNING, - "disposed unprocessed dirty file(s) when detaching log writer"); + nilfs_warn(sb, + "disposed unprocessed dirty file(s) when detaching log writer"); } spin_unlock(&nilfs->ns_inode_lock); up_write(&nilfs->ns_segctor_sem); diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c index bf3f8f05c89b..42ff67c0c14f 100644 --- a/fs/nilfs2/sufile.c +++ b/fs/nilfs2/sufile.c @@ -171,9 +171,9 @@ int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs, down_write(&NILFS_MDT(sufile)->mi_sem); for (seg = segnumv; seg < segnumv + nsegs; seg++) { if (unlikely(*seg >= nilfs_sufile_get_nsegments(sufile))) { - nilfs_msg(sufile->i_sb, KERN_WARNING, - "%s: invalid segment number: %llu", - __func__, (unsigned long long)*seg); + nilfs_warn(sufile->i_sb, + "%s: invalid segment number: %llu", + __func__, (unsigned long long)*seg); nerr++; } } @@ -230,9 +230,8 @@ int nilfs_sufile_update(struct inode *sufile, __u64 segnum, int create, int ret; if (unlikely(segnum >= nilfs_sufile_get_nsegments(sufile))) { - nilfs_msg(sufile->i_sb, KERN_WARNING, - "%s: invalid segment number: %llu", - __func__, (unsigned long long)segnum); + nilfs_warn(sufile->i_sb, "%s: invalid segment number: %llu", + __func__, (unsigned long long)segnum); return -EINVAL; } down_write(&NILFS_MDT(sufile)->mi_sem); @@ -410,9 +409,8 @@ void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum, kaddr = kmap_atomic(su_bh->b_page); su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr); if (unlikely(!nilfs_segment_usage_clean(su))) { - nilfs_msg(sufile->i_sb, KERN_WARNING, - "%s: segment %llu must be clean", __func__, - (unsigned long long)segnum); + nilfs_warn(sufile->i_sb, "%s: segment %llu must be clean", + __func__, (unsigned long long)segnum); kunmap_atomic(kaddr); return; } @@ -468,9 +466,8 @@ void nilfs_sufile_do_free(struct inode *sufile, __u64 segnum, kaddr = kmap_atomic(su_bh->b_page); su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr); if (nilfs_segment_usage_clean(su)) { - nilfs_msg(sufile->i_sb, KERN_WARNING, - "%s: segment %llu is already clean", - __func__, (unsigned long long)segnum); + nilfs_warn(sufile->i_sb, "%s: segment %llu is already clean", + __func__, (unsigned long long)segnum); kunmap_atomic(kaddr); return; } @@ -1168,12 +1165,12 @@ int nilfs_sufile_read(struct super_block *sb, size_t susize, int err; if (susize > sb->s_blocksize) { - nilfs_msg(sb, KERN_ERR, - "too large segment usage size: %zu bytes", susize); + nilfs_err(sb, "too large segment usage size: %zu bytes", + susize); return -EINVAL; } else if (susize < NILFS_MIN_SEGMENT_USAGE_SIZE) { - nilfs_msg(sb, KERN_ERR, - "too small segment usage size: %zu bytes", susize); + nilfs_err(sb, "too small segment usage size: %zu bytes", + susize); return -EINVAL; } diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c index 5729ee86da9a..2eee5fb1a882 100644 --- a/fs/nilfs2/super.c +++ b/fs/nilfs2/super.c @@ -62,19 +62,25 @@ struct kmem_cache *nilfs_btree_path_cache; static int nilfs_setup_super(struct super_block *sb, int is_mount); static int nilfs_remount(struct super_block *sb, int *flags, char *data); -void __nilfs_msg(struct super_block *sb, const char *level, const char *fmt, - ...) +void __nilfs_msg(struct super_block *sb, const char *fmt, ...) { struct va_format vaf; va_list args; + int level; va_start(args, fmt); - vaf.fmt = fmt; + + level = printk_get_level(fmt); + vaf.fmt = printk_skip_level(fmt); vaf.va = &args; + if (sb) - printk("%sNILFS (%s): %pV\n", level, sb->s_id, &vaf); + printk("%c%cNILFS (%s): %pV\n", + KERN_SOH_ASCII, level, sb->s_id, &vaf); else - printk("%sNILFS: %pV\n", level, &vaf); + printk("%c%cNILFS: %pV\n", + KERN_SOH_ASCII, level, &vaf); + va_end(args); } @@ -106,7 +112,7 @@ static void nilfs_set_error(struct super_block *sb) * * This implements the body of nilfs_error() macro. Normally, * nilfs_error() should be used. As for sustainable errors such as a - * single-shot I/O error, nilfs_msg() should be used instead. + * single-shot I/O error, nilfs_err() should be used instead. * * Callers should not add a trailing newline since this will do it. */ @@ -178,8 +184,7 @@ static int nilfs_sync_super(struct super_block *sb, int flag) } if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, "unable to write superblock: err=%d", - err); + nilfs_err(sb, "unable to write superblock: err=%d", err); if (err == -EIO && nilfs->ns_sbh[1]) { /* * sbp[0] points to newer log than sbp[1], @@ -249,7 +254,7 @@ struct nilfs_super_block **nilfs_prepare_super(struct super_block *sb, sbp[1]->s_magic == cpu_to_le16(NILFS_SUPER_MAGIC)) { memcpy(sbp[0], sbp[1], nilfs->ns_sbsize); } else { - nilfs_msg(sb, KERN_CRIT, "superblock broke"); + nilfs_crit(sb, "superblock broke"); return NULL; } } else if (sbp[1] && @@ -359,9 +364,9 @@ static int nilfs_move_2nd_super(struct super_block *sb, loff_t sb2off) offset = sb2off & (nilfs->ns_blocksize - 1); nsbh = sb_getblk(sb, newblocknr); if (!nsbh) { - nilfs_msg(sb, KERN_WARNING, - "unable to move secondary superblock to block %llu", - (unsigned long long)newblocknr); + nilfs_warn(sb, + "unable to move secondary superblock to block %llu", + (unsigned long long)newblocknr); ret = -EIO; goto out; } @@ -524,7 +529,7 @@ int nilfs_attach_checkpoint(struct super_block *sb, __u64 cno, int curr_mnt, up_read(&nilfs->ns_segctor_sem); if (unlikely(err)) { if (err == -ENOENT || err == -EINVAL) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "Invalid checkpoint (checkpoint number=%llu)", (unsigned long long)cno); err = -EINVAL; @@ -622,8 +627,7 @@ static int nilfs_statfs(struct dentry *dentry, struct kstatfs *buf) err = nilfs_ifile_count_free_inodes(root->ifile, &nmaxinodes, &nfreeinodes); if (unlikely(err)) { - nilfs_msg(sb, KERN_WARNING, - "failed to count free inodes: err=%d", err); + nilfs_warn(sb, "failed to count free inodes: err=%d", err); if (err == -ERANGE) { /* * If nilfs_palloc_count_max_entries() returns @@ -755,7 +759,7 @@ static int parse_options(char *options, struct super_block *sb, int is_remount) break; case Opt_snapshot: if (is_remount) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "\"%s\" option is invalid for remount", p); return 0; @@ -771,8 +775,7 @@ static int parse_options(char *options, struct super_block *sb, int is_remount) nilfs_clear_opt(nilfs, DISCARD); break; default: - nilfs_msg(sb, KERN_ERR, - "unrecognized mount option \"%s\"", p); + nilfs_err(sb, "unrecognized mount option \"%s\"", p); return 0; } } @@ -808,10 +811,10 @@ static int nilfs_setup_super(struct super_block *sb, int is_mount) mnt_count = le16_to_cpu(sbp[0]->s_mnt_count); if (nilfs->ns_mount_state & NILFS_ERROR_FS) { - nilfs_msg(sb, KERN_WARNING, "mounting fs with errors"); + nilfs_warn(sb, "mounting fs with errors"); #if 0 } else if (max_mnt_count >= 0 && mnt_count >= max_mnt_count) { - nilfs_msg(sb, KERN_WARNING, "maximal mount count reached"); + nilfs_warn(sb, "maximal mount count reached"); #endif } if (!max_mnt_count) @@ -874,7 +877,7 @@ int nilfs_check_feature_compatibility(struct super_block *sb, features = le64_to_cpu(sbp->s_feature_incompat) & ~NILFS_FEATURE_INCOMPAT_SUPP; if (features) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "couldn't mount because of unsupported optional features (%llx)", (unsigned long long)features); return -EINVAL; @@ -882,7 +885,7 @@ int nilfs_check_feature_compatibility(struct super_block *sb, features = le64_to_cpu(sbp->s_feature_compat_ro) & ~NILFS_FEATURE_COMPAT_RO_SUPP; if (!sb_rdonly(sb) && features) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "couldn't mount RDWR because of unsupported optional features (%llx)", (unsigned long long)features); return -EINVAL; @@ -901,12 +904,12 @@ static int nilfs_get_root_dentry(struct super_block *sb, inode = nilfs_iget(sb, root, NILFS_ROOT_INO); if (IS_ERR(inode)) { ret = PTR_ERR(inode); - nilfs_msg(sb, KERN_ERR, "error %d getting root inode", ret); + nilfs_err(sb, "error %d getting root inode", ret); goto out; } if (!S_ISDIR(inode->i_mode) || !inode->i_blocks || !inode->i_size) { iput(inode); - nilfs_msg(sb, KERN_ERR, "corrupt root inode"); + nilfs_err(sb, "corrupt root inode"); ret = -EINVAL; goto out; } @@ -934,7 +937,7 @@ static int nilfs_get_root_dentry(struct super_block *sb, return ret; failed_dentry: - nilfs_msg(sb, KERN_ERR, "error %d getting root dentry", ret); + nilfs_err(sb, "error %d getting root dentry", ret); goto out; } @@ -954,7 +957,7 @@ static int nilfs_attach_snapshot(struct super_block *s, __u64 cno, ret = (ret == -ENOENT) ? -EINVAL : ret; goto out; } else if (!ret) { - nilfs_msg(s, KERN_ERR, + nilfs_err(s, "The specified checkpoint is not a snapshot (checkpoint number=%llu)", (unsigned long long)cno); ret = -EINVAL; @@ -963,7 +966,7 @@ static int nilfs_attach_snapshot(struct super_block *s, __u64 cno, ret = nilfs_attach_checkpoint(s, cno, false, &root); if (ret) { - nilfs_msg(s, KERN_ERR, + nilfs_err(s, "error %d while loading snapshot (checkpoint number=%llu)", ret, (unsigned long long)cno); goto out; @@ -1060,7 +1063,7 @@ nilfs_fill_super(struct super_block *sb, void *data, int silent) cno = nilfs_last_cno(nilfs); err = nilfs_attach_checkpoint(sb, cno, true, &fsroot); if (err) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "error %d while loading last checkpoint (checkpoint number=%llu)", err, (unsigned long long)cno); goto failed_unload; @@ -1122,8 +1125,8 @@ static int nilfs_remount(struct super_block *sb, int *flags, char *data) err = -EINVAL; if (!nilfs_valid_fs(nilfs)) { - nilfs_msg(sb, KERN_WARNING, - "couldn't remount because the filesystem is in an incomplete recovery state"); + nilfs_warn(sb, + "couldn't remount because the filesystem is in an incomplete recovery state"); goto restore_opts; } @@ -1155,9 +1158,9 @@ static int nilfs_remount(struct super_block *sb, int *flags, char *data) ~NILFS_FEATURE_COMPAT_RO_SUPP; up_read(&nilfs->ns_sem); if (features) { - nilfs_msg(sb, KERN_WARNING, - "couldn't remount RDWR because of unsupported optional features (%llx)", - (unsigned long long)features); + nilfs_warn(sb, + "couldn't remount RDWR because of unsupported optional features (%llx)", + (unsigned long long)features); err = -EROFS; goto restore_opts; } @@ -1216,7 +1219,7 @@ static int nilfs_parse_snapshot_option(const char *option, return 0; parse_error: - nilfs_msg(NULL, KERN_ERR, "invalid option \"%s\": %s", option, msg); + nilfs_err(NULL, "invalid option \"%s\": %s", option, msg); return 1; } @@ -1319,7 +1322,7 @@ nilfs_mount(struct file_system_type *fs_type, int flags, } else if (!sd.cno) { if (nilfs_tree_is_busy(s->s_root)) { if ((flags ^ s->s_flags) & SB_RDONLY) { - nilfs_msg(s, KERN_ERR, + nilfs_err(s, "the device already has a %s mount.", sb_rdonly(s) ? "read-only" : "read/write"); err = -EBUSY; diff --git a/fs/nilfs2/sysfs.c b/fs/nilfs2/sysfs.c index e60be7bb55b0..303d71430bdd 100644 --- a/fs/nilfs2/sysfs.c +++ b/fs/nilfs2/sysfs.c @@ -263,8 +263,8 @@ nilfs_checkpoints_checkpoints_number_show(struct nilfs_checkpoints_attr *attr, err = nilfs_cpfile_get_stat(nilfs->ns_cpfile, &cpstat); up_read(&nilfs->ns_segctor_sem); if (err < 0) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "unable to get checkpoint stat: err=%d", err); + nilfs_err(nilfs->ns_sb, "unable to get checkpoint stat: err=%d", + err); return err; } @@ -286,8 +286,8 @@ nilfs_checkpoints_snapshots_number_show(struct nilfs_checkpoints_attr *attr, err = nilfs_cpfile_get_stat(nilfs->ns_cpfile, &cpstat); up_read(&nilfs->ns_segctor_sem); if (err < 0) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "unable to get checkpoint stat: err=%d", err); + nilfs_err(nilfs->ns_sb, "unable to get checkpoint stat: err=%d", + err); return err; } @@ -405,8 +405,8 @@ nilfs_segments_dirty_segments_show(struct nilfs_segments_attr *attr, err = nilfs_sufile_get_stat(nilfs->ns_sufile, &sustat); up_read(&nilfs->ns_segctor_sem); if (err < 0) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "unable to get segment stat: err=%d", err); + nilfs_err(nilfs->ns_sb, "unable to get segment stat: err=%d", + err); return err; } @@ -779,15 +779,15 @@ nilfs_superblock_sb_update_frequency_store(struct nilfs_superblock_attr *attr, err = kstrtouint(skip_spaces(buf), 0, &val); if (err) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "unable to convert string: err=%d", err); + nilfs_err(nilfs->ns_sb, "unable to convert string: err=%d", + err); return err; } if (val < NILFS_SB_FREQ) { val = NILFS_SB_FREQ; - nilfs_msg(nilfs->ns_sb, KERN_WARNING, - "superblock update frequency cannot be lesser than 10 seconds"); + nilfs_warn(nilfs->ns_sb, + "superblock update frequency cannot be lesser than 10 seconds"); } down_write(&nilfs->ns_sem); @@ -990,8 +990,7 @@ int nilfs_sysfs_create_device_group(struct super_block *sb) nilfs->ns_dev_subgroups = kzalloc(devgrp_size, GFP_KERNEL); if (unlikely(!nilfs->ns_dev_subgroups)) { err = -ENOMEM; - nilfs_msg(sb, KERN_ERR, - "unable to allocate memory for device group"); + nilfs_err(sb, "unable to allocate memory for device group"); goto failed_create_device_group; } @@ -1101,15 +1100,13 @@ int __init nilfs_sysfs_init(void) nilfs_kset = kset_create_and_add(NILFS_ROOT_GROUP_NAME, NULL, fs_kobj); if (!nilfs_kset) { err = -ENOMEM; - nilfs_msg(NULL, KERN_ERR, - "unable to create sysfs entry: err=%d", err); + nilfs_err(NULL, "unable to create sysfs entry: err=%d", err); goto failed_sysfs_init; } err = sysfs_create_group(&nilfs_kset->kobj, &nilfs_feature_attr_group); if (unlikely(err)) { - nilfs_msg(NULL, KERN_ERR, - "unable to create feature group: err=%d", err); + nilfs_err(NULL, "unable to create feature group: err=%d", err); goto cleanup_sysfs_init; } diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c index 484785cdf96e..221a1cc597f0 100644 --- a/fs/nilfs2/the_nilfs.c +++ b/fs/nilfs2/the_nilfs.c @@ -183,7 +183,7 @@ static int nilfs_store_log_cursor(struct the_nilfs *nilfs, nilfs_get_segnum_of_block(nilfs, nilfs->ns_last_pseg); nilfs->ns_cno = nilfs->ns_last_cno + 1; if (nilfs->ns_segnum >= nilfs->ns_nsegments) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, + nilfs_err(nilfs->ns_sb, "pointed segment number is out of range: segnum=%llu, nsegments=%lu", (unsigned long long)nilfs->ns_segnum, nilfs->ns_nsegments); @@ -210,12 +210,12 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) int err; if (!valid_fs) { - nilfs_msg(sb, KERN_WARNING, "mounting unchecked fs"); + nilfs_warn(sb, "mounting unchecked fs"); if (s_flags & SB_RDONLY) { - nilfs_msg(sb, KERN_INFO, - "recovery required for readonly filesystem"); - nilfs_msg(sb, KERN_INFO, - "write access will be enabled during recovery"); + nilfs_info(sb, + "recovery required for readonly filesystem"); + nilfs_info(sb, + "write access will be enabled during recovery"); } } @@ -230,12 +230,11 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) goto scan_error; if (!nilfs_valid_sb(sbp[1])) { - nilfs_msg(sb, KERN_WARNING, - "unable to fall back to spare super block"); + nilfs_warn(sb, + "unable to fall back to spare super block"); goto scan_error; } - nilfs_msg(sb, KERN_INFO, - "trying rollback from an earlier position"); + nilfs_info(sb, "trying rollback from an earlier position"); /* * restore super block with its spare and reconfigure @@ -248,9 +247,9 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) /* verify consistency between two super blocks */ blocksize = BLOCK_SIZE << le32_to_cpu(sbp[0]->s_log_block_size); if (blocksize != nilfs->ns_blocksize) { - nilfs_msg(sb, KERN_WARNING, - "blocksize differs between two super blocks (%d != %d)", - blocksize, nilfs->ns_blocksize); + nilfs_warn(sb, + "blocksize differs between two super blocks (%d != %d)", + blocksize, nilfs->ns_blocksize); goto scan_error; } @@ -269,8 +268,7 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) err = nilfs_load_super_root(nilfs, sb, ri.ri_super_root); if (unlikely(err)) { - nilfs_msg(sb, KERN_ERR, "error %d while loading super root", - err); + nilfs_err(sb, "error %d while loading super root", err); goto failed; } @@ -281,28 +279,28 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) __u64 features; if (nilfs_test_opt(nilfs, NORECOVERY)) { - nilfs_msg(sb, KERN_INFO, - "norecovery option specified, skipping roll-forward recovery"); + nilfs_info(sb, + "norecovery option specified, skipping roll-forward recovery"); goto skip_recovery; } features = le64_to_cpu(nilfs->ns_sbp[0]->s_feature_compat_ro) & ~NILFS_FEATURE_COMPAT_RO_SUPP; if (features) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "couldn't proceed with recovery because of unsupported optional features (%llx)", (unsigned long long)features); err = -EROFS; goto failed_unload; } if (really_read_only) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "write access unavailable, cannot proceed"); err = -EROFS; goto failed_unload; } sb->s_flags &= ~SB_RDONLY; } else if (nilfs_test_opt(nilfs, NORECOVERY)) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "recovery cancelled because norecovery option was specified for a read/write mount"); err = -EINVAL; goto failed_unload; @@ -318,12 +316,12 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) up_write(&nilfs->ns_sem); if (err) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "error %d updating super block. recovery unfinished.", err); goto failed_unload; } - nilfs_msg(sb, KERN_INFO, "recovery complete"); + nilfs_info(sb, "recovery complete"); skip_recovery: nilfs_clear_recovery_info(&ri); @@ -331,7 +329,7 @@ int load_nilfs(struct the_nilfs *nilfs, struct super_block *sb) return 0; scan_error: - nilfs_msg(sb, KERN_ERR, "error %d while searching super root", err); + nilfs_err(sb, "error %d while searching super root", err); goto failed; failed_unload: @@ -378,7 +376,7 @@ static int nilfs_store_disk_layout(struct the_nilfs *nilfs, struct nilfs_super_block *sbp) { if (le32_to_cpu(sbp->s_rev_level) < NILFS_MIN_SUPP_REV) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, + nilfs_err(nilfs->ns_sb, "unsupported revision (superblock rev.=%d.%d, current rev.=%d.%d). Please check the version of mkfs.nilfs(2).", le32_to_cpu(sbp->s_rev_level), le16_to_cpu(sbp->s_minor_rev_level), @@ -391,13 +389,11 @@ static int nilfs_store_disk_layout(struct the_nilfs *nilfs, nilfs->ns_inode_size = le16_to_cpu(sbp->s_inode_size); if (nilfs->ns_inode_size > nilfs->ns_blocksize) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "too large inode size: %d bytes", + nilfs_err(nilfs->ns_sb, "too large inode size: %d bytes", nilfs->ns_inode_size); return -EINVAL; } else if (nilfs->ns_inode_size < NILFS_MIN_INODE_SIZE) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "too small inode size: %d bytes", + nilfs_err(nilfs->ns_sb, "too small inode size: %d bytes", nilfs->ns_inode_size); return -EINVAL; } @@ -406,8 +402,7 @@ static int nilfs_store_disk_layout(struct the_nilfs *nilfs, nilfs->ns_blocks_per_segment = le32_to_cpu(sbp->s_blocks_per_segment); if (nilfs->ns_blocks_per_segment < NILFS_SEG_MIN_BLOCKS) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, - "too short segment: %lu blocks", + nilfs_err(nilfs->ns_sb, "too short segment: %lu blocks", nilfs->ns_blocks_per_segment); return -EINVAL; } @@ -417,7 +412,7 @@ static int nilfs_store_disk_layout(struct the_nilfs *nilfs, le32_to_cpu(sbp->s_r_segments_percentage); if (nilfs->ns_r_segments_percentage < 1 || nilfs->ns_r_segments_percentage > 99) { - nilfs_msg(nilfs->ns_sb, KERN_ERR, + nilfs_err(nilfs->ns_sb, "invalid reserved segments percentage: %lu", nilfs->ns_r_segments_percentage); return -EINVAL; @@ -503,16 +498,16 @@ static int nilfs_load_super_block(struct the_nilfs *nilfs, if (!sbp[0]) { if (!sbp[1]) { - nilfs_msg(sb, KERN_ERR, "unable to read superblock"); + nilfs_err(sb, "unable to read superblock"); return -EIO; } - nilfs_msg(sb, KERN_WARNING, - "unable to read primary superblock (blocksize = %d)", - blocksize); + nilfs_warn(sb, + "unable to read primary superblock (blocksize = %d)", + blocksize); } else if (!sbp[1]) { - nilfs_msg(sb, KERN_WARNING, - "unable to read secondary superblock (blocksize = %d)", - blocksize); + nilfs_warn(sb, + "unable to read secondary superblock (blocksize = %d)", + blocksize); } /* @@ -534,14 +529,14 @@ static int nilfs_load_super_block(struct the_nilfs *nilfs, } if (!valid[swp]) { nilfs_release_super_block(nilfs); - nilfs_msg(sb, KERN_ERR, "couldn't find nilfs on the device"); + nilfs_err(sb, "couldn't find nilfs on the device"); return -EINVAL; } if (!valid[!swp]) - nilfs_msg(sb, KERN_WARNING, - "broken superblock, retrying with spare superblock (blocksize = %d)", - blocksize); + nilfs_warn(sb, + "broken superblock, retrying with spare superblock (blocksize = %d)", + blocksize); if (swp) nilfs_swap_super_block(nilfs); @@ -575,7 +570,7 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data) blocksize = sb_min_blocksize(sb, NILFS_MIN_BLOCK_SIZE); if (!blocksize) { - nilfs_msg(sb, KERN_ERR, "unable to set blocksize"); + nilfs_err(sb, "unable to set blocksize"); err = -EINVAL; goto out; } @@ -594,7 +589,7 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data) blocksize = BLOCK_SIZE << le32_to_cpu(sbp->s_log_block_size); if (blocksize < NILFS_MIN_BLOCK_SIZE || blocksize > NILFS_MAX_BLOCK_SIZE) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "couldn't mount because of unsupported filesystem blocksize %d", blocksize); err = -EINVAL; @@ -604,7 +599,7 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data) int hw_blocksize = bdev_logical_block_size(sb->s_bdev); if (blocksize < hw_blocksize) { - nilfs_msg(sb, KERN_ERR, + nilfs_err(sb, "blocksize %d too small for device (sector-size = %d)", blocksize, hw_blocksize); err = -EINVAL; diff --git a/fs/open.c b/fs/open.c index c80e9f497e9b..9af548fb841b 100644 --- a/fs/open.c +++ b/fs/open.c @@ -779,12 +779,6 @@ static int do_dentry_open(struct file *f, return 0; } - /* Any file opened for execve()/uselib() has to be a regular file. */ - if (unlikely(f->f_flags & FMODE_EXEC && !S_ISREG(inode->i_mode))) { - error = -EACCES; - goto cleanup_file; - } - if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) { error = get_write_access(inode); if (unlikely(error)) diff --git a/fs/proc/base.c b/fs/proc/base.c index a333caeca291..617db4e0faa0 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -551,8 +551,17 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns, { unsigned long totalpages = totalram_pages() + total_swap_pages; unsigned long points = 0; + long badness; + + badness = oom_badness(task, totalpages); + /* + * Special case OOM_SCORE_ADJ_MIN for all others scale the + * badness value into [0, 2000] range which we have been + * exporting for a long time so userspace might depend on it. + */ + if (badness != LONG_MIN) + points = (1000 + badness * 1000 / (long)totalpages) * 2 / 3; - points = oom_badness(task, totalpages) * 1000 / totalpages; seq_printf(m, "%lu\n", points); return 0; diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index dbda4499a859..5066b0251ed8 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -786,7 +786,7 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss, SEQ_PUT_DEC(" kB\nLazyFree: ", mss->lazyfree); SEQ_PUT_DEC(" kB\nAnonHugePages: ", mss->anonymous_thp); SEQ_PUT_DEC(" kB\nShmemPmdMapped: ", mss->shmem_thp); - SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); + SEQ_PUT_DEC(" kB\nFilePmdMapped: ", mss->file_thp); SEQ_PUT_DEC(" kB\nShared_Hugetlb: ", mss->shared_hugetlb); seq_put_decimal_ull_width(m, " kB\nPrivate_Hugetlb: ", mss->private_hugetlb >> 10, 7); @@ -816,7 +816,7 @@ static int show_smap(struct seq_file *m, void *v) __show_smap(m, &mss, false); - seq_printf(m, "THPeligible: %d\n", + seq_printf(m, "THPeligible: %d\n", transparent_hugepage_enabled(vma)); if (arch_pkeys_enabled()) diff --git a/fs/signalfd.c b/fs/signalfd.c index 44b6845b071c..5b78719be445 100644 --- a/fs/signalfd.c +++ b/fs/signalfd.c @@ -314,9 +314,10 @@ SYSCALL_DEFINE4(signalfd4, int, ufd, sigset_t __user *, user_mask, { sigset_t mask; - if (sizemask != sizeof(sigset_t) || - copy_from_user(&mask, user_mask, sizeof(mask))) + if (sizemask != sizeof(sigset_t)) return -EINVAL; + if (copy_from_user(&mask, user_mask, sizeof(mask))) + return -EFAULT; return do_signalfd4(ufd, &mask, flags); } @@ -325,9 +326,10 @@ SYSCALL_DEFINE3(signalfd, int, ufd, sigset_t __user *, user_mask, { sigset_t mask; - if (sizemask != sizeof(sigset_t) || - copy_from_user(&mask, user_mask, sizeof(mask))) + if (sizemask != sizeof(sigset_t)) return -EINVAL; + if (copy_from_user(&mask, user_mask, sizeof(mask))) + return -EFAULT; return do_signalfd4(ufd, &mask, 0); } diff --git a/fs/ufs/super.c b/fs/ufs/super.c index 1da0be667409..e3b69fb280e8 100644 --- a/fs/ufs/super.c +++ b/fs/ufs/super.c @@ -101,7 +101,7 @@ static struct inode *ufs_nfs_get_inode(struct super_block *sb, u64 ino, u32 gene struct ufs_sb_private_info *uspi = UFS_SB(sb)->s_uspi; struct inode *inode; - if (ino < UFS_ROOTINO || ino > uspi->s_ncg * uspi->s_ipg) + if (ino < UFS_ROOTINO || ino > (u64)uspi->s_ncg * uspi->s_ipg) return ERR_PTR(-ESTALE); inode = ufs_iget(sb, ino); diff --git a/include/asm-generic/uaccess.h b/include/asm-generic/uaccess.h index e935318804f8..ba68ee4dabfa 100644 --- a/include/asm-generic/uaccess.h +++ b/include/asm-generic/uaccess.h @@ -86,8 +86,8 @@ static inline void set_fs(mm_segment_t fs) } #endif -#ifndef segment_eq -#define segment_eq(a, b) ((a).seg == (b).seg) +#ifndef uaccess_kernel +#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg) #endif #define access_ok(addr, size) __access_ok((unsigned long)(addr),(size)) diff --git a/include/clocksource/timer-ti-dm.h b/include/clocksource/timer-ti-dm.h index 531ca87fcd08..4c61dade8835 100644 --- a/include/clocksource/timer-ti-dm.h +++ b/include/clocksource/timer-ti-dm.h @@ -1,7 +1,7 @@ /* * OMAP Dual-Mode Timers * - * Copyright (C) 2010 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2010 Texas Instruments Incorporated - https://www.ti.com/ * Tarun Kanti DebBarma <tarun.kanti@ti.com> * Thara Gopinath <thara@ti.com> * diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h index 75e582b8d2d9..4c328fef403c 100644 --- a/include/linux/async_tx.h +++ b/include/linux/async_tx.h @@ -36,7 +36,7 @@ struct dma_chan_ref { /** * async_tx_flags - modifiers for the async_* calls * @ASYNC_TX_XOR_ZERO_DST: this flag must be used for xor operations where the - * the destination address is not a source. The asynchronous case handles this + * destination address is not a source. The asynchronous case handles this * implicitly, the synchronous case needs to zero the destination block. * @ASYNC_TX_XOR_DROP_DST: this flag must be used if the destination address is * also one of the source addresses. In the synchronous case the destination diff --git a/include/linux/btree.h b/include/linux/btree.h index 68f858c831b1..243ee544397a 100644 --- a/include/linux/btree.h +++ b/include/linux/btree.h @@ -10,7 +10,7 @@ * * A B+Tree is a data structure for looking up arbitrary (currently allowing * unsigned long, u32, u64 and 2 * u64) keys into pointers. The data structure - * is described at http://en.wikipedia.org/wiki/B-tree, we currently do not + * is described at https://en.wikipedia.org/wiki/B-tree, we currently do not * use binary search to find the key on lookups. * * Each B+Tree consists of a head, that contains bookkeeping information and diff --git a/include/linux/compaction.h b/include/linux/compaction.h index 6fa0eea3f530..25a521d299c1 100644 --- a/include/linux/compaction.h +++ b/include/linux/compaction.h @@ -85,11 +85,13 @@ static inline unsigned long compact_gap(unsigned int order) #ifdef CONFIG_COMPACTION extern int sysctl_compact_memory; +extern unsigned int sysctl_compaction_proactiveness; extern int sysctl_compaction_handler(struct ctl_table *table, int write, void *buffer, size_t *length, loff_t *ppos); extern int sysctl_extfrag_threshold; extern int sysctl_compact_unevictable_allowed; +extern unsigned int extfrag_for_order(struct zone *zone, unsigned int order); extern int fragmentation_index(struct zone *zone, unsigned int order); extern enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, unsigned int alloc_flags, diff --git a/include/linux/compiler-clang.h b/include/linux/compiler-clang.h index 8a072d00e688..cee0c728d39a 100644 --- a/include/linux/compiler-clang.h +++ b/include/linux/compiler-clang.h @@ -40,7 +40,7 @@ #endif /* - * Not all versions of clang implement the the type-generic versions + * Not all versions of clang implement the type-generic versions * of the builtin overflow checkers. Fortunately, clang implements * __has_builtin allowing us to avoid awkward version * checks. Unfortunately, we don't know which version of gcc clang diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index 2e231ba8fe3f..4b33cb385f96 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -5,48 +5,54 @@ #ifndef __ASSEMBLY__ #ifdef __CHECKER__ +/* address spaces */ # define __kernel __attribute__((address_space(0))) # define __user __attribute__((noderef, address_space(__user))) -# define __safe __attribute__((safe)) -# define __force __attribute__((force)) -# define __nocast __attribute__((nocast)) # define __iomem __attribute__((noderef, address_space(__iomem))) +# define __percpu __attribute__((noderef, address_space(__percpu))) +# define __rcu __attribute__((noderef, address_space(__rcu))) +extern void __chk_user_ptr(const volatile void __user *); +extern void __chk_io_ptr(const volatile void __iomem *); +/* context/locking */ # define __must_hold(x) __attribute__((context(x,1,1))) # define __acquires(x) __attribute__((context(x,0,1))) # define __releases(x) __attribute__((context(x,1,0))) # define __acquire(x) __context__(x,1) # define __release(x) __context__(x,-1) # define __cond_lock(x,c) ((c) ? ({ __acquire(x); 1; }) : 0) -# define __percpu __attribute__((noderef, address_space(__percpu))) -# define __rcu __attribute__((noderef, address_space(__rcu))) +/* other */ +# define __force __attribute__((force)) +# define __nocast __attribute__((nocast)) +# define __safe __attribute__((safe)) # define __private __attribute__((noderef)) -extern void __chk_user_ptr(const volatile void __user *); -extern void __chk_io_ptr(const volatile void __iomem *); # define ACCESS_PRIVATE(p, member) (*((typeof((p)->member) __force *) &(p)->member)) #else /* __CHECKER__ */ +/* address spaces */ +# define __kernel # ifdef STRUCTLEAK_PLUGIN -# define __user __attribute__((user)) +# define __user __attribute__((user)) # else # define __user # endif -# define __kernel -# define __safe -# define __force -# define __nocast # define __iomem -# define __chk_user_ptr(x) (void)0 -# define __chk_io_ptr(x) (void)0 -# define __builtin_warning(x, y...) (1) +# define __percpu +# define __rcu +# define __chk_user_ptr(x) (void)0 +# define __chk_io_ptr(x) (void)0 +/* context/locking */ # define __must_hold(x) # define __acquires(x) # define __releases(x) -# define __acquire(x) (void)0 -# define __release(x) (void)0 +# define __acquire(x) (void)0 +# define __release(x) (void)0 # define __cond_lock(x,c) (c) -# define __percpu -# define __rcu +/* other */ +# define __force +# define __nocast +# define __safe # define __private # define ACCESS_PRIVATE(p, member) ((p)->member) +# define __builtin_warning(x, y...) (1) #endif /* __CHECKER__ */ /* Indirect macros required for expanded argument pasting, eg. __LINE__. */ diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h index 525510a9f965..6594dbc34a37 100644 --- a/include/linux/crash_core.h +++ b/include/linux/crash_core.h @@ -38,6 +38,8 @@ phys_addr_t paddr_vmcoreinfo_note(void); #define VMCOREINFO_OSRELEASE(value) \ vmcoreinfo_append_str("OSRELEASE=%s\n", value) +#define VMCOREINFO_BUILD_ID(value) \ + vmcoreinfo_append_str("BUILD-ID=%s\n", value) #define VMCOREINFO_PAGESIZE(value) \ vmcoreinfo_append_str("PAGESIZE=%ld\n", value) #define VMCOREINFO_SYMBOL(name) \ @@ -64,6 +66,10 @@ extern unsigned char *vmcoreinfo_data; extern size_t vmcoreinfo_size; extern u32 *vmcoreinfo_note; +/* raw contents of kernel .notes section */ +extern const void __start_notes __weak; +extern const void __stop_notes __weak; + Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len); void final_note(Elf_Word *buf); diff --git a/include/linux/delay.h b/include/linux/delay.h index 5e016a4029d9..1d0e2ce6b6d9 100644 --- a/include/linux/delay.h +++ b/include/linux/delay.h @@ -16,7 +16,7 @@ * 3. CPU clock rate changes. * * Please see this thread: - * http://lists.openwall.net/linux-kernel/2011/01/09/56 + * https://lists.openwall.net/linux-kernel/2011/01/09/56 */ #include <linux/kernel.h> diff --git a/include/linux/dma/k3-psil.h b/include/linux/dma/k3-psil.h index 61d5cc0ad601..1962f75fa2d3 100644 --- a/include/linux/dma/k3-psil.h +++ b/include/linux/dma/k3-psil.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* - * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2019 Texas Instruments Incorporated - https://www.ti.com */ #ifndef K3_PSIL_H_ diff --git a/include/linux/dma/k3-udma-glue.h b/include/linux/dma/k3-udma-glue.h index caadbab1632a..5eb34ad973a7 100644 --- a/include/linux/dma/k3-udma-glue.h +++ b/include/linux/dma/k3-udma-glue.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ /* - * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2019 Texas Instruments Incorporated - https://www.ti.com */ #ifndef K3_UDMA_GLUE_H_ diff --git a/include/linux/dma/ti-cppi5.h b/include/linux/dma/ti-cppi5.h index 579356ae447e..5896441ee604 100644 --- a/include/linux/dma/ti-cppi5.h +++ b/include/linux/dma/ti-cppi5.h @@ -2,7 +2,7 @@ /* * CPPI5 descriptors interface * - * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2019 Texas Instruments Incorporated - https://www.ti.com */ #ifndef __TI_CPPI5_H__ diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h index d896b8657085..3ceb72b67a7a 100644 --- a/include/linux/exportfs.h +++ b/include/linux/exportfs.h @@ -178,7 +178,7 @@ struct fid { * get_name: * @get_name should find a name for the given @child in the given @parent * directory. The name should be stored in the @name (with the - * understanding that it is already pointing to a a %NAME_MAX+1 sized + * understanding that it is already pointing to a %NAME_MAX+1 sized * buffer. get_name() should return %0 on success, a negative error code * or error. @get_name will be called without @parent->i_mutex held. * diff --git a/include/linux/frontswap.h b/include/linux/frontswap.h index 6d775984905b..b07d88c92bb2 100644 --- a/include/linux/frontswap.h +++ b/include/linux/frontswap.h @@ -10,7 +10,7 @@ /* * Return code to denote that requested number of * frontswap pages are unused(moved to page cache). - * Used in in shmem_unuse and try_to_unuse. + * Used in shmem_unuse and try_to_unuse. */ #define FRONTSWAP_PAGES_UNUSED 2 diff --git a/include/linux/fs.h b/include/linux/fs.h index 011af396aa17..7c69dd7c6160 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -518,6 +518,16 @@ static inline void i_mmap_unlock_read(struct address_space *mapping) up_read(&mapping->i_mmap_rwsem); } +static inline void i_mmap_assert_locked(struct address_space *mapping) +{ + lockdep_assert_held(&mapping->i_mmap_rwsem); +} + +static inline void i_mmap_assert_write_locked(struct address_space *mapping) +{ + lockdep_assert_held_write(&mapping->i_mmap_rwsem); +} + /* * Might pages of this file be mapped into userspace? */ diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h index 02393c0c98f9..bfd00320c7f3 100644 --- a/include/linux/generic-radix-tree.h +++ b/include/linux/generic-radix-tree.h @@ -44,7 +44,7 @@ struct genradix_root; struct __genradix { - struct genradix_root __rcu *root; + struct genradix_root *root; }; /* diff --git a/include/linux/highmem.h b/include/linux/highmem.h index d6e82e3de027..14e6202ce47f 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -73,7 +73,7 @@ static inline void kunmap(struct page *page) * no global lock is needed and because the kmap code must perform a global TLB * invalidation when the kmap pool wraps. * - * However when holding an atomic kmap is is not legal to sleep, so atomic + * However when holding an atomic kmap it is not legal to sleep, so atomic * kmaps are appropriate for short, tight code paths only. * * The use of kmap_atomic/kunmap_atomic is discouraged - kmap/kunmap diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 17c4c4975145..467302056e17 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -181,13 +181,6 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma, #define transparent_hugepage_use_zero_page() \ (transparent_hugepage_flags & \ (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG)) -#ifdef CONFIG_DEBUG_VM -#define transparent_hugepage_debug_cow() \ - (transparent_hugepage_flags & \ - (1<<TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG)) -#else /* CONFIG_DEBUG_VM */ -#define transparent_hugepage_debug_cow() 0 -#endif /* CONFIG_DEBUG_VM */ extern unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 50650d0d01b9..d5cc5f802dd4 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -10,6 +10,7 @@ #include <linux/list.h> #include <linux/kref.h> #include <linux/pgtable.h> +#include <linux/gfp.h> struct ctl_table; struct user_struct; @@ -164,7 +165,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz); pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr, unsigned long sz); -int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep); +int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *addr, pte_t *ptep); void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma, unsigned long *start, unsigned long *end); struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address, @@ -203,8 +205,9 @@ static inline struct address_space *hugetlb_page_mapping_lock_write( return NULL; } -static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, - pte_t *ptep) +static inline int huge_pmd_unshare(struct mm_struct *mm, + struct vm_area_struct *vma, + unsigned long *addr, pte_t *ptep) { return 0; } @@ -504,13 +507,10 @@ struct huge_bootmem_page { struct page *alloc_huge_page(struct vm_area_struct *vma, unsigned long addr, int avoid_reserve); -struct page *alloc_huge_page_node(struct hstate *h, int nid); struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, - nodemask_t *nmask); + nodemask_t *nmask, gfp_t gfp_mask); struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address); -struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, - int nid, nodemask_t *nmask); int huge_add_to_page_cache(struct page *page, struct address_space *mapping, pgoff_t idx); @@ -692,6 +692,27 @@ static inline bool hugepage_movable_supported(struct hstate *h) return true; } +/* Movability of hugepages depends on migration support. */ +static inline gfp_t htlb_alloc_mask(struct hstate *h) +{ + if (hugepage_movable_supported(h)) + return GFP_HIGHUSER_MOVABLE; + else + return GFP_HIGHUSER; +} + +static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask) +{ + gfp_t modified_mask = htlb_alloc_mask(h); + + /* Some callers might want to enforce node */ + modified_mask |= (gfp_mask & __GFP_THISNODE); + + modified_mask |= (gfp_mask & __GFP_NOWARN); + + return modified_mask; +} + static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) { @@ -759,13 +780,9 @@ static inline struct page *alloc_huge_page(struct vm_area_struct *vma, return NULL; } -static inline struct page *alloc_huge_page_node(struct hstate *h, int nid) -{ - return NULL; -} - static inline struct page * -alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, nodemask_t *nmask) +alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, + nodemask_t *nmask, gfp_t gfp_mask) { return NULL; } @@ -878,6 +895,16 @@ static inline bool hugepage_movable_supported(struct hstate *h) return false; } +static inline gfp_t htlb_alloc_mask(struct hstate *h) +{ + return 0; +} + +static inline gfp_t htlb_modify_alloc_mask(struct hstate *h, gfp_t gfp_mask) +{ + return 0; +} + static inline spinlock_t *huge_pte_lockptr(struct hstate *h, struct mm_struct *mm, pte_t *pte) { diff --git a/include/linux/irqchip/irq-omap-intc.h b/include/linux/irqchip/irq-omap-intc.h index 216e5adf80ce..dca379c0d7eb 100644 --- a/include/linux/irqchip/irq-omap-intc.h +++ b/include/linux/irqchip/irq-omap-intc.h @@ -2,7 +2,7 @@ /** * irq-omap-intc.h - INTC Idle Functions * - * Copyright (C) 2014 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2014 Texas Instruments Incorporated - https://www.ti.com * * Author: Felipe Balbi <balbi@ti.com> */ diff --git a/include/linux/jhash.h b/include/linux/jhash.h index ba2f6a9776b6..19ddd43aee68 100644 --- a/include/linux/jhash.h +++ b/include/linux/jhash.h @@ -5,7 +5,7 @@ * * Copyright (C) 2006. Bob Jenkins (bob_jenkins@burtleburtle.net) * - * http://burtleburtle.net/bob/hash/ + * https://burtleburtle.net/bob/hash/ * * These are the credits from Bob's sources: * diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 7339a00c895e..500def620d8f 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -17,7 +17,6 @@ #include <asm/byteorder.h> #include <asm/div64.h> #include <uapi/linux/kernel.h> -#include <asm/div64.h> #define STACK_MAGIC 0xdeadbeef @@ -322,8 +321,7 @@ void panic(const char *fmt, ...) __noreturn __cold; void nmi_panic(struct pt_regs *regs, const char *msg); extern void oops_enter(void); extern void oops_exit(void); -void print_oops_end_marker(void); -extern int oops_may_print(void); +extern bool oops_may_print(void); void do_exit(long error_code) __noreturn; void complete_and_exit(struct completion *, long) __noreturn; @@ -347,7 +345,7 @@ int __must_check kstrtoll(const char *s, unsigned int base, long long *res); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the simple_strtoull. Return code must be checked. + * Preferred over simple_strtoul(). Return code must be checked. */ static inline int __must_check kstrtoul(const char *s, unsigned int base, unsigned long *res) { @@ -375,7 +373,7 @@ static inline int __must_check kstrtoul(const char *s, unsigned int base, unsign * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the simple_strtoull. Return code must be checked. + * Preferred over simple_strtol(). Return code must be checked. */ static inline int __must_check kstrtol(const char *s, unsigned int base, long *res) { diff --git a/include/linux/leds-ti-lmu-common.h b/include/linux/leds-ti-lmu-common.h index 5eb111f38803..420b61e5a213 100644 --- a/include/linux/leds-ti-lmu-common.h +++ b/include/linux/leds-ti-lmu-common.h @@ -1,6 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ // TI LMU Common Core -// Copyright (C) 2018 Texas Instruments Incorporated - http://www.ti.com/ +// Copyright (C) 2018 Texas Instruments Incorporated - https://www.ti.com/ #ifndef _TI_LMU_COMMON_H_ #define _TI_LMU_COMMON_H_ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1bb49b600310..385237e4cb44 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -32,6 +32,7 @@ struct kmem_cache; enum memcg_stat_item { MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS, MEMCG_SOCK, + MEMCG_PERCPU_B, MEMCG_NR_STAT, }; @@ -64,8 +65,8 @@ struct mem_cgroup_id { /* * Per memcg event counter is incremented at every pagein/pageout. With THP, - * it will be incremated by the number of pages. This counter is used for - * for trigger some periodic events. This is straightforward and better + * it will be incremented by the number of pages. This counter is used + * to trigger some periodic events. This is straightforward and better * than using jiffies etc. to handle periodic memcg event. */ enum mem_cgroup_events_target { @@ -339,6 +340,13 @@ struct mem_cgroup { extern struct mem_cgroup *root_mem_cgroup; +static __always_inline bool memcg_stat_item_in_bytes(int idx) +{ + if (idx == MEMCG_PERCPU_B) + return true; + return vmstat_item_in_bytes(idx); +} + static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg) { return (memcg == root_mem_cgroup); diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index ea9c15b60a96..5f1c74df264d 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -6,7 +6,7 @@ #ifndef _LINUX_MEMPOLICY_H #define _LINUX_MEMPOLICY_H 1 - +#include <linux/sched.h> #include <linux/mmzone.h> #include <linux/dax.h> #include <linux/slab.h> @@ -28,7 +28,7 @@ struct mm_struct; * the process policy is used. Interrupts ignore the memory policy * of the current process. * - * Locking policy for interlave: + * Locking policy for interleave: * In process context there is no locking because only the process accesses * its own state. All vma manipulation is somewhat protected by a down_read on * mmap_lock. @@ -152,6 +152,15 @@ extern int huge_node(struct vm_area_struct *vma, extern bool init_nodemask_of_mempolicy(nodemask_t *mask); extern bool mempolicy_nodemask_intersects(struct task_struct *tsk, const nodemask_t *mask); +extern nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy); + +static inline nodemask_t *policy_nodemask_current(gfp_t gfp) +{ + struct mempolicy *mpol = get_task_policy(current); + + return policy_nodemask(gfp, mpol); +} + extern unsigned int mempolicy_slab_node(void); extern enum zone_type policy_zone; @@ -281,5 +290,10 @@ static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma, static inline void mpol_put_task_policy(struct task_struct *task) { } + +static inline nodemask_t *policy_nodemask_current(gfp_t gfp) +{ + return NULL; +} #endif /* CONFIG_NUMA */ #endif diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 540998d9810b..0f8d1583fa8e 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -10,6 +10,8 @@ typedef struct page *new_page_t(struct page *page, unsigned long private); typedef void free_page_t(struct page *page, unsigned long private); +struct migration_target_control; + /* * Return values from addresss_space_operations.migratepage(): * - negative errno on page migration failure; @@ -31,34 +33,6 @@ enum migrate_reason { /* In mm/debug.c; also keep sync with include/trace/events/migrate.h */ extern const char *migrate_reason_names[MR_TYPES]; -static inline struct page *new_page_nodemask(struct page *page, - int preferred_nid, nodemask_t *nodemask) -{ - gfp_t gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL; - unsigned int order = 0; - struct page *new_page = NULL; - - if (PageHuge(page)) - return alloc_huge_page_nodemask(page_hstate(compound_head(page)), - preferred_nid, nodemask); - - if (PageTransHuge(page)) { - gfp_mask |= GFP_TRANSHUGE; - order = HPAGE_PMD_ORDER; - } - - if (PageHighMem(page) || (zone_idx(page_zone(page)) == ZONE_MOVABLE)) - gfp_mask |= __GFP_HIGHMEM; - - new_page = __alloc_pages_nodemask(gfp_mask, order, - preferred_nid, nodemask); - - if (new_page && PageTransHuge(new_page)) - prep_transhuge_page(new_page); - - return new_page; -} - #ifdef CONFIG_MIGRATION extern void putback_movable_pages(struct list_head *l); @@ -67,6 +41,7 @@ extern int migrate_page(struct address_space *mapping, enum migrate_mode mode); extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, int reason); +extern struct page *alloc_migration_target(struct page *page, unsigned long private); extern int isolate_movable_page(struct page *page, isolate_mode_t mode); extern void putback_movable_page(struct page *page); @@ -85,6 +60,9 @@ static inline int migrate_pages(struct list_head *l, new_page_t new, free_page_t free, unsigned long private, enum migrate_mode mode, int reason) { return -ENOSYS; } +static inline struct page *alloc_migration_target(struct page *page, + unsigned long private) + { return NULL; } static inline int isolate_movable_page(struct page *page, isolate_mode_t mode) { return -EBUSY; } diff --git a/include/linux/mm.h b/include/linux/mm.h index f6a82f9bccd7..e7602a3bcef1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -38,6 +38,7 @@ struct file_ra_state; struct user_struct; struct writeback_control; struct bdi_writeback; +struct pt_regs; void init_mm_internals(void); @@ -479,7 +480,7 @@ static inline bool fault_flag_allow_retry_first(unsigned int flags) { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" } /* - * vm_fault is filled by the the pagefault handler and passed to the vma's + * vm_fault is filled by the pagefault handler and passed to the vma's * ->fault function. The vma's ->fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. * @@ -1658,8 +1659,9 @@ int invalidate_inode_page(struct page *page); #ifdef CONFIG_MMU extern vm_fault_t handle_mm_fault(struct vm_area_struct *vma, - unsigned long address, unsigned int flags); -extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, + unsigned long address, unsigned int flags, + struct pt_regs *regs); +extern int fixup_user_fault(struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked); void unmap_mapping_pages(struct address_space *mapping, @@ -1668,14 +1670,14 @@ void unmap_mapping_range(struct address_space *mapping, loff_t const holebegin, loff_t const holelen, int even_cows); #else static inline vm_fault_t handle_mm_fault(struct vm_area_struct *vma, - unsigned long address, unsigned int flags) + unsigned long address, unsigned int flags, + struct pt_regs *regs) { /* should never happen if there's no MMU */ BUG(); return VM_FAULT_SIGBUS; } -static inline int fixup_user_fault(struct task_struct *tsk, - struct mm_struct *mm, unsigned long address, +static inline int fixup_user_fault(struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked) { /* should never happen if there's no MMU */ @@ -1701,11 +1703,11 @@ extern int access_remote_vm(struct mm_struct *mm, unsigned long addr, extern int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, unsigned long addr, void *buf, int len, unsigned int gup_flags); -long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked); -long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long pin_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked); @@ -2599,7 +2601,7 @@ extern unsigned long stack_guard_gap; /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); -/* CONFIG_STACK_GROWSUP still needs to to grow downwards at some places */ +/* CONFIG_STACK_GROWSUP still needs to grow downwards at some places */ extern int expand_downwards(struct vm_area_struct *vma, unsigned long address); #if VM_GROWSUP diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 635a96cd9b1f..8379432f4f2f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -173,9 +173,15 @@ enum node_stat_item { NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */ NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */ WORKINGSET_NODES, - WORKINGSET_REFAULT, - WORKINGSET_ACTIVATE, - WORKINGSET_RESTORE, + WORKINGSET_REFAULT_BASE, + WORKINGSET_REFAULT_ANON = WORKINGSET_REFAULT_BASE, + WORKINGSET_REFAULT_FILE, + WORKINGSET_ACTIVATE_BASE, + WORKINGSET_ACTIVATE_ANON = WORKINGSET_ACTIVATE_BASE, + WORKINGSET_ACTIVATE_FILE, + WORKINGSET_RESTORE_BASE, + WORKINGSET_RESTORE_ANON = WORKINGSET_RESTORE_BASE, + WORKINGSET_RESTORE_FILE, WORKINGSET_NODERECLAIM, NR_ANON_MAPPED, /* Mapped anonymous pages */ NR_FILE_MAPPED, /* pagecache pages mapped into pagetables. @@ -277,8 +283,8 @@ struct lruvec { unsigned long file_cost; /* Non-resident age, driven by LRU movement */ atomic_long_t nonresident_age; - /* Refaults at the time of last reclaim cycle */ - unsigned long refaults; + /* Refaults at the time of last reclaim cycle, anon=0, file=1 */ + unsigned long refaults[2]; /* Various lruvec state flags (enum lruvec_flags) */ unsigned long flags; #ifdef CONFIG_MEMCG @@ -530,6 +536,7 @@ struct zone { * On compaction failure, 1<<compact_defer_shift compactions * are skipped before trying again. The number attempted since * last failure is tracked with compact_considered. + * compact_order_failed is the minimum compaction failed order. */ unsigned int compact_considered; unsigned int compact_defer_shift; diff --git a/include/linux/oom.h b/include/linux/oom.h index c696c265f019..f022f581ac29 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -48,7 +48,7 @@ struct oom_control { /* Used by oom implementation, do not set */ unsigned long totalpages; struct task_struct *chosen; - unsigned long chosen_points; + long chosen_points; /* Used to print the constraint info. */ enum oom_constraint constraint; @@ -107,7 +107,7 @@ static inline vm_fault_t check_stable_address_space(struct mm_struct *mm) bool __oom_reap_task_mm(struct mm_struct *mm); -extern unsigned long oom_badness(struct task_struct *p, +long oom_badness(struct task_struct *p, unsigned long totalpages); extern bool out_of_memory(struct oom_control *oc); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 53e97da1e8e2..a124c21e3204 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -804,7 +804,7 @@ static inline void ptep_modify_prot_commit(struct vm_area_struct *vma, /* * No-op macros that just return the current protection value. Defined here - * because these macros can be used used even if CONFIG_MMU is not defined. + * because these macros can be used even if CONFIG_MMU is not defined. */ #ifndef pgprot_nx @@ -1234,7 +1234,7 @@ static inline int pmd_trans_unstable(pmd_t *pmd) * Technically a PTE can be PROTNONE even when not doing NUMA balancing but * the only case the kernel cares is for NUMA balancing and is only ever set * when the VMA is accessible. For PROT_NONE VMAs, the PTEs are not marked - * _PAGE_PROTNONE so by by default, implement the helper as "always no". It + * _PAGE_PROTNONE so by default, implement the helper as "always no". It * is the responsibility of the caller to distinguish between PROT_NONE * protections and NUMA hinting fault protections. */ @@ -1318,10 +1318,10 @@ static inline int pmd_free_pte_page(pmd_t *pmd, unsigned long addr) /* * ARCHes with special requirements for evicting THP backing TLB entries can * implement this. Otherwise also, it can help optimize normal TLB flush in - * THP regime. stock flush_tlb_range() typically has optimization to nuke the - * entire TLB TLB if flush span is greater than a threshold, which will - * likely be true for a single huge page. Thus a single thp flush will - * invalidate the entire TLB which is not desitable. + * THP regime. Stock flush_tlb_range() typically has optimization to nuke the + * entire TLB if flush span is greater than a threshold, which will + * likely be true for a single huge page. Thus a single THP flush will + * invalidate the entire TLB which is not desirable. * e.g. see arch/arc: flush_pmd_tlb_range */ #define flush_pmd_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end) diff --git a/include/linux/platform_data/davinci-cpufreq.h b/include/linux/platform_data/davinci-cpufreq.h index 3fbf9f2793b5..bc208c64e3d7 100644 --- a/include/linux/platform_data/davinci-cpufreq.h +++ b/include/linux/platform_data/davinci-cpufreq.h @@ -2,7 +2,7 @@ /* * TI DaVinci CPUFreq platform support. * - * Copyright (C) 2009 Texas Instruments, Inc. http://www.ti.com/ + * Copyright (C) 2009 Texas Instruments, Inc. https://www.ti.com/ */ #ifndef _MACH_DAVINCI_CPUFREQ_H diff --git a/include/linux/platform_data/davinci_asp.h b/include/linux/platform_data/davinci_asp.h index 7fe80f1c7e08..5d1fb0d78a22 100644 --- a/include/linux/platform_data/davinci_asp.h +++ b/include/linux/platform_data/davinci_asp.h @@ -1,7 +1,7 @@ /* * TI DaVinci Audio Serial Port support * - * Copyright (C) 2012 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2012 Texas Instruments Incorporated - https://www.ti.com/ * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as diff --git a/include/linux/platform_data/elm.h b/include/linux/platform_data/elm.h index 0f491d8abfdd..3cc78f0447b1 100644 --- a/include/linux/platform_data/elm.h +++ b/include/linux/platform_data/elm.h @@ -2,7 +2,7 @@ /* * BCH Error Location Module * - * Copyright (C) 2012 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2012 Texas Instruments Incorporated - https://www.ti.com/ */ #ifndef __ELM_H diff --git a/include/linux/platform_data/gpio-davinci.h b/include/linux/platform_data/gpio-davinci.h index a93841bfb9f7..e182a46e609f 100644 --- a/include/linux/platform_data/gpio-davinci.h +++ b/include/linux/platform_data/gpio-davinci.h @@ -1,7 +1,7 @@ /* * DaVinci GPIO Platform Related Defines * - * Copyright (C) 2013 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2013 Texas Instruments Incorporated - https://www.ti.com/ * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as diff --git a/include/linux/platform_data/gpmc-omap.h b/include/linux/platform_data/gpmc-omap.h index ef663e570552..c9cc4e32435d 100644 --- a/include/linux/platform_data/gpmc-omap.h +++ b/include/linux/platform_data/gpmc-omap.h @@ -2,7 +2,7 @@ /* * OMAP GPMC Platform data * - * Copyright (C) 2014 Texas Instruments, Inc. - http://www.ti.com + * Copyright (C) 2014 Texas Instruments, Inc. - https://www.ti.com * Roger Quadros <rogerq@ti.com> */ diff --git a/include/linux/platform_data/mtd-davinci-aemif.h b/include/linux/platform_data/mtd-davinci-aemif.h index a403dd51dacc..a49826214a39 100644 --- a/include/linux/platform_data/mtd-davinci-aemif.h +++ b/include/linux/platform_data/mtd-davinci-aemif.h @@ -1,7 +1,7 @@ /* * TI DaVinci AEMIF support * - * Copyright 2010 (C) Texas Instruments, Inc. http://www.ti.com/ + * Copyright 2010 (C) Texas Instruments, Inc. https://www.ti.com/ * * This file is licensed under the terms of the GNU General Public License * version 2. This program is licensed "as is" without any warranty of any diff --git a/include/linux/platform_data/omap-twl4030.h b/include/linux/platform_data/omap-twl4030.h index 8419c8caf54e..0dd851ea1c72 100644 --- a/include/linux/platform_data/omap-twl4030.h +++ b/include/linux/platform_data/omap-twl4030.h @@ -3,7 +3,7 @@ * omap-twl4030.h - ASoC machine driver for TI SoC based boards with twl4030 * codec, header. * - * Copyright (C) 2012 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2012 Texas Instruments Incorporated - https://www.ti.com * All rights reserved. * * Author: Peter Ujfalusi <peter.ujfalusi@ti.com> diff --git a/include/linux/platform_data/uio_pruss.h b/include/linux/platform_data/uio_pruss.h index 3d47d219827f..31f2e22661bc 100644 --- a/include/linux/platform_data/uio_pruss.h +++ b/include/linux/platform_data/uio_pruss.h @@ -3,7 +3,7 @@ * * Platform data for uio_pruss driver * - * Copyright (C) 2010-11 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2010-11 Texas Instruments Incorporated - https://www.ti.com/ * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public License as diff --git a/include/linux/platform_data/usb-omap.h b/include/linux/platform_data/usb-omap.h index fa579b4c666b..5e70d667031c 100644 --- a/include/linux/platform_data/usb-omap.h +++ b/include/linux/platform_data/usb-omap.h @@ -1,7 +1,7 @@ /* * usb-omap.h - Platform data for the various OMAP USB IPs * - * Copyright (C) 2012 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2012 Texas Instruments Incorporated - https://www.ti.com * * This software is distributed under the terms of the GNU General Public * License ("GPL") version 2, as published by the Free Software Foundation. diff --git a/include/linux/poison.h b/include/linux/poison.h index df34330b4e34..dc8ae5d8db03 100644 --- a/include/linux/poison.h +++ b/include/linux/poison.h @@ -24,10 +24,6 @@ #define LIST_POISON2 ((void *) 0x122 + POISON_POINTER_DELTA) /********** include/linux/timer.h **********/ -/* - * Magic number "tsta" to indicate a static timer initializer - * for the object debugging code. - */ #define TIMER_ENTRY_STATIC ((void *) 0x300 + POISON_POINTER_DELTA) /********** mm/page_poison.c **********/ diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index 85023ddc2dc2..f889e332912f 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -178,14 +178,16 @@ static inline bool in_vfork(struct task_struct *tsk) */ static inline gfp_t current_gfp_context(gfp_t flags) { - if (unlikely(current->flags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) { + unsigned int pflags = READ_ONCE(current->flags); + + if (unlikely(pflags & (PF_MEMALLOC_NOIO | PF_MEMALLOC_NOFS))) { /* * NOIO implies both NOIO and NOFS and it is a weaker context * so always make sure it makes precedence */ - if (current->flags & PF_MEMALLOC_NOIO) + if (pflags & PF_MEMALLOC_NOIO) flags &= ~(__GFP_IO | __GFP_FS); - else if (current->flags & PF_MEMALLOC_NOFS) + else if (pflags & PF_MEMALLOC_NOFS) flags &= ~__GFP_FS; } return flags; diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index ae3060f0b0c9..a98965007eef 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -88,6 +88,7 @@ struct task_struct *fork_idle(int); struct mm_struct *copy_init_mm(void); extern pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags); extern long kernel_wait4(pid_t, int __user *, int, struct rusage *); +int kernel_wait(pid_t pid, int *stat); extern void free_task(struct task_struct *tsk); diff --git a/include/linux/soc/ti/k3-ringacc.h b/include/linux/soc/ti/k3-ringacc.h index 7ac115432fa1..5a472eca5ee4 100644 --- a/include/linux/soc/ti/k3-ringacc.h +++ b/include/linux/soc/ti/k3-ringacc.h @@ -2,7 +2,7 @@ /* * K3 Ring Accelerator (RA) subsystem interface * - * Copyright (C) 2019 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2019 Texas Instruments Incorporated - https://www.ti.com */ #ifndef __SOC_TI_K3_RINGACC_API_H_ diff --git a/include/linux/soc/ti/knav_qmss.h b/include/linux/soc/ti/knav_qmss.h index 9745df6ed9d3..c75ef99c99ca 100644 --- a/include/linux/soc/ti/knav_qmss.h +++ b/include/linux/soc/ti/knav_qmss.h @@ -1,7 +1,7 @@ /* * Keystone Navigator Queue Management Sub-System header * - * Copyright (C) 2014 Texas Instruments Incorporated - http://www.ti.com + * Copyright (C) 2014 Texas Instruments Incorporated - https://www.ti.com * Author: Sandeep Nair <sandeep_n@ti.com> * Cyril Chemparathy <cyril@ti.com> * Santosh Shilimkar <santosh.shilimkar@ti.com> diff --git a/include/linux/soc/ti/ti-msgmgr.h b/include/linux/soc/ti/ti-msgmgr.h index eac8e0c6fe11..1f6e76d423cf 100644 --- a/include/linux/soc/ti/ti-msgmgr.h +++ b/include/linux/soc/ti/ti-msgmgr.h @@ -1,7 +1,7 @@ /* * Texas Instruments' Message Manager * - * Copyright (C) 2015-2016 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2015-2016 Texas Instruments Incorporated - https://www.ti.com/ * Nishanth Menon * * This program is free software; you can redistribute it and/or modify diff --git a/include/linux/swap.h b/include/linux/swap.h index 7eb59bc552a5..661046994db4 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -352,7 +352,7 @@ extern void deactivate_page(struct page *page); extern void mark_page_lazyfree(struct page *page); extern void swap_setup(void); -extern void lru_cache_add_active_or_unevictable(struct page *page, +extern void lru_cache_add_inactive_or_unevictable(struct page *page, struct vm_area_struct *vma); /* linux/mm/vmscan.c */ @@ -414,9 +414,14 @@ extern struct address_space *swapper_spaces[]; extern unsigned long total_swapcache_pages(void); extern void show_swap_cache_info(void); extern int add_to_swap(struct page *page); -extern int add_to_swap_cache(struct page *, swp_entry_t, gfp_t); -extern void __delete_from_swap_cache(struct page *, swp_entry_t entry); +extern void *get_shadow_from_swap_cache(swp_entry_t entry); +extern int add_to_swap_cache(struct page *page, swp_entry_t entry, + gfp_t gfp, void **shadowp); +extern void __delete_from_swap_cache(struct page *page, + swp_entry_t entry, void *shadow); extern void delete_from_swap_cache(struct page *); +extern void clear_shadow_from_swap_cache(int type, unsigned long begin, + unsigned long end); extern void free_page_and_swap_cache(struct page *); extern void free_pages_and_swap_cache(struct page **, int); extern struct page *lookup_swap_cache(swp_entry_t entry, @@ -569,14 +574,19 @@ static inline int add_to_swap(struct page *page) return 0; } +static inline void *get_shadow_from_swap_cache(swp_entry_t entry) +{ + return NULL; +} + static inline int add_to_swap_cache(struct page *page, swp_entry_t entry, - gfp_t gfp_mask) + gfp_t gfp_mask, void **shadowp) { return -1; } static inline void __delete_from_swap_cache(struct page *page, - swp_entry_t entry) + swp_entry_t entry, void *shadow) { } @@ -584,6 +594,11 @@ static inline void delete_from_swap_cache(struct page *page) { } +static inline void clear_shadow_from_swap_cache(int type, unsigned long begin, + unsigned long end) +{ +} + static inline int page_swapcount(struct page *page) { return 0; diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index a2429d336593..dc2b827c81e5 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -263,7 +263,7 @@ static inline void addr_limit_user_check(void) return; #endif - if (CHECK_DATA_CORRUPTION(!segment_eq(get_fs(), USER_DS), + if (CHECK_DATA_CORRUPTION(uaccess_kernel(), "Invalid address limit on user-mode return")) force_sig(SIGKILL); diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h index 0a76ddc07d59..94b285411659 100644 --- a/include/linux/uaccess.h +++ b/include/linux/uaccess.h @@ -6,11 +6,27 @@ #include <linux/sched.h> #include <linux/thread_info.h> -#define uaccess_kernel() segment_eq(get_fs(), KERNEL_DS) - #include <asm/uaccess.h> /* + * Force the uaccess routines to be wired up for actual userspace access, + * overriding any possible set_fs(KERNEL_DS) still lingering around. Undone + * using force_uaccess_end below. + */ +static inline mm_segment_t force_uaccess_begin(void) +{ + mm_segment_t fs = get_fs(); + + set_fs(USER_DS); + return fs; +} + +static inline void force_uaccess_end(mm_segment_t oldfs) +{ + set_fs(oldfs); +} + +/* * Architectures should provide two primitives (raw_copy_{to,from}_user()) * and get rid of their private instances of copy_{to,from}_user() and * __copy_{to,from}_user{,_inatomic}(). diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 24fc7c3ae7d6..2e6ca53b9bbd 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -56,6 +56,9 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #endif #ifdef CONFIG_MIGRATION PGMIGRATE_SUCCESS, PGMIGRATE_FAIL, + THP_MIGRATION_SUCCESS, + THP_MIGRATION_FAIL, + THP_MIGRATION_SPLIT, #endif #ifdef CONFIG_COMPACTION COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED, diff --git a/include/linux/wkup_m3_ipc.h b/include/linux/wkup_m3_ipc.h index e497e621dbb7..3f496967b538 100644 --- a/include/linux/wkup_m3_ipc.h +++ b/include/linux/wkup_m3_ipc.h @@ -1,7 +1,7 @@ /* * TI Wakeup M3 for AMx3 SoCs Power Management Routines * - * Copyright (C) 2015 Texas Instruments Incorporated - http://www.ti.com/ + * Copyright (C) 2015 Texas Instruments Incorporated - https://www.ti.com/ * Dave Gerlach <d-gerlach@ti.com> * * This program is free software; you can redistribute it and/or diff --git a/include/linux/xxhash.h b/include/linux/xxhash.h index 52b073fea17f..df42511438d0 100644 --- a/include/linux/xxhash.h +++ b/include/linux/xxhash.h @@ -34,7 +34,7 @@ * ("BSD"). * * You can contact the author at: - * - xxHash homepage: http://cyan4973.github.io/xxHash/ + * - xxHash homepage: https://cyan4973.github.io/xxHash/ * - xxHash source repository: https://github.com/Cyan4973/xxHash */ diff --git a/include/linux/xz.h b/include/linux/xz.h index 64cffa6ddfce..9884c8440188 100644 --- a/include/linux/xz.h +++ b/include/linux/xz.h @@ -2,7 +2,7 @@ * XZ decompressor * * Authors: Lasse Collin <lasse.collin@tukaani.org> - * Igor Pavlov <http://7-zip.org/> + * Igor Pavlov <https://7-zip.org/> * * This file has been put into the public domain. * You can do whatever you want with this file. @@ -28,7 +28,7 @@ * enum xz_mode - Operation mode * * @XZ_SINGLE: Single-call mode. This uses less RAM than - * than multi-call modes, because the LZMA2 + * multi-call modes, because the LZMA2 * dictionary doesn't need to be allocated as * part of the decoder state. All required data * structures are allocated at initialization, diff --git a/include/linux/zlib.h b/include/linux/zlib.h index c757d848a758..78ede944c082 100644 --- a/include/linux/zlib.h +++ b/include/linux/zlib.h @@ -23,7 +23,7 @@ The data format used by the zlib library is described by RFCs (Request for - Comments) 1950 to 1952 in the files http://www.ietf.org/rfc/rfc1950.txt + Comments) 1950 to 1952 in the files https://www.ietf.org/rfc/rfc1950.txt (zlib format), rfc1951.txt (deflate format) and rfc1952.txt (gzip format). */ diff --git a/include/soc/arc/aux.h b/include/soc/arc/aux.h index e223c4ffa153..9c2eff6140b6 100644 --- a/include/soc/arc/aux.h +++ b/include/soc/arc/aux.h @@ -22,7 +22,7 @@ static inline int read_aux_reg(u32 r) /* * function helps elide unused variable warning - * see: http://lists.infradead.org/pipermail/linux-snps-arc/2016-November/001748.html + * see: https://lists.infradead.org/pipermail/linux-snps-arc/2016-November/001748.html */ static inline void write_aux_reg(u32 r, u32 v) { diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index 705b33d1e395..4d434398d64d 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -46,13 +46,18 @@ MIGRATE_REASON TRACE_EVENT(mm_migrate_pages, TP_PROTO(unsigned long succeeded, unsigned long failed, - enum migrate_mode mode, int reason), + unsigned long thp_succeeded, unsigned long thp_failed, + unsigned long thp_split, enum migrate_mode mode, int reason), - TP_ARGS(succeeded, failed, mode, reason), + TP_ARGS(succeeded, failed, thp_succeeded, thp_failed, + thp_split, mode, reason), TP_STRUCT__entry( __field( unsigned long, succeeded) __field( unsigned long, failed) + __field( unsigned long, thp_succeeded) + __field( unsigned long, thp_failed) + __field( unsigned long, thp_split) __field( enum migrate_mode, mode) __field( int, reason) ), @@ -60,13 +65,19 @@ TRACE_EVENT(mm_migrate_pages, TP_fast_assign( __entry->succeeded = succeeded; __entry->failed = failed; + __entry->thp_succeeded = thp_succeeded; + __entry->thp_failed = thp_failed; + __entry->thp_split = thp_split; __entry->mode = mode; __entry->reason = reason; ), - TP_printk("nr_succeeded=%lu nr_failed=%lu mode=%s reason=%s", + TP_printk("nr_succeeded=%lu nr_failed=%lu nr_thp_succeeded=%lu nr_thp_failed=%lu nr_thp_split=%lu mode=%s reason=%s", __entry->succeeded, __entry->failed, + __entry->thp_succeeded, + __entry->thp_failed, + __entry->thp_split, __print_symbolic(__entry->mode, MIGRATE_MODE), __print_symbolic(__entry->reason, MIGRATE_REASON)) ); diff --git a/include/uapi/linux/auto_dev-ioctl.h b/include/uapi/linux/auto_dev-ioctl.h index 374742651c30..62e625356dc8 100644 --- a/include/uapi/linux/auto_dev-ioctl.h +++ b/include/uapi/linux/auto_dev-ioctl.h @@ -82,7 +82,7 @@ struct args_ismountpoint { /* * All the ioctls use this structure. * When sending a path size must account for the total length - * of the chunk of memory otherwise is is the size of the + * of the chunk of memory otherwise it is the size of the * structure. */ diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h index c6dd0215482e..22220945a5fd 100644 --- a/include/uapi/linux/elf.h +++ b/include/uapi/linux/elf.h @@ -53,7 +53,7 @@ typedef __s64 Elf64_Sxword; * * - Oracle: Linker and Libraries. * Part No: 817–1984–19, August 2011. - * http://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf + * https://docs.oracle.com/cd/E18752_01/pdf/817-1984.pdf * * - System V ABI AMD64 Architecture Processor Supplement * Draft Version 0.99.4, diff --git a/include/uapi/linux/map_to_7segment.h b/include/uapi/linux/map_to_7segment.h index f9ed18134b83..13a06e5e966e 100644 --- a/include/uapi/linux/map_to_7segment.h +++ b/include/uapi/linux/map_to_7segment.h @@ -24,7 +24,7 @@ * of (ASCII) characters to a 7-segments notation. * * The 7 segment's wikipedia notation below is used as standard. - * See: http://en.wikipedia.org/wiki/Seven_segment_display + * See: https://en.wikipedia.org/wiki/Seven_segment_display * * Notation: +-a-+ * f b diff --git a/include/uapi/linux/types.h b/include/uapi/linux/types.h index 2fce8b6876e9..f6d2f83cbe29 100644 --- a/include/uapi/linux/types.h +++ b/include/uapi/linux/types.h @@ -7,7 +7,7 @@ #ifndef __ASSEMBLY__ #ifndef __KERNEL__ #ifndef __EXPORTED_HEADERS__ -#warning "Attempt to use kernel headers from user space, see http://kernelnewbies.org/KernelHeaders" +#warning "Attempt to use kernel headers from user space, see https://kernelnewbies.org/KernelHeaders" #endif /* __EXPORTED_HEADERS__ */ #endif diff --git a/include/uapi/linux/usb/ch9.h b/include/uapi/linux/usb/ch9.h index 48766fdf6580..0f865ae4ba89 100644 --- a/include/uapi/linux/usb/ch9.h +++ b/include/uapi/linux/usb/ch9.h @@ -1229,7 +1229,7 @@ struct usb_set_sel_req { * As per USB compliance update, a device that is actively drawing * more than 100mA from USB must report itself as bus-powered in * the GetStatus(DEVICE) call. - * http://compliance.usb.org/index.asp?UpdateFile=Electrical&Format=Standard#34 + * https://compliance.usb.org/index.asp?UpdateFile=Electrical&Format=Standard#34 */ #define USB_SELF_POWER_VBUS_MAX_DRAW 100 diff --git a/ipc/sem.c b/ipc/sem.c index 3687b71151b3..8c0244e0365e 100644 --- a/ipc/sem.c +++ b/ipc/sem.c @@ -585,8 +585,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params) /* * Called with sem_ids.rwsem and ipcp locked. */ -static inline int sem_more_checks(struct kern_ipc_perm *ipcp, - struct ipc_params *params) +static int sem_more_checks(struct kern_ipc_perm *ipcp, struct ipc_params *params) { struct sem_array *sma; diff --git a/ipc/shm.c b/ipc/shm.c index bf38d7e2fbe9..f1ed36e3ac9f 100644 --- a/ipc/shm.c +++ b/ipc/shm.c @@ -711,8 +711,7 @@ no_file: /* * Called with shm_ids.rwsem and ipcp locked. */ -static inline int shm_more_checks(struct kern_ipc_perm *ipcp, - struct ipc_params *params) +static int shm_more_checks(struct kern_ipc_perm *ipcp, struct ipc_params *params) { struct shmid_kernel *shp; @@ -1381,7 +1380,6 @@ static long compat_ksys_shmctl(int shmid, int cmd, void __user *uptr, int versio case SHM_LOCK: case SHM_UNLOCK: return shmctl_do_lock(ns, shmid, cmd); - break; default: return -EINVAL; } diff --git a/kernel/Makefile b/kernel/Makefile index 5350fd292910..b3da548691c9 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -36,7 +36,7 @@ KCOV_INSTRUMENT_stacktrace.o := n KCOV_INSTRUMENT_kcov.o := n KASAN_SANITIZE_kcov.o := n KCSAN_SANITIZE_kcov.o := n -CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector) +CFLAGS_kcov.o := $(call cc-option, -fno-conserve-stack) -fno-stack-protector # cond_syscall is currently not LTO compatible CFLAGS_sys_ni.o = $(DISABLE_LTO) diff --git a/kernel/crash_core.c b/kernel/crash_core.c index 18175687133a..106e4500fd53 100644 --- a/kernel/crash_core.c +++ b/kernel/crash_core.c @@ -11,6 +11,8 @@ #include <asm/page.h> #include <asm/sections.h> +#include <crypto/sha.h> + /* vmcoreinfo stuff */ unsigned char *vmcoreinfo_data; size_t vmcoreinfo_size; @@ -376,6 +378,53 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) } EXPORT_SYMBOL(paddr_vmcoreinfo_note); +#define NOTES_SIZE (&__stop_notes - &__start_notes) +#define BUILD_ID_MAX SHA1_DIGEST_SIZE +#define NT_GNU_BUILD_ID 3 + +struct elf_note_section { + struct elf_note n_hdr; + u8 n_data[]; +}; + +/* + * Add build ID from .notes section as generated by the GNU ld(1) + * or LLVM lld(1) --build-id option. + */ +static void add_build_id_vmcoreinfo(void) +{ + char build_id[BUILD_ID_MAX * 2 + 1]; + int n_remain = NOTES_SIZE; + + while (n_remain >= sizeof(struct elf_note)) { + const struct elf_note_section *note_sec = + &__start_notes + NOTES_SIZE - n_remain; + const u32 n_namesz = note_sec->n_hdr.n_namesz; + + if (note_sec->n_hdr.n_type == NT_GNU_BUILD_ID && + n_namesz != 0 && + !strcmp((char *)¬e_sec->n_data[0], "GNU")) { + if (note_sec->n_hdr.n_descsz <= BUILD_ID_MAX) { + const u32 n_descsz = note_sec->n_hdr.n_descsz; + const u8 *s = ¬e_sec->n_data[n_namesz]; + + s = PTR_ALIGN(s, 4); + bin2hex(build_id, s, n_descsz); + build_id[2 * n_descsz] = '\0'; + VMCOREINFO_BUILD_ID(build_id); + return; + } + pr_warn("Build ID is too large to include in vmcoreinfo: %u > %u\n", + note_sec->n_hdr.n_descsz, + BUILD_ID_MAX); + return; + } + n_remain -= sizeof(struct elf_note) + + ALIGN(note_sec->n_hdr.n_namesz, 4) + + ALIGN(note_sec->n_hdr.n_descsz, 4); + } +} + static int __init crash_save_vmcoreinfo_init(void) { vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL); @@ -394,6 +443,7 @@ static int __init crash_save_vmcoreinfo_init(void) } VMCOREINFO_OSRELEASE(init_uts_ns.name.release); + add_build_id_vmcoreinfo(); VMCOREINFO_PAGESIZE(PAGE_SIZE); VMCOREINFO_SYMBOL(init_uts_ns); diff --git a/kernel/events/callchain.c b/kernel/events/callchain.c index c6ce894e4ce9..58cbe357fb2b 100644 --- a/kernel/events/callchain.c +++ b/kernel/events/callchain.c @@ -217,10 +217,9 @@ get_perf_callchain(struct pt_regs *regs, u32 init_nr, bool kernel, bool user, if (add_mark) perf_callchain_store_context(&ctx, PERF_CONTEXT_USER); - fs = get_fs(); - set_fs(USER_DS); + fs = force_uaccess_begin(); perf_callchain_user(&ctx, regs); - set_fs(fs); + force_uaccess_end(fs); } } diff --git a/kernel/events/core.c b/kernel/events/core.c index d1f0a7e5b182..6961333ebad5 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -6453,10 +6453,9 @@ perf_output_sample_ustack(struct perf_output_handle *handle, u64 dump_size, /* Data. */ sp = perf_user_stack_pointer(regs); - fs = get_fs(); - set_fs(USER_DS); + fs = force_uaccess_begin(); rem = __output_copy_user(handle, (void *) sp, dump_size); - set_fs(fs); + force_uaccess_end(fs); dyn_size = dump_size - rem; perf_output_skip(handle, rem); diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 25de10c904e6..649fd53dc9ad 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -184,7 +184,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr, if (new_page) { get_page(new_page); page_add_new_anon_rmap(new_page, vma, addr, false); - lru_cache_add_active_or_unevictable(new_page, vma); + lru_cache_add_inactive_or_unevictable(new_page, vma); } else /* no new page, just dec_mm_counter for old_page */ dec_mm_counter(mm, MM_ANONPAGES); @@ -376,7 +376,7 @@ __update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d) if (!vaddr || !d) return -EINVAL; - ret = get_user_pages_remote(NULL, mm, vaddr, 1, + ret = get_user_pages_remote(mm, vaddr, 1, FOLL_WRITE, &page, &vma, NULL); if (unlikely(ret <= 0)) { /* @@ -477,7 +477,7 @@ retry: if (is_register) gup_flags |= FOLL_SPLIT_PMD; /* Read the page with vaddr into memory */ - ret = get_user_pages_remote(NULL, mm, vaddr, 1, gup_flags, + ret = get_user_pages_remote(mm, vaddr, 1, gup_flags, &old_page, &vma, NULL); if (ret <= 0) return ret; @@ -2029,7 +2029,7 @@ static int is_trap_at_addr(struct mm_struct *mm, unsigned long vaddr) * but we treat this as a 'remote' access since it is * essentially a kernel access to the memory. */ - result = get_user_pages_remote(NULL, mm, vaddr, 1, FOLL_FORCE, &page, + result = get_user_pages_remote(mm, vaddr, 1, FOLL_FORCE, &page, NULL, NULL); if (result < 0) return result; diff --git a/kernel/exit.c b/kernel/exit.c index e731c414e024..733e80f334e7 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -732,7 +732,7 @@ void __noreturn do_exit(long code) * mm_release()->clear_child_tid() from writing to a user-controlled * kernel address. */ - set_fs(USER_DS); + force_uaccess_begin(); if (unlikely(in_atomic())) { pr_info("note: %s[%d] exited with preempt_count %d\n", @@ -1626,6 +1626,22 @@ long kernel_wait4(pid_t upid, int __user *stat_addr, int options, return ret; } +int kernel_wait(pid_t pid, int *stat) +{ + struct wait_opts wo = { + .wo_type = PIDTYPE_PID, + .wo_pid = find_get_pid(pid), + .wo_flags = WEXITED, + }; + int ret; + + ret = do_wait(&wo); + if (ret > 0 && wo.wo_stat) + *stat = wo.wo_stat; + put_pid(wo.wo_pid); + return ret; +} + SYSCALL_DEFINE4(wait4, pid_t, upid, int __user *, stat_addr, int, options, struct rusage __user *, ru) { diff --git a/kernel/futex.c b/kernel/futex.c index 83404124b77b..61e8153e6c76 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -678,7 +678,7 @@ static int fault_in_user_writeable(u32 __user *uaddr) int ret; mmap_read_lock(mm); - ret = fixup_user_fault(current, mm, (unsigned long)uaddr, + ret = fixup_user_fault(mm, (unsigned long)uaddr, FAULT_FLAG_WRITE, NULL); mmap_read_unlock(mm); diff --git a/kernel/kcov.c b/kernel/kcov.c index 6afae0bcbac4..6b8368be89c8 100644 --- a/kernel/kcov.c +++ b/kernel/kcov.c @@ -96,7 +96,7 @@ struct kcov_percpu_data { int saved_sequence; }; -DEFINE_PER_CPU(struct kcov_percpu_data, kcov_percpu_data); +static DEFINE_PER_CPU(struct kcov_percpu_data, kcov_percpu_data); /* Must be called with kcov_remote_lock locked. */ static struct kcov_remote *kcov_remote_find(u64 handle) @@ -775,7 +775,7 @@ static inline bool kcov_mode_enabled(unsigned int mode) return (mode & ~KCOV_IN_CTXSW) != KCOV_MODE_DISABLED; } -void kcov_remote_softirq_start(struct task_struct *t) +static void kcov_remote_softirq_start(struct task_struct *t) { struct kcov_percpu_data *data = this_cpu_ptr(&kcov_percpu_data); unsigned int mode; @@ -792,7 +792,7 @@ void kcov_remote_softirq_start(struct task_struct *t) } } -void kcov_remote_softirq_stop(struct task_struct *t) +static void kcov_remote_softirq_stop(struct task_struct *t) { struct kcov_percpu_data *data = this_cpu_ptr(&kcov_percpu_data); diff --git a/kernel/kmod.c b/kernel/kmod.c index 37c3c4b97b8e..3cd075ce2a1e 100644 --- a/kernel/kmod.c +++ b/kernel/kmod.c @@ -36,9 +36,8 @@ * * If you need less than 50 threads would mean we're dealing with systems * smaller than 3200 pages. This assumes you are capable of having ~13M memory, - * and this would only be an be an upper limit, after which the OOM killer - * would take effect. Systems like these are very unlikely if modules are - * enabled. + * and this would only be an upper limit, after which the OOM killer would take + * effect. Systems like these are very unlikely if modules are enabled. */ #define MAX_KMOD_CONCURRENT 50 static atomic_t kmod_concurrent_max = ATOMIC_INIT(MAX_KMOD_CONCURRENT); diff --git a/kernel/kthread.c b/kernel/kthread.c index b2807e7be772..3edaa380dc7b 100644 --- a/kernel/kthread.c +++ b/kernel/kthread.c @@ -1258,8 +1258,7 @@ void kthread_use_mm(struct mm_struct *mm) if (active_mm != mm) mmdrop(active_mm); - to_kthread(tsk)->oldfs = get_fs(); - set_fs(USER_DS); + to_kthread(tsk)->oldfs = force_uaccess_begin(); } EXPORT_SYMBOL_GPL(kthread_use_mm); @@ -1274,7 +1273,7 @@ void kthread_unuse_mm(struct mm_struct *mm) WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD)); WARN_ON_ONCE(!tsk->mm); - set_fs(to_kthread(tsk)->oldfs); + force_uaccess_end(to_kthread(tsk)->oldfs); task_lock(tsk); sync_mm_rss(mm); diff --git a/kernel/panic.c b/kernel/panic.c index e2157ca387c8..aef8872ba843 100644 --- a/kernel/panic.c +++ b/kernel/panic.c @@ -505,7 +505,7 @@ static void do_oops_enter_exit(void) * Return true if the calling CPU is allowed to print oops-related info. * This is a bit racy.. */ -int oops_may_print(void) +bool oops_may_print(void) { return pause_on_oops_flag == 0; } @@ -551,7 +551,7 @@ static int init_oops_id(void) } late_initcall(init_oops_id); -void print_oops_end_marker(void) +static void print_oops_end_marker(void) { init_oops_id(); pr_warn("---[ end trace %016llx ]---\n", (unsigned long long)oops_id); diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c index 2af66e449aa6..946f44a9e86a 100644 --- a/kernel/stacktrace.c +++ b/kernel/stacktrace.c @@ -233,10 +233,9 @@ unsigned int stack_trace_save_user(unsigned long *store, unsigned int size) if (current->flags & PF_KTHREAD) return 0; - fs = get_fs(); - set_fs(USER_DS); + fs = force_uaccess_begin(); arch_stack_walk_user(consume_entry, &c, task_pt_regs(current)); - set_fs(fs); + force_uaccess_end(fs); return c.len; } diff --git a/kernel/sysctl.c b/kernel/sysctl.c index f785de3caac0..287862f91717 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -2852,6 +2852,15 @@ static struct ctl_table vm_table[] = { .proc_handler = sysctl_compaction_handler, }, { + .procname = "compaction_proactiveness", + .data = &sysctl_compaction_proactiveness, + .maxlen = sizeof(sysctl_compaction_proactiveness), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = &one_hundred, + }, + { .procname = "extfrag_threshold", .data = &sysctl_extfrag_threshold, .maxlen = sizeof(int), diff --git a/kernel/umh.c b/kernel/umh.c index a25433f9cd9a..fcf3ee803630 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -119,37 +119,16 @@ static void call_usermodehelper_exec_sync(struct subprocess_info *sub_info) { pid_t pid; - /* If SIGCLD is ignored kernel_wait4 won't populate the status. */ + /* If SIGCLD is ignored do_wait won't populate the status. */ kernel_sigaction(SIGCHLD, SIG_DFL); pid = kernel_thread(call_usermodehelper_exec_async, sub_info, SIGCHLD); - if (pid < 0) { + if (pid < 0) sub_info->retval = pid; - } else { - int ret = -ECHILD; - /* - * Normally it is bogus to call wait4() from in-kernel because - * wait4() wants to write the exit code to a userspace address. - * But call_usermodehelper_exec_sync() always runs as kernel - * thread (workqueue) and put_user() to a kernel address works - * OK for kernel threads, due to their having an mm_segment_t - * which spans the entire address space. - * - * Thus the __user pointer cast is valid here. - */ - kernel_wait4(pid, (int __user *)&ret, 0, NULL); - - /* - * If ret is 0, either call_usermodehelper_exec_async failed and - * the real error code is already in sub_info->retval or - * sub_info->retval is 0 anyway, so don't mess with it then. - */ - if (ret) - sub_info->retval = ret; - } + else + kernel_wait(pid, &sub_info->retval); /* Restore default kernel sig handler */ kernel_sigaction(SIGCHLD, SIG_IGN); - umh_complete(sub_info); } diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index a164785c3b48..e068c3c7189a 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -365,6 +365,17 @@ config SECTION_MISMATCH_WARN_ONLY If unsure, say Y. +config DEBUG_FORCE_FUNCTION_ALIGN_32B + bool "Force all function address 32B aligned" if EXPERT + help + There are cases that a commit from one domain changes the function + address alignment of other domains, and cause magic performance + bump (regression or improvement). Enable this option will help to + verify if the bump is caused by function alignment changes, while + it will slightly increase the kernel size and affect icache usage. + + It is mainly for debug and performance tuning use. + # # Select this config option from the architecture Kconfig, if it # is preferred to always offer frame pointers as a config @@ -906,7 +917,7 @@ config PANIC_TIMEOUT int "panic timeout" default 0 help - Set the timeout value (in seconds) until a reboot occurs when the + Set the timeout value (in seconds) until a reboot occurs when the kernel panics. If n = 0, then we wait forever. A timeout value n > 0 will wait n seconds before rebooting, while a timeout value n < 0 will reboot immediately. @@ -1067,6 +1078,7 @@ config WQ_WATCHDOG config TEST_LOCKUP tristate "Test module to generate lockups" + depends on m help This builds the "test_lockup" module that helps to make sure that watchdogs and lockup detectors are working properly. @@ -2203,7 +2215,7 @@ config LIST_KUNIT_TEST and associated macros. KUnit tests run during boot and output the results to the debug log - in TAP format (http://testanything.org/). Only useful for kernel devs + in TAP format (https://testanything.org/). Only useful for kernel devs running the KUnit test harness, and not intended for inclusion into a production build. @@ -2224,6 +2236,17 @@ config LINEAR_RANGES_TEST If unsure, say N. +config BITS_TEST + tristate "KUnit test for bits.h" + depends on KUNIT + help + This builds the bits unit test. + Tests the logic of macros defined in bits.h. + For more information on KUnit and unit tests in general please refer + to the KUnit documentation in Documentation/dev-tools/kunit/. + + If unsure, say N. + config TEST_UDELAY tristate "udelay test driver" help diff --git a/lib/Makefile b/lib/Makefile index 9d1fd82ea145..e290fc5707ea 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -342,3 +342,4 @@ obj-$(CONFIG_PLDMFW) += pldmfw/ # KUnit tests obj-$(CONFIG_LIST_KUNIT_TEST) += list-test.o obj-$(CONFIG_LINEAR_RANGES_TEST) += test_linear_ranges.o +obj-$(CONFIG_BITS_TEST) += test_bits.o diff --git a/lib/bitmap.c b/lib/bitmap.c index 0364452b1617..c13d859bc7ab 100644 --- a/lib/bitmap.c +++ b/lib/bitmap.c @@ -212,13 +212,13 @@ void bitmap_cut(unsigned long *dst, const unsigned long *src, unsigned long keep = 0, carry; int i; - memmove(dst, src, len * sizeof(*dst)); - if (first % BITS_PER_LONG) { keep = src[first / BITS_PER_LONG] & (~0UL >> (BITS_PER_LONG - first % BITS_PER_LONG)); } + memmove(dst, src, len * sizeof(*dst)); + while (cut--) { for (i = first / BITS_PER_LONG; i < len; i++) { if (i < len - 1) diff --git a/lib/crc64.c b/lib/crc64.c index f8928ce28280..47cfa054827f 100644 --- a/lib/crc64.c +++ b/lib/crc64.c @@ -4,7 +4,7 @@ * * This is a basic crc64 implementation following ECMA-182 specification, * which can be found from, - * http://www.ecma-international.org/publications/standards/Ecma-182.htm + * https://www.ecma-international.org/publications/standards/Ecma-182.htm * * Dr. Ross N. Williams has a great document to introduce the idea of CRC * algorithm, here the CRC64 code is also inspired by the table-driven diff --git a/lib/decompress_bunzip2.c b/lib/decompress_bunzip2.c index 7c4932eed748..f9628f3924ce 100644 --- a/lib/decompress_bunzip2.c +++ b/lib/decompress_bunzip2.c @@ -34,7 +34,7 @@ Phone (337) 232-1234 or 1-800-738-2226 Fax (337) 232-1297 - http://www.hospiceacadiana.com/ + https://www.hospiceacadiana.com/ Manuel */ diff --git a/lib/decompress_unlzma.c b/lib/decompress_unlzma.c index ed7a1fd819f2..1cf409ef8d04 100644 --- a/lib/decompress_unlzma.c +++ b/lib/decompress_unlzma.c @@ -8,7 +8,7 @@ *implementation for lzma. *Copyright (C) 2006 Aurelien Jacobs < aurel@gnuage.org > * - *Based on LzmaDecode.c from the LZMA SDK 4.22 (http://www.7-zip.org/) + *Based on LzmaDecode.c from the LZMA SDK 4.22 (https://www.7-zip.org/) *Copyright (C) 1999-2005 Igor Pavlov * *Copyrights of the parts, see headers below. @@ -56,7 +56,7 @@ static long long INIT read_int(unsigned char *ptr, int size) /* Small range coder implementation for lzma. *Copyright (C) 2006 Aurelien Jacobs < aurel@gnuage.org > * - *Based on LzmaDecode.c from the LZMA SDK 4.22 (http://www.7-zip.org/) + *Based on LzmaDecode.c from the LZMA SDK 4.22 (https://www.7-zip.org/) *Copyright (c) 1999-2005 Igor Pavlov */ @@ -213,7 +213,7 @@ rc_bit_tree_decode(struct rc *rc, uint16_t *p, int num_levels, int *symbol) * Small lzma deflate implementation. * Copyright (C) 2006 Aurelien Jacobs < aurel@gnuage.org > * - * Based on LzmaDecode.c from the LZMA SDK 4.22 (http://www.7-zip.org/) + * Based on LzmaDecode.c from the LZMA SDK 4.22 (https://www.7-zip.org/) * Copyright (C) 1999-2005 Igor Pavlov */ diff --git a/lib/kstrtox.c b/lib/kstrtox.c index 1006bf70bf74..a14ccf905055 100644 --- a/lib/kstrtox.c +++ b/lib/kstrtox.c @@ -115,8 +115,7 @@ static int _kstrtoull(const char *s, unsigned int base, unsigned long long *res) * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoull. Return code must - * be checked. + * Preferred over simple_strtoull(). Return code must be checked. */ int kstrtoull(const char *s, unsigned int base, unsigned long long *res) { @@ -139,8 +138,7 @@ EXPORT_SYMBOL(kstrtoull); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoull. Return code must - * be checked. + * Preferred over simple_strtoll(). Return code must be checked. */ int kstrtoll(const char *s, unsigned int base, long long *res) { @@ -211,8 +209,7 @@ EXPORT_SYMBOL(_kstrtol); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoull. Return code must - * be checked. + * Preferred over simple_strtoul(). Return code must be checked. */ int kstrtouint(const char *s, unsigned int base, unsigned int *res) { @@ -242,8 +239,7 @@ EXPORT_SYMBOL(kstrtouint); * @res: Where to write the result of the conversion on success. * * Returns 0 on success, -ERANGE on overflow and -EINVAL on parsing error. - * Used as a replacement for the obsolete simple_strtoull. Return code must - * be checked. + * Preferred over simple_strtol(). Return code must be checked. */ int kstrtoint(const char *s, unsigned int base, int *res) { diff --git a/lib/math/rational.c b/lib/math/rational.c index 31fb27db2deb..df75c8809693 100644 --- a/lib/math/rational.c +++ b/lib/math/rational.c @@ -27,7 +27,7 @@ * with the fractional part size described in given_denominator. * * for theoretical background, see: - * http://en.wikipedia.org/wiki/Continued_fraction + * https://en.wikipedia.org/wiki/Continued_fraction */ void rational_best_approximation( diff --git a/lib/rbtree.c b/lib/rbtree.c index 8545872e61db..c4ac5c2421f2 100644 --- a/lib/rbtree.c +++ b/lib/rbtree.c @@ -13,7 +13,7 @@ #include <linux/export.h> /* - * red-black trees properties: http://en.wikipedia.org/wiki/Rbtree + * red-black trees properties: https://en.wikipedia.org/wiki/Rbtree * * 1) A node is either red or black * 2) The root is black diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c index 6b13150667f5..df903c53952b 100644 --- a/lib/test_bitmap.c +++ b/lib/test_bitmap.c @@ -610,6 +610,63 @@ static void __init test_for_each_set_clump8(void) expect_eq_clump8(start, CLUMP_EXP_NUMBITS, clump_exp, &clump); } +struct test_bitmap_cut { + unsigned int first; + unsigned int cut; + unsigned int nbits; + unsigned long in[4]; + unsigned long expected[4]; +}; + +static struct test_bitmap_cut test_cut[] = { + { 0, 0, 8, { 0x0000000aUL, }, { 0x0000000aUL, }, }, + { 0, 0, 32, { 0xdadadeadUL, }, { 0xdadadeadUL, }, }, + { 0, 3, 8, { 0x000000aaUL, }, { 0x00000015UL, }, }, + { 3, 3, 8, { 0x000000aaUL, }, { 0x00000012UL, }, }, + { 0, 1, 32, { 0xa5a5a5a5UL, }, { 0x52d2d2d2UL, }, }, + { 0, 8, 32, { 0xdeadc0deUL, }, { 0x00deadc0UL, }, }, + { 1, 1, 32, { 0x5a5a5a5aUL, }, { 0x2d2d2d2cUL, }, }, + { 0, 15, 32, { 0xa5a5a5a5UL, }, { 0x00014b4bUL, }, }, + { 0, 16, 32, { 0xa5a5a5a5UL, }, { 0x0000a5a5UL, }, }, + { 15, 15, 32, { 0xa5a5a5a5UL, }, { 0x000125a5UL, }, }, + { 15, 16, 32, { 0xa5a5a5a5UL, }, { 0x0000a5a5UL, }, }, + { 16, 15, 32, { 0xa5a5a5a5UL, }, { 0x0001a5a5UL, }, }, + + { BITS_PER_LONG, BITS_PER_LONG, BITS_PER_LONG, + { 0xa5a5a5a5UL, 0xa5a5a5a5UL, }, + { 0xa5a5a5a5UL, 0xa5a5a5a5UL, }, + }, + { 1, BITS_PER_LONG - 1, BITS_PER_LONG, + { 0xa5a5a5a5UL, 0xa5a5a5a5UL, }, + { 0x00000001UL, 0x00000001UL, }, + }, + + { 0, BITS_PER_LONG * 2, BITS_PER_LONG * 2 + 1, + { 0xa5a5a5a5UL, 0x00000001UL, 0x00000001UL, 0x00000001UL }, + { 0x00000001UL, }, + }, + { 16, BITS_PER_LONG * 2 + 1, BITS_PER_LONG * 2 + 1 + 16, + { 0x0000ffffUL, 0x5a5a5a5aUL, 0x5a5a5a5aUL, 0x5a5a5a5aUL }, + { 0x2d2dffffUL, }, + }, +}; + +static void __init test_bitmap_cut(void) +{ + unsigned long b[5], *in = &b[1], *out = &b[0]; /* Partial overlap */ + int i; + + for (i = 0; i < ARRAY_SIZE(test_cut); i++) { + struct test_bitmap_cut *t = &test_cut[i]; + + memcpy(in, t->in, sizeof(t->in)); + + bitmap_cut(out, in, t->first, t->cut, t->nbits); + + expect_eq_bitmap(t->expected, out, t->nbits); + } +} + static void __init selftest(void) { test_zero_clear(); @@ -623,6 +680,7 @@ static void __init selftest(void) test_bitmap_parselist_user(); test_mem_optimisations(); test_for_each_set_clump8(); + test_bitmap_cut(); } KSTM_MODULE_LOADERS(test_bitmap); diff --git a/lib/test_bitops.c b/lib/test_bitops.c index ced25e3a779b..471141ddd691 100644 --- a/lib/test_bitops.c +++ b/lib/test_bitops.c @@ -52,9 +52,9 @@ static unsigned long order_comb_long[][2] = { static int __init test_bitops_startup(void) { - int i; + int i, bit_set; - pr_warn("Loaded test module\n"); + pr_info("Starting bitops test\n"); set_bit(BITOPS_4, g_bitmap); set_bit(BITOPS_7, g_bitmap); set_bit(BITOPS_11, g_bitmap); @@ -81,12 +81,8 @@ static int __init test_bitops_startup(void) order_comb_long[i][0]); } #endif - return 0; -} -static void __exit test_bitops_unstartup(void) -{ - int bit_set; + barrier(); clear_bit(BITOPS_4, g_bitmap); clear_bit(BITOPS_7, g_bitmap); @@ -98,7 +94,13 @@ static void __exit test_bitops_unstartup(void) if (bit_set != BITOPS_LAST) pr_err("ERROR: FOUND SET BIT %d\n", bit_set); - pr_warn("Unloaded test module\n"); + pr_info("Completed bitops test\n"); + + return 0; +} + +static void __exit test_bitops_unstartup(void) +{ } module_init(test_bitops_startup); diff --git a/lib/test_bits.c b/lib/test_bits.c new file mode 100644 index 000000000000..c9368a2314e7 --- /dev/null +++ b/lib/test_bits.c @@ -0,0 +1,75 @@ +// SPDX-License-Identifier: GPL-2.0+ +/* + * Test cases for functions and macros in bits.h + */ + +#include <kunit/test.h> +#include <linux/bits.h> + + +static void genmask_test(struct kunit *test) +{ + KUNIT_EXPECT_EQ(test, 1ul, GENMASK(0, 0)); + KUNIT_EXPECT_EQ(test, 3ul, GENMASK(1, 0)); + KUNIT_EXPECT_EQ(test, 6ul, GENMASK(2, 1)); + KUNIT_EXPECT_EQ(test, 0xFFFFFFFFul, GENMASK(31, 0)); + +#ifdef TEST_GENMASK_FAILURES + /* these should fail compilation */ + GENMASK(0, 1); + GENMASK(0, 10); + GENMASK(9, 10); +#endif + + +} + +static void genmask_ull_test(struct kunit *test) +{ + KUNIT_EXPECT_EQ(test, 1ull, GENMASK_ULL(0, 0)); + KUNIT_EXPECT_EQ(test, 3ull, GENMASK_ULL(1, 0)); + KUNIT_EXPECT_EQ(test, 0x000000ffffe00000ull, GENMASK_ULL(39, 21)); + KUNIT_EXPECT_EQ(test, 0xffffffffffffffffull, GENMASK_ULL(63, 0)); + +#ifdef TEST_GENMASK_FAILURES + /* these should fail compilation */ + GENMASK_ULL(0, 1); + GENMASK_ULL(0, 10); + GENMASK_ULL(9, 10); +#endif +} + +static void genmask_input_check_test(struct kunit *test) +{ + unsigned int x, y; + int z, w; + + /* Unknown input */ + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(x, 0)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(0, x)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(x, y)); + + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(z, 0)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(0, z)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(z, w)); + + /* Valid input */ + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(1, 1)); + KUNIT_EXPECT_EQ(test, 0, GENMASK_INPUT_CHECK(39, 21)); +} + + +static struct kunit_case bits_test_cases[] = { + KUNIT_CASE(genmask_test), + KUNIT_CASE(genmask_ull_test), + KUNIT_CASE(genmask_input_check_test), + {} +}; + +static struct kunit_suite bits_test_suite = { + .name = "bits-test", + .test_cases = bits_test_cases, +}; +kunit_test_suite(bits_test_suite); + +MODULE_LICENSE("GPL"); diff --git a/lib/test_kmod.c b/lib/test_kmod.c index e651c37d56db..eab52770070d 100644 --- a/lib/test_kmod.c +++ b/lib/test_kmod.c @@ -745,7 +745,7 @@ static int trigger_config_run_type(struct kmod_test_device *test_dev, break; case TEST_KMOD_FS_TYPE: kfree_const(config->test_fs); - config->test_driver = NULL; + config->test_fs = NULL; copied = config_copy_test_fs(config, test_str, strlen(test_str)); break; diff --git a/lib/test_lockup.c b/lib/test_lockup.c index ff26f36d729f..f1a020bcc763 100644 --- a/lib/test_lockup.c +++ b/lib/test_lockup.c @@ -400,7 +400,7 @@ static void test_lockup(bool master) test_unlock(master, true); } -DEFINE_PER_CPU(struct work_struct, test_works); +static DEFINE_PER_CPU(struct work_struct, test_works); static void test_work_fn(struct work_struct *work) { @@ -512,8 +512,8 @@ static int __init test_lockup_init(void) if (test_file_path[0]) { test_file = filp_open(test_file_path, O_RDONLY, 0); if (IS_ERR(test_file)) { - pr_err("cannot find file_path\n"); - return -EINVAL; + pr_err("failed to open %s: %ld\n", test_file_path, PTR_ERR(test_file)); + return PTR_ERR(test_file); } test_inode = file_inode(test_file); } else if (test_lock_inode || diff --git a/lib/ts_bm.c b/lib/ts_bm.c index 277cb4417ac2..4cf250031f0f 100644 --- a/lib/ts_bm.c +++ b/lib/ts_bm.c @@ -11,7 +11,7 @@ * [1] A Fast String Searching Algorithm, R.S. Boyer and Moore. * Communications of the Association for Computing Machinery, * 20(10), 1977, pp. 762-772. - * http://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf + * https://www.cs.utexas.edu/users/moore/publications/fstrpos.pdf * * [2] Handbook of Exact String Matching Algorithms, Thierry Lecroq, 2004 * http://www-igm.univ-mlv.fr/~lecroq/string/string.pdf diff --git a/lib/xxhash.c b/lib/xxhash.c index aa61e2a3802f..d5bb9ff10607 100644 --- a/lib/xxhash.c +++ b/lib/xxhash.c @@ -34,7 +34,7 @@ * ("BSD"). * * You can contact the author at: - * - xxHash homepage: http://cyan4973.github.io/xxHash/ + * - xxHash homepage: https://cyan4973.github.io/xxHash/ * - xxHash source repository: https://github.com/Cyan4973/xxHash */ diff --git a/lib/xz/xz_crc32.c b/lib/xz/xz_crc32.c index 912aae5fa09e..88a2c35e1b59 100644 --- a/lib/xz/xz_crc32.c +++ b/lib/xz/xz_crc32.c @@ -2,7 +2,7 @@ * CRC32 using the polynomial from IEEE-802.3 * * Authors: Lasse Collin <lasse.collin@tukaani.org> - * Igor Pavlov <http://7-zip.org/> + * Igor Pavlov <https://7-zip.org/> * * This file has been put into the public domain. * You can do whatever you want with this file. diff --git a/lib/xz/xz_dec_bcj.c b/lib/xz/xz_dec_bcj.c index a768e6d28bbb..72ddac6ef2ec 100644 --- a/lib/xz/xz_dec_bcj.c +++ b/lib/xz/xz_dec_bcj.c @@ -2,7 +2,7 @@ * Branch/Call/Jump (BCJ) filter decoders * * Authors: Lasse Collin <lasse.collin@tukaani.org> - * Igor Pavlov <http://7-zip.org/> + * Igor Pavlov <https://7-zip.org/> * * This file has been put into the public domain. * You can do whatever you want with this file. diff --git a/lib/xz/xz_dec_lzma2.c b/lib/xz/xz_dec_lzma2.c index 156f26fdc4c9..9f336bc07ed6 100644 --- a/lib/xz/xz_dec_lzma2.c +++ b/lib/xz/xz_dec_lzma2.c @@ -2,7 +2,7 @@ * LZMA2 decoder * * Authors: Lasse Collin <lasse.collin@tukaani.org> - * Igor Pavlov <http://7-zip.org/> + * Igor Pavlov <https://7-zip.org/> * * This file has been put into the public domain. * You can do whatever you want with this file. diff --git a/lib/xz/xz_lzma2.h b/lib/xz/xz_lzma2.h index 071d67bee9f5..92d852d4f87a 100644 --- a/lib/xz/xz_lzma2.h +++ b/lib/xz/xz_lzma2.h @@ -2,7 +2,7 @@ * LZMA2 definitions * * Authors: Lasse Collin <lasse.collin@tukaani.org> - * Igor Pavlov <http://7-zip.org/> + * Igor Pavlov <https://7-zip.org/> * * This file has been put into the public domain. * You can do whatever you want with this file. diff --git a/lib/xz/xz_stream.h b/lib/xz/xz_stream.h index 66cb5a7055ec..430bb3a0d195 100644 --- a/lib/xz/xz_stream.h +++ b/lib/xz/xz_stream.h @@ -19,7 +19,7 @@ /* * See the .xz file format specification at - * http://tukaani.org/xz/xz-file-format.txt + * https://tukaani.org/xz/xz-file-format.txt * to understand the container format. */ @@ -52,7 +52,7 @@ unsigned long cma_get_size(const struct cma *cma) const char *cma_get_name(const struct cma *cma) { - return cma->name ? cma->name : "(undefined)"; + return cma->name; } static unsigned long cma_bitmap_aligned_mask(const struct cma *cma, @@ -93,17 +93,15 @@ static void cma_clear_bitmap(struct cma *cma, unsigned long pfn, mutex_unlock(&cma->lock); } -static int __init cma_activate_area(struct cma *cma) +static void __init cma_activate_area(struct cma *cma) { unsigned long base_pfn = cma->base_pfn, pfn = base_pfn; unsigned i = cma->count >> pageblock_order; struct zone *zone; cma->bitmap = bitmap_zalloc(cma_bitmap_maxno(cma), GFP_KERNEL); - if (!cma->bitmap) { - cma->count = 0; - return -ENOMEM; - } + if (!cma->bitmap) + goto out_error; WARN_ON_ONCE(!pfn_valid(pfn)); zone = page_zone(pfn_to_page(pfn)); @@ -133,25 +131,22 @@ static int __init cma_activate_area(struct cma *cma) spin_lock_init(&cma->mem_head_lock); #endif - return 0; + return; not_in_zone: - pr_err("CMA area %s could not be activated\n", cma->name); bitmap_free(cma->bitmap); +out_error: cma->count = 0; - return -EINVAL; + pr_err("CMA area %s could not be activated\n", cma->name); + return; } static int __init cma_init_reserved_areas(void) { int i; - for (i = 0; i < cma_area_count; i++) { - int ret = cma_activate_area(&cma_areas[i]); - - if (ret) - return ret; - } + for (i = 0; i < cma_area_count; i++) + cma_activate_area(&cma_areas[i]); return 0; } @@ -202,13 +197,12 @@ int __init cma_init_reserved_mem(phys_addr_t base, phys_addr_t size, * subsystems (like slab allocator) are available. */ cma = &cma_areas[cma_area_count]; - if (name) { - cma->name = name; - } else { - cma->name = kasprintf(GFP_KERNEL, "cma%d\n", cma_area_count); - if (!cma->name) - return -ENOMEM; - } + + if (name) + snprintf(cma->name, CMA_MAX_NAME, name); + else + snprintf(cma->name, CMA_MAX_NAME, "cma%d\n", cma_area_count); + cma->base_pfn = PFN_DOWN(base); cma->count = size >> PAGE_SHIFT; cma->order_per_bit = order_per_bit; @@ -425,7 +419,7 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, struct page *page = NULL; int ret = -ENOMEM; - if (!cma || !cma->count) + if (!cma || !cma->count || !cma->bitmap) return NULL; pr_debug("%s(cma %p, count %zu, align %d)\n", __func__, (void *)cma, @@ -4,6 +4,8 @@ #include <linux/debugfs.h> +#define CMA_MAX_NAME 64 + struct cma { unsigned long base_pfn; unsigned long count; @@ -15,7 +17,7 @@ struct cma { spinlock_t mem_head_lock; struct debugfs_u32_array dfs_bitmap; #endif - const char *name; + char name[CMA_MAX_NAME]; }; extern struct cma cma_areas[MAX_CMA_AREAS]; diff --git a/mm/compaction.c b/mm/compaction.c index 86375605faa9..b89581bf859c 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -50,6 +50,24 @@ static inline void count_compact_events(enum vm_event_item item, long delta) #define pageblock_start_pfn(pfn) block_start_pfn(pfn, pageblock_order) #define pageblock_end_pfn(pfn) block_end_pfn(pfn, pageblock_order) +/* + * Fragmentation score check interval for proactive compaction purposes. + */ +static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; + +/* + * Page order with-respect-to which proactive compaction + * calculates external fragmentation, which is used as + * the "fragmentation score" of a node/zone. + */ +#if defined CONFIG_TRANSPARENT_HUGEPAGE +#define COMPACTION_HPAGE_ORDER HPAGE_PMD_ORDER +#elif defined CONFIG_HUGETLBFS +#define COMPACTION_HPAGE_ORDER HUGETLB_PAGE_ORDER +#else +#define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) +#endif + static unsigned long release_freepages(struct list_head *freelist) { struct page *page, *next; @@ -136,7 +154,7 @@ EXPORT_SYMBOL(__ClearPageMovable); /* * Compaction is deferred when compaction fails to result in a page - * allocation success. 1 << compact_defer_limit compactions are skipped up + * allocation success. 1 << compact_defer_shift, compactions are skipped up * to a limit of 1 << COMPACT_MAX_DEFER_SHIFT */ void defer_compaction(struct zone *zone, int order) @@ -1459,7 +1477,7 @@ static void isolate_freepages(struct compact_control *cc) * this pfn aligned down to the pageblock boundary, because we do * block_start_pfn -= pageblock_nr_pages in the for loop. * For ending point, take care when isolating in last pageblock of a - * a zone which ends in the middle of a pageblock. + * zone which ends in the middle of a pageblock. * The low boundary is the end of the pageblock the migration scanner * is using. */ @@ -1857,6 +1875,76 @@ static inline bool is_via_compact_memory(int order) return order == -1; } +static bool kswapd_is_running(pg_data_t *pgdat) +{ + return pgdat->kswapd && (pgdat->kswapd->state == TASK_RUNNING); +} + +/* + * A zone's fragmentation score is the external fragmentation wrt to the + * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a value + * in the range [0, 100]. + * + * The scaling factor ensures that proactive compaction focuses on larger + * zones like ZONE_NORMAL, rather than smaller, specialized zones like + * ZONE_DMA32. For smaller zones, the score value remains close to zero, + * and thus never exceeds the high threshold for proactive compaction. + */ +static unsigned int fragmentation_score_zone(struct zone *zone) +{ + unsigned long score; + + score = zone->present_pages * + extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); + return div64_ul(score, zone->zone_pgdat->node_present_pages + 1); +} + +/* + * The per-node proactive (background) compaction process is started by its + * corresponding kcompactd thread when the node's fragmentation score + * exceeds the high threshold. The compaction process remains active till + * the node's score falls below the low threshold, or one of the back-off + * conditions is met. + */ +static unsigned int fragmentation_score_node(pg_data_t *pgdat) +{ + unsigned int score = 0; + int zoneid; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + struct zone *zone; + + zone = &pgdat->node_zones[zoneid]; + score += fragmentation_score_zone(zone); + } + + return score; +} + +static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low) +{ + unsigned int wmark_low; + + /* + * Cap the low watermak to avoid excessive compaction + * activity in case a user sets the proactivess tunable + * close to 100 (maximum). + */ + wmark_low = max(100U - sysctl_compaction_proactiveness, 5U); + return low ? wmark_low : min(wmark_low + 10, 100U); +} + +static bool should_proactive_compact_node(pg_data_t *pgdat) +{ + int wmark_high; + + if (!sysctl_compaction_proactiveness || kswapd_is_running(pgdat)) + return false; + + wmark_high = fragmentation_score_wmark(pgdat, false); + return fragmentation_score_node(pgdat) > wmark_high; +} + static enum compact_result __compact_finished(struct compact_control *cc) { unsigned int order; @@ -1883,6 +1971,25 @@ static enum compact_result __compact_finished(struct compact_control *cc) return COMPACT_PARTIAL_SKIPPED; } + if (cc->proactive_compaction) { + int score, wmark_low; + pg_data_t *pgdat; + + pgdat = cc->zone->zone_pgdat; + if (kswapd_is_running(pgdat)) + return COMPACT_PARTIAL_SKIPPED; + + score = fragmentation_score_zone(cc->zone); + wmark_low = fragmentation_score_wmark(pgdat, true); + + if (score > wmark_low) + ret = COMPACT_CONTINUE; + else + ret = COMPACT_SUCCESS; + + goto out; + } + if (is_via_compact_memory(cc->order)) return COMPACT_CONTINUE; @@ -1941,6 +2048,7 @@ static enum compact_result __compact_finished(struct compact_control *cc) } } +out: if (cc->contended || fatal_signal_pending(current)) ret = COMPACT_CONTENDED; @@ -2421,6 +2529,41 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order, return rc; } +/* + * Compact all zones within a node till each zone's fragmentation score + * reaches within proactive compaction thresholds (as determined by the + * proactiveness tunable). + * + * It is possible that the function returns before reaching score targets + * due to various back-off conditions, such as, contention on per-node or + * per-zone locks. + */ +static void proactive_compact_node(pg_data_t *pgdat) +{ + int zoneid; + struct zone *zone; + struct compact_control cc = { + .order = -1, + .mode = MIGRATE_SYNC_LIGHT, + .ignore_skip_hint = true, + .whole_zone = true, + .gfp_mask = GFP_KERNEL, + .proactive_compaction = true, + }; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + zone = &pgdat->node_zones[zoneid]; + if (!populated_zone(zone)) + continue; + + cc.zone = zone; + + compact_zone(&cc, NULL); + + VM_BUG_ON(!list_empty(&cc.freepages)); + VM_BUG_ON(!list_empty(&cc.migratepages)); + } +} /* Compact all zones within a node */ static void compact_node(int nid) @@ -2468,6 +2611,13 @@ static void compact_nodes(void) int sysctl_compact_memory; /* + * Tunable for proactive compaction. It determines how + * aggressively the kernel should compact memory in the + * background. It takes values in the range [0, 100]. + */ +unsigned int __read_mostly sysctl_compaction_proactiveness = 20; + +/* * This is the entry point for compacting all nodes via * /proc/sys/vm/compact_memory */ @@ -2646,6 +2796,7 @@ static int kcompactd(void *p) { pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; + unsigned int proactive_defer = 0; const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); @@ -2661,12 +2812,34 @@ static int kcompactd(void *p) unsigned long pflags; trace_mm_compaction_kcompactd_sleep(pgdat->node_id); - wait_event_freezable(pgdat->kcompactd_wait, - kcompactd_work_requested(pgdat)); + if (wait_event_freezable_timeout(pgdat->kcompactd_wait, + kcompactd_work_requested(pgdat), + msecs_to_jiffies(HPAGE_FRAG_CHECK_INTERVAL_MSEC))) { + + psi_memstall_enter(&pflags); + kcompactd_do_work(pgdat); + psi_memstall_leave(&pflags); + continue; + } - psi_memstall_enter(&pflags); - kcompactd_do_work(pgdat); - psi_memstall_leave(&pflags); + /* kcompactd wait timeout */ + if (should_proactive_compact_node(pgdat)) { + unsigned int prev_score, score; + + if (proactive_defer) { + proactive_defer--; + continue; + } + prev_score = fragmentation_score_node(pgdat); + proactive_compact_node(pgdat); + score = fragmentation_score_node(pgdat); + /* + * Defer proactive compaction if the fragmentation + * score did not go down i.e. no progress made. + */ + proactive_defer = score < prev_score ? + 0 : 1 << COMPACT_MAX_DEFER_SHIFT; + } } return 0; diff --git a/mm/filemap.c b/mm/filemap.c index f2bb5ff0293d..8e75bce0346d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2885,7 +2885,7 @@ filler: * Case a, the page will be up to date when the page is unlocked. * There is no need to serialise on the page lock here as the page * is pinned so the lock gives no additional protection. Even if the - * the page is truncated, the data is still valid if PageUptodate as + * page is truncated, the data is still valid if PageUptodate as * it's a race vs truncate race. * Case b, the page will not be up to date * Case c, the page may be truncated but in itself, the data may still @@ -859,7 +859,7 @@ unmap: * does not include FOLL_NOWAIT, the mmap_lock may be released. If it * is, *@locked will be set to 0 and -EBUSY returned. */ -static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, +static int faultin_page(struct vm_area_struct *vma, unsigned long address, unsigned int *flags, int *locked) { unsigned int fault_flags = 0; @@ -884,7 +884,7 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, fault_flags |= FAULT_FLAG_TRIED; } - ret = handle_mm_fault(vma, address, fault_flags); + ret = handle_mm_fault(vma, address, fault_flags, NULL); if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, *flags); @@ -893,13 +893,6 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma, BUG(); } - if (tsk) { - if (ret & VM_FAULT_MAJOR) - tsk->maj_flt++; - else - tsk->min_flt++; - } - if (ret & VM_FAULT_RETRY) { if (locked && !(fault_flags & FAULT_FLAG_RETRY_NOWAIT)) *locked = 0; @@ -969,7 +962,6 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) /** * __get_user_pages() - pin user pages in memory - * @tsk: task_struct of target task * @mm: mm_struct of target mm * @start: starting user address * @nr_pages: number of pages from start to pin @@ -1028,7 +1020,7 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags) * instead of __get_user_pages. __get_user_pages should be used only if * you need some special @gup_flags. */ -static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, +static long __get_user_pages(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1110,8 +1102,7 @@ retry: page = follow_page_mask(vma, start, foll_flags, &ctx); if (!page) { - ret = faultin_page(tsk, vma, start, &foll_flags, - locked); + ret = faultin_page(vma, start, &foll_flags, locked); switch (ret) { case 0: goto retry; @@ -1185,8 +1176,6 @@ static bool vma_permits_fault(struct vm_area_struct *vma, /** * fixup_user_fault() - manually resolve a user page fault - * @tsk: the task_struct to use for page fault accounting, or - * NULL if faults are not to be recorded. * @mm: mm_struct of target mm * @address: user address * @fault_flags:flags to pass down to handle_mm_fault() @@ -1214,7 +1203,7 @@ static bool vma_permits_fault(struct vm_area_struct *vma, * This function will not return with an unlocked mmap_lock. So it has not the * same semantics wrt the @mm->mmap_lock as does filemap_fault(). */ -int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm, +int fixup_user_fault(struct mm_struct *mm, unsigned long address, unsigned int fault_flags, bool *unlocked) { @@ -1238,7 +1227,7 @@ retry: fatal_signal_pending(current)) return -EINTR; - ret = handle_mm_fault(vma, address, fault_flags); + ret = handle_mm_fault(vma, address, fault_flags, NULL); major |= ret & VM_FAULT_MAJOR; if (ret & VM_FAULT_ERROR) { int err = vm_fault_to_errno(ret, 0); @@ -1255,12 +1244,6 @@ retry: goto retry; } - if (tsk) { - if (major) - tsk->maj_flt++; - else - tsk->min_flt++; - } return 0; } EXPORT_SYMBOL_GPL(fixup_user_fault); @@ -1269,8 +1252,7 @@ EXPORT_SYMBOL_GPL(fixup_user_fault); * Please note that this function, unlike __get_user_pages will not * return 0 for nr_pages > 0 without FOLL_NOWAIT */ -static __always_inline long __get_user_pages_locked(struct task_struct *tsk, - struct mm_struct *mm, +static __always_inline long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, @@ -1303,7 +1285,7 @@ static __always_inline long __get_user_pages_locked(struct task_struct *tsk, pages_done = 0; lock_dropped = false; for (;;) { - ret = __get_user_pages(tsk, mm, start, nr_pages, flags, pages, + ret = __get_user_pages(mm, start, nr_pages, flags, pages, vmas, locked); if (!locked) /* VM_FAULT_RETRY couldn't trigger, bypass */ @@ -1363,7 +1345,7 @@ retry: } *locked = 1; - ret = __get_user_pages(tsk, mm, start, 1, flags | FOLL_TRIED, + ret = __get_user_pages(mm, start, 1, flags | FOLL_TRIED, pages, NULL, locked); if (!*locked) { /* Continue to retry until we succeeded */ @@ -1450,7 +1432,7 @@ long populate_vma_page_range(struct vm_area_struct *vma, * We made sure addr is within a VMA, so the following will * not result in a stack expansion that recurses back here. */ - return __get_user_pages(current, mm, start, nr_pages, gup_flags, + return __get_user_pages(mm, start, nr_pages, gup_flags, NULL, NULL, locked); } @@ -1534,7 +1516,7 @@ struct page *get_dump_page(unsigned long addr) struct vm_area_struct *vma; struct page *page; - if (__get_user_pages(current, current->mm, addr, 1, + if (__get_user_pages(current->mm, addr, 1, FOLL_FORCE | FOLL_DUMP | FOLL_GET, &page, &vma, NULL) < 1) return NULL; @@ -1543,8 +1525,7 @@ struct page *get_dump_page(unsigned long addr) } #endif /* CONFIG_ELF_CORE */ #else /* CONFIG_MMU */ -static long __get_user_pages_locked(struct task_struct *tsk, - struct mm_struct *mm, unsigned long start, +static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, struct vm_area_struct **vmas, int *locked, unsigned int foll_flags) @@ -1609,59 +1590,7 @@ static bool check_dax_vmas(struct vm_area_struct **vmas, long nr_pages) } #ifdef CONFIG_CMA -static struct page *new_non_cma_page(struct page *page, unsigned long private) -{ - /* - * We want to make sure we allocate the new page from the same node - * as the source page. - */ - int nid = page_to_nid(page); - /* - * Trying to allocate a page for migration. Ignore allocation - * failure warnings. We don't force __GFP_THISNODE here because - * this node here is the node where we have CMA reservation and - * in some case these nodes will have really less non movable - * allocation memory. - */ - gfp_t gfp_mask = GFP_USER | __GFP_NOWARN; - - if (PageHighMem(page)) - gfp_mask |= __GFP_HIGHMEM; - -#ifdef CONFIG_HUGETLB_PAGE - if (PageHuge(page)) { - struct hstate *h = page_hstate(page); - /* - * We don't want to dequeue from the pool because pool pages will - * mostly be from the CMA region. - */ - return alloc_migrate_huge_page(h, gfp_mask, nid, NULL); - } -#endif - if (PageTransHuge(page)) { - struct page *thp; - /* - * ignore allocation failure warnings - */ - gfp_t thp_gfpmask = GFP_TRANSHUGE | __GFP_NOWARN; - - /* - * Remove the movable mask so that we don't allocate from - * CMA area again. - */ - thp_gfpmask &= ~__GFP_MOVABLE; - thp = __alloc_pages_node(nid, thp_gfpmask, HPAGE_PMD_ORDER); - if (!thp) - return NULL; - prep_transhuge_page(thp); - return thp; - } - - return __alloc_pages_node(nid, gfp_mask, 0); -} - -static long check_and_migrate_cma_pages(struct task_struct *tsk, - struct mm_struct *mm, +static long check_and_migrate_cma_pages(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, @@ -1674,6 +1603,10 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, bool migrate_allow = true; LIST_HEAD(cma_page_list); long ret = nr_pages; + struct migration_target_control mtc = { + .nid = NUMA_NO_NODE, + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_NOWARN, + }; check_again: for (i = 0; i < nr_pages;) { @@ -1719,8 +1652,8 @@ check_again: for (i = 0; i < nr_pages; i++) put_page(pages[i]); - if (migrate_pages(&cma_page_list, new_non_cma_page, - NULL, 0, MIGRATE_SYNC, MR_CONTIG_RANGE)) { + if (migrate_pages(&cma_page_list, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_SYNC, MR_CONTIG_RANGE)) { /* * some of the pages failed migration. Do get_user_pages * without migration. @@ -1735,7 +1668,7 @@ check_again: * again migrating any new CMA pages which we failed to isolate * earlier. */ - ret = __get_user_pages_locked(tsk, mm, start, nr_pages, + ret = __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, gup_flags); @@ -1749,8 +1682,7 @@ check_again: return ret; } #else -static long check_and_migrate_cma_pages(struct task_struct *tsk, - struct mm_struct *mm, +static long check_and_migrate_cma_pages(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, @@ -1765,8 +1697,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk, * __gup_longterm_locked() is a wrapper for __get_user_pages_locked which * allows us to process the FOLL_LONGTERM flag. */ -static long __gup_longterm_locked(struct task_struct *tsk, - struct mm_struct *mm, +static long __gup_longterm_locked(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, @@ -1791,11 +1722,10 @@ static long __gup_longterm_locked(struct task_struct *tsk, flags = memalloc_nocma_save(); } - rc = __get_user_pages_locked(tsk, mm, start, nr_pages, pages, + rc = __get_user_pages_locked(mm, start, nr_pages, pages, vmas_tmp, NULL, gup_flags); if (gup_flags & FOLL_LONGTERM) { - memalloc_nocma_restore(flags); if (rc < 0) goto out; @@ -1806,32 +1736,31 @@ static long __gup_longterm_locked(struct task_struct *tsk, goto out; } - rc = check_and_migrate_cma_pages(tsk, mm, start, rc, pages, + rc = check_and_migrate_cma_pages(mm, start, rc, pages, vmas_tmp, gup_flags); +out: + memalloc_nocma_restore(flags); } -out: if (vmas_tmp != vmas) kfree(vmas_tmp); return rc; } #else /* !CONFIG_FS_DAX && !CONFIG_CMA */ -static __always_inline long __gup_longterm_locked(struct task_struct *tsk, - struct mm_struct *mm, +static __always_inline long __gup_longterm_locked(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, struct page **pages, struct vm_area_struct **vmas, unsigned int flags) { - return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, NULL, flags); } #endif /* CONFIG_FS_DAX || CONFIG_CMA */ #ifdef CONFIG_MMU -static long __get_user_pages_remote(struct task_struct *tsk, - struct mm_struct *mm, +static long __get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1850,20 +1779,18 @@ static long __get_user_pages_remote(struct task_struct *tsk, * This will check the vmas (even if our vmas arg is NULL) * and return -ENOTSUPP if DAX isn't allowed in this case: */ - return __gup_longterm_locked(tsk, mm, start, nr_pages, pages, + return __gup_longterm_locked(mm, start, nr_pages, pages, vmas, gup_flags | FOLL_TOUCH | FOLL_REMOTE); } - return __get_user_pages_locked(tsk, mm, start, nr_pages, pages, vmas, + return __get_user_pages_locked(mm, start, nr_pages, pages, vmas, locked, gup_flags | FOLL_TOUCH | FOLL_REMOTE); } /** * get_user_pages_remote() - pin user pages in memory - * @tsk: the task_struct to use for page fault accounting, or - * NULL if faults are not to be recorded. * @mm: mm_struct of target mm * @start: starting user address * @nr_pages: number of pages from start to pin @@ -1922,7 +1849,7 @@ static long __get_user_pages_remote(struct task_struct *tsk, * should use get_user_pages_remote because it cannot pass * FAULT_FLAG_ALLOW_RETRY to handle_mm_fault. */ -long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1934,13 +1861,13 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL; - return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, + return __get_user_pages_remote(mm, start, nr_pages, gup_flags, pages, vmas, locked); } EXPORT_SYMBOL(get_user_pages_remote); #else /* CONFIG_MMU */ -long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1948,8 +1875,7 @@ long get_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, return 0; } -static long __get_user_pages_remote(struct task_struct *tsk, - struct mm_struct *mm, +static long __get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -1969,11 +1895,10 @@ static long __get_user_pages_remote(struct task_struct *tsk, * @vmas: array of pointers to vmas corresponding to each page. * Or NULL if the caller does not require them. * - * This is the same as get_user_pages_remote(), just with a - * less-flexible calling convention where we assume that the task - * and mm being operated on are the current task's and don't allow - * passing of a locked parameter. We also obviously don't pass - * FOLL_REMOTE in here. + * This is the same as get_user_pages_remote(), just with a less-flexible + * calling convention where we assume that the mm being operated on belongs to + * the current task, and doesn't allow passing of a locked parameter. We also + * obviously don't pass FOLL_REMOTE in here. */ long get_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, @@ -1986,7 +1911,7 @@ long get_user_pages(unsigned long start, unsigned long nr_pages, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL; - return __gup_longterm_locked(current, current->mm, start, nr_pages, + return __gup_longterm_locked(current->mm, start, nr_pages, pages, vmas, gup_flags | FOLL_TOUCH); } EXPORT_SYMBOL(get_user_pages); @@ -1996,7 +1921,7 @@ EXPORT_SYMBOL(get_user_pages); * * mmap_read_lock(mm); * do_something() - * get_user_pages(tsk, mm, ..., pages, NULL); + * get_user_pages(mm, ..., pages, NULL); * mmap_read_unlock(mm); * * to: @@ -2004,7 +1929,7 @@ EXPORT_SYMBOL(get_user_pages); * int locked = 1; * mmap_read_lock(mm); * do_something() - * get_user_pages_locked(tsk, mm, ..., pages, &locked); + * get_user_pages_locked(mm, ..., pages, &locked); * if (locked) * mmap_read_unlock(mm); * @@ -2042,7 +1967,7 @@ long get_user_pages_locked(unsigned long start, unsigned long nr_pages, if (WARN_ON_ONCE(gup_flags & FOLL_PIN)) return -EINVAL; - return __get_user_pages_locked(current, current->mm, start, nr_pages, + return __get_user_pages_locked(current->mm, start, nr_pages, pages, NULL, locked, gup_flags | FOLL_TOUCH); } @@ -2052,12 +1977,12 @@ EXPORT_SYMBOL(get_user_pages_locked); * get_user_pages_unlocked() is suitable to replace the form: * * mmap_read_lock(mm); - * get_user_pages(tsk, mm, ..., pages, NULL); + * get_user_pages(mm, ..., pages, NULL); * mmap_read_unlock(mm); * * with: * - * get_user_pages_unlocked(tsk, mm, ..., pages); + * get_user_pages_unlocked(mm, ..., pages); * * It is functionally equivalent to get_user_pages_fast so * get_user_pages_fast should be used instead if specific gup_flags @@ -2080,7 +2005,7 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages, return -EINVAL; mmap_read_lock(mm); - ret = __get_user_pages_locked(current, mm, start, nr_pages, pages, NULL, + ret = __get_user_pages_locked(mm, start, nr_pages, pages, NULL, &locked, gup_flags | FOLL_TOUCH); if (locked) mmap_read_unlock(mm); @@ -2725,7 +2650,7 @@ static int __gup_longterm_unlocked(unsigned long start, int nr_pages, */ if (gup_flags & FOLL_LONGTERM) { mmap_read_lock(current->mm); - ret = __gup_longterm_locked(current, current->mm, + ret = __gup_longterm_locked(current->mm, start, nr_pages, pages, NULL, gup_flags); mmap_read_unlock(current->mm); @@ -2968,10 +2893,8 @@ int pin_user_pages_fast_only(unsigned long start, int nr_pages, EXPORT_SYMBOL_GPL(pin_user_pages_fast_only); /** - * pin_user_pages_remote() - pin pages of a remote process (task != current) + * pin_user_pages_remote() - pin pages of a remote process * - * @tsk: the task_struct to use for page fault accounting, or - * NULL if faults are not to be recorded. * @mm: mm_struct of target mm * @start: starting user address * @nr_pages: number of pages from start to pin @@ -2992,7 +2915,7 @@ EXPORT_SYMBOL_GPL(pin_user_pages_fast_only); * FOLL_PIN means that the pages must be released via unpin_user_page(). Please * see Documentation/core-api/pin_user_pages.rst for details. */ -long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, +long pin_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas, int *locked) @@ -3002,7 +2925,7 @@ long pin_user_pages_remote(struct task_struct *tsk, struct mm_struct *mm, return -EINVAL; gup_flags |= FOLL_PIN; - return __get_user_pages_remote(tsk, mm, start, nr_pages, gup_flags, + return __get_user_pages_remote(mm, start, nr_pages, gup_flags, pages, vmas, locked); } EXPORT_SYMBOL(pin_user_pages_remote); @@ -3034,7 +2957,7 @@ long pin_user_pages(unsigned long start, unsigned long nr_pages, return -EINVAL; gup_flags |= FOLL_PIN; - return __gup_longterm_locked(current, current->mm, start, nr_pages, + return __gup_longterm_locked(current->mm, start, nr_pages, pages, vmas, gup_flags); } EXPORT_SYMBOL(pin_user_pages); @@ -3079,7 +3002,7 @@ long pin_user_pages_locked(unsigned long start, unsigned long nr_pages, return -EINVAL; gup_flags |= FOLL_PIN; - return __get_user_pages_locked(current, current->mm, start, nr_pages, + return __get_user_pages_locked(current->mm, start, nr_pages, pages, NULL, locked, gup_flags | FOLL_TOUCH); } @@ -75,7 +75,8 @@ static int hmm_vma_fault(unsigned long addr, unsigned long end, } for (; addr < end; addr += PAGE_SIZE) - if (handle_mm_fault(vma, addr, fault_flags) & VM_FAULT_ERROR) + if (handle_mm_fault(vma, addr, fault_flags, NULL) & + VM_FAULT_ERROR) return -EFAULT; return -EBUSY; } @@ -249,7 +250,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, swp_entry_t entry = pte_to_swp_entry(pte); /* - * Never fault in device private pages pages, but just report + * Never fault in device private pages, but just report * the PFN even if not present. */ if (hmm_is_device_private_entry(range, entry)) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 206f52b36ffb..2ccff8472cd4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -303,24 +303,6 @@ static ssize_t hpage_pmd_size_show(struct kobject *kobj, static struct kobj_attribute hpage_pmd_size_attr = __ATTR_RO(hpage_pmd_size); -#ifdef CONFIG_DEBUG_VM -static ssize_t debug_cow_show(struct kobject *kobj, - struct kobj_attribute *attr, char *buf) -{ - return single_hugepage_flag_show(kobj, attr, buf, - TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG); -} -static ssize_t debug_cow_store(struct kobject *kobj, - struct kobj_attribute *attr, - const char *buf, size_t count) -{ - return single_hugepage_flag_store(kobj, attr, buf, count, - TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG); -} -static struct kobj_attribute debug_cow_attr = - __ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store); -#endif /* CONFIG_DEBUG_VM */ - static struct attribute *hugepage_attr[] = { &enabled_attr.attr, &defrag_attr.attr, @@ -329,9 +311,6 @@ static struct attribute *hugepage_attr[] = { #ifdef CONFIG_SHMEM &shmem_enabled_attr.attr, #endif -#ifdef CONFIG_DEBUG_VM - &debug_cow_attr.attr, -#endif NULL, }; @@ -640,7 +619,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf, entry = mk_huge_pmd(page, vma->vm_page_prot); entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma); page_add_new_anon_rmap(page, vma, haddr, true); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable); set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry); add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR); diff --git a/mm/hugetlb.c b/mm/hugetlb.c index e52c878940bb..a301c2d672bf 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -19,6 +19,7 @@ #include <linux/memblock.h> #include <linux/sysfs.h> #include <linux/slab.h> +#include <linux/sched/mm.h> #include <linux/mmdebug.h> #include <linux/sched/signal.h> #include <linux/rmap.h> @@ -133,7 +134,7 @@ void hugepage_put_subpool(struct hugepage_subpool *spool) /* * Subpool accounting for allocating and reserving pages. * Return -ENOMEM if there are not enough resources to satisfy the - * the request. Otherwise, return the number of pages by which the + * request. Otherwise, return the number of pages by which the * global pools must be adjusted (upward). The returned value may * only be different than the passed value (delta) in the case where * a subpool minimum size must be maintained. @@ -1040,10 +1041,16 @@ static void enqueue_huge_page(struct hstate *h, struct page *page) static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) { struct page *page; + bool nocma = !!(current->flags & PF_MEMALLOC_NOCMA); + + list_for_each_entry(page, &h->hugepage_freelists[nid], lru) { + if (nocma && is_migrate_cma_page(page)) + continue; - list_for_each_entry(page, &h->hugepage_freelists[nid], lru) if (!PageHWPoison(page)) break; + } + /* * if 'non-isolated free hugepage' not found on the list, * the allocation fails. @@ -1093,15 +1100,6 @@ retry_cpuset: return NULL; } -/* Movability of hugepages depends on migration support. */ -static inline gfp_t htlb_alloc_mask(struct hstate *h) -{ - if (hugepage_movable_supported(h)) - return GFP_HIGHUSER_MOVABLE; - else - return GFP_HIGHUSER; -} - static struct page *dequeue_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, unsigned long address, int avoid_reserve, @@ -1944,7 +1942,7 @@ out_unlock: return page; } -struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, +static struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask, int nid, nodemask_t *nmask) { struct page *page; @@ -1986,31 +1984,9 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h, } /* page migration callback function */ -struct page *alloc_huge_page_node(struct hstate *h, int nid) -{ - gfp_t gfp_mask = htlb_alloc_mask(h); - struct page *page = NULL; - - if (nid != NUMA_NO_NODE) - gfp_mask |= __GFP_THISNODE; - - spin_lock(&hugetlb_lock); - if (h->free_huge_pages - h->resv_huge_pages > 0) - page = dequeue_huge_page_nodemask(h, gfp_mask, nid, NULL); - spin_unlock(&hugetlb_lock); - - if (!page) - page = alloc_migrate_huge_page(h, gfp_mask, nid, NULL); - - return page; -} - -/* page migration callback function */ struct page *alloc_huge_page_nodemask(struct hstate *h, int preferred_nid, - nodemask_t *nmask) + nodemask_t *nmask, gfp_t gfp_mask) { - gfp_t gfp_mask = htlb_alloc_mask(h); - spin_lock(&hugetlb_lock); if (h->free_huge_pages - h->resv_huge_pages > 0) { struct page *page; @@ -2038,7 +2014,7 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma, gfp_mask = htlb_alloc_mask(h); node = huge_node(vma, address, gfp_mask, &mpol, &nodemask); - page = alloc_huge_page_nodemask(h, node, nodemask); + page = alloc_huge_page_nodemask(h, node, nodemask, gfp_mask); mpol_cond_put(mpol); return page; @@ -2167,7 +2143,7 @@ static void return_unused_surplus_pages(struct hstate *h, * evenly across all nodes with memory. Iterate across these nodes * until we can no longer free unreserved surplus pages. This occurs * when the nodes with surplus pages have no free pages. - * free_pool_huge_page() will balance the the freed pages across the + * free_pool_huge_page() will balance the freed pages across the * on-line nodes with memory and will handle the hstate accounting. * * Note that we decrement resv_huge_pages as we free the pages. If @@ -3458,13 +3434,21 @@ static int __init default_hugepagesz_setup(char *s) } __setup("default_hugepagesz=", default_hugepagesz_setup); -static unsigned int cpuset_mems_nr(unsigned int *array) +static unsigned int allowed_mems_nr(struct hstate *h) { int node; unsigned int nr = 0; + nodemask_t *mpol_allowed; + unsigned int *array = h->free_huge_pages_node; + gfp_t gfp_mask = htlb_alloc_mask(h); - for_each_node_mask(node, cpuset_current_mems_allowed) - nr += array[node]; + mpol_allowed = policy_nodemask_current(gfp_mask); + + for_each_node_mask(node, cpuset_current_mems_allowed) { + if (!mpol_allowed || + (mpol_allowed && node_isset(node, *mpol_allowed))) + nr += array[node]; + } return nr; } @@ -3643,12 +3627,18 @@ static int hugetlb_acct_memory(struct hstate *h, long delta) * we fall back to check against current free page availability as * a best attempt and hopefully to minimize the impact of changing * semantics that cpuset has. + * + * Apart from cpuset, we also have memory policy mechanism that + * also determines from which node the kernel will allocate memory + * in a NUMA system. So similar to cpuset, we also should consider + * the memory policy of the current task. Similar to the description + * above. */ if (delta > 0) { if (gather_surplus_pages(h, delta) < 0) goto out; - if (delta > cpuset_mems_nr(h->free_huge_pages_node)) { + if (delta > allowed_mems_nr(h)) { return_unused_surplus_pages(h, delta); goto out; } @@ -3953,7 +3943,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma, continue; ptl = huge_pte_lock(h, mm, ptep); - if (huge_pmd_unshare(mm, &address, ptep)) { + if (huge_pmd_unshare(mm, vma, &address, ptep)) { spin_unlock(ptl); /* * We just unmapped a page of PMDs by clearing a PUD. @@ -4540,10 +4530,6 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, } else if (unlikely(is_hugetlb_entry_hwpoisoned(entry))) return VM_FAULT_HWPOISON_LARGE | VM_FAULT_SET_HINDEX(hstate_index(h)); - } else { - ptep = huge_pte_alloc(mm, haddr, huge_page_size(h)); - if (!ptep) - return VM_FAULT_OOM; } /* @@ -5020,7 +5006,7 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma, if (!ptep) continue; ptl = huge_pte_lock(h, mm, ptep); - if (huge_pmd_unshare(mm, &address, ptep)) { + if (huge_pmd_unshare(mm, vma, &address, ptep)) { pages++; spin_unlock(ptl); shared_pmd = true; @@ -5401,12 +5387,14 @@ out: * returns: 1 successfully unmapped a shared pte page * 0 the underlying pte page is not shared, or it is the last user */ -int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) +int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *addr, pte_t *ptep) { pgd_t *pgd = pgd_offset(mm, *addr); p4d_t *p4d = p4d_offset(pgd, *addr); pud_t *pud = pud_offset(p4d, *addr); + i_mmap_assert_write_locked(vma->vm_file->f_mapping); BUG_ON(page_count(virt_to_page(ptep)) == 0); if (page_count(virt_to_page(ptep)) == 1) return 0; @@ -5424,7 +5412,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud) return NULL; } -int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep) +int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long *addr, pte_t *ptep) { return 0; } @@ -5694,12 +5683,14 @@ void __init hugetlb_cma_reserve(int order) reserved = 0; for_each_node_state(nid, N_ONLINE) { int res; + char name[20]; size = min(per_node, hugetlb_cma_size - reserved); size = round_up(size, PAGE_SIZE << order); + snprintf(name, 20, "hugetlb%d", nid); res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order, - 0, false, "hugetlb", + 0, false, name, &hugetlb_cma[nid], nid); if (res) { pr_warn("hugetlb_cma: reservation failed: err %d, node %d", diff --git a/mm/internal.h b/mm/internal.h index 9886db20d94f..d11a9a8d2135 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -239,6 +239,7 @@ struct compact_control { bool no_set_skip_hint; /* Don't mark blocks for skipping */ bool ignore_block_suitable; /* Scan blocks considered unsuitable */ bool direct_compaction; /* False from kcompactd or /proc/... */ + bool proactive_compaction; /* kcompactd proactive compaction */ bool whole_zone; /* Whole zone should/has been scanned */ bool contended; /* Signal lock or sched contention */ bool rescan; /* Rescanning the same pageblock */ @@ -612,5 +613,11 @@ static inline bool is_migrate_highatomic_page(struct page *page) } void setup_zone_pageset(struct zone *zone); -extern struct page *alloc_new_node_page(struct page *page, unsigned long node); + +struct migration_target_control { + int nid; /* preferred node id */ + nodemask_t *nmask; + gfp_t gfp_mask; +}; + #endif /* __MM_INTERNAL_H */ diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b52bd46ad146..15a9af791014 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1173,7 +1173,7 @@ static void collapse_huge_page(struct mm_struct *mm, spin_lock(pmd_ptl); BUG_ON(!pmd_none(*pmd)); page_add_new_anon_rmap(new_page, vma, address, true); - lru_cache_add_active_or_unevictable(new_page, vma); + lru_cache_add_inactive_or_unevictable(new_page, vma); pgtable_trans_huge_deposit(mm, pmd, pgtable); set_pmd_at(mm, address, pmd, _pmd); update_mmu_cache_pmd(vma, address, pmd); @@ -480,7 +480,8 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr) break; if (PageKsm(page)) ret = handle_mm_fault(vma, addr, - FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE); + FAULT_FLAG_WRITE | FAULT_FLAG_REMOTE, + NULL); else ret = VM_FAULT_WRITE; put_page(page); diff --git a/mm/maccess.c b/mm/maccess.c index f98ff91e32c6..3bd70405f2d8 100644 --- a/mm/maccess.c +++ b/mm/maccess.c @@ -205,15 +205,14 @@ long strncpy_from_kernel_nofault(char *dst, const void *unsafe_addr, long count) long copy_from_user_nofault(void *dst, const void __user *src, size_t size) { long ret = -EFAULT; - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs = force_uaccess_begin(); - set_fs(USER_DS); if (access_ok(src, size)) { pagefault_disable(); ret = __copy_from_user_inatomic(dst, src, size); pagefault_enable(); } - set_fs(old_fs); + force_uaccess_end(old_fs); if (ret) return -EFAULT; @@ -233,15 +232,14 @@ EXPORT_SYMBOL_GPL(copy_from_user_nofault); long copy_to_user_nofault(void __user *dst, const void *src, size_t size) { long ret = -EFAULT; - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs = force_uaccess_begin(); - set_fs(USER_DS); if (access_ok(dst, size)) { pagefault_disable(); ret = __copy_to_user_inatomic(dst, src, size); pagefault_enable(); } - set_fs(old_fs); + force_uaccess_end(old_fs); if (ret) return -EFAULT; @@ -270,17 +268,17 @@ EXPORT_SYMBOL_GPL(copy_to_user_nofault); long strncpy_from_user_nofault(char *dst, const void __user *unsafe_addr, long count) { - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs; long ret; if (unlikely(count <= 0)) return 0; - set_fs(USER_DS); + old_fs = force_uaccess_begin(); pagefault_disable(); ret = strncpy_from_user(dst, unsafe_addr, count); pagefault_enable(); - set_fs(old_fs); + force_uaccess_end(old_fs); if (ret >= count) { ret = count; @@ -310,14 +308,14 @@ long strncpy_from_user_nofault(char *dst, const void __user *unsafe_addr, */ long strnlen_user_nofault(const void __user *unsafe_addr, long count) { - mm_segment_t old_fs = get_fs(); + mm_segment_t old_fs; int ret; - set_fs(USER_DS); + old_fs = force_uaccess_begin(); pagefault_disable(); ret = strnlen_user(unsafe_addr, count); pagefault_enable(); - set_fs(old_fs); + force_uaccess_end(old_fs); return ret; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 8d9ceea7fe4d..d59fd9af6e63 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -781,7 +781,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) if (mem_cgroup_disabled()) return; - if (vmstat_item_in_bytes(idx)) + if (memcg_stat_item_in_bytes(idx)) threshold <<= PAGE_SHIFT; x = val + __this_cpu_read(memcg->vmstats_percpu->stat[idx]); @@ -1488,6 +1488,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg) seq_buf_printf(&s, "slab %llu\n", (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) + memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B))); + seq_buf_printf(&s, "percpu %llu\n", + (u64)memcg_page_state(memcg, MEMCG_PERCPU_B)); seq_buf_printf(&s, "sock %llu\n", (u64)memcg_page_state(memcg, MEMCG_SOCK) * PAGE_SIZE); @@ -1528,12 +1530,18 @@ static char *memory_stat_format(struct mem_cgroup *memcg) seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT), memcg_events(memcg, PGMAJFAULT)); - seq_buf_printf(&s, "workingset_refault %lu\n", - memcg_page_state(memcg, WORKINGSET_REFAULT)); - seq_buf_printf(&s, "workingset_activate %lu\n", - memcg_page_state(memcg, WORKINGSET_ACTIVATE)); + seq_buf_printf(&s, "workingset_refault_anon %lu\n", + memcg_page_state(memcg, WORKINGSET_REFAULT_ANON)); + seq_buf_printf(&s, "workingset_refault_file %lu\n", + memcg_page_state(memcg, WORKINGSET_REFAULT_FILE)); + seq_buf_printf(&s, "workingset_activate_anon %lu\n", + memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON)); + seq_buf_printf(&s, "workingset_activate_file %lu\n", + memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE)); + seq_buf_printf(&s, "workingset_restore %lu\n", + memcg_page_state(memcg, WORKINGSET_RESTORE_ANON)); seq_buf_printf(&s, "workingset_restore %lu\n", - memcg_page_state(memcg, WORKINGSET_RESTORE)); + memcg_page_state(memcg, WORKINGSET_RESTORE_FILE)); seq_buf_printf(&s, "workingset_nodereclaim %lu\n", memcg_page_state(memcg, WORKINGSET_NODERECLAIM)); @@ -2414,7 +2422,7 @@ static void high_work_func(struct work_struct *work) * * - MEMCG_DELAY_PRECISION_SHIFT: Extra precision bits while translating the * overage ratio to a delay. - * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down down the + * - MEMCG_DELAY_SCALING_SHIFT: The number of bits to scale down the * proposed penalty in order to reduce to a reasonable number of jiffies, and * to produce a reasonable delay curve. * @@ -5129,13 +5137,18 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return 1; - pn->lruvec_stat_local = alloc_percpu(struct lruvec_stat); + /* We charge the parent cgroup, never the current task */ + WARN_ON_ONCE(!current->active_memcg); + + pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat, + GFP_KERNEL_ACCOUNT); if (!pn->lruvec_stat_local) { kfree(pn); return 1; } - pn->lruvec_stat_cpu = alloc_percpu(struct lruvec_stat); + pn->lruvec_stat_cpu = alloc_percpu_gfp(struct lruvec_stat, + GFP_KERNEL_ACCOUNT); if (!pn->lruvec_stat_cpu) { free_percpu(pn->lruvec_stat_local); kfree(pn); @@ -5209,11 +5222,16 @@ static struct mem_cgroup *mem_cgroup_alloc(void) goto fail; } - memcg->vmstats_local = alloc_percpu(struct memcg_vmstats_percpu); + /* We charge the parent cgroup, never the current task */ + WARN_ON_ONCE(!current->active_memcg); + + memcg->vmstats_local = alloc_percpu_gfp(struct memcg_vmstats_percpu, + GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_local) goto fail; - memcg->vmstats_percpu = alloc_percpu(struct memcg_vmstats_percpu); + memcg->vmstats_percpu = alloc_percpu_gfp(struct memcg_vmstats_percpu, + GFP_KERNEL_ACCOUNT); if (!memcg->vmstats_percpu) goto fail; @@ -5262,7 +5280,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) struct mem_cgroup *memcg; long error = -ENOMEM; + memalloc_use_memcg(parent); memcg = mem_cgroup_alloc(); + memalloc_unuse_memcg(); if (IS_ERR(memcg)) return ERR_CAST(memcg); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 47b8ccb1fb9b..f1aa6433f404 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1648,9 +1648,12 @@ EXPORT_SYMBOL(unpoison_memory); static struct page *new_page(struct page *p, unsigned long private) { - int nid = page_to_nid(p); + struct migration_target_control mtc = { + .nid = page_to_nid(p), + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + }; - return new_page_nodemask(p, nid, &node_states[N_MEMORY]); + return alloc_migration_target(p, (unsigned long)&mtc); } /* diff --git a/mm/memory.c b/mm/memory.c index c39a13b09602..228efaca75d3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -71,6 +71,8 @@ #include <linux/dax.h> #include <linux/oom.h> #include <linux/numa.h> +#include <linux/perf_event.h> +#include <linux/ptrace.h> #include <trace/events/kmem.h> @@ -1800,7 +1802,7 @@ out_unlock: * @pfn: source kernel pfn * @pgprot: pgprot flags for the inserted page * - * This is exactly like vmf_insert_pfn(), except that it allows drivers to + * This is exactly like vmf_insert_pfn(), except that it allows drivers * to override pgprot on a per-page basis. * * This only makes sense for IO mappings, and it makes no sense for @@ -1936,7 +1938,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma, * @pfn: source kernel pfn * @pgprot: pgprot flags for the inserted page * - * This is exactly like vmf_insert_mixed(), except that it allows drivers to + * This is exactly like vmf_insert_mixed(), except that it allows drivers * to override pgprot on a per-page basis. * * Typically this function should be used by drivers to set caching- and @@ -2715,7 +2717,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) */ ptep_clear_flush_notify(vma, vmf->address, vmf->pte); page_add_new_anon_rmap(new_page, vma, vmf->address, false); - lru_cache_add_active_or_unevictable(new_page, vma); + lru_cache_add_inactive_or_unevictable(new_page, vma); /* * We call the notify macro here because, when using secondary * mmu page tables (such as kvm shadow page tables), we want the @@ -3098,6 +3100,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) int locked; int exclusive = 0; vm_fault_t ret = 0; + void *shadow = NULL; if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) goto out; @@ -3149,13 +3152,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out_page; } - /* - * XXX: Move to lru_cache_add() when it - * supports new vs putback - */ - spin_lock_irq(&page_pgdat(page)->lru_lock); - lru_note_cost_page(page); - spin_unlock_irq(&page_pgdat(page)->lru_lock); + shadow = get_shadow_from_swap_cache(entry); + if (shadow) + workingset_refault(page, shadow); lru_cache_add(page); swap_readpage(page, true); @@ -3266,10 +3265,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) /* ksm created a completely new copy */ if (unlikely(page != swapcache && swapcache)) { page_add_new_anon_rmap(page, vma, vmf->address, false); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); } else { do_page_add_anon_rmap(page, vma, vmf->address, exclusive); - activate_page(page); } swap_free(entry); @@ -3414,7 +3412,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, vmf->address, false); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); setpte: set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); @@ -3672,7 +3670,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page) if (write && !(vma->vm_flags & VM_SHARED)) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, vmf->address, false); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); } else { inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); page_add_file_rmap(page, false); @@ -4360,6 +4358,67 @@ retry_pud: return handle_pte_fault(&vmf); } +/** + * mm_account_fault - Do page fault accountings + * + * @regs: the pt_regs struct pointer. When set to NULL, will skip accounting + * of perf event counters, but we'll still do the per-task accounting to + * the task who triggered this page fault. + * @address: the faulted address. + * @flags: the fault flags. + * @ret: the fault retcode. + * + * This will take care of most of the page fault accountings. Meanwhile, it + * will also include the PERF_COUNT_SW_PAGE_FAULTS_[MAJ|MIN] perf counter + * updates. However note that the handling of PERF_COUNT_SW_PAGE_FAULTS should + * still be in per-arch page fault handlers at the entry of page fault. + */ +static inline void mm_account_fault(struct pt_regs *regs, + unsigned long address, unsigned int flags, + vm_fault_t ret) +{ + bool major; + + /* + * We don't do accounting for some specific faults: + * + * - Unsuccessful faults (e.g. when the address wasn't valid). That + * includes arch_vma_access_permitted() failing before reaching here. + * So this is not a "this many hardware page faults" counter. We + * should use the hw profiling for that. + * + * - Incomplete faults (VM_FAULT_RETRY). They will only be counted + * once they're completed. + */ + if (ret & (VM_FAULT_ERROR | VM_FAULT_RETRY)) + return; + + /* + * We define the fault as a major fault when the final successful fault + * is VM_FAULT_MAJOR, or if it retried (which implies that we couldn't + * handle it immediately previously). + */ + major = (ret & VM_FAULT_MAJOR) || (flags & FAULT_FLAG_TRIED); + + if (major) + current->maj_flt++; + else + current->min_flt++; + + /* + * If the fault is done for GUP, regs will be NULL. We only do the + * accounting for the per thread fault counters who triggered the + * fault, and we skip the perf event updates. + */ + if (!regs) + return; + + if (major) + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, regs, address); + else + perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address); +} + /* * By the time we get here, we already hold the mm semaphore * @@ -4367,7 +4426,7 @@ retry_pud: * return value. See filemap_fault() and __lock_page_or_retry(). */ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, - unsigned int flags) + unsigned int flags, struct pt_regs *regs) { vm_fault_t ret; @@ -4408,6 +4467,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, mem_cgroup_oom_synchronize(false); } + mm_account_fault(regs, address, flags, ret); + return ret; } EXPORT_SYMBOL_GPL(handle_mm_fault); @@ -4681,7 +4742,7 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, void *maddr; struct page *page = NULL; - ret = get_user_pages_remote(tsk, mm, addr, 1, + ret = get_user_pages_remote(mm, addr, 1, gup_flags, &page, &vma, NULL); if (ret <= 0) { #ifndef CONFIG_HAVE_IOREMAP_PROT diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index ac6961abaa10..c32ead89c911 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -350,6 +350,16 @@ int __ref __add_pages(int nid, unsigned long pfn, unsigned long nr_pages, return err; } +#ifdef CONFIG_NUMA +int __weak memory_add_physaddr_to_nid(u64 start) +{ + pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n", + start); + return 0; +} +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid); +#endif + /* find the smallest valid pfn in the range [start_pfn, end_pfn) */ static unsigned long find_smallest_section_pfn(int nid, struct zone *zone, unsigned long start_pfn, @@ -844,8 +854,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, node_states_set_node(nid, &arg); if (need_zonelists_rebuild) build_all_zonelists(NULL); - else - zone_pcp_update(zone); + zone_pcp_update(zone); init_per_zone_wmark_min(); @@ -1267,19 +1276,23 @@ found: static struct page *new_node_page(struct page *page, unsigned long private) { - int nid = page_to_nid(page); nodemask_t nmask = node_states[N_MEMORY]; + struct migration_target_control mtc = { + .nid = page_to_nid(page), + .nmask = &nmask, + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + }; /* * try to allocate from a different node but reuse this node if there * are no other online nodes to be used (e.g. we are offlining a part * of the only existing node) */ - node_clear(nid, nmask); + node_clear(mtc.nid, nmask); if (nodes_empty(nmask)) - node_set(nid, nmask); + node_set(mtc.nid, nmask); - return new_page_nodemask(page, nid, &nmask); + return alloc_migration_target(page, (unsigned long)&mtc); } static int @@ -1747,7 +1760,7 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) */ rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb); if (rc) - goto done; + return rc; /* remove memmap entry */ firmware_map_remove(start, start + size, "System RAM"); @@ -1771,9 +1784,8 @@ static int __ref try_remove_memory(int nid, u64 start, u64 size) try_offline_node(nid); -done: mem_hotplug_done(); - return rc; + return 0; } /** diff --git a/mm/mempolicy.c b/mm/mempolicy.c index b9e85d467352..afaa09ff9f6c 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -129,7 +129,7 @@ static struct mempolicy preferred_node_policy[MAX_NUMNODES]; /** * numa_map_to_online_node - Find closest online node - * @nid: Node id to start the search + * @node: Node id to start the search * * Lookup the next closest node by distance if @nid is not online. */ @@ -1065,27 +1065,6 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist, return 0; } -/* page allocation callback for NUMA node migration */ -struct page *alloc_new_node_page(struct page *page, unsigned long node) -{ - if (PageHuge(page)) - return alloc_huge_page_node(page_hstate(compound_head(page)), - node); - else if (PageTransHuge(page)) { - struct page *thp; - - thp = alloc_pages_node(node, - (GFP_TRANSHUGE | __GFP_THISNODE), - HPAGE_PMD_ORDER); - if (!thp) - return NULL; - prep_transhuge_page(thp); - return thp; - } else - return __alloc_pages_node(node, GFP_HIGHUSER_MOVABLE | - __GFP_THISNODE, 0); -} - /* * Migrate pages from one node to a target node. * Returns error or the number of pages not migrated. @@ -1096,6 +1075,10 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, nodemask_t nmask; LIST_HEAD(pagelist); int err = 0; + struct migration_target_control mtc = { + .nid = dest, + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, + }; nodes_clear(nmask); node_set(source, nmask); @@ -1110,8 +1093,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest, flags | MPOL_MF_DISCONTIG_OK, &pagelist); if (!list_empty(&pagelist)) { - err = migrate_pages(&pagelist, alloc_new_node_page, NULL, dest, - MIGRATE_SYNC, MR_SYSCALL); + err = migrate_pages(&pagelist, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL); if (err) putback_movable_pages(&pagelist); } @@ -1632,11 +1615,11 @@ static int kernel_get_mempolicy(int __user *policy, int pval; nodemask_t nodes; - addr = untagged_addr(addr); - if (nmask != NULL && maxnode < nr_node_ids) return -EINVAL; + addr = untagged_addr(addr); + err = do_get_mempolicy(&pval, &nodes, addr, flags); if (err) @@ -1890,7 +1873,7 @@ static int apply_policy_zone(struct mempolicy *policy, enum zone_type zone) * Return a nodemask representing a mempolicy for filtering nodes for * page allocation */ -static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) +nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy) { /* Lower zones don't get a nodemask applied for MPOL_BIND */ if (unlikely(policy->mode == MPOL_BIND) && diff --git a/mm/migrate.c b/mm/migrate.c index d179657f8685..5053439be6ab 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1418,22 +1418,35 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page, enum migrate_mode mode, int reason) { int retry = 1; + int thp_retry = 1; int nr_failed = 0; int nr_succeeded = 0; + int nr_thp_succeeded = 0; + int nr_thp_failed = 0; + int nr_thp_split = 0; int pass = 0; + bool is_thp = false; struct page *page; struct page *page2; int swapwrite = current->flags & PF_SWAPWRITE; - int rc; + int rc, nr_subpages; if (!swapwrite) current->flags |= PF_SWAPWRITE; - for(pass = 0; pass < 10 && retry; pass++) { + for (pass = 0; pass < 10 && (retry || thp_retry); pass++) { retry = 0; + thp_retry = 0; list_for_each_entry_safe(page, page2, from, lru) { retry: + /* + * THP statistics is based on the source huge page. + * Capture required information that might get lost + * during migration. + */ + is_thp = PageTransHuge(page); + nr_subpages = hpage_nr_pages(page); cond_resched(); if (PageHuge(page)) @@ -1464,15 +1477,30 @@ retry: unlock_page(page); if (!rc) { list_safe_reset_next(page, page2, lru); + nr_thp_split++; goto retry; } } + if (is_thp) { + nr_thp_failed++; + nr_failed += nr_subpages; + goto out; + } nr_failed++; goto out; case -EAGAIN: + if (is_thp) { + thp_retry++; + break; + } retry++; break; case MIGRATEPAGE_SUCCESS: + if (is_thp) { + nr_thp_succeeded++; + nr_succeeded += nr_subpages; + break; + } nr_succeeded++; break; default: @@ -1482,19 +1510,27 @@ retry: * removed from migration page list and not * retried in the next outer loop. */ + if (is_thp) { + nr_thp_failed++; + nr_failed += nr_subpages; + break; + } nr_failed++; break; } } } - nr_failed += retry; + nr_failed += retry + thp_retry; + nr_thp_failed += thp_retry; rc = nr_failed; out: - if (nr_succeeded) - count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded); - if (nr_failed) - count_vm_events(PGMIGRATE_FAIL, nr_failed); - trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason); + count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded); + count_vm_events(PGMIGRATE_FAIL, nr_failed); + count_vm_events(THP_MIGRATION_SUCCESS, nr_thp_succeeded); + count_vm_events(THP_MIGRATION_FAIL, nr_thp_failed); + count_vm_events(THP_MIGRATION_SPLIT, nr_thp_split); + trace_mm_migrate_pages(nr_succeeded, nr_failed, nr_thp_succeeded, + nr_thp_failed, nr_thp_split, mode, reason); if (!swapwrite) current->flags &= ~PF_SWAPWRITE; @@ -1502,6 +1538,49 @@ out: return rc; } +struct page *alloc_migration_target(struct page *page, unsigned long private) +{ + struct migration_target_control *mtc; + gfp_t gfp_mask; + unsigned int order = 0; + struct page *new_page = NULL; + int nid; + int zidx; + + mtc = (struct migration_target_control *)private; + gfp_mask = mtc->gfp_mask; + nid = mtc->nid; + if (nid == NUMA_NO_NODE) + nid = page_to_nid(page); + + if (PageHuge(page)) { + struct hstate *h = page_hstate(compound_head(page)); + + gfp_mask = htlb_modify_alloc_mask(h, gfp_mask); + return alloc_huge_page_nodemask(h, nid, mtc->nmask, gfp_mask); + } + + if (PageTransHuge(page)) { + /* + * clear __GFP_RECLAIM to make the migration callback + * consistent with regular THP allocations. + */ + gfp_mask &= ~__GFP_RECLAIM; + gfp_mask |= GFP_TRANSHUGE; + order = HPAGE_PMD_ORDER; + } + zidx = zone_idx(page_zone(page)); + if (is_highmem_idx(zidx) || zidx == ZONE_MOVABLE) + gfp_mask |= __GFP_HIGHMEM; + + new_page = __alloc_pages_nodemask(gfp_mask, order, nid, mtc->nmask); + + if (new_page && PageTransHuge(new_page)) + prep_transhuge_page(new_page); + + return new_page; +} + #ifdef CONFIG_NUMA static int store_status(int __user *status, int start, int value, int nr) @@ -1519,9 +1598,13 @@ static int do_move_pages_to_node(struct mm_struct *mm, struct list_head *pagelist, int node) { int err; + struct migration_target_control mtc = { + .nid = node, + .gfp_mask = GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, + }; - err = migrate_pages(pagelist, alloc_new_node_page, NULL, node, - MIGRATE_SYNC, MR_SYSCALL); + err = migrate_pages(pagelist, alloc_migration_target, NULL, + (unsigned long)&mtc, MIGRATE_SYNC, MR_SYSCALL); if (err) putback_movable_pages(pagelist); return err; @@ -2168,6 +2251,16 @@ static int migrate_vma_collect_hole(unsigned long start, struct migrate_vma *migrate = walk->private; unsigned long addr; + /* Only allow populating anonymous memory. */ + if (!vma_is_anonymous(walk->vma)) { + for (addr = start; addr < end; addr += PAGE_SIZE) { + migrate->src[migrate->npages] = 0; + migrate->dst[migrate->npages] = 0; + migrate->npages++; + } + return 0; + } + for (addr = start; addr < end; addr += PAGE_SIZE) { migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE; migrate->dst[migrate->npages] = 0; @@ -2260,8 +2353,10 @@ again: pte = *ptep; if (pte_none(pte)) { - mpfn = MIGRATE_PFN_MIGRATE; - migrate->cpages++; + if (vma_is_anonymous(vma)) { + mpfn = MIGRATE_PFN_MIGRATE; + migrate->cpages++; + } goto next; } @@ -2619,7 +2714,7 @@ restore: /** * migrate_vma_setup() - prepare to migrate a range of memory - * @args: contains the vma, start, and and pfns arrays for the migration + * @args: contains the vma, start, and pfns arrays for the migration * * Returns: negative errno on failures, 0 when 0 or more pages were migrated * without an error. @@ -2830,7 +2925,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate, inc_mm_counter(mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, addr, false); if (!is_zone_device_page(page)) - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); get_page(page); if (flush) { diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 352bb9f3ecc0..4fc918163dd3 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -166,7 +166,7 @@ static void mn_itree_inv_end(struct mmu_notifier_subscriptions *subscriptions) /** * mmu_interval_read_begin - Begin a read side critical section against a VA * range - * interval_sub: The interval subscription + * @interval_sub: The interval subscription * * mmu_iterval_read_begin()/mmu_iterval_read_retry() implement a * collision-retry scheme similar to seqcount for the VA range under @@ -686,7 +686,7 @@ EXPORT_SYMBOL_GPL(__mmu_notifier_register); /** * mmu_notifier_register - Register a notifier on a mm - * @mn: The notifier to attach + * @subscription: The notifier to attach * @mm: The mm to attach the notifier to * * Must not hold mmap_lock nor any other VM related lock when calling @@ -856,7 +856,7 @@ static void mmu_notifier_free_rcu(struct rcu_head *rcu) /** * mmu_notifier_put - Release the reference on the notifier - * @mn: The notifier to act on + * @subscription: The notifier to act on * * This function must be paired with each mmu_notifier_get(), it releases the * reference obtained by the get. If this is the last reference then process @@ -965,7 +965,8 @@ static int __mmu_interval_notifier_insert( * @interval_sub: Interval subscription to register * @start: Starting virtual address to monitor * @length: Length of the range to monitor - * @mm : mm_struct to attach to + * @mm: mm_struct to attach to + * @ops: Interval notifier operations to be called on matching events * * This function subscribes the interval notifier for notifications from the * mm. Upon return the ops related to mmu_interval_notifier will be called diff --git a/mm/nommu.c b/mm/nommu.c index 340ae7774c13..75a327149af1 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1762,8 +1762,8 @@ EXPORT_SYMBOL_GPL(access_process_vm); * @newsize: The proposed filesize of the inode * * Check the shared mappings on an inode on behalf of a shrinking truncate to - * make sure that that any outstanding VMAs aren't broken and then shrink the - * vm_regions that extend that beyond so that do_mmap() doesn't + * make sure that any outstanding VMAs aren't broken and then shrink the + * vm_regions that extend beyond so that do_mmap() doesn't * automatically grant mappings that are too large. */ int nommu_shrink_inode_mappings(struct inode *inode, size_t size, diff --git a/mm/oom_kill.c b/mm/oom_kill.c index d30ce75f23fb..e90f25d6385d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -196,17 +196,17 @@ static bool is_dump_unreclaim_slabs(void) * predictable as possible. The goal is to return the highest value for the * task consuming the most memory to avoid subsequent oom failures. */ -unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) +long oom_badness(struct task_struct *p, unsigned long totalpages) { long points; long adj; if (oom_unkillable_task(p)) - return 0; + return LONG_MIN; p = find_lock_task_mm(p); if (!p) - return 0; + return LONG_MIN; /* * Do not even consider tasks which are explicitly marked oom @@ -218,7 +218,7 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) test_bit(MMF_OOM_SKIP, &p->mm->flags) || in_vfork(p)) { task_unlock(p); - return 0; + return LONG_MIN; } /* @@ -233,11 +233,7 @@ unsigned long oom_badness(struct task_struct *p, unsigned long totalpages) adj *= totalpages / 1000; points += adj; - /* - * Never return 0 for an eligible task regardless of the root bonus and - * oom_score_adj (oom_score_adj can't be OOM_SCORE_ADJ_MIN here). - */ - return points > 0 ? points : 1; + return points; } static const char * const oom_constraint_text[] = { @@ -310,7 +306,7 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc) static int oom_evaluate_task(struct task_struct *task, void *arg) { struct oom_control *oc = arg; - unsigned long points; + long points; if (oom_unkillable_task(task)) goto next; @@ -336,12 +332,12 @@ static int oom_evaluate_task(struct task_struct *task, void *arg) * killed first if it triggers an oom, then select it. */ if (oom_task_origin(task)) { - points = ULONG_MAX; + points = LONG_MAX; goto select; } points = oom_badness(task, oc->totalpages); - if (!points || points < oc->chosen_points) + if (points == LONG_MIN || points < oc->chosen_points) goto next; select: @@ -365,6 +361,8 @@ abort: */ static void select_bad_process(struct oom_control *oc) { + oc->chosen_points = LONG_MIN; + if (is_memcg_oom(oc)) mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc); else { @@ -863,6 +861,8 @@ static void __oom_kill_process(struct task_struct *victim, const char *message) p = find_lock_task_mm(victim); if (!p) { + pr_info("%s: OOM victim %d (%s) is already exiting. Skip killing the task\n", + message, task_pid_nr(victim), victim->comm); put_task_struct(victim); return; } else if (victim != p) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 167732f4d124..8b7d0ecf30b1 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4282,7 +4282,7 @@ retry: /* * If an allocation failed after direct reclaim, it could be because * pages are pinned on the per-cpu lists or in high alloc reserves. - * Shrink them them and try again + * Shrink them and try again */ if (!page && !drained) { unreserve_highatomic_pageblock(ac, false); @@ -6192,7 +6192,7 @@ static int zone_batchsize(struct zone *zone) * locking. * * Any new users of pcp->batch and pcp->high should ensure they can cope with - * those fields changing asynchronously (acording the the above rule). + * those fields changing asynchronously (acording to the above rule). * * mutex_is_locked(&pcp_batch_high_lock) required when calling this function * outside of boot time (or some other assurance that no concurrent updaters @@ -8203,7 +8203,7 @@ void *__init alloc_large_system_hash(const char *tablename, * race condition. So you can't expect this function should be exact. * * Returns a page without holding a reference. If the caller wants to - * dereference that page (e.g., dumping), it has to make sure that that it + * dereference that page (e.g., dumping), it has to make sure that it * cannot get removed (e.g., via memory unplug) concurrently. * */ @@ -8347,6 +8347,10 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, unsigned long pfn = start; unsigned int tries = 0; int ret = 0; + struct migration_target_control mtc = { + .nid = zone_to_nid(cc->zone), + .gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL, + }; migrate_prep(); @@ -8373,8 +8377,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, &cc->migratepages); cc->nr_migratepages -= nr_reclaimed; - ret = migrate_pages(&cc->migratepages, alloc_migrate_target, - NULL, 0, cc->mode, MR_CONTIG_RANGE); + ret = migrate_pages(&cc->migratepages, alloc_migration_target, + NULL, (unsigned long)&mtc, cc->mode, MR_CONTIG_RANGE); } if (ret < 0) { putback_movable_pages(&cc->migratepages); diff --git a/mm/page_isolation.c b/mm/page_isolation.c index f6d07c5f0d34..242c03121d73 100644 --- a/mm/page_isolation.c +++ b/mm/page_isolation.c @@ -306,8 +306,3 @@ int test_pages_isolated(unsigned long start_pfn, unsigned long end_pfn, return pfn < end_pfn ? -EBUSY : 0; } - -struct page *alloc_migrate_target(struct page *page, unsigned long private) -{ - return new_page_nodemask(page, numa_node_id(), &node_states[N_MEMORY]); -} diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index 0468ba500bd4..18b768ac7dca 100644 --- a/mm/percpu-internal.h +++ b/mm/percpu-internal.h @@ -6,6 +6,25 @@ #include <linux/percpu.h> /* + * There are two chunk types: root and memcg-aware. + * Chunks of each type have separate slots list. + * + * Memcg-aware chunks have an attached vector of obj_cgroup pointers, which is + * used to store memcg membership data of a percpu object. Obj_cgroups are + * ref-counted pointers to a memory cgroup with an ability to switch dynamically + * to the parent memory cgroup. This allows to reclaim a deleted memory cgroup + * without reclaiming of all outstanding objects, which hold a reference at it. + */ +enum pcpu_chunk_type { + PCPU_CHUNK_ROOT, +#ifdef CONFIG_MEMCG_KMEM + PCPU_CHUNK_MEMCG, +#endif + PCPU_NR_CHUNK_TYPES, + PCPU_FAIL_ALLOC = PCPU_NR_CHUNK_TYPES +}; + +/* * pcpu_block_md is the metadata block struct. * Each chunk's bitmap is split into a number of full blocks. * All units are in terms of bits. @@ -54,6 +73,9 @@ struct pcpu_chunk { int end_offset; /* additional area required to have the region end page aligned */ +#ifdef CONFIG_MEMCG_KMEM + struct obj_cgroup **obj_cgroups; /* vector of object cgroups */ +#endif int nr_pages; /* # of pages served by this chunk */ int nr_populated; /* # of populated pages */ @@ -63,7 +85,7 @@ struct pcpu_chunk { extern spinlock_t pcpu_lock; -extern struct list_head *pcpu_slot; +extern struct list_head *pcpu_chunk_lists; extern int pcpu_nr_slots; extern int pcpu_nr_empty_pop_pages; @@ -106,6 +128,37 @@ static inline int pcpu_chunk_map_bits(struct pcpu_chunk *chunk) return pcpu_nr_pages_to_map_bits(chunk->nr_pages); } +#ifdef CONFIG_MEMCG_KMEM +static inline enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk) +{ + if (chunk->obj_cgroups) + return PCPU_CHUNK_MEMCG; + return PCPU_CHUNK_ROOT; +} + +static inline bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type) +{ + return chunk_type == PCPU_CHUNK_MEMCG; +} + +#else +static inline enum pcpu_chunk_type pcpu_chunk_type(struct pcpu_chunk *chunk) +{ + return PCPU_CHUNK_ROOT; +} + +static inline bool pcpu_is_memcg_chunk(enum pcpu_chunk_type chunk_type) +{ + return false; +} +#endif + +static inline struct list_head *pcpu_chunk_list(enum pcpu_chunk_type chunk_type) +{ + return &pcpu_chunk_lists[pcpu_nr_slots * + pcpu_is_memcg_chunk(chunk_type)]; +} + #ifdef CONFIG_PERCPU_STATS #include <linux/spinlock.h> diff --git a/mm/percpu-km.c b/mm/percpu-km.c index 20d2b69a13b0..35c9941077ee 100644 --- a/mm/percpu-km.c +++ b/mm/percpu-km.c @@ -44,7 +44,8 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, /* nada */ } -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type, + gfp_t gfp) { const int nr_pages = pcpu_group_sizes[0] >> PAGE_SHIFT; struct pcpu_chunk *chunk; @@ -52,7 +53,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) unsigned long flags; int i; - chunk = pcpu_alloc_chunk(gfp); + chunk = pcpu_alloc_chunk(type, gfp); if (!chunk) return NULL; diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index 32558063c3f9..c8400a2adbc2 100644 --- a/mm/percpu-stats.c +++ b/mm/percpu-stats.c @@ -34,11 +34,15 @@ static int find_max_nr_alloc(void) { struct pcpu_chunk *chunk; int slot, max_nr_alloc; + enum pcpu_chunk_type type; max_nr_alloc = 0; - for (slot = 0; slot < pcpu_nr_slots; slot++) - list_for_each_entry(chunk, &pcpu_slot[slot], list) - max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc); + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + for (slot = 0; slot < pcpu_nr_slots; slot++) + list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot], + list) + max_nr_alloc = max(max_nr_alloc, + chunk->nr_alloc); return max_nr_alloc; } @@ -129,6 +133,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, P("cur_min_alloc", cur_min_alloc); P("cur_med_alloc", cur_med_alloc); P("cur_max_alloc", cur_max_alloc); +#ifdef CONFIG_MEMCG_KMEM + P("memcg_aware", pcpu_is_memcg_chunk(pcpu_chunk_type(chunk))); +#endif seq_putc(m, '\n'); } @@ -137,6 +144,7 @@ static int percpu_stats_show(struct seq_file *m, void *v) struct pcpu_chunk *chunk; int slot, max_nr_alloc; int *buffer; + enum pcpu_chunk_type type; alloc_buffer: spin_lock_irq(&pcpu_lock); @@ -202,18 +210,18 @@ alloc_buffer: chunk_map_stats(m, pcpu_reserved_chunk, buffer); } - for (slot = 0; slot < pcpu_nr_slots; slot++) { - list_for_each_entry(chunk, &pcpu_slot[slot], list) { - if (chunk == pcpu_first_chunk) { - seq_puts(m, "Chunk: <- First Chunk\n"); - chunk_map_stats(m, chunk, buffer); - - - } else { - seq_puts(m, "Chunk:\n"); - chunk_map_stats(m, chunk, buffer); + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) { + for (slot = 0; slot < pcpu_nr_slots; slot++) { + list_for_each_entry(chunk, &pcpu_chunk_list(type)[slot], + list) { + if (chunk == pcpu_first_chunk) { + seq_puts(m, "Chunk: <- First Chunk\n"); + chunk_map_stats(m, chunk, buffer); + } else { + seq_puts(m, "Chunk:\n"); + chunk_map_stats(m, chunk, buffer); + } } - } } diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c index a2b395acef89..e46f7a6917f9 100644 --- a/mm/percpu-vm.c +++ b/mm/percpu-vm.c @@ -328,12 +328,13 @@ static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, pcpu_free_pages(chunk, pages, page_start, page_end); } -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type, + gfp_t gfp) { struct pcpu_chunk *chunk; struct vm_struct **vms; - chunk = pcpu_alloc_chunk(gfp); + chunk = pcpu_alloc_chunk(type, gfp); if (!chunk) return NULL; diff --git a/mm/percpu.c b/mm/percpu.c index b626766160ce..f4709629e6de 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -37,9 +37,14 @@ * takes care of normal allocations. * * The allocator organizes chunks into lists according to free size and - * tries to allocate from the fullest chunk first. Each chunk is managed - * by a bitmap with metadata blocks. The allocation map is updated on - * every allocation and free to reflect the current state while the boundary + * memcg-awareness. To make a percpu allocation memcg-aware the __GFP_ACCOUNT + * flag should be passed. All memcg-aware allocations are sharing one set + * of chunks and all unaccounted allocations and allocations performed + * by processes belonging to the root memory cgroup are using the second set. + * + * The allocator tries to allocate from the fullest chunk first. Each chunk + * is managed by a bitmap with metadata blocks. The allocation map is updated + * on every allocation and free to reflect the current state while the boundary * map is only updated on allocation. Each metadata block contains * information to help mitigate the need to iterate over large portions * of the bitmap. The reverse mapping from page to chunk is stored in @@ -81,6 +86,7 @@ #include <linux/kmemleak.h> #include <linux/sched.h> #include <linux/sched/mm.h> +#include <linux/memcontrol.h> #include <asm/cacheflush.h> #include <asm/sections.h> @@ -160,7 +166,7 @@ struct pcpu_chunk *pcpu_reserved_chunk __ro_after_init; DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */ +struct list_head *pcpu_chunk_lists __ro_after_init; /* chunk list slots */ /* chunks which need their map areas extended, protected by pcpu_lock */ static LIST_HEAD(pcpu_map_extend_chunks); @@ -500,6 +506,9 @@ static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot, bool move_front) { if (chunk != pcpu_reserved_chunk) { + struct list_head *pcpu_slot; + + pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk)); if (move_front) list_move(&chunk->list, &pcpu_slot[slot]); else @@ -1211,11 +1220,14 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits, * * This function determines the size of an allocation to free using * the boundary bitmap and clears the allocation map. + * + * RETURNS: + * Number of freed bytes. */ -static void pcpu_free_area(struct pcpu_chunk *chunk, int off) +static int pcpu_free_area(struct pcpu_chunk *chunk, int off) { struct pcpu_block_md *chunk_md = &chunk->chunk_md; - int bit_off, bits, end, oslot; + int bit_off, bits, end, oslot, freed; lockdep_assert_held(&pcpu_lock); pcpu_stats_area_dealloc(chunk); @@ -1230,8 +1242,10 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off) bits = end - bit_off; bitmap_clear(chunk->alloc_map, bit_off, bits); + freed = bits * PCPU_MIN_ALLOC_SIZE; + /* update metadata */ - chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE; + chunk->free_bytes += freed; /* update first free bit */ chunk_md->first_free = min(chunk_md->first_free, bit_off); @@ -1239,6 +1253,8 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off) pcpu_block_update_hint_free(chunk, bit_off, bits); pcpu_chunk_relocate(chunk, oslot); + + return freed; } static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits) @@ -1334,6 +1350,10 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr, panic("%s: Failed to allocate %zu bytes\n", __func__, alloc_size); +#ifdef CONFIG_MEMCG_KMEM + /* first chunk isn't memcg-aware */ + chunk->obj_cgroups = NULL; +#endif pcpu_init_md_blocks(chunk); /* manage populated page bitmap */ @@ -1373,7 +1393,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr, return chunk; } -static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp) +static struct pcpu_chunk *pcpu_alloc_chunk(enum pcpu_chunk_type type, gfp_t gfp) { struct pcpu_chunk *chunk; int region_bits; @@ -1401,6 +1421,16 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp) if (!chunk->md_blocks) goto md_blocks_fail; +#ifdef CONFIG_MEMCG_KMEM + if (pcpu_is_memcg_chunk(type)) { + chunk->obj_cgroups = + pcpu_mem_zalloc(pcpu_chunk_map_bits(chunk) * + sizeof(struct obj_cgroup *), gfp); + if (!chunk->obj_cgroups) + goto objcg_fail; + } +#endif + pcpu_init_md_blocks(chunk); /* init metadata */ @@ -1408,6 +1438,10 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp) return chunk; +#ifdef CONFIG_MEMCG_KMEM +objcg_fail: + pcpu_mem_free(chunk->md_blocks); +#endif md_blocks_fail: pcpu_mem_free(chunk->bound_map); bound_map_fail: @@ -1422,6 +1456,9 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk) { if (!chunk) return; +#ifdef CONFIG_MEMCG_KMEM + pcpu_mem_free(chunk->obj_cgroups); +#endif pcpu_mem_free(chunk->md_blocks); pcpu_mem_free(chunk->bound_map); pcpu_mem_free(chunk->alloc_map); @@ -1498,7 +1535,8 @@ static int pcpu_populate_chunk(struct pcpu_chunk *chunk, int page_start, int page_end, gfp_t gfp); static void pcpu_depopulate_chunk(struct pcpu_chunk *chunk, int page_start, int page_end); -static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp); +static struct pcpu_chunk *pcpu_create_chunk(enum pcpu_chunk_type type, + gfp_t gfp); static void pcpu_destroy_chunk(struct pcpu_chunk *chunk); static struct page *pcpu_addr_to_page(void *addr); static int __init pcpu_verify_alloc_info(const struct pcpu_alloc_info *ai); @@ -1540,6 +1578,87 @@ static struct pcpu_chunk *pcpu_chunk_addr_search(void *addr) return pcpu_get_page_chunk(pcpu_addr_to_page(addr)); } +#ifdef CONFIG_MEMCG_KMEM +static enum pcpu_chunk_type pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, + struct obj_cgroup **objcgp) +{ + struct obj_cgroup *objcg; + + if (!memcg_kmem_enabled() || !(gfp & __GFP_ACCOUNT) || + memcg_kmem_bypass()) + return PCPU_CHUNK_ROOT; + + objcg = get_obj_cgroup_from_current(); + if (!objcg) + return PCPU_CHUNK_ROOT; + + if (obj_cgroup_charge(objcg, gfp, size * num_possible_cpus())) { + obj_cgroup_put(objcg); + return PCPU_FAIL_ALLOC; + } + + *objcgp = objcg; + return PCPU_CHUNK_MEMCG; +} + +static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, + struct pcpu_chunk *chunk, int off, + size_t size) +{ + if (!objcg) + return; + + if (chunk) { + chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = objcg; + + rcu_read_lock(); + mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, + size * num_possible_cpus()); + rcu_read_unlock(); + } else { + obj_cgroup_uncharge(objcg, size * num_possible_cpus()); + obj_cgroup_put(objcg); + } +} + +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) +{ + struct obj_cgroup *objcg; + + if (!pcpu_is_memcg_chunk(pcpu_chunk_type(chunk))) + return; + + objcg = chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT]; + chunk->obj_cgroups[off >> PCPU_MIN_ALLOC_SHIFT] = NULL; + + obj_cgroup_uncharge(objcg, size * num_possible_cpus()); + + rcu_read_lock(); + mod_memcg_state(obj_cgroup_memcg(objcg), MEMCG_PERCPU_B, + -(size * num_possible_cpus())); + rcu_read_unlock(); + + obj_cgroup_put(objcg); +} + +#else /* CONFIG_MEMCG_KMEM */ +static enum pcpu_chunk_type +pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp, struct obj_cgroup **objcgp) +{ + return PCPU_CHUNK_ROOT; +} + +static void pcpu_memcg_post_alloc_hook(struct obj_cgroup *objcg, + struct pcpu_chunk *chunk, int off, + size_t size) +{ +} + +static void pcpu_memcg_free_hook(struct pcpu_chunk *chunk, int off, size_t size) +{ +} +#endif /* CONFIG_MEMCG_KMEM */ + /** * pcpu_alloc - the percpu allocator * @size: size of area to allocate in bytes @@ -1561,6 +1680,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, gfp_t pcpu_gfp; bool is_atomic; bool do_warn; + enum pcpu_chunk_type type; + struct list_head *pcpu_slot; + struct obj_cgroup *objcg = NULL; static int warn_limit = 10; struct pcpu_chunk *chunk, *next; const char *err; @@ -1595,16 +1717,23 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, return NULL; } + type = pcpu_memcg_pre_alloc_hook(size, gfp, &objcg); + if (unlikely(type == PCPU_FAIL_ALLOC)) + return NULL; + pcpu_slot = pcpu_chunk_list(type); + if (!is_atomic) { /* * pcpu_balance_workfn() allocates memory under this mutex, * and it may wait for memory reclaim. Allow current task * to become OOM victim, in case of memory pressure. */ - if (gfp & __GFP_NOFAIL) + if (gfp & __GFP_NOFAIL) { mutex_lock(&pcpu_alloc_mutex); - else if (mutex_lock_killable(&pcpu_alloc_mutex)) + } else if (mutex_lock_killable(&pcpu_alloc_mutex)) { + pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size); return NULL; + } } spin_lock_irqsave(&pcpu_lock, flags); @@ -1659,7 +1788,7 @@ restart: } if (list_empty(&pcpu_slot[pcpu_nr_slots - 1])) { - chunk = pcpu_create_chunk(pcpu_gfp); + chunk = pcpu_create_chunk(type, pcpu_gfp); if (!chunk) { err = "failed to allocate new chunk"; goto fail; @@ -1716,6 +1845,8 @@ area_found: trace_percpu_alloc_percpu(reserved, is_atomic, size, align, chunk->base_addr, off, ptr); + pcpu_memcg_post_alloc_hook(objcg, chunk, off, size); + return ptr; fail_unlock: @@ -1737,6 +1868,9 @@ fail: } else { mutex_unlock(&pcpu_alloc_mutex); } + + pcpu_memcg_post_alloc_hook(objcg, NULL, 0, size); + return NULL; } @@ -1796,8 +1930,8 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align) } /** - * pcpu_balance_workfn - manage the amount of free chunks and populated pages - * @work: unused + * __pcpu_balance_workfn - manage the amount of free chunks and populated pages + * @type: chunk type * * Reclaim all fully free chunks except for the first one. This is also * responsible for maintaining the pool of empty populated pages. However, @@ -1806,11 +1940,12 @@ void __percpu *__alloc_reserved_percpu(size_t size, size_t align) * allocation causes the failure as it is possible that requests can be * serviced from already backed regions. */ -static void pcpu_balance_workfn(struct work_struct *work) +static void __pcpu_balance_workfn(enum pcpu_chunk_type type) { /* gfp flags passed to underlying allocators */ const gfp_t gfp = GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN; LIST_HEAD(to_free); + struct list_head *pcpu_slot = pcpu_chunk_list(type); struct list_head *free_head = &pcpu_slot[pcpu_nr_slots - 1]; struct pcpu_chunk *chunk, *next; int slot, nr_to_pop, ret; @@ -1908,7 +2043,7 @@ retry_pop: if (nr_to_pop) { /* ran out of chunks to populate, create a new one and retry */ - chunk = pcpu_create_chunk(gfp); + chunk = pcpu_create_chunk(type, gfp); if (chunk) { spin_lock_irq(&pcpu_lock); pcpu_chunk_relocate(chunk, -1); @@ -1921,6 +2056,20 @@ retry_pop: } /** + * pcpu_balance_workfn - manage the amount of free chunks and populated pages + * @work: unused + * + * Call __pcpu_balance_workfn() for each chunk type. + */ +static void pcpu_balance_workfn(struct work_struct *work) +{ + enum pcpu_chunk_type type; + + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + __pcpu_balance_workfn(type); +} + +/** * free_percpu - free percpu area * @ptr: pointer to area to free * @@ -1934,8 +2083,9 @@ void free_percpu(void __percpu *ptr) void *addr; struct pcpu_chunk *chunk; unsigned long flags; - int off; + int size, off; bool need_balance = false; + struct list_head *pcpu_slot; if (!ptr) return; @@ -1949,7 +2099,11 @@ void free_percpu(void __percpu *ptr) chunk = pcpu_chunk_addr_search(addr); off = addr - chunk->base_addr; - pcpu_free_area(chunk, off); + size = pcpu_free_area(chunk, off); + + pcpu_slot = pcpu_chunk_list(pcpu_chunk_type(chunk)); + + pcpu_memcg_free_hook(chunk, off, size); /* if there are more than one fully free chunks, wake up grim reaper */ if (chunk->free_bytes == pcpu_unit_size) { @@ -2260,6 +2414,7 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, int map_size; unsigned long tmp_addr; size_t alloc_size; + enum pcpu_chunk_type type; #define PCPU_SETUP_BUG_ON(cond) do { \ if (unlikely(cond)) { \ @@ -2377,13 +2532,18 @@ void __init pcpu_setup_first_chunk(const struct pcpu_alloc_info *ai, * empty chunks. */ pcpu_nr_slots = __pcpu_size_to_slot(pcpu_unit_size) + 2; - pcpu_slot = memblock_alloc(pcpu_nr_slots * sizeof(pcpu_slot[0]), - SMP_CACHE_BYTES); - if (!pcpu_slot) + pcpu_chunk_lists = memblock_alloc(pcpu_nr_slots * + sizeof(pcpu_chunk_lists[0]) * + PCPU_NR_CHUNK_TYPES, + SMP_CACHE_BYTES); + if (!pcpu_chunk_lists) panic("%s: Failed to allocate %zu bytes\n", __func__, - pcpu_nr_slots * sizeof(pcpu_slot[0])); - for (i = 0; i < pcpu_nr_slots; i++) - INIT_LIST_HEAD(&pcpu_slot[i]); + pcpu_nr_slots * sizeof(pcpu_chunk_lists[0]) * + PCPU_NR_CHUNK_TYPES); + + for (type = 0; type < PCPU_NR_CHUNK_TYPES; type++) + for (i = 0; i < pcpu_nr_slots; i++) + INIT_LIST_HEAD(&pcpu_chunk_list(type)[i]); /* * The end of the static region needs to be aligned with the diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c index cc85ce81914a..29c052099aff 100644 --- a/mm/process_vm_access.c +++ b/mm/process_vm_access.c @@ -105,7 +105,7 @@ static int process_vm_rw_single_vec(unsigned long addr, * current/current->mm */ mmap_read_lock(mm); - pinned_pages = pin_user_pages_remote(task, mm, pa, pinned_pages, + pinned_pages = pin_user_pages_remote(mm, pa, pinned_pages, flags, process_pages, NULL, &locked); if (locked) diff --git a/mm/rmap.c b/mm/rmap.c index 5fe2dedce1fc..6cce9ef06753 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1469,7 +1469,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, * do this outside rmap routines. */ VM_BUG_ON(!(flags & TTU_RMAP_LOCKED)); - if (huge_pmd_unshare(mm, &address, pvmw.pte)) { + if (huge_pmd_unshare(mm, vma, &address, pvmw.pte)) { /* * huge_pmd_unshare unmapped an entire PMD * page. There is no way of knowing exactly diff --git a/mm/shmem.c b/mm/shmem.c index eb6b36d89722..271548ca20f3 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1434,7 +1434,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) list_add(&info->swaplist, &shmem_swaplist); if (add_to_swap_cache(page, swap, - __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN) == 0) { + __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN, + NULL) == 0) { spin_lock_irq(&info->lock); shmem_recalc_inode(inode); info->swapped++; @@ -1685,7 +1686,7 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp, * Swap in the page pointed to by *pagep. * Caller has to make sure that *pagep contains a valid swapped page. * Returns 0 and the page in pagep if success. On failure, returns the - * the error code and NULL in *pagep. + * error code and NULL in *pagep. */ static int shmem_swapin_page(struct inode *inode, pgoff_t index, struct page **pagep, enum sgp_type sgp, diff --git a/mm/slab_common.c b/mm/slab_common.c index a513f3237155..f9ccd5dc13f3 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -419,7 +419,7 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work) /* * On destruction, SLAB_TYPESAFE_BY_RCU kmem_caches are put on the * @slab_caches_to_rcu_destroy list. The slab pages are freed - * through RCU and and the associated kmem_cache are dereferenced + * through RCU and the associated kmem_cache are dereferenced * while freeing the pages, so the kmem_caches should be freed only * after the pending RCU operations are finished. As rcu_barrier() * is a pretty slow operation, we batch all pending destructions diff --git a/mm/swap.c b/mm/swap.c index de257c0a89b1..9285e60c7d6e 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -476,23 +476,24 @@ void lru_cache_add(struct page *page) EXPORT_SYMBOL(lru_cache_add); /** - * lru_cache_add_active_or_unevictable + * lru_cache_add_inactive_or_unevictable * @page: the page to be added to LRU * @vma: vma in which page is mapped for determining reclaimability * - * Place @page on the active or unevictable LRU list, depending on its + * Place @page on the inactive or unevictable LRU list, depending on its * evictability. Note that if the page is not evictable, it goes * directly back onto it's zone's unevictable list, it does NOT use a * per cpu pagevec. */ -void lru_cache_add_active_or_unevictable(struct page *page, +void lru_cache_add_inactive_or_unevictable(struct page *page, struct vm_area_struct *vma) { + bool unevictable; + VM_BUG_ON_PAGE(PageLRU(page), page); - if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) - SetPageActive(page); - else if (!TestSetPageMlocked(page)) { + unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED; + if (unlikely(unevictable) && !TestSetPageMlocked(page)) { /* * We use the irq-unsafe __mod_zone_page_stat because this * counter is not modified from interrupt context, and the pte diff --git a/mm/swap_state.c b/mm/swap_state.c index e82f4f8b1f63..b73aabdfd35a 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -106,16 +106,32 @@ void show_swap_cache_info(void) printk("Total swap = %lukB\n", total_swap_pages << (PAGE_SHIFT - 10)); } +void *get_shadow_from_swap_cache(swp_entry_t entry) +{ + struct address_space *address_space = swap_address_space(entry); + pgoff_t idx = swp_offset(entry); + struct page *page; + + page = find_get_entry(address_space, idx); + if (xa_is_value(page)) + return page; + if (page) + put_page(page); + return NULL; +} + /* * add_to_swap_cache resembles add_to_page_cache_locked on swapper_space, * but sets SwapCache flag and private instead of mapping and index. */ -int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp) +int add_to_swap_cache(struct page *page, swp_entry_t entry, + gfp_t gfp, void **shadowp) { struct address_space *address_space = swap_address_space(entry); pgoff_t idx = swp_offset(entry); XA_STATE_ORDER(xas, &address_space->i_pages, idx, compound_order(page)); unsigned long i, nr = hpage_nr_pages(page); + void *old; VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageSwapCache(page), page); @@ -125,16 +141,25 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp) SetPageSwapCache(page); do { + unsigned long nr_shadows = 0; + xas_lock_irq(&xas); xas_create_range(&xas); if (xas_error(&xas)) goto unlock; for (i = 0; i < nr; i++) { VM_BUG_ON_PAGE(xas.xa_index != idx + i, page); + old = xas_load(&xas); + if (xa_is_value(old)) { + nr_shadows++; + if (shadowp) + *shadowp = old; + } set_page_private(page + i, entry.val + i); xas_store(&xas, page); xas_next(&xas); } + address_space->nrexceptional -= nr_shadows; address_space->nrpages += nr; __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr); ADD_CACHE_INFO(add_total, nr); @@ -154,7 +179,8 @@ unlock: * This must be called only on pages that have * been verified to be in the swap cache. */ -void __delete_from_swap_cache(struct page *page, swp_entry_t entry) +void __delete_from_swap_cache(struct page *page, + swp_entry_t entry, void *shadow) { struct address_space *address_space = swap_address_space(entry); int i, nr = hpage_nr_pages(page); @@ -166,12 +192,14 @@ void __delete_from_swap_cache(struct page *page, swp_entry_t entry) VM_BUG_ON_PAGE(PageWriteback(page), page); for (i = 0; i < nr; i++) { - void *entry = xas_store(&xas, NULL); + void *entry = xas_store(&xas, shadow); VM_BUG_ON_PAGE(entry != page, entry); set_page_private(page + i, 0); xas_next(&xas); } ClearPageSwapCache(page); + if (shadow) + address_space->nrexceptional += nr; address_space->nrpages -= nr; __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr); ADD_CACHE_INFO(del_total, nr); @@ -208,7 +236,7 @@ int add_to_swap(struct page *page) * Add it to the swap cache. */ err = add_to_swap_cache(page, entry, - __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); + __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN, NULL); if (err) /* * add_to_swap_cache() doesn't return -EEXIST, so we can safely @@ -246,13 +274,44 @@ void delete_from_swap_cache(struct page *page) struct address_space *address_space = swap_address_space(entry); xa_lock_irq(&address_space->i_pages); - __delete_from_swap_cache(page, entry); + __delete_from_swap_cache(page, entry, NULL); xa_unlock_irq(&address_space->i_pages); put_swap_page(page, entry); page_ref_sub(page, hpage_nr_pages(page)); } +void clear_shadow_from_swap_cache(int type, unsigned long begin, + unsigned long end) +{ + unsigned long curr = begin; + void *old; + + for (;;) { + unsigned long nr_shadows = 0; + swp_entry_t entry = swp_entry(type, curr); + struct address_space *address_space = swap_address_space(entry); + XA_STATE(xas, &address_space->i_pages, curr); + + xa_lock_irq(&address_space->i_pages); + xas_for_each(&xas, old, end) { + if (!xa_is_value(old)) + continue; + xas_store(&xas, NULL); + nr_shadows++; + } + address_space->nrexceptional -= nr_shadows; + xa_unlock_irq(&address_space->i_pages); + + /* search the next swapcache until we meet end */ + curr >>= SWAP_ADDRESS_SPACE_SHIFT; + curr++; + curr <<= SWAP_ADDRESS_SPACE_SHIFT; + if (curr > end) + break; + } +} + /* * If we are the only user, then try to free up the swap cache. * @@ -361,6 +420,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, { struct swap_info_struct *si; struct page *page; + void *shadow = NULL; *new_page_allocated = false; @@ -429,7 +489,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, __SetPageSwapBacked(page); /* May fail (-ENOMEM) if XArray node allocation failed. */ - if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK)) { + if (add_to_swap_cache(page, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow)) { put_swap_page(page, entry); goto fail_unlock; } @@ -439,10 +499,8 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, goto fail_unlock; } - /* XXX: Move to lru_cache_add() when it supports new vs putback */ - spin_lock_irq(&page_pgdat(page)->lru_lock); - lru_note_cost_page(page); - spin_unlock_irq(&page_pgdat(page)->lru_lock); + if (shadow) + workingset_refault(page, shadow); /* Caller will initiate read into locked page */ SetPageWorkingset(page); diff --git a/mm/swapfile.c b/mm/swapfile.c index 6c26916e95fd..e653eea1eb88 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -696,6 +696,7 @@ static void add_to_avail_list(struct swap_info_struct *p) static void swap_range_free(struct swap_info_struct *si, unsigned long offset, unsigned int nr_entries) { + unsigned long begin = offset; unsigned long end = offset + nr_entries - 1; void (*swap_slot_free_notify)(struct block_device *, unsigned long); @@ -721,6 +722,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, swap_slot_free_notify(si->bdev, offset); offset++; } + clear_shadow_from_swap_cache(si->type, begin, end); } static void set_cluster_next(struct swap_info_struct *si, unsigned long next) @@ -1915,7 +1917,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd, page_add_anon_rmap(page, vma, addr, false); } else { /* ksm created a completely new copy */ page_add_new_anon_rmap(page, vma, addr, false); - lru_cache_add_active_or_unevictable(page, vma); + lru_cache_add_inactive_or_unevictable(page, vma); } swap_free(entry); /* diff --git a/mm/usercopy.c b/mm/usercopy.c index 660717a1ea5c..b3de3c4eefba 100644 --- a/mm/usercopy.c +++ b/mm/usercopy.c @@ -43,7 +43,7 @@ static noinline int check_stack_object(const void *obj, unsigned long len) /* * Reject: object partially overlaps the stack (passing the - * the check above means at least one end is within the stack, + * check above means at least one end is within the stack, * so if this check fails, the other end is outside the stack). */ if (obj < stack || stackend < obj + len) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index b80419320c7d..9a3d451402d7 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -123,7 +123,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm, inc_mm_counter(dst_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, dst_vma, dst_addr, false); - lru_cache_add_active_or_unevictable(page, dst_vma); + lru_cache_add_inactive_or_unevictable(page, dst_vma); set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); diff --git a/mm/vmscan.c b/mm/vmscan.c index 72da290b171b..738115ed75e2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -854,6 +854,7 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, { unsigned long flags; int refcount; + void *shadow = NULL; BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); @@ -896,13 +897,13 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (PageSwapCache(page)) { swp_entry_t swap = { .val = page_private(page) }; mem_cgroup_swapout(page, swap); - __delete_from_swap_cache(page, swap); + if (reclaimed && !mapping_exiting(mapping)) + shadow = workingset_eviction(page, target_memcg); + __delete_from_swap_cache(page, swap, shadow); xa_unlock_irqrestore(&mapping->i_pages, flags); put_swap_page(page, swap); - workingset_eviction(page, target_memcg); } else { void (*freepage)(struct page *); - void *shadow = NULL; freepage = mapping->a_ops->freepage; /* @@ -998,8 +999,6 @@ static enum page_references page_check_references(struct page *page, return PAGEREF_RECLAIM; if (referenced_ptes) { - if (PageSwapBacked(page)) - return PAGEREF_ACTIVATE; /* * All mapped pages start out with page table * references from the instantiating fault, so we need @@ -1022,7 +1021,7 @@ static enum page_references page_check_references(struct page *page, /* * Activate file-backed executable pages after first usage. */ - if (vm_flags & VM_EXEC) + if ((vm_flags & VM_EXEC) && !PageSwapBacked(page)) return PAGEREF_ACTIVATE; return PAGEREF_KEEP; @@ -2685,7 +2684,10 @@ again: if (!sc->force_deactivate) { unsigned long refaults; - if (inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) + refaults = lruvec_page_state(target_lruvec, + WORKINGSET_ACTIVATE_ANON); + if (refaults != target_lruvec->refaults[0] || + inactive_is_low(target_lruvec, LRU_INACTIVE_ANON)) sc->may_deactivate |= DEACTIVATE_ANON; else sc->may_deactivate &= ~DEACTIVATE_ANON; @@ -2696,8 +2698,8 @@ again: * rid of any stale active pages quickly. */ refaults = lruvec_page_state(target_lruvec, - WORKINGSET_ACTIVATE); - if (refaults != target_lruvec->refaults || + WORKINGSET_ACTIVATE_FILE); + if (refaults != target_lruvec->refaults[1] || inactive_is_low(target_lruvec, LRU_INACTIVE_FILE)) sc->may_deactivate |= DEACTIVATE_FILE; else @@ -2796,7 +2798,7 @@ again: set_bit(PGDAT_DIRTY, &pgdat->flags); /* - * If kswapd scans pages marked marked for immediate + * If kswapd scans pages marked for immediate * reclaim and under writeback (nr_immediate), it * implies that pages are cycling through the LRU * faster than they are written so also forcibly stall. @@ -2974,8 +2976,10 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat) unsigned long refaults; target_lruvec = mem_cgroup_lruvec(target_memcg, pgdat); - refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE); - target_lruvec->refaults = refaults; + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_ANON); + target_lruvec->refaults[0] = refaults; + refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_FILE); + target_lruvec->refaults[1] = refaults; } /* @@ -3369,7 +3373,7 @@ static bool pgdat_watermark_boosted(pg_data_t *pgdat, int highest_zoneidx) /* * Check for watermark boosts top-down as the higher zones * are more likely to be boosted. Both watermarks and boosts - * should not be checked at the time time as reclaim would + * should not be checked at the same time as reclaim would * start prematurely when there is no boosting and a lower * zone is balanced. */ diff --git a/mm/vmstat.c b/mm/vmstat.c index 2b866cbab11d..727a26d1ec1d 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1096,6 +1096,24 @@ static int __fragmentation_index(unsigned int order, struct contig_page_info *in return 1000 - div_u64( (1000+(div_u64(info->free_pages * 1000ULL, requested))), info->free_blocks_total); } +/* + * Calculates external fragmentation within a zone wrt the given order. + * It is defined as the percentage of pages found in blocks of size + * less than 1 << order. It returns values in range [0, 100]. + */ +unsigned int extfrag_for_order(struct zone *zone, unsigned int order) +{ + struct contig_page_info info; + + fill_contig_page_info(zone, order, &info); + if (info.free_pages == 0) + return 0; + + return div_u64((info.free_pages - + (info.free_blocks_suitable << order)) * 100, + info.free_pages); +} + /* Same as __fragmentation index but allocs contig_page_info on stack */ int fragmentation_index(struct zone *zone, unsigned int order) { @@ -1167,9 +1185,12 @@ const char * const vmstat_text[] = { "nr_isolated_anon", "nr_isolated_file", "workingset_nodes", - "workingset_refault", - "workingset_activate", - "workingset_restore", + "workingset_refault_anon", + "workingset_refault_file", + "workingset_activate_anon", + "workingset_activate_file", + "workingset_restore_anon", + "workingset_restore_file", "workingset_nodereclaim", "nr_anon_pages", "nr_mapped", @@ -1256,6 +1277,9 @@ const char * const vmstat_text[] = { #ifdef CONFIG_MIGRATION "pgmigrate_success", "pgmigrate_fail", + "thp_migration_success", + "thp_migration_fail", + "thp_migration_split", #endif #ifdef CONFIG_COMPACTION "compact_migrate_scanned", diff --git a/mm/workingset.c b/mm/workingset.c index b199726924dd..8cbe4e3cbe5c 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -6,6 +6,7 @@ */ #include <linux/memcontrol.h> +#include <linux/mm_inline.h> #include <linux/writeback.h> #include <linux/shmem_fs.h> #include <linux/pagemap.h> @@ -280,6 +281,7 @@ void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg) */ void workingset_refault(struct page *page, void *shadow) { + bool file = page_is_file_lru(page); struct mem_cgroup *eviction_memcg; struct lruvec *eviction_lruvec; unsigned long refault_distance; @@ -346,27 +348,34 @@ void workingset_refault(struct page *page, void *shadow) memcg = page_memcg(page); lruvec = mem_cgroup_lruvec(memcg, pgdat); - inc_lruvec_state(lruvec, WORKINGSET_REFAULT); + inc_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file); /* * Compare the distance to the existing workingset size. We * don't activate pages that couldn't stay resident even if - * all the memory was available to the page cache. Whether - * cache can compete with anon or not depends on having swap. + * all the memory was available to the workingset. Whether + * workingset competition needs to consider anon or not depends + * on having swap. */ workingset_size = lruvec_page_state(eviction_lruvec, NR_ACTIVE_FILE); - if (mem_cgroup_get_nr_swap_pages(memcg) > 0) { + if (!file) { workingset_size += lruvec_page_state(eviction_lruvec, - NR_INACTIVE_ANON); + NR_INACTIVE_FILE); + } + if (mem_cgroup_get_nr_swap_pages(memcg) > 0) { workingset_size += lruvec_page_state(eviction_lruvec, NR_ACTIVE_ANON); + if (file) { + workingset_size += lruvec_page_state(eviction_lruvec, + NR_INACTIVE_ANON); + } } if (refault_distance > workingset_size) goto out; SetPageActive(page); workingset_age_nonresident(lruvec, hpage_nr_pages(page)); - inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE); + inc_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + file); /* Page was active prior to eviction */ if (workingset) { @@ -375,7 +384,7 @@ void workingset_refault(struct page *page, void *shadow) spin_lock_irq(&page_pgdat(page)->lru_lock); lru_note_cost_page(page); spin_unlock_irq(&page_pgdat(page)->lru_lock); - inc_lruvec_state(lruvec, WORKINGSET_RESTORE); + inc_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file); } out: rcu_read_unlock(); diff --git a/mm/zpool.c b/mm/zpool.c index 863669212070..3744a2d1a624 100644 --- a/mm/zpool.c +++ b/mm/zpool.c @@ -239,15 +239,15 @@ const char *zpool_get_type(struct zpool *zpool) } /** - * zpool_malloc_support_movable() - Check if the zpool support - * allocate movable memory + * zpool_malloc_support_movable() - Check if the zpool supports + * allocating movable memory * @zpool: The zpool to check * - * This returns if the zpool support allocate movable memory. + * This returns if the zpool supports allocating movable memory. * * Implementations must guarantee this to be thread-safe. * - * Returns: true if if the zpool support allocate movable memory, false if not + * Returns: true if the zpool supports allocating movable memory, false if not */ bool zpool_malloc_support_movable(struct zpool *zpool) { diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index 952a01e45c6a..c36fdff9a371 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -79,7 +79,7 @@ /* * Object location (<PFN>, <obj_idx>) is encoded as - * as single (unsigned long) handle value. + * a single (unsigned long) handle value. * * Note that object index <obj_idx> starts from 0. * diff --git a/scripts/checkpatch.pl b/scripts/checkpatch.pl index 599b8c4933a7..2cbeae6d9aee 100755 --- a/scripts/checkpatch.pl +++ b/scripts/checkpatch.pl @@ -59,7 +59,7 @@ my $spelling_file = "$D/spelling.txt"; my $codespell = 0; my $codespellfile = "/usr/share/codespell/dictionary.txt"; my $conststructsfile = "$D/const_structs.checkpatch"; -my $typedefsfile = ""; +my $typedefsfile; my $color = "auto"; my $allow_c99_comments = 1; # Can be overridden by --ignore C99_COMMENT_TOLERANCE # git output parsing needs US English output, so first set backtick child process LANGUAGE @@ -588,6 +588,8 @@ our @mode_permission_funcs = ( ["__ATTR", 2], ); +my $word_pattern = '\b[A-Z]?[a-z]{2,}\b'; + #Create a search pattern for all these functions to speed up a loop below our $mode_perms_search = ""; foreach my $entry (@mode_permission_funcs) { @@ -756,7 +758,7 @@ sub read_words { next; } - $$wordsRef .= '|' if ($$wordsRef ne ""); + $$wordsRef .= '|' if (defined $$wordsRef); $$wordsRef .= $line; } close($file); @@ -766,16 +768,18 @@ sub read_words { return 0; } -my $const_structs = ""; -read_words(\$const_structs, $conststructsfile) - or warn "No structs that should be const will be found - file '$conststructsfile': $!\n"; +my $const_structs; +if (show_type("CONST_STRUCT")) { + read_words(\$const_structs, $conststructsfile) + or warn "No structs that should be const will be found - file '$conststructsfile': $!\n"; +} -my $typeOtherTypedefs = ""; -if (length($typedefsfile)) { +if (defined($typedefsfile)) { + my $typeOtherTypedefs; read_words(\$typeOtherTypedefs, $typedefsfile) or warn "No additional types will be considered - file '$typedefsfile': $!\n"; + $typeTypedefs .= '|' . $typeOtherTypedefs if (defined $typeOtherTypedefs); } -$typeTypedefs .= '|' . $typeOtherTypedefs if ($typeOtherTypedefs ne ""); sub build_types { my $mods = "(?x: \n" . join("|\n ", (@modifierList, @modifierListFile)) . "\n)"; @@ -3310,6 +3314,42 @@ sub process { } } +# check for repeated words separated by a single space + if ($rawline =~ /^\+/) { + while ($rawline =~ /\b($word_pattern) (?=($word_pattern))/g) { + + my $first = $1; + my $second = $2; + + if ($first =~ /(?:struct|union|enum)/) { + pos($rawline) += length($first) + length($second) + 1; + next; + } + + next if ($first ne $second); + next if ($first eq 'long'); + + if (WARN("REPEATED_WORD", + "Possible repeated word: '$first'\n" . $herecurr) && + $fix) { + $fixed[$fixlinenr] =~ s/\b$first $second\b/$first/; + } + } + + # if it's a repeated word on consecutive lines in a comment block + if ($prevline =~ /$;+\s*$/ && + $prevrawline =~ /($word_pattern)\s*$/) { + my $last_word = $1; + if ($rawline =~ /^\+\s*\*\s*$last_word /) { + if (WARN("REPEATED_WORD", + "Possible repeated word: '$last_word'\n" . $hereprev) && + $fix) { + $fixed[$fixlinenr] =~ s/(\+\s*\*\s*)$last_word /$1/; + } + } + } + } + # check for space before tabs. if ($rawline =~ /^\+/ && $rawline =~ / \t/) { my $herevet = "$here\n" . cat_vet($rawline) . "\n"; @@ -5020,8 +5060,30 @@ sub process { my ($s, $c) = ($stat, $cond); if ($c =~ /\bif\s*\(.*[^<>!=]=[^=].*/s) { - ERROR("ASSIGN_IN_IF", - "do not use assignment in if condition\n" . $herecurr); + if (ERROR("ASSIGN_IN_IF", + "do not use assignment in if condition\n" . $herecurr) && + $fix && $perl_version_ok) { + if ($rawline =~ /^\+(\s+)if\s*\(\s*(\!)?\s*\(\s*(($Lval)\s*=\s*$LvalOrFunc)\s*\)\s*(?:($Compare)\s*($FuncArg))?\s*\)\s*(\{)?\s*$/) { + my $space = $1; + my $not = $2; + my $statement = $3; + my $assigned = $4; + my $test = $8; + my $against = $9; + my $brace = $15; + fix_delete_line($fixlinenr, $rawline); + fix_insert_line($fixlinenr, "$space$statement;"); + my $newline = "${space}if ("; + $newline .= '!' if defined($not); + $newline .= '(' if (defined $not && defined($test) && defined($against)); + $newline .= "$assigned"; + $newline .= " $test $against" if (defined($test) && defined($against)); + $newline .= ')' if (defined $not && defined($test) && defined($against)); + $newline .= ')'; + $newline .= " {" if (defined($brace)); + fix_insert_line($fixlinenr + 1, $newline); + } + } } # Find out what is on the end of the line after the @@ -6465,6 +6527,12 @@ sub process { } } +# check for IS_ENABLED() without CONFIG_<FOO> ($rawline for comments too) + if ($rawline =~ /\bIS_ENABLED\s*\(\s*(\w+)\s*\)/ && $1 !~ /^CONFIG_/) { + WARN("IS_ENABLED_CONFIG", + "IS_ENABLED($1) is normally used as IS_ENABLED(CONFIG_$1)\n" . $herecurr); + } + # check for #if defined CONFIG_<FOO> || defined CONFIG_<FOO>_MODULE if ($line =~ /^\+\s*#\s*if\s+defined(?:\s*\(?\s*|\s+)(CONFIG_[A-Z_]+)\s*\)?\s*\|\|\s*defined(?:\s*\(?\s*|\s+)\1_MODULE\s*\)?\s*$/) { my $config = $1; @@ -6475,31 +6543,6 @@ sub process { } } -# check for case / default statements not preceded by break/fallthrough/switch - if ($line =~ /^.\s*(?:case\s+(?:$Ident|$Constant)\s*|default):/) { - my $has_break = 0; - my $has_statement = 0; - my $count = 0; - my $prevline = $linenr; - while ($prevline > 1 && ($file || $count < 3) && !$has_break) { - $prevline--; - my $rline = $rawlines[$prevline - 1]; - my $fline = $lines[$prevline - 1]; - last if ($fline =~ /^\@\@/); - next if ($fline =~ /^\-/); - next if ($fline =~ /^.(?:\s*(?:case\s+(?:$Ident|$Constant)[\s$;]*|default):[\s$;]*)*$/); - $has_break = 1 if ($rline =~ /fall[\s_-]*(through|thru)/i); - next if ($fline =~ /^.[\s$;]*$/); - $has_statement = 1; - $count++; - $has_break = 1 if ($fline =~ /\bswitch\b|\b(?:break\s*;[\s$;]*$|exit\s*\(\b|return\b|goto\b|continue\b)/); - } - if (!$has_break && $has_statement) { - WARN("MISSING_BREAK", - "Possible switch case/default not preceded by break or fallthrough comment\n" . $herecurr); - } - } - # check for /* fallthrough */ like comment, prefer fallthrough; my @fallthroughs = ( 'fallthrough', @@ -6615,7 +6658,8 @@ sub process { # check for various structs that are normally const (ops, kgdb, device_tree) # and avoid what seem like struct definitions 'struct foo {' - if ($line !~ /\bconst\b/ && + if (defined($const_structs) && + $line !~ /\bconst\b/ && $line =~ /\bstruct\s+($const_structs)\b(?!\s*\{)/) { WARN("CONST_STRUCT", "struct $1 should normally be const\n" . $herecurr); diff --git a/scripts/gdb/linux/rbtree.py b/scripts/gdb/linux/rbtree.py index c4b991607917..fe462855eefd 100644 --- a/scripts/gdb/linux/rbtree.py +++ b/scripts/gdb/linux/rbtree.py @@ -17,7 +17,7 @@ def rb_first(root): raise gdb.GdbError("Must be struct rb_root not {}".format(root.type)) node = root['rb_node'] - if node is 0: + if node == 0: return None while node['rb_left']: @@ -33,7 +33,7 @@ def rb_last(root): raise gdb.GdbError("Must be struct rb_root not {}".format(root.type)) node = root['rb_node'] - if node is 0: + if node == 0: return None while node['rb_right']: diff --git a/security/tomoyo/domain.c b/security/tomoyo/domain.c index 53b3e1f5f227..dc4ecc0b2038 100644 --- a/security/tomoyo/domain.c +++ b/security/tomoyo/domain.c @@ -914,7 +914,7 @@ bool tomoyo_dump_page(struct linux_binprm *bprm, unsigned long pos, * (represented by bprm). 'current' is the process doing * the execve(). */ - if (get_user_pages_remote(current, bprm->mm, pos, 1, + if (get_user_pages_remote(bprm->mm, pos, 1, FOLL_FORCE, &page, NULL, NULL) <= 0) return false; #else diff --git a/tools/testing/selftests/cgroup/test_kmem.c b/tools/testing/selftests/cgroup/test_kmem.c index 5224dae216e5..0941aa16157e 100644 --- a/tools/testing/selftests/cgroup/test_kmem.c +++ b/tools/testing/selftests/cgroup/test_kmem.c @@ -18,6 +18,15 @@ #include "cgroup_util.h" +/* + * Memory cgroup charging and vmstat data aggregation is performed using + * percpu batches 32 pages big (look at MEMCG_CHARGE_BATCH). So the maximum + * discrepancy between charge and vmstat entries is number of cpus multiplied + * by 32 pages multiplied by 2. + */ +#define MAX_VMSTAT_ERROR (4096 * 32 * 2 * get_nprocs()) + + static int alloc_dcache(const char *cgroup, void *arg) { unsigned long i; @@ -180,7 +189,7 @@ static int test_kmem_memcg_deletion(const char *root) goto cleanup; sum = slab + anon + file + kernel_stack; - if (abs(sum - current) < 4096 * 32 * 2 * get_nprocs()) { + if (abs(sum - current) < MAX_VMSTAT_ERROR) { ret = KSFT_PASS; } else { printf("memory.current = %ld\n", current); @@ -331,6 +340,64 @@ cleanup: return ret; } +/* + * This test creates a sub-tree with 1000 memory cgroups. + * Then it checks that the memory.current on the parent level + * is greater than 0 and approximates matches the percpu value + * from memory.stat. + */ +static int test_percpu_basic(const char *root) +{ + int ret = KSFT_FAIL; + char *parent, *child; + long current, percpu; + int i; + + parent = cg_name(root, "percpu_basic_test"); + if (!parent) + goto cleanup; + + if (cg_create(parent)) + goto cleanup; + + if (cg_write(parent, "cgroup.subtree_control", "+memory")) + goto cleanup; + + for (i = 0; i < 1000; i++) { + child = cg_name_indexed(parent, "child", i); + if (!child) + return -1; + + if (cg_create(child)) + goto cleanup_children; + + free(child); + } + + current = cg_read_long(parent, "memory.current"); + percpu = cg_read_key_long(parent, "memory.stat", "percpu "); + + if (current > 0 && percpu > 0 && abs(current - percpu) < + MAX_VMSTAT_ERROR) + ret = KSFT_PASS; + else + printf("memory.current %ld\npercpu %ld\n", + current, percpu); + +cleanup_children: + for (i = 0; i < 1000; i++) { + child = cg_name_indexed(parent, "child", i); + cg_destroy(child); + free(child); + } + +cleanup: + cg_destroy(parent); + free(parent); + + return ret; +} + #define T(x) { x, #x } struct kmem_test { int (*fn)(const char *root); @@ -341,6 +408,7 @@ struct kmem_test { T(test_kmem_proc_kpagecgroup), T(test_kmem_kernel_stacks), T(test_kmem_dead_cgroups), + T(test_percpu_basic), }; #undef T diff --git a/tools/testing/selftests/kmod/kmod.sh b/tools/testing/selftests/kmod/kmod.sh index ea2147248ebe..afd42387e8b2 100755 --- a/tools/testing/selftests/kmod/kmod.sh +++ b/tools/testing/selftests/kmod/kmod.sh @@ -343,7 +343,7 @@ kmod_test_0001_driver() kmod_defaults_driver config_num_threads 1 - printf '\000' >"$DIR"/config_test_driver + printf $NAME >"$DIR"/config_test_driver config_trigger ${FUNCNAME[0]} config_expect_result ${FUNCNAME[0]} MODULE_NOT_FOUND } @@ -354,7 +354,7 @@ kmod_test_0001_fs() kmod_defaults_fs config_num_threads 1 - printf '\000' >"$DIR"/config_test_fs + printf $NAME >"$DIR"/config_test_fs config_trigger ${FUNCNAME[0]} config_expect_result ${FUNCNAME[0]} -EINVAL } diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c index 91d38a29956b..93fc5cadce61 100644 --- a/tools/testing/selftests/vm/hmm-tests.c +++ b/tools/testing/selftests/vm/hmm-tests.c @@ -942,6 +942,41 @@ TEST_F(hmm, migrate_fault) } /* + * Migrate anonymous shared memory to device private memory. + */ +TEST_F(hmm, migrate_shared) +{ + struct hmm_buffer *buffer; + unsigned long npages; + unsigned long size; + int ret; + + npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift; + ASSERT_NE(npages, 0); + size = npages << self->page_shift; + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_SHARED | MAP_ANONYMOUS, + buffer->fd, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Migrate memory to device. */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages); + ASSERT_EQ(ret, -ENOENT); + + hmm_buffer_free(buffer); +} + +/* * Try to migrate various memory types to device private memory. */ TEST_F(hmm2, migrate_mixed) diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c index 390f758d5a27..dd777688d14a 100644 --- a/virt/kvm/async_pf.c +++ b/virt/kvm/async_pf.c @@ -61,7 +61,7 @@ static void async_pf_execute(struct work_struct *work) * access remotely. */ mmap_read_lock(mm); - get_user_pages_remote(NULL, mm, addr, 1, FOLL_WRITE, NULL, NULL, + get_user_pages_remote(mm, addr, 1, FOLL_WRITE, NULL, NULL, &locked); if (locked) mmap_read_unlock(mm); diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2c2c0254c2d8..737666db02de 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1893,7 +1893,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma, * not call the fault handler, so do it here. */ bool unlocked = false; - r = fixup_user_fault(current, current->mm, addr, + r = fixup_user_fault(current->mm, addr, (write_fault ? FAULT_FLAG_WRITE : 0), &unlocked); if (unlocked) |