diff options
Diffstat (limited to 'Documentation/admin-guide')
25 files changed, 711 insertions, 364 deletions
diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index caa3c09a5c3f..9a969c0157f1 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -1,9 +1,9 @@ .. _readme: -Linux kernel release 5.x <http://kernel.org/> +Linux kernel release 6.x <http://kernel.org/> ============================================= -These are the release notes for Linux version 5. Read them carefully, +These are the release notes for Linux version 6. Read them carefully, as they tell you what this is all about, explain how to install the kernel, and what to do if something goes wrong. @@ -63,7 +63,7 @@ Installing the kernel source directory where you have permissions (e.g. your home directory) and unpack it:: - xz -cd linux-5.x.tar.xz | tar xvf - + xz -cd linux-6.x.tar.xz | tar xvf - Replace "X" with the version number of the latest kernel. @@ -72,12 +72,12 @@ Installing the kernel source files. They should match the library, and not get messed up by whatever the kernel-du-jour happens to be. - - You can also upgrade between 5.x releases by patching. Patches are + - You can also upgrade between 6.x releases by patching. Patches are distributed in the xz format. To install by patching, get all the newer patch files, enter the top level directory of the kernel source - (linux-5.x) and execute:: + (linux-6.x) and execute:: - xz -cd ../patch-5.x.xz | patch -p1 + xz -cd ../patch-6.x.xz | patch -p1 Replace "x" for all versions bigger than the version "x" of your current source tree, **in_order**, and you should be ok. You may want to remove @@ -85,13 +85,13 @@ Installing the kernel source that there are no failed patches (some-file-name# or some-file-name.rej). If there are, either you or I have made a mistake. - Unlike patches for the 5.x kernels, patches for the 5.x.y kernels + Unlike patches for the 6.x kernels, patches for the 6.x.y kernels (also known as the -stable kernels) are not incremental but instead apply - directly to the base 5.x kernel. For example, if your base kernel is 5.0 - and you want to apply the 5.0.3 patch, you must not first apply the 5.0.1 - and 5.0.2 patches. Similarly, if you are running kernel version 5.0.2 and - want to jump to 5.0.3, you must first reverse the 5.0.2 patch (that is, - patch -R) **before** applying the 5.0.3 patch. You can read more on this in + directly to the base 6.x kernel. For example, if your base kernel is 6.0 + and you want to apply the 6.0.3 patch, you must not first apply the 6.0.1 + and 6.0.2 patches. Similarly, if you are running kernel version 6.0.2 and + want to jump to 6.0.3, you must first reverse the 6.0.2 patch (that is, + patch -R) **before** applying the 6.0.3 patch. You can read more on this in :ref:`Documentation/process/applying-patches.rst <applying_patches>`. Alternatively, the script patch-kernel can be used to automate this @@ -114,7 +114,7 @@ Installing the kernel source Software requirements --------------------- - Compiling and running the 5.x kernels requires up-to-date + Compiling and running the 6.x kernels requires up-to-date versions of various software packages. Consult :ref:`Documentation/process/changes.rst <changes>` for the minimum version numbers required and how to get updates for these packages. Beware that using @@ -132,12 +132,12 @@ Build directory for the kernel place for the output files (including .config). Example:: - kernel source code: /usr/src/linux-5.x + kernel source code: /usr/src/linux-6.x build directory: /home/name/build/kernel To configure and build the kernel, use:: - cd /usr/src/linux-5.x + cd /usr/src/linux-6.x make O=/home/name/build/kernel menuconfig make O=/home/name/build/kernel sudo make O=/home/name/build/kernel modules_install install @@ -262,8 +262,6 @@ Compiling the kernel - Make sure you have at least gcc 5.1 available. For more information, refer to :ref:`Documentation/process/changes.rst <changes>`. - Please note that you can still run a.out user programs with this kernel. - - Do a ``make`` to create a compressed kernel image. It is also possible to do ``make install`` if you have lilo installed to suit the kernel makefiles, but you may want to check your particular lilo setup first. @@ -332,85 +330,10 @@ Compiling the kernel If something goes wrong ----------------------- - - If you have problems that seem to be due to kernel bugs, please check - the file MAINTAINERS to see if there is a particular person associated - with the part of the kernel that you are having trouble with. If there - isn't anyone listed there, then the second best thing is to mail - them to me (torvalds@linux-foundation.org), and possibly to any other - relevant mailing-list or to the newsgroup. - - - In all bug-reports, *please* tell what kernel you are talking about, - how to duplicate the problem, and what your setup is (use your common - sense). If the problem is new, tell me so, and if the problem is - old, please try to tell me when you first noticed it. - - - If the bug results in a message like:: - - unable to handle kernel paging request at address C0000010 - Oops: 0002 - EIP: 0010:XXXXXXXX - eax: xxxxxxxx ebx: xxxxxxxx ecx: xxxxxxxx edx: xxxxxxxx - esi: xxxxxxxx edi: xxxxxxxx ebp: xxxxxxxx - ds: xxxx es: xxxx fs: xxxx gs: xxxx - Pid: xx, process nr: xx - xx xx xx xx xx xx xx xx xx xx - - or similar kernel debugging information on your screen or in your - system log, please duplicate it *exactly*. The dump may look - incomprehensible to you, but it does contain information that may - help debugging the problem. The text above the dump is also - important: it tells something about why the kernel dumped code (in - the above example, it's due to a bad kernel pointer). More information - on making sense of the dump is in Documentation/admin-guide/bug-hunting.rst - - - If you compiled the kernel with CONFIG_KALLSYMS you can send the dump - as is, otherwise you will have to use the ``ksymoops`` program to make - sense of the dump (but compiling with CONFIG_KALLSYMS is usually preferred). - This utility can be downloaded from - https://www.kernel.org/pub/linux/utils/kernel/ksymoops/ . - Alternatively, you can do the dump lookup by hand: - - - In debugging dumps like the above, it helps enormously if you can - look up what the EIP value means. The hex value as such doesn't help - me or anybody else very much: it will depend on your particular - kernel setup. What you should do is take the hex value from the EIP - line (ignore the ``0010:``), and look it up in the kernel namelist to - see which kernel function contains the offending address. - - To find out the kernel function name, you'll need to find the system - binary associated with the kernel that exhibited the symptom. This is - the file 'linux/vmlinux'. To extract the namelist and match it against - the EIP from the kernel crash, do:: - - nm vmlinux | sort | less - - This will give you a list of kernel addresses sorted in ascending - order, from which it is simple to find the function that contains the - offending address. Note that the address given by the kernel - debugging messages will not necessarily match exactly with the - function addresses (in fact, that is very unlikely), so you can't - just 'grep' the list: the list will, however, give you the starting - point of each kernel function, so by looking for the function that - has a starting address lower than the one you are searching for but - is followed by a function with a higher address you will find the one - you want. In fact, it may be a good idea to include a bit of - "context" in your problem report, giving a few lines around the - interesting one. - - If you for some reason cannot do the above (you have a pre-compiled - kernel image or similar), telling me as much about your setup as - possible will help. Please read - 'Documentation/admin-guide/reporting-issues.rst' for details. - - - Alternatively, you can use gdb on a running kernel. (read-only; i.e. you - cannot change values or set break points.) To do this, first compile the - kernel with -g; edit arch/x86/Makefile appropriately, then do a ``make - clean``. You'll also need to enable CONFIG_PROC_FS (via ``make config``). - - After you've rebooted with the new kernel, do ``gdb vmlinux /proc/kcore``. - You can now use all the usual gdb commands. The command to look up the - point where your system crashed is ``l *0xXXXXXXXX``. (Replace the XXXes - with the EIP value.) - - gdb'ing a non-running kernel currently fails because ``gdb`` (wrongly) - disregards the starting offset for which the kernel is compiled. +If you have problems that seem to be due to kernel bugs, please follow the +instructions at 'Documentation/admin-guide/reporting-issues.rst'. + +Hints on understanding kernel bug reports are in +'Documentation/admin-guide/bug-hunting.rst'. More on debugging the kernel +with gdb is in 'Documentation/dev-tools/gdb-kernel-debugging.rst' and +'Documentation/dev-tools/kgdb.rst'. diff --git a/Documentation/admin-guide/acpi/dsdt-override.rst b/Documentation/admin-guide/acpi/dsdt-override.rst deleted file mode 100644 index 50bd7f194bf4..000000000000 --- a/Documentation/admin-guide/acpi/dsdt-override.rst +++ /dev/null @@ -1,13 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -=============== -Overriding DSDT -=============== - -Linux supports a method of overriding the BIOS DSDT: - -CONFIG_ACPI_CUSTOM_DSDT - builds the image into the kernel. - -When to use this method is described in detail on the -Linux/ACPI home page: -https://01.org/linux-acpi/documentation/overriding-dsdt diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst index 2cc502a75ef6..5b86245450bd 100644 --- a/Documentation/admin-guide/cgroup-v1/memory.rst +++ b/Documentation/admin-guide/cgroup-v1/memory.rst @@ -299,7 +299,7 @@ Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by lruvec->lru_lock; PG_lru bit of page->flags is cleared before isolating a page from its LRU under lruvec->lru_lock. -2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM) +2.7 Kernel Memory Extension ----------------------------------------------- With the Kernel memory extension, the Memory Controller is able to limit @@ -386,8 +386,6 @@ U != 0, K >= U: a. Enable CONFIG_CGROUPS b. Enable CONFIG_MEMCG -c. Enable CONFIG_MEMCG_SWAP (to use swap extension) -d. Enable CONFIG_MEMCG_KMEM (to use kmem extension) 3.1. Prepare the cgroups (see cgroups.txt, Why are cgroups needed?) ------------------------------------------------------------------- diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index be4a77baf784..7bcfb38498c6 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1355,6 +1355,11 @@ PAGE_SIZE multiple when read back. pagetables Amount of memory allocated for page tables. + sec_pagetables + Amount of memory allocated for secondary page tables, + this currently includes KVM mmu allocations on x86 + and arm64. + percpu (npn) Amount of memory used for storing per-cpu kernel data structures. @@ -2185,75 +2190,93 @@ Cpuset Interface Files It accepts only the following input values when written to. - ======== ================================ - "root" a partition root - "member" a non-root member of a partition - ======== ================================ - - When set to be a partition root, the current cgroup is the - root of a new partition or scheduling domain that comprises - itself and all its descendants except those that are separate - partition roots themselves and their descendants. The root - cgroup is always a partition root. - - There are constraints on where a partition root can be set. - It can only be set in a cgroup if all the following conditions - are true. - - 1) The "cpuset.cpus" is not empty and the list of CPUs are - exclusive, i.e. they are not shared by any of its siblings. - 2) The parent cgroup is a partition root. - 3) The "cpuset.cpus" is also a proper subset of the parent's - "cpuset.cpus.effective". - 4) There is no child cgroups with cpuset enabled. This is for - eliminating corner cases that have to be handled if such a - condition is allowed. - - Setting it to partition root will take the CPUs away from the - effective CPUs of the parent cgroup. Once it is set, this - file cannot be reverted back to "member" if there are any child - cgroups with cpuset enabled. - - A parent partition cannot distribute all its CPUs to its - child partitions. There must be at least one cpu left in the - parent partition. - - Once becoming a partition root, changes to "cpuset.cpus" is - generally allowed as long as the first condition above is true, - the change will not take away all the CPUs from the parent - partition and the new "cpuset.cpus" value is a superset of its - children's "cpuset.cpus" values. - - Sometimes, external factors like changes to ancestors' - "cpuset.cpus" or cpu hotplug can cause the state of the partition - root to change. On read, the "cpuset.sched.partition" file - can show the following values. - - ============== ============================== - "member" Non-root member of a partition - "root" Partition root - "root invalid" Invalid partition root - ============== ============================== - - It is a partition root if the first 2 partition root conditions - above are true and at least one CPU from "cpuset.cpus" is - granted by the parent cgroup. - - A partition root can become invalid if none of CPUs requested - in "cpuset.cpus" can be granted by the parent cgroup or the - parent cgroup is no longer a partition root itself. In this - case, it is not a real partition even though the restriction - of the first partition root condition above will still apply. - The cpu affinity of all the tasks in the cgroup will then be - associated with CPUs in the nearest ancestor partition. - - An invalid partition root can be transitioned back to a - real partition root if at least one of the requested CPUs - can now be granted by its parent. In this case, the cpu - affinity of all the tasks in the formerly invalid partition - will be associated to the CPUs of the newly formed partition. - Changing the partition state of an invalid partition root to - "member" is always allowed even if child cpusets are present. + ========== ===================================== + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing + ========== ===================================== + + The root cgroup is always a partition root and its state + cannot be changed. All other non-root cgroups start out as + "member". + + When set to "root", the current cgroup is the root of a new + partition or scheduling domain that comprises itself and all + its descendants except those that are separate partition roots + themselves and their descendants. + + When set to "isolated", the CPUs in that partition root will + be in an isolated state without any load balancing from the + scheduler. Tasks placed in such a partition with multiple + CPUs should be carefully distributed and bound to each of the + individual CPUs for optimal performance. + + The value shown in "cpuset.cpus.effective" of a partition root + is the CPUs that the partition root can dedicate to a potential + new child partition root. The new child subtracts available + CPUs from its parent "cpuset.cpus.effective". + + A partition root ("root" or "isolated") can be in one of the + two possible states - valid or invalid. An invalid partition + root is in a degraded state where some state information may + be retained, but behaves more like a "member". + + All possible state transitions among "member", "root" and + "isolated" are allowed. + + On read, the "cpuset.cpus.partition" file can show the following + values. + + ============================= ===================================== + "member" Non-root member of a partition + "root" Partition root + "isolated" Partition root without load balancing + "root invalid (<reason>)" Invalid partition root + "isolated invalid (<reason>)" Invalid isolated partition root + ============================= ===================================== + + In the case of an invalid partition root, a descriptive string on + why the partition is invalid is included within parentheses. + + For a partition root to become valid, the following conditions + must be met. + + 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they + are not shared by any of its siblings (exclusivity rule). + 2) The parent cgroup is a valid partition root. + 3) The "cpuset.cpus" is not empty and must contain at least + one of the CPUs from parent's "cpuset.cpus", i.e. they overlap. + 4) The "cpuset.cpus.effective" cannot be empty unless there is + no task associated with this partition. + + External events like hotplug or changes to "cpuset.cpus" can + cause a valid partition root to become invalid and vice versa. + Note that a task cannot be moved to a cgroup with empty + "cpuset.cpus.effective". + + For a valid partition root with the sibling cpu exclusivity + rule enabled, changes made to "cpuset.cpus" that violate the + exclusivity rule will invalidate the partition as well as its + sibiling partitions with conflicting cpuset.cpus values. So + care must be taking in changing "cpuset.cpus". + + A valid non-root parent partition may distribute out all its CPUs + to its child partitions when there is no task associated with it. + + Care must be taken to change a valid partition root to + "member" as all its child partitions, if present, will become + invalid causing disruption to tasks running in those child + partitions. These inactivated partitions could be recovered if + their parent is switched back to a partition root with a proper + set of "cpuset.cpus". + + Poll and inotify events are triggered whenever the state of + "cpuset.cpus.partition" changes. That includes changes caused + by write to "cpuset.cpus.partition", cpu hotplug or other + changes that modify the validity status of the partition. + This will allow user space agents to monitor unexpected changes + to "cpuset.cpus.partition" without the need to do continuous + polling. Device controller diff --git a/Documentation/admin-guide/dynamic-debug-howto.rst b/Documentation/admin-guide/dynamic-debug-howto.rst index a89cfa083155..faa22f77847a 100644 --- a/Documentation/admin-guide/dynamic-debug-howto.rst +++ b/Documentation/admin-guide/dynamic-debug-howto.rst @@ -5,143 +5,115 @@ Dynamic debug Introduction ============ -This document describes how to use the dynamic debug (dyndbg) feature. +Dynamic debug allows you to dynamically enable/disable kernel +debug-print code to obtain additional kernel information. -Dynamic debug is designed to allow you to dynamically enable/disable -kernel code to obtain additional kernel information. Currently, if -``CONFIG_DYNAMIC_DEBUG`` is set, then all ``pr_debug()``/``dev_dbg()`` and -``print_hex_dump_debug()``/``print_hex_dump_bytes()`` calls can be dynamically -enabled per-callsite. +If ``/proc/dynamic_debug/control`` exists, your kernel has dynamic +debug. You'll need root access (sudo su) to use this. -If you do not want to enable dynamic debug globally (i.e. in some embedded -system), you may set ``CONFIG_DYNAMIC_DEBUG_CORE`` as basic support of dynamic -debug and add ``ccflags := -DDYNAMIC_DEBUG_MODULE`` into the Makefile of any -modules which you'd like to dynamically debug later. - -If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is just -shortcut for ``print_hex_dump(KERN_DEBUG)``. - -For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is -its ``prefix_str`` argument, if it is constant string; or ``hexdump`` -in case ``prefix_str`` is built dynamically. +Dynamic debug provides: -Dynamic debug has even more useful features: + * a Catalog of all *prdbgs* in your kernel. + ``cat /proc/dynamic_debug/control`` to see them. - * Simple query language allows turning on and off debugging - statements by matching any combination of 0 or 1 of: + * a Simple query/command language to alter *prdbgs* by selecting on + any combination of 0 or 1 of: - source filename - function name - line number (including ranges of line numbers) - module name - format string - - * Provides a debugfs control file: ``<debugfs>/dynamic_debug/control`` - which can be read to display the complete list of known debug - statements, to help guide you - -Controlling dynamic debug Behaviour -=================================== - -The behaviour of ``pr_debug()``/``dev_dbg()`` are controlled via writing to a -control file in the 'debugfs' filesystem. Thus, you must first mount -the debugfs filesystem, in order to make use of this feature. -Subsequently, we refer to the control file as: -``<debugfs>/dynamic_debug/control``. For example, if you want to enable -printing from source file ``svcsock.c``, line 1603 you simply do:: - - nullarbor:~ # echo 'file svcsock.c line 1603 +p' > - <debugfs>/dynamic_debug/control - -If you make a mistake with the syntax, the write will fail thus:: - - nullarbor:~ # echo 'file svcsock.c wtf 1 +p' > - <debugfs>/dynamic_debug/control - -bash: echo: write error: Invalid argument - -Note, for systems without 'debugfs' enabled, the control file can be -found in ``/proc/dynamic_debug/control``. + - class name (as known/declared by each module) Viewing Dynamic Debug Behaviour =============================== -You can view the currently configured behaviour of all the debug -statements via:: +You can view the currently configured behaviour in the *prdbg* catalog:: - nullarbor:~ # cat <debugfs>/dynamic_debug/control + :#> head -n7 /proc/dynamic_debug/control # filename:lineno [module]function flags format - net/sunrpc/svc_rdma.c:323 [svcxprt_rdma]svc_rdma_cleanup =_ "SVCRDMA Module Removed, deregister RPC RDMA transport\012" - net/sunrpc/svc_rdma.c:341 [svcxprt_rdma]svc_rdma_init =_ "\011max_inline : %d\012" - net/sunrpc/svc_rdma.c:340 [svcxprt_rdma]svc_rdma_init =_ "\011sq_depth : %d\012" - net/sunrpc/svc_rdma.c:338 [svcxprt_rdma]svc_rdma_init =_ "\011max_requests : %d\012" - ... + init/main.c:1179 [main]initcall_blacklist =_ "blacklisting initcall %s\012 + init/main.c:1218 [main]initcall_blacklisted =_ "initcall %s blacklisted\012" + init/main.c:1424 [main]run_init_process =_ " with arguments:\012" + init/main.c:1426 [main]run_init_process =_ " %s\012" + init/main.c:1427 [main]run_init_process =_ " with environment:\012" + init/main.c:1429 [main]run_init_process =_ " %s\012" +The 3rd space-delimited column shows the current flags, preceded by +a ``=`` for easy use with grep/cut. ``=p`` shows enabled callsites. -You can also apply standard Unix text manipulation filters to this -data, e.g.:: +Controlling dynamic debug Behaviour +=================================== - nullarbor:~ # grep -i rdma <debugfs>/dynamic_debug/control | wc -l - 62 +The behaviour of *prdbg* sites are controlled by writing +query/commands to the control file. Example:: - nullarbor:~ # grep -i tcp <debugfs>/dynamic_debug/control | wc -l - 42 + # grease the interface + :#> alias ddcmd='echo $* > /proc/dynamic_debug/control' -The third column shows the currently enabled flags for each debug -statement callsite (see below for definitions of the flags). The -default value, with no flags enabled, is ``=_``. So you can view all -the debug statement callsites with any non-default flags:: + :#> ddcmd '-p; module main func run* +p' + :#> grep =p /proc/dynamic_debug/control + init/main.c:1424 [main]run_init_process =p " with arguments:\012" + init/main.c:1426 [main]run_init_process =p " %s\012" + init/main.c:1427 [main]run_init_process =p " with environment:\012" + init/main.c:1429 [main]run_init_process =p " %s\012" - nullarbor:~ # awk '$3 != "=_"' <debugfs>/dynamic_debug/control - # filename:lineno [module]function flags format - net/sunrpc/svcsock.c:1603 [sunrpc]svc_send p "svc_process: st_sendto returned %d\012" +Error messages go to console/syslog:: + + :#> ddcmd mode foo +p + dyndbg: unknown keyword "mode" + dyndbg: query parse failed + bash: echo: write error: Invalid argument + +If debugfs is also enabled and mounted, ``dynamic_debug/control`` is +also under the mount-dir, typically ``/sys/kernel/debug/``. Command Language Reference ========================== -At the lexical level, a command comprises a sequence of words separated +At the basic lexical level, a command is a sequence of words separated by spaces or tabs. So these are all equivalent:: - nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > - <debugfs>/dynamic_debug/control - nullarbor:~ # echo -n ' file svcsock.c line 1603 +p ' > - <debugfs>/dynamic_debug/control - nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd file svcsock.c line 1603 +p + :#> ddcmd "file svcsock.c line 1603 +p" + :#> ddcmd ' file svcsock.c line 1603 +p ' Command submissions are bounded by a write() system call. Multiple commands can be written together, separated by ``;`` or ``\n``:: - ~# echo "func pnpacpi_get_resources +p; func pnp_assign_mem +p" \ - > <debugfs>/dynamic_debug/control + :#> ddcmd "func pnpacpi_get_resources +p; func pnp_assign_mem +p" + :#> ddcmd <<"EOC" + func pnpacpi_get_resources +p + func pnp_assign_mem +p + EOC + :#> cat query-batch-file > /proc/dynamic_debug/control -If your query set is big, you can batch them too:: +You can also use wildcards in each query term. The match rule supports +``*`` (matches zero or more characters) and ``?`` (matches exactly one +character). For example, you can match all usb drivers:: - ~# cat query-batch-file > <debugfs>/dynamic_debug/control + :#> ddcmd file "drivers/usb/*" +p # "" to suppress shell expansion -Another way is to use wildcards. The match rule supports ``*`` (matches -zero or more characters) and ``?`` (matches exactly one character). For -example, you can match all usb drivers:: - - ~# echo "file drivers/usb/* +p" > <debugfs>/dynamic_debug/control - -At the syntactical level, a command comprises a sequence of match -specifications, followed by a flags change specification:: +Syntactically, a command is pairs of keyword values, followed by a +flags change or setting:: command ::= match-spec* flags-spec -The match-spec's are used to choose a subset of the known pr_debug() -callsites to which to apply the flags-spec. Think of them as a query -with implicit ANDs between each pair. Note that an empty list of -match-specs will select all debug statement callsites. +The match-spec's select *prdbgs* from the catalog, upon which to apply +the flags-spec, all constraints are ANDed together. An absent keyword +is the same as keyword "*". + -A match specification comprises a keyword, which controls the -attribute of the callsite to be compared, and a value to compare -against. Possible keywords are::: +A match specification is a keyword, which selects the attribute of +the callsite to be compared, and a value to compare against. Possible +keywords are::: match-spec ::= 'func' string | 'file' string | 'module' string | 'format' string | + 'class' string | 'line' line-range line-range ::= lineno | @@ -203,6 +175,16 @@ format format "nfsd: SETATTR" // a neater way to match a format with whitespace format 'nfsd: SETATTR' // yet another way to match a format with whitespace +class + The given class_name is validated against each module, which may + have declared a list of known class_names. If the class_name is + found for a module, callsite & class matching and adjustment + proceeds. Examples:: + + class DRM_UT_KMS # a DRM.debug category + class JUNK # silent non-match + // class TLD_* # NOTICE: no wildcard in class names + line The given line number or range of line numbers is compared against the line number of each ``pr_debug()`` callsite. A single @@ -228,17 +210,16 @@ of the characters:: The flags are:: p enables the pr_debug() callsite. - f Include the function name in the printed message - l Include line number in the printed message - m Include module name in the printed message - t Include thread ID in messages not generated from interrupt context - _ No flags are set. (Or'd with others on input) + _ enables no flags. -For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only ``p`` flag -have meaning, other flags ignored. + Decorator flags add to the message-prefix, in order: + t Include thread ID, or <intr> + m Include module name + f Include the function name + l Include line number -For display, the flags are preceded by ``=`` -(mnemonic: what the flags are currently equal to). +For ``print_hex_dump_debug()`` and ``print_hex_dump_bytes()``, only +the ``p`` flag has meaning, other flags are ignored. Note the regexp ``^[-+=][flmpt_]+$`` matches a flags specification. To clear all flags at once, use ``=_`` or ``-flmpt``. @@ -313,7 +294,7 @@ For ``CONFIG_DYNAMIC_DEBUG`` kernels, any settings given at boot-time (or enabled by ``-DDEBUG`` flag during compilation) can be disabled later via the debugfs interface if the debug messages are no longer needed:: - echo "module module_name -p" > <debugfs>/dynamic_debug/control + echo "module module_name -p" > /proc/dynamic_debug/control Examples ======== @@ -321,37 +302,31 @@ Examples :: // enable the message at line 1603 of file svcsock.c - nullarbor:~ # echo -n 'file svcsock.c line 1603 +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'file svcsock.c line 1603 +p' // enable all the messages in file svcsock.c - nullarbor:~ # echo -n 'file svcsock.c +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'file svcsock.c +p' // enable all the messages in the NFS server module - nullarbor:~ # echo -n 'module nfsd +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'module nfsd +p' // enable all 12 messages in the function svc_process() - nullarbor:~ # echo -n 'func svc_process +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'func svc_process +p' // disable all 12 messages in the function svc_process() - nullarbor:~ # echo -n 'func svc_process -p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'func svc_process -p' // enable messages for NFS calls READ, READLINK, READDIR and READDIR+. - nullarbor:~ # echo -n 'format "nfsd: READ" +p' > - <debugfs>/dynamic_debug/control + :#> ddcmd 'format "nfsd: READ" +p' // enable messages in files of which the paths include string "usb" - nullarbor:~ # echo -n 'file *usb* +p' > <debugfs>/dynamic_debug/control + :#> ddcmd 'file *usb* +p' > /proc/dynamic_debug/control // enable all messages - nullarbor:~ # echo -n '+p' > <debugfs>/dynamic_debug/control + :#> ddcmd '+p' > /proc/dynamic_debug/control // add module, function to all enabled messages - nullarbor:~ # echo -n '+mf' > <debugfs>/dynamic_debug/control + :#> ddcmd '+mf' > /proc/dynamic_debug/control // boot-args example, with newlines and comments for readability Kernel command line: ... @@ -364,3 +339,38 @@ Examples dyndbg="file init/* +p #cmt ; func parse_one +p" // enable pr_debugs in 2 functions in a module loaded later pc87360.dyndbg="func pc87360_init_device +p; func pc87360_find +p" + +Kernel Configuration +==================== + +Dynamic Debug is enabled via kernel config items:: + + CONFIG_DYNAMIC_DEBUG=y # build catalog, enables CORE + CONFIG_DYNAMIC_DEBUG_CORE=y # enable mechanics only, skip catalog + +If you do not want to enable dynamic debug globally (i.e. in some embedded +system), you may set ``CONFIG_DYNAMIC_DEBUG_CORE`` as basic support of dynamic +debug and add ``ccflags := -DDYNAMIC_DEBUG_MODULE`` into the Makefile of any +modules which you'd like to dynamically debug later. + + +Kernel *prdbg* API +================== + +The following functions are cataloged and controllable when dynamic +debug is enabled:: + + pr_debug() + dev_dbg() + print_hex_dump_debug() + print_hex_dump_bytes() + +Otherwise, they are off by default; ``ccflags += -DDEBUG`` or +``#define DEBUG`` in a source file will enable them appropriately. + +If ``CONFIG_DYNAMIC_DEBUG`` is not set, ``print_hex_dump_debug()`` is +just a shortcut for ``print_hex_dump(KERN_DEBUG)``. + +For ``print_hex_dump_debug()``/``print_hex_dump_bytes()``, format string is +its ``prefix_str`` argument, if it is constant string; or ``hexdump`` +in case ``prefix_str`` is built dynamically. diff --git a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst index 9393c50b5afc..c98fd11907cc 100644 --- a/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst +++ b/Documentation/admin-guide/hw-vuln/processor_mmio_stale_data.rst @@ -230,6 +230,20 @@ The possible values in this file are: * - 'Mitigation: Clear CPU buffers' - The processor is vulnerable and the CPU buffer clearing mitigation is enabled. + * - 'Unknown: No mitigations' + - The processor vulnerability status is unknown because it is + out of Servicing period. Mitigation is not attempted. + +Definitions: +------------ + +Servicing period: The process of providing functional and security updates to +Intel processors or platforms, utilizing the Intel Platform Update (IPU) +process or other similar mechanisms. + +End of Servicing Updates (ESU): ESU is the date at which Intel will no +longer provide Servicing, such as through IPU or other similar update +processes. ESU dates will typically be aligned to end of quarter. If the processor is vulnerable then the following information is appended to the above information: diff --git a/Documentation/admin-guide/hw-vuln/spectre.rst b/Documentation/admin-guide/hw-vuln/spectre.rst index 2ce2a38cdd55..c4dcdb3d0d45 100644 --- a/Documentation/admin-guide/hw-vuln/spectre.rst +++ b/Documentation/admin-guide/hw-vuln/spectre.rst @@ -613,6 +613,7 @@ kernel command line. eibrs enhanced IBRS eibrs,retpoline enhanced IBRS + Retpolines eibrs,lfence enhanced IBRS + LFENCE + ibrs use IBRS to protect kernel Not specifying this option is equivalent to spectre_v2=auto. diff --git a/Documentation/admin-guide/kdump/vmcoreinfo.rst b/Documentation/admin-guide/kdump/vmcoreinfo.rst index 8419019b6a88..6726f439958c 100644 --- a/Documentation/admin-guide/kdump/vmcoreinfo.rst +++ b/Documentation/admin-guide/kdump/vmcoreinfo.rst @@ -200,7 +200,7 @@ prb A pointer to the printk ringbuffer (struct printk_ringbuffer). This may be pointing to the static boot ringbuffer or the dynamically -allocated ringbuffer, depending on when the the core dump occurred. +allocated ringbuffer, depending on when the core dump occurred. Used by user-space tools to read the active kernel log buffer. printk_rb_static diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index d7f30902fda0..a465d5242774 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -321,6 +321,8 @@ force_enable - Force enable the IOMMU on platforms known to be buggy with IOMMU enabled. Use this option with care. + pgtbl_v1 - Use v1 page table for DMA-API (Default). + pgtbl_v2 - Use v2 page table for DMA-API. amd_iommu_dump= [HW,X86-64] Enable AMD IOMMU driver option to dump the ACPI table @@ -966,10 +968,6 @@ debugpat [X86] Enable PAT debugging - decnet.addr= [HW,NET] - Format: <area>[,<node>] - See also Documentation/networking/decnet.rst. - default_hugepagesz= [HW] The size of the default HugeTLB page. This is the size represented by the legacy /proc/ hugepages @@ -1471,6 +1469,14 @@ Permit 'security.evm' to be updated regardless of current integrity status. + early_page_ext [KNL] Enforces page_ext initialization to earlier + stages so cover more early boot allocations. + Please note that as side effect some optimizations + might be disabled to achieve that (e.g. parallelized + memory initialization is disabled) so the boot process + might take longer, especially on systems with a lot of + memory. Available with CONFIG_PAGE_EXTENSION=y. + failslab= fail_usercopy= fail_page_alloc= @@ -2436,6 +2442,12 @@ 0: force disabled 1: force enabled + kunit.enable= [KUNIT] Enable executing KUnit tests. Requires + CONFIG_KUNIT to be set to be fully enabled. The + default value can be overridden via + KUNIT_DEFAULT_ENABLED. + Default is 1 (enabled) + kvm.ignore_msrs=[KVM] Ignore guest accesses to unhandled MSRs. Default is 0 (don't ignore, but inject #GP) @@ -3207,6 +3219,7 @@ spectre_v2_user=off [X86] spec_store_bypass_disable=off [X86,PPC] ssbd=force-off [ARM64] + nospectre_bhb [ARM64] l1tf=off [X86] mds=off [X86] tsx_async_abort=off [X86] @@ -3613,7 +3626,7 @@ nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings. - nohugevmalloc [PPC] Disable kernel huge vmalloc mappings. + nohugevmalloc [KNL,X86,PPC,ARM64] Disable kernel huge vmalloc mappings. nosmt [KNL,S390] Disable symmetric multithreading (SMT). Equivalent to smt=1. @@ -3626,11 +3639,15 @@ (bounds check bypass). With this option data leaks are possible in the system. - nospectre_v2 [X86,PPC_FSL_BOOK3E,ARM64] Disable all mitigations for + nospectre_v2 [X86,PPC_E500,ARM64] Disable all mitigations for the Spectre variant 2 (indirect branch prediction) vulnerability. System may allow data leaks with this option. + nospectre_bhb [ARM64] Disable all mitigations for Spectre-BHB (branch + history injection) vulnerability. System may allow data leaks + with this option. + nospec_store_bypass_disable [HW] Disable all mitigations for the Speculative Store Bypass vulnerability @@ -3741,9 +3758,9 @@ [X86,PV_OPS] Disable paravirtualized VMware scheduler clock and use the default one. - no-steal-acc [X86,PV_OPS,ARM64] Disable paravirtualized steal time - accounting. steal time is computed, but won't - influence scheduler behaviour + no-steal-acc [X86,PV_OPS,ARM64,PPC/PSERIES] Disable paravirtualized + steal time accounting. steal time is computed, but + won't influence scheduler behaviour nolapic [X86-32,APIC] Do not enable or use the local APIC. @@ -3805,6 +3822,10 @@ nox2apic [X86-64,APIC] Do not enable x2APIC mode. + NOTE: this parameter will be ignored on systems with the + LEGACY_XAPIC_DISABLED bit set in the + IA32_XAPIC_DISABLE_STATUS MSR. + nps_mtm_hs_ctr= [KNL,ARC] This parameter sets the maximum duration, in cycles, each HW thread of the CTOP can run @@ -5331,6 +5352,8 @@ rodata= [KNL] on Mark read-only kernel memory as read-only (default). off Leave read-only kernel memory writable for debugging. + full Mark read-only kernel memory and aliases as read-only + [arm64] rockchip.usb_uart Enable the uart passthrough on the designated usb port @@ -6026,12 +6049,6 @@ This parameter controls use of the Protected Execution Facility on pSeries. - swapaccount= [KNL] - Format: [0|1] - Enable accounting of swap in memory resource - controller if no parameter or 1 is given or disable - it if 0 is given (See Documentation/admin-guide/cgroup-v1/memory.rst) - swiotlb= [ARM,IA-64,PPC,MIPS,X86] Format: { <int> [,<int>] | force | noforce } <int> -- Number of I/O TLB slabs @@ -6834,6 +6851,12 @@ Crash from Xen panic notifier, without executing late panic() code such as dumping handler. + xen_msr_safe= [X86,XEN] + Format: <bool> + Select whether to always use non-faulting (safe) MSR + access functions when running as Xen PV guest. The + default value is controlled by CONFIG_XEN_PV_MSR_SAFE. + xen_nopvspin [X86,XEN] Disables the qspinlock slowpath using Xen PV optimizations. This parameter is obsoleted by "nopvspin" parameter, which diff --git a/Documentation/admin-guide/mm/cma_debugfs.rst b/Documentation/admin-guide/mm/cma_debugfs.rst index 4e06ffabd78a..7367e6294ef6 100644 --- a/Documentation/admin-guide/mm/cma_debugfs.rst +++ b/Documentation/admin-guide/mm/cma_debugfs.rst @@ -5,10 +5,10 @@ CMA Debugfs Interface The CMA debugfs interface is useful to retrieve basic information out of the different CMA areas and to test allocation/release in each of the areas. -Each CMA zone represents a directory under <debugfs>/cma/, indexed by the -kernel's CMA index. So the first CMA zone would be: +Each CMA area represents a directory under <debugfs>/cma/, represented by +its CMA name like below: - <debugfs>/cma/cma-0 + <debugfs>/cma/<cma_name> The structure of the files created under that directory is as follows: @@ -18,8 +18,8 @@ The structure of the files created under that directory is as follows: - [RO] bitmap: The bitmap of page states in the zone. - [WO] alloc: Allocate N pages from that CMA area. For example:: - echo 5 > <debugfs>/cma/cma-2/alloc + echo 5 > <debugfs>/cma/<cma_name>/alloc -would try to allocate 5 pages from the cma-2 area. +would try to allocate 5 pages from the 'cma_name' area. - [WO] free: Free N pages from that CMA area, similar to the above. diff --git a/Documentation/admin-guide/mm/damon/index.rst b/Documentation/admin-guide/mm/damon/index.rst index 05500042f777..33d37bb2fb4e 100644 --- a/Documentation/admin-guide/mm/damon/index.rst +++ b/Documentation/admin-guide/mm/damon/index.rst @@ -1,8 +1,8 @@ .. SPDX-License-Identifier: GPL-2.0 -======================== -Monitoring Data Accesses -======================== +========================== +DAMON: Data Access MONitor +========================== :doc:`DAMON </mm/damon/index>` allows light-weight data access monitoring. Using DAMON, users can analyze the memory access patterns of their systems and diff --git a/Documentation/admin-guide/mm/damon/start.rst b/Documentation/admin-guide/mm/damon/start.rst index 4d5ca2c46288..9f88afc734da 100644 --- a/Documentation/admin-guide/mm/damon/start.rst +++ b/Documentation/admin-guide/mm/damon/start.rst @@ -29,16 +29,9 @@ called DAMON Operator (DAMO). It is available at https://github.com/awslabs/damo. The examples below assume that ``damo`` is on your ``$PATH``. It's not mandatory, though. -Because DAMO is using the debugfs interface (refer to :doc:`usage` for the -detail) of DAMON, you should ensure debugfs is mounted. Mount it manually as -below:: - - # mount -t debugfs none /sys/kernel/debug/ - -or append the following line to your ``/etc/fstab`` file so that your system -can automatically mount debugfs upon booting:: - - debugfs /sys/kernel/debug debugfs defaults 0 0 +Because DAMO is using the sysfs interface (refer to :doc:`usage` for the +detail) of DAMON, you should ensure :doc:`sysfs </filesystems/sysfs>` is +mounted. Recording Data Access Patterns diff --git a/Documentation/admin-guide/mm/damon/usage.rst b/Documentation/admin-guide/mm/damon/usage.rst index d52f572a9029..b47b0cbbd491 100644 --- a/Documentation/admin-guide/mm/damon/usage.rst +++ b/Documentation/admin-guide/mm/damon/usage.rst @@ -50,10 +50,10 @@ For a short example, users can monitor the virtual address space of a given workload as below. :: # cd /sys/kernel/mm/damon/admin/ - # echo 1 > kdamonds/nr && echo 1 > kdamonds/0/contexts/nr + # echo 1 > kdamonds/nr_kdamonds && echo 1 > kdamonds/0/contexts/nr_contexts # echo vaddr > kdamonds/0/contexts/0/operations - # echo 1 > kdamonds/0/contexts/0/targets/nr - # echo $(pidof <workload>) > kdamonds/0/contexts/0/targets/0/pid + # echo 1 > kdamonds/0/contexts/0/targets/nr_targets + # echo $(pidof <workload>) > kdamonds/0/contexts/0/targets/0/pid_target # echo on > kdamonds/0/state Files Hierarchy @@ -366,12 +366,12 @@ memory rate becomes larger than 60%, or lower than 30%". :: # echo 1 > kdamonds/0/contexts/0/schemes/nr_schemes # cd kdamonds/0/contexts/0/schemes/0 # # set the basic access pattern and the action - # echo 4096 > access_patterns/sz/min - # echo 8192 > access_patterns/sz/max - # echo 0 > access_patterns/nr_accesses/min - # echo 5 > access_patterns/nr_accesses/max - # echo 10 > access_patterns/age/min - # echo 20 > access_patterns/age/max + # echo 4096 > access_pattern/sz/min + # echo 8192 > access_pattern/sz/max + # echo 0 > access_pattern/nr_accesses/min + # echo 5 > access_pattern/nr_accesses/max + # echo 10 > access_pattern/age/min + # echo 20 > access_pattern/age/max # echo pageout > action # # set quotas # echo 10 > quotas/ms @@ -393,6 +393,11 @@ the files as above. Above is only for an example. debugfs Interface ================= +.. note:: + + DAMON debugfs interface will be removed after next LTS kernel is released, so + users should move to the :ref:`sysfs interface <sysfs_interface>`. + DAMON exports eight files, ``attrs``, ``target_ids``, ``init_regions``, ``schemes``, ``monitor_on``, ``kdamond_pid``, ``mk_contexts`` and ``rm_contexts`` under its debugfs directory, ``<debugfs>/damon/``. diff --git a/Documentation/admin-guide/mm/hugetlbpage.rst b/Documentation/admin-guide/mm/hugetlbpage.rst index 8e2727dc18d4..19f27c0d92e0 100644 --- a/Documentation/admin-guide/mm/hugetlbpage.rst +++ b/Documentation/admin-guide/mm/hugetlbpage.rst @@ -65,7 +65,7 @@ HugePages_Surp may be temporarily larger than the maximum number of surplus huge pages when the system is under memory pressure. Hugepagesize - is the default hugepage size (in Kb). + is the default hugepage size (in kB). Hugetlb is the total amount of memory (in kB), consumed by huge pages of all sizes. diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst index 1bd11118dfb1..d1064e0ba34a 100644 --- a/Documentation/admin-guide/mm/index.rst +++ b/Documentation/admin-guide/mm/index.rst @@ -32,6 +32,7 @@ the Linux memory management. idle_page_tracking ksm memory-hotplug + multigen_lru nommu-mmap numa_memory_policy numaperf diff --git a/Documentation/admin-guide/mm/ksm.rst b/Documentation/admin-guide/mm/ksm.rst index b244f0202a03..fb6ba2002a4b 100644 --- a/Documentation/admin-guide/mm/ksm.rst +++ b/Documentation/admin-guide/mm/ksm.rst @@ -184,6 +184,42 @@ The maximum possible ``pages_sharing/pages_shared`` ratio is limited by the ``max_page_sharing`` tunable. To increase the ratio ``max_page_sharing`` must be increased accordingly. +Monitoring KSM profit +===================== + +KSM can save memory by merging identical pages, but also can consume +additional memory, because it needs to generate a number of rmap_items to +save each scanned page's brief rmap information. Some of these pages may +be merged, but some may not be abled to be merged after being checked +several times, which are unprofitable memory consumed. + +1) How to determine whether KSM save memory or consume memory in system-wide + range? Here is a simple approximate calculation for reference:: + + general_profit =~ pages_sharing * sizeof(page) - (all_rmap_items) * + sizeof(rmap_item); + + where all_rmap_items can be easily obtained by summing ``pages_sharing``, + ``pages_shared``, ``pages_unshared`` and ``pages_volatile``. + +2) The KSM profit inner a single process can be similarly obtained by the + following approximate calculation:: + + process_profit =~ ksm_merging_pages * sizeof(page) - + ksm_rmap_items * sizeof(rmap_item). + + where ksm_merging_pages is shown under the directory ``/proc/<pid>/``, + and ksm_rmap_items is shown in ``/proc/<pid>/ksm_stat``. + +From the perspective of application, a high ratio of ``ksm_rmap_items`` to +``ksm_merging_pages`` means a bad madvise-applied policy, so developers or +administrators have to rethink how to change madvise policy. Giving an example +for reference, a page's size is usually 4K, and the rmap_item's size is +separately 32B on 32-bit CPU architecture and 64B on 64-bit CPU architecture. +so if the ``ksm_rmap_items/ksm_merging_pages`` ratio exceeds 64 on 64-bit CPU +or exceeds 128 on 32-bit CPU, then the app's madvise policy should be dropped, +because the ksm profit is approximately zero or negative. + Monitoring KSM events ===================== diff --git a/Documentation/admin-guide/mm/multigen_lru.rst b/Documentation/admin-guide/mm/multigen_lru.rst new file mode 100644 index 000000000000..33e068830497 --- /dev/null +++ b/Documentation/admin-guide/mm/multigen_lru.rst @@ -0,0 +1,162 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Quick start +=========== +Build the kernel with the following configurations. + +* ``CONFIG_LRU_GEN=y`` +* ``CONFIG_LRU_GEN_ENABLED=y`` + +All set! + +Runtime options +=============== +``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the +following subsections. + +Kill switch +----------- +``enabled`` accepts different values to enable or disable the +following components. Its default value depends on +``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled +unless some of them have unforeseen side effects. Writing to +``enabled`` has no effect when a component is not supported by the +hardware, and valid values will be accepted even when the main switch +is off. + +====== =============================================================== +Values Components +====== =============================================================== +0x0001 The main switch for the multi-gen LRU. +0x0002 Clearing the accessed bit in leaf page table entries in large + batches, when MMU sets it (e.g., on x86). This behavior can + theoretically worsen lock contention (mmap_lock). If it is + disabled, the multi-gen LRU will suffer a minor performance + degradation for workloads that contiguously map hot pages, + whose accessed bits can be otherwise cleared by fewer larger + batches. +0x0004 Clearing the accessed bit in non-leaf page table entries as + well, when MMU sets it (e.g., on x86). This behavior was not + verified on x86 varieties other than Intel and AMD. If it is + disabled, the multi-gen LRU will suffer a negligible + performance degradation. +[yYnN] Apply to all the components above. +====== =============================================================== + +E.g., +:: + + echo y >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0007 + echo 5 >/sys/kernel/mm/lru_gen/enabled + cat /sys/kernel/mm/lru_gen/enabled + 0x0005 + +Thrashing prevention +-------------------- +Personal computers are more sensitive to thrashing because it can +cause janks (lags when rendering UI) and negatively impact user +experience. The multi-gen LRU offers thrashing prevention to the +majority of laptop and desktop users who do not have ``oomd``. + +Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of +``N`` milliseconds from getting evicted. The OOM killer is triggered +if this working set cannot be kept in memory. In other words, this +option works as an adjustable pressure relief valve, and when open, it +terminates applications that are hopefully not being used. + +Based on the average human detectable lag (~100ms), ``N=1000`` usually +eliminates intolerable janks due to thrashing. Larger values like +``N=3000`` make janks less noticeable at the risk of premature OOM +kills. + +The default value ``0`` means disabled. + +Experimental features +===================== +``/sys/kernel/debug/lru_gen`` accepts commands described in the +following subsections. Multiple command lines are supported, so does +concatenation with delimiters ``,`` and ``;``. + +``/sys/kernel/debug/lru_gen_full`` provides additional stats for +debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from +evicted generations in this file. + +Working set estimation +---------------------- +Working set estimation measures how much memory an application needs +in a given time interval, and it is usually done with little impact on +the performance of the application. E.g., data centers want to +optimize job scheduling (bin packing) to improve memory utilizations. +When a new job comes in, the job scheduler needs to find out whether +each server it manages can allocate a certain amount of memory for +this new job before it can pick a candidate. To do so, the job +scheduler needs to estimate the working sets of the existing jobs. + +When it is read, ``lru_gen`` returns a histogram of numbers of pages +accessed over different time intervals for each memcg and node. +``MAX_NR_GENS`` decides the number of bins for each histogram. The +histograms are noncumulative. +:: + + memcg memcg_id memcg_path + node node_id + min_gen_nr age_in_ms nr_anon_pages nr_file_pages + ... + max_gen_nr age_in_ms nr_anon_pages nr_file_pages + +Each bin contains an estimated number of pages that have been accessed +within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages +and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of +the former is the largest and that of the latter is the smallest. + +Users can write the following command to ``lru_gen`` to create a new +generation ``max_gen_nr+1``: + + ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` + +``can_swap`` defaults to the swap setting and, if it is set to ``1``, +it forces the scan of anon pages when swap is off, and vice versa. +``force_scan`` defaults to ``1`` and, if it is set to ``0``, it +employs heuristics to reduce the overhead, which is likely to reduce +the coverage as well. + +A typical use case is that a job scheduler runs this command at a +certain time interval to create new generations, and it ranks the +servers it manages based on the sizes of their cold pages defined by +this time interval. + +Proactive reclaim +----------------- +Proactive reclaim induces page reclaim when there is no memory +pressure. It usually targets cold pages only. E.g., when a new job +comes in, the job scheduler wants to proactively reclaim cold pages on +the server it selected, to improve the chance of successfully landing +this new job. + +Users can write the following command to ``lru_gen`` to evict +generations less than or equal to ``min_gen_nr``. + + ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` + +``min_gen_nr`` should be less than ``max_gen_nr-1``, since +``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to +the active list) and therefore cannot be evicted. ``swappiness`` +overrides the default value in ``/proc/sys/vm/swappiness``. +``nr_to_reclaim`` limits the number of pages to evict. + +A typical use case is that a job scheduler runs this command before it +tries to land a new job on a server. If it fails to materialize enough +cold pages because of the overestimation, it retries on the next +server according to the ranking result obtained from the working set +estimation step. This less forceful approach limits the impacts on the +existing jobs. diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index c9c37f16eef8..8ee78ec232eb 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -191,7 +191,14 @@ allocation failure to throttle the next allocation attempt:: /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs -The khugepaged progress can be seen in the number of pages collapsed:: +The khugepaged progress can be seen in the number of pages collapsed (note +that this counter may not be an exact count of the number of pages +collapsed, since "collapsed" could mean multiple things: (1) A PTE mapping +being replaced by a PMD mapping, or (2) All 4K physical pages replaced by +one 2M hugepage. Each may happen independently, or together, depending on +the type of memory and the failures that occur. As such, this value should +be interpreted roughly as a sign of progress, and counters in /proc/vmstat +consulted for more accurate accounting):: /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed @@ -366,10 +373,9 @@ thp_split_pmd page table entry. thp_zero_page_alloc - is incremented every time a huge zero page is - successfully allocated. It includes allocations which where - dropped due race with other allocation. Note, it doesn't count - every map of the huge zero page, only its allocation. + is incremented every time a huge zero page used for thp is + successfully allocated. Note, it doesn't count every map of + the huge zero page, only its allocation. thp_zero_page_alloc_failed is incremented if kernel fails to allocate diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/admin-guide/mm/userfaultfd.rst index 6528036093e1..83f31919ebb3 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -17,7 +17,10 @@ of the ``PROT_NONE+SIGSEGV`` trick. Design ====== -Userfaults are delivered and resolved through the ``userfaultfd`` syscall. +Userspace creates a new userfaultfd, initializes it, and registers one or more +regions of virtual memory with it. Then, any page faults which occur within the +region(s) result in a message being delivered to the userfaultfd, notifying +userspace of the fault. The ``userfaultfd`` (aside from registering and unregistering virtual memory ranges) provides two primary functionalities: @@ -34,12 +37,11 @@ The real advantage of userfaults if compared to regular virtual memory management of mremap/mprotect is that the userfaults in all their operations never involve heavyweight structures like vmas (in fact the ``userfaultfd`` runtime load never takes the mmap_lock for writing). - Vmas are not suitable for page- (or hugepage) granular fault tracking when dealing with virtual address spaces that could span Terabytes. Too many vmas would be needed for that. -The ``userfaultfd`` once opened by invoking the syscall, can also be +The ``userfaultfd``, once created, can also be passed using unix domain sockets to a manager process, so the same manager process could handle the userfaults of a multitude of different processes without them being aware about what is going on @@ -50,6 +52,39 @@ is a corner case that would currently return ``-EBUSY``). API === +Creating a userfaultfd +---------------------- + +There are two ways to create a new userfaultfd, each of which provide ways to +restrict access to this functionality (since historically userfaultfds which +handle kernel page faults have been a useful tool for exploiting the kernel). + +The first way, supported since userfaultfd was introduced, is the +userfaultfd(2) syscall. Access to this is controlled in several ways: + +- Any user can always create a userfaultfd which traps userspace page faults + only. Such a userfaultfd can be created using the userfaultfd(2) syscall + with the flag UFFD_USER_MODE_ONLY. + +- In order to also trap kernel page faults for the address space, either the + process needs the CAP_SYS_PTRACE capability, or the system must have + vm.unprivileged_userfaultfd set to 1. By default, vm.unprivileged_userfaultfd + is set to 0. + +The second way, added to the kernel more recently, is by opening +/dev/userfaultfd and issuing a USERFAULTFD_IOC_NEW ioctl to it. This method +yields equivalent userfaultfds to the userfaultfd(2) syscall. + +Unlike userfaultfd(2), access to /dev/userfaultfd is controlled via normal +filesystem permissions (user/group/mode), which gives fine grained access to +userfaultfd specifically, without also granting other unrelated privileges at +the same time (as e.g. granting CAP_SYS_PTRACE would do). Users who have access +to /dev/userfaultfd can always create userfaultfds that trap kernel page faults; +vm.unprivileged_userfaultfd is not considered. + +Initializing a userfaultfd +-------------------------- + When first opened the ``userfaultfd`` must be enabled invoking the ``UFFDIO_API`` ioctl specifying a ``uffdio_api.api`` value set to ``UFFD_API`` (or a later API version) which will specify the ``read/POLLIN`` protocol diff --git a/Documentation/admin-guide/perf/alibaba_pmu.rst b/Documentation/admin-guide/perf/alibaba_pmu.rst new file mode 100644 index 000000000000..11de998bb480 --- /dev/null +++ b/Documentation/admin-guide/perf/alibaba_pmu.rst @@ -0,0 +1,100 @@ +============================================================= +Alibaba's T-Head SoC Uncore Performance Monitoring Unit (PMU) +============================================================= + +The Yitian 710, custom-built by Alibaba Group's chip development business, +T-Head, implements uncore PMU for performance and functional debugging to +facilitate system maintenance. + +DDR Sub-System Driveway (DRW) PMU Driver +========================================= + +Yitian 710 employs eight DDR5/4 channels, four on each die. Each DDR5 channel +is independent of others to service system memory requests. And one DDR5 +channel is split into two independent sub-channels. The DDR Sub-System Driveway +implements separate PMUs for each sub-channel to monitor various performance +metrics. + +The Driveway PMU devices are named as ali_drw_<sys_base_addr> with perf. +For example, ali_drw_21000 and ali_drw_21080 are two PMU devices for two +sub-channels of the same channel in die 0. And the PMU device of die 1 is +prefixed with ali_drw_400XXXXX, e.g. ali_drw_40021000. + +Each sub-channel has 36 PMU counters in total, which is classified into +four groups: + +- Group 0: PMU Cycle Counter. This group has one pair of counters + pmu_cycle_cnt_low and pmu_cycle_cnt_high, that is used as the cycle count + based on DDRC core clock. + +- Group 1: PMU Bandwidth Counters. This group has 8 counters that are used + to count the total access number of either the eight bank groups in a + selected rank, or four ranks separately in the first 4 counters. The base + transfer unit is 64B. + +- Group 2: PMU Retry Counters. This group has 10 counters, that intend to + count the total retry number of each type of uncorrectable error. + +- Group 3: PMU Common Counters. This group has 16 counters, that are used + to count the common events. + +For now, the Driveway PMU driver only uses counters in group 0 and group 3. + +The DDR Controller (DDRCTL) and DDR PHY combine to create a complete solution +for connecting an SoC application bus to DDR memory devices. The DDRCTL +receives transactions Host Interface (HIF) which is custom-defined by Synopsys. +These transactions are queued internally and scheduled for access while +satisfying the SDRAM protocol timing requirements, transaction priorities, and +dependencies between the transactions. The DDRCTL in turn issues commands on +the DDR PHY Interface (DFI) to the PHY module, which launches and captures data +to and from the SDRAM. The driveway PMUs have hardware logic to gather +statistics and performance logging signals on HIF, DFI, etc. + +By counting the READ, WRITE and RMW commands sent to the DDRC through the HIF +interface, we could calculate the bandwidth. Example usage of counting memory +data bandwidth:: + + perf stat \ + -e ali_drw_21000/hif_wr/ \ + -e ali_drw_21000/hif_rd/ \ + -e ali_drw_21000/hif_rmw/ \ + -e ali_drw_21000/cycle/ \ + -e ali_drw_21080/hif_wr/ \ + -e ali_drw_21080/hif_rd/ \ + -e ali_drw_21080/hif_rmw/ \ + -e ali_drw_21080/cycle/ \ + -e ali_drw_23000/hif_wr/ \ + -e ali_drw_23000/hif_rd/ \ + -e ali_drw_23000/hif_rmw/ \ + -e ali_drw_23000/cycle/ \ + -e ali_drw_23080/hif_wr/ \ + -e ali_drw_23080/hif_rd/ \ + -e ali_drw_23080/hif_rmw/ \ + -e ali_drw_23080/cycle/ \ + -e ali_drw_25000/hif_wr/ \ + -e ali_drw_25000/hif_rd/ \ + -e ali_drw_25000/hif_rmw/ \ + -e ali_drw_25000/cycle/ \ + -e ali_drw_25080/hif_wr/ \ + -e ali_drw_25080/hif_rd/ \ + -e ali_drw_25080/hif_rmw/ \ + -e ali_drw_25080/cycle/ \ + -e ali_drw_27000/hif_wr/ \ + -e ali_drw_27000/hif_rd/ \ + -e ali_drw_27000/hif_rmw/ \ + -e ali_drw_27000/cycle/ \ + -e ali_drw_27080/hif_wr/ \ + -e ali_drw_27080/hif_rd/ \ + -e ali_drw_27080/hif_rmw/ \ + -e ali_drw_27080/cycle/ -- sleep 10 + +The average DRAM bandwidth can be calculated as follows: + +- Read Bandwidth = perf_hif_rd * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle +- Write Bandwidth = (perf_hif_wr + perf_hif_rmw) * DDRC_WIDTH * DDRC_Freq / DDRC_Cycle + +Here, DDRC_WIDTH = 64 bytes. + +The current driver does not support sampling. So "perf record" is +unsupported. Also attach to a task is unsupported as the events are all +uncore. diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst index 9c9ece88ce53..793e1970bc05 100644 --- a/Documentation/admin-guide/perf/index.rst +++ b/Documentation/admin-guide/perf/index.rst @@ -18,3 +18,4 @@ Performance monitor support xgene-pmu arm_dsu_pmu thunderx2-pmu + alibaba_pmu diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index ee6572b1edad..98d1b198b2b4 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -65,6 +65,11 @@ combining the following values: 4 s3_beep = ======= +arch +==== + +The machine hardware name, the same output as ``uname -m`` +(e.g. ``x86_64`` or ``aarch64``). auto_msgmni =========== @@ -635,6 +640,17 @@ different types of memory (represented as different NUMA nodes) to place the hot pages in the fast memory. This is implemented based on unmapping and page fault too. +numa_balancing_promote_rate_limit_MBps +====================================== + +Too high promotion/demotion throughput between different memory types +may hurt application latency. This can be used to rate limit the +promotion throughput. The per-node max promotion throughput in MB/s +will be limited to be no more than the set value. + +A rule of thumb is to set this to less than 1/10 of the PMEM node +write bandwidth. + oops_all_cpu_backtrace ====================== diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst index 805f2281e000..6394f5dc2303 100644 --- a/Documentation/admin-guide/sysctl/net.rst +++ b/Documentation/admin-guide/sysctl/net.rst @@ -31,17 +31,18 @@ see only some of them, depending on your kernel's configuration. Table : Subdirectories in /proc/sys/net - ========= =================== = ========== ================== + ========= =================== = ========== =================== Directory Content Directory Content - ========= =================== = ========== ================== - core General parameter appletalk Appletalk protocol - unix Unix domain sockets netrom NET/ROM - 802 E802 protocol ax25 AX25 - ethernet Ethernet protocol rose X.25 PLP layer + ========= =================== = ========== =================== + 802 E802 protocol mptcp Multipath TCP + appletalk Appletalk protocol netfilter Network Filter + ax25 AX25 netrom NET/ROM + bridge Bridging rose X.25 PLP layer + core General parameter tipc TIPC + ethernet Ethernet protocol unix Unix domain sockets ipv4 IP version 4 x25 X.25 protocol - bridge Bridging decnet DEC net - ipv6 IP version 6 tipc TIPC - ========= =================== = ========== ================== + ipv6 IP version 6 + ========= =================== = ========== =================== 1. /proc/sys/net/core - Network core options ============================================ @@ -101,6 +102,9 @@ Values: - 1 - enable JIT hardening for unprivileged users only - 2 - enable JIT hardening for all users +where "privileged user" in this context means a process having +CAP_BPF or CAP_SYS_ADMIN in the root user name space. + bpf_jit_kallsyms ---------------- @@ -271,7 +275,7 @@ poll cycle or the number of packets processed reaches netdev_budget. netdev_max_backlog ------------------ -Maximum number of packets, queued on the INPUT side, when the interface +Maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them. netdev_rss_key diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 9b833e439f09..988f6a4c8084 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -926,6 +926,9 @@ calls without any restrictions. The default value is 0. +Another way to control permissions for userfaultfd is to use +/dev/userfaultfd instead of userfaultfd(2). See +Documentation/admin-guide/mm/userfaultfd.rst. user_reserve_kbytes =================== diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index 7d80e8c307d1..92a8a07f5c43 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -134,6 +134,12 @@ More detailed explanation for tainting scsi/snic on something else than x86_64, scsi/ips on non x86/x86_64/itanium, have broken firmware settings for the irqchip/irq-gic on arm64 ...). + - x86/x86_64: Microcode late loading is dangerous and will result in + tainting the kernel. It requires that all CPUs rendezvous to make sure + the update happens when the system is as quiescent as possible. However, + a higher priority MCE/SMI/NMI can move control flow away from that + rendezvous and interrupt the update, which can be detrimental to the + machine. 3) ``R`` if a module was force unloaded by ``rmmod -f``, ``' '`` if all modules were unloaded normally. |