diff options
author | Linus Torvalds | 2020-10-12 16:21:29 -0700 |
---|---|---|
committer | Linus Torvalds | 2020-10-12 16:21:29 -0700 |
commit | 50d228345a03c882dfe11928ab41b42458b3f922 (patch) | |
tree | 31a8894ec4986f02802be9daac29c36839df084e /Documentation | |
parent | ced3a9eb3cd0d07462cdbaa8a0f3d46e5aaeadec (diff) | |
parent | 4fb220da0dd03d3699776220d86ac84b38941c0c (diff) |
Merge tag 'docs-5.10' of git://git.lwn.net/linux
Pull documentation updates from Jonathan Corbet:
"As hoped, things calmed down for docs this cycle; fewer changes and
almost no conflicts at all. This includes:
- A reworked and expanded user-mode Linux document
- Some simplifications and improvements for submitting-patches.rst
- An emergency fix for (some) problems with Sphinx 3.x
- Some welcome automarkup improvements to automatically generate
cross-references to struct definitions and other documents
- The usual collection of translation updates, typo fixes, etc"
* tag 'docs-5.10' of git://git.lwn.net/linux: (81 commits)
gpiolib: Update indentation in driver.rst for code excerpts
Documentation/admin-guide: tainted-kernels: Fix typo occured
Documentation: better locations for sysfs-pci, sysfs-tagging
docs: programming-languages: refresh blurb on clang support
Documentation: kvm: fix a typo
Documentation: Chinese translation of Documentation/arm64/amu.rst
doc: zh_CN: index files in arm64 subdirectory
mailmap: add entry for <mstarovoitov@marvell.com>
doc: seq_file: clarify role of *pos in ->next()
docs: trace: ring-buffer-design.rst: use the new SPDX tag
Documentation: kernel-parameters: clarify "module." parameters
Fix references to nommu-mmap.rst
docs: rewrite admin-guide/sysctl/abi.rst
docs: fb: Remove vesafb scrollback boot option
docs: fb: Remove sstfb scrollback boot option
docs: fb: Remove matroxfb scrollback boot option
docs: fb: Remove framebuffer scrollback boot option
docs: replace the old User Mode Linux HowTo with a new one
Documentation/admin-guide: blockdev/ramdisk: remove use of "rdev"
Documentation/admin-guide: README & svga: remove use of "rdev"
...
Diffstat (limited to 'Documentation')
74 files changed, 2055 insertions, 5247 deletions
diff --git a/Documentation/ABI/stable/sysfs-kernel-notes b/Documentation/ABI/stable/sysfs-kernel-notes new file mode 100644 index 000000000000..2c76ee9e67f7 --- /dev/null +++ b/Documentation/ABI/stable/sysfs-kernel-notes @@ -0,0 +1,5 @@ +What: /sys/kernel/notes +Date: July 2009 +Contact: <linux-kernel@vger.kernel.org> +Description: The /sys/kernel/notes file contains the binary representation + of the running vmlinux's .notes section. diff --git a/Documentation/PCI/index.rst b/Documentation/PCI/index.rst index 8f66feaafd4f..c17c87af1968 100644 --- a/Documentation/PCI/index.rst +++ b/Documentation/PCI/index.rst @@ -12,6 +12,7 @@ Linux PCI Bus Subsystem pciebus-howto pci-iov-howto msi-howto + sysfs-pci acpi-info pci-error-recovery pcieaer-howto diff --git a/Documentation/filesystems/sysfs-pci.rst b/Documentation/PCI/sysfs-pci.rst index 742fbd21dc1f..742fbd21dc1f 100644 --- a/Documentation/filesystems/sysfs-pci.rst +++ b/Documentation/PCI/sysfs-pci.rst diff --git a/Documentation/admin-guide/README.rst b/Documentation/admin-guide/README.rst index 5aad534233cd..95a28f47ac30 100644 --- a/Documentation/admin-guide/README.rst +++ b/Documentation/admin-guide/README.rst @@ -322,9 +322,9 @@ Compiling the kernel reboot, and enjoy! If you ever need to change the default root device, video mode, - ramdisk size, etc. in the kernel image, use the ``rdev`` program (or - alternatively the LILO boot options when appropriate). No need to - recompile the kernel to change these parameters. + etc. in the kernel image, use your bootloader's boot options + where appropriate. No need to recompile the kernel to change + these parameters. - Reboot with the new kernel and enjoy. diff --git a/Documentation/admin-guide/bcache.rst b/Documentation/admin-guide/bcache.rst index 1eccf952876d..8d3a2d045c0a 100644 --- a/Documentation/admin-guide/bcache.rst +++ b/Documentation/admin-guide/bcache.rst @@ -5,11 +5,14 @@ A block layer cache (bcache) Say you've got a big slow raid 6, and an ssd or three. Wouldn't it be nice if you could use them as cache... Hence bcache. -Wiki and git repositories are at: +The bcache wiki can be found at: + https://bcache.evilpiepirate.org - - https://bcache.evilpiepirate.org - - http://evilpiepirate.org/git/linux-bcache.git - - https://evilpiepirate.org/git/bcache-tools.git +This is the git repository of bcache-tools: + https://git.kernel.org/pub/scm/linux/kernel/git/colyli/bcache-tools.git/ + +The latest bcache kernel code can be found from mainline Linux kernel: + https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/ It's designed around the performance characteristics of SSDs - it only allocates in erase block sized buckets, and it uses a hybrid btree/log to track cached @@ -41,17 +44,21 @@ in the cache it first disables writeback caching and waits for all dirty data to be flushed. Getting started: -You'll need make-bcache from the bcache-tools repository. Both the cache device +You'll need bcache util from the bcache-tools repository. Both the cache device and backing device must be formatted before use:: - make-bcache -B /dev/sdb - make-bcache -C /dev/sdc + bcache make -B /dev/sdb + bcache make -C /dev/sdc -make-bcache has the ability to format multiple devices at the same time - if +`bcache make` has the ability to format multiple devices at the same time - if you format your backing devices and cache device at the same time, you won't have to manually attach:: - make-bcache -B /dev/sda /dev/sdb -C /dev/sdc + bcache make -B /dev/sda /dev/sdb -C /dev/sdc + +If your bcache-tools is not updated to latest version and does not have the +unified `bcache` utility, you may use the legacy `make-bcache` utility to format +bcache device with same -B and -C parameters. bcache-tools now ships udev rules, and bcache devices are known to the kernel immediately. Without udev, you can manually register devices like this:: @@ -188,7 +195,7 @@ D) Recovering data without bcache: If bcache is not available in the kernel, a filesystem on the backing device is still available at an 8KiB offset. So either via a loopdev of the backing device created with --offset 8K, or any value defined by ---data-offset when you originally formatted bcache with `make-bcache`. +--data-offset when you originally formatted bcache with `bcache make`. For example:: @@ -210,7 +217,7 @@ E) Wiping a cache device After you boot back with bcache enabled, you recreate the cache and attach it:: - host:~# make-bcache -C /dev/sdh2 + host:~# bcache make -C /dev/sdh2 UUID: 7be7e175-8f4c-4f99-94b2-9c904d227045 Set UUID: 5bc072a8-ab17-446d-9744-e247949913c1 version: 0 @@ -318,7 +325,7 @@ want for getting the best possible numbers when benchmarking. The default metadata size in bcache is 8k. If your backing device is RAID based, then be sure to align this by a multiple of your stride - width using `make-bcache --data-offset`. If you intend to expand your + width using `bcache make --data-offset`. If you intend to expand your disk array in the future, then multiply a series of primes by your raid stripe size to get the disk multiples that you would like. diff --git a/Documentation/admin-guide/blockdev/ramdisk.rst b/Documentation/admin-guide/blockdev/ramdisk.rst index b7c2268f8dec..9ce6101e8dd9 100644 --- a/Documentation/admin-guide/blockdev/ramdisk.rst +++ b/Documentation/admin-guide/blockdev/ramdisk.rst @@ -6,7 +6,7 @@ Using the RAM disk block device with Linux 1) Overview 2) Kernel Command Line Parameters - 3) Using "rdev -r" + 3) Using "rdev" 4) An Example of Creating a Compressed RAM Disk @@ -59,51 +59,27 @@ default is 4096 (4 MB). rd_size See ramdisk_size. -3) Using "rdev -r" ------------------- +3) Using "rdev" +--------------- -The usage of the word (two bytes) that "rdev -r" sets in the kernel image is -as follows. The low 11 bits (0 -> 10) specify an offset (in 1 k blocks) of up -to 2 MB (2^11) of where to find the RAM disk (this used to be the size). Bit -14 indicates that a RAM disk is to be loaded, and bit 15 indicates whether a -prompt/wait sequence is to be given before trying to read the RAM disk. Since -the RAM disk dynamically grows as data is being written into it, a size field -is not required. Bits 11 to 13 are not currently used and may as well be zero. -These numbers are no magical secrets, as seen below:: +"rdev" is an obsolete, deprecated, antiquated utility that could be used +to set the boot device in a Linux kernel image. - ./arch/x86/kernel/setup.c:#define RAMDISK_IMAGE_START_MASK 0x07FF - ./arch/x86/kernel/setup.c:#define RAMDISK_PROMPT_FLAG 0x8000 - ./arch/x86/kernel/setup.c:#define RAMDISK_LOAD_FLAG 0x4000 +Instead of using rdev, just place the boot device information on the +kernel command line and pass it to the kernel from the bootloader. -Consider a typical two floppy disk setup, where you will have the -kernel on disk one, and have already put a RAM disk image onto disk #2. +You can also pass arguments to the kernel by setting FDARGS in +arch/x86/boot/Makefile and specify in initrd image by setting FDINITRD in +arch/x86/boot/Makefile. -Hence you want to set bits 0 to 13 as 0, meaning that your RAM disk -starts at an offset of 0 kB from the beginning of the floppy. -The command line equivalent is: "ramdisk_start=0" +Some of the kernel command line boot options that may apply here are:: -You want bit 14 as one, indicating that a RAM disk is to be loaded. -The command line equivalent is: "load_ramdisk=1" - -You want bit 15 as one, indicating that you want a prompt/keypress -sequence so that you have a chance to switch floppy disks. -The command line equivalent is: "prompt_ramdisk=1" - -Putting that together gives 2^15 + 2^14 + 0 = 49152 for an rdev word. -So to create disk one of the set, you would do:: - - /usr/src/linux# cat arch/x86/boot/zImage > /dev/fd0 - /usr/src/linux# rdev /dev/fd0 /dev/fd0 - /usr/src/linux# rdev -r /dev/fd0 49152 + ramdisk_start=N + ramdisk_size=M If you make a boot disk that has LILO, then for the above, you would use:: - append = "ramdisk_start=0 load_ramdisk=1 prompt_ramdisk=1" - -Since the default start = 0 and the default prompt = 1, you could use:: - - append = "load_ramdisk=1" - + append = "ramdisk_start=N ramdisk_size=M" 4) An Example of Creating a Compressed RAM Disk ----------------------------------------------- @@ -151,12 +127,9 @@ f) Put the RAM disk image onto the floppy, after the kernel. Use an offset dd if=/tmp/ram_image.gz of=/dev/fd0 bs=1k seek=400 -g) Use "rdev" to set the boot device, RAM disk offset, prompt flag, etc. - For prompt_ramdisk=1, load_ramdisk=1, ramdisk_start=400, one would - have 2^15 + 2^14 + 400 = 49552:: - - rdev /dev/fd0 /dev/fd0 - rdev -r /dev/fd0 49552 +g) Make sure that you have already specified the boot information in + FDARGS and FDINITRD or that you use a bootloader to pass kernel + command line boot options to the kernel. That is it. You now have your boot/root compressed RAM disk floppy. Some users may wish to combine steps (d) and (f) by using a pipe. @@ -167,11 +140,14 @@ users may wish to combine steps (d) and (f) by using a pipe. Changelog: ---------- +SEPT-2020 : + + Removed usage of "rdev" + 10-22-04 : Updated to reflect changes in command line options, remove obsolete references, general cleanup. James Nelson (james4765@gmail.com) - 12-95 : Original Document diff --git a/Documentation/admin-guide/cgroup-v1/cpusets.rst b/Documentation/admin-guide/cgroup-v1/cpusets.rst index 7ade3abd342a..5d844ed4df69 100644 --- a/Documentation/admin-guide/cgroup-v1/cpusets.rst +++ b/Documentation/admin-guide/cgroup-v1/cpusets.rst @@ -1,3 +1,5 @@ +.. _cpusets: + ======= CPUSETS ======= diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst index 2da65fef2a1c..75a9dd98e76e 100644 --- a/Documentation/admin-guide/kdump/kdump.rst +++ b/Documentation/admin-guide/kdump/kdump.rst @@ -509,9 +509,12 @@ ELF32-format headers using the --elf32-core-headers kernel option on the dump kernel. You can also use the Crash utility to analyze dump files in Kdump -format. Crash is available on Dave Anderson's site at the following URL: +format. Crash is available at the following URL: - http://people.redhat.com/~anderson/ + https://github.com/crash-utility/crash + +Crash document can be found at: + https://crash-utility.github.io/ Trigger Kdump on WARN() ======================= diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index ffe864390c5a..0fa47ddf4c46 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -591,7 +591,7 @@ some critical bits. cma=nn[MG]@[start[MG][-end[MG]]] - [ARM,X86,KNL] + [KNL,CMA] Sets the size of kernel global memory area for contiguous memory allocations and optionally the placement constraint by the physical address range of @@ -940,7 +940,7 @@ Arch Perfmon v4 (Skylake and newer). disable_ddw [PPC/PSERIES] - Disable Dynamic DMA Window support. Use this if + Disable Dynamic DMA Window support. Use this to workaround buggy firmware. disable_ipv6= [IPV6] @@ -1019,7 +1019,7 @@ what data is available or for reverse-engineering. dyndbg[="val"] [KNL,DYNAMIC_DEBUG] - module.dyndbg[="val"] + <module>.dyndbg[="val"] Enable debug messages at boot time. See Documentation/admin-guide/dynamic-debug-howto.rst for details. @@ -1027,7 +1027,7 @@ nopku [X86] Disable Memory Protection Keys CPU feature found in some Intel CPUs. - module.async_probe [KNL] + <module>.async_probe [KNL] Enable asynchronous probe on this module. early_ioremap_debug [KNL] @@ -1956,7 +1956,7 @@ 1 - Bypass the IOMMU for DMA. unset - Use value of CONFIG_IOMMU_DEFAULT_PASSTHROUGH. - io7= [HW] IO7 for Marvel based alpha systems + io7= [HW] IO7 for Marvel-based Alpha systems See comment before marvel_specify_io7 in arch/alpha/kernel/core_marvel.c. @@ -2177,7 +2177,7 @@ kgdbwait [KGDB] Stop kernel execution and enter the kernel debugger at the earliest opportunity. - kmac= [MIPS] korina ethernet MAC address. + kmac= [MIPS] Korina ethernet MAC address. Configure the RouterBoard 532 series on-chip Ethernet adapter MAC address. @@ -2258,6 +2258,14 @@ [KVM,ARM] Allow use of GICv4 for direct injection of LPIs. + kvm_cma_resv_ratio=n [PPC] + Reserves given percentage from system memory area for + contiguous memory allocation for KVM hash pagetable + allocation. + By default it reserves 5% of total system memory. + Format: <integer> + Default: 5 + kvm-intel.ept= [KVM,Intel] Disable extended page tables (virtualized MMU) support on capable Intel chips. Default is 1 (enabled) @@ -2367,9 +2375,10 @@ lapic [X86-32,APIC] Enable the local APIC even if BIOS disabled it. - lapic= [X86,APIC] "notscdeadline" Do not use TSC deadline + lapic= [X86,APIC] Do not use TSC deadline value for LAPIC timer one-shot implementation. Default back to the programmable timer unit in the LAPIC. + Format: notscdeadline lapic_timer_c2_ok [X86,APIC] trust the local apic timer in C2 power state. @@ -2441,8 +2450,7 @@ memblock=debug [KNL] Enable memblock debug messages. - load_ramdisk= [RAM] List of ramdisks to load from floppy - See Documentation/admin-guide/blockdev/ramdisk.rst. + load_ramdisk= [RAM] [Deprecated] lockd.nlm_grace_period=P [NFS] Assign grace period. Format: <integer> @@ -2579,8 +2587,8 @@ (machvec) in a generic kernel. Example: machvec=hpzx1 - machtype= [Loongson] Share the same kernel image file between different - yeeloong laptop. + machtype= [Loongson] Share the same kernel image file between + different yeeloong laptops. Example: machtype=lemote-yeeloong-2f-7inch max_addr=nn[KMG] [KNL,BOOT,ia64] All physical memory greater @@ -3185,7 +3193,7 @@ register save and restore. The kernel will only save legacy floating-point registers on task switch. - nohugeiomap [KNL,X86,PPC] Disable kernel huge I/O mappings. + nohugeiomap [KNL,X86,PPC,ARM64] Disable kernel huge I/O mappings. nosmt [KNL,S390] Disable symmetric multithreading (SMT). Equivalent to smt=1. @@ -3921,9 +3929,7 @@ Param: <number> - step/bucket size as a power of 2 for statistical time based profiling. - prompt_ramdisk= [RAM] List of RAM disks to prompt for floppy disk - before loading. - See Documentation/admin-guide/blockdev/ramdisk.rst. + prompt_ramdisk= [RAM] [Deprecated] prot_virt= [S390] enable hosting protected virtual machines isolated from the hypervisor (if hardware supports @@ -3981,6 +3987,8 @@ ramdisk_size= [RAM] Sizes of RAM disks in kilobytes See Documentation/admin-guide/blockdev/ramdisk.rst. + ramdisk_start= [RAM] RAM disk image start address + random.trust_cpu={on,off} [KNL] Enable or disable trusting the use of the CPU's random number generator (if available) to diff --git a/Documentation/admin-guide/svga.rst b/Documentation/admin-guide/svga.rst index b6c2f9acca92..9eb1e0738e84 100644 --- a/Documentation/admin-guide/svga.rst +++ b/Documentation/admin-guide/svga.rst @@ -12,7 +12,8 @@ Intro This small document describes the "Video Mode Selection" feature which allows the use of various special video modes supported by the video BIOS. Due to usage of the BIOS, the selection is limited to boot time (before the -kernel decompression starts) and works only on 80X86 machines. +kernel decompression starts) and works only on 80X86 machines that are +booted through BIOS firmware (as opposed to through UEFI, kexec, etc.). .. note:: @@ -23,7 +24,7 @@ kernel decompression starts) and works only on 80X86 machines. The video mode to be used is selected by a kernel parameter which can be specified in the kernel Makefile (the SVGA_MODE=... line) or by the "vga=..." -option of LILO (or some other boot loader you use) or by the "vidmode" utility +option of LILO (or some other boot loader you use) or by the "xrandr" utility (present in standard Linux utility packages). You can use the following values of this parameter:: @@ -41,7 +42,7 @@ of this parameter:: better to use absolute mode numbers instead. 0x.... - Hexadecimal video mode ID (also displayed on the menu, see below - for exact meaning of the ID). Warning: rdev and LILO don't support + for exact meaning of the ID). Warning: LILO doesn't support hexadecimal numbers -- you have to convert it to decimal manually. Menu diff --git a/Documentation/admin-guide/sysctl/abi.rst b/Documentation/admin-guide/sysctl/abi.rst index 599bcde7f0b7..ac87eafdb54f 100644 --- a/Documentation/admin-guide/sysctl/abi.rst +++ b/Documentation/admin-guide/sysctl/abi.rst @@ -1,67 +1,34 @@ +.. SPDX-License-Identifier: GPL-2.0+ + ================================ Documentation for /proc/sys/abi/ ================================ -kernel version 2.6.0.test2 +.. See scripts/check-sysctl-docs to keep this up to date: +.. scripts/check-sysctl-docs -vtable="abi" \ +.. Documentation/admin-guide/sysctl/abi.rst \ +.. $(git grep -l register_sysctl_) -Copyright (c) 2003, Fabian Frederick <ffrederick@users.sourceforge.net> +Copyright (c) 2020, Stephen Kitt -For general info: index.rst. +For general info, see :doc:`index`. ------------------------------------------------------------------------------ -This path is binary emulation relevant aka personality types aka abi. -When a process is executed, it's linked to an exec_domain whose -personality is defined using values available from /proc/sys/abi. -You can find further details about abi in include/linux/personality.h. - -Here are the files featuring in 2.6 kernel: - -- defhandler_coff -- defhandler_elf -- defhandler_lcall7 -- defhandler_libcso -- fake_utsname -- trace - -defhandler_coff ---------------- - -defined value: - PER_SCOSVR3:: - - 0x0003 | STICKY_TIMEOUTS | WHOLE_SECONDS | SHORT_INODE - -defhandler_elf --------------- - -defined value: - PER_LINUX:: - - 0 - -defhandler_lcall7 ------------------ - -defined value : - PER_SVR4:: - - 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, - -defhandler_libsco ------------------ - -defined value: - PER_SVR4:: +The files in ``/proc/sys/abi`` can be used to see and modify +ABI-related settings. - 0x0001 | STICKY_TIMEOUTS | MMAP_PAGE_ZERO, +Currently, these files might (depending on your configuration) +show up in ``/proc/sys/kernel``: -fake_utsname ------------- +.. contents:: :local: -Unused +vsyscall32 (x86) +================ -trace ------ +Determines whether the kernels maps a vDSO page into 32-bit processes; +can be set to 1 to enable, or 0 to disable. Defaults to enabled if +``CONFIG_COMPAT_VDSO`` is set, disabled otherwide. -Unused +This controls the same setting as the ``vdso32`` kernel boot +parameter. diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst index abf804719890..f718a2eaf1f6 100644 --- a/Documentation/admin-guide/tainted-kernels.rst +++ b/Documentation/admin-guide/tainted-kernels.rst @@ -130,7 +130,7 @@ More detailed explanation for tainting 5) ``B`` If a page-release function has found a bad page reference or some unexpected page flags. This indicates a hardware problem or a kernel bug; there should be other information in the log indicating why this tainting - occured. + occurred. 6) ``U`` if a user or user application specifically requested that the Tainted flag be set, ``' '`` otherwise. diff --git a/Documentation/arm/sunxi.rst b/Documentation/arm/sunxi.rst index b037428aee98..62b533d0ba94 100644 --- a/Documentation/arm/sunxi.rst +++ b/Documentation/arm/sunxi.rst @@ -108,7 +108,7 @@ SunXi family * Datasheet - http://dl.linux-sunxi.org/H3/Allwinner_H3_Datasheet_V1.0.pdf + https://linux-sunxi.org/images/4/4b/Allwinner_H3_Datasheet_V1.2.pdf - Allwinner R40 (sun8i) diff --git a/Documentation/arm64/amu.rst b/Documentation/arm64/amu.rst index 452ec8b115c2..01f2de2b0450 100644 --- a/Documentation/arm64/amu.rst +++ b/Documentation/arm64/amu.rst @@ -1,3 +1,5 @@ +.. _amu_index: + ======================================================= Activity Monitors Unit (AMU) extension in AArch64 Linux ======================================================= diff --git a/Documentation/arm64/index.rst b/Documentation/arm64/index.rst index 43b0939d384e..937634c49979 100644 --- a/Documentation/arm64/index.rst +++ b/Documentation/arm64/index.rst @@ -1,3 +1,5 @@ +.. _arm64_index: + ================== ARM64 Architecture ================== diff --git a/Documentation/conf.py b/Documentation/conf.py index c503188880d9..0a102d57437d 100644 --- a/Documentation/conf.py +++ b/Documentation/conf.py @@ -36,10 +36,23 @@ needs_sphinx = '1.3' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. -extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'cdomain', +extensions = ['kerneldoc', 'rstFlatTable', 'kernel_include', 'kfigure', 'sphinx.ext.ifconfig', 'automarkup', 'maintainers_include', 'sphinx.ext.autosectionlabel' ] +# +# cdomain is badly broken in Sphinx 3+. Leaving it out generates *most* +# of the docs correctly, but not all. Scream bloody murder but allow +# the process to proceed; hopefully somebody will fix this properly soon. +# +if major >= 3: + sys.stderr.write('''WARNING: The kernel documentation build process + does not work correctly with Sphinx v3.0 and above. Expect errors + in the generated output. + ''') +else: + extensions.append('cdomain') + # Ensure that autosectionlabel will produce unique names autosectionlabel_prefix_document = True autosectionlabel_maxdepth = 2 diff --git a/Documentation/core-api/cpu_hotplug.rst b/Documentation/core-api/cpu_hotplug.rst index 298c9c8bea9a..a2c96bec5ee8 100644 --- a/Documentation/core-api/cpu_hotplug.rst +++ b/Documentation/core-api/cpu_hotplug.rst @@ -30,7 +30,7 @@ which didn't support these methods. Command Line Switches ===================== ``maxcpus=n`` - Restrict boot time CPUs to *n*. Say if you have fourV CPUs, using + Restrict boot time CPUs to *n*. Say if you have four CPUs, using ``maxcpus=2`` will only boot two. You can choose to bring the other CPUs later online. diff --git a/Documentation/doc-guide/kernel-doc.rst b/Documentation/doc-guide/kernel-doc.rst index fff6604631ea..4fd86c21397b 100644 --- a/Documentation/doc-guide/kernel-doc.rst +++ b/Documentation/doc-guide/kernel-doc.rst @@ -387,22 +387,23 @@ Domain`_ references. Cross-referencing from reStructuredText ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -To cross-reference the functions and types defined in the kernel-doc comments -from reStructuredText documents, please use the `Sphinx C Domain`_ -references. For example:: - - See function :c:func:`foo` and struct/union/enum/typedef :c:type:`bar`. - -While the type reference works with just the type name, without the -struct/union/enum/typedef part in front, you may want to use:: - - See :c:type:`struct foo <foo>`. - See :c:type:`union bar <bar>`. - See :c:type:`enum baz <baz>`. - See :c:type:`typedef meh <meh>`. - -This will produce prettier links, and is in line with how kernel-doc does the -cross-references. +No additional syntax is needed to cross-reference the functions and types +defined in the kernel-doc comments from reStructuredText documents. +Just end function names with ``()`` and write ``struct``, ``union``, ``enum`` +or ``typedef`` before types. +For example:: + + See foo(). + See struct foo. + See union bar. + See enum baz. + See typedef meh. + +However, if you want custom text in the cross-reference link, that can be done +through the following syntax:: + + See :c:func:`my custom link text for function foo <foo>`. + See :c:type:`my custom link text for struct bar <bar>`. For further details, please refer to the `Sphinx C Domain`_ documentation. diff --git a/Documentation/doc-guide/sphinx.rst b/Documentation/doc-guide/sphinx.rst index f71ddd592aaa..896478baf570 100644 --- a/Documentation/doc-guide/sphinx.rst +++ b/Documentation/doc-guide/sphinx.rst @@ -337,6 +337,23 @@ Rendered as: - column 3 +Cross-referencing +----------------- + +Cross-referencing from one documentation page to another can be done by passing +the path to the file starting from the Documentation folder. +For example, to cross-reference to this page (the .rst extension is optional):: + + See Documentation/doc-guide/sphinx.rst. + +If you want to use a relative path, you need to use Sphinx's ``doc`` directive. +For example, referencing this page from the same directory would be done as:: + + See :doc:`sphinx`. + +For information on cross-referencing to kernel-doc functions or types, see +Documentation/doc-guide/kernel-doc.rst. + .. _sphinx_kfigure: Figures & Images diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst index 13ea0cc0a3fa..4144b669e80c 100644 --- a/Documentation/driver-api/dma-buf.rst +++ b/Documentation/driver-api/dma-buf.rst @@ -85,7 +85,7 @@ consider though: - Memory mapping the contents of the DMA buffer is also supported. See the discussion below on `CPU Access to DMA Buffer Objects`_ for the full details. -- The DMA buffer FD is also pollable, see `Fence Poll Support`_ below for +- The DMA buffer FD is also pollable, see `Implicit Fence Poll Support`_ below for details. Basic Operation and Device DMA Access diff --git a/Documentation/driver-api/gpio/driver.rst b/Documentation/driver-api/gpio/driver.rst index 9809f593c0ab..072a7455044e 100644 --- a/Documentation/driver-api/gpio/driver.rst +++ b/Documentation/driver-api/gpio/driver.rst @@ -342,12 +342,12 @@ Cascaded GPIO irqchips usually fall in one of three categories: forced to a thread. The "fake?" raw lock can be used to work around this problem:: - raw_spinlock_t wa_lock; - static irqreturn_t omap_gpio_irq_handler(int irq, void *gpiobank) - unsigned long wa_lock_flags; - raw_spin_lock_irqsave(&bank->wa_lock, wa_lock_flags); - generic_handle_irq(irq_find_mapping(bank->chip.irq.domain, bit)); - raw_spin_unlock_irqrestore(&bank->wa_lock, wa_lock_flags); + raw_spinlock_t wa_lock; + static irqreturn_t omap_gpio_irq_handler(int irq, void *gpiobank) + unsigned long wa_lock_flags; + raw_spin_lock_irqsave(&bank->wa_lock, wa_lock_flags); + generic_handle_irq(irq_find_mapping(bank->chip.irq.domain, bit)); + raw_spin_unlock_irqrestore(&bank->wa_lock, wa_lock_flags); - GENERIC CHAINED GPIO IRQCHIPS: these are the same as "CHAINED GPIO irqchips", but chained IRQ handlers are not used. Instead GPIO IRQs dispatching is diff --git a/Documentation/driver-api/nvdimm/index.rst b/Documentation/driver-api/nvdimm/index.rst index a4f8f98aeb94..5863bd04f056 100644 --- a/Documentation/driver-api/nvdimm/index.rst +++ b/Documentation/driver-api/nvdimm/index.rst @@ -10,3 +10,4 @@ Non-Volatile Memory Device (NVDIMM) nvdimm btt security + firmware-activate diff --git a/Documentation/driver-api/soundwire/stream.rst b/Documentation/driver-api/soundwire/stream.rst index 8858cea7bfe0..b432a2de45d3 100644 --- a/Documentation/driver-api/soundwire/stream.rst +++ b/Documentation/driver-api/soundwire/stream.rst @@ -518,10 +518,10 @@ typically called during a dailink .shutdown() callback, which clears the stream pointer for all DAIS connected to a stream and releases the memory allocated for the stream. - Not Supported +Not Supported ============= 1. A single port with multiple channels supported cannot be used between two -streams or across stream. For example a port with 4 channels cannot be used -to handle 2 independent stereo streams even though it's possible in theory -in SoundWire. + streams or across stream. For example a port with 4 channels cannot be used + to handle 2 independent stereo streams even though it's possible in theory + in SoundWire. diff --git a/Documentation/fb/fbcon.rst b/Documentation/fb/fbcon.rst index e57a3d1d085a..328f6980698c 100644 --- a/Documentation/fb/fbcon.rst +++ b/Documentation/fb/fbcon.rst @@ -87,15 +87,8 @@ C. Boot options Note, not all drivers can handle font with widths not divisible by 8, such as vga16fb. -2. fbcon=scrollback:<value>[k] - The scrollback buffer is memory that is used to preserve display - contents that has already scrolled past your view. This is accessed - by using the Shift-PageUp key combination. The value 'value' is any - integer. It defaults to 32KB. The 'k' suffix is optional, and will - multiply the 'value' by 1024. - -3. fbcon=map:<0123> +2. fbcon=map:<0123> This is an interesting option. It tells which driver gets mapped to which console. The value '0123' is a sequence that gets repeated until @@ -116,7 +109,7 @@ C. Boot options Later on, when you want to map the console the to the framebuffer device, you can use the con2fbmap utility. -4. fbcon=vc:<n1>-<n2> +3. fbcon=vc:<n1>-<n2> This option tells fbcon to take over only a range of consoles as specified by the values 'n1' and 'n2'. The rest of the consoles @@ -127,7 +120,7 @@ C. Boot options is typically located on the same video card. Thus, the consoles that are controlled by the VGA console will be garbled. -5. fbcon=rotate:<n> +4. fbcon=rotate:<n> This option changes the orientation angle of the console display. The value 'n' accepts the following: @@ -152,21 +145,21 @@ C. Boot options Actually, the underlying fb driver is totally ignorant of console rotation. -6. fbcon=margin:<color> +5. fbcon=margin:<color> This option specifies the color of the margins. The margins are the leftover area at the right and the bottom of the screen that are not used by text. By default, this area will be black. The 'color' value is an integer number that depends on the framebuffer driver being used. -7. fbcon=nodefer +6. fbcon=nodefer If the kernel is compiled with deferred fbcon takeover support, normally the framebuffer contents, left in place by the firmware/bootloader, will be preserved until there actually is some text is output to the console. This option causes fbcon to bind immediately to the fbdev device. -8. fbcon=logo-pos:<location> +7. fbcon=logo-pos:<location> The only possible 'location' is 'center' (without quotes), and when given, the bootup logo is moved from the default top-left corner @@ -174,7 +167,7 @@ C. Boot options displayed due to multiple CPUs, the collected line of logos is moved as a whole. -9. fbcon=logo-count:<n> +8. fbcon=logo-count:<n> The value 'n' overrides the number of bootup logos. 0 disables the logo, and -1 gives the default which is the number of online CPUs. diff --git a/Documentation/fb/matroxfb.rst b/Documentation/fb/matroxfb.rst index f1859d98606e..6158c49c8571 100644 --- a/Documentation/fb/matroxfb.rst +++ b/Documentation/fb/matroxfb.rst @@ -317,8 +317,6 @@ Currently there are following known bugs: - interlaced text mode is not supported; it looks like hardware limitation, but I'm not sure. - Gxx0 SGRAM/SDRAM is not autodetected. - - If you are using more than one framebuffer device, you must boot kernel - with 'video=scrollback:0'. - maybe more... And following misfeatures: diff --git a/Documentation/fb/sstfb.rst b/Documentation/fb/sstfb.rst index 8e8c1b940359..42466ff49c58 100644 --- a/Documentation/fb/sstfb.rst +++ b/Documentation/fb/sstfb.rst @@ -185,9 +185,6 @@ Bugs contact me. - The 24/32 is not likely to work anytime soon, knowing that the hardware does ... unusual things in 24/32 bpp. -- When used with another video board, current limitations of the linux - console subsystem can cause some troubles, specifically, you should - disable software scrollback, as it can oops badly ... Todo ==== diff --git a/Documentation/fb/vesafb.rst b/Documentation/fb/vesafb.rst index 6821c87b7893..f890a4f5623b 100644 --- a/Documentation/fb/vesafb.rst +++ b/Documentation/fb/vesafb.rst @@ -135,8 +135,6 @@ ypan enable display panning using the VESA protected mode * scrolling (fullscreen) is fast, because there is no need to copy around data. - * You'll get scrollback (the Shift-PgUp thing), - the video memory can be used as scrollback buffer kontra: diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 4c536e66dc4c..98f59a864242 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -34,8 +34,6 @@ algorithms work. quota seq_file sharedsubtree - sysfs-pci - sysfs-tagging automount-support diff --git a/Documentation/filesystems/mount_api.rst b/Documentation/filesystems/mount_api.rst index 29c169c68961..d7f53d62b5bb 100644 --- a/Documentation/filesystems/mount_api.rst +++ b/Documentation/filesystems/mount_api.rst @@ -1,7 +1,7 @@ .. SPDX-License-Identifier: GPL-2.0 ==================== -fILESYSTEM Mount API +Filesystem Mount API ==================== .. CONTENTS @@ -479,7 +479,7 @@ returned. int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param); - Supply a single mount parameter to the filesystem context. This include + Supply a single mount parameter to the filesystem context. This includes the specification of the source/device which is specified as the "source" parameter (which may be specified multiple times if the filesystem supports that). @@ -592,8 +592,7 @@ The following helpers all wrap sget_fc(): one. -===================== -PARAMETER DESCRIPTION +Parameter Description ===================== Parameters are described using structures defined in linux/fs_parser.h. diff --git a/Documentation/filesystems/seq_file.rst b/Documentation/filesystems/seq_file.rst index 7f7ee06b2693..56856481dc8d 100644 --- a/Documentation/filesystems/seq_file.rst +++ b/Documentation/filesystems/seq_file.rst @@ -129,7 +129,9 @@ also a special value which can be returned by the start() function called SEQ_START_TOKEN; it can be used if you wish to instruct your show() function (described below) to print a header at the top of the output. SEQ_START_TOKEN should only be used if the offset is zero, -however. +however. SEQ_START_TOKEN has no special meaning to the core seq_file +code. It is provided as a convenience for a start() funciton to +communicate with the next() and show() functions. The next function to implement is called, amazingly, next(); its job is to move the iterator forward to the next position in the sequence. The @@ -145,6 +147,22 @@ complete. Here's the example version:: return spos; } +The next() function should set ``*pos`` to a value that start() can use +to find the new location in the sequence. When the iterator is being +stored in the private data area, rather than being reinitialized on each +start(), it might seem sufficient to simply set ``*pos`` to any non-zero +value (zero always tells start() to restart the sequence). This is not +sufficient due to historical problems. + +Historically, many next() functions have *not* updated ``*pos`` at +end-of-file. If the value is then used by start() to initialise the +iterator, this can result in corner cases where the last entry in the +sequence is reported twice in the file. In order to discourage this bug +from being resurrected, the core seq_file code now produces a warning if +a next() function does not change the value of ``*pos``. Consequently a +next() function *must* change the value of ``*pos``, and of course must +set it to a non-zero value. + The stop() function closes a session; its job, of course, is to clean up. If dynamic memory is allocated for the iterator, stop() is the place to free it; if a lock was taken by start(), stop() must release diff --git a/Documentation/filesystems/sysfs.rst b/Documentation/filesystems/sysfs.rst index ab0f7795792b..5a3209a4cebf 100644 --- a/Documentation/filesystems/sysfs.rst +++ b/Documentation/filesystems/sysfs.rst @@ -172,14 +172,13 @@ calls the associated methods. To illustrate:: - #define to_dev(obj) container_of(obj, struct device, kobj) #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, char *buf) { struct device_attribute *dev_attr = to_dev_attr(attr); - struct device *dev = to_dev(kobj); + struct device *dev = kobj_to_dev(kobj); ssize_t ret = -EIO; if (dev_attr->show) diff --git a/Documentation/filesystems/ubifs-authentication.rst b/Documentation/filesystems/ubifs-authentication.rst index 1f39c8cea702..5210aed2afbc 100644 --- a/Documentation/filesystems/ubifs-authentication.rst +++ b/Documentation/filesystems/ubifs-authentication.rst @@ -1,11 +1,13 @@ .. SPDX-License-Identifier: GPL-2.0 -:orphan: - .. UBIFS Authentication .. sigma star gmbh .. 2018 +============================ +UBIFS Authentication Support +============================ + Introduction ============ diff --git a/Documentation/firmware-guide/acpi/index.rst b/Documentation/firmware-guide/acpi/index.rst index ad3b5afdae77..f72b5f1769fb 100644 --- a/Documentation/firmware-guide/acpi/index.rst +++ b/Documentation/firmware-guide/acpi/index.rst @@ -26,3 +26,4 @@ ACPI Support lpit video_extension extcon-intel-int3496 + intel-pmc-mux diff --git a/Documentation/hwmon/index.rst b/Documentation/hwmon/index.rst index 750d3a975d82..77a1ae975037 100644 --- a/Documentation/hwmon/index.rst +++ b/Documentation/hwmon/index.rst @@ -158,6 +158,7 @@ Hardware Monitoring Kernel Drivers smsc47b397 smsc47m192 smsc47m1 + sparx5-temp tc654 tc74 thmc50 diff --git a/Documentation/ia64/index.rst b/Documentation/ia64/index.rst index 0436e1034115..4bdfe28067ee 100644 --- a/Documentation/ia64/index.rst +++ b/Documentation/ia64/index.rst @@ -15,4 +15,3 @@ IA-64 Architecture irq-redir mca serial - xen diff --git a/Documentation/ia64/xen.rst b/Documentation/ia64/xen.rst deleted file mode 100644 index 831339c74441..000000000000 --- a/Documentation/ia64/xen.rst +++ /dev/null @@ -1,206 +0,0 @@ -******************************************************** -Recipe for getting/building/running Xen/ia64 with pv_ops -******************************************************** -This recipe describes how to get xen-ia64 source and build it, -and run domU with pv_ops. - -Requirements -============ - - - python - - mercurial - it (aka "hg") is an open-source source code - management software. See the below. - http://www.selenic.com/mercurial/wiki/ - - git - - bridge-utils - -Getting and Building Xen and Dom0 -================================= - - My environment is: - - - Machine : Tiger4 - - Domain0 OS : RHEL5 - - DomainU OS : RHEL5 - - 1. Download source:: - - # hg clone http://xenbits.xensource.com/ext/ia64/xen-unstable.hg - # cd xen-unstable.hg - # hg clone http://xenbits.xensource.com/ext/ia64/linux-2.6.18-xen.hg - - 2. # make world - - 3. # make install-tools - - 4. copy kernels and xen:: - - # cp xen/xen.gz /boot/efi/efi/redhat/ - # cp build-linux-2.6.18-xen_ia64/vmlinux.gz \ - /boot/efi/efi/redhat/vmlinuz-2.6.18.8-xen - - 5. make initrd for Dom0/DomU:: - - # make -C linux-2.6.18-xen.hg ARCH=ia64 modules_install \ - O=$(pwd)/build-linux-2.6.18-xen_ia64 - # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6.18.8-xen.img \ - 2.6.18.8-xen --builtin mptspi --builtin mptbase \ - --builtin mptscsih --builtin uhci-hcd --builtin ohci-hcd \ - --builtin ehci-hcd - -Making a disk image for guest OS -================================ - - 1. make file:: - - # dd if=/dev/zero of=/root/rhel5.img bs=1M seek=4096 count=0 - # mke2fs -F -j /root/rhel5.img - # mount -o loop /root/rhel5.img /mnt - # cp -ax /{dev,var,etc,usr,bin,sbin,lib} /mnt - # mkdir /mnt/{root,proc,sys,home,tmp} - - Note: You may miss some device files. If so, please create them - with mknod. Or you can use tar instead of cp. - - 2. modify DomU's fstab:: - - # vi /mnt/etc/fstab - /dev/xvda1 / ext3 defaults 1 1 - none /dev/pts devpts gid=5,mode=620 0 0 - none /dev/shm tmpfs defaults 0 0 - none /proc proc defaults 0 0 - none /sys sysfs defaults 0 0 - - 3. modify inittab - - set runlevel to 3 to avoid X trying to start:: - - # vi /mnt/etc/inittab - id:3:initdefault: - - Start a getty on the hvc0 console:: - - X0:2345:respawn:/sbin/mingetty hvc0 - - tty1-6 mingetty can be commented out - - 4. add hvc0 into /etc/securetty:: - - # vi /mnt/etc/securetty (add hvc0) - - 5. umount:: - - # umount /mnt - -FYI, virt-manager can also make a disk image for guest OS. -It's GUI tools and easy to make it. - -Boot Xen & Domain0 -================== - - 1. replace elilo - elilo of RHEL5 can boot Xen and Dom0. - If you use old elilo (e.g RHEL4), please download from the below - http://elilo.sourceforge.net/cgi-bin/blosxom - and copy into /boot/efi/efi/redhat/:: - - # cp elilo-3.6-ia64.efi /boot/efi/efi/redhat/elilo.efi - - 2. modify elilo.conf (like the below):: - - # vi /boot/efi/efi/redhat/elilo.conf - prompt - timeout=20 - default=xen - relocatable - - image=vmlinuz-2.6.18.8-xen - label=xen - vmm=xen.gz - initrd=initrd-2.6.18.8-xen.img - read-only - append=" -- rhgb root=/dev/sda2" - -The append options before "--" are for xen hypervisor, -the options after "--" are for dom0. - -FYI, your machine may need console options like -"com1=19200,8n1 console=vga,com1". For example, -append="com1=19200,8n1 console=vga,com1 -- rhgb console=tty0 \ -console=ttyS0 root=/dev/sda2" - -Getting and Building domU with pv_ops -===================================== - - 1. get pv_ops tree:: - - # git clone http://people.valinux.co.jp/~yamahata/xen-ia64/linux-2.6-xen-ia64.git/ - - 2. git branch (if necessary):: - - # cd linux-2.6-xen-ia64/ - # git checkout -b your_branch origin/xen-ia64-domu-minimal-2008may19 - - Note: - The current branch is xen-ia64-domu-minimal-2008may19. - But you would find the new branch. You can see with - "git branch -r" to get the branch lists. - - http://people.valinux.co.jp/~yamahata/xen-ia64/for_eagl/linux-2.6-ia64-pv-ops.git/ - - is also available. - - The tree is based on - - git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 test) - - 3. copy .config for pv_ops of domU:: - - # cp arch/ia64/configs/xen_domu_wip_defconfig .config - - 4. make kernel with pv_ops:: - - # make oldconfig - # make - - 5. install the kernel and initrd:: - - # cp vmlinux.gz /boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU - # make modules_install - # mkinitrd -f /boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img \ - 2.6.26-rc3xen-ia64-08941-g1b12161 --builtin mptspi \ - --builtin mptbase --builtin mptscsih --builtin uhci-hcd \ - --builtin ohci-hcd --builtin ehci-hcd - -Boot DomainU with pv_ops -======================== - - 1. make config of DomU:: - - # vi /etc/xen/rhel5 - kernel = "/boot/efi/efi/redhat/vmlinuz-2.6-pv_ops-xenU" - ramdisk = "/boot/efi/efi/redhat/initrd-2.6-pv_ops-xenU.img" - vcpus = 1 - memory = 512 - name = "rhel5" - disk = [ 'file:/root/rhel5.img,xvda1,w' ] - root = "/dev/xvda1 ro" - extra= "rhgb console=hvc0" - - 2. After boot xen and dom0, start xend:: - - # /etc/init.d/xend start - - ( In the debugging case, `# XEND_DEBUG=1 xend trace_start` ) - - 3. start domU:: - - # xm create -c rhel5 - -Reference -========= -- Wiki of Xen/IA64 upstream merge - http://wiki.xensource.com/xenwiki/XenIA64/UpstreamMerge - -Written by Akio Takebe <takebe_akio@jp.fujitsu.com> on 28 May 2008 diff --git a/Documentation/iio/iio_configfs.rst b/Documentation/iio/iio_configfs.rst index 6e38cbbd2981..3a5d76f9e2b9 100644 --- a/Documentation/iio/iio_configfs.rst +++ b/Documentation/iio/iio_configfs.rst @@ -53,7 +53,7 @@ kernel module following the interface in include/linux/iio/sw_trigger.h:: */ } - static int iio_trig_hrtimer_remove(struct iio_sw_trigger *swt) + static int iio_trig_sample_remove(struct iio_sw_trigger *swt) { /* * This undoes the actions in iio_trig_sample_probe diff --git a/Documentation/kbuild/llvm.rst b/Documentation/kbuild/llvm.rst index dae90c21aed3..cf3ca236d2cc 100644 --- a/Documentation/kbuild/llvm.rst +++ b/Documentation/kbuild/llvm.rst @@ -1,3 +1,5 @@ +.. _kbuild_llvm: + ============================== Building Linux with Clang/LLVM ============================== @@ -73,6 +75,8 @@ Getting Help - `Wiki <https://github.com/ClangBuiltLinux/linux/wiki>`_ - `Beginner Bugs <https://github.com/ClangBuiltLinux/linux/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22>`_ +.. _getting_llvm: + Getting LLVM ------------- diff --git a/Documentation/maintainer/index.rst b/Documentation/maintainer/index.rst index d904e74e1159..f0a60435b124 100644 --- a/Documentation/maintainer/index.rst +++ b/Documentation/maintainer/index.rst @@ -13,4 +13,5 @@ additions to this manual. rebasing-and-merging pull-requests maintainer-entry-profile + modifying-patches diff --git a/Documentation/maintainer/modifying-patches.rst b/Documentation/maintainer/modifying-patches.rst new file mode 100644 index 000000000000..58385d2e8065 --- /dev/null +++ b/Documentation/maintainer/modifying-patches.rst @@ -0,0 +1,50 @@ +.. _modifyingpatches: + +Modifying Patches +================= + +If you are a subsystem or branch maintainer, sometimes you need to slightly +modify patches you receive in order to merge them, because the code is not +exactly the same in your tree and the submitters'. If you stick strictly to +rule (c) of the developers certificate of origin, you should ask the submitter +to rediff, but this is a totally counter-productive waste of time and energy. +Rule (b) allows you to adjust the code, but then it is very impolite to change +one submitters code and make him endorse your bugs. To solve this problem, it +is recommended that you add a line between the last Signed-off-by header and +yours, indicating the nature of your changes. While there is nothing mandatory +about this, it seems like prepending the description with your mail and/or +name, all enclosed in square brackets, is noticeable enough to make it obvious +that you are responsible for last-minute changes. Example:: + + Signed-off-by: Random J Developer <random@developer.example.org> + [lucky@maintainer.example.org: struct foo moved from foo.c to foo.h] + Signed-off-by: Lucky K Maintainer <lucky@maintainer.example.org> + +This practice is particularly helpful if you maintain a stable branch and +want at the same time to credit the author, track changes, merge the fix, +and protect the submitter from complaints. Note that under no circumstances +can you change the author's identity (the From header), as it is the one +which appears in the changelog. + +Special note to back-porters: It seems to be a common and useful practice +to insert an indication of the origin of a patch at the top of the commit +message (just after the subject line) to facilitate tracking. For instance, +here's what we see in a 3.x-stable release:: + + Date: Tue Oct 7 07:26:38 2014 -0400 + + libata: Un-break ATA blacklist + + commit 1c40279960bcd7d52dbdf1d466b20d24b99176c8 upstream. + +And here's what might appear in an older kernel once a patch is backported:: + + Date: Tue May 13 22:12:27 2008 +0200 + + wireless, airo: waitbusy() won't delay + + [backport of 2.6 commit b7acbdfbd1f277c1eb23f344f899cfa4cd0bf36a] + +Whatever the format, this information provides a valuable help to people +tracking your trees, and to people trying to troubleshoot bugs in your +tree. diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 96186332e5f4..17c8e0c2deb4 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -546,8 +546,8 @@ There are certain things that the Linux kernel memory barriers do not guarantee: [*] For information on bus mastering DMA and coherency please read: Documentation/driver-api/pci/pci.rst - Documentation/DMA-API-HOWTO.txt - Documentation/DMA-API.txt + Documentation/core-api/dma-api-howto.rst + Documentation/core-api/dma-api.rst DATA DEPENDENCY BARRIERS (HISTORICAL) @@ -1932,8 +1932,8 @@ There are some more advanced barrier functions: here. See the subsection "Kernel I/O barrier effects" for more information on - relaxed I/O accessors and the Documentation/DMA-API.txt file for more - information on consistent memory. + relaxed I/O accessors and the Documentation/core-api/dma-api.rst file for + more information on consistent memory. (*) pmem_wmb(); diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index c29496fff81c..611e4b130c1e 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -95,6 +95,7 @@ Contents: seg6-sysctl strparser switchdev + sysfs-tagging tc-actions-env-rules tcp-thin team diff --git a/Documentation/filesystems/sysfs-tagging.rst b/Documentation/networking/sysfs-tagging.rst index 83647e10c207..83647e10c207 100644 --- a/Documentation/filesystems/sysfs-tagging.rst +++ b/Documentation/networking/sysfs-tagging.rst diff --git a/Documentation/process/2.Process.rst b/Documentation/process/2.Process.rst index 4ae1e0f600c1..e05fb1b8f8b6 100644 --- a/Documentation/process/2.Process.rst +++ b/Documentation/process/2.Process.rst @@ -405,7 +405,7 @@ be found at: http://vger.kernel.org/vger-lists.html There are lists hosted elsewhere, though; a number of them are at -lists.redhat.com. +redhat.com/mailman/listinfo. The core mailing list for kernel development is, of course, linux-kernel. This list is an intimidating place to be; volume can reach 500 messages per diff --git a/Documentation/process/changes.rst b/Documentation/process/changes.rst index ee741763a3fc..dac17711dc11 100644 --- a/Documentation/process/changes.rst +++ b/Documentation/process/changes.rst @@ -30,6 +30,7 @@ you probably needn't concern yourself with pcmciautils. Program Minimal version Command to check the version ====================== =============== ======================================== GNU C 4.9 gcc --version +Clang/LLVM (optional) 10.0.1 clang --version GNU make 3.81 make --version binutils 2.23 ld -v flex 2.5.35 flex --version @@ -68,6 +69,15 @@ GCC The gcc version requirements may vary depending on the type of CPU in your computer. +Clang/LLVM (optional) +--------------------- + +The latest formal release of clang and LLVM utils (according to +`releases.llvm.org <https://releases.llvm.org>`_) are supported for building +kernels. Older releases aren't guaranteed to work, and we may drop workarounds +from the kernel that were used to support older versions. Please see additional +docs on :ref:`Building Linux with Clang/LLVM <kbuild_llvm>`. + Make ---- @@ -331,6 +341,11 @@ gcc - <ftp://ftp.gnu.org/gnu/gcc/> +Clang/LLVM +---------- + +- :ref:`Getting LLVM <getting_llvm>`. + Make ---- diff --git a/Documentation/process/deprecated.rst b/Documentation/process/deprecated.rst index 918e32d76fc4..ff71d802b53d 100644 --- a/Documentation/process/deprecated.rst +++ b/Documentation/process/deprecated.rst @@ -51,24 +51,6 @@ to make sure their systems do not continue running in the face of "unreachable" conditions. (For example, see commits like `this one <https://git.kernel.org/linus/d4689846881d160a4d12a514e991a740bcb5d65a>`_.) -uninitialized_var() -------------------- -For any compiler warnings about uninitialized variables, just add -an initializer. Using the uninitialized_var() macro (or similar -warning-silencing tricks) is dangerous as it papers over `real bugs -<https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/>`_ -(or can in the future), and suppresses unrelated compiler warnings -(e.g. "unused variable"). If the compiler thinks it is uninitialized, -either simply initialize the variable or make compiler changes. Keep in -mind that in most cases, if an initialization is obviously redundant, -the compiler's dead-store elimination pass will make sure there are no -needless variable writes. - -As Linus has said, this macro -`must <https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/>`_ -`be <https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/>`_ -`removed <https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/>`_. - open-coded arithmetic in allocator arguments -------------------------------------------- Dynamic size calculations (especially multiplication) should not be @@ -322,7 +304,8 @@ to allocate for a structure containing an array of this kind as a member:: In the example above, we had to remember to calculate ``count - 1`` when using the struct_size() helper, otherwise we would have --unintentionally-- allocated memory for one too many ``items`` objects. The cleanest and least error-prone way -to implement this is through the use of a `flexible array member`:: +to implement this is through the use of a `flexible array member`, together with +struct_size() and flex_array_size() helpers:: struct something { size_t count; @@ -334,5 +317,4 @@ to implement this is through the use of a `flexible array member`:: instance = kmalloc(struct_size(instance, items, count), GFP_KERNEL); instance->count = count; - size = sizeof(instance->items[0]) * instance->count; - memcpy(instance->items, source, size); + memcpy(instance->items, source, flex_array_size(instance, items, instance->count)); diff --git a/Documentation/process/email-clients.rst b/Documentation/process/email-clients.rst index c9e4ce2613c0..16586f6cc888 100644 --- a/Documentation/process/email-clients.rst +++ b/Documentation/process/email-clients.rst @@ -25,6 +25,11 @@ attachments, but then the attachments should have content-type it makes quoting portions of the patch more difficult in the patch review process. +It's also strongly recommended that you use plain text in your email body, +for patches and other emails alike. https://useplaintext.email may be useful +for information on how to configure your preferred email client, as well as +listing recommended email clients should you not already have a preference. + Email clients that are used for Linux kernel patches should send the patch text untouched. For example, they should not modify or delete tabs or spaces, even at the beginning or end of lines. diff --git a/Documentation/process/programming-language.rst b/Documentation/process/programming-language.rst index e5f5f065dc24..ec474a70a02f 100644 --- a/Documentation/process/programming-language.rst +++ b/Documentation/process/programming-language.rst @@ -6,14 +6,15 @@ Programming Language The kernel is written in the C programming language [c-language]_. More precisely, the kernel is typically compiled with ``gcc`` [gcc]_ under ``-std=gnu89`` [gcc-c-dialect-options]_: the GNU dialect of ISO C90 -(including some C99 features). +(including some C99 features). ``clang`` [clang]_ is also supported, see +docs on :ref:`Building Linux with Clang/LLVM <kbuild_llvm>`. This dialect contains many extensions to the language [gnu-extensions]_, and many of them are used within the kernel as a matter of course. -There is some support for compiling the kernel with ``clang`` [clang]_ -and ``icc`` [icc]_ for several of the architectures, although at the time -of writing it is not completed, requiring third-party patches. +There is some support for compiling the kernel with ``icc`` [icc]_ for several +of the architectures, although at the time of writing it is not completed, +requiring third-party patches. Attributes ---------- diff --git a/Documentation/process/submit-checklist.rst b/Documentation/process/submit-checklist.rst index 3f8e9d5d95c2..b681e862a335 100644 --- a/Documentation/process/submit-checklist.rst +++ b/Documentation/process/submit-checklist.rst @@ -24,6 +24,10 @@ and elsewhere regarding submitting Linux kernel patches. c) Builds successfully when using ``O=builddir`` + d) Any Documentation/ changes build successfully without new warnings/errors. + Use ``make htmldocs`` or ``make pdfdocs`` to check the build and + fix any issues. + 3) Builds on multiple CPU architectures by using local cross-compile tools or some other build farm. diff --git a/Documentation/process/submitting-drivers.rst b/Documentation/process/submitting-drivers.rst index 74b35bfc6623..3861887e0ca5 100644 --- a/Documentation/process/submitting-drivers.rst +++ b/Documentation/process/submitting-drivers.rst @@ -60,10 +60,11 @@ What Criteria Determine Acceptance Licensing: The code must be released to us under the - GNU General Public License. We don't insist on any kind - of exclusive GPL licensing, and if you wish the driver - to be useful to other communities such as BSD you may well - wish to release under multiple licenses. + GNU General Public License. If you wish the driver to be + useful to other communities such as BSD you may release + under multiple licenses. If you choose to release under + licenses other than the GPL, you should include your + rationale for your license choices in your cover letter. See accepted licenses at include/linux/module.h Copyright: diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst index 5219bf3cddfc..58586ffe2808 100644 --- a/Documentation/process/submitting-patches.rst +++ b/Documentation/process/submitting-patches.rst @@ -10,22 +10,18 @@ can greatly increase the chances of your change being accepted. This document contains a large number of suggestions in a relatively terse format. For detailed information on how the kernel development process -works, see :ref:`Documentation/process <development_process_main>`. -Also, read :ref:`Documentation/process/submit-checklist.rst <submitchecklist>` -for a list of items to check before -submitting code. If you are submitting a driver, also read -:ref:`Documentation/process/submitting-drivers.rst <submittingdrivers>`; -for device tree binding patches, read -Documentation/devicetree/bindings/submitting-patches.rst. - -Many of these steps describe the default behavior of the ``git`` version -control system; if you use ``git`` to prepare your patches, you'll find much -of the mechanical work done for you, though you'll still need to prepare -and document a sensible set of patches. In general, use of ``git`` will make -your life as a kernel developer easier. - -0) Obtain a current source tree -------------------------------- +works, see :doc:`development-process`. Also, read :doc:`submit-checklist` +for a list of items to check before submitting code. If you are submitting +a driver, also read :doc:`submitting-drivers`; for device tree binding patches, +read :doc:`submitting-patches`. + +This documentation assumes that you're using ``git`` to prepare your patches. +If you're unfamiliar with ``git``, you would be well-advised to learn how to +use it, it will make your life as a kernel developer and in general much +easier. + +Obtain a current source tree +---------------------------- If you do not have a repository with the current kernel source handy, use ``git`` to obtain one. You'll want to start with the mainline repository, @@ -39,68 +35,10 @@ patches prepared against those trees. See the **T:** entry for the subsystem in the MAINTAINERS file to find that tree, or simply ask the maintainer if the tree is not listed there. -It is still possible to download kernel releases via tarballs (as described -in the next section), but that is the hard way to do kernel development. - -1) ``diff -up`` ---------------- - -If you must generate your patches by hand, use ``diff -up`` or ``diff -uprN`` -to create patches. Git generates patches in this form by default; if -you're using ``git``, you can skip this section entirely. - -All changes to the Linux kernel occur in the form of patches, as -generated by :manpage:`diff(1)`. When creating your patch, make sure to -create it in "unified diff" format, as supplied by the ``-u`` argument -to :manpage:`diff(1)`. -Also, please use the ``-p`` argument which shows which C function each -change is in - that makes the resultant ``diff`` a lot easier to read. -Patches should be based in the root kernel source directory, -not in any lower subdirectory. - -To create a patch for a single file, it is often sufficient to do:: - - SRCTREE=linux - MYFILE=drivers/net/mydriver.c - - cd $SRCTREE - cp $MYFILE $MYFILE.orig - vi $MYFILE # make your change - cd .. - diff -up $SRCTREE/$MYFILE{.orig,} > /tmp/patch - -To create a patch for multiple files, you should unpack a "vanilla", -or unmodified kernel source tree, and generate a ``diff`` against your -own source tree. For example:: - - MYSRC=/devel/linux - - tar xvfz linux-3.19.tar.gz - mv linux-3.19 linux-3.19-vanilla - diff -uprN -X linux-3.19-vanilla/Documentation/dontdiff \ - linux-3.19-vanilla $MYSRC > /tmp/patch - -``dontdiff`` is a list of files which are generated by the kernel during -the build process, and should be ignored in any :manpage:`diff(1)`-generated -patch. - -Make sure your patch does not include any extra files which do not -belong in a patch submission. Make sure to review your patch -after- -generating it with :manpage:`diff(1)`, to ensure accuracy. - -If your changes produce a lot of deltas, you need to split them into -individual patches which modify things in logical stages; see -:ref:`split_changes`. This will facilitate review by other kernel developers, -very important if you want your patch accepted. - -If you're using ``git``, ``git rebase -i`` can help you with this process. If -you're not using ``git``, ``quilt`` <https://savannah.nongnu.org/projects/quilt> -is another popular alternative. - .. _describe_changes: -2) Describe your changes ------------------------- +Describe your changes +--------------------- Describe your problem. Whether your patch is a one-line bug fix or 5000 lines of a new feature, there must be an underlying problem that @@ -203,8 +141,8 @@ An example call:: .. _split_changes: -3) Separate your changes ------------------------- +Separate your changes +--------------------- Separate each **logical change** into a separate patch. @@ -236,8 +174,8 @@ then only post say 15 or so at a time and wait for review and integration. -4) Style-check your changes ---------------------------- +Style-check your changes +------------------------ Check your patch for basic style violations, details of which can be found in @@ -267,8 +205,8 @@ You should be able to justify all violations that remain in your patch. -5) Select the recipients for your patch ---------------------------------------- +Select the recipients for your patch +------------------------------------ You should always copy the appropriate subsystem maintainer(s) on any patch to code that they maintain; look through the MAINTAINERS file and the @@ -299,7 +237,8 @@ sending him e-mail. If you have a patch that fixes an exploitable security bug, send that patch to security@kernel.org. For severe bugs, a short embargo may be considered to allow distributors to get the patch out to users; in such cases, -obviously, the patch should not be sent to any public lists. +obviously, the patch should not be sent to any public lists. See also +:doc:`/admin-guide/security-bugs`. Patches that fix a severe bug in a released kernel should be directed toward the stable maintainers by putting a line like this:: @@ -342,15 +281,20 @@ Trivial patches must qualify for one of the following rules: -6) No MIME, no links, no compression, no attachments. Just plain text ----------------------------------------------------------------------- +No MIME, no links, no compression, no attachments. Just plain text +------------------------------------------------------------------- Linus and other kernel developers need to be able to read and comment on the changes you are submitting. It is important for a kernel developer to be able to "quote" your changes, using standard e-mail tools, so that they may comment on specific portions of your code. -For this reason, all patches should be submitted by e-mail "inline". +For this reason, all patches should be submitted by e-mail "inline". The +easiest way to do this is with ``git send-email``, which is strongly +recommended. An interactive tutorial for ``git send-email`` is available at +https://git-send-email.io. + +If you choose not to use ``git send-email``: .. warning:: @@ -366,27 +310,17 @@ decreasing the likelihood of your MIME-attached change being accepted. Exception: If your mailer is mangling patches then someone may ask you to re-send them using MIME. -See :ref:`Documentation/process/email-clients.rst <email_clients>` -for hints about configuring your e-mail client so that it sends your patches -untouched. - -7) E-mail size --------------- +See :doc:`/process/email-clients` for hints about configuring your e-mail +client so that it sends your patches untouched. -Large changes are not appropriate for mailing lists, and some -maintainers. If your patch, uncompressed, exceeds 300 kB in size, -it is preferred that you store your patch on an Internet-accessible -server, and provide instead a URL (link) pointing to your patch. But note -that if your patch exceeds 300 kB, it almost certainly needs to be broken up -anyway. - -8) Respond to review comments ------------------------------ +Respond to review comments +-------------------------- Your patch will almost certainly get comments from reviewers on ways in -which the patch can be improved. You must respond to those comments; -ignoring reviewers is a good way to get ignored in return. Review comments -or questions that do not lead to a code change should almost certainly +which the patch can be improved, in the form of a reply to your email. You must +respond to those comments; ignoring reviewers is a good way to get ignored in +return. You can simply reply to their emails to answer their comments. Review +comments or questions that do not lead to a code change should almost certainly bring about a comment or changelog entry so that the next reviewer better understands what is going on. @@ -395,9 +329,12 @@ for their time. Code review is a tiring and time-consuming process, and reviewers sometimes get grumpy. Even in that case, though, respond politely and address the problems they have pointed out. +See :doc:`email-clients` for recommendations on email +clients and mailing list etiquette. -9) Don't get discouraged - or impatient ---------------------------------------- + +Don't get discouraged - or impatient +------------------------------------ After you have submitted your change, be patient and wait. Reviewers are busy people and may not get to your patch right away. @@ -410,18 +347,19 @@ one week before resubmitting or pinging reviewers - possibly longer during busy times like merge windows. -10) Include PATCH in the subject --------------------------------- +Include PATCH in the subject +----------------------------- Due to high e-mail traffic to Linus, and to linux-kernel, it is common convention to prefix your subject line with [PATCH]. This lets Linus and other kernel developers more easily distinguish patches from other e-mail discussions. +``git send-email`` will do this for you automatically. -11) Sign your work - the Developer's Certificate of Origin ----------------------------------------------------------- +Sign your work - the Developer's Certificate of Origin +------------------------------------------------------ To improve tracking of who did what, especially with patches that can percolate to their final resting place in the kernel through several @@ -465,60 +403,15 @@ then you just add a line saying:: Signed-off-by: Random J Developer <random@developer.example.org> using your real name (sorry, no pseudonyms or anonymous contributions.) +This will be done for you automatically if you use ``git commit -s``. Some people also put extra tags at the end. They'll just be ignored for now, but you can do this to mark internal company procedures or just point out some special detail about the sign-off. -If you are a subsystem or branch maintainer, sometimes you need to slightly -modify patches you receive in order to merge them, because the code is not -exactly the same in your tree and the submitters'. If you stick strictly to -rule (c), you should ask the submitter to rediff, but this is a totally -counter-productive waste of time and energy. Rule (b) allows you to adjust -the code, but then it is very impolite to change one submitter's code and -make him endorse your bugs. To solve this problem, it is recommended that -you add a line between the last Signed-off-by header and yours, indicating -the nature of your changes. While there is nothing mandatory about this, it -seems like prepending the description with your mail and/or name, all -enclosed in square brackets, is noticeable enough to make it obvious that -you are responsible for last-minute changes. Example:: - Signed-off-by: Random J Developer <random@developer.example.org> - [lucky@maintainer.example.org: struct foo moved from foo.c to foo.h] - Signed-off-by: Lucky K Maintainer <lucky@maintainer.example.org> - -This practice is particularly helpful if you maintain a stable branch and -want at the same time to credit the author, track changes, merge the fix, -and protect the submitter from complaints. Note that under no circumstances -can you change the author's identity (the From header), as it is the one -which appears in the changelog. - -Special note to back-porters: It seems to be a common and useful practice -to insert an indication of the origin of a patch at the top of the commit -message (just after the subject line) to facilitate tracking. For instance, -here's what we see in a 3.x-stable release:: - - Date: Tue Oct 7 07:26:38 2014 -0400 - - libata: Un-break ATA blacklist - - commit 1c40279960bcd7d52dbdf1d466b20d24b99176c8 upstream. - -And here's what might appear in an older kernel once a patch is backported:: - - Date: Tue May 13 22:12:27 2008 +0200 - - wireless, airo: waitbusy() won't delay - - [backport of 2.6 commit b7acbdfbd1f277c1eb23f344f899cfa4cd0bf36a] - -Whatever the format, this information provides a valuable help to people -tracking your trees, and to people trying to troubleshoot bugs in your -tree. - - -12) When to use Acked-by:, Cc:, and Co-developed-by: -------------------------------------------------------- +When to use Acked-by:, Cc:, and Co-developed-by: +------------------------------------------------ The Signed-off-by: tag indicates that the signer was involved in the development of the patch, or that he/she was in the patch's delivery path. @@ -586,8 +479,8 @@ Example of a patch submitted by a Co-developed-by: author:: Signed-off-by: Submitting Co-Author <sub@coauthor.example.org> -13) Using Reported-by:, Tested-by:, Reviewed-by:, Suggested-by: and Fixes: --------------------------------------------------------------------------- +Using Reported-by:, Tested-by:, Reviewed-by:, Suggested-by: and Fixes: +---------------------------------------------------------------------- The Reported-by tag gives credit to people who find bugs and report them and it hopefully inspires them to help us again in the future. Please note that if @@ -650,8 +543,8 @@ for more details. .. _the_canonical_patch_format: -14) The canonical patch format ------------------------------- +The canonical patch format +-------------------------- This section describes how the patch itself should be formatted. Note that, if you have your patches stored in a ``git`` repository, proper patch @@ -773,8 +666,8 @@ references. .. _explicit_in_reply_to: -15) Explicit In-Reply-To headers --------------------------------- +Explicit In-Reply-To headers +---------------------------- It can be helpful to manually add In-Reply-To: headers to a patch (e.g., when using ``git send-email``) to associate the patch with @@ -787,8 +680,8 @@ helpful, you can use the https://lkml.kernel.org/ redirector (e.g., in the cover email text) to link to an earlier version of the patch series. -16) Providing base tree information ------------------------------------ +Providing base tree information +------------------------------- When other developers receive your patches and start the review process, it is often useful for them to know where in the tree history they @@ -838,61 +731,6 @@ either below the ``---`` line or at the very bottom of all other content, right before your email signature. -17) Sending ``git pull`` requests ---------------------------------- - -If you have a series of patches, it may be most convenient to have the -maintainer pull them directly into the subsystem repository with a -``git pull`` operation. Note, however, that pulling patches from a developer -requires a higher degree of trust than taking patches from a mailing list. -As a result, many subsystem maintainers are reluctant to take pull -requests, especially from new, unknown developers. If in doubt you can use -the pull request as the cover letter for a normal posting of the patch -series, giving the maintainer the option of using either. - -A pull request should have [GIT PULL] in the subject line. The -request itself should include the repository name and the branch of -interest on a single line; it should look something like:: - - Please pull from - - git://jdelvare.pck.nerim.net/jdelvare-2.6 i2c-for-linus - - to get these changes: - -A pull request should also include an overall message saying what will be -included in the request, a ``git shortlog`` listing of the patches -themselves, and a ``diffstat`` showing the overall effect of the patch series. -The easiest way to get all this information together is, of course, to let -``git`` do it for you with the ``git request-pull`` command. - -Some maintainers (including Linus) want to see pull requests from signed -commits; that increases their confidence that the request actually came -from you. Linus, in particular, will not pull from public hosting sites -like GitHub in the absence of a signed tag. - -The first step toward creating such tags is to make a GNUPG key and get it -signed by one or more core kernel developers. This step can be hard for -new developers, but there is no way around it. Attending conferences can -be a good way to find developers who can sign your key. - -Once you have prepared a patch series in ``git`` that you wish to have somebody -pull, create a signed tag with ``git tag -s``. This will create a new tag -identifying the last commit in the series and containing a signature -created with your private key. You will also have the opportunity to add a -changelog-style message to the tag; this is an ideal place to describe the -effects of the pull request as a whole. - -If the tree the maintainer will be pulling from is not the repository you -are working from, don't forget to push the signed tag explicitly to the -public tree. - -When generating your pull request, use the signed tag as the target. A -command like this will do the trick:: - - git request-pull master git://my.public.tree/linux.git my-signed-tag - - References ---------- diff --git a/Documentation/scheduler/sched-capacity.rst b/Documentation/scheduler/sched-capacity.rst index 00bf0d011e2a..9b7cbe43b2d1 100644 --- a/Documentation/scheduler/sched-capacity.rst +++ b/Documentation/scheduler/sched-capacity.rst @@ -365,7 +365,7 @@ giving it a high uclamp.min value. .. note:: Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling - (EAS), which is described in Documentation/scheduling/sched-energy.rst. + (EAS), which is described in Documentation/scheduler/sched-energy.rst. 5.1.3 Load balancing ~~~~~~~~~~~~~~~~~~~~ diff --git a/Documentation/scheduler/sched-energy.rst b/Documentation/scheduler/sched-energy.rst index 78f850778982..001e09c95e1d 100644 --- a/Documentation/scheduler/sched-energy.rst +++ b/Documentation/scheduler/sched-energy.rst @@ -331,7 +331,7 @@ asymmetric CPU topologies for now. This requirement is checked at run-time by looking for the presence of the SD_ASYM_CPUCAPACITY flag when the scheduling domains are built. -See Documentation/sched/sched-capacity.rst for requirements to be met for this +See Documentation/scheduler/sched-capacity.rst for requirements to be met for this flag to be set in the sched_domain hierarchy. Please note that EAS is not fundamentally incompatible with SMP, but no diff --git a/Documentation/security/credentials.rst b/Documentation/security/credentials.rst index d9387209d143..357328d566c8 100644 --- a/Documentation/security/credentials.rst +++ b/Documentation/security/credentials.rst @@ -323,7 +323,6 @@ credentials (the value is simply returned in each case):: uid_t current_fsuid(void) Current's file access UID gid_t current_fsgid(void) Current's file access GID kernel_cap_t current_cap(void) Current's effective capabilities - void *current_security(void) Current's LSM security pointer struct user_struct *current_user(void) Current's user account There are also convenience wrappers for retrieving specific associated pairs of diff --git a/Documentation/security/keys/trusted-encrypted.rst b/Documentation/security/keys/trusted-encrypted.rst index 9483a7425ad5..1da879a68640 100644 --- a/Documentation/security/keys/trusted-encrypted.rst +++ b/Documentation/security/keys/trusted-encrypted.rst @@ -39,10 +39,9 @@ With the IBM TSS 2 stack:: Or with the Intel TSS 2 stack:: - #> tpm2_createprimary --hierarchy o -G rsa2048 -o key.ctxt + #> tpm2_createprimary --hierarchy o -G rsa2048 -c key.ctxt [...] - handle: 0x800000FF - #> tpm2_evictcontrol -c key.ctxt -p 0x81000001 + #> tpm2_evictcontrol -c key.ctxt 0x81000001 persistentHandle: 0x81000001 Usage:: diff --git a/Documentation/sphinx/automarkup.py b/Documentation/sphinx/automarkup.py index b18236370742..a1b0f554cd82 100644 --- a/Documentation/sphinx/automarkup.py +++ b/Documentation/sphinx/automarkup.py @@ -13,6 +13,7 @@ if sphinx.version_info[0] < 2 or \ else: from sphinx.errors import NoUri import re +from itertools import chain # # Regex nastiness. Of course. @@ -21,7 +22,13 @@ import re # :c:func: block (i.e. ":c:func:`mmap()`s" flakes out), so the last # bit tries to restrict matches to things that won't create trouble. # -RE_function = re.compile(r'([\w_][\w\d_]+\(\))') +RE_function = re.compile(r'(([\w_][\w\d_]+)\(\))') +RE_type = re.compile(r'(struct|union|enum|typedef)\s+([\w_][\w\d_]+)') +# +# Detects a reference to a documentation page of the form Documentation/... with +# an optional extension +# +RE_doc = re.compile(r'Documentation(/[\w\-_/]+)(\.\w+)*') # # Many places in the docs refer to common system calls. It is @@ -34,56 +41,110 @@ Skipfuncs = [ 'open', 'close', 'read', 'write', 'fcntl', 'mmap', 'select', 'poll', 'fork', 'execve', 'clone', 'ioctl', 'socket' ] -# -# Find all occurrences of function() and try to replace them with -# appropriate cross references. -# -def markup_funcs(docname, app, node): - cdom = app.env.domains['c'] +def markup_refs(docname, app, node): t = node.astext() done = 0 repl = [ ] - for m in RE_function.finditer(t): + # + # Associate each regex with the function that will markup its matches + # + markup_func = {RE_type: markup_c_ref, + RE_function: markup_c_ref, + RE_doc: markup_doc_ref} + match_iterators = [regex.finditer(t) for regex in markup_func] + # + # Sort all references by the starting position in text + # + sorted_matches = sorted(chain(*match_iterators), key=lambda m: m.start()) + for m in sorted_matches: # - # Include any text prior to function() as a normal text node. + # Include any text prior to match as a normal text node. # if m.start() > done: repl.append(nodes.Text(t[done:m.start()])) + # - # Go through the dance of getting an xref out of the C domain - # - target = m.group(1)[:-2] - target_text = nodes.Text(target + '()') - xref = None - if target not in Skipfuncs: - lit_text = nodes.literal(classes=['xref', 'c', 'c-func']) - lit_text += target_text - pxref = addnodes.pending_xref('', refdomain = 'c', - reftype = 'function', - reftarget = target, modname = None, - classname = None) - # - # XXX The Latex builder will throw NoUri exceptions here, - # work around that by ignoring them. - # - try: - xref = cdom.resolve_xref(app.env, docname, app.builder, - 'function', target, pxref, lit_text) - except NoUri: - xref = None - # - # Toss the xref into the list if we got it; otherwise just put - # the function text. + # Call the function associated with the regex that matched this text and + # append its return to the text # - if xref: - repl.append(xref) - else: - repl.append(target_text) + repl.append(markup_func[m.re](docname, app, m)) + done = m.end() if done < len(t): repl.append(nodes.Text(t[done:])) return repl +# +# Try to replace a C reference (function() or struct/union/enum/typedef +# type_name) with an appropriate cross reference. +# +def markup_c_ref(docname, app, match): + class_str = {RE_function: 'c-func', RE_type: 'c-type'} + reftype_str = {RE_function: 'function', RE_type: 'type'} + + cdom = app.env.domains['c'] + # + # Go through the dance of getting an xref out of the C domain + # + target = match.group(2) + target_text = nodes.Text(match.group(0)) + xref = None + if not (match.re == RE_function and target in Skipfuncs): + lit_text = nodes.literal(classes=['xref', 'c', class_str[match.re]]) + lit_text += target_text + pxref = addnodes.pending_xref('', refdomain = 'c', + reftype = reftype_str[match.re], + reftarget = target, modname = None, + classname = None) + # + # XXX The Latex builder will throw NoUri exceptions here, + # work around that by ignoring them. + # + try: + xref = cdom.resolve_xref(app.env, docname, app.builder, + reftype_str[match.re], target, pxref, + lit_text) + except NoUri: + xref = None + # + # Return the xref if we got it; otherwise just return the plain text. + # + if xref: + return xref + else: + return target_text + +# +# Try to replace a documentation reference of the form Documentation/... with a +# cross reference to that page +# +def markup_doc_ref(docname, app, match): + stddom = app.env.domains['std'] + # + # Go through the dance of getting an xref out of the std domain + # + target = match.group(1) + xref = None + pxref = addnodes.pending_xref('', refdomain = 'std', reftype = 'doc', + reftarget = target, modname = None, + classname = None, refexplicit = False) + # + # XXX The Latex builder will throw NoUri exceptions here, + # work around that by ignoring them. + # + try: + xref = stddom.resolve_xref(app.env, docname, app.builder, 'doc', + target, pxref, None) + except NoUri: + xref = None + # + # Return the xref if we got it; otherwise just return the plain text. + # + if xref: + return xref + else: + return nodes.Text(match.group(0)) + def auto_markup(app, doctree, name): # # This loop could eventually be improved on. Someday maybe we @@ -97,7 +158,7 @@ def auto_markup(app, doctree, name): for para in doctree.traverse(nodes.paragraph): for node in para.traverse(nodes.Text): if not isinstance(node.parent, nodes.literal): - node.parent.replace(node, markup_funcs(name, app, node)) + node.parent.replace(node, markup_refs(name, app, node)) def setup(app): app.connect('doctree-resolved', auto_markup) diff --git a/Documentation/trace/kprobetrace.rst b/Documentation/trace/kprobetrace.rst index c1709165c553..10850a9e9af3 100644 --- a/Documentation/trace/kprobetrace.rst +++ b/Documentation/trace/kprobetrace.rst @@ -40,7 +40,7 @@ Synopsis of kprobe_events MEMADDR : Address where the probe is inserted. MAXACTIVE : Maximum number of instances of the specified function that can be probed simultaneously, or 0 for the default value - as defined in Documentation/staging/kprobes.rst section 1.3.1. + as defined in Documentation/trace/kprobes.rst section 1.3.1. FETCHARGS : Arguments. Each probe can have up to 128 args. %REG : Fetch register REG diff --git a/Documentation/trace/ring-buffer-design.rst b/Documentation/trace/ring-buffer-design.rst index 9c8d22a53d6c..c5d77fcbb5bc 100644 --- a/Documentation/trace/ring-buffer-design.rst +++ b/Documentation/trace/ring-buffer-design.rst @@ -1,28 +1,4 @@ -.. This file is dual-licensed: you can use it either under the terms -.. of the GPL 2.0 or the GFDL 1.2 license, at your option. Note that this -.. dual licensing only applies to this file, and not this project as a -.. whole. -.. -.. a) This file is free software; you can redistribute it and/or -.. modify it under the terms of the GNU General Public License as -.. published by the Free Software Foundation version 2 of -.. the License. -.. -.. This file is distributed in the hope that it will be useful, -.. but WITHOUT ANY WARRANTY; without even the implied warranty of -.. MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -.. GNU General Public License for more details. -.. -.. Or, alternatively, -.. -.. b) Permission is granted to copy, distribute and/or modify this -.. document under the terms of the GNU Free Documentation License, -.. Version 1.2 version published by the Free Software -.. Foundation, with no Invariant Sections, no Front-Cover Texts -.. and no Back-Cover Texts. A copy of the license is included at -.. Documentation/userspace-api/media/fdl-appendix.rst. -.. -.. TODO: replace it to GPL-2.0 OR GFDL-1.2 WITH no-invariant-sections +.. SPDX-License-Identifier: GPL-2.0 OR GFDL-1.2-no-invariants-only =========================== Lockless Ring Buffer Design diff --git a/Documentation/translations/ko_KR/howto.rst b/Documentation/translations/ko_KR/howto.rst index 71d4823e41e1..240d29be38f2 100644 --- a/Documentation/translations/ko_KR/howto.rst +++ b/Documentation/translations/ko_KR/howto.rst @@ -284,9 +284,10 @@ Andrew Mortonì˜ ê¸€ì´ ìžˆë‹¤. 여러 ë©”ì´ì € 넘버를 갖는 다양한 ì•ˆì •ëœ ì»¤ë„ íŠ¸ë¦¬ë“¤ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -3 ìžë¦¬ 숫ìžë¡œ ì´ë£¨ì–´ì§„ ë²„ì ¼ì˜ ì»¤ë„ë“¤ì€ -stable 커ë„들ì´ë‹¤. ê·¸ê²ƒë“¤ì€ í•´ë‹¹ ë©”ì´ì € -ë©”ì¸ë¼ì¸ 릴리즈ì—ì„œ ë°œê²¬ëœ í° íšŒê·€ë“¤ì´ë‚˜ 보안 ë¬¸ì œë“¤ 중 비êµì ìž‘ê³ ì¤‘ìš”í•œ -ìˆ˜ì •ë“¤ì„ í¬í•¨í•˜ë©°, ì•žì˜ ë‘ ë²„ì „ 넘버는 ê°™ì€ ê¸°ë°˜ ë²„ì „ì„ ì˜ë¯¸í•œë‹¤. +ì„¸ê°œì˜ ë²„ì ¼ 넘버로 ì´ë£¨ì–´ì§„ ë²„ì ¼ì˜ ì»¤ë„ë“¤ì€ -stable 커ë„들ì´ë‹¤. ê·¸ê²ƒë“¤ì€ í•´ë‹¹ +ë©”ì´ì € ë©”ì¸ë¼ì¸ 릴리즈ì—ì„œ ë°œê²¬ëœ í° íšŒê·€ë“¤ì´ë‚˜ 보안 ë¬¸ì œë“¤ 중 비êµì ìž‘ê³ +중요한 ìˆ˜ì •ë“¤ì„ í¬í•¨í•œë‹¤. 주요 stable 시리즈 릴리즈는 세번째 ë²„ì ¼ 넘버를 +ì¦ê°€ì‹œí‚¤ë©° ì•žì˜ ë‘ ë²„ì ¼ 넘버는 그대로 ìœ ì§€í•œë‹¤. ì´ê²ƒì€ 가장 ìµœê·¼ì˜ ì•ˆì •ì ì¸ ì»¤ë„ì„ ì›í•˜ëŠ” 사용ìžì—게 추천ë˜ëŠ” 브랜치ì´ë©°, 개발/실험ì ë²„ì ¼ì„ í…ŒìŠ¤íŠ¸í•˜ëŠ” ê²ƒì„ ë•ê³ ìž í•˜ëŠ” 사용ìžë“¤ê³¼ëŠ” 별로 ê´€ë ¨ì´ ì—†ë‹¤. @@ -316,7 +317,7 @@ Andrew Mortonì˜ ê¸€ì´ ìžˆë‹¤. ì œì•ˆëœ íŒ¨ì¹˜ëŠ” 서브시스템 íŠ¸ë¦¬ì— ì»¤ë°‹ë˜ê¸° ì „ì— ë©”ì¼ë§ 리스트를 통해 리뷰ëœë‹¤(ì•„ëž˜ì˜ ê´€ë ¨ ì„¹ì…˜ì„ ì°¸ê³ í•˜ê¸° 바란다). ì¼ë¶€ ì»¤ë„ ì„œë¸Œì‹œìŠ¤í…œì˜ ê²½ìš°, ì´ ë¦¬ë·° 프로세스는 patchworkë¼ëŠ” ë„구를 통해 추ì ëœë‹¤. patchworkì€ ë“±ë¡ëœ 패치와 -íŒ¨ì¹˜ì— ëŒ€í•œ 코멘트, íŒ¨ì¹˜ì˜ ë²„ì „ì„ ë³¼ 수 있는 웹 ì¸í„°íŽ˜ì´ìŠ¤ë¥¼ ì œê³µí•˜ê³ , +íŒ¨ì¹˜ì— ëŒ€í•œ 코멘트, íŒ¨ì¹˜ì˜ ë²„ì ¼ì„ ë³¼ 수 있는 웹 ì¸í„°íŽ˜ì´ìŠ¤ë¥¼ ì œê³µí•˜ê³ , ë©”ì¸í…Œì´ë„ˆëŠ” 패치를 리뷰 중, 리뷰 통과, ë˜ëŠ” ë°˜ë ¤ë¨ìœ¼ë¡œ í‘œì‹œí• ìˆ˜ 있다. ëŒ€ë¶€ë¶„ì˜ ì´ëŸ¬í•œ patchwork 사ì´íŠ¸ëŠ” https://patchwork.kernel.org/ ì— ë‚˜ì—´ë˜ì–´ 있다. diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt index 9dcc7c9d52e6..64d932f5dc77 100644 --- a/Documentation/translations/ko_KR/memory-barriers.txt +++ b/Documentation/translations/ko_KR/memory-barriers.txt @@ -91,7 +91,6 @@ Documentation/memory-barriers.txt - 컴파ì¼ëŸ¬ 배리어. - CPU 메모리 배리어. - - MMIO 쓰기 배리어. (*) 암묵ì ì»¤ë„ ë©”ëª¨ë¦¬ 배리어. @@ -103,7 +102,6 @@ Documentation/memory-barriers.txt (*) CPU ê°„ ACQUIRING ë°°ë¦¬ì–´ì˜ íš¨ê³¼. - Acquire vs 메모리 액세스. - - Acquire vs I/O 액세스. (*) 메모리 배리어가 필요한 ê³³ @@ -515,14 +513,13 @@ CPU ì—게 ê¸°ëŒ€í• ìˆ˜ 있는 ìµœì†Œí•œì˜ ë³´ìž¥ì‚¬í• ëª‡ê°€ì§€ê°€ 있습니 완료ë˜ê¸° ì „ì— í–‰í•´ì§„ 것처럼 ë³´ì¼ ìˆ˜ 있습니다. ACQUIRE 와 RELEASE 오í¼ë ˆì´ì…˜ì˜ ì‚¬ìš©ì€ ì¼ë°˜ì 으로 다른 메모리 ë°°ë¦¬ì–´ì˜ - í•„ìš”ì„±ì„ ì—†ì•±ë‹ˆë‹¤ (하지만 "MMIO 쓰기 배리어" 서브섹션ì—ì„œ 설명ë˜ëŠ” 예외를 - 알아ë‘세요). ë˜í•œ, RELEASE+ACQUIRE ì¡°í•©ì€ ë²”ìš© 메모리 배리어처럼 ë™ìž‘í• - ê²ƒì„ ë³´ìž¥í•˜ì§€ -않습니다-. 하지만, ì–´ë–¤ ë³€ìˆ˜ì— ëŒ€í•œ RELEASE 오í¼ë ˆì´ì…˜ì„ - 앞서는 메모리 ì•¡ì„¸ìŠ¤ë“¤ì˜ ìˆ˜í–‰ 결과는 ì´ RELEASE 오í¼ë ˆì´ì…˜ì„ ë’¤ì´ì–´ ê°™ì€ - ë³€ìˆ˜ì— ëŒ€í•´ ìˆ˜í–‰ëœ ACQUIRE 오í¼ë ˆì´ì…˜ì„ 뒤따르는 메모리 액세스ì—는 보여질 - ê²ƒì´ ë³´ìž¥ë©ë‹ˆë‹¤. 다르게 ë§í•˜ìžë©´, 주어진 ë³€ìˆ˜ì˜ í¬ë¦¬í‹°ì»¬ 섹션ì—서는, 해당 - ë³€ìˆ˜ì— ëŒ€í•œ ì•žì˜ í¬ë¦¬í‹°ì»¬ 섹션ì—ì„œì˜ ëª¨ë“ ì•¡ì„¸ìŠ¤ë“¤ì´ ì™„ë£Œë˜ì—ˆì„ ê²ƒì„ - 보장합니다. + í•„ìš”ì„±ì„ ì—†ì•±ë‹ˆë‹¤. ë˜í•œ, RELEASE+ACQUIRE ì¡°í•©ì€ ë²”ìš© 메모리 배리어처럼 + ë™ìž‘í• ê²ƒì„ ë³´ìž¥í•˜ì§€ -않습니다-. 하지만, ì–´ë–¤ ë³€ìˆ˜ì— ëŒ€í•œ RELEASE + 오í¼ë ˆì´ì…˜ì„ 앞서는 메모리 ì•¡ì„¸ìŠ¤ë“¤ì˜ ìˆ˜í–‰ 결과는 ì´ RELEASE 오í¼ë ˆì´ì…˜ì„ + ë’¤ì´ì–´ ê°™ì€ ë³€ìˆ˜ì— ëŒ€í•´ ìˆ˜í–‰ëœ ACQUIRE 오í¼ë ˆì´ì…˜ì„ 뒤따르는 메모리 + 액세스ì—는 보여질 ê²ƒì´ ë³´ìž¥ë©ë‹ˆë‹¤. 다르게 ë§í•˜ìžë©´, 주어진 ë³€ìˆ˜ì˜ + í¬ë¦¬í‹°ì»¬ 섹션ì—서는, 해당 ë³€ìˆ˜ì— ëŒ€í•œ ì•žì˜ í¬ë¦¬í‹°ì»¬ 섹션ì—ì„œì˜ ëª¨ë“ + ì•¡ì„¸ìŠ¤ë“¤ì´ ì™„ë£Œë˜ì—ˆì„ ê²ƒì„ ë³´ìž¥í•©ë‹ˆë‹¤. 즉, ACQUIRE 는 ìµœì†Œí•œì˜ "ì·¨ë“" ë™ìž‘처럼, ê·¸ë¦¬ê³ RELEASE 는 ìµœì†Œí•œì˜ "공개" 처럼 ë™ìž‘한다는 ì˜ë¯¸ìž…니다. @@ -1501,8 +1498,6 @@ u ë¡œì˜ ìŠ¤í† ì–´ë¥¼ cpu1() ì˜ v ë¡œë¶€í„°ì˜ ë¡œë“œ ë’¤ì— ì¼ì–´ë‚œ ê²ƒìœ¼ë¡ (*) CPU 메모리 배리어. - (*) MMIO 쓰기 배리어. - 컴파ì¼ëŸ¬ 배리어 --------------- @@ -1909,6 +1904,19 @@ Mandatory ë°°ë¦¬ì–´ë“¤ì€ SMP 시스템ì—ì„œë„ UP 시스템ì—ì„œë„ SMP íš¨ê³ "ì»¤ë„ I/O ë°°ë¦¬ì–´ì˜ íš¨ê³¼" 섹션ì„, consistent memory ì— ëŒ€í•œ ìžì„¸í•œ ë‚´ìš©ì„ ìœ„í•´ì„ Documentation/core-api/dma-api.rst 문서를 ì°¸ê³ í•˜ì„¸ìš”. + (*) pmem_wmb(); + + ì´ê²ƒì€ persistent memory 를 위한 것으로, persistent ì €ìž¥ì†Œì— ê°€í•´ì§„ 변경 + 사í•ì´ í”Œëž«í¼ ì—°ì†ì„± ë„ë©”ì¸ì— ë„ë‹¬í–ˆì„ ê²ƒì„ ë³´ìž¥í•˜ê¸° 위한 것입니다. + + 예를 들어, ìž„ì‹œì ì´ì§€ ì•Šì€ pmem ì˜ì—ìœ¼ë¡œì˜ ì“°ê¸° 후, 우리는 쓰기가 í”Œëž«í¼ + ì—°ì†ì„± ë„ë©”ì¸ì— ë„ë‹¬í–ˆì„ ê²ƒì„ ë³´ìž¥í•˜ê¸° 위해 pmem_wmb() 를 사용합니다. + ì´ëŠ” 쓰기가 뒤따르는 instruction ë“¤ì´ ìœ ë°œí•˜ëŠ” ì–´ë– í•œ ë°ì´í„° 액세스나 + ë°ì´í„° ì „ì†¡ì˜ ì‹œìž‘ ì „ì— persistent ì €ìž¥ì†Œë¥¼ ì—…ë°ì´íŠ¸ í–ˆì„ ê²ƒì„ ë³´ìž¥í•©ë‹ˆë‹¤. + ì´ëŠ” wmb() ì— ì˜í•´ ì´ë¤„지는 순서 ê·œì¹™ì„ í¬í•¨í•©ë‹ˆë‹¤. + + Persistent memory ì—ì„œì˜ ë¡œë“œë¥¼ ìœ„í•´ì„ í˜„ìž¬ì˜ ì½ê¸° 메모리 ë°°ë¦¬ì–´ë¡œë„ ì½ê¸° + 순서를 ë³´ìž¥í•˜ëŠ”ë° ì¶©ë¶„í•©ë‹ˆë‹¤. ========================= 암묵ì ì»¤ë„ ë©”ëª¨ë¦¬ 배리어 diff --git a/Documentation/translations/zh_CN/arm64/amu.rst b/Documentation/translations/zh_CN/arm64/amu.rst new file mode 100644 index 000000000000..bd875f221330 --- /dev/null +++ b/Documentation/translations/zh_CN/arm64/amu.rst @@ -0,0 +1,100 @@ +.. include:: ../disclaimer-zh_CN.rst + +:Original: :ref:`Documentation/arm64/amu.rst <amu_index>` + +Translator: Bailu Lin <bailu.lin@vivo.com> + +================================= +AArch64 Linux ä¸æ‰©å±•çš„活动监控å•å…ƒ +================================= + +作者: Ionela Voinescu <ionela.voinescu@arm.com> + +日期: 2019-09-10 + +本文档简è¦æ述了 AArch64 Linux 支æŒçš„活动监控å•å…ƒçš„规范。 + + +架构总述 +-------- + +活动监控是 ARMv8.4 CPU 架构引入的一个å¯é€‰æ‰©å±•ç‰¹æ€§ã€‚ + +活动监控å•å…ƒ(在æ¯ä¸ª CPU ä¸å®žçŽ°)为系统管ç†æ供了性能计数器。既å¯ä»¥é€š +过系统寄å˜å™¨çš„æ–¹å¼è®¿é—®è®¡æ•°å™¨ï¼ŒåŒæ—¶ä¹Ÿæ”¯æŒå¤–部内å˜æ˜ å°„çš„æ–¹å¼è®¿é—®è®¡æ•°å™¨ã€‚ + +AMUv1 架构实现了一个由4个固定的64ä½äº‹ä»¶è®¡æ•°å™¨ç»„æˆçš„计数器组。 + + - CPU å‘¨æœŸè®¡æ•°å™¨ï¼šåŒ CPU 的频率增长 + - 常é‡è®¡æ•°å™¨ï¼šåŒå›ºå®šçš„系统时钟频率增长 + - 淘汰指令计数器: åŒæ¯æ¬¡æž¶æž„指令执行增长 + - 内å˜åœé¡¿å‘¨æœŸè®¡æ•°å™¨ï¼šè®¡ç®—由在时钟域内的最åŽä¸€çº§ç¼“å˜ä¸æœªå‘½ä¸è€Œå¼•èµ· + 的指令调度åœé¡¿å‘¨æœŸæ•° + +当处于 WFI 或者 WFE 状æ€æ—¶ï¼Œè®¡æ•°å™¨ä¸ä¼šå¢žé•¿ã€‚ + +AMU 架构æ供了一个高达16ä½çš„事件计数器空间,未æ¥æ–°çš„ AMU 版本ä¸å¯èƒ½ +用它æ¥å®žçŽ°æ–°å¢žçš„事件计数器。 + +å¦å¤–,AMUv1 实现了一个多达16个64ä½è¾…助事件计数器的计数器组。 + +冷å¤ä½æ—¶æ‰€æœ‰çš„计数器会清零。 + + +åŸºæœ¬æ”¯æŒ +-------- + +å†…æ ¸å¯ä»¥å®‰å…¨åœ°è¿è¡Œåœ¨æ”¯æŒ AMU å’Œä¸æ”¯æŒ AMU çš„ CPU 组åˆä¸ã€‚ +å› æ¤ï¼Œå½“é…ç½® CONFIG_ARM64_AMU_EXTN åŽæˆ‘ä»¬æ— æ¡ä»¶ä½¿èƒ½åŽç» +(secondary or hotplugged) CPU 检测和使用这个特性。 + +当在 CPU ä¸Šæ£€æµ‹åˆ°è¯¥ç‰¹æ€§æ—¶ï¼Œæˆ‘ä»¬ä¼šæ ‡è®°ä¸ºç‰¹æ€§å¯ç”¨ä½†æ˜¯ä¸èƒ½ä¿è¯è®¡æ•°å™¨çš„功能, +仅表明有扩展属性。 + +固件(代ç è¿è¡Œåœ¨é«˜å¼‚常级别,例如 arm-tf )需支æŒä»¥ä¸‹åŠŸèƒ½ï¼š + + - æ供低异常级别(EL2 å’Œ EL1)访问 AMU 寄å˜å™¨çš„能力。 + - 使能计数器。如果未使能,它的值应为 0。 + - 在从电æºå…³é—状æ€å¯åŠ¨ CPU å‰æˆ–åŽä¿å˜æˆ–者æ¢å¤è®¡æ•°å™¨ã€‚ + +å½“ä½¿ç”¨ä½¿èƒ½äº†è¯¥ç‰¹æ€§çš„å†…æ ¸å¯åŠ¨ä½†å›ºä»¶æŸå时,访问计数器寄å˜å™¨å¯èƒ½ä¼šéé‡ +panic 或者æ»é”。å³ä½¿æœªå‘现这些症状,计数器寄å˜å™¨è¿”回的数æ®ç»“果并ä¸ä¸€ +定能åæ˜ çœŸå®žæƒ…å†µã€‚é€šå¸¸ï¼Œè®¡æ•°å™¨ä¼šè¿”å›ž 0,表明他们未被使能。 + +如果固件没有æ供适当的支æŒæœ€å¥½å…³é— CONFIG_ARM64_AMU_EXTN。 +值得注æ„çš„æ˜¯ï¼Œå‡ºäºŽå®‰å…¨åŽŸå› ï¼Œä¸è¦ç»•è¿‡ AMUSERRENR_EL0 设置而æ•èŽ·ä»Ž +EL0(用户空间) 访问 EL1(å†…æ ¸ç©ºé—´)。 å› æ¤ï¼Œå›ºä»¶åº”该确ä¿è®¿é—® AMU寄å˜å™¨ +ä¸ä¼šå›°åœ¨ EL2或EL3。 + +AMUv1 的固定计数器å¯ä»¥é€šè¿‡å¦‚下系统寄å˜å™¨è®¿é—®ï¼š + + - SYS_AMEVCNTR0_CORE_EL0 + - SYS_AMEVCNTR0_CONST_EL0 + - SYS_AMEVCNTR0_INST_RET_EL0 + - SYS_AMEVCNTR0_MEM_STALL_EL0 + +特定辅助计数器å¯ä»¥é€šè¿‡ SYS_AMEVCNTR1_EL0(n) 访问,其ä¸n介于0到15。 + +详细信æ¯å®šä¹‰åœ¨ç›®å½•ï¼šarch/arm64/include/asm/sysreg.h。 + + +用户空间访问 +------------ + +ç”±äºŽä»¥ä¸‹åŽŸå› ï¼Œå½“å‰ç¦æ¢ä»Žç”¨æˆ·ç©ºé—´è®¿é—® AMU 的寄å˜å™¨ï¼š + + - å®‰å…¨å› æ•°ï¼šå¯èƒ½ä¼šæš´éœ²å¤„于安全模å¼æ‰§è¡Œçš„代ç ä¿¡æ¯ã€‚ + - æ„愿:AMU 是用于系统管ç†çš„。 + +åŒæ ·ï¼Œè¯¥åŠŸèƒ½å¯¹ç”¨æˆ·ç©ºé—´ä¸å¯è§ã€‚ + + +虚拟化 +------ + +ç”±äºŽä»¥ä¸‹åŽŸå› ï¼Œå½“å‰ç¦æ¢ä»Ž KVM 客户端的用户空间(EL0)å’Œå†…æ ¸ç©ºé—´(EL1) +访问 AMU 的寄å˜å™¨ï¼š + + - å®‰å…¨å› æ•°ï¼šå¯èƒ½ä¼šæš´éœ²ç»™å…¶ä»–客户端或主机端执行的代ç ä¿¡æ¯ã€‚ + +任何试图访问 AMU 寄å˜å™¨çš„行为都会触å‘一个注册在客户端的未定义异常。 diff --git a/Documentation/translations/zh_CN/arm64/index.rst b/Documentation/translations/zh_CN/arm64/index.rst new file mode 100644 index 000000000000..646ed1f7aea3 --- /dev/null +++ b/Documentation/translations/zh_CN/arm64/index.rst @@ -0,0 +1,16 @@ +.. include:: ../disclaimer-zh_CN.rst + +:Original: :ref:`Documentation/arm64/index.rst <arm64_index>` +:Translator: Bailu Lin <bailu.lin@vivo.com> + +.. _cn_arm64_index: + + +========== +ARM64 架构 +========== + +.. toctree:: + :maxdepth: 2 + + amu diff --git a/Documentation/translations/zh_CN/filesystems/sysfs.txt b/Documentation/translations/zh_CN/filesystems/sysfs.txt index 9481e3ed2a06..046cc1d52058 100644 --- a/Documentation/translations/zh_CN/filesystems/sysfs.txt +++ b/Documentation/translations/zh_CN/filesystems/sysfs.txt @@ -154,14 +154,13 @@ sysfs ä¼šä¸ºè¿™ä¸ªç±»åž‹è°ƒç”¨é€‚å½“çš„æ–¹æ³•ã€‚å½“ä¸€ä¸ªæ–‡ä»¶è¢«è¯»å†™æ—¶ï¼Œè¿ ç¤ºä¾‹: -#define to_dev(obj) container_of(obj, struct device, kobj) #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, char *buf) { struct device_attribute *dev_attr = to_dev_attr(attr); - struct device *dev = to_dev(kobj); + struct device *dev = kobj_to_dev(kobj); ssize_t ret = -EIO; if (dev_attr->show) diff --git a/Documentation/translations/zh_CN/index.rst b/Documentation/translations/zh_CN/index.rst index 85643e46e308..be6f11176200 100644 --- a/Documentation/translations/zh_CN/index.rst +++ b/Documentation/translations/zh_CN/index.rst @@ -19,6 +19,7 @@ admin-guide/index process/index filesystems/index + arm64/index ç›®å½•å’Œè¡¨æ ¼ ---------- diff --git a/Documentation/virt/index.rst b/Documentation/virt/index.rst index de1ab81df958..d20490292642 100644 --- a/Documentation/virt/index.rst +++ b/Documentation/virt/index.rst @@ -8,7 +8,7 @@ Linux Virtualization Support :maxdepth: 2 kvm/index - uml/user_mode_linux + uml/user_mode_linux_howto_v2 paravirt_ops guest-halt-polling diff --git a/Documentation/virt/kvm/amd-memory-encryption.rst b/Documentation/virt/kvm/amd-memory-encryption.rst index 2d44388438cc..09a8f2a34e39 100644 --- a/Documentation/virt/kvm/amd-memory-encryption.rst +++ b/Documentation/virt/kvm/amd-memory-encryption.rst @@ -53,11 +53,11 @@ key management interface to perform common hypervisor activities such as encrypting bootstrap code, snapshot, migrating and debugging the guest. For more information, see the SEV Key Management spec [api-spec]_ -The main ioctl to access SEV is KVM_MEM_ENCRYPT_OP. If the argument -to KVM_MEM_ENCRYPT_OP is NULL, the ioctl returns 0 if SEV is enabled +The main ioctl to access SEV is KVM_MEMORY_ENCRYPT_OP. If the argument +to KVM_MEMORY_ENCRYPT_OP is NULL, the ioctl returns 0 if SEV is enabled and ``ENOTTY` if it is disabled (on some older versions of Linux, the ioctl runs normally even with a NULL argument, and therefore will -likely return ``EFAULT``). If non-NULL, the argument to KVM_MEM_ENCRYPT_OP +likely return ``EFAULT``). If non-NULL, the argument to KVM_MEMORY_ENCRYPT_OP must be a struct kvm_sev_cmd:: struct kvm_sev_cmd { diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 51191b56e61c..1f26d83e6b16 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -4211,7 +4211,7 @@ H_GET_CPU_CHARACTERISTICS hypercall. :Capability: basic :Architectures: x86 -:Type: system +:Type: vm :Parameters: an opaque platform specific structure (in/out) :Returns: 0 on success; -1 on error @@ -4343,7 +4343,7 @@ Errors: #define KVM_STATE_NESTED_VMX_SMM_GUEST_MODE 0x00000001 #define KVM_STATE_NESTED_VMX_SMM_VMXON 0x00000002 -#define KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE 0x00000001 + #define KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE 0x00000001 struct kvm_vmx_nested_state_hdr { __u64 vmxon_pa; diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst index a7dff9186bed..9150e9d1c39b 100644 --- a/Documentation/virt/kvm/cpuid.rst +++ b/Documentation/virt/kvm/cpuid.rst @@ -78,7 +78,7 @@ KVM_FEATURE_PV_SEND_IPI 11 guest checks this feature bit before enabling paravirtualized sebd IPIs -KVM_FEATURE_PV_POLL_CONTROL 12 host-side polling on HLT can +KVM_FEATURE_POLL_CONTROL 12 host-side polling on HLT can be disabled by writing to msr 0x4b564d05. diff --git a/Documentation/virt/uml/user_mode_linux.rst b/Documentation/virt/uml/user_mode_linux.rst deleted file mode 100644 index de0f0b2c9d5b..000000000000 --- a/Documentation/virt/uml/user_mode_linux.rst +++ /dev/null @@ -1,4403 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0 - -===================== -User Mode Linux HOWTO -===================== - -:Author: User Mode Linux Core Team -:Last-updated: Sat Jan 25 16:07:55 CET 2020 - -This document describes the use and abuse of Jeff Dike's User Mode -Linux: a port of the Linux kernel as a normal Intel Linux process. - - -.. Table of Contents - - 1. Introduction - - 1.1 How is User Mode Linux Different? - 1.2 Why Would I Want User Mode Linux? - - 2. Compiling the kernel and modules - - 2.1 Compiling the kernel - 2.2 Compiling and installing kernel modules - 2.3 Compiling and installing uml_utilities - - 3. Running UML and logging in - - 3.1 Running UML - 3.2 Logging in - 3.3 Examples - - 4. UML on 2G/2G hosts - - 4.1 Introduction - 4.2 The problem - 4.3 The solution - - 5. Setting up serial lines and consoles - - 5.1 Specifying the device - 5.2 Specifying the channel - 5.3 Examples - - 6. Setting up the network - - 6.1 General setup - 6.2 Userspace daemons - 6.3 Specifying ethernet addresses - 6.4 UML interface setup - 6.5 Multicast - 6.6 TUN/TAP with the uml_net helper - 6.7 TUN/TAP with a preconfigured tap device - 6.8 Ethertap - 6.9 The switch daemon - 6.10 Slip - 6.11 Slirp - 6.12 pcap - 6.13 Setting up the host yourself - - 7. Sharing Filesystems between Virtual Machines - - 7.1 A warning - 7.2 Using layered block devices - 7.3 Note! - 7.4 Another warning - 7.5 uml_moo : Merging a COW file with its backing file - - 8. Creating filesystems - - 8.1 Create the filesystem file - 8.2 Assign the file to a UML device - 8.3 Creating and mounting the filesystem - - 9. Host file access - - 9.1 Using hostfs - 9.2 hostfs as the root filesystem - 9.3 Building hostfs - - 10. The Management Console - 10.1 version - 10.2 halt and reboot - 10.3 config - 10.4 remove - 10.5 sysrq - 10.6 help - 10.7 cad - 10.8 stop - 10.9 go - - 11. Kernel debugging - - 11.1 Starting the kernel under gdb - 11.2 Examining sleeping processes - 11.3 Running ddd on UML - 11.4 Debugging modules - 11.5 Attaching gdb to the kernel - 11.6 Using alternate debuggers - - 12. Kernel debugging examples - - 12.1 The case of the hung fsck - 12.2 Episode 2: The case of the hung fsck - - 13. What to do when UML doesn't work - - 13.1 Strange compilation errors when you build from source - 13.2 (obsolete) - 13.3 A variety of panics and hangs with /tmp on a reiserfs filesystem - 13.4 The compile fails with errors about conflicting types for 'open', 'dup', and 'waitpid' - 13.5 UML doesn't work when /tmp is an NFS filesystem - 13.6 UML hangs on boot when compiled with gprof support - 13.7 syslogd dies with a SIGTERM on startup - 13.8 TUN/TAP networking doesn't work on a 2.4 host - 13.9 You can network to the host but not to other machines on the net - 13.10 I have no root and I want to scream - 13.11 UML build conflict between ptrace.h and ucontext.h - 13.12 The UML BogoMips is exactly half the host's BogoMips - 13.13 When you run UML, it immediately segfaults - 13.14 xterms appear, then immediately disappear - 13.15 Any other panic, hang, or strange behavior - - 14. Diagnosing Problems - - 14.1 Case 1 : Normal kernel panics - 14.2 Case 2 : Tracing thread panics - 14.3 Case 3 : Tracing thread panics caused by other threads - 14.4 Case 4 : Hangs - - 15. Thanks - - 15.1 Code and Documentation - 15.2 Flushing out bugs - 15.3 Buglets and clean-ups - 15.4 Case Studies - 15.5 Other contributions - - -1. Introduction -================ - - Welcome to User Mode Linux. It's going to be fun. - - - -1.1. How is User Mode Linux Different? ---------------------------------------- - - Normally, the Linux Kernel talks straight to your hardware (video - card, keyboard, hard drives, etc), and any programs which run ask the - kernel to operate the hardware, like so:: - - - - +-----------+-----------+----+ - | Process 1 | Process 2 | ...| - +-----------+-----------+----+ - | Linux Kernel | - +----------------------------+ - | Hardware | - +----------------------------+ - - - - - The User Mode Linux Kernel is different; instead of talking to the - hardware, it talks to a `real` Linux kernel (called the `host kernel` - from now on), like any other program. Programs can then run inside - User-Mode Linux as if they were running under a normal kernel, like - so:: - - - - +----------------+ - | Process 2 | ...| - +-----------+----------------+ - | Process 1 | User-Mode Linux| - +----------------------------+ - | Linux Kernel | - +----------------------------+ - | Hardware | - +----------------------------+ - - - - - -1.2. Why Would I Want User Mode Linux? ---------------------------------------- - - - 1. If User Mode Linux crashes, your host kernel is still fine. - - 2. You can run a usermode kernel as a non-root user. - - 3. You can debug the User Mode Linux like any normal process. - - 4. You can run gprof (profiling) and gcov (coverage testing). - - 5. You can play with your kernel without breaking things. - - 6. You can use it as a sandbox for testing new apps. - - 7. You can try new development kernels safely. - - 8. You can run different distributions simultaneously. - - 9. It's extremely fun. - - - -.. _Compiling_the_kernel_and_modules: - -2. Compiling the kernel and modules -==================================== - - - - -2.1. Compiling the kernel --------------------------- - - - Compiling the user mode kernel is just like compiling any other - kernel. - - - 1. Download the latest kernel from your favourite kernel mirror, - such as: - - https://mirrors.edge.kernel.org/pub/linux/kernel/v5.x/linux-5.4.14.tar.xz - - 2. Make a directory and unpack the kernel into it:: - - host% - mkdir ~/uml - - host% - cd ~/uml - - host% - tar xvf linux-5.4.14.tar.xz - - - 3. Run your favorite config; ``make xconfig ARCH=um`` is the most - convenient. ``make config ARCH=um`` and ``make menuconfig ARCH=um`` - will work as well. The defaults will give you a useful kernel. If - you want to change something, go ahead, it probably won't hurt - anything. - - - Note: If the host is configured with a 2G/2G address space split - rather than the usual 3G/1G split, then the packaged UML binaries - will not run. They will immediately segfault. See - :ref:`UML_on_2G/2G_hosts` for the scoop on running UML on your system. - - - - 4. Finish with ``make linux ARCH=um``: the result is a file called - ``linux`` in the top directory of your source tree. - - -2.2. Compiling and installing kernel modules ---------------------------------------------- - - UML modules are built in the same way as the native kernel (with the - exception of the 'ARCH=um' that you always need for UML):: - - - host% make modules ARCH=um - - - - - Any modules that you want to load into this kernel need to be built in - the user-mode pool. Modules from the native kernel won't work. - - You can install them by using ftp or something to copy them into the - virtual machine and dropping them into ``/lib/modules/$(uname -r)``. - - You can also get the kernel build process to install them as follows: - - 1. with the kernel not booted, mount the root filesystem in the top - level of the kernel pool:: - - - host% mount root_fs mnt -o loop - - - - - - - 2. run:: - - - host% - make modules_install INSTALL_MOD_PATH=`pwd`/mnt ARCH=um - - - - - - - 3. unmount the filesystem:: - - - host% umount mnt - - - - - - - 4. boot the kernel on it - - - When the system is booted, you can use insmod as usual to get the - modules into the kernel. A number of things have been loaded into UML - as modules, especially filesystems and network protocols and filters, - so most symbols which need to be exported probably already are. - However, if you do find symbols that need exporting, let us - know at http://user-mode-linux.sourceforge.net/, and - they'll be "taken care of". - - - -2.3. Compiling and installing uml_utilities --------------------------------------------- - - Many features of the UML kernel require a user-space helper program, - so a uml_utilities package is distributed separately from the kernel - patch which provides these helpers. Included within this is: - - - port-helper - Used by consoles which connect to xterms or ports - - - tunctl - Configuration tool to create and delete tap devices - - - uml_net - Setuid binary for automatic tap device configuration - - - uml_switch - User-space virtual switch required for daemon - transport - - The uml_utilities tree is compiled with:: - - - host# - make && make install - - - - - Note that UML kernel patches may require a specific version of the - uml_utilities distribution. If you don't keep up with the mailing - lists, ensure that you have the latest release of uml_utilities if you - are experiencing problems with your UML kernel, particularly when - dealing with consoles or command-line switches to the helper programs - - - - - - - - -3. Running UML and logging in -============================== - - - -3.1. Running UML ------------------ - - It runs on 2.2.15 or later, and all kernel versions since 2.4. - - - Booting UML is straightforward. Simply run 'linux': it will try to - mount the file ``root_fs`` in the current directory. You do not need to - run it as root. If your root filesystem is not named ``root_fs``, then - you need to put a ``ubd0=root_fs_whatever`` switch on the linux command - line. - - - You will need a filesystem to boot UML from. There are a number - available for download from http://user-mode-linux.sourceforge.net. - There are also several tools at - http://user-mode-linux.sourceforge.net/ which can be - used to generate UML-compatible filesystem images from media. - The kernel will boot up and present you with a login prompt. - - -Note: - If the host is configured with a 2G/2G address space split - rather than the usual 3G/1G split, then the packaged UML binaries will - not run. They will immediately segfault. See :ref:`UML_on_2G/2G_hosts` - for the scoop on running UML on your system. - - - -3.2. Logging in ----------------- - - - - The prepackaged filesystems have a root account with password 'root' - and a user account with password 'user'. The login banner will - generally tell you how to log in. So, you log in and you will find - yourself inside a little virtual machine. Our filesystems have a - variety of commands and utilities installed (and it is fairly easy to - add more), so you will have a lot of tools with which to poke around - the system. - - There are a couple of other ways to log in: - - - On a virtual console - - - - Each virtual console that is configured (i.e. the device exists in - /dev and /etc/inittab runs a getty on it) will come up in its own - xterm. If you get tired of the xterms, read - :ref:`setting_up_serial_lines_and_consoles` to see how to attach - the consoles to something else, like host ptys. - - - - - Over the serial line - - - In the boot output, find a line that looks like:: - - - - serial line 0 assigned pty /dev/ptyp1 - - - - - Attach your favorite terminal program to the corresponding tty. I.e. - for minicom, the command would be:: - - - host% minicom -o -p /dev/ttyp1 - - - - - - - - Over the net - - - If the network is running, then you can telnet to the virtual - machine and log in to it. See :ref:`Setting_up_the_network` to learn - about setting up a virtual network. - - When you're done using it, run halt, and the kernel will bring itself - down and the process will exit. - - -3.3. Examples --------------- - - Here are some examples of UML in action: - - - A login session http://user-mode-linux.sourceforge.net/old/login.html - - - A virtual network http://user-mode-linux.sourceforge.net/old/net.html - - - - - -.. _UML_on_2G/2G_hosts: - -4. UML on 2G/2G hosts -====================== - - - - -4.1. Introduction ------------------- - - - Most Linux machines are configured so that the kernel occupies the - upper 1G (0xc0000000 - 0xffffffff) of the 4G address space and - processes use the lower 3G (0x00000000 - 0xbfffffff). However, some - machine are configured with a 2G/2G split, with the kernel occupying - the upper 2G (0x80000000 - 0xffffffff) and processes using the lower - 2G (0x00000000 - 0x7fffffff). - - - - -4.2. The problem ------------------ - - - The prebuilt UML binaries on this site will not run on 2G/2G hosts - because UML occupies the upper .5G of the 3G process address space - (0xa0000000 - 0xbfffffff). Obviously, on 2G/2G hosts, this is right - in the middle of the kernel address space, so UML won't even load - it - will immediately segfault. - - - - -4.3. The solution ------------------- - - - The fix for this is to rebuild UML from source after enabling - CONFIG_HOST_2G_2G (under 'General Setup'). This will cause UML to - load itself in the top .5G of that smaller process address space, - where it will run fine. See :ref:`Compiling_the_kernel_and_modules` if - you need help building UML from source. - - - - - - - -.. _setting_up_serial_lines_and_consoles: - - -5. Setting up serial lines and consoles -======================================== - - - It is possible to attach UML serial lines and consoles to many types - of host I/O channels by specifying them on the command line. - - - You can attach them to host ptys, ttys, file descriptors, and ports. - This allows you to do things like: - - - have a UML console appear on an unused host console, - - - hook two virtual machines together by having one attach to a pty - and having the other attach to the corresponding tty - - - make a virtual machine accessible from the net by attaching a - console to a port on the host. - - - The general format of the command line option is ``device=channel``. - - - -5.1. Specifying the device ---------------------------- - - Devices are specified with "con" or "ssl" (console or serial line, - respectively), optionally with a device number if you are talking - about a specific device. - - - Using just "con" or "ssl" describes all of the consoles or serial - lines. If you want to talk about console #3 or serial line #10, they - would be "con3" and "ssl10", respectively. - - - A specific device name will override a less general "con=" or "ssl=". - So, for example, you can assign a pty to each of the serial lines - except for the first two like this:: - - - ssl=pty ssl0=tty:/dev/tty0 ssl1=tty:/dev/tty1 - - - - - The specificity of the device name is all that matters; order on the - command line is irrelevant. - - - -5.2. Specifying the channel ----------------------------- - - There are a number of different types of channels to attach a UML - device to, each with a different way of specifying exactly what to - attach to. - - - pseudo-terminals - device=pty pts terminals - device=pts - - - This will cause UML to allocate a free host pseudo-terminal for the - device. The terminal that it got will be announced in the boot - log. You access it by attaching a terminal program to the - corresponding tty: - - - screen /dev/pts/n - - - screen /dev/ttyxx - - - minicom -o -p /dev/ttyxx - minicom seems not able to handle pts - devices - - - kermit - start it up, 'open' the device, then 'connect' - - - - - - - terminals - device=tty:tty device file - - - This will make UML attach the device to the specified tty (i.e:: - - - con1=tty:/dev/tty3 - - - - - will attach UML's console 1 to the host's /dev/tty3). If the tty that - you specify is the slave end of a tty/pty pair, something else must - have already opened the corresponding pty in order for this to work. - - - - - - - xterms - device=xterm - - - UML will run an xterm and the device will be attached to it. - - - - - - - Port - device=port:port number - - - This will attach the UML devices to the specified host port. - Attaching console 1 to the host's port 9000 would be done like - this:: - - - con1=port:9000 - - - - - Attaching all the serial lines to that port would be done similarly:: - - - ssl=port:9000 - - - - - You access these devices by telnetting to that port. Each active - telnet session gets a different device. If there are more telnets to a - port than UML devices attached to it, then the extra telnet sessions - will block until an existing telnet detaches, or until another device - becomes active (i.e. by being activated in /etc/inittab). - - This channel has the advantage that you can both attach multiple UML - devices to it and know how to access them without reading the UML boot - log. It is also unique in allowing access to a UML from remote - machines without requiring that the UML be networked. This could be - useful in allowing public access to UMLs because they would be - accessible from the net, but wouldn't need any kind of network - filtering or access control because they would have no network access. - - - If you attach the main console to a portal, then the UML boot will - appear to hang. In reality, it's waiting for a telnet to connect, at - which point the boot will proceed. - - - - - - - already-existing file descriptors - device=file descriptor - - - If you set up a file descriptor on the UML command line, you can - attach a UML device to it. This is most commonly used to put the - main console back on stdin and stdout after assigning all the other - consoles to something else:: - - - con0=fd:0,fd:1 con=pts - - - - - - - - - - Nothing - device=null - - - This allows the device to be opened, in contrast to 'none', but - reads will block, and writes will succeed and the data will be - thrown out. - - - - - - - None - device=none - - - This causes the device to disappear. - - - - You can also specify different input and output channels for a device - by putting a comma between them:: - - - ssl3=tty:/dev/tty2,xterm - - - - - will cause serial line 3 to accept input on the host's /dev/tty2 and - display output on an xterm. That's a silly example - the most common - use of this syntax is to reattach the main console to stdin and stdout - as shown above. - - - If you decide to move the main console away from stdin/stdout, the - initial boot output will appear in the terminal that you're running - UML in. However, once the console driver has been officially - initialized, then the boot output will start appearing wherever you - specified that console 0 should be. That device will receive all - subsequent output. - - - -5.3. Examples --------------- - - There are a number of interesting things you can do with this - capability. - - - First, this is how you get rid of those bleeding console xterms by - attaching them to host ptys:: - - - con=pty con0=fd:0,fd:1 - - - - - This will make a UML console take over an unused host virtual console, - so that when you switch to it, you will see the UML login prompt - rather than the host login prompt:: - - - con1=tty:/dev/tty6 - - - - - You can attach two virtual machines together with what amounts to a - serial line as follows: - - Run one UML with a serial line attached to a pty:: - - - ssl1=pty - - - - - Look at the boot log to see what pty it got (this example will assume - that it got /dev/ptyp1). - - Boot the other UML with a serial line attached to the corresponding - tty:: - - - ssl1=tty:/dev/ttyp1 - - - - - Log in, make sure that it has no getty on that serial line, attach a - terminal program like minicom to it, and you should see the login - prompt of the other virtual machine. - - -.. _setting_up_the_network: - -6. Setting up the network -========================== - - - - This page describes how to set up the various transports and to - provide a UML instance with network access to the host, other machines - on the local net, and the rest of the net. - - - As of 2.4.5, UML networking has been completely redone to make it much - easier to set up, fix bugs, and add new features. - - - There is a new helper, uml_net, which does the host setup that - requires root privileges. - - - There are currently five transport types available for a UML virtual - machine to exchange packets with other hosts: - - - ethertap - - - TUN/TAP - - - Multicast - - - a switch daemon - - - slip - - - slirp - - - pcap - - The TUN/TAP, ethertap, slip, and slirp transports allow a UML - instance to exchange packets with the host. They may be directed - to the host or the host may just act as a router to provide access - to other physical or virtual machines. - - - The pcap transport is a synthetic read-only interface, using the - libpcap binary to collect packets from interfaces on the host and - filter them. This is useful for building preconfigured traffic - monitors or sniffers. - - - The daemon and multicast transports provide a completely virtual - network to other virtual machines. This network is completely - disconnected from the physical network unless one of the virtual - machines on it is acting as a gateway. - - - With so many host transports, which one should you use? Here's when - you should use each one: - - - ethertap - if you want access to the host networking and it is - running 2.2 - - - TUN/TAP - if you want access to the host networking and it is - running 2.4. Also, the TUN/TAP transport is able to use a - preconfigured device, allowing it to avoid using the setuid uml_net - helper, which is a security advantage. - - - Multicast - if you want a purely virtual network and you don't want - to set up anything but the UML - - - a switch daemon - if you want a purely virtual network and you - don't mind running the daemon in order to get somewhat better - performance - - - slip - there is no particular reason to run the slip backend unless - ethertap and TUN/TAP are just not available for some reason - - - slirp - if you don't have root access on the host to setup - networking, or if you don't want to allocate an IP to your UML - - - pcap - not much use for actual network connectivity, but great for - monitoring traffic on the host - - Ethertap is available on 2.4 and works fine. TUN/TAP is preferred - to it because it has better performance and ethertap is officially - considered obsolete in 2.4. Also, the root helper only needs to - run occasionally for TUN/TAP, rather than handling every packet, as - it does with ethertap. This is a slight security advantage since - it provides fewer opportunities for a nasty UML user to somehow - exploit the helper's root privileges. - - -6.1. General setup -------------------- - - First, you must have the virtual network enabled in your UML. If are - running a prebuilt kernel from this site, everything is already - enabled. If you build the kernel yourself, under the "Network device - support" menu, enable "Network device support", and then the three - transports. - - - The next step is to provide a network device to the virtual machine. - This is done by describing it on the kernel command line. - - The general format is:: - - - eth <n> = <transport> , <transport args> - - - - - For example, a virtual ethernet device may be attached to a host - ethertap device as follows:: - - - eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254 - - - - - This sets up eth0 inside the virtual machine to attach itself to the - host /dev/tap0, assigns it an ethernet address, and assigns the host - tap0 interface an IP address. - - - - Note that the IP address you assign to the host end of the tap device - must be different than the IP you assign to the eth device inside UML. - If you are short on IPs and don't want to consume two per UML, then - you can reuse the host's eth IP address for the host ends of the tap - devices. Internally, the UMLs must still get unique IPs for their eth - devices. You can also give the UMLs non-routable IPs (192.168.x.x or - 10.x.x.x) and have the host masquerade them. This will let outgoing - connections work, but incoming connections won't without more work, - such as port forwarding from the host. - Also note that when you configure the host side of an interface, it is - only acting as a gateway. It will respond to pings sent to it - locally, but is not useful to do that since it's a host interface. - You are not talking to the UML when you ping that interface and get a - response. - - - You can also add devices to a UML and remove them at runtime. See the - :ref:`The_Management_Console` page for details. - - - The sections below describe this in more detail. - - - Once you've decided how you're going to set up the devices, you boot - UML, log in, configure the UML side of the devices, and set up routes - to the outside world. At that point, you will be able to talk to any - other machines, physical or virtual, on the net. - - - If ifconfig inside UML fails and the network refuses to come up, run - tell you what went wrong. - - - -6.2. Userspace daemons ------------------------ - - You will likely need the setuid helper, or the switch daemon, or both. - They are both installed with the RPM and deb, so if you've installed - either, you can skip the rest of this section. - - - If not, then you need to check them out of CVS, build them, and - install them. The helper is uml_net, in CVS /tools/uml_net, and the - daemon is uml_switch, in CVS /tools/uml_router. They are both built - with a plain 'make'. Both need to be installed in a directory that's - in your path - /usr/bin is recommend. On top of that, uml_net needs - to be setuid root. - - - -6.3. Specifying ethernet addresses ------------------------------------ - - Below, you will see that the TUN/TAP, ethertap, and daemon interfaces - allow you to specify hardware addresses for the virtual ethernet - devices. This is generally not necessary. If you don't have a - specific reason to do it, you probably shouldn't. If one is not - specified on the command line, the driver will assign one based on the - device IP address. It will provide the address fe:fd:nn:nn:nn:nn - where nn.nn.nn.nn is the device IP address. This is nearly always - sufficient to guarantee a unique hardware address for the device. A - couple of exceptions are: - - - Another set of virtual ethernet devices are on the same network and - they are assigned hardware addresses using a different scheme which - may conflict with the UML IP address-based scheme - - - You aren't going to use the device for IP networking, so you don't - assign the device an IP address - - If you let the driver provide the hardware address, you should make - sure that the device IP address is known before the interface is - brought up. So, inside UML, this will guarantee that:: - - - - UML# - ifconfig eth0 192.168.0.250 up - - - - - If you decide to assign the hardware address yourself, make sure that - the first byte of the address is even. Addresses with an odd first - byte are broadcast addresses, which you don't want assigned to a - device. - - - -6.4. UML interface setup -------------------------- - - Once the network devices have been described on the command line, you - should boot UML and log in. - - - The first thing to do is bring the interface up:: - - - UML# ifconfig ethn ip-address up - - - - - You should be able to ping the host at this point. - - - To reach the rest of the world, you should set a default route to the - host:: - - - UML# route add default gw host ip - - - - - Again, with host ip of 192.168.0.4:: - - - UML# route add default gw 192.168.0.4 - - - - - This page used to recommend setting a network route to your local net. - This is wrong, because it will cause UML to try to figure out hardware - addresses of the local machines by arping on the interface to the - host. Since that interface is basically a single strand of ethernet - with two nodes on it (UML and the host) and arp requests don't cross - networks, they will fail to elicit any responses. So, what you want - is for UML to just blindly throw all packets at the host and let it - figure out what to do with them, which is what leaving out the network - route and adding the default route does. - - - Note: If you can't communicate with other hosts on your physical - ethernet, it's probably because of a network route that's - automatically set up. If you run 'route -n' and see a route that - looks like this:: - - - - - Destination Gateway Genmask Flags Metric Ref Use Iface - 192.168.0.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0 - - - - - with a mask that's not 255.255.255.255, then replace it with a route - to your host:: - - - UML# - route del -net 192.168.0.0 dev eth0 netmask 255.255.255.0 - - - UML# - route add -host 192.168.0.4 dev eth0 - - - - - This, plus the default route to the host, will allow UML to exchange - packets with any machine on your ethernet. - - - -6.5. Multicast ---------------- - - The simplest way to set up a virtual network between multiple UMLs is - to use the mcast transport. This was written by Harald Welte and is - present in UML version 2.4.5-5um and later. Your system must have - multicast enabled in the kernel and there must be a multicast-capable - network device on the host. Normally, this is eth0, but if there is - no ethernet card on the host, then you will likely get strange error - messages when you bring the device up inside UML. - - - To use it, run two UMLs with:: - - - eth0=mcast - - - - - on their command lines. Log in, configure the ethernet device in each - machine with different IP addresses:: - - - UML1# ifconfig eth0 192.168.0.254 - - - UML2# ifconfig eth0 192.168.0.253 - - - - - and they should be able to talk to each other. - - The full set of command line options for this transport are:: - - - - ethn=mcast,ethernet address,multicast - address,multicast port,ttl - - - - There is also a related point-to-point only "ucast" transport. - This is useful when your network does not support multicast, and - all network connections are simple point to point links. - - The full set of command line options for this transport are:: - - - ethn=ucast,ethernet address,remote address,listen port,remote port - - - - -6.6. TUN/TAP with the uml_net helper -------------------------------------- - - TUN/TAP is the preferred mechanism on 2.4 to exchange packets with the - host. The TUN/TAP backend has been in UML since 2.4.9-3um. - - - The easiest way to get up and running is to let the setuid uml_net - helper do the host setup for you. This involves insmod-ing the tun.o - module if necessary, configuring the device, and setting up IP - forwarding, routing, and proxy arp. If you are new to UML networking, - do this first. If you're concerned about the security implications of - the setuid helper, use it to get up and running, then read the next - section to see how to have UML use a preconfigured tap device, which - avoids the use of uml_net. - - - If you specify an IP address for the host side of the device, the - uml_net helper will do all necessary setup on the host - the only - requirement is that TUN/TAP be available, either built in to the host - kernel or as the tun.o module. - - The format of the command line switch to attach a device to a TUN/TAP - device is:: - - - eth <n> =tuntap,,, <IP address> - - - - - For example, this argument will attach the UML's eth0 to the next - available tap device and assign an ethernet address to it based on its - IP address:: - - - eth0=tuntap,,,192.168.0.254 - - - - - - - Note that the IP address that must be used for the eth device inside - UML is fixed by the routing and proxy arp that is set up on the - TUN/TAP device on the host. You can use a different one, but it won't - work because reply packets won't reach the UML. This is a feature. - It prevents a nasty UML user from doing things like setting the UML IP - to the same as the network's nameserver or mail server. - - - There are a couple potential problems with running the TUN/TAP - transport on a 2.4 host kernel - - - TUN/TAP seems not to work on 2.4.3 and earlier. Upgrade the host - kernel or use the ethertap transport. - - - With an upgraded kernel, TUN/TAP may fail with:: - - - File descriptor in bad state - - - - - This is due to a header mismatch between the upgraded kernel and the - kernel that was originally installed on the machine. The fix is to - make sure that /usr/src/linux points to the headers for the running - kernel. - - These were pointed out by Tim Robinson <timro at trkr dot net> in the past. - - - -6.7. TUN/TAP with a preconfigured tap device ---------------------------------------------- - - If you prefer not to have UML use uml_net (which is somewhat - insecure), with UML 2.4.17-11, you can set up a TUN/TAP device - beforehand. The setup needs to be done as root, but once that's done, - there is no need for root assistance. Setting up the device is done - as follows: - - - Create the device with tunctl (available from the UML utilities - tarball):: - - - - - host# tunctl -u uid - - - - - where uid is the user id or username that UML will be run as. This - will tell you what device was created. - - - Configure the device IP (change IP addresses and device name to - suit):: - - - - - host# ifconfig tap0 192.168.0.254 up - - - - - - - Set up routing and arping if desired - this is my recipe, there are - other ways of doing the same thing:: - - - host# - bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward' - - host# - route add -host 192.168.0.253 dev tap0 - - host# - bash -c 'echo 1 > /proc/sys/net/ipv4/conf/tap0/proxy_arp' - - host# - arp -Ds 192.168.0.253 eth0 pub - - - - - Note that this must be done every time the host boots - this configu- - ration is not stored across host reboots. So, it's probably a good - idea to stick it in an rc file. An even better idea would be a little - utility which reads the information from a config file and sets up - devices at boot time. - - - Rather than using up two IPs and ARPing for one of them, you can - also provide direct access to your LAN by the UML by using a - bridge:: - - - host# - brctl addbr br0 - - - host# - ifconfig eth0 0.0.0.0 promisc up - - - host# - ifconfig tap0 0.0.0.0 promisc up - - - host# - ifconfig br0 192.168.0.1 netmask 255.255.255.0 up - - - host# - brctl stp br0 off - - - host# - brctl setfd br0 1 - - - host# - brctl sethello br0 1 - - - host# - brctl addif br0 eth0 - - - host# - brctl addif br0 tap0 - - - - - Note that 'br0' should be setup using ifconfig with the existing IP - address of eth0, as eth0 no longer has its own IP. - - - - - - Also, the /dev/net/tun device must be writable by the user running - UML in order for the UML to use the device that's been configured - for it. The simplest thing to do is:: - - - host# chmod 666 /dev/net/tun - - - - - Making it world-writable looks bad, but it seems not to be - exploitable as a security hole. However, it does allow anyone to cre- - ate useless tap devices (useless because they can't configure them), - which is a DOS attack. A somewhat more secure alternative would to be - to create a group containing all the users who have preconfigured tap - devices and chgrp /dev/net/tun to that group with mode 664 or 660. - - - - Once the device is set up, run UML with 'eth0=tuntap,device name' - (i.e. 'eth0=tuntap,tap0') on the command line (or do it with the - mconsole config command). - - - Bring the eth device up in UML and you're in business. - - If you don't want that tap device any more, you can make it non- - persistent with:: - - - host# tunctl -d tap device - - - - - Finally, tunctl has a -b (for brief mode) switch which causes it to - output only the name of the tap device it created. This makes it - suitable for capture by a script:: - - - host# TAP=`tunctl -u 1000 -b` - - - - - - -6.8. Ethertap --------------- - - Ethertap is the general mechanism on 2.2 for userspace processes to - exchange packets with the kernel. - - - - To use this transport, you need to describe the virtual network device - on the UML command line. The general format for this is:: - - - eth <n> =ethertap, <device> , <ethernet address> , <tap IP address> - - - - - So, the previous example:: - - - eth0=ethertap,tap0,fe:fd:0:0:0:1,192.168.0.254 - - - - - attaches the UML eth0 device to the host /dev/tap0, assigns it the - ethernet address fe:fd:0:0:0:1, and assigns the IP address - 192.168.0.254 to the tap device. - - - - The tap device is mandatory, but the others are optional. If the - ethernet address is omitted, one will be assigned to it. - - - The presence of the tap IP address will cause the helper to run and do - whatever host setup is needed to allow the virtual machine to - communicate with the outside world. If you're not sure you know what - you're doing, this is the way to go. - - - If it is absent, then you must configure the tap device and whatever - arping and routing you will need on the host. However, even in this - case, the uml_net helper still needs to be in your path and it must be - setuid root if you're not running UML as root. This is because the - tap device doesn't support SIGIO, which UML needs in order to use - something as a source of input. So, the helper is used as a - convenient asynchronous IO thread. - - If you're using the uml_net helper, you can ignore the following host - setup - uml_net will do it for you. You just need to make sure you - have ethertap available, either built in to the host kernel or - available as a module. - - - If you want to set things up yourself, you need to make sure that the - appropriate /dev entry exists. If it doesn't, become root and create - it as follows:: - - - mknod /dev/tap <minor> c 36 <minor> + 16 - - - - - For example, this is how to create /dev/tap0:: - - - mknod /dev/tap0 c 36 0 + 16 - - - - - You also need to make sure that the host kernel has ethertap support. - If ethertap is enabled as a module, you apparently need to insmod - ethertap once for each ethertap device you want to enable. So,:: - - - host# - insmod ethertap - - - - - will give you the tap0 interface. To get the tap1 interface, you need - to run:: - - - host# - insmod ethertap unit=1 -o ethertap1 - - - - - - - -6.9. The switch daemon ------------------------ - - Note: This is the daemon formerly known as uml_router, but which was - renamed so the network weenies of the world would stop growling at me. - - - The switch daemon, uml_switch, provides a mechanism for creating a - totally virtual network. By default, it provides no connection to the - host network (but see -tap, below). - - - The first thing you need to do is run the daemon. Running it with no - arguments will make it listen on a default pair of unix domain - sockets. - - - If you want it to listen on a different pair of sockets, use:: - - - -unix control socket data socket - - - - - - If you want it to act as a hub rather than a switch, use:: - - - -hub - - - - - - If you want the switch to be connected to host networking (allowing - the umls to get access to the outside world through the host), use:: - - - -tap tap0 - - - - - - Note that the tap device must be preconfigured (see "TUN/TAP with a - preconfigured tap device", above). If you're using a different tap - device than tap0, specify that instead of tap0. - - - uml_switch can be backgrounded as follows:: - - - host% - uml_switch [ options ] < /dev/null > /dev/null - - - - - The reason it doesn't background by default is that it listens to - stdin for EOF. When it sees that, it exits. - - - The general format of the kernel command line switch is:: - - - - ethn=daemon,ethernet address,socket - type,control socket,data socket - - - - - You can leave off everything except the 'daemon'. You only need to - specify the ethernet address if the one that will be assigned to it - isn't acceptable for some reason. The rest of the arguments describe - how to communicate with the daemon. You should only specify them if - you told the daemon to use different sockets than the default. So, if - you ran the daemon with no arguments, running the UML on the same - machine with:: - - eth0=daemon - - - - - will cause the eth0 driver to attach itself to the daemon correctly. - - - -6.10. Slip ------------ - - Slip is another, less general, mechanism for a process to communicate - with the host networking. In contrast to the ethertap interface, - which exchanges ethernet frames with the host and can be used to - transport any higher-level protocol, it can only be used to transport - IP. - - - The general format of the command line switch is:: - - - - ethn=slip,slip IP - - - - - The slip IP argument is the IP address that will be assigned to the - host end of the slip device. If it is specified, the helper will run - and will set up the host so that the virtual machine can reach it and - the rest of the network. - - - There are some oddities with this interface that you should be aware - of. You should only specify one slip device on a given virtual - machine, and its name inside UML will be 'umn', not 'eth0' or whatever - you specified on the command line. These problems will be fixed at - some point. - - - -6.11. Slirp ------------- - - slirp uses an external program, usually /usr/bin/slirp, to provide IP - only networking connectivity through the host. This is similar to IP - masquerading with a firewall, although the translation is performed in - user-space, rather than by the kernel. As slirp does not set up any - interfaces on the host, or changes routing, slirp does not require - root access or setuid binaries on the host. - - - The general format of the command line switch for slirp is:: - - - - ethn=slirp,ethernet address,slirp path - - - - - The ethernet address is optional, as UML will set up the interface - with an ethernet address based upon the initial IP address of the - interface. The slirp path is generally /usr/bin/slirp, although it - will depend on distribution. - - - The slirp program can have a number of options passed to the command - line and we can't add them to the UML command line, as they will be - parsed incorrectly. Instead, a wrapper shell script can be written or - the options inserted into the /.slirprc file. More information on - all of the slirp options can be found in its man pages. - - - The eth0 interface on UML should be set up with the IP 10.2.0.15, - although you can use anything as long as it is not used by a network - you will be connecting to. The default route on UML should be set to - use:: - - - UML# - route add default dev eth0 - - - - - slirp provides a number of useful IP addresses which can be used by - UML, such as 10.0.2.3 which is an alias for the DNS server specified - in /etc/resolv.conf on the host or the IP given in the 'dns' option - for slirp. - - - Even with a baudrate setting higher than 115200, the slirp connection - is limited to 115200. If you need it to go faster, the slirp binary - needs to be compiled with FULL_BOLT defined in config.h. - - - -6.12. pcap ------------ - - The pcap transport is attached to a UML ethernet device on the command - line or with uml_mconsole with the following syntax:: - - - - ethn=pcap,host interface,filter - expression,option1,option2 - - - - - The expression and options are optional. - - - The interface is whatever network device on the host you want to - sniff. The expression is a pcap filter expression, which is also what - tcpdump uses, so if you know how to specify tcpdump filters, you will - use the same expressions here. The options are up to two of - 'promisc', control whether pcap puts the host interface into - promiscuous mode. 'optimize' and 'nooptimize' control whether the pcap - expression optimizer is used. - - - Example:: - - - - eth0=pcap,eth0,tcp - - eth1=pcap,eth0,!tcp - - - - will cause the UML eth0 to emit all tcp packets on the host eth0 and - the UML eth1 to emit all non-tcp packets on the host eth0. - - - -6.13. Setting up the host yourself ------------------------------------ - - If you don't specify an address for the host side of the ethertap or - slip device, UML won't do any setup on the host. So this is what is - needed to get things working (the examples use a host-side IP of - 192.168.0.251 and a UML-side IP of 192.168.0.250 - adjust to suit your - own network): - - - The device needs to be configured with its IP address. Tap devices - are also configured with an mtu of 1484. Slip devices are - configured with a point-to-point address pointing at the UML ip - address:: - - - host# ifconfig tap0 arp mtu 1484 192.168.0.251 up - - - host# - ifconfig sl0 192.168.0.251 pointopoint 192.168.0.250 up - - - - - - - If a tap device is being set up, a route is set to the UML IP:: - - - UML# route add -host 192.168.0.250 gw 192.168.0.251 - - - - - - - To allow other hosts on your network to see the virtual machine, - proxy arp is set up for it:: - - - host# arp -Ds 192.168.0.250 eth0 pub - - - - - - - Finally, the host is set up to route packets:: - - - host# echo 1 > /proc/sys/net/ipv4/ip_forward - - - - - - - - - - -7. Sharing Filesystems between Virtual Machines -================================================ - - - - -7.1. A warning ---------------- - - Don't attempt to share filesystems simply by booting two UMLs from the - same file. That's the same thing as booting two physical machines - from a shared disk. It will result in filesystem corruption. - - - -7.2. Using layered block devices ---------------------------------- - - The way to share a filesystem between two virtual machines is to use - the copy-on-write (COW) layering capability of the ubd block driver. - As of 2.4.6-2um, the driver supports layering a read-write private - device over a read-only shared device. A machine's writes are stored - in the private device, while reads come from either device - the - private one if the requested block is valid in it, the shared one if - not. Using this scheme, the majority of data which is unchanged is - shared between an arbitrary number of virtual machines, each of which - has a much smaller file containing the changes that it has made. With - a large number of UMLs booting from a large root filesystem, this - leads to a huge disk space saving. It will also help performance, - since the host will be able to cache the shared data using a much - smaller amount of memory, so UML disk requests will be served from the - host's memory rather than its disks. - - - - - To add a copy-on-write layer to an existing block device file, simply - add the name of the COW file to the appropriate ubd switch:: - - - ubd0=root_fs_cow,root_fs_debian_22 - - - - - where 'root_fs_cow' is the private COW file and 'root_fs_debian_22' is - the existing shared filesystem. The COW file need not exist. If it - doesn't, the driver will create and initialize it. Once the COW file - has been initialized, it can be used on its own on the command line:: - - - ubd0=root_fs_cow - - - - - The name of the backing file is stored in the COW file header, so it - would be redundant to continue specifying it on the command line. - - - -7.3. Note! ------------ - - When checking the size of the COW file in order to see the gobs of - space that you're saving, make sure you use 'ls -ls' to see the actual - disk consumption rather than the length of the file. The COW file is - sparse, so the length will be very different from the disk usage. - Here is a 'ls -l' of a COW file and backing file from one boot and - shutdown:: - - host% ls -l cow.debian debian2.2 - -rw-r--r-- 1 jdike jdike 492504064 Aug 6 21:16 cow.debian - -rwxrw-rw- 1 jdike jdike 537919488 Aug 6 20:42 debian2.2 - - - - - Doesn't look like much saved space, does it? Well, here's 'ls -ls':: - - - host% ls -ls cow.debian debian2.2 - 880 -rw-r--r-- 1 jdike jdike 492504064 Aug 6 21:16 cow.debian - 525832 -rwxrw-rw- 1 jdike jdike 537919488 Aug 6 20:42 debian2.2 - - - - - Now, you can see that the COW file has less than a meg of disk, rather - than 492 meg. - - - -7.4. Another warning ---------------------- - - Once a filesystem is being used as a readonly backing file for a COW - file, do not boot directly from it or modify it in any way. Doing so - will invalidate any COW files that are using it. The mtime and size - of the backing file are stored in the COW file header at its creation, - and they must continue to match. If they don't, the driver will - refuse to use the COW file. - - - - - If you attempt to evade this restriction by changing either the - backing file or the COW header by hand, you will get a corrupted - filesystem. - - - - - Among other things, this means that upgrading the distribution in a - backing file and expecting that all of the COW files using it will see - the upgrade will not work. - - - - -7.5. uml_moo : Merging a COW file with its backing file --------------------------------------------------------- - - Depending on how you use UML and COW devices, it may be advisable to - merge the changes in the COW file into the backing file every once in - a while. - - - - - The utility that does this is uml_moo. Its usage is:: - - - host% uml_moo COW file new backing file - - - - - There's no need to specify the backing file since that information is - already in the COW file header. If you're paranoid, boot the new - merged file, and if you're happy with it, move it over the old backing - file. - - - - - uml_moo creates a new backing file by default as a safety measure. It - also has a destructive merge option which will merge the COW file - directly into its current backing file. This is really only usable - when the backing file only has one COW file associated with it. If - there are multiple COWs associated with a backing file, a -d merge of - one of them will invalidate all of the others. However, it is - convenient if you're short of disk space, and it should also be - noticeably faster than a non-destructive merge. - - - - - uml_moo is installed with the UML deb and RPM. If you didn't install - UML from one of those packages, you can also get it from the UML - utilities http://user-mode-linux.sourceforge.net/utilities tar file - in tools/moo. - - - - - - - - -8. Creating filesystems -======================== - - - You may want to create and mount new UML filesystems, either because - your root filesystem isn't large enough or because you want to use a - filesystem other than ext2. - - - This was written on the occasion of reiserfs being included in the - 2.4.1 kernel pool, and therefore the 2.4.1 UML, so the examples will - talk about reiserfs. This information is generic, and the examples - should be easy to translate to the filesystem of your choice. - - -8.1. Create the filesystem file -================================ - - dd is your friend. All you need to do is tell dd to create an empty - file of the appropriate size. I usually make it sparse to save time - and to avoid allocating disk space until it's actually used. For - example, the following command will create a sparse 100 meg file full - of zeroes:: - - - host% - dd if=/dev/zero of=new_filesystem seek=100 count=1 bs=1M - - - - - - - 8.2. Assign the file to a UML device - - Add an argument like the following to the UML command line:: - - ubd4=new_filesystem - - - - - making sure that you use an unassigned ubd device number. - - - - 8.3. Creating and mounting the filesystem - - Make sure that the filesystem is available, either by being built into - the kernel, or available as a module, then boot up UML and log in. If - the root filesystem doesn't have the filesystem utilities (mkfs, fsck, - etc), then get them into UML by way of the net or hostfs. - - - Make the new filesystem on the device assigned to the new file:: - - - host# mkreiserfs /dev/ubd/4 - - - <----------- MKREISERFSv2 -----------> - - ReiserFS version 3.6.25 - Block size 4096 bytes - Block count 25856 - Used blocks 8212 - Journal - 8192 blocks (18-8209), journal header is in block 8210 - Bitmaps: 17 - Root block 8211 - Hash function "r5" - ATTENTION: ALL DATA WILL BE LOST ON '/dev/ubd/4'! (y/n)y - journal size 8192 (from 18) - Initializing journal - 0%....20%....40%....60%....80%....100% - Syncing..done. - - - - - Now, mount it:: - - - UML# - mount /dev/ubd/4 /mnt - - - - - and you're in business. - - - - - - - - - -9. Host file access -==================== - - - If you want to access files on the host machine from inside UML, you - can treat it as a separate machine and either nfs mount directories - from the host or copy files into the virtual machine with scp or rcp. - However, since UML is running on the host, it can access those - files just like any other process and make them available inside the - virtual machine without needing to use the network. - - - This is now possible with the hostfs virtual filesystem. With it, you - can mount a host directory into the UML filesystem and access the - files contained in it just as you would on the host. - - -9.1. Using hostfs ------------------- - - To begin with, make sure that hostfs is available inside the virtual - machine with:: - - - UML# cat /proc/filesystems - - - - . hostfs should be listed. If it's not, either rebuild the kernel - with hostfs configured into it or make sure that hostfs is built as a - module and available inside the virtual machine, and insmod it. - - - Now all you need to do is run mount:: - - - UML# mount none /mnt/host -t hostfs - - - - - will mount the host's / on the virtual machine's /mnt/host. - - - If you don't want to mount the host root directory, then you can - specify a subdirectory to mount with the -o switch to mount:: - - - UML# mount none /mnt/home -t hostfs -o /home - - - - - will mount the hosts's /home on the virtual machine's /mnt/home. - - - -9.2. hostfs as the root filesystem ------------------------------------ - - It's possible to boot from a directory hierarchy on the host using - hostfs rather than using the standard filesystem in a file. - - To start, you need that hierarchy. The easiest way is to loop mount - an existing root_fs file:: - - - host# mount root_fs uml_root_dir -o loop - - - - - You need to change the filesystem type of / in etc/fstab to be - 'hostfs', so that line looks like this:: - - /dev/ubd/0 / hostfs defaults 1 1 - - - - - Then you need to chown to yourself all the files in that directory - that are owned by root. This worked for me:: - - - host# find . -uid 0 -exec chown jdike {} \; - - - - - Next, make sure that your UML kernel has hostfs compiled in, not as a - module. Then run UML with the boot device pointing at that directory:: - - - ubd0=/path/to/uml/root/directory - - - - - UML should then boot as it does normally. - - -9.3. Building hostfs ---------------------- - - If you need to build hostfs because it's not in your kernel, you have - two choices: - - - - - Compiling hostfs into the kernel: - - - Reconfigure the kernel and set the 'Host filesystem' option under - - - - Compiling hostfs as a module: - - - Reconfigure the kernel and set the 'Host filesystem' option under - be in arch/um/fs/hostfs/hostfs.o. Install that in - ``/lib/modules/$(uname -r)/fs`` in the virtual machine, boot it up, and:: - - - UML# insmod hostfs - - -.. _The_Management_Console: - -10. The Management Console -=========================== - - - - The UML management console is a low-level interface to the kernel, - somewhat like the i386 SysRq interface. Since there is a full-blown - operating system under UML, there is much greater flexibility possible - than with the SysRq mechanism. - - - There are a number of things you can do with the mconsole interface: - - - get the kernel version - - - add and remove devices - - - halt or reboot the machine - - - Send SysRq commands - - - Pause and resume the UML - - - You need the mconsole client (uml_mconsole) which is present in CVS - (/tools/mconsole) in 2.4.5-9um and later, and will be in the RPM in - 2.4.6. - - - You also need CONFIG_MCONSOLE (under 'General Setup') enabled in UML. - When you boot UML, you'll see a line like:: - - - mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole - - - - - If you specify a unique machine id one the UML command line, i.e.:: - - - umid=debian - - - - - you'll see this:: - - - mconsole initialized on /home/jdike/.uml/debian/mconsole - - - - - That file is the socket that uml_mconsole will use to communicate with - UML. Run it with either the umid or the full path as its argument:: - - - host% uml_mconsole debian - - - - - or:: - - - host% uml_mconsole /home/jdike/.uml/debian/mconsole - - - - - You'll get a prompt, at which you can run one of these commands: - - - version - - - halt - - - reboot - - - config - - - remove - - - sysrq - - - help - - - cad - - - stop - - - go - - -10.1. version --------------- - - This takes no arguments. It prints the UML version:: - - - (mconsole) version - OK Linux usermode 2.4.5-9um #1 Wed Jun 20 22:47:08 EDT 2001 i686 - - - - - There are a couple actual uses for this. It's a simple no-op which - can be used to check that a UML is running. It's also a way of - sending an interrupt to the UML. This is sometimes useful on SMP - hosts, where there's a bug which causes signals to UML to be lost, - often causing it to appear to hang. Sending such a UML the mconsole - version command is a good way to 'wake it up' before networking has - been enabled, as it does not do anything to the function of the UML. - - - -10.2. halt and reboot ----------------------- - - These take no arguments. They shut the machine down immediately, with - no syncing of disks and no clean shutdown of userspace. So, they are - pretty close to crashing the machine:: - - - (mconsole) halt - OK - - - - - - -10.3. config -------------- - - "config" adds a new device to the virtual machine. Currently the ubd - and network drivers support this. It takes one argument, which is the - device to add, with the same syntax as the kernel command line:: - - - - - (mconsole) - config ubd3=/home/jdike/incoming/roots/root_fs_debian22 - - OK - (mconsole) config eth1=mcast - OK - - - - - - -10.4. remove -------------- - - "remove" deletes a device from the system. Its argument is just the - name of the device to be removed. The device must be idle in whatever - sense the driver considers necessary. In the case of the ubd driver, - the removed block device must not be mounted, swapped on, or otherwise - open, and in the case of the network driver, the device must be down:: - - - (mconsole) remove ubd3 - OK - (mconsole) remove eth1 - OK - - - - - - -10.5. sysrq ------------- - - This takes one argument, which is a single letter. It calls the - generic kernel's SysRq driver, which does whatever is called for by - that argument. See the SysRq documentation in - Documentation/admin-guide/sysrq.rst in your favorite kernel tree to - see what letters are valid and what they do. - - - -10.6. help ------------ - - "help" returns a string listing the valid commands and what each one - does. - - - -10.7. cad ----------- - - This invokes the Ctl-Alt-Del action on init. What exactly this ends - up doing is up to /etc/inittab. Normally, it reboots the machine. - With UML, this is usually not desired, so if a halt would be better, - then find the section of inittab that looks like this:: - - - # What to do when CTRL-ALT-DEL is pressed. - ca:12345:ctrlaltdel:/sbin/shutdown -t1 -a -r now - - - - - and change the command to halt. - - - -10.8. stop ------------ - - This puts the UML in a loop reading mconsole requests until a 'go' - mconsole command is received. This is very useful for making backups - of UML filesystems, as the UML can be stopped, then synced via 'sysrq - s', so that everything is written to the filesystem. You can then copy - the filesystem and then send the UML 'go' via mconsole. - - - Note that a UML running with more than one CPU will have problems - after you send the 'stop' command, as only one CPU will be held in a - mconsole loop and all others will continue as normal. This is a bug, - and will be fixed. - - - -10.9. go ---------- - - This resumes a UML after being paused by a 'stop' command. Note that - when the UML has resumed, TCP connections may have timed out and if - the UML is paused for a long period of time, crond might go a little - crazy, running all the jobs it didn't do earlier. - - - - - - -.. _Kernel_debugging: - -11. Kernel debugging -===================== - - - Note: The interface that makes debugging, as described here, possible - is present in 2.4.0-test6 kernels and later. - - - Since the user-mode kernel runs as a normal Linux process, it is - possible to debug it with gdb almost like any other process. It is - slightly different because the kernel's threads are already being - ptraced for system call interception, so gdb can't ptrace them. - However, a mechanism has been added to work around that problem. - - - In order to debug the kernel, you need build it from source. See - :ref:`Compiling_the_kernel_and_modules` for information on doing that. - Make sure that you enable CONFIG_DEBUGSYM and CONFIG_PT_PROXY during - the config. These will compile the kernel with ``-g``, and enable the - ptrace proxy so that gdb works with UML, respectively. - - - - -11.1. Starting the kernel under gdb ------------------------------------- - - You can have the kernel running under the control of gdb from the - beginning by putting 'debug' on the command line. You will get an - xterm with gdb running inside it. The kernel will send some commands - to gdb which will leave it stopped at the beginning of start_kernel. - At this point, you can get things going with 'next', 'step', or - 'cont'. - - - There is a transcript of a debugging session here <debug- - session.html> , with breakpoints being set in the scheduler and in an - interrupt handler. - - -11.2. Examining sleeping processes ------------------------------------ - - - Not every bug is evident in the currently running process. Sometimes, - processes hang in the kernel when they shouldn't because they've - deadlocked on a semaphore or something similar. In this case, when - you ^C gdb and get a backtrace, you will see the idle thread, which - isn't very relevant. - - - What you want is the stack of whatever process is sleeping when it - shouldn't be. You need to figure out which process that is, which is - generally fairly easy. Then you need to get its host process id, - which you can do either by looking at ps on the host or at - task.thread.extern_pid in gdb. - - - Now what you do is this: - - - detach from the current thread:: - - - (UML gdb) det - - - - - - - attach to the thread you are interested in:: - - - (UML gdb) att <host pid> - - - - - - - look at its stack and anything else of interest:: - - - (UML gdb) bt - - - - - Note that you can't do anything at this point that requires that a - process execute, e.g. calling a function - - - when you're done looking at that process, reattach to the current - thread and continue it:: - - - (UML gdb) - att 1 - - - (UML gdb) - c - - - - - Here, specifying any pid which is not the process id of a UML thread - will cause gdb to reattach to the current thread. I commonly use 1, - but any other invalid pid would work. - - - -11.3. Running ddd on UML -------------------------- - - ddd works on UML, but requires a special kludge. The process goes - like this: - - - Start ddd:: - - - host% ddd linux - - - - - - - With ps, get the pid of the gdb that ddd started. You can ask the - gdb to tell you, but for some reason that confuses things and - causes a hang. - - - run UML with 'debug=parent gdb-pid=<pid>' added to the command line - - it will just sit there after you hit return - - - type 'att 1' to the ddd gdb and you will see something like:: - - - 0xa013dc51 in __kill () - - - (gdb) - - - - - - - At this point, type 'c', UML will boot up, and you can use ddd just - as you do on any other process. - - - -11.4. Debugging modules ------------------------- - - - gdb has support for debugging code which is dynamically loaded into - the process. This support is what is needed to debug kernel modules - under UML. - - - Using that support is somewhat complicated. You have to tell gdb what - object file you just loaded into UML and where in memory it is. Then, - it can read the symbol table, and figure out where all the symbols are - from the load address that you provided. It gets more interesting - when you load the module again (i.e. after an rmmod). You have to - tell gdb to forget about all its symbols, including the main UML ones - for some reason, then load then all back in again. - - - There's an easy way and a hard way to do this. The easy way is to use - the umlgdb expect script written by Chandan Kudige. It basically - automates the process for you. - - - First, you must tell it where your modules are. There is a list in - the script that looks like this:: - - set MODULE_PATHS { - "fat" "/usr/src/uml/linux-2.4.18/fs/fat/fat.o" - "isofs" "/usr/src/uml/linux-2.4.18/fs/isofs/isofs.o" - "minix" "/usr/src/uml/linux-2.4.18/fs/minix/minix.o" - } - - - - - You change that to list the names and paths of the modules that you - are going to debug. Then you run it from the toplevel directory of - your UML pool and it basically tells you what to do:: - - - ******** GDB pid is 21903 ******** - Start UML as: ./linux <kernel switches> debug gdb-pid=21903 - - - - GNU gdb 5.0rh-5 Red Hat Linux 7.1 - Copyright 2001 Free Software Foundation, Inc. - GDB is free software, covered by the GNU General Public License, and you are - welcome to change it and/or distribute copies of it under certain conditions. - Type "show copying" to see the conditions. - There is absolutely no warranty for GDB. Type "show warranty" for details. - This GDB was configured as "i386-redhat-linux"... - (gdb) b sys_init_module - Breakpoint 1 at 0xa0011923: file module.c, line 349. - (gdb) att 1 - - - - - After you run UML and it sits there doing nothing, you hit return at - the 'att 1' and continue it:: - - - Attaching to program: /home/jdike/linux/2.4/um/./linux, process 1 - 0xa00f4221 in __kill () - (UML gdb) c - Continuing. - - - - - At this point, you debug normally. When you insmod something, the - expect magic will kick in and you'll see something like:: - - - *** Module hostfs loaded *** - Breakpoint 1, sys_init_module (name_user=0x805abb0 "hostfs", - mod_user=0x8070e00) at module.c:349 - 349 char *name, *n_name, *name_tmp = NULL; - (UML gdb) finish - Run till exit from #0 sys_init_module (name_user=0x805abb0 "hostfs", - mod_user=0x8070e00) at module.c:349 - 0xa00e2e23 in execute_syscall (r=0xa8140284) at syscall_kern.c:411 - 411 else res = EXECUTE_SYSCALL(syscall, regs); - Value returned is $1 = 0 - (UML gdb) - p/x (int)module_list + module_list->size_of_struct - - $2 = 0xa9021054 - (UML gdb) symbol-file ./linux - Load new symbol table from "./linux"? (y or n) y - Reading symbols from ./linux... - done. - (UML gdb) - add-symbol-file /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o 0xa9021054 - - add symbol table from file "/home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o" at - .text_addr = 0xa9021054 - (y or n) y - - Reading symbols from /home/jdike/linux/2.4/um/arch/um/fs/hostfs/hostfs.o... - done. - (UML gdb) p *module_list - $1 = {size_of_struct = 84, next = 0xa0178720, name = 0xa9022de0 "hostfs", - size = 9016, uc = {usecount = {counter = 0}, pad = 0}, flags = 1, - nsyms = 57, ndeps = 0, syms = 0xa9023170, deps = 0x0, refs = 0x0, - init = 0xa90221f0 <init_hostfs>, cleanup = 0xa902222c <exit_hostfs>, - ex_table_start = 0x0, ex_table_end = 0x0, persist_start = 0x0, - persist_end = 0x0, can_unload = 0, runsize = 0, kallsyms_start = 0x0, - kallsyms_end = 0x0, - archdata_start = 0x1b855 <Address 0x1b855 out of bounds>, - archdata_end = 0xe5890000 <Address 0xe5890000 out of bounds>, - kernel_data = 0xf689c35d <Address 0xf689c35d out of bounds>} - >> Finished loading symbols for hostfs ... - - - - - That's the easy way. It's highly recommended. The hard way is - described below in case you're interested in what's going on. - - - Boot the kernel under the debugger and load the module with insmod or - modprobe. With gdb, do:: - - - (UML gdb) p module_list - - - - - This is a list of modules that have been loaded into the kernel, with - the most recently loaded module first. Normally, the module you want - is at module_list. If it's not, walk down the next links, looking at - the name fields until find the module you want to debug. Take the - address of that structure, and add module.size_of_struct (which in - 2.4.10 kernels is 96 (0x60)) to it. Gdb can make this hard addition - for you :-):: - - - - (UML gdb) - printf "%#x\n", (int)module_list module_list->size_of_struct - - - - - The offset from the module start occasionally changes (before 2.4.0, - it was module.size_of_struct + 4), so it's a good idea to check the - init and cleanup addresses once in a while, as describe below. Now - do:: - - - (UML gdb) - add-symbol-file /path/to/module/on/host that_address - - - - - Tell gdb you really want to do it, and you're in business. - - - If there's any doubt that you got the offset right, like breakpoints - appear not to work, or they're appearing in the wrong place, you can - check it by looking at the module structure. The init and cleanup - fields should look like:: - - - init = 0x588066b0 <init_hostfs>, cleanup = 0x588066c0 <exit_hostfs> - - - - - with no offsets on the symbol names. If the names are right, but they - are offset, then the offset tells you how much you need to add to the - address you gave to add-symbol-file. - - - When you want to load in a new version of the module, you need to get - gdb to forget about the old one. The only way I've found to do that - is to tell gdb to forget about all symbols that it knows about:: - - - (UML gdb) symbol-file - - - - - Then reload the symbols from the kernel binary:: - - - (UML gdb) symbol-file /path/to/kernel - - - - - and repeat the process above. You'll also need to re-enable break- - points. They were disabled when you dumped all the symbols because - gdb couldn't figure out where they should go. - - - -11.5. Attaching gdb to the kernel ----------------------------------- - - If you don't have the kernel running under gdb, you can attach gdb to - it later by sending the tracing thread a SIGUSR1. The first line of - the console output identifies its pid:: - - tracing thread pid = 20093 - - - - - When you send it the signal:: - - - host% kill -USR1 20093 - - - - - you will get an xterm with gdb running in it. - - - If you have the mconsole compiled into UML, then the mconsole client - can be used to start gdb:: - - - (mconsole) (mconsole) config gdb=xterm - - - - - will fire up an xterm with gdb running in it. - - - -11.6. Using alternate debuggers --------------------------------- - - UML has support for attaching to an already running debugger rather - than starting gdb itself. This is present in CVS as of 17 Apr 2001. - I sent it to Alan for inclusion in the ac tree, and it will be in my - 2.4.4 release. - - - This is useful when gdb is a subprocess of some UI, such as emacs or - ddd. It can also be used to run debuggers other than gdb on UML. - Below is an example of using strace as an alternate debugger. - - - To do this, you need to get the pid of the debugger and pass it in - with the - - - If you are using gdb under some UI, then tell it to 'att 1', and - you'll find yourself attached to UML. - - - If you are using something other than gdb as your debugger, then - you'll need to get it to do the equivalent of 'att 1' if it doesn't do - it automatically. - - - An example of an alternate debugger is strace. You can strace the - actual kernel as follows: - - - Run the following in a shell:: - - - host% - sh -c 'echo pid=$$; echo -n hit return; read x; exec strace -p 1 -o strace.out' - - - - - Run UML with 'debug' and 'gdb-pid=<pid>' with the pid printed out - by the previous command - - - Hit return in the shell, and UML will start running, and strace - output will start accumulating in the output file. - - Note that this is different from running:: - - - host% strace ./linux - - - - - That will strace only the main UML thread, the tracing thread, which - doesn't do any of the actual kernel work. It just oversees the vir- - tual machine. In contrast, using strace as described above will show - you the low-level activity of the virtual machine. - - - - - -12. Kernel debugging examples -============================== - -12.1. The case of the hung fsck --------------------------------- - - When booting up the kernel, fsck failed, and dropped me into a shell - to fix things up. I ran fsck -y, which hung:: - - - Setting hostname uml [ OK ] - Checking root filesystem - /dev/fhd0 was not cleanly unmounted, check forced. - Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. - - /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. - (i.e., without -a or -p options) - [ FAILED ] - - *** An error occurred during the file system check. - *** Dropping you to a shell; the system will reboot - *** when you leave the shell. - Give root password for maintenance - (or type Control-D for normal startup): - - [root@uml /root]# fsck -y /dev/fhd0 - fsck -y /dev/fhd0 - Parallelizing fsck version 1.14 (9-Jan-1999) - e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09 - /dev/fhd0 contains a file system with errors, check forced. - Pass 1: Checking inodes, blocks, and sizes - Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes - - Inode 19780, i_blocks is 1548, should be 540. Fix? yes - - Pass 2: Checking directory structure - Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes - - Directory inode 11858, block 0, offset 0: directory corrupted - Salvage? yes - - Missing '.' in directory inode 11858. - Fix? yes - - Missing '..' in directory inode 11858. - Fix? yes - - - The standard drill in this sort of situation is to fire up gdb on the - signal thread, which, in this case, was pid 1935. In another window, - I run gdb and attach pid 1935:: - - - ~/linux/2.3.26/um 1016: gdb linux - GNU gdb 4.17.0.11 with Linux support - Copyright 1998 Free Software Foundation, Inc. - GDB is free software, covered by the GNU General Public License, and you are - welcome to change it and/or distribute copies of it under certain conditions. - Type "show copying" to see the conditions. - There is absolutely no warranty for GDB. Type "show warranty" for details. - This GDB was configured as "i386-redhat-linux"... - - (gdb) att 1935 - Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1935 - 0x100756d9 in __wait4 () - - - Let's see what's currently running:: - - - - (gdb) p current_task.pid - $1 = 0 - - - - - - It's the idle thread, which means that fsck went to sleep for some - reason and never woke up. - - - Let's guess that the last process in the process list is fsck:: - - - - (gdb) p current_task.prev_task.comm - $13 = "fsck.ext2\000\000\000\000\000\000" - - - - - - It is, so let's see what it thinks it's up to:: - - - - (gdb) p current_task.prev_task.thread - $14 = {extern_pid = 1980, tracing = 0, want_tracing = 0, forking = 0, - kernel_stack_page = 0, signal_stack = 1342627840, syscall = {id = 4, args = { - 3, 134973440, 1024, 0, 1024}, have_result = 0, result = 50590720}, - request = {op = 2, u = {exec = {ip = 1350467584, sp = 2952789424}, fork = { - regs = {1350467584, 2952789424, 0 <repeats 15 times>}, sigstack = 0, - pid = 0}, switch_to = 0x507e8000, thread = {proc = 0x507e8000, - arg = 0xaffffdb0, flags = 0, new_pid = 0}, input_request = { - op = 1350467584, fd = -1342177872, proc = 0, pid = 0}}}} - - - - The interesting things here are the fact that its .thread.syscall.id - is __NR_write (see the big switch in arch/um/kernel/syscall_kern.c or - the defines in include/asm-um/arch/unistd.h), and that it never - returned. Also, its .request.op is OP_SWITCH (see - arch/um/include/user_util.h). These mean that it went into a write, - and, for some reason, called schedule(). - - - The fact that it never returned from write means that its stack should - be fairly interesting. Its pid is 1980 (.thread.extern_pid). That - process is being ptraced by the signal thread, so it must be detached - before gdb can attach it:: - - - - (gdb) call detach(1980) - - Program received signal SIGSEGV, Segmentation fault. - <function called from gdb> - The program being debugged stopped while in a function called from GDB. - When the function (detach) is done executing, GDB will silently - stop (instead of continuing to evaluate the expression containing - the function call). - (gdb) call detach(1980) - $15 = 0 - - - The first detach segfaults for some reason, and the second one - succeeds. - - - Now I detach from the signal thread, attach to the fsck thread, and - look at its stack:: - - - (gdb) det - Detaching from program: /home/dike/linux/2.3.26/um/linux Pid 1935 - (gdb) att 1980 - Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 1980 - 0x10070451 in __kill () - (gdb) bt - #0 0x10070451 in __kill () - #1 0x10068ccd in usr1_pid (pid=1980) at process.c:30 - #2 0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000) - at process_kern.c:156 - #3 0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000) - at process_kern.c:161 - #4 0x10001d12 in schedule () at core.c:777 - #5 0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71 - #6 0x1006aa10 in __down_failed () at semaphore.c:157 - #7 0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174 - #8 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182 - #9 <signal handler called> - #10 0x10155404 in errno () - #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50 - #12 0x1006c5d8 in segv_handler (sc=0x5006eaf8) at trap_user.c:174 - #13 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182 - #14 <signal handler called> - #15 0xc0fd in ?? () - #16 0x10016647 in sys_write (fd=3, - buf=0x80b8800 <Address 0x80b8800 out of bounds>, count=1024) - at read_write.c:159 - #17 0x1006d5b3 in execute_syscall (syscall=4, args=0x5006ef08) - at syscall_kern.c:254 - #18 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35 - #19 <signal handler called> - #20 0x400dc8b0 in ?? () - - - - - - The interesting things here are: - - - There are two segfaults on this stack (frames 9 and 14) - - - The first faulting address (frame 11) is 0x50000800:: - - (gdb) p (void *)1342179328 - $16 = (void *) 0x50000800 - - - - - - The initial faulting address is interesting because it is on the idle - thread's stack. I had been seeing the idle thread segfault for no - apparent reason, and the cause looked like stack corruption. In hopes - of catching the culprit in the act, I had turned off all protections - to that stack while the idle thread wasn't running. This apparently - tripped that trap. - - - However, the more immediate problem is that second segfault and I'm - going to concentrate on that. First, I want to see where the fault - happened, so I have to go look at the sigcontent struct in frame 8:: - - - - (gdb) up - #1 0x10068ccd in usr1_pid (pid=1980) at process.c:30 - 30 kill(pid, SIGUSR1); - (gdb) - #2 0x1006a03f in _switch_to (prev=0x50072000, next=0x507e8000) - at process_kern.c:156 - 156 usr1_pid(getpid()); - (gdb) - #3 0x1006a052 in switch_to (prev=0x50072000, next=0x507e8000, last=0x50072000) - at process_kern.c:161 - 161 _switch_to(prev, next); - (gdb) - #4 0x10001d12 in schedule () at core.c:777 - 777 switch_to(prev, next, prev); - (gdb) - #5 0x1006a744 in __down (sem=0x507d241c) at semaphore.c:71 - 71 schedule(); - (gdb) - #6 0x1006aa10 in __down_failed () at semaphore.c:157 - 157 } - (gdb) - #7 0x1006c5d8 in segv_handler (sc=0x5006e940) at trap_user.c:174 - 174 segv(sc->cr2, sc->err & 2); - (gdb) - #8 0x1006c5ec in kern_segv_handler (sig=11) at trap_user.c:182 - 182 segv_handler(sc); - (gdb) p *sc - Cannot access memory at address 0x0. - - - - - That's not very useful, so I'll try a more manual method:: - - - (gdb) p *((struct sigcontext *) (&sig + 1)) - $19 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43, - __dsh = 0, edi = 1342179328, esi = 1350378548, ebp = 1342630440, - esp = 1342630420, ebx = 1348150624, edx = 1280, ecx = 0, eax = 0, - trapno = 14, err = 4, eip = 268480945, cs = 35, __csh = 0, eflags = 66118, - esp_at_signal = 1342630420, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0, - cr2 = 1280} - - - - The ip is in handle_mm_fault:: - - - (gdb) p (void *)268480945 - $20 = (void *) 0x1000b1b1 - (gdb) i sym $20 - handle_mm_fault + 57 in section .text - - - - - - Specifically, it's in pte_alloc:: - - - (gdb) i line *$20 - Line 124 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h" - starts at address 0x1000b1b1 <handle_mm_fault+57> - and ends at 0x1000b1b7 <handle_mm_fault+63>. - - - - - - To find where in handle_mm_fault this is, I'll jump forward in the - code until I see an address in that procedure:: - - - - (gdb) i line *0x1000b1c0 - Line 126 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h" - starts at address 0x1000b1b7 <handle_mm_fault+63> - and ends at 0x1000b1c3 <handle_mm_fault+75>. - (gdb) i line *0x1000b1d0 - Line 131 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h" - starts at address 0x1000b1d0 <handle_mm_fault+88> - and ends at 0x1000b1da <handle_mm_fault+98>. - (gdb) i line *0x1000b1e0 - Line 61 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h" - starts at address 0x1000b1da <handle_mm_fault+98> - and ends at 0x1000b1e1 <handle_mm_fault+105>. - (gdb) i line *0x1000b1f0 - Line 134 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h" - starts at address 0x1000b1f0 <handle_mm_fault+120> - and ends at 0x1000b200 <handle_mm_fault+136>. - (gdb) i line *0x1000b200 - Line 135 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h" - starts at address 0x1000b200 <handle_mm_fault+136> - and ends at 0x1000b208 <handle_mm_fault+144>. - (gdb) i line *0x1000b210 - Line 139 of "/home/dike/linux/2.3.26/um/include/asm/pgalloc.h" - starts at address 0x1000b210 <handle_mm_fault+152> - and ends at 0x1000b219 <handle_mm_fault+161>. - (gdb) i line *0x1000b220 - Line 1168 of "memory.c" starts at address 0x1000b21e <handle_mm_fault+166> - and ends at 0x1000b222 <handle_mm_fault+170>. - - - - - - Something is apparently wrong with the page tables or vma_structs, so - lets go back to frame 11 and have a look at them:: - - - - #11 0x1006c0aa in segv (address=1342179328, is_write=2) at trap_kern.c:50 - 50 handle_mm_fault(current, vma, address, is_write); - (gdb) call pgd_offset_proc(vma->vm_mm, address) - $22 = (pgd_t *) 0x80a548c - - - - - - That's pretty bogus. Page tables aren't supposed to be in process - text or data areas. Let's see what's in the vma:: - - - (gdb) p *vma - $23 = {vm_mm = 0x507d2434, vm_start = 0, vm_end = 134512640, - vm_next = 0x80a4f8c, vm_page_prot = {pgprot = 0}, vm_flags = 31200, - vm_avl_height = 2058, vm_avl_left = 0x80a8c94, vm_avl_right = 0x80d1000, - vm_next_share = 0xaffffdb0, vm_pprev_share = 0xaffffe63, - vm_ops = 0xaffffe7a, vm_pgoff = 2952789626, vm_file = 0xafffffec, - vm_private_data = 0x62} - (gdb) p *vma.vm_mm - $24 = {mmap = 0x507d2434, mmap_avl = 0x0, mmap_cache = 0x8048000, - pgd = 0x80a4f8c, mm_users = {counter = 0}, mm_count = {counter = 134904288}, - map_count = 134909076, mmap_sem = {count = {counter = 135073792}, - sleepers = -1342177872, wait = {lock = <optimized out or zero length>, - task_list = {next = 0xaffffe63, prev = 0xaffffe7a}, - __magic = -1342177670, __creator = -1342177300}, __magic = 98}, - page_table_lock = {}, context = 138, start_code = 0, end_code = 0, - start_data = 0, end_data = 0, start_brk = 0, brk = 0, start_stack = 0, - arg_start = 0, arg_end = 0, env_start = 0, env_end = 0, rss = 1350381536, - total_vm = 0, locked_vm = 0, def_flags = 0, cpu_vm_mask = 0, swap_cnt = 0, - swap_address = 0, segments = 0x0} - - - - This also pretty bogus. With all of the 0x80xxxxx and 0xaffffxxx - addresses, this is looking like a stack was plonked down on top of - these structures. Maybe it's a stack overflow from the next page:: - - - (gdb) p vma - $25 = (struct vm_area_struct *) 0x507d2434 - - - - That's towards the lower quarter of the page, so that would have to - have been pretty heavy stack overflow:: - - - (gdb) x/100x $25 - 0x507d2434: 0x507d2434 0x00000000 0x08048000 0x080a4f8c - 0x507d2444: 0x00000000 0x080a79e0 0x080a8c94 0x080d1000 - 0x507d2454: 0xaffffdb0 0xaffffe63 0xaffffe7a 0xaffffe7a - 0x507d2464: 0xafffffec 0x00000062 0x0000008a 0x00000000 - 0x507d2474: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2484: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2494: 0x00000000 0x00000000 0x507d2fe0 0x00000000 - 0x507d24a4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24b4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24c4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24d4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24e4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d24f4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2504: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2514: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2524: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2534: 0x00000000 0x00000000 0x507d25dc 0x00000000 - 0x507d2544: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2554: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2564: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2574: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2584: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d2594: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d25a4: 0x00000000 0x00000000 0x00000000 0x00000000 - 0x507d25b4: 0x00000000 0x00000000 0x00000000 0x00000000 - - - - It's not stack overflow. The only "stack-like" piece of this data is - the vma_struct itself. - - - At this point, I don't see any avenues to pursue, so I just have to - admit that I have no idea what's going on. What I will do, though, is - stick a trap on the segfault handler which will stop if it sees any - writes to the idle thread's stack. That was the thing that happened - first, and it may be that if I can catch it immediately, what's going - on will be somewhat clearer. - - -12.2. Episode 2: The case of the hung fsck -------------------------------------------- - - After setting a trap in the SEGV handler for accesses to the signal - thread's stack, I reran the kernel. - - - fsck hung again, this time by hitting the trap:: - - - - Setting hostname uml [ OK ] - Checking root filesystem - /dev/fhd0 contains a file system with errors, check forced. - Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. - - /dev/fhd0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. - (i.e., without -a or -p options) - [ FAILED ] - - *** An error occurred during the file system check. - *** Dropping you to a shell; the system will reboot - *** when you leave the shell. - Give root password for maintenance - (or type Control-D for normal startup): - - [root@uml /root]# fsck -y /dev/fhd0 - fsck -y /dev/fhd0 - Parallelizing fsck version 1.14 (9-Jan-1999) - e2fsck 1.14, 9-Jan-1999 for EXT2 FS 0.5b, 95/08/09 - /dev/fhd0 contains a file system with errors, check forced. - Pass 1: Checking inodes, blocks, and sizes - Error reading block 86894 (Attempt to read block from filesystem resulted in short read) while reading indirect blocks of inode 19780. Ignore error? yes - - Pass 2: Checking directory structure - Error reading block 49405 (Attempt to read block from filesystem resulted in short read). Ignore error? yes - - Directory inode 11858, block 0, offset 0: directory corrupted - Salvage? yes - - Missing '.' in directory inode 11858. - Fix? yes - - Missing '..' in directory inode 11858. - Fix? yes - - Untested (4127) [100fe44c]: trap_kern.c line 31 - - - - - - I need to get the signal thread to detach from pid 4127 so that I can - attach to it with gdb. This is done by sending it a SIGUSR1, which is - caught by the signal thread, which detaches the process:: - - - kill -USR1 4127 - - - - - - Now I can run gdb on it:: - - - ~/linux/2.3.26/um 1034: gdb linux - GNU gdb 4.17.0.11 with Linux support - Copyright 1998 Free Software Foundation, Inc. - GDB is free software, covered by the GNU General Public License, and you are - welcome to change it and/or distribute copies of it under certain conditions. - Type "show copying" to see the conditions. - There is absolutely no warranty for GDB. Type "show warranty" for details. - This GDB was configured as "i386-redhat-linux"... - (gdb) att 4127 - Attaching to program `/home/dike/linux/2.3.26/um/linux', Pid 4127 - 0x10075891 in __libc_nanosleep () - - - - - - The backtrace shows that it was in a write and that the fault address - (address in frame 3) is 0x50000800, which is right in the middle of - the signal thread's stack page:: - - - (gdb) bt - #0 0x10075891 in __libc_nanosleep () - #1 0x1007584d in __sleep (seconds=1000000) - at ../sysdeps/unix/sysv/linux/sleep.c:78 - #2 0x1006ce9a in stop () at user_util.c:191 - #3 0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31 - #4 0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174 - #5 0x1006c63c in kern_segv_handler (sig=11) at trap_user.c:182 - #6 <signal handler called> - #7 0xc0fd in ?? () - #8 0x10016647 in sys_write (fd=3, buf=0x80b8800 "R.", count=1024) - at read_write.c:159 - #9 0x1006d603 in execute_syscall (syscall=4, args=0x5006ef08) - at syscall_kern.c:254 - #10 0x1006af87 in really_do_syscall (sig=12) at syscall_user.c:35 - #11 <signal handler called> - #12 0x400dc8b0 in ?? () - #13 <signal handler called> - #14 0x400dc8b0 in ?? () - #15 0x80545fd in ?? () - #16 0x804daae in ?? () - #17 0x8054334 in ?? () - #18 0x804d23e in ?? () - #19 0x8049632 in ?? () - #20 0x80491d2 in ?? () - #21 0x80596b5 in ?? () - (gdb) p (void *)1342179328 - $3 = (void *) 0x50000800 - - - - Going up the stack to the segv_handler frame and looking at where in - the code the access happened shows that it happened near line 110 of - block_dev.c:: - - - - (gdb) up - #1 0x1007584d in __sleep (seconds=1000000) - at ../sysdeps/unix/sysv/linux/sleep.c:78 - ../sysdeps/unix/sysv/linux/sleep.c:78: No such file or directory. - (gdb) - #2 0x1006ce9a in stop () at user_util.c:191 - 191 while(1) sleep(1000000); - (gdb) - #3 0x1006bf88 in segv (address=1342179328, is_write=2) at trap_kern.c:31 - 31 KERN_UNTESTED(); - (gdb) - #4 0x1006c628 in segv_handler (sc=0x5006eaf8) at trap_user.c:174 - 174 segv(sc->cr2, sc->err & 2); - (gdb) p *sc - $1 = {gs = 0, __gsh = 0, fs = 0, __fsh = 0, es = 43, __esh = 0, ds = 43, - __dsh = 0, edi = 1342179328, esi = 134973440, ebp = 1342631484, - esp = 1342630864, ebx = 256, edx = 0, ecx = 256, eax = 1024, trapno = 14, - err = 6, eip = 268550834, cs = 35, __csh = 0, eflags = 66070, - esp_at_signal = 1342630864, ss = 43, __ssh = 0, fpstate = 0x0, oldmask = 0, - cr2 = 1342179328} - (gdb) p (void *)268550834 - $2 = (void *) 0x1001c2b2 - (gdb) i sym $2 - block_write + 1090 in section .text - (gdb) i line *$2 - Line 209 of "/home/dike/linux/2.3.26/um/include/asm/arch/string.h" - starts at address 0x1001c2a1 <block_write+1073> - and ends at 0x1001c2bf <block_write+1103>. - (gdb) i line *0x1001c2c0 - Line 110 of "block_dev.c" starts at address 0x1001c2bf <block_write+1103> - and ends at 0x1001c2e3 <block_write+1139>. - - - - Looking at the source shows that the fault happened during a call to - copy_from_user to copy the data into the kernel:: - - - 107 count -= chars; - 108 copy_from_user(p,buf,chars); - 109 p += chars; - 110 buf += chars; - - - - p is the pointer which must contain 0x50000800, since buf contains - 0x80b8800 (frame 8 above). It is defined as:: - - - p = offset + bh->b_data; - - - - - - I need to figure out what bh is, and it just so happens that bh is - passed as an argument to mark_buffer_uptodate and mark_buffer_dirty a - few lines later, so I do a little disassembly:: - - - (gdb) disas 0x1001c2bf 0x1001c2e0 - Dump of assembler code from 0x1001c2bf to 0x1001c2d0: - 0x1001c2bf <block_write+1103>: addl %eax,0xc(%ebp) - 0x1001c2c2 <block_write+1106>: movl 0xfffffdd4(%ebp),%edx - 0x1001c2c8 <block_write+1112>: btsl $0x0,0x18(%edx) - 0x1001c2cd <block_write+1117>: btsl $0x1,0x18(%edx) - 0x1001c2d2 <block_write+1122>: sbbl %ecx,%ecx - 0x1001c2d4 <block_write+1124>: testl %ecx,%ecx - 0x1001c2d6 <block_write+1126>: jne 0x1001c2e3 <block_write+1139> - 0x1001c2d8 <block_write+1128>: pushl $0x0 - 0x1001c2da <block_write+1130>: pushl %edx - 0x1001c2db <block_write+1131>: call 0x1001819c <__mark_buffer_dirty> - End of assembler dump. - - - - - - At that point, bh is in %edx (address 0x1001c2da), which is calculated - at 0x1001c2c2 as %ebp + 0xfffffdd4, so I figure exactly what that is, - taking %ebp from the sigcontext_struct above:: - - - (gdb) p (void *)1342631484 - $5 = (void *) 0x5006ee3c - (gdb) p 0x5006ee3c+0xfffffdd4 - $6 = 1342630928 - (gdb) p (void *)$6 - $7 = (void *) 0x5006ec10 - (gdb) p *((void **)$7) - $8 = (void *) 0x50100200 - - - - - - Now, I look at the structure to see what's in it, and particularly, - what its b_data field contains:: - - - (gdb) p *((struct buffer_head *)0x50100200) - $13 = {b_next = 0x50289380, b_blocknr = 49405, b_size = 1024, b_list = 0, - b_dev = 15872, b_count = {counter = 1}, b_rdev = 15872, b_state = 24, - b_flushtime = 0, b_next_free = 0x501001a0, b_prev_free = 0x50100260, - b_this_page = 0x501001a0, b_reqnext = 0x0, b_pprev = 0x507fcf58, - b_data = 0x50000800 "", b_page = 0x50004000, - b_end_io = 0x10017f60 <end_buffer_io_sync>, b_dev_id = 0x0, - b_rsector = 98810, b_wait = {lock = <optimized out or zero length>, - task_list = {next = 0x50100248, prev = 0x50100248}, __magic = 1343226448, - __creator = 0}, b_kiobuf = 0x0} - - - - - - The b_data field is indeed 0x50000800, so the question becomes how - that happened. The rest of the structure looks fine, so this probably - is not a case of data corruption. It happened on purpose somehow. - - - The b_page field is a pointer to the page_struct representing the - 0x50000000 page. Looking at it shows the kernel's idea of the state - of that page:: - - - - (gdb) p *$13.b_page - $17 = {list = {next = 0x50004a5c, prev = 0x100c5174}, mapping = 0x0, - index = 0, next_hash = 0x0, count = {counter = 1}, flags = 132, lru = { - next = 0x50008460, prev = 0x50019350}, wait = { - lock = <optimized out or zero length>, task_list = {next = 0x50004024, - prev = 0x50004024}, __magic = 1342193708, __creator = 0}, - pprev_hash = 0x0, buffers = 0x501002c0, virtual = 1342177280, - zone = 0x100c5160} - - - - - - Some sanity-checking: the virtual field shows the "virtual" address of - this page, which in this kernel is the same as its "physical" address, - and the page_struct itself should be mem_map[0], since it represents - the first page of memory:: - - - - (gdb) p (void *)1342177280 - $18 = (void *) 0x50000000 - (gdb) p mem_map - $19 = (mem_map_t *) 0x50004000 - - - - - - These check out fine. - - - Now to check out the page_struct itself. In particular, the flags - field shows whether the page is considered free or not:: - - - (gdb) p (void *)132 - $21 = (void *) 0x84 - - - - - - The "reserved" bit is the high bit, which is definitely not set, so - the kernel considers the signal stack page to be free and available to - be used. - - - At this point, I jump to conclusions and start looking at my early - boot code, because that's where that page is supposed to be reserved. - - - In my setup_arch procedure, I have the following code which looks just - fine:: - - - - bootmap_size = init_bootmem(start_pfn, end_pfn - start_pfn); - free_bootmem(__pa(low_physmem) + bootmap_size, high_physmem - low_physmem); - - - - - - Two stack pages have already been allocated, and low_physmem points to - the third page, which is the beginning of free memory. - The init_bootmem call declares the entire memory to the boot memory - manager, which marks it all reserved. The free_bootmem call frees up - all of it, except for the first two pages. This looks correct to me. - - - So, I decide to see init_bootmem run and make sure that it is marking - those first two pages as reserved. I never get that far. - - - Stepping into init_bootmem, and looking at bootmem_map before looking - at what it contains shows the following:: - - - - (gdb) p bootmem_map - $3 = (void *) 0x50000000 - - - - - - Aha! The light dawns. That first page is doing double duty as a - stack and as the boot memory map. The last thing that the boot memory - manager does is to free the pages used by its memory map, so this page - is getting freed even its marked as reserved. - - - The fix was to initialize the boot memory manager before allocating - those two stack pages, and then allocate them through the boot memory - manager. After doing this, and fixing a couple of subsequent buglets, - the stack corruption problem disappeared. - - - - - -13. What to do when UML doesn't work -===================================== - - - - -13.1. Strange compilation errors when you build from source ------------------------------------------------------------- - - As of test11, it is necessary to have "ARCH=um" in the environment or - on the make command line for all steps in building UML, including - clean, distclean, or mrproper, config, menuconfig, or xconfig, dep, - and linux. If you forget for any of them, the i386 build seems to - contaminate the UML build. If this happens, start from scratch with:: - - - host% - make mrproper ARCH=um - - - - - and repeat the build process with ARCH=um on all the steps. - - - See :ref:`Compiling_the_kernel_and_modules` for more details. - - - Another cause of strange compilation errors is building UML in - /usr/src/linux. If you do this, the first thing you need to do is - clean up the mess you made. The /usr/src/linux/asm link will now - point to /usr/src/linux/asm-um. Make it point back to - /usr/src/linux/asm-i386. Then, move your UML pool someplace else and - build it there. Also see below, where a more specific set of symptoms - is described. - - - -13.3. A variety of panics and hangs with /tmp on a reiserfs filesystem ------------------------------------------------------------------------ - - I saw this on reiserfs 3.5.21 and it seems to be fixed in 3.5.27. - Panics preceded by:: - - - Detaching pid nnnn - - - - are diagnostic of this problem. This is a reiserfs bug which causes a - thread to occasionally read stale data from a mmapped page shared with - another thread. The fix is to upgrade the filesystem or to have /tmp - be an ext2 filesystem. - - - - 13.4. The compile fails with errors about conflicting types for - 'open', 'dup', and 'waitpid' - - This happens when you build in /usr/src/linux. The UML build makes - the include/asm link point to include/asm-um. /usr/include/asm points - to /usr/src/linux/include/asm, so when that link gets moved, files - which need to include the asm-i386 versions of headers get the - incompatible asm-um versions. The fix is to move the include/asm link - back to include/asm-i386 and to do UML builds someplace else. - - - -13.5. UML doesn't work when /tmp is an NFS filesystem ------------------------------------------------------- - - This seems to be a similar situation with the ReiserFS problem above. - Some versions of NFS seems not to handle mmap correctly, which UML - depends on. The workaround is have /tmp be a non-NFS directory. - - -13.6. UML hangs on boot when compiled with gprof support ---------------------------------------------------------- - - If you build UML with gprof support and, early in the boot, it does - this:: - - - kernel BUG at page_alloc.c:100! - - - - - you have a buggy gcc. You can work around the problem by removing - UM_FASTCALL from CFLAGS in arch/um/Makefile-i386. This will open up - another bug, but that one is fairly hard to reproduce. - - - -13.7. syslogd dies with a SIGTERM on startup ---------------------------------------------- - - The exact boot error depends on the distribution that you're booting, - but Debian produces this:: - - - /etc/rc2.d/S10sysklogd: line 49: 93 Terminated - start-stop-daemon --start --quiet --exec /sbin/syslogd -- $SYSLOGD - - - - - This is a syslogd bug. There's a race between a parent process - installing a signal handler and its child sending the signal. - - - -13.8. TUN/TAP networking doesn't work on a 2.4 host ----------------------------------------------------- - - There are a couple of problems which were reported by - Tim Robinson <timro at trkr dot net> - - - It doesn't work on hosts running 2.4.7 (or thereabouts) or earlier. - The fix is to upgrade to something more recent and then read the - next item. - - - If you see:: - - - File descriptor in bad state - - - - when you bring up the device inside UML, you have a header mismatch - between the original kernel and the upgraded one. Make /usr/src/linux - point at the new headers. This will only be a problem if you build - uml_net yourself. - - - -13.9. You can network to the host but not to other machines on the net -======================================================================= - - If you can connect to the host, and the host can connect to UML, but - you cannot connect to any other machines, then you may need to enable - IP Masquerading on the host. Usually this is only experienced when - using private IP addresses (192.168.x.x or 10.x.x.x) for host/UML - networking, rather than the public address space that your host is - connected to. UML does not enable IP Masquerading, so you will need - to create a static rule to enable it:: - - - host% - iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE - - - - - Replace eth0 with the interface that you use to talk to the rest of - the world. - - - Documentation on IP Masquerading, and SNAT, can be found at - http://www.netfilter.org. - - - If you can reach the local net, but not the outside Internet, then - that is usually a routing problem. The UML needs a default route:: - - - UML# - route add default gw gateway IP - - - - - The gateway IP can be any machine on the local net that knows how to - reach the outside world. Usually, this is the host or the local net- - work's gateway. - - - Occasionally, we hear from someone who can reach some machines, but - not others on the same net, or who can reach some ports on other - machines, but not others. These are usually caused by strange - firewalling somewhere between the UML and the other box. You track - this down by running tcpdump on every interface the packets travel - over and see where they disappear. When you find a machine that takes - the packets in, but does not send them onward, that's the culprit. - - - -13.10. I have no root and I want to scream -=========================================== - - Thanks to Birgit Wahlich for telling me about this strange one. It - turns out that there's a limit of six environment variables on the - kernel command line. When that limit is reached or exceeded, argument - processing stops, which means that the 'root=' argument that UML - usually adds is not seen. So, the filesystem has no idea what the - root device is, so it panics. - - - The fix is to put less stuff on the command line. Glomming all your - setup variables into one is probably the best way to go. - - - -13.11. UML build conflict between ptrace.h and ucontext.h -========================================================== - - On some older systems, /usr/include/asm/ptrace.h and - /usr/include/sys/ucontext.h define the same names. So, when they're - included together, the defines from one completely mess up the parsing - of the other, producing errors like:: - - /usr/include/sys/ucontext.h:47: parse error before - `10` - - - - - plus a pile of warnings. - - - This is a libc botch, which has since been fixed, and I don't see any - way around it besides upgrading. - - - -13.12. The UML BogoMips is exactly half the host's BogoMips ------------------------------------------------------------- - - On i386 kernels, there are two ways of running the loop that is used - to calculate the BogoMips rating, using the TSC if it's there or using - a one-instruction loop. The TSC produces twice the BogoMips as the - loop. UML uses the loop, since it has nothing resembling a TSC, and - will get almost exactly the same BogoMips as a host using the loop. - However, on a host with a TSC, its BogoMips will be double the loop - BogoMips, and therefore double the UML BogoMips. - - - -13.13. When you run UML, it immediately segfaults --------------------------------------------------- - - If the host is configured with the 2G/2G address space split, that's - why. See ref:`UML_on_2G/2G_hosts` for the details on getting UML to - run on your host. - - - -13.14. xterms appear, then immediately disappear -------------------------------------------------- - - If you're running an up to date kernel with an old release of - uml_utilities, the port-helper program will not work properly, so - xterms will exit straight after they appear. The solution is to - upgrade to the latest release of uml_utilities. Usually this problem - occurs when you have installed a packaged release of UML then compiled - your own development kernel without upgrading the uml_utilities from - the source distribution. - - - -13.15. Any other panic, hang, or strange behavior --------------------------------------------------- - - If you're seeing truly strange behavior, such as hangs or panics that - happen in random places, or you try running the debugger to see what's - happening and it acts strangely, then it could be a problem in the - host kernel. If you're not running a stock Linus or -ac kernel, then - try that. An early version of the preemption patch and a 2.4.10 SuSE - kernel have caused very strange problems in UML. - - - Otherwise, let me know about it. Send a message to one of the UML - mailing lists - either the developer list - user-mode-linux-devel at - lists dot sourceforge dot net (subscription info) or the user list - - user-mode-linux-user at lists dot sourceforge do net (subscription - info), whichever you prefer. Don't assume that everyone knows about - it and that a fix is imminent. - - - If you want to be super-helpful, read :ref:`Diagnosing_Problems` and - follow the instructions contained therein. - -.. _Diagnosing_Problems: - -14. Diagnosing Problems -======================== - - - If you get UML to crash, hang, or otherwise misbehave, you should - report this on one of the project mailing lists, either the developer - list - user-mode-linux-devel at lists dot sourceforge dot net - (subscription info) or the user list - user-mode-linux-user at lists - dot sourceforge dot net (subscription info). When you do, it is - likely that I will want more information. So, it would be helpful to - read the stuff below, do whatever is applicable in your case, and - report the results to the list. - - - For any diagnosis, you're going to need to build a debugging kernel. - The binaries from this site aren't debuggable. If you haven't done - this before, read about :ref:`Compiling_the_kernel_and_modules` and - :ref:`Kernel_debugging` UML first. - - -14.1. Case 1 : Normal kernel panics ------------------------------------- - - The most common case is for a normal thread to panic. To debug this, - you will need to run it under the debugger (add 'debug' to the command - line). An xterm will start up with gdb running inside it. Continue - it when it stops in start_kernel and make it crash. Now ``^C gdb`` and - - - If the panic was a "Kernel mode fault", then there will be a segv - frame on the stack and I'm going to want some more information. The - stack might look something like this:: - - - (UML gdb) backtrace - #0 0x1009bf76 in __sigprocmask (how=1, set=0x5f347940, oset=0x0) - at ../sysdeps/unix/sysv/linux/sigprocmask.c:49 - #1 0x10091411 in change_sig (signal=10, on=1) at process.c:218 - #2 0x10094785 in timer_handler (sig=26) at time_kern.c:32 - #3 0x1009bf38 in __restore () - at ../sysdeps/unix/sysv/linux/i386/sigaction.c:125 - #4 0x1009534c in segv (address=8, ip=268849158, is_write=2, is_user=0) - at trap_kern.c:66 - #5 0x10095c04 in segv_handler (sig=11) at trap_user.c:285 - #6 0x1009bf38 in __restore () - - - - - I'm going to want to see the symbol and line information for the value - of ip in the segv frame. In this case, you would do the following:: - - - (UML gdb) i sym 268849158 - - - - - and:: - - - (UML gdb) i line *268849158 - - - - - The reason for this is the __restore frame right above the segv_han- - dler frame is hiding the frame that actually segfaulted. So, I have - to get that information from the faulting ip. - - -14.2. Case 2 : Tracing thread panics -------------------------------------- - - The less common and more painful case is when the tracing thread - panics. In this case, the kernel debugger will be useless because it - needs a healthy tracing thread in order to work. The first thing to - do is get a backtrace from the tracing thread. This is done by - figuring out what its pid is, firing up gdb, and attaching it to that - pid. You can figure out the tracing thread pid by looking at the - first line of the console output, which will look like this:: - - - tracing thread pid = 15851 - - - - - or by running ps on the host and finding the line that looks like - this:: - - - jdike 15851 4.5 0.4 132568 1104 pts/0 S 21:34 0:05 ./linux [(tracing thread)] - - - - - If the panic was 'segfault in signals', then follow the instructions - above for collecting information about the location of the seg fault. - - - If the tracing thread flaked out all by itself, then send that - backtrace in and wait for our crack debugging team to fix the problem. - - - 14.3. Case 3 : Tracing thread panics caused by other threads - - However, there are cases where the misbehavior of another thread - caused the problem. The most common panic of this type is:: - - - wait_for_stop failed to wait for <pid> to stop with <signal number> - - - - - In this case, you'll need to get a backtrace from the process men- - tioned in the panic, which is complicated by the fact that the kernel - debugger is defunct and without some fancy footwork, another gdb can't - attach to it. So, this is how the fancy footwork goes: - - In a shell:: - - - host% kill -STOP pid - - - - - Run gdb on the tracing thread as described in case 2 and do:: - - - (host gdb) call detach(pid) - - - If you get a segfault, do it again. It always works the second time. - - Detach from the tracing thread and attach to that other thread:: - - - (host gdb) detach - - - - - - - (host gdb) attach pid - - - - - If gdb hangs when attaching to that process, go back to a shell and - do:: - - - host% - kill -CONT pid - - - - - And then get the backtrace:: - - - (host gdb) backtrace - - - - - -14.4. Case 4 : Hangs ---------------------- - - Hangs seem to be fairly rare, but they sometimes happen. When a hang - happens, we need a backtrace from the offending process. Run the - kernel debugger as described in case 1 and get a backtrace. If the - current process is not the idle thread, then send in the backtrace. - You can tell that it's the idle thread if the stack looks like this:: - - - #0 0x100b1401 in __libc_nanosleep () - #1 0x100a2885 in idle_sleep (secs=10) at time.c:122 - #2 0x100a546f in do_idle () at process_kern.c:445 - #3 0x100a5508 in cpu_idle () at process_kern.c:471 - #4 0x100ec18f in start_kernel () at init/main.c:592 - #5 0x100a3e10 in start_kernel_proc (unused=0x0) at um_arch.c:71 - #6 0x100a383f in signal_tramp (arg=0x100a3dd8) at trap_user.c:50 - - - - - If this is the case, then some other process is at fault, and went to - sleep when it shouldn't have. Run ps on the host and figure out which - process should not have gone to sleep and stayed asleep. Then attach - to it with gdb and get a backtrace as described in case 3. - - - - - - -15. Thanks -=========== - - - A number of people have helped this project in various ways, and this - page gives recognition where recognition is due. - - - If you're listed here and you would prefer a real link on your name, - or no link at all, instead of the despammed email address pseudo-link, - let me know. - - - If you're not listed here and you think maybe you should be, please - let me know that as well. I try to get everyone, but sometimes my - bookkeeping lapses and I forget about contributions. - - -15.1. Code and Documentation ------------------------------ - - Rusty Russell <rusty at linuxcare.com.au> - - - - wrote the HOWTO - http://user-mode-linux.sourceforge.net/old/UserModeLinux-HOWTO.html - - - prodded me into making this project official and putting it on - SourceForge - - - came up with the way cool UML logo - http://user-mode-linux.sourceforge.net/uml-small.png - - - redid the config process - - - Peter Moulder <reiter at netspace.net.au> - Fixed my config and build - processes, and added some useful code to the block driver - - - Bill Stearns <wstearns at pobox.com> - - - - HOWTO updates - - - lots of bug reports - - - lots of testing - - - dedicated a box (uml.ists.dartmouth.edu) to support UML development - - - wrote the mkrootfs script, which allows bootable filesystems of - RPM-based distributions to be cranked out - - - cranked out a large number of filesystems with said script - - - Jim Leu <jleu at mindspring.com> - Wrote the virtual ethernet driver - and associated usermode tools - - Lars Brinkhoff http://lars.nocrew.org/ - Contributed the ptrace - proxy from his own project to allow easier kernel debugging - - - Andrea Arcangeli <andrea at suse.de> - Redid some of the early boot - code so that it would work on machines with Large File Support - - - Chris Emerson - Did the first UML port to Linux/ppc - - - Harald Welte <laforge at gnumonks.org> - Wrote the multicast - transport for the network driver - - - Jorgen Cederlof - Added special file support to hostfs - - - Greg Lonnon <glonnon at ridgerun dot com> - Changed the ubd driver - to allow it to layer a COW file on a shared read-only filesystem and - wrote the iomem emulation support - - - Henrik Nordstrom http://hem.passagen.se/hno/ - Provided a variety - of patches, fixes, and clues - - - Lennert Buytenhek - Contributed various patches, a rewrite of the - network driver, the first implementation of the mconsole driver, and - did the bulk of the work needed to get SMP working again. - - - Yon Uriarte - Fixed the TUN/TAP network backend while I slept. - - - Adam Heath - Made a bunch of nice cleanups to the initialization code, - plus various other small patches. - - - Matt Zimmerman - Matt volunteered to be the UML Debian maintainer and - is doing a real nice job of it. He also noticed and fixed a number of - actually and potentially exploitable security holes in uml_net. Plus - the occasional patch. I like patches. - - - James McMechan - James seems to have taken over maintenance of the ubd - driver and is doing a nice job of it. - - - Chandan Kudige - wrote the umlgdb script which automates the reloading - of module symbols. - - - Steve Schmidtke - wrote the UML slirp transport and hostaudio drivers, - enabling UML processes to access audio devices on the host. He also - submitted patches for the slip transport and lots of other things. - - - David Coulson http://davidcoulson.net - - - - Set up the http://usermodelinux.org site, - which is a great way of keeping the UML user community on top of - UML goings-on. - - - Site documentation and updates - - - Nifty little UML management daemon UMLd - - - Lots of testing and bug reports - - - - -15.2. Flushing out bugs ------------------------- - - - - - Yuri Pudgorodsky - - - Gerald Britton - - - Ian Wehrman - - - Gord Lamb - - - Eugene Koontz - - - John H. Hartman - - - Anders Karlsson - - - Daniel Phillips - - - John Fremlin - - - Rainer Burgstaller - - - James Stevenson - - - Matt Clay - - - Cliff Jefferies - - - Geoff Hoff - - - Lennert Buytenhek - - - Al Viro - - - Frank Klingenhoefer - - - Livio Baldini Soares - - - Jon Burgess - - - Petru Paler - - - Paul - - - Chris Reahard - - - Sverker Nilsson - - - Gong Su - - - johan verrept - - - Bjorn Eriksson - - - Lorenzo Allegrucci - - - Muli Ben-Yehuda - - - David Mansfield - - - Howard Goff - - - Mike Anderson - - - John Byrne - - - Sapan J. Batia - - - Iris Huang - - - Jan Hudec - - - Voluspa - - - - -15.3. Buglets and clean-ups ----------------------------- - - - - - Dave Zarzycki - - - Adam Lazur - - - Boria Feigin - - - Brian J. Murrell - - - JS - - - Roman Zippel - - - Wil Cooley - - - Ayelet Shemesh - - - Will Dyson - - - Sverker Nilsson - - - dvorak - - - v.naga srinivas - - - Shlomi Fish - - - Roger Binns - - - johan verrept - - - MrChuoi - - - Peter Cleve - - - Vincent Guffens - - - Nathan Scott - - - Patrick Caulfield - - - jbearce - - - Catalin Marinas - - - Shane Spencer - - - Zou Min - - - - Ryan Boder - - - Lorenzo Colitti - - - Gwendal Grignou - - - Andre' Breiler - - - Tsutomu Yasuda - - - -15.4. Case Studies -------------------- - - - - Jon Wright - - - William McEwan - - - Michael Richardson - - - -15.5. Other contributions --------------------------- - - - Bill Carr <Bill.Carr at compaq.com> made the Red Hat mkrootfs script - work with RH 6.2. - - Michael Jennings <mikejen at hevanet.com> sent in some material which - is now gracing the top of the index page - http://user-mode-linux.sourceforge.net/ of this site. - - SGI (and more specifically Ralf Baechle <ralf at - uni-koblenz.de> ) gave me an account on oss.sgi.com. - The bandwidth there made it possible to - produce most of the filesystems available on the project download - page. - - Laurent Bonnaud <Laurent.Bonnaud at inpg.fr> took the old grotty - Debian filesystem that I've been distributing and updated it to 2.2. - It is now available by itself here. - - Rik van Riel gave me some ftp space on ftp.nl.linux.org so I can make - releases even when Sourceforge is broken. - - Rodrigo de Castro looked at my broken pte code and told me what was - wrong with it, letting me fix a long-standing (several weeks) and - serious set of bugs. - - Chris Reahard built a specialized root filesystem for running a DNS - server jailed inside UML. It's available from the download - http://user-mode-linux.sourceforge.net/old/dl-sf.html page in the Jail - Filesystems section. diff --git a/Documentation/virt/uml/user_mode_linux_howto_v2.rst b/Documentation/virt/uml/user_mode_linux_howto_v2.rst new file mode 100644 index 000000000000..f70e6f5873c6 --- /dev/null +++ b/Documentation/virt/uml/user_mode_linux_howto_v2.rst @@ -0,0 +1,1208 @@ +.. SPDX-License-Identifier: GPL-2.0 + +######### +UML HowTo +######### + +.. contents:: :local: + +************ +Introduction +************ + +Welcome to User Mode Linux + +User Mode Linux is the first Open Source virtualization platform (first +release date 1991) and second virtualization platform for an x86 PC. + +How is UML Different from a VM using Virtualization package X? +============================================================== + +We have come to assume that virtualization also means some level of +hardware emulation. In fact, it does not. As long as a virtualization +package provides the OS with devices which the OS can recognize and +has a driver for, the devices do not need to emulate real hardware. +Most OSes today have built-in support for a number of "fake" +devices used only under virtualization. +User Mode Linux takes this concept to the ultimate extreme - there +is not a single real device in sight. It is 100% artificial or if +we use the correct term 100% paravirtual. All UML devices are abstract +concepts which map onto something provided by the host - files, sockets, +pipes, etc. + +The other major difference between UML and various virtualization +packages is that there is a distinct difference between the way the UML +kernel and the UML programs operate. +The UML kernel is just a process running on Linux - same as any other +program. It can be run by an unprivileged user and it does not require +anything in terms of special CPU features. +The UML userspace, however, is a bit different. The Linux kernel on the +host machine assists UML in intercepting everything the program running +on a UML instance is trying to do and making the UML kernel handle all +of its requests. +This is different from other virtualization packages which do not make any +difference between the guest kernel and guest programs. This difference +results in a number of advantages and disadvantages of UML over let's say +QEMU which we will cover later in this document. + + +Why Would I Want User Mode Linux? +================================= + + +* If User Mode Linux kernel crashes, your host kernel is still fine. It + is not accelerated in any way (vhost, kvm, etc) and it is not trying to + access any devices directly. It is, in fact, a process like any other. + +* You can run a usermode kernel as a non-root user (you may need to + arrange appropriate permissions for some devices). + +* You can run a very small VM with a minimal footprint for a specific + task (for example 32M or less). + +* You can get extremely high performance for anything which is a "kernel + specific task" such as forwarding, firewalling, etc while still being + isolated from the host kernel. + +* You can play with kernel concepts without breaking things. + +* You are not bound by "emulating" hardware, so you can try weird and + wonderful concepts which are very difficult to support when emulating + real hardware such as time travel and making your system clock + dependent on what UML does (very useful for things like tests). + +* It's fun. + +Why not to run UML +================== + +* The syscall interception technique used by UML makes it inherently + slower for any userspace applications. While it can do kernel tasks + on par with most other virtualization packages, its userspace is + **slow**. The root cause is that UML has a very high cost of creating + new processes and threads (something most Unix/Linux applications + take for granted). + +* UML is strictly uniprocessor at present. If you want to run an + application which needs many CPUs to function, it is clearly the + wrong choice. + +*********************** +Building a UML instance +*********************** + +There is no UML installer in any distribution. While you can use off +the shelf install media to install into a blank VM using a virtualization +package, there is no UML equivalent. You have to use appropriate tools on +your host to build a viable filesystem image. + +This is extremely easy on Debian - you can do it using debootstrap. It is +also easy on OpenWRT - the build process can build UML images. All other +distros - YMMV. + +Creating an image +================= + +Create a sparse raw disk image:: + + # dd if=/dev/zero of=disk_image_name bs=1 count=1 seek=16G + +This will create a 16G disk image. The OS will initially allocate only one +block and will allocate more as they are written by UML. As of kernel +version 4.19 UML fully supports TRIM (as usually used by flash drives). +Using TRIM inside the UML image by specifying discard as a mount option +or by running ``tune2fs -o discard /dev/ubdXX`` will request UML to +return any unused blocks to the OS. + +Create a filesystem on the disk image and mount it:: + + # mkfs.ext4 ./disk_image_name && mount ./disk_image_name /mnt + +This example uses ext4, any other filesystem such as ext3, btrfs, xfs, +jfs, etc will work too. + +Create a minimal OS installation on the mounted filesystem:: + + # debootstrap buster /mnt http://deb.debian.org/debian + +debootstrap does not set up the root password, fstab, hostname or +anything related to networking. It is up to the user to do that. + +Set the root password -t he easiest way to do that is to chroot into the +mounted image:: + + # chroot /mnt + # passwd + # exit + +Edit key system files +===================== + +UML block devices are called ubds. The fstab created by debootstrap +will be empty and it needs an entry for the root file system:: + + /dev/ubd0 ext4 discard,errors=remount-ro 0 1 + +The image hostname will be set to the same as the host on which you +are creating it image. It is a good idea to change that to avoid +"Oh, bummer, I rebooted the wrong machine". + +UML supports two classes of network devices - the older uml_net ones +which are scheduled for obsoletion. These are called ethX. It also +supports the newer vector IO devices which are significantly faster +and have support for some standard virtual network encapsulations like +Ethernet over GRE and Ethernet over L2TPv3. These are called vec0. + +Depending on which one is in use, ``/etc/network/interfaces`` will +need entries like:: + + # legacy UML network devices + auto eth0 + iface eth0 inet dhcp + + # vector UML network devices + auto vec0 + iface eth0 inet dhcp + +We now have a UML image which is nearly ready to run, all we need is a +UML kernel and modules for it. + +Most distributions have a UML package. Even if you intend to use your own +kernel, testing the image with a stock one is always a good start. These +packages come with a set of modules which should be copied to the target +filesystem. The location is distribution dependent. For Debian these +reside under /usr/lib/uml/modules. Copy recursively the content of this +directory to the mounted UML filesystem:: + + # cp -rax /usr/lib/uml/modules /mnt/lib/modules + +If you have compiled your own kernel, you need to use the usual "install +modules to a location" procedure by running:: + + # make install MODULES_DIR=/mnt/lib/modules + +At this point the image is ready to be brought up. + +************************* +Setting Up UML Networking +************************* + +UML networking is designed to emulate an Ethernet connection. This +connection may be either a point-to-point (similar to a connection +between machines using a back-to-back cable) or a connection to a +switch. UML supports a wide variety of means to build these +connections to all of: local machine, remote machine(s), local and +remote UML and other VM instances. + + ++-----------+--------+------------------------------------+------------+ +| Transport | Type | Capabilities | Throughput | ++===========+========+====================================+============+ +| tap | vector | checksum, tso | > 8Gbit | ++-----------+--------+------------------------------------+------------+ +| hybrid | vector | checksum, tso, multipacket rx | > 6GBit | ++-----------+--------+------------------------------------+------------+ +| raw | vector | checksum, tso, multipacket rx, tx" | > 6GBit | ++-----------+--------+------------------------------------+------------+ +| EoGRE | vector | multipacket rx, tx | > 3Gbit | ++-----------+--------+------------------------------------+------------+ +| Eol2tpv3 | vector | multipacket rx, tx | > 3Gbit | ++-----------+--------+------------------------------------+------------+ +| bess | vector | multipacket rx, tx | > 3Gbit | ++-----------+--------+------------------------------------+------------+ +| fd | vector | dependent on fd type | varies | ++-----------+--------+------------------------------------+------------+ +| tuntap | legacy | none | ~ 500Mbit | ++-----------+--------+------------------------------------+------------+ +| daemon | legacy | none | ~ 450Mbit | ++-----------+--------+------------------------------------+------------+ +| socket | legacy | none | ~ 450Mbit | ++-----------+--------+------------------------------------+------------+ +| pcap | legacy | rx only | ~ 450Mbit | ++-----------+--------+------------------------------------+------------+ +| ethertap | legacy | obsolete | ~ 500Mbit | ++-----------+--------+------------------------------------+------------+ +| vde | legacy | obsolete | ~ 500Mbit | ++-----------+--------+------------------------------------+------------+ + +* All transports which have tso and checksum offloads can deliver speeds + approaching 10G on TCP streams. + +* All transports which have multi-packet rx and/or tx can deliver pps + rates of up to 1Mps or more. + +* All legacy transports are generally limited to ~600-700MBit and 0.05Mps + +* GRE and L2TPv3 allow connections to all of: local machine, remote + machines, remote network devices and remote UML instances. + +* Socket allows connections only between UML instances. + +* Daemon and bess require running a local switch. This switch may be + connected to the host as well. + + +Network configuration privileges +================================ + +The majority of the supported networking modes need ``root`` privileges. +For example, in the legacy tuntap networking mode, users were required +to be part of the group associated with the tunnel device. + +For newer network drivers like the vector transports, ``root`` privilege +is required to fire an ioctl to setup the tun interface and/or use +raw sockets where needed. + +This can be achieved by granting the user a particular capability instead +of running UML as root. In case of vector transport, a user can add the +capability ``CAP_NET_ADMIN`` or ``CAP_NET_RAW``, to the uml binary. +Thenceforth, UML can be run with normal user privilges, along with +full networking. + +For example:: + + # sudo setcap cap_net_raw,cap_net_admin+ep linux + +Configuring vector transports +=============================== + +All vector transports support a similar syntax: + +If X is the interface number as in vec0, vec1, vec2, etc, the general +syntax for options is:: + + vecX:transport="Transport Name",option=value,option=value,...,option=value + +Common options +-------------- + +These options are common for all transports: + +* ``depth=int`` - sets the queue depth for vector IO. This is the + amount of packets UML will attempt to read or write in a single + system call. The default number is 64 and is generally sufficient + for most applications that need throughput in the 2-4 Gbit range. + Higher speeds may require larger values. + +* ``mac=XX:XX:XX:XX:XX`` - sets the interface MAC address value. + +* ``gro=[0,1]`` - sets GRO on or off. Enables receive/transmit offloads. + The effect of this option depends on the host side support in the transport + which is being configured. In most cases it will enable TCP segmentation and + RX/TX checksumming offloads. The setting must be identical on the host side + and the UML side. The UML kernel will produce warnings if it is not. + For example, GRO is enabled by default on local machine interfaces + (e.g. veth pairs, bridge, etc), so it should be enabled in UML in the + corresponding UML transports (raw, tap, hybrid) in order for networking to + operate correctly. + +* ``mtu=int`` - sets the interface MTU + +* ``headroom=int`` - adjusts the default headroom (32 bytes) reserved + if a packet will need to be re-encapsulated into for instance VXLAN. + +* ``vec=0`` - disable multipacket io and fall back to packet at a + time mode + +Shared Options +-------------- + +* ``ifname=str`` Transports which bind to a local network interface + have a shared option - the name of the interface to bind to. + +* ``src, dst, src_port, dst_port`` - all transports which use sockets + which have the notion of source and destination and/or source port + and destination port use these to specify them. + +* ``v6=[0,1]`` to specify if a v6 connection is desired for all + transports which operate over IP. Additionally, for transports that + have some differences in the way they operate over v4 and v6 (for example + EoL2TPv3), sets the correct mode of operation. In the absense of this + option, the socket type is determined based on what do the src and dst + arguments resolve/parse to. + +tap transport +------------- + +Example:: + + vecX:transport=tap,ifname=tap0,depth=128,gro=1 + +This will connect vec0 to tap0 on the host. Tap0 must already exist (for example +created using tunctl) and UP. + +tap0 can be configured as a point-to-point interface and given an ip +address so that UML can talk to the host. Alternatively, it is possible +to connect UML to a tap interface which is connected to a bridge. + +While tap relies on the vector infrastructure, it is not a true vector +transport at this point, because Linux does not support multi-packet +IO on tap file descriptors for normal userspace apps like UML. This +is a privilege which is offered only to something which can hook up +to it at kernel level via specialized interfaces like vhost-net. A +vhost-net like helper for UML is planned at some point in the future. + +Privileges required: tap transport requires either: + +* tap interface to exist and be created persistent and owned by the + UML user using tunctl. Example ``tunctl -u uml-user -t tap0`` + +* binary to have ``CAP_NET_ADMIN`` privilege + +hybrid transport +---------------- + +Example:: + + vecX:transport=hybrid,ifname=tap0,depth=128,gro=1 + +This is an experimental/demo transport which couples tap for transmit +and a raw socket for receive. The raw socket allows multi-packet +receive resulting in significantly higher packet rates than normal tap + +Privileges required: hybrid requires ``CAP_NET_RAW`` capability by +the UML user as well as the requirements for the tap transport. + +raw socket transport +-------------------- + +Example:: + + vecX:transport=raw,ifname=p-veth0,depth=128,gro=1 + + +This transport uses vector IO on raw sockets. While you can bind to any +interface including a physical one, the most common use it to bind to +the "peer" side of a veth pair with the other side configured on the +host. + +Example host configuration for Debian: + +**/etc/network/interfaces**:: + + auto veth0 + iface veth0 inet static + address 192.168.4.1 + netmask 255.255.255.252 + broadcast 192.168.4.3 + pre-up ip link add veth0 type veth peer name p-veth0 && \ + ifconfig p-veth0 up + +UML can now bind to p-veth0 like this:: + + vec0:transport=raw,ifname=p-veth0,depth=128,gro=1 + + +If the UML guest is configured with 192.168.4.2 and netmask 255.255.255.0 +it can talk to the host on 192.168.4.1 + +The raw transport also provides some support for offloading some of the +filtering to the host. The two options to control it are: + +* ``bpffile=str`` filename of raw bpf code to be loaded as a socket filter + +* ``bpfflash=int`` 0/1 allow loading of bpf from inside User Mode Linux. + This option allows the use of the ethtool load firmware command to + load bpf code. + +In either case the bpf code is loaded into the host kernel. While this is +presently limited to legacy bpf syntax (not ebpf), it is still a security +risk. It is not recommended to allow this unless the User Mode Linux +instance is considered trusted. + +Privileges required: raw socket transport requires `CAP_NET_RAW` +capability. + +GRE socket transport +-------------------- + +Example:: + + vecX:transport=gre,src=$src_host,dst=$dst_host + + +This will configure an Ethernet over ``GRE`` (aka ``GRETAP`` or +``GREIRB``) tunnel which will connect the UML instance to a ``GRE`` +endpoint at host dst_host. ``GRE`` supports the following additional +options: + +* ``rx_key=int`` - GRE 32 bit integer key for rx packets, if set, + ``txkey`` must be set too + +* ``tx_key=int`` - GRE 32 bit integer key for tx packets, if set + ``rx_key`` must be set too + +* ``sequence=[0,1]`` - enable GRE sequence + +* ``pin_sequence=[0,1]`` - pretend that the sequence is always reset + on each packet (needed to interoperate with some really broken + implementations) + +* ``v6=[0,1]`` - force IPv4 or IPv6 sockets respectively + +* GRE checksum is not presently supported + +GRE has a number of caveats: + +* You can use only one GRE connection per ip address. There is no way to + multiplex connections as each GRE tunnel is terminated directly on + the UML instance. + +* The key is not really a security feature. While it was intended as such + it's "security" is laughable. It is, however, a useful feature to + ensure that the tunnel is not misconfigured. + +An example configuration for a Linux host with a local address of +192.168.128.1 to connect to a UML instance at 192.168.129.1 + +**/etc/network/interfaces**:: + + auto gt0 + iface gt0 inet static + address 10.0.0.1 + netmask 255.255.255.0 + broadcast 10.0.0.255 + mtu 1500 + pre-up ip link add gt0 type gretap local 192.168.128.1 \ + remote 192.168.129.1 || true + down ip link del gt0 || true + +Additionally, GRE has been tested versus a variety of network equipment. + +Privileges required: GRE requires ``CAP_NET_RAW`` + +l2tpv3 socket transport +----------------------- + +_Warning_. L2TPv3 has a "bug". It is the "bug" known as "has more +options than GNU ls". While it has some advantages, there are usually +easier (and less verbose) ways to connect a UML instance to something. +For example, most devices which support L2TPv3 also support GRE. + +Example:: + + vec0:transport=l2tpv3,udp=1,src=$src_host,dst=$dst_host,srcport=$src_port,dstport=$dst_port,depth=128,rx_session=0xffffffff,tx_session=0xffff + +This will configure an Ethernet over L2TPv3 fixed tunnel which will +connect the UML instance to a L2TPv3 endpoint at host $dst_host using +the L2TPv3 UDP flavour and UDP destination port $dst_port. + +L2TPv3 always requires the following additional options: + +* ``rx_session=int`` - l2tpv3 32 bit integer session for rx packets + +* ``tx_session=int`` - l2tpv3 32 bit integer session for tx packets + +As the tunnel is fixed these are not negotiated and they are +preconfigured on both ends. + +Additionally, L2TPv3 supports the following optional parameters + +* ``rx_cookie=int`` - l2tpv3 32 bit integer cookie for rx packets - same + functionality as GRE key, more to prevent misconfiguration than provide + actual security + +* ``tx_cookie=int`` - l2tpv3 32 bit integer cookie for tx packets + +* ``cookie64=[0,1]`` - use 64 bit cookies instead of 32 bit. + +* ``counter=[0,1]`` - enable l2tpv3 counter + +* ``pin_counter=[0,1]`` - pretend that the counter is always reset on + each packet (needed to interoperate with some really broken + implementations) + +* ``v6=[0,1]`` - force v6 sockets + +* ``udp=[0,1]`` - use raw sockets (0) or UDP (1) version of the protocol + +L2TPv3 has a number of caveats: + +* you can use only one connection per ip address in raw mode. There is + no way to multiplex connections as each L2TPv3 tunnel is terminated + directly on the UML instance. UDP mode can use different ports for + this purpose. + +Here is an example of how to configure a linux host to connect to UML +via L2TPv3: + +**/etc/network/interfaces**:: + + auto l2tp1 + iface l2tp1 inet static + address 192.168.126.1 + netmask 255.255.255.0 + broadcast 192.168.126.255 + mtu 1500 + pre-up ip l2tp add tunnel remote 127.0.0.1 \ + local 127.0.0.1 encap udp tunnel_id 2 \ + peer_tunnel_id 2 udp_sport 1706 udp_dport 1707 && \ + ip l2tp add session name l2tp1 tunnel_id 2 \ + session_id 0xffffffff peer_session_id 0xffffffff + down ip l2tp del session tunnel_id 2 session_id 0xffffffff && \ + ip l2tp del tunnel tunnel_id 2 + + +Privileges required: L2TPv3 requires ``CAP_NET_RAW`` for raw IP mode and +no special privileges for the UDP mode. + +BESS socket transport +--------------------- + +BESS is a high performance modular network switch. + +https://github.com/NetSys/bess + +It has support for a simple sequential packet socket mode which in the +more recent versions is using vector IO for high performance. + +Example:: + + vecX:transport=bess,src=$unix_src,dst=$unix_dst + +This will configure a BESS transport using the unix_src Unix domain +socket address as source and unix_dst socket address as destination. + +For BESS configuration and how to allocate a BESS Unix domain socket port +please see the BESS documentation. + +https://github.com/NetSys/bess/wiki/Built-In-Modules-and-Ports + +BESS transport does not require any special privileges. + +Configuring Legacy transports +============================= + +Legacy transports are now considered obsolete. Please use the vector +versions. + +*********** +Running UML +*********** + +This section assumes that either the user-mode-linux package from the +distribution or a custom built kernel has been installed on the host. + +These add an executable called linux to the system. This is the UML +kernel. It can be run just like any other executable. +It will take most normal linux kernel arguments as command line +arguments. Additionally, it will need some UML specific arguments +in order to do something useful. + +Arguments +========= + +Mandatory Arguments: +-------------------- + +* ``mem=int[K,M,G]`` - amount of memory. By default bytes. It will + also accept K, M or G qualifiers. + +* ``ubdX[s,d,c,t]=`` virtual disk specification. This is not really + mandatory, but it is likely to be needed in nearly all cases so we can + specify a root file system. + The simplest possible image specification is the name of the image + file for the filesystem (created using one of the methods described + in `Creating an image`_) + + * UBD devices support copy on write (COW). The changes are kept in + a separate file which can be discarded allowing a rollback to the + original pristine image. If COW is desired, the UBD image is + specified as: ``cow_file,master_image``. + Example:``ubd0=Filesystem.cow,Filesystem.img`` + + * UBD devices can be set to use synchronous IO. Any writes are + immediately flushed to disk. This is done by adding ``s`` after + the ``ubdX`` specification + + * UBD performs some euristics on devices specified as a single + filename to make sure that a COW file has not been specified as + the image. To turn them off, use the ``d`` flag after ``ubdX`` + + * UBD supports TRIM - asking the Host OS to reclaim any unused + blocks in the image. To turn it off, specify the ``t`` flag after + ``ubdX`` + +* ``root=`` root device - most likely ``/dev/ubd0`` (this is a Linux + filesystem image) + +Important Optional Arguments +---------------------------- + +If UML is run as "linux" with no extra arguments, it will try to start an +xterm for every console configured inside the image (up to 6 in most +linux distributions). Each console is started inside an +xterm. This makes it nice and easy to use UML on a host with a GUI. It is, +however, the wrong approach if UML is to be used as a testing harness or run +in a text-only environment. + +In order to change this behaviour we need to specify an alternative console +and wire it to one of the supported "line" channels. For this we need to map a +console to use something different from the default xterm. + +Example which will divert console number 1 to stdin/stdout:: + + con1=fd:0,fd:1 + +UML supports a wide variety of serial line channels which are specified using +the following syntax + + conX=channel_type:options[,channel_type:options] + + +If the channel specification contains two parts separated by comma, the first +one is input, the second one output. + +* The null channel - Discard all input or output. Example ``con=null`` will set + all consoles to null by default. + +* The fd channel - use file descriptor numbers for input/out. Example: + ``con1=fd:0,fd:1.`` + +* The port channel - listen on tcp port number. Example: ``con1=port:4321`` + +* The pty and pts channels - use system pty/pts. + +* The tty channel - bind to an existing system tty. Example: ``con1=/dev/tty8`` + will make UML use the host 8th console (usually unused). + +* The xterm channel - this is the default - bring up an xterm on this channel + and direct IO to it. Note, that in order for xterm to work, the host must + have the UML distribution package installed. This usually contains the + port-helper and other utilities needed for UML to communicate with the xterm. + Alternatively, these need to be complied and installed from source. All + options applicable to consoles also apply to UML serial lines which are + presented as ttyS inside UML. + +Starting UML +============ + +We can now run UML. +:: + # linux mem=2048M umid=TEST \ + ubd0=Filesystem.img \ + vec0:transport=tap,ifname=tap0,depth=128,gro=1 \ + root=/dev/ubda con=null con0=null,fd:2 con1=fd:0,fd:1 + +This will run an instance with ``2048M RAM``, try to use the image file +called ``Filesystem.img`` as root. It will connect to the host using tap0. +All consoles except ``con1`` will be disabled and console 1 will +use standard input/output making it appear in the same terminal it was started. + +Logging in +============ + +If you have not set up a password when generating the image, you will have to +shut down the UML instance, mount the image, chroot into it and set it - as +described in the Generating an Image section. If the password is already set, +you can just log in. + +The UML Management Console +============================ + +In addition to managing the image from "the inside" using normal sysadmin tools, +it is possible to perform a number of low level operations using the UML +management console. The UML management console is a low-level interface to the +kernel on a running UML instance, somewhat like the i386 SysRq interface. Since +there is a full-blown operating system under UML, there is much greater +flexibility possible than with the SysRq mechanism. + +There are a number of things you can do with the mconsole interface: + +* get the kernel version +* add and remove devices +* halt or reboot the machine +* Send SysRq commands +* Pause and resume the UML +* Inspect processes running inside UML +* Inspect UML internal /proc state + +You need the mconsole client (uml\_mconsole) which is a part of the UML +tools package available in most Linux distritions. + +You also need ``CONFIG_MCONSOLE`` (under 'General Setup') enabled in the UML +kernel. When you boot UML, you'll see a line like:: + + mconsole initialized on /home/jdike/.uml/umlNJ32yL/mconsole + +If you specify a unique machine id one the UML command line, i.e. +``umid=debian``, you'll see this:: + + mconsole initialized on /home/jdike/.uml/debian/mconsole + + +That file is the socket that uml_mconsole will use to communicate with +UML. Run it with either the umid or the full path as its argument:: + + # uml_mconsole debian + +or + + # uml_mconsole /home/jdike/.uml/debian/mconsole + + +You'll get a prompt, at which you can run one of these commands: + +* version +* help +* halt +* reboot +* config +* remove +* sysrq +* help +* cad +* stop +* go +* proc +* stack + +version +------- + +This command takes no arguments. It prints the UML version:: + + (mconsole) version + OK Linux OpenWrt 4.14.106 #0 Tue Mar 19 08:19:41 2019 x86_64 + + +There are a couple actual uses for this. It's a simple no-op which +can be used to check that a UML is running. It's also a way of +sending a device interrupt to the UML. UML mconsole is treated internally as +a UML device. + +help +---- + +This command takes no arguments. It prints a short help screen with the +supported mconsole commands. + + +halt and reboot +--------------- + +These commands take no arguments. They shut the machine down immediately, with +no syncing of disks and no clean shutdown of userspace. So, they are +pretty close to crashing the machine:: + + (mconsole) halt + OK + +config +------ + +"config" adds a new device to the virtual machine. This is supported +by most UML device drivers. It takes one argument, which is the +device to add, with the same syntax as the kernel command line:: + + (mconsole) config ubd3=/home/jdike/incoming/roots/root_fs_debian22 + +remove +------ + +"remove" deletes a device from the system. Its argument is just the +name of the device to be removed. The device must be idle in whatever +sense the driver considers necessary. In the case of the ubd driver, +the removed block device must not be mounted, swapped on, or otherwise +open, and in the case of the network driver, the device must be down:: + + (mconsole) remove ubd3 + +sysrq +----- + +This command takes one argument, which is a single letter. It calls the +generic kernel's SysRq driver, which does whatever is called for by +that argument. See the SysRq documentation in +Documentation/admin-guide/sysrq.rst in your favorite kernel tree to +see what letters are valid and what they do. + +cad +--- + +This invokes the ``Ctl-Alt-Del`` action in the running image. What exactly +this ends up doing is up to init, systemd, etc. Normally, it reboots the +machine. + +stop +---- + +This puts the UML in a loop reading mconsole requests until a 'go' +mconsole command is received. This is very useful as a +debugging/snapshotting tool. + +go +-- + +This resumes a UML after being paused by a 'stop' command. Note that +when the UML has resumed, TCP connections may have timed out and if +the UML is paused for a long period of time, crond might go a little +crazy, running all the jobs it didn't do earlier. + +proc +---- + +This takes one argument - the name of a file in /proc which is printed +to the mconsole standard output + +stack +----- + +This takes one argument - the pid number of a process. Its stack is +printed to a standard output. + +******************* +Advanced UML Topics +******************* + +Sharing Filesystems between Virtual Machines +============================================ + +Don't attempt to share filesystems simply by booting two UMLs from the +same file. That's the same thing as booting two physical machines +from a shared disk. It will result in filesystem corruption. + +Using layered block devices +--------------------------- + +The way to share a filesystem between two virtual machines is to use +the copy-on-write (COW) layering capability of the ubd block driver. +Any changed blocks are stored in the private COW file, while reads come +from either device - the private one if the requested block is valid in +it, the shared one if not. Using this scheme, the majority of data +which is unchanged is shared between an arbitrary number of virtual +machines, each of which has a much smaller file containing the changes +that it has made. With a large number of UMLs booting from a large root +filesystem, this leads to a huge disk space saving. + +Sharing file system data will also help performance, since the host will +be able to cache the shared data using a much smaller amount of memory, +so UML disk requests will be served from the host's memory rather than +its disks. There is a major caveat in doing this on multisocket NUMA +machines. On such hardware, running many UML instances with a shared +master image and COW changes may caise issues like NMIs from excess of +inter-socket traffic. + +If you are running UML on high end hardware like this, make sure to +bind UML to a set of logical cpus residing on the same socket using the +``taskset`` command or have a look at the "tuning" section. + +To add a copy-on-write layer to an existing block device file, simply +add the name of the COW file to the appropriate ubd switch:: + + ubd0=root_fs_cow,root_fs_debian_22 + +where ``root_fs_cow`` is the private COW file and ``root_fs_debian_22`` is +the existing shared filesystem. The COW file need not exist. If it +doesn't, the driver will create and initialize it. + +Disk Usage +---------- + +UML has TRIM support which will release any unused space in its disk +image files to the underlying OS. It is important to use either ls -ls +or du to verify the actual file size. + +COW validity. +------------- + +Any changes to the master image will invalidate all COW files. If this +happens, UML will *NOT* automatically delete any of the COW files and +will refuse to boot. In this case the only solution is to either +restore the old image (including its last modified timestamp) or remove +all COW files which will result in their recreation. Any changes in +the COW files will be lost. + +Cows can moo - uml_moo : Merging a COW file with its backing file +----------------------------------------------------------------- + +Depending on how you use UML and COW devices, it may be advisable to +merge the changes in the COW file into the backing file every once in +a while. + +The utility that does this is uml_moo. Its usage is:: + + uml_moo COW_file new_backing_file + + +There's no need to specify the backing file since that information is +already in the COW file header. If you're paranoid, boot the new +merged file, and if you're happy with it, move it over the old backing +file. + +``uml_moo`` creates a new backing file by default as a safety measure. +It also has a destructive merge option which will merge the COW file +directly into its current backing file. This is really only usable +when the backing file only has one COW file associated with it. If +there are multiple COWs associated with a backing file, a -d merge of +one of them will invalidate all of the others. However, it is +convenient if you're short of disk space, and it should also be +noticeably faster than a non-destructive merge. + +``uml_moo`` is installed with the UML distribution packages and is +available as a part of UML utilities. + +Host file access +================== + +If you want to access files on the host machine from inside UML, you +can treat it as a separate machine and either nfs mount directories +from the host or copy files into the virtual machine with scp. +However, since UML is running on the host, it can access those +files just like any other process and make them available inside the +virtual machine without the need to use the network. +This is possible with the hostfs virtual filesystem. With it, you +can mount a host directory into the UML filesystem and access the +files contained in it just as you would on the host. + +*SECURITY WARNING* + +Hostfs without any parameters to the UML Image will allow the image +to mount any part of the host filesystem and write to it. Always +confine hostfs to a specific "harmless" directory (for example ``/var/tmp``) +if running UML. This is especially important if UML is being run as root. + +Using hostfs +------------ + +To begin with, make sure that hostfs is available inside the virtual +machine with:: + + # cat /proc/filesystems + +``hostfs`` should be listed. If it's not, either rebuild the kernel +with hostfs configured into it or make sure that hostfs is built as a +module and available inside the virtual machine, and insmod it. + + +Now all you need to do is run mount:: + + # mount none /mnt/host -t hostfs + +will mount the host's ``/`` on the virtual machine's ``/mnt/host``. +If you don't want to mount the host root directory, then you can +specify a subdirectory to mount with the -o switch to mount:: + + # mount none /mnt/home -t hostfs -o /home + +will mount the hosts's /home on the virtual machine's /mnt/home. + +hostfs as the root filesystem +----------------------------- + +It's possible to boot from a directory hierarchy on the host using +hostfs rather than using the standard filesystem in a file. +To start, you need that hierarchy. The easiest way is to loop mount +an existing root_fs file:: + + # mount root_fs uml_root_dir -o loop + + +You need to change the filesystem type of ``/`` in ``etc/fstab`` to be +'hostfs', so that line looks like this:: + + /dev/ubd/0 / hostfs defaults 1 1 + +Then you need to chown to yourself all the files in that directory +that are owned by root. This worked for me:: + + # find . -uid 0 -exec chown jdike {} \; + +Next, make sure that your UML kernel has hostfs compiled in, not as a +module. Then run UML with the boot device pointing at that directory:: + + ubd0=/path/to/uml/root/directory + +UML should then boot as it does normally. + +Hostfs Caveats +-------------- + +Hostfs does not support keeping track of host filesystem changes on the +host (outside UML). As a result, if a file is changed without UML's +knowledge, UML will not know about it and its own in-memory cache of +the file may be corrupt. While it is possible to fix this, it is not +something which is being worked on at present. + +Tuning UML +============ + +UML at present is strictly uniprocessor. It will, however spin up a +number of threads to handle various functions. + +The UBD driver, SIGIO and the MMU emulation do that. If the system is +idle, these threads will be migrated to other processors on a SMP host. +This, unfortunately, will usually result in LOWER performance because of +all of the cache/memory synchronization traffic between cores. As a +result, UML will usually benefit from being pinned on a single CPU +especially on a large system. This can result in performance differences +of 5 times or higher on some benchmarks. + +Similarly, on large multi-node NUMA systems UML will benefit if all of +its memory is allocated from the same NUMA node it will run on. The +OS will *NOT* do that by default. In order to do that, the sysadmin +needs to create a suitable tmpfs ramdisk bound to a particular node +and use that as the source for UML RAM allocation by specifying it +in the TMP or TEMP environment variables. UML will look at the values +of ``TMPDIR``, ``TMP`` or ``TEMP`` for that. If that fails, it will +look for shmfs mounted under ``/dev/shm``. If everything else fails use +``/tmp/`` regardless of the filesystem type used for it:: + + mount -t tmpfs -ompol=bind:X none /mnt/tmpfs-nodeX + TEMP=/mnt/tmpfs-nodeX taskset -cX linux options options options.. + +******************************************* +Contributing to UML and Developing with UML +******************************************* + +UML is an excellent platform to develop new Linux kernel concepts - +filesystems, devices, virtualization, etc. It provides unrivalled +opportunities to create and test them without being constrained to +emulating specific hardware. + +Example - want to try how linux will work with 4096 "proper" network +devices? + +Not an issue with UML. At the same time, this is something which +is difficult with other virtualization packages - they are +constrained by the number of devices allowed on the hardware bus +they are trying to emulate (for example 16 on a PCI bus in qemu). + +If you have something to contribute such as a patch, a bugfix, a +new feature, please send it to ``linux-um@lists.infradead.org`` + +Please follow all standard Linux patch guidelines such as cc-ing +relevant maintainers and run ``./sripts/checkpatch.pl`` on your patch. +For more details see ``Documentation/process/submitting-patches.rst`` + +Note - the list does not accept HTML or attachments, all emails must +be formatted as plain text. + +Developing always goes hand in hand with debugging. First of all, +you can always run UML under gdb and there will be a whole section +later on on how to do that. That, however, is not the only way to +debug a linux kernel. Quite often adding tracing statements and/or +using UML specific approaches such as ptracing the UML kernel process +are significantly more informative. + +Tracing UML +============= + +When running UML consists of a main kernel thread and a number of +helper threads. The ones of interest for tracing are NOT the ones +that are already ptraced by UML as a part of its MMU emulation. + +These are usually the first three threads visible in a ps display. +The one with the lowest PID number and using most CPU is usually the +kernel thread. The other threads are the disk +(ubd) device helper thread and the sigio helper thread. +Running ptrace on this thread usually results in the following picture:: + + host$ strace -p 16566 + --- SIGIO {si_signo=SIGIO, si_code=POLL_IN, si_band=65} --- + epoll_wait(4, [{EPOLLIN, {u32=3721159424, u64=3721159424}}], 64, 0) = 1 + epoll_wait(4, [], 64, 0) = 0 + rt_sigreturn({mask=[PIPE]}) = 16967 + ptrace(PTRACE_GETREGS, 16967, NULL, 0xd5f34f38) = 0 + ptrace(PTRACE_GETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=832}]) = 0 + ptrace(PTRACE_GETSIGINFO, 16967, NULL, {si_signo=SIGTRAP, si_code=0x85, si_pid=16967, si_uid=0}) = 0 + ptrace(PTRACE_SETREGS, 16967, NULL, 0xd5f34f38) = 0 + ptrace(PTRACE_SETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=2696}]) = 0 + ptrace(PTRACE_SYSEMU, 16967, NULL, 0) = 0 + --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_TRAPPED, si_pid=16967, si_uid=0, si_status=SIGTRAP, si_utime=65, si_stime=89} --- + wait4(16967, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP | 0x80}], WSTOPPED|__WALL, NULL) = 16967 + ptrace(PTRACE_GETREGS, 16967, NULL, 0xd5f34f38) = 0 + ptrace(PTRACE_GETREGSET, 16967, NT_X86_XSTATE, [{iov_base=0xd5f35010, iov_len=832}]) = 0 + ptrace(PTRACE_GETSIGINFO, 16967, NULL, {si_signo=SIGTRAP, si_code=0x85, si_pid=16967, si_uid=0}) = 0 + timer_settime(0, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=2830912}}, NULL) = 0 + getpid() = 16566 + clock_nanosleep(CLOCK_MONOTONIC, 0, {tv_sec=1, tv_nsec=0}, NULL) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) + --- SIGALRM {si_signo=SIGALRM, si_code=SI_TIMER, si_timerid=0, si_overrun=0, si_value={int=1631716592, ptr=0x614204f0}} --- + rt_sigreturn({mask=[PIPE]}) = -1 EINTR (Interrupted system call) + +This is a typical picture from a mostly idle UML instance + +* UML interrupt controller uses epoll - this is UML waiting for IO + interrupts: + + epoll_wait(4, [{EPOLLIN, {u32=3721159424, u64=3721159424}}], 64, 0) = 1 + +* The sequence of ptrace calls is part of MMU emulation and runnin the + UML userspace +* ``timer_settime`` is part of the UML high res timer subsystem mapping + timer requests from inside UML onto the host high resultion timers. +* ``clock_nanosleep`` is UML going into idle (similar to the way a PC + will execute an ACPI idle). + +As you can see UML will generate quite a bit of output even in idle.The output +can be very informative when observing IO. It shows the actual IO calls, their +arguments and returns values. + +Kernel debugging +================ + +You can run UML under gdb now, though it will not necessarily agree to +be started under it. If you are trying to track a runtime bug, it is +much better to attach gdb to a running UML instance and let UML run. + +Assuming the same PID number as in the previous example, this would be:: + + # gdb -p 16566 + +This will STOP the UML instance, so you must enter `cont` at the GDB +command line to request it to continue. It may be a good idea to make +this into a gdb script and pass it to gdb as an argument. + +Developing Device Drivers +========================= + +Nearly all UML drivers are monolithic. While it is possible to build a +UML driver as a kernel module, that limits the possible functionality +to in-kernel only and non-UML specific. The reason for this is that +in order to really leverage UML, one needs to write a piece of +userspace code which maps driver concepts onto actual userspace host +calls. + +This forms the so called "user" portion of the driver. While it can +reuse a lot of kernel concepts, it is generally just another piece of +userspace code. This portion needs some matching "kernel" code which +resides inside the UML image and which implements the Linux kernel part. + +*Note: There are very few limitations in the way "kernel" and "user" interact*. + +UML does not have a strictly defined kernel to host API. It does not +try to emulate a specific architecture or bus. UML's "kernel" and +"user" can share memory, code and interact as needed to implement +whatever design the software developer has in mind. The only +limitations are purely technical. Due to a lot of functions and +variables having the same names, the developer should be careful +which includes and libraries they are trying to refer to. + +As a result a lot of userspace code consists of simple wrappers. +F.e. ``os_close_file()`` is just a wrapper around ``close()`` +which ensures that the userspace function close does not clash +with similarly named function(s) in the kernel part. + +Security Considerations +----------------------- + +Drivers or any new functionality should default to not +accepting arbitrary filename, bpf code or other parameters +which can affect the host from inside the UML instance. +For example, specifying the socket used for IPC communication +between a driver and the host at the UML command line is OK +security-wise. Allowing it as a loadable module parameter +isn't. + +If such functionality is desireable for a particular application +(e.g. loading BPF "firmware" for raw socket network transports), +it should be off by default and should be explicitly turned on +as a command line parameter at startup. + +Even with this in mind, the level of isolation between UML +and the host is relatively weak. If the UML userspace is +allowed to load arbitrary kernel drivers, an attacker can +use this to break out of UML. Thus, if UML is used in +a production application, it is recommended that all modules +are loaded at boot and kernel module loading is disabled +afterwards. diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index 6f9e000757fa..dd9f76a4ef29 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -1,4 +1,4 @@ -.. hmm: +.. _hmm: ===================================== Heterogeneous Memory Management (HMM) @@ -271,10 +271,139 @@ map those pages from the CPU side. Migration to and from device memory =================================== -Because the CPU cannot access device memory, migration must use the device DMA -engine to perform copy from and to device memory. For this we need to use -migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() helpers. - +Because the CPU cannot access device memory directly, the device driver must +use hardware DMA or device specific load/store instructions to migrate data. +The migrate_vma_setup(), migrate_vma_pages(), and migrate_vma_finalize() +functions are designed to make drivers easier to write and to centralize common +code across drivers. + +Before migrating pages to device private memory, special device private +``struct page`` need to be created. These will be used as special "swap" +page table entries so that a CPU process will fault if it tries to access +a page that has been migrated to device private memory. + +These can be allocated and freed with:: + + struct resource *res; + struct dev_pagemap pagemap; + + res = request_free_mem_region(&iomem_resource, /* number of bytes */, + "name of driver resource"); + pagemap.type = MEMORY_DEVICE_PRIVATE; + pagemap.range.start = res->start; + pagemap.range.end = res->end; + pagemap.nr_range = 1; + pagemap.ops = &device_devmem_ops; + memremap_pages(&pagemap, numa_node_id()); + + memunmap_pages(&pagemap); + release_mem_region(pagemap.range.start, range_len(&pagemap.range)); + +There are also devm_request_free_mem_region(), devm_memremap_pages(), +devm_memunmap_pages(), and devm_release_mem_region() when the resources can +be tied to a ``struct device``. + +The overall migration steps are similar to migrating NUMA pages within system +memory (see :ref:`Page migration <page_migration>`) but the steps are split +between device driver specific code and shared common code: + +1. ``mmap_read_lock()`` + + The device driver has to pass a ``struct vm_area_struct`` to + migrate_vma_setup() so the mmap_read_lock() or mmap_write_lock() needs to + be held for the duration of the migration. + +2. ``migrate_vma_setup(struct migrate_vma *args)`` + + The device driver initializes the ``struct migrate_vma`` fields and passes + the pointer to migrate_vma_setup(). The ``args->flags`` field is used to + filter which source pages should be migrated. For example, setting + ``MIGRATE_VMA_SELECT_SYSTEM`` will only migrate system memory and + ``MIGRATE_VMA_SELECT_DEVICE_PRIVATE`` will only migrate pages residing in + device private memory. If the latter flag is set, the ``args->pgmap_owner`` + field is used to identify device private pages owned by the driver. This + avoids trying to migrate device private pages residing in other devices. + Currently only anonymous private VMA ranges can be migrated to or from + system memory and device private memory. + + One of the first steps migrate_vma_setup() does is to invalidate other + device's MMUs with the ``mmu_notifier_invalidate_range_start(()`` and + ``mmu_notifier_invalidate_range_end()`` calls around the page table + walks to fill in the ``args->src`` array with PFNs to be migrated. + The ``invalidate_range_start()`` callback is passed a + ``struct mmu_notifier_range`` with the ``event`` field set to + ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to + the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is + allows the device driver to skip the invalidation callback and only + invalidate device private MMU mappings that are actually migrating. + This is explained more in the next section. + + While walking the page tables, a ``pte_none()`` or ``is_zero_pfn()`` + entry results in a valid "zero" PFN stored in the ``args->src`` array. + This lets the driver allocate device private memory and clear it instead + of copying a page of zeros. Valid PTE entries to system memory or + device private struct pages will be locked with ``lock_page()``, isolated + from the LRU (if system memory since device private pages are not on + the LRU), unmapped from the process, and a special migration PTE is + inserted in place of the original PTE. + migrate_vma_setup() also clears the ``args->dst`` array. + +3. The device driver allocates destination pages and copies source pages to + destination pages. + + The driver checks each ``src`` entry to see if the ``MIGRATE_PFN_MIGRATE`` + bit is set and skips entries that are not migrating. The device driver + can also choose to skip migrating a page by not filling in the ``dst`` + array for that page. + + The driver then allocates either a device private struct page or a + system memory page, locks the page with ``lock_page()``, and fills in the + ``dst`` array entry with:: + + dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED; + + Now that the driver knows that this page is being migrated, it can + invalidate device private MMU mappings and copy device private memory + to system memory or another device private page. The core Linux kernel + handles CPU page table invalidations so the device driver only has to + invalidate its own MMU mappings. + + The driver can use ``migrate_pfn_to_page(src[i])`` to get the + ``struct page`` of the source and either copy the source page to the + destination or clear the destination device private memory if the pointer + is ``NULL`` meaning the source page was not populated in system memory. + +4. ``migrate_vma_pages()`` + + This step is where the migration is actually "committed". + + If the source page was a ``pte_none()`` or ``is_zero_pfn()`` page, this + is where the newly allocated page is inserted into the CPU's page table. + This can fail if a CPU thread faults on the same page. However, the page + table is locked and only one of the new pages will be inserted. + The device driver will see that the ``MIGRATE_PFN_MIGRATE`` bit is cleared + if it loses the race. + + If the source page was locked, isolated, etc. the source ``struct page`` + information is now copied to destination ``struct page`` finalizing the + migration on the CPU side. + +5. Device driver updates device MMU page tables for pages still migrating, + rolling back pages not migrating. + + If the ``src`` entry still has ``MIGRATE_PFN_MIGRATE`` bit set, the device + driver can update the device MMU and set the write enable bit if the + ``MIGRATE_PFN_WRITE`` bit is set. + +6. ``migrate_vma_finalize()`` + + This step replaces the special migration page table entry with the new + page's page table entry and releases the reference to the source and + destination ``struct page``. + +7. ``mmap_read_unlock()`` + + The lock can now be released. Memory cgroup (memcg) and rss accounting ======================================== diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst index 611140ffef7e..eff5fbd492d0 100644 --- a/Documentation/vm/index.rst +++ b/Documentation/vm/index.rst @@ -29,6 +29,7 @@ descriptions of data structures and algorithms. :maxdepth: 1 active_mm + arch_pgtable_helpers balance cleancache free_page_reporting diff --git a/Documentation/vm/page_migration.rst b/Documentation/vm/page_migration.rst index 68883ac485fa..91a98a6b43bb 100644 --- a/Documentation/vm/page_migration.rst +++ b/Documentation/vm/page_migration.rst @@ -4,25 +4,28 @@ Page migration ============== -Page migration allows the moving of the physical location of pages between -nodes in a numa system while the process is running. This means that the +Page migration allows moving the physical location of pages between +nodes in a NUMA system while the process is running. This means that the virtual addresses that the process sees do not change. However, the system rearranges the physical location of those pages. -The main intend of page migration is to reduce the latency of memory access +Also see :ref:`Heterogeneous Memory Management (HMM) <hmm>` +for migrating pages to or from device private memory. + +The main intent of page migration is to reduce the latency of memory accesses by moving pages near to the processor where the process accessing that memory is running. Page migration allows a process to manually relocate the node on which its pages are located through the MF_MOVE and MF_MOVE_ALL options while setting -a new memory policy via mbind(). The pages of process can also be relocated +a new memory policy via mbind(). The pages of a process can also be relocated from another process using the sys_migrate_pages() function call. The -migrate_pages function call takes two sets of nodes and moves pages of a +migrate_pages() function call takes two sets of nodes and moves pages of a process that are located on the from nodes to the destination nodes. Page migration functions are provided by the numactl package by Andi Kleen (a version later than 0.9.3 is required. Get it from -ftp://oss.sgi.com/www/projects/libnuma/download/). numactl provides libnuma -which provides an interface similar to other numa functionality for page +https://github.com/numactl/numactl.git). numactl provides libnuma +which provides an interface similar to other NUMA functionality for page migration. cat ``/proc/<pid>/numa_maps`` allows an easy review of where the pages of a process are located. See also the numa_maps documentation in the proc(5) man page. @@ -30,19 +33,19 @@ proc(5) man page. Manual migration is useful if for example the scheduler has relocated a process to a processor on a distant node. A batch scheduler or an administrator may detect the situation and move the pages of the process -nearer to the new processor. The kernel itself does only provide +nearer to the new processor. The kernel itself only provides manual page migration support. Automatic page migration may be implemented through user space processes that move pages. A special function call "move_pages" allows the moving of individual pages within a process. -A NUMA profiler may f.e. obtain a log showing frequent off node +For example, A NUMA profiler may obtain a log showing frequent off-node accesses and may use the result to move pages to more advantageous locations. Larger installations usually partition the system using cpusets into sections of nodes. Paul Jackson has equipped cpusets with the ability to move pages when a task is moved to another cpuset (See -Documentation/admin-guide/cgroup-v1/cpusets.rst). -Cpusets allows the automation of process locality. If a task is moved to +:ref:`CPUSETS <cpusets>`). +Cpusets allow the automation of process locality. If a task is moved to a new cpuset then also all its pages are moved with it so that the performance of the process does not sink dramatically. Also the pages of processes in a cpuset are moved if the allowed memory nodes of a @@ -67,9 +70,9 @@ In kernel use of migrate_pages() Lists of pages to be migrated are generated by scanning over pages and moving them into lists. This is done by calling isolate_lru_page(). - Calling isolate_lru_page increases the references to the page + Calling isolate_lru_page() increases the references to the page so that it cannot vanish while the page migration occurs. - It also prevents the swapper or other scans to encounter + It also prevents the swapper or other scans from encountering the page. 2. We need to have a function of type new_page_t that can be @@ -91,23 +94,24 @@ is increased so that the page cannot be freed while page migration occurs. Steps: -1. Lock the page to be migrated +1. Lock the page to be migrated. 2. Ensure that writeback is complete. 3. Lock the new page that we want to move to. It is locked so that accesses to - this (not yet uptodate) page immediately lock while the move is in progress. + this (not yet uptodate) page immediately block while the move is in progress. 4. All the page table references to the page are converted to migration entries. This decreases the mapcount of a page. If the resulting mapcount is not zero then we do not migrate the page. All user space - processes that attempt to access the page will now wait on the page lock. + processes that attempt to access the page will now wait on the page lock + or wait for the migration page table entry to be removed. 5. The i_pages lock is taken. This will cause all processes trying to access the page via the mapping to block on the spinlock. -6. The refcount of the page is examined and we back out if references remain - otherwise we know that we are the only one referencing this page. +6. The refcount of the page is examined and we back out if references remain. + Otherwise, we know that we are the only one referencing this page. 7. The radix tree is checked and if it does not contain the pointer to this page then we back out because someone else modified the radix tree. @@ -134,124 +138,124 @@ Steps: 15. Queued up writeback on the new page is triggered. -16. If migration entries were page then replace them with real ptes. Doing - so will enable access for user space processes not already waiting for - the page lock. +16. If migration entries were inserted into the page table, then replace them + with real ptes. Doing so will enable access for user space processes not + already waiting for the page lock. -19. The page locks are dropped from the old and new page. +17. The page locks are dropped from the old and new page. Processes waiting on the page lock will redo their page faults and will reach the new page. -20. The new page is moved to the LRU and can be scanned by the swapper - etc again. +18. The new page is moved to the LRU and can be scanned by the swapper, + etc. again. Non-LRU page migration ====================== -Although original migration aimed for reducing the latency of memory access -for NUMA, compaction who want to create high-order page is also main customer. +Although migration originally aimed for reducing the latency of memory accesses +for NUMA, compaction also uses migration to create high-order pages. Current problem of the implementation is that it is designed to migrate only -*LRU* pages. However, there are potential non-lru pages which can be migrated +*LRU* pages. However, there are potential non-LRU pages which can be migrated in drivers, for example, zsmalloc, virtio-balloon pages. For virtio-balloon pages, some parts of migration code path have been hooked up and added virtio-balloon specific functions to intercept migration logics. It's too specific to a driver so other drivers who want to make their pages -movable would have to add own specific hooks in migration path. +movable would have to add their own specific hooks in the migration path. -To overclome the problem, VM supports non-LRU page migration which provides +To overcome the problem, VM supports non-LRU page migration which provides generic functions for non-LRU movable pages without driver specific hooks -migration path. +in the migration path. -If a driver want to make own pages movable, it should define three functions +If a driver wants to make its pages movable, it should define three functions which are function pointers of struct address_space_operations. 1. ``bool (*isolate_page) (struct page *page, isolate_mode_t mode);`` - What VM expects on isolate_page function of driver is to return *true* - if driver isolates page successfully. On returing true, VM marks the page + What VM expects from isolate_page() function of driver is to return *true* + if driver isolates the page successfully. On returning true, VM marks the page as PG_isolated so concurrent isolation in several CPUs skip the page for isolation. If a driver cannot isolate the page, it should return *false*. Once page is successfully isolated, VM uses page.lru fields so driver - shouldn't expect to preserve values in that fields. + shouldn't expect to preserve values in those fields. 2. ``int (*migratepage) (struct address_space *mapping,`` | ``struct page *newpage, struct page *oldpage, enum migrate_mode);`` - After isolation, VM calls migratepage of driver with isolated page. - The function of migratepage is to move content of the old page to new page + After isolation, VM calls migratepage() of driver with the isolated page. + The function of migratepage() is to move the contents of the old page to the + new page and set up fields of struct page newpage. Keep in mind that you should indicate to the VM the oldpage is no longer movable via __ClearPageMovable() - under page_lock if you migrated the oldpage successfully and returns + under page_lock if you migrated the oldpage successfully and returned MIGRATEPAGE_SUCCESS. If driver cannot migrate the page at the moment, driver can return -EAGAIN. On -EAGAIN, VM will retry page migration in a short time - because VM interprets -EAGAIN as "temporal migration failure". On returning - any error except -EAGAIN, VM will give up the page migration without retrying - in this time. + because VM interprets -EAGAIN as "temporary migration failure". On returning + any error except -EAGAIN, VM will give up the page migration without + retrying. - Driver shouldn't touch page.lru field VM using in the functions. + Driver shouldn't touch the page.lru field while in the migratepage() function. 3. ``void (*putback_page)(struct page *);`` - If migration fails on isolated page, VM should return the isolated page - to the driver so VM calls driver's putback_page with migration failed page. - In this function, driver should put the isolated page back to the own data + If migration fails on the isolated page, VM should return the isolated page + to the driver so VM calls the driver's putback_page() with the isolated page. + In this function, the driver should put the isolated page back into its own data structure. -4. non-lru movable page flags +4. non-LRU movable page flags - There are two page flags for supporting non-lru movable page. + There are two page flags for supporting non-LRU movable page. * PG_movable - Driver should use the below function to make page movable under page_lock:: + Driver should use the function below to make page movable under page_lock:: void __SetPageMovable(struct page *page, struct address_space *mapping) It needs argument of address_space for registering migration family functions which will be called by VM. Exactly speaking, - PG_movable is not a real flag of struct page. Rather than, VM - reuses page->mapping's lower bits to represent it. + PG_movable is not a real flag of struct page. Rather, VM + reuses the page->mapping's lower bits to represent it:: -:: #define PAGE_MAPPING_MOVABLE 0x2 page->mapping = page->mapping | PAGE_MAPPING_MOVABLE; so driver shouldn't access page->mapping directly. Instead, driver should - use page_mapping which mask off the low two bits of page->mapping under - page lock so it can get right struct address_space. - - For testing of non-lru movable page, VM supports __PageMovable function. - However, it doesn't guarantee to identify non-lru movable page because - page->mapping field is unified with other variables in struct page. - As well, if driver releases the page after isolation by VM, page->mapping - doesn't have stable value although it has PAGE_MAPPING_MOVABLE - (Look at __ClearPageMovable). But __PageMovable is cheap to catch whether - page is LRU or non-lru movable once the page has been isolated. Because - LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also - good for just peeking to test non-lru movable pages before more expensive - checking with lock_page in pfn scanning to select victim. - - For guaranteeing non-lru movable page, VM provides PageMovable function. - Unlike __PageMovable, PageMovable functions validates page->mapping and - mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden - destroying of page->mapping. - - Driver using __SetPageMovable should clear the flag via __ClearMovablePage - under page_lock before the releasing the page. + use page_mapping() which masks off the low two bits of page->mapping under + page lock so it can get the right struct address_space. + + For testing of non-LRU movable pages, VM supports __PageMovable() function. + However, it doesn't guarantee to identify non-LRU movable pages because + the page->mapping field is unified with other variables in struct page. + If the driver releases the page after isolation by VM, page->mapping + doesn't have a stable value although it has PAGE_MAPPING_MOVABLE set + (look at __ClearPageMovable). But __PageMovable() is cheap to call whether + page is LRU or non-LRU movable once the page has been isolated because LRU + pages can never have PAGE_MAPPING_MOVABLE set in page->mapping. It is also + good for just peeking to test non-LRU movable pages before more expensive + checking with lock_page() in pfn scanning to select a victim. + + For guaranteeing non-LRU movable page, VM provides PageMovable() function. + Unlike __PageMovable(), PageMovable() validates page->mapping and + mapping->a_ops->isolate_page under lock_page(). The lock_page() prevents + sudden destroying of page->mapping. + + Drivers using __SetPageMovable() should clear the flag via + __ClearMovablePage() under page_lock() before the releasing the page. * PG_isolated To prevent concurrent isolation among several CPUs, VM marks isolated page - as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru - movable page, it can skip it. Driver doesn't need to manipulate the flag - because VM will set/clear it automatically. Keep in mind that if driver - sees PG_isolated page, it means the page have been isolated by VM so it - shouldn't touch page.lru field. - PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag - for own purpose. + as PG_isolated under lock_page(). So if a CPU encounters PG_isolated + non-LRU movable page, it can skip it. Driver doesn't need to manipulate the + flag because VM will set/clear it automatically. Keep in mind that if the + driver sees a PG_isolated page, it means the page has been isolated by the + VM so it shouldn't touch the page.lru field. + The PG_isolated flag is aliased with the PG_reclaim flag so drivers + shouldn't use PG_isolated for its own purposes. Monitoring Migration ===================== @@ -266,8 +270,8 @@ The following events (counters) can be used to monitor page migration. 512. 2. PGMIGRATE_FAIL: Normal page migration failure. Same counting rules as for - _SUCCESS, above: this will be increased by the number of subpages, if it was - a THP. + PGMIGRATE_SUCCESS, above: this will be increased by the number of subpages, + if it was a THP. 3. THP_MIGRATION_SUCCESS: A THP was migrated without being split. diff --git a/Documentation/watch_queue.rst b/Documentation/watch_queue.rst index 849fad6893ef..54f13ad5fc17 100644 --- a/Documentation/watch_queue.rst +++ b/Documentation/watch_queue.rst @@ -103,8 +103,10 @@ watch that specific key). To manage a watch list, the following functions are provided: - * ``void init_watch_list(struct watch_list *wlist, - void (*release_watch)(struct watch *wlist));`` + * :: + + void init_watch_list(struct watch_list *wlist, + void (*release_watch)(struct watch *wlist)); Initialise a watch list. If ``release_watch`` is not NULL, then this indicates a function that should be called when the watch_list object is @@ -179,9 +181,11 @@ The following functions are provided to manage watches: driver-settable fields in the watch struct must have been set before this is called. - * ``int remove_watch_from_object(struct watch_list *wlist, - struct watch_queue *wqueue, - u64 id, false);`` + * :: + + int remove_watch_from_object(struct watch_list *wlist, + struct watch_queue *wqueue, + u64 id, false); Remove a watch from a watch list, where the watch must match the specified watch queue (``wqueue``) and object identifier (``id``). A notification |