aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2020-04-09crypto: curve25519 - do not pollute dispatcher based on assemblerJason A. Donenfeld
Since we're doing a static inline dispatch here, we normally branch based on whether or not there's an arch implementation. That would have been fine in general, except the crypto Makefile prior used to turn things off -- despite the Kconfig -- resulting in us needing to also hard code various assembler things into the dispatcher too. The horror! Now that the assembler config options are done by Kconfig, we can get rid of the inconsistency. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-04-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input Pull input updates from Dmitry Torokhov: "An update to the Goodix touchscreen driver to enable it work properly on various Bay Trail and Cherry Trail devices, and a few other assorted changes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (26 commits) Input: update SPDX tag for input-event-codes.h Input: i8042 - add Acer Aspire 5738z to nomux list Input: goodix - fix compilation when ACPI support is disabled dt-bindings: touchscreen: Convert edt-ft5x06 to json-schema Input: of_touchscreen - explicitly choose axis Input: goodix - support gt9147 touchpanel dt-bindings: touchscreen: goodix: support of gt9147 Input: goodix - add support for Goodix GT917S Input: goodix - use string-based chip ID dt-bindings: input: touchscreen: add compatible string for Goodix GT917S Input: goodix - add support for more then one touch-key Input: goodix - fix spurious key release events Input: goodix - try to reset the controller if the i2c-test fails Input: goodix - restore config on resume if necessary Input: goodix - make goodix_send_cfg() take a raw buffer as argument Input: goodix - add minimum firmware size check Input: goodix - save a copy of the config from goodix_read_config() Input: goodix - move defines to above struct goodix_ts_data declaration Input: goodix - add support for controlling the IRQ pin through ACPI methods Input: goodix - add support for getting IRQ + reset GPIOs on Bay Trail devices ...
2020-04-07Merge tag 'thermal-v5.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux Pull thermal updates from Daniel Lezcano: - Convert tsens configuration DT binding to yaml (Rajeshwari) - Add interrupt support on the rcar sensor (Niklas Söderlund) - Add a new Spreadtrum thermal driver (Baolin Wang) - Add thermal binding for the fsl scu board, a new API to retrieve the sensor id bound to the thermal zone and i.MX system controller sensor (Anson Huang)) - Remove warning log when a deferred probe is requested on Exynos (Marek Szyprowski) - Add the thermal monitoring unit support for imx8mm with its DT bindings (Anson Huang) - Rephrase the Kconfig text for clarity (Linus Walleij) - Use the gpio descriptor for the ti-soc-thermal (Linus Walleij) - Align msg structure to 4 bytes for i.MX SC, fix the Kconfig dependency, add the __may_be unused annotation for PM functions and the COMPILE_TEST option for imx8mm (Anson Huang) - Fix a dependency on regmap in Kconfig for qoriq (Yuantian Tang) - Add DT binding and support for the rcar gen3 r8a77961 and improve the error path on the rcar init function (Niklas Söderlund) - Cleanup and improvements for the tsens Qcom sensor (Amit Kucheria) - Improve code by removing lock and caching values in the rcar thermal sensor (Niklas Söderlund) - Cleanup in the qoriq drivers and add a call to imx_thermal_unregister_legacy_cooling in the removal function (Anson Huang) - Remove redundant 'maxItems' in tsens and sprd DT bindings (Rob Herring) - Change the thermal DT bindings by making the cooling-maps optional (Yuantian Tang) - Add Tiger Lake support (Sumeet Pawnikar) - Use scnprintf() for avoiding potential buffer overflow (Takashi Iwai) - Make pkg_temp_lock a raw_spinlock_t(Clark Williams) - Fix incorrect data types by changing them to signed on i.MX SC (Anson Huang) - Replace zero-length array with flexible-array member (Gustavo A. R. Silva) - Add support for i.MX8MP in the driver and in the DT bindings (Anson Huang) - Fix return value of the cpufreq_set_cur_state() function (Willy Wolff) - Remove abusing and scary WARN_ON in the cpufreq cooling device (Daniel Lezcano) - Fix build warning of incorrect argument type reported by sparse on imx8mm (Anson Huang) - Fix stub for the devfreq cooling device (Martin Blumenstingl) - Fix cpu idle cooling documentation (Sergey Vidishev) * tag 'thermal-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux: (52 commits) Documentation: cpu-idle-cooling: Fix diagram for 33% duty cycle thermal: devfreq_cooling: inline all stubs for CONFIG_DEVFREQ_THERMAL=n thermal: imx8mm: Fix build warning of incorrect argument type thermal/drivers/cpufreq_cooling: Remove abusing WARN_ON thermal/drivers/cpufreq_cooling: Fix return of cpufreq_set_cur_state thermal: imx8mm: Add i.MX8MP support dt-bindings: thermal: imx8mm-thermal: Add support for i.MX8MP thermal: qcom: tsens.h: Replace zero-length array with flexible-array member thermal: imx_sc_thermal: Fix incorrect data type thermal: int340x_thermal: Use scnprintf() for avoiding potential buffer overflow thermal: int340x: processor_thermal: Add Tiger Lake support thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t dt-bindings: thermal: make cooling-maps property optional dt-bindings: thermal: qcom-tsens: Remove redundant 'maxItems' dt-bindings: thermal: sprd: Remove redundant 'maxItems' thermal: imx: Calling imx_thermal_unregister_legacy_cooling() in .remove thermal: qoriq: Sort includes alphabetically thermal: qoriq: Use devm_add_action_or_reset() to handle all cleanups thermal: rcar_thermal: Remove lock in rcar_thermal_get_current_temp() thermal: rcar_thermal: Do not store ctemp in rcar_thermal_priv ...
2020-04-07Merge tag 'mfd-next-5.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd Pull mfd updates from Lee Jones: "New Drivers: - Add support for IQS620A/621/622/624/625 Azoteq IQS62X Sensors New Device Support: - Add support for ADC, IRQ, Regulator, RTC and WDT to Ricoh RN5T618 PMIC - Add support for Comet Lake to Intel LPSS New Functionality: - Add support for Charger Detection to Spreadtrum SC27xx PMICs - Add support for Interrupt Polarity to Dialog Semi DA9062/61 PMIC - Add ACPI enumeration support to Diolan DLN2 USB Adaptor Fix-ups: - Device Tree; iqs62x, rn5t618, cros_ec_dev, stm32-lptimer, rohm,bd71837, rohm,bd71847 - I2C registration; rn5t618 - Kconfig; MFD_CPCAP, AB8500_CORE, MFD_WM8994, MFD_WM97xx, MFD_STPMIC1 - Use flexible-array members; omap-usb-tll, qcom-pm8xxx - Remove unnecessary casts; omap-usb-host, omap-usb-tll - Power (suspend/resume/poweroff) enhancements; rk808 - Improve error/sanity checking; dln2 - Use snprintf(); aat2870-core Bug Fixes: - Fix PCI IDs in intel-lpss-pci" * tag 'mfd-next-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (33 commits) mfd: intel-lpss: Fix Intel Elkhart Lake LPSS I2C input clock mfd: aat2870: Use scnprintf() for avoiding potential buffer overflow mfd: dln2: Allow to be enumerated via ACPI mfd: da9062: Add support for interrupt polarity defined in device tree dt-bindings: bd718x7: Yamlify and add BD71850 mfd: dln2: Fix sanity checking for endpoints mfd: intel-lpss: Add Intel Comet Lake PCH-V PCI IDs mfd: sc27xx: Add USB charger type detection support dt-bindings: mfd: Document STM32 low power timer bindings mfd: rk808: Convert RK805 to shutdown/suspend hooks mfd: rk808: Reduce shutdown duplication mfd: rk808: Stop using syscore ops mfd: rk808: Ensure suspend/resume hooks always work mfd: rk808: Always use poweroff when requested mfd: omap: Remove useless cast for driver.name mfd: Kconfig: Fix some misspelling of the word functionality mfd: pm8xxx: Replace zero-length array with flexible-array member mfd: omap-usb-tll: Replace zero-length array with flexible-array member mfd: cpcap: Fix compile if MFD_CORE is not selected mfd: cros_ec: Check DT node for usbpd-notify add ...
2020-04-07Merge tag 'backlight-next-5.7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight Pull backlight updates from Lee Jones: "Switch pwm_bl and corgi_lcd drivers to use GPIO descriptors" * tag 'backlight-next-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight: backlight: corgi: Convert to use GPIO descriptors backlight: pwm_bl: Switch to full GPIO descriptor
2020-04-07Merge tag 'leds-5.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds Pull LED updates from Pavel Machek: "One new driver, some driver changes, and some late minute cleanups -- but those are just whitespace so should be okay. There are some major changes being prepared (multicolor, triggers) so the next release likely will be more interesting" * tag 'leds-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/pavel/linux-leds: leds: core: Fix warning message when init_data leds: make functions easier to understand leds: sort Makefile entries leds: old enums are not really applicable to new code leds: ip30: label power LED as such leds: lm3532: make bitfield 'enabled' unsigned leds: leds-pwm: Replace zero-length array with flexible-array member leds: leds-is31fl32xx: Replace zero-length array with flexible-array member leds: pwm: remove useless pwm_period_ns leds: pwm: remove header leds: pwm: convert to atomic PWM API leds: pwm: simplify if condition leds: add SGI IP30 led support leds: lm3697: fix spelling mistake "To" -> "Too" leds: leds-bd2802: remove set but not used variable 'pdata' leds: ns2: Convert to GPIO descriptors leds: ns2: Absorb platform data
2020-04-07Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge more updates from Andrew Morton: - a lot more of MM, quite a bit more yet to come: (memcg, pagemap, vmalloc, pagealloc, migration, thp, ksm, madvise, virtio, userfaultfd, memory-hotplug, shmem, rmap, zswap, zsmalloc, cleanups) - various other subsystems (procfs, misc, MAINTAINERS, bitops, lib, checkpatch, epoll, binfmt, kallsyms, reiserfs, kmod, gcov, kconfig, ubsan, fault-injection, ipc) * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (158 commits) ipc/shm.c: make compat_ksys_shmctl() static ipc/mqueue.c: fix a brace coding style issue lib/Kconfig.debug: fix a typo "capabilitiy" -> "capability" ubsan: include bug type in report header kasan: unset panic_on_warn before calling panic() ubsan: check panic_on_warn drivers/misc/lkdtm/bugs.c: add arithmetic overflow and array bounds checks ubsan: split "bounds" checker from other options ubsan: add trap instrumentation option init/Kconfig: clean up ANON_INODES and old IO schedulers options kernel/gcov/fs.c: replace zero-length array with flexible-array member gcov: gcc_3_4: replace zero-length array with flexible-array member gcov: gcc_4_7: replace zero-length array with flexible-array member kernel/kmod.c: fix a typo "assuems" -> "assumes" reiserfs: clean up several indentation issues kallsyms: unexport kallsyms_lookup_name() and kallsyms_on_each_symbol() samples/hw_breakpoint: drop use of kallsyms_lookup_name() samples/hw_breakpoint: drop HW_BREAKPOINT_R when reporting writes fs/binfmt_elf.c: don't free interpreter's ELF pheaders on common path fs/binfmt_elf.c: allocate less for static executable ...
2020-04-07Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates from Trond Myklebust: "Highlights include: Stable fixes: - Fix a page leak in nfs_destroy_unlinked_subrequests() - Fix use-after-free issues in nfs_pageio_add_request() - Fix new mount code constant_table array definitions - finish_automount() requires us to hold 2 refs to the mount record Features: - Improve the accuracy of telldir/seekdir by using 64-bit cookies when possible. - Allow one RDMA active connection and several zombie connections to prevent blocking if the remote server is unresponsive. - Limit the size of the NFS access cache by default - Reduce the number of references to credentials that are taken by NFS - pNFS files and flexfiles drivers now support per-layout segment COMMIT lists. - Enable partial-file layout segments in the pNFS/flexfiles driver. - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from the DS using the layouterror mechanism. Bugfixes and cleanups: - SUNRPC: Fix krb5p regressions - Don't specify NFS version in "UDP not supported" error - nfsroot: set tcp as the default transport protocol - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid() - alloc_nfs_open_context() must use the file cred when available - Fix locking when dereferencing the delegation cred - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails - Various clean ups of the NFS O_DIRECT commit code - Clean up RDMA connect/disconnect - Replace zero-length arrays with C99-style flexible arrays" * tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits) NFS: Clean up process of marking inode stale. SUNRPC: Don't start a timer on an already queued rpc task NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn() NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode() NFS: Beware when dereferencing the delegation cred NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout NFS: finish_automount() requires us to hold 2 refs to the mount record NFS: Fix a few constant_table array definitions NFS: Try to join page groups before an O_DIRECT retransmission NFS: Refactor nfs_lock_and_join_requests() NFS: Reverse the submission order of requests in __nfs_pageio_add_request() NFS: Clean up nfs_lock_and_join_requests() NFS: Remove the redundant function nfs_pgio_has_mirroring() NFS: Fix memory leaks in nfs_pageio_stop_mirroring() NFS: Fix a request reference leak in nfs_direct_write_clear_reqs() NFS: Fix use-after-free issues in nfs_pageio_add_request() NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests() NFS: Fix a page leak in nfs_destroy_unlinked_subrequests() NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio() pNFS/flexfiles: Specify the layout segment range in LAYOUTGET ...
2020-04-07Merge tag 'f2fs-for-5.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs Pull f2fs updates from Jaegeuk Kim: "In this round, we've mainly focused on fixing bugs and addressing issues in recently introduced compression support. Enhancement: - add zstd support, and set LZ4 by default - add ioctl() to show # of compressed blocks - show mount time in debugfs - replace rwsem with spinlock - avoid lock contention in DIO reads Some major bug fixes wrt compression: - compressed block count - memory access and leak - remove obsolete fields - flag controls Other bug fixes and clean ups: - fix overflow when handling .flags in inode_info - fix SPO issue during resize FS flow - fix compression with fsverity enabled - potential deadlock when writing compressed pages - show missing mount options" * tag 'f2fs-for-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs: (66 commits) f2fs: keep inline_data when compression conversion f2fs: fix to disable compression on directory f2fs: add missing CONFIG_F2FS_FS_COMPRESSION f2fs: switch discard_policy.timeout to bool type f2fs: fix to verify tpage before releasing in f2fs_free_dic() f2fs: show compression in statx f2fs: clean up dic->tpages assignment f2fs: compress: support zstd compress algorithm f2fs: compress: add .{init,destroy}_decompress_ctx callback f2fs: compress: fix to call missing destroy_compress_ctx() f2fs: change default compression algorithm f2fs: clean up {cic,dic}.ref handling f2fs: fix to use f2fs_readpage_limit() in f2fs_read_multi_pages() f2fs: xattr.h: Make stub helpers inline f2fs: fix to avoid double unlock f2fs: fix potential .flags overflow on 32bit architecture f2fs: fix NULL pointer dereference in f2fs_verity_work() f2fs: fix to clear PG_error if fsverity failed f2fs: don't call fscrypt_get_encryption_info() explicitly in f2fs_tmpfile() f2fs: don't trigger data flush in foreground operation ...
2020-04-07Merge tag 'for-linus-5.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml Pull UML updates from Richard Weinberger: - New mode for time travel, external via virtio - Fixes for ubd to make sure no requests can get lost - Fixes for vector networking - Allow CONFIG_STATIC_LINK only when possible - Minor cleanups and fixes * tag 'for-linus-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: um: Remove some unnecessary NULL checks in vector_user.c um: vector: Avoid NULL ptr deference if transport is unset um: Make CONFIG_STATIC_LINK actually static um: Implement cpu_relax() as ndelay(1) for time-travel um: Implement ndelay/udelay in time-travel mode um: Implement time-travel=ext um: virtio: Implement VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS um: time-travel: Rewrite as an event scheduler um: Move timer-internal.h to non-shared hostfs: Use kasprintf() instead of fixed buffer formatting um: falloc.h needs to be directly included for older libc um: ubd: Retry buffer read on any kind of error um: ubd: Prevent buffer overrun on command completion um: Fix overlapping ELF segments when statically linked um: Delete never executed timer um: Don't overwrite ethtool driver version um: Fix len of file in create_pid_file um: Don't use console_drivers directly um: Cleanup CONFIG_IOSCHED_CFQ
2020-04-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Slave bond and team devices should not be assigned ipv6 link local addresses, from Jarod Wilson. 2) Fix clock sink config on some at803x PHY devices, from Oleksij Rempel. 3) Uninitialized stack space transmitted in slcan frames, fix from Richard Palethorpe. 4) Guard HW VLAN ops properly in stmmac driver, from Jose Abreu. 5) "=" --> "|=" fix in aquantia driver, from Colin Ian King. 6) Fix TCP fallback in mptcp, from Florian Westphal. (accessing a plain tcp_sk as if it were an mptcp socket). 7) Fix cavium driver in some configurations wrt. PTP, from Yue Haibing. 8) Make ipv6 and ipv4 consistent in the lower bound allowed for neighbour entry retrans_time, from Hangbin Liu. 9) Don't use private workqueue in pegasus usb driver, from Petko Manolov. 10) Fix integer overflow in mlxsw, from Colin Ian King. 11) Missing refcnt init in cls_tcindex, from Cong Wang. 12) One too many loop iterations when processing cmpri entries in ipv6 rpl code, from Alexander Aring. 13) Disable SG and TSO by default in r8169, from Heiner Kallweit. 14) NULL deref in macsec, from Davide Caratti. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits) macsec: fix NULL dereference in macsec_upd_offload() skbuff.h: Improve the checksum related comments net: dsa: bcm_sf2: Ensure correct sub-node is parsed qed: remove redundant assignment to variable 'rc' wimax: remove some redundant assignments to variable result mlxsw: spectrum_flower: Do not stop at FLOW_ACTION_VLAN_MANGLE mlxsw: spectrum_flower: Do not stop at FLOW_ACTION_PRIORITY r8169: change back SG and TSO to be disabled by default net: dsa: bcm_sf2: Do not register slave MDIO bus with OF ipv6: rpl: fix loop iteration tun: Don't put_page() for all negative return values from XDP program net: dsa: mt7530: fix null pointer dereferencing in port5 setup mptcp: add some missing pr_fmt defines net: phy: micrel: kszphy_resume(): add delay after genphy_resume() before accessing PHY registers net_sched: fix a missing refcnt in tcindex_init() net: stmmac: dwmac1000: fix out-of-bounds mac address reg setting mlxsw: spectrum_trap: fix unintention integer overflow on left shift pegasus: Remove pegasus' own workqueue neigh: support smaller retrans_time settting net: openvswitch: use hlist_for_each_entry_rcu instead of hlist_for_each_entry ...
2020-04-07linux/bits.h: add compile time sanity check of GENMASK inputsRikard Falkeborn
GENMASK() and GENMASK_ULL() are supposed to be called with the high bit as the first argument and the low bit as the second argument. Mixing them will return a mask with zero bits set. Recent commits show getting this wrong is not uncommon, see e.g. commit aa4c0c9091b0 ("net: stmmac: Fix misuses of GENMASK macro") and commit 9bdd7bb3a844 ("clocksource/drivers/npcm: Fix misuse of GENMASK macro"). To prevent such mistakes from appearing again, add compile time sanity checking to the arguments of GENMASK() and GENMASK_ULL(). If both arguments are known at compile time, and the low bit is higher than the high bit, break the build to detect the mistake immediately. Since GENMASK() is used in declarations, BUILD_BUG_ON_ZERO() must be used instead of BUILD_BUG_ON(). __builtin_constant_p does not evaluate is argument, it only checks if it is a constant or not at compile time, and __builtin_choose_expr does not evaluate the expression that is not chosen. Therefore, GENMASK(x++, 0) does only evaluate x++ once. Commit 95b980d62d52 ("linux/bits.h: make BIT(), GENMASK(), and friends available in assembly") made the macros in linux/bits.h available in assembly. Since BUILD_BUG_OR_ZERO() is not asm compatible, disable the checks if the file is included in an asm file. Due to bugs in GCC versions before 4.9 [0], disable the check if building with a too old GCC compiler. [0]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=19449 Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Haren Myneni <haren@us.ibm.com> Cc: Joe Perches <joe@perches.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: lkml <linux-kernel@vger.kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200308193954.2372399-1-rikard.falkeborn@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07percpu_counter: fix a data race at vm_committed_asQian Cai
"vm_committed_as.count" could be accessed concurrently as reported by KCSAN, BUG: KCSAN: data-race in __vm_enough_memory / percpu_counter_add_batch write to 0xffffffff9451c538 of 8 bytes by task 65879 on cpu 35: percpu_counter_add_batch+0x83/0xd0 percpu_counter_add_batch at lib/percpu_counter.c:91 __vm_enough_memory+0xb9/0x260 dup_mm+0x3a4/0x8f0 copy_process+0x2458/0x3240 _do_fork+0xaa/0x9f0 __do_sys_clone+0x125/0x160 __x64_sys_clone+0x70/0x90 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe read to 0xffffffff9451c538 of 8 bytes by task 66773 on cpu 19: __vm_enough_memory+0x199/0x260 percpu_counter_read_positive at include/linux/percpu_counter.h:81 (inlined by) __vm_enough_memory at mm/util.c:839 mmap_region+0x1b2/0xa10 do_mmap+0x45c/0x700 vm_mmap_pgoff+0xc0/0x130 ksys_mmap_pgoff+0x6e/0x300 __x64_sys_mmap+0x33/0x40 do_syscall_64+0x91/0xb05 entry_SYSCALL_64_after_hwframe+0x49/0xbe The read is outside percpu_counter::lock critical section which results in a data race. Fix it by adding a READ_ONCE() in percpu_counter_read_positive() which could also service as the existing compiler memory barrier. Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Marco Elver <elver@google.com> Link: http://lkml.kernel.org/r/1582302724-2804-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07kasan: stackdepot: move filter_irq_stacks() to stackdepot.cAlexander Potapenko
filter_irq_stacks() can be used by other tools (e.g. KMSAN), so it needs to be moved to a common location. lib/stackdepot.c seems a good place, as filter_irq_stacks() is usually applied to the output of stack_trace_save(). This patch has been previously mailed as part of KMSAN RFC patch series. [glider@google.co: nds32: linker script: add SOFTIRQENTRY_TEXT\ Link: http://lkml.kernel.org/r/20200311121002.241430-1-glider@google.com [glider@google.com: add IRQENTRY_TEXT and SOFTIRQENTRY_TEXT to linker script] Link: http://lkml.kernel.org/r/20200311121124.243352-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vegard Nossum <vegard.nossum@oracle.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Link: http://lkml.kernel.org/r/20200220141916.55455-3-glider@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07bitops: always inline sign extension helpersJosh Poimboeuf
With CONFIG_CC_OPTIMIZE_FOR_SIZE, objtool reports: drivers/gpu/drm/i915/gem/i915_gem_execbuffer.o: warning: objtool: i915_gem_execbuffer2_ioctl()+0x5b7: call to gen8_canonical_addr() with UACCESS enabled This means i915_gem_execbuffer2_ioctl() is calling gen8_canonical_addr() from the user_access_begin/end critical region (i.e, with SMAP disabled). While it's probably harmless in this case, in general we like to avoid extra function calls in SMAP-disabled regions because it can open up inadvertent security holes. Fix the warning by changing the sign extension helpers to __always_inline. This convinces GCC to inline gen8_canonical_addr(). The sign extension functions are trivial anyway, so it makes sense to always inline them. With my test optimize-for-size-based config, this actually shrinks the text size of i915_gem_execbuffer.o by 45 bytes -- and no change for vmlinux. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Chris Wilson <chris@chris-wilson.co.uk> Link: http://lkml.kernel.org/r/740179324b2b18b750b16295c48357f00b5fa9ed.1582982020.git.jpoimboe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07compiler.h: fix error in BUILD_BUG_ON() reportingVegard Nossum
compiletime_assert() uses __LINE__ to create a unique function name. This means that if you have more than one BUILD_BUG_ON() in the same source line (which can happen if they appear e.g. in a macro), then the error message from the compiler might output the wrong condition. For this source file: #include <linux/build_bug.h> #define macro() \ BUILD_BUG_ON(1); \ BUILD_BUG_ON(0); void foo() { macro(); } gcc would output: ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_9' declared with attribute error: BUILD_BUG_ON failed: 0 _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__) However, it was not the BUILD_BUG_ON(0) that failed, so it should say 1 instead of 0. With this patch, we use __COUNTER__ instead of __LINE__, so each BUILD_BUG_ON() gets a different function name and the correct condition is printed: ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_0' declared with attribute error: BUILD_BUG_ON failed: 1 _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Daniel Santos <daniel.santos@pobox.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Ian Abbott <abbotti@mev.co.uk> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200331112637.25047-1-vegard.nossum@oracle.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07compiler: remove CONFIG_OPTIMIZE_INLINING entirelyMasahiro Yamada
Commit ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING forcibly") made this always-on option. We released v5.4 and v5.5 including that commit. Remove the CONFIG option and clean up the code now. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: David Miller <davem@davemloft.net> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200220110807.32534-2-masahiroy@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07seq_file: remove m->versionMatthew Wilcox (Oracle)
The process maps file was the only user of version (introduced back in 2005). Now that it uses ppos instead, we can remove it. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200317193201.9924-4-adobriyan@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07proc: faster open/read/close with "permanent" filesAlexey Dobriyan
Now that "struct proc_ops" exist we can start putting there stuff which could not fly with VFS "struct file_operations"... Most of fs/proc/inode.c file is dedicated to make open/read/.../close reliable in the event of disappearing /proc entries which usually happens if module is getting removed. Files like /proc/cpuinfo which never disappear simply do not need such protection. Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such "permanent" files. Enable "permanent" flag for /proc/cpuinfo /proc/kmsg /proc/modules /proc/slabinfo /proc/stat /proc/sysvipc/* /proc/swaps More will come once I figure out foolproof way to prevent out module authors from marking their stuff "permanent" for performance reasons when it is not. This should help with scalability: benchmark is "read /proc/cpuinfo R times by N threads scattered over the system". N R t, s (before) t, s (after) ----------------------------------------------------- 64 4096 1.582458 1.530502 -3.2% 256 4096 6.371926 6.125168 -3.9% 1024 4096 25.64888 24.47528 -4.6% Benchmark source: #include <chrono> #include <iostream> #include <thread> #include <vector> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN); int N; const char *filename; int R; int xxx = 0; int glue(int n) { cpu_set_t m; CPU_ZERO(&m); CPU_SET(n, &m); return sched_setaffinity(0, sizeof(cpu_set_t), &m); } void f(int n) { glue(n % NR_CPUS); while (*(volatile int *)&xxx == 0) { } for (int i = 0; i < R; i++) { int fd = open(filename, O_RDONLY); char buf[4096]; ssize_t rv = read(fd, buf, sizeof(buf)); asm volatile ("" :: "g" (rv)); close(fd); } } int main(int argc, char *argv[]) { if (argc < 4) { std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R "; return 1; } N = atoi(argv[1]); filename = argv[2]; R = atoi(argv[3]); for (int i = 0; i < NR_CPUS; i++) { if (glue(i) == 0) break; } std::vector<std::thread> T; T.reserve(N); for (int i = 0; i < N; i++) { T.emplace_back(f, i); } auto t0 = std::chrono::system_clock::now(); { *(volatile int *)&xxx = 1; for (auto& t: T) { t.join(); } } auto t1 = std::chrono::system_clock::now(); std::chrono::duration<double> dt = t1 - t0; std::cout << dt.count() << ' '; return 0; } P.S.: Explicit randomization marker is added because adding non-function pointer will silently disable structure layout randomization. [akpm@linux-foundation.org: coding style fixes] Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Joe Perches <joe@perches.com> Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2 Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: remove dummy struct bootmem_data/bootmem_data_tWaiman Long
Both bootmem_data and bootmem_data_t structures are no longer defined. Remove the dummy forward declarations. Signed-off-by: Waiman Long <longman@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Link: http://lkml.kernel.org/r/20200326022617.26208-1-longman@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07include/linux/memremap.h: remove stale commentsIra Weiny
Fixes: 80a72d0af05a ("memremap: remove the data field in struct dev_pagemap") Fixes: fdc029b19dfd ("memremap: remove the dev field in struct dev_pagemap") Signed-off-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dan Williams <dan.j.williams@intel.com> Link: http://lkml.kernel.org/r/20200316213205.145333-1-ira.weiny@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07include/linux/swapops.h: correct guards for non_swap_entry()Steven Price
If CONFIG_DEVICE_PRIVATE is defined, but neither CONFIG_MEMORY_FAILURE nor CONFIG_MIGRATION, then non_swap_entry() will return 0, meaning that the condition (non_swap_entry(entry) && is_device_private_entry(entry)) in zap_pte_range() will never be true even if the entry is a device private one. Equally any other code depending on non_swap_entry() will not function as expected. I originally spotted this just by looking at the code, I haven't actually observed any problems. Looking a bit more closely it appears that actually this situation (currently at least) cannot occur: DEVICE_PRIVATE depends on ZONE_DEVICE ZONE_DEVICE depends on MEMORY_HOTREMOVE MEMORY_HOTREMOVE depends on MIGRATION Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory") Signed-off-by: Steven Price <steven.price@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: John Hubbard <jhubbard@nvidia.com> Link: http://lkml.kernel.org/r/20200305130550.22693-1-steven.price@arm.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: fix ambiguous comments for better code readabilitychenqiwu
The parameter of remap_pfn_range() @pfn passed from the caller is actually a page-frame number converted by corresponding physical address of kernel memory, the original comment is ambiguous that may mislead the users. Meanwhile, there is an ambiguous typo "VMM" in the comment of vm_area_struct. So fixing them will make the code more readable. Signed-off-by: chenqiwu <chenqiwu@xiaomi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm/memory_hotplug: allow to specify a default online_typeDavid Hildenbrand
For now, distributions implement advanced udev rules to essentially - Don't online any hotplugged memory (s390x) - Online all memory to ZONE_NORMAL (e.g., most virt environments like hyperv) - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken care of (e.g., bare metal, special virt environments) In summary: All memory is usually onlined the same way, however, the kernel always has to ask user space to come up with the same answer. E.g., Hyper-V always waits for a memory block to get onlined before continuing, otherwise it might end up adding memory faster than onlining it, which can result in strange OOM situations. This waiting slows down adding of a bigger amount of memory. Let's allow to specify a default online_type, not just "online" and "offline". This allows distributions to configure the default online_type when booting up and be done with it. We can now specify "offline", "online", "online_movable" and "online_kernel" via - "memhp_default_state=" on the kernel cmdline - /sys/devices/system/memory/auto_online_blocks just like we are able to specify for a single memory block via /sys/devices/system/memory/memoryX/state Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Yumei Huang <yuhuang@redhat.com> Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm/memory_hotplug: convert memhp_auto_online to store an online_typeDavid Hildenbrand
... and rename it to memhp_default_online_type. This is a preparation for more detailed default online behavior. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Yumei Huang <yuhuang@redhat.com> Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07drivers/base/memory: map MMOP_OFFLINE to 0David Hildenbrand
Historically, we used the value -1. Just treat 0 as the special case now. Clarify a comment (which was wrong, when we come via device_online() the first time, the online_type would have been 0 / MEM_ONLINE). The default is now always MMOP_OFFLINE. This removes the last user of the manual "-1", which didn't use the enum value. This is a preparation to use the online_type as an array index. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Yumei Huang <yuhuang@redhat.com> Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINEDavid Hildenbrand
Patch series "mm/memory_hotplug: allow to specify a default online_type", v3. Distributions nowadays use udev rules ([1] [2]) to specify if and how to online hotplugged memory. The rules seem to get more complex with many special cases. Due to the various special cases, CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug is handled via udev rules. Every time we hotplug memory, the udev rule will come to the same conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of memory in separate memory blocks and wait for memory to get onlined by user space before continuing to add more memory blocks (to not add memory faster than it is getting onlined). This of course slows down the whole memory hotplug process. To make the job of distributions easier and to avoid udev rules that get more and more complicated, let's extend the mechanism provided by - /sys/devices/system/memory/auto_online_blocks - "memhp_default_state=" on the kernel cmdline to be able to specify also "online_movable" as well as "online_kernel" === Example /usr/libexec/config-memhotplug === #!/bin/bash VIRT=`systemd-detect-virt --vm` ARCH=`uname -p` sense_virtio_mem() { if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l` if [ $DEVICES != "0" ]; then return 0 fi fi return 1 } if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then echo "Memory hotplug configuration support missing in the kernel" exit 1 fi if grep "memhp_default_state=" /proc/cmdline > /dev/null; then echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)" exit 1 fi if [ $VIRT == "microsoft" ]; then echo "Detected Hyper-V on $ARCH" # Hyper-V wants all memory in ZONE_NORMAL ONLINE_TYPE="online_kernel" elif sense_virtio_mem; then echo "Detected virtio-mem on $ARCH" # virtio-mem wants all memory in ZONE_NORMAL ONLINE_TYPE="online_kernel" elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then echo "Detected $ARCH" # standby memory should not be onlined automatically ONLINE_TYPE="offline" elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then echo "Detected" $ARCH # PPC64 onlines all hotplugged memory right from the kernel ONLINE_TYPE="offline" elif [ $VIRT == "none" ]; then echo "Detected bare-metal on $ARCH" # Bare metal users expect hotplugged memory to be unpluggable. We assume # that ZONE imbalances on such enterpise servers cannot happen and is # properly documented ONLINE_TYPE="online_movable" else # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE # imbalances won't happen echo "Detected $VIRT on $ARCH" # Usually, ballooning is used in virtual environments, so memory should go to # ZONE_NORMAL. However, sometimes "movable_node" is relevant. ONLINE_TYPE="online" fi echo "Selected online_type:" $ONLINE_TYPE # Configure what to do with memory that will be hotplugged in the future echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks if [ $? != "0" ]; then echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)" # A backup udev rule should handle old kernels if necessary exit 1 fi # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem) if [ $ONLINE_TYPE != "offline" ]; then for MEMORY in /sys/devices/system/memory/memory*; do STATE=`cat $MEMORY/state` if [ $STATE == "offline" ]; then echo $ONLINE_TYPE > $MEMORY/state fi done fi === Example /usr/lib/systemd/system/config-memhotplug.service === [Unit] Description=Configure memory hotplug behavior DefaultDependencies=no Conflicts=shutdown.target Before=sysinit.target shutdown.target After=systemd-modules-load.service ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks [Service] ExecStart=/usr/libexec/config-memhotplug Type=oneshot TimeoutSec=0 RemainAfterExit=yes [Install] WantedBy=sysinit.target === Example modification to the 40-redhat.rules [2] === : diff --git a/40-redhat.rules b/40-redhat.rules-new : index 2c690e5..168fd03 100644 : --- a/40-redhat.rules : +++ b/40-redhat.rules-new : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online} : # Memory hotadd request : SUBSYSTEM!="memory", GOTO="memory_hotplug_end" : ACTION!="add", GOTO="memory_hotplug_end" : +# memory hotplug behavior configured : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end" : + : PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end" : : ENV{.state}="online" === [1] https://github.com/lnykryn/systemd-rhel/pull/281 [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules This patch (of 8): The name is misleading and it's not really clear what is "kept". Let's just name it like the online_type name we expose to user space ("online"). Add some documentation to the types. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Baoquan He <bhe@redhat.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Yumei Huang <yuhuang@redhat.com> Cc: Igor Mammedov <imammedo@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Paul Mackerras <paulus@samba.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm/sparse.c: only use subsection map in VMEMMAP caseBaoquan He
Currently, to support subsection aligned memory region adding for pmem, subsection map is added to track which subsection is present. However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the classic sparse, it's meaningless. Even worse, it may confuse people when checking code related to the classic sparse. About the classic sparse which doesn't support subsection hotplug, Dan said it's more because the effort and maintenance burden outweighs the benefit. Besides, the current 64 bit ARCHes all enable SPARSEMEM_VMEMMAP_ENABLE by default. Combining the above reasons, no need to provide subsection map and the relevant handling for the classic sparse. Let's remove them. Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07drivers/base/memory.c: drop section_countDavid Hildenbrand
Patch series "mm: drop superfluous section checks when onlining/offlining". Let's drop some superfluous section checks on the onlining/offlining path. This patch (of 3): Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with holes") we have a generic check in offline_pages() that disallows offlining memory blocks with holes. Memory blocks with missing sections are just another variant of these type of blocks. We can stop checking (and especially storing) present sections. A proper error message is now printed why offlining failed. section_count was initially introduced in commit 07681215975e ("Driver core: Add section count to memory_block struct") in order to detect when it is okay to remove a memory block. It was used in commit 26bbe7ef6d5c ("drivers/base/memory.c: prohibit offlining of memory blocks with missing sections") to disallow offlining memory blocks with missing sections. As we refactored creation/removal of memory devices and have a proper check for holes in place, we can drop the section_count. This also removes a leftover comment regarding the mem_sysfs_mutex, which was removed in commit 848e19ad3c33 ("drivers/base/memory.c: drop the mem_sysfs_mutex"). Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: enabled write protection in userfaultfd APIShaohua Li
Now it's safe to enable write protection in userfaultfd API Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220163112.11409-15-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: add the writeprotect API to userfaultfd ioctlAndrea Arcangeli
Introduce the new uffd-wp APIs for userspace. Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking using the new UFFDIO_REGISTER_MODE_WP flag. Note that this flag can co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the userspace program can not only resolve missing page faults, and at the same time tracking page data changes along the way. Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level write protection tracking. Note that we will need to register the memory region with UFFDIO_REGISTER_MODE_WP before that. [peterx@redhat.com: write up the commit message] [peterx@redhat.com: remove useless block, write commit message, check against VM_MAYWRITE rather than VM_WRITE when register] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: support write protection for userfault vma rangeShaohua Li
Add API to enable/disable writeprotect a vma range. Unlike mprotect, this doesn't split/merge vmas. [peterx@redhat.com: - use the helper to find VMA; - return -ENOENT if not found to match mcopy case; - use the new MM_CP_UFFD_WP* flags for change_protection - check against mmap_changing for failures - replace find_dst_vma with vma_find_uffd] Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07khugepaged: skip collapse if uffd-wp detectedPeter Xu
Don't collapse the huge PMD if there is any userfault write protected small PTEs. The problem is that the write protection is in small page granularity and there's no way to keep all these write protection information if the small pages are going to be merged into a huge PMD. The same thing needs to be considered for swap entries and migration entries. So do the check as well disregarding khugepaged_max_ptes_swap. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: support swap and page migrationPeter Xu
For either swap and page migration, we all use the bit 2 of the entry to identify whether this entry is uffd write-protected. It plays a similar role as the existing soft dirty bit in swap entries but only for keeping the uffd-wp tracking for a specific PTE/PMD. Something special here is that when we want to recover the uffd-wp bit from a swap/migration entry to the PTE bit we'll also need to take care of the _PAGE_RW bit and make sure it's cleared, otherwise even with the _PAGE_UFFD_WP bit we can't trap it at all. In change_pte_range() we do nothing for uffd if the PTE is a swap entry. That can lead to data mismatch if the page that we are going to write protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch also applies/removes the uffd-wp bit even for the swap entries. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: add pmd_swp_*uffd_wp() helpersPeter Xu
Adding these missing helpers for uffd-wp operations with pmd swap/migration entries. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-10-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: apply _PAGE_UFFD_WP bitPeter Xu
Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for change_protection() when used with uffd-wp and make sure the two new flags are exclusively used. Then, - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW when a range of memory is write protected by uffd - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover _PAGE_RW when write protection is resolved from userspace And use this new interface in mwriteprotect_range() to replace the old MM_CP_DIRTY_ACCT. Do this change for both PTEs and huge PMDs. Then we can start to identify which PTE/PMD is write protected by general (e.g., COW or soft dirty tracking), and which is for userfaultfd-wp. Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we can be even more strict when detecting uffd-wp page faults in either do_wp_page() or wp_huge_pmd(). After we're with _PAGE_UFFD_WP, a special case is when a page is both protected by the general COW logic and also userfault-wp. Here the userfault-wp will have higher priority and will be handled first. Only after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle the general COW. These are the steps on what will happen with such a page: 1. CPU accesses write protected shared page (so both protected by general COW and uffd-wp), blocked by uffd-wp first because in do_wp_page we'll handle uffd-wp first, so it has higher priority than general COW. 2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT to remove the uffd-wp bit upon the PTE/PMD. However here we still keep the write bit cleared. Notify the blocked CPU. 3. The blocked CPU resumes the page fault process with a fault retry, during retry it'll notice it was not with the uffd-wp bit this time but it is still write protected by general COW, then it'll go though the COW path in the fault handler, copy the page, apply write bit where necessary, and retry again. 4. The CPU will be able to access this page with write bit set. Suggested-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Brian Geffon <bgeffon@google.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Martin Cracauer <cracauer@cons.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: Hugh Dickins <hughd@google.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: merge parameters for change_protection()Peter Xu
change_protection() was used by either the NUMA or mprotect() code, there's one parameter for each of the callers (dirty_accountable and prot_numa). Further, these parameters are passed along the calls: - change_protection_range() - change_p4d_range() - change_pud_range() - change_pmd_range() - ... Now we introduce a flag for change_protect() and all these helpers to replace these parameters. Then we can avoid passing multiple parameters multiple times along the way. More importantly, it'll greatly simplify the work if we want to introduce any new parameters to change_protection(). In the follow up patches, a new parameter for userfaultfd write protection will be introduced. No functional change at all. Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: add UFFDIO_COPY_MODE_WPAndrea Arcangeli
This allows UFFDIO_COPY to map pages write-protected. [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and commit messages] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: userfaultfd_pte/huge_pmd_wp() helpersAndrea Arcangeli
Implement helpers methods to invoke userfaultfd wp faults more selectively: not only when a wp fault triggers on a vma with vma->vm_flags VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set in the pagetable too. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-5-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: add WP pagetable tracking to x86Andrea Arcangeli
Accurate userfaultfd WP tracking is possible by tracking exactly which virtual memory ranges were writeprotected by userland. We can't relay only on the RW bit of the mapped pagetable because that information is destroyed by fork() or KSM or swap. If we were to relay on that, we'd need to stay on the safe side and generate false positive wp faults for every swapped out page. [peterx@redhat.com: append _PAGE_UFD_WP to _PAGE_CHG_MASK] Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@fb.com> Link: http://lkml.kernel.org/r/20200220163112.11409-4-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07userfaultfd: wp: add helper for writeprotect checkShaohua Li
Patch series "userfaultfd: write protection support", v6. Overview ======== The uffd-wp work was initialized by Shaohua Li [1], and later continued by Andrea [2]. This series is based upon Andrea's latest userfaultfd tree, and it is a continuous works from both Shaohua and Andrea. Many of the follow up ideas come from Andrea too. Besides the old MISSING register mode of userfaultfd, the new uffd-wp support provides another alternative register mode called UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing page faults but also write protection page faults, or even they can be registered together. At the same time, the new feature also provides a new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the userspace to write protect a range or memory or fixup write permission of faulted pages. Please refer to the document patch "userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update" for more information on the new interface and what it can do. The major workflow of an uffd-wp program should be: 1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP 2. Write protect part of the whole registered region using UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to show that we want to write protect the range. 3. Start a working thread that modifies the protected pages, meanwhile listening to UFFD messages. 4. When a write is detected upon the protected range, page fault happens, a UFFD message will be generated and reported to the page fault handling thread 5. The page fault handler thread resolves the page fault using the new UFFDIO_WRITEPROTECT ioctl, but this time passing in !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to recover the write permission. Before this operation, the fault handler thread can do anything it wants, e.g., dumps the page to a persistent storage. 6. The worker thread will continue running with the correctly applied write permission from step 5. Currently there are already two projects that are based on this new userfaultfd feature. QEMU Live Snapshot: The project provides a way to allow the QEMU hypervisor to take snapshot of VMs without stopping the VM [3]. LLNL umap library: The project provides a mmap-like interface and "allow to have an application specific buffer of pages cached from a large file, i.e. out-of-core execution using memory map" [4][5]. Before posting the patchset, this series was smoke tested against QEMU live snapshot and the LLNL umap library (by doing parallel quicksort using 128 sorting threads + 80 uffd servicing threads). My sincere thanks to Marty Mcfadden and Denis Plotnikov for the help along the way. TODO ==== - hugetlbfs/shmem support - performance - more architectures - cooperate with mprotect()-allowed processes (???) - ... References ========== [1] https://lwn.net/Articles/666187/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault [3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm [4] https://github.com/LLNL/umap [5] https://llnl-umap.readthedocs.io/en/develop/ [6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5 [7] https://lkml.org/lkml/2018/11/21/370 [8] https://lkml.org/lkml/2018/12/30/64 This patch (of 19): Add helper for writeprotect check. Will use it later. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jerome Glisse <jglisse@redhat.com> Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Bobby Powers <bobbypowers@gmail.com> Cc: Brian Geffon <bgeffon@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Denis Plotnikov <dplotnikov@virtuozzo.com> Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com> Cc: Martin Cracauer <cracauer@cons.org> Cc: Marty McFadden <mcfadden8@llnl.gov> Cc: Maya Gokhale <gokhale2@llnl.gov> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@openvz.org> Link: http://lkml.kernel.org/r/20200220163112.11409-2-peterx@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm/page_reporting: add budget limit on how many pages can be reported per passAlexander Duyck
In order to keep ourselves from reporting pages that are just going to be reused again in the case of heavy churn we can put a limit on how many total pages we will process per pass. Doing this will allow the worker thread to go into idle much more quickly so that we avoid competing with other threads that might be allocating or freeing pages. The logic added here will limit the worker thread to no more than one sixteenth of the total free pages in a given area per list. Once that limit is reached it will update the state so that at the end of the pass we will reschedule the worker to try again in 2 seconds when the memory churn has hopefully settled down. Again this optimization doesn't show much of a benefit in the standard case as the memory churn is minmal. However with page allocator shuffling enabled the gain is quite noticeable. Below are the results with a THP enabled version of the will-it-scale page_fault1 test showing the improvement in iterations for 16 processes or threads. Without: tasks processes processes_idle threads threads_idle 16 8283274.75 0.17 5594261.00 38.15 With: tasks processes processes_idle threads threads_idle 16 8767010.50 0.21 5791312.75 36.98 Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07virtio-balloon: add support for providing free page reports to hostAlexander Duyck
Add support for the page reporting feature provided by virtio-balloon. Reporting differs from the regular balloon functionality in that is is much less durable than a standard memory balloon. Instead of creating a list of pages that cannot be accessed the pages are only inaccessible while they are being indicated to the virtio interface. Once the interface has acknowledged them they are placed back into their respective free lists and are once again accessible by the guest system. Unlike a standard balloon we don't inflate and deflate the pages. Instead we perform the reporting, and once the reporting is completed it is assumed that the page has been dropped from the guest and will be faulted back in the next time the page is accessed. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: introduce Reported pagesAlexander Duyck
In order to pave the way for free page reporting in virtualized environments we will need a way to get pages out of the free lists and identify those pages after they have been returned. To accomplish this, this patch adds the concept of a Reported Buddy, which is essentially meant to just be the Uptodate flag used in conjunction with the Buddy page type. To prevent the reported pages from leaking outside of the buddy lists I added a check to clear the PageReported bit in the del_page_from_free_list function. As a result any reported page that is split, merged, or allocated will have the flag cleared prior to the PageBuddy value being cleared. The process for reporting pages is fairly simple. Once we free a page that meets the minimum order for page reporting we will schedule a worker thread to start 2s or more in the future. That worker thread will begin working from the lowest supported page reporting order up to MAX_ORDER - 1 pulling unreported pages from the free list and storing them in the scatterlist. When processing each individual free list it is necessary for the worker thread to release the zone lock when it needs to stop and report the full scatterlist of pages. To reduce the work of the next iteration the worker thread will rotate the free list so that the first unreported page in the free list becomes the first entry in the list. It will then call a reporting function providing information on how many entries are in the scatterlist. Once the function completes it will return the pages to the free area from which they were allocated and start over pulling more pages from the free areas until there are no longer enough pages to report on to keep the worker busy, or we have processed as many pages as were contained in the free area when we started processing the list. The worker thread will work in a round-robin fashion making its way though each zone requesting reporting, and through each reportable free list within that zone. Once all free areas within the zone have been processed it will check to see if there have been any requests for reporting while it was processing. If so it will reschedule the worker thread to start up again in roughly 2s and exit. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: use zone and order instead of free area in free_list manipulatorsAlexander Duyck
In order to enable the use of the zone from the list manipulator functions I will need access to the zone pointer. As it turns out most of the accessors were always just being directly passed &zone->free_area[order] anyway so it would make sense to just fold that into the function itself and pass the zone and order as arguments instead of the free area. In order to be able to reference the zone we need to move the declaration of the functions down so that we have the zone defined before we define the list manipulation functions. Since the functions are only used in the file mm/page_alloc.c we can just move them there to reduce noise in the header. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pagupta@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: adjust shuffle code to allow for future coalescingAlexander Duyck
Patch series "mm / virtio: Provide support for free page reporting", v17. This series provides an asynchronous means of reporting free guest pages to a hypervisor so that the memory associated with those pages can be dropped and reused by other processes and/or guests on the host. Using this it is possible to avoid unnecessary I/O to disk and greatly improve performance in the case of memory overcommit on the host. When enabled we will be performing a scan of free memory every 2 seconds while pages of sufficiently high order are being freed. In each pass at least one sixteenth of each free list will be reported. By doing this we avoid racing against other threads that may be causing a high amount of memory churn. The lowest page order currently scanned when reporting pages is pageblock_order so that this feature will not interfere with the use of Transparent Huge Pages in the case of virtualization. Currently this is only in use by virtio-balloon however there is the hope that at some point in the future other hypervisors might be able to make use of it. In the virtio-balloon/QEMU implementation the hypervisor is currently using MADV_DONTNEED to indicate to the host kernel that the page is currently free. It will be zeroed and faulted back into the guest the next time the page is accessed. To track if a page is reported or not the Uptodate flag was repurposed and used as a Reported flag for Buddy pages. We walk though the free list isolating pages and adding them to the scatterlist until we either encounter the end of the list or have processed at least one sixteenth of the pages that were listed in nr_free prior to us starting. If we fill the scatterlist before we reach the end of the list we rotate the list so that the first unreported page we encounter is moved to the head of the list as that is where we will resume after we have freed the reported pages back into the tail of the list. Below are the results from various benchmarks. I primarily focused on two tests. The first is the will-it-scale/page_fault2 test, and the other is a modified version of will-it-scale/page_fault1 that was enabled to use THP. I did this as it allows for better visibility into different parts of the memory subsystem. The guest is running with 32G for RAM on one node of a E5-2630 v3. The host has had some features such as CPU turbo disabled in the BIOS. Test page_fault1 (THP) page_fault2 Name tasks Process Iter STDEV Process Iter STDEV Baseline 1 1012402.50 0.14% 361855.25 0.81% 16 8827457.25 0.09% 3282347.00 0.34% Patches Applied 1 1007897.00 0.23% 361887.00 0.26% 16 8784741.75 0.39% 3240669.25 0.48% Patches Enabled 1 1010227.50 0.39% 359749.25 0.56% 16 8756219.00 0.24% 3226608.75 0.97% Patches Enabled 1 1050982.00 4.26% 357966.25 0.14% page shuffle 16 8672601.25 0.49% 3223177.75 0.40% Patches enabled 1 1003238.00 0.22% 360211.00 0.22% shuffle w/ RFC 16 8767010.50 0.32% 3199874.00 0.71% The results above are for a baseline with a linux-next-20191219 kernel, that kernel with this patch set applied but page reporting disabled in virtio-balloon, the patches applied and page reporting fully enabled, the patches enabled with page shuffling enabled, and the patches applied with page shuffling enabled and an RFC patch that makes used of MADV_FREE in QEMU. These results include the deviation seen between the average value reported here versus the high and/or low value. I observed that during the test memory usage for the first three tests never dropped whereas with the patches fully enabled the VM would drop to using only a few GB of the host's memory when switching from memhog to page fault tests. Any of the overhead visible with this patch set enabled seems due to page faults caused by accessing the reported pages and the host zeroing the page before giving it back to the guest. This overhead is much more visible when using THP than with standard 4K pages. In addition page shuffling seemed to increase the amount of faults generated due to an increase in memory churn. The overehad is reduced when using MADV_FREE as we can avoid the extra zeroing of the pages when they are reintroduced to the host, as can be seen when the RFC is applied with shuffling enabled. The overall guest size is kept fairly small to only a few GB while the test is running. If the host memory were oversubscribed this patch set should result in a performance improvement as swapping memory in the host can be avoided. A brief history on the background of free page reporting can be found at: https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/ This patch (of 9): Move the head/tail adding logic out of the shuffle code and into the __free_one_page function since ultimately that is where it is really needed anyway. By doing this we should be able to reduce the overhead and can consolidate all of the list addition bits in one spot. Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Hildenbrand <david@redhat.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomain Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: code cleanup for MADV_FREEHuang Ying
Some comments for MADV_FREE is revised and added to help people understand the MADV_FREE code, especially the page flag, PG_swapbacked. This makes page_is_file_cache() isn't consistent with its comments. So the function is renamed to page_is_file_lru() to make them consistent again. All these are put in one patch as one logical change. Suggested-by: David Hildenbrand <david@redhat.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: David Rientjes <rientjes@google.com> Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@kernel.org> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Minchan Kim <minchan@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@surriel.com> Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHEMatthew Wilcox (Oracle)
Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE") notes that it should be reverted when the PowerPC problem was fixed. The commit fixing the PowerPC problem (953c66c2b22a) did not revert the commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an oversight, so remove the Kconfig symbol and undo the work of commit e496cf3d7821. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07include/linux/pagemap.h: optimise find_subpage for !THPMatthew Wilcox (Oracle)
If THP is disabled, find_subpage() can become a no-op by using hpage_nr_pages() instead of compound_nr(). hpage_nr_pages() embeds a check for PageTail, so we can drop the check here. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200318140253.6141-5-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-04-07mm, thp: track fallbacks due to failed memcg charges separatelyDavid Rientjes
The thp_fault_fallback and thp_file_fallback vmstats are incremented if either the hugepage allocation fails through the page allocator or the hugepage charge fails through mem cgroup. This patch leaves this field untouched but adds two new fields, thp_{fault,file}_fallback_charge, which is incremented only when the mem cgroup charge fails. This distinguishes between attempted hugepage allocations that fail due to fragmentation (or low memory conditions) and those that fail due to mem cgroup limits. That can be used to determine the impact of fragmentation on the system by excluding faults that failed due to memcg usage. Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Jeremy Cline <jcline@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061422070.7412@chino.kir.corp.google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>