aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2020-03-30Merge tag 'irq-core-2020-03-30' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq updates from Thomas Gleixner: "Updates for the interrupt subsystem: Treewide: - Cleanup of setup_irq() which is not longer required because the memory allocator is available early. Most cleanup changes come through the various maintainer trees, so the final removal of setup_irq() is postponed towards the end of the merge window. Core: - Protection against unsafe invocation of interrupt handlers and unsafe interrupt injection including a fixup of the offending PCI/AER error injection mechanism. Invoking interrupt handlers from arbitrary contexts, i.e. outside of an actual interrupt, can cause inconsistent state on the fragile x86 interrupt affinity changing hardware trainwreck. Drivers: - Second wave of support for the new ARM GICv4.1 - Multi-instance support for Xilinx and PLIC interrupt controllers - CPU-Hotplug support for PLIC - The obligatory new driver for X1000 TCU - Enhancements, cleanups and fixes all over the place" * tag 'irq-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) unicore32: Replace setup_irq() by request_irq() sh: Replace setup_irq() by request_irq() hexagon: Replace setup_irq() by request_irq() c6x: Replace setup_irq() by request_irq() alpha: Replace setup_irq() by request_irq() irqchip/gic-v4.1: Eagerly vmap vPEs irqchip/gic-v4.1: Add VSGI property setup irqchip/gic-v4.1: Add VSGI allocation/teardown irqchip/gic-v4.1: Move doorbell management to the GICv4 abstraction layer irqchip/gic-v4.1: Plumb set_vcpu_affinity SGI callbacks irqchip/gic-v4.1: Plumb get/set_irqchip_state SGI callbacks irqchip/gic-v4.1: Plumb mask/unmask SGI callbacks irqchip/gic-v4.1: Add initial SGI configuration irqchip/gic-v4.1: Plumb skeletal VSGI irqchip irqchip/stm32: Retrigger both in eoi and unmask callbacks irqchip/gic-v3: Move irq_domain_update_bus_token to after checking for NULL domain irqchip/xilinx: Do not call irq_set_default_host() irqchip/xilinx: Enable generic irq multi handler irqchip/xilinx: Fill error code when irq domain registration fails irqchip/xilinx: Add support for multiple instances ...
2020-03-30Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The main changes in this cycle are: - Various NUMA scheduling updates: harmonize the load-balancer and NUMA placement logic to not work against each other. The intended result is better locality, better utilization and fewer migrations. - Introduce Thermal Pressure tracking and optimizations, to improve task placement on thermally overloaded systems. - Implement frequency invariant scheduler accounting on (some) x86 CPUs. This is done by observing and sampling the 'recent' CPU frequency average at ~tick boundaries. The CPU provides this data via the APERF/MPERF MSRs. This hopefully makes our capacity estimates more precise and keeps tasks on the same CPU better even if it might seem overloaded at a lower momentary frequency. (As usual, turbo mode is a complication that we resolve by observing the maximum frequency and renormalizing to it.) - Add asymmetric CPU capacity wakeup scan to improve capacity utilization on asymmetric topologies. (big.LITTLE systems) - PSI fixes and optimizations. - RT scheduling capacity awareness fixes & improvements. - Optimize the CONFIG_RT_GROUP_SCHED constraints code. - Misc fixes, cleanups and optimizations - see the changelog for details" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits) threads: Update PID limit comment according to futex UAPI change sched/fair: Fix condition of avg_load calculation sched/rt: cpupri_find: Trigger a full search as fallback kthread: Do not preempt current task if it is going to call schedule() sched/fair: Improve spreading of utilization sched: Avoid scale real weight down to zero psi: Move PF_MEMSTALL out of task->flags MAINTAINERS: Add maintenance information for psi psi: Optimize switching tasks inside shared cgroups psi: Fix cpu.pressure for cpu.max and competing cgroups sched/core: Distribute tasks within affinity masks sched/fair: Fix enqueue_task_fair warning thermal/cpu-cooling, sched/core: Move the arch_set_thermal_pressure() API to generic scheduler code sched/rt: Remove unnecessary push for unfit tasks sched/rt: Allow pulling unfitting task sched/rt: Optimize cpupri_find() on non-heterogenous systems sched/rt: Re-instate old behavior in select_task_rq_rt() sched/rt: cpupri_find: Implement fallback mechanism for !fit case sched/fair: Fix reordering of enqueue/dequeue_task_fair() sched/fair: Fix runnable_avg for throttled cfs ...
2020-03-30Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf updates from Ingo Molnar: "The main changes in this cycle were: Kernel side changes: - A couple of x86/cpu cleanups and changes were grandfathered in due to patch dependencies. These clean up the set of CPU model/family matching macros with a consistent namespace and C99 initializer style. - A bunch of updates to various low level PMU drivers: * AMD Family 19h L3 uncore PMU * Intel Tiger Lake uncore support * misc fixes to LBR TOS sampling - optprobe fixes - perf/cgroup: optimize cgroup event sched-in processing - misc cleanups and fixes Tooling side changes are to: - perf {annotate,expr,record,report,stat,test} - perl scripting - libapi, libperf and libtraceevent - vendor events on Intel and S390, ARM cs-etm - Intel PT updates - Documentation changes and updates to core facilities - misc cleanups, fixes and other enhancements" * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (89 commits) cpufreq/intel_pstate: Fix wrong macro conversion x86/cpu: Cleanup the now unused CPU match macros hwrng: via_rng: Convert to new X86 CPU match macros crypto: Convert to new CPU match macros ASoC: Intel: Convert to new X86 CPU match macros powercap/intel_rapl: Convert to new X86 CPU match macros PCI: intel-mid: Convert to new X86 CPU match macros mmc: sdhci-acpi: Convert to new X86 CPU match macros intel_idle: Convert to new X86 CPU match macros extcon: axp288: Convert to new X86 CPU match macros thermal: Convert to new X86 CPU match macros hwmon: Convert to new X86 CPU match macros platform/x86: Convert to new CPU match macros EDAC: Convert to new X86 CPU match macros cpufreq: Convert to new X86 CPU match macros ACPI: Convert to new X86 CPU match macros x86/platform: Convert to new CPU match macros x86/kernel: Convert to new CPU match macros x86/kvm: Convert to new CPU match macros x86/perf/events: Convert to new CPU match macros ...
2020-03-30Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - Continued user-access cleanups in the futex code. - percpu-rwsem rewrite that uses its own waitqueue and atomic_t instead of an embedded rwsem. This addresses a couple of weaknesses, but the primary motivation was complications on the -rt kernel. - Introduce raw lock nesting detection on lockdep (CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal lock differences. This too originates from -rt. - Reuse lockdep zapped chain_hlocks entries, to conserve RAM footprint on distro-ish kernels running into the "BUG: MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep chain-entries pool. - Misc cleanups, smaller fixes and enhancements - see the changelog for details" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits) fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t Documentation/locking/locktypes: Minor copy editor fixes Documentation/locking/locktypes: Further clarifications and wordsmithing m68knommu: Remove mm.h include from uaccess_no.h x86: get rid of user_atomic_cmpxchg_inatomic() generic arch_futex_atomic_op_inuser() doesn't need access_ok() x86: don't reload after cmpxchg in unsafe_atomic_op2() loop x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end() objtool: whitelist __sanitizer_cov_trace_switch() [parisc, s390, sparc64] no need for access_ok() in futex handling sh: no need of access_ok() in arch_futex_atomic_op_inuser() futex: arch_futex_atomic_op_inuser() calling conventions change completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() lockdep: Add posixtimer context tracing bits lockdep: Annotate irq_work lockdep: Add hrtimer context tracing bits lockdep: Introduce wait-type checks completion: Use simple wait queues sched/swait: Prepare usage in completions ...
2020-03-30Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: "The main changes in this cycle were: - Make kfree_rcu() use kfree_bulk() for added performance - RCU updates - Callback-overload handling updates - Tasks-RCU KCSAN and sparse updates - Locking torture test and RCU torture test updates - Documentation updates - Miscellaneous fixes" * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) rcu: Make rcu_barrier() account for offline no-CBs CPUs rcu: Mark rcu_state.gp_seq to detect concurrent writes Documentation/memory-barriers: Fix typos doc: Add rcutorture scripting to torture.txt doc/RCU/rcu: Use https instead of http if possible doc/RCU/rcu: Use absolute paths for non-rst files doc/RCU/rcu: Use ':ref:' for links to other docs doc/RCU/listRCU: Update example function name doc/RCU/listRCU: Fix typos in a example code snippets doc/RCU/Design: Remove remaining HTML tags in ReST files doc: Add some more RCU list patterns in the kernel rcutorture: Set KCSAN Kconfig options to detect more data races rcutorture: Manually clean up after rcu_barrier() failure rcutorture: Make rcu_torture_barrier_cbs() post from corresponding CPU rcuperf: Measure memory footprint during kfree_rcu() test rcutorture: Annotation lockless accesses to rcu_torture_current rcutorture: Add READ_ONCE() to rcu_torture_count and rcu_torture_batch rcutorture: Fix stray access to rcu_fwd_cb_nodelay rcutorture: Fix rcu_torture_one_read()/rcu_torture_writer() data race rcutorture: Make kvm-find-errors.sh abort on bad directory ...
2020-03-30Merge tag 'pm-5.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These clean up and rework the PM QoS API, address a suspend-to-idle wakeup regression on some ACPI-based platforms, clean up and extend a few cpuidle drivers, update multiple cpufreq drivers and cpufreq documentation, and fix a number of issues in devfreq and several other things all over. Specifics: - Clean up and rework the PM QoS API to simplify the code and reduce the size of it (Rafael Wysocki). - Fix a suspend-to-idle wakeup regression on Dell XPS13 9370 and similar platforms where the USB plug/unplug events are handled by the EC (Rafael Wysocki). - CLean up the intel_idle and PSCI cpuidle drivers (Rafael Wysocki, Ulf Hansson). - Extend the haltpoll cpuidle driver so that it can be forced to run on some systems where it refused to load (Maciej Szmigiero). - Convert several cpufreq documents to the .rst format and move the legacy driver documentation into one common file (Mauro Carvalho Chehab, Rafael Wysocki). - Update several cpufreq drivers: * Extend and fix the imx-cpufreq-dt driver (Anson Huang). * Improve the -EPROBE_DEFER handling and fix unwanted CPU overclocking on i.MX6ULL in imx6q-cpufreq (Anson Huang, Christoph Niedermaier). * Add support for Krait based SoCs to the qcom driver (Ansuel Smith). * Add support for OPP_PLUS to ti-cpufreq (Lokesh Vutla). * Add platform specific intermediate callbacks support to cpufreq-dt and update the imx6q driver (Peng Fan). * Simplify and consolidate some pieces of the intel_pstate driver and update its documentation (Rafael Wysocki, Alex Hung). - Fix several devfreq issues: * Remove unneeded extern keyword from a devfreq header file and use the DEVFREQ_GOV_UPDATE_INTERNAL event name instead of DEVFREQ_GOV_INTERNAL (Chanwoo Choi). * Fix the handling of dev_pm_qos_remove_request() result (Leonard Crestez). * Use constant name for userspace governor (Pierre Kuo). * Get rid of doc warnings and fix a typo (Christophe JAILLET). - Use built-in RCU list checking in some places in the PM core to avoid false-positive RCU usage warnings (Madhuparna Bhowmik). - Add explicit READ_ONCE()/WRITE_ONCE() annotations to low-level PM QoS routines (Qian Cai). - Fix removal of wakeup sources to avoid NULL pointer dereferences in a corner case (Neeraj Upadhyay). - Clean up the handling of hibernate compat ioctls and fix the related documentation (Eric Biggers). - Update the idle_inject power capping driver to use variable-length arrays instead of zero-length arrays (Gustavo Silva). - Fix list format in a PM QoS document (Randy Dunlap). - Make the cpufreq stats module use scnprintf() to avoid potential buffer overflows (Takashi Iwai). - Add pm_runtime_get_if_active() to PM-runtime API (Sakari Ailus). - Allow no domain-idle-states DT property in generic PM domains (Ulf Hansson). - Fix a broken y-axis scale in the intel_pstate_tracer utility (Doug Smythies)" * tag 'pm-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (78 commits) cpufreq: intel_pstate: Simplify intel_pstate_cpu_init() tools/power/x86/intel_pstate_tracer: fix a broken y-axis scale ACPI: PM: s2idle: Refine active GPEs check ACPICA: Allow acpi_any_gpe_status_set() to skip one GPE PM: sleep: wakeup: Skip wakeup_source_sysfs_remove() if device is not there PM / devfreq: Get rid of some doc warnings PM / devfreq: Fix handling dev_pm_qos_remove_request result PM / devfreq: Fix a typo in a comment PM / devfreq: Change to DEVFREQ_GOV_UPDATE_INTERVAL event name PM / devfreq: Remove unneeded extern keyword PM / devfreq: Use constant name of userspace governor ACPI: PM: s2idle: Fix comment in acpi_s2idle_prepare_late() cpufreq: qcom: Add support for krait based socs cpufreq: imx6q-cpufreq: Improve the logic of -EPROBE_DEFER handling cpufreq: Use scnprintf() for avoiding potential buffer overflow cpuidle: psci: Split psci_dt_cpu_init_idle() PM / Domains: Allow no domain-idle-states DT property in genpd when parsing PM / hibernate: Remove unnecessary compat ioctl overrides PM: hibernate: fix docs for ioctls that return loff_t via pointer Documentation: intel_pstate: update links for references ...
2020-03-30Merge tag 'seccomp-v5.7-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull seccomp updates from Kees Cook: "A couple of seccomp updates. They're both mostly bug fixes that I wanted to have sit in linux-next for a while: - allow TSYNC and USER_NOTIF together (Tycho Andersen) - add missing compat_ioctl for notify (Sven Schnelle)" * tag 'seccomp-v5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: seccomp: Add missing compat_ioctl for notify seccomp: allow TSYNC and USER_NOTIF together
2020-03-30Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "Here are the io_uring changes for this merge window. Light on new features this time around (just splice + buffer selection), lots of cleanups, fixes, and improvements to existing support. In particular, this contains: - Cleanup fixed file update handling for stack fallback (Hillf) - Re-work of how pollable async IO is handled, we no longer require thread offload to handle that. Instead we rely using poll to drive this, with task_work execution. - In conjunction with the above, allow expendable buffer selection, so that poll+recv (for example) no longer has to be a split operation. - Make sure we honor RLIMIT_FSIZE for buffered writes - Add support for splice (Pavel) - Linked work inheritance fixes and optimizations (Pavel) - Async work fixes and cleanups (Pavel) - Improve io-wq locking (Pavel) - Hashed link write improvements (Pavel) - SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)" * tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits) io_uring: cleanup io_alloc_async_ctx() io_uring: fix missing 'return' in comment io-wq: handle hashed writes in chains io-uring: drop 'free_pfile' in struct io_file_put io-uring: drop completion when removing file io_uring: Fix ->data corruption on re-enqueue io-wq: close cancel gap for hashed linked work io_uring: make spdxcheck.py happy io_uring: honor original task RLIMIT_FSIZE io-wq: hash dependent work io-wq: split hashing and enqueueing io-wq: don't resched if there is no work io-wq: remove duplicated cancel code io_uring: fix truncated async read/readv and write/writev retry io_uring: dual license io_uring.h uapi header io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled io_uring: Fix unused function warnings io_uring: add end-of-bits marker and build time verify it io_uring: provide means of removing buffers io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG ...
2020-03-30Merge branches 'pm-core', 'pm-sleep', 'pm-acpi' and 'pm-domains'Rafael J. Wysocki
* pm-core: PM: runtime: Add pm_runtime_get_if_active() * pm-sleep: PM: sleep: wakeup: Skip wakeup_source_sysfs_remove() if device is not there PM / hibernate: Remove unnecessary compat ioctl overrides PM: hibernate: fix docs for ioctls that return loff_t via pointer PM: sleep: wakeup: Use built-in RCU list checking PM: sleep: core: Use built-in RCU list checking * pm-acpi: ACPI: PM: s2idle: Refine active GPEs check ACPICA: Allow acpi_any_gpe_status_set() to skip one GPE ACPI: PM: s2idle: Fix comment in acpi_s2idle_prepare_late() * pm-domains: cpuidle: psci: Split psci_dt_cpu_init_idle() PM / Domains: Allow no domain-idle-states DT property in genpd when parsing
2020-03-30Merge branch 'pm-qos'Rafael J. Wysocki
* pm-qos: (30 commits) PM: QoS: annotate data races in pm_qos_*_value() Documentation: power: fix pm_qos_interface.rst format warning PM: QoS: Make CPU latency QoS depend on CONFIG_CPU_IDLE Documentation: PM: QoS: Update to reflect previous code changes PM: QoS: Update file information comments PM: QoS: Drop PM_QOS_CPU_DMA_LATENCY and rename related functions sound: Call cpu_latency_qos_*() instead of pm_qos_*() drivers: usb: Call cpu_latency_qos_*() instead of pm_qos_*() drivers: tty: Call cpu_latency_qos_*() instead of pm_qos_*() drivers: spi: Call cpu_latency_qos_*() instead of pm_qos_*() drivers: net: Call cpu_latency_qos_*() instead of pm_qos_*() drivers: mmc: Call cpu_latency_qos_*() instead of pm_qos_*() drivers: media: Call cpu_latency_qos_*() instead of pm_qos_*() drivers: hsi: Call cpu_latency_qos_*() instead of pm_qos_*() drm: i915: Call cpu_latency_qos_*() instead of pm_qos_*() x86: platform: iosf_mbi: Call cpu_latency_qos_*() instead of pm_qos_*() cpuidle: Call cpu_latency_qos_limit() instead of pm_qos_request() PM: QoS: Add CPU latency QoS API wrappers PM: QoS: Adjust pm_qos_request() signature and reorder pm_qos.h PM: QoS: Simplify definitions of CPU latency QoS trace events ...
2020-03-29seccomp: Add missing compat_ioctl for notifySven Schnelle
Executing the seccomp_bpf testsuite under a 64-bit kernel with 32-bit userland (both s390 and x86) doesn't work because there's no compat_ioctl handler defined. Add the handler. Signed-off-by: Sven Schnelle <svens@linux.ibm.com> Fixes: 6a21cc50f0c7 ("seccomp: add a return code to trap to userspace") Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200310123332.42255-1-svens@linux.ibm.com Signed-off-by: Kees Cook <keescook@chromium.org>
2020-03-29Merge branch 'akpm' (patches from Andrew)Linus Torvalds
Merge vm fixes from Andrew Morton: "5 fixes" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: mm/sparse: fix kernel crash with pfn_section_valid check mm: fork: fix kernel_stack memcg stats for various stack implementations hugetlb_cgroup: fix illegal access to memory drivers/base/memory.c: indicate all memory blocks as removable mm/swapfile.c: move inode_lock out of claim_swapfile
2020-03-29Merge tag 'irq-urgent-2020-03-29' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq fix from Thomas Gleixner: "A single bugfix to prevent reference leaks in irq affinity notifiers" * tag 'irq-urgent-2020-03-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Fix reference leaks on irq affinity notifiers
2020-03-29mm: fork: fix kernel_stack memcg stats for various stack implementationsRoman Gushchin
Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the space for task stacks can be allocated using __vmalloc_node_range(), alloc_pages_node() and kmem_cache_alloc_node(). In the first and the second cases page->mem_cgroup pointer is set, but in the third it's not: memcg membership of a slab page should be determined using the memcg_from_slab_page() function, which looks at page->slab_cache->memcg_params.memcg . In this case, using mod_memcg_page_state() (as in account_kernel_stack()) is incorrect: page->mem_cgroup pointer is NULL even for pages charged to a non-root memory cgroup. It can lead to kernel_stack per-memcg counters permanently showing 0 on some architectures (depending on the configuration). In order to fix it, let's introduce a mod_memcg_obj_state() helper, which takes a pointer to a kernel object as a first argument, uses mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls mod_memcg_state(). It allows to handle all possible configurations (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without spilling any memcg/kmem specifics into fork.c . Note: This is a special version of the patch created for stable backports. It contains code from the following two patches: - mm: memcg/slab: introduce mem_cgroup_from_obj() - mm: fork: fix kernel_stack memcg stats for various stack implementations [guro@fb.com: introduce mem_cgroup_from_obj()] Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages") Signed-off-by: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Bharata B Rao <bharata@linux.ibm.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-28Merge branch 'uaccess.futex' of ↵Thomas Gleixner
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs into locking/core Pull uaccess futex cleanups for Al Viro: Consolidate access_ok() usage and the futex uaccess function zoo.
2020-03-27futex: arch_futex_atomic_op_inuser() calling conventions changeAl Viro
Move access_ok() in and pagefault_enable()/pagefault_disable() out. Mechanical conversion only - some instances don't really need a separate access_ok() at all (e.g. the ones only using get_user()/put_user(), or architectures where access_ok() is always true); we'll deal with that in followups. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-03-27Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2020-03-27 The following pull-request contains BPF updates for your *net* tree. We've added 3 non-merge commits during the last 4 day(s) which contain a total of 4 files changed, 25 insertions(+), 20 deletions(-). The main changes are: 1) Explicitly memset the bpf_attr structure on bpf() syscall to avoid having to rely on compiler to do so. Issues have been noticed on some compilers with padding and other oddities where the request was then unexpectedly rejected, from Greg Kroah-Hartman. 2) Sanitize the bpf_struct_ops TCP congestion control name in order to avoid problematic characters such as whitespaces, from Martin KaFai Lau. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix deadlock in bpf_send_signal() from Yonghong Song. 2) Fix off by one in kTLS offload of mlx5, from Tariq Toukan. 3) Add missing locking in iwlwifi mvm code, from Avraham Stern. 4) Fix MSG_WAITALL handling in rxrpc, from David Howells. 5) Need to hold RTNL mutex in tcindex_partial_destroy_work(), from Cong Wang. 6) Fix producer race condition in AF_PACKET, from Willem de Bruijn. 7) cls_route removes the wrong filter during change operations, from Cong Wang. 8) Reject unrecognized request flags in ethtool netlink code, from Michal Kubecek. 9) Need to keep MAC in reset until PHY is up in bcmgenet driver, from Doug Berger. 10) Don't leak ct zone template in act_ct during replace, from Paul Blakey. 11) Fix flushing of offloaded netfilter flowtable flows, also from Paul Blakey. 12) Fix throughput drop during tx backpressure in cxgb4, from Rahul Lakkireddy. 13) Don't let a non-NULL skb->dev leave the TCP stack, from Eric Dumazet. 14) TCP_QUEUE_SEQ socket option has to update tp->copied_seq as well, also from Eric Dumazet. 15) Restrict macsec to ethernet devices, from Willem de Bruijn. 16) Fix reference leak in some ethtool *_SET handlers, from Michal Kubecek. 17) Fix accidental disabling of MSI for some r8169 chips, from Heiner Kallweit. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (138 commits) net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build net: ena: Add PCI shutdown handler to allow safe kexec selftests/net/forwarding: define libs as TEST_PROGS_EXTENDED selftests/net: add missing tests to Makefile r8169: re-enable MSI on RTL8168c net: phy: mdio-bcm-unimac: Fix clock handling cxgb4/ptp: pass the sign of offset delta in FW CMD net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop net: cbs: Fix software cbs to consider packet sending time net/mlx5e: Do not recover from a non-fatal syndrome net/mlx5e: Fix ICOSQ recovery flow with Striding RQ net/mlx5e: Fix missing reset of SW metadata in Striding RQ reset net/mlx5e: Enhance ICOSQ WQE info fields net/mlx5_core: Set IB capability mask1 to fix ib_srpt connection failure selftests: netfilter: add nfqueue test case netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress netfilter: nft_fwd_netdev: validate family and chain type netfilter: nft_set_rbtree: Detect partial overlaps on insertion netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start() netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion ...
2020-03-24Merge branch 'for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu Pull RCU changes from Paul E. McKenney: - Make kfree_rcu() use kfree_bulk() for added performance - RCU updates - Callback-overload handling updates - Tasks-RCU KCSAN and sparse updates - Locking torture test and RCU torture test updates - Documentation updates - Miscellaneous fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-03-23completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()Sebastian Siewior
The warning was intended to spot complete_all() users from hardirq context on PREEMPT_RT. The warning as-is will also trigger in interrupt handlers, which are threaded on PREEMPT_RT, which was not intended. Use lockdep_assert_RT_in_threaded_ctx() which triggers in non-preemptive context on PREEMPT_RT. Fixes: a5c6234e1028 ("completion: Use simple wait queues") Reported-by: kernel test robot <rong.a.chen@intel.com> Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200323152019.4qjwluldohuh3by5@linutronix.de
2020-03-21x86/mm: split vmalloc_sync_all()Joerg Roedel
Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the vunmap() code-path. While this change was necessary to maintain correctness on x86-32-pae kernels, it also adds additional cycles for architectures that don't need it. Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported severe performance regressions in micro-benchmarks because it now also calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But the vmalloc_sync_all() implementation on x86-64 is only needed for newly created mappings. To avoid the unnecessary work on x86-64 and to gain the performance back, split up vmalloc_sync_all() into two functions: * vmalloc_sync_mappings(), and * vmalloc_sync_unmappings() Most call-sites to vmalloc_sync_all() only care about new mappings being synchronized. The only exception is the new call-site added in the above mentioned commit. Shile Zhang directed us to a report of an 80% regression in reaim throughput. Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()") Reported-by: kernel test robot <oliver.sang@intel.com> Reported-by: Shile Zhang <shile.zhang@linux.alibaba.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [GHES] Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/ Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-03-21Merge branches 'doc.2020.02.27a', 'fixes.2020.03.21a', ↵Paul E. McKenney
'kfree_rcu.2020.02.20a', 'locktorture.2020.02.20a', 'ovld.2020.02.20a', 'rcu-tasks.2020.02.20a', 'srcu.2020.02.20a' and 'torture.2020.02.20a' into HEAD doc.2020.02.27a: Documentation updates. fixes.2020.03.21a: Miscellaneous fixes. kfree_rcu.2020.02.20a: Updates to kfree_rcu(). locktorture.2020.02.20a: Lock torture-test updates. ovld.2020.02.20a: Updates to callback-overload handling. rcu-tasks.2020.02.20a: RCU-tasks updates. srcu.2020.02.20a: SRCU updates. torture.2020.02.20a: Torture-test updates.
2020-03-21rcu: Make rcu_barrier() account for offline no-CBs CPUsPaul E. McKenney
Currently, rcu_barrier() ignores offline CPUs, However, it is possible for an offline no-CBs CPU to have callbacks queued, and rcu_barrier() must wait for those callbacks. This commit therefore makes rcu_barrier() directly invoke the rcu_barrier_func() with interrupts disabled for such CPUs. This requires passing the CPU number into this function so that it can entrain the rcu_barrier() callback onto the correct CPU's callback list, given that the code must instead execute on the current CPU. While in the area, this commit fixes a bug where the first CPU's callback might have been invoked before rcu_segcblist_entrain() returned, which would also result in an early wakeup. Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs") Signed-off-by: Paul E. McKenney <paulmck@kernel.org> [ paulmck: Apply optimization feedback from Boqun Feng. ] Cc: <stable@vger.kernel.org> # 5.5.x
2020-03-21rcu: Mark rcu_state.gp_seq to detect concurrent writesPaul E. McKenney
The rcu_state structure's gp_seq field is only to be modified by the RCU grace-period kthread, which is single-threaded. This commit therefore enlists KCSAN's help in enforcing this restriction. Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-03-21genirq: Fix reference leaks on irq affinity notifiersEdward Cree
The handling of notify->work did not properly maintain notify->kref in two cases: 1) where the work was already scheduled, another irq_set_affinity_locked() would get the ref and (no-op-ly) schedule the work. Thus when irq_affinity_notify() ran, it would drop the original ref but not the additional one. 2) when cancelling the (old) work in irq_set_affinity_notifier(), if there was outstanding work a ref had been got for it but was never put. Fix both by checking the return values of the work handling functions (schedule_work() for (1) and cancel_work_sync() for (2)) and put the extra ref if the return value indicates preexisting work. Fixes: cd7eab44e994 ("genirq: Add IRQ affinity notifiers") Fixes: 59c39840f5ab ("genirq: Prevent use-after-free and work list corruption") Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ben Hutchings <ben@decadent.org.uk> Link: https://lkml.kernel.org/r/24f5983f-2ab5-e83a-44ee-a45b5f9300f5@solarflare.com
2020-03-21lockdep: Add posixtimer context tracing bitsSebastian Andrzej Siewior
Splitting run_posix_cpu_timers() into two parts is work in progress which is stuck on other entry code related problems. The heavy lifting which involves locking of sighand lock will be moved into task context so the necessary execution time is burdened on the task and not on interrupt context. Until this work completes lockdep with the spinlock nesting rules enabled would emit warnings for this known context. Prevent it by setting "->irq_config = 1" for the invocation of run_posix_cpu_timers() so lockdep does not complain when sighand lock is acquried. This will be removed once the split is completed. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.751182723@linutronix.de
2020-03-21lockdep: Annotate irq_workSebastian Andrzej Siewior
Mark irq_work items with IRQ_WORK_HARD_IRQ which should be invoked in hardirq context even on PREEMPT_RT. IRQ_WORK without this flag will be invoked in softirq context on PREEMPT_RT. Set ->irq_config to 1 for the IRQ_WORK items which are invoked in softirq context so lockdep knows that these can safely acquire a spinlock_t. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.643576700@linutronix.de
2020-03-21lockdep: Add hrtimer context tracing bitsSebastian Andrzej Siewior
Set current->irq_config = 1 for hrtimers which are not marked to expire in hard interrupt context during hrtimer_init(). These timers will expire in softirq context on PREEMPT_RT. Setting this allows lockdep to differentiate these timers. If a timer is marked to expire in hard interrupt context then the timer callback is not supposed to acquire a regular spinlock instead of a raw_spinlock in the expiry callback. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.534508206@linutronix.de
2020-03-21lockdep: Introduce wait-type checksPeter Zijlstra
Extend lockdep to validate lock wait-type context. The current wait-types are: LD_WAIT_FREE, /* wait free, rcu etc.. */ LD_WAIT_SPIN, /* spin loops, raw_spinlock_t etc.. */ LD_WAIT_CONFIG, /* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */ LD_WAIT_SLEEP, /* sleeping locks, mutex_t etc.. */ Where lockdep validates that the current lock (the one being acquired) fits in the current wait-context (as generated by the held stack). This ensures that there is no attempt to acquire mutexes while holding spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In other words, its a more fancy might_sleep(). Obviously RCU made the entire ordeal more complex than a simple single value test because RCU can be acquired in (pretty much) any context and while it presents a context to nested locks it is not the same as it got acquired in. Therefore its necessary to split the wait_type into two values, one representing the acquire (outer) and one representing the nested context (inner). For most 'normal' locks these two are the same. [ To make static initialization easier we have the rule that: .outer == INV means .outer == .inner; because INV == 0. ] It further means that its required to find the minimal .inner of the held stack to compare against the outer of the new lock; because while 'normal' RCU presents a CONFIG type to nested locks, if it is taken while already holding a SPIN type it obviously doesn't relax the rules. Below is an example output generated by the trivial test code: raw_spin_lock(&foo); spin_lock(&bar); spin_unlock(&bar); raw_spin_unlock(&foo); [ BUG: Invalid wait context ] ----------------------------- swapper/0/1 is trying to lock: ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187 other info that might help us debug this: 1 lock held by swapper/0/1: #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187 The way to read it is to look at the new -{n,m} part in the lock description; -{3:3} for the attempted lock, and try and match that up to the held locks, which in this case is the one: -{2,2}. This tells that the acquiring lock requires a more relaxed environment than presented by the lock stack. Currently only the normal locks and RCU are converted, the rest of the lockdep users defaults to .inner = INV which is ignored. More conversions can be done when desired. The check for spinlock_t nesting is not enabled by default. It's a separate config option for now as there are known problems which are currently addressed. The config option allows to identify these problems and to verify that the solutions found are indeed solving them. The config switch will be removed and the checks will permanently enabled once the vast majority of issues has been addressed. [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP] [ tglx: Add the config option ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
2020-03-21completion: Use simple wait queuesThomas Gleixner
completion uses a wait_queue_head_t to enqueue waiters. wait_queue_head_t contains a spinlock_t to protect the list of waiters which excludes it from being used in truly atomic context on a PREEMPT_RT enabled kernel. The spinlock in the wait queue head cannot be replaced by a raw_spinlock because: - wait queues can have custom wakeup callbacks, which acquire other spinlock_t locks and have potentially long execution times - wake_up() walks an unbounded number of list entries during the wake up and may wake an unbounded number of waiters. For simplicity and performance reasons complete() should be usable on PREEMPT_RT enabled kernels. completions do not use custom wakeup callbacks and are usually single waiter, except for a few corner cases. Replace the wait queue in the completion with a simple wait queue (swait), which uses a raw_spinlock_t for protecting the waiter list and therefore is safe to use inside truly atomic regions on PREEMPT_RT. There is no semantical or functional change: - completions use the exclusive wait mode which is what swait provides - complete() wakes one exclusive waiter - complete_all() wakes all waiters while holding the lock which protects the wait queue against newly incoming waiters. The conversion to swait preserves this behaviour. complete_all() might cause unbound latencies with a large number of waiters being woken at once, but most complete_all() usage sites are either in testing or initialization code or have only a really small number of concurrent waiters which for now does not cause a latency problem. Keep it simple for now. The fixup of the warning check in the USB gadget driver is just a straight forward conversion of the lockless waiter check from one waitqueue type to the other. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lkml.kernel.org/r/20200321113242.317954042@linutronix.de
2020-03-21sched/swait: Prepare usage in completionsThomas Gleixner
As a preparation to use simple wait queues for completions: - Provide swake_up_all_locked() to support complete_all() - Make __prepare_to_swait() public available This is done to enable the usage of complete() within truly atomic contexts on a PREEMPT_RT enabled kernel. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.228481202@linutronix.de
2020-03-21timekeeping: Split jiffies seqlockThomas Gleixner
seqlock consists of a sequence counter and a spinlock_t which is used to serialize the writers. spinlock_t is substituted by a "sleeping" spinlock on PREEMPT_RT enabled kernels which breaks the usage in the timekeeping code as the writers are executed in hard interrupt and therefore non-preemptible context even on PREEMPT_RT. The spinlock in seqlock cannot be unconditionally replaced by a raw_spinlock_t as many seqlock users have nesting spinlock sections or other code which is not suitable to run in truly atomic context on RT. Instead of providing a raw_seqlock API for a single use case, open code the seqlock for the jiffies use case and implement it with a raw_spinlock_t and a sequence counter. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113242.120587764@linutronix.de
2020-03-21rcuwait: Add @state argument to rcuwait_wait_event()Peter Zijlstra (Intel)
Extend rcuwait_wait_event() with a state variable so that it is not restricted to UNINTERRUPTIBLE waits. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200321113241.824030968@linutronix.de
2020-03-20bpf: Explicitly memset some bpf info structures declared on the stackGreg Kroah-Hartman
Trying to initialize a structure with "= {};" will not always clean out all padding locations in a structure. So be explicit and call memset to initialize everything for a number of bpf information structures that are then copied from userspace, sometimes from smaller memory locations than the size of the structure. Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200320162258.GA794295@kroah.com
2020-03-20bpf: Explicitly memset the bpf_attr structureGreg Kroah-Hartman
For the bpf syscall, we are relying on the compiler to properly zero out the bpf_attr union that we copy userspace data into. Unfortunately that doesn't always work properly, padding and other oddities might not be correctly zeroed, and in some tests odd things have been found when the stack is pre-initialized to other values. Fix this by explicitly memsetting the structure to 0 before using it. Reported-by: Maciej Żenczykowski <maze@google.com> Reported-by: John Stultz <john.stultz@linaro.org> Reported-by: Alexander Potapenko <glider@google.com> Reported-by: Alistair Delva <adelva@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://android-review.googlesource.com/c/kernel/common/+/1235490 Link: https://lore.kernel.org/bpf/20200320094813.GA421650@kroah.com
2020-03-20lockdep: Teach lockdep about "USED" <- "IN-NMI" inversionsPeter Zijlstra
nmi_enter() does lockdep_off() and hence lockdep ignores everything. And NMI context makes it impossible to do full IN-NMI tracking like we do IN-HARDIRQ, that could result in graph_lock recursion. However, since look_up_lock_class() is lockless, we can find the class of a lock that has prior use and detect IN-NMI after USED, just not USED after IN-NMI. NOTE: By shifting the lockdep_off() recursion count to bit-16, we can easily differentiate between actual recursion and off. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Link: https://lkml.kernel.org/r/20200221134215.090538203@infradead.org
2020-03-20locking/lockdep: Rework lockdep_lockPeter Zijlstra
A few sites want to assert we own the graph_lock/lockdep_lock, provide a more conventional lock interface for it with a number of trivial debug checks. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200313102107.GX12561@hirez.programming.kicks-ass.net
2020-03-20locking/lockdep: Fix bad recursion patternPeter Zijlstra
There were two patterns for lockdep_recursion: Pattern-A: if (current->lockdep_recursion) return current->lockdep_recursion = 1; /* do stuff */ current->lockdep_recursion = 0; Pattern-B: current->lockdep_recursion++; /* do stuff */ current->lockdep_recursion--; But a third pattern has emerged: Pattern-C: current->lockdep_recursion = 1; /* do stuff */ current->lockdep_recursion = 0; And while this isn't broken per-se, it is highly dangerous because it doesn't nest properly. Get rid of all Pattern-C instances and shore up Pattern-A with a warning. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200313093325.GW12561@hirez.programming.kicks-ass.net
2020-03-20locking/lockdep: Avoid recursion in lockdep_count_{for,back}ward_deps()Boqun Feng
Qian Cai reported a bug when PROVE_RCU_LIST=y, and read on /proc/lockdep triggered a warning: [ ] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) ... [ ] Call Trace: [ ] lock_is_held_type+0x5d/0x150 [ ] ? rcu_lockdep_current_cpu_online+0x64/0x80 [ ] rcu_read_lock_any_held+0xac/0x100 [ ] ? rcu_read_lock_held+0xc0/0xc0 [ ] ? __slab_free+0x421/0x540 [ ] ? kasan_kmalloc+0x9/0x10 [ ] ? __kmalloc_node+0x1d7/0x320 [ ] ? kvmalloc_node+0x6f/0x80 [ ] __bfs+0x28a/0x3c0 [ ] ? class_equal+0x30/0x30 [ ] lockdep_count_forward_deps+0x11a/0x1a0 The warning got triggered because lockdep_count_forward_deps() call __bfs() without current->lockdep_recursion being set, as a result a lockdep internal function (__bfs()) is checked by lockdep, which is unexpected, and the inconsistency between the irq-off state and the state traced by lockdep caused the warning. Apart from this warning, lockdep internal functions like __bfs() should always be protected by current->lockdep_recursion to avoid potential deadlocks and data inconsistency, therefore add the current->lockdep_recursion on-and-off section to protect __bfs() in both lockdep_count_forward_deps() and lockdep_count_backward_deps() Reported-by: Qian Cai <cai@lca.pw> Signed-off-by: Boqun Feng <boqun.feng@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200312151258.128036-1-boqun.feng@gmail.com
2020-03-20perf/core: Fix reversed NULL check in perf_event_groups_less()Dan Carpenter
This NULL check is reversed so it leads to a Smatch warning and presumably a NULL dereference. kernel/events/core.c:1598 perf_event_groups_less() error: we previously assumed 'right->cgrp->css.cgroup' could be null (see line 1590) Fixes: 95ed6c707f26 ("perf/cgroup: Order events in RB tree by cgroup id") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200312105637.GA8960@mwanda
2020-03-20perf/core: Fix endless multiplex timerPeter Zijlstra
Kan and Andi reported that we fail to kill rotation when the flexible events go empty, but the context does not. XXX moar Fixes: fd7d55172d1e ("perf/cgroups: Don't rotate events for cgroups unnecessarily") Reported-by: Andi Kleen <ak@linux.intel.com> Reported-by: Kan Liang <kan.liang@linux.intel.com> Tested-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200305123851.GX2596@hirez.programming.kicks-ass.net
2020-03-20sched/fair: Fix condition of avg_load calculationTao Zhou
In update_sg_wakeup_stats(), the comment says: Computing avg_load makes sense only when group is fully busy or overloaded. But, the code below this comment does not check like this. From reading the code about avg_load in other functions, I confirm that avg_load should be calculated in fully busy or overloaded case. The comment is correct and the checking condition is wrong. So, change that condition. Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()") Signed-off-by: Tao Zhou <ouwen210@hotmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Mel Gorman <mgorman@suse.de> Link: https://lkml.kernel.org/r/Message-ID:
2020-03-20sched/rt: cpupri_find: Trigger a full search as fallbackQais Yousef
If we failed to find a fitting CPU, in cpupri_find(), we only fallback to the level we found a hit at. But Steve suggested to fallback to a second full scan instead as this could be a better effort. https://lore.kernel.org/lkml/20200304135404.146c56eb@gandalf.local.home/ We trigger the 2nd search unconditionally since the argument about triggering a full search is that the recorded fall back level might have become empty by then. Which means storing any data about what happened would be meaningless and stale. I had a humble try at timing it and it seemed okay for the small 6 CPUs system I was running on https://lore.kernel.org/lkml/20200305124324.42x6ehjxbnjkklnh@e107158-lin.cambridge.arm.com/ On large system this second full scan could be expensive. But there are no users outside capacity awareness for this fitness function at the moment. Heterogeneous systems tend to be small with 8cores in total. Suggested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20200310142219.syxzn5ljpdxqtbgx@e107158-lin.cambridge.arm.com
2020-03-20kthread: Do not preempt current task if it is going to call schedule()Liang Chen
when we create a kthread with ktrhead_create_on_cpu(),the child thread entry is ktread.c:ktrhead() which will be preempted by the parent after call complete(done) while schedule() is not called yet,then the parent will call wait_task_inactive(child) but the child is still on the runqueue, so the parent will schedule_hrtimeout() for 1 jiffy,it will waste a lot of time,especially on startup. parent child ktrhead_create_on_cpu() wait_fo_completion(&done) -----> ktread.c:ktrhead() |----- complete(done);--wakeup and preempted by parent kthread_bind() <------------| |-> schedule();--dequeue here wait_task_inactive(child) | schedule_hrtimeout(1 jiffy) -| So we hope the child just wakeup parent but not preempted by parent, and the child is going to call schedule() soon,then the parent will not call schedule_hrtimeout(1 jiffy) as the child is already dequeue. The same issue for ktrhead_park()&&kthread_parkme(). This patch can save 120ms on rk312x startup with CONFIG_HZ=300. Signed-off-by: Liang Chen <cl@rock-chips.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20200306070133.18335-2-cl@rock-chips.com
2020-03-20sched/fair: Improve spreading of utilizationVincent Guittot
During load_balancing, a group with spare capacity will try to pull some utilizations from an overloaded group. In such case, the load balance looks for the runqueue with the highest utilization. Nevertheless, it should also ensure that there are some pending tasks to pull otherwise the load balance will fail to pull a task and the spread of the load will be delayed. This situation is quite transient but it's possible to highlight the effect with a short run of sysbench test so the time to spread task impacts the global result significantly. Below are the average results for 15 iterations on an arm64 octo core: sysbench --test=cpu --num-threads=8 --max-requests=1000 run tip/sched/core +patchset total time: 172ms 158ms per-request statistics: avg: 1.337ms 1.244ms max: 21.191ms 10.753ms The average max doesn't fully reflect the wide spread of the value which ranges from 1.350ms to more than 41ms for the tip/sched/core and from 1.350ms to 21ms with the patch. Other factors like waiting for an idle load balance or cache hotness can delay the spreading of the tasks which explains why we can still have up to 21ms with the patch. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200312165429.990-1-vincent.guittot@linaro.org
2020-03-20sched: Avoid scale real weight down to zeroMichael Wang
During our testing, we found a case that shares no longer working correctly, the cgroup topology is like: /sys/fs/cgroup/cpu/A (shares=102400) /sys/fs/cgroup/cpu/A/B (shares=2) /sys/fs/cgroup/cpu/A/B/C (shares=1024) /sys/fs/cgroup/cpu/D (shares=1024) /sys/fs/cgroup/cpu/D/E (shares=1024) /sys/fs/cgroup/cpu/D/E/F (shares=1024) The same benchmark is running in group C & F, no other tasks are running, the benchmark is capable to consumed all the CPUs. We suppose the group C will win more CPU resources since it could enjoy all the shares of group A, but it's F who wins much more. The reason is because we have group B with shares as 2, since A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus, so A->cfs_rq.load.weight become very small. And in calc_group_shares() we calculate shares as: load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg); shares = (tg_shares * load) / tg_weight; Since the 'cfs_rq->load.weight' is too small, the load become 0 after scale down, although 'tg_shares' is 102400, shares of the se which stand for group A on root cfs_rq become 2. While the se of D on root cfs_rq is far more bigger than 2, so it wins the battle. Thus when scale_load_down() scale real weight down to 0, it's no longer telling the real story, the caller will have the wrong information and the calculation will be buggy. This patch add check in scale_load_down(), so the real weight will be >= MIN_SHARES after scale, after applied the group C wins as expected. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Michael Wang <yun.wang@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.com
2020-03-20psi: Move PF_MEMSTALL out of task->flagsYafang Shao
The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're still enough spaces in the bit-field section of task_struct, so we can define the memstall state as a single bit in task_struct instead. This patch also removes an out-of-date comment pointed by Matthew. Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.com
2020-03-20psi: Optimize switching tasks inside shared cgroupsJohannes Weiner
When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that, and can perform many unnecessary state changes in nested hierarchies, especially when most activity comes from one leaf cgroup. This patch implements an optimization where we only update cgroups whose state actually changes during a task switch. These are all cgroups that contain one task but not the other, up to the first shared ancestor. When both tasks are in the same group, we don't need to update anything at all. We can identify the first shared ancestor by walking the groups of the incoming task until we see TSK_ONCPU set on the local CPU; that's the first group that also contains the outgoing task. The new psi_task_switch() is similar to psi_task_change(). To allow code reuse, move the task flag maintenance code into a new function and the poll/avg worker wakeups into the shared psi_group_change(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.org
2020-03-20psi: Fix cpu.pressure for cpu.max and competing cgroupsJohannes Weiner
For simplicity, cpu pressure is defined as having more than one runnable task on a given CPU. This works on the system-level, but it has limitations in a cgrouped reality: When cpu.max is in use, it doesn't capture the time in which a task is not executing on the CPU due to throttling. Likewise, it doesn't capture the time in which a competing cgroup is occupying the CPU - meaning it only reflects cgroup-internal competitive pressure, not outside pressure. Enable tracking of currently executing tasks, and then change the definition of cpu pressure in a cgroup from NR_RUNNING > 1 to NR_RUNNING > ON_CPU which will capture the effects of cpu.max as well as competition from outside the cgroup. After this patch, a cgroup running `stress -c 1` with a cpu.max setting of 5000 10000 shows ~50% continuous CPU pressure. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.org
2020-03-20sched/core: Distribute tasks within affinity masksPaul Turner
Currently, when updating the affinity of tasks via either cpusets.cpus, or, sched_setaffinity(); tasks not currently running within the newly specified mask will be arbitrarily assigned to the first CPU within the mask. This (particularly in the case that we are restricting masks) can result in many tasks being assigned to the first CPUs of their new masks. This: 1) Can induce scheduling delays while the load-balancer has a chance to spread them between their new CPUs. 2) Can antogonize a poor load-balancer behavior where it has a difficult time recognizing that a cross-socket imbalance has been forced by an affinity mask. This change adds a new cpumask interface to allow iterated calls to distribute within the intersection of the provided masks. The cases that this mainly affects are: - modifying cpuset.cpus - when tasks join a cpuset - when modifying a task's affinity via sched_setaffinity(2) Signed-off-by: Paul Turner <pjt@google.com> Signed-off-by: Josh Don <joshdon@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Tested-by: Qais Yousef <qais.yousef@arm.com> Link: https://lkml.kernel.org/r/20200311010113.136465-1-joshdon@google.com