aboutsummaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2020-01-28Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "These were the main changes in this cycle: - More -rt motivated separation of CONFIG_PREEMPT and CONFIG_PREEMPTION. - Add more low level scheduling topology sanity checks and warnings to filter out nonsensical topologies that break scheduling. - Extend uclamp constraints to influence wakeup CPU placement - Make the RT scheduler more aware of asymmetric topologies and CPU capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y - Make idle CPU selection more consistent - Various fixes, smaller cleanups, updates and enhancements - please see the git log for details" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits) sched/fair: Define sched_idle_cpu() only for SMP configurations sched/topology: Assert non-NUMA topology masks don't (partially) overlap idle: fix spelling mistake "iterrupts" -> "interrupts" sched/fair: Remove redundant call to cpufreq_update_util() sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP sched/fair: calculate delta runnable load only when it's needed sched/cputime: move rq parameter in irqtime_account_process_tick stop_machine: Make stop_cpus() static sched/debug: Reset watchdog on all CPUs while processing sysrq-t sched/core: Fix size of rq::uclamp initialization sched/uclamp: Fix a bug in propagating uclamp value in new cgroups sched/fair: Load balance aggressively for SCHED_IDLE CPUs sched/fair : Improve update_sd_pick_busiest for spare capacity case watchdog: Remove soft_lockup_hrtimer_cnt and related code sched/rt: Make RT capacity-aware sched/fair: Make EAS wakeup placement consider uclamp restrictions sched/fair: Make task_fits_capacity() consider uclamp restrictions sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with() sched/uclamp: Make uclamp util helpers use and return UL values ...
2020-01-22genirq, sched/isolation: Isolate from handling managed interruptsMing Lei
The affinity of managed interrupts is completely handled in the kernel and cannot be changed via the /proc/irq/* interfaces from user space. As the kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent vector exhaustion, it can happen that a managed interrupt whose affinity mask contains both isolated and housekeeping CPUs is routed to an isolated CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts on the isolated CPU. Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding logic in the interrupt affinity selection code. The subparameter indicates to the interrupt affinity selection logic that it should try to avoid the above scenario. This isolation is best effort and only effective if the automatically assigned interrupt mask of a device queue contains isolated and housekeeping CPUs. If housekeeping CPUs are online then such interrupts are directed to the housekeeping CPU so that IO submitted on the housekeeping CPU cannot disturb the isolated CPU. If a queue's affinity mask contains only isolated CPUs then this parameter has no effect on the interrupt routing decision, though interrupts are only happening when tasks running on those isolated CPUs submit IO. IO submitted on housekeeping CPUs has no influence on those queues. If the affinity mask contains both housekeeping and isolated CPUs, but none of the contained housekeeping CPUs is online, then the interrupt is also routed to an isolated CPU. Interrupts are only delivered when one of the isolated CPUs in the affinity mask submits IO. If one of the contained housekeeping CPUs comes online, the CPU hotplug logic migrates the interrupt automatically back to the upcoming housekeeping CPU. Depending on the type of interrupt controller, this can require that at least one interrupt is delivered to the isolated CPU in order to complete the migration. [ tglx: Removed unused parameter, added and edited comments/documentation and rephrased the changelog so it contains more details. ] Signed-off-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com
2020-01-20sched/fair: Define sched_idle_cpu() only for SMP configurationsViresh Kumar
sched_idle_cpu() isn't used for non SMP configuration and with a recent change, we have started getting following warning: kernel/sched/fair.c:5221:12: warning: ‘sched_idle_cpu’ defined but not used [-Wunused-function] Fix that by defining sched_idle_cpu() only for SMP configurations. Fixes: 323af6deaf70 ("sched/fair: Load balance aggressively for SCHED_IDLE CPUs") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Link: https://lore.kernel.org/r/f0554f590687478b33914a4aff9f0e6a62886d44.1579499907.git.viresh.kumar@linaro.org
2020-01-17sched/topology: Assert non-NUMA topology masks don't (partially) overlapValentin Schneider
topology.c::get_group() relies on the assumption that non-NUMA domains do not partially overlap. Zeng Tao pointed out in [1] that such topology descriptions, while completely bogus, can end up being exposed to the scheduler. In his example (8 CPUs, 2-node system), we end up with: MC span for CPU3 == 3-7 MC span for CPU4 == 4-7 The first pass through get_group(3, sdd@MC) will result in the following sched_group list: 3 -> 4 -> 5 -> 6 -> 7 ^ / `----------------' And a later pass through get_group(4, sdd@MC) will "corrupt" that to: 3 -> 4 -> 5 -> 6 -> 7 ^ / `-----------' which will completely break things like 'while (sg != sd->groups)' when using CPU3's base sched_domain. There already are some architecture-specific checks in place such as x86/kernel/smpboot.c::topology.sane(), but this is something we can detect in the core scheduler, so it seems worthwhile to do so. Warn and abort the construction of the sched domains if such a broken topology description is detected. Note that this is somewhat expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be gated under SCHED_DEBUG if deemed necessary. Testing ======= Dietmar managed to reproduce this using the following qemu incantation: $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \ -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \ cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \ node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1 alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be needed if '-smp cores=X, sockets=Y' would work with qemu): 8<--- @@ -465,6 +465,9 @@ void update_siblings_masks(unsigned int cpuid) if (cpuid_topo->package_id != cpu_topo->package_id) continue; + if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4)) + continue; + cpumask_set_cpu(cpuid, &cpu_topo->core_sibling); cpumask_set_cpu(cpu, &cpuid_topo->core_sibling); 8<--- [1]: https://lkml.kernel.org/r/1577088979-8545-1-git-send-email-prime.zeng@hisilicon.com Reported-by: Zeng Tao <prime.zeng@hisilicon.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
2020-01-17idle: fix spelling mistake "iterrupts" -> "interrupts"Hewenliang
There is a spelling misake in comments of cpuidle_idle_call. Fix it. Signed-off-by: Hewenliang <hewenliang4@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20200110025604.34373-1-hewenliang4@huawei.com
2020-01-17sched/fair: Remove redundant call to cpufreq_update_util()Vincent Guittot
With commit bef69dd87828 ("sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()") update_load_avg() has become the central point for calling cpufreq (not including the update of blocked load). This change helps to simplify further the number of calls to cpufreq_update_util() and to remove last redundant ones. With update_load_avg(), we are now sure that cpufreq_update_util() will be called after every task attachment to a cfs_rq and especially after propagating this event down to the util_avg of the root cfs_rq, which is the level that is used by cpufreq governors like schedutil to set the frequency of a CPU. The SCHED_CPUFREQ_MIGRATION flag forces an early call to cpufreq when the migration happens in a cgroup whereas util_avg of root cfs_rq is not yet updated and this call is duplicated with the one that happens immediately after when the migration event reaches the root cfs_rq. The dedicated flag SCHED_CPUFREQ_MIGRATION is now useless and can be removed. The interface of attach_entity_load_avg() can also be simplified accordingly. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Link: https://lkml.kernel.org/r/1579083620-24943-1-git-send-email-vincent.guittot@linaro.org
2020-01-17sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only ↵Wang Long
when psi enabled when CONFIG_PSI_DEFAULT_DISABLED set to N or the command line set psi=0, I think we should not create /proc/pressure and /proc/pressure/{io|memory|cpu}. In the future, user maybe determine whether the psi feature is enabled by checking the existence of the /proc/pressure dir or /proc/pressure/{io|memory|cpu} files. Signed-off-by: Wang Long <w@laoqinren.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1576672698-32504-1-git-send-email-w@laoqinren.net
2020-01-17sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAPPeng Liu
commit bf475ce0a3dd ("sched/fair: Add per-CPU min capacity to sched_group_capacity") introduced per-cpu min_capacity. commit e3d6d0cb66f2 ("sched/fair: Add sched_group per-CPU max capacity") introduced per-cpu max_capacity. In the SD_OVERLAP case, the local variable 'capacity' represents the sum of CPU capacity of all CPUs in the first sched group (sg) of the sched domain (sd). It is erroneously used to calculate sg's min and max CPU capacity. To fix this use capacity_of(cpu) instead of 'capacity'. The code which achieves this via cpu_rq(cpu)->sd->groups->sgc->capacity (for rq->sd != NULL) can be removed since it delivers the same value as capacity_of(cpu) which is currently only used for the (!rq->sd) case (see update_cpu_capacity()). An sg of the lowest sd (rq->sd or sd->child == NULL) represents a single CPU (and hence sg->sgc->capacity == capacity_of(cpu)). Signed-off-by: Peng Liu <iwtbavbm@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20200104130828.GA7718@iZj6chx1xj0e0buvshuecpZ
2020-01-17sched/fair: calculate delta runnable load only when it's neededPeng Wang
Move the code of calculation for delta_sum/delta_avg to where it is really needed to be done. Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20200103114400.17668-1-rocking@linux.alibaba.com
2020-01-17sched/cputime: move rq parameter in irqtime_account_process_tickAlex Shi
Every time we call irqtime_account_process_tick() is in a interrupt, Every caller will get and assign a parameter rq = this_rq(), This is unnecessary and increase the code size a little bit. Move the rq getting action to irqtime_account_process_tick internally is better. base with this patch cputime.o 578792 bytes 577888 bytes Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1577959674-255537-1-git-send-email-alex.shi@linux.alibaba.com
2020-01-17sched/debug: Reset watchdog on all CPUs while processing sysrq-tWei Li
Lengthy output of sysrq-t may take a lot of time on slow serial console with lots of processes and CPUs. So we need to reset NMI-watchdog to avoid spurious lockup messages, and we also reset softlockup watchdogs on all other CPUs since another CPU might be blocked waiting for us to process an IPI or stop_machine. Add to sysrq_sched_debug_show() as what we did in show_state_filter(). Signed-off-by: Wei Li <liwei391@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Link: https://lkml.kernel.org/r/20191226085224.48942-1-liwei391@huawei.com
2020-01-17sched/core: Fix size of rq::uclamp initializationLi Guanglei
rq::uclamp is an array of struct uclamp_rq, make sure we clear the whole thing. Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga") Signed-off-by: Li Guanglei <guanglei.li@unisoc.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Qais Yousef <qais.yousef@arm.com> Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
2020-01-17sched/uclamp: Fix a bug in propagating uclamp value in new cgroupsQais Yousef
When a new cgroup is created, the effective uclamp value wasn't updated with a call to cpu_util_update_eff() that looks at the hierarchy and update to the most restrictive values. Fix it by ensuring to call cpu_util_update_eff() when a new cgroup becomes online. Without this change, the newly created cgroup uses the default root_task_group uclamp values, which is 1024 for both uclamp_{min, max}, which will cause the rq to to be clamped to max, hence cause the system to run at max frequency. The problem was observed on Ubuntu server and was reproduced on Debian and Buildroot rootfs. By default, Ubuntu and Debian create a cpu controller cgroup hierarchy and add all tasks to it - which creates enough noise to keep the rq uclamp value at max most of the time. Imitating this behavior makes the problem visible in Buildroot too which otherwise looks fine since it's a minimal userspace. Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps") Reported-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Doug Smythies <dsmythies@telus.net> Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
2020-01-17sched/fair: Load balance aggressively for SCHED_IDLE CPUsViresh Kumar
The fair scheduler performs periodic load balance on every CPU to check if it can pull some tasks from other busy CPUs. The duration of this periodic load balance is set to sd->balance_interval for the idle CPUs and is calculated by multiplying the sd->balance_interval with the sd->busy_factor (set to 32 by default) for the busy CPUs. The multiplication is done for busy CPUs to avoid doing load balance too often and rather spend more time executing actual task. While that is the right thing to do for the CPUs busy with SCHED_OTHER or SCHED_BATCH tasks, it may not be the optimal thing for CPUs running only SCHED_IDLE tasks. With the recent enhancements in the fair scheduler around SCHED_IDLE CPUs, we now prefer to enqueue a newly-woken task to a SCHED_IDLE CPU instead of other busy or idle CPUs. The same reasoning should be applied to the load balancer as well to make it migrate tasks more aggressively to a SCHED_IDLE CPU, as that will reduce the scheduling latency of the migrated (SCHED_OTHER) tasks. This patch makes minimal changes to the fair scheduler to do the next load balance soon after the last non SCHED_IDLE task is dequeued from a runqueue, i.e. making the CPU SCHED_IDLE. Also the sd->busy_factor is ignored while calculating the balance_interval for such CPUs. This is done to avoid delaying the periodic load balance by few hundred milliseconds for SCHED_IDLE CPUs. This is tested on ARM64 Hikey620 platform (octa-core) with the help of rt-app and it is verified, using kernel traces, that the newly SCHED_IDLE CPU does load balancing shortly after it becomes SCHED_IDLE and pulls tasks from other busy CPUs. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/e485827eb8fe7db0943d6f3f6e0f5a4a70272781.1578471925.git.viresh.kumar@linaro.org
2020-01-17sched/fair : Improve update_sd_pick_busiest for spare capacity caseVincent Guittot
Similarly to calculate_imbalance() and find_busiest_group(), using the number of idle CPUs when there is only 1 CPU in the group is not efficient because we can't make a difference between a CPU running 1 task and a CPU running dozens of small tasks competing for the same CPU but not enough to overload it. More generally speaking, we should use the number of running tasks when there is the same number of idle CPUs in a group instead of blindly select the 1st one. When the groups have spare capacity and the same number of idle CPUs, we compare the number of running tasks to select the busiest group. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1576839893-26930-1-git-send-email-vincent.guittot@linaro.org
2019-12-25sched/rt: Make RT capacity-awareQais Yousef
Capacity Awareness refers to the fact that on heterogeneous systems (like Arm big.LITTLE), the capacity of the CPUs is not uniform, hence when placing tasks we need to be aware of this difference of CPU capacities. In such scenarios we want to ensure that the selected CPU has enough capacity to meet the requirement of the running task. Enough capacity means here that capacity_orig_of(cpu) >= task.requirement. The definition of task.requirement is dependent on the scheduling class. For CFS, utilization is used to select a CPU that has >= capacity value than the cfs_task.util. capacity_orig_of(cpu) >= cfs_task.util DL isn't capacity aware at the moment but can make use of the bandwidth reservation to implement that in a similar manner CFS uses utilization. The following patchset implements that: https://lore.kernel.org/lkml/20190506044836.2914-1-luca.abeni@santannapisa.it/ capacity_orig_of(cpu)/SCHED_CAPACITY >= dl_deadline/dl_runtime For RT we don't have a per task utilization signal and we lack any information in general about what performance requirement the RT task needs. But with the introduction of uclamp, RT tasks can now control that by setting uclamp_min to guarantee a minimum performance point. ATM the uclamp value are only used for frequency selection; but on heterogeneous systems this is not enough and we need to ensure that the capacity of the CPU is >= uclamp_min. Which is what implemented here. capacity_orig_of(cpu) >= rt_task.uclamp_min Note that by default uclamp.min is 1024, which means that RT tasks will always be biased towards the big CPUs, which make for a better more predictable behavior for the default case. Must stress that the bias acts as a hint rather than a definite placement strategy. For example, if all big cores are busy executing other RT tasks we can't guarantee that a new RT task will be placed there. On non-heterogeneous systems the original behavior of RT should be retained. Similarly if uclamp is not selected in the config. [ mingo: Minor edits to comments. ] Signed-off-by: Qais Yousef <qais.yousef@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191009104611.15363-1-qais.yousef@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25sched/fair: Make EAS wakeup placement consider uclamp restrictionsValentin Schneider
task_fits_capacity() has just been made uclamp-aware, and find_energy_efficient_cpu() needs to go through the same treatment. Things are somewhat different here however - using the task max clamp isn't sufficient. Consider the following setup: The target runqueue, rq: rq.cpu_capacity_orig = 512 rq.cfs.avg.util_avg = 200 rq.uclamp.max = 768 // the max p.uclamp.max of all enqueued p's is 768 The waking task, p (not yet enqueued on rq): p.util_est = 600 p.uclamp.max = 100 Now, consider the following code which doesn't use the rq clamps: util = uclamp_task_util(p); // Does the task fit in the spare CPU capacity? cpu = cpu_of(rq); fits_capacity(util, cpu_capacity(cpu) - cpu_util(cpu)) This would lead to: util = 100; fits_capacity(100, 512 - 200) fits_capacity() would return true. However, enqueuing p on that CPU *will* cause it to become overutilized since rq clamp values are max-aggregated, so we'd remain with rq.uclamp.max = 768 which comes from the other tasks already enqueued on rq. Thus, we could select a high enough frequency to reach beyond 0.8 * 512 utilization (== overutilized) after enqueuing p on rq. What find_energy_efficient_cpu() needs here is uclamp_rq_util_with() which lets us peek at the future utilization landscape, including rq-wide uclamp values. Make find_energy_efficient_cpu() use uclamp_rq_util_with() for its fits_capacity() check. This is in line with what compute_energy() ends up using for estimating utilization. Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com> Suggested-by: Quentin Perret <qperret@google.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191211113851.24241-6-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25sched/fair: Make task_fits_capacity() consider uclamp restrictionsValentin Schneider
task_fits_capacity() drives CPU selection at wakeup time, and is also used to detect misfit tasks. Right now it does so by comparing task_util_est() with a CPU's capacity, but doesn't take into account uclamp restrictions. There's a few interesting uses that can come out of doing this. For instance, a low uclamp.max value could prevent certain tasks from being flagged as misfit tasks, so they could merrily remain on low-capacity CPUs. Similarly, a high uclamp.min value would steer tasks towards high capacity CPUs at wakeup (and, should that fail, later steered via misfit balancing), so such "boosted" tasks would favor CPUs of higher capacity. Introduce uclamp_task_util() and make task_fits_capacity() use it. Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191211113851.24241-5-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()Valentin Schneider
The current helper returns (CPU) rq utilization with uclamp restrictions taken into account. A uclamp task utilization helper would be quite helpful, but this requires some renaming. Prepare the code for the introduction of a uclamp_task_util() by renaming the existing uclamp_util_with() to uclamp_rq_util_with(). Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191211113851.24241-4-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25sched/uclamp: Make uclamp util helpers use and return UL valuesValentin Schneider
Vincent pointed out recently that the canonical type for utilization values is 'unsigned long'. Internally uclamp uses 'unsigned int' values for cache optimization, but this doesn't have to be exported to its users. Make the uclamp helpers that deal with utilization use and return unsigned long values. Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Quentin Perret <qperret@google.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191211113851.24241-3-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25sched/uclamp: Remove uclamp_util()Valentin Schneider
The sole user of uclamp_util(), schedutil_cpu_util(), was made to use uclamp_util_with() instead in commit: af24bde8df20 ("sched/uclamp: Add uclamp support to energy_compute()") From then on, uclamp_util() has remained unused. Being a simple wrapper around uclamp_util_with(), we can get rid of it and win back a few lines. Tested-By: Dietmar Eggemann <dietmar.eggemann@arm.com> Suggested-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191211113851.24241-2-valentin.schneider@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25sched/fair: Make sched-idle CPU selection consistent throughoutViresh Kumar
There are instances where we keep searching for an idle CPU despite already having a sched-idle CPU (in find_idlest_group_cpu(), select_idle_smt() and select_idle_cpu() and then there are places where we don't necessarily do that and return a sched-idle CPU as soon as we find one (in select_idle_sibling()). This looks a bit inconsistent and it may be worth having the same policy everywhere. On the other hand, choosing a sched-idle CPU over a idle one shall be beneficial from performance and power point of view as well, as we don't need to get the CPU online from a deep idle state which wastes quite a lot of time and energy and delays the scheduling of the newly woken up task. This patch tries to simplify code around sched-idle CPU selection and make it consistent throughout. Testing is done with the help of rt-app on hikey board (ARM64 octa-core, 2 clusters, 0-3 and 4-7). The cpufreq governor was set to performance to avoid any side affects from CPU frequency. Following are the tests performed: Test 1: 1-cfs-task: A single SCHED_NORMAL task is pinned to CPU5 which runs for 2333 us out of 7777 us (so gives time for the cluster to go in deep idle state). Test 2: 1-cfs-1-idle-task: A single SCHED_NORMAL task is pinned on CPU5 and single SCHED_IDLE task is pinned on CPU6 (to make sure cluster 1 doesn't go in deep idle state). Test 3: 1-cfs-8-idle-task: A single SCHED_NORMAL task is pinned on CPU5 and eight SCHED_IDLE tasks are created which run forever (not pinned anywhere, so they run on all CPUs). Checked with kernelshark that as soon as NORMAL task sleeps, the SCHED_IDLE task starts running on CPU5. And here are the results on mean latency (in us), using the "st" tool. $ st 1-cfs-task/rt-app-cfs_thread-0.log N min max sum mean stddev 642 90 592 197180 307.134 109.906 $ st 1-cfs-1-idle-task/rt-app-cfs_thread-0.log N min max sum mean stddev 642 67 311 113850 177.336 41.4251 $ st 1-cfs-8-idle-task/rt-app-cfs_thread-0.log N min max sum mean stddev 643 29 173 41364 64.3297 13.2344 The mean latency when we need to: - wakeup from deep idle state is 307 us. - wakeup from shallow idle state is 177 us. - preempt a SCHED_IDLE task is 64 us. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/b90cbcce608cef4e02a7bbfe178335f76d201bab.1573728344.git.viresh.kumar@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25sched/core: Remove unused variable from set_user_nice()Qian Cai
This commit left behind an unused variable: 5443a0be6121 ("sched: Use fair:prio_changed() instead of ad-hoc implementation") left behind an unused variable. kernel/sched/core.c: In function 'set_user_nice': kernel/sched/core.c:4507:16: warning: variable 'delta' set but not used int old_prio, delta; ^~~~~ Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 5443a0be6121 ("sched: Use fair:prio_changed() instead of ad-hoc implementation") Link: https://lkml.kernel.org/r/20191219140314.1252-1-cai@lca.pw Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-25Merge tag 'v5.5-rc3' into sched/core, to pick up fixesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-21Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes: a (rare) PSI crash fix, a CPU affinity related balancing fix, and a toning down of active migration attempts" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/cfs: fix spurious active migration sched/fair: Fix find_idlest_group() to handle CPU affinity psi: Fix a division error in psi poll() sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime
2019-12-17schied/fair: Skip calculating @contrib without loadPeng Wang
Because of the: if (!load) runnable = running = 0; clause in ___update_load_sum(), all the actual users of @contrib in accumulate_sum(): if (load) sa->load_sum += load * contrib; if (runnable) sa->runnable_load_sum += runnable * contrib; if (running) sa->util_sum += contrib << SCHED_CAPACITY_SHIFT; don't happen, and therefore we don't care what @contrib actually is and calculating it is pointless. If we count the times when @load equals zero and not as below: if (load) { load_is_not_zero_count++; contrib = __accumulate_pelt_segments(periods, 1024 - sa->period_contrib,delta); } else load_is_zero_count++; As we can see, load_is_zero_count is much bigger than load_is_zero_count, and the gap is gradually widening: load_is_zero_count: 6016044 times load_is_not_zero_count: 244316 times 19:50:43 up 1 min, 1 user, load average: 0.09, 0.06, 0.02 load_is_zero_count: 7956168 times load_is_not_zero_count: 261472 times 19:51:42 up 2 min, 1 user, load average: 0.03, 0.05, 0.01 load_is_zero_count: 10199896 times load_is_not_zero_count: 278364 times 19:52:51 up 3 min, 1 user, load average: 0.06, 0.05, 0.01 load_is_zero_count: 14333700 times load_is_not_zero_count: 318424 times 19:54:53 up 5 min, 1 user, load average: 0.01, 0.03, 0.00 Perhaps we can gain some performance advantage by saving these unnecessary calculation. Signed-off-by: Peng Wang <rocking@linux.alibaba.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot < vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/1576208740-35609-1-git-send-email-rocking@linux.alibaba.com
2019-12-17sched/fair: Optimize select_idle_cpuCheng Jian
select_idle_cpu() will scan the LLC domain for idle CPUs, it's always expensive. so the next commit : 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()") introduces a way to limit how many CPUs we scan. But it consume some CPUs out of 'nr' that are not allowed for the task and thus waste our attempts. The function always return nr_cpumask_bits, and we can't find a CPU which our task is allowed to run. Cpumask may be too big, similar to select_idle_core(), use per_cpu_ptr 'select_idle_mask' to prevent stack overflow. Fixes: 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()") Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
2019-12-17sched/wait: fix ___wait_var_event(exclusive)Oleg Nesterov
init_wait_var_entry() forgets to initialize wq_entry->flags. Currently not a problem, we don't have wait_var_event_exclusive(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Felipe Balbi <balbi@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Miklos Szeredi <miklos@szeredi.hu> Cc: Juri Lelli <juri.lelli@redhat.com> Link: https://lkml.kernel.org/r/20191210191902.GB14449@redhat.com
2019-12-17sched: Use fair:prio_changed() instead of ad-hoc implementationFrederic Weisbecker
set_user_nice() implements its own version of fair::prio_changed() and therefore misses a specific optimization towards nohz_full CPUs that avoid sending an resched IPI to a reniced task running alone. Use the proper callback instead. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20191203160106.18806-3-frederic@kernel.org
2019-12-17sched: Spare resched IPI when prio changes on a single fair taskFrederic Weisbecker
The runqueue of a fair task being remotely reniced is going to get a resched IPI in order to reassess which task should be the current running on the CPU. However that evaluation is useless if the fair task is running alone, in which case we can spare that IPI, preventing nohz_full CPUs from being disturbed. Reported-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20191203160106.18806-2-frederic@kernel.org
2019-12-17sched/cfs: fix spurious active migrationVincent Guittot
The load balance can fail to find a suitable task during the periodic check because the imbalance is smaller than half of the load of the waiting tasks. This results in the increase of the number of failed load balance, which can end up to start an active migration. This active migration is useless because the current running task is not a better choice than the waiting ones. In fact, the current task was probably not running but waiting for the CPU during one of the previous attempts and it had already not been selected. When load balance fails too many times to migrate a task, we should relax the contraint on the maximum load of the tasks that can be migrated similarly to what is done with cache hotness. Before the rework, load balance used to set the imbalance to the average load_per_task in order to mitigate such situation. This increased the likelihood of migrating a task but also of selecting a larger task than needed while more appropriate ones were in the list. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/1575036287-6052-1-git-send-email-vincent.guittot@linaro.org
2019-12-17sched/fair: Fix find_idlest_group() to handle CPU affinityVincent Guittot
Because of CPU affinity, the local group can be skipped which breaks the assumption that statistics are always collected for local group. With uninitialized local_sgs, the comparison is meaningless and the behavior unpredictable. This can even end up to use local pointer which is to NULL in this case. If the local group has been skipped because of CPU affinity, we return the idlest group. Fixes: 57abff067a08 ("sched/fair: Rework find_idlest_group()") Reported-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Valentin Schneider <valentin.schneider@arm.com> Tested-by: John Stultz <john.stultz@linaro.org> Cc: rostedt@goodmis.org Cc: valentin.schneider@arm.com Cc: mingo@redhat.com Cc: mgorman@suse.de Cc: juri.lelli@redhat.com Cc: dietmar.eggemann@arm.com Cc: bsegall@google.com Cc: qais.yousef@arm.com Link: https://lkml.kernel.org/r/1575483700-22153-1-git-send-email-vincent.guittot@linaro.org
2019-12-17psi: Fix a division error in psi poll()Johannes Weiner
The psi window size is a u64 an can be up to 10 seconds right now, which exceeds the lower 32 bits of the variable. We currently use div_u64 for it, which is meant only for 32-bit divisors. The result is garbage pressure sampling values and even potential div0 crashes. Use div64_u64. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Jingfeng Xie <xiejingfeng@linux.alibaba.com> Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
2019-12-17sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptimeJohannes Weiner
Jingfeng reports rare div0 crashes in psi on systems with some uptime: [58914.066423] divide error: 0000 [#1] SMP [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O) [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1 [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019 [58914.102728] Workqueue: events psi_update_work [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000 [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330 [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246 [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640 [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000 [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000 [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78 [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000 [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0 [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [58914.200360] PKRU: 55555554 [58914.203221] Stack: [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8 [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000 [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [58914.228734] Call Trace: [58914.231337] [] psi_update_work+0x22/0x60 [58914.237067] [] process_one_work+0x189/0x420 [58914.243063] [] worker_thread+0x4e/0x4b0 [58914.248701] [] ? process_one_work+0x420/0x420 [58914.254869] [] kthread+0xe6/0x100 [58914.259994] [] ? kthread_park+0x60/0x60 [58914.265640] [] ret_from_fork+0x39/0x50 [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1 [58914.279691] RIP [] psi_update_stats+0x1c1/0x330 The crashing instruction is trying to divide the observed stall time by the sampling period. The period, stored in R8, is not 0, but we are dividing by the lower 32 bits only, which are all 0 in this instance. We could switch to a 64-bit division, but the period shouldn't be that big in the first place. It's the time between the last update and the next scheduled one, and so should always be around 2s and comfortably fit into 32 bits. The bug is in the initialization of new cgroups: we schedule the first sampling event in a cgroup as an offset of sched_clock(), but fail to initialize the last_update timestamp, and it defaults to 0. That results in a bogusly large sampling period the first time we run the sampling code, and consequently we underreport pressure for the first 2s of a cgroup's life. But worse, if sched_clock() is sufficiently advanced on the system, and the user gets unlucky, the period's lower 32 bits can all be 0 and the sampling division will crash. Fix this by initializing the last update timestamp to the creation time of the cgroup, thus correctly marking the start of the first pressure sampling period in a new cgroup. Reported-by: Jingfeng Xie <xiejingfeng@linux.alibaba.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
2019-12-12cpufreq: Avoid leaving stale IRQ work items during CPU offlineRafael J. Wysocki
The scheduler code calling cpufreq_update_util() may run during CPU offline on the target CPU after the IRQ work lists have been flushed for it, so the target CPU should be prevented from running code that may queue up an IRQ work item on it at that point. Unfortunately, that may not be the case if dvfs_possible_from_any_cpu is set for at least one cpufreq policy in the system, because that allows the CPU going offline to run the utilization update callback of the cpufreq governor on behalf of another (online) CPU in some cases. If that happens, the cpufreq governor callback may queue up an IRQ work on the CPU running it, which is going offline, and the IRQ work may not be flushed after that point. Moreover, that IRQ work cannot be flushed until the "offlining" CPU goes back online, so if any other CPU calls irq_work_sync() to wait for the completion of that IRQ work, it will have to wait until the "offlining" CPU is back online and that may not happen forever. In particular, a system-wide deadlock may occur during CPU online as a result of that. The failing scenario is as follows. CPU0 is the boot CPU, so it creates a cpufreq policy and becomes the "leader" of it (policy->cpu). It cannot go offline, because it is the boot CPU. Next, other CPUs join the cpufreq policy as they go online and they leave it when they go offline. The last CPU to go offline, say CPU3, may queue up an IRQ work while running the governor callback on behalf of CPU0 after leaving the cpufreq policy because of the dvfs_possible_from_any_cpu effect described above. Then, CPU0 is the only online CPU in the system and the stale IRQ work is still queued on CPU3. When, say, CPU1 goes back online, it will run irq_work_sync() to wait for that IRQ work to complete and so it will wait for CPU3 to go back online (which may never happen even in principle), but (worse yet) CPU0 is waiting for CPU1 at that point too and a system-wide deadlock occurs. To address this problem notice that CPUs which cannot run cpufreq utilization update code for themselves (for example, because they have left the cpufreq policies that they belonged to), should also be prevented from running that code on behalf of the other CPUs that belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so in that case the cpufreq_update_util_data pointer of the CPU running the code must not be NULL as well as for the CPU which is the target of the cpufreq utilization update in progress. Accordingly, change cpufreq_this_cpu_can_update() into a regular function in kernel/sched/cpufreq.c (instead of a static inline in a header file) and make it check the cpufreq_update_util_data pointer of the local CPU if dvfs_possible_from_any_cpu is set for the target cpufreq policy. Also update the schedutil governor to do the cpufreq_this_cpu_can_update() check in the non-fast-switch case too to avoid the stale IRQ work issues. Fixes: 99d14d0e16fa ("cpufreq: Process remote callbacks from any CPU if the platform permits") Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/ Reported-by: Anson Huang <anson.huang@nxp.com> Tested-by: Anson Huang <anson.huang@nxp.com> Cc: 4.14+ <stable@vger.kernel.org> # 4.14+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Peng Fan <peng.fan@nxp.com> (i.MX8QXP-MEK) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-12-08Merge branch 'linus' into sched/urgent, to pick up the latest before merging ↵Ingo Molnar
new patches Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-12-05Merge branch 'thermal/next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux Pull thermal management updates from Zhang Rui: - Fix a deadlock regression in thermal core framework, which was introduced in 5.3 (Wei Wang) - Initialize thermal control framework earlier to enable thermal mitigation during boot (Amit Kucheria) - Convert the Intelligent Power Allocator (IPA) thermal governor to follow the generic PM_EM instead of its own Energy Model (Quentin Perret) - Introduce a new Amlogic soc thermal driver (Guillaume La Roque) - Add interrupt support for tsens thermal driver (Amit Kucheria) - Add support for MSM8956/8976 in tsens thermal driver (AngeloGioacchino Del Regno) - Add support for r8a774b1 in rcar thermal driver (Biju Das) - Add support for Thermal Monitor Unit v2 in qoriq thermal driver (Yuantian Tang) - Some other fixes/cleanups on thermal core framework and soc thermal drivers (Colin Ian King, Daniel Lezcano, Hsin-Yi Wang, Tian Tao) * 'thermal/next' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux: (32 commits) thermal: Fix deadlock in thermal thermal_zone_device_check thermal: cpu_cooling: Migrate to using the EM framework thermal: cpu_cooling: Make the power-related code depend on IPA PM / EM: Declare EM data types unconditionally arm64: defconfig: Enable CONFIG_ENERGY_MODEL drivers: thermal: tsens: fix potential integer overflow on multiply thermal: cpu_cooling: Reorder the header file thermal: cpu_cooling: Remove pointless dependency on CONFIG_OF thermal: no need to set .owner when using module_platform_driver thermal: qcom: tsens-v1: Fix kfree of a non-pointer value cpufreq: qcom-hw: Move driver initialization earlier clk: qcom: Initialize clock drivers earlier cpufreq: Initialize cpufreq-dt driver earlier cpufreq: Initialize the governors in core_initcall thermal: Initialize thermal subsystem earlier thermal: Remove netlink support dt: thermal: tsens: Document compatible for MSM8976/56 thermal: qcom: tsens-v1: Add support for MSM8956 and MSM8976 MAINTAINERS: add entry for Amlogic Thermal driver thermal: amlogic: Add thermal driver to support G12 SoCs ...
2019-11-30Merge tag 'notifications-pipe-prep-20191115' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs Pull pipe rework from David Howells: "This is my set of preparatory patches for building a general notification queue on top of pipes. It makes a number of significant changes: - It removes the nr_exclusive argument from __wake_up_sync_key() as this is always 1. This prepares for the next step: - Adds wake_up_interruptible_sync_poll_locked() so that poll can be woken up from a function that's holding the poll waitqueue spinlock. - Change the pipe buffer ring to be managed in terms of unbounded head and tail indices rather than bounded index and length. This means that reading the pipe only needs to modify one index, not two. - A selection of helper functions are provided to query the state of the pipe buffer, plus a couple to apply updates to the pipe indices. - The pipe ring is allowed to have kernel-reserved slots. This allows many notification messages to be spliced in by the kernel without allowing userspace to pin too many pages if it writes to the same pipe. - Advance the head and tail indices inside the pipe waitqueue lock and use wake_up_interruptible_sync_poll_locked() to poke poll without having to take the lock twice. - Rearrange pipe_write() to preallocate the buffer it is going to write into and then drop the spinlock. This allows kernel notifications to then be added the ring whilst it is filling the buffer it allocated. The read side is stalled because the pipe mutex is still held. - Don't wake up readers on a pipe if there was already data in it when we added more. - Don't wake up writers on a pipe if the ring wasn't full before we removed a buffer" * tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: pipe: Remove sync on wake_ups pipe: Increase the writer-wakeup threshold to reduce context-switch count pipe: Check for ring full inside of the spinlock in pipe_write() pipe: Remove redundant wakeup from pipe_write() pipe: Rearrange sequence in pipe_write() to preallocate slot pipe: Conditionalise wakeup in pipe_read() pipe: Advance tail pointer inside of wait spinlock in pipe_read() pipe: Allow pipes to have kernel-reserved slots pipe: Use head and tail pointers for the ring, not cursor and length Add wake_up_interruptible_sync_poll_locked() Remove the nr_exclusive argument from __wake_up_sync_key() pipe: Reduce #inclusion of pipe_fs_i.h
2019-11-29sched/clock: Use static_branch_likely() with sched_clock_runningZhenzhong Duan
sched_clock_running is enabled early at bootup stage and never disabled. So hint that to the compiler by using static_branch_likely() rather than static_branch_unlikely(). The branch probability mis-annotation was introduced in the original commit that converted the plain sched_clock_running flag to a static key: 46457ea464f5 ("sched/clock: Use static key for sched_clock_running") Steve further notes: | Looks like the confusion was the moving of the "!": | | - if (unlikely(!sched_clock_running)) | + if (!static_branch_unlikely(&sched_clock_running)) | | Where, it was unlikely that !sched_clock_running would be true, but | because the "!" was moved outside the "unlikely()" it makes the test | "likely()". That is, if we added an intermediate step, it would have | been: | | if (!likely(sched_clock_running)) | | which would have prevented the mistake that this patch fixes. [ mingo: Edited the changelog. ] Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: juri.lelli@redhat.com Cc: mgorman@suse.de Cc: vincent.guittot@linaro.org Link: https://lkml.kernel.org/r/1574843848-26825-1-git-send-email-zhenzhong.duan@oracle.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-26Merge tag 'pm-5.5-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: "These include cpuidle changes to use nanoseconds (instead of microseconds) as the unit of time and to simplify checks for disabled idle states in the idle loop, some cpuidle fixes and governor updates, assorted cpufreq updates (driver updates mostly and a few core fixes and cleanups), devfreq updates (dominated by the tegra30 driver changes), new CPU IDs for the RAPL power capping driver, relatively minor updates of the generic power domains (genpd) and operation performance points (OPP) frameworks, and assorted fixes and cleanups. There are also two maintainer information updates: Chanwoo Choi will be maintaining the devfreq subsystem going forward and Todd Brandt is going to maintain the pm-graph utility (created by him). Specifics: - Use nanoseconds (instead of microseconds) as the unit of time in the cpuidle core and simplify checks for disabled idle states in the idle loop (Rafael Wysocki) - Fix and clean up the teo cpuidle governor (Rafael Wysocki) - Fix the cpuidle registration error code path (Zhenzhong Duan) - Avoid excessive vmexits in the ACPI cpuidle driver (Yin Fengwei) - Extend the idle injection infrastructure to be able to measure the requested duration in nanoseconds and to allow an exit latency limit for idle states to be specified (Daniel Lezcano) - Fix cpufreq driver registration and clarify a comment in the cpufreq core (Viresh Kumar) - Add NULL checks to the show() and store() methods of sysfs attributes exposed by cpufreq (Kai Shen) - Update cpufreq drivers: * Fix for a plain int as pointer warning from sparse in intel_pstate (Jamal Shareef) * Fix for a hardcoded number of CPUs and stack bloat in the powernv driver (John Hubbard) * Updates to the ti-cpufreq driver and DT files to support new platforms and migrate bindings from opp-v1 to opp-v2 (Adam Ford, H. Nikolaus Schaller) * Merging of the arm_big_little and vexpress-spc drivers and related cleanup (Sudeep Holla) * Fix for imx's default speed grade value (Anson Huang) * Minor cleanup of the s3c64xx driver (Nathan Chancellor) * CPU speed bin detection fix for sun50i (Ondrej Jirman) - Appoint Chanwoo Choi as the new devfreq maintainer. - Update the devfreq core: * Check NULL governor in available_governors_show sysfs to prevent showing wrong governor information and fix a race condition between devfreq_update_status() and trans_stat_show() (Leonard Crestez) * Add new 'interrupt-driven' flag for devfreq governors to allow interrupt-driven governors to prevent the devfreq core from polling devices for status (Dmitry Osipenko) * Improve an error message in devfreq_add_device() (Matthias Kaehlcke) - Update devfreq drivers: * tegra30 driver fixes and cleanups (Dmitry Osipenko) * Removal of unused property from dt-binding documentation for the exynos-bus driver (Kamil Konieczny) * exynos-ppmu cleanup and DT bindings update (Lukasz Luba, Marek Szyprowski) - Add new CPU IDs for CometLake Mobile and Desktop to the Intel RAPL power capping driver (Zhang Rui) - Allow device initialization in the generic power domains (genpd) framework to be more straightforward and clean it up (Ulf Hansson) - Add support for adjusting OPP voltages at run time to the OPP framework (Stephen Boyd) - Avoid freeing memory that has never been allocated in the hibernation core (Andy Whitcroft) - Clean up function headers in a header file and coding style in the wakeup IRQs handling code (Ulf Hansson, Xiaofei Tan) - Clean up the SmartReflex adaptive voltage scaling (AVS) driver for ARM (Ben Dooks, Geert Uytterhoeven) - Wrap power management documentation to fit in 80 columns (Bjorn Helgaas) - Add pm-graph utility entry to MAINTAINERS (Todd Brandt) - Update the cpupower utility: * Fix the handling of set and info subcommands (Abhishek Goel) * Fix build warnings (Nathan Chancellor) * Improve mperf_monitor handling (Janakarajan Natarajan)" * tag 'pm-5.5-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (83 commits) PM: Wrap documentation to fit in 80 columns cpuidle: Pass exit latency limit to cpuidle_use_deepest_state() cpuidle: Allow idle injection to apply exit latency limit cpuidle: Introduce cpuidle_driver_state_disabled() for driver quirks cpuidle: teo: Avoid code duplication in conditionals cpufreq: Register drivers only after CPU devices have been registered cpuidle: teo: Avoid using "early hits" incorrectly cpuidle: teo: Exclude cpuidle overhead from computations PM / Domains: Convert to dev_to_genpd_safe() in genpd_syscore_switch() mmc: tmio: Avoid boilerplate code in ->runtime_suspend() PM / Domains: Implement the ->start() callback for genpd PM / Domains: Introduce dev_pm_domain_start() ARM: OMAP2+: SmartReflex: add omap_sr_pdata definition PM / wakeirq: remove unnecessary parentheses power: avs: smartreflex: Remove superfluous cast in debugfs_create_file() call cpuidle: Use nanoseconds as the unit of time PM / OPP: Support adjusting OPP voltages at runtime PM / core: Clean up some function headers in power.h cpufreq: Add NULL checks to show() and store() methods of cpufreq cpufreq: intel_pstate: Fix plain int as pointer warning from sparse ...
2019-11-26Merge branch 'locking-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "The main changes in this cycle were: - A comprehensive rewrite of the robust/PI futex code's exit handling to fix various exit races. (Thomas Gleixner et al) - Rework the generic REFCOUNT_FULL implementation using atomic_fetch_* operations so that the performance impact of the cmpxchg() loops is mitigated for common refcount operations. With these performance improvements the generic implementation of refcount_t should be good enough for everybody - and this got confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and REFCOUNT_FULL entirely, leaving the generic implementation enabled unconditionally. (Will Deacon) - Other misc changes, fixes, cleanups" * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) lkdtm: Remove references to CONFIG_REFCOUNT_FULL locking/refcount: Remove unused 'refcount_error_report()' function locking/refcount: Consolidate implementations of refcount_t locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions locking/refcount: Move saturation warnings out of line locking/refcount: Improve performance of generic REFCOUNT_FULL code locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header locking/refcount: Remove unused refcount_*_checked() variants locking/refcount: Ensure integer operands are treated as signed locking/refcount: Define constants for saturation and max refcount values futex: Prevent exit livelock futex: Provide distinct return value when owner is exiting futex: Add mutex around futex exit futex: Provide state handling for exec() as well futex: Sanitize exit state handling futex: Mark the begin of futex exit explicitly futex: Set task::futex_state to DEAD right after handling futex exit futex: Split futex_mm_release() for exit/exec exit/exec: Seperate mm_release() futex: Replace PF_EXITPIDONE with a state ...
2019-11-26Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "The biggest changes in this cycle were: - Make kcpustat vtime aware (Frederic Weisbecker) - Rework the CFS load_balance() logic (Vincent Guittot) - Misc cleanups, smaller enhancements, fixes. The load-balancing rework is the most intrusive change: it replaces the old heuristics that have become less meaningful after the introduction of the PELT metrics, with a grounds-up load-balancing algorithm. As such it's not really an iterative series, but replaces the old load-balancing logic with the new one. We hope there are no performance regressions left - but statistically it's highly probable that there *is* going to be some workload that is hurting from these chnages. If so then we'd prefer to have a look at that workload and fix its scheduling, instead of reverting the changes" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits) rackmeter: Use vtime aware kcpustat accessor leds: Use all-in-one vtime aware kcpustat accessor cpufreq: Use vtime aware kcpustat accessors for user time procfs: Use all-in-one vtime aware kcpustat accessor sched/vtime: Bring up complete kcpustat accessor sched/cputime: Support other fields on kcpustat_field() sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util() sched/fair: Add comments for group_type and balancing at SD_NUMA level sched/fair: Fix rework of find_idlest_group() sched/uclamp: Fix overzealous type replacement sched/Kconfig: Fix spelling mistake in user-visible help text sched/core: Further clarify sched_class::set_next_task() sched/fair: Use mul_u32_u32() sched/core: Simplify sched_class::pick_next_task() sched/core: Optimize pick_next_task() sched/core: Make pick_next_task_idle() more consistent sched/fair: Better document newidle_balance() leds: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM cpufreq: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM procfs: Use vtime aware kcpustat accessor to fetch CPUTIME_SYSTEM ...
2019-11-26Merge branch 'pm-cpuidle'Rafael J. Wysocki
* pm-cpuidle: cpuidle: Pass exit latency limit to cpuidle_use_deepest_state() cpuidle: Allow idle injection to apply exit latency limit cpuidle: Introduce cpuidle_driver_state_disabled() for driver quirks cpuidle: teo: Avoid code duplication in conditionals cpuidle: teo: Avoid using "early hits" incorrectly cpuidle: teo: Exclude cpuidle overhead from computations cpuidle: Use nanoseconds as the unit of time cpuidle: Consolidate disabled state checks ACPI: processor_idle: Skip dummy wait if kernel is in guest cpuidle: Do not unset the driver if it is there already cpuidle: teo: Fix "early hits" handling for disabled idle states cpuidle: teo: Consider hits and misses metrics of disabled states cpuidle: teo: Rename local variable in teo_select() cpuidle: teo: Ignore disabled idle states that are too deep
2019-11-25Merge tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull io_uring updates from Jens Axboe: "A lot of stuff has been going on this cycle, with improving the support for networked IO (and hence unbounded request completion times) being one of the major themes. There's been a set of fixes done this week, I'll send those out as well once we're certain we're fully happy with them. This contains: - Unification of the "normal" submit path and the SQPOLL path (Pavel) - Support for sparse (and bigger) file sets, and updating of those file sets without needing to unregister/register again. - Independently sized CQ ring, instead of just making it always 2x the SQ ring size. This makes it more flexible for networked applications. - Support for overflowed CQ ring, never dropping events but providing backpressure on submits. - Add support for absolute timeouts, not just relative ones. - Support for generic cancellations. This divorces io_uring from workqueues as well, which additionally gets us one step closer to generic async system call support. - With cancellations, we can support grabbing the process file table as well, just like we do mm context. This allows support for system calls that create file descriptors, like accept4() support that's built on top of that. - Support for io_uring tracing (Dmitrii) - Support for linked timeouts. These abort an operation if it isn't completed by the time noted in the linke timeout. - Speedup tracking of poll requests - Various cleanups making the coder easier to follow (Jackie, Pavel, Bob, YueHaibing, me) - Update MAINTAINERS with new io_uring list" * tag 'for-5.5/io_uring-20191121' of git://git.kernel.dk/linux-block: (64 commits) io_uring: make POLL_ADD/POLL_REMOVE scale better io-wq: remove now redundant struct io_wq_nulls_list io_uring: Fix getting file for non-fd opcodes io_uring: introduce req_need_defer() io_uring: clean up io_uring_cancel_files() io-wq: ensure free/busy list browsing see all items io-wq: ensure we have a stable view of ->cur_work for cancellations io_wq: add get/put_work handlers to io_wq_create() io_uring: check for validity of ->rings in teardown io_uring: fix potential deadlock in io_poll_wake() io_uring: use correct "is IO worker" helper io_uring: fix -ENOENT issue with linked timer with short timeout io_uring: don't do flush cancel under inflight_lock io_uring: flag SQPOLL busy condition to userspace io_uring: make ASYNC_CANCEL work with poll and timeout io_uring: provide fallback request for OOM situations io_uring: convert accept4() -ERESTARTSYS into -EINTR io_uring: fix error clear of ->file_table in io_sqe_files_register() io_uring: separate the io_free_req and io_free_req_find_next interface io_uring: keep io_put_req only responsible for release and put req ...
2019-11-21sched/vtime: Bring up complete kcpustat accessorFrederic Weisbecker
Many callsites want to fetch the values of system, user, user_nice, guest or guest_nice kcpustat fields altogether or at least a pair of these. In that case calling kcpustat_field() for each requested field brings unecessary overhead when we could fetch all of them in a row. So provide kcpustat_cpu_fetch() that fetches the whole kcpustat array in a vtime safe way under the same RCU and seqcount block. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191121024430.19938-3-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-21sched/cputime: Support other fields on kcpustat_field()Frederic Weisbecker
Provide support for user, nice, guest and guest_nice fields through kcpustat_field(). Whether we account the delta to a nice or not nice field is decided on top of the nice value snapshot taken at the time we call kcpustat_field(). If the nice value of the task has been changed since the last vtime update, we may have inacurrate distribution of the nice VS unnice cputime. However this is considered as a minor issue compared to the proper fix that would involve interrupting the target on nice updates, which is undesired on nohz_full CPUs. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: Yauheni Kaliuta <yauheni.kaliuta@redhat.com> Link: https://lkml.kernel.org/r/20191121024430.19938-2-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-20cpuidle: Pass exit latency limit to cpuidle_use_deepest_state()Daniel Lezcano
Modify cpuidle_use_deepest_state() to take an additional exit latency limit argument to be passed to find_deepest_idle_state() and make cpuidle_idle_call() pass dev->forced_idle_latency_limit_ns to it for forced idle. Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: Rebase and rearrange code, subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-20cpuidle: Allow idle injection to apply exit latency limitDaniel Lezcano
In some cases it may be useful to specify an exit latency limit for the idle state to be used during CPU idle time injection. Instead of duplicating the information in struct cpuidle_device or propagating the latency limit in the call stack, replace the use_deepest_state field with forced_latency_limit_ns to represent that limit, so that the deepest idle state with exit latency within that limit is forced (i.e. no governors) when it is set. A zero exit latency limit for forced idle means to use governors in the usual way (analogous to use_deepest_state equal to "false" before this change). Additionally, add play_idle_precise() taking two arguments, the duration of forced idle and the idle state exit latency limit, both in nanoseconds, and redefine play_idle() as a wrapper around that new function. This change is preparatory, no functional impact is expected. Suggested-by: Rafael J. Wysocki <rafael@kernel.org> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> [ rjw: Subject, changelog, cpuidle_use_deepest_state() kerneldoc, whitespace ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2019-11-18sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util()Vincent Guittot
update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays, which might be inefficient when the cpufreq driver has rate limitation. When a task is attached on a CPU, we have this call path: update_load_avg() update_cfs_rq_load_avg() cfs_rq_util_change -- > trig frequency update attach_entity_load_avg() cfs_rq_util_change -- > trig frequency update The 1st frequency update will not take into account the utilization of the newly attached task and the 2nd one might be discarded because of rate limitation of the cpufreq driver. update_cfs_rq_load_avg() is only called by update_blocked_averages() and update_load_avg() so we can move the call to cfs_rq_util_change/cpufreq_update_util() into these two functions. It's also interesting to note that update_load_avg() already calls cfs_rq_util_change() directly for the !SMP case. This change will also ensure that cpufreq_update_util() is called even when there is no more CFS rq in the leaf_cfs_rq_list to update, but only IRQ, RT or DL PELT signals. [ mingo: Minor updates. ] Reported-by: Doug Smythies <dsmythies@telus.net> Tested-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: juri.lelli@redhat.com Cc: linux-pm@vger.kernel.org Cc: mgorman@suse.de Cc: rostedt@goodmis.org Cc: sargun@sargun.me Cc: srinivas.pandruvada@linux.intel.com Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-11-18Merge tag 'v5.4-rc8' into sched/core, to pick up fixes and dependenciesIngo Molnar
Signed-off-by: Ingo Molnar <mingo@kernel.org>