aboutsummaryrefslogtreecommitdiff
path: root/block/bfq-iosched.h
AgeCommit message (Collapse)Author
2019-01-31block, bfq: consider also ioprio classes in symmetry detectionPaolo Valente
In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to be guaranteed a different bandwidth than other bfq_queues or bfq_groups, these service guaranteed can be provided only by plugging I/O dispatch, completely or partially, when the queue in service remains temporarily empty. A case where asymmetry is particularly strong is when some active bfq_queues belong to a higher-priority class than some other active bfq_queues. Unfortunately, this important case is not considered at all in the code for detecting asymmetric scenarios. This commit adds the missing logic. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-12-07block, bfq: fix decrement of num_active_groupsPaolo Valente
Since commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")', if there are process groups with I/O requests waiting for completion, then BFQ tags the scenario as 'asymmetric'. This detection is needed for preserving service guarantees (for details, see comments on the computation * of the variable asymmetric_scenario in the function bfq_better_to_idle). Unfortunately, commit '2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")' contains an error exactly in the updating of the number of groups with I/O requests waiting for completion: if a group has more than one descendant process, then the above number of groups, which is renamed from num_active_groups to a more appropriate num_groups_with_pending_reqs by this commit, may happen to be wrongly decremented multiple times, namely every time one of the descendant processes gets all its pending I/O requests completed. A correct, complete solution should work as follows. Consider a group that is inactive, i.e., that has no descendant process with pending I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs is still accounting for this group, because the group still has some descendant process with some I/O request still in flight. num_groups_with_pending_reqs should be decremented when the in-flight request of the last descendant process is finally completed (assuming that nothing else has changed for the group in the meantime, in terms of composition of the group and active/inactive state of child groups and processes). To accomplish this, an additional pending-request counter must be added to entities, and must be updated correctly. To avoid this additional field and operations, this commit resorts to the following tradeoff between simplicity and accuracy: for an inactive group that is still counted in num_groups_with_pending_reqs, this commit decrements num_groups_with_pending_reqs when the first descendant process of the group remains with no request waiting for completion. This simplified scheme provides a fix to the unbalanced decrements introduced by 2d29c9f89fcd. Since this error was also caused by lack of comments on this non-trivial issue, this commit also adds related comments. Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection") Reported-by: Steven Barrett <steven@liquorix.net> Tested-by: Steven Barrett <steven@liquorix.net> Tested-by: Lucjan Lucjanov <lucjan.lucjanov@gmail.com> Reviewed-by: Federico Motta <federico@willer.it> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-10-13block, bfq: improve asymmetric scenarios detectionFederico Motta
bfq defines as asymmetric a scenario where an active entity, say E (representing either a single bfq_queue or a group of other entities), has a higher weight than some other entities. If the entity E does sync I/O in such a scenario, then bfq plugs the dispatch of the I/O of the other entities in the following situation: E is in service but temporarily has no pending I/O request. In fact, without this plugging, all the times that E stops being temporarily idle, it may find the internal queues of the storage device already filled with an out-of-control number of extra requests, from other entities. So E may have to wait for the service of these extra requests, before finally having its own requests served. This may easily break service guarantees, with E getting less than its fair share of the device throughput. Usually, the end result is that E gets the same fraction of the throughput as the other entities, instead of getting more, according to its higher weight. Yet there are two other more subtle cases where E, even if its weight is actually equal to or even lower than the weight of any other active entities, may get less than its fair share of the throughput in case the above I/O plugging is not performed: 1. other entities issue larger requests than E; 2. other entities contain more active child entities than E (or in general tend to have more backlog than E). In the first case, other entities may get more service than E because they get larger requests, than those of E, served during the temporary idle periods of E. In the second case, other entities get more service because, by having many child entities, they have many requests ready for dispatching while E is temporarily idle. This commit addresses this issue by extending the definition of asymmetric scenario: a scenario is asymmetric when - active entities representing bfq_queues have differentiated weights, as in the original definition or (inclusive) - one or more entities representing groups of entities are active. This broader definition makes sure that I/O plugging will be performed in all the above cases, provided that there is at least one active group. Of course, this definition is very coarse, so it will trigger I/O plugging also in cases where it is not needed, such as, e.g., multiple active entities with just one child each, and all with the same I/O-request size. The reason for this coarse definition is just that a finer-grained definition would be rather heavy to compute. On the opposite end, even this new definition does not trigger I/O plugging in all cases where there is no active group, and all bfq_queues have the same weight. So, in these cases some unfairness may occur if there are asymmetries in I/O-request sizes. We made this choice because I/O plugging may lower throughput, and probably a user that has not created any group cares more about throughput than about perfect fairness. At any rate, as for possible applications that may care about service guarantees, bfq already guarantees a high responsiveness and a low latency to soft real-time applications automatically. Signed-off-by: Federico Motta <federico@willer.it> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-09-14block, bfq: inject other-queue I/O into seeky idle queues on NCQ flashPaolo Valente
The Achilles' heel of BFQ is its failing to reach a high throughput with sync random I/O on flash storage with internal queueing, in case the processes doing I/O have differentiated weights. The cause of this failure is as follows. If at least two processes do sync I/O, and have a different weight from each other, then BFQ plugs I/O dispatching every time one of these processes, while it is being served, remains temporarily without pending I/O requests. This plugging is necessary to guarantee that every process enjoys a bandwidth proportional to its weight; but it empties the internal queue(s) of the drive. And this kills throughput with random I/O. So, if some processes have differentiated weights and do both sync and random I/O, the end result is a throughput collapse. This commit tries to counter this problem by injecting the service of other processes, in a controlled way, while the process in service happens to have no I/O. This injection is performed only if the medium is non rotational and performs internal queueing, and the process in service does random I/O (service injection might be beneficial for sequential I/O too, we'll work on that). As an example of the benefits of this commit, on a PLEXTOR PX-256M5S SSD, and with five processes having differentiated weights and doing sync random 4KB I/O, this commit makes the throughput with bfq grow by 400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than that reached with none. As some less random I/O is added to the mix, the throughput becomes equal to or higher than that with none. This commit is a very first attempt to recover throughput without losing control, and certainly has many limitations. One is, e.g., that the processes whose service is injected are not chosen so as to distribute the extra bandwidth they receive in accordance to their weights. Thus there might be loss of weighted fairness in some cases. Anyway, this loss concerns extra service, which would not have been received at all without this commit. Other limitations and issues will probably show up with usage. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-07-09block, bfq: add/remove entity weights correctlyPaolo Valente
To keep I/O throughput high as often as possible, BFQ performs I/O-dispatch plugging (aka device idling) only when beneficial exactly for throughput, or when needed for service guarantees (low latency, fairness). An important case where the latter condition holds is when the scenario is 'asymmetric' in terms of weights: i.e., when some bfq_queue or whole group of queues has a higher weight, and thus has to receive more service, than other queues or groups. Without dispatch plugging, lower-weight queues/groups may unjustly steal bandwidth to higher-weight queues/groups. To detect asymmetric scenarios, BFQ checks some sufficient conditions. One of these conditions is that active groups have different weights. BFQ controls this condition by maintaining a special set of unique weights of active groups (group_weights_tree). To this purpose, in the function bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a group to/from this set. Unfortunately, the function bfq_active_extract may happen to be invoked also for a group that is still active (to preserve the correct update of the next queue to serve, see comments in function bfq_no_longer_next_in_service() for details). In this case, removing the weight of the group makes the set group_weights_tree inconsistent. Service-guarantee violations follow. This commit addresses this issue by moving group_weights_tree insertions from their previous location (in bfq_active_insert) into the function __bfq_activate_entity, and by moving group_weights_tree extractions from bfq_active_extract to when the entity that represents a group remains throughly idle, i.e., with no request either enqueued or dispatched. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-31block, bfq: remove slow-system classPaolo Valente
BFQ computes the duration of weight raising for interactive applications automatically, using some reference parameters. In particular, BFQ uses the best durations (see comments in the code for how these durations have been assessed) for two classes of systems: slow and fast ones. Examples of slow systems are old phones or systems using micro HDDs. Fast systems are all the remaining ones. Using these parameters, BFQ computes the actual duration of the weight raising, for the system at hand, as a function of the relative speed of the system w.r.t. the speed of a reference system, belonging to the same class of systems as the system at hand. This slow vs fast differentiation proved to be useful in the past, but happens to have little meaning with current hardware. Even worse, it does cause problems in virtual systems, where the speed of the system can vary frequently, and so widely to just confuse the class-detection mechanism, and, as we have verified experimentally, to cause BFQ to compute non-sensical weight-raising durations. This commit addresses this issue by removing the slow class and the class-detection mechanism. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-10bfq-iosched: remove unused variableJens Axboe
bfqd->sb_shift was attempted used as a cache for the sbitmap queue shift, but we don't need it, as it never changes. Kill it with fire. Acked-by: Paolo Valente <paolo.valente@linaro.org> Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-05-09block: use ktime_get_ns() instead of sched_clock() for cfq and bfqOmar Sandoval
cfq and bfq have some internal fields that use sched_clock() which can trivially use ktime_get_ns() instead. Their timestamp fields in struct request can also use ktime_get_ns(), which resolves the 8 year old comment added by commit 28f4197e5d47 ("block: disable preemption before using sched_clock()"). Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-03-26block, bfq: lower-bound the estimated peak rate to 1Paolo Valente
If a storage device handled by BFQ happens to be slower than 7.5 KB/s for a certain amount of time (in the order of a second), then the estimated peak rate of the device, maintained in BFQ, becomes equal to 0. The reason is the limited precision with which the rate is represented (details on the range of representable values in the comments introduced by this commit). This leads to a division-by-zero error where the estimated peak rate is used as divisor. Such a type of failure has been reported in [1]. This commit addresses this issue by: 1. Lower-bounding the estimated peak rate to 1 2. Adding and improving comments on the range of rates representable [1] https://www.spinics.net/lists/kernel/msg2739205.html Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18block, bfq: limit sectors served with interactive weight raisingPaolo Valente
To maximise responsiveness, BFQ raises the weight, and performs device idling, for bfq_queues associated with processes deemed as interactive. In particular, weight raising has a maximum duration, equal to the time needed to start a large application. If a weight-raised process goes on doing I/O beyond this maximum duration, it loses weight-raising. This mechanism is evidently vulnerable to the following false positives: I/O-bound applications that will go on doing I/O for much longer than the duration of weight-raising. These applications have basically no benefit from being weight-raised at the beginning of their I/O. On the opposite end, while being weight-raised, these applications a) unjustly steal throughput to applications that may truly need low latency; b) make BFQ uselessly perform device idling; device idling results in loss of device throughput with most flash-based storage, and may increase latencies when used purposelessly. This commit adds a countermeasure to reduce both the above problems. To introduce this countermeasure, we provide the following extra piece of information (full details in the comments added by this commit). During the start-up of the large application used as a reference to set the duration of weight-raising, involved processes transfer at most ~110K sectors each. Accordingly, a process initially deemed as interactive has no right to be weight-raised any longer, once transferred 110K sectors or more. Basing on this consideration, this commit early-ends weight-raising for a bfq_queue if the latter happens to have received an amount of service at least equal to 110K sectors (actually, a little bit more, to keep a safety margin). I/O-bound applications that reach a high throughput, such as file copy, get to this threshold much before the allowed weight-raising period finishes. Thus this early ending of weight-raising reduces the amount of time during which these applications cause the problems described above. Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-18block, bfq: limit tags for writes and async I/OPaolo Valente
Asynchronous I/O can easily starve synchronous I/O (both sync reads and sync writes), by consuming all request tags. Similarly, storms of synchronous writes, such as those that sync(2) may trigger, can starve synchronous reads. In their turn, these two problems may also cause BFQ to loose control on latency for interactive and soft real-time applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice Writer takes 0.6 seconds to start if the device is idle, but it takes more than 45 seconds (!) if there are sequential writes in the background. This commit addresses this issue by limiting the maximum percentage of tags that asynchronous I/O requests and synchronous write requests can consume. In particular, this commit grants a higher threshold to synchronous writes, to prevent the latter from being starved by asynchronous I/O. According to the above test, LibreOffice Writer now starts in about 1.2 seconds on average, regardless of the background workload, and apart from some rare outlier. To check this improvement, run, e.g., sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init" for the comm_startup_lat benchmark in the S suite [1]. [1] https://github.com/Algodev-github/S Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2018-01-05block, bfq: let a queue be merged only shortly after starting I/OPaolo Valente
In BFQ and CFQ, two processes are said to be cooperating if they do I/O in such a way that the union of their I/O requests yields a sequential I/O pattern. To get such a sequential I/O pattern out of the non-sequential pattern of each cooperating process, BFQ and CFQ merge the queues associated with these processes. In more detail, cooperating processes, and thus their associated queues, usually start, or restart, to do I/O shortly after each other. This is the case, e.g., for the I/O threads of KVM/QEMU and of the dump utility. Basing on this assumption, this commit allows a bfq_queue to be merged only during a short time interval (100ms) after it starts, or re-starts, to do I/O. This filtering provides two important benefits. First, it greatly reduces the probability that two non-cooperating processes have their queues merged by mistake, if they just happen to do I/O close to each other for a short time interval. These spurious merges cause loss of service guarantees. A low-weight bfq_queue may unjustly get more than its expected share of the throughput: if such a low-weight queue is merged with a high-weight queue, then the I/O for the low-weight queue is served as if the queue had a high weight. This may damage other high-weight queues unexpectedly. For instance, because of this issue, lxterminal occasionally took 7.5 seconds to start, instead of 6.5 seconds, when some sequential readers and writers did I/O in the background on a FUJITSU MHX2300BT HDD. The reason is that the bfq_queues associated with some of the readers or the writers were merged with the high-weight queues of some processes that had to do some urgent but little I/O. The readers then exploited the inherited high weight for all or most of their I/O, during the start-up of terminal. The filtering introduced by this commit eliminated any outlier caused by spurious queue merges in our start-up time tests. This filtering also provides a little boost of the throughput sustainable by BFQ: 3-4%, depending on the CPU. The reason is that, once a bfq_queue cannot be merged any longer, this commit makes BFQ stop updating the data needed to handle merging for the queue. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Angelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-11-14block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUPLuca Miccio
BFQ currently creates, and updates, its own instance of the whole set of blkio statistics that cfq creates. Yet, from the comments of Tejun Heo in [1], it turned out that most of these statistics are meant/useful only for debugging. This commit makes BFQ create the latter, debugging statistics only if the option CONFIG_DEBUG_BLK_CGROUP is set. By doing so, this commit also enables BFQ to enjoy a high perfomance boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then BFQ has to update far fewer statistics, and, in particular, not the heaviest to update. To give an idea of the benefits, if CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and with 8 threads doing random I/O in parallel on null_blk (configured with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS (+30%). We have measured similar or even much higher boosts with other CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have been obtained and can be reproduced very easily with the script in [1]. [1] https://www.spinics.net/lists/linux-block/msg18943.html Suggested-by: Tejun Heo <tj@kernel.org> Suggested-by: Ulf Hansson <ulf.hansson@linaro.org> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Luca Miccio <lucmiccio@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-31block, bfq: make lookup_next_entity push up vtime on expirationsPaolo Valente
To provide a very smooth service, bfq starts to serve a bfq_queue only if the queue is 'eligible', i.e., if the same queue would have started to be served in the ideal, perfectly fair system that bfq simulates internally. This is obtained by associating each queue with a virtual start time, and by computing a special system virtual time quantity: a queue is eligible only if the system virtual time has reached the virtual start time of the queue. Finally, bfq guarantees that, when a new queue must be set in service, there is always at least one eligible entity for each active parent entity in the scheduler. To provide this guarantee, the function __bfq_lookup_next_entity pushes up, for each parent entity on which it is invoked, the system virtual time to the minimum among the virtual start times of the entities in the active tree for the parent entity (more precisely, the push up occurs if the system virtual time happens to be lower than all such virtual start times). There is however a circumstance in which __bfq_lookup_next_entity cannot push up the system virtual time for a parent entity, even if the system virtual time is lower than the virtual start times of all the child entities in the active tree. It happens if one of the child entities is in service. In fact, in such a case, there is already an eligible entity, the in-service one, even if it may not be not present in the active tree (because in-service entities may be removed from the active tree). Unfortunately, in the last re-design of the hierarchical-scheduling engine, the reset of the pointer to the in-service entity for a given parent entity--reset to be done as a consequence of the expiration of the in-service entity--always happens after the function __bfq_lookup_next_entity has been invoked. This causes the function to think that there is still an entity in service for the parent entity, and then that the system virtual time cannot be pushed up, even if actually such a no-more-in-service entity has already been properly reinserted into the active tree (or in some other tree if no more active). Yet, the system virtual time *had* to be pushed up, to be ready to correctly choose the next queue to serve. Because of the lack of this push up, bfq may wrongly set in service a queue that had been speculatively pre-computed as the possible next-in-service queue, but that would no more be the one to serve after the expiration and the reinsertion into the active trees of the previously in-service entities. This commit addresses this issue by making __bfq_lookup_next_entity properly push up the system virtual time if an expiration is occurring. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-28Merge tag 'v4.13-rc7' into for-4.14/block-postmergeJens Axboe
Linux 4.13-rc7 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-08-11block,bfq: refactor device-idling logicPaolo Valente
The logic that decides whether to idle the device is scattered across three functions. Almost all of the logic is in the function bfq_bfqq_may_idle, but (1) part of the decision is made in bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may switch off idling regardless of the output of bfq_bfqq_may_idle. In addition, both bfq_update_idle_window and bfq_bfqq_must_idle make their decisions as a function of parameters that are used, for similar purposes, also in bfq_bfqq_may_idle. This commit addresses these issues by moving all the logic into bfq_bfqq_may_idle. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29block, bfq: consider also in_service_entity to state whether an entity is activePaolo Valente
Groups of BFQ queues are represented by generic entities in BFQ. When a queue belonging to a parent entity is deactivated, the parent entity may need to be deactivated too, in case the deactivated queue was the only active queue for the parent entity. This deactivation may need to be propagated upwards if the entity belongs, in its turn, to a further higher-level entity, and so on. In particular, the upward propagation of deactivation stops at the first parent entity that remains active even if one of its child entities has been deactivated. To decide whether the last non-deactivation condition holds for a parent entity, BFQ checks whether the field next_in_service is still not NULL for the parent entity, after the deactivation of one of its child entity. If it is not NULL, then there are certainly other active entities in the parent entity, and deactivations can stop. Unfortunately, this check misses a corner case: if in_service_entity is not NULL, then next_in_service may happen to be NULL, although the parent entity is evidently active. This happens if: 1) the entity pointed by in_service_entity is the only active entity in the parent entity, and 2) according to the definition of next_in_service, the in_service_entity cannot be considered as next_in_service. See the comments on the definition of next_in_service for details on this second point. Hitting the above corner case causes crashes. To address this issue, this commit: 1) Extends the above check on only next_in_service to controlling both next_in_service and in_service_entity (if any of them is not NULL, then no further deactivation is performed) 2) Improves the (important) comments on how next_in_service is defined and updated; in particular it fixes a few rather obscure paragraphs Reported-by: Eric Wheeler <bfq-sched@lists.ewheeler.net> Reported-by: Rick Yiu <rick_yiu@htc.com> Reported-by: Tom X Nguyen <tom81094@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Eric Wheeler <bfq-sched@lists.ewheeler.net> Tested-by: Rick Yiu <rick_yiu@htc.com> Tested-by: Laurentiu Nicola <lnicola@dend.ro> Tested-by: Tom X Nguyen <tom81094@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-29block: use standard blktrace API to output cgroup info for debug notesShaohua Li
Currently cfq/bfq/blk-throttle output cgroup info in trace in their own way. Now we have standard blktrace API for this, so convert them to use it. Note, this changes the behavior a little bit. cgroup info isn't output by default, we only do this with 'blk_cgroup' option enabled. cgroup info isn't output as a string by default too, we only do this with 'blk_cgname' option enabled. Also cgroup info is output in different position of the note string. I think these behavior changes aren't a big issue (actually we make trace data shorter which is good), since the blktrace note is solely for debugging. Signed-off-by: Shaohua Li <shli@fb.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-12bfq: fix typos in comments about B-WF2Q+ algorithmHou Tao
The start time of eligible entity should be less than or equal to the current virtual time, and the entity in idle tree has a finish time being greater than the current virtual time. Signed-off-by: Hou Tao <houtao1@huawei.com> Reviewed-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-07-03block, bfq: don't change ioprio class for a bfq_queue on a service treePaolo Valente
On each deactivation or re-scheduling (after being served) of a bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(), to perform pending updates of ioprio, weight and ioprio class for the bfq_queue. BFQ also invokes this function on I/O-request dispatches, to raise or lower weights more quickly when needed, thereby improving latency. However, the entity representing the bfq_queue may be on the active (sub)tree of a service tree when this happens, and, although with a very low probability, the bfq_queue may happen to also have a pending change of its ioprio class. If both conditions hold when __bfq_entity_update_weight_prio() is invoked, then the entity moves to a sort of hybrid state: the new service tree for the entity, as returned by bfq_entity_service_tree(), differs from service tree on which the entity still is. The functions that handle activations and deactivations of entities do not cope with such a hybrid state (and would need to become more complex to cope). This commit addresses this issue by just making __bfq_entity_update_weight_prio() not perform also a possible pending change of ioprio class, when invoked on an I/O-request dispatch for a bfq_queue. Such a change is thus postponed to when __bfq_entity_update_weight_prio() is invoked on deactivation or re-scheduling of the bfq_queue. Reported-by: Marco Piazza <mpiazza@gmail.com> Reported-by: Laurentiu Nicola <lnicola@dend.ro> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Marco Piazza <mpiazza@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2017-06-08block, bfq: access and cache blkg data only when safePaolo Valente
In blk-cgroup, operations on blkg objects are protected with the request_queue lock. This is no more the lock that protects I/O-scheduler operations in blk-mq. In fact, the latter are now protected with a finer-grained per-scheduler-instance lock. As a consequence, although blkg lookups are also rcu-protected, blk-mq I/O schedulers may see inconsistent data when they access blkg and blkg-related objects. BFQ does access these objects, and does incur this problem, in the following case. The blkg_lookup performed in bfq_get_queue, being protected (only) through rcu, may happen to return the address of a copy of the original blkg. If this is the case, then the blkg_get performed in bfq_get_queue, to pin down the blkg, is useless: it does not prevent blk-cgroup code from destroying both the original blkg and all objects directly or indirectly referred by the copy of the blkg. BFQ accesses these objects, which typically causes a crash for NULL-pointer dereference of memory-protection violation. Some additional protection mechanism should be added to blk-cgroup to address this issue. In the meantime, this commit provides a quick temporary fix for BFQ: cache (when safe) blkg data that might disappear right after a blkg_lookup. In particular, this commit exploits the following facts to achieve its goal without introducing further locks. Destroy operations on a blkg invoke, as a first step, hooks of the scheduler associated with the blkg. And these hooks are executed with bfqd->lock held for BFQ. As a consequence, for any blkg associated with the request queue an instance of BFQ is attached to, we are guaranteed that such a blkg is not destroyed, and that all the pointers it contains are consistent, while that instance is holding its bfqd->lock. A blkg_lookup performed with bfqd->lock held then returns a fully consistent blkg, which remains consistent until this lock is held. In more detail, this holds even if the returned blkg is a copy of the original one. Finally, also the object describing a group inside BFQ needs to be protected from destruction on the blkg_free of the original blkg (which invokes bfq_pd_free). This commit adds private refcounting for this object, to let it disappear only after no bfq_queue refers to it any longer. This commit also removes or updates some stale comments on locking issues related to blk-cgroup operations. Reported-by: Tomas Konir <tomas.konir@gmail.com> Reported-by: Lee Tibbert <lee.tibbert@gmail.com> Reported-by: Marco Piazza <mpiazza@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Tested-by: Tomas Konir <tomas.konir@gmail.com> Tested-by: Lee Tibbert <lee.tibbert@gmail.com> Tested-by: Marco Piazza <mpiazza@gmail.com> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-20bfq: fix compile error if CONFIG_CGROUPS=nJens Axboe
If we don't have CGROUPS enabled, the compile ends in the following misery: In file included from ../block/bfq-iosched.c:105:0: ../block/bfq-iosched.h:819:22: error: array type has incomplete element type extern struct cftype bfq_blkcg_legacy_files[]; ^ ../block/bfq-iosched.h:820:22: error: array type has incomplete element type extern struct cftype bfq_blkg_files[]; ^ Move the declarations under the right ifdef. Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jens Axboe <axboe@fb.com>
2017-04-19block, bfq: split bfq-iosched.c into multiple source filesPaolo Valente
The BFQ I/O scheduler features an optimal fair-queuing (proportional-share) scheduling algorithm, enriched with several mechanisms to boost throughput and reduce latency for interactive and real-time applications. This makes BFQ a large and complex piece of code. This commit addresses this issue by splitting BFQ into three main, independent components, and by moving each component into a separate source file: 1. Main algorithm: handles the interaction with the kernel, and decides which requests to dispatch; it uses the following two further components to achieve its goals. 2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm): computes the schedule, using weights and budgets provided by the above component. 3. cgroups support: handles group operations (creation, destruction, move, ...). Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jens Axboe <axboe@fb.com>