* [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32
@ 2009-11-26 12:19 Mel Gorman
2009-11-26 13:08 ` Mike Galbraith
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Mel Gorman @ 2009-11-26 12:19 UTC (permalink / raw)
To: Jens Axboe, Andrew Morton, Linus Torvalds
Cc: Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski,
Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel,
Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki,
linux-kernel, linux-mm
(cc'ing the people from the page allocator failure thread as this might be
relevant to some of their problems)
I know this is very last minute but I believe we should consider disabling
the "low_latency" tunable for block devices by default for 2.6.32. There was
evidence that low_latency was a problem last week for page allocation failure
reports but the reproduction-case was unusual and involved high-order atomic
allocations in low-memory conditions. It took another few days to accurately
show the problem for more normal workloads and it's a bit more wide-spread
than just allocation failures.
Basically, low_latency looks great as long as you have plenty of memory
but in low memory situations, it appears to cause problems that manifest
as reduced performance, desktop stalls and in some cases, page allocation
failures. I think most kernel developers are not seeing the problem as they
tend to test on beefier machines and without hitting swap or low-memory
situations for the most part. When they are hitting low-memory situations,
it tends to be for stress tests where stalls and low performance are expected.
To show the problem, I used an x86-64 machine booting booted with 512MB of
memory. This is a small amount of RAM but the bug reports related to page
allocation failures were on smallish machines and the disks in the system
are not very high-performance.
I used three tests. The first was sysbench on postgres running an IO-heavy
test against a large database with 10,000,000 rows. The second was IOZone
running most of the automatic tests with a record length of 4KB and the
last was a simulated launching of gitk with a music player running in the
background to act as a desktop-like scenario. The final test was similar
to the test described here http://lwn.net/Articles/362184/ except that
dm-crypt was not used as it has its own problems.
Sysbench results looks as follows
sysbench-with sysbench-without
low-latency low-latency
1 1266.02 ( 0.00%) 1278.55 ( 0.98%)
2 1182.58 ( 0.00%) 1379.25 (14.26%)
3 1257.08 ( 0.00%) 1580.08 (20.44%)
4 1212.11 ( 0.00%) 1534.17 (20.99%)
5 1046.77 ( 0.00%) 1552.48 (32.57%)
6 1187.14 ( 0.00%) 1661.19 (28.54%)
7 1179.37 ( 0.00%) 790.26 (-49.24%)
8 1164.62 ( 0.00%) 854.10 (-36.36%)
9 1125.04 ( 0.00%) 1655.04 (32.02%)
10 1147.52 ( 0.00%) 1653.89 (30.62%)
11 823.38 ( 0.00%) 1627.45 (49.41%)
12 813.73 ( 0.00%) 1494.63 (45.56%)
13 898.22 ( 0.00%) 1521.64 (40.97%)
14 873.50 ( 0.00%) 1311.09 (33.38%)
15 808.32 ( 0.00%) 1009.70 (19.94%)
16 758.17 ( 0.00%) 725.17 (-4.55%)
The first column is threads. Disabling low_latency performs much better
for the most part. I should point out that with plenty of memory, sysbench
tends to perform better *with* low_latency but as we're seeing page allocation
failure reports in low memory situations and desktop stalls, the lower memory
situation is also important.
The IOZone results are long I'm afraid.
iozone-with iozone-without
low-latency low-latency
write-64 151212 ( 0.00%) 159856 ( 5.41%)
write-128 189357 ( 0.00%) 206233 ( 8.18%)
write-256 219883 ( 0.00%) 223174 ( 1.47%)
write-512 224932 ( 0.00%) 220227 (-2.14%)
write-1024 227738 ( 0.00%) 226155 (-0.70%)
write-2048 227564 ( 0.00%) 224848 (-1.21%)
write-4096 208556 ( 0.00%) 223430 ( 6.66%)
write-8192 219484 ( 0.00%) 219389 (-0.04%)
write-16384 206670 ( 0.00%) 206295 (-0.18%)
write-32768 203023 ( 0.00%) 201852 (-0.58%)
write-65536 162134 ( 0.00%) 189173 (14.29%)
write-131072 68534 ( 0.00%) 67417 (-1.66%)
write-262144 32936 ( 0.00%) 27750 (-18.69%)
write-524288 24044 ( 0.00%) 23759 (-1.20%)
rewrite-64 755681 ( 0.00%) 755681 ( 0.00%)
rewrite-128 581518 ( 0.00%) 799840 (27.30%)
rewrite-256 639427 ( 0.00%) 659861 ( 3.10%)
rewrite-512 669577 ( 0.00%) 684954 ( 2.24%)
rewrite-1024 680960 ( 0.00%) 686182 ( 0.76%)
rewrite-2048 685263 ( 0.00%) 692780 ( 1.09%)
rewrite-4096 631352 ( 0.00%) 643266 ( 1.85%)
rewrite-8192 442146 ( 0.00%) 442624 ( 0.11%)
rewrite-16384 428641 ( 0.00%) 432613 ( 0.92%)
rewrite-32768 425361 ( 0.00%) 430568 ( 1.21%)
rewrite-65536 405183 ( 0.00%) 389242 (-4.10%)
rewrite-131072 66110 ( 0.00%) 58472 (-13.06%)
rewrite-262144 29254 ( 0.00%) 29306 ( 0.18%)
rewrite-524288 23812 ( 0.00%) 24543 ( 2.98%)
read-64 934589 ( 0.00%) 840903 (-11.14%)
read-128 1601534 ( 0.00%) 1280633 (-25.06%)
read-256 1255511 ( 0.00%) 1310683 ( 4.21%)
read-512 1291158 ( 0.00%) 1319723 ( 2.16%)
read-1024 1319408 ( 0.00%) 1347557 ( 2.09%)
read-2048 1316016 ( 0.00%) 1347393 ( 2.33%)
read-4096 1253710 ( 0.00%) 1251882 (-0.15%)
read-8192 995149 ( 0.00%) 1011794 ( 1.65%)
read-16384 883156 ( 0.00%) 897458 ( 1.59%)
read-32768 844368 ( 0.00%) 856364 ( 1.40%)
read-65536 816099 ( 0.00%) 826473 ( 1.26%)
read-131072 818055 ( 0.00%) 824351 ( 0.76%)
read-262144 827225 ( 0.00%) 835693 ( 1.01%)
read-524288 24653 ( 0.00%) 22519 (-9.48%)
reread-64 2329708 ( 0.00%) 1985134 (-17.36%)
reread-128 1446222 ( 0.00%) 2137031 (32.33%)
reread-256 1828508 ( 0.00%) 1879725 ( 2.72%)
reread-512 1521718 ( 0.00%) 1579934 ( 3.68%)
reread-1024 1347557 ( 0.00%) 1375171 ( 2.01%)
reread-2048 1340664 ( 0.00%) 1350783 ( 0.75%)
reread-4096 1259592 ( 0.00%) 1284839 ( 1.96%)
reread-8192 1007285 ( 0.00%) 1011317 ( 0.40%)
reread-16384 891404 ( 0.00%) 905022 ( 1.50%)
reread-32768 850492 ( 0.00%) 862772 ( 1.42%)
reread-65536 836565 ( 0.00%) 847020 ( 1.23%)
reread-131072 844516 ( 0.00%) 853155 ( 1.01%)
reread-262144 851524 ( 0.00%) 860653 ( 1.06%)
reread-524288 24927 ( 0.00%) 22487 (-10.85%)
randread-64 1605256 ( 0.00%) 1775099 ( 9.57%)
randread-128 1179358 ( 0.00%) 1528576 (22.85%)
randread-256 1421755 ( 0.00%) 1310683 (-8.47%)
randread-512 1306873 ( 0.00%) 1281909 (-1.95%)
randread-1024 1201314 ( 0.00%) 1231629 ( 2.46%)
randread-2048 1179413 ( 0.00%) 1190529 ( 0.93%)
randread-4096 1107005 ( 0.00%) 1116792 ( 0.88%)
randread-8192 894337 ( 0.00%) 899487 ( 0.57%)
randread-16384 783760 ( 0.00%) 791341 ( 0.96%)
randread-32768 740498 ( 0.00%) 743511 ( 0.41%)
randread-65536 721640 ( 0.00%) 728139 ( 0.89%)
randread-131072 715284 ( 0.00%) 720825 ( 0.77%)
randread-262144 709855 ( 0.00%) 714943 ( 0.71%)
randread-524288 394 ( 0.00%) 431 ( 8.58%)
randwrite-64 730988 ( 0.00%) 730988 ( 0.00%)
randwrite-128 746459 ( 0.00%) 742331 (-0.56%)
randwrite-256 695778 ( 0.00%) 727850 ( 4.41%)
randwrite-512 666253 ( 0.00%) 691126 ( 3.60%)
randwrite-1024 651223 ( 0.00%) 659625 ( 1.27%)
randwrite-2048 655558 ( 0.00%) 664073 ( 1.28%)
randwrite-4096 635556 ( 0.00%) 642400 ( 1.07%)
randwrite-8192 467357 ( 0.00%) 469734 ( 0.51%)
randwrite-16384 413188 ( 0.00%) 417282 ( 0.98%)
randwrite-32768 404161 ( 0.00%) 407580 ( 0.84%)
randwrite-65536 379372 ( 0.00%) 381273 ( 0.50%)
randwrite-131072 21780 ( 0.00%) 19758 (-10.23%)
randwrite-262144 6249 ( 0.00%) 6316 ( 1.06%)
randwrite-524288 2915 ( 0.00%) 2859 (-1.96%)
bkwdread-64 1141196 ( 0.00%) 1141196 ( 0.00%)
bkwdread-128 1066865 ( 0.00%) 1101900 ( 3.18%)
bkwdread-256 877797 ( 0.00%) 1105556 (20.60%)
bkwdread-512 1133103 ( 0.00%) 1162547 ( 2.53%)
bkwdread-1024 1163562 ( 0.00%) 1195962 ( 2.71%)
bkwdread-2048 1163439 ( 0.00%) 1204552 ( 3.41%)
bkwdread-4096 1116792 ( 0.00%) 1150600 ( 2.94%)
bkwdread-8192 912288 ( 0.00%) 934724 ( 2.40%)
bkwdread-16384 817707 ( 0.00%) 829152 ( 1.38%)
bkwdread-32768 775898 ( 0.00%) 787691 ( 1.50%)
bkwdread-65536 759643 ( 0.00%) 772174 ( 1.62%)
bkwdread-131072 763215 ( 0.00%) 773816 ( 1.37%)
bkwdread-262144 765491 ( 0.00%) 780021 ( 1.86%)
bkwdread-524288 3688 ( 0.00%) 3724 ( 0.97%)
The first column is "operation-sizeInKB". The other figures are measured
in operations (-O in iozone). It's a little less clear-cut but disabling
low_latency wins more often than not although many of the gains are small and
in the 1-3% range (or is that considered lots in iozone land?) There were
big gains and losses for some tests but the really big differences were
around 128 bytes so it might be a CPU caching effect.
Running a simulation of multiple instances of gitk and a music player results
in the following
gitk-with gitk-without
low-latency low-latency
min 954.46 ( 0.00%) 640.65 (32.88%)
mean 964.79 ( 0.00%) 655.57 (32.05%)
stddev 10.01 ( 0.00%) 13.33 (-33.18%)
max 981.23 ( 0.00%) 675.65 (31.14%)
The measure is the time taken for the fake-gitk program to complete its job.
Disabling low_latency completes the test far faster. On previous tests,
I had busted networking to do high-order atomic allocations to simualate
wireless cards which are high-order happy. In those tests, disabling
low_latency performed better, produced more stable results, stalled less
(which I think would look like a desktop stall in a normal environment)
and critically, it didn't fail high-order page allocations. i.e. Enabling
low_latency hurts reclaim in some unspecified fashion.
On my laptop (2GB RAM), I find the desktop stalls less when I disable
low_latency in the situation where something kicks off a lot of IO. For
example, if I do a large git operation and switch to a browser while that
is doing its thing, I notice that the desktop sometimes stalls for almost a
second. I do not see this with low_latency disabled but I cannot quantify
this better and it's tricky to reproduce. I also might be fooling myself
because I expect to see problems with low_latency enabled.
I regret that I do not have an explanation as to why low_latency causes
problems other than a hunch that low_latency is preventing page writeback
happening fast enough and that causes stalls later. Theories and patches
welcome but if it cannot be resolved, should the following be applied?
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
block/cfq-iosched.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index aa1e953..dc33045 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -2543,7 +2543,7 @@ static void *cfq_init_queue(struct request_queue *q)
cfqd->cfq_slice[1] = cfq_slice_sync;
cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
cfqd->cfq_slice_idle = cfq_slice_idle;
- cfqd->cfq_latency = 1;
+ cfqd->cfq_latency = 0;
cfqd->hw_tag = 1;
cfqd->last_end_sync_rq = jiffies;
return cfqd;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 23+ messages in thread* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 12:19 [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 Mel Gorman @ 2009-11-26 13:08 ` Mike Galbraith 2009-11-26 13:20 ` Bartlomiej Zolnierkiewicz 2009-11-26 13:47 ` Corrado Zoccolo 2009-11-27 4:36 ` KOSAKI Motohiro 2 siblings, 1 reply; 23+ messages in thread From: Mike Galbraith @ 2009-11-26 13:08 UTC (permalink / raw) To: Mel Gorman Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Thu, 2009-11-26 at 12:19 +0000, Mel Gorman wrote: > (cc'ing the people from the page allocator failure thread as this might be > relevant to some of their problems) > > I know this is very last minute but I believe we should consider disabling > the "low_latency" tunable for block devices by default for 2.6.32. There was > evidence that low_latency was a problem last week for page allocation failure > reports but the reproduction-case was unusual and involved high-order atomic > allocations in low-memory conditions. It took another few days to accurately > show the problem for more normal workloads and it's a bit more wide-spread > than just allocation failures. > > Basically, low_latency looks great as long as you have plenty of memory > but in low memory situations, it appears to cause problems that manifest > as reduced performance, desktop stalls and in some cases, page allocation > failures. I think most kernel developers are not seeing the problem as they > tend to test on beefier machines and without hitting swap or low-memory > situations for the most part. When they are hitting low-memory situations, > it tends to be for stress tests where stalls and low performance are expected. Ouch. It was bad desktop stalls under heavy write that kicked the whole thing off. -Mike -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 13:08 ` Mike Galbraith @ 2009-11-26 13:20 ` Bartlomiej Zolnierkiewicz 2009-11-26 13:37 ` Mike Galbraith 0 siblings, 1 reply; 23+ messages in thread From: Bartlomiej Zolnierkiewicz @ 2009-11-26 13:20 UTC (permalink / raw) To: Mike Galbraith Cc: Mel Gorman, Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Thursday 26 November 2009 02:08:57 pm Mike Galbraith wrote: > On Thu, 2009-11-26 at 12:19 +0000, Mel Gorman wrote: > > (cc'ing the people from the page allocator failure thread as this might be > > relevant to some of their problems) > > > > I know this is very last minute but I believe we should consider disabling > > the "low_latency" tunable for block devices by default for 2.6.32. There was > > evidence that low_latency was a problem last week for page allocation failure > > reports but the reproduction-case was unusual and involved high-order atomic > > allocations in low-memory conditions. It took another few days to accurately > > show the problem for more normal workloads and it's a bit more wide-spread > > than just allocation failures. > > > > Basically, low_latency looks great as long as you have plenty of memory > > but in low memory situations, it appears to cause problems that manifest > > as reduced performance, desktop stalls and in some cases, page allocation > > failures. I think most kernel developers are not seeing the problem as they > > tend to test on beefier machines and without hitting swap or low-memory > > situations for the most part. When they are hitting low-memory situations, > > it tends to be for stress tests where stalls and low performance are expected. > > Ouch. It was bad desktop stalls under heavy write that kicked the whole > thing off. The problem is that 'desktop' means different things for different people (for some kernel developers 'desktop' is more like 'a workstation' and for others it is more like 'an embedded device'). -- Bartlomiej Zolnierkiewicz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 13:20 ` Bartlomiej Zolnierkiewicz @ 2009-11-26 13:37 ` Mike Galbraith 2009-11-26 13:56 ` Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Mike Galbraith @ 2009-11-26 13:37 UTC (permalink / raw) To: Bartlomiej Zolnierkiewicz Cc: Mel Gorman, Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Thu, 2009-11-26 at 14:20 +0100, Bartlomiej Zolnierkiewicz wrote: > On Thursday 26 November 2009 02:08:57 pm Mike Galbraith wrote: > > On Thu, 2009-11-26 at 12:19 +0000, Mel Gorman wrote: > > > (cc'ing the people from the page allocator failure thread as this might be > > > relevant to some of their problems) > > > > > > I know this is very last minute but I believe we should consider disabling > > > the "low_latency" tunable for block devices by default for 2.6.32. There was > > > evidence that low_latency was a problem last week for page allocation failure > > > reports but the reproduction-case was unusual and involved high-order atomic > > > allocations in low-memory conditions. It took another few days to accurately > > > show the problem for more normal workloads and it's a bit more wide-spread > > > than just allocation failures. > > > > > > Basically, low_latency looks great as long as you have plenty of memory > > > but in low memory situations, it appears to cause problems that manifest > > > as reduced performance, desktop stalls and in some cases, page allocation > > > failures. I think most kernel developers are not seeing the problem as they > > > tend to test on beefier machines and without hitting swap or low-memory > > > situations for the most part. When they are hitting low-memory situations, > > > it tends to be for stress tests where stalls and low performance are expected. > > > > Ouch. It was bad desktop stalls under heavy write that kicked the whole > > thing off. > > The problem is that 'desktop' means different things for different people > (for some kernel developers 'desktop' is more like 'a workstation' and for > others it is more like 'an embedded device'). The stalls I'm talking about were reported for garden variety desktop PC. I reproduced them on my supermarket special Q6600 desktop PC. That problem has been with us roughly forever, but I'd hoped it had been cured. Guess not. As an idle speculation, I wonder if the sync vs async slice ratios may not have been knocked out of kilter a bit by giving more to sync. -Mike -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 13:37 ` Mike Galbraith @ 2009-11-26 13:56 ` Mel Gorman 0 siblings, 0 replies; 23+ messages in thread From: Mel Gorman @ 2009-11-26 13:56 UTC (permalink / raw) To: Mike Galbraith Cc: Bartlomiej Zolnierkiewicz, Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Thu, Nov 26, 2009 at 02:37:31PM +0100, Mike Galbraith wrote: > On Thu, 2009-11-26 at 14:20 +0100, Bartlomiej Zolnierkiewicz wrote: > > On Thursday 26 November 2009 02:08:57 pm Mike Galbraith wrote: > > > On Thu, 2009-11-26 at 12:19 +0000, Mel Gorman wrote: > > > > (cc'ing the people from the page allocator failure thread as this might be > > > > relevant to some of their problems) > > > > > > > > I know this is very last minute but I believe we should consider disabling > > > > the "low_latency" tunable for block devices by default for 2.6.32. There was > > > > evidence that low_latency was a problem last week for page allocation failure > > > > reports but the reproduction-case was unusual and involved high-order atomic > > > > allocations in low-memory conditions. It took another few days to accurately > > > > show the problem for more normal workloads and it's a bit more wide-spread > > > > than just allocation failures. > > > > > > > > Basically, low_latency looks great as long as you have plenty of memory > > > > but in low memory situations, it appears to cause problems that manifest > > > > as reduced performance, desktop stalls and in some cases, page allocation > > > > failures. I think most kernel developers are not seeing the problem as they > > > > tend to test on beefier machines and without hitting swap or low-memory > > > > situations for the most part. When they are hitting low-memory situations, > > > > it tends to be for stress tests where stalls and low performance are expected. > > > > > > Ouch. It was bad desktop stalls under heavy write that kicked the whole > > > thing off. > > > > The problem is that 'desktop' means different things for different people > > (for some kernel developers 'desktop' is more like 'a workstation' and for > > others it is more like 'an embedded device'). Will concede that - the term "desktop" is fuzzy at best. The characteristics of note are a mid-range machine running workloads that are not steady, have abupt phase changes and are not very well sized to the available memory. "Desktops" fall into this category but it's also possible that badly-or-borderline-provisioned servers would also fall into it. > > The stalls I'm talking about were reported for garden variety desktop > PC. The stalls I'm seeing on the laptop are tiny but there. It's prefectly possible a whole host of stalls for people have been resolved but there is one corner case. > I reproduced them on my supermarket special Q6600 desktop PC. That > problem has been with us roughly forever, but I'd hoped it had been > cured. Guess not. > It's possible the corner case causing stalls is specific to low-memory rather than writes. Conceivably, what is going wrong is that writes need to complete for pages to be clean so pages can be reclaimed. The cleaning of pages is getting pre-empted by sync IO until such point as pages cannot be reclaimed and they stall allowing writes to complete. I'll prototype something to disable low_latency if kswapd is awake. If it makes as difference, this might be plausible. As Jens would say though, this is "mostly hand-wavy nonsense". > As an idle speculation, I wonder if the sync vs async slice ratios may > not have been knocked out of kilter a bit by giving more to sync. > I don't know enough to speculate. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 12:19 [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 Mel Gorman 2009-11-26 13:08 ` Mike Galbraith @ 2009-11-26 13:47 ` Corrado Zoccolo 2009-11-26 14:17 ` Mel Gorman 2009-11-27 4:36 ` KOSAKI Motohiro 2 siblings, 1 reply; 23+ messages in thread From: Corrado Zoccolo @ 2009-11-26 13:47 UTC (permalink / raw) To: Mel Gorman Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Thu, Nov 26, 2009 at 1:19 PM, Mel Gorman <mel@csn.ul.ie> wrote: > (cc'ing the people from the page allocator failure thread as this might be > relevant to some of their problems) > > I know this is very last minute but I believe we should consider disabling > the "low_latency" tunable for block devices by default for 2.6.32. There was > evidence that low_latency was a problem last week for page allocation failure > reports but the reproduction-case was unusual and involved high-order atomic > allocations in low-memory conditions. It took another few days to accurately > show the problem for more normal workloads and it's a bit more wide-spread > than just allocation failures. > > Basically, low_latency looks great as long as you have plenty of memory > but in low memory situations, it appears to cause problems that manifest > as reduced performance, desktop stalls and in some cases, page allocation > failures. I think most kernel developers are not seeing the problem as they > tend to test on beefier machines and without hitting swap or low-memory > situations for the most part. When they are hitting low-memory situations, > it tends to be for stress tests where stalls and low performance are expected. The low latency tunable controls various policies inside cfq. The one that could affect memory reclaim is: /* * Async queues must wait a bit before being allowed dispatch. * We also ramp up the dispatch depth gradually for async IO, * based on the last sync IO we serviced */ if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) { unsigned long last_sync = jiffies - cfqd->last_end_sync_rq; unsigned int depth; depth = last_sync / cfqd->cfq_slice[1]; if (!depth && !cfqq->dispatched) depth = 1; if (depth < max_dispatch) max_dispatch = depth; } here the async queues max depth is limited to 1 for up to 200 ms after a sync I/O is completed. Note: dirty page writeback goes through an async queue, so it is penalized by this. This can affect both low and high end hardware. My non-NCQ sata disk can handle a depth of 2 when writing. NCQ sata disks can handle a depth up to 31, so limiting depth to 1 can cause write performance drop, and this in turn will slow down dirty page reclaim, and cause allocation failures. It would be good to re-test the OOM conditions with that code commented out. > > To show the problem, I used an x86-64 machine booting booted with 512MB of > memory. This is a small amount of RAM but the bug reports related to page > allocation failures were on smallish machines and the disks in the system > are not very high-performance. > > I used three tests. The first was sysbench on postgres running an IO-heavy > test against a large database with 10,000,000 rows. The second was IOZone > running most of the automatic tests with a record length of 4KB and the > last was a simulated launching of gitk with a music player running in the > background to act as a desktop-like scenario. The final test was similar > to the test described here http://lwn.net/Articles/362184/ except that > dm-crypt was not used as it has its own problems. low_latency was tested on other scenarios: http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html where it improved actual and perceived performance, so disabling it completely may not be good. Thanks, Corrado -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 13:47 ` Corrado Zoccolo @ 2009-11-26 14:17 ` Mel Gorman 2009-11-26 15:18 ` Corrado Zoccolo 2009-11-27 5:58 ` KOSAKI Motohiro 0 siblings, 2 replies; 23+ messages in thread From: Mel Gorman @ 2009-11-26 14:17 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Thu, Nov 26, 2009 at 02:47:10PM +0100, Corrado Zoccolo wrote: > On Thu, Nov 26, 2009 at 1:19 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > (cc'ing the people from the page allocator failure thread as this might be > > relevant to some of their problems) > > > > I know this is very last minute but I believe we should consider disabling > > the "low_latency" tunable for block devices by default for 2.6.32. There was > > evidence that low_latency was a problem last week for page allocation failure > > reports but the reproduction-case was unusual and involved high-order atomic > > allocations in low-memory conditions. It took another few days to accurately > > show the problem for more normal workloads and it's a bit more wide-spread > > than just allocation failures. > > > > Basically, low_latency looks great as long as you have plenty of memory > > but in low memory situations, it appears to cause problems that manifest > > as reduced performance, desktop stalls and in some cases, page allocation > > failures. I think most kernel developers are not seeing the problem as they > > tend to test on beefier machines and without hitting swap or low-memory > > situations for the most part. When they are hitting low-memory situations, > > it tends to be for stress tests where stalls and low performance are expected. > > The low latency tunable controls various policies inside cfq. > The one that could affect memory reclaim is: > /* > * Async queues must wait a bit before being allowed dispatch. > * We also ramp up the dispatch depth gradually for async IO, > * based on the last sync IO we serviced > */ > if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) { > unsigned long last_sync = jiffies - cfqd->last_end_sync_rq; > unsigned int depth; > > depth = last_sync / cfqd->cfq_slice[1]; > if (!depth && !cfqq->dispatched) > depth = 1; > if (depth < max_dispatch) > max_dispatch = depth; > } > > here the async queues max depth is limited to 1 for up to 200 ms after > a sync I/O is completed. > Note: dirty page writeback goes through an async queue, so it is > penalized by this. > > This can affect both low and high end hardware. My non-NCQ sata disk > can handle a depth of 2 when writing. NCQ sata disks can handle a > depth up to 31, so limiting depth to 1 can cause write performance > drop, and this in turn will slow down dirty page reclaim, and cause > allocation failures. > > It would be good to re-test the OOM conditions with that code commented out. > All of it or just the cfq_latency part? As it turns out the test machine does report for the disk NCQ (depth 31/32) and it's the same on the laptop so slowing down dirty page cleaning could be impacting reclaim. > > > > To show the problem, I used an x86-64 machine booting booted with 512MB of > > memory. This is a small amount of RAM but the bug reports related to page > > allocation failures were on smallish machines and the disks in the system > > are not very high-performance. > > > > I used three tests. The first was sysbench on postgres running an IO-heavy > > test against a large database with 10,000,000 rows. The second was IOZone > > running most of the automatic tests with a record length of 4KB and the > > last was a simulated launching of gitk with a music player running in the > > background to act as a desktop-like scenario. The final test was similar > > to the test described here http://lwn.net/Articles/362184/ except that > > dm-crypt was not used as it has its own problems. > > low_latency was tested on other scenarios: > http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html > where it improved actual and perceived performance, so disabling it > completely may not be good. > It may not indeed. In case you mean a partial disabling of cfq_latency, I'm try the following patch. The intention is to disable the low_latency logic if kswapd is at work and presumably needs clean pages. Alternative suggestions welcome. ====== cfq: Do not limit the async queue depth while kswapd is awake diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index aa1e953..dcab74e 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1308,7 +1308,7 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) * We also ramp up the dispatch depth gradually for async IO, * based on the last sync IO we serviced */ - if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) { + if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency && !kswapd_awake()) { unsigned long last_sync = jiffies - cfqd->last_end_sync_rq; unsigned int depth; diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6f75617..b593aff 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -655,6 +655,7 @@ typedef struct pglist_data { void get_zone_counts(unsigned long *active, unsigned long *inactive, unsigned long *free); void build_all_zonelists(void); +int kswapd_awake(void); void wakeup_kswapd(struct zone *zone, int order); int zone_watermark_ok(struct zone *z, int order, unsigned long mark, int classzone_idx, int alloc_flags); diff --git a/mm/vmscan.c b/mm/vmscan.c index 777af57..75cdd9a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2201,6 +2201,15 @@ static int kswapd(void *p) return 0; } +int kswapd_awake(void) +{ + pg_data_t *pgdat; + for_each_online_pgdat(pgdat) + if (!waitqueue_active(&pgdat->kswapd_wait)) + return 1; + return 0; +} + /* * A zone is low on free memory, so wake its kswapd task to service it. */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 14:17 ` Mel Gorman @ 2009-11-26 15:18 ` Corrado Zoccolo 2009-11-27 11:44 ` Mel Gorman 2009-11-27 5:58 ` KOSAKI Motohiro 1 sibling, 1 reply; 23+ messages in thread From: Corrado Zoccolo @ 2009-11-26 15:18 UTC (permalink / raw) To: Mel Gorman Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Thu, Nov 26, 2009 at 3:17 PM, Mel Gorman <mel@csn.ul.ie> wrote: > On Thu, Nov 26, 2009 at 02:47:10PM +0100, Corrado Zoccolo wrote: >> On Thu, Nov 26, 2009 at 1:19 PM, Mel Gorman <mel@csn.ul.ie> wrote: >> > (cc'ing the people from the page allocator failure thread as this might be >> > relevant to some of their problems) >> > >> > I know this is very last minute but I believe we should consider disabling >> > the "low_latency" tunable for block devices by default for 2.6.32. There was >> > evidence that low_latency was a problem last week for page allocation failure >> > reports but the reproduction-case was unusual and involved high-order atomic >> > allocations in low-memory conditions. It took another few days to accurately >> > show the problem for more normal workloads and it's a bit more wide-spread >> > than just allocation failures. >> > >> > Basically, low_latency looks great as long as you have plenty of memory >> > but in low memory situations, it appears to cause problems that manifest >> > as reduced performance, desktop stalls and in some cases, page allocation >> > failures. I think most kernel developers are not seeing the problem as they >> > tend to test on beefier machines and without hitting swap or low-memory >> > situations for the most part. When they are hitting low-memory situations, >> > it tends to be for stress tests where stalls and low performance are expected. >> >> The low latency tunable controls various policies inside cfq. >> The one that could affect memory reclaim is: >> /* >> * Async queues must wait a bit before being allowed dispatch. >> * We also ramp up the dispatch depth gradually for async IO, >> * based on the last sync IO we serviced >> */ >> if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) { >> unsigned long last_sync = jiffies - cfqd->last_end_sync_rq; >> unsigned int depth; >> >> depth = last_sync / cfqd->cfq_slice[1]; >> if (!depth && !cfqq->dispatched) >> depth = 1; >> if (depth < max_dispatch) >> max_dispatch = depth; >> } >> >> here the async queues max depth is limited to 1 for up to 200 ms after >> a sync I/O is completed. >> Note: dirty page writeback goes through an async queue, so it is >> penalized by this. >> >> This can affect both low and high end hardware. My non-NCQ sata disk >> can handle a depth of 2 when writing. NCQ sata disks can handle a >> depth up to 31, so limiting depth to 1 can cause write performance >> drop, and this in turn will slow down dirty page reclaim, and cause >> allocation failures. >> >> It would be good to re-test the OOM conditions with that code commented out. >> > > All of it or just the cfq_latency part? The whole if, that is enabled only with cfq_latency. > > As it turns out the test machine does report for the disk NCQ (depth 31/32) > and it's the same on the laptop so slowing down dirty page cleaning > could be impacting reclaim. Yes, I think so. > >> > >> > To show the problem, I used an x86-64 machine booting booted with 512MB of >> > memory. This is a small amount of RAM but the bug reports related to page >> > allocation failures were on smallish machines and the disks in the system >> > are not very high-performance. >> > >> > I used three tests. The first was sysbench on postgres running an IO-heavy >> > test against a large database with 10,000,000 rows. The second was IOZone >> > running most of the automatic tests with a record length of 4KB and the >> > last was a simulated launching of gitk with a music player running in the >> > background to act as a desktop-like scenario. The final test was similar >> > to the test described here http://lwn.net/Articles/362184/ except that >> > dm-crypt was not used as it has its own problems. >> >> low_latency was tested on other scenarios: >> http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html >> http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html >> where it improved actual and perceived performance, so disabling it >> completely may not be good. >> > > It may not indeed. > > In case you mean a partial disabling of cfq_latency, I'm try the > following patch. The intention is to disable the low_latency logic if > kswapd is at work and presumably needs clean pages. Alternative > suggestions welcome. Yes, I meant exactly to disable that part, and doing it when kswapd is active is probably a good choice. I have a different idea for 2.6.33, though. If you have a reliable reproducer of the issue, can you test it on git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33? It may already be unaffected, since we had various performance improvements there, but I think a better way to boost writeback is possible. Thanks, Corrado > > ====== > cfq: Do not limit the async queue depth while kswapd is awake > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index aa1e953..dcab74e 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -1308,7 +1308,7 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > * We also ramp up the dispatch depth gradually for async IO, > * based on the last sync IO we serviced > */ > - if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) { > + if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency && !kswapd_awake()) { > unsigned long last_sync = jiffies - cfqd->last_end_sync_rq; > unsigned int depth; > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6f75617..b593aff 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -655,6 +655,7 @@ typedef struct pglist_data { > void get_zone_counts(unsigned long *active, unsigned long *inactive, > unsigned long *free); > void build_all_zonelists(void); > +int kswapd_awake(void); > void wakeup_kswapd(struct zone *zone, int order); > int zone_watermark_ok(struct zone *z, int order, unsigned long mark, > int classzone_idx, int alloc_flags); > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 777af57..75cdd9a 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2201,6 +2201,15 @@ static int kswapd(void *p) > return 0; > } > > +int kswapd_awake(void) > +{ > + pg_data_t *pgdat; > + for_each_online_pgdat(pgdat) > + if (!waitqueue_active(&pgdat->kswapd_wait)) > + return 1; > + return 0; > +} > + > /* > * A zone is low on free memory, so wake its kswapd task to service it. > */ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 15:18 ` Corrado Zoccolo @ 2009-11-27 11:44 ` Mel Gorman 2009-11-27 12:03 ` Corrado Zoccolo 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2009-11-27 11:44 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Thu, Nov 26, 2009 at 04:18:18PM +0100, Corrado Zoccolo wrote: > > <SNIP> > > > > In case you mean a partial disabling of cfq_latency, I'm try the > > following patch. The intention is to disable the low_latency logic if > > kswapd is at work and presumably needs clean pages. Alternative > > suggestions welcome. As it turned out, that patch sucked so I aborted the test and I need to think about it a lot more. > Yes, I meant exactly to disable that part, and doing it when kswapd is > active is probably a good choice. > I have a different idea for 2.6.33, though. > If you have a reliable reproducer of the issue, can you test it on > git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33? > It may already be unaffected, since we had various performance > improvements there, but I think a better way to boost writeback is > possible. > I haven't tested the high-order allocation scenario yet but the results as thing stands are below. There are four kernels being compared 1. with-low-latency is 2.6.32-rc8 vanilla 2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied 3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula" 4. without-low-latency is with low_latency disabled SYSBENCH sysbench-with low-latency low-latency sysbench-without low-latency block-2.6.33 async-rampup low-latency 1 1266.02 ( 0.00%) 824.08 (-53.63%) 1265.15 (-0.07%) 1278.55 ( 0.98%) 2 1182.58 ( 0.00%) 1226.42 ( 3.57%) 1223.03 ( 3.31%) 1379.25 (14.26%) 3 1218.64 ( 0.00%) 1271.38 ( 4.15%) 1246.42 ( 2.23%) 1580.08 (22.87%) 4 1212.11 ( 0.00%) 1257.84 ( 3.64%) 1325.17 ( 8.53%) 1534.17 (20.99%) 5 1046.77 ( 0.00%) 981.71 (-6.63%) 1008.44 (-3.80%) 1552.48 (32.57%) 6 1187.14 ( 0.00%) 1132.89 (-4.79%) 1147.18 (-3.48%) 1661.19 (28.54%) 7 1179.37 ( 0.00%) 1183.61 ( 0.36%) 1202.49 ( 1.92%) 790.26 (-49.24%) 8 1164.62 ( 0.00%) 1143.54 (-1.84%) 1184.56 ( 1.68%) 854.10 (-36.36%) 9 1095.22 ( 0.00%) 1178.72 ( 7.08%) 1002.42 (-9.26%) 1655.04 (33.83%) 10 1147.52 ( 0.00%) 1153.46 ( 0.52%) 1151.73 ( 0.37%) 1653.89 (30.62%) 11 823.38 ( 0.00%) 820.64 (-0.33%) 754.15 (-9.18%) 1627.45 (49.41%) 12 813.73 ( 0.00%) 791.44 (-2.82%) 848.32 ( 4.08%) 1494.63 (45.56%) 13 898.22 ( 0.00%) 789.63 (-13.75%) 931.47 ( 3.57%) 1521.64 (40.97%) 14 873.50 ( 0.00%) 938.90 ( 6.97%) 875.75 ( 0.26%) 1311.09 (33.38%) 15 808.32 ( 0.00%) 979.88 (17.51%) 877.87 ( 7.92%) 1009.70 (19.94%) 16 758.17 ( 0.00%) 1096.81 (30.87%) 881.23 (13.96%) 725.17 (-4.55%) sysbench is helped by both both block-2.6.33 and async-rampup to some extent. For many of the results, plain old disabling low_latency still helps the most. desktop-net-gitk gitk-with low-latency low-latency gitk-without low-latency block-2.6.33 async-rampup low-latency min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%) mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%) stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%) max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%) The changes for block in 2.6.33 make a massive difference here, notably beating the disabling of low_latency. IOZone iozone-with low-latency low-latency iozone-without low-latency block-2.6.33 async-rampup low-latency write-64 151212 ( 0.00%) 163359 ( 7.44%) 163359 ( 7.44%) 159856 ( 5.41%) write-128 189357 ( 0.00%) 184922 (-2.40%) 202805 ( 6.63%) 206233 ( 8.18%) write-256 219883 ( 0.00%) 211232 (-4.10%) 189867 (-15.81%) 223174 ( 1.47%) write-512 224932 ( 0.00%) 222601 (-1.05%) 204459 (-10.01%) 220227 (-2.14%) write-1024 227738 ( 0.00%) 226728 (-0.45%) 216009 (-5.43%) 226155 (-0.70%) write-2048 227564 ( 0.00%) 224167 (-1.52%) 229387 ( 0.79%) 224848 (-1.21%) write-4096 208556 ( 0.00%) 227707 ( 8.41%) 216908 ( 3.85%) 223430 ( 6.66%) write-8192 219484 ( 0.00%) 222365 ( 1.30%) 217737 (-0.80%) 219389 (-0.04%) write-16384 206670 ( 0.00%) 209355 ( 1.28%) 204146 (-1.24%) 206295 (-0.18%) write-32768 203023 ( 0.00%) 205097 ( 1.01%) 199766 (-1.63%) 201852 (-0.58%) write-65536 162134 ( 0.00%) 196670 (17.56%) 189975 (14.66%) 189173 (14.29%) write-131072 68534 ( 0.00%) 69145 ( 0.88%) 64519 (-6.22%) 67417 (-1.66%) write-262144 32936 ( 0.00%) 28587 (-15.21%) 31470 (-4.66%) 27750 (-18.69%) write-524288 24044 ( 0.00%) 23560 (-2.05%) 23116 (-4.01%) 23759 (-1.20%) rewrite-64 755681 ( 0.00%) 800767 ( 5.63%) 469931 (-60.81%) 755681 ( 0.00%) rewrite-128 581518 ( 0.00%) 639723 ( 9.10%) 591774 ( 1.73%) 799840 (27.30%) rewrite-256 639427 ( 0.00%) 710511 (10.00%) 666414 ( 4.05%) 659861 ( 3.10%) rewrite-512 669577 ( 0.00%) 743788 ( 9.98%) 692017 ( 3.24%) 684954 ( 2.24%) rewrite-1024 680960 ( 0.00%) 755195 ( 9.83%) 701422 ( 2.92%) 686182 ( 0.76%) rewrite-2048 685263 ( 0.00%) 743123 ( 7.79%) 703445 ( 2.58%) 692780 ( 1.09%) rewrite-4096 631352 ( 0.00%) 686776 ( 8.07%) 640007 ( 1.35%) 643266 ( 1.85%) rewrite-8192 442146 ( 0.00%) 474089 ( 6.74%) 457768 ( 3.41%) 442624 ( 0.11%) rewrite-16384 428641 ( 0.00%) 454857 ( 5.76%) 442896 ( 3.22%) 432613 ( 0.92%) rewrite-32768 425361 ( 0.00%) 444206 ( 4.24%) 434472 ( 2.10%) 430568 ( 1.21%) rewrite-65536 405183 ( 0.00%) 433898 ( 6.62%) 419843 ( 3.49%) 389242 (-4.10%) rewrite-131072 66110 ( 0.00%) 58370 (-13.26%) 54342 (-21.66%) 58472 (-13.06%) rewrite-262144 29254 ( 0.00%) 24665 (-18.61%) 25710 (-13.78%) 29306 ( 0.18%) rewrite-524288 23812 ( 0.00%) 20742 (-14.80%) 22490 (-5.88%) 24543 ( 2.98%) read-64 934589 ( 0.00%) 1160938 (19.50%) 1004538 ( 6.96%) 840903 (-11.14%) read-128 1601534 ( 0.00%) 1869179 (14.32%) 1681806 ( 4.77%) 1280633 (-25.06%) read-256 1255511 ( 0.00%) 1526887 (17.77%) 1304314 ( 3.74%) 1310683 ( 4.21%) read-512 1291158 ( 0.00%) 1377278 ( 6.25%) 1336145 ( 3.37%) 1319723 ( 2.16%) read-1024 1319408 ( 0.00%) 1306564 (-0.98%) 1368162 ( 3.56%) 1347557 ( 2.09%) read-2048 1316016 ( 0.00%) 1394645 ( 5.64%) 1339827 ( 1.78%) 1347393 ( 2.33%) read-4096 1253710 ( 0.00%) 1307525 ( 4.12%) 1247519 (-0.50%) 1251882 (-0.15%) read-8192 995149 ( 0.00%) 1033337 ( 3.70%) 1016944 ( 2.14%) 1011794 ( 1.65%) read-16384 883156 ( 0.00%) 905213 ( 2.44%) 905213 ( 2.44%) 897458 ( 1.59%) read-32768 844368 ( 0.00%) 855213 ( 1.27%) 849609 ( 0.62%) 856364 ( 1.40%) read-65536 816099 ( 0.00%) 839262 ( 2.76%) 835019 ( 2.27%) 826473 ( 1.26%) read-131072 818055 ( 0.00%) 837369 ( 2.31%) 828230 ( 1.23%) 824351 ( 0.76%) read-262144 827225 ( 0.00%) 839635 ( 1.48%) 840538 ( 1.58%) 835693 ( 1.01%) read-524288 24653 ( 0.00%) 21387 (-15.27%) 20602 (-19.66%) 22519 (-9.48%) reread-64 2329708 ( 0.00%) 2251544 (-3.47%) 1985134 (-17.36%) 1985134 (-17.36%) reread-128 1446222 ( 0.00%) 1979446 (26.94%) 2009076 (28.02%) 2137031 (32.33%) reread-256 1828508 ( 0.00%) 2006158 ( 8.86%) 1892980 ( 3.41%) 1879725 ( 2.72%) reread-512 1521718 ( 0.00%) 1642783 ( 7.37%) 1508887 (-0.85%) 1579934 ( 3.68%) reread-1024 1347557 ( 0.00%) 1422540 ( 5.27%) 1384034 ( 2.64%) 1375171 ( 2.01%) reread-2048 1340664 ( 0.00%) 1413929 ( 5.18%) 1372364 ( 2.31%) 1350783 ( 0.75%) reread-4096 1259592 ( 0.00%) 1324868 ( 4.93%) 1273788 ( 1.11%) 1284839 ( 1.96%) reread-8192 1007285 ( 0.00%) 1033710 ( 2.56%) 1027159 ( 1.93%) 1011317 ( 0.40%) reread-16384 891404 ( 0.00%) 910828 ( 2.13%) 916562 ( 2.74%) 905022 ( 1.50%) reread-32768 850492 ( 0.00%) 859341 ( 1.03%) 856385 ( 0.69%) 862772 ( 1.42%) reread-65536 836565 ( 0.00%) 852664 ( 1.89%) 852315 ( 1.85%) 847020 ( 1.23%) reread-131072 844516 ( 0.00%) 862590 ( 2.10%) 854067 ( 1.12%) 853155 ( 1.01%) reread-262144 851524 ( 0.00%) 860559 ( 1.05%) 864921 ( 1.55%) 860653 ( 1.06%) reread-524288 24927 ( 0.00%) 21300 (-17.03%) 19748 (-26.23%) 22487 (-10.85%) randread-64 1605256 ( 0.00%) 1605256 ( 0.00%) 1605256 ( 0.00%) 1775099 ( 9.57%) randread-128 1179358 ( 0.00%) 1582649 (25.48%) 1511363 (21.97%) 1528576 (22.85%) randread-256 1421755 ( 0.00%) 1599680 (11.12%) 1460430 ( 2.65%) 1310683 (-8.47%) randread-512 1306873 ( 0.00%) 1278855 (-2.19%) 1243315 (-5.11%) 1281909 (-1.95%) randread-1024 1201314 ( 0.00%) 1254656 ( 4.25%) 1190657 (-0.90%) 1231629 ( 2.46%) randread-2048 1179413 ( 0.00%) 1227971 ( 3.95%) 1185272 ( 0.49%) 1190529 ( 0.93%) randread-4096 1107005 ( 0.00%) 1160862 ( 4.64%) 1110727 ( 0.34%) 1116792 ( 0.88%) randread-8192 894337 ( 0.00%) 924264 ( 3.24%) 912676 ( 2.01%) 899487 ( 0.57%) randread-16384 783760 ( 0.00%) 800299 ( 2.07%) 793351 ( 1.21%) 791341 ( 0.96%) randread-32768 740498 ( 0.00%) 743720 ( 0.43%) 741233 ( 0.10%) 743511 ( 0.41%) randread-65536 721640 ( 0.00%) 727692 ( 0.83%) 726984 ( 0.74%) 728139 ( 0.89%) randread-131072 715284 ( 0.00%) 722094 ( 0.94%) 717746 ( 0.34%) 720825 ( 0.77%) randread-262144 709855 ( 0.00%) 706770 (-0.44%) 709133 (-0.10%) 714943 ( 0.71%) randread-524288 394 ( 0.00%) 421 ( 6.41%) 418 ( 5.74%) 431 ( 8.58%) randwrite-64 730988 ( 0.00%) 764288 ( 4.36%) 723111 (-1.09%) 730988 ( 0.00%) randwrite-128 746459 ( 0.00%) 799840 ( 6.67%) 746459 ( 0.00%) 742331 (-0.56%) randwrite-256 695778 ( 0.00%) 752329 ( 7.52%) 720041 ( 3.37%) 727850 ( 4.41%) randwrite-512 666253 ( 0.00%) 722760 ( 7.82%) 667081 ( 0.12%) 691126 ( 3.60%) randwrite-1024 651223 ( 0.00%) 697776 ( 6.67%) 663292 ( 1.82%) 659625 ( 1.27%) randwrite-2048 655558 ( 0.00%) 691887 ( 5.25%) 665720 ( 1.53%) 664073 ( 1.28%) randwrite-4096 635556 ( 0.00%) 662721 ( 4.10%) 643170 ( 1.18%) 642400 ( 1.07%) randwrite-8192 467357 ( 0.00%) 491364 ( 4.89%) 476720 ( 1.96%) 469734 ( 0.51%) randwrite-16384 413188 ( 0.00%) 427521 ( 3.35%) 417353 ( 1.00%) 417282 ( 0.98%) randwrite-32768 404161 ( 0.00%) 411721 ( 1.84%) 404942 ( 0.19%) 407580 ( 0.84%) randwrite-65536 379372 ( 0.00%) 397312 ( 4.52%) 386853 ( 1.93%) 381273 ( 0.50%) randwrite-131072 21780 ( 0.00%) 16924 (-28.69%) 21177 (-2.85%) 19758 (-10.23%) randwrite-262144 6249 ( 0.00%) 5548 (-12.64%) 6370 ( 1.90%) 6316 ( 1.06%) randwrite-524288 2915 ( 0.00%) 2582 (-12.90%) 2871 (-1.53%) 2859 (-1.96%) bkwdread-64 1141196 ( 0.00%) 1141196 ( 0.00%) 1004538 (-13.60%) 1141196 ( 0.00%) bkwdread-128 1066865 ( 0.00%) 1386465 (23.05%) 1400936 (23.85%) 1101900 ( 3.18%) bkwdread-256 877797 ( 0.00%) 1105556 (20.60%) 1105556 (20.60%) 1105556 (20.60%) bkwdread-512 1133103 ( 0.00%) 1162547 ( 2.53%) 1175271 ( 3.59%) 1162547 ( 2.53%) bkwdread-1024 1163562 ( 0.00%) 1206714 ( 3.58%) 1213534 ( 4.12%) 1195962 ( 2.71%) bkwdread-2048 1163439 ( 0.00%) 1218910 ( 4.55%) 1204552 ( 3.41%) 1204552 ( 3.41%) bkwdread-4096 1116792 ( 0.00%) 1175477 ( 4.99%) 1159922 ( 3.72%) 1150600 ( 2.94%) bkwdread-8192 912288 ( 0.00%) 935233 ( 2.45%) 944695 ( 3.43%) 934724 ( 2.40%) bkwdread-16384 817707 ( 0.00%) 824140 ( 0.78%) 832527 ( 1.78%) 829152 ( 1.38%) bkwdread-32768 775898 ( 0.00%) 773714 (-0.28%) 785494 ( 1.22%) 787691 ( 1.50%) bkwdread-65536 759643 ( 0.00%) 769924 ( 1.34%) 778780 ( 2.46%) 772174 ( 1.62%) bkwdread-131072 763215 ( 0.00%) 769634 ( 0.83%) 773707 ( 1.36%) 773816 ( 1.37%) bkwdread-262144 765491 ( 0.00%) 768992 ( 0.46%) 780876 ( 1.97%) 780021 ( 1.86%) bkwdread-524288 3688 ( 0.00%) 3595 (-2.59%) 3577 (-3.10%) 3724 ( 0.97%) The upcoming changes for 2.6.33 also help iozone in many cases, often by more than just disabling low_latency. It has the occasional massive gain or loss for the larger file sizes. I don't know why this is but as the big losses appear to be mostly in the write-tests, I would guess that it's differences in heavy-writer-throttling. The only downside with block-2.6.33 is that there are a lot of patches in there and doesn't help with the 2.6.32 release as such. I could do a reverse bisect to see what helps the most in there but under ideal conditions, it'll take 3 days to complete and I wouldn't be able to start until Monday as I'm out of the country for the weekend. That's a bit late. p.s. As a consequence of being out of the country, I also won't be able to respond to mail over the weekend. -- Mel Gorman -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-27 11:44 ` Mel Gorman @ 2009-11-27 12:03 ` Corrado Zoccolo 2009-11-27 15:58 ` Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Corrado Zoccolo @ 2009-11-27 12:03 UTC (permalink / raw) To: Mel Gorman Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <mel@csn.ul.ie> wrote: > On Thu, Nov 26, 2009 at 04:18:18PM +0100, Corrado Zoccolo wrote: >> > <SNIP> >> > >> > In case you mean a partial disabling of cfq_latency, I'm try the >> > following patch. The intention is to disable the low_latency logic if >> > kswapd is at work and presumably needs clean pages. Alternative >> > suggestions welcome. > > As it turned out, that patch sucked so I aborted the test and I need to > think about it a lot more. What about using the dirty ratio, instead of checking if kswapd is running? >> Yes, I meant exactly to disable that part, and doing it when kswapd is >> active is probably a good choice. >> I have a different idea for 2.6.33, though. >> If you have a reliable reproducer of the issue, can you test it on >> git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33? >> It may already be unaffected, since we had various performance >> improvements there, but I think a better way to boost writeback is >> possible. >> > > I haven't tested the high-order allocation scenario yet but the results > as thing stands are below. There are four kernels being compared > > 1. with-low-latency is 2.6.32-rc8 vanilla > 2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied > 3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula" > 4. without-low-latency is with low_latency disabled > > SYSBENCH > sysbench-with low-latency low-latency sysbench-without > low-latency block-2.6.33 async-rampup low-latency > 1 1266.02 ( 0.00%) 824.08 (-53.63%) 1265.15 (-0.07%) 1278.55 ( 0.98%) > 2 1182.58 ( 0.00%) 1226.42 ( 3.57%) 1223.03 ( 3.31%) 1379.25 (14.26%) > 3 1218.64 ( 0.00%) 1271.38 ( 4.15%) 1246.42 ( 2.23%) 1580.08 (22.87%) > 4 1212.11 ( 0.00%) 1257.84 ( 3.64%) 1325.17 ( 8.53%) 1534.17 (20.99%) > 5 1046.77 ( 0.00%) 981.71 (-6.63%) 1008.44 (-3.80%) 1552.48 (32.57%) > 6 1187.14 ( 0.00%) 1132.89 (-4.79%) 1147.18 (-3.48%) 1661.19 (28.54%) > 7 1179.37 ( 0.00%) 1183.61 ( 0.36%) 1202.49 ( 1.92%) 790.26 (-49.24%) > 8 1164.62 ( 0.00%) 1143.54 (-1.84%) 1184.56 ( 1.68%) 854.10 (-36.36%) > 9 1095.22 ( 0.00%) 1178.72 ( 7.08%) 1002.42 (-9.26%) 1655.04 (33.83%) > 10 1147.52 ( 0.00%) 1153.46 ( 0.52%) 1151.73 ( 0.37%) 1653.89 (30.62%) > 11 823.38 ( 0.00%) 820.64 (-0.33%) 754.15 (-9.18%) 1627.45 (49.41%) > 12 813.73 ( 0.00%) 791.44 (-2.82%) 848.32 ( 4.08%) 1494.63 (45.56%) > 13 898.22 ( 0.00%) 789.63 (-13.75%) 931.47 ( 3.57%) 1521.64 (40.97%) > 14 873.50 ( 0.00%) 938.90 ( 6.97%) 875.75 ( 0.26%) 1311.09 (33.38%) > 15 808.32 ( 0.00%) 979.88 (17.51%) 877.87 ( 7.92%) 1009.70 (19.94%) > 16 758.17 ( 0.00%) 1096.81 (30.87%) 881.23 (13.96%) 725.17 (-4.55%) > > sysbench is helped by both both block-2.6.33 and async-rampup to some > extent. For many of the results, plain old disabling low_latency still > helps the most. > > desktop-net-gitk > gitk-with low-latency low-latency gitk-without > low-latency block-2.6.33 async-rampup low-latency > min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%) > mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%) > stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%) > max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%) > > The changes for block in 2.6.33 make a massive difference here, notably > beating the disabling of low_latency. Yes. These are read of lots of small files, so the improvements for seeky workload we introduced in 2.6.33 helps a lot here. > > IOZone > iozone-with low-latency low-latency iozone-without > low-latency block-2.6.33 async-rampup low-latency > write-64 151212 ( 0.00%) 163359 ( 7.44%) 163359 ( 7.44%) 159856 ( 5.41%) > write-128 189357 ( 0.00%) 184922 (-2.40%) 202805 ( 6.63%) 206233 ( 8.18%) > write-256 219883 ( 0.00%) 211232 (-4.10%) 189867 (-15.81%) 223174 ( 1.47%) > write-512 224932 ( 0.00%) 222601 (-1.05%) 204459 (-10.01%) 220227 (-2.14%) > write-1024 227738 ( 0.00%) 226728 (-0.45%) 216009 (-5.43%) 226155 (-0.70%) > write-2048 227564 ( 0.00%) 224167 (-1.52%) 229387 ( 0.79%) 224848 (-1.21%) > write-4096 208556 ( 0.00%) 227707 ( 8.41%) 216908 ( 3.85%) 223430 ( 6.66%) > write-8192 219484 ( 0.00%) 222365 ( 1.30%) 217737 (-0.80%) 219389 (-0.04%) > write-16384 206670 ( 0.00%) 209355 ( 1.28%) 204146 (-1.24%) 206295 (-0.18%) > write-32768 203023 ( 0.00%) 205097 ( 1.01%) 199766 (-1.63%) 201852 (-0.58%) > write-65536 162134 ( 0.00%) 196670 (17.56%) 189975 (14.66%) 189173 (14.29%) > write-131072 68534 ( 0.00%) 69145 ( 0.88%) 64519 (-6.22%) 67417 (-1.66%) > write-262144 32936 ( 0.00%) 28587 (-15.21%) 31470 (-4.66%) 27750 (-18.69%) > write-524288 24044 ( 0.00%) 23560 (-2.05%) 23116 (-4.01%) 23759 (-1.20%) > rewrite-64 755681 ( 0.00%) 800767 ( 5.63%) 469931 (-60.81%) 755681 ( 0.00%) > rewrite-128 581518 ( 0.00%) 639723 ( 9.10%) 591774 ( 1.73%) 799840 (27.30%) > rewrite-256 639427 ( 0.00%) 710511 (10.00%) 666414 ( 4.05%) 659861 ( 3.10%) > rewrite-512 669577 ( 0.00%) 743788 ( 9.98%) 692017 ( 3.24%) 684954 ( 2.24%) > rewrite-1024 680960 ( 0.00%) 755195 ( 9.83%) 701422 ( 2.92%) 686182 ( 0.76%) > rewrite-2048 685263 ( 0.00%) 743123 ( 7.79%) 703445 ( 2.58%) 692780 ( 1.09%) > rewrite-4096 631352 ( 0.00%) 686776 ( 8.07%) 640007 ( 1.35%) 643266 ( 1.85%) > rewrite-8192 442146 ( 0.00%) 474089 ( 6.74%) 457768 ( 3.41%) 442624 ( 0.11%) > rewrite-16384 428641 ( 0.00%) 454857 ( 5.76%) 442896 ( 3.22%) 432613 ( 0.92%) > rewrite-32768 425361 ( 0.00%) 444206 ( 4.24%) 434472 ( 2.10%) 430568 ( 1.21%) > rewrite-65536 405183 ( 0.00%) 433898 ( 6.62%) 419843 ( 3.49%) 389242 (-4.10%) > rewrite-131072 66110 ( 0.00%) 58370 (-13.26%) 54342 (-21.66%) 58472 (-13.06%) > rewrite-262144 29254 ( 0.00%) 24665 (-18.61%) 25710 (-13.78%) 29306 ( 0.18%) > rewrite-524288 23812 ( 0.00%) 20742 (-14.80%) 22490 (-5.88%) 24543 ( 2.98%) > read-64 934589 ( 0.00%) 1160938 (19.50%) 1004538 ( 6.96%) 840903 (-11.14%) > read-128 1601534 ( 0.00%) 1869179 (14.32%) 1681806 ( 4.77%) 1280633 (-25.06%) > read-256 1255511 ( 0.00%) 1526887 (17.77%) 1304314 ( 3.74%) 1310683 ( 4.21%) > read-512 1291158 ( 0.00%) 1377278 ( 6.25%) 1336145 ( 3.37%) 1319723 ( 2.16%) > read-1024 1319408 ( 0.00%) 1306564 (-0.98%) 1368162 ( 3.56%) 1347557 ( 2.09%) > read-2048 1316016 ( 0.00%) 1394645 ( 5.64%) 1339827 ( 1.78%) 1347393 ( 2.33%) > read-4096 1253710 ( 0.00%) 1307525 ( 4.12%) 1247519 (-0.50%) 1251882 (-0.15%) > read-8192 995149 ( 0.00%) 1033337 ( 3.70%) 1016944 ( 2.14%) 1011794 ( 1.65%) > read-16384 883156 ( 0.00%) 905213 ( 2.44%) 905213 ( 2.44%) 897458 ( 1.59%) > read-32768 844368 ( 0.00%) 855213 ( 1.27%) 849609 ( 0.62%) 856364 ( 1.40%) > read-65536 816099 ( 0.00%) 839262 ( 2.76%) 835019 ( 2.27%) 826473 ( 1.26%) > read-131072 818055 ( 0.00%) 837369 ( 2.31%) 828230 ( 1.23%) 824351 ( 0.76%) > read-262144 827225 ( 0.00%) 839635 ( 1.48%) 840538 ( 1.58%) 835693 ( 1.01%) > read-524288 24653 ( 0.00%) 21387 (-15.27%) 20602 (-19.66%) 22519 (-9.48%) > reread-64 2329708 ( 0.00%) 2251544 (-3.47%) 1985134 (-17.36%) 1985134 (-17.36%) > reread-128 1446222 ( 0.00%) 1979446 (26.94%) 2009076 (28.02%) 2137031 (32.33%) > reread-256 1828508 ( 0.00%) 2006158 ( 8.86%) 1892980 ( 3.41%) 1879725 ( 2.72%) > reread-512 1521718 ( 0.00%) 1642783 ( 7.37%) 1508887 (-0.85%) 1579934 ( 3.68%) > reread-1024 1347557 ( 0.00%) 1422540 ( 5.27%) 1384034 ( 2.64%) 1375171 ( 2.01%) > reread-2048 1340664 ( 0.00%) 1413929 ( 5.18%) 1372364 ( 2.31%) 1350783 ( 0.75%) > reread-4096 1259592 ( 0.00%) 1324868 ( 4.93%) 1273788 ( 1.11%) 1284839 ( 1.96%) > reread-8192 1007285 ( 0.00%) 1033710 ( 2.56%) 1027159 ( 1.93%) 1011317 ( 0.40%) > reread-16384 891404 ( 0.00%) 910828 ( 2.13%) 916562 ( 2.74%) 905022 ( 1.50%) > reread-32768 850492 ( 0.00%) 859341 ( 1.03%) 856385 ( 0.69%) 862772 ( 1.42%) > reread-65536 836565 ( 0.00%) 852664 ( 1.89%) 852315 ( 1.85%) 847020 ( 1.23%) > reread-131072 844516 ( 0.00%) 862590 ( 2.10%) 854067 ( 1.12%) 853155 ( 1.01%) > reread-262144 851524 ( 0.00%) 860559 ( 1.05%) 864921 ( 1.55%) 860653 ( 1.06%) > reread-524288 24927 ( 0.00%) 21300 (-17.03%) 19748 (-26.23%) 22487 (-10.85%) > randread-64 1605256 ( 0.00%) 1605256 ( 0.00%) 1605256 ( 0.00%) 1775099 ( 9.57%) > randread-128 1179358 ( 0.00%) 1582649 (25.48%) 1511363 (21.97%) 1528576 (22.85%) > randread-256 1421755 ( 0.00%) 1599680 (11.12%) 1460430 ( 2.65%) 1310683 (-8.47%) > randread-512 1306873 ( 0.00%) 1278855 (-2.19%) 1243315 (-5.11%) 1281909 (-1.95%) > randread-1024 1201314 ( 0.00%) 1254656 ( 4.25%) 1190657 (-0.90%) 1231629 ( 2.46%) > randread-2048 1179413 ( 0.00%) 1227971 ( 3.95%) 1185272 ( 0.49%) 1190529 ( 0.93%) > randread-4096 1107005 ( 0.00%) 1160862 ( 4.64%) 1110727 ( 0.34%) 1116792 ( 0.88%) > randread-8192 894337 ( 0.00%) 924264 ( 3.24%) 912676 ( 2.01%) 899487 ( 0.57%) > randread-16384 783760 ( 0.00%) 800299 ( 2.07%) 793351 ( 1.21%) 791341 ( 0.96%) > randread-32768 740498 ( 0.00%) 743720 ( 0.43%) 741233 ( 0.10%) 743511 ( 0.41%) > randread-65536 721640 ( 0.00%) 727692 ( 0.83%) 726984 ( 0.74%) 728139 ( 0.89%) > randread-131072 715284 ( 0.00%) 722094 ( 0.94%) 717746 ( 0.34%) 720825 ( 0.77%) > randread-262144 709855 ( 0.00%) 706770 (-0.44%) 709133 (-0.10%) 714943 ( 0.71%) > randread-524288 394 ( 0.00%) 421 ( 6.41%) 418 ( 5.74%) 431 ( 8.58%) > randwrite-64 730988 ( 0.00%) 764288 ( 4.36%) 723111 (-1.09%) 730988 ( 0.00%) > randwrite-128 746459 ( 0.00%) 799840 ( 6.67%) 746459 ( 0.00%) 742331 (-0.56%) > randwrite-256 695778 ( 0.00%) 752329 ( 7.52%) 720041 ( 3.37%) 727850 ( 4.41%) > randwrite-512 666253 ( 0.00%) 722760 ( 7.82%) 667081 ( 0.12%) 691126 ( 3.60%) > randwrite-1024 651223 ( 0.00%) 697776 ( 6.67%) 663292 ( 1.82%) 659625 ( 1.27%) > randwrite-2048 655558 ( 0.00%) 691887 ( 5.25%) 665720 ( 1.53%) 664073 ( 1.28%) > randwrite-4096 635556 ( 0.00%) 662721 ( 4.10%) 643170 ( 1.18%) 642400 ( 1.07%) > randwrite-8192 467357 ( 0.00%) 491364 ( 4.89%) 476720 ( 1.96%) 469734 ( 0.51%) > randwrite-16384 413188 ( 0.00%) 427521 ( 3.35%) 417353 ( 1.00%) 417282 ( 0.98%) > randwrite-32768 404161 ( 0.00%) 411721 ( 1.84%) 404942 ( 0.19%) 407580 ( 0.84%) > randwrite-65536 379372 ( 0.00%) 397312 ( 4.52%) 386853 ( 1.93%) 381273 ( 0.50%) > randwrite-131072 21780 ( 0.00%) 16924 (-28.69%) 21177 (-2.85%) 19758 (-10.23%) > randwrite-262144 6249 ( 0.00%) 5548 (-12.64%) 6370 ( 1.90%) 6316 ( 1.06%) > randwrite-524288 2915 ( 0.00%) 2582 (-12.90%) 2871 (-1.53%) 2859 (-1.96%) > bkwdread-64 1141196 ( 0.00%) 1141196 ( 0.00%) 1004538 (-13.60%) 1141196 ( 0.00%) > bkwdread-128 1066865 ( 0.00%) 1386465 (23.05%) 1400936 (23.85%) 1101900 ( 3.18%) > bkwdread-256 877797 ( 0.00%) 1105556 (20.60%) 1105556 (20.60%) 1105556 (20.60%) > bkwdread-512 1133103 ( 0.00%) 1162547 ( 2.53%) 1175271 ( 3.59%) 1162547 ( 2.53%) > bkwdread-1024 1163562 ( 0.00%) 1206714 ( 3.58%) 1213534 ( 4.12%) 1195962 ( 2.71%) > bkwdread-2048 1163439 ( 0.00%) 1218910 ( 4.55%) 1204552 ( 3.41%) 1204552 ( 3.41%) > bkwdread-4096 1116792 ( 0.00%) 1175477 ( 4.99%) 1159922 ( 3.72%) 1150600 ( 2.94%) > bkwdread-8192 912288 ( 0.00%) 935233 ( 2.45%) 944695 ( 3.43%) 934724 ( 2.40%) > bkwdread-16384 817707 ( 0.00%) 824140 ( 0.78%) 832527 ( 1.78%) 829152 ( 1.38%) > bkwdread-32768 775898 ( 0.00%) 773714 (-0.28%) 785494 ( 1.22%) 787691 ( 1.50%) > bkwdread-65536 759643 ( 0.00%) 769924 ( 1.34%) 778780 ( 2.46%) 772174 ( 1.62%) > bkwdread-131072 763215 ( 0.00%) 769634 ( 0.83%) 773707 ( 1.36%) 773816 ( 1.37%) > bkwdread-262144 765491 ( 0.00%) 768992 ( 0.46%) 780876 ( 1.97%) 780021 ( 1.86%) > bkwdread-524288 3688 ( 0.00%) 3595 (-2.59%) 3577 (-3.10%) 3724 ( 0.97%) > > The upcoming changes for 2.6.33 also help iozone in many cases, often by more > than just disabling low_latency. It has the occasional massive gain or loss > for the larger file sizes. I don't know why this is but as the big losses > appear to be mostly in the write-tests, I would guess that it's differences > in heavy-writer-throttling. I wonder if 2.6.33 + my async rampup patch will improve still further, maybe reaching the low_latency=0 performance also for writing tests. > > The only downside with block-2.6.33 is that there are a lot of patches in > there and doesn't help with the 2.6.32 release as such. I could do a reverse > bisect to see what helps the most in there but under ideal conditions, it'll > take 3 days to complete and I wouldn't be able to start until Monday as I'm > out of the country for the weekend. That's a bit late. Bisect will likely not help, since we have several patch series with heavy internal dependencies in that tree. If one of the patch series is found to bring the improvement, you have to backport the entire series, that is not advisable for a rc8 or for stable. > > p.s. As a consequence of being out of the country, I also won't be able to > respond to mail over the weekend. > > -- > Mel Gorman > Thanks for the detailed report Corrado ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-27 12:03 ` Corrado Zoccolo @ 2009-11-27 15:58 ` Mel Gorman 2009-11-27 18:14 ` Corrado Zoccolo 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2009-11-27 15:58 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote: > On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > On Thu, Nov 26, 2009 at 04:18:18PM +0100, Corrado Zoccolo wrote: > >> > <SNIP> > >> > > >> > In case you mean a partial disabling of cfq_latency, I'm try the > >> > following patch. The intention is to disable the low_latency logic if > >> > kswapd is at work and presumably needs clean pages. Alternative > >> > suggestions welcome. > > > > As it turned out, that patch sucked so I aborted the test and I need to > > think about it a lot more. > > What about using the dirty ratio, instead of checking if kswapd is running? > How would one go about selecting the proper ratio at which to disable the low_latency logic? > >> Yes, I meant exactly to disable that part, and doing it when kswapd is > >> active is probably a good choice. > >> I have a different idea for 2.6.33, though. > >> If you have a reliable reproducer of the issue, can you test it on > >> git://git.kernel.dk/linux-2.6-block.git branch for-2.6.33? > >> It may already be unaffected, since we had various performance > >> improvements there, but I think a better way to boost writeback is > >> possible. > >> > > > > I haven't tested the high-order allocation scenario yet but the results > > as thing stands are below. There are four kernels being compared > > > > 1. with-low-latency is 2.6.32-rc8 vanilla > > 2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied > > 3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula" > > 4. without-low-latency is with low_latency disabled > > > > SYSBENCH > > sysbench-with low-latency low-latency sysbench-without > > low-latency block-2.6.33 async-rampup low-latency > > 1 1266.02 ( 0.00%) 824.08 (-53.63%) 1265.15 (-0.07%) 1278.55 ( 0.98%) > > 2 1182.58 ( 0.00%) 1226.42 ( 3.57%) 1223.03 ( 3.31%) 1379.25 (14.26%) > > 3 1218.64 ( 0.00%) 1271.38 ( 4.15%) 1246.42 ( 2.23%) 1580.08 (22.87%) > > 4 1212.11 ( 0.00%) 1257.84 ( 3.64%) 1325.17 ( 8.53%) 1534.17 (20.99%) > > 5 1046.77 ( 0.00%) 981.71 (-6.63%) 1008.44 (-3.80%) 1552.48 (32.57%) > > 6 1187.14 ( 0.00%) 1132.89 (-4.79%) 1147.18 (-3.48%) 1661.19 (28.54%) > > 7 1179.37 ( 0.00%) 1183.61 ( 0.36%) 1202.49 ( 1.92%) 790.26 (-49.24%) > > 8 1164.62 ( 0.00%) 1143.54 (-1.84%) 1184.56 ( 1.68%) 854.10 (-36.36%) > > 9 1095.22 ( 0.00%) 1178.72 ( 7.08%) 1002.42 (-9.26%) 1655.04 (33.83%) > > 10 1147.52 ( 0.00%) 1153.46 ( 0.52%) 1151.73 ( 0.37%) 1653.89 (30.62%) > > 11 823.38 ( 0.00%) 820.64 (-0.33%) 754.15 (-9.18%) 1627.45 (49.41%) > > 12 813.73 ( 0.00%) 791.44 (-2.82%) 848.32 ( 4.08%) 1494.63 (45.56%) > > 13 898.22 ( 0.00%) 789.63 (-13.75%) 931.47 ( 3.57%) 1521.64 (40.97%) > > 14 873.50 ( 0.00%) 938.90 ( 6.97%) 875.75 ( 0.26%) 1311.09 (33.38%) > > 15 808.32 ( 0.00%) 979.88 (17.51%) 877.87 ( 7.92%) 1009.70 (19.94%) > > 16 758.17 ( 0.00%) 1096.81 (30.87%) 881.23 (13.96%) 725.17 (-4.55%) > > > > sysbench is helped by both both block-2.6.33 and async-rampup to some > > extent. For many of the results, plain old disabling low_latency still > > helps the most. > > > > desktop-net-gitk > > gitk-with low-latency low-latency gitk-without > > low-latency block-2.6.33 async-rampup low-latency > > min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%) > > mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%) > > stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%) > > max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%) > > > > The changes for block in 2.6.33 make a massive difference here, notably > > beating the disabling of low_latency. > > Yes. These are read of lots of small files, so the improvements for > seeky workload we introduced in 2.6.33 helps a lot here. Ok, good to know > > > > IOZone > > iozone-with low-latency low-latency iozone-without > > low-latency block-2.6.33 async-rampup low-latency > > write-64 151212 ( 0.00%) 163359 ( 7.44%) 163359 ( 7.44%) 159856 ( 5.41%) > > write-128 189357 ( 0.00%) 184922 (-2.40%) 202805 ( 6.63%) 206233 ( 8.18%) > > write-256 219883 ( 0.00%) 211232 (-4.10%) 189867 (-15.81%) 223174 ( 1.47%) > > write-512 224932 ( 0.00%) 222601 (-1.05%) 204459 (-10.01%) 220227 (-2.14%) > > write-1024 227738 ( 0.00%) 226728 (-0.45%) 216009 (-5.43%) 226155 (-0.70%) > > write-2048 227564 ( 0.00%) 224167 (-1.52%) 229387 ( 0.79%) 224848 (-1.21%) > > write-4096 208556 ( 0.00%) 227707 ( 8.41%) 216908 ( 3.85%) 223430 ( 6.66%) > > write-8192 219484 ( 0.00%) 222365 ( 1.30%) 217737 (-0.80%) 219389 (-0.04%) > > write-16384 206670 ( 0.00%) 209355 ( 1.28%) 204146 (-1.24%) 206295 (-0.18%) > > write-32768 203023 ( 0.00%) 205097 ( 1.01%) 199766 (-1.63%) 201852 (-0.58%) > > write-65536 162134 ( 0.00%) 196670 (17.56%) 189975 (14.66%) 189173 (14.29%) > > write-131072 68534 ( 0.00%) 69145 ( 0.88%) 64519 (-6.22%) 67417 (-1.66%) > > write-262144 32936 ( 0.00%) 28587 (-15.21%) 31470 (-4.66%) 27750 (-18.69%) > > write-524288 24044 ( 0.00%) 23560 (-2.05%) 23116 (-4.01%) 23759 (-1.20%) > > rewrite-64 755681 ( 0.00%) 800767 ( 5.63%) 469931 (-60.81%) 755681 ( 0.00%) > > rewrite-128 581518 ( 0.00%) 639723 ( 9.10%) 591774 ( 1.73%) 799840 (27.30%) > > rewrite-256 639427 ( 0.00%) 710511 (10.00%) 666414 ( 4.05%) 659861 ( 3.10%) > > rewrite-512 669577 ( 0.00%) 743788 ( 9.98%) 692017 ( 3.24%) 684954 ( 2.24%) > > rewrite-1024 680960 ( 0.00%) 755195 ( 9.83%) 701422 ( 2.92%) 686182 ( 0.76%) > > rewrite-2048 685263 ( 0.00%) 743123 ( 7.79%) 703445 ( 2.58%) 692780 ( 1.09%) > > rewrite-4096 631352 ( 0.00%) 686776 ( 8.07%) 640007 ( 1.35%) 643266 ( 1.85%) > > rewrite-8192 442146 ( 0.00%) 474089 ( 6.74%) 457768 ( 3.41%) 442624 ( 0.11%) > > rewrite-16384 428641 ( 0.00%) 454857 ( 5.76%) 442896 ( 3.22%) 432613 ( 0.92%) > > rewrite-32768 425361 ( 0.00%) 444206 ( 4.24%) 434472 ( 2.10%) 430568 ( 1.21%) > > rewrite-65536 405183 ( 0.00%) 433898 ( 6.62%) 419843 ( 3.49%) 389242 (-4.10%) > > rewrite-131072 66110 ( 0.00%) 58370 (-13.26%) 54342 (-21.66%) 58472 (-13.06%) > > rewrite-262144 29254 ( 0.00%) 24665 (-18.61%) 25710 (-13.78%) 29306 ( 0.18%) > > rewrite-524288 23812 ( 0.00%) 20742 (-14.80%) 22490 (-5.88%) 24543 ( 2.98%) > > read-64 934589 ( 0.00%) 1160938 (19.50%) 1004538 ( 6.96%) 840903 (-11.14%) > > read-128 1601534 ( 0.00%) 1869179 (14.32%) 1681806 ( 4.77%) 1280633 (-25.06%) > > read-256 1255511 ( 0.00%) 1526887 (17.77%) 1304314 ( 3.74%) 1310683 ( 4.21%) > > read-512 1291158 ( 0.00%) 1377278 ( 6.25%) 1336145 ( 3.37%) 1319723 ( 2.16%) > > read-1024 1319408 ( 0.00%) 1306564 (-0.98%) 1368162 ( 3.56%) 1347557 ( 2.09%) > > read-2048 1316016 ( 0.00%) 1394645 ( 5.64%) 1339827 ( 1.78%) 1347393 ( 2.33%) > > read-4096 1253710 ( 0.00%) 1307525 ( 4.12%) 1247519 (-0.50%) 1251882 (-0.15%) > > read-8192 995149 ( 0.00%) 1033337 ( 3.70%) 1016944 ( 2.14%) 1011794 ( 1.65%) > > read-16384 883156 ( 0.00%) 905213 ( 2.44%) 905213 ( 2.44%) 897458 ( 1.59%) > > read-32768 844368 ( 0.00%) 855213 ( 1.27%) 849609 ( 0.62%) 856364 ( 1.40%) > > read-65536 816099 ( 0.00%) 839262 ( 2.76%) 835019 ( 2.27%) 826473 ( 1.26%) > > read-131072 818055 ( 0.00%) 837369 ( 2.31%) 828230 ( 1.23%) 824351 ( 0.76%) > > read-262144 827225 ( 0.00%) 839635 ( 1.48%) 840538 ( 1.58%) 835693 ( 1.01%) > > read-524288 24653 ( 0.00%) 21387 (-15.27%) 20602 (-19.66%) 22519 (-9.48%) > > reread-64 2329708 ( 0.00%) 2251544 (-3.47%) 1985134 (-17.36%) 1985134 (-17.36%) > > reread-128 1446222 ( 0.00%) 1979446 (26.94%) 2009076 (28.02%) 2137031 (32.33%) > > reread-256 1828508 ( 0.00%) 2006158 ( 8.86%) 1892980 ( 3.41%) 1879725 ( 2.72%) > > reread-512 1521718 ( 0.00%) 1642783 ( 7.37%) 1508887 (-0.85%) 1579934 ( 3.68%) > > reread-1024 1347557 ( 0.00%) 1422540 ( 5.27%) 1384034 ( 2.64%) 1375171 ( 2.01%) > > reread-2048 1340664 ( 0.00%) 1413929 ( 5.18%) 1372364 ( 2.31%) 1350783 ( 0.75%) > > reread-4096 1259592 ( 0.00%) 1324868 ( 4.93%) 1273788 ( 1.11%) 1284839 ( 1.96%) > > reread-8192 1007285 ( 0.00%) 1033710 ( 2.56%) 1027159 ( 1.93%) 1011317 ( 0.40%) > > reread-16384 891404 ( 0.00%) 910828 ( 2.13%) 916562 ( 2.74%) 905022 ( 1.50%) > > reread-32768 850492 ( 0.00%) 859341 ( 1.03%) 856385 ( 0.69%) 862772 ( 1.42%) > > reread-65536 836565 ( 0.00%) 852664 ( 1.89%) 852315 ( 1.85%) 847020 ( 1.23%) > > reread-131072 844516 ( 0.00%) 862590 ( 2.10%) 854067 ( 1.12%) 853155 ( 1.01%) > > reread-262144 851524 ( 0.00%) 860559 ( 1.05%) 864921 ( 1.55%) 860653 ( 1.06%) > > reread-524288 24927 ( 0.00%) 21300 (-17.03%) 19748 (-26.23%) 22487 (-10.85%) > > randread-64 1605256 ( 0.00%) 1605256 ( 0.00%) 1605256 ( 0.00%) 1775099 ( 9.57%) > > randread-128 1179358 ( 0.00%) 1582649 (25.48%) 1511363 (21.97%) 1528576 (22.85%) > > randread-256 1421755 ( 0.00%) 1599680 (11.12%) 1460430 ( 2.65%) 1310683 (-8.47%) > > randread-512 1306873 ( 0.00%) 1278855 (-2.19%) 1243315 (-5.11%) 1281909 (-1.95%) > > randread-1024 1201314 ( 0.00%) 1254656 ( 4.25%) 1190657 (-0.90%) 1231629 ( 2.46%) > > randread-2048 1179413 ( 0.00%) 1227971 ( 3.95%) 1185272 ( 0.49%) 1190529 ( 0.93%) > > randread-4096 1107005 ( 0.00%) 1160862 ( 4.64%) 1110727 ( 0.34%) 1116792 ( 0.88%) > > randread-8192 894337 ( 0.00%) 924264 ( 3.24%) 912676 ( 2.01%) 899487 ( 0.57%) > > randread-16384 783760 ( 0.00%) 800299 ( 2.07%) 793351 ( 1.21%) 791341 ( 0.96%) > > randread-32768 740498 ( 0.00%) 743720 ( 0.43%) 741233 ( 0.10%) 743511 ( 0.41%) > > randread-65536 721640 ( 0.00%) 727692 ( 0.83%) 726984 ( 0.74%) 728139 ( 0.89%) > > randread-131072 715284 ( 0.00%) 722094 ( 0.94%) 717746 ( 0.34%) 720825 ( 0.77%) > > randread-262144 709855 ( 0.00%) 706770 (-0.44%) 709133 (-0.10%) 714943 ( 0.71%) > > randread-524288 394 ( 0.00%) 421 ( 6.41%) 418 ( 5.74%) 431 ( 8.58%) > > randwrite-64 730988 ( 0.00%) 764288 ( 4.36%) 723111 (-1.09%) 730988 ( 0.00%) > > randwrite-128 746459 ( 0.00%) 799840 ( 6.67%) 746459 ( 0.00%) 742331 (-0.56%) > > randwrite-256 695778 ( 0.00%) 752329 ( 7.52%) 720041 ( 3.37%) 727850 ( 4.41%) > > randwrite-512 666253 ( 0.00%) 722760 ( 7.82%) 667081 ( 0.12%) 691126 ( 3.60%) > > randwrite-1024 651223 ( 0.00%) 697776 ( 6.67%) 663292 ( 1.82%) 659625 ( 1.27%) > > randwrite-2048 655558 ( 0.00%) 691887 ( 5.25%) 665720 ( 1.53%) 664073 ( 1.28%) > > randwrite-4096 635556 ( 0.00%) 662721 ( 4.10%) 643170 ( 1.18%) 642400 ( 1.07%) > > randwrite-8192 467357 ( 0.00%) 491364 ( 4.89%) 476720 ( 1.96%) 469734 ( 0.51%) > > randwrite-16384 413188 ( 0.00%) 427521 ( 3.35%) 417353 ( 1.00%) 417282 ( 0.98%) > > randwrite-32768 404161 ( 0.00%) 411721 ( 1.84%) 404942 ( 0.19%) 407580 ( 0.84%) > > randwrite-65536 379372 ( 0.00%) 397312 ( 4.52%) 386853 ( 1.93%) 381273 ( 0.50%) > > randwrite-131072 21780 ( 0.00%) 16924 (-28.69%) 21177 (-2.85%) 19758 (-10.23%) > > randwrite-262144 6249 ( 0.00%) 5548 (-12.64%) 6370 ( 1.90%) 6316 ( 1.06%) > > randwrite-524288 2915 ( 0.00%) 2582 (-12.90%) 2871 (-1.53%) 2859 (-1.96%) > > bkwdread-64 1141196 ( 0.00%) 1141196 ( 0.00%) 1004538 (-13.60%) 1141196 ( 0.00%) > > bkwdread-128 1066865 ( 0.00%) 1386465 (23.05%) 1400936 (23.85%) 1101900 ( 3.18%) > > bkwdread-256 877797 ( 0.00%) 1105556 (20.60%) 1105556 (20.60%) 1105556 (20.60%) > > bkwdread-512 1133103 ( 0.00%) 1162547 ( 2.53%) 1175271 ( 3.59%) 1162547 ( 2.53%) > > bkwdread-1024 1163562 ( 0.00%) 1206714 ( 3.58%) 1213534 ( 4.12%) 1195962 ( 2.71%) > > bkwdread-2048 1163439 ( 0.00%) 1218910 ( 4.55%) 1204552 ( 3.41%) 1204552 ( 3.41%) > > bkwdread-4096 1116792 ( 0.00%) 1175477 ( 4.99%) 1159922 ( 3.72%) 1150600 ( 2.94%) > > bkwdread-8192 912288 ( 0.00%) 935233 ( 2.45%) 944695 ( 3.43%) 934724 ( 2.40%) > > bkwdread-16384 817707 ( 0.00%) 824140 ( 0.78%) 832527 ( 1.78%) 829152 ( 1.38%) > > bkwdread-32768 775898 ( 0.00%) 773714 (-0.28%) 785494 ( 1.22%) 787691 ( 1.50%) > > bkwdread-65536 759643 ( 0.00%) 769924 ( 1.34%) 778780 ( 2.46%) 772174 ( 1.62%) > > bkwdread-131072 763215 ( 0.00%) 769634 ( 0.83%) 773707 ( 1.36%) 773816 ( 1.37%) > > bkwdread-262144 765491 ( 0.00%) 768992 ( 0.46%) 780876 ( 1.97%) 780021 ( 1.86%) > > bkwdread-524288 3688 ( 0.00%) 3595 (-2.59%) 3577 (-3.10%) 3724 ( 0.97%) > > > > The upcoming changes for 2.6.33 also help iozone in many cases, often by more > > than just disabling low_latency. It has the occasional massive gain or loss > > for the larger file sizes. I don't know why this is but as the big losses > > appear to be mostly in the write-tests, I would guess that it's differences > > in heavy-writer-throttling. > > I wonder if 2.6.33 + my async rampup patch will improve still further, > maybe reaching the low_latency=0 performance also for writing tests. It might, I didn't test yet as the machine is tied up. However, even if it does, it will not help the 2.6.32 if the patches for 2.6.33 are being considered. > > > > The only downside with block-2.6.33 is that there are a lot of patches in > > there and doesn't help with the 2.6.32 release as such. I could do a reverse > > bisect to see what helps the most in there but under ideal conditions, it'll > > take 3 days to complete and I wouldn't be able to start until Monday as I'm > > out of the country for the weekend. That's a bit late. > > Bisect will likely not help, since we have several patch series with > heavy internal dependencies in that tree. > If one of the patch series is found to bring the improvement, you have > to backport the entire series, that is not advisable for a rc8 or for > stable. Scratch that then. I did a quick test for when high-order-atomic-allocations-for-network are happening but the results are not great. By quick test, I mean I only did the gitk tests as there wasn't time to do the sysbench and iozone tests as well before I'd go offline. desktop-net-gitk high-with low-latency low-latency high-without low-latency block-2.6.33 async-rampup low-latency min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%) mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%) stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%) max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%) pgalloc-fail 25 ( 0.00%) 10 (50.00%) 39 (-95.00%) 20 ( 0.00%) The patches for 2.6.33 help a little all right but the async-rampup patches both make the performance worse and causes more page allocation failures to occur. In other words, on most machines it'll appear fine but people with wireless cards doing high-order allocations may run into trouble. Disabling low_latency again helps performance significantly in this scenario. There were still page allocation failures because not all the patches related to that problem made it to mainline. I was somewhat aggrevated by the page allocation failures until I remembered that there are three patches in -mm that I failed to convince either Jens or Andrew of them being suitable for mainline. When they are added to the mix, the results are as follows; desktop-net-gitk atomics-with low-latency low-latency atomics-without low-latency block-2.6.33 async-rampup low-latency min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%) mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%) stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%) max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%) pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 23 ( 0.00%) 0 ( 0.00%) Again, plain old disabling low_latency both performs the best and fails page allocations the least. The three patches for page allocation failures are in -mm but not mainline are; [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble It still seems to be that the route of least damage is to disable low_latency by default for 2.6.32. It's very unfortunate that I wasn't able to fully justify the 3 patches for page allocation failures in time but all that can be done there is consider them for -stable I suppose. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-27 15:58 ` Mel Gorman @ 2009-11-27 18:14 ` Corrado Zoccolo 2009-11-27 18:52 ` Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Corrado Zoccolo @ 2009-11-27 18:14 UTC (permalink / raw) To: Mel Gorman Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <mel@csn.ul.ie> wrote: > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote: >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <mel@csn.ul.ie> wrote: > How would one go about selecting the proper ratio at which to disable > the low_latency logic? Can we measure the dirty ratio when the allocation failures start to happen? >> > >> > I haven't tested the high-order allocation scenario yet but the results >> > as thing stands are below. There are four kernels being compared >> > >> > 1. with-low-latency is 2.6.32-rc8 vanilla >> > 2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied >> > 3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula" >> > 4. without-low-latency is with low_latency disabled >> > >> > desktop-net-gitk >> > gitk-with low-latency low-latency gitk-without >> > low-latency block-2.6.33 async-rampup low-latency >> > min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%) >> > mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%) >> > stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%) >> > max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%) >> > >> > The changes for block in 2.6.33 make a massive difference here, notably >> > beating the disabling of low_latency. >> > I did a quick test for when high-order-atomic-allocations-for-network > are happening but the results are not great. By quick test, I mean I > only did the gitk tests as there wasn't time to do the sysbench and > iozone tests as well before I'd go offline. > > desktop-net-gitk > high-with low-latency low-latency high-without > low-latency block-2.6.33 async-rampup low-latency > min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%) > mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%) > stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%) > max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%) > pgalloc-fail 25 ( 0.00%) 10 (50.00%) 39 (-95.00%) 20 ( 0.00%) > > The patches for 2.6.33 help a little all right but the async-rampup > patches both make the performance worse and causes more page allocation > failures to occur. In other words, on most machines it'll appear fine > but people with wireless cards doing high-order allocations may run into > trouble. > > Disabling low_latency again helps performance significantly in this > scenario. There were still page allocation failures because not all the > patches related to that problem made it to mainline. I'm puzzled how almost all kernels, excluding the async rampup, perform better when high order allocations are enabled, than in previous test. > I was somewhat aggrevated by the page allocation failures until I remembered > that there are three patches in -mm that I failed to convince either Jens or > Andrew of them being suitable for mainline. When they are added to the mix, > the results are as follows; > > desktop-net-gitk > atomics-with low-latency low-latency atomics-without > low-latency block-2.6.33 async-rampup low-latency > min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%) > mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%) > stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%) > max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%) > pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 23 ( 0.00%) 0 ( 0.00%) > Those patches penalize block-2.6.33, that was the one with lowest number of failures in previous test. I think the heuristics were tailored to 2.6.32. They need to be re-tuned for 2.6.33. > Again, plain old disabling low_latency both performs the best and fails page > allocations the least. The three patches for page allocation failures are > in -mm but not mainline are; > > [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim > [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep > [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble > > It still seems to be that the route of least damage is to disable low_latency > by default for 2.6.32. It's very unfortunate that I wasn't able to fully > justify the 3 patches for page allocation failures in time but all that > can be done there is consider them for -stable I suppose. Just disabling low_latency will not solve the allocation issues (20 instead of 25). Moreover, it will improve some workloads, but penalize others. Your 3 patches, though, seem to improve the situation also for low_latency enabled, both for performance and allocation failures (25 to 3). Having those 3 patches with low_latency enabled seems better, since it won't penalize the workloads that are benefited by low_latency (if you add a sequential read to your test, you should see a big difference). Thanks, Corrado -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-27 18:14 ` Corrado Zoccolo @ 2009-11-27 18:52 ` Mel Gorman 2009-11-29 15:11 ` Corrado Zoccolo 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2009-11-27 18:52 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote: > On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote: > >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > How would one go about selecting the proper ratio at which to disable > > the low_latency logic? > > Can we measure the dirty ratio when the allocation failures start to happen? > Would the number of dirty pages in the page allocation failure message to kern.log be enough? You won't get them all because of printk suppress but it's something. Alternatively, tell me exactly what stats from /proc you want and I'll stick a monitor on there. Assuming you want nr_dirty vs total number of pages though, the monitor tends to execute too late to be useful. > >> > > >> > I haven't tested the high-order allocation scenario yet but the results > >> > as thing stands are below. There are four kernels being compared > >> > > >> > 1. with-low-latency is 2.6.32-rc8 vanilla > >> > 2. with-low-latency-block-2.6.33 is with the for-2.6.33 from linux-block applied > >> > 3. with-low-latency-async-rampup is with "[RFC,PATCH] cfq-iosched: improve async queue ramp up formula" > >> > 4. without-low-latency is with low_latency disabled > >> > > >> > desktop-net-gitk > >> > gitk-with low-latency low-latency gitk-without > >> > low-latency block-2.6.33 async-rampup low-latency > >> > min 954.46 ( 0.00%) 570.06 (40.27%) 796.22 (16.58%) 640.65 (32.88%) > >> > mean 964.79 ( 0.00%) 573.96 (40.51%) 798.01 (17.29%) 655.57 (32.05%) > >> > stddev 10.01 ( 0.00%) 2.65 (73.55%) 1.91 (80.95%) 13.33 (-33.18%) > >> > max 981.23 ( 0.00%) 577.21 (41.17%) 800.91 (18.38%) 675.65 (31.14%) > >> > > >> > The changes for block in 2.6.33 make a massive difference here, notably > >> > beating the disabling of low_latency. > >> > > I did a quick test for when high-order-atomic-allocations-for-network > > are happening but the results are not great. By quick test, I mean I > > only did the gitk tests as there wasn't time to do the sysbench and > > iozone tests as well before I'd go offline. > > > > desktop-net-gitk > > high-with low-latency low-latency high-without > > low-latency block-2.6.33 async-rampup low-latency > > min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%) > > mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%) > > stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%) > > max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%) > > pgalloc-fail 25 ( 0.00%) 10 (50.00%) 39 (-95.00%) 20 ( 0.00%) > > > > The patches for 2.6.33 help a little all right but the async-rampup > > patches both make the performance worse and causes more page allocation > > failures to occur. In other words, on most machines it'll appear fine > > but people with wireless cards doing high-order allocations may run into > > trouble. > > > > Disabling low_latency again helps performance significantly in this > > scenario. There were still page allocation failures because not all the > > patches related to that problem made it to mainline. > > I'm puzzled how almost all kernels, excluding the async rampup, > perform better when high order allocations are enabled, than in > previous test. > Two major differences. 1, the previous non-high-order tests had also run sysbench and iozone so the starting conditions are different. I had disabled those tests to get some of the high-order figures before I went offline. However, the starting conditions are probably not as important as the fact that kswapd is working to free order-2 pages and staying awake until watermarks are reached. kswapd working harder is probably making a big difference. > > I was somewhat aggrevated by the page allocation failures until I remembered > > that there are three patches in -mm that I failed to convince either Jens or > > Andrew of them being suitable for mainline. When they are added to the mix, > > the results are as follows; > > > > desktop-net-gitk > > atomics-with low-latency low-latency atomics-without > > low-latency block-2.6.33 async-rampup low-latency > > min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%) > > mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%) > > stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%) > > max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%) > > pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 23 ( 0.00%) 0 ( 0.00%) > > > > Those patches penalize block-2.6.33, that was the one with lowest > number of failures in previous test. > I think the heuristics were tailored to 2.6.32. They need to be > re-tuned for 2.6.33. > I made a mistake in the script that was generating the summary. I neglected to take into account printk rate suppressions. When they are taken into account, the first round of figures look like desktop-net-gitk high-with low-latency low-latency high-without low-latency block-2.6.33 async-rampup low-latency min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 (-37.69%) 303.43 (64.76%) mean 866.60 ( 0.00%) 616.28 (28.89%) 1201.82 (-38.68%) 459.69 (46.96%) stddev 4.39 ( 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%) max 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 (38.42%) pgalloc-fail 65 ( 0.00%) 10 (84.62%) 293 (-350.77%) 20 (69.23%) So the async-rampup is getting smacked very hard with allocation failures in the high-order case. With the three additional applied for allocation failures, the figures look like desktop-net-gitk atomics-with low-latency low-latency atomics-without low-latency block-2.6.33 async-rampup low-latency min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 (-95.71%) 375.05 (41.50%) mean 743.61 ( 0.00%) 631.20 (15.12%) 1272.70 (-71.15%) 389.71 (47.59%) stddev 60.30 ( 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%) max 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%) pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 27 ( 0.00%) 0 ( 0.00%) So again, async-rampup is getting smacked in terms of allocation failures although the three additional patches help a lot. This is a real pity because it looked nice in the tests involving no high-order allocations for the network. > > Again, plain old disabling low_latency both performs the best and fails page > > allocations the least. The three patches for page allocation failures are > > in -mm but not mainline are; > > > > [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim > > [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep > > [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble > > > > It still seems to be that the route of least damage is to disable low_latency > > by default for 2.6.32. It's very unfortunate that I wasn't able to fully > > justify the 3 patches for page allocation failures in time but all that > > can be done there is consider them for -stable I suppose. > > Just disabling low_latency will not solve the allocation issues (20 > instead of 25). 20 instead of 65 and I know it doesn't fully help the problem with high-order allocations. The patches that do help that problem aren't in mainline but they do exist. > Moreover, it will improve some workloads, but penalize others. > It really does appear to hurt a lot when the machine is kinda low on memory though. That is a fairly common situation with a desktop loaded up with random apps. Well..... by common, I mean I hit that situation a lot on my laptop. I don't hit it on server workloads because I make sure the machines are not overloaded. > Your 3 patches, though, seem to improve the situation also for > low_latency enabled, both for performance and allocation failures (25 > to 3). Having those 3 patches with low_latency enabled seems better, > since it won't penalize the workloads that are benefited by > low_latency (if you add a sequential read to your test, you should see > a big difference). > This is true and I would like to see them merged. However, this close to release, with Jens unhappiness with the explanation of why congestion_wait() changes made a difference and Andrew feeling there wasn't enough cause to merge them, I'm doubtful it'll happen. Will see Monday what the story is. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-27 18:52 ` Mel Gorman @ 2009-11-29 15:11 ` Corrado Zoccolo 2009-11-30 12:04 ` Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Corrado Zoccolo @ 2009-11-29 15:11 UTC (permalink / raw) To: Mel Gorman Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Fri, Nov 27, 2009 19:52:34, Mel Gorman wrote: : > On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote: > > On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote: > > >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > > > > How would one go about selecting the proper ratio at which to disable > > > the low_latency logic? > > > > Can we measure the dirty ratio when the allocation failures start to > > happen? > > Would the number of dirty pages in the page allocation failure message to > kern.log be enough? You won't get them all because of printk suppress but > it's something. Alternatively, tell me exactly what stats from /proc you > want and I'll stick a monitor on there. Assuming you want nr_dirty vs total > number of pages though, the monitor tends to execute too late to be useful. > Since I wanted to go deeper in the understanding, but my system is healty, I devised a measure of fragmentation, and wanted to chart it to understand what was going wrong. A perl script that produces gnuplot compatible output is provided: use strict; select(STDOUT); $|=1; do { open (my $bf, "< /proc/buddyinfo") or die; open (my $up, "< /proc/uptime") or die; my $now = <$up>; chomp $now; print $now; while(<$bf>) { next unless /Node (\d+), zone\s+([a-zA-Z]+)\s+(.+)$/; my ($frag, $tot, $val) = (0,0,1); map { $frag += $_; $tot += $val * $_; $val <<= 1;} ($3 =~ /\d+/g); print "\t", $frag/$tot; } print "\n"; sleep 1; } while(1); My definition of fragmentation is just the number of fragments / the number of pages: * It is 1 only when all pages are of order 0 * it is 2/3 on a random marking of used pages (each page has probability 0.5 of being used) * to be sure that a order k allocation succeeds, the fragmentation should be <= 2^-k I observed the mainline kernel during normal usage, and found that: * the fragmentation is very low after boot (< 1%). * it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations). * high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment. * when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8). * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high. > > Two major differences. 1, the previous non-high-order tests had also > run sysbench and iozone so the starting conditions are different. I had > disabled those tests to get some of the high-order figures before I went > offline. However, the starting conditions are probably not as important as > the fact that kswapd is working to free order-2 pages and staying awake > until watermarks are reached. kswapd working harder is probably making a > big difference. > From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation. We (block layer guys) tend to do a sync & drop cache before starting any test, so this can explain why our optimizations work best when machine has plenty of free memory. On the other hand, machines with plenty of memory should be the norm now, even for desktops. > > I made a mistake in the script that was generating the summary. I neglected > to take into account printk rate suppressions. When they are taken into > account, the first round of figures look like > > desktop-net-gitk > high-with low-latency low-latency > high-without low-latency block-2.6.33 async-rampup > low-latency min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 > (-37.69%) 303.43 (64.76%) mean 866.60 ( 0.00%) 616.28 > (28.89%) 1201.82 (-38.68%) 459.69 (46.96%) stddev 4.39 ( > 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%) max > 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 > (38.42%) pgalloc-fail 65 ( 0.00%) 10 (84.62%) 293 > (-350.77%) 20 (69.23%) > > So the async-rampup is getting smacked very hard with allocation failures > in the high-order case. With the three additional applied for allocation > failures, the figures look like > > desktop-net-gitk > atomics-with low-latency low-latency > atomics-without low-latency block-2.6.33 async-rampup > low-latency min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 > (-95.71%) 375.05 (41.50%) mean 743.61 ( 0.00%) 631.20 > (15.12%) 1272.70 (-71.15%) 389.71 (47.59%) stddev 60.30 ( > 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%) max > 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%) > pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 27 ( 0.00%) 0 > ( 0.00%) > > So again, async-rampup is getting smacked in terms of allocation failures > although the three additional patches help a lot. This is a real pity > because it looked nice in the tests involving no high-order allocations for > the network. Ok. Forget that patch for now. Maybe we can test it with 2.6.33 to see if it fits. On the other hand, I saw that the problems with high order allocations started around 2.6.31, where we didn't have any low_latency patch. So I don't think the solution to the problem is in the block layer. A slightly slower or faster writeback shouldn't cause a DoS like situation as the one encountered with your network driver. > > Moreover, it will improve some workloads, but penalize others. > > It really does appear to hurt a lot when the machine is kinda low on > memory though. That is a fairly common situation with a desktop loaded > up with random apps. Well..... by common, I mean I hit that situation a > lot on my laptop. I don't hit it on server workloads because I make sure > the machines are not overloaded. This is why we have it as a tunable. If your workload is negatively affected, you can switch it off. But make sure to test it thoroughly, because even if you found a 2x slowdown in a particular circumstance, it can gain 10x speedup (see http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html) in others. > > > Your 3 patches, though, seem to improve the situation also for > > low_latency enabled, both for performance and allocation failures (25 > > to 3). Having those 3 patches with low_latency enabled seems better, > > since it won't penalize the workloads that are benefited by > > low_latency (if you add a sequential read to your test, you should see > > a big difference). > > This is true and I would like to see them merged. However, this close to > release, with Jens unhappiness with the explanation of why > congestion_wait() changes made a difference and Andrew feeling there > wasn't enough cause to merge them, I'm doubtful it'll happen. Will see > Monday what the story is. After a 1day study of the VM, I found an other way to improve the fragmentation. With the patch below, the fragmentation stays below 2/3 even when memory pressure is high, and decreases overtime, if the system is lightly used, even without dropping caches. Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order allocations are usually serviced by the other zones (more likely than with mainline allocator). The idea is to have 2 freelists for each zone. The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free. The free_list_1 contains the other ones. When expanding, we put pages into free_list_1. When freeing, we put them in the proper one by checking the buddy of the compound. And when extracting, we always extract from free_list_0 first, and fall back on the other if the first is empty. In this way, we keep free longer the pages that are more likely to cause a big merge. Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation. It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 6f75617..6427361 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page) } struct free_area { - struct list_head free_list[MIGRATE_TYPES]; + struct list_head free_list_0[MIGRATE_TYPES]; + struct list_head free_list_1[MIGRATE_TYPES]; unsigned long nr_free; }; diff --git a/kernel/kexec.c b/kernel/kexec.c index f336e21..aee5ef5 100644 --- a/kernel/kexec.c +++ b/kernel/kexec.c @@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void) VMCOREINFO_OFFSET(zone, free_area); VMCOREINFO_OFFSET(zone, vm_stat); VMCOREINFO_OFFSET(zone, spanned_pages); - VMCOREINFO_OFFSET(free_area, free_list); + VMCOREINFO_OFFSET(free_area, free_list_0); + VMCOREINFO_OFFSET(free_area, free_list_1); VMCOREINFO_OFFSET(list_head, next); VMCOREINFO_OFFSET(list_head, prev); VMCOREINFO_OFFSET(vm_struct, addr); VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER); log_buf_kexec_setup(); - VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); + VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES); + VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES); VMCOREINFO_NUMBER(NR_FREE_PAGES); VMCOREINFO_NUMBER(PG_lru); VMCOREINFO_NUMBER(PG_private); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cdcedf6..5f488d8 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page, int migratetype) { unsigned long page_idx; + unsigned long combined_idx; + bool high_order_free = false; if (unlikely(PageCompound(page))) if (unlikely(destroy_compound_page(page, order))) @@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON(bad_range(zone, page)); while (order < MAX_ORDER-1) { - unsigned long combined_idx; struct page *buddy; buddy = __page_find_buddy(page, page_idx, order); @@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page, order++; } set_page_order(page, order); - list_add(&page->lru, - &zone->free_area[order].free_list[migratetype]); + + if (order < MAX_ORDER-1) { + struct page *parent_page, *ppage_buddy; + combined_idx = __find_combined_index(page_idx, order); + parent_page = page + combined_idx - page_idx; + ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1); + high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1); + } + + if (high_order_free) + list_add(&page->lru, + &zone->free_area[order].free_list_1[migratetype]); + else + list_add(&page->lru, + &zone->free_area[order].free_list_0[migratetype]); zone->free_area[order].nr_free++; } @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page, high--; size >>= 1; VM_BUG_ON(bad_range(zone, &page[size])); - list_add(&page[size].lru, &area->free_list[migratetype]); + list_add(&page[size].lru, &area->free_list_1[migratetype]); area->nr_free++; set_page_order(&page[size], high); } @@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, /* Find a page of the appropriate size in the preferred list */ for (current_order = order; current_order < MAX_ORDER; ++current_order) { + bool fl0, fl1; area = &(zone->free_area[current_order]); - if (list_empty(&area->free_list[migratetype])) + fl0 = list_empty(&area->free_list_0[migratetype]); + fl1 = list_empty(&area->free_list_1[migratetype]); + if (fl0 && fl1) continue; - page = list_entry(area->free_list[migratetype].next, - struct page, lru); + if (fl0) + page = list_entry(area->free_list_1[migratetype].next, + struct page, lru); + else + page = list_entry(area->free_list_0[migratetype].next, + struct page, lru); list_del(&page->lru); rmv_page_order(page); area->nr_free--; @@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone, order = page_order(page); list_del(&page->lru); list_add(&page->lru, - &zone->free_area[order].free_list[migratetype]); + &zone->free_area[order].free_list_0[migratetype]); page += 1 << order; pages_moved += 1 << order; } @@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) for (current_order = MAX_ORDER-1; current_order >= order; --current_order) { for (i = 0; i < MIGRATE_TYPES - 1; i++) { + bool fl0, fl1; migratetype = fallbacks[start_migratetype][i]; /* MIGRATE_RESERVE handled later if necessary */ @@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) continue; area = &(zone->free_area[current_order]); - if (list_empty(&area->free_list[migratetype])) + + + fl0 = list_empty(&area->free_list_0[migratetype]); + fl1 = list_empty(&area->free_list_1[migratetype]); + + if (fl0 && fl1) continue; - page = list_entry(area->free_list[migratetype].next, - struct page, lru); + if (fl0) + page = list_entry(area->free_list_1[migratetype].next, + struct page, lru); + else + page = list_entry(area->free_list_0[migratetype].next, + struct page, lru); area->nr_free--; /* @@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone) } for_each_migratetype_order(order, t) { - list_for_each(curr, &zone->free_area[order].free_list[t]) { + list_for_each(curr, &zone->free_area[order].free_list_0[t]) { + unsigned long i; + + pfn = page_to_pfn(list_entry(curr, struct page, lru)); + for (i = 0; i < (1UL << order); i++) + swsusp_set_page_free(pfn_to_page(pfn + i)); + } + list_for_each(curr, &zone->free_area[order].free_list_1[t]) { unsigned long i; pfn = page_to_pfn(list_entry(curr, struct page, lru)); @@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone) { int order, t; for_each_migratetype_order(order, t) { - INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); + INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]); + INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]); zone->free_area[order].nr_free = 0; } } diff --git a/mm/vmstat.c b/mm/vmstat.c index c81321f..613ef1e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m, area = &(zone->free_area[order]); - list_for_each(curr, &area->free_list[mtype]) + list_for_each(curr, &area->free_list_0[mtype]) + freecount++; + list_for_each(curr, &area->free_list_1[mtype]) freecount++; seq_printf(m, "%6lu ", freecount); } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-29 15:11 ` Corrado Zoccolo @ 2009-11-30 12:04 ` Mel Gorman 2009-11-30 12:54 ` Corrado Zoccolo 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2009-11-30 12:04 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Sun, Nov 29, 2009 at 04:11:15PM +0100, Corrado Zoccolo wrote: > On Fri, Nov 27, 2009 19:52:34, Mel Gorman wrote: > : > On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote: > > > On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote: > > > >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > > > > > > How would one go about selecting the proper ratio at which to disable > > > > the low_latency logic? > > > > > > Can we measure the dirty ratio when the allocation failures start to > > > happen? > > > > Would the number of dirty pages in the page allocation failure message to > > kern.log be enough? You won't get them all because of printk suppress but > > it's something. Alternatively, tell me exactly what stats from /proc you > > want and I'll stick a monitor on there. Assuming you want nr_dirty vs total > > number of pages though, the monitor tends to execute too late to be useful. > > > Since I wanted to go deeper in the understanding, but my system is healty, > I devised a measure of fragmentation, and wanted to chart it to understand > what was going wrong. A perl script that produces gnuplot compatible output is provided: > > use strict; > select(STDOUT); > $|=1; > do { > open (my $bf, "< /proc/buddyinfo") or die; > open (my $up, "< /proc/uptime") or die; > my $now = <$up>; > chomp $now; > print $now; > while(<$bf>) { > next unless /Node (\d+), zone\s+([a-zA-Z]+)\s+(.+)$/; > my ($frag, $tot, $val) = (0,0,1); > map { $frag += $_; $tot += $val * $_; $val <<= 1;} ($3 =~ /\d+/g); > print "\t", $frag/$tot; > } > print "\n"; > sleep 1; > } while(1); > > My definition of fragmentation is just the number of fragments / the number of pages: > * It is 1 only when all pages are of order 0 > * it is 2/3 on a random marking of used pages (each page has probability 0.5 of being used) > * to be sure that a order k allocation succeeds, the fragmentation should be <= 2^-k > In practice, the ordering of page allocations and frees are not random but it's ok for the purposes here. Also when considering fragmentation, I'd take into account the order of the desired allocation as fragmentations at or over that size are not contributing to fragmentation in a negative way. I'd usually express it in terms of free pages instead of total pages as well to avoid large fluctuations when reclaim is working. We can work with this measure for the moment though to avoid getting side-tracked on what fragmentation is. > I observed the mainline kernel during normal usage, and found that: > * the fragmentation is very low after boot (< 1%). > * it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations). > * high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment. All three of these observations are expected. > * when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8). Again, this is expected. Page cache pages stay resident until reclaimed. If they are clean, they are not really contributing to fragmentation in any way that matters as they should be quickly found and discarded in most cases. In the networking case, it's depending on kswapd to find and reclaim the pages fast enough. > * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high. > > > > Two major differences. 1, the previous non-high-order tests had also > > run sysbench and iozone so the starting conditions are different. I had > > disabled those tests to get some of the high-order figures before I went > > offline. However, the starting conditions are probably not as important as > > the fact that kswapd is working to free order-2 pages and staying awake > > until watermarks are reached. kswapd working harder is probably making a > > big difference. > > > > From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation. While this is true, during the course of the test, the old page cache should be discarded quickly. It's not as abrupt as dropping the page cache but the end result should be similar in the majority of cases - the exception being when atomic allocations are a major factor. > We (block layer guys) tend to do a sync & drop cache before starting any test, so this can explain why our optimizations work best when machine has plenty of free memory. > On the other hand, machines with plenty of memory should be the norm now, even for desktops. > Even large memory machines will eventually use the bulk of their memory on old page cache. There is no problem with this as such. > > > > I made a mistake in the script that was generating the summary. I neglected > > to take into account printk rate suppressions. When they are taken into > > account, the first round of figures look like > > > > desktop-net-gitk > > high-with low-latency low-latency > > high-without low-latency block-2.6.33 async-rampup > > low-latency min 861.03 ( 0.00%) 467.83 (45.67%) 1185.51 > > (-37.69%) 303.43 (64.76%) mean 866.60 ( 0.00%) 616.28 > > (28.89%) 1201.82 (-38.68%) 459.69 (46.96%) stddev 4.39 ( > > 0.00%) 86.90 (-1877.46%) 23.63 (-437.75%) 92.75 (-2010.76%) max > > 872.56 ( 0.00%) 679.36 (22.14%) 1242.63 (-42.41%) 537.31 > > (38.42%) pgalloc-fail 65 ( 0.00%) 10 (84.62%) 293 > > (-350.77%) 20 (69.23%) > > > > So the async-rampup is getting smacked very hard with allocation failures > > in the high-order case. With the three additional applied for allocation > > failures, the figures look like > > > > desktop-net-gitk > > atomics-with low-latency low-latency > > atomics-without low-latency block-2.6.33 async-rampup > > low-latency min 641.12 ( 0.00%) 627.91 ( 2.06%) 1254.75 > > (-95.71%) 375.05 (41.50%) mean 743.61 ( 0.00%) 631.20 > > (15.12%) 1272.70 (-71.15%) 389.71 (47.59%) stddev 60.30 ( > > 0.00%) 2.53 (95.80%) 10.64 (82.35%) 22.38 (62.89%) max > > 793.85 ( 0.00%) 633.76 (20.17%) 1281.65 (-61.45%) 428.41 (46.03%) > > pgalloc-fail 3 ( 0.00%) 2 ( 0.00%) 27 ( 0.00%) 0 > > ( 0.00%) > > > > So again, async-rampup is getting smacked in terms of allocation failures > > although the three additional patches help a lot. This is a real pity > > because it looked nice in the tests involving no high-order allocations for > > the network. > > Ok. Forget that patch for now. Maybe we can test it with 2.6.33 to see if it fits. Sounds reasonable. > On the other hand, I saw that the problems with high order allocations started > around 2.6.31, where we didn't have any low_latency patch. While this is true, there appear to be many sources of the high order allocation failures. While low_latency is not the original source, it does not appear to have helped either. Even without high-order allocations being involved, disabling low_latency performs much better in low-memory situations. > So I don't think the > solution to the problem is in the block layer. A slightly slower or faster writeback > shouldn't cause a DoS like situation as the one encountered with your network driver. > > > > Moreover, it will improve some workloads, but penalize others. > > > > It really does appear to hurt a lot when the machine is kinda low on > > memory though. That is a fairly common situation with a desktop loaded > > up with random apps. Well..... by common, I mean I hit that situation a > > lot on my laptop. I don't hit it on server workloads because I make sure > > the machines are not overloaded. > > This is why we have it as a tunable. If your workload is negatively affected, > you can switch it off. True, although it's hard to spot. > But make sure to test it thoroughly, because even if > you found a 2x slowdown in a particular circumstance, it can gain 10x > speedup (see http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html) > in others. > Ok. > > > > > Your 3 patches, though, seem to improve the situation also for > > > low_latency enabled, both for performance and allocation failures (25 > > > to 3). Having those 3 patches with low_latency enabled seems better, > > > since it won't penalize the workloads that are benefited by > > > low_latency (if you add a sequential read to your test, you should see > > > a big difference). > > > > This is true and I would like to see them merged. However, this close to > > release, with Jens unhappiness with the explanation of why > > congestion_wait() changes made a difference and Andrew feeling there > > wasn't enough cause to merge them, I'm doubtful it'll happen. Will see > > Monday what the story is. > > After a 1day study of the VM, I found an other way to improve the fragmentation. > With the patch below, the fragmentation stays below 2/3 even when memory pressure is high, > and decreases overtime, if the system is lightly used, even without dropping caches. > Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order > allocations are usually serviced by the other zones (more likely than with mainline allocator). > > The idea is to have 2 freelists for each zone. > The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free. > The free_list_1 contains the other ones. > When expanding, we put pages into free_list_1.When freeing, we put them in the proper one by checking the buddy of the compound. > And when extracting, we always extract from free_list_0 first, This is subtle, but as well as increased overhead in the page allocator, I'd expect this to break the page-ordering when a caller is allocation many numbers of order-0 pages. Some IO controllers get a boost by the pages coming back in physically contiguous order which happens if a high-order page is being split towards the beginning of the stream of requests. Previous attempts at altering how coalescing and splitting to reduce fragmentation with methods similar to yours have fallen foul of this. > and fall back on the other if the first is empty. > In this way, we keep free longer the pages that are more likely to cause a big merge. > Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation. > > It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look. > > Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 6f75617..6427361 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page) > } > > struct free_area { > - struct list_head free_list[MIGRATE_TYPES]; > + struct list_head free_list_0[MIGRATE_TYPES]; > + struct list_head free_list_1[MIGRATE_TYPES]; > unsigned long nr_free; > }; > > diff --git a/kernel/kexec.c b/kernel/kexec.c > index f336e21..aee5ef5 100644 > --- a/kernel/kexec.c > +++ b/kernel/kexec.c > @@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void) > VMCOREINFO_OFFSET(zone, free_area); > VMCOREINFO_OFFSET(zone, vm_stat); > VMCOREINFO_OFFSET(zone, spanned_pages); > - VMCOREINFO_OFFSET(free_area, free_list); > + VMCOREINFO_OFFSET(free_area, free_list_0); > + VMCOREINFO_OFFSET(free_area, free_list_1); > VMCOREINFO_OFFSET(list_head, next); > VMCOREINFO_OFFSET(list_head, prev); > VMCOREINFO_OFFSET(vm_struct, addr); > VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER); > log_buf_kexec_setup(); > - VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); > + VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES); > + VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES); > VMCOREINFO_NUMBER(NR_FREE_PAGES); > VMCOREINFO_NUMBER(PG_lru); > VMCOREINFO_NUMBER(PG_private); > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index cdcedf6..5f488d8 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page, > int migratetype) > { > unsigned long page_idx; > + unsigned long combined_idx; > + bool high_order_free = false; > > if (unlikely(PageCompound(page))) > if (unlikely(destroy_compound_page(page, order))) > @@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page, > VM_BUG_ON(bad_range(zone, page)); > > while (order < MAX_ORDER-1) { > - unsigned long combined_idx; > struct page *buddy; > > buddy = __page_find_buddy(page, page_idx, order); > @@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page, > order++; > } > set_page_order(page, order); > - list_add(&page->lru, > - &zone->free_area[order].free_list[migratetype]); > + > + if (order < MAX_ORDER-1) { > + struct page *parent_page, *ppage_buddy; > + combined_idx = __find_combined_index(page_idx, order); > + parent_page = page + combined_idx - page_idx; parent_page is a bad name here. It's not the parent of anything. What I think you're looking for is the lowest page of the pair of buddies that was last considered for merging. > + ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1); > + high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1); > + } And you are checking if when one buddy of this pair frees, will it then be merged with the next-highest order. If so, you want to delay reusing that page for allocation. > + > + if (high_order_free) > + list_add(&page->lru, > + &zone->free_area[order].free_list_1[migratetype]); > + else > + list_add(&page->lru, > + &zone->free_area[order].free_list_0[migratetype]); You could have avoided the extra list to some extent by altering whether it was the head or tail of the list the page was added to. It would have had a similar effect of the page not being used for longer with slightly less overhead. > zone->free_area[order].nr_free++; > } > > @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page, > high--; > size >>= 1; > VM_BUG_ON(bad_range(zone, &page[size])); > - list_add(&page[size].lru, &area->free_list[migratetype]); > + list_add(&page[size].lru, &area->free_list_1[migratetype]); I think this here will damage the contiguous ordering of pages being returned to callers. > area->nr_free++; > set_page_order(&page[size], high); > } > @@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > > /* Find a page of the appropriate size in the preferred list */ > for (current_order = order; current_order < MAX_ORDER; ++current_order) { > + bool fl0, fl1; > area = &(zone->free_area[current_order]); > - if (list_empty(&area->free_list[migratetype])) > + fl0 = list_empty(&area->free_list_0[migratetype]); > + fl1 = list_empty(&area->free_list_1[migratetype]); > + if (fl0 && fl1) > continue; > > - page = list_entry(area->free_list[migratetype].next, > - struct page, lru); > + if (fl0) > + page = list_entry(area->free_list_1[migratetype].next, > + struct page, lru); > + else > + page = list_entry(area->free_list_0[migratetype].next, > + struct page, lru); By altering whether it's the head or tail free pages are added to, you can achieve a similar effect. > list_del(&page->lru); > rmv_page_order(page); > area->nr_free--; > @@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone, > order = page_order(page); > list_del(&page->lru); > list_add(&page->lru, > - &zone->free_area[order].free_list[migratetype]); > + &zone->free_area[order].free_list_0[migratetype]); > page += 1 << order; > pages_moved += 1 << order; > } > @@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) > for (current_order = MAX_ORDER-1; current_order >= order; > --current_order) { > for (i = 0; i < MIGRATE_TYPES - 1; i++) { > + bool fl0, fl1; > migratetype = fallbacks[start_migratetype][i]; > > /* MIGRATE_RESERVE handled later if necessary */ > @@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) > continue; > > area = &(zone->free_area[current_order]); > - if (list_empty(&area->free_list[migratetype])) > + > + > + fl0 = list_empty(&area->free_list_0[migratetype]); > + fl1 = list_empty(&area->free_list_1[migratetype]); > + > + if (fl0 && fl1) > continue; > > - page = list_entry(area->free_list[migratetype].next, > - struct page, lru); > + if (fl0) > + page = list_entry(area->free_list_1[migratetype].next, > + struct page, lru); > + else > + page = list_entry(area->free_list_0[migratetype].next, > + struct page, lru); > area->nr_free--; > > /* > @@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone) > } > > for_each_migratetype_order(order, t) { > - list_for_each(curr, &zone->free_area[order].free_list[t]) { > + list_for_each(curr, &zone->free_area[order].free_list_0[t]) { > + unsigned long i; > + > + pfn = page_to_pfn(list_entry(curr, struct page, lru)); > + for (i = 0; i < (1UL << order); i++) > + swsusp_set_page_free(pfn_to_page(pfn + i)); > + } > + list_for_each(curr, &zone->free_area[order].free_list_1[t]) { > unsigned long i; > > pfn = page_to_pfn(list_entry(curr, struct page, lru)); > @@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone) > { > int order, t; > for_each_migratetype_order(order, t) { > - INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); > + INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]); > + INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]); > zone->free_area[order].nr_free = 0; > } > } > diff --git a/mm/vmstat.c b/mm/vmstat.c > index c81321f..613ef1e 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m, > > area = &(zone->free_area[order]); > > - list_for_each(curr, &area->free_list[mtype]) > + list_for_each(curr, &area->free_list_0[mtype]) > + freecount++; > + list_for_each(curr, &area->free_list_1[mtype]) > freecount++; > seq_printf(m, "%6lu ", freecount); > } No more than the low_latency switch, I think this will help some workloads in terms of fragmentation but hurt others that depend on the ordering of pages being returned. There is a fair amount of overhead introduced here as well with branches and a lot of extra lists although I believe that could be mitigated. What are the results if you just alter whether it's the head or tail of the list that is used in __free_one_page()? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-30 12:04 ` Mel Gorman @ 2009-11-30 12:54 ` Corrado Zoccolo 2009-11-30 15:48 ` Mel Gorman 0 siblings, 1 reply; 23+ messages in thread From: Corrado Zoccolo @ 2009-11-30 12:54 UTC (permalink / raw) To: Mel Gorman Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Mon, Nov 30, 2009 at 1:04 PM, Mel Gorman <mel@csn.ul.ie> wrote: > On Sun, Nov 29, 2009 at 04:11:15PM +0100, Corrado Zoccolo wrote: >> On Fri, Nov 27, 2009 19:52:34, Mel Gorman wrote: >> : > On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote: >> > > On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <mel@csn.ul.ie> wrote: >> > > > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote: >> > > >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <mel@csn.ul.ie> wrote: >> > > > >> > > > How would one go about selecting the proper ratio at which to disable >> > > > the low_latency logic? >> > > >> > > Can we measure the dirty ratio when the allocation failures start to >> > > happen? >> > >> > Would the number of dirty pages in the page allocation failure message to >> > kern.log be enough? You won't get them all because of printk suppress but >> > it's something. Alternatively, tell me exactly what stats from /proc you >> > want and I'll stick a monitor on there. Assuming you want nr_dirty vs total >> > number of pages though, the monitor tends to execute too late to be useful. >> > >> Since I wanted to go deeper in the understanding, but my system is healty, >> I devised a measure of fragmentation, and wanted to chart it to understand >> what was going wrong. A perl script that produces gnuplot compatible output is provided: >> >> use strict; >> select(STDOUT); >> $|=1; >> do { >> open (my $bf, "< /proc/buddyinfo") or die; >> open (my $up, "< /proc/uptime") or die; >> my $now = <$up>; >> chomp $now; >> print $now; >> while(<$bf>) { >> next unless /Node (\d+), zone\s+([a-zA-Z]+)\s+(.+)$/; >> my ($frag, $tot, $val) = (0,0,1); >> map { $frag += $_; $tot += $val * $_; $val <<= 1;} ($3 =~ /\d+/g); >> print "\t", $frag/$tot; >> } >> print "\n"; >> sleep 1; >> } while(1); >> >> My definition of fragmentation is just the number of fragments / the number of pages: >> * It is 1 only when all pages are of order 0 >> * it is 2/3 on a random marking of used pages (each page has probability 0.5 of being used) >> * to be sure that a order k allocation succeeds, the fragmentation should be <= 2^-k >> > > In practice, the ordering of page allocations and frees are not random > but it's ok for the purposes here. > > Also when considering fragmentation, I'd take into account the order of the > desired allocation as fragmentations at or over that size are not contributing > to fragmentation in a negative way. I'd usually express it in terms of free > pages instead of total pages as well to avoid large fluctuations when reclaim > is working. We can work with this measure for the moment though to avoid > getting side-tracked on what fragmentation is. > >> I observed the mainline kernel during normal usage, and found that: >> * the fragmentation is very low after boot (< 1%). >> * it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations). >> * high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment. > > All three of these observations are expected. > >> * when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8). > > Again, this is expected. Page cache pages stay resident until > reclaimed. If they are clean, they are not really contributing to > fragmentation in any way that matters as they should be quickly found > and discarded in most cases. In the networking case, it's depending on > kswapd to find and reclaim the pages fast enough. If you need an order 5 page, how would kswapd work? Will it free randomly some order 0 pages until a merge magically happens? Unless the dirty ratio is really high, there should already be plenty of contiguous non-dirty pages in the page cache that could be freed, but if you use an LRU policy to evict, you can go through a lot of freeing before a merge will happen. >> * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high. >> > >> > Two major differences. 1, the previous non-high-order tests had also >> > run sysbench and iozone so the starting conditions are different. I had >> > disabled those tests to get some of the high-order figures before I went >> > offline. However, the starting conditions are probably not as important as >> > the fact that kswapd is working to free order-2 pages and staying awake >> > until watermarks are reached. kswapd working harder is probably making a >> > big difference. >> > >> >> From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation. > > While this is true, during the course of the test, the old page cache > should be discarded quickly. It's not as abrupt as dropping the page > cache but the end result should be similar in the majority of cases - > the exception being when atomic allocations are a major factor. For my I/O scheduler tests I use an external disk, to be able to monitor exactly what is happening. If I don't do a sync & drop cache before starting a test, I usually see writeback happening on the main disk, even if the only activity on the machine is writing a sequential file to my external disk. If that writeback is done in the context of my test process, this will alter the result. And with high order allocations, depending on how do you free page cache, it can be even worse than that. > >> On the other hand, I saw that the problems with high order allocations started >> around 2.6.31, where we didn't have any low_latency patch. > > While this is true, there appear to be many sources of the high order > allocation failures. While low_latency is not the original source, it > does not appear to have helped either. Even without high-order > allocations being involved, disabling low_latency performs much better > in low-memory situations. Can you try reproducing: http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html in a low memory scenario, to substantiate your claim? >> After a 1day study of the VM, I found an other way to improve the fragmentation. >> With the patch below, the fragmentation stays below 2/3 even when memory pressure is high, >> and decreases overtime, if the system is lightly used, even without dropping caches. >> Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order >> allocations are usually serviced by the other zones (more likely than with mainline allocator). >> >> The idea is to have 2 freelists for each zone. >> The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free. >> The free_list_1 contains the other ones. >> When expanding, we put pages into free_list_1.When freeing, we put them in the proper one by checking the buddy of the compound. >> And when extracting, we always extract from free_list_0 first, > > This is subtle, but as well as increased overhead in the page allocator, I'd > expect this to break the page-ordering when a caller is allocation many numbers > of order-0 pages. Some IO controllers get a boost by the pages coming back > in physically contiguous order which happens if a high-order page is being > split towards the beginning of the stream of requests. Previous attempts at > altering how coalescing and splitting to reduce fragmentation with methods > similar to yours have fallen foul of this. I took extreme care in not disrupting the page ordering. In fact, I thought, too, to a single list solution, but it could cause page reordering (since I would have used add_tail to add to the other list). > >> and fall back on the other if the first is empty. >> In this way, we keep free longer the pages that are more likely to cause a big merge. >> Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation. >> >> It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look. >> >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> >> >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h >> index 6f75617..6427361 100644 >> --- a/include/linux/mmzone.h >> +++ b/include/linux/mmzone.h >> @@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page) >> } >> >> struct free_area { >> - struct list_head free_list[MIGRATE_TYPES]; >> + struct list_head free_list_0[MIGRATE_TYPES]; >> + struct list_head free_list_1[MIGRATE_TYPES]; >> unsigned long nr_free; >> }; >> >> diff --git a/kernel/kexec.c b/kernel/kexec.c >> index f336e21..aee5ef5 100644 >> --- a/kernel/kexec.c >> +++ b/kernel/kexec.c >> @@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void) >> VMCOREINFO_OFFSET(zone, free_area); >> VMCOREINFO_OFFSET(zone, vm_stat); >> VMCOREINFO_OFFSET(zone, spanned_pages); >> - VMCOREINFO_OFFSET(free_area, free_list); >> + VMCOREINFO_OFFSET(free_area, free_list_0); >> + VMCOREINFO_OFFSET(free_area, free_list_1); >> VMCOREINFO_OFFSET(list_head, next); >> VMCOREINFO_OFFSET(list_head, prev); >> VMCOREINFO_OFFSET(vm_struct, addr); >> VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER); >> log_buf_kexec_setup(); >> - VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); >> + VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES); >> + VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES); >> VMCOREINFO_NUMBER(NR_FREE_PAGES); >> VMCOREINFO_NUMBER(PG_lru); >> VMCOREINFO_NUMBER(PG_private); >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index cdcedf6..5f488d8 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page, >> int migratetype) >> { >> unsigned long page_idx; >> + unsigned long combined_idx; >> + bool high_order_free = false; >> >> if (unlikely(PageCompound(page))) >> if (unlikely(destroy_compound_page(page, order))) >> @@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page, >> VM_BUG_ON(bad_range(zone, page)); >> >> while (order < MAX_ORDER-1) { >> - unsigned long combined_idx; >> struct page *buddy; >> >> buddy = __page_find_buddy(page, page_idx, order); >> @@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page, >> order++; >> } >> set_page_order(page, order); >> - list_add(&page->lru, >> - &zone->free_area[order].free_list[migratetype]); >> + >> + if (order < MAX_ORDER-1) { >> + struct page *parent_page, *ppage_buddy; >> + combined_idx = __find_combined_index(page_idx, order); >> + parent_page = page + combined_idx - page_idx; > > parent_page is a bad name here. It's not the parent of anything. What I > think you're looking for is the lowest page of the pair of buddies that > was last considered for merging. Right, this should be the combined page, to keep naming consistent with combined_idx. > >> + ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1); >> + high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1); >> + } > > And you are checking if when one buddy of this pair frees, will it then > be merged with the next-highest order. If so, you want to delay reusing > that page for allocation. Exactly. If you have two streams of allocations, with different average lifetime (and with the long lifetime allocations having a slower rate), this will make very probable that the long lifetime allocations span a smaller set of compounds. > >> + >> + if (high_order_free) >> + list_add(&page->lru, >> + &zone->free_area[order].free_list_1[migratetype]); >> + else >> + list_add(&page->lru, >> + &zone->free_area[order].free_list_0[migratetype]); > > You could have avoided the extra list to some extent by altering whether > it was the head or tail of the list the page was added to. It would have > had a similar effect of the page not being used for longer with slightly > less overhead. Right, but the order of insertions at the tail would be reversed. >> zone->free_area[order].nr_free++; >> } >> >> @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page, >> high--; >> size >>= 1; >> VM_BUG_ON(bad_range(zone, &page[size])); >> - list_add(&page[size].lru, &area->free_list[migratetype]); >> + list_add(&page[size].lru, &area->free_list_1[migratetype]); > > I think this here will damage the contiguous ordering of pages being > returned to callers. This shouldn't damage the order. In fact, expand always inserts in the free_list_1, in the same order as the original code inserted in the free_list. And if we hit expand, then the free_list_0 is empty, so all allocations will be serviced from free_list_1 in the same order as the original code. > >> area->nr_free++; >> set_page_order(&page[size], high); >> } >> @@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, >> >> /* Find a page of the appropriate size in the preferred list */ >> for (current_order = order; current_order < MAX_ORDER; ++current_order) { >> + bool fl0, fl1; >> area = &(zone->free_area[current_order]); >> - if (list_empty(&area->free_list[migratetype])) >> + fl0 = list_empty(&area->free_list_0[migratetype]); >> + fl1 = list_empty(&area->free_list_1[migratetype]); >> + if (fl0 && fl1) >> continue; >> >> - page = list_entry(area->free_list[migratetype].next, >> - struct page, lru); >> + if (fl0) >> + page = list_entry(area->free_list_1[migratetype].next, >> + struct page, lru); >> + else >> + page = list_entry(area->free_list_0[migratetype].next, >> + struct page, lru); > > By altering whether it's the head or tail free pages are added to, you > can achieve a similar effect. > >> list_del(&page->lru); >> rmv_page_order(page); >> area->nr_free--; >> @@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone, >> order = page_order(page); >> list_del(&page->lru); >> list_add(&page->lru, >> - &zone->free_area[order].free_list[migratetype]); >> + &zone->free_area[order].free_list_0[migratetype]); >> page += 1 << order; >> pages_moved += 1 << order; >> } >> @@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) >> for (current_order = MAX_ORDER-1; current_order >= order; >> --current_order) { >> for (i = 0; i < MIGRATE_TYPES - 1; i++) { >> + bool fl0, fl1; >> migratetype = fallbacks[start_migratetype][i]; >> >> /* MIGRATE_RESERVE handled later if necessary */ >> @@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) >> continue; >> >> area = &(zone->free_area[current_order]); >> - if (list_empty(&area->free_list[migratetype])) >> + >> + >> + fl0 = list_empty(&area->free_list_0[migratetype]); >> + fl1 = list_empty(&area->free_list_1[migratetype]); >> + >> + if (fl0 && fl1) >> continue; >> >> - page = list_entry(area->free_list[migratetype].next, >> - struct page, lru); >> + if (fl0) >> + page = list_entry(area->free_list_1[migratetype].next, >> + struct page, lru); >> + else >> + page = list_entry(area->free_list_0[migratetype].next, >> + struct page, lru); >> area->nr_free--; >> >> /* >> @@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone) >> } >> >> for_each_migratetype_order(order, t) { >> - list_for_each(curr, &zone->free_area[order].free_list[t]) { >> + list_for_each(curr, &zone->free_area[order].free_list_0[t]) { >> + unsigned long i; >> + >> + pfn = page_to_pfn(list_entry(curr, struct page, lru)); >> + for (i = 0; i < (1UL << order); i++) >> + swsusp_set_page_free(pfn_to_page(pfn + i)); >> + } >> + list_for_each(curr, &zone->free_area[order].free_list_1[t]) { >> unsigned long i; >> >> pfn = page_to_pfn(list_entry(curr, struct page, lru)); >> @@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone) >> { >> int order, t; >> for_each_migratetype_order(order, t) { >> - INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); >> + INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]); >> + INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]); >> zone->free_area[order].nr_free = 0; >> } >> } >> diff --git a/mm/vmstat.c b/mm/vmstat.c >> index c81321f..613ef1e 100644 >> --- a/mm/vmstat.c >> +++ b/mm/vmstat.c >> @@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m, >> >> area = &(zone->free_area[order]); >> >> - list_for_each(curr, &area->free_list[mtype]) >> + list_for_each(curr, &area->free_list_0[mtype]) >> + freecount++; >> + list_for_each(curr, &area->free_list_1[mtype]) >> freecount++; >> seq_printf(m, "%6lu ", freecount); >> } > > No more than the low_latency switch, I think this will help some > workloads in terms of fragmentation but hurt others that depend on the > ordering of pages being returned. Hopefully not, if my considerations above are correct. > There is a fair amount of overhead > introduced here as well with branches and a lot of extra lists although > I believe that could be mitigated. > > What are the results if you just alter whether it's the head or tail of > the list that is used in __free_one_page()? In that case, it would alter the ordering, but not the one of the pages returned by expand. In fact, only the order of the pages returned by free will be affected, and in that case maybe it is already quite disordered. If that order is not needed to be kept, I can prepare a new version with a single list. BTW, if we only guarantee that pages returned by expand are well ordered, this patch will increase the ordered-ness of the stream of allocated pages, since it will increase the probability that allocations go into expand (since frees will more likely create high order combined pages). So it will also improve the workloads that prefer ordered allocations. > > -- > Mel Gorman > Part-time Phd Student Linux Technology Center > University of Limerick IBM Dublin Software Lab > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-30 12:54 ` Corrado Zoccolo @ 2009-11-30 15:48 ` Mel Gorman 2009-11-30 17:21 ` Corrado Zoccolo 0 siblings, 1 reply; 23+ messages in thread From: Mel Gorman @ 2009-11-30 15:48 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Mon, Nov 30, 2009 at 01:54:04PM +0100, Corrado Zoccolo wrote: > On Mon, Nov 30, 2009 at 1:04 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > On Sun, Nov 29, 2009 at 04:11:15PM +0100, Corrado Zoccolo wrote: > >> On Fri, Nov 27, 2009 19:52:34, Mel Gorman wrote: > >> : > On Fri, Nov 27, 2009 at 07:14:41PM +0100, Corrado Zoccolo wrote: > >> > > On Fri, Nov 27, 2009 at 4:58 PM, Mel Gorman <mel@csn.ul.ie> wrote: > >> > > > On Fri, Nov 27, 2009 at 01:03:29PM +0100, Corrado Zoccolo wrote: > >> > > >> On Fri, Nov 27, 2009 at 12:44 PM, Mel Gorman <mel@csn.ul.ie> wrote: > >> > > > > >> > > > How would one go about selecting the proper ratio at which to disable > >> > > > the low_latency logic? > >> > > > >> > > Can we measure the dirty ratio when the allocation failures start to > >> > > happen? > >> > > >> > Would the number of dirty pages in the page allocation failure message to > >> > kern.log be enough? You won't get them all because of printk suppress but > >> > it's something. Alternatively, tell me exactly what stats from /proc you > >> > want and I'll stick a monitor on there. Assuming you want nr_dirty vs total > >> > number of pages though, the monitor tends to execute too late to be useful. > >> > > >> Since I wanted to go deeper in the understanding, but my system is healty, > >> I devised a measure of fragmentation, and wanted to chart it to understand > >> what was going wrong. A perl script that produces gnuplot compatible output is provided: > >> > >> use strict; > >> select(STDOUT); > >> $|=1; > >> do { > >> open (my $bf, "< /proc/buddyinfo") or die; > >> open (my $up, "< /proc/uptime") or die; > >> my $now = <$up>; > >> chomp $now; > >> print $now; > >> while(<$bf>) { > >> next unless /Node (\d+), zone\s+([a-zA-Z]+)\s+(.+)$/; > >> my ($frag, $tot, $val) = (0,0,1); > >> map { $frag += $_; $tot += $val * $_; $val <<= 1;} ($3 =~ /\d+/g); > >> print "\t", $frag/$tot; > >> } > >> print "\n"; > >> sleep 1; > >> } while(1); > >> > >> My definition of fragmentation is just the number of fragments / the number of pages: > >> * It is 1 only when all pages are of order 0 > >> * it is 2/3 on a random marking of used pages (each page has probability 0.5 of being used) > >> * to be sure that a order k allocation succeeds, the fragmentation should be <= 2^-k > >> > > > > In practice, the ordering of page allocations and frees are not random > > but it's ok for the purposes here. > > > > Also when considering fragmentation, I'd take into account the order of the > > desired allocation as fragmentations at or over that size are not contributing > > to fragmentation in a negative way. I'd usually express it in terms of free > > pages instead of total pages as well to avoid large fluctuations when reclaim > > is working. We can work with this measure for the moment though to avoid > > getting side-tracked on what fragmentation is. > > > >> I observed the mainline kernel during normal usage, and found that: > >> * the fragmentation is very low after boot (< 1%). > >> * it tends to increase when memory is freed, and to decrease when memory is allocated (since the kernel usually performs order 0 allocations). > >> * high memory fragmentation increases first, and only when all high memory is used, normal memory starts to fragment. > > > > All three of these observations are expected. > > > >> * when the page cache is big enough (so memory pressure is high for the allocator), the fragmentation starts to fluctuate a lot, sometimes exceeding 2/3 (up to 0.8). > > > > Again, this is expected. Page cache pages stay resident until > > reclaimed. If they are clean, they are not really contributing to > > fragmentation in any way that matters as they should be quickly found > > and discarded in most cases. In the networking case, it's depending on > > kswapd to find and reclaim the pages fast enough. > > If you need an order 5 page, how would kswapd work? > Will it free randomly some order 0 pages until a merge magically happens? No, it won't. There is contiguity-aware reclaim logic called "lumpy reclaim" which is used for high-order pages. The next LRU page for reclaiming is a cursor page and the naturally-aligned block of pages around it are also considered for reclaim so that a high-order page gets freed. > Unless the dirty ratio is really high, there should already be plenty > of contiguous non-dirty pages in the page cache that could be freed, > but if you use an LRU policy to evict, you can go through a lot of > freeing before a merge will happen. > Indeed. There is no need to go into details but if it was order-0 pages being reclaimed, an extremely large percentage of memory would have to be freed to get a order-5 page. > >> * the only way to make the fragmentation return to sane values after it enters fluctuation is to do a sync & drop caches. Even in this case, it will go around 14%, that is still quite high. > >> > > >> > Two major differences. 1, the previous non-high-order tests had also > >> > run sysbench and iozone so the starting conditions are different. I had > >> > disabled those tests to get some of the high-order figures before I went > >> > offline. However, the starting conditions are probably not as important as > >> > the fact that kswapd is working to free order-2 pages and staying awake > >> > until watermarks are reached. kswapd working harder is probably making a > >> > big difference. > >> > > >> > >> From my observation, having run a program that fills page cache before a test has a lot of impact to the fragmentation. > > > > While this is true, during the course of the test, the old page cache > > should be discarded quickly. It's not as abrupt as dropping the page > > cache but the end result should be similar in the majority of cases - > > the exception being when atomic allocations are a major factor. > > For my I/O scheduler tests I use an external disk, to be able to > monitor exactly what is happening. > If I don't do a sync & drop cache before starting a test, I usually > see writeback happening on the main disk, even if the only activity on > the machine is writing a sequential file to my external disk. If that > writeback is done in the context of my test process, this will alter > the result. Why does the writeback kick in late? I thought pages were meant to be written back after a contigurable interval of time had passed. > And with high order allocations, depending on how do you free page > cache, it can be even worse than that. > > > > >> On the other hand, I saw that the problems with high order allocations started > >> around 2.6.31, where we didn't have any low_latency patch. > > > > While this is true, there appear to be many sources of the high order > > allocation failures. While low_latency is not the original source, it > > does not appear to have helped either. Even without high-order > > allocations being involved, disabling low_latency performs much better > > in low-memory situations. > > Can you try reproducing: > http://lkml.indiana.edu/hypermail/linux/kernel/0911.1/01848.html > in a low memory scenario, to substantiate your claim? > I can try but it'll take a few days to get around to. I'm still trying to identify other sources of the problems from between 2.6.30 and 2.6.32-rc8. It'll be tricky to test what you ask because it might not just be low-memory that is the problem but low memory + enough pressure that processes are stalling waiting on reclaim. > >> After a 1day study of the VM, I found an other way to improve the fragmentation. > >> With the patch below, the fragmentation stays below 2/3 even when memory pressure is high, > >> and decreases overtime, if the system is lightly used, even without dropping caches. > >> Moreover, the precious zones (Normal, DMA) are kept at a lower fragmentation, since high order > >> allocations are usually serviced by the other zones (more likely than with mainline allocator). > >> > >> The idea is to have 2 freelists for each zone. > >> The free_list_0 has the pages that are less likely to cause an higher-order merge, since the buddy of their compound is not free. > >> The free_list_1 contains the other ones. > >> When expanding, we put pages into free_list_1.When freeing, we put them in the proper one by checking the buddy of the compound. > >> And when extracting, we always extract from free_list_0 first, > > > > This is subtle, but as well as increased overhead in the page allocator, I'd > > expect this to break the page-ordering when a caller is allocation many numbers > > of order-0 pages. Some IO controllers get a boost by the pages coming back > > in physically contiguous order which happens if a high-order page is being > > split towards the beginning of the stream of requests. Previous attempts at > > altering how coalescing and splitting to reduce fragmentation with methods > > similar to yours have fallen foul of this. > > I took extreme care in not disrupting the page ordering. In fact, I > thought, too, to a single list solution, but it could cause page > reordering (since I would have used add_tail to add to the other > list). > You're right. this way does preserve the page ordering. > > > >> and fall back on the other if the first is empty. > >> In this way, we keep free longer the pages that are more likely to cause a big merge. > >> Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation. > >> > >> It can, though, slow down allocation and reclaim, so someone more knowledgeable than me should have a look. > >> > >> Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> > >> > >> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > >> index 6f75617..6427361 100644 > >> --- a/include/linux/mmzone.h > >> +++ b/include/linux/mmzone.h > >> @@ -55,7 +55,8 @@ static inline int get_pageblock_migratetype(struct page *page) > >> } > >> > >> struct free_area { > >> - struct list_head free_list[MIGRATE_TYPES]; > >> + struct list_head free_list_0[MIGRATE_TYPES]; > >> + struct list_head free_list_1[MIGRATE_TYPES]; > >> unsigned long nr_free; > >> }; > >> > >> diff --git a/kernel/kexec.c b/kernel/kexec.c > >> index f336e21..aee5ef5 100644 > >> --- a/kernel/kexec.c > >> +++ b/kernel/kexec.c > >> @@ -1404,13 +1404,15 @@ static int __init crash_save_vmcoreinfo_init(void) > >> VMCOREINFO_OFFSET(zone, free_area); > >> VMCOREINFO_OFFSET(zone, vm_stat); > >> VMCOREINFO_OFFSET(zone, spanned_pages); > >> - VMCOREINFO_OFFSET(free_area, free_list); > >> + VMCOREINFO_OFFSET(free_area, free_list_0); > >> + VMCOREINFO_OFFSET(free_area, free_list_1); > >> VMCOREINFO_OFFSET(list_head, next); > >> VMCOREINFO_OFFSET(list_head, prev); > >> VMCOREINFO_OFFSET(vm_struct, addr); > >> VMCOREINFO_LENGTH(zone.free_area, MAX_ORDER); > >> log_buf_kexec_setup(); > >> - VMCOREINFO_LENGTH(free_area.free_list, MIGRATE_TYPES); > >> + VMCOREINFO_LENGTH(free_area.free_list_0, MIGRATE_TYPES); > >> + VMCOREINFO_LENGTH(free_area.free_list_1, MIGRATE_TYPES); > >> VMCOREINFO_NUMBER(NR_FREE_PAGES); > >> VMCOREINFO_NUMBER(PG_lru); > >> VMCOREINFO_NUMBER(PG_private); > >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >> index cdcedf6..5f488d8 100644 > >> --- a/mm/page_alloc.c > >> +++ b/mm/page_alloc.c > >> @@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page, > >> int migratetype) > >> { > >> unsigned long page_idx; > >> + unsigned long combined_idx; > >> + bool high_order_free = false; > >> > >> if (unlikely(PageCompound(page))) > >> if (unlikely(destroy_compound_page(page, order))) > >> @@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page, > >> VM_BUG_ON(bad_range(zone, page)); > >> > >> while (order < MAX_ORDER-1) { > >> - unsigned long combined_idx; > >> struct page *buddy; > >> > >> buddy = __page_find_buddy(page, page_idx, order); > >> @@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page, > >> order++; > >> } > >> set_page_order(page, order); > >> - list_add(&page->lru, > >> - &zone->free_area[order].free_list[migratetype]); > >> + > >> + if (order < MAX_ORDER-1) { > >> + struct page *parent_page, *ppage_buddy; > >> + combined_idx = __find_combined_index(page_idx, order); > >> + parent_page = page + combined_idx - page_idx; > > > > parent_page is a bad name here. It's not the parent of anything. What I > > think you're looking for is the lowest page of the pair of buddies that > > was last considered for merging. > > Right, this should be the combined page, to keep naming consistent > with combined_idx. > > > > >> + ppage_buddy = __page_find_buddy(parent_page, combined_idx, order + 1); > >> + high_order_free = page_is_buddy(parent_page, ppage_buddy, order + 1); > >> + } > > > > And you are checking if when one buddy of this pair frees, will it then > > be merged with the next-highest order. If so, you want to delay reusing > > that page for allocation. > > Exactly. > If you have two streams of allocations, with different average > lifetime (and with the long lifetime allocations having a slower > rate), this will make very probable that the long lifetime allocations > span a smaller set of compounds. I see the logic. > > > >> + > >> + if (high_order_free) > >> + list_add(&page->lru, > >> + &zone->free_area[order].free_list_1[migratetype]); > >> + else > >> + list_add(&page->lru, > >> + &zone->free_area[order].free_list_0[migratetype]); > > > > You could have avoided the extra list to some extent by altering whether > > it was the head or tail of the list the page was added to. It would have > > had a similar effect of the page not being used for longer with slightly > > less overhead. > > Right, but the order of insertions at the tail would be reversed. > True but maybe it doesn't matter. What's important is that the order the pages are returned during allocation and after a high-order page is split is what is important. > >> zone->free_area[order].nr_free++; > >> } > >> > >> @@ -663,7 +677,7 @@ static inline void expand(struct zone *zone, struct page *page, > >> high--; > >> size >>= 1; > >> VM_BUG_ON(bad_range(zone, &page[size])); > >> - list_add(&page[size].lru, &area->free_list[migratetype]); > >> + list_add(&page[size].lru, &area->free_list_1[migratetype]); > > > > I think this here will damage the contiguous ordering of pages being > > returned to callers. > > This shouldn't damage the order. In fact, expand always inserts in the > free_list_1, in the same order as the original code inserted in the > free_list. And if we hit expand, then the free_list_0 is empty, so all > allocations will be serviced from free_list_1 in the same order as the > original code. > > > > >> area->nr_free++; > >> set_page_order(&page[size], high); > >> } > >> @@ -723,12 +737,19 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, > >> > >> /* Find a page of the appropriate size in the preferred list */ > >> for (current_order = order; current_order < MAX_ORDER; ++current_order) { > >> + bool fl0, fl1; > >> area = &(zone->free_area[current_order]); > >> - if (list_empty(&area->free_list[migratetype])) > >> + fl0 = list_empty(&area->free_list_0[migratetype]); > >> + fl1 = list_empty(&area->free_list_1[migratetype]); > >> + if (fl0 && fl1) > >> continue; > >> > >> - page = list_entry(area->free_list[migratetype].next, > >> - struct page, lru); > >> + if (fl0) > >> + page = list_entry(area->free_list_1[migratetype].next, > >> + struct page, lru); > >> + else > >> + page = list_entry(area->free_list_0[migratetype].next, > >> + struct page, lru); > > > > By altering whether it's the head or tail free pages are added to, you > > can achieve a similar effect. > > > >> list_del(&page->lru); > >> rmv_page_order(page); > >> area->nr_free--; > >> @@ -792,7 +813,7 @@ static int move_freepages(struct zone *zone, > >> order = page_order(page); > >> list_del(&page->lru); > >> list_add(&page->lru, > >> - &zone->free_area[order].free_list[migratetype]); > >> + &zone->free_area[order].free_list_0[migratetype]); > >> page += 1 << order; > >> pages_moved += 1 << order; > >> } > >> @@ -845,6 +866,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) > >> for (current_order = MAX_ORDER-1; current_order >= order; > >> --current_order) { > >> for (i = 0; i < MIGRATE_TYPES - 1; i++) { > >> + bool fl0, fl1; > >> migratetype = fallbacks[start_migratetype][i]; > >> > >> /* MIGRATE_RESERVE handled later if necessary */ > >> @@ -852,11 +874,20 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype) > >> continue; > >> > >> area = &(zone->free_area[current_order]); > >> - if (list_empty(&area->free_list[migratetype])) > >> + > >> + > >> + fl0 = list_empty(&area->free_list_0[migratetype]); > >> + fl1 = list_empty(&area->free_list_1[migratetype]); > >> + > >> + if (fl0 && fl1) > >> continue; > >> > >> - page = list_entry(area->free_list[migratetype].next, > >> - struct page, lru); > >> + if (fl0) > >> + page = list_entry(area->free_list_1[migratetype].next, > >> + struct page, lru); > >> + else > >> + page = list_entry(area->free_list_0[migratetype].next, > >> + struct page, lru); > >> area->nr_free--; > >> > >> /* > >> @@ -1061,7 +1092,14 @@ void mark_free_pages(struct zone *zone) > >> } > >> > >> for_each_migratetype_order(order, t) { > >> - list_for_each(curr, &zone->free_area[order].free_list[t]) { > >> + list_for_each(curr, &zone->free_area[order].free_list_0[t]) { > >> + unsigned long i; > >> + > >> + pfn = page_to_pfn(list_entry(curr, struct page, lru)); > >> + for (i = 0; i < (1UL << order); i++) > >> + swsusp_set_page_free(pfn_to_page(pfn + i)); > >> + } > >> + list_for_each(curr, &zone->free_area[order].free_list_1[t]) { > >> unsigned long i; > >> > >> pfn = page_to_pfn(list_entry(curr, struct page, lru)); > >> @@ -2993,7 +3031,8 @@ static void __meminit zone_init_free_lists(struct zone *zone) > >> { > >> int order, t; > >> for_each_migratetype_order(order, t) { > >> - INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); > >> + INIT_LIST_HEAD(&zone->free_area[order].free_list_0[t]); > >> + INIT_LIST_HEAD(&zone->free_area[order].free_list_1[t]); > >> zone->free_area[order].nr_free = 0; > >> } > >> } > >> diff --git a/mm/vmstat.c b/mm/vmstat.c > >> index c81321f..613ef1e 100644 > >> --- a/mm/vmstat.c > >> +++ b/mm/vmstat.c > >> @@ -468,7 +468,9 @@ static void pagetypeinfo_showfree_print(struct seq_file *m, > >> > >> area = &(zone->free_area[order]); > >> > >> - list_for_each(curr, &area->free_list[mtype]) > >> + list_for_each(curr, &area->free_list_0[mtype]) > >> + freecount++; > >> + list_for_each(curr, &area->free_list_1[mtype]) > >> freecount++; > >> seq_printf(m, "%6lu ", freecount); > >> } > > > > No more than the low_latency switch, I think this will help some > > workloads in terms of fragmentation but hurt others that depend on the > > ordering of pages being returned. > > Hopefully not, if my considerations above are correct. Right, it doesn't affect the ordering of pages returned. The impact is additional branches and a lot more lists but it's still very interesting. > > There is a fair amount of overhead > > introduced here as well with branches and a lot of extra lists although > > I believe that could be mitigated. > > > > What are the results if you just alter whether it's the head or tail of > > the list that is used in __free_one_page()? > > In that case, it would alter the ordering, but not the one of the > pages returned by expand. > In fact, only the order of the pages returned by free will be > affected, and in that case maybe it is already quite disordered. > If that order is not needed to be kept, I can prepare a new version > with a single list. > The ordering of free does not need to be preserved. The important property is that if a high-order page is split by expand() that subsequent allocations use the contiguous pages. > BTW, if we only guarantee that pages returned by expand are well > ordered, this patch will increase the ordered-ness of the stream of > allocated pages, since it will increase the probability that > allocations go into expand (since frees will more likely create high > order combined pages). So it will also improve the workloads that > prefer ordered allocations. > That's a distinct possibility. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-30 15:48 ` Mel Gorman @ 2009-11-30 17:21 ` Corrado Zoccolo 0 siblings, 0 replies; 23+ messages in thread From: Corrado Zoccolo @ 2009-11-30 17:21 UTC (permalink / raw) To: Mel Gorman Cc: Corrado Zoccolo, Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, KOSAKI Motohiro, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Mon, Nov 30 2009 at 16:48:32, Mel Gorman wrote: > On Mon, Nov 30, 2009 at 01:54:04PM +0100, Corrado Zoccolo wrote: > > On Mon, Nov 30, 2009 at 1:04 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > On Sun, Nov 29, 2009 at 04:11:15PM +0100, Corrado Zoccolo wrote: > > For my I/O scheduler tests I use an external disk, to be able to > > monitor exactly what is happening. > > If I don't do a sync & drop cache before starting a test, I usually > > see writeback happening on the main disk, even if the only activity on > > the machine is writing a sequential file to my external disk. If that > > writeback is done in the context of my test process, this will alter > > the result. > > Why does the writeback kick in late? I thought pages were meant to be > written back after a contigurable interval of time had passed. That is a good question. Maybe when dirty ratio goes high, something is being written to swap? > > I can try but it'll take a few days to get around to. I'm still trying > to identify other sources of the problems from between 2.6.30 and > 2.6.32-rc8. It'll be tricky to test what you ask because it might not just > be low-memory that is the problem but low memory + enough pressure that > processes are stalling waiting on reclaim. Ok. > > > Right, but the order of insertions at the tail would be reversed. > > True but maybe it doesn't matter. What's important is that the order the > pages are returned during allocation and after a high-order page is split > is what is important. > > > > There is a fair amount of overhead > > > introduced here as well with branches and a lot of extra lists although > > > I believe that could be mitigated. > > > > > > What are the results if you just alter whether it's the head or tail of > > > the list that is used in __free_one_page()? > > > > In that case, it would alter the ordering, but not the one of the > > pages returned by expand. > > In fact, only the order of the pages returned by free will be > > affected, and in that case maybe it is already quite disordered. > > If that order is not needed to be kept, I can prepare a new version > > with a single list. > > The ordering of free does not need to be preserved. The important > property is that if a high-order page is split by expand() that > subsequent allocations use the contiguous pages. Then, a solution with a single list is possible. It removes the overhead of the branches when allocating, and also the additional lists. What about: From b792ce5afff2e7a28ec3db41baaf93c3200ee5fc Mon Sep 17 00:00:00 2001 From: Corrado Zoccolo <czoccolo@gmail.com> Date: Mon, 30 Nov 2009 17:42:05 +0100 Subject: [PATCH] page allocator: heuristic to reduce fragmentation in buddy allocator In order to reduce fragmentation, we classify freed pages in two groups, according to their probability of being part of a high order merge. Pages belonging to a compound whose buddy is free are more likely to be part of a high order merge, so they will be added at the tail of the freelist. The remaining pages will, instead, be put at the front of the freelist. In this way, the pages that are more likely to cause a big merge are kept free longer. Consequently we tend to aggregate the long-living allocations on a subset of the compounds, reducing the fragmentation. Signed-off-by: Corrado Zoccolo <czoccolo@gmail.com> --- mm/page_alloc.c | 20 +++++++++++++++++--- 1 files changed, 17 insertions(+), 3 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2bc2ac6..0f273af 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -451,6 +451,8 @@ static inline void __free_one_page(struct page *page, int migratetype) { unsigned long page_idx; + unsigned long combined_idx; + bool combined_free = false; if (unlikely(PageCompound(page))) if (unlikely(destroy_compound_page(page, order))) @@ -464,7 +466,6 @@ static inline void __free_one_page(struct page *page, VM_BUG_ON(bad_range(zone, page)); while (order < MAX_ORDER-1) { - unsigned long combined_idx; struct page *buddy; buddy = __page_find_buddy(page, page_idx, order); @@ -481,8 +482,21 @@ static inline void __free_one_page(struct page *page, order++; } set_page_order(page, order); - list_add(&page->lru, - &zone->free_area[order].free_list[migratetype]); + + if (order < MAX_ORDER-1) { + struct page *combined_page, *combined_buddy; + combined_idx = __find_combined_index(page_idx, order); + combined_page = page + combined_idx - page_idx; + combined_buddy = __page_find_buddy(combined_page, combined_idx, order + 1); + combined_free = page_is_buddy(combined_page, combined_buddy, order + 1); + } + + if (combined_free) + list_add_tail(&page->lru, + &zone->free_area[order].free_list[migratetype]); + else + list_add(&page->lru, + &zone->free_area[order].free_list[migratetype]); zone->free_area[order].nr_free++; } -- 1.6.2.5 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 14:17 ` Mel Gorman 2009-11-26 15:18 ` Corrado Zoccolo @ 2009-11-27 5:58 ` KOSAKI Motohiro 2009-11-27 6:29 ` KOSAKI Motohiro 2009-11-27 12:16 ` Mel Gorman 1 sibling, 2 replies; 23+ messages in thread From: KOSAKI Motohiro @ 2009-11-27 5:58 UTC (permalink / raw) To: Mel Gorman Cc: kosaki.motohiro, Corrado Zoccolo, Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm > On Thu, Nov 26, 2009 at 02:47:10PM +0100, Corrado Zoccolo wrote: > > On Thu, Nov 26, 2009 at 1:19 PM, Mel Gorman <mel@csn.ul.ie> wrote: > > > (cc'ing the people from the page allocator failure thread as this might be > > > relevant to some of their problems) > > > > > > I know this is very last minute but I believe we should consider disabling > > > the "low_latency" tunable for block devices by default for 2.6.32. There was > > > evidence that low_latency was a problem last week for page allocation failure > > > reports but the reproduction-case was unusual and involved high-order atomic > > > allocations in low-memory conditions. It took another few days to accurately > > > show the problem for more normal workloads and it's a bit more wide-spread > > > than just allocation failures. > > > > > > Basically, low_latency looks great as long as you have plenty of memory > > > but in low memory situations, it appears to cause problems that manifest > > > as reduced performance, desktop stalls and in some cases, page allocation > > > failures. I think most kernel developers are not seeing the problem as they > > > tend to test on beefier machines and without hitting swap or low-memory > > > situations for the most part. When they are hitting low-memory situations, > > > it tends to be for stress tests where stalls and low performance are expected. > > > > The low latency tunable controls various policies inside cfq. > > The one that could affect memory reclaim is: > > /* > > * Async queues must wait a bit before being allowed dispatch. > > * We also ramp up the dispatch depth gradually for async IO, > > * based on the last sync IO we serviced > > */ > > if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) { > > unsigned long last_sync = jiffies - cfqd->last_end_sync_rq; > > unsigned int depth; > > > > depth = last_sync / cfqd->cfq_slice[1]; > > if (!depth && !cfqq->dispatched) > > depth = 1; > > if (depth < max_dispatch) > > max_dispatch = depth; > > } > > > > here the async queues max depth is limited to 1 for up to 200 ms after > > a sync I/O is completed. > > Note: dirty page writeback goes through an async queue, so it is > > penalized by this. > > > > This can affect both low and high end hardware. My non-NCQ sata disk > > can handle a depth of 2 when writing. NCQ sata disks can handle a > > depth up to 31, so limiting depth to 1 can cause write performance > > drop, and this in turn will slow down dirty page reclaim, and cause > > allocation failures. > > > > It would be good to re-test the OOM conditions with that code commented out. > > > > All of it or just the cfq_latency part? > > As it turns out the test machine does report for the disk NCQ (depth 31/32) > and it's the same on the laptop so slowing down dirty page cleaning > could be impacting reclaim. > > > > > > > To show the problem, I used an x86-64 machine booting booted with 512MB of > > > memory. This is a small amount of RAM but the bug reports related to page > > > allocation failures were on smallish machines and the disks in the system > > > are not very high-performance. > > > > > > I used three tests. The first was sysbench on postgres running an IO-heavy > > > test against a large database with 10,000,000 rows. The second was IOZone > > > running most of the automatic tests with a record length of 4KB and the > > > last was a simulated launching of gitk with a music player running in the > > > background to act as a desktop-like scenario. The final test was similar > > > to the test described here http://lwn.net/Articles/362184/ except that > > > dm-crypt was not used as it has its own problems. > > > > low_latency was tested on other scenarios: > > http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html > > where it improved actual and perceived performance, so disabling it > > completely may not be good. > > > > It may not indeed. > > In case you mean a partial disabling of cfq_latency, I'm try the > following patch. The intention is to disable the low_latency logic if > kswapd is at work and presumably needs clean pages. Alternative > suggestions welcome. I like treat vmscan writeout as special. because - vmscan use various process context. but it doesn't write own process's page. IOW, it doesn't so match cfq's io fairness logic. - plus, the above mean vmscan writeout doesn't need good i/o latency. - vmscan maintain page granularity lru list. It mean vmscan makes awful seekful I/O. it assume block-layer buffered much i/o request. - plus, the above mena vmscan. writeout need good io throughput. otherwise system might cause hangup. However, I don't think kswapd_awake is good choice. because - zone reclaim run before kswapd wakeup. iow, this patch doesn't solve hpc machine. btw, some Core i7 box (at least, Intel's reference box) also use zone reclaim. - On large (many memory node) machine, one of much kswapd always run. Instead, PF_MEMALLOC is good idea? Subject: [PATCH] cfq: Do not limit the async queue depth while memory reclaim Not-Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> (I haven't test this) --- block/cfq-iosched.c | 3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index aa1e953..9546f64 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -1308,7 +1308,8 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) * We also ramp up the dispatch depth gradually for async IO, * based on the last sync IO we serviced */ - if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) { + if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency && + !(current->flags & PF_MEMALLOC)) { unsigned long last_sync = jiffies - cfqd->last_end_sync_rq; unsigned int depth; -- 1.6.5.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-27 5:58 ` KOSAKI Motohiro @ 2009-11-27 6:29 ` KOSAKI Motohiro 2009-11-27 12:16 ` Mel Gorman 1 sibling, 0 replies; 23+ messages in thread From: KOSAKI Motohiro @ 2009-11-27 6:29 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Mel Gorman, Corrado Zoccolo, Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm > Instead, PF_MEMALLOC is good idea? This patch was obviously wrong. please forget it. i'm sorry. > > > Subject: [PATCH] cfq: Do not limit the async queue depth while memory reclaim > > Not-Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> (I haven't test this) > --- > block/cfq-iosched.c | 3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index aa1e953..9546f64 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -1308,7 +1308,8 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > * We also ramp up the dispatch depth gradually for async IO, > * based on the last sync IO we serviced > */ > - if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) { > + if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency && > + !(current->flags & PF_MEMALLOC)) { > unsigned long last_sync = jiffies - cfqd->last_end_sync_rq; > unsigned int depth; > > -- > 1.6.5.2 > > > > > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-27 5:58 ` KOSAKI Motohiro 2009-11-27 6:29 ` KOSAKI Motohiro @ 2009-11-27 12:16 ` Mel Gorman 2009-11-30 10:18 ` KOSAKI Motohiro 1 sibling, 1 reply; 23+ messages in thread From: Mel Gorman @ 2009-11-27 12:16 UTC (permalink / raw) To: KOSAKI Motohiro Cc: Corrado Zoccolo, Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm On Fri, Nov 27, 2009 at 02:58:26PM +0900, KOSAKI Motohiro wrote: > > > <SNIP> > > > low_latency was tested on other scenarios: > > > http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html > > > where it improved actual and perceived performance, so disabling it > > > completely may not be good. > > > > > > > It may not indeed. > > > > In case you mean a partial disabling of cfq_latency, I'm try the > > following patch. The intention is to disable the low_latency logic if > > kswapd is at work and presumably needs clean pages. Alternative > > suggestions welcome. > > I like treat vmscan writeout as special. because > - vmscan use various process context. but it doesn't write own process's page. > IOW, it doesn't so match cfq's io fairness logic. > - plus, the above mean vmscan writeout doesn't need good i/o latency. While it might not need good latency as such, it does need pages to be clean because direct reclaim has trouble cleaning pages in its own behalf. > - vmscan maintain page granularity lru list. It mean vmscan makes awful > seekful I/O. it assume block-layer buffered much i/o request. > - plus, the above mena vmscan. writeout need good io throughput. otherwise > system might cause hangup. > > However, I don't think kswapd_awake is good choice. because > - zone reclaim run before kswapd wakeup. iow, this patch doesn't solve hpc machine. > btw, some Core i7 box (at least, Intel's reference box) also use zone reclaim. Good point. > - On large (many memory node) machine, one of much kswapd always run. > Also true. > > Instead, PF_MEMALLOC is good idea? > It doesn't work out either because a process with PF_MEMALLOC is in direct reclaim and like kswapd, it may not be able to clean the pages at all, let alone in a small period of time. > > Subject: [PATCH] cfq: Do not limit the async queue depth while memory reclaim > > Not-Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> (I haven't test this) > --- > block/cfq-iosched.c | 3 ++- > 1 files changed, 2 insertions(+), 1 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index aa1e953..9546f64 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -1308,7 +1308,8 @@ static bool cfq_may_dispatch(struct cfq_data *cfqd, struct cfq_queue *cfqq) > * We also ramp up the dispatch depth gradually for async IO, > * based on the last sync IO we serviced > */ > - if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency) { > + if (!cfq_cfqq_sync(cfqq) && cfqd->cfq_latency && > + !(current->flags & PF_MEMALLOC)) { > unsigned long last_sync = jiffies - cfqd->last_end_sync_rq; > unsigned int depth; > > -- > 1.6.5.2 > > > > > > -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-27 12:16 ` Mel Gorman @ 2009-11-30 10:18 ` KOSAKI Motohiro 0 siblings, 0 replies; 23+ messages in thread From: KOSAKI Motohiro @ 2009-11-30 10:18 UTC (permalink / raw) To: Mel Gorman Cc: kosaki.motohiro, Corrado Zoccolo, Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm > On Fri, Nov 27, 2009 at 02:58:26PM +0900, KOSAKI Motohiro wrote: > > > > <SNIP> > > > > low_latency was tested on other scenarios: > > > > http://lkml.indiana.edu/hypermail/linux/kernel/0910.0/01410.html > > > > http://linux.derkeiler.com/Mailing-Lists/Kernel/2009-11/msg04855.html > > > > where it improved actual and perceived performance, so disabling it > > > > completely may not be good. > > > > > > > > > > It may not indeed. > > > > > > In case you mean a partial disabling of cfq_latency, I'm try the > > > following patch. The intention is to disable the low_latency logic if > > > kswapd is at work and presumably needs clean pages. Alternative > > > suggestions welcome. > > > > I like treat vmscan writeout as special. because > > - vmscan use various process context. but it doesn't write own process's page. > > IOW, it doesn't so match cfq's io fairness logic. > > - plus, the above mean vmscan writeout doesn't need good i/o latency. > > While it might not need good latency as such, it does need pages to be > clean because direct reclaim has trouble cleaning pages in its own > behalf. Well. if direct reclaim need lumpy reclaim, you are right. In no lupy case, vmscan start pageout and move the page list tail typically. cleaned page will be used by another task. --------------------------------------------------------------------------------------- static unsigned long shrink_page_list(struct list_head *page_list, struct list_head *freed_pages_list, struct scan_control *sc, enum pageout_io sync_writeback) { (snip) switch (pageout(page, mapping, sync_writeback)) { case PAGE_KEEP: goto keep_locked; case PAGE_ACTIVATE: goto activate_locked; case PAGE_SUCCESS: if (PageWriteback(page) || PageDirty(page)) goto keep; /////// HERE --------------------------------------------------------------------------------------- > > - vmscan maintain page granularity lru list. It mean vmscan makes awful > > seekful I/O. it assume block-layer buffered much i/o request. > > - plus, the above mena vmscan. writeout need good io throughput. otherwise > > system might cause hangup. > > > > However, I don't think kswapd_awake is good choice. because > > - zone reclaim run before kswapd wakeup. iow, this patch doesn't solve hpc machine. > > btw, some Core i7 box (at least, Intel's reference box) also use zone reclaim. > > Good point. > > > - On large (many memory node) machine, one of much kswapd always run. > > > > Also true. > > > > > Instead, PF_MEMALLOC is good idea? > > It doesn't work out either because a process with PF_MEMALLOC is in > direct reclaim and like kswapd, it may not be able to clean the pages at > all, let alone in a small period of time. please forget this idea ;) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 2009-11-26 12:19 [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 Mel Gorman 2009-11-26 13:08 ` Mike Galbraith 2009-11-26 13:47 ` Corrado Zoccolo @ 2009-11-27 4:36 ` KOSAKI Motohiro 2 siblings, 0 replies; 23+ messages in thread From: KOSAKI Motohiro @ 2009-11-27 4:36 UTC (permalink / raw) To: Mel Gorman Cc: kosaki.motohiro, Jens Axboe, Andrew Morton, Linus Torvalds, Frans Pop, Jiri Kosina, Sven Geggus, Karol Lewandowski, Tobias Oetiker, Pekka Enberg, Rik van Riel, Christoph Lameter, Stephan von Krawczynski, Rafael J. Wysocki, linux-kernel, linux-mm > Signed-off-by: Mel Gorman <mel@csn.ul.ie> > --- > block/cfq-iosched.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c > index aa1e953..dc33045 100644 > --- a/block/cfq-iosched.c > +++ b/block/cfq-iosched.c > @@ -2543,7 +2543,7 @@ static void *cfq_init_queue(struct request_queue *q) > cfqd->cfq_slice[1] = cfq_slice_sync; > cfqd->cfq_slice_async_rq = cfq_slice_async_rq; > cfqd->cfq_slice_idle = cfq_slice_idle; > - cfqd->cfq_latency = 1; > + cfqd->cfq_latency = 0; > cfqd->hw_tag = 1; > cfqd->last_end_sync_rq = jiffies; > return cfqd; Great. Probably we can reenable this feature at 2.6.33. but there isn't any reason to take any risk at 2.6.32. i.e. This simple disabling is best. I like this. Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2009-11-30 17:25 UTC | newest] Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-11-26 12:19 [PATCH-RFC] cfq: Disable low_latency by default for 2.6.32 Mel Gorman 2009-11-26 13:08 ` Mike Galbraith 2009-11-26 13:20 ` Bartlomiej Zolnierkiewicz 2009-11-26 13:37 ` Mike Galbraith 2009-11-26 13:56 ` Mel Gorman 2009-11-26 13:47 ` Corrado Zoccolo 2009-11-26 14:17 ` Mel Gorman 2009-11-26 15:18 ` Corrado Zoccolo 2009-11-27 11:44 ` Mel Gorman 2009-11-27 12:03 ` Corrado Zoccolo 2009-11-27 15:58 ` Mel Gorman 2009-11-27 18:14 ` Corrado Zoccolo 2009-11-27 18:52 ` Mel Gorman 2009-11-29 15:11 ` Corrado Zoccolo 2009-11-30 12:04 ` Mel Gorman 2009-11-30 12:54 ` Corrado Zoccolo 2009-11-30 15:48 ` Mel Gorman 2009-11-30 17:21 ` Corrado Zoccolo 2009-11-27 5:58 ` KOSAKI Motohiro 2009-11-27 6:29 ` KOSAKI Motohiro 2009-11-27 12:16 ` Mel Gorman 2009-11-30 10:18 ` KOSAKI Motohiro 2009-11-27 4:36 ` KOSAKI Motohiro
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox