Re: [PATCH 1/5] mm: kswapd: Stop high-order balancing when any suitable zone is balanced

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Minchan Kim <minchan.kim@gmail.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: Simon Kirby <sim@hostway.ca>,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Shaohua Li <shaohua.li@intel.com>,
	Dave Hansen <dave@linux.vnet.ibm.com>,
	linux-mm <linux-mm@kvack.org>,
	linux-kernel <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/5] mm: kswapd: Stop high-order balancing when any suitable zone is balanced
Date: Tue, 7 Dec 2010 10:32:45 +0900	[thread overview]
Message-ID: <AANLkTimvmbvZ-9RcLsefTqbq1ktm6=-XD1N6z4JHBh=v@mail.gmail.com> (raw)
In-Reply-To: <20101206105558.GA21406@csn.ul.ie>

On Mon, Dec 6, 2010 at 7:55 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> On Mon, Dec 06, 2010 at 08:35:18AM +0900, Minchan Kim wrote:
>> Hi Mel,
>>
>> On Fri, Dec 3, 2010 at 8:45 PM, Mel Gorman <mel@csn.ul.ie> wrote:
>> > When the allocator enters its slow path, kswapd is woken up to balance the
>> > node. It continues working until all zones within the node are balanced. For
>> > order-0 allocations, this makes perfect sense but for higher orders it can
>> > have unintended side-effects. If the zone sizes are imbalanced, kswapd may
>> > reclaim heavily within a smaller zone discarding an excessive number of
>> > pages. The user-visible behaviour is that kswapd is awake and reclaiming
>> > even though plenty of pages are free from a suitable zone.
>> >
>> > This patch alters the "balance" logic for high-order reclaim allowing kswapd
>> > to stop if any suitable zone becomes balanced to reduce the number of pages
>> > it reclaims from other zones. kswapd still tries to ensure that order-0
>> > watermarks for all zones are met before sleeping.
>> >
>> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
>>
>> <snip>
>>
>> > -       if (!all_zones_ok) {
>> > +       if (!(all_zones_ok || (order && any_zone_ok))) {
>> >                cond_resched();
>> >
>> >                try_to_freeze();
>> > @@ -2361,6 +2366,31 @@ out:
>> >                goto loop_again;
>> >        }
>> >
>> > +       /*
>> > +        * If kswapd was reclaiming at a higher order, it has the option of
>> > +        * sleeping without all zones being balanced. Before it does, it must
>> > +        * ensure that the watermarks for order-0 on *all* zones are met and
>> > +        * that the congestion flags are cleared
>> > +        */
>> > +       if (order) {
>> > +               for (i = 0; i <= end_zone; i++) {
>> > +                       struct zone *zone = pgdat->node_zones + i;
>> > +
>> > +                       if (!populated_zone(zone))
>> > +                               continue;
>> > +
>> > +                       if (zone->all_unreclaimable && priority != DEF_PRIORITY)
>> > +                               continue;
>> > +
>> > +                       zone_clear_flag(zone, ZONE_CONGESTED);
>>
>> Why clear ZONE_CONGESTED?
>> If you have a cause, please, write down the comment.
>>
>
> It's because kswapd is the only mechanism that clears the congestion
> flag. If it's not cleared and kswapd goes to sleep, the flag could be
> left set causing hard-to-diagnose stalls. I'll add a comment.

Seems good.

>
>> <snip>
>>
>> First impression on this patch is that it changes scanning behavior as
>> well as reclaiming on high order reclaim.
>
> It does affect scanning behaviour for high-order reclaim. Specifically,
> it may stop scanning once a zone is balanced within the node. Previously
> it would continue scanning until all zones were balanced. Is this what
> you are thinking of or something else?

Yes. I mean page aging of high zones.

>
>> I can't say old behavior is right but we can't say this behavior is
>> right, too although this patch solves the problem. At least, we might
>> need some data that shows this patch doesn't have a regression.
>
> How do you suggest it be tested and this data be gathered? I tested a number of
> workloads that keep kswapd awake but found no differences of major significant
> even though it was using high-order allocations. The  problem with identifying
> small regressions for high-order allocations is that the state of the system
> when lumpy reclaim starts is very important as it determines how much work
> has to be done. I did not find major regressions in performance.
>
> For the tests I did run;
>
> fsmark showed nothing useful. iozone showed nothing useful either as it didn't
> even wake kswapd. sysbench showed minor performance gains and losses but it
> is not useful as it typically does not wake kswapd unless the database is
> badly configured.
>
> I ran postmark because it was the closest benchmark to a mail simulator I
> had access to. This sucks because it's no longer representative of a mail
> server and is more like a crappy filesystem benchmark. To get it closer to a
> real server, there was also a program running in the background that mapped
> a large anonymous segment and scanned it in blocks.
>
> POSTMARK
>            postmark-traceonly-v3r1-postmarkpostmark-kanyzone-v2r6-postmark
>                traceonly-v3r1     kanyzone-v2r6
> Transactions per second:                2.00 ( 0.00%)     2.00 ( 0.00%)
> Data megabytes read per second:         8.14 ( 0.00%)     8.59 ( 5.24%)
> Data megabytes written per second:     18.94 ( 0.00%)    19.98 ( 5.21%)
> Files created alone per second:         4.00 ( 0.00%)     4.00 ( 0.00%)
> Files create/transact per second:       1.00 ( 0.00%)     1.00 ( 0.00%)
> Files deleted alone per second:        34.00 ( 0.00%)    30.00 (-13.33%)

Do you know the reason only file deletion has a big regression?

> Files delete/transact per second:       1.00 ( 0.00%)     1.00 ( 0.00%)
>
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         152.4    152.92
> Total Elapsed Time (seconds)               5110.96   4847.22
>
> FTrace Reclaim Statistics: vmscan
>            postmark-traceonly-v3r1-postmarkpostmark-kanyzone-v2r6-postmark
>                traceonly-v3r1     kanyzone-v2r6
> Direct reclaims                                  0          0
> Direct reclaim pages scanned                     0          0
> Direct reclaim pages reclaimed                   0          0
> Direct reclaim write file async I/O              0          0
> Direct reclaim write anon async I/O              0          0
> Direct reclaim write file sync I/O               0          0
> Direct reclaim write anon sync I/O               0          0
> Wake kswapd requests                             0          0
> Kswapd wakeups                                2177       2174
> Kswapd pages scanned                      34690766   34691473

Perhaps, in your workload, any_zone is highest zone.
If any_zone became low zone, kswapd pages scanned would have a big
difference because old behavior try to balance all zones.
Could we evaluate this situation? but I have no idea how we set up the
situation. :(

> Kswapd pages reclaimed                    34511965   34513478
> Kswapd reclaim write file async I/O             32          0
> Kswapd reclaim write anon async I/O           2357       2561
> Kswapd reclaim write file sync I/O               0          0
> Kswapd reclaim write anon sync I/O               0          0
> Time stalled direct reclaim (seconds)         0.00       0.00
> Time kswapd awake (seconds)                 632.10     683.34
>
> Total pages scanned                       34690766  34691473
> Total pages reclaimed                     34511965  34513478
> %age total pages scanned/reclaimed          99.48%    99.49%
> %age total pages scanned/written             0.01%     0.01%
> %age  file pages scanned/written             0.00%     0.00%
> Percentage Time Spent Direct Reclaim         0.00%     0.00%
> Percentage Time kswapd Awake                12.37%    14.10%

Is "kswapd Awake" correct?
AFAIR, In your implementation, you seems to account kswapd time even
though kswapd are schedule out.
I mean, for example,

kswapd
-> time stamp start
-> balance_pgdat
-> cond_resched(kswapd schedule out)
-> app 1 start
-> app 2 start
-> kswapd schedule in
-> time stamp end.

If it's right, kswapd awake doesn't have a big meaning.

>
> proc vmstat: Faults
>            postmark-traceonly-v3r1-postmarkpostmark-kanyzone-v2r6-postmark
>                traceonly-v3r1     kanyzone-v2r6
> Major Faults                                  1979      1741
> Minor Faults                              13660834  13587939
> Page ins                                     89060     74704
> Page outs                                    69800     58884
> Swap ins                                      1193      1499
> Swap outs                                     2403      2562
>
> Still, IO performance was improved (higher rates of read/write) and the test
> completed significantly faster with this patch series applied.  kswapd was
> awake for longer and reclaimed marginally more pages with more swap-ins and

Longer wake may be due to wrong gathering of time as I said.

> swap-outs which is unfortunate but it's somewhat balanced by fewer faults
> and fewer page-ins. Basically, in terms of reclaim the figures are so close
> that it is within the performance variations lumpy reclaim has depending on
> the exact state of the system when reclaim starts.

What I wanted to see is that when if zones above any_zone isn't aging
how it affect system performance.
This patch is changing balancing mechanism of kswapd so I think the
experiment is valuable.
I don't want to make contributors to be tired by bad reviewer.
What do you think about that?

>
>> It's
>> not easy but I believe you can do very well as like having done until
>> now. I didn't see whole series so I might miss something.
>>
>
> --
> Mel Gorman
> Part-time Phd Student                          Linux Technology Center
> University of Limerick                         IBM Dublin Software Lab
>



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2010-12-07  1:32 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-12-03 11:45 [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V2 Mel Gorman
2010-12-03 11:45 ` [PATCH 1/5] mm: kswapd: Stop high-order balancing when any suitable zone is balanced Mel Gorman
2010-12-05 23:35   ` Minchan Kim
2010-12-06 10:55     ` Mel Gorman
2010-12-07  1:32       ` Minchan Kim [this message]
2010-12-07  9:49         ` Mel Gorman
2010-12-06  2:35   ` KAMEZAWA Hiroyuki
2010-12-06 11:32     ` Mel Gorman
2010-12-06 23:51       ` KAMEZAWA Hiroyuki
2010-12-03 11:45 ` [PATCH 2/5] mm: kswapd: Use the order that kswapd was reclaiming at for sleeping_prematurely() Mel Gorman
2010-12-03 11:45 ` [PATCH 3/5] mm: kswapd: Use the classzone idx that kswapd was using " Mel Gorman
2010-12-03 11:45 ` [PATCH 4/5] mm: kswapd: Reset kswapd_max_order and classzone_idx after reading Mel Gorman
2010-12-03 11:45 ` [PATCH 5/5] mm: kswapd: Keep kswapd awake for high-order allocations until a percentage of the node is balanced Mel Gorman
2010-12-09  1:18 ` [PATCH 0/5] Prevent kswapd dumping excessive amounts of memory in response to high-order allocations V2 Simon Kirby
2010-12-09 12:13   ` Mel Gorman
2010-12-09  1:55 ` Simon Kirby
2010-12-09 11:45   ` Mel Gorman
2010-12-10  0:06     ` Simon Kirby
2010-12-10 11:28       ` Mel Gorman
2010-12-11  1:33         ` Simon Kirby

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='AANLkTimvmbvZ-9RcLsefTqbq1ktm6=-XD1N6z4JHBh=v@mail.gmail.com' \
    --to=minchan.kim@gmail.com \
    --cc=dave@linux.vnet.ibm.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=shaohua.li@intel.com \
    --cc=sim@hostway.ca \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox