linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shaohua Li <shaohua.li@intel.com>
To: Minchan Kim <minchan.kim@gmail.com>
Cc: David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>, mel <mel@csn.ul.ie>,
	Rik van Riel <riel@redhat.com>, Michal Hocko <mhocko@suse.cz>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [patch v2]vmscan: correctly detect GFP_ATOMIC allocation failure
Date: Wed, 12 Oct 2011 10:48:59 +0800	[thread overview]
Message-ID: <1318387739.22361.109.camel@sli10-conroe> (raw)
In-Reply-To: <20111011065401.GA4415@barrios-desktop>

On Tue, 2011-10-11 at 14:54 +0800, Minchan Kim wrote:
> On Tue, Oct 11, 2011 at 01:30:10PM +0800, Shaohua Li wrote:
> > On Mon, 2011-10-10 at 23:42 +0800, Minchan Kim wrote:
> > > On Mon, Oct 10, 2011 at 03:28:13PM +0800, Shaohua Li wrote:
> > > > On Sun, 2011-10-09 at 23:10 +0800, Minchan Kim wrote:
> > > > > On Sun, Oct 09, 2011 at 04:17:51PM +0800, Shaohua Li wrote:
> > > > > > On Sun, 2011-10-09 at 16:01 +0800, Minchan Kim wrote:
> > > > > > > On Sun, Oct 09, 2011 at 01:53:11PM +0800, Shaohua Li wrote:
> > > > > > > > On Sat, 2011-10-08 at 18:25 +0800, Minchan Kim wrote:
> > > > > > > > > On Sat, Oct 08, 2011 at 01:56:52PM +0800, Shaohua Li wrote:
> > > > > > > > > > On Sat, 2011-10-08 at 11:35 +0800, Shaohua Li wrote:
> > > > > > > > > > > On Sat, 2011-10-08 at 11:19 +0800, David Rientjes wrote:
> > > > > > > > > > > > On Sat, 8 Oct 2011, Shaohua Li wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > has_under_min_watermark_zone is used to detect if there is GFP_ATOMIC allocation
> > > > > > > > > > > > > failure risk. For a high end_zone, if any zone below or equal to it has min
> > > > > > > > > > > > > matermark ok, we have no risk. But current logic is any zone has min watermark
> > > > > > > > > > > > > not ok, then we have risk. This is wrong to me.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Shaohua Li <shaohua.li@intel.com>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  mm/vmscan.c |    7 ++++---
> > > > > > > > > > > > >  1 file changed, 4 insertions(+), 3 deletions(-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Index: linux/mm/vmscan.c
> > > > > > > > > > > > > ===================================================================
> > > > > > > > > > > > > --- linux.orig/mm/vmscan.c      2011-09-27 15:09:29.000000000 +0800
> > > > > > > > > > > > > +++ linux/mm/vmscan.c   2011-09-27 15:14:45.000000000 +0800
> > > > > > > > > > > > > @@ -2463,7 +2463,7 @@ loop_again:
> > > > > > > > > > > > >
> > > > > > > > > > > > >         for (priority = DEF_PRIORITY; priority >= 0; priority--) {
> > > > > > > > > > > > >                 unsigned long lru_pages = 0;
> > > > > > > > > > > > > -               int has_under_min_watermark_zone = 0;
> > > > > > > > > > > > > +               int has_under_min_watermark_zone = 1;
> > > > > > > > > > > >
> > > > > > > > > > > > bool
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >                 /* The swap token gets in the way of swapout... */
> > > > > > > > > > > > >                 if (!priority)
> > > > > > > > > > > > > @@ -2594,9 +2594,10 @@ loop_again:
> > > > > > > > > > > > >                                  * means that we have a GFP_ATOMIC allocation
> > > > > > > > > > > > >                                  * failure risk. Hurry up!
> > > > > > > > > > > > >                                  */
> > > > > > > > > > > > > -                               if (!zone_watermark_ok_safe(zone, order,
> > > > > > > > > > > > > +                               if (has_under_min_watermark_zone &&
> > > > > > > > > > > > > +                                           zone_watermark_ok_safe(zone, order,
> > > > > > > > > > > > >                                             min_wmark_pages(zone), end_zone, 0))
> > > > > > > > > > > > > -                                       has_under_min_watermark_zone = 1;
> > > > > > > > > > > > > +                                       has_under_min_watermark_zone = 0;
> > > > > > > > > > > > >                         } else {
> > > > > > > > > > > > >                                 /*
> > > > > > > > > > > > >                                  * If a zone reaches its high watermark,
> > > > > > > > > > > >
> > > > > > > > > > > > Ignore checking the min watermark for a moment and consider if all zones
> > > > > > > > > > > > are above the high watermark (a situation where kswapd does not need to
> > > > > > > > > > > > do aggressive reclaim), then has_under_min_watermark_zone doesn't get
> > > > > > > > > > > > cleared and never actually stalls on congestion_wait().  Notice this is
> > > > > > > > > > > > congestion_wait() and not wait_iff_congested(), so the clearing of
> > > > > > > > > > > > ZONE_CONGESTED doesn't prevent this.
> > > > > > > > > > > if all zones are above the high watermark, we will have i < 0 when
> > > > > > > > > > > detecting the highest imbalanced zone, and the whole loop will end
> > > > > > > > > > > without run into congestion_wait().
> > > > > > > > > > > or I can add a clearing has_under_min_watermark_zone in the else block
> > > > > > > > > > > to be safe.
> > > > > > > > > > Subject: vmscan: correctly detect GFP_ATOMIC allocation failure -v2
> > > > > > > > > >
> > > > > > > > > > has_under_min_watermark_zone is used to detect if there is GFP_ATOMIC allocation
> > > > > > > > > > failure risk. For a high end_zone, if any zone below or equal to it has min
> > > > > > > > > > matermark ok, we have no risk. But current logic is any zone has min watermark
> > > > > > > > > > not ok, then we have risk. This is wrong to me.
> > > > > > > > >
> > > > > > > > > I think it's not a right or wrong problem but a policy stuff.
> > > > > > > > > If we are going to start busy reclaiming for atomic allocation
> > > > > > > > > after we see all lower zones' min water mark pages are already consumed
> > > > > > > > > It could make you go through long latency and is likely to fail atomic allocation
> > > > > > > > > stream(Because, there is nothing to do for aotmic allocation fail in direct reclaim
> > > > > > > > > so kswapd should always do best effort for it)
> > > > > > > > >
> > > > > > > > > I don't mean you are wrong but we are very careful about this
> > > > > > > > > and at least need some experiments with atomic allocaion stream, I think.
> > > > > > > > yes. this is a policy problem. I just don't want the kswapd keep running
> > > > > > > > even there is no immediate risk of atomic allocation fail.
> > > > > > > > One problem here is end_zone could be high, and low zone always doesn't
> > > > > > > > meet min watermark. So kswapd keeps running without a wait and builds
> > > > > > > > big priority.
> > > > > > >
> > > > > > > It could be but I think it's a mistake of admin if he handles such rare system.
> > > > > > > Couldn't he lower the reserved pages for highmem?
> > > > > > not because admin changes reserved pages. we still have the
> > > > > > zone->lowmem_reserve[] issue for zone_watermark_ok here.
> > > > >
> > > > > Sorry I couldn't understand your point.
> > > > > I mean if min watermark is too high, you could lower min_free_kbytes.
> > > > > If reserved pages is too high, you could handle lowmem_reserve_ratio.
> > > > > Could we solve the problem with those knobs?
> > > > I mean a system with default setting. Changing the knobs might solve the
> > > > problem, but few people know it, so not a right solution.
> > >
> > > We couldn't cover whole cases so that we should focus on common case.
> > > That's why we need knobs.
> > > Do you think your case is general one we should handle it by default?
> > > If you really think, we have to fix watermark setting logic rathe than kswapd.
> > it's pretty common a system has big high zone and small low zone these
> > days. Why current watermark setting is wrong?
> 
> You said as follows,
> 
> > > > > > > > One problem here is end_zone could be high, and low zone always doesn't
> > > > > > > > meet min watermark. So kswapd keeps running without a wait and builds
> 
> You should have said not "meet min watermark" but "meet min_watermak + reserved pages[HIGH_ZONE]"
> for the clearness. Sorry for misleading your statement. Now I got realized your intention.
I should have made it clear, sorry.

> > > > > In addition, kswapd could easily set all_unreclaimable of the zone in your example.
> > > > > Then, kswapd should just peek the zone once in a while if the zone is all_unreclaimable.
> > > > > It should be no problem in CPU overhead.
> > > > even the low zone has all_unreclaimable, the high zone will be scanned
> > > > again with low priority (because its high watermark might not be ok, but
> > > > min watermark is ok), and does not do congestion_wait. This will make a
> > >
> > > As you said, high zone is below high watermark so it's natural since it's
> > > one of kswapd goal. Why it doesn't sleep is that one zone of zonelists is
> > > below min_watmark. So it's natural in current policy, too.
> > there are two cases one zone is below min_watermark.
> > 1. the zone is below min_watermark for allocation in the zone. in this
> > case we need hurry up.
> > 2. the zone is below min_watermark for allocation from high zone. we
> > don't really need hurry up if other zone is above min_watermark.
> > Since low zone need to reserve pages for high zone, the second case
> > could be common.
> 
> You mean "lowmem_reserve"?
> It means opposite. It is a mechanism to defend using of lowmem pages from high zone allocation
> because it could be fatal to allow process pages to be allocated from low zone.
> Also, We could set each ratio for reserved pages of zones.
> How could we make sure lower zones have enough free pages for higher zone?
lowmem_reserve causes the problem, but it's not a fault of
lowmem_reserve. I'm thinking changing it.

> > Yes, keeping kswapd running in this case can reduce the chance
> > GFP_ATOMIC failure. But my patch will not cause immediate failure
> > because there is still some zones which are above min_watermark and can
> > meet the GFP_ATOMIC allocation. And keeping kswapd running has some
> 
> True. It was why I said "I don't mean you are wrong but we are very careful about this."
> Normally, it could handle but might fail on sudden peak of atomic allocation stream.
> Recently, we have suffered from many reporting of GFP_AOTMIC allocation than olded.
> So I would like to be very careful and that's why I suggest we need at least some experiment.
> Through it, Andrew could make final call.
sure.

> > drawbacks:
> > 1. cpu overhead
> > 2. extend isolate window size, so trash working set.
> > Considering DMA zone, we almost always have DMA zone min_watermark not
> > ok for any allocation from high zone. So we will always have such
> > drawbacks.
> 
> I agree with you in that it's a problem.
> I think the real solution is to remove the zone from allocation fallback list in such case
> because the lower zone could never meet any allocation for the higher zone.
> But it makes code rather complicated as we have to consider admin who can change
> reserved pages anytime.
Not worthy the complication.

> So how about this?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8913374..f71ed2f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2693,8 +2693,16 @@ loop_again:
>                                  * failure risk. Hurry up!
>                                  */
>                                 if (!zone_watermark_ok_safe(zone, order,
> -                                           min_wmark_pages(zone), end_zone, 0))
> -                                       has_under_min_watermark_zone = 1;
> +                                           min_wmark_pages(zone), end_zone, 0)) {
> +                                       /*
> +                                        * In case of big reserved page for higher zone,
> +                                        * it is pointless to try reclaimaing pages quickly
> +                                        * because it could never meet the requirement.
> +                                        */
> +                                       if (zone->present_pages >
> +                                               min_wmark_pages(zone) + zone->lowmem_reserve[end_zone])
> +                                               has_under_min_watermark_zone = 1;
> +                               }
>                         } else {
>                                 /*
>                                  * If a zone reaches its high watermark,
This looks like a workaround just for DMA zone. present_pages could be
bigger than min_mwark+lowmem_reserve. And We still suffered from the
issue, for example, a DMA32 zone with some pages allocated for DMA, or a
zone which has some lru pages, but still much smaller than high zone.

> Even, we could apply this at starting of the loop so that we can avoid unnecessary scanning st the beginning.
> In that case, we have to apply zone->lowmem_reserve[end_zone] only because we have to consider NO_WATERMARK alloc case.
yes, we can do this to avoid unnecessary scan. but DMA zone hasn't lru
pages, so not sure how big the benefit is here.

> > Or is something below better? we can avoid the big reserved pages
> > accounting to the min_wmark_pages for low zone. if high zone is under
> > min_wmark, kswapd will not sleep.
> >                                if (!zone_watermark_ok_safe(zone, order,
> > -                                            min_wmark_pages(zone), end_zone, 0))
> > +                                            min_wmark_pages(zone), 0, 0))
> >                                         has_under_min_watermark_zone = 1;
> 
> I think it's not a good idea since page allocator always considers classzone_idx.
> So although we fix kswapd issue through your changes, page allocator still can't allocate memory
> and wakes up kswapd, again.
why kswapd will be waked up again? The high zone itself still has
min_wark+low_reserve ok for the allocation(classzone_idx 0 means
checking the low_reserve for allocation from the zone itself), so the
allocation can be met.

Thanks,
Shaohua

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2011-10-12  2:42 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-27  7:23 [patch 2/2]vmscan: " Shaohua Li
2011-09-27 11:28 ` Michal Hocko
2011-09-28  0:48   ` Shaohua Li
2011-09-28  9:27     ` Michal Hocko
2011-10-08  3:14       ` Shaohua Li
2011-10-08  3:19         ` David Rientjes
2011-10-08  3:35           ` Shaohua Li
2011-10-08  5:56             ` [patch v2]vmscan: " Shaohua Li
2011-10-08 10:25               ` Minchan Kim
2011-10-09  5:53                 ` Shaohua Li
2011-10-09  8:01                   ` Minchan Kim
2011-10-09  8:17                     ` Shaohua Li
2011-10-09 15:10                       ` Minchan Kim
2011-10-10  7:28                         ` Shaohua Li
2011-10-10 15:42                           ` Minchan Kim
2011-10-11  5:30                             ` Shaohua Li
2011-10-11  6:54                               ` Minchan Kim
2011-10-12  2:48                                 ` Shaohua Li [this message]
2011-10-12  7:59                                   ` Minchan Kim
2011-10-18  2:13                                     ` [patch v3]vmscan: " Shaohua Li
2011-10-27 22:50                                       ` Minchan Kim
2011-10-28  5:15                                         ` Shaohua Li
2011-11-07  5:15                                           ` Shaohua Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1318387739.22361.109.camel@sli10-conroe \
    --to=shaohua.li@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mhocko@suse.cz \
    --cc=minchan.kim@gmail.com \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox