Re: [patch v3]vmscan: correctly detect GFP_ATOMIC allocation failure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Minchan Kim <minchan.kim@gmail.com>
To: Shaohua Li <shaohua.li@intel.com>
Cc: David Rientjes <rientjes@google.com>,
	Andrew Morton <akpm@linux-foundation.org>, mel <mel@csn.ul.ie>,
	Rik van Riel <riel@redhat.com>, Michal Hocko <mhocko@suse.cz>,
	linux-mm <linux-mm@kvack.org>
Subject: Re: [patch v3]vmscan: correctly detect GFP_ATOMIC allocation failure
Date: Fri, 28 Oct 2011 07:50:42 +0900	[thread overview]
Message-ID: <20111027225042.GA29407@barrios-laptop.redhat.com> (raw)
In-Reply-To: <1318904032.22361.113.camel@sli10-conroe>

On Tue, Oct 18, 2011 at 10:13:52AM +0800, Shaohua Li wrote:
> On Wed, 2011-10-12 at 15:59 +0800, Minchan Kim wrote:
> > On Wed, Oct 12, 2011 at 10:48:59AM +0800, Shaohua Li wrote:
> >  > > there are two cases one zone is below min_watermark.
> > > > > 1. the zone is below min_watermark for allocation in the zone. in this
> > > > > case we need hurry up.
> > > > > 2. the zone is below min_watermark for allocation from high zone. we
> > > > > don't really need hurry up if other zone is above min_watermark.
> > > > > Since low zone need to reserve pages for high zone, the second case
> > > > > could be common.
> > > > 
> > > > You mean "lowmem_reserve"?
> > > > It means opposite. It is a mechanism to defend using of lowmem pages from high zone allocation
> > > > because it could be fatal to allow process pages to be allocated from low zone.
> > > > Also, We could set each ratio for reserved pages of zones.
> > > > How could we make sure lower zones have enough free pages for higher zone?
> > > lowmem_reserve causes the problem, but it's not a fault of
> > > lowmem_reserve. I'm thinking changing it.
> > > 
> > > > > Yes, keeping kswapd running in this case can reduce the chance
> > > > > GFP_ATOMIC failure. But my patch will not cause immediate failure
> > > > > because there is still some zones which are above min_watermark and can
> > > > > meet the GFP_ATOMIC allocation. And keeping kswapd running has some
> > > > 
> > > > True. It was why I said "I don't mean you are wrong but we are very careful about this."
> > > > Normally, it could handle but might fail on sudden peak of atomic allocation stream.
> > > > Recently, we have suffered from many reporting of GFP_AOTMIC allocation than olded.
> > > > So I would like to be very careful and that's why I suggest we need at least some experiment.
> > > > Through it, Andrew could make final call.
> > > sure.
> > > 
> > > > > drawbacks:
> > > > > 1. cpu overhead
> > > > > 2. extend isolate window size, so trash working set.
> > > > > Considering DMA zone, we almost always have DMA zone min_watermark not
> > > > > ok for any allocation from high zone. So we will always have such
> > > > > drawbacks.
> > > > 
> > > > I agree with you in that it's a problem.
> > > > I think the real solution is to remove the zone from allocation fallback list in such case
> > > > because the lower zone could never meet any allocation for the higher zone.
> > > > But it makes code rather complicated as we have to consider admin who can change
> > > > reserved pages anytime.
> > > Not worthy the complication.
> > > 
> > > > So how about this?
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 8913374..f71ed2f 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2693,8 +2693,16 @@ loop_again:
> > > >                                  * failure risk. Hurry up!
> > > >                                  */
> > > >                                 if (!zone_watermark_ok_safe(zone, order,
> > > > -                                           min_wmark_pages(zone), end_zone, 0))
> > > > -                                       has_under_min_watermark_zone = 1;
> > > > +                                           min_wmark_pages(zone), end_zone, 0)) {
> > > > +                                       /*
> > > > +                                        * In case of big reserved page for higher zone,
> > > > +                                        * it is pointless to try reclaimaing pages quickly
> > > > +                                        * because it could never meet the requirement.
> > > > +                                        */
> > > > +                                       if (zone->present_pages >
> > > > +                                               min_wmark_pages(zone) + zone->lowmem_reserve[end_zone])
> > > > +                                               has_under_min_watermark_zone = 1;
> > > > +                               }
> > > >                         } else {
> > > >                                 /*
> > > >                                  * If a zone reaches its high watermark,
> > > This looks like a workaround just for DMA zone. present_pages could be
> > > bigger than min_mwark+lowmem_reserve. And We still suffered from the
> > > issue, for example, a DMA32 zone with some pages allocated for DMA, or a
> > > zone which has some lru pages, but still much smaller than high zone.
> > 
> > Right. I thought about it but couldn't have a good idea for it. :(
> > 
> > > 
> > > > Even, we could apply this at starting of the loop so that we can avoid unnecessary scanning st the beginning.
> > > > In that case, we have to apply zone->lowmem_reserve[end_zone] only because we have to consider NO_WATERMARK alloc case.
> > > yes, we can do this to avoid unnecessary scan. but DMA zone hasn't lru
> > > pages, so not sure how big the benefit is here.
> > 
> > At least, we can prevent has_under_min_watermark_zone from set.
> > But it still have a problem you pointed out earlier.
> > 
> > > 
> > > > > Or is something below better? we can avoid the big reserved pages
> > > > > accounting to the min_wmark_pages for low zone. if high zone is under
> > > > > min_wmark, kswapd will not sleep.
> > > > >                                if (!zone_watermark_ok_safe(zone, order,
> > > > > -                                            min_wmark_pages(zone), end_zone, 0))
> > > > > +                                            min_wmark_pages(zone), 0, 0))
> > > > >                                         has_under_min_watermark_zone = 1;
> > > > 
> > > > I think it's not a good idea since page allocator always considers classzone_idx.
> > > > So although we fix kswapd issue through your changes, page allocator still can't allocate memory
> > > > and wakes up kswapd, again.
> > > why kswapd will be waked up again? The high zone itself still has
> > > min_wark+low_reserve ok for the allocation(classzone_idx 0 means
> > > checking the low_reserve for allocation from the zone itself), so the
> > > allocation can be met.
> > 
> > You're absolutely right.
> > I got confused. Sorry about that.
> > 
> > I like this than your old version.
> > That's because it could rush if one of zonelist is consumed as below min_watermak.
> > It could mitigate GFP_ALLOC fail than yours old version but still would be higher than now.
> > So, we need the number.
> > 
> > Could you repost this as formal patch with good comment and number?
> > Personally, I like description based on scenario with kind step-by-step.
> > Feel free to use my description in my patch if you want.
> > 
> > Thanks for patient discussion in spite of my irregular reply, Shaohua.
> Thanks for your time. Here is patch. I'm trying to get some number, but
> didn't find a proper workload to demonstrate the atomic allocation
> failure. Any suggestion for the workload?

Sorry for really really late response.

Hmm.. I didn't have a such test.
But it seems recently we had a such report, again.
https://lkml.org/lkml/2011/10/18/131
If you want it, you could ask him.

> 
> Thanks,
> Shaohua
> 
> 
> Subject: vmscan: correctly detect GFP_ATOMIC allocation failure -v3
> 
> has_under_min_watermark_zone is used to detect if there is GFP_ATOMIC allocation
> failure risk. Current logic is if any zone has min watermark not ok, we have
> risk.
> 
> Low zone needs reserve memory to avoid fallback from high zone. The reserved
> memory is zone->lowmem_reserve[]. If high zone is big, low zone's
> min_wmark + lowmem_reserve[] usually is big. Sometimes min_wmark +
> lowmem_reserve[] could even be higher than zone->present_pages. An example is
> DMA zone. Other low zones could have the similar high reserved memory though
> might still have margins between reserved pages and present pages. So in kswapd
> loop, if end_zone is a high zone, has_under_min_watermark_zone could be easily
> set or always set for DMA.
> 
> Let's consider end_zone is a high zone and it has high_mwark not ok, but
> min_mwark ok. A DMA zone always has present_pages less than reserved pages, so
> has_under_min_watermark_zone is always set. When kswapd is running, there are
> some drawbacks:
> 1. kswapd can keep unnecessary running without congestion_wait. high zone
> already can meet GFP_ATOMIC. The running will waste some CPU.
> 2. kswapd can scan much more pages to trash working set. congestion_wait can
> slow down scan if kswapd has trouble. Now congestion_wait is skipped, kswapd
> will keep scanning unnecessary pages.
> 
> So since DMA zone always set has_under_min_watermark_zone, current logic actually
> equals to that kswapd keeps running without congestion_wait till high zone has
> high wmark ok when it has trouble. This is not intended.
> 
> In this path, we test the min_mwark against the zone itself. This doesn't change
> the behavior of high zone. For low zone, we now exclude lowmem_reserve for high
> zone to avoid unnecessary running.
> 
> Note: With this patch, we could have potential higher risk of GFP_ATOMIC failure.
> 
> v3: Uses a less intrusive method to determine if has_under_min_watermark_zone
> should be set after discussion with Minchan.
> v2: use bool and clear has_under_min_watermark_zone for zone with watermark ok
> as suggested by David Rientjes.
> 
> Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

As you know, I give my reviewed-by as a token of "code looks good to me".
But still, we see atomic allocation failure messasge recently.
So you need to get a acked-by as a toekn of "I support this idea" of
others(ie, Mel/Rik are right person).

I hope it would be better to write down it in description that
it's real problem of your system and the patch solves it.

Thanks!

> ---
>  mm/vmscan.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> Index: linux/mm/vmscan.c
> ===================================================================
> --- linux.orig/mm/vmscan.c	2011-10-17 16:02:30.000000000 +0800
> +++ linux/mm/vmscan.c	2011-10-18 09:32:23.000000000 +0800
> @@ -2463,7 +2463,7 @@ loop_again:
>  
>  	for (priority = DEF_PRIORITY; priority >= 0; priority--) {
>  		unsigned long lru_pages = 0;
> -		int has_under_min_watermark_zone = 0;
> +		bool has_under_min_watermark_zone = false;
>  
>  		/* The swap token gets in the way of swapout... */
>  		if (!priority)
> @@ -2593,8 +2593,8 @@ loop_again:
>  				 * failure risk. Hurry up!
>  				 */
>  				if (!zone_watermark_ok_safe(zone, order,
> -					    min_wmark_pages(zone), end_zone, 0))
> -					has_under_min_watermark_zone = 1;
> +					    min_wmark_pages(zone), 0, 0))
> +					has_under_min_watermark_zone = true;
>  			} else {
>  				/*
>  				 * If a zone reaches its high watermark,
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2011-10-27 22:51 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-27  7:23 [patch 2/2]vmscan: " Shaohua Li
2011-09-27 11:28 ` Michal Hocko
2011-09-28  0:48   ` Shaohua Li
2011-09-28  9:27     ` Michal Hocko
2011-10-08  3:14       ` Shaohua Li
2011-10-08  3:19         ` David Rientjes
2011-10-08  3:35           ` Shaohua Li
2011-10-08  5:56             ` [patch v2]vmscan: " Shaohua Li
2011-10-08 10:25               ` Minchan Kim
2011-10-09  5:53                 ` Shaohua Li
2011-10-09  8:01                   ` Minchan Kim
2011-10-09  8:17                     ` Shaohua Li
2011-10-09 15:10                       ` Minchan Kim
2011-10-10  7:28                         ` Shaohua Li
2011-10-10 15:42                           ` Minchan Kim
2011-10-11  5:30                             ` Shaohua Li
2011-10-11  6:54                               ` Minchan Kim
2011-10-12  2:48                                 ` Shaohua Li
2011-10-12  7:59                                   ` Minchan Kim
2011-10-18  2:13                                     ` [patch v3]vmscan: " Shaohua Li
2011-10-27 22:50                                       ` Minchan Kim [this message]
2011-10-28  5:15                                         ` Shaohua Li
2011-11-07  5:15                                           ` Shaohua Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20111027225042.GA29407@barrios-laptop.redhat.com \
    --to=minchan.kim@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mel@csn.ul.ie \
    --cc=mhocko@suse.cz \
    --cc=riel@redhat.com \
    --cc=rientjes@google.com \
    --cc=shaohua.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox