From: Mel Gorman <mel@csn.ul.ie>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Frans Pop <elendil@planet.nl>, Jiri Kosina <jkosina@suse.cz>,
Sven Geggus <lists@fuchsschwanzdomain.de>,
Karol Lewandowski <karol.k.lewandowski@gmail.com>,
Tobias Oetiker <tobi@oetiker.ch>,
linux-kernel@vger.kernel.org,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Pekka Enberg <penberg@cs.helsinki.fi>,
Rik van Riel <riel@redhat.com>,
Christoph Lameter <cl@linux-foundation.org>,
Stephan von Krawczynski <skraw@ithnet.com>,
"Rafael J. Wysocki" <rjw@sisk.pl>,
Kernel Testers List <kernel-testers@vger.kernel.org>
Subject: Re: [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble
Date: Fri, 13 Nov 2009 13:54:43 +0000 [thread overview]
Message-ID: <20091113135443.GF29804@csn.ul.ie> (raw)
In-Reply-To: <20091113142608.33B9.A69D9226@jp.fujitsu.com>
On Fri, Nov 13, 2009 at 06:54:29PM +0900, KOSAKI Motohiro wrote:
> > If reclaim fails to make sufficient progress, the priority is raised.
> > Once the priority is higher, kswapd starts waiting on congestion.
> > However, on systems with large numbers of high-order atomics due to
> > crappy network cards, it's important that kswapd keep working in
> > parallel to save their sorry ass.
> >
> > This patch takes into account the order kswapd is reclaiming at before
> > waiting on congestion. The higher the order, the longer it is before
> > kswapd considers itself to be in trouble. The impact is that kswapd
> > works harder in parallel rather than depending on direct reclaimers or
> > atomic allocations to fail.
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> > ---
> > mm/vmscan.c | 14 ++++++++++++--
> > 1 files changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ffa1766..5e200f1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1946,7 +1946,7 @@ static int sleeping_prematurely(int order, long remaining)
> > static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> > {
> > int all_zones_ok;
> > - int priority;
> > + int priority, congestion_priority;
> > int i;
> > unsigned long total_scanned;
> > struct reclaim_state *reclaim_state = current->reclaim_state;
> > @@ -1967,6 +1967,16 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
> > */
> > int temp_priority[MAX_NR_ZONES];
> >
> > + /*
> > + * When priority reaches congestion_priority, kswapd will sleep
> > + * for a short time while congestion clears. The higher the
> > + * order being reclaimed, the less likely kswapd will go to
> > + * sleep as high-order allocations are harder to reclaim and
> > + * stall direct reclaimers longer
> > + */
> > + congestion_priority = DEF_PRIORITY - 2;
> > + congestion_priority -= min(congestion_priority, sc.order);
>
> This calculation mean
>
> sc.order congestion_priority scan-pages
> ---------------------------------------------------------
> 0 10 1/1024 * zone-mem
> 1 9 1/512 * zone-mem
> 2 8 1/256 * zone-mem
> 3 7 1/128 * zone-mem
> 4 6 1/64 * zone-mem
> 5 5 1/32 * zone-mem
> 6 4 1/16 * zone-mem
> 7 3 1/8 * zone-mem
> 8 2 1/4 * zone-mem
> 9 1 1/2 * zone-mem
> 10 0 1 * zone-mem
> 11+ 0 1 * zone-mem
>
> I feel this is too agressive. The intention of this congestion_wait()
> is to prevent kswapd use 100% cpu time.
Ok, I thought the intention might be to avoid dumping too many pages on
the queue but it was already waiting on congestion elsewhere.
> but the above promotion seems
> break it.
>
> example,
> ia64 have 256MB hugepage (i.e. order=14). it mean kswapd never sleep.
>
> example2,
> order-3 (i.e. PAGE_ALLOC_COSTLY_ORDER) makes one of most inefficent
> reclaim, because it doesn't use lumpy recliam.
> I've seen 128GB size zone, it mean 1/128 = 1GB. oh well, kswapd definitely
> waste cpu time 100%.
>
>
> > +
> > loop_again:
> > total_scanned = 0;
> > sc.nr_reclaimed = 0;
> > @@ -2092,7 +2102,7 @@ loop_again:
> > * OK, kswapd is getting into trouble. Take a nap, then take
> > * another pass across the zones.
> > */
> > - if (total_scanned && priority < DEF_PRIORITY - 2)
> > + if (total_scanned && priority < congestion_priority)
> > congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> Instead, How about this?
>
This makes a lot of sense. Tests look good and I added stats to make sure
the logic was triggering. On X86, kswapd avoided a congestion_wait 11723
times and X86-64 avoided it 5084 times. I think we should hold onto the
stats temporarily until all these bugs are ironed out.
Would you like to sign off the following?
If you are ok to sign off, this patch should replace my patch 5 in
the series.
==== CUT HERE ====
vmscan: Stop kswapd waiting on congestion when the min watermark is not being met
If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion. However,
if the zone is below the min watermark then kswapd needs to continue working
without delay as there is a danger of an increased rate of GFP_ATOMIC
allocation failure.
This patch changes the conditions under which kswapd waits on
congestion by only going to sleep if the min watermarks are being met.
[mel@csn.ul.ie: Add stats to track how relevant the logic is]
Needs-signed-off-by-original-author
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9716003..7d66695 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -41,6 +41,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
#endif
PGINODESTEAL, SLABS_SCANNED, KSWAPD_STEAL, KSWAPD_INODESTEAL,
KSWAPD_PREMATURE_FAST, KSWAPD_PREMATURE_SLOW,
+ KSWAPD_NO_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ffa1766..70967e1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1966,6 +1966,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
* free_pages == high_wmark_pages(zone).
*/
int temp_priority[MAX_NR_ZONES];
+ int has_under_min_watermark_zone = 0;
loop_again:
total_scanned = 0;
@@ -2085,6 +2086,15 @@ loop_again:
if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
sc.may_writepage = 1;
+
+ /*
+ * We are still under min water mark. it mean we have
+ * GFP_ATOMIC allocation failure risk. Hurry up!
+ */
+ if (!zone_watermark_ok(zone, order, min_wmark_pages(zone),
+ end_zone, 0))
+ has_under_min_watermark_zone = 1;
+
}
if (all_zones_ok)
break; /* kswapd: all done */
@@ -2092,8 +2102,13 @@ loop_again:
* OK, kswapd is getting into trouble. Take a nap, then take
* another pass across the zones.
*/
- if (total_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ if (total_scanned && (priority < DEF_PRIORITY - 2)) {
+
+ if (!has_under_min_watermark_zone)
+ count_vm_event(KSWAPD_NO_CONGESTION_WAIT);
+ else
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ }
/*
* We do this so kswapd doesn't build up large priorities for
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 90b11e4..bc09547 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -685,6 +685,7 @@ static const char * const vmstat_text[] = {
"kswapd_inodesteal",
"kswapd_slept_prematurely_fast",
"kswapd_slept_prematurely_slow",
+ "kswapd_no_congestion_wait",
"pageoutrun",
"allocstall",
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-11-13 13:54 UTC|newest]
Thread overview: 57+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-11-12 19:30 [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Mel Gorman
2009-11-12 19:30 ` [PATCH 1/5] page allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed Mel Gorman
2009-11-13 5:23 ` KOSAKI Motohiro
2009-11-13 13:55 ` Mel Gorman
2009-11-12 19:30 ` [PATCH 2/5] page allocator: Do not allow interrupts to use ALLOC_HARDER Mel Gorman
2009-11-13 5:24 ` KOSAKI Motohiro
2009-11-13 13:56 ` Mel Gorman
2009-11-12 19:30 ` [PATCH 3/5] page allocator: Wait on both sync and async congestion after direct reclaim Mel Gorman
2009-11-13 11:20 ` KOSAKI Motohiro
2009-11-13 11:55 ` Jens Axboe
2009-11-13 12:28 ` Mel Gorman
2009-11-13 13:32 ` Jens Axboe
2009-11-13 13:41 ` Pekka Enberg
2009-11-13 15:22 ` Chris Mason
2009-11-13 14:16 ` Mel Gorman
2009-11-20 14:56 ` Mel Gorman
2009-11-12 19:30 ` [PATCH 4/5] vmscan: Have kswapd sleep for a short interval and double check it should be asleep Mel Gorman
2009-11-13 10:43 ` KOSAKI Motohiro
2009-11-13 14:13 ` Mel Gorman
2009-11-13 18:00 ` KOSAKI Motohiro
2009-11-13 18:17 ` Mel Gorman
2009-11-14 9:34 ` KOSAKI Motohiro
2009-11-14 15:46 ` Mel Gorman
2009-11-17 11:03 ` KOSAKI Motohiro
2009-11-17 11:44 ` Mel Gorman
2009-11-17 12:18 ` KOSAKI Motohiro
2009-11-17 12:25 ` Mel Gorman
2009-11-18 5:20 ` KOSAKI Motohiro
2009-11-17 10:34 ` [PATCH] vmscan: Have kswapd sleep for a short interval and double check it should be asleep fix 1 Mel Gorman
2009-11-18 5:27 ` KOSAKI Motohiro
2009-11-12 19:30 ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Mel Gorman
2009-11-13 9:54 ` KOSAKI Motohiro
2009-11-13 13:54 ` Mel Gorman [this message]
2009-11-13 14:48 ` Minchan Kim
2009-11-13 18:00 ` KOSAKI Motohiro
2009-11-13 18:15 ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Mel Gorman
2009-11-13 18:26 ` Frans Pop
2009-11-13 18:33 ` KOSAKI Motohiro
2009-11-13 20:03 ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2 Mel Gorman
2009-11-26 14:45 ` Tobias Oetiker
2009-11-29 7:42 ` still getting allocation failures (was Re: [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met V2) Tobi Oetiker
2009-12-02 11:32 ` Mel Gorman
2009-12-02 21:30 ` Tobias Oetiker
2009-12-03 20:26 ` Corrado Zoccolo
2009-12-14 5:59 ` Tobias Oetiker
2009-12-14 8:49 ` Corrado Zoccolo
2009-11-13 18:36 ` [PATCH] vmscan: Stop kswapd waiting on congestion when the min watermark is not being met Rik van Riel
2009-11-13 14:38 ` [PATCH 5/5] vmscan: Take order into consideration when deciding if kswapd is in trouble Minchan Kim
2009-11-13 12:41 ` Minchan Kim
2009-11-13 9:04 ` [PATCH 0/5] Reduce GFP_ATOMIC allocation failures, candidate fix V3 Frans Pop
2009-11-16 17:57 ` Mel Gorman
2009-11-13 12:47 ` Tobias Oetiker
2009-11-13 13:37 ` Mel Gorman
2009-11-15 12:07 ` Karol Lewandowski
2009-11-16 9:52 ` Mel Gorman
2009-11-16 12:08 ` Karol Lewandowski
2009-11-16 14:32 ` Karol Lewandowski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20091113135443.GF29804@csn.ul.ie \
--to=mel@csn.ul.ie \
--cc=akpm@linux-foundation.org \
--cc=cl@linux-foundation.org \
--cc=elendil@planet.nl \
--cc=jkosina@suse.cz \
--cc=karol.k.lewandowski@gmail.com \
--cc=kernel-testers@vger.kernel.org \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lists@fuchsschwanzdomain.de \
--cc=penberg@cs.helsinki.fi \
--cc=riel@redhat.com \
--cc=rjw@sisk.pl \
--cc=skraw@ithnet.com \
--cc=tobi@oetiker.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox