From: Wu Fengguang <fengguang.wu@intel.com>
To: Mel Gorman <mel@csn.ul.ie>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
Rik van Riel <riel@redhat.com>,
Christoph Lameter <cl@linux-foundation.org>,
"Zhang, Yanmin" <yanmin.zhang@intel.com>,
"linuxram@us.ibm.com" <linuxram@us.ibm.com>,
linux-mm <linux-mm@kvack.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA
Date: Tue, 9 Jun 2009 21:38:04 +0800 [thread overview]
Message-ID: <20090609133804.GB6583@localhost> (raw)
In-Reply-To: <20090609094050.GL18380@csn.ul.ie>
On Tue, Jun 09, 2009 at 05:40:50PM +0800, Mel Gorman wrote:
> On Tue, Jun 09, 2009 at 05:07:35PM +0800, Wu Fengguang wrote:
> > On Tue, Jun 09, 2009 at 04:31:54PM +0800, Mel Gorman wrote:
> > > On Tue, Jun 09, 2009 at 04:25:39PM +0800, Wu Fengguang wrote:
> > > > On Tue, Jun 09, 2009 at 04:14:25PM +0800, Mel Gorman wrote:
> > > > > On Tue, Jun 09, 2009 at 09:58:22AM +0800, Wu Fengguang wrote:
> > > > > > On Mon, Jun 08, 2009 at 09:01:28PM +0800, Mel Gorman wrote:
> > > > > > > On NUMA machines, the administrator can configure zone_reclaim_mode that is a
> > > > > > > more targetted form of direct reclaim. On machines with large NUMA distances,
> > > > > > > zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be
> > > > > > > reclaimed if the zone watermarks are not being met. The problem is that
> > > > > > > zone_reclaim() can be in a situation where it scans excessively without
> > > > > > > making progress.
> > > > > > >
> > > > > > > One such situation is where a large tmpfs mount is occupying a large
> > > > > > > percentage of memory overall. The pages do not get cleaned or reclaimed by
> > > > > > > zone_reclaim(), but the lists are uselessly scanned frequencly making the
> > > > > > > CPU spin at 100%. The scanning occurs because zone_reclaim() cannot tell
> > > > > > > in advance the scan is pointless because the counters do not distinguish
> > > > > > > between pagecache pages backed by disk and by RAM. The observation in
> > > > > > > the field is that malloc() stalls for a long time (minutes in some cases)
> > > > > > > when this situation occurs.
> > > > > > >
> > > > > > > Accounting for ram-backed file pages was considered but not implemented on
> > > > > > > the grounds it would be introducing new branches and expensive checks into
> > > > > > > the page cache add/remove patches and increase the number of statistics
> > > > > > > needed in the zone. As zone_reclaim() failing is currently considered a
> > > > > > > corner case, this seemed like overkill. Note, if there are a large number
> > > > > > > of reports about CPU spinning at 100% on NUMA that is fixed by disabling
> > > > > > > zone_reclaim, then this assumption is false and zone_reclaim() scanning
> > > > > > > and failing is not a corner case but a common occurance
> > > > > > >
> > > > > > > This patch reintroduces zone_reclaim_interval which was removed by commit
> > > > > > > 34aa1330f9b3c5783d269851d467326525207422 [zoned vm counters: zone_reclaim:
> > > > > > > remove /proc/sys/vm/zone_reclaim_interval] because the zone counters were
> > > > > > > considered sufficient to determine in advance if the scan would succeed.
> > > > > > > As unsuccessful scans can still occur, zone_reclaim_interval is still
> > > > > > > required.
> > > > > >
> > > > > > Can we avoid the user visible parameter zone_reclaim_interval?
> > > > > >
> > > > >
> > > > > You could, but then there is no way of disabling it by setting it to 0
> > > > > either. I can't imagine why but the desired behaviour might really be to
> > > > > spin and never go off-node unless there is no other option. They might
> > > > > want to set it to 0 for example when determining what the right value for
> > > > > zone_reclaim_mode is for their workloads.
> > > > >
> > > > > > That means to introduce some heuristics for it.
> > > > >
> > > > > I suspect the vast majority of users will ignore it unless they are runing
> > > > > zone_reclaim_mode at the same time and even then will probably just leave
> > > > > it as 30 as a LRU scan every 30 seconds worst case is not going to show up
> > > > > on many profiles.
> > > > >
> > > > > > Since the whole point
> > > > > > is to avoid 100% CPU usage, we can take down the time used for this
> > > > > > failed zone reclaim (T) and forbid zone reclaim until (NOW + 100*T).
> > > > > >
> > > > >
> > > > > i.e. just fix it internally at 100 seconds? How is that better than
> > > > > having an obscure tunable? I think if this heuristic exists at all, it's
> > > > > important that an administrator be able to turn it off if absolutly
> > > > > necessary and so something must be user-visible.
> > > >
> > > > That 100*T don't mean 100 seconds. It means to keep CPU usage under 1%:
> > > > after busy scanning for time T, let's go relax for 100*T.
> > > >
> > >
> > > Do I have a means of calculating what my CPU usage is as a result of
> > > scanning the LRU list?
> > >
> > > If I don't and the machine is busy, would I not avoid scanning even in
> > > situations where it should have been scanned?
> >
> > I guess we don't really care about the exact number for the ratio 100.
> > If the box is busy, it automatically scales the effective ratio to 200
> > or more, which I think is reasonable behavior.
> >
> > Something like this.
> >
> > Thanks,
> > Fengguang
> >
> > ---
> > include/linux/mmzone.h | 2 ++
> > mm/vmscan.c | 11 +++++++++++
> > 2 files changed, 13 insertions(+)
> >
> > --- linux.orig/include/linux/mmzone.h
> > +++ linux/include/linux/mmzone.h
> > @@ -334,6 +334,8 @@ struct zone {
> > /* Zone statistics */
> > atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
> >
> > + unsigned long zone_reclaim_relax;
> > +
> > /*
> > * prev_priority holds the scanning priority for this zone. It is
> > * defined as the scanning priority at which we achieved our reclaim
> > --- linux.orig/mm/vmscan.c
> > +++ linux/mm/vmscan.c
> > @@ -2453,6 +2453,7 @@ int zone_reclaim(struct zone *zone, gfp_
> > int ret;
> > long nr_unmapped_file_pages;
> > long nr_slab_reclaimable;
> > + unsigned long t;
> >
> > /*
> > * Zone reclaim reclaims unmapped file backed pages and
> > @@ -2475,6 +2476,11 @@ int zone_reclaim(struct zone *zone, gfp_
> > if (zone_is_all_unreclaimable(zone))
> > return 0;
> >
> > + if (time_in_range(zone->zone_reclaim_relax - 10000 * HZ,
> > + jiffies,
> > + zone->zone_reclaim_relax))
> > + return 0;
> > +
>
> So. zone_reclaim_relax is some value between now and 100 times the approximate
> time it takes to scan the LRU list. This check ensures that we do not scan
> multiple times within the same interval. Is that right?
Yes and no: zone_reclaim_relax is the *absolute* time for that.
This check ensures that if we wasted T seconds doing a fruitless
zone reclaim, zone reclaim won't be repeated in the following 100*T
seconds - which is a coarse relax period.
Its simpler form is: time_before(jiffies, zone_reclaim_relax),
if not considering wraparound issues.
> > /*
> > * Do not scan if the allocation should not be delayed.
> > */
> > @@ -2493,7 +2499,12 @@ int zone_reclaim(struct zone *zone, gfp_
> >
> > if (zone_test_and_set_flag(zone, ZONE_RECLAIM_LOCKED))
> > return 0;
> > + t = jiffies;
> > ret = __zone_reclaim(zone, gfp_mask, order);
> > + if (sc.nr_reclaimed == 0) {
> > + t = min_t(unsigned long, 10000 * HZ, 100 * (jiffies - t));
> > + zone->zone_reclaim_relax = jiffies + t;
> > + }
>
> This appears to be a way of automatically selecting a value for
> zone_reclaim_interval but is 100 times the length of time it takes to scan the
> LRU list enough to avoid excessive scanning of the LRU lists by zone_reclaim?
Exactly.
> I don't know and unlike zone_reclaim_interval, we have no way for the
> administrator to intervene in the event we get the calculation wrong.
>
> Conceivably though, zone_reclaim_interval could automatically tune
> itself based on a heuristic like this if the administrator does not give
> a specific value. I think that would be an interesting follow on once
> we've brought back zone_reclaim_interval and get a feeling for how often
> it is actually used.
Well I don't think that's good practice. There are heuristic
calculations all over the kernel. Shall we exporting parameters to
user space just because we are not absolutely sure? Or shall we ship
the heuristics and do adjustments based on feedbacks and only export
parameters when we find _known cases_ that cannot be covered by pure
heuristics?
Thanks,
Fengguang
> > zone_clear_flag(zone, ZONE_RECLAIM_LOCKED);
> >
> > return ret;
> >
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2009-06-09 13:00 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-06-08 13:01 [PATCH 0/3] [RFC] Functional fix to zone_reclaim() and bring behaviour more in line with expectations Mel Gorman
2009-06-08 13:01 ` [PATCH 1/3] Reintroduce zone_reclaim_interval for when zone_reclaim() scans and fails to avoid CPU spinning at 100% on NUMA Mel Gorman
2009-06-08 13:31 ` Rik van Riel
2009-06-08 13:54 ` Mel Gorman
2009-06-08 14:33 ` Christoph Lameter
2009-06-08 14:38 ` Mel Gorman
2009-06-08 14:55 ` Christoph Lameter
2009-06-08 15:11 ` Mel Gorman
2009-06-10 5:23 ` Andrew Morton
2009-06-10 6:44 ` KOSAKI Motohiro
2009-06-10 10:00 ` Mel Gorman
2009-06-08 14:48 ` Rik van Riel
2009-06-09 8:08 ` Mel Gorman
2009-06-09 1:58 ` Wu Fengguang
2009-06-09 8:14 ` Mel Gorman
2009-06-09 8:25 ` Wu Fengguang
2009-06-09 8:31 ` Mel Gorman
2009-06-09 9:07 ` Wu Fengguang
2009-06-09 9:40 ` Mel Gorman
2009-06-09 13:38 ` Wu Fengguang [this message]
2009-06-09 15:06 ` Mel Gorman
2009-06-10 2:14 ` Wu Fengguang
2009-06-10 9:54 ` Mel Gorman
2009-06-09 7:48 ` KOSAKI Motohiro
2009-06-09 8:18 ` Mel Gorman
2009-06-09 8:45 ` KOSAKI Motohiro
2009-06-09 9:42 ` Mel Gorman
2009-06-09 9:45 ` KOSAKI Motohiro
2009-06-09 9:59 ` KOSAKI Motohiro
2009-06-09 10:44 ` Mel Gorman
2009-06-09 10:50 ` KOSAKI Motohiro
2009-06-08 13:01 ` [PATCH 2/3] Properly account for the number of page cache pages zone_reclaim() can reclaim Mel Gorman
2009-06-08 14:25 ` Christoph Lameter
2009-06-08 14:36 ` Mel Gorman
2009-06-09 2:25 ` Wu Fengguang
2009-06-09 8:27 ` Mel Gorman
2009-06-09 8:45 ` Wu Fengguang
2009-06-09 10:48 ` Mel Gorman
2009-06-09 12:08 ` Wu Fengguang
2009-06-09 8:55 ` KOSAKI Motohiro
2009-06-09 2:37 ` Wu Fengguang
2009-06-09 8:19 ` KOSAKI Motohiro
2009-06-09 8:47 ` Mel Gorman
2009-06-08 13:01 ` [PATCH 3/3] Do not unconditionally treat zones that fail zone_reclaim() as full Mel Gorman
2009-06-08 14:32 ` Christoph Lameter
2009-06-08 14:43 ` Mel Gorman
2009-06-09 3:11 ` Wu Fengguang
2009-06-09 8:50 ` Mel Gorman
2009-06-09 7:48 ` KOSAKI Motohiro
2009-06-09 9:25 ` Mel Gorman
2009-06-09 12:05 ` KOSAKI Motohiro
2009-06-09 13:28 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090609133804.GB6583@localhost \
--to=fengguang.wu@intel.com \
--cc=cl@linux-foundation.org \
--cc=kosaki.motohiro@jp.fujitsu.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxram@us.ibm.com \
--cc=mel@csn.ul.ie \
--cc=riel@redhat.com \
--cc=yanmin.zhang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox