From: mel@skynet.ie (Mel Gorman)
To: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>,
linux-mm@kvack.org, ak@suse.de,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
akpm@linux-foundation.org, pj@sgi.com
Subject: Re: NUMA policy issues with ZONE_MOVABLE
Date: Thu, 2 Aug 2007 18:10:59 +0100 [thread overview]
Message-ID: <20070802171059.GC23133@skynet.ie> (raw)
In-Reply-To: <1185994779.5059.87.camel@localhost>
On (01/08/07 14:59), Lee Schermerhorn didst pronounce:
> <snip>
> > This patch filters only when MPOL_BIND is in use. In non-numa, the
> > checks do not exist and in NUMA cases, the filtering usually does not
> > take place. I'd like this to be the bug fix for policy + ZONE_MOVABLE
> > and then deal with reducing zonelists to see if there is any performance
> > gain as well as a simplification in how policies and cpusets are
> > implemented.
> >
> > Testing shows no difference on non-numa as you'd expect and on NUMA machines,
> > there are very small differences on NUMA (kernbench figures range from -0.02%
> > to 0.15% differences on machines). Lee, can you test this patch in relation
> > to MPOL_BIND? I'll look at the numactl tests tomorrow as well.
> >
>
> The patches look OK to me. I got around to testing it today.
> Both atop the Memoryless Nodes series, and directly on 23-rc1-mm1.
>
Excellent. Thanks for the test. I hadn't seen memtool in use before, it
looks great for investigating this sort of thing.
> Test System: 32GB 4-node ia64, booted with kernelcore=24G.
> Yields, about 2GB Movable, and 6G Normal per node.
>
> Filtered zoneinfo:
>
> Node 0, zone Normal
> pages free 416464
> spanned 425984
> present 424528
> Node 0, zone Movable
> pages free 47195
> spanned 60416
> present 60210
> Node 1, zone Normal
> pages free 388011
> spanned 393216
> present 391871
> Node 1, zone Movable
> pages free 125940
> spanned 126976
> present 126542
> Node 2, zone Normal
> pages free 387849
> spanned 393216
> present 391872
> Node 2, zone Movable
> pages free 126285
> spanned 126976
> present 126542
> Node 3, zone Normal
> pages free 388256
> spanned 393216
> present 391872
> Node 3, zone Movable
> pages free 126575
> spanned 126966
> present 126490
> Node 4, zone DMA
> pages free 31689
> spanned 32767
> present 32656
> ---
> Attempt to allocate a 12G--i.e., > 4*2G--segment interleaved
> across nodes 0-3 with memtoy. I figured this would use up
> all of ZONE_MOVABLE on each node and then dip into NORMAL.
>
> root@gwydyr(root):memtoy
> memtoy pid: 6558
> memtoy>anon a1 12g
> memtoy>map a1
> memtoy>mbind a1 interleave 0,1,2,3
> memtoy>touch a1 w
> memtoy: touched 786432 pages in 10.542 secs
>
> Yields:
>
> Node 0, zone Normal
> pages free 328392
> spanned 425984
> present 424528
> Node 0, zone Movable
> pages free 37
> spanned 60416
> present 60210
> Node 1, zone Normal
> pages free 300293
> spanned 393216
> present 391871
> Node 1, zone Movable
> pages free 91
> spanned 126976
> present 126542
> Node 2, zone Normal
> pages free 300193
> spanned 393216
> present 391872
> Node 2, zone Movable
> pages free 49
> spanned 126976
> present 126542
> Node 3, zone Normal
> pages free 300448
> spanned 393216
> present 391872
> Node 3, zone Movable
> pages free 56
> spanned 126966
> present 126490
> Node 4, zone DMA
> pages free 31689
> spanned 32767
> present 32656
>
> Looks like most of the movable zone in each node [~8G]
> and remainder from normal zones. Should be ~1G from
> zone normal of each node. However, memtoy shows something
> weird, looking at the location of the 1st 64 pages at each
> 1G boundary. Most pages are located as I "expect" [well, I'm
> not sure why we start with node 2 at offset 0, instead of
> node 0].
Could it simply because the process started on node 2? alloc_page_interleave()
would have taken the zonelist on that node then.
>
> memtoy>where a1
> a 0x2000000003c08000 0x000300000000 0x000000000000 rw- private a1
> page offset +00 +01 +02 +03 +04 +05 +06 +07
> 0: 2 3 0 1 2 3 0 1
> 8: 2 3 0 1 2 3 0 1
> 10: 2 3 0 1 2 3 0 1
> 18: 2 3 0 1 2 3 0 1
> 20: 2 3 0 1 2 3 0 1
> 28: 2 3 0 1 2 3 0 1
> 30: 2 3 0 1 2 3 0 1
> 38: 2 3 0 1 2 3 0 1
>
> Same at 1G, 2G and 3G
> But, between ~4G through 6+G [I didn't check any finer
> granuality and didn't want to watch > 780K pages scroll
> by] show:
>
> memtoy>where a1 4g 64p
> a 0x2000000003c08000 0x000300000000 0x000000000000 rw- private a1
> page offset +00 +01 +02 +03 +04 +05 +06 +07
> 40000: 2 3 1 1 2 3 1 1
> 40008: 2 3 1 1 2 3 1 1
> 40010: 2 3 1 1 2 3 1 1
> 40018: 2 3 1 1 2 3 1 1
> 40020: 2 3 1 1 2 3 1 1
> 40028: 2 3 1 1 2 3 1 1
> 40030: 2 3 1 1 2 3 1 1
> 40038: 2 3 1 1 2 3 1 1
>
> Same at 5G, then:
>
> memtoy>where a1 6g 64p
> a 0x2000000003c08000 0x000300000000 0x000000000000 rw- private a1
> page offset +00 +01 +02 +03 +04 +05 +06 +07
> 60000: 2 3 2 2 2 3 2 2
> 60008: 2 3 2 2 2 3 2 2
> 60010: 2 3 2 2 2 3 2 2
> 60018: 2 3 2 2 2 3 2 2
> 60020: 2 3 2 2 2 3 2 2
> 60028: 2 3 2 2 2 3 2 2
> 60030: 2 3 2 2 2 3 2 2
> 60038: 2 3 2 2 2 3 2 2
>
> 7G, 8G, ... 11G back to expected pattern.
>
> Thought this might be due to interaction with memoryless node patches,
> so I backed those out and tested Mel's patch again. This time I
> ran memtoy in batch mode and dumped the entire segment page locations
> to a file. Did this twice. Both looked pretty much the same--i.e.,
> the change in pattern occurs at around the same offset into the
> segment. Note that here, the interleave starts at node 3 at offset
> zero.
>
> memtoy>where a1 0 0
> a 0x200000000047c000 0x000300000000 0x000000000000 rw- private a1
> page offset +00 +01 +02 +03 +04 +05 +06 +07
> 0: 3 0 1 2 3 0 1 2
> 8: 3 0 1 2 3 0 1 2
> 10: 3 0 1 2 3 0 1 2
> ...
> 38c20: 3 0 1 2 3 0 1 2
> 38c28: 3 0 1 2 3 0 1 2
> 38c30: 3 1 1 2 3 1 1 2
> 38c38: 3 1 1 2 3 1 1 2
> 38c40: 3 1 1 2 3 1 1 2
> ...
> 5a0c0: 3 1 1 2 3 1 1 2
> 5a0c8: 3 1 1 2 3 1 1 2
> 5a0d0: 3 1 1 2 3 2 2 2
> 5a0d8: 3 2 2 2 3 2 2 2
> 5a0e0: 3 2 2 2 3 2 2 2
> ...
> 65230: 3 2 2 2 3 2 2 2
> 65238: 3 2 2 2 3 2 2 2
> 65240: 3 2 2 2 3 3 3 3
> 65248: 3 3 3 3 3 3 3 3
> 65250: 3 3 3 3 3 3 3 3
> ...
> 6ab60: 3 3 3 3 3 3 3 3
> 6ab68: 3 3 3 3 3 3 3 3
> 6ab70: 3 3 3 2 3 0 1 2
> 6ab78: 3 0 1 2 3 0 1 2
> 6ab80: 3 0 1 2 3 0 1 2
> ...
> and so on to the end of the segment:
> bffe8: 3 0 1 2 3 0 1 2
> bfff0: 3 0 1 2 3 0 1 2
> bfff8: 3 0 1 2 3 0 1 2
>
> The pattern changes occur at about page offsets:
>
> 0x38800 = ~ 3.6G
> 0x5a000 = ~ 5.8G
> 0x65000 = ~ 6.4G
> 0x6aa00 = ~ 6.8G
>
> Then I checked zonelist order:
> Built 5 zonelists in Zone order, mobility grouping on. Total pages: 2072583
>
> Looks like we're falling back to ZONE_MOVABLE on the next node when ZONE_MOVABLE
> on target node overflows.
>
Ok, which might have been unexpected to you, but it's behaving as
advertised for zonelists.
> Rebooted to "Node order" [numa_zonelist_order sysctl missing in 23-rc1-mm1]
> and tried again. Saw "expected" interleave pattern across entire 12G segment.
>
> Kame-san's patch to just exclude the DMA zones from the zonelists is looking
> better--better than changing zonelist order when zone_movable is populated!
>
> But, Mel's patch seems to work OK. I'll keep it in my stack for later
> stress testing.
>
Great. As this has passed your tests and it passes the numactl
regression tests (when patched for timing problems) with and without
kernelcore, I reckon it's good as a bugfix.
Thanks Lee
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-08-02 17:10 UTC|newest]
Thread overview: 60+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-07-25 4:20 Christoph Lameter
2007-07-25 4:47 ` Nick Piggin
2007-07-25 5:05 ` Christoph Lameter
2007-07-25 5:24 ` Nick Piggin
2007-07-25 6:00 ` Christoph Lameter
2007-07-25 6:09 ` Nick Piggin
2007-07-25 9:32 ` Andi Kleen
2007-07-25 6:36 ` KAMEZAWA Hiroyuki
2007-07-25 11:16 ` Mel Gorman
2007-07-25 14:30 ` Lee Schermerhorn
2007-07-25 19:31 ` Christoph Lameter
2007-07-26 4:15 ` KAMEZAWA Hiroyuki
2007-07-26 4:53 ` Christoph Lameter
2007-07-26 7:41 ` KAMEZAWA Hiroyuki
2007-07-26 16:16 ` Mel Gorman
2007-07-26 18:03 ` Christoph Lameter
2007-07-26 18:26 ` Mel Gorman
2007-07-26 13:23 ` Mel Gorman
2007-07-26 18:07 ` Christoph Lameter
2007-07-26 22:59 ` Mel Gorman
2007-07-27 1:22 ` Christoph Lameter
2007-07-27 8:20 ` Mel Gorman
2007-07-27 15:45 ` Mel Gorman
2007-07-27 17:35 ` Christoph Lameter
2007-07-27 17:46 ` Mel Gorman
2007-07-27 18:38 ` Christoph Lameter
2007-07-27 18:00 ` [PATCH] Document Linux Memory Policy - V2 Lee Schermerhorn
2007-07-27 18:38 ` Randy Dunlap
2007-07-27 19:01 ` Lee Schermerhorn
2007-07-27 19:21 ` Randy Dunlap
2007-07-27 18:55 ` Christoph Lameter
2007-07-27 19:24 ` Lee Schermerhorn
2007-07-31 15:14 ` Mel Gorman
2007-07-31 16:34 ` Lee Schermerhorn
2007-07-31 19:10 ` Christoph Lameter
2007-07-31 19:46 ` Lee Schermerhorn
2007-07-31 19:58 ` Christoph Lameter
2007-07-31 20:23 ` Lee Schermerhorn
2007-07-31 20:48 ` [PATCH] Document Linux Memory Policy - V3 Lee Schermerhorn
2007-08-03 13:52 ` Mel Gorman
2007-07-28 7:28 ` NUMA policy issues with ZONE_MOVABLE KAMEZAWA Hiroyuki
2007-07-28 11:57 ` Mel Gorman
2007-07-28 14:10 ` KAMEZAWA Hiroyuki
2007-07-28 14:21 ` KAMEZAWA Hiroyuki
2007-07-30 12:41 ` Mel Gorman
2007-07-30 18:06 ` Christoph Lameter
2007-07-27 14:24 ` Lee Schermerhorn
2007-08-01 18:59 ` Lee Schermerhorn
2007-08-02 0:36 ` KAMEZAWA Hiroyuki
2007-08-02 17:10 ` Mel Gorman [this message]
2007-08-02 17:51 ` Lee Schermerhorn
2007-07-26 18:09 ` Lee Schermerhorn
2007-08-02 14:09 ` Mel Gorman
2007-08-02 18:56 ` Christoph Lameter
2007-08-02 19:42 ` Mel Gorman
2007-08-02 19:52 ` Christoph Lameter
2007-08-03 9:32 ` Mel Gorman
2007-08-03 16:36 ` Christoph Lameter
2007-07-25 14:27 ` Lee Schermerhorn
2007-07-25 17:39 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070802171059.GC23133@skynet.ie \
--to=mel@skynet.ie \
--cc=Lee.Schermerhorn@hp.com \
--cc=ak@suse.de \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=kamezawa.hiroyu@jp.fujitsu.com \
--cc=linux-mm@kvack.org \
--cc=pj@sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox