From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Mon, 25 Sep 2006 23:08:17 -0700 (PDT) From: David Rientjes Subject: Re: [RFC] another way to speed up fake numa node page_alloc In-Reply-To: <20060925091452.14277.9236.sendpatchset@v0> Message-ID: References: <20060925091452.14277.9236.sendpatchset@v0> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org Return-Path: To: Paul Jackson Cc: linux-mm@kvack.org, akpm@osdl.org, Nick Piggin , Andi Kleen , mbligh@google.com, rohitseth@google.com, menage@google.com, clameter@sgi.com List-ID: On Mon, 25 Sep 2006, Paul Jackson wrote: > - Some per-node data in the struct zonelist is now modified frequently, > with no locking. Multiple CPU cores on a node could hit and mangle > this data. The theory is that this is just performance hint data, > and the memory allocator will work just fine despite any such mangling. > The fields at risk are the struct 'zonelist_faster' fields 'fullnodes' > (a nodemask_t) and 'last_full_zap' (unsigned long jiffies). It should > all be self correcting after at most a one second delay. > If there's mangling on 'last_full_zap' in the scenario with multiple CPU's on one node, that means that we might be clearing 'fullnodes' more often than every 1*HZ, and that clear is always done by one CPU. Since the only purpose of the delay is to allow a certain period of time go by where these hints will actually serve a purpose, this entire speed-up will then be degraded. I agree that adding locking for 'zonelist_faster' is probably going too far in terms of performance hint data, but it seems necessary with 'last_full_zap' if the goal is to preserve this 1*HZ delay. > - I pay no attention to the various watermarks and such in this performance > hint. A node could be marked full for one watermark, and then skipped > over when searching for a page using a different watermark. I think > that's actually quite ok, as it will tend to slightly increase the > spreading of memory over other nodes, away from a memory stressed node. > Since we currently lack support for dynamically allocating nodes with a node hotplug API, it actually seems advantageous to have a memory stressed node in a pool or cpuset of 'mems'. Now when another cpuset is facing memory pressure I can cherry-pick an untouched node from a less bogged down cpuset for my own use. It seems like an immutable time interval embedded in the page alloc code may not be the best way to measure when a full zap should occur. A more appropriate metric might be to do a full zap after a certain threshold of pages have been freed. If it's done that way, the zap would occur in a more appropriate place (when pages are freed) as opposed to when pages are allocated. The overhead that we incur of zapping the nodemask every second and then being forced to recheck all the nodes again would then be eliminated in the case where there's been no change. Based on the benchmarks I ran earlier, that's a popular case. It's more appropriate when we're freeing pages and we know for sure that we're getting memory somewhere. Note to self: in 2.6.18-rc7-mm1, NUMA_BUILD is just a synonym for CONFIG_NUMA. And since this and CONFIG_NUMA_EMU is defined by default on x64_64, we're going to have overhead on a single processor system. In my earlier patch I started extracting a macro that could be tested against in generic kernel code to determine at least whether NUMA emulation was being _used_. This might need to make a comeback if this type of implementation is considered later. This is a creative solution, especially considering the use of a statically-sized zlfast_ptr to find zlfast hidden away in struct zonelist. This definitely seems to be headed in the right direction because it works in both the real NUMA case and the fake NUMA case. I would really like to run benchmarks on this implementation as I have done for the others but I no longer have access to a 64-bit machine. I don't see how it could cause a performance degredation in the non-NUMA case. David -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org