From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Thu, 7 Apr 2005 21:34:36 -0500 From: Jack Steiner Subject: Re: Excessive memory trapped in pageset lists Message-ID: <20050408023436.GA1927@sgi.com> References: <20050407211101.GA29069@sgi.com> <1112923481.21749.88.camel@localhost> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1112923481.21749.88.camel@localhost> Sender: owner-linux-mm@kvack.org Return-Path: To: Dave Hansen Cc: linux-mm , clameter@sgi.com List-ID: On Thu, Apr 07, 2005 at 06:24:41PM -0700, Dave Hansen wrote: > On Thu, 2005-04-07 at 16:11 -0500, Jack Steiner wrote: > > 28 pages/node/cpu * 512 cpus * 256nodes * 16384 bytes/page = 60GB (Yikes!!!) > ... > > I have a couple of ideas for fixing this but it looks like Christoph is > > actively making changes in this area. Christoph do you want to address > > this issue or should I wait for your patch to stabilize? > > What about only keeping the page lists populated for cpus which can > locally allocate from the zone? > > cpu_to_node(cpu) == page_nid(pfn_to_page(zone->zone_start_pfn)) Exactly. That is at the top of my list. What I haven't decided is whether to: - leave the list_heads for offnode pages in the per_cpu_pages struct. Offnode lists would be unused but the amount of wasted space is small - probably 0 because of the cacheline alignment of the per_cpu_pageset. This is the simplest solution but is not clean because of the unused fields. Unless some architectures want to control whether offnode pages are kept in the lists (???). OR - remove the list_heads from the per_cpu_pageset and make it a standalone array in the zone struct. Array size would be MAX_CPUS_PER_NODE. I don't recall any notion of MAX_CPUS_PER_NODE or a relative cpu number on a node (have I overlooked this?). This solution is cleaner in the long run but may involve more infrastructure than I wanted to get into at this point. OR - sane as above but have a SINGLE list_head per zone. The list would be used by all cpus on the node. Thsi avoids the page coloring issues I ran into earlier (see prev posting). Obviously, this requires a lock. However, only on-node cpus would normally take the lock. Another advantage of this scheme is that an offnode shaker could acquire the lock & drain the lists if memory became low. I haven't fully thought thru these ideas. Maybe other alternatives would be even better.... Suggestions???? > > There certainly aren't a lot of cases where frequent, persistent > single-page allocations are occurring off-node, unless a node is empty. Hmmmm. True, but one of our popular configurations consists of memory-only nodes. I know of one site that has 240 memory-only nodes & 16 nodes with both cpus & memory. For this configuration, most memory if offnode to EVERY cpu. (But I still don't want to cache offnode pages). > If you go to an off-node 'struct zone', you're probably bouncing so many > cachelines that you don't get any benefit from per-cpu-pages anyway. Agree, although on the SGI systems, we set a global policy to roundrobin all file pages across all nodes. However, I'm not suggesting we cache offnode pages in the per_cpu_pageset. That gets us back to where we started - too much memory in percpu page lists. Also, creating a file page already bounces a lot of cachelines around. > > Maybe there could be a per-cpu-pages miss rate that's required to occur > before the lists are even populated. That would probably account better > for cases where nodes are disproportionately populated with memory. > This, along with the occasional flushing of the pages back into the > general allocator if the miss rate isn't satisfied should give some good > self-tuning behavior. Makes sense. -- Thanks Jack Steiner (steiner@sgi.com) 651-683-5302 Principal Engineer SGI - Silicon Graphics, Inc. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: aart@kvack.org