Excessive memory trapped in pageset lists

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Excessive memory trapped in pageset lists
@ 2005-04-07 21:11 Jack Steiner
  2005-04-08  1:24 ` Dave Hansen
  0 siblings, 1 reply; 4+ messages in thread
From: Jack Steiner @ 2005-04-07 21:11 UTC (permalink / raw)
  To: linux-mm; +Cc: clameter

The zone structure has 2 lists in the per_cpu_pageset structure. 
These lists are used for quickly allocating & freeing pages:

        struct zone {
                ...
                struct per_cpu_pageset  pageset[NR_CPUS];
        }

        struct per_cpu_pageset {
                ...
                struct per_cpu_pages pcp[2];
        }

	struct per_cpu_pages {
		...
		struct list_head list;	// list head for free pages
	}

Since the lists are private to a cpu, no global locks are required to
allocate or free pages.  This is likely a performance win for many benchmarks.

However, memory in the lists is trapped, ie. not easily available
for allocation by any cpu except the owner of the list. In addition,
there is no "shaker" for this memory.

So how much memory can be in the lists.... Lots!

There is 1 zone per node (on SN). Each zone has 2 lists per cpu.
One list is for "hot" pages, the other is for "cold" pages.

On a big SN system there are 512p and 256 nodes:

        512cpus * 256zones * 2 lists/percpu/perzone = 256K lists

On any system with more than 256MB/node (ie, all SN systems), the hot list
will contain 4 to 24 pages.. The cold list will contain 0 - 4 pages.
Assuming worst case, on a 512p system with 256k lists, there can be
a lot of memory trapped in these lists.

   28 pages/node/cpu * 512 cpus * 256nodes * 16384 bytes/page = 60GB  (Yikes!!!)

In practice, there will be a lot less memory in the lists, but even a
lot less is still way too much.

I have a couple of ideas for fixing this but it looks like Christoph is
actively making changes in this area. Christoph do you want to address
this issue or should I wait for your patch to stabilize?

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Excessive memory trapped in pageset lists
  2005-04-07 21:11 Excessive memory trapped in pageset lists Jack Steiner
@ 2005-04-08  1:24 ` Dave Hansen
  2005-04-08  2:34   ` Jack Steiner
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Hansen @ 2005-04-08  1:24 UTC (permalink / raw)
  To: Jack Steiner; +Cc: linux-mm, clameter

On Thu, 2005-04-07 at 16:11 -0500, Jack Steiner wrote:
>    28 pages/node/cpu * 512 cpus * 256nodes * 16384 bytes/page = 60GB  (Yikes!!!)
...
> I have a couple of ideas for fixing this but it looks like Christoph is
> actively making changes in this area. Christoph do you want to address
> this issue or should I wait for your patch to stabilize?

What about only keeping the page lists populated for cpus which can
locally allocate from the zone?

	cpu_to_node(cpu) == page_nid(pfn_to_page(zone->zone_start_pfn)) 

There certainly aren't a lot of cases where frequent, persistent
single-page allocations are occurring off-node, unless a node is empty.
If you go to an off-node 'struct zone', you're probably bouncing so many
cachelines that you don't get any benefit from per-cpu-pages anyway.

Maybe there could be a per-cpu-pages miss rate that's required to occur
before the lists are even populated.  That would probably account better
for cases where nodes are disproportionately populated with memory.
This, along with the occasional flushing of the pages back into the
general allocator if the miss rate isn't satisfied should give some good
self-tuning behavior.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Excessive memory trapped in pageset lists
  2005-04-08  1:24 ` Dave Hansen
@ 2005-04-08  2:34   ` Jack Steiner
  2005-04-08  5:18     ` Christoph Lameter
  0 siblings, 1 reply; 4+ messages in thread
From: Jack Steiner @ 2005-04-08  2:34 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-mm, clameter

On Thu, Apr 07, 2005 at 06:24:41PM -0700, Dave Hansen wrote:
> On Thu, 2005-04-07 at 16:11 -0500, Jack Steiner wrote:
> >    28 pages/node/cpu * 512 cpus * 256nodes * 16384 bytes/page = 60GB  (Yikes!!!)
> ...
> > I have a couple of ideas for fixing this but it looks like Christoph is
> > actively making changes in this area. Christoph do you want to address
> > this issue or should I wait for your patch to stabilize?
> 
> What about only keeping the page lists populated for cpus which can
> locally allocate from the zone?
> 
> 	cpu_to_node(cpu) == page_nid(pfn_to_page(zone->zone_start_pfn)) 

Exactly. That is at the top of my list. What I haven't decided is whether to:

	- leave the list_heads for offnode pages in the per_cpu_pages
	  struct. Offnode lists would be unused but the amount of wasted space
	  is small - probably 0 because of the cacheline alignment 
	  of the per_cpu_pageset. This is the simplest solution
	  but is not clean because of the unused fields. Unless some
	  architectures want to control whether offnode pages
	  are kept in the lists (???).

	  	OR

	- remove the list_heads from the per_cpu_pageset and make it
	  a standalone array in the zone struct. Array size would be
	  MAX_CPUS_PER_NODE. I don't recall any notion of MAX_CPUS_PER_NODE
	  or a relative cpu number on a node (have I overlooked this?). 
	  This solution is cleaner in the long run but may involve more 
	  infrastructure than I wanted to get into at this point.

	  	OR

	- sane as above but have a SINGLE list_head per zone. The list
	  would be used by all cpus on the node. Thsi avoids the page coloring
	  issues I ran into earlier (see prev posting). Obviously, this requires 
	  a lock. However, only on-node cpus would normally take the lock. 
	  Another advantage of this scheme is that an offnode shaker could 
	  acquire the lock & drain the lists if memory became low.

I haven't fully thought thru these ideas. Maybe other alternatives would
be even better.... Suggestions????


> 
> There certainly aren't a lot of cases where frequent, persistent
> single-page allocations are occurring off-node, unless a node is empty.

Hmmmm. True, but one of our popular configurations consists of memory-only nodes.
I know of one site that has 240 memory-only nodes & 16 nodes with
both cpus & memory. For this configuration, most memory if offnode 
to EVERY cpu. (But I still don't want to cache offnode pages).


> If you go to an off-node 'struct zone', you're probably bouncing so many
> cachelines that you don't get any benefit from per-cpu-pages anyway.

Agree, although on the SGI systems, we set a global policy to roundrobin
all file pages across all nodes. However, I'm not suggesting we cache
offnode pages in the per_cpu_pageset. That gets us back to where we 
started - too much memory in percpu page lists. Also, creating a file
page already bounces a lot of cachelines around.

> 
> Maybe there could be a per-cpu-pages miss rate that's required to occur
> before the lists are even populated.  That would probably account better
> for cases where nodes are disproportionately populated with memory.
> This, along with the occasional flushing of the pages back into the
> general allocator if the miss rate isn't satisfied should give some good
> self-tuning behavior.

Makes sense.

-- 
Thanks

Jack Steiner (steiner@sgi.com)          651-683-5302
Principal Engineer                      SGI - Silicon Graphics, Inc.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Excessive memory trapped in pageset lists
  2005-04-08  2:34   ` Jack Steiner
@ 2005-04-08  5:18     ` Christoph Lameter
  0 siblings, 0 replies; 4+ messages in thread
From: Christoph Lameter @ 2005-04-08  5:18 UTC (permalink / raw)
  To: Jack Steiner; +Cc: Dave Hansen, linux-mm, clameter

On Thu, 7 Apr 2005, Jack Steiner wrote:

> On Thu, Apr 07, 2005 at 06:24:41PM -0700, Dave Hansen wrote:
> > On Thu, 2005-04-07 at 16:11 -0500, Jack Steiner wrote:
> > >    28 pages/node/cpu * 512 cpus * 256nodes * 16384 bytes/page = 60GB  (Yikes!!!)
> > ...
> > > I have a couple of ideas for fixing this but it looks like Christoph is
> > > actively making changes in this area. Christoph do you want to address
> > > this issue or should I wait for your patch to stabilize?
> >
> > What about only keeping the page lists populated for cpus which can
> > locally allocate from the zone?
> >
> > 	cpu_to_node(cpu) == page_nid(pfn_to_page(zone->zone_start_pfn))
>
> Exactly. That is at the top of my list. What I haven't decided is whether to:

<list of options where to keep the pagesets....,>

Maybe its best to keep only pageset for each cpu for the zone that is
local to the cpu? That may allow simplified locking.

The pageset could be defined as a per cpu variable.

I would like to add a list of zeroed pages to the hot and cold list. If a
page can be obtained with some inline code from the per cpu lists from the
local zone then we would be able to bypass the unlock, page alloc, page
zero, relock, verify pte not changed sequence during page faults.

F.e. do_anonymous page could try to obtain an entry from the quicklist
and only drop the lock if the allocation is off node or no pages are on
the quicklist of zeroed pages.

> > There certainly aren't a lot of cases where frequent, persistent
> > single-page allocations are occurring off-node, unless a node is empty.
>
> Hmmmm. True, but one of our popular configurations consists of memory-only nodes.
> I know of one site that has 240 memory-only nodes & 16 nodes with
> both cpus & memory. For this configuration, most memory if offnode
> to EVERY cpu. (But I still don't want to cache offnode pages).

We could have an additional pageset in the remote zone for remote
accesses only. This means we would have to manage one remote pageset
and a set of cpu local pagesets for the cpu to which the zone is the
primary local node.

The off node pageset would require a spinlock
whereas the node local pagesets could work very quickly w/o locking. We
would need some easy way to distinguish off node accesses from on node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2005-04-08  5:18 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-04-07 21:11 Excessive memory trapped in pageset lists Jack Steiner
2005-04-08  1:24 ` Dave Hansen
2005-04-08  2:34   ` Jack Steiner
2005-04-08  5:18     ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox