* [RFC] Removing page->flags
@ 2006-02-08 6:46 Magnus Damm
2006-02-08 11:54 ` Nick Piggin
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Magnus Damm @ 2006-02-08 6:46 UTC (permalink / raw)
To: linux-mm; +Cc: Magnus Damm
[RFC] Removing page-flags
Removing page->flags might not be the right way to put this idea, but it
sums it up pretty good IMO. The idea is to save memory for smaller
machines and also improve scalability for large SMP systems. Maybe too
much overhead is introduced, hopefully someone of you can tell.
Today each page->flags contain two types of information:
A) 21 bits defined in linux/page-flags.h
B) Zone, node and sparsemem section bit fields, covered in linux/mm.h
On smaller systems (like my laptop), type B is only used to determine
which zone it belongs to using any given struct page. At least 8 bits
per struct page are unused in that case.
Large NUMA systems use type B more efficiently, but the fact that type A
contains a mix of bits might be suboptimal. Especially since some bits
may require atomic operations while others are already protected and
doesn't require atomicy. The fact that the bits share the same word
forces us to use atomic-only operation, which may result in unnecessary
cache line bouncing.
Moving type A bits:
Instead of keeping the bits together, we spread them out and store a
pointer to them from pg_data_t.
To be more exact, pg_data_t is extended to include an array of pointers,
one pointer per bit defined in linux/page-flags.h. Today that would be
21 pointers. Each pointer is pointing to a bitmap, and the bitmap
contains one bit per page in the node. The bitmap should be indexed
using (pfn - node_start_pfn). Each one of these (21) bitmaps may be
accessed using atomic or non-atomic operations, all depending on how the
flag is used. This hopefully improves scalability.
Removing type B bits:
Instead of using the highest bits of page->flags to locate zones, nodes
or sparsemem section, let's remove them and locate them using alignment!
To locate which zone, node and sparsemem section a page belongs to, just
use struct page (source_page) and aligment! The page that contains the
specific struct page (and also contains other parts of mem_map), it's
struct page is located using something like this:
memmap_page = virt_to_page(source_page)
This memmap_page should be unused today. Maybe it is reserved. Anyway,
memmap_page could be used to do all sorts of tricks, like misusing
mapping to point to the zone, index to point to the sparsemem section,
and while at it why not use lru.next to point to the node. One drawback
with this idea is that it adds some extra limitations to the sizes of
zones and sparsemem sections. One example is that a DMA zone of 4096
pages works very well, but 4097 pages might force a certain page
containing a part of mem_map to point to two different zones which of
course does not work at all.
Much work, no gain? Comments?
/ magnus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC] Removing page->flags
2006-02-08 6:46 [RFC] Removing page->flags Magnus Damm
@ 2006-02-08 11:54 ` Nick Piggin
2006-02-09 2:35 ` Magnus Damm
2006-02-08 19:37 ` Dave Hansen
2006-02-09 1:55 ` KAMEZAWA Hiroyuki
2 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2006-02-08 11:54 UTC (permalink / raw)
To: Magnus Damm; +Cc: linux-mm, Magnus Damm
Magnus Damm wrote:
> [RFC] Removing page-flags
>
> Removing page->flags might not be the right way to put this idea, but it
> sums it up pretty good IMO. The idea is to save memory for smaller
> machines and also improve scalability for large SMP systems. Maybe too
> much overhead is introduced, hopefully someone of you can tell.
>
> Today each page->flags contain two types of information:
> A) 21 bits defined in linux/page-flags.h
> B) Zone, node and sparsemem section bit fields, covered in linux/mm.h
>
> On smaller systems (like my laptop), type B is only used to determine
> which zone it belongs to using any given struct page. At least 8 bits
> per struct page are unused in that case.
>
> Large NUMA systems use type B more efficiently, but the fact that type A
> contains a mix of bits might be suboptimal. Especially since some bits
> may require atomic operations while others are already protected and
> doesn't require atomicy. The fact that the bits share the same word
> forces us to use atomic-only operation, which may result in unnecessary
> cache line bouncing.
>
> Moving type A bits:
>
> Instead of keeping the bits together, we spread them out and store a
> pointer to them from pg_data_t.
>
> To be more exact, pg_data_t is extended to include an array of pointers,
> one pointer per bit defined in linux/page-flags.h. Today that would be
> 21 pointers. Each pointer is pointing to a bitmap, and the bitmap
> contains one bit per page in the node. The bitmap should be indexed
> using (pfn - node_start_pfn). Each one of these (21) bitmaps may be
> accessed using atomic or non-atomic operations, all depending on how the
> flag is used. This hopefully improves scalability.
>
There are a large number of paths which access essentially random struct
pages (any memory allocation / deallocation, many pagecache operations).
Your proposal basically guarantees at least an extra cache miss on such
paths. On most modern machines the struct page should be less than or
equal to a cacheline I think.
Also, would you mind explaining how you'd allow non-atomic access to
bits which are already protected somewhere else? Without adding extra
cache misses for each different type of bit that is manipulated? Which
bits do you have in mind, exactly?
I don't think operations on page flags should ever inhibit scalability
just due to the fact they are atomic. Atomic bitops will hurt single
threaded performance, but scalability would probably be impacted more
by the extra cache misses and memory traffic.
The real hit to scalability is when there is multiple access to the same
flags, but in that case the problem remains.
> Removing type B bits:
>
> Instead of using the highest bits of page->flags to locate zones, nodes
> or sparsemem section, let's remove them and locate them using alignment!
>
If we accept that type A bits are a good idea, then removing just type B
is no point. Sometimes the more complex memory layouts will require more
than just arithmetic (ie. memory loads) so I doubt that is worthwhile
either.
> To locate which zone, node and sparsemem section a page belongs to, just
> use struct page (source_page) and aligment! The page that contains the
> specific struct page (and also contains other parts of mem_map), it's
> struct page is located using something like this:
>
> memmap_page = virt_to_page(source_page)
>
> This memmap_page should be unused today. Maybe it is reserved. Anyway,
> memmap_page could be used to do all sorts of tricks, like misusing
> mapping to point to the zone, index to point to the sparsemem section,
> and while at it why not use lru.next to point to the node. One drawback
> with this idea is that it adds some extra limitations to the sizes of
> zones and sparsemem sections. One example is that a DMA zone of 4096
> pages works very well, but 4097 pages might force a certain page
> containing a part of mem_map to point to two different zones which of
> course does not work at all.
>
> Much work, no gain? Comments?
>
> / magnus
>
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-08 11:54 ` Nick Piggin
@ 2006-02-09 2:35 ` Magnus Damm
2006-02-09 4:19 ` Nick Piggin
2006-02-10 15:03 ` Rik van Riel
0 siblings, 2 replies; 17+ messages in thread
From: Magnus Damm @ 2006-02-09 2:35 UTC (permalink / raw)
To: Nick Piggin; +Cc: Magnus Damm, linux-mm, Magnus Damm
On 2/8/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Magnus Damm wrote:
> > [RFC] Removing page-flags
> >
> > Removing page->flags might not be the right way to put this idea, but it
> > sums it up pretty good IMO. The idea is to save memory for smaller
> > machines and also improve scalability for large SMP systems. Maybe too
> > much overhead is introduced, hopefully someone of you can tell.
> >
> > Today each page->flags contain two types of information:
> > A) 21 bits defined in linux/page-flags.h
> > B) Zone, node and sparsemem section bit fields, covered in linux/mm.h
> >
> > On smaller systems (like my laptop), type B is only used to determine
> > which zone it belongs to using any given struct page. At least 8 bits
> > per struct page are unused in that case.
> >
> > Large NUMA systems use type B more efficiently, but the fact that type A
> > contains a mix of bits might be suboptimal. Especially since some bits
> > may require atomic operations while others are already protected and
> > doesn't require atomicy. The fact that the bits share the same word
> > forces us to use atomic-only operation, which may result in unnecessary
> > cache line bouncing.
> >
> > Moving type A bits:
> >
> > Instead of keeping the bits together, we spread them out and store a
> > pointer to them from pg_data_t.
> >
> > To be more exact, pg_data_t is extended to include an array of pointers,
> > one pointer per bit defined in linux/page-flags.h. Today that would be
> > 21 pointers. Each pointer is pointing to a bitmap, and the bitmap
> > contains one bit per page in the node. The bitmap should be indexed
> > using (pfn - node_start_pfn). Each one of these (21) bitmaps may be
> > accessed using atomic or non-atomic operations, all depending on how the
> > flag is used. This hopefully improves scalability.
> >
>
> There are a large number of paths which access essentially random struct
> pages (any memory allocation / deallocation, many pagecache operations).
> Your proposal basically guarantees at least an extra cache miss on such
> paths. On most modern machines the struct page should be less than or
> equal to a cacheline I think.
And this extra cache miss comes from accessing the flags in a
different cache line than the rest of the struct page, right? OTOH,
maybe it is more likely that a certain struct page is in the cache if
struct page would become smaller.
> Also, would you mind explaining how you'd allow non-atomic access to
> bits which are already protected somewhere else? Without adding extra
> cache misses for each different type of bit that is manipulated? Which
> bits do you have in mind, exactly?
I'm thinking about PG_lru and PG_active. PG_lru is always modified
under zone->lru_lock, and the same goes for PG_active (except in
shrink_list(), grr). But as you say above, breaking out the page flags
may result in an extra cache miss...
Also, I think it would be interesting to break out the page
replacement policy code and make it pluggable. Different page
replacement algorithms need different flags/information associated
with each page, so moving the flags from struct page was my way of
solving that. Page replacement flags would in that case be stored
somewhere else than the rest of the flags.
> I don't think operations on page flags should ever inhibit scalability
> just due to the fact they are atomic. Atomic bitops will hurt single
> threaded performance, but scalability would probably be impacted more
> by the extra cache misses and memory traffic.
Ok, so scalability is probably not improved by my proposal.
> The real hit to scalability is when there is multiple access to the same
> flags, but in that case the problem remains.
I'm not going to try to solve that one! =)
> > Removing type B bits:
> >
> > Instead of using the highest bits of page->flags to locate zones, nodes
> > or sparsemem section, let's remove them and locate them using alignment!
> >
>
> If we accept that type A bits are a good idea, then removing just type B
> is no point. Sometimes the more complex memory layouts will require more
> than just arithmetic (ie. memory loads) so I doubt that is worthwhile
> either.
Yes, removing type B bits only is no point. But I wonder how the
performance would be affected by using the "parent" struct page
instead of type B bits.
Thanks for the comments!
/ magnus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-09 2:35 ` Magnus Damm
@ 2006-02-09 4:19 ` Nick Piggin
2006-02-09 5:19 ` Magnus Damm
2006-02-10 15:03 ` Rik van Riel
1 sibling, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2006-02-09 4:19 UTC (permalink / raw)
To: Magnus Damm; +Cc: Magnus Damm, linux-mm, Magnus Damm
Magnus Damm wrote:
> On 2/8/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>There are a large number of paths which access essentially random struct
>>pages (any memory allocation / deallocation, many pagecache operations).
>>Your proposal basically guarantees at least an extra cache miss on such
>>paths. On most modern machines the struct page should be less than or
>>equal to a cacheline I think.
>
>
> And this extra cache miss comes from accessing the flags in a
> different cache line than the rest of the struct page, right? OTOH,
Yes
> maybe it is more likely that a certain struct page is in the cache if
> struct page would become smaller.
>
In some very select cases, yes. Most of the time I'd say it would
be more likely that you'll actually have to take two cache misses
(basic operations like page allocation and freeing touch flags).
>
>>Also, would you mind explaining how you'd allow non-atomic access to
>>bits which are already protected somewhere else? Without adding extra
>>cache misses for each different type of bit that is manipulated? Which
>>bits do you have in mind, exactly?
>
>
> I'm thinking about PG_lru and PG_active. PG_lru is always modified
> under zone->lru_lock, and the same goes for PG_active (except in
> shrink_list(), grr). But as you say above, breaking out the page flags
> may result in an extra cache miss...
>
Also, it will still be difficult to enable non-atomic operations on
them while still keeping overhead to just a single cache miss:
If your flags bits are arranged as an array of flag words, eg
| page 0 flags | page 1 flags | page 2 flags | ... then obviously
you can't use non atomic operations.
Otherwise if they are arranged as bits
| PG_lru bits for pages 0..n | PG_active bits | PG_locked bits |
Then you take 3 extra cache misses when locking the page, then
looking at PG_lru and PG_active.
> Also, I think it would be interesting to break out the page
> replacement policy code and make it pluggable. Different page
> replacement algorithms need different flags/information associated
> with each page, so moving the flags from struct page was my way of
> solving that. Page replacement flags would in that case be stored
> somewhere else than the rest of the flags.
>
It seems pretty unlikely that we'll get a pluggable replacement
policy in mainline any time soon though.
>>If we accept that type A bits are a good idea, then removing just type B
>>is no point. Sometimes the more complex memory layouts will require more
>>than just arithmetic (ie. memory loads) so I doubt that is worthwhile
>>either.
>
>
> Yes, removing type B bits only is no point. But I wonder how the
> performance would be affected by using the "parent" struct page
> instead of type B bits.
>
An essentially random memory access is going to be worth hundreds or
thousands of integer ops though, and you'd increase cache footprint
of 'struct page' operations by 50-100% on most architectures.
I don't see the problem with type B bits in flags?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-09 4:19 ` Nick Piggin
@ 2006-02-09 5:19 ` Magnus Damm
2006-02-09 5:37 ` Nick Piggin
0 siblings, 1 reply; 17+ messages in thread
From: Magnus Damm @ 2006-02-09 5:19 UTC (permalink / raw)
To: Nick Piggin; +Cc: Magnus Damm, linux-mm, Magnus Damm
On 2/9/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Magnus Damm wrote:
> > On 2/8/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>
> >>There are a large number of paths which access essentially random struct
> >>pages (any memory allocation / deallocation, many pagecache operations).
> >>Your proposal basically guarantees at least an extra cache miss on such
> >>paths. On most modern machines the struct page should be less than or
> >>equal to a cacheline I think.
> >
> >
> > And this extra cache miss comes from accessing the flags in a
> > different cache line than the rest of the struct page, right? OTOH,
>
> Yes
>
> > maybe it is more likely that a certain struct page is in the cache if
> > struct page would become smaller.
> >
>
> In some very select cases, yes. Most of the time I'd say it would
> be more likely that you'll actually have to take two cache misses
> (basic operations like page allocation and freeing touch flags).
Yes, that makes sense.
> >>Also, would you mind explaining how you'd allow non-atomic access to
> >>bits which are already protected somewhere else? Without adding extra
> >>cache misses for each different type of bit that is manipulated? Which
> >>bits do you have in mind, exactly?
> >
> >
> > I'm thinking about PG_lru and PG_active. PG_lru is always modified
> > under zone->lru_lock, and the same goes for PG_active (except in
> > shrink_list(), grr). But as you say above, breaking out the page flags
> > may result in an extra cache miss...
> >
>
> Also, it will still be difficult to enable non-atomic operations on
> them while still keeping overhead to just a single cache miss:
>
> If your flags bits are arranged as an array of flag words, eg
> | page 0 flags | page 1 flags | page 2 flags | ... then obviously
> you can't use non atomic operations.
>
> Otherwise if they are arranged as bits
>
> | PG_lru bits for pages 0..n | PG_active bits | PG_locked bits |
>
> Then you take 3 extra cache misses when locking the page, then
> looking at PG_lru and PG_active.
So, if the optimization is to allow non-atomic operations, then a
better way would be to arrange the flags like this:
| page 0 atomic flags | page 1 atomic flags | ....
together with
| page 0 non-atomic flags | page 1 non-atomic flags | ...
But introducing a second page->flags is out of the question, and
breaking out flags and placing a pointer to them in the node data
structure will introduce more cache misses. So it is probably not
worth it.
> > Also, I think it would be interesting to break out the page
> > replacement policy code and make it pluggable. Different page
> > replacement algorithms need different flags/information associated
> > with each page, so moving the flags from struct page was my way of
> > solving that. Page replacement flags would in that case be stored
> > somewhere else than the rest of the flags.
> >
>
> It seems pretty unlikely that we'll get a pluggable replacement
> policy in mainline any time soon though.
So, do you think it is more likely that a ClockPro implementation will
be accepted then? Or is Linux "doomed" to LRU forever?
> >>If we accept that type A bits are a good idea, then removing just type B
> >>is no point. Sometimes the more complex memory layouts will require more
> >>than just arithmetic (ie. memory loads) so I doubt that is worthwhile
> >>either.
> >
> >
> > Yes, removing type B bits only is no point. But I wonder how the
> > performance would be affected by using the "parent" struct page
> > instead of type B bits.
> >
>
> An essentially random memory access is going to be worth hundreds or
> thousands of integer ops though, and you'd increase cache footprint
> of 'struct page' operations by 50-100% on most architectures.
Yeah, that sounds like a bad idea. =)
> I don't see the problem with type B bits in flags?
There are no problems, especially since we already have the bits available in
page->flags.
Thanks for the comments!
/ magnus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-09 5:19 ` Magnus Damm
@ 2006-02-09 5:37 ` Nick Piggin
2006-02-11 5:30 ` Marcelo Tosatti
0 siblings, 1 reply; 17+ messages in thread
From: Nick Piggin @ 2006-02-09 5:37 UTC (permalink / raw)
To: Magnus Damm; +Cc: Magnus Damm, linux-mm, Magnus Damm
Magnus Damm wrote:
> But introducing a second page->flags is out of the question, and
> breaking out flags and placing a pointer to them in the node data
> structure will introduce more cache misses. So it is probably not
> worth it.
>
Yep. Even then, you can't simply have a single non-atomic flags word,
unless _all_ flags are protected by the same lock.
>>
>>It seems pretty unlikely that we'll get a pluggable replacement
>>policy in mainline any time soon though.
>
>
> So, do you think it is more likely that a ClockPro implementation will
> be accepted then? Or is Linux "doomed" to LRU forever?
>
I think (hope) that Linux eventually (if slowly) moves toward the best
implementation available. I just don't think there will be sufficient
justification for a pluggable page reclaim infrastructure in the mainline
kernel.
Cheers,
Nick
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-09 5:37 ` Nick Piggin
@ 2006-02-11 5:30 ` Marcelo Tosatti
0 siblings, 0 replies; 17+ messages in thread
From: Marcelo Tosatti @ 2006-02-11 5:30 UTC (permalink / raw)
To: Nick Piggin
Cc: Magnus Damm, Magnus Damm, linux-mm, Magnus Damm, Peter Zijlstra
On Thu, Feb 09, 2006 at 04:37:40PM +1100, Nick Piggin wrote:
> Magnus Damm wrote:
>
> >But introducing a second page->flags is out of the question, and
> >breaking out flags and placing a pointer to them in the node data
> >structure will introduce more cache misses. So it is probably not
> >worth it.
> >
>
> Yep. Even then, you can't simply have a single non-atomic flags word,
> unless _all_ flags are protected by the same lock.
>
> >>
> >>It seems pretty unlikely that we'll get a pluggable replacement
> >>policy in mainline any time soon though.
> >
> >
> >So, do you think it is more likely that a ClockPro implementation will
> >be accepted then? Or is Linux "doomed" to LRU forever?
> >
>
> I think (hope) that Linux eventually (if slowly) moves toward the best
> implementation available. I just don't think there will be sufficient
> justification for a pluggable page reclaim infrastructure in the mainline
> kernel.
Hi Nick,
There is no such thing as "best implementation available" given that
page replacement policy is nothing more than a set of heuristics
assuming certain characteristics of the underlying workload, and
optimizing for that.
Please refer to
http://programming.kicks-ass.net/kernel-patches/clockpro-2/dev/2.6.16-rc2-1/
Peter's patchset implements a pluggable page reclaim infrastructure
which is used by CLOCK-Pro and CART.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-09 2:35 ` Magnus Damm
2006-02-09 4:19 ` Nick Piggin
@ 2006-02-10 15:03 ` Rik van Riel
1 sibling, 0 replies; 17+ messages in thread
From: Rik van Riel @ 2006-02-10 15:03 UTC (permalink / raw)
To: Magnus Damm; +Cc: Nick Piggin, Magnus Damm, linux-mm, Magnus Damm
On Thu, 9 Feb 2006, Magnus Damm wrote:
> OTOH, maybe it is more likely that a certain struct page is in the cache
> if struct page would become smaller.
No. If the struct page is no longer equal to the size of
a cache line, most of the struct page structures will end
up straddling two cache lines, instead of each being on
their own cache line.
--
All Rights Reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-08 6:46 [RFC] Removing page->flags Magnus Damm
2006-02-08 11:54 ` Nick Piggin
@ 2006-02-08 19:37 ` Dave Hansen
2006-02-09 2:50 ` Magnus Damm
2006-02-09 1:55 ` KAMEZAWA Hiroyuki
2 siblings, 1 reply; 17+ messages in thread
From: Dave Hansen @ 2006-02-08 19:37 UTC (permalink / raw)
To: Magnus Damm; +Cc: linux-mm, Magnus Damm, Andy Whitcroft
On Wed, 2006-02-08 at 15:46 +0900, Magnus Damm wrote:
> Removing type B bits:
>
> Instead of using the highest bits of page->flags to locate zones, nodes
> or sparsemem section, let's remove them and locate them using alignment!
>
> To locate which zone, node and sparsemem section a page belongs to, just
> use struct page (source_page) and aligment! The page that contains the
> specific struct page (and also contains other parts of mem_map), it's
> struct page is located using something like this:
>
> memmap_page = virt_to_page(source_page)
We actually discussed this a number of times when developing sparsemem
and its predecessors. It does seem silly to store stuff like the node
information in *so* *many* copies all over the place.
Andy's argument at the time (if I remember correctly) was that nobody
was using those particular page flags for anything, so what shouldn't we
use them? Plus, this gives better cache locality.
You hinted at it, but you are completely right that the 'struct pages'
backing other 'struct pages' aren't used for anything. They are often
bootmem-allocated, so that probably have PageReserved set, and have
never seen the allocator. All of their fields are basically free for
any use that we'd like.
The biggest killer for this idea for me is not when the zones or section
edges are not aligned on big powers of 2, but when the 'struct page' is
oddly sized. When it is 32 or 64 bytes, you get a nice, even number of
them in a full page. But, when you have a 40-byte 'struct page', things
go downhill in a hurry. This can be affected by things like spinlock
debugging, so it is hard to predict and handle in advance.
The really basic implementation (without the odd page size handling) is
here, if you care:
http://www.sr71.net/patches/2.6.10/2.6.10-rc2-mm4-mhp3/broken-out/C6-nonlinear-no-page-section.patch
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-08 19:37 ` Dave Hansen
@ 2006-02-09 2:50 ` Magnus Damm
2006-02-09 17:27 ` Dave Hansen
0 siblings, 1 reply; 17+ messages in thread
From: Magnus Damm @ 2006-02-09 2:50 UTC (permalink / raw)
To: Dave Hansen; +Cc: Magnus Damm, linux-mm, Magnus Damm, Andy Whitcroft
On 2/9/06, Dave Hansen <haveblue@us.ibm.com> wrote:
> On Wed, 2006-02-08 at 15:46 +0900, Magnus Damm wrote:
> > Removing type B bits:
> >
> > Instead of using the highest bits of page->flags to locate zones, nodes
> > or sparsemem section, let's remove them and locate them using alignment!
> >
> > To locate which zone, node and sparsemem section a page belongs to, just
> > use struct page (source_page) and aligment! The page that contains the
> > specific struct page (and also contains other parts of mem_map), it's
> > struct page is located using something like this:
> >
> > memmap_page = virt_to_page(source_page)
>
> We actually discussed this a number of times when developing sparsemem
> and its predecessors. It does seem silly to store stuff like the node
> information in *so* *many* copies all over the place.
Exactly!
> Andy's argument at the time (if I remember correctly) was that nobody
> was using those particular page flags for anything, so what shouldn't we
> use them? Plus, this gives better cache locality.
Sure, that makes sense.
> You hinted at it, but you are completely right that the 'struct pages'
> backing other 'struct pages' aren't used for anything. They are often
> bootmem-allocated, so that probably have PageReserved set, and have
> never seen the allocator. All of their fields are basically free for
> any use that we'd like.
Yes, and there is probably quite much free space in those struct pages too.
> The biggest killer for this idea for me is not when the zones or section
> edges are not aligned on big powers of 2, but when the 'struct page' is
> oddly sized. When it is 32 or 64 bytes, you get a nice, even number of
> them in a full page. But, when you have a 40-byte 'struct page', things
> go downhill in a hurry. This can be affected by things like spinlock
> debugging, so it is hard to predict and handle in advance.
I realize that if struct page size is not a power of two we will end
up with struct page elements that cross a lot of page boundaries. But
is that really a problem? I thought we were safe if:
1) struct page could be any size
2) zones have to start and end at pfn:s that are a multiple of PAGE_SIZE
3) for sparsemem, the smallest section size is 1 << (PAGE_SIZE * 2).
> The really basic implementation (without the odd page size handling) is
> here, if you care:
>
> http://www.sr71.net/patches/2.6.10/2.6.10-rc2-mm4-mhp3/broken-out/C6-nonlinear-no-page-section.patch
Cool, I will check it out.
Thanks!
/ magnus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-09 2:50 ` Magnus Damm
@ 2006-02-09 17:27 ` Dave Hansen
0 siblings, 0 replies; 17+ messages in thread
From: Dave Hansen @ 2006-02-09 17:27 UTC (permalink / raw)
To: Magnus Damm; +Cc: Magnus Damm, linux-mm, Magnus Damm, Andy Whitcroft
On Thu, 2006-02-09 at 11:50 +0900, Magnus Damm wrote:
> I realize that if struct page size is not a power of two we will end
> up with struct page elements that cross a lot of page boundaries. But
> is that really a problem? I thought we were safe if:
>
> 1) struct page could be any size
> 2) zones have to start and end at pfn:s that are a multiple of
> PAGE_SIZE
> 3) for sparsemem, the smallest section size is 1 << (PAGE_SIZE * 2).
Yeah, I've thought through some scenarios and I can't think of any where
it breaks unless the section size is really small, or a
non-power-of-two. But, I don't think it is as feasible for DISCONTIGMEM
or normal FLATMEM configurations.
-- Dave
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-08 6:46 [RFC] Removing page->flags Magnus Damm
2006-02-08 11:54 ` Nick Piggin
2006-02-08 19:37 ` Dave Hansen
@ 2006-02-09 1:55 ` KAMEZAWA Hiroyuki
2006-02-09 2:57 ` Magnus Damm
2 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-02-09 1:55 UTC (permalink / raw)
To: Magnus Damm; +Cc: linux-mm, Magnus Damm
Magnus Damm wrote:
> [RFC] Removing page-flags
>
> Moving type A bits:
>
> Instead of keeping the bits together, we spread them out and store a
> pointer to them from pg_data_t.
>
This will annoy people who has a job to look into crash-dump's vmcore..like me ;)
so, I don't like this idea.
BTW, did you see Nigel's dynamic page-flags idea ?
I think temporal page-flags can be replaced by some page tracking
infrastructure.
-- Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-09 1:55 ` KAMEZAWA Hiroyuki
@ 2006-02-09 2:57 ` Magnus Damm
2006-02-09 3:14 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 17+ messages in thread
From: Magnus Damm @ 2006-02-09 2:57 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Magnus Damm, linux-mm, Magnus Damm
Hi Kamezawa-san,
On 2/9/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Magnus Damm wrote:
> > [RFC] Removing page-flags
> >
> > Moving type A bits:
> >
> > Instead of keeping the bits together, we spread them out and store a
> > pointer to them from pg_data_t.
> >
> This will annoy people who has a job to look into crash-dump's vmcore..like me ;)
> so, I don't like this idea.
Hehe, gotcha. =) I also wonder how well it would work with your zone patches.
> BTW, did you see Nigel's dynamic page-flags idea ?
> I think temporal page-flags can be replaced by some page tracking
> infrastructure.
I'm not familiar with that patch yet, but I will be soon. =) Thanks!
/ magnus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [RFC] Removing page->flags
2006-02-09 2:57 ` Magnus Damm
@ 2006-02-09 3:14 ` KAMEZAWA Hiroyuki
2006-02-09 3:38 ` Magnus Damm
0 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-02-09 3:14 UTC (permalink / raw)
To: Magnus Damm; +Cc: Magnus Damm, linux-mm, Magnus Damm
Magnus Damm wrote:
> Hi Kamezawa-san,
>
> On 2/9/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>> Magnus Damm wrote:
>>> [RFC] Removing page-flags
>>>
>>> Moving type A bits:
>>>
>>> Instead of keeping the bits together, we spread them out and store a
>>> pointer to them from pg_data_t.
>>>
>> This will annoy people who has a job to look into crash-dump's vmcore..like me ;)
>> so, I don't like this idea.
>
> Hehe, gotcha. =) I also wonder how well it would work with your zone patches.
>
My layout-free-zone patches are not affected by this if you use pgdat/section to
preserve page-flags.
To be honest, I'd like to do this
==
struct zone *page_zone(struct page *page)
{
return page->zone;
}
==
But this increases size of memmap awfully ;( and I can't.
Current zone-indexing in page-flags is well saving memory space, I think.
-- Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC] Removing page->flags
2006-02-09 3:14 ` KAMEZAWA Hiroyuki
@ 2006-02-09 3:38 ` Magnus Damm
2006-02-09 3:51 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 17+ messages in thread
From: Magnus Damm @ 2006-02-09 3:38 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Magnus Damm, linux-mm, Magnus Damm
On 2/9/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Magnus Damm wrote:
> > Hi Kamezawa-san,
> >
> > On 2/9/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> >> Magnus Damm wrote:
> >>> [RFC] Removing page-flags
> >>>
> >>> Moving type A bits:
> >>>
> >>> Instead of keeping the bits together, we spread them out and store a
> >>> pointer to them from pg_data_t.
> >>>
> >> This will annoy people who has a job to look into crash-dump's vmcore..like me ;)
> >> so, I don't like this idea.
> >
> > Hehe, gotcha. =) I also wonder how well it would work with your zone patches.
> >
> My layout-free-zone patches are not affected by this if you use pgdat/section to
> preserve page-flags.
Ok, good.
> To be honest, I'd like to do this
> ==
> struct zone *page_zone(struct page *page)
> {
> return page->zone;
> }
> ==
> But this increases size of memmap awfully ;( and I can't.
> Current zone-indexing in page-flags is well saving memory space, I think.
With my proposal (Removing type B bits), if you can guarantee that all
your zones have a start address and a size that is aligned to (1 <<
(PAGE_SHIFT * 2)), then the following code should be possible:
struct zone *page_zone(struct page *page)
{
struct page *parent = virt_to_page(page);
return (struct zone *)parent->mapping;
}
This assumes that the first entry in mem_map is aligned to PAGE_SIZE,
and that some code has setup parent->mapping to point to the correct
zone. =)
/ magnus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC] Removing page->flags
2006-02-09 3:38 ` Magnus Damm
@ 2006-02-09 3:51 ` KAMEZAWA Hiroyuki
2006-02-09 5:24 ` Magnus Damm
0 siblings, 1 reply; 17+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-02-09 3:51 UTC (permalink / raw)
To: Magnus Damm; +Cc: Magnus Damm, linux-mm, Magnus Damm
Magnus Damm wrote:
> With my proposal (Removing type B bits), if you can guarantee that all
> your zones have a start address and a size that is aligned to (1 <<
> (PAGE_SHIFT * 2)), then the following code should be possible:
>
> struct zone *page_zone(struct page *page)
> {
> struct page *parent = virt_to_page(page);
>
> return (struct zone *)parent->mapping;
> }
>
I think "Why do this" is important. Just for increasing space of page->flags
is not attractive to me. And I think your proposal will adds a extra limitation
to memmap and page<->zone linkage.
IMHO, it will adds another complexity to the kernel.
-- Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [RFC] Removing page->flags
2006-02-09 3:51 ` KAMEZAWA Hiroyuki
@ 2006-02-09 5:24 ` Magnus Damm
0 siblings, 0 replies; 17+ messages in thread
From: Magnus Damm @ 2006-02-09 5:24 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki; +Cc: Magnus Damm, linux-mm, Magnus Damm
On 2/9/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> Magnus Damm wrote:
> > With my proposal (Removing type B bits), if you can guarantee that all
> > your zones have a start address and a size that is aligned to (1 <<
> > (PAGE_SHIFT * 2)), then the following code should be possible:
> >
> > struct zone *page_zone(struct page *page)
> > {
> > struct page *parent = virt_to_page(page);
> >
> > return (struct zone *)parent->mapping;
> > }
> >
>
> I think "Why do this" is important. Just for increasing space of page->flags
> is not attractive to me. And I think your proposal will adds a extra limitation
> to memmap and page<->zone linkage.
> IMHO, it will adds another complexity to the kernel.
Yes, it will make the relationship between zones and memmap more
complex. The only reason to implement this idea would be to save space
by removing page->flags. But the move of type A bits will probably
result in more cache misses so it is probably not worth it.
Thanks for your input,
/ magnus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2006-02-11 5:30 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-08 6:46 [RFC] Removing page->flags Magnus Damm
2006-02-08 11:54 ` Nick Piggin
2006-02-09 2:35 ` Magnus Damm
2006-02-09 4:19 ` Nick Piggin
2006-02-09 5:19 ` Magnus Damm
2006-02-09 5:37 ` Nick Piggin
2006-02-11 5:30 ` Marcelo Tosatti
2006-02-10 15:03 ` Rik van Riel
2006-02-08 19:37 ` Dave Hansen
2006-02-09 2:50 ` Magnus Damm
2006-02-09 17:27 ` Dave Hansen
2006-02-09 1:55 ` KAMEZAWA Hiroyuki
2006-02-09 2:57 ` Magnus Damm
2006-02-09 3:14 ` KAMEZAWA Hiroyuki
2006-02-09 3:38 ` Magnus Damm
2006-02-09 3:51 ` KAMEZAWA Hiroyuki
2006-02-09 5:24 ` Magnus Damm
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox