linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Christoph Lameter <clameter@sgi.com>
To: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>, linux-mm@kvack.org
Subject: Re: [PATCH] Get rid of zone_table
Date: Thu, 14 Sep 2006 14:46:02 -0700 (PDT)	[thread overview]
Message-ID: <Pine.LNX.4.64.0609141431560.5688@schroedinger.engr.sgi.com> (raw)
In-Reply-To: <45092FE6.3060706@shadowen.org>

On Thu, 14 Sep 2006, Andy Whitcroft wrote:

> Proposed implementation:
> 
>     | Node | Zone | [Section] | xxxxx |      Flags       |
>      \____/ \____/
>         |      |__________________
>   .- - -|- - - - - - - -.         |

Right. There is one lookup here in the node_data array. The combination
with the zone is an address calculation and does not require a lookup.

>   .     v               .         v
>   . +-----------+       .  +-----------+
>   . | node_data |--&node-->| NODE_DATA |----> &zone
>   . +-----------+       .  +-----------+
>   .     ^               .         ^
>    - - -|- - - - - - - -A         |
>         |                         |
>     +---------------+             |
>     | section_table |             |
>     +---------------+             |

Right here is the second lookup for the case in which the section does not 
fit.

>         ^                         |
>         |                         |
>       __|_____________________   _|__
>      /                        \ /    \
>     |         Section          | Zone |      Flags       |
> 
> 
> Christoph Lameter wrote:
> > The zone table is mostly not needed. If we have a node in the page flags 
> > then we can get to the zone via NODE_DATA(). In case of SMP and UP 
> > NODE_DATA() is a constant pointer which allows us to access an exact 
> > replica of zonetable in the node_zones field. In all of the above cases 
> > there will be no need at all for the zone table.
> 
> Ok here we are talking about the segment of the second diagram ringed
> and marked A.  Yes the compiler/we should be able to optimise this case
> to directly use the zonelist.  However, this is also true of the current
> scheme and would be a fairly trivial change in that framework.

What would the compiler optimize? You mean the zonelist in the node 
structure or the zonetable?

> 
> Something like the below.
> 
> @@ -477,7 +477,10 @@ static inline int page_zone_id(struct pa
>  }
>  static inline struct zone *page_zone(struct page *page)
>  {
> -       return zone_table[page_zone_id(page)];
> +       if (NODE_SHIFT)
> +               return zone_table[page_zone_id(page)];
> +       else
> +               return NODE_DATA(0)->node_zones[page_zonenum(page)];
>  }

Yes that code was proposed in the RFC. See linux-mm. Dave suggested that 
we can eliminate the zone_table or the section_to_node_table completely
because we can actually fit the node into the page flags with some 
adjustments to sparsemem.

> A similar thing could be done for page_to_nid which should always be zero.

page_to_nid already uses page_zone in that case.

> > The section_to_node table (if we still need it) is still the size of the 
> > number of sections but the individual elements are integers (which already 
> > saves 50% on 64 bit platforms) and we do not need to duplicate the entries 
> > per zone type. So even if we have to keep the table then we shrink it to 
> > 1/4th (32bit) or 1/8th )(64bit).
> 
> Ok, this is based on half for moving from a pointer to an integer.  The
> rest is based on the fact we have 4 zones.  Given most sane
> architectures only have ZONE_DMA we should be able to get a large
> percentage of this saving just from knowing the highest 'valid' zone per
> architecture.

NUMAQ only populates HIGHMEM on nodes other than zero. You will get 
no benefit with such a scheme.

> Let us consider the likely sizes of the zone_table for a SPARSEMEM
> configuration:
> 
> 1) the 32bit case.  Here we have a limitation of a maximum of 6 bits
> worth of sections (64 of them).  So the maximum zone_table size is 4 *
> 64 * 4 == 1024, so 1KB of zone_table.

Can we fit the node in there for all possible 32 bit NUMA machines?

> General comments.  Although this may seem of the same order of
> complexity and therefore a performance drop in, there does seem to be a
> significant number of additional indirections on a NUMA system.

Could you tell me wher the "indirections" come from? AFAIK there is only
one additional indirection that is offset by the NODE_DATA array being
cache hot. page_to_nid goes from 3 indirections to one with this scheme.

> I can see a very valid case for optimising the UP/SMP case where
> NODE_DATA is a constant.  But that could be optimised as I indicate
> above without a complete rewrite.

Could you have a look at the RFC wich does exactly that?
 
> Finally, if the change here was a valid one benchmark wise or whatever,
> I think it would be nicer to push this in through the same interface we
> currently have as that would allow other shaped zone_tables to be
> brought back should a new memory layout come along.

It would be best to eliminate the zone_table or my section_to_node_table 
completely. The section_to_node_table does not require maintanance 
in the page allocator as the zone_table does.

> > Index: linux-2.6.18-rc6-mm2/include/linux/mm.h
> > ===================================================================

I could not find any comments in here. Please cut down emails as much as 
possible.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2006-09-14 21:46 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-13 20:44 Christoph Lameter
2006-09-13 20:53 ` Dave Hansen
2006-09-13 21:40   ` Christoph Lameter
2006-09-13 21:47     ` Dave Hansen
2006-09-13 21:54       ` Christoph Lameter
2006-09-13 21:58         ` Dave Hansen
2006-09-13 22:02           ` Christoph Lameter
2006-09-15 13:28       ` Andy Whitcroft
2006-09-15 16:32         ` Dave Hansen
2006-09-15 17:13         ` Christoph Lameter
2006-09-15 17:51           ` Christoph Lameter
2006-09-14 10:33 ` Andy Whitcroft
2006-09-14 21:46   ` Christoph Lameter [this message]
2006-09-15 13:07     ` Andy Whitcroft

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Pine.LNX.4.64.0609141431560.5688@schroedinger.engr.sgi.com \
    --to=clameter@sgi.com \
    --cc=apw@shadowen.org \
    --cc=haveblue@us.ibm.com \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox