From mboxrd@z Thu Jan  1 00:00:00 1970
Date: Thu, 20 Apr 2006 23:17:51 -0700
From: Paul Jackson <pj@sgi.com>
Subject: Re: [RFC] split zonelist and use nodemask for page allocation [1/4]
Message-Id: <20060420231751.f1068112.pj@sgi.com>
In-Reply-To: <20060421131147.81477c93.kamezawa.hiroyu@jp.fujitsu.com>
References: <20060421131147.81477c93.kamezawa.hiroyu@jp.fujitsu.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: owner-linux-mm@kvack.org
Return-Path: <owner-linux-mm@kvack.org>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: linux-mm@kvack.org, clameter@sgi.com
List-ID: <linux-mm.kvack.org>

Interesting ... maybe ?

Doesn't this change the semantics of the kernel page allocator?

If I read correctly:

    The existing code scans the entire systems zonelists multiple times.
    First, it looks on all nodes in the system for easy memory, and if that
    fails, tries again, looking for less easy (lower threshold) memory.

    Your code takes one node at a time, in the alloc_pages_nodemask() loop,
    and calls __alloc_pages() for that node, which will exhaust that node
    before giving up.

In particular, the low memory failure cases, such as when the system
starts to swap on a node, or a task is forced to sleep waiting for memory,
or the out-of-memory killer called, would seem to be quite different with
your patch.  This could cause some serious problems, I suspect.

Some of your other advantages from this change look nice, but I suspect
it would take a radical rewrite of __alloc_pages(), moving the multiple
scans at increasingly aggressive free memory settings up into your
__alloc_pages_nodemask() routine, and  moving the cpuset_zone_allowed()
check from get_page_from_freelist() up as well.

This would be a major rewrite of mm/page_alloc.c, perhaps a very
interesting one, but I don't think it would be an easy one.

Or, just perhaps, the above change in semantics is a -good- one.  I'll
wager that my colleague Christoph will consider it such (I see he has
already heartily endorsed your patch.)  Essentially your patch would
seem to increase the locality of allocations -- beating one node to
death before considering the next.  Sometimes this will be a good
improvement.

And sometimes not.  In my ideal world, there would be a per-cpuset
option, perhaps just a boolean, choosing between the two choices of:
  1) look on all allowed nodes for easy memory, before reconsidering
       each allowed node for the one of the last free pages, or
  2) beat all zones on one node hard, before going off-node.

I believe that the existing code does (1), and your patch does (2).

In any event, the layering of yet another control loop on top of the
nested conditional fallback loops of loops we have now is a concern.
It is getting harder and harder for mere mortals to understand this.

Perhaps there are opportunities here for much more cleanup, though
that would not be easy.

My apologies for wasting your time if I misread this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>