From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
To: Christoph Lameter <clameter@sgi.com>
Cc: linux-mm <linux-mm@kvack.org>,
Andrew Morton <akpm@linux-foundation.org>,
Andi Kleen <ak@suse.de>, Gleb Natapov <glebn@voltaire.com>
Subject: Re: [PATCH] Document Linux Memory Policy
Date: Thu, 31 May 2007 15:29:04 -0400 [thread overview]
Message-ID: <1180639745.5091.186.camel@localhost> (raw)
In-Reply-To: <Pine.LNX.4.64.0705311130010.11008@schroedinger.engr.sgi.com>
On Thu, 2007-05-31 at 11:35 -0700, Christoph Lameter wrote:
> On Thu, 31 May 2007, Lee Schermerhorn wrote:
>
> > > It seems that you are creating some artificial problems here.
> >
> > Christoph: Let me assume you, I'm not persisting in this exchange
> > because I'm enjoying it. Quite the opposite, actually. However, like
> > you, my employer asks me to address our customers' requirements. I'm
> > trying to understand and play within the rules of the community. I
> > attempted this documentation patch to address what I saw as missing
> > documentation and to provide context for further discussion of my patch
> > set.
>
> Could you explain to us what kind of user scenario you are addressing? We
> have repeatedly asked you for that information. I am happy to hear that
> there is an actual customer requirement.
And I've tried to explain without "naming names". Let me try it this
way: An multi-task application that mmap()s a large file--think O(1TB)
or larger--shared. You can think of it as an in-memory data base, but
the size of the file could exceed physical memory. Various tasks of the
application will access various portions of the memory area in different
ways/with different frequencies, ... [sort of like Gleb described]. The
memory region is large enough and cache locality poor enough that
"locality matters".
Why not just use shmem and read the file in at startup? In those cases
where it would fit, it takes quite a while to read a file of this size
in, and processing can't start until it's all in. Perhaps one could
modify the application to carefully sequence the load so other tasks
could get started before it's all in. And, it only works if the access
pattern is known a priori. And, if the file is larger than memory,
you'll need swap space to back it.
You want persistence across runs of the application--e.g., so that you
could suspend it and continue later. You could just write the entire
shmem out at the end, but again, that takes a long time. The
application could keep track of which regions of memory have been
modified and write them out incrementally, but with a mapped file, the
kernel does this automatically [I won't say "for free" ;-)] with an
occasional msync() or if reclaim becomes necessary.
Granted, these last 2 paragraphs describe how a number of large
enterprise data bases work. So, it's not impossible. It IS a lot of
work if you don't need the type of guarantees that those systems
provide.
Why not just use task policy to place the pages? Task policy affects
all task allocations, including stack, heap, ... Better to let those
default to local. Well, why not place the pages, lock them down and
then change the task policy back to default/local? File might not fit;
even if it did, might not want to commit that much memory, ... And,
yes, it seems unnatural to have to jump through these hoops--at least
for customers bringing applications from envrionments where they didn't
have to. [I know, I know. Functional parity with other systems... Not
a valid reason... Yada yada. ;-)]
>
> > My point was that the description of MPOL_DEFAULT made reference to the
> > zonelists built at boot time, to distinguish them from the custom
> > zonelists built for an MPOL_BIND. Since the zonelist reorder patch
> > hasn't made it out of Andrew's tree yet, I didn't want to refer to it
> > this round of the doc. If it makes it into the tree, I had planned say
> > something like: "at boot time or on request". I should probably add
> > "or on memory hotplug".
>
> Hmmm... The zonelists for MPOL_BIND are never rebuilt by Kame-san's
> patches. That is a concern.
Yes. And as we noted earlier, even the initial ones don't consider
distance. The latter should be relatively easy to fix, as we have code
that does it for the node zonelists. Would require some generalization.
Rebuilding policy zonelists would require finding them all somehow.
Either an expensive system [cpuset] wide scan or a system-wide/per
cpuset list of [MPOL-BIND] policies. A per cpuset list might reduce the
scope of the rebuild, but you'd have to scan tasks and reparent their
policies when move them between cpusets. Not pretty either way.
>
> > But, after I see what gets accepted into the man pages that I've agreed
> > to update, I'll consider dropping this section altogether. Maybe the
> > entire document.
>
> I'd be very thankful if you could upgrade the manpages. Andi has some
> patches from me against numactl pending that include manpage
> updatess. I can forward that too you.
>
> > > page cache pages are subject to a tasks memory policy regardless of how we
> > > get to the page cache page. I think that is pretty consistent.
> >
> > Oh, it's consistent, alright. Just not pretty [;-)] when it's not what
> > the application wants.
>
> I sure hope that we can at some point figure out what your applications is
> doing. Its been a hard road to that information so far.
>
I thought I'd explained before. I guess just too abstractly. Maybe the
description above is a too abstract as well. However, Gleb's
application has similar requirements--he wants to control the location
of pages in a shared, mmap'ed file using explicit policies. He's even
willing to issue the identical policy calls from each task--something I
don't think he should need to do--to accomplish it. But, it still won't
work for him...
Lee
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-05-31 19:29 UTC|newest]
Thread overview: 83+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-05-29 19:33 Lee Schermerhorn
2007-05-29 20:04 ` Christoph Lameter
2007-05-29 20:16 ` Andi Kleen
2007-05-30 16:17 ` Lee Schermerhorn
2007-05-30 17:41 ` Christoph Lameter
2007-05-31 8:20 ` Michael Kerrisk
2007-05-31 14:49 ` Lee Schermerhorn
2007-05-31 15:56 ` Michael Kerrisk
2007-06-01 21:15 ` [PATCH] enhance memory policy sys call man pages v1 Lee Schermerhorn
2007-07-23 6:11 ` Michael Kerrisk
2007-07-23 6:32 ` mbind.2 man page patch Michael Kerrisk
2007-07-23 14:26 ` Lee Schermerhorn
2007-07-26 17:19 ` Michael Kerrisk
2007-07-26 18:06 ` Lee Schermerhorn
2007-07-26 18:18 ` Michael Kerrisk
2007-07-23 6:32 ` get_mempolicy.2 " Michael Kerrisk
2007-07-28 9:31 ` Michael Kerrisk
2007-08-09 18:43 ` Lee Schermerhorn
2007-08-09 20:57 ` Michael Kerrisk
2007-08-16 20:05 ` Andi Kleen
2007-08-18 5:50 ` Michael Kerrisk
2007-08-21 15:45 ` Lee Schermerhorn
2007-08-22 4:10 ` Michael Kerrisk
2007-08-22 16:08 ` [PATCH] Mempolicy Man Pages 2.64 1/3 - mbind.2 Lee Schermerhorn
2007-08-27 11:29 ` Michael Kerrisk
2007-08-22 16:10 ` [PATCH] Mempolicy Man Pages 2.64 2/3 - set_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30 ` Michael Kerrisk
2007-08-22 16:12 ` [PATCH] Mempolicy Man Pages 2.64 3/3 - get_mempolicy.2 Lee Schermerhorn
2007-08-27 11:30 ` Michael Kerrisk
2007-08-27 10:46 ` get_mempolicy.2 man page patch Michael Kerrisk
2007-07-23 6:33 ` set_mempolicy.2 " Michael Kerrisk
2007-05-30 16:55 ` [PATCH] Document Linux Memory Policy Lee Schermerhorn
2007-05-30 17:56 ` Christoph Lameter
2007-05-31 6:18 ` Gleb Natapov
2007-05-31 6:41 ` Christoph Lameter
2007-05-31 6:47 ` Gleb Natapov
2007-05-31 6:56 ` Christoph Lameter
2007-05-31 7:11 ` Gleb Natapov
2007-05-31 7:24 ` Christoph Lameter
2007-05-31 7:39 ` Gleb Natapov
2007-05-31 17:43 ` Christoph Lameter
2007-05-31 17:07 ` Lee Schermerhorn
2007-05-31 10:43 ` Andi Kleen
2007-05-31 11:04 ` Gleb Natapov
2007-05-31 11:30 ` Gleb Natapov
2007-05-31 15:26 ` Lee Schermerhorn
2007-05-31 17:41 ` Gleb Natapov
2007-05-31 18:56 ` Lee Schermerhorn
2007-05-31 20:06 ` Gleb Natapov
2007-05-31 20:43 ` Andi Kleen
2007-06-01 9:38 ` Gleb Natapov
2007-06-01 10:21 ` Andi Kleen
2007-06-01 12:25 ` Gleb Natapov
2007-06-01 13:09 ` Andi Kleen
2007-06-01 17:15 ` Lee Schermerhorn
2007-06-01 18:43 ` Christoph Lameter
2007-06-01 19:38 ` Lee Schermerhorn
2007-06-01 19:48 ` Christoph Lameter
2007-06-01 21:05 ` Lee Schermerhorn
2007-06-01 21:56 ` Christoph Lameter
2007-06-04 13:46 ` Lee Schermerhorn
2007-06-04 16:34 ` Christoph Lameter
2007-06-04 17:02 ` Lee Schermerhorn
2007-06-04 17:11 ` Christoph Lameter
2007-06-04 20:23 ` Andi Kleen
2007-06-04 21:51 ` Christoph Lameter
2007-06-05 14:30 ` Lee Schermerhorn
2007-06-01 20:28 ` Gleb Natapov
2007-06-01 20:45 ` Christoph Lameter
2007-06-01 21:10 ` Lee Schermerhorn
2007-06-01 21:58 ` Christoph Lameter
2007-06-02 7:23 ` Gleb Natapov
2007-05-31 11:47 ` Andi Kleen
2007-05-31 11:59 ` Gleb Natapov
2007-05-31 12:15 ` Andi Kleen
2007-05-31 12:18 ` Gleb Natapov
2007-05-31 18:28 ` Lee Schermerhorn
2007-05-31 18:35 ` Christoph Lameter
2007-05-31 19:29 ` Lee Schermerhorn [this message]
2007-05-31 19:25 ` Paul Jackson
2007-05-31 20:22 ` Lee Schermerhorn
2007-05-29 20:07 ` Andi Kleen
2007-05-30 16:04 ` Lee Schermerhorn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1180639745.5091.186.camel@localhost \
--to=lee.schermerhorn@hp.com \
--cc=ak@suse.de \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=glebn@voltaire.com \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox