From: William Lee Irwin III <wli@holomorphy.com>
To: Andrea Arcangeli <andrea@suse.de>
Cc: Rik van Riel <riel@redhat.com>,
"Martin J. Bligh" <mbligh@aracnet.com>,
Mel Gorman <mel@csn.ul.ie>,
Linux Memory Management List <linux-mm@kvack.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: What to expect with the 2.6 VM
Date: Thu, 3 Jul 2003 18:46:24 -0700 [thread overview]
Message-ID: <20030704014624.GN20413@holomorphy.com> (raw)
In-Reply-To: <20030704004000.GQ23578@dualathlon.random>
On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> If it's such a strong feature, go ahead and show me a patch to a real
> life applications (there are plenty of things using hashes or btrees,
> feel free to choose the one that according to you will behave closest to
> the "exploit" [1]) using remap_file_pages to avoid the pte overhead and
> show the huge improvement in the numbers compared to changing the design
> of the code to have some sort of locality in the I/O access patterns
> (NOTE: I don't care about ram, I care about speed). Given how hard you
> advocate for this I assume you at least expect a 1% improvement, right?
> Once you make the patch I can volounteer to benchmark it if you don't
> have time or hardware for that. After you made the patch and you showed
> a >=1% improvement by not keeping the file mapped linearly, but by
> mapping it nonlinearly using remap_file_pages, I reserve myself to fix
> the app to have some sort of locality of information so that the I/O
> side will be able to get a boost too.
You're obviously determined to write the issue the thing addresses
out of the cost/benefit analysis. This is tremendously obvious;
conserve RAM and you don't go into page (or pagetable) replacement.
I'm sick of hearing how it's supposedly legitimate to explode when
presented with random access patterns (this is not the only instance
of such an argument).
And I'd love to write an app that uses it for this; unfortunately for
both of us, I'm generally fully booked with kernelspace tasks.
On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> the fact is, no matter the VM side, your app has no way to nearly
> perform in terms of I/O seeks if you're filling a page per pmd due the
> huge seek it will generate with the major faults. And even the minor
> faults if has no locality at all and it seeks all over the place in a
> non predictable manner, the tlb flushes will kill performance compared
> to keeping the file mapped statically, and it'll make it even slower
> than accessing a new pte every time.
What minor faults? remap_file_pages() does pagecache lookups and
instantiates the ptes directly in the system call.
Also, the assumption of being seek-bound assumes a large enough cache
turnover rate to resort to io often. If such were the case it would be
unoptimizable apart from doubling the amount of pagecache it's feasible
to cache (which would provide only a slight advantage in such a case).
The obvious, but unstated assumption I made is that the virtual arena
(and hence physical pagecache backing it) is large enough to provide a
high cache hit rate to the mapped on-disk data structures. And the
virtualspace compaction is in order to prevent pagetables from
competing with pagecache.
On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> Until you produce pratical results IMHO the usage you advocated to use
> remap_file_pages to avoid doing big linear mappings that may allocate
> more ptes, sounds completely vapourware overkill overdesign that won't
> last past emails. All in my humble opinion of course. I've no problem
> to be wrong, I just don't buy what you say since it is not obvious at
> all given the huge cost of entering exiting kernel, reaching the
> pagetable in software, mangling them, flushing the tlb (on all common
> archs that I'm assuming this doesn't only mean to flush a range but to
> flush it all but it'd be slower even with a range-flush operation),
> compared to doing nothing with a static linear mapping (regardless the
> fact there are more ptes with a big linear mapping, I don't care to save
> ram).
Space _does_ matter. Everywhere. All the time. And it's not just
virtualspace this time. Throw away space, and you go into page or
pagetable replacement.
I'd love to write an app that uses it properly to conserve space. And
saving space is saving time. Time otherwise spent in page replacement
and waiting on io.
On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> If you really go to change the app to use remap_file_pages, rather than
> just compact the vm side with remap_file_pages (which will waste lots of
> cpu power and it'll run slower IMHO), you'd better introduce locality
> knowledge so the I/O side will have a slight chance to perform too and
> the VM side will be improved as well, potentially also sharing the same
> page, not only the same pmd (and after you do that if you really need to
> save ram [not cpu] you can munmap/mmap at the same cost but this is just
> a said note, I said I don't care to save ram, I care to perform the
> fastest). reiserfs and other huge btree users have to do this locality
> stuff all the time with their trees, for example to avoid a directory to
> be completely scattered everywhere in the tree and in turn triggering
> an huge amount of I/O seeks that may not even fit in buffercache. w/o
> the locality information there would be no way for reiserfs to perform
> with big filesystems and many directories, this is just the simples
> example I can think of huge btrees that we use everyday.
Filesystems don't use mmap() to simulate access to the B-trees, they
deal with virtual discontiguity with lookup structures. They essentially
are scattered on-disk, though not so greatly as general lookup
structures would be. The VM space cost of using mmap() on such on-disk
structures is what's addressed by this API.
Also, mmap()/munmap() do not have equivalent costs to remap_file_pages();
they do not instantiate ptes like remap_file_pages() and hence accesses
incur minor faults. So it also addresses minor fault costs. There is no
API not requiring privileges that speculatively populates ptes of a
mapping and would prevent minor faults like remap_file_pages().
Another advantage of the remap_file_pages() approach is that the
virtual arena can be made somewhat smaller than the actual pagecache,
which allows the VM to freely reclaim the cold bits of the cache and by
so doing not overcompete with other applications. i.e. I take the exact
opposite tack: apps should not hog every resource of the machine they
can for raw speed but rather be "good citizens".
In principle, there are other ways to do these things in some cases,
e.g. O_DIRECT. It's not truly an adequate substitute, of course, since
not only is the app then forced to deal with on-disk coherency, the
mappings aren't shared with pagecache (i.e. evictable via mere writeback)
and the cost of io is incurred for each miss of the process' cache of
the on-disk data.
On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> Again, I don't care about saving ram, we're talking 64bit, I care about
> speed, I hope I already made this clear enough in the previous email.
This is a very fundamentally mistaken set of priorities.
On Fri, Jul 04, 2003 at 02:40:00AM +0200, Andrea Arcangeli wrote:
> My arguments sounds all pretty strightforward to me.
Sorry, this line of argument is specious.
As for the security issue, I'm not terribly interested, as rlimits on
numbers of processes and VSZ suffice (to get an idea of how much you've
entitled a user to, check RLIMIT_NPROC*RLIMIT_AS*sizeof(pte_t)/PAGE_SIZE).
This is primarily resource scalability and functionality, not
bencnmark-mania or security.
-- wli
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
next prev parent reply other threads:[~2003-07-04 1:46 UTC|newest]
Thread overview: 74+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-07-01 1:39 Mel Gorman
2003-06-30 17:43 ` Daniel Phillips
2003-07-01 20:10 ` Martin J. Bligh
2003-07-01 21:41 ` Mel Gorman
2003-07-01 21:51 ` Davide Libenzi
2003-07-01 21:58 ` Martin J. Bligh
2003-07-02 9:01 ` Mel Gorman
2003-07-01 2:25 ` Andrea Arcangeli
2003-07-01 3:02 ` Andrew Morton
2003-07-01 3:22 ` Andrea Arcangeli
2003-07-01 3:25 ` Andrea Arcangeli
2003-07-01 3:29 ` Rik van Riel
2003-07-01 4:04 ` Andrea Arcangeli
2003-07-01 11:01 ` Hugh Dickins
2003-07-01 3:25 ` William Lee Irwin III
2003-07-01 4:39 ` Andrea Arcangeli
2003-07-01 6:33 ` William Lee Irwin III
2003-07-01 7:49 ` Andrea Arcangeli
2003-07-01 8:59 ` William Lee Irwin III
2003-07-01 9:27 ` Andrea Arcangeli
2003-07-01 14:24 ` Martin J. Bligh
2003-07-01 16:22 ` William Lee Irwin III
2003-07-01 17:54 ` Martin J. Bligh
2003-07-02 3:04 ` Andrea Arcangeli
2003-07-01 14:42 ` Martin J. Bligh
2003-07-01 21:45 ` Mel Gorman
2003-07-01 22:06 ` Martin J. Bligh
2003-07-01 21:46 ` Mel Gorman
2003-07-02 3:08 ` Andrea Arcangeli
2003-07-02 15:57 ` Mel Gorman
2003-07-02 17:11 ` Andrea Arcangeli
2003-07-02 17:10 ` Martin J. Bligh
2003-07-02 17:47 ` Andrea Arcangeli
2003-07-02 17:52 ` Martin J. Bligh
2003-07-02 18:13 ` Andrea Arcangeli
2003-07-02 18:05 ` Rik van Riel
2003-07-02 20:05 ` Martin J. Bligh
2003-07-02 21:40 ` William Lee Irwin III
2003-07-02 21:48 ` Martin J. Bligh
2003-07-02 22:14 ` William Lee Irwin III
2003-07-02 22:02 ` Andrea Arcangeli
2003-07-02 22:15 ` William Lee Irwin III
2003-07-02 22:26 ` Andrea Arcangeli
2003-07-02 23:11 ` William Lee Irwin III
2003-07-02 23:30 ` Andrea Arcangeli
2003-07-02 23:55 ` William Lee Irwin III
2003-07-03 11:31 ` Andrea Arcangeli
2003-07-03 11:46 ` William Lee Irwin III
2003-07-03 12:58 ` Andrea Arcangeli
2003-07-03 13:06 ` Rik van Riel
2003-07-03 13:48 ` Andrea Arcangeli
2003-07-03 18:53 ` William Lee Irwin III
2003-07-03 19:27 ` Andrea Arcangeli
2003-07-03 19:32 ` Rik van Riel
2003-07-03 20:16 ` William Lee Irwin III
2003-07-04 0:40 ` Andrea Arcangeli
2003-07-04 1:46 ` William Lee Irwin III [this message]
2003-07-04 2:34 ` Andrea Arcangeli
2003-07-04 4:10 ` William Lee Irwin III
2003-07-04 5:54 ` Andrea Arcangeli
2003-07-04 8:15 ` William Lee Irwin III
2003-07-04 23:44 ` Andrea Arcangeli
2003-07-05 0:05 ` William Lee Irwin III
2003-07-05 0:08 ` Andrea Arcangeli
2003-07-03 18:48 ` Jamie Lokier
2003-07-03 18:54 ` William Lee Irwin III
2003-07-03 19:33 ` Andrea Arcangeli
2003-07-03 22:21 ` William Lee Irwin III
2003-07-04 0:46 ` Andrea Arcangeli
2003-07-04 1:33 ` Jamie Lokier
2003-07-04 1:36 ` William Lee Irwin III
2003-07-03 19:06 ` Andrew Morton
2003-07-03 19:34 ` Andrea Arcangeli
2003-07-02 18:07 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20030704014624.GN20413@holomorphy.com \
--to=wli@holomorphy.com \
--cc=andrea@suse.de \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mbligh@aracnet.com \
--cc=mel@csn.ul.ie \
--cc=riel@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox