From: Andrea Arcangeli <andrea@suse.de>
To: Ingo Molnar <mingo@redhat.com>
Cc: Andrew Morton <akpm@digeo.com>,
mbligh@aracnet.com, mingo@elte.hu, hugh@veritas.com,
dmccr@us.ibm.com, Linus Torvalds <torvalds@transmeta.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: objrmap and vmtruncate
Date: Tue, 22 Apr 2003 14:37:19 +0200 [thread overview]
Message-ID: <20030422123719.GH23320@dualathlon.random> (raw)
In-Reply-To: <Pine.LNX.4.44.0304220618190.24063-100000@devserv.devel.redhat.com>
On Tue, Apr 22, 2003 at 07:00:05AM -0400, Ingo Molnar wrote:
>
> On Sat, 5 Apr 2003, Andrew Morton wrote:
>
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > >
> > > I see what you mean, you're right. That's because all the 10,000 vma
> > > belongs to the same inode.
> >
> > I see two problems with objrmap - this search, and the complexity of the
> > interworking with nonlinear mappings.
> >
> > There is talk going around about implementing some more sophisticated
> > search structure thatn a linear list.
> >
> > And treating the nonlinear mappings as being mlocked is a great
> > simplification - I'd be interested in Ingo's views on that.
>
> i believe the right direction is the one that is currently happening: to
> make nonlinear mappings more generic. sys_remap_file_pages() started off
> as a special hack mostly usable for locked down pages. Now it's directly
> encoded in the pte and thus swappable, and uses up a fraction of the vma
> cost for finegrained mappings.
>
> (i believe the next step should be to encode permission bits into the pte
> as well, and thus enable eg. mprotect() to work without splitting up vmas.
> On 32-bit ptes this is not relistic due to the file size limit imposed,
> but once 64-bit ptes become commonplace it's a step worth taking i
> believe.)
>
> the O(N^2) property of objrmap where N is the 'inode sharing factor' is a
> serious design problem i believe. 100 mappings in 100 contexts on the same
> inode is not uncommon at all - still it totally DoS-es the VM's scanning
> code, if it uses objrmap. Sure, rmap is O(N) - after all we do have 100
> users of that mapping.
>
> If the O(N^2) can be optimized away then i'm all for it. If not, then i
> dont really understand how the same people who call sys_remap_file_pages()
> a 'hack' [i believe they are not understanding the current state of the
it's an hack primarly because you're mixing linear with non linear,
incidentally that as well breaks truncate. In the current state truncate
is malfunctioning. To make truncate working in the current state you
would need to check all pages->indexes for every page pointed by the
pagetables belonging to each vma linked in the objrmap.
I don't think anybody wants to slowdown truncate like that (I mean, with
partial truncates and huge vmas).
Fixing it so truncate works still at a the current speed (when you don't
use sys_remap_file_pages) means changing the API to be sane and at the
very least to stop mixing linaer with nonlinaer vmas.
And I found very unclean anyways that you can mangle a linaer vma, and
to have it partly linear and partly nonlinear. nonlinear vmas are
special, if they would not be special we would not break anything with
the nonlinear behaviour inside a linear vma.
At the very least you need a mmap(VM_NONLINEAR) to allocate the
nonlinaer virtual space, and to have sys_remap_file_pages working only
inside this space.
This was one of my first points to consider sys_remap_file_pages a stay
in the kernel as a sane API. The other points are lower prio actually.
As for the other points I still think the whole purpose of
sys_remap_file_pages is to bypass the VM enterely so it should have the
least possible hardware cost associated with it. It is meant only to
mangle pagetables from userspace. And sys_remap_file_pages has nothing
to do with rmap or objrmap btw (that is an issue for everything, not
just this). But since the whole purpose of sys_remap_file_pages is to
bypass the VM enterely and to make it as fast as possible, we should as
well turn off the paging to allow people to get the biggest advantage
out of sys_remap_file_pages and to allow to pass the filedescriptor as
well to sys_remap_file_pages, so that you can map multiple files in the
same vma. I think allowing multiple files makes perfect sense and the
lack of this additional important feature is a concern to me.
Also sys_remap_file_pages should as well try to use largepages to map
the pagecache, as far as the alignment and the largepage pool allows it.
That makes perfect sense.
As for bochs it will have no problem in enabling a system wide sysctl
before running, that's much cleaner than loading two kernel modules.
Overall trying to make nonlinear a usable by default generic API looks
wrong to me, sys_remap_file_pages has to be a VM bypass or it has to go.
If you want it to stay as a possibly default generic API then drop the
vma enterely and have mmap() and mprotect and mlock not generating any
vma overhead, but have them generating nonlinare stuff inside a single
whole vma for the whole address space. If you can do everything
generically (as you seem to want to reach) with sys_remap_file_pages,
then do it with the current API w/o generating a new non standard API.
It's a matter of functionalty inside the kernel, if you can do
everything w/o vma, then dorp the vma from mmap, that's all.
sys_remap_file_pages is equivalent to a mmap(MAP_FIXED) anyways.
I'm not against making mmap faster or whatever, but sys_remap_file_pages
makes sense to me only as a VM bypass, something that will always be
faster than the regular mmap or whatever by bypassing the VM. If you
don't bypass the VM you should make mmap run as fast as
sys_remap_file_pages instead IMHO.
Andrea
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
next prev parent reply other threads:[~2003-04-22 12:37 UTC|newest]
Thread overview: 105+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-04-04 14:34 Hugh Dickins
2003-04-04 16:14 ` William Lee Irwin III
2003-04-04 16:29 ` Hugh Dickins
2003-04-04 18:54 ` Andrew Morton
2003-04-04 21:43 ` Hugh Dickins
2003-04-04 21:45 ` Andrea Arcangeli
2003-04-04 21:58 ` Benjamin LaHaise
2003-04-04 23:07 ` Andrew Morton
2003-04-05 0:03 ` Andrea Arcangeli
2003-04-05 0:31 ` Andrew Morton
2003-04-05 1:31 ` Andrea Arcangeli
2003-04-05 1:52 ` Benjamin LaHaise
2003-04-05 2:22 ` Andrea Arcangeli
2003-04-05 10:01 ` Jamie Lokier
2003-04-05 10:11 ` William Lee Irwin III
2003-04-05 2:06 ` Andrew Morton
2003-04-05 2:24 ` Andrea Arcangeli
2003-04-05 2:13 ` Martin J. Bligh
2003-04-05 2:44 ` Andrea Arcangeli
2003-04-05 3:24 ` Andrew Morton
2003-04-05 12:06 ` Andrew Morton
2003-04-05 15:11 ` Martin J. Bligh
[not found] ` <20030405161758.1ee19bfa.akpm@digeo.com>
2003-04-06 0:17 ` Andrew Morton
2003-04-06 7:07 ` William Lee Irwin III
2003-04-05 16:30 ` Andrea Arcangeli
2003-04-05 19:01 ` Andrea Arcangeli
2003-04-05 20:14 ` Andrew Morton
2003-04-05 21:24 ` Andrew Morton
2003-04-05 22:06 ` Andrea Arcangeli
2003-04-05 22:31 ` Andrew Morton
2003-04-05 23:10 ` Andrea Arcangeli
2003-04-06 1:58 ` Andrew Morton
2003-04-06 14:47 ` Andrea Arcangeli
2003-04-06 21:35 ` William Lee Irwin III
2003-04-06 7:38 ` William Lee Irwin III
2003-04-06 14:51 ` Andrea Arcangeli
2003-04-06 12:37 ` Jamie Lokier
2003-04-06 13:12 ` William Lee Irwin III
2003-04-22 11:00 ` Ingo Molnar
2003-04-22 11:54 ` William Lee Irwin III
2003-04-22 14:31 ` Ingo Molnar
2003-04-22 14:56 ` William Lee Irwin III
2003-04-22 15:26 ` Ingo Molnar
2003-04-22 16:20 ` William Lee Irwin III
2003-04-22 16:57 ` Andrea Arcangeli
2003-04-22 17:21 ` William Lee Irwin III
2003-04-22 18:08 ` Andrea Arcangeli
2003-04-22 17:34 ` Ingo Molnar
2003-04-22 18:04 ` Benjamin LaHaise
2003-04-22 16:58 ` Martin J. Bligh
2003-04-22 12:37 ` Andrea Arcangeli [this message]
2003-04-22 13:20 ` William Lee Irwin III
2003-04-22 14:38 ` Martin J. Bligh
2003-04-22 15:10 ` William Lee Irwin III
2003-04-22 15:53 ` Martin J. Bligh
2003-04-22 14:52 ` Andrea Arcangeli
2003-04-22 14:29 ` Martin J. Bligh
2003-04-22 15:07 ` Ingo Molnar
2003-04-22 15:42 ` William Lee Irwin III
2003-04-22 15:55 ` Ingo Molnar
2003-04-22 16:58 ` William Lee Irwin III
2003-04-22 17:07 ` Ingo Molnar
2003-04-22 15:16 ` Andrea Arcangeli
2003-04-22 15:49 ` Ingo Molnar
2003-04-22 16:16 ` Martin J. Bligh
2003-04-22 17:24 ` Ingo Molnar
2003-04-22 17:45 ` John Bradford
2003-04-22 14:32 ` Martin J. Bligh
2003-04-22 15:09 ` Ingo Molnar
2003-04-05 21:34 ` Rik van Riel
2003-04-06 9:29 ` Benjamin LaHaise
2003-04-05 23:25 ` William Lee Irwin III
2003-04-05 23:57 ` Andrew Morton
2003-04-06 0:14 ` Andrea Arcangeli
2003-04-06 1:39 ` Andrew Morton
2003-04-06 2:13 ` William Lee Irwin III
2003-04-06 9:26 ` Benjamin LaHaise
2003-04-06 9:41 ` William Lee Irwin III
2003-04-06 9:54 ` William Lee Irwin III
2003-04-06 2:23 ` Martin J. Bligh
2003-04-06 3:55 ` Andrew Morton
2003-04-06 3:08 ` Martin J. Bligh
2003-04-06 7:42 ` William Lee Irwin III
2003-04-06 14:49 ` Alan Cox
2003-04-06 16:13 ` Martin J. Bligh
2003-04-06 21:34 ` subobj-rmap Martin J. Bligh
2003-04-06 21:42 ` subobj-rmap Rik van Riel
2003-04-06 21:55 ` subobj-rmap Jamie Lokier
2003-04-06 22:39 ` subobj-rmap William Lee Irwin III
2003-04-06 22:03 ` subobj-rmap Martin J. Bligh
2003-04-06 22:06 ` subobj-rmap Martin J. Bligh
2003-04-06 22:15 ` subobj-rmap Andrea Arcangeli
2003-04-06 22:25 ` subobj-rmap Martin J. Bligh
2003-04-07 21:25 ` subobj-rmap Andrea Arcangeli
2003-04-06 23:06 ` subobj-rmap Jamie Lokier
2003-04-06 23:26 ` subobj-rmap Martin J. Bligh
2003-04-05 3:45 ` objrmap and vmtruncate Martin J. Bligh
2003-04-05 3:59 ` Rik van Riel
2003-04-05 4:10 ` William Lee Irwin III
2003-04-05 4:49 ` Martin J. Bligh
2003-04-05 13:31 ` Rik van Riel
2003-04-05 4:52 ` Martin J. Bligh
2003-04-05 3:22 ` Andrew Morton
2003-04-05 3:35 ` Martin J. Bligh
2003-04-05 3:53 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20030422123719.GH23320@dualathlon.random \
--to=andrea@suse.de \
--cc=akpm@digeo.com \
--cc=dmccr@us.ibm.com \
--cc=hugh@veritas.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mbligh@aracnet.com \
--cc=mingo@elte.hu \
--cc=mingo@redhat.com \
--cc=torvalds@transmeta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox