From: Ingo Molnar <mingo@redhat.com>
To: Andrew Morton <akpm@digeo.com>
Cc: Andrea Arcangeli <andrea@suse.de>,
mbligh@aracnet.com, mingo@elte.hu, hugh@veritas.com,
dmccr@us.ibm.com, Linus Torvalds <torvalds@transmeta.com>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: objrmap and vmtruncate
Date: Tue, 22 Apr 2003 07:00:05 -0400 (EDT) [thread overview]
Message-ID: <Pine.LNX.4.44.0304220618190.24063-100000@devserv.devel.redhat.com> (raw)
In-Reply-To: <20030405143138.27003289.akpm@digeo.com>
On Sat, 5 Apr 2003, Andrew Morton wrote:
> Andrea Arcangeli <andrea@suse.de> wrote:
> >
> > I see what you mean, you're right. That's because all the 10,000 vma
> > belongs to the same inode.
>
> I see two problems with objrmap - this search, and the complexity of the
> interworking with nonlinear mappings.
>
> There is talk going around about implementing some more sophisticated
> search structure thatn a linear list.
>
> And treating the nonlinear mappings as being mlocked is a great
> simplification - I'd be interested in Ingo's views on that.
i believe the right direction is the one that is currently happening: to
make nonlinear mappings more generic. sys_remap_file_pages() started off
as a special hack mostly usable for locked down pages. Now it's directly
encoded in the pte and thus swappable, and uses up a fraction of the vma
cost for finegrained mappings.
(i believe the next step should be to encode permission bits into the pte
as well, and thus enable eg. mprotect() to work without splitting up vmas.
On 32-bit ptes this is not relistic due to the file size limit imposed,
but once 64-bit ptes become commonplace it's a step worth taking i
believe.)
the O(N^2) property of objrmap where N is the 'inode sharing factor' is a
serious design problem i believe. 100 mappings in 100 contexts on the same
inode is not uncommon at all - still it totally DoS-es the VM's scanning
code, if it uses objrmap. Sure, rmap is O(N) - after all we do have 100
users of that mapping.
If the O(N^2) can be optimized away then i'm all for it. If not, then i
dont really understand how the same people who call sys_remap_file_pages()
a 'hack' [i believe they are not understanding the current state of the
API] can argue for objrmap in the same paragraph.
i believe the main problem wrt. rmap is the pte_chain lowmem overhead on
32-bit systems. (it also causes some fork() runtime overhead, but i doubt
anyone these days should argue that fork() latency is a commanding
parameter to optimize the VM for. We have vfork() and good threading, and
any fork()-sensitive app uses preforking anyway.)
to solve this problem i believe the pte chains should be made
double-linked lists, and should be organized in a completely different
(and much simpler) way: in a 'companion page' to the actual pte page. The
companion page stores the pte-chain links, corresponding directly to the
pte in the pagetable. Ie. if we have pte #100 in the pagetable, then we
look at entry #100 in the companion page. [the size of the page is
platform-dependent, eg. on PAE x86 it's a single page, on 64-platforms
it's two pages most of the time.] That entry then points to the 'next' and
'previous' pte in the pte chain. [the pte pagetable page itself has
pointers towards the companion page(s) in the struct page itself, existing
fields can be reused for this.]
This simpler pte chain construct also makes it easy to high-map the pte
chains: whenever we high-map the pte page, we can high-map the pte chain
page(s) as well. No more lowmem overhead for pte chains.
It also makes it easy to calculate the overhead of the pte chains: twice
the amount of pagetable overhead. Ie. with 32-bit pte's it's +8 bytes
overhead, or +0.2% of RAM overhead per mapped page, using a 4K page. With
64-bit ptes on 32-bit platforms (PAE), the overhead is still 8 bytes. On
64-bit platforms using 8K pages the overhead is still +0.2% of RAM, in
additionl to the 0.1% of RAM overhead for the pte itself. The worst-case
is 64-bit platforms with a 4K pagesize, there the overhead is +0.4% of
RAM, in addition to the 0.2% overhead caused by the pte itself.
(as a comparison, for finegrained mappings, if a single page is mapped by
a single vma, the 64-byte overhead of the vma causes a +1.5% overhead.)
so i think it's doable, and it solves many of the hairy allocation
deadlock issues wrt. pte-chains - the 'companion pages' hosting the pte
chain back and forward pointers can be allocated at the same time a
pagetable page is allocated. I believe this approach also greatly reduces
the complexity of pte chains, plus it makes unmap-time O(1) unlinking of
pte chains possible. If we can live with the RAM overhead. (which would
scale linearly with the already existing pagetable overhead.)
Ingo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>
next prev parent reply other threads:[~2003-04-22 11:00 UTC|newest]
Thread overview: 105+ messages / expand[flat|nested] mbox.gz Atom feed top
2003-04-04 14:34 Hugh Dickins
2003-04-04 16:14 ` William Lee Irwin III
2003-04-04 16:29 ` Hugh Dickins
2003-04-04 18:54 ` Andrew Morton
2003-04-04 21:43 ` Hugh Dickins
2003-04-04 21:45 ` Andrea Arcangeli
2003-04-04 21:58 ` Benjamin LaHaise
2003-04-04 23:07 ` Andrew Morton
2003-04-05 0:03 ` Andrea Arcangeli
2003-04-05 0:31 ` Andrew Morton
2003-04-05 1:31 ` Andrea Arcangeli
2003-04-05 1:52 ` Benjamin LaHaise
2003-04-05 2:22 ` Andrea Arcangeli
2003-04-05 10:01 ` Jamie Lokier
2003-04-05 10:11 ` William Lee Irwin III
2003-04-05 2:06 ` Andrew Morton
2003-04-05 2:24 ` Andrea Arcangeli
2003-04-05 2:13 ` Martin J. Bligh
2003-04-05 2:44 ` Andrea Arcangeli
2003-04-05 3:24 ` Andrew Morton
2003-04-05 12:06 ` Andrew Morton
2003-04-05 15:11 ` Martin J. Bligh
[not found] ` <20030405161758.1ee19bfa.akpm@digeo.com>
2003-04-06 0:17 ` Andrew Morton
2003-04-06 7:07 ` William Lee Irwin III
2003-04-05 16:30 ` Andrea Arcangeli
2003-04-05 19:01 ` Andrea Arcangeli
2003-04-05 20:14 ` Andrew Morton
2003-04-05 21:24 ` Andrew Morton
2003-04-05 22:06 ` Andrea Arcangeli
2003-04-05 22:31 ` Andrew Morton
2003-04-05 23:10 ` Andrea Arcangeli
2003-04-06 1:58 ` Andrew Morton
2003-04-06 14:47 ` Andrea Arcangeli
2003-04-06 21:35 ` William Lee Irwin III
2003-04-06 7:38 ` William Lee Irwin III
2003-04-06 14:51 ` Andrea Arcangeli
2003-04-06 12:37 ` Jamie Lokier
2003-04-06 13:12 ` William Lee Irwin III
2003-04-22 11:00 ` Ingo Molnar [this message]
2003-04-22 11:54 ` William Lee Irwin III
2003-04-22 14:31 ` Ingo Molnar
2003-04-22 14:56 ` William Lee Irwin III
2003-04-22 15:26 ` Ingo Molnar
2003-04-22 16:20 ` William Lee Irwin III
2003-04-22 16:57 ` Andrea Arcangeli
2003-04-22 17:21 ` William Lee Irwin III
2003-04-22 18:08 ` Andrea Arcangeli
2003-04-22 17:34 ` Ingo Molnar
2003-04-22 18:04 ` Benjamin LaHaise
2003-04-22 16:58 ` Martin J. Bligh
2003-04-22 12:37 ` Andrea Arcangeli
2003-04-22 13:20 ` William Lee Irwin III
2003-04-22 14:38 ` Martin J. Bligh
2003-04-22 15:10 ` William Lee Irwin III
2003-04-22 15:53 ` Martin J. Bligh
2003-04-22 14:52 ` Andrea Arcangeli
2003-04-22 14:29 ` Martin J. Bligh
2003-04-22 15:07 ` Ingo Molnar
2003-04-22 15:42 ` William Lee Irwin III
2003-04-22 15:55 ` Ingo Molnar
2003-04-22 16:58 ` William Lee Irwin III
2003-04-22 17:07 ` Ingo Molnar
2003-04-22 15:16 ` Andrea Arcangeli
2003-04-22 15:49 ` Ingo Molnar
2003-04-22 16:16 ` Martin J. Bligh
2003-04-22 17:24 ` Ingo Molnar
2003-04-22 17:45 ` John Bradford
2003-04-22 14:32 ` Martin J. Bligh
2003-04-22 15:09 ` Ingo Molnar
2003-04-05 21:34 ` Rik van Riel
2003-04-06 9:29 ` Benjamin LaHaise
2003-04-05 23:25 ` William Lee Irwin III
2003-04-05 23:57 ` Andrew Morton
2003-04-06 0:14 ` Andrea Arcangeli
2003-04-06 1:39 ` Andrew Morton
2003-04-06 2:13 ` William Lee Irwin III
2003-04-06 9:26 ` Benjamin LaHaise
2003-04-06 9:41 ` William Lee Irwin III
2003-04-06 9:54 ` William Lee Irwin III
2003-04-06 2:23 ` Martin J. Bligh
2003-04-06 3:55 ` Andrew Morton
2003-04-06 3:08 ` Martin J. Bligh
2003-04-06 7:42 ` William Lee Irwin III
2003-04-06 14:49 ` Alan Cox
2003-04-06 16:13 ` Martin J. Bligh
2003-04-06 21:34 ` subobj-rmap Martin J. Bligh
2003-04-06 21:42 ` subobj-rmap Rik van Riel
2003-04-06 21:55 ` subobj-rmap Jamie Lokier
2003-04-06 22:39 ` subobj-rmap William Lee Irwin III
2003-04-06 22:03 ` subobj-rmap Martin J. Bligh
2003-04-06 22:06 ` subobj-rmap Martin J. Bligh
2003-04-06 22:15 ` subobj-rmap Andrea Arcangeli
2003-04-06 22:25 ` subobj-rmap Martin J. Bligh
2003-04-07 21:25 ` subobj-rmap Andrea Arcangeli
2003-04-06 23:06 ` subobj-rmap Jamie Lokier
2003-04-06 23:26 ` subobj-rmap Martin J. Bligh
2003-04-05 3:45 ` objrmap and vmtruncate Martin J. Bligh
2003-04-05 3:59 ` Rik van Riel
2003-04-05 4:10 ` William Lee Irwin III
2003-04-05 4:49 ` Martin J. Bligh
2003-04-05 13:31 ` Rik van Riel
2003-04-05 4:52 ` Martin J. Bligh
2003-04-05 3:22 ` Andrew Morton
2003-04-05 3:35 ` Martin J. Bligh
2003-04-05 3:53 ` Rik van Riel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.44.0304220618190.24063-100000@devserv.devel.redhat.com \
--to=mingo@redhat.com \
--cc=akpm@digeo.com \
--cc=andrea@suse.de \
--cc=dmccr@us.ibm.com \
--cc=hugh@veritas.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mbligh@aracnet.com \
--cc=mingo@elte.hu \
--cc=torvalds@transmeta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox