* 2.4 / 2.5 VM plans
@ 2000-06-25 3:51 Rik van Riel
2000-06-28 17:45 ` vii
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Rik van Riel @ 2000-06-25 3:51 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Stephen C. Tweedie, linux-mm
Hi,
since I've heard some rumours of you folks having come
up with nice VM ideas at USENIX and since I've been
working on various VM things (and experimental 2.5 things)
for the last months, maybe it's a good idea to see which
of your ideas have already been put into code and to see
which ideas fit together or are mutually exclusive. :)
To start the discussion, here's my flameba^Wlist of ideas:
2.4:
1) re-introduce page aging, my small and simple experiments
seem to indicate that page aging takes *less* cpu time
than copying pages to/from highmem all the time (let alone
making your applications wait for disk because we replaced
the wrong page last time)
2) fix the latency problems of applications calling shrink_mmap
and flushing infinite amounts of pages (mostly fixed)
3) separate page replacement (page aging) and page flushing,
currently we'll happily free a referenced clean page just
because the unreferenced pages haven't been flushed to disk
yet ... this is very bad since the unreferenced pages often
turn out to be things like executable code
we could achieve this by augmenting the current MM subsystem
with an inactive and scavenge list, in the process splitting
shrink_mmap() into three better readable functions ... I have
this mostly done
4) fix balance_dirty() to include inactive pages and have kflushd
help kswapd by proactively flushing some of the inactive pages
_before_ we run into trouble
5) implement some form of write throttling for VMAs so it'll be
impossible for big mmap()s, etc, to competely fill memory
with dirty pages
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4 / 2.5 VM plans
2000-06-25 3:51 2.4 / 2.5 VM plans Rik van Riel
@ 2000-06-28 17:45 ` vii
2000-06-28 21:04 ` Juan J. Quintela
2000-06-28 21:17 ` Juan J. Quintela
2000-06-29 13:44 ` Stephen C. Tweedie
2 siblings, 1 reply; 8+ messages in thread
From: vii @ 2000-06-28 17:45 UTC (permalink / raw)
To: linux-mm
Rik van Riel <riel@conectiva.com.br> writes:
[...]
> To start the discussion, here's my flameba^Wlist of ideas:
Seeing as not much discussion has resulted (if so it missed my
mailbox), I'll stick my neck out to agree.
[...]
> 3) separate page replacement (page aging) and page flushing,
Definitely!
> currently we'll happily free a referenced clean page just
> because the unreferenced pages haven't been flushed to disk
> yet ... this is very bad since the unreferenced pages often
> turn out to be things like executable code
>
> we could achieve this by augmenting the current MM subsystem
> with an inactive and scavenge list, in the process splitting
Yes! Please!
IMHO another really cool side-effect will be getting rid of the
vmscan.c:swap_out algorithm (at least as far as I understand).
> shrink_mmap() into three better readable functions ... I have
> this mostly done
[...]
BTW, Is there any timescale for integrating page coloring? Someone
produced a patch somewhere (IIRC specifically for the alpha, sorry to
be so vague).
--
http://altern.org/vii
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4 / 2.5 VM plans
2000-06-28 17:45 ` vii
@ 2000-06-28 21:04 ` Juan J. Quintela
0 siblings, 0 replies; 8+ messages in thread
From: Juan J. Quintela @ 2000-06-28 21:04 UTC (permalink / raw)
To: vii; +Cc: linux-mm
>>>>> "vii" == vii <vii@penguinpowered.com> writes:
Hi
>> 3) separate page replacement (page aging) and page flushing,
vii> Definitely!
I have done part of this work with my write deferred swap (I will port
it to test3 ASAP). The deferred swap write also helps. It is related
with your question about removing swap_out function, it is related
with the scanning and the several lists setup.
vii> BTW, Is there any timescale for integrating page coloring? Someone
vii> produced a patch somewhere (IIRC specifically for the alpha, sorry to
vii> be so vague).
There was a page colouring patch frem somone at DEC^WCompaq, and
another one from David Miller. The one from Compaq appeared to have
some problems with some workloads (see the comments from Dave Miller,
I think in this list). I haven't seen the David one, I can't comment
on that. But I suppose that the integration will be a 2.5 thing
(Wild, wild guess).
Later, Juan.
--
In theory, practice and theory are the same, but in practice they
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4 / 2.5 VM plans
2000-06-25 3:51 2.4 / 2.5 VM plans Rik van Riel
2000-06-28 17:45 ` vii
@ 2000-06-28 21:17 ` Juan J. Quintela
2000-06-29 13:45 ` Stephen C. Tweedie
2000-06-29 13:44 ` Stephen C. Tweedie
2 siblings, 1 reply; 8+ messages in thread
From: Juan J. Quintela @ 2000-06-28 21:17 UTC (permalink / raw)
To: Rik van Riel; +Cc: Linus Torvalds, Stephen C. Tweedie, linux-mm
>>>>> "rik" == Rik van Riel <riel@conectiva.com.br> writes:
Hi
rik> 2.4:
6) Integrate the shm code in the page cache, to evict having Yet
another Cache to balance.
2.5:
7) Make a ->flush method in the address_space operations, Rik
mentioned it in some previous mail, it should return the number of
pages that it has flushed. That would make shrink_mmap code (or
its successor) more readable, as we don't have to add new code each
time that we add a new type of page to the page cache.
8) This one is related with the FS, not MM specific, but FS people
want to be able to allocate MultiPage buffers (see pagebuf from
XFS) and people want similar functionality for other things.
Perhaps we need to find some solution/who to do that in a clean
way. For instance, if the FS told us that he wants a buffer of 4
pages, it is quite obvious how to do write clustering for a page in
that buffer, we can use that information.
9) We need also to implement write clustering for fs/page cache/swap.
Just now we have _not_ limit in the amount of IO that we start,
that means that if we have all the memory full of dirty pages, we
can have a _big_ stall while we wait for all the pages to be
written to disk, and yes that happens with the actual code.
Later, Juan.
--
In theory, practice and theory are the same, but in practice they
are different -- Larry McVoy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4 / 2.5 VM plans
2000-06-25 3:51 2.4 / 2.5 VM plans Rik van Riel
2000-06-28 17:45 ` vii
2000-06-28 21:17 ` Juan J. Quintela
@ 2000-06-29 13:44 ` Stephen C. Tweedie
2000-07-06 7:51 ` page_table_lock problem [was: Re: 2.4 / 2.5 VM plans] Andrey Savochkin
2 siblings, 1 reply; 8+ messages in thread
From: Stephen C. Tweedie @ 2000-06-29 13:44 UTC (permalink / raw)
To: Rik van Riel; +Cc: Linus Torvalds, Stephen C. Tweedie, linux-mm
Hi,
On Sun, Jun 25, 2000 at 12:51:42AM -0300, Rik van Riel wrote:
>
> since I've heard some rumours of you folks having come
> up with nice VM ideas at USENIX and since I've been
> working on various VM things (and experimental 2.5 things)
> for the last months, maybe it's a good idea to see which
> of your ideas have already been put into code and to see
> which ideas fit together or are mutually exclusive. :)
Right. :-) The following includes a lot of the stuff that Ben and I
bashed out at Usenix.
I don't count this as new feature stuff --- most of what follows is
just identifying places where the current VM is plain broken!
> 1) re-introduce page aging,
OK.
> 2) fix the latency problems of applications calling shrink_mmap
> and flushing infinite amounts of pages (mostly fixed)
Right, but it can't be _that_ hard to keep a persistent track of how
much of the cache has changed since the last time you looked at it.
We ought to be able to be much more aggressive about pruning
unnecessary lru list walks.
> 3) separate page replacement (page aging) and page flushing,
YES!!!. But then again I just said as much on linux-mm in reply to
another recent post. :-)
> 4) fix balance_dirty() to include inactive pages
No. balance_dirty() and page cache dirty page management are
completely different. Utterly different. balance_dirty() only has
business doing early flush and/or flow control on buffer_heads,
nothing else. (At least not until we have a write-behind mechanism
for pages which is independent of the buffer cache; say, if NFS
write-behind gets integrated into the mainstream write-behind code.)
> 5) implement some form of write throttling for VMAs so it'll be
> impossible for big mmap()s, etc, to competely fill memory
> with dirty pages
Right. This is necessary, but is orthogonal to the other problems. A
large part of (5) comes for free, however, if we are strict about
keeping a minimum (load-dependent) number of clean, unmapped pages
around on the VM's clean lru-list; separating out page aging and
unmapping from the flushing code fixes a lot of this anyway by
preventing dirty pages from occupying the whole of memory.
Other things to consider:
* The page aging loops need to have early break-out when
the number of free pages suddenly increases (exit, munmap,
whatever);
* The page stealer shouldn't block just because kswapd is blocked on
synchronous swapping (this comes for free if we have separate page
flushing)
* shrink_dentry should probably skip inodes which have still got pages
attached, as otherwise we get a lot of unnecessary cache flushes
* We MUST quantify the current VM pressure as a way of controlling
page aging. That way aging can be proactive under load, but we
don't necessarily have to evict pages from memory too early (we can
age pages without flushing them).
* RSS accounting needs to be audited. Right now, the per-mm rss isn't
an atomic type, and it doesn't seem to be consistently protected by
the page table locks.
A few other ideas Ben and I threw about are much more long-term.
1) We think it should be possible to share page tables for
large shared mmaps (think of libc and big sysv shm segments).
2) We can do reverse pte maps pretty cheaply by the following:
* Reverse maps for shared mmaps are easy enough by following the
per-inode vma list
* The pte for unshared anon pages can be encoded in the page struct
easily.
* Shared anon pages are the tricky ones; but it's simple to maintain a
hash list of all such ptes, and there aren't many in a typical
system. Fork() is, of course, the one place where lots of these
occur, but we can minimise the number of shared anon pages over
fork by implementing COW on page tables (that way, we share the page
tables but NOT the pages!)
3) Think about having a list of all page tables in memory. With
that, we can do aging in the VM without *EVER* having to walk
through vmas at all: we can walk through the ptes in the system
performing atomic bitops on the ptes and age counts without caring
about the higher level layers until a given page's age reaches
zero. Only at that point do we care about invoking the swapper
for that page's vma.
Food for thought. 3) in particular seems to open up a whole new set
of possibilities, but it's definitely something for an experimental
post-2.4 branch. :-)
Cheers,
Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4 / 2.5 VM plans
2000-06-28 21:17 ` Juan J. Quintela
@ 2000-06-29 13:45 ` Stephen C. Tweedie
0 siblings, 0 replies; 8+ messages in thread
From: Stephen C. Tweedie @ 2000-06-29 13:45 UTC (permalink / raw)
To: Juan J. Quintela
Cc: Rik van Riel, Linus Torvalds, Stephen C. Tweedie, linux-mm
Hi,
On Wed, Jun 28, 2000 at 11:17:57PM +0200, Juan J. Quintela wrote:
> 2.5:
>
> 7) Make a ->flush method in the address_space operations
OK
> 8) This one is related with the FS, not MM specific, but FS people
> want to be able to allocate MultiPage buffers (see pagebuf from
> XFS) and people want similar functionality for other things.
Yes, but this should be layered on top of the page handling ---
there's no need to integrate it into the low levels of the page cache.
> 9) We need also to implement write clustering for fs/page cache/swap.
Same as above. When the pagebuf layer or whatever gets a write
request for a given page, it is perfectly at liberty to write out
adjacent pages too if it wants to. The VM doesn't have to enforce
that itself.
Cheers,
Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 8+ messages in thread
* page_table_lock problem [was: Re: 2.4 / 2.5 VM plans]
2000-06-29 13:44 ` Stephen C. Tweedie
@ 2000-07-06 7:51 ` Andrey Savochkin
2000-07-06 13:32 ` Stephen C. Tweedie
0 siblings, 1 reply; 8+ messages in thread
From: Andrey Savochkin @ 2000-07-06 7:51 UTC (permalink / raw)
To: Stephen C. Tweedie, Rik van Riel; +Cc: linux-mm
On Thu, Jun 29, 2000 at 02:44:08PM +0100, Stephen C. Tweedie wrote:
> * RSS accounting needs to be audited. Right now, the per-mm rss isn't
> an atomic type, and it doesn't seem to be consistently protected by
> the page table locks.
Stephen,
I've looked at RSS updates in 2.4.0 kernels.
You're right, they are not protected enough from
concurrent updates from mm paths (mmap, page fault handler) and swapout
path. Moreover, I found that page_table_lock which is supposed to serialize
page table updates from mm and swapout paths isn't taken in the later at all!
Is it a bug or am I missing something?
Best regards
Andrey
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: page_table_lock problem [was: Re: 2.4 / 2.5 VM plans]
2000-07-06 7:51 ` page_table_lock problem [was: Re: 2.4 / 2.5 VM plans] Andrey Savochkin
@ 2000-07-06 13:32 ` Stephen C. Tweedie
0 siblings, 0 replies; 8+ messages in thread
From: Stephen C. Tweedie @ 2000-07-06 13:32 UTC (permalink / raw)
To: Andrey Savochkin; +Cc: Stephen C. Tweedie, Rik van Riel, linux-mm
Hi,
On Thu, Jul 06, 2000 at 03:51:23PM +0800, Andrey Savochkin wrote:
>
> I've looked at RSS updates in 2.4.0 kernels.
> You're right, they are not protected enough from
> concurrent updates from mm paths (mmap, page fault handler) and swapout
> path. Moreover, I found that page_table_lock which is supposed to serialize
> page table updates from mm and swapout paths isn't taken in the later at all!
> Is it a bug or am I missing something?
Sorry, I don't have time to look closely at this right now --- I'm
swamped with travel and ext3 work, and I've just moved house...
Cheers,
Stephen
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2000-07-06 13:32 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-06-25 3:51 2.4 / 2.5 VM plans Rik van Riel
2000-06-28 17:45 ` vii
2000-06-28 21:04 ` Juan J. Quintela
2000-06-28 21:17 ` Juan J. Quintela
2000-06-29 13:45 ` Stephen C. Tweedie
2000-06-29 13:44 ` Stephen C. Tweedie
2000-07-06 7:51 ` page_table_lock problem [was: Re: 2.4 / 2.5 VM plans] Andrey Savochkin
2000-07-06 13:32 ` Stephen C. Tweedie
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox