* 2.4 / 2.5 VM plans
@ 2000-06-25 3:51 Rik van Riel
2000-06-28 17:45 ` vii
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Rik van Riel @ 2000-06-25 3:51 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Stephen C. Tweedie, linux-mm
Hi,
since I've heard some rumours of you folks having come
up with nice VM ideas at USENIX and since I've been
working on various VM things (and experimental 2.5 things)
for the last months, maybe it's a good idea to see which
of your ideas have already been put into code and to see
which ideas fit together or are mutually exclusive. :)
To start the discussion, here's my flameba^Wlist of ideas:
2.4:
1) re-introduce page aging, my small and simple experiments
seem to indicate that page aging takes *less* cpu time
than copying pages to/from highmem all the time (let alone
making your applications wait for disk because we replaced
the wrong page last time)
2) fix the latency problems of applications calling shrink_mmap
and flushing infinite amounts of pages (mostly fixed)
3) separate page replacement (page aging) and page flushing,
currently we'll happily free a referenced clean page just
because the unreferenced pages haven't been flushed to disk
yet ... this is very bad since the unreferenced pages often
turn out to be things like executable code
we could achieve this by augmenting the current MM subsystem
with an inactive and scavenge list, in the process splitting
shrink_mmap() into three better readable functions ... I have
this mostly done
4) fix balance_dirty() to include inactive pages and have kflushd
help kswapd by proactively flushing some of the inactive pages
_before_ we run into trouble
5) implement some form of write throttling for VMAs so it'll be
impossible for big mmap()s, etc, to competely fill memory
with dirty pages
regards,
Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.
Wanna talk about the kernel? irc.openprojects.net / #kernelnewbies
http://www.conectiva.com/ http://www.surriel.com/
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux.eu.org/Linux-MM/
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: 2.4 / 2.5 VM plans 2000-06-25 3:51 2.4 / 2.5 VM plans Rik van Riel @ 2000-06-28 17:45 ` vii 2000-06-28 21:04 ` Juan J. Quintela 2000-06-28 21:17 ` Juan J. Quintela 2000-06-29 13:44 ` Stephen C. Tweedie 2 siblings, 1 reply; 8+ messages in thread From: vii @ 2000-06-28 17:45 UTC (permalink / raw) To: linux-mm Rik van Riel <riel@conectiva.com.br> writes: [...] > To start the discussion, here's my flameba^Wlist of ideas: Seeing as not much discussion has resulted (if so it missed my mailbox), I'll stick my neck out to agree. [...] > 3) separate page replacement (page aging) and page flushing, Definitely! > currently we'll happily free a referenced clean page just > because the unreferenced pages haven't been flushed to disk > yet ... this is very bad since the unreferenced pages often > turn out to be things like executable code > > we could achieve this by augmenting the current MM subsystem > with an inactive and scavenge list, in the process splitting Yes! Please! IMHO another really cool side-effect will be getting rid of the vmscan.c:swap_out algorithm (at least as far as I understand). > shrink_mmap() into three better readable functions ... I have > this mostly done [...] BTW, Is there any timescale for integrating page coloring? Someone produced a patch somewhere (IIRC specifically for the alpha, sorry to be so vague). -- http://altern.org/vii -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4 / 2.5 VM plans 2000-06-28 17:45 ` vii @ 2000-06-28 21:04 ` Juan J. Quintela 0 siblings, 0 replies; 8+ messages in thread From: Juan J. Quintela @ 2000-06-28 21:04 UTC (permalink / raw) To: vii; +Cc: linux-mm >>>>> "vii" == vii <vii@penguinpowered.com> writes: Hi >> 3) separate page replacement (page aging) and page flushing, vii> Definitely! I have done part of this work with my write deferred swap (I will port it to test3 ASAP). The deferred swap write also helps. It is related with your question about removing swap_out function, it is related with the scanning and the several lists setup. vii> BTW, Is there any timescale for integrating page coloring? Someone vii> produced a patch somewhere (IIRC specifically for the alpha, sorry to vii> be so vague). There was a page colouring patch frem somone at DEC^WCompaq, and another one from David Miller. The one from Compaq appeared to have some problems with some workloads (see the comments from Dave Miller, I think in this list). I haven't seen the David one, I can't comment on that. But I suppose that the integration will be a 2.5 thing (Wild, wild guess). Later, Juan. -- In theory, practice and theory are the same, but in practice they are different -- Larry McVoy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4 / 2.5 VM plans 2000-06-25 3:51 2.4 / 2.5 VM plans Rik van Riel 2000-06-28 17:45 ` vii @ 2000-06-28 21:17 ` Juan J. Quintela 2000-06-29 13:45 ` Stephen C. Tweedie 2000-06-29 13:44 ` Stephen C. Tweedie 2 siblings, 1 reply; 8+ messages in thread From: Juan J. Quintela @ 2000-06-28 21:17 UTC (permalink / raw) To: Rik van Riel; +Cc: Linus Torvalds, Stephen C. Tweedie, linux-mm >>>>> "rik" == Rik van Riel <riel@conectiva.com.br> writes: Hi rik> 2.4: 6) Integrate the shm code in the page cache, to evict having Yet another Cache to balance. 2.5: 7) Make a ->flush method in the address_space operations, Rik mentioned it in some previous mail, it should return the number of pages that it has flushed. That would make shrink_mmap code (or its successor) more readable, as we don't have to add new code each time that we add a new type of page to the page cache. 8) This one is related with the FS, not MM specific, but FS people want to be able to allocate MultiPage buffers (see pagebuf from XFS) and people want similar functionality for other things. Perhaps we need to find some solution/who to do that in a clean way. For instance, if the FS told us that he wants a buffer of 4 pages, it is quite obvious how to do write clustering for a page in that buffer, we can use that information. 9) We need also to implement write clustering for fs/page cache/swap. Just now we have _not_ limit in the amount of IO that we start, that means that if we have all the memory full of dirty pages, we can have a _big_ stall while we wait for all the pages to be written to disk, and yes that happens with the actual code. Later, Juan. -- In theory, practice and theory are the same, but in practice they are different -- Larry McVoy -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4 / 2.5 VM plans 2000-06-28 21:17 ` Juan J. Quintela @ 2000-06-29 13:45 ` Stephen C. Tweedie 0 siblings, 0 replies; 8+ messages in thread From: Stephen C. Tweedie @ 2000-06-29 13:45 UTC (permalink / raw) To: Juan J. Quintela Cc: Rik van Riel, Linus Torvalds, Stephen C. Tweedie, linux-mm Hi, On Wed, Jun 28, 2000 at 11:17:57PM +0200, Juan J. Quintela wrote: > 2.5: > > 7) Make a ->flush method in the address_space operations OK > 8) This one is related with the FS, not MM specific, but FS people > want to be able to allocate MultiPage buffers (see pagebuf from > XFS) and people want similar functionality for other things. Yes, but this should be layered on top of the page handling --- there's no need to integrate it into the low levels of the page cache. > 9) We need also to implement write clustering for fs/page cache/swap. Same as above. When the pagebuf layer or whatever gets a write request for a given page, it is perfectly at liberty to write out adjacent pages too if it wants to. The VM doesn't have to enforce that itself. Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: 2.4 / 2.5 VM plans 2000-06-25 3:51 2.4 / 2.5 VM plans Rik van Riel 2000-06-28 17:45 ` vii 2000-06-28 21:17 ` Juan J. Quintela @ 2000-06-29 13:44 ` Stephen C. Tweedie 2000-07-06 7:51 ` page_table_lock problem [was: Re: 2.4 / 2.5 VM plans] Andrey Savochkin 2 siblings, 1 reply; 8+ messages in thread From: Stephen C. Tweedie @ 2000-06-29 13:44 UTC (permalink / raw) To: Rik van Riel; +Cc: Linus Torvalds, Stephen C. Tweedie, linux-mm Hi, On Sun, Jun 25, 2000 at 12:51:42AM -0300, Rik van Riel wrote: > > since I've heard some rumours of you folks having come > up with nice VM ideas at USENIX and since I've been > working on various VM things (and experimental 2.5 things) > for the last months, maybe it's a good idea to see which > of your ideas have already been put into code and to see > which ideas fit together or are mutually exclusive. :) Right. :-) The following includes a lot of the stuff that Ben and I bashed out at Usenix. I don't count this as new feature stuff --- most of what follows is just identifying places where the current VM is plain broken! > 1) re-introduce page aging, OK. > 2) fix the latency problems of applications calling shrink_mmap > and flushing infinite amounts of pages (mostly fixed) Right, but it can't be _that_ hard to keep a persistent track of how much of the cache has changed since the last time you looked at it. We ought to be able to be much more aggressive about pruning unnecessary lru list walks. > 3) separate page replacement (page aging) and page flushing, YES!!!. But then again I just said as much on linux-mm in reply to another recent post. :-) > 4) fix balance_dirty() to include inactive pages No. balance_dirty() and page cache dirty page management are completely different. Utterly different. balance_dirty() only has business doing early flush and/or flow control on buffer_heads, nothing else. (At least not until we have a write-behind mechanism for pages which is independent of the buffer cache; say, if NFS write-behind gets integrated into the mainstream write-behind code.) > 5) implement some form of write throttling for VMAs so it'll be > impossible for big mmap()s, etc, to competely fill memory > with dirty pages Right. This is necessary, but is orthogonal to the other problems. A large part of (5) comes for free, however, if we are strict about keeping a minimum (load-dependent) number of clean, unmapped pages around on the VM's clean lru-list; separating out page aging and unmapping from the flushing code fixes a lot of this anyway by preventing dirty pages from occupying the whole of memory. Other things to consider: * The page aging loops need to have early break-out when the number of free pages suddenly increases (exit, munmap, whatever); * The page stealer shouldn't block just because kswapd is blocked on synchronous swapping (this comes for free if we have separate page flushing) * shrink_dentry should probably skip inodes which have still got pages attached, as otherwise we get a lot of unnecessary cache flushes * We MUST quantify the current VM pressure as a way of controlling page aging. That way aging can be proactive under load, but we don't necessarily have to evict pages from memory too early (we can age pages without flushing them). * RSS accounting needs to be audited. Right now, the per-mm rss isn't an atomic type, and it doesn't seem to be consistently protected by the page table locks. A few other ideas Ben and I threw about are much more long-term. 1) We think it should be possible to share page tables for large shared mmaps (think of libc and big sysv shm segments). 2) We can do reverse pte maps pretty cheaply by the following: * Reverse maps for shared mmaps are easy enough by following the per-inode vma list * The pte for unshared anon pages can be encoded in the page struct easily. * Shared anon pages are the tricky ones; but it's simple to maintain a hash list of all such ptes, and there aren't many in a typical system. Fork() is, of course, the one place where lots of these occur, but we can minimise the number of shared anon pages over fork by implementing COW on page tables (that way, we share the page tables but NOT the pages!) 3) Think about having a list of all page tables in memory. With that, we can do aging in the VM without *EVER* having to walk through vmas at all: we can walk through the ptes in the system performing atomic bitops on the ptes and age counts without caring about the higher level layers until a given page's age reaches zero. Only at that point do we care about invoking the swapper for that page's vma. Food for thought. 3) in particular seems to open up a whole new set of possibilities, but it's definitely something for an experimental post-2.4 branch. :-) Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* page_table_lock problem [was: Re: 2.4 / 2.5 VM plans] 2000-06-29 13:44 ` Stephen C. Tweedie @ 2000-07-06 7:51 ` Andrey Savochkin 2000-07-06 13:32 ` Stephen C. Tweedie 0 siblings, 1 reply; 8+ messages in thread From: Andrey Savochkin @ 2000-07-06 7:51 UTC (permalink / raw) To: Stephen C. Tweedie, Rik van Riel; +Cc: linux-mm On Thu, Jun 29, 2000 at 02:44:08PM +0100, Stephen C. Tweedie wrote: > * RSS accounting needs to be audited. Right now, the per-mm rss isn't > an atomic type, and it doesn't seem to be consistently protected by > the page table locks. Stephen, I've looked at RSS updates in 2.4.0 kernels. You're right, they are not protected enough from concurrent updates from mm paths (mmap, page fault handler) and swapout path. Moreover, I found that page_table_lock which is supposed to serialize page table updates from mm and swapout paths isn't taken in the later at all! Is it a bug or am I missing something? Best regards Andrey -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: page_table_lock problem [was: Re: 2.4 / 2.5 VM plans] 2000-07-06 7:51 ` page_table_lock problem [was: Re: 2.4 / 2.5 VM plans] Andrey Savochkin @ 2000-07-06 13:32 ` Stephen C. Tweedie 0 siblings, 0 replies; 8+ messages in thread From: Stephen C. Tweedie @ 2000-07-06 13:32 UTC (permalink / raw) To: Andrey Savochkin; +Cc: Stephen C. Tweedie, Rik van Riel, linux-mm Hi, On Thu, Jul 06, 2000 at 03:51:23PM +0800, Andrey Savochkin wrote: > > I've looked at RSS updates in 2.4.0 kernels. > You're right, they are not protected enough from > concurrent updates from mm paths (mmap, page fault handler) and swapout > path. Moreover, I found that page_table_lock which is supposed to serialize > page table updates from mm and swapout paths isn't taken in the later at all! > Is it a bug or am I missing something? Sorry, I don't have time to look closely at this right now --- I'm swamped with travel and ext3 work, and I've just moved house... Cheers, Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2000-07-06 13:32 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2000-06-25 3:51 2.4 / 2.5 VM plans Rik van Riel 2000-06-28 17:45 ` vii 2000-06-28 21:04 ` Juan J. Quintela 2000-06-28 21:17 ` Juan J. Quintela 2000-06-29 13:45 ` Stephen C. Tweedie 2000-06-29 13:44 ` Stephen C. Tweedie 2000-07-06 7:51 ` page_table_lock problem [was: Re: 2.4 / 2.5 VM plans] Andrey Savochkin 2000-07-06 13:32 ` Stephen C. Tweedie
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox