From mboxrd@z Thu Jan 1 00:00:00 1970 Date: Tue, 21 Mar 2000 16:15:07 +0100 From: Jamie Lokier Subject: Re: Extensions to mincore Message-ID: <20000321161507.D5291@pcep-jamie.cern.ch> References: <20000320135939.A3390@pcep-jamie.cern.ch> <20000321024731.C4271@pcep-jamie.cern.ch> <20000321113448.A6991@dukat.scot.redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20000321113448.A6991@dukat.scot.redhat.com>; from Stephen C. Tweedie on Tue, Mar 21, 2000 at 11:34:48AM +0000 Sender: owner-linux-mm@kvack.org Return-Path: To: "Stephen C. Tweedie" Cc: "Eric W. Biederman" , Chuck Lever , linux-mm@kvack.org List-ID: Eric W. Biederman wrote: > > > > [Aside: is there the possibility to have mincore return the > > > > "!accessed" and "!dirty" bits of each page, perhaps as bits 1 > > > > and 2 of the returned bytes? I can imagine a bunch of garbage > > > > collection algorithms that could make good use of those bits. > > > > Currently some GC systems mprotect() regions and unprotect them > > > > on SEGV -- simply reading the !dirty status would obviously be > > > > much simpler and faster.] > > No it wouldn't. Yes it would. > Dirty kernel wise means the page needs to be swapped out. Clean kernel > wise mean the page is in the swap cache, and hasn't been written > since it was swapped in. > > Dirty GC wise the page has changes since the last GC pass over it. Of course, I thought that was obvious :-) You're right, that for GC the "!dirty" bit has to mean "since the last time we called mincore". To get the correct behaviour without maintaining extra state in the kernel (apart from a bit or two per struct page), you'd say that mincore returns "!dirty since the last time _anyone_ called mincore on this page", and you'd disallow it for shared mappings. It works for threads too. All threads sharing a page have to synchronise their mincore calls for that page, but that situation is no different to the SEGV method: all threads have to synchronise with the information collected from that, too. Stephen C. Tweedie wrote: > Worse than that, returning dirty status bits in mincore() just wouldn't > work for threads. mincore() is a valid optimisation when you just treat > it as a hint: if a page gets swapped out between calling mincore() and > using the page, nothing breaks, you just get an extra page fault. [Aside: I regard this as a bug. mincore() should have an option to set the accessed bit on each page that is in core, to avoid the "just missed" condition. If it sets the accessed bit, then under most circumstances the just missed condition will never happen. If it does not (it doesn't now), the just missed condition will always happen sometimes under the slightest non-zero paging load. The difference for an application that does "call mincore; if not in core, spawn thread to pull in page" under low system load will be between no stalls and occasional stalls. Thus mincore() is missing a flag parameter IMO] > The same is not true for the sort of garbage collection or distributed > memory mechanisms which use mprotect(). If you find that a page is clean > via mincore() and discard the data based on that, there is nothing to > stop another thread from dirtying the data after the mincore() and losing > its modification. In general, you have to be very careful about what you allow other threads to modify during GC. For a full collection, some kind of synchronisation point with everyone is usually required. (Disclaimer: I am not a GC expert so if you know of GC mechanisms that use mprotect and don't require threads to be synchronised, please speak up!) 1. Stop all the other threads, copy the state of their roots (i.e. processor registers, individual stack roots), call mprotect(), restart the threads, and let SEGVs mprotect() pages back to writable status while putting them on a list. Watch out for concurrent SEGVs on the same page! Disadvantage: lots of SEGV handling, SEGV code is processor specific (until siginfo is reliable), lots of individual page mprotect calls, lots of vmas, page fault slowdown even for non-GC-using threads due to all the tiny vmas. 1a. Using mincore(): call mincore() instead of mprotect() in method 1. Threads are stopped so it just works :-) Advantage: everything runs faster and the code is more portable (among Linux systems). 2. Method 1 has a large mprotect() call. Quite apart from the slowness of all that mprotect/SEGV processing, the single large mprotect may take a while during which all threads are blocked, and it also prevents any threads not involved in GC from faulting. (As you say, it grabs the page table lock). You can call mprotect() first to protect the GC arena, with threads still running. At this point, you're _not_ using it to collect dirty page information. When mprotect() returns, you synchronise all threads to gather local GC roots, and then start collecting dirty page info via SEGVs. If a thread gets a SEGV before the synchronisation point, it is blocked until the synchronisation point. In this way, threads not writing to the arena don't get stopped for long even if mprotect() itself takes a long time. 2a. Method 2 using mincore(). Now you do do mprotect() at the beginning -- remember it is not for collecting dirty page info here, but for blocking threads writing to the arena while permitting others to continue. After synchronisation, call mincore() and then mprotect() to make the entire arena writable. Then restart all blocked threads. Any SEGVs from the start of the first mprotect() to the end of the second one block the faulting thread prior to synchronisation; any that block are restarted afterwards. Obviously there are plenty of other ways to arrange this, with multiple arenas etc. But I hope you can see that mincore() can be used reliably without requiring the overhead of individual-page mprotect and SEGVs. > mprotect() has the advantage of holding page table locks so it can do > an atomic read-modify-write on the page table entries. Without that > locking, you just can't reliably use dirty/accessed information. mprotect() has the major disadvantage of creating a million tiny vmas when you are using it to track dirty pages. And as far as I can see, mprotect/SEGV gives no advantage over the dirty bit method: in both cases, you always need synchronisations points between threads to share the dirty page information. mprotect has another disadvantage: it holds the page table lock. Great for atomic operations; terrible when you do a large mprotect and you _don't_ want to stop concurrent threads (that are not using the GC arena) from page faulting their stuff. Interestingly, neither GC synchronisation method I described depends on mprotect() being atomic w.r.t. the whole protection change, and method 2 would actually benefit from concurrent page faults being allowed during the mprotect(). The atomicity you mention is important. Consider this implementation: 1. Only private mappings allowed. 2. A page is considered dirty "since the last mincore call" if the pte dirty bit is set, or if a struct page flag PageMincoreDirty is set. To read this, you must atomically read and clear the pte's dirty bit. (Not difficult on x86 or any UP system; I'm not sure about other SMP systems). mincore() calls are assumed to be protected w.r.t. each other. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/