* Re: MADV_SPACEAVAIL and MADV_FREE in pre2-3 [not found] <20000320135939.A3390@pcep-jamie.cern.ch> @ 2000-03-20 19:09 ` Chuck Lever 2000-03-21 1:20 ` madvise (MADV_FREE) Jamie Lokier ` (3 more replies) 0 siblings, 4 replies; 55+ messages in thread From: Chuck Lever @ 2000-03-20 19:09 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mm jamie- i've moved this discussion to linux-mm where we were just discussing the madvise() implementation. On Mon, 20 Mar 2000, Jamie Lokier wrote: > Chuck Lever wrote: > > > Besides, MADV_FREE would be quite useful. MADV_DONTNEED doesn't do the > > > right thing for free(3) and similar things. ok, i don't understand why you think this. and besides, free(3) doesn't shrink the heap currently, i believe. this would work if free(3) used sbrk() to shrink the heap in an intelligent fashion, freeing kernel VM resources along the way. if you want something to help free(3), i would favor this design instead. > No idea. Didn't you see my message about the collected meanings of > different MADV_ flags on different systems? yes, i saw it, but perhaps didn't understand it completely. > In particular, using the name MADV_DONTNEED is a really bad idea. It > means completely different things on different OSes. For example your > meaning of MADV_DONTNEED is different to BSD's: a program that assumes > the BSD behaviour may well crash with your implementation and will > almost certainly give invalid results if it doesn't crash. i'm more concerned about portability from operating systems like Solaris, because there are many more server applications there than on *BSD that have been designed to use these interfaces. i'm not saying the *BSD way is wrong, but i think it would be a more useful compromise to make *BSD functionality available via some other interface (like MADV_ZERO). > [Aside: is there the possibility to have mincore return the "!accessed" > and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned > bytes? I can imagine a bunch of garbage collection algorithms that > could make good use of those bits. Currently some GC systems mprotect() > regions and unprotect them on SEGV -- simply reading the !dirty status > would obviously be much simpler and faster.] you could add that; the question is how to do it while not breaking applications that do this: if (!byte) { page not present } rather than checking the LSB specifically. i think using "dirty" instead of "!dirty" would help. the "accessed" bit is only used by the shrink_mmap logic to "time out" a page as memory gets short; i'm not sure that's a semantic that is useful to a user-level garbarge collector? and it probably isn't very portable. [ jamie's earlier summary included below for context, with commentary ] > 1. A hint to the VM system: I've finished using this data. If it's > modified, you can write it back right away. If not, you can discard > it. FreeBSD's MADV_DONTNEED does this, but DU's doesn't. > > FreeBSD: > > MADV_DONTNEED Allows the VM system to decrease the in-memory priority > > of pages in the specified range. Additionally future > > references to this address range will incur a page > > fault. > > To avoid ambiguity, perhaps we could call this one MADV_DONE? > > In BSD compatibility mode, Glibc would define MADV_DONTNEED to be > MADV_DONE. In standard mode it would not define MADV_DONTNEED at all. my preference is for the DU semantic of tossing dirty data instead of flushing onto backing store, simply because that's what so many applications expect DONTNEED to do. as far as i can tell, linux's msync(MS_INVALIDATE) behaves like freeBSD's MADV_DONTNEED. > 2. Zeroing a range in a private map. DU's MADV_DONTNEED does this -- > that's my reading of the man page. > > Digital Unix: (?yes) > > MADV_DONTNEED Do not need these pages > > The system will free any whole pages in the specified > > region. All modifications will be lost and any swapped > > out pages will be discarded. Subsequent access to the > > region will result in a zero-fill-on-demand fault as > > though it is being accessed for the first time. > > Reserved swap space is not affected by this call. > > For Linux, simply read /dev/zero into the selected range. The kernel > already optimises this case for anonymous mappings. > > If doing it in general turns out to be too hard to implement, I > propose MADV_ZERO should have this effect: exactly like reading > /dev/zero into the range, but always efficient. linux's MADV_DONTNEED currently doesn't clear the MADV_DONTNEED area. but it would be easy to add, perhaps as a separate MADV_ZERO as you describe below. > 3. Zeroing a range in a shared map. > > I have no idea if DU's MADV_DONTNEED has this effect, or whether it > only has this effect on shared anonymous mappings. > > In any case, reading /dev/zero into the range will always have the > desired effect, and Stephen's work will eventually make this > efficient on Linux. > > Again, if the kiobuf work doesn't have the desired effect, I propose > MADV_ZERO should be exactly like reading /dev/zero into the range, > and efficiently if the underlying mapped object can do so > efficiently. MADV_ZERO makes sense to me as an efficient way to zero a range of addresses in a mapping. but i think it's useful as a *separate* function, not as combined with, say, MADV_DONTNEED. > 4. Deferred freeing of pages. FreeBSD's MADV_FREE does this, according > to the posted manual snippet. I like this very much -- it is perfect > for a wide variety of memory allocators. > > FreeBSD: > > MADV_FREE Gives the VM system the freedom to free pages, and tells > > the system that information in the specified page range > > is no longer important. This is an efficient way of al- > > lowing malloc(3) to free pages anywhere in the address > > space, while keeping the address space valid. The next > > time that the page is referenced, the page might be de- > > mand zeroed, or might contain the data that was there > > before the MADV_FREE call. References made to that ad- > > dress space range will not make the VM system page the > > information back in from backing store until the page is > > modified again. > > I like this so much I started coding it a long time ago, as an > mdiscard syscall. But then I got onto something else. > > The principle here is very simple: MADV_FREE marks all the pages in > the region as "discardable", and clears the accessed and dirty bits > of those pages. > > Later when the kernel needs to free some memory, it is permitted to > free "discardable" pages immediately provided they are still not > accessed or dirty. When vmscan is clearing the accessed and dirty > bits on pages, if they were set it must clear the " discardable" bit. > > This allows malloc() and other user space allocators to free pages > back to the system. Unlike DU's MADV_DONTNEED, or mmapping > /dev/zero, if the system does not need the page there is no > inefficient zero-copy. If there was, malloc() would be better off > not bothering to return the pages. unless i've completely misunderstood what you are proposing, this is what MADV_DONTNEED does today, except it doesn't schedule the "freed" pages for disposal ahead of other pages in the system. but that should be easy enough to add once the semantics are nailed down and the bugs have been eliminated. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* madvise (MADV_FREE) 2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever @ 2000-03-21 1:20 ` Jamie Lokier 2000-03-21 2:24 ` William J. Earl 2000-03-22 16:24 ` Chuck Lever 2000-03-21 1:29 ` MADV_DONTNEED Jamie Lokier ` (2 subsequent siblings) 3 siblings, 2 replies; 55+ messages in thread From: Jamie Lokier @ 2000-03-21 1:20 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm Hi Chuck About MADV_FREE --------------- > > The principle here is very simple: MADV_FREE marks all the pages in > > the region as "discardable", and clears the accessed and dirty bits > > of those pages. > > > > Later when the kernel needs to free some memory, it is permitted to > > free "discardable" pages immediately provided they are still not > > accessed or dirty. When vmscan is clearing the accessed and dirty > > bits on pages, if they were set it must clear the " discardable" bit. > > > > This allows malloc() and other user space allocators to free pages > > back to the system. Unlike DU's MADV_DONTNEED, or mmapping > > /dev/zero, if the system does not need the page there is no > > inefficient zero-copy. If there was, malloc() would be better off > > not bothering to return the pages. > > unless i've completely misunderstood what you are proposing, this is what > MADV_DONTNEED does today, No, your MADV_DONTNEED _always_ discards the data in those pages. That makes it too inefficient for application memory allocators, because they will often want to reuse some of the pages soon after. You don't want redundant page zeroing, and you don't want to give up memory which is still nice and warm in the CPU's cache. Unless the kernel has a better use for it than you. MADV_FREE on the other hand simply permits the kernel to reclaim those pages, if it is under memory pressure. If there is no pressure, the pages are reused by the application unchanged. In this way different subsystems competing for memory get to share it out -- essentially the fairness mechanisms in the kernel are extending to application page management. And the application hardly knows a think about it. Here's why MADV_FREE works, and the other things don't: A typical memory allocator creates holes in its heap, which the kernel has to swap out if it needs memory. I guess about 1/4 of all data in swap is this kind of junk (but it's just a guess). But it's quite inefficient for an allocator to unconditionally give pages back to the kernel. The cost-benefit is "cost of giving page to kernel" vs. "cost of maybe paging out". The cost of giving up pages is significant: each one implies a COW fault, clear_page when you reuse the page, and loss of cache-warm memory. You assume a page is not likely to swap, because there's a reasonable chance the application will reallocate it before that happens. So on balance, giving pages unconditionally to the kernel is a loss. --> No sane free(3) would call MADV_DONTNEED or msync(MS_INVALIDATE). A better application allocator would base decisions about when to return pages to the kernel on the likelihood of swapping and measured cost of swapping vs. retaining pages. Of course that's very difficult and system specific. And really only the kernel has access to all the information on memory pressure. So the best arrangment is to let the kernel make page reclamation decisions. And if a page is not reclaimed before it is reused, let the application reuse the page unchanged and cache-warm. MADV_FREE is the mechanism for doing that. And it's a very nice, simple one to use. Paging decisions stay in the kernel where they belong. Applications run fast if they have enough memory. Everything is happy. > ... except it doesn't schedule the "freed" pages for > disposal ahead of other pages in the system. but that should be easy > enough to add once the semantics are nailed down and the bugs have been > eliminated. It's not clear you'd want to do that. There is a cost for every "freed" page disposed of, so you don't want to dispose of them ahead of other pages. > ok, i don't understand why you think this. and besides, free(3) doesn't > shrink the heap currently, i believe. this would work if free(3) used > sbrk() to shrink the heap in an intelligent fashion, freeing kernel VM > resources along the way. if you want something to help free(3), i would > favor this design instead. free(3) already uses sbrk() to shrink the heap at the end. It's not usable for the typical 1/3 of memory which becomes holes in the heap. Yes the idea is to modify free(3) to permit the kernel to reclaim memory that is free in the application. However, none of sbrk() _or_ MADV_DONTNEED _or_ MADV_ZERO _or_ mmap(/dev/zero) have the desired effect. It has to be a win for the application to call this function -- and it it's a loss to zero pages as soon as you free them. But it's relatively cheap to just mark the pages as "reclaimable" without losing them. enjoy, -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-21 1:20 ` madvise (MADV_FREE) Jamie Lokier @ 2000-03-21 2:24 ` William J. Earl 2000-03-21 14:08 ` Jamie Lokier 2000-03-22 16:24 ` Chuck Lever 1 sibling, 1 reply; 55+ messages in thread From: William J. Earl @ 2000-03-21 2:24 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chuck Lever, linux-mm Jamie Lokier writes: ... > You assume a page is not likely to swap, because there's a reasonable > chance the application will reallocate it before that happens. So on > balance, giving pages unconditionally to the kernel is a loss. > > --> No sane free(3) would call MADV_DONTNEED or msync(MS_INVALIDATE). > > A better application allocator would base decisions about when to return > pages to the kernel on the likelihood of swapping and measured cost of > swapping vs. retaining pages. Of course that's very difficult and > system specific. And really only the kernel has access to all the > information on memory pressure. ... I have been asked by some application people to have free() use MADV_DONTNEED or the equivalent in selected cases, specifically when the memory allocated is large, in order to free up the physical and virtual (swap space) memory for other uses. If the application uses very large chunks of memory, giving it back entirely is a win. The application could be recoded to do its own mmap() of /dev/zero and munmap(), but would prefer that this behavior be automatic. Of course, MADV_DONTNEED does not apply in the case of mmap()/munmap() of /dev/zero, but it is not implausible to give up virtual memory. Note that I am not claiming one should do anything of the sort for small allocations. If you have, say, 256 MB of memory and 256 MB of swap, and you use 384 MB of memory in your application, you cannot even fork() without giving up some of it. Many serious applications at least reserve large amounts of memory (even if they do not touch all of it on every run). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-21 2:24 ` William J. Earl @ 2000-03-21 14:08 ` Jamie Lokier 0 siblings, 0 replies; 55+ messages in thread From: Jamie Lokier @ 2000-03-21 14:08 UTC (permalink / raw) To: William J. Earl; +Cc: Chuck Lever, linux-mm William J. Earl wrote: > I have been asked by some application people to have free() use > MADV_DONTNEED or the equivalent in selected cases, specifically when > the memory allocated is large, in order to free up the physical and > virtual (swap space) memory for other uses. If the application uses > very large chunks of memory, giving it back entirely is a win. The > application could be recoded to do its own mmap() of /dev/zero and > munmap(), but would prefer that this behavior be automatic. Of course, > MADV_DONTNEED does not apply in the case of mmap()/munmap() of /dev/zero, > but it is not implausible to give up virtual memory. Note that > I am not claiming one should do anything of the sort for small > allocations. Take a look at Glibc's malloc/free, which is the only one we care about for Linux. Glibc's malloc uses mmap() of /dev/zero for large allocations automatically. You can change the threshold if you like. However, assuming this was not the case, even your application would benefit more from MADV_FREE than MADV_DONTNEED. MADV_DONTNEED forces a non-trivial minimum recycling cost, whereas MADV_FREE allows the cost to be balanced between the kernel and the application, according to the current paging situation. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-21 1:20 ` madvise (MADV_FREE) Jamie Lokier 2000-03-21 2:24 ` William J. Earl @ 2000-03-22 16:24 ` Chuck Lever 2000-03-22 18:05 ` Jamie Lokier 2000-03-22 18:15 ` madvise (MADV_FREE) Christoph Rohland 1 sibling, 2 replies; 55+ messages in thread From: Chuck Lever @ 2000-03-22 16:24 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mm hi jamie- ok, i think i'm getting a more clear picture of what you are thinking. On Tue, 21 Mar 2000, Jamie Lokier wrote: > > > The principle here is very simple: MADV_FREE marks all the pages in > > > the region as "discardable", and clears the accessed and dirty bits > > > of those pages. > > > > > > Later when the kernel needs to free some memory, it is permitted to > > > free "discardable" pages immediately provided they are still not > > > accessed or dirty. When vmscan is clearing the accessed and dirty > > > bits on pages, if they were set it must clear the " discardable" bit. > > > > > > This allows malloc() and other user space allocators to free pages > > > back to the system. Unlike DU's MADV_DONTNEED, or mmapping > > > /dev/zero, if the system does not need the page there is no > > > inefficient zero-copy. If there was, malloc() would be better off > > > not bothering to return the pages. > > > > unless i've completely misunderstood what you are proposing, this is what > > MADV_DONTNEED does today, > > No, your MADV_DONTNEED _always_ discards the data in those pages. That > makes it too inefficient for application memory allocators, because they > will often want to reuse some of the pages soon after. You don't want > redundant page zeroing, and you don't want to give up memory which is > still nice and warm in the CPU's cache. Unless the kernel has a better > use for it than you. > > MADV_FREE on the other hand simply permits the kernel to reclaim those > pages, if it is under memory pressure. > > If there is no pressure, the pages are reused by the application > unchanged. In this way different subsystems competing for memory get to > share it out -- essentially the fairness mechanisms in the kernel are > extending to application page management. And the application hardly > knows a think about it. ok, so you're asking for a lite(TM) version of DONTNEED that provides the following hint to the kernel: "i may be finished with this page, but i may also want to reuse it immediately." memory allocation studies i've read show that dynamically allocated memory objects are often re-used immediately after they are freed. even if the memory is being freed just before a process exits, it will be recycled immediately by the kernel, so why use MADV_FREE if you are about to munmap() it anyway? finally, as you point out, the heap is generally too fragmented to return page-sized chunks of it to the kernel, especially if you consider that glibc uses *multiple* subheaps to reduce lock contention in multithreaded applications. it seems to me that normal page aging will adequately identify these pages and flush them out. if the application needs to recycle areas of a virtual address space immediately, why should the kernel be involved at all? i think even doing an MADV_FREE during arbitrary free() operations would be more overhead then you really want. in other words, i don't think free() as it exists today harms performance in the ways you describe. thus, either the application keeps the memory, or it is really completely finished with it -- MADV_DONTNEED. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 16:24 ` Chuck Lever @ 2000-03-22 18:05 ` Jamie Lokier 2000-03-22 21:39 ` Chuck Lever 2000-03-22 18:15 ` madvise (MADV_FREE) Christoph Rohland 1 sibling, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-22 18:05 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm Hi Chuck, Think of this scenario: Allocate 20 x 20k blocks for images. Process images. Free 20 x 20k blocks (-> 100 page sized holes) Wait for user input. ... Allocate 20 x 20k blocks for images. Process images. Free 20 x 20k blocks. Now, if the rest of your system (not just this app) is busy paging, the best thing the app can do at "wait" is call MADV_DONTNEED. But if the rest of your system is not paging at all, the best thing the app can do is _not_ call MADV_DONTNEED. You see? It doesn't matter whether you're going to reuse the pages soon. The decision to use MADV_DONTNEED or not depends on overall system behaviour, which the application doesn't know about. Chuck Lever wrote: > ok, so you're asking for a lite(TM) version of DONTNEED that provides the > following hint to the kernel: "i may be finished with this page, but i may > also want to reuse it immediately." It does *not* mean "i may have finished with this page". For free() it looks that way, but that is a special case. It means "if you decide to swap this page out, you can skip the I/O". The page age remains the same. (You have MADV_WONTNEED if you want to change the page age as well). We let applications decide for themselves when it's best used. It's for long-lived holes after memory allocation, and cached objects such as Netscapes in-memory image and document cache. > memory allocation studies i've read show that dynamically allocated memory > objects are often re-used immediately after they are freed. True for programs which are continuously allocating and freeing memory. Not true for interactive programs waiting for the user (for example). See the scenario I wrote at the start of this message. > even if the memory is being freed just before a process exits, it will > be recycled immediately by the kernel, so why use MADV_FREE if you are > about to munmap() it anyway? You wouldn't use it in that situation. I am thinking of long lived processes that aren't actively allocating and have holes in their heap. For example Emacs, Netscape etc. My motivation for MADV_FREE is the observation that the optimal behaviour for programs like Emacs and Netscape is to allocate and use lots of memory (without changing it much) if there is no swapping, but to release memory aggressively if there is swapping. > finally, as you point out, the heap is generally too fragmented to > return page-sized chunks of it to the kernel, especially if you > consider that glibc uses *multiple* subheaps to reduce lock contention > in multithreaded applications. Multiple subheaps helps to produce page sized holes. Larger allocations (but not large enough to use mmap), when freed, leave page sized holes. The holes aren't blocked because tiny allocations go on different subheaps. > it seems to me that normal page aging will adequately identify these > pages and flush them out. Exactly! In fact page ageing is required for MADV_FREE to have any effect. The only effect of MADV_FREE is to eliminate the write to swap, after page ageing has decided to flush a page. It doesn't change the page reclamation policy. > if the application needs to recycle areas of a virtual address space > immediately, why should the kernel be involved at all? It is for long lived applications that have holes in their heap, who aren't actively recycling. Some memory allocators don't know if they are about to be recycled, but some do. It depends on the application. > i think even doing an MADV_FREE during arbitrary free() operations > would be more overhead then you really want. in other words, i don't > think free() as it exists today harms performance in the ways you > describe. You're right, you wouldn't call MADV_FREE on every free(). Just when you have a set of pages to free, every so often. There are lots of systems which can do that -- even a timer signal will do with a generic malloc. See for example GCC's ggc-page allocator -- every so often it decides to free a set of pages. And any GC system. And any system which caches objects in memory, for example Netscape. > thus, either the application keeps the memory, or it is really completely > finished with it -- MADV_DONTNEED. MADV_FREE is, speaking generally, not for either of those situations. It's for when the application has memory that it's _willing_ to give up, at some cost to application performance. For example cached objects that can be recalculated or reread over the network. Memory allocators are a special case of this. Not just malloc/free, but also garbage collecting systems. At the moment, the kernel has a number of subsystems, and when memory is required, it asks each subsystem to release some memory. MADV_FREE is a way for the kernel to include applications in memory balancing decisions. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 18:05 ` Jamie Lokier @ 2000-03-22 21:39 ` Chuck Lever 2000-03-22 22:31 ` Jamie Lokier 2000-03-22 22:33 ` Stephen C. Tweedie 0 siblings, 2 replies; 55+ messages in thread From: Chuck Lever @ 2000-03-22 21:39 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mm On Wed, 22 Mar 2000, Jamie Lokier wrote: > Think of this scenario: > > Allocate 20 x 20k blocks for images. > Process images. > Free 20 x 20k blocks (-> 100 page sized holes) > Wait for user input. > ... > Allocate 20 x 20k blocks for images. > Process images. > Free 20 x 20k blocks. > > Now, if the rest of your system (not just this app) is busy paging, the > best thing the app can do at "wait" is call MADV_DONTNEED. But if the > rest of your system is not paging at all, the best thing the app can do > is _not_ call MADV_DONTNEED. > > You see? It doesn't matter whether you're going to reuse the pages soon. > > The decision to use MADV_DONTNEED or not depends on overall system > behaviour, which the application doesn't know about. > > > ok, so you're asking for a lite(TM) version of DONTNEED that provides the > > following hint to the kernel: "i may be finished with this page, but i may > > also want to reuse it immediately." > > It does *not* mean "i may have finished with this page". > For free() it looks that way, but that is a special case. > > It means "if you decide to swap this page out, you can skip the I/O". > > The page age remains the same. (You have MADV_WONTNEED if you want to > change the page age as well). > > We let applications decide for themselves when it's best used. It's for > long-lived holes after memory allocation, and cached objects such as > Netscapes in-memory image and document cache. we have several generic applications we are interested in optimizing: 1. memory allocators can indicate pages that are not in use 2. applications that need to cache large files or big pieces of data that can be regenerated relatively cheaply 3. applications that need to buffer data to control precisely its movement to and from permanent storage. now, for 1: several studies i've read indicate that the average size of a dynamically allocated object is in the range of 40 bytes. if an application is screwing with much bigger objects, it should probably manage the objects differently (use mmap explicitly, tweak malloc, or something like that). in fact, i'd say it is safe in general to lower DEFAULT_MMAP_THRESHOLD to the system page size. that way you'd get closer to the behavior you're after, and you'd also win a much bigger effective heap size when allocating large objects, because you can only allocate up to 960M of a process's address space with sbrk(). on Linux with glibc, you can use mallopt to do this. something like: mallopt(M_MMAP_THRESHOLD, getpagesize()); for 2: note carefully that my implementation of MADV_DONTNEED doesn't evict data from memory. it simply tears down page mappings. this will result in a minor fault if the application immediately reaccesses the address, or a major fault if the application accesses the address after the page contents have finally been evicted from physical memory. to say this another way, the page mapping binds a virtual address to a page in the page cache. MADV_DONTNEED simply removes that binding. normal page aging will discover the unbound pages in the page cache and remove them. so really, MADV_DONTNEED is actually disconnected from the mechanism of swapping or discarding the page's data. there are probably nicer ways to do this, but there it is. i think this is exactly what you want for cached files. the application can say "DONTNEED" this data, and the system is free to reclaim it as necessary. if the application accesses it again later, it will get the old data back. just be sure that if you change data in the file, you explicitly sync it back to disk. for 3: this area of memory is probably going to be mapped from /dev/zero, and pinned. it's a nice way to get a clear page if you just re-read /dev/zero into that page. > > it seems to me that normal page aging will adequately identify these > > pages and flush them out. > > Exactly! In fact page ageing is required for MADV_FREE to have any > effect. > > The only effect of MADV_FREE is to eliminate the write to swap, after > page ageing has decided to flush a page. It doesn't change the page > reclamation policy. ok, here is where i'm confused. i don't think MADV_DONTNEED and MADV_FREE are different -- they both work this way. > > i think even doing an MADV_FREE during arbitrary free() operations > > would be more overhead then you really want. in other words, i don't > > think free() as it exists today harms performance in the ways you > > describe. > > You're right, you wouldn't call MADV_FREE on every free(). Just when > you have a set of pages to free, every so often. There are lots of > systems which can do that -- even a timer signal will do with a generic > malloc. nah, i still say a better way to handle this case is to lower malloc's "use an anon map instead of the heap" threshold to 4K or 8K. right now it's 32K by default. > At the moment, the kernel has a number of subsystems, and when memory is > required, it asks each subsystem to release some memory. MADV_FREE is a > way for the kernel to include applications in memory balancing > decisions. like adding another separate call in do_try_to_free_pages that trolls applications for free-able pages; expect with MADV_FREE and MADV_DONTNEED, you're causing shrink_mmap to do this for you automatically. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 21:39 ` Chuck Lever @ 2000-03-22 22:31 ` Jamie Lokier 2000-03-22 22:44 ` Stephen C. Tweedie 2000-03-23 18:53 ` Chuck Lever 2000-03-22 22:33 ` Stephen C. Tweedie 1 sibling, 2 replies; 55+ messages in thread From: Jamie Lokier @ 2000-03-22 22:31 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm > > > it seems to me that normal page aging will adequately identify these > > > pages and flush them out. > > > > Exactly! In fact page ageing is required for MADV_FREE to have any > > effect. > > > > The only effect of MADV_FREE is to eliminate the write to swap, after > > page ageing has decided to flush a page. It doesn't change the page > > reclamation policy. > > ok, here is where i'm confused. i don't think MADV_DONTNEED and MADV_FREE > are different -- they both work this way. No they don't. MADV_DONTNEED always discards private modifications. (BTW I think it should be flushing the swap cache while it's at it). MADV_FREE only discards private modifications when there is paging pressure to do so. The decisions to do so are deferred, for architectures that support this. (Includes x86). Chuck Lever wrote: > 1. memory allocators can indicate pages that are not in use > > now, for 1: > > several studies i've read indicate that the average size of a dynamically > allocated object is in the range of 40 bytes. if an application is > screwing with much bigger objects, it should probably manage the objects > differently (use mmap explicitly, tweak malloc, or something like that). The average object size is skewed towards small numbers because there are usually many more small objects, allocated at a higher rate. It only takes a few larger objects to lead to holes, but they don't count in the "average size" statistic because the time spent in the memory allocator for larger objects isn't significant. MADV_FREE isn't to optimise the time spent in a memory allocator. It's to optimise overall system performance. And that is for a subset of applications. Yes, by all means tweak malloc. Tweak it to call MADV_FREE :-) > in fact, i'd say it is safe in general to lower DEFAULT_MMAP_THRESHOLD to > the system page size. that way you'd get closer to the behavior you're > after, and you'd also win a much bigger effective heap size when > allocating large objects, because you can only allocate up to 960M of a > process's address space with sbrk(). A fine way to make performance suck. Application heap fragmentation now appears as vma fragmentation -> that means expect to see hundreds or more vmas. Lost memory due to rounding to a page size is also now also unusable. Even if you manage to save memory, performance sucks. A system call for every medium size allocation and deallocation? You gotta be kidding. And now even normal page faults take longer because of the extra vmas. You've just optimised for the minimum RAM, maximum paging case. > > You're right, you wouldn't call MADV_FREE on every free(). Just when > > you have a set of pages to free, every so often. There are lots of > > systems which can do that -- even a timer signal will do with a generic > > malloc. > > nah, i still say a better way to handle this case is to lower malloc's > "use an anon map instead of the heap" threshold to 4K or 8K. right now > it's 32K by default. Try it. I expect the malloc author chose a high threshold after extensive measurements -- that malloc implementation is the result of a series of implementations and studies. Do you know that Glibc's malloc also limits the total number of mmaps? I believe that's because performance plummets when you have too many vmas. And even if we didn't use vmas or system calls, even if mmap were a straightforward function call to ultra-fast code, explicitly returning the memory to the kernel implies a significant overhead -- you're forcing unnecessary clear_page() calls. > 2. applications that need to cache large files or big pieces of data that > can be regenerated relatively cheaply > > note carefully that my implementation of MADV_DONTNEED doesn't evict data > from memory. it simply tears down page mappings. this will result in a > minor fault if the application immediately reaccesses the address, or a > major fault if the application accesses the address after the page > contents have finally been evicted from physical memory. > > to say this another way, the page mapping binds a virtual address to a > page in the page cache. MADV_DONTNEED simply removes that binding. > normal page aging will discover the unbound pages in the page cache and > remove them. so really, MADV_DONTNEED is actually disconnected from the > mechanism of swapping or discarding the page's data. Let's see... zap_page_range. That looks like the private modification is discarded. That's not what MADV_FREE does. MADV_FREE does _not_ discard private modifications unless they're reclaimed due to memory pressure. And that decision is magically deferred. And that's what you want for caching calculated structures in an application. They are private mappings which will be zeroed _if_ (and only if) the kernel decides there is pressure to use the memory elsewhere. > i think this is exactly what you want for cached files. For reading a file, yes. For a locally generated structure, such as a parsed file, no. BTW, I am sure that Netscape's "memory cache" is the latter -- because they have "disk cache" for the former. > the application can say "DONTNEED" this data, and the system is free > to reclaim it as necessary. if the application accesses it again > later, it will get the old data back. just be sure that if you change > data in the file, you explicitly sync it back to disk. You say "the system is free to reclaim it". MADV_DONTNEED _forces_ the system to reclaim the data, if it is not in swap cache at the time. For a locally calculated structure in an anonymous mapping, you don't get the data back. (Yes, this means "cached files". Sorry if I made it sound like mapped files). > 3. applications that need to buffer data to control precisely its > movement to and from permanent storage. > > for 3: > > this area of memory is probably going to be mapped from /dev/zero, and > pinned. it's a nice way to get a clear page if you just re-read /dev/zero > into that page. Um. I don't see how that response has anything to do with 3 :-) > > At the moment, the kernel has a number of subsystems, and when memory is > > required, it asks each subsystem to release some memory. MADV_FREE is a > > way for the kernel to include applications in memory balancing > > decisions. > > like adding another separate call in do_try_to_free_pages that trolls > applications for free-able pages; expect with MADV_FREE and MADV_DONTNEED, > you're causing shrink_mmap to do this for you automatically. It should be added to vmscan and/or shrink_mmap. The rough outline is: MADV_FREE clears the pte accessed bit and marks the page as freeable. Later, on finding one of these pages during the normal scans, just dump the page if it is still not accessed. If it has been accessed, it's no longer freeable. There are some interactions with the swap cache and vmscan algorithm I have glossed over... -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 22:31 ` Jamie Lokier @ 2000-03-22 22:44 ` Stephen C. Tweedie 2000-03-23 18:53 ` Chuck Lever 1 sibling, 0 replies; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-22 22:44 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chuck Lever, linux-mm, Stephen C. Tweedie Hi, On Wed, Mar 22, 2000 at 11:31:47PM +0100, Jamie Lokier wrote: > > No they don't. MADV_DONTNEED always discards private modifications. > (BTW I think it should be flushing the swap cache while it's at it). If it is the last user of the page --- ie. if PG_SwapCache is set and the refcount of the page is one --- then it will do so anyway, because when I added that swap cache code I made sure that zap_page_range() does a free_page_and_swap_cache() when freeing pages. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 22:31 ` Jamie Lokier 2000-03-22 22:44 ` Stephen C. Tweedie @ 2000-03-23 18:53 ` Chuck Lever 2000-03-24 0:00 ` /dev/recycle Jamie Lokier 2000-03-24 0:21 ` madvise (MADV_FREE) Jamie Lokier 1 sibling, 2 replies; 55+ messages in thread From: Chuck Lever @ 2000-03-23 18:53 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mm On Wed, 22 Mar 2000, Jamie Lokier wrote: > > > The only effect of MADV_FREE is to eliminate the write to swap, after > > > page ageing has decided to flush a page. It doesn't change the page > > > reclamation policy. > > > > ok, here is where i'm confused. i don't think MADV_DONTNEED and MADV_FREE > > are different -- they both work this way. > > No they don't. MADV_DONTNEED always discards private modifications. > (BTW I think it should be flushing the swap cache while it's at it). > > MADV_FREE only discards private modifications when there is paging > pressure to do so. The decisions to do so are deferred, for > architectures that support this. (Includes x86). i still don't see a big difference. the private modifications, in both cases, won't be written to swap. in both cases, the application cannot rely on the contents of these pages after the madvise call. for private mappings, pages are freed immediately by DONTNEED; FREE will cause the pages to be freed later if the system is low on memory. that's six of one, half dozen of the other. freeing later may mean the application saves a little time now, but freeing immediately could mean postponing a low memory scenario, and would allow the system to reuse a page that is still in hardware caches. > > nah, i still say a better way to handle this case is to lower malloc's > > "use an anon map instead of the heap" threshold to 4K or 8K. right now > > it's 32K by default. > > Try it. I expect the malloc author chose a high threshold after > extensive measurements -- that malloc implementation is the result of a > series of implementations and studies. Do you know that Glibc's malloc > also limits the total number of mmaps? I believe that's because > performance plummets when you have too many vmas. the AVL tree structure helps this. there is still a linear search in the number of vmas to find unused areas in a virtual address space. this makes mmap significantly slower when there are a large number of vmas. i'll bet some clever person on this list could create a data structure that fixes this problem. but you said before that the number of small dynamically allocated objects dwarfs the number of large objects. so either there is a problem here, or there isn't! :) can this be any worse than mprotect? - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* /dev/recycle 2000-03-23 18:53 ` Chuck Lever @ 2000-03-24 0:00 ` Jamie Lokier 2000-03-24 9:14 ` /dev/recycle Christoph Rohland 2000-03-28 0:48 ` /dev/recycle Chuck Lever 2000-03-24 0:21 ` madvise (MADV_FREE) Jamie Lokier 1 sibling, 2 replies; 55+ messages in thread From: Jamie Lokier @ 2000-03-24 0:00 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm This discussion needs to split into two: one about memory allocators responding to overall system memory pressure, and another about applications cacheing recomputable objects, which also want to respond to system memory pressure.pa The issues are different and the requirements are different. Perhaps trying to use the name MADV_FREE to cover them both is just confusing. For the record, I'm going to talk about memory allocators, and the subject has changed to reflect that. So hi Chuck! I've thought of something maybe better than MADV_FREE for memory allocators. It's neat, it's simple, it's cute... But first I'll explain MADV_FREE a bit more. Chuck Lever wrote: > > MADV_FREE only discards private modifications when there is paging > > pressure to do so. The decisions to do so are deferred, for > > architectures that support this. (Includes x86). > > i still don't see a big difference. the private modifications, in both > cases, won't be written to swap. in both cases, the application cannot > rely on the contents of these pages after the madvise call. Correct. The difference is that with MADV_FREE, clear_page() operations are skipped when there's no memory pressure from the kernel. > for private mappings, pages are freed immediately by DONTNEED; FREE will > cause the pages to be freed later if the system is low on memory. that's > six of one, half dozen of the other. freeing later may mean the > application saves a little time now, It may save the time overall -- if the page is next reused by the application before the kernel recycles it. Note that nobody, neither the application nor the kernel, knows in advance if this will be the case. > but freeing immediately could mean postponing a low memory scenario, > and would allow the system to reuse a page that is still in hardware > caches. The system is free to reuse MADV_FREE pages immediately if it wishes -- the system doesn't lose here. In fact if you're already low on memory at the time madvise() is called, the kernel would reclaim as many pages as it needs immediately, just as if you'd called MADV_DONTNEED for those pages. The remainder get marked reclaimable. Look at it from the point of view of an application writer. Why would I ever call MADV_DONTNEED for anything but large memory areas? It penalises my application on systems that aren't swapping.. (Though MADV_FREE is also a penalty, but a smaller one). > but you said before that the number of small dynamically allocated objects > dwarfs the number of large objects. so either there is a problem here, or > there isn't! :) We're talking about free areas, not objects :-) Think of the kernel, specifically only the memory managed by kmalloc/slab. It handles lots of small allocations, but nevertheless produces free pages which the kernel can use when there's memory pressure. But anyway... Better than MADV_FREE: /dev/recycle -------------------------------------------------- What about this whacky idea? MAP_RECYCLE|MAP_ANON initially allocates pages like MAP_ANON. Mapping /dev/recycle is similar (but subtly different). MADV_DONTNEED or munmap discard private modifications, but record this process as the page owner. If the process later accesses the page, a page is allocated again but the MAP_RECYCLE means it may return a page already marked as belonging to this process without clearing it. That's better for app allocators than MADV_FREE: they're giving the kernel more freedom with not much loss in performance. And the kernel likes this too -- no need for vmscan to release references, as the pages are free already. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: /dev/recycle 2000-03-24 0:00 ` /dev/recycle Jamie Lokier @ 2000-03-24 9:14 ` Christoph Rohland 2000-03-24 13:10 ` /dev/recycle Jamie Lokier 2000-03-28 0:48 ` /dev/recycle Chuck Lever 1 sibling, 1 reply; 55+ messages in thread From: Christoph Rohland @ 2000-03-24 9:14 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chuck Lever, linux-mm Jamie Lokier <lk@tantalophile.demon.co.uk> writes: > Better than MADV_FREE: /dev/recycle > -------------------------------------------------- > > What about this whacky idea? > > MAP_RECYCLE|MAP_ANON initially allocates pages like MAP_ANON. Mapping > /dev/recycle is similar (but subtly different). > > MADV_DONTNEED or munmap discard private modifications, but record this > process as the page owner. If the process later accesses the page, a > page is allocated again but the MAP_RECYCLE means it may return a page > already marked as belonging to this process without clearing it. > > That's better for app allocators than MADV_FREE: they're giving the > kernel more freedom with not much loss in performance. And the kernel > likes this too -- no need for vmscan to release references, as the pages > are free already. This would only work for /dev/zero like mappings. I need it for shm mappings. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: /dev/recycle 2000-03-24 9:14 ` /dev/recycle Christoph Rohland @ 2000-03-24 13:10 ` Jamie Lokier 2000-03-24 13:54 ` /dev/recycle Christoph Rohland 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-24 13:10 UTC (permalink / raw) To: Christoph Rohland; +Cc: Chuck Lever, linux-mm Christoph Rohland wrote: > > MAP_RECYCLE|MAP_ANON initially allocates pages like MAP_ANON. Mapping > > /dev/recycle is similar (but subtly different). > > This would only work for /dev/zero like mappings. I need it for shm > mappings. Open /dev/recycle several times and map it shared -- it's the same as anonymous shared mappings. The owner of pages is considered to be the filehandle itself in that case. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: /dev/recycle 2000-03-24 13:10 ` /dev/recycle Jamie Lokier @ 2000-03-24 13:54 ` Christoph Rohland 2000-03-24 14:17 ` /dev/recycle Jamie Lokier 0 siblings, 1 reply; 55+ messages in thread From: Christoph Rohland @ 2000-03-24 13:54 UTC (permalink / raw) To: Jamie Lokier; +Cc: Christoph Rohland, Chuck Lever, linux-mm Jamie Lokier <lk@tantalophile.demon.co.uk> writes: > Christoph Rohland wrote: > > This would only work for /dev/zero like mappings. I need it for shm > > mappings. > > Open /dev/recycle several times and map it shared -- it's the same as > anonymous shared mappings. The owner of pages is considered to be the > filehandle itself in that case. It's not the same as posix shared mem. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: /dev/recycle 2000-03-24 13:54 ` /dev/recycle Christoph Rohland @ 2000-03-24 14:17 ` Jamie Lokier 2000-03-24 17:40 ` /dev/recycle Christoph Rohland 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-24 14:17 UTC (permalink / raw) To: Christoph Rohland; +Cc: Chuck Lever, linux-mm Christoph Rohland wrote: > > Open /dev/recycle several times and map it shared -- it's the same as > > anonymous shared mappings. The owner of pages is considered to be the > > filehandle itself in that case. > > It's not the same as posix shared mem. What's the difference? -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: /dev/recycle 2000-03-24 14:17 ` /dev/recycle Jamie Lokier @ 2000-03-24 17:40 ` Christoph Rohland 2000-03-24 18:13 ` /dev/recycle Jamie Lokier 0 siblings, 1 reply; 55+ messages in thread From: Christoph Rohland @ 2000-03-24 17:40 UTC (permalink / raw) To: Jamie Lokier; +Cc: Christoph Rohland, Chuck Lever, linux-mm Jamie Lokier <lk@tantalophile.demon.co.uk> writes: > Christoph Rohland wrote: > > > Open /dev/recycle several times and map it shared -- it's the same as > > > anonymous shared mappings. The owner of pages is considered to be the > > > filehandle itself in that case. > > > > It's not the same as posix shared mem. > > What's the difference? 1) /dev/{zero,recycle} shared mappings do only work between childs of the same parent and the parent. Also they do not survive an exec. 2) You cannot unmap and remap the same area. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: /dev/recycle 2000-03-24 17:40 ` /dev/recycle Christoph Rohland @ 2000-03-24 18:13 ` Jamie Lokier 2000-03-25 8:35 ` /dev/recycle Christoph Rohland 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-24 18:13 UTC (permalink / raw) To: Christoph Rohland; +Cc: Chuck Lever, linux-mm Christoph Rohland wrote: > 1) /dev/{zero,recycle} shared mappings do only work between childs of > the same parent and the parent. Also they do not survive an exec. Use file handle passing -- another process can then share the mapping. This is what shared anonymous mapping means, and it was added to the kernel recently just after posix shm (because posix shm made it easy to implement). > 2) You cannot unmap and remap the same area. You can if someone else holds it open. Anyway, you can use MAP_RECYCLE when you're mapping posix shm. That could be made to work :-) -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: /dev/recycle 2000-03-24 18:13 ` /dev/recycle Jamie Lokier @ 2000-03-25 8:35 ` Christoph Rohland 0 siblings, 0 replies; 55+ messages in thread From: Christoph Rohland @ 2000-03-25 8:35 UTC (permalink / raw) To: Jamie Lokier; +Cc: Christoph Rohland, Chuck Lever, linux-mm Jamie Lokier <lk@tantalophile.demon.co.uk> writes: > Christoph Rohland wrote: > > 1) /dev/{zero,recycle} shared mappings do only work between childs of > > the same parent and the parent. Also they do not survive an exec. > > Use file handle passing -- another process can then share the mapping. > This is what shared anonymous mapping means, and it was added to the > kernel recently just after posix shm (because posix shm made it easy to > implement). That's not how /dev/zero works. Check the implementation. AFAIK it also does not work this way on other platforms. > > 2) You cannot unmap and remap the same area. > > You can if someone else holds it open. See above. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: /dev/recycle 2000-03-24 0:00 ` /dev/recycle Jamie Lokier 2000-03-24 9:14 ` /dev/recycle Christoph Rohland @ 2000-03-28 0:48 ` Chuck Lever 1 sibling, 0 replies; 55+ messages in thread From: Chuck Lever @ 2000-03-28 0:48 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mm On Fri, 24 Mar 2000, Jamie Lokier wrote: > Chuck Lever wrote: > > > MADV_FREE only discards private modifications when there is paging > > > pressure to do so. The decisions to do so are deferred, for > > > architectures that support this. (Includes x86). > > > > i still don't see a big difference. the private modifications, in both > > cases, won't be written to swap. in both cases, the application cannot > > rely on the contents of these pages after the madvise call. > > Correct. The difference is that with MADV_FREE, clear_page() operations > are skipped when there's no memory pressure from the kernel. > > > for private mappings, pages are freed immediately by DONTNEED; FREE will > > cause the pages to be freed later if the system is low on memory. that's > > six of one, half dozen of the other. freeing later may mean the > > application saves a little time now, > > It may save the time overall -- if the page is next reused by the > application before the kernel recycles it. Note that nobody, neither > the application nor the kernel, knows in advance if this will be the > case. > > > but freeing immediately could mean postponing a low memory scenario, > > and would allow the system to reuse a page that is still in hardware > > caches. > > The system is free to reuse MADV_FREE pages immediately if it wishes -- > the system doesn't lose here. In fact if you're already low on memory > at the time madvise() is called, the kernel would reclaim as many pages > as it needs immediately, just as if you'd called MADV_DONTNEED for those > pages. The remainder get marked reclaimable. ok, i just want to make sure we really are talking about the same thing, at least from the point of view of the semantics that the application will depend on. the only difference is how/when the kernel disposes of the pages. reducing the number of clear_page() operations and reducing the amount of page table jiggling on SMP are both good goals. is it your view that MADV_FREE is a better implementation of MADV_DONTNEED? should we replace the current implementation of MADV_DONTNEED with one that behaves more like MADV_FREE? is there a reason to have both behaviors available to applications? - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-23 18:53 ` Chuck Lever 2000-03-24 0:00 ` /dev/recycle Jamie Lokier @ 2000-03-24 0:21 ` Jamie Lokier 2000-03-24 7:21 ` lars brinkhoff 1 sibling, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-24 0:21 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm On the dirty bit ................ And then Chuck moved onto a different topic, mincore... > can this be any worse than mprotect? Do you really imagine an application having to handle 1000 SEGV signals, and call mprotect() for one page per SEGV, and the kernel locking the mm thereby causing soft fault contention for other threads, is fast? :-) <ahem>, but enough sensationalism from me. I went and looked at some papers -- and found a rather annoying problem with mprotect, for general purpose GCs[1]: "The resulting write faults were caught as UNIX signals and recorded. Various Portable Common Runtime interfaces to SunOS system calls were modified so as to preclude unrecoverable faults in system calls." Ouch! You can't use the mprotect() method with read(). mincore would be just fine. So you can't make a conservative collector that works with a third-party library unless you're willing to write wrappers for all the system calls that touch user memory. ioctl() for a hairy example. On the matter of timing, when that paper was written (1991), continued from above: "The primary cost of this is that the first time a page in the heap is written after a garbage collection, a signal must be caught and a system call must be executed to unprotect the page. The cost of this is variable, but in our environment appears to be somewhat less then half a millisecond per page written." As a counterpoint, Boehm has this to say on the subject of getting dirty bits from the OS[2]. See 3: "We keep track of modified pages using one of three distinct mechanisms: 1. Through explicit mutator cooperation. Currently this requires the use of GC_malloc_stubborn. 2. By write-protecting physical pages and catching write faults. This is implemented for many Unix-like systems and for win32. It is not possible in a few environments. 3. By retrieving dirty bit information from /proc. (Currently only Sun's Solaris supports this. Though this is considerably cleaner, performance may actually be better with mprotect and signals.) Well, I guess we will never know until it has been tried, but it looks like it should be experimented with by someone writing a garbage collector before it becomes a standard kernel feature. I really don't like the way mprotect breaks syscalls though, even if it performs well. On the accessed bit ................... In [3], Boehm says: "Paging locality A common concern about garbage collection, or any form of dynamic memory allocation, is its interaction with a virtual memory system. Accesses to virtual memory should be such that the traffic between disk and memory is small, i.e. most access should be to pages that were already recently accessed. On modern computers, where disks are so much slower than CPUs, many programs page very little, and most of their heaps reside in the working set. But even for those programs in which significant parts of the heap do not reside in the working set, there are a number of techniques which dramatically increase the locality of reference of a mark-and-sweep collector. The fundamental problem is that all memory that may possibly contain pointers has to be examined during every full collection." Boehm then goes on to summarise methods used to avoid this problem. In particular, generational collection. This is something that mincore could perhaps help with. Pages that haven't been accessed since certain GC checkpoints can gather in a set of pages that don't need to be scanned, or at least not scanned particularly often. Again, somebody working on a real GC implementation would be the right person to experiment with extensions to mincore. My summary from this is: no point adding mincore extensions until we know what would be useful. But do reserve the space in those bits 1-7. enjoy, -- Jamie [1] http://reality.sgi.com/boehm/papers/pldi91.ps.Z [2] http://reality.sgi.com/boehm/gcdescr.html [3] http://reality.sgi.com/boehm/issues.html -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-24 0:21 ` madvise (MADV_FREE) Jamie Lokier @ 2000-03-24 7:21 ` lars brinkhoff 2000-03-24 17:42 ` Jeff Dike 0 siblings, 1 reply; 55+ messages in thread From: lars brinkhoff @ 2000-03-24 7:21 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chuck Lever, linux-mm, jdike Jamie Lokier wrote: > Well, I guess we will never know until it has been tried, but it looks > like it should be experimented with by someone writing a garbage > collector before it becomes a standard kernel feature. I really don't > like the way mprotect breaks syscalls though, even if it performs well. And please remember that not only garbage collectors can benefit from dirty and accessed bits. There are a number of applications doing paging in user space. For example, the Brown Simulator (http://www.cs.brown.edu/software/brownsim/) and a386 (http://a386.nocrew.org/) both provide virtual CPUs with MMUs which can run operating system kernels. Per-page accessed and dirty information from the hosting kernel would ease the implementation of a simulated MMU. Perhaps also the user-mode Linux kernel would benefit, but I'm not sure. Jeff? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-24 7:21 ` lars brinkhoff @ 2000-03-24 17:42 ` Jeff Dike 2000-03-24 16:49 ` Jamie Lokier 2000-03-24 17:08 ` Stephen C. Tweedie 0 siblings, 2 replies; 55+ messages in thread From: Jeff Dike @ 2000-03-24 17:42 UTC (permalink / raw) To: lars brinkhoff; +Cc: lk, cel, linux-mm > Per-page accessed and dirty information from the hosting kernel would > ease the implementation of a simulated MMU. > Perhaps also the user-mode Linux kernel would benefit, but I'm not > sure. Jeff? The user-mode kernel doesn't expect to get any mm bits from the hosting kernel and I don't see any use for them. It lives in its own happy world keeping track of its own bits. Maybe on arches where the hardware provides those bits and the kernel uses them, but the i386 kernel doesn't. Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-24 17:42 ` Jeff Dike @ 2000-03-24 16:49 ` Jamie Lokier 2000-03-24 17:08 ` Stephen C. Tweedie 1 sibling, 0 replies; 55+ messages in thread From: Jamie Lokier @ 2000-03-24 16:49 UTC (permalink / raw) To: Jeff Dike; +Cc: lars brinkhoff, cel, linux-mm Jeff Dike wrote: > Maybe on arches where the hardware provides those bits and the kernel uses > them, but the i386 kernel doesn't. The i386 not-user-mode kernel certainly uses the accessed and dirty bits. What do you think pte_young does? -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-24 17:42 ` Jeff Dike 2000-03-24 16:49 ` Jamie Lokier @ 2000-03-24 17:08 ` Stephen C. Tweedie 2000-03-24 19:58 ` Jeff Dike 1 sibling, 1 reply; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-24 17:08 UTC (permalink / raw) To: Jeff Dike; +Cc: lars brinkhoff, lk, cel, linux-mm, Stephen Tweedie Hi, On Fri, Mar 24, 2000 at 12:42:18PM -0500, Jeff Dike wrote: > > Maybe on arches where the hardware provides those bits and the kernel uses > them, but the i386 kernel doesn't. Sure it does. It relies utterly on them. It uses the accessed bit to perform page aging, and it uses the dirty bit to distinguish between private and shared pages on writable private vmas, or to mark dirty shared pages on shared vmas. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-24 17:08 ` Stephen C. Tweedie @ 2000-03-24 19:58 ` Jeff Dike 2000-03-25 0:30 ` Stephen C. Tweedie 0 siblings, 1 reply; 55+ messages in thread From: Jeff Dike @ 2000-03-24 19:58 UTC (permalink / raw) To: Stephen C. Tweedie, lk; +Cc: linux-mm > The i386 not-user-mode kernel I usually call that the native kernel :-) lk@tantalophile.demon.co.uk said: > certainly uses the accessed and dirty bits. What do you think > pte_young does? sct@redhat.com said: > It uses the accessed bit to perform page aging, and it uses the dirty > bit to distinguish between private and shared pages on writable > private vmas, or to mark dirty shared pages on shared vmas. I should have thought a little before making that post. When I did the user-mode port, I didn't have to provide any special support for maintaining the non-protection bits (should I be?). I essentially stole the i386 pgtable.h and pgalloc.h to get the bits and macros, and that's about it. Everything appears to work fine, so my conclusion (without delving into the i386 code too deeply) was that the upper kernel maintained them itself without any particular help from the hardware. Is this correct? Should I be dealing with the non-protection bits in the arch layer? Jeff -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-24 19:58 ` Jeff Dike @ 2000-03-25 0:30 ` Stephen C. Tweedie 0 siblings, 0 replies; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-25 0:30 UTC (permalink / raw) To: Jeff Dike; +Cc: Stephen C. Tweedie, lk, linux-mm Hi, On Fri, Mar 24, 2000 at 02:58:10PM -0500, Jeff Dike wrote: > > Everything appears to work fine, so my conclusion (without delving into the > i386 code too deeply) was that the upper kernel maintained them itself without > any particular help from the hardware. > > Is this correct? Should I be dealing with the non-protection bits in the arch > layer? You probably should. It is impossible to do MAP_SHARED, PROT_WRITE regions correctly without dirty bit support, and you don't get efficient paging without accessed bit support. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 21:39 ` Chuck Lever 2000-03-22 22:31 ` Jamie Lokier @ 2000-03-22 22:33 ` Stephen C. Tweedie 2000-03-22 22:45 ` Jamie Lokier 1 sibling, 1 reply; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-22 22:33 UTC (permalink / raw) To: Chuck Lever; +Cc: Jamie Lokier, linux-mm, Stephen C. Tweedie Hi, On Wed, Mar 22, 2000 at 04:39:12PM -0500, Chuck Lever wrote: > > in fact, i'd say it is safe in general to lower DEFAULT_MMAP_THRESHOLD to > the system page size. that way you'd get closer to the behavior you're > after, and you'd also win a much bigger effective heap size when > allocating large objects, because you can only allocate up to 960M of a > process's address space with sbrk(). You can use MADV_DONTNEED to reclaim demand-zero pages below sbrk() even without using memory map in the first place, and I understand that recent versions of glibc will resort to extending the heap with mmap() automatically once sbrk() reaches its limit. So, I don't think that decreasing DEFAULT_MMAP_THRESHOLD really gains that much. > > to say this another way, the page mapping binds a virtual address to a > page in the page cache. MADV_DONTNEED simply removes that binding. > normal page aging will discover the unbound pages in the page cache and > remove them. so really, MADV_DONTNEED is actually disconnected from the > mechanism of swapping or discarding the page's data. Not for anonymous pages, where the pte reference is the _only_ reference to the page (except for swap-cached pages). In this case, MADV_DONTNEED will genuinely free the page. > nah, i still say a better way to handle this case is to lower malloc's > "use an anon map instead of the heap" threshold to 4K or 8K. right now > it's 32K by default. No, it's much cheaper to do a MADV_DONTNEED when freeing an anonymous page: that way the pageout and subsequent demand-zero pagein all happen entirely within the page tables, without having to perform lots of operations on the vma tree of the process. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 22:33 ` Stephen C. Tweedie @ 2000-03-22 22:45 ` Jamie Lokier 2000-03-22 22:48 ` Stephen C. Tweedie 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-22 22:45 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-mm Stephen C. Tweedie wrote: > > to say this another way, the page mapping binds a virtual address to a > > page in the page cache. MADV_DONTNEED simply removes that binding. > > normal page aging will discover the unbound pages in the page cache and > > remove them. so really, MADV_DONTNEED is actually disconnected from the > > mechanism of swapping or discarding the page's data. > > Not for anonymous pages, where the pte reference is the _only_ reference > to the page (except for swap-cached pages). In this case, MADV_DONTNEED > will genuinely free the page. Doesn't this also result in a swap-cache leak, or are orphan swap-cache pages reclaimed eventually? > > nah, i still say a better way to handle this case is to lower malloc's > > "use an anon map instead of the heap" threshold to 4K or 8K. right now > > it's 32K by default. > > No, it's much cheaper to do a MADV_DONTNEED when freeing an anonymous > page: that way the pageout and subsequent demand-zero pagein all happen > entirely within the page tables, without having to perform lots of > operations on the vma tree of the process. And it's even cheaper to do MADV_FREE so you skip demand-zeroing if memory pressure doesn't require that. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 22:45 ` Jamie Lokier @ 2000-03-22 22:48 ` Stephen C. Tweedie 2000-03-22 22:55 ` Q. about swap-cache orphans Jamie Lokier 0 siblings, 1 reply; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-22 22:48 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chuck Lever, linux-mm, Stephen C. Tweedie Hi, On Wed, Mar 22, 2000 at 11:45:31PM +0100, Jamie Lokier wrote: > > Doesn't this also result in a swap-cache leak, or are orphan swap-cache > pages reclaimed eventually? The shrink_mmap() page cache reclaimer is able to pick up any orphaned swap cache pages. > And it's even cheaper to do MADV_FREE so you skip demand-zeroing if > memory pressure doesn't require that. Right. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Q. about swap-cache orphans 2000-03-22 22:48 ` Stephen C. Tweedie @ 2000-03-22 22:55 ` Jamie Lokier 2000-03-22 22:58 ` Stephen C. Tweedie 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-22 22:55 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-mm [This is just a question to help my understanding, not relevant to madvise] Stephen C. Tweedie wrote: > If it is the last user of the page --- ie. if PG_SwapCache is set and > the refcount of the page is one --- then it will do so anyway, because > when I added that swap cache code I made sure that zap_page_range() > does a free_page_and_swap_cache() when freeing pages. I.e., zap_page_range makes sure that MADV_DONTNEED won't leave orphan swap-cache pages. > > Doesn't this also result in a swap-cache leak, or are orphan swap-cache > > pages reclaimed eventually? > > The shrink_mmap() page cache reclaimer is able to pick up any orphaned > swap cache pages. But there won't be any orphans, will there? Or do they appear due to async. swapping situations? thanks, -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Q. about swap-cache orphans 2000-03-22 22:55 ` Q. about swap-cache orphans Jamie Lokier @ 2000-03-22 22:58 ` Stephen C. Tweedie 0 siblings, 0 replies; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-22 22:58 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chuck Lever, linux-mm, Stephen C. Tweedie Hi, On Wed, Mar 22, 2000 at 11:55:45PM +0100, Jamie Lokier wrote: > [This is just a question to help my understanding, not relevant to madvise] > > Stephen C. Tweedie wrote: > > If it is the last user of the page --- ie. if PG_SwapCache is set and > > the refcount of the page is one --- then it will do so anyway, because > > when I added that swap cache code I made sure that zap_page_range() > > does a free_page_and_swap_cache() when freeing pages. > > I.e., zap_page_range makes sure that MADV_DONTNEED won't leave orphan > swap-cache pages. Not quite, but very nearly. There are a few minor places where the refcount on a page is bumped up temporarily, so zap_page_range is theoretically able to be confused into thinking that there are extra references, and that the swap cache should remain. However, that is still correct behaviour, because the shrink_mmap() code will seek and destroy the remaining swap cache references if that happens. > > The shrink_mmap() page cache reclaimer is able to pick up any orphaned > > swap cache pages. > > But there won't be any orphans, will there? > Or do they appear due to async. swapping situations? Yes, but it's harmless. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 16:24 ` Chuck Lever 2000-03-22 18:05 ` Jamie Lokier @ 2000-03-22 18:15 ` Christoph Rohland 2000-03-22 18:30 ` Jamie Lokier 1 sibling, 1 reply; 55+ messages in thread From: Christoph Rohland @ 2000-03-22 18:15 UTC (permalink / raw) To: Chuck Lever; +Cc: Jamie Lokier, linux-mm Hi Chuck Chuck Lever <cel@monkey.org> writes: > ok, so you're asking for a lite(TM) version of DONTNEED that > provides the following hint to the kernel: "i may be finished with > this page, but i may also want to reuse it immediately." I would say "... reuse this address space immediately and you can give me _any_ data the next time". "Any data" means probably either the old or a zero page. That's the optimal strategy for the memory management modules of SAP R/3. > function 1 (could be MADV_DISCARD; currently MADV_DONTNEED): > discard pages. if they are referenced again, the process causes page > faults to read original data (zero page for anonymous maps). That would be also good. > i'm interested to hear what big database folks have to say about this. R/3 is not a database but probably the biggest database client. Often much bigger than the database itself. Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 18:15 ` madvise (MADV_FREE) Christoph Rohland @ 2000-03-22 18:30 ` Jamie Lokier 2000-03-23 16:56 ` Christoph Rohland 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-22 18:30 UTC (permalink / raw) To: Christoph Rohland; +Cc: Chuck Lever, linux-mm Christoph Rohland wrote: > > ok, so you're asking for a lite(TM) version of DONTNEED that > > provides the following hint to the kernel: "i may be finished with > > this page, but i may also want to reuse it immediately." > > I would say "... reuse this address space immediately and you can give > me _any_ data the next time". "Any data" means probably either the old > or a zero page. For maximum performance that's right. But Linux normally has to provide some minimal security, so an application should only see its own data or zeros, not an arbitrary page. Zeroing has another advantage: you can efficiently detect it. So you can use it for cached memory objects too in a number of cases, not just free memory. (A bit from mincore would also allow detection, but not nearly as efficiently). > That's the optimal strategy for the memory management modules of SAP R/3. Excellent! A hard core recommendation :-) -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: madvise (MADV_FREE) 2000-03-22 18:30 ` Jamie Lokier @ 2000-03-23 16:56 ` Christoph Rohland 0 siblings, 0 replies; 55+ messages in thread From: Christoph Rohland @ 2000-03-23 16:56 UTC (permalink / raw) To: Jamie Lokier; +Cc: Christoph Rohland, Chuck Lever, linux-mm Jamie Lokier <jamie.lokier@cern.ch> writes: > Christoph Rohland wrote: > > > ok, so you're asking for a lite(TM) version of DONTNEED that > > > provides the following hint to the kernel: "i may be finished > > > with this page, but i may also want to reuse it immediately." > > > > I would say "... reuse this address space immediately and you can > > give me _any_ data the next time". "Any data" means probably > > either the old or a zero page. > > For maximum performance that's right. But Linux normally has to > provide some minimal security, so an application should only see its > own data or zeros, not an arbitrary page. That was the reason for "...probably either the old or a zero page" > Zeroing has another advantage: you can efficiently detect it. So > you can use it for cached memory objects too in a number of cases, > not just free memory. (A bit from mincore would also allow > detection, but not nearly as efficiently). > > > That's the optimal strategy for the memory management modules of > > SAP R/3. > > Excellent! A hard core recommendation :-) :-) Greetings Christoph -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* MADV_DONTNEED 2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever 2000-03-21 1:20 ` madvise (MADV_FREE) Jamie Lokier @ 2000-03-21 1:29 ` Jamie Lokier 2000-03-22 17:04 ` MADV_DONTNEED Chuck Lever 2000-03-21 1:47 ` Extensions to mincore Jamie Lokier 2000-03-21 1:50 ` MADV flags as mmap options Jamie Lokier 3 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-21 1:29 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm Hi Chuck About MADV_DONTNEED ------------------- > > In particular, using the name MADV_DONTNEED is a really bad idea. It > > means completely different things on different OSes. For example your > > meaning of MADV_DONTNEED is different to BSD's: a program that assumes > > the BSD behaviour may well crash with your implementation and will > > almost certainly give invalid results if it doesn't crash. > > i'm more concerned about portability from operating systems like Solaris, > because there are many more server applications there than on *BSD that > have been designed to use these interfaces. ... > my preference is for the DU semantic of tossing dirty data instead of > flushing onto backing store, simply because that's what so many > applications expect DONTNEED to do. That's interesting. When I saw MADV_DONTNEED, I immediately assumed it was the natural counterpoint to MADV_WILLNEED. Useful even for sequential accesses, to say "my streaming window has moved beyond this point". Do you agree that a counterpoint to MADV_WILLNEED is useful? The names are so similar, I consider using MADV_DONTNEED to mean "trash this memory" quite misleading. (If there was no MADV_WILLNEED I wouldn't mind). > i'm not saying the *BSD way is wrong, but i think it would be a more > useful compromise to make *BSD functionality available via some other > interface (like MADV_ZERO). You got it the wrong way around. MADV_ZERO is more like what your implementation of MADV_DONTNEED does. The BSD behaviour is nothing like MADV_ZERO. BSD simply means "increment the paging priority" -- the page contents are unchanged. BSD's behaviour is the obvious counterpoint to MADV_WILLNEED afaict. > as far as i can tell, linux's msync(MS_INVALIDATE) behaves like freeBSD's > MADV_DONTNEED. Doesn't look like that. 1. MS_INVALIDATE only works on file mappings -- BSD's MADV_DONTNEED is defined (if you believe the documentation) for any mapping. 2. The msync() manual page doesn't agree with you, but I'm not sure about the implementation. The manual says: MS_INVALIDATE asks to invalidate other mappings of the same file (so that they can be updated with the fresh values just written). The implementation seems to invalidate _this_ mapping. Either way, they are different from BSD's MADV_DONTNEED. 3. Your MADV_DONTNEED does different things to msync(MS_INVALIDATE) Actually I like what MADV_DONTNEED does, but I would like it to have a different name to avoid potentially dangerous ambiguity with BSD's meaning. If Linux MADV_DONTNEED were just a hint it would be fine, but it actively trashes memory. By the way, Linux MADV_DONTNEED does some of the things msync(MS_INVALIDATE) does but not others (in the implementation -- ignore the man page). Can you explain how the two things differ? I.e., why does MS_INVALIDATE fiddle with swap cache pages. Does this indicate a bug in your MADV_DONTNEED implementation? > MADV_ZERO makes sense to me as an efficient way to zero a range of > addresses in a mapping. but i think it's useful as a *separate* function, > not as combined with, say, MADV_DONTNEED. Agreed. I mention DONTNEED only because some OS's documentation of DONTNEED appears to be equivalent to MADV_ZERO. And of course, on a mapping of /dev/zero they are equivalent. To be honest, the MADV_DONTNEED behaviour on private mappings is probably much more useful than zeroing a range anyway. You've always got read(/dev/zero) for the latter. enjoy, -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: MADV_DONTNEED 2000-03-21 1:29 ` MADV_DONTNEED Jamie Lokier @ 2000-03-22 17:04 ` Chuck Lever 2000-03-22 17:10 ` MADV_DONTNEED Stephen C. Tweedie 2000-03-22 17:43 ` MADV_DONTNEED Jamie Lokier 0 siblings, 2 replies; 55+ messages in thread From: Chuck Lever @ 2000-03-22 17:04 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mm hi jamie- On Tue, 21 Mar 2000, Jamie Lokier wrote: > > > In particular, using the name MADV_DONTNEED is a really bad idea. It > > > means completely different things on different OSes. For example your > > > meaning of MADV_DONTNEED is different to BSD's: a program that assumes > > > the BSD behaviour may well crash with your implementation and will > > > almost certainly give invalid results if it doesn't crash. > > > > i'm more concerned about portability from operating systems like Solaris, > > because there are many more server applications there than on *BSD that > > have been designed to use these interfaces. > ... > > my preference is for the DU semantic of tossing dirty data instead of > > flushing onto backing store, simply because that's what so many > > applications expect DONTNEED to do. > > That's interesting. When I saw MADV_DONTNEED, I immediately assumed it > was the natural counterpoint to MADV_WILLNEED. yes, i did too. but i realized later that "will" is *not* the opposite of "dont". > Useful even for > sequential accesses, to say "my streaming window has moved beyond this > point". Do you agree that a counterpoint to MADV_WILLNEED is useful? if you look at the implementation of nopage_sequential_readahead, you'll see that it doesn't use MADV_DONTNEED, but the internal implementation of msync(MS_INVALIDATE). i'm not completely confident in this implementation, but my intent was to release behind, not discard data. so, yes, a counterpoint to WILLNEED is a good idea. perhaps that *was* the original intent of MADV_DONTNEED, but i don't see any documentation that ties WILLNEED and DONTNEED together, semantically. > > i'm not saying the *BSD way is wrong, but i think it would be a more > > useful compromise to make *BSD functionality available via some other > > interface (like MADV_ZERO). > > You got it the wrong way around. MADV_ZERO is more like what your > implementation of MADV_DONTNEED does. The BSD behaviour is nothing like > MADV_ZERO. BSD simply means "increment the paging priority" -- the > page contents are unchanged. > > BSD's behaviour is the obvious counterpoint to MADV_WILLNEED afaict. it is, but it's not the behavior that most applications expect. i'd like to have something like this, but it should probably be named MADV_FREE, or how about MADV_WONTNEED ? :) so we agree that both behaviors might be useful to expose to an application. the only question is what to name them. function 1 (could be MADV_DISCARD; currently MADV_DONTNEED): discard pages. if they are referenced again, the process causes page faults to read original data (zero page for anonymous maps). function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)): release pages, syncing dirty data. if they are referenced again, the process causes page faults to read in latest data. function 3 (could be MADV_ZERO): discard pages. if they are referenced again, the process sees C-O-W zeroed pages. function 4 (for comparison; currently munmap): release pages, syncing dirty data. if they are referenced again, the process causes invalid memory access faults. i'm interested to hear what big database folks have to say about this. > By the way, Linux MADV_DONTNEED does some of the things > msync(MS_INVALIDATE) does but not others (in the implementation -- > ignore the man page). > > Can you explain how the two things differ? I.e., why does MS_INVALIDATE > fiddle with swap cache pages. Does this indicate a bug in your > MADV_DONTNEED implementation? for MADV_DONTNEED, i re-used code. i'm not convinced that it's correct, though, as i stated when i submitted the patch. it may abandon swap cache pages, and there may be some undefined interaction between file truncation and MADV_DONTNEED. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: MADV_DONTNEED 2000-03-22 17:04 ` MADV_DONTNEED Chuck Lever @ 2000-03-22 17:10 ` Stephen C. Tweedie 2000-03-22 17:32 ` MADV_DONTNEED Jamie Lokier 2000-03-22 17:33 ` MADV_DONTNEED Jamie Lokier 2000-03-22 17:43 ` MADV_DONTNEED Jamie Lokier 1 sibling, 2 replies; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-22 17:10 UTC (permalink / raw) To: Chuck Lever; +Cc: Jamie Lokier, linux-mm, Stephen C. Tweedie Hi, On Wed, Mar 22, 2000 at 12:04:58PM -0500, Chuck Lever wrote: > > so we agree that both behaviors might be useful to expose to an > application. the only question is what to name them. > > function 1 (could be MADV_DISCARD; currently MADV_DONTNEED): > discard pages. if they are referenced again, the process causes page > faults to read original data (zero page for anonymous maps). > > function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)): > release pages, syncing dirty data. if they are referenced again, the > process causes page faults to read in latest data. > > function 3 (could be MADV_ZERO): > discard pages. if they are referenced again, the process sees C-O-W > zeroed pages. > > function 4 (for comparison; currently munmap): > release pages, syncing dirty data. if they are referenced again, the > process causes invalid memory access faults. > > i'm interested to hear what big database folks have to say about this. The requests I've seen from database vendors are specifically for function 1 above. I'd expect that they could live with function 3 too, though --- perhaps the main reason they asked for 1 is that this is what they are used to working with on some other systems (I don't know offhand of anybody who implements 3: it seems an odd thing to want to do for shared pages, and is equivalent to 1 for private mappings.) --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: MADV_DONTNEED 2000-03-22 17:10 ` MADV_DONTNEED Stephen C. Tweedie @ 2000-03-22 17:32 ` Jamie Lokier 2000-03-22 17:33 ` MADV_DONTNEED Jamie Lokier 1 sibling, 0 replies; 55+ messages in thread From: Jamie Lokier @ 2000-03-22 17:32 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-mm Stephen C. Tweedie wrote: > The requests I've seen from database vendors are specifically for > function 1 above. I'd expect that they could live with function 3 > too, though --- perhaps the main reason they asked for 1 is that > this is what they are used to working with on some other systems > (I don't know offhand of anybody who implements 3: it seems an odd > thing to want to do for shared pages, and is equivalent to 1 for > private mappings.) For private file mappings, 1 and 3 are different. 1 reverts pages to the underlying object. 3 as equivalent to writing zeros over the page. It's only for /dev/zero mappings that they are the same. Probably nobody implements 3, but some documentation suggests otherwise. Digital Unix: MADV_DONTNEED Do not need these pages The system will free any whole pages in the specified region. All modifications will be lost and any swapped out pages will be discarded. Subsequent access to the region will result in a zero-fill-on-demand fault ~~~~~~~~~~~~~~~~~~~ as though it is being accessed for the first time. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Reserved swap space is not affected by this call. Clearly for non-anonymous mappings, the two underlined phrases contradict one another. Does MADV_DONTNEED on DU zero pages in private file mappings, or does it revert to the original file pages? -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: MADV_DONTNEED 2000-03-22 17:10 ` MADV_DONTNEED Stephen C. Tweedie 2000-03-22 17:32 ` MADV_DONTNEED Jamie Lokier @ 2000-03-22 17:33 ` Jamie Lokier 2000-03-22 17:37 ` MADV_DONTNEED Stephen C. Tweedie 1 sibling, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-22 17:33 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Chuck Lever, linux-mm Stephen C. Tweedie wrote: > > function 3 (could be MADV_ZERO): > > discard pages. if they are referenced again, the process sees C-O-W > > zeroed pages. Fwiw, I don't think MADV_ZERO is particularly useful. You can just read /dev/zero over that memory range. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: MADV_DONTNEED 2000-03-22 17:33 ` MADV_DONTNEED Jamie Lokier @ 2000-03-22 17:37 ` Stephen C. Tweedie 0 siblings, 0 replies; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-22 17:37 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chuck Lever, linux-mm Hi, On Wed, Mar 22, 2000 at 06:33:07PM +0100, Jamie Lokier wrote: > > Fwiw, I don't think MADV_ZERO is particularly useful. > You can just read /dev/zero over that memory range. Exactly. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: MADV_DONTNEED 2000-03-22 17:04 ` MADV_DONTNEED Chuck Lever 2000-03-22 17:10 ` MADV_DONTNEED Stephen C. Tweedie @ 2000-03-22 17:43 ` Jamie Lokier 2000-03-22 21:54 ` MADV_DONTNEED Chuck Lever 1 sibling, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-22 17:43 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm Chuck Lever wrote: > > That's interesting. When I saw MADV_DONTNEED, I immediately assumed it > > was the natural counterpoint to MADV_WILLNEED. > > yes, i did too. but i realized later that "will" is *not* the opposite of > "dont". Agreed. > if you look at the implementation of nopage_sequential_readahead, you'll > see that it doesn't use MADV_DONTNEED, but the internal implementation of > msync(MS_INVALIDATE). i'm not completely confident in this > implementation, but my intent was to release behind, not discard data. If I knew what msync(MS_INVALIDATE) did I could think about this! :-) But the msync documentation is unhelpful and possibly misleading. > it is, but it's not the behavior that most applications expect. i'd like > to have something like this, but it should probably be named MADV_FREE, or > how about MADV_WONTNEED ? :) I like the name MADV_WONTNEED. Thanks for thinking of it :-) With that, even keeping the name MADV_DONTNEED is ok because there is a distinction. (But I'd prefer to rename MADV_DONTNEED to MADV_DISCARD, to catch potential misuses). > function 1 (could be MADV_DISCARD; currently MADV_DONTNEED): > discard pages. if they are referenced again, the process causes page > faults to read original data (zero page for anonymous maps). I like the name MADV_DISCARD too. :-) > function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)): > release pages, syncing dirty data. if they are referenced again, the > process causes page faults to read in latest data. Oh, I see, this is what msync(MS_INVALIDATE) does :-) > function 4 (for comparison; currently munmap): > release pages, syncing dirty data. if they are referenced again, the > process causes invalid memory access faults. > for MADV_DONTNEED, i re-used code. >From where? > i'm not convinced that it's correct, though, as i stated when i > submitted the patch. it may abandon swap cache pages, and there may > be some undefined interaction between file truncation and > MADV_DONTNEED. Oh dear -- because it's in pre2.4 already :-) Better work out what it's supposed to do and fix it :-) -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: MADV_DONTNEED 2000-03-22 17:43 ` MADV_DONTNEED Jamie Lokier @ 2000-03-22 21:54 ` Chuck Lever 2000-03-22 22:41 ` MADV_DONTNEED Jamie Lokier 0 siblings, 1 reply; 55+ messages in thread From: Chuck Lever @ 2000-03-22 21:54 UTC (permalink / raw) To: Jamie Lokier; +Cc: linux-mm On Wed, 22 Mar 2000, Jamie Lokier wrote: > > if you look at the implementation of nopage_sequential_readahead, you'll > > see that it doesn't use MADV_DONTNEED, but the internal implementation of > > msync(MS_INVALIDATE). i'm not completely confident in this > > implementation, but my intent was to release behind, not discard data. > > If I knew what msync(MS_INVALIDATE) did I could think about this! :-) > But the msync documentation is unhelpful and possibly misleading. well, the doc's accurate, as far as i can tell. but my use of it is a side-effect of the behavior described in the man page. > > function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)): > > release pages, syncing dirty data. if they are referenced again, the > > process causes page faults to read in latest data. > > Oh, I see, this is what msync(MS_INVALIDATE) does :-) more or less. it removes the mappings, but also schedules writes for any dirty pages it finds. > > function 4 (for comparison; currently munmap): > > release pages, syncing dirty data. if they are referenced again, the > > process causes invalid memory access faults. > > > for MADV_DONTNEED, i re-used code. > > From where? you can find logic that invokes zap_page_range throughout the mm code, but especially in do_munmap. if my implementation is broken in this regard, then i'd bet do_munmap is broken too. > > i'm not convinced that it's correct, though, as i stated when i > > submitted the patch. it may abandon swap cache pages, and there may > > be some undefined interaction between file truncation and > > MADV_DONTNEED. > > Oh dear -- because it's in pre2.4 already :-) > Better work out what it's supposed to do and fix it :-) it's not too serious, i hope, since madvise is not used by any existing Linux apps. this area of the kernel has been changing so much in the past 6-9 months that it's been difficult to know what is the blessed way to get my implementation to work. it now works in the simple cases. i'm waiting to hear about real world usage. - Chuck Lever -- corporate: <chuckl@netscape.com> personal: <chucklever@netscape.net> or <cel@monkey.org> The Linux Scalability project: http://www.citi.umich.edu/projects/linux-scalability/ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: MADV_DONTNEED 2000-03-22 21:54 ` MADV_DONTNEED Chuck Lever @ 2000-03-22 22:41 ` Jamie Lokier 2000-03-23 19:13 ` MADV_DONTNEED James Antill 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-22 22:41 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm Chuck Lever wrote: > > If I knew what msync(MS_INVALIDATE) did I could think about this! :-) > > But the msync documentation is unhelpful and possibly misleading. > > well, the doc's accurate, as far as i can tell. but my use of it is a > side-effect of the behavior described in the man page. "MS_INVALIDATE asks to invalidate other mappings of the same file (so that they can be updated with the fresh val- ues just written)." Oh I see. It means the locally modified but in principle shared mapping is copied back to the underlying object. For a page aligned mapping that shouldn't need to do anything. Since the MS_INVALIDATE code doesn't modify other ptes, we must assume the other mappings are all page aligned or they wouldn't see the update. So why does MS_INVALIDATE have any code? :-) > > > function 2 (could be MADV_FREE; currently msync(MS_INVALIDATE)): > > > release pages, syncing dirty data. if they are referenced again, the > > > process causes page faults to read in latest data. > > > > Oh, I see, this is what msync(MS_INVALIDATE) does :-) > > more or less. it removes the mappings, but also schedules writes for any > dirty pages it finds. I think "schedules writes" is what MS_ASYNC and MS_SYNC do, independently of MS_INVALIDATE. > > > function 4 (for comparison; currently munmap): > > > release pages, syncing dirty data. if they are referenced again, the > > > process causes invalid memory access faults. > > > > > for MADV_DONTNEED, i re-used code. > > > > From where? > > you can find logic that invokes zap_page_range throughout the mm code, but > especially in do_munmap. if my implementation is broken in this regard, > then i'd bet do_munmap is broken too. do_munmap also calls vm_ops->unmap before the zap_page_range, which has a potentially important side effects for files... Like actually writing the data :-) That's not what, say, MADV_DISCARD would do, but it's what "release pages, syncing dirty data" should do. > > > i'm not convinced that it's correct, though, as i stated when i > > > submitted the patch. it may abandon swap cache pages, and there may > > > be some undefined interaction between file truncation and > > > MADV_DONTNEED. > > > > Oh dear -- because it's in pre2.4 already :-) > > Better work out what it's supposed to do and fix it :-) > > it's not too serious, i hope, since madvise is not used by any existing > Linux apps. this area of the kernel has been changing so much in the past > 6-9 months that it's been difficult to know what is the blessed way to get > my implementation to work. Quite. I'm not so concerned about the implementation at this stage as getting agreement on the right semantics! -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: MADV_DONTNEED 2000-03-22 22:41 ` MADV_DONTNEED Jamie Lokier @ 2000-03-23 19:13 ` James Antill 0 siblings, 0 replies; 55+ messages in thread From: James Antill @ 2000-03-23 19:13 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chuck Lever, linux-mm > Chuck Lever wrote: > > > If I knew what msync(MS_INVALIDATE) did I could think about this! :-) > > > But the msync documentation is unhelpful and possibly misleading. > > > > well, the doc's accurate, as far as i can tell. but my use of it is a > > side-effect of the behavior described in the man page. > > "MS_INVALIDATE asks to invalidate other mappings of the > same file (so that they can be updated with the fresh val- > ues just written)." > > Oh I see. It means the locally modified but in principle shared mapping > is copied back to the underlying object. For a page aligned mapping > that shouldn't need to do anything. > > Since the MS_INVALIDATE code doesn't modify other ptes, we must assume > the other mappings are all page aligned or they wouldn't see the > update. > > So why does MS_INVALIDATE have any code? :-) I've used this in Solaris when mmap()'ing over NFS. Ie. You'd msync(MS_SYNC) on the NFS writer, and msync(MS_INVALIDATE) on the readers. The Linux documentation I have is the same as Jamie's and says _other_ mappings, but maybe that's just a typo (I'm pretty sure INVALIDATE on solaris guaranteed that your mapping was invalidas well). -- James Antill -- james@and.org "If we can't keep this sort of thing out of the kernel, we might as well pack it up and go run Solaris." -- Larry McVoy. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Extensions to mincore 2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever 2000-03-21 1:20 ` madvise (MADV_FREE) Jamie Lokier 2000-03-21 1:29 ` MADV_DONTNEED Jamie Lokier @ 2000-03-21 1:47 ` Jamie Lokier 2000-03-21 9:11 ` Eric W. Biederman 2000-03-21 1:50 ` MADV flags as mmap options Jamie Lokier 3 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-21 1:47 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm > > [Aside: is there the possibility to have mincore return the "!accessed" > > and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned > > bytes? I can imagine a bunch of garbage collection algorithms that > > could make good use of those bits. Currently some GC systems mprotect() > > regions and unprotect them on SEGV -- simply reading the !dirty status > > would obviously be much simpler and faster.] > > you could add that; the question is how to do it while not breaking > applications that do this: > > if (!byte) { > page not present > } > > rather than checking the LSB specifically. The comment says: The status is returned in a vector of bytes. The least significant bit of each byte is 1 if the referenced page is in memory, otherwise it is zero. Solaris (SunOS 5.6) extends this with: The settings of other bits in each character are undefined and may contain other information in future implementations. So I think you're quite safe extending the information. > i think using "dirty" instead of "!dirty" would help. In a GC system you're looking to skip pages which are "definitely clean". "Definitely dirty" isn't very interesting, however "maybe dirty" is. Given that the default value from mincore is 0 (say for an older kernel), it should mean "maybe dirty". Hence !dirty. > the "accessed" bit is only used by the shrink_mmap logic to "time out" > a page as memory gets short; i'm not sure that's a semantic that is > useful to a user-level garbarge collector? and it probably isn't very > portable. For a garbage collector that can move objects, it has uses in suggesting how to efficiently repack objects, to reduce the resident set size of the process. There are also a number of user-space paging systems (e.g. one was once proposed for the special relocated .exe mappings in Wine), which would benefit from this information the same was as the kernel does. You could indicate that these values are "exact" by another bit which is always set if you are able to provide dirty and accessed bits. Then the polarity doesn't really matter. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Extensions to mincore 2000-03-21 1:47 ` Extensions to mincore Jamie Lokier @ 2000-03-21 9:11 ` Eric W. Biederman 2000-03-21 9:40 ` lars brinkhoff 2000-03-21 11:34 ` Stephen C. Tweedie 0 siblings, 2 replies; 55+ messages in thread From: Eric W. Biederman @ 2000-03-21 9:11 UTC (permalink / raw) To: Jamie Lokier; +Cc: Chuck Lever, linux-mm Jamie Lokier <jamie.lokier@cern.ch> writes: > > > [Aside: is there the possibility to have mincore return the "!accessed" > > > and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned > > > bytes? I can imagine a bunch of garbage collection algorithms that > > > could make good use of those bits. Currently some GC systems mprotect() > > > regions and unprotect them on SEGV -- simply reading the !dirty status > > > would obviously be much simpler and faster.] No it wouldn't. Dirty kernel wise means the page needs to be swapped out. Clean kernel wise mean the page is in the swap cache, and hasn't been written since it was swapped in. Dirty GC wise the page has changes since the last GC pass over it. It is very easy to conceive of a case where a dirty GC'd page swapped out, and then swapped in before someone got to looking at it. So kernel Clean/Dirty has no connection with GC Clean/Dirty. Please, please don't mess with this for a 2.4 timeframe. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Extensions to mincore 2000-03-21 9:11 ` Eric W. Biederman @ 2000-03-21 9:40 ` lars brinkhoff 2000-03-21 11:34 ` Stephen C. Tweedie 1 sibling, 0 replies; 55+ messages in thread From: lars brinkhoff @ 2000-03-21 9:40 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Jamie Lokier, Chuck Lever, linux-mm "Eric W. Biederman" wrote: > Jamie Lokier <jamie.lokier@cern.ch> writes: > > > > [Aside: is there the possibility to have mincore return the "!accessed" > > > > and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned > > > > bytes? I can imagine a bunch of garbage collection algorithms that > > > > could make good use of those bits. Currently some GC systems mprotect() > > > > regions and unprotect them on SEGV -- simply reading the !dirty status > > > > would obviously be much simpler and faster.] > > Dirty kernel wise means the page needs to be swapped out. Clean kernel > wise mean the page is in the swap cache, and hasn't been written > since it was swapped in. > > Dirty GC wise the page has changes since the last GC pass over it. > > It is very easy to conceive of a case where a dirty GC'd page swapped > out, and then swapped in before someone got to looking at it. So > kernel Clean/Dirty has no connection with GC Clean/Dirty. > > Please, please don't mess with this for a 2.4 timeframe. For user-space paging, it would be great to know the kernel sense of the clean/dirty status of pages. Perhaps something to be considered for 2.5. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Extensions to mincore 2000-03-21 9:11 ` Eric W. Biederman 2000-03-21 9:40 ` lars brinkhoff @ 2000-03-21 11:34 ` Stephen C. Tweedie 2000-03-21 15:15 ` Jamie Lokier 1 sibling, 1 reply; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-21 11:34 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Jamie Lokier, Chuck Lever, linux-mm Hi, On Tue, Mar 21, 2000 at 03:11:16AM -0600, Eric W. Biederman wrote: > Jamie Lokier <jamie.lokier@cern.ch> writes: > > > > > [Aside: is there the possibility to have mincore return the "!accessed" > > > > and "!dirty" bits of each page, perhaps as bits 1 and 2 of the returned > > > > bytes? I can imagine a bunch of garbage collection algorithms that > > > > could make good use of those bits. Currently some GC systems mprotect() > > > > regions and unprotect them on SEGV -- simply reading the !dirty status > > > > would obviously be much simpler and faster.] > > Dirty kernel wise means the page needs to be swapped out. Clean kernel > wise mean the page is in the swap cache, and hasn't been written > since it was swapped in. Worse than that, returning dirty status bits in mincore() just wouldn't work for threads. mincore() is a valid optimisation when you just treat it as a hint: if a page gets swapped out between calling mincore() and using the page, nothing breaks, you just get an extra page fault. The same is not true for the sort of garbage collection or distributed memory mechanisms which use mprotect(). If you find that a page is clean via mincore() and discard the data based on that, there is nothing to stop another thread from dirtying the data after the mincore() and losing its modification. mprotect() has the advantage of holding page table locks so it can do an atomic read-modify-write on the page table entries. Without that locking, you just can't reliably use dirty/accessed information. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Extensions to mincore 2000-03-21 11:34 ` Stephen C. Tweedie @ 2000-03-21 15:15 ` Jamie Lokier 2000-03-21 15:41 ` Stephen C. Tweedie 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-21 15:15 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Eric W. Biederman, Chuck Lever, linux-mm Eric W. Biederman wrote: > > > > [Aside: is there the possibility to have mincore return the > > > > "!accessed" and "!dirty" bits of each page, perhaps as bits 1 > > > > and 2 of the returned bytes? I can imagine a bunch of garbage > > > > collection algorithms that could make good use of those bits. > > > > Currently some GC systems mprotect() regions and unprotect them > > > > on SEGV -- simply reading the !dirty status would obviously be > > > > much simpler and faster.] > > No it wouldn't. Yes it would. > Dirty kernel wise means the page needs to be swapped out. Clean kernel > wise mean the page is in the swap cache, and hasn't been written > since it was swapped in. > > Dirty GC wise the page has changes since the last GC pass over it. Of course, I thought that was obvious :-) You're right, that for GC the "!dirty" bit has to mean "since the last time we called mincore". To get the correct behaviour without maintaining extra state in the kernel (apart from a bit or two per struct page), you'd say that mincore returns "!dirty since the last time _anyone_ called mincore on this page", and you'd disallow it for shared mappings. It works for threads too. All threads sharing a page have to synchronise their mincore calls for that page, but that situation is no different to the SEGV method: all threads have to synchronise with the information collected from that, too. Stephen C. Tweedie wrote: > Worse than that, returning dirty status bits in mincore() just wouldn't > work for threads. mincore() is a valid optimisation when you just treat > it as a hint: if a page gets swapped out between calling mincore() and > using the page, nothing breaks, you just get an extra page fault. [Aside: I regard this as a bug. mincore() should have an option to set the accessed bit on each page that is in core, to avoid the "just missed" condition. If it sets the accessed bit, then under most circumstances the just missed condition will never happen. If it does not (it doesn't now), the just missed condition will always happen sometimes under the slightest non-zero paging load. The difference for an application that does "call mincore; if not in core, spawn thread to pull in page" under low system load will be between no stalls and occasional stalls. Thus mincore() is missing a flag parameter IMO] > The same is not true for the sort of garbage collection or distributed > memory mechanisms which use mprotect(). If you find that a page is clean > via mincore() and discard the data based on that, there is nothing to > stop another thread from dirtying the data after the mincore() and losing > its modification. In general, you have to be very careful about what you allow other threads to modify during GC. For a full collection, some kind of synchronisation point with everyone is usually required. (Disclaimer: I am not a GC expert so if you know of GC mechanisms that use mprotect and don't require threads to be synchronised, please speak up!) 1. Stop all the other threads, copy the state of their roots (i.e. processor registers, individual stack roots), call mprotect(), restart the threads, and let SEGVs mprotect() pages back to writable status while putting them on a list. Watch out for concurrent SEGVs on the same page! Disadvantage: lots of SEGV handling, SEGV code is processor specific (until siginfo is reliable), lots of individual page mprotect calls, lots of vmas, page fault slowdown even for non-GC-using threads due to all the tiny vmas. 1a. Using mincore(): call mincore() instead of mprotect() in method 1. Threads are stopped so it just works :-) Advantage: everything runs faster and the code is more portable (among Linux systems). 2. Method 1 has a large mprotect() call. Quite apart from the slowness of all that mprotect/SEGV processing, the single large mprotect may take a while during which all threads are blocked, and it also prevents any threads not involved in GC from faulting. (As you say, it grabs the page table lock). You can call mprotect() first to protect the GC arena, with threads still running. At this point, you're _not_ using it to collect dirty page information. When mprotect() returns, you synchronise all threads to gather local GC roots, and then start collecting dirty page info via SEGVs. If a thread gets a SEGV before the synchronisation point, it is blocked until the synchronisation point. In this way, threads not writing to the arena don't get stopped for long even if mprotect() itself takes a long time. 2a. Method 2 using mincore(). Now you do do mprotect() at the beginning -- remember it is not for collecting dirty page info here, but for blocking threads writing to the arena while permitting others to continue. After synchronisation, call mincore() and then mprotect() to make the entire arena writable. Then restart all blocked threads. Any SEGVs from the start of the first mprotect() to the end of the second one block the faulting thread prior to synchronisation; any that block are restarted afterwards. Obviously there are plenty of other ways to arrange this, with multiple arenas etc. But I hope you can see that mincore() can be used reliably without requiring the overhead of individual-page mprotect and SEGVs. > mprotect() has the advantage of holding page table locks so it can do > an atomic read-modify-write on the page table entries. Without that > locking, you just can't reliably use dirty/accessed information. mprotect() has the major disadvantage of creating a million tiny vmas when you are using it to track dirty pages. And as far as I can see, mprotect/SEGV gives no advantage over the dirty bit method: in both cases, you always need synchronisations points between threads to share the dirty page information. mprotect has another disadvantage: it holds the page table lock. Great for atomic operations; terrible when you do a large mprotect and you _don't_ want to stop concurrent threads (that are not using the GC arena) from page faulting their stuff. Interestingly, neither GC synchronisation method I described depends on mprotect() being atomic w.r.t. the whole protection change, and method 2 would actually benefit from concurrent page faults being allowed during the mprotect(). The atomicity you mention is important. Consider this implementation: 1. Only private mappings allowed. 2. A page is considered dirty "since the last mincore call" if the pte dirty bit is set, or if a struct page flag PageMincoreDirty is set. To read this, you must atomically read and clear the pte's dirty bit. (Not difficult on x86 or any UP system; I'm not sure about other SMP systems). mincore() calls are assumed to be protected w.r.t. each other. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Extensions to mincore 2000-03-21 15:15 ` Jamie Lokier @ 2000-03-21 15:41 ` Stephen C. Tweedie 2000-03-21 15:55 ` Jamie Lokier 0 siblings, 1 reply; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-21 15:41 UTC (permalink / raw) To: Jamie Lokier; +Cc: Stephen C. Tweedie, Eric W. Biederman, Chuck Lever, linux-mm On Tue, Mar 21, 2000 at 04:15:07PM +0100, Jamie Lokier wrote: > > Dirty GC wise the page has changes since the last GC pass over it. > > Of course, I thought that was obvious :-) > > You're right, that for GC the "!dirty" bit has to mean "since the last > time we called mincore". And that information is not maintained anywhere. In fact, it basically _can't_ be maintained, since the hardware only maintains one bit and we already use that dirty bit. The only way round this is to use mprotect-style munging. > All threads sharing a page have to synchronise their mincore calls for > that page, but that situation is no different to the SEGV method: all > threads have to synchronise with the information collected from that, > too. It's not about synchronising between mincore calls, it's about synchronising mincore calls on one CPU with direct memory references modifying page tables on another CPU. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Extensions to mincore 2000-03-21 15:41 ` Stephen C. Tweedie @ 2000-03-21 15:55 ` Jamie Lokier 2000-03-21 16:08 ` Stephen C. Tweedie 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-21 15:55 UTC (permalink / raw) To: Stephen C. Tweedie Cc: Stephen C. Tweedie, Eric W. Biederman, Chuck Lever, linux-mm Stephen C. Tweedie wrote: > > You're right, that for GC the "!dirty" bit has to mean "since the last > > time we called mincore". > > And that information is not maintained anywhere. In fact, it basically > _can't_ be maintained, since the hardware only maintains one bit and > we already use that dirty bit. The only way round this is to use > mprotect-style munging. Didn't you read a few paragraphs down, where I explain how to implement this? You've got struct page. It is enough for private mappings, and we don't need this feature for shared mappings. > > All threads sharing a page have to synchronise their mincore calls for > > that page, but that situation is no different to the SEGV method: all > > threads have to synchronise with the information collected from that, > > too. > > It's not about synchronising between mincore calls, it's about > synchronising mincore calls on one CPU with direct memory references > modifying page tables on another CPU. Note, for both GC synchronisation methods I described, the mincore() call does not happen concurrently with other processors updating the page flags. In the first case all threads accessing the GC arena are blocked, and in the second the entire area is write-protected during the mincore() call. So the synchronisation you say isn't possible isn't a required feature. (I know it's quite easy on x86, but probably not some other CPUs). It would be enough the say "the mincore accessed/dirty bits are not guaranteed to be accurate if pages are accessed by concurrent threads during the mincore call". -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Extensions to mincore 2000-03-21 15:55 ` Jamie Lokier @ 2000-03-21 16:08 ` Stephen C. Tweedie 2000-03-21 16:48 ` Jamie Lokier 0 siblings, 1 reply; 55+ messages in thread From: Stephen C. Tweedie @ 2000-03-21 16:08 UTC (permalink / raw) To: Jamie Lokier; +Cc: Eric W. Biederman, Chuck Lever, linux-mm Hi, On Tue, Mar 21, 2000 at 04:55:32PM +0100, Jamie Lokier wrote: > > Didn't you read a few paragraphs down, where I explain how to implement > this? You've got struct page. It is enough for private mappings, and > we don't need this feature for shared mappings. Umm, yes, but just saying "we'll solve synchronisation problems by stopping all the other threads" hardly seems like a "solution" to me: more of a workaround of the problem! mprotect() does work correctly without stopping other threads. > It would be enough the say "the mincore accessed/dirty bits are not > guaranteed to be accurate if pages are accessed by concurrent threads > during the mincore call". Exactly why you need mprotect, which _does_ make the necessary guarantees. Oh, and suggesting that we can obtain the dirty bit by assuming all mappings are private doesn't work either. Private mappings *need* a per-pte (NOT per-page, but per-pte) dirty bit to distinguish between pages shared with the underlying mapped object, and pages which have been modified by the local process. --Stephen -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Extensions to mincore 2000-03-21 16:08 ` Stephen C. Tweedie @ 2000-03-21 16:48 ` Jamie Lokier 2000-03-22 7:36 ` Eric W. Biederman 0 siblings, 1 reply; 55+ messages in thread From: Jamie Lokier @ 2000-03-21 16:48 UTC (permalink / raw) To: Stephen C. Tweedie; +Cc: Eric W. Biederman, Chuck Lever, linux-mm Stephen C. Tweedie wrote: > > Didn't you read a few paragraphs down, where I explain how to implement > > this? You've got struct page. It is enough for private mappings, and > > we don't need this feature for shared mappings. > > Umm, yes, but just saying "we'll solve synchronisation problems by > stopping all the other threads" hardly seems like a "solution" to me: > more of a workaround of the problem! mprotect() does work correctly > without stopping other threads. It is a limitation on mincore (at present). But I haven't though of a GC implementation that will work without synchronising the threads anyway. So the limitation may not be a problem for GC, and only GC would use this feature. That said, the synchronisation issue is really separate from the dirty page issue. They're orthogonal. There's no reason why mincore should not have an option to synchronise with other processors, in just the same way that mprotect does. User space SEGV processing is horrible, per-page mprotect() write-enabling is slow and a resource hog, and the mprotect works on vmas instead of pages unfortunately so you get zillions of vmas. zillions of vmas isn't good. Try cat /proc/self/maps when you have 25000 entries :-) Oops, I also forgot to mention that each per-page mprotect to write-enable the page on SEGV causes horrendous SMP behaviour too. > > It would be enough the say "the mincore accessed/dirty bits are not > > guaranteed to be accurate if pages are accessed by concurrent threads > > during the mincore call". > > Exactly why you need mprotect, which _does_ make the necessary > guarantees. It does so with utterly sucking performance too. And not because of the synchronisation -- but because you need 2500 separate mprotect calls and to handle 2500 SEGV signals to detect that 10MB of pages have been dirtied between GC runs. mincore() can gather that info in one relatively fast system call. It does have synchronisation issues -- on _some_ architectures. But they can be either documented (where they may not be a problem for GC), or explicit synchronisation can be added for architectures that need it. > Oh, and suggesting that we can obtain the dirty bit by assuming all > mappings are private doesn't work either. Private mappings *need* a > per-pte (NOT per-page, but per-pte) dirty bit to distinguish between > pages shared with the underlying mapped object, and pages which have > been modified by the local process. For private mappings, any page pointing to the underlying mapped object is by definition clean. That's easy enough to check. Any other page has either a struct page or a swap entry that's local to its pte. So the mincore-dirty flag can be stored in the struct page or the swap entry. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* Re: Extensions to mincore 2000-03-21 16:48 ` Jamie Lokier @ 2000-03-22 7:36 ` Eric W. Biederman 0 siblings, 0 replies; 55+ messages in thread From: Eric W. Biederman @ 2000-03-22 7:36 UTC (permalink / raw) To: Jamie Lokier; +Cc: Stephen C. Tweedie, Chuck Lever, linux-mm Jamie Lokier <jamie.lokier@cern.ch> writes: > Stephen C. Tweedie wrote: > > > Didn't you read a few paragraphs down, where I explain how to implement > > > this? You've got struct page. It is enough for private mappings, and > > > we don't need this feature for shared mappings. > > > > Umm, yes, but just saying "we'll solve synchronisation problems by > > stopping all the other threads" hardly seems like a "solution" to me: > > more of a workaround of the problem! mprotect() does work correctly > > without stopping other threads. > > It is a limitation on mincore (at present). > > But I haven't though of a GC implementation that will work without > synchronising the threads anyway. So the limitation may not be a > problem for GC, and only GC would use this feature. Nope. In dosemu we do the mprotect style of munging with mappings as well. This allows us to detect which parts of a virtual frame buffer have been changed pretty cheaply. I think it is actually implemented with mmap & munamp though. Same story.... Doing mprotect tricks in a GC algorithm is actually a pretty stupid way to go. Upon occasion it might be the only solution where you can't get in and modify the code the GC algorithm is cooperating with. But it still won't work great. And only the slower GC algorithms, that need backwards compatiblity with languages like C. Anyway as you have mentioned to make this work you have to add additional state from what is already kept, and it isn't clear exactly what would make efficient use of this state. I won't argue that in the long run this a bad idea. But in the short run of the upcomming 2.4. I see no clear win. For a GC that works with a SMP threaded heap you should never need to do that crap anyway. You have the cost of the write lock per object or group of objects anyway. And it shouldn't be hard to instrument the lock aquiring paths to mark the object dirty as well. > User space SEGV processing is horrible, per-page mprotect() > write-enabling is slow and a resource hog, and the mprotect works on > vmas instead of pages unfortunately so you get zillions of vmas. > zillions of vmas isn't good. Try cat /proc/self/maps when you have > 25000 entries :-) That's atleast 97 meg of RAM being managed, and given that we combing adjacent vmas with the same permissions probably a lot more. While not unthinkable I suspect that is a pretty unlikely case. > Oops, I also forgot to mention that each per-page mprotect to > write-enable the page on SEGV causes horrendous SMP behaviour too. > > > It would be enough the say "the mincore accessed/dirty bits are not > > > guaranteed to be accurate if pages are accessed by concurrent threads > > > during the mincore call". > > > > Exactly why you need mprotect, which _does_ make the necessary > > guarantees. > > It does so with utterly sucking performance too. And not because of the > synchronisation -- but because you need 2500 separate mprotect calls and > to handle 2500 SEGV signals to detect that 10MB of pages have been > dirtied between GC runs. > > mincore() can gather that info in one relatively fast system call. mincore has to use exactly the same implementation except it might be able to get lucky, and not need to juggle vmas. In which case it probably makes more sense to figure out how to store the page writeable flag in the page table of a swapped out page so mprotect does not need to break vmas.... All GC's that use mprotect & co will have sucky performance period. They are definentily compromise solutions. > It does have synchronisation issues -- on _some_ architectures. But > they can be either documented (where they may not be a problem for GC), > or explicit synchronisation can be added for architectures that need it. > > > Oh, and suggesting that we can obtain the dirty bit by assuming all > > mappings are private doesn't work either. Private mappings *need* a > > per-pte (NOT per-page, but per-pte) dirty bit to distinguish between > > pages shared with the underlying mapped object, and pages which have > > been modified by the local process. > > For private mappings, any page pointing to the underlying mapped object > is by definition clean. That's easy enough to check. > > Any other page has either a struct page or a swap entry that's local to > its pte. So the mincore-dirty flag can be stored in the struct page or > the swap entry. Again if you must please look at optimising mprotect. If we can find 3 bits in a pte of a swapped out page we don't need to split the vma's. Nor do we need to change existing applications. Plus the shared case is handled as well. At the cost of a slightly higher miss penalty for a page. That sound like a much more reasonable thing to do then what you are proposing now. Please feel free to tell me I'm an idiot but I think I just stumbled upon a pretty decent idea. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
* MADV flags as mmap options 2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever ` (2 preceding siblings ...) 2000-03-21 1:47 ` Extensions to mincore Jamie Lokier @ 2000-03-21 1:50 ` Jamie Lokier 3 siblings, 0 replies; 55+ messages in thread From: Jamie Lokier @ 2000-03-21 1:50 UTC (permalink / raw) To: Chuck Lever; +Cc: linux-mm While we're here :-) It seems to me that a lot of the time, madvise() will be called immediately after mmap() on the same region. How about making the MADV_ flags distinct from the MAP_ flags, and arranging that you may pass MADV_ flags to mmap(). If it sees any, it does the mapping and follows it by the corresponding madvise_vma call. (Only really useful for MADV_RANDOM and MADV_SEQUENTIAL). -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux.eu.org/Linux-MM/ ^ permalink raw reply [flat|nested] 55+ messages in thread
end of thread, other threads:[~2000-03-28 0:48 UTC | newest]
Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20000320135939.A3390@pcep-jamie.cern.ch>
2000-03-20 19:09 ` MADV_SPACEAVAIL and MADV_FREE in pre2-3 Chuck Lever
2000-03-21 1:20 ` madvise (MADV_FREE) Jamie Lokier
2000-03-21 2:24 ` William J. Earl
2000-03-21 14:08 ` Jamie Lokier
2000-03-22 16:24 ` Chuck Lever
2000-03-22 18:05 ` Jamie Lokier
2000-03-22 21:39 ` Chuck Lever
2000-03-22 22:31 ` Jamie Lokier
2000-03-22 22:44 ` Stephen C. Tweedie
2000-03-23 18:53 ` Chuck Lever
2000-03-24 0:00 ` /dev/recycle Jamie Lokier
2000-03-24 9:14 ` /dev/recycle Christoph Rohland
2000-03-24 13:10 ` /dev/recycle Jamie Lokier
2000-03-24 13:54 ` /dev/recycle Christoph Rohland
2000-03-24 14:17 ` /dev/recycle Jamie Lokier
2000-03-24 17:40 ` /dev/recycle Christoph Rohland
2000-03-24 18:13 ` /dev/recycle Jamie Lokier
2000-03-25 8:35 ` /dev/recycle Christoph Rohland
2000-03-28 0:48 ` /dev/recycle Chuck Lever
2000-03-24 0:21 ` madvise (MADV_FREE) Jamie Lokier
2000-03-24 7:21 ` lars brinkhoff
2000-03-24 17:42 ` Jeff Dike
2000-03-24 16:49 ` Jamie Lokier
2000-03-24 17:08 ` Stephen C. Tweedie
2000-03-24 19:58 ` Jeff Dike
2000-03-25 0:30 ` Stephen C. Tweedie
2000-03-22 22:33 ` Stephen C. Tweedie
2000-03-22 22:45 ` Jamie Lokier
2000-03-22 22:48 ` Stephen C. Tweedie
2000-03-22 22:55 ` Q. about swap-cache orphans Jamie Lokier
2000-03-22 22:58 ` Stephen C. Tweedie
2000-03-22 18:15 ` madvise (MADV_FREE) Christoph Rohland
2000-03-22 18:30 ` Jamie Lokier
2000-03-23 16:56 ` Christoph Rohland
2000-03-21 1:29 ` MADV_DONTNEED Jamie Lokier
2000-03-22 17:04 ` MADV_DONTNEED Chuck Lever
2000-03-22 17:10 ` MADV_DONTNEED Stephen C. Tweedie
2000-03-22 17:32 ` MADV_DONTNEED Jamie Lokier
2000-03-22 17:33 ` MADV_DONTNEED Jamie Lokier
2000-03-22 17:37 ` MADV_DONTNEED Stephen C. Tweedie
2000-03-22 17:43 ` MADV_DONTNEED Jamie Lokier
2000-03-22 21:54 ` MADV_DONTNEED Chuck Lever
2000-03-22 22:41 ` MADV_DONTNEED Jamie Lokier
2000-03-23 19:13 ` MADV_DONTNEED James Antill
2000-03-21 1:47 ` Extensions to mincore Jamie Lokier
2000-03-21 9:11 ` Eric W. Biederman
2000-03-21 9:40 ` lars brinkhoff
2000-03-21 11:34 ` Stephen C. Tweedie
2000-03-21 15:15 ` Jamie Lokier
2000-03-21 15:41 ` Stephen C. Tweedie
2000-03-21 15:55 ` Jamie Lokier
2000-03-21 16:08 ` Stephen C. Tweedie
2000-03-21 16:48 ` Jamie Lokier
2000-03-22 7:36 ` Eric W. Biederman
2000-03-21 1:50 ` MADV flags as mmap options Jamie Lokier
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox