* Re: vma_merge issue [not found] <a1b36c3a0908101347t796dedbat2ecb0535c32f325b@mail.gmail.com> @ 2009-08-12 18:26 ` Hugh Dickins 2009-08-12 19:04 ` Bill Speirs 0 siblings, 1 reply; 6+ messages in thread From: Hugh Dickins @ 2009-08-12 18:26 UTC (permalink / raw) To: Bill Speirs; +Cc: Nick Piggin, linux-kernel, linux-mm On Mon, 10 Aug 2009, Bill Speirs wrote: > > I came across an issue where adjacent pages are not properly coalesced > together when changing protections on them. This can be shown by doing > the following: > > 1) Map 3 pages with PROT_NONE and MAP_PRIVATE | MAP_ANONYMOUS > 2) Set the middle page's protection to PROT_READ | PROT_WRITE > 3) Set the middle page's protection back to PROT_NONE > > You are left with 3 entries in /proc/self/map where you should only > have 1. If you only change the protection to PROT_READ in step 2, then > it is properly merged together. I noticed in mprotect.c the following > comment in the function mprotect_fixup; I'm not sure if it applies or > not: > /* > * If we make a private mapping writable we increase our commit; > * but (without finer accounting) cannot reduce our commit if we > * make it unwritable again. [ the following lines of the comment are not relevant here so I'll delete ] > */ > > I think this only applies to setting charged = nrpages; however, > VM_ACCOUNT is also added to newflags. Could it be that the adjacent > blocks don't have VM_ACCOUNT and so the call to vma_merge cannot merge > because the flags for the adjacent vma are not the same? That's right, and it is working as intended. To allow people to set up enormous PROT_READ,MAP_PRIVATE mappings "for free", we don't account those initially, but only as parts are mprotected writable later: at that point they're accounted, and marked VM_ACCOUNT so that we know it's been done (and don't double account later on). So your middle page has been accounted (one page added to /proc/meminfo's Committed_AS, which isn't allowed to exceed CommitLimit if /proc/sys/vm/overcommit_memory is 2 to disable overcommit), but the neighbouring pages have not been accounted: so we need separate vmas for them, I'm afraid, since that accounting is done per vma. > > Can anyone shed some light on this? While it isn't an issue for 3 > pages, I'm mmaping 200K+ pages and changing the perms on random pages > throughout and then back but I quickly run into the max_map_count when > I don't actually need that many mappings. But that's easily dealt with: just make your mmap PROT_READ|PROT_WRITE, which will account for the whole extent; then mprotect it all PROT_NONE, which will take you to your previous starting position; then proceed as before - the vmas should get merged as they are reset back to PROT_NONE. That works, doesn't it? (I must offer a big thank you: replying to your mail just after writing a mail about the ZERO_PAGE, brings me to realize - if I'm not mistaken - that we broke the accounting of initially non-writable anonymous areas when we stopped using the ZERO_PAGE there, but marked readfaulted pages as dirty. Looks like another argument to bring them back.) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vma_merge issue 2009-08-12 18:26 ` vma_merge issue Hugh Dickins @ 2009-08-12 19:04 ` Bill Speirs 2009-08-12 20:23 ` Hugh Dickins 0 siblings, 1 reply; 6+ messages in thread From: Bill Speirs @ 2009-08-12 19:04 UTC (permalink / raw) To: Hugh Dickins; +Cc: Nick Piggin, linux-kernel, linux-mm On Wed, Aug 12, 2009 at 2:26 PM, Hugh Dickins<hugh.dickins@tiscali.co.uk> wrote: > On Mon, 10 Aug 2009, Bill Speirs wrote: >> >> I came across an issue where adjacent pages are not properly coalesced >> together when changing protections on them. This can be shown by doing >> the following: >> >> 1) Map 3 pages with PROT_NONE and MAP_PRIVATE | MAP_ANONYMOUS >> 2) Set the middle page's protection to PROT_READ | PROT_WRITE >> 3) Set the middle page's protection back to PROT_NONE >> >> You are left with 3 entries in /proc/self/map where you should only >> have 1. If you only change the protection to PROT_READ in step 2, then >> it is properly merged together. I noticed in mprotect.c the following >> comment in the function mprotect_fixup; I'm not sure if it applies or >> not: >> /* >> * If we make a private mapping writable we increase our commit; >> * but (without finer accounting) cannot reduce our commit if we >> * make it unwritable again. > [ the following lines of the comment are not relevant here so I'll delete ] >> */ >> >> I think this only applies to setting charged = nrpages; however, >> VM_ACCOUNT is also added to newflags. Could it be that the adjacent >> blocks don't have VM_ACCOUNT and so the call to vma_merge cannot merge >> because the flags for the adjacent vma are not the same? > > That's right, and it is working as intended. > > To allow people to set up enormous PROT_READ,MAP_PRIVATE mappings > "for free", we don't account those initially, but only as parts > are mprotected writable later: at that point they're accounted, > and marked VM_ACCOUNT so that we know it's been done (and don't > double account later on). > > So your middle page has been accounted (one page added to > /proc/meminfo's Committed_AS, which isn't allowed to exceed CommitLimit > if /proc/sys/vm/overcommit_memory is 2 to disable overcommit), but the > neighbouring pages have not been accounted: so we need separate vmas > for them, I'm afraid, since that accounting is done per vma. > >> >> Can anyone shed some light on this? While it isn't an issue for 3 >> pages, I'm mmaping 200K+ pages and changing the perms on random pages >> throughout and then back but I quickly run into the max_map_count when >> I don't actually need that many mappings. > > But that's easily dealt with: just make your mmap PROT_READ|PROT_WRITE, > which will account for the whole extent; then mprotect it all PROT_NONE, > which will take you to your previous starting position; then proceed as > before - the vmas should get merged as they are reset back to PROT_NONE. > That works, doesn't it? Unfortunately, that doesn't work. When I mmap pages as PROT_WRITE it is checked against the CommitLimit and returns with ENOMEM as I'm mmaping a lot of pages. However, I don't actually want to be charged for that memory, as I won't be using all of it. This is why I mmap as PROT_NONE as I'm not charged for it. Then when I set a page to PROT_WRITE I get charged (which is expected and OK), but then going back to PROT_NONE I don't get "uncharged". This makes sense as I could simply PROT_WRITE that page again and I should be charged. However, I have no way (that I know of) to tell the kernel "I'm done with this page, don't charge me for it, and set it's protection to PROT_NONE." I've tried madvise with MADV_DONTNEED but that doesn't seem to remove the VM_ACCOUNT flag. I have seen an mm patch that introduces MADV_FREE, which I believe removes the VM_ACCOUNT flag and decrements the commit charge. Does it make sense to have this type of functionality? Can I get this same type of functionality (start without being charged for a page, use it, then un-use it and remove the charge for it?) currently? > (I must offer a big thank you: replying to your mail just after writing > a mail about the ZERO_PAGE, brings me to realize - if I'm not mistaken - > that we broke the accounting of initially non-writable anonymous areas > when we stopped using the ZERO_PAGE there, but marked readfaulted pages > as dirty. Looks like another argument to bring them back.) I'm not 100% sure what you're talking about with respect to ZERO_PAGE, but I'm happy to help :-) Bill- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vma_merge issue 2009-08-12 19:04 ` Bill Speirs @ 2009-08-12 20:23 ` Hugh Dickins 2009-08-12 20:53 ` Hugh Dickins 2009-08-13 1:15 ` William R Speirs 0 siblings, 2 replies; 6+ messages in thread From: Hugh Dickins @ 2009-08-12 20:23 UTC (permalink / raw) To: Bill Speirs; +Cc: Nick Piggin, linux-kernel, linux-mm [-- Attachment #1: Type: TEXT/PLAIN, Size: 4994 bytes --] On Wed, 12 Aug 2009, Bill Speirs wrote: > On Wed, Aug 12, 2009 at 2:26 PM, Hugh Dickins<hugh.dickins@tiscali.co.uk> wrote: > > On Mon, 10 Aug 2009, Bill Speirs wrote: > >> > >> Can anyone shed some light on this? While it isn't an issue for 3 > >> pages, I'm mmaping 200K+ pages and changing the perms on random pages > >> throughout and then back but I quickly run into the max_map_count when > >> I don't actually need that many mappings. > > > > But that's easily dealt with: just make your mmap PROT_READ|PROT_WRITE, > > which will account for the whole extent; then mprotect it all PROT_NONE, > > which will take you to your previous starting position; then proceed as > > before - the vmas should get merged as they are reset back to PROT_NONE. > > That works, doesn't it? > > Unfortunately, that doesn't work. When I mmap pages as PROT_WRITE it > is checked against the CommitLimit and returns with ENOMEM as I'm > mmaping a lot of pages. However, I don't actually want to be charged > for that memory, as I won't be using all of it. This is why I mmap as > PROT_NONE as I'm not charged for it. I'm sorry, I hadn't realized you're working in an overcommit_memory 2 environment. And it's not single user, so you don't have the freedom to adjust /proc/sys/vm/overcommit_ratio to suit your needs? > Then when I set a page to > PROT_WRITE I get charged (which is expected and OK), but then going > back to PROT_NONE I don't get "uncharged". This makes sense as I could > simply PROT_WRITE that page again and I should be charged. Even if you never wrote to it again, PROT_READ would have to show you the same content as was in there before, so you definitely still need to be charged for it. > However, I > have no way (that I know of) to tell the kernel "I'm done with this > page, don't charge me for it, and set it's protection to PROT_NONE." > I've tried madvise with MADV_DONTNEED but that doesn't seem to remove > the VM_ACCOUNT flag. MADV_DONTNEED: brilliant idea, what a shame it doesn't work for you. I'd been on the point of volunteering a bugfix to it to do what you want, it would make sense; but there's a big but... we have sold MADV_DONTNEED as an madvise that only needs non-exclusive access to the mmap_sem, which means it can be used concurrently with faulting, which has made it much more useful to glibc (I believe). If we were to fiddle with vmas and accounting and merging in there, it would go back to needing exclusive mmap_sem, which would hurt important users. There could be a MADV_BILL_SPEIRS_WONTNEED, but even if we could agree on a more impartial name for it, it might be hard to justify, and tiresome to write the man page explaining when to use this and when to use that. Could be done, but... Oh, I've somehow missed your next paragraph... > > I have seen an mm patch that introduces MADV_FREE, which I believe > removes the VM_ACCOUNT flag and decrements the commit charge. Does it > make sense to have this type of functionality? Can I get this same > type of functionality (start without being charged for a page, use it, > then un-use it and remove the charge for it?) currently? The name MADV_FREE is vaguely familiar, let's see, Rik, 2007. Looking at that patch, no, it didn't remove the commit charge: it kept quite close to MADV_DONTNEED in that respect. I think Nick's non-exclusive mmap_sem mod to MADV_DONTNEED solved the particular problem which MADV_FREE was proposed for, in a much simpler way, so MADV_FREE didn't get any further. What could you do? Some variously unsatisfactory solutions, all of which you've probably rejected already: Raise max_map_count via /proc/sys/vm/max_map_count (but probably you don't have access to do so) Don't mmap the arena in the first place, or mmap it and then munmap all but start and end, use MAP_FIXED within the arena for your pages, and pray that no library might be mmap'ing in there while you're running (and maybe the architecture's address choices will help you). Don't use anonymous memory, have a 1GB sparse file to back this, and mmap it MAP_SHARED, then you won't get charged for RAM+swap. All of them copouts, but the last maybe the best. > > > (I must offer a big thank you: replying to your mail just after writing > > a mail about the ZERO_PAGE, brings me to realize - if I'm not mistaken - > > that we broke the accounting of initially non-writable anonymous areas > > when we stopped using the ZERO_PAGE there, but marked readfaulted pages > > as dirty. Looks like another argument to bring them back.) > > I'm not 100% sure what you're talking about with respect to ZERO_PAGE, > but I'm happy to help :-) I was rather talking to myself and a few others there, but the important thing is, that I have helped to make you happy :-) (It is spooky that the mail about ZERO_PAGE that I refer to, also involved comments about MADV_DONTNEED and alternatives.) Hugh ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vma_merge issue 2009-08-12 20:23 ` Hugh Dickins @ 2009-08-12 20:53 ` Hugh Dickins 2009-08-13 1:15 ` William R Speirs 1 sibling, 0 replies; 6+ messages in thread From: Hugh Dickins @ 2009-08-12 20:53 UTC (permalink / raw) To: Bill Speirs; +Cc: Nick Piggin, linux-kernel, linux-mm On Wed, 12 Aug 2009, Hugh Dickins wrote: > > Don't use anonymous memory, have a 1GB sparse file to back this, > and mmap it MAP_SHARED, then you won't get charged for RAM+swap. A "refinement" to that suggestion is to put the file on tmpfs: you will then get charged for RAM+swap as you use it, but you can use madvise MADV_REMOVE to unmap pages, punching holes in the file, freeing up those charges. A little baroque, but I think it does amount to a way of doing exactly what you wanted in the first place. (Note: we do insist on PROT_WRITE access at the time of MADV_REMOVE: I've even a feeling it was me who insisted on that.) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vma_merge issue 2009-08-12 20:23 ` Hugh Dickins 2009-08-12 20:53 ` Hugh Dickins @ 2009-08-13 1:15 ` William R Speirs 2009-08-13 17:33 ` Hugh Dickins 1 sibling, 1 reply; 6+ messages in thread From: William R Speirs @ 2009-08-13 1:15 UTC (permalink / raw) To: Hugh Dickins; +Cc: Nick Piggin, linux-kernel, linux-mm Hugh Dickins wrote: >> Unfortunately, that doesn't work. When I mmap pages as PROT_WRITE it >> is checked against the CommitLimit and returns with ENOMEM as I'm >> mmaping a lot of pages. However, I don't actually want to be charged >> for that memory, as I won't be using all of it. This is why I mmap as >> PROT_NONE as I'm not charged for it. > > I'm sorry, I hadn't realized you're working in an overcommit_memory 2 > environment. And it's not single user, so you don't have the freedom > to adjust /proc/sys/vm/overcommit_ratio to suit your needs? I could maybe change these things, but I'd have to fight with the sys admin... not a battle I want to engage in. >> Then when I set a page to >> PROT_WRITE I get charged (which is expected and OK), but then going >> back to PROT_NONE I don't get "uncharged". This makes sense as I could >> simply PROT_WRITE that page again and I should be charged. > > Even if you never wrote to it again, PROT_READ would have to show you > the same content as was in there before, so you definitely still need > to be charged for it. Good point. In my world (program) once a page goes PROT_NONE I will never need the memory again. But alas not everyone lives in my world... >> However, I >> have no way (that I know of) to tell the kernel "I'm done with this >> page, don't charge me for it, and set it's protection to PROT_NONE." >> I've tried madvise with MADV_DONTNEED but that doesn't seem to remove >> the VM_ACCOUNT flag. > > MADV_DONTNEED: brilliant idea, what a shame it doesn't work for you. > I'd been on the point of volunteering a bugfix to it to do what you > want, it would make sense; but there's a big but... we have sold > MADV_DONTNEED as an madvise that only needs non-exclusive access > to the mmap_sem, which means it can be used concurrently with faulting, > which has made it much more useful to glibc (I believe). If we were > to fiddle with vmas and accounting and merging in there, it would go > back to needing exclusive mmap_sem, which would hurt important users. For my own edification, hurt these users how? Performance? Serializing access during a MADV_DONTNEED? I wonder how big the "hurt" would be? > There could be a MADV_BILL_SPEIRS_WONTNEED, but even if we could > agree on a more impartial name for it, it might be hard to justify, > and tiresome to write the man page explaining when to use this and > when to use that. Could be done, but... While my ego would love that constant... > Oh, I've somehow missed your next paragraph... > >> I have seen an mm patch that introduces MADV_FREE, which I believe >> removes the VM_ACCOUNT flag and decrements the commit charge. Does it >> make sense to have this type of functionality? Can I get this same >> type of functionality (start without being charged for a page, use it, >> then un-use it and remove the charge for it?) currently? > > The name MADV_FREE is vaguely familiar, let's see, Rik, 2007. > Looking at that patch, no, it didn't remove the commit charge: > it kept quite close to MADV_DONTNEED in that respect. I think > Nick's non-exclusive mmap_sem mod to MADV_DONTNEED solved the > particular problem which MADV_FREE was proposed for, in a much > simpler way, so MADV_FREE didn't get any further. Yeah, I apologize, I didn't study exactly what the proposed MADV_FREE was to do before suggesting it. Informative, thanks! > What could you do? Some variously unsatisfactory solutions, > all of which you've probably rejected already: > > Raise max_map_count via /proc/sys/vm/max_map_count > (but probably you don't have access to do so) Been here. Again, I'd have to fight with the sys admins... > Don't mmap the arena in the first place, or mmap it and then munmap > all but start and end, use MAP_FIXED within the arena for your pages, > and pray that no library might be mmap'ing in there while you're > running (and maybe the architecture's address choices will help you). Interesting idea, but slightly too risky for me. > Don't use anonymous memory, have a 1GB sparse file to back this, > and mmap it MAP_SHARED, then you won't get charged for RAM+swap. > > On Wed, 12 Aug 2009, Hugh Dickins wrote: > > A "refinement" to that suggestion is to put the file on tmpfs: > you will then get charged for RAM+swap as you use it, but you can > use madvise MADV_REMOVE to unmap pages, punching holes in the file, > freeing up those charges. A little baroque, but I think it does > amount to a way of doing exactly what you wanted in the first place. I like this (the refined) idea a lot. I coded it up and works as expected, and the way I initially want. Thanks for taking the time and providing the solution... I appreciate it. Bill- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: vma_merge issue 2009-08-13 1:15 ` William R Speirs @ 2009-08-13 17:33 ` Hugh Dickins 0 siblings, 0 replies; 6+ messages in thread From: Hugh Dickins @ 2009-08-13 17:33 UTC (permalink / raw) To: William R Speirs; +Cc: Nick Piggin, linux-kernel, linux-mm On Wed, 12 Aug 2009, William R Speirs wrote: > Hugh Dickins wrote: > > > > MADV_DONTNEED: brilliant idea, what a shame it doesn't work for you. > > I'd been on the point of volunteering a bugfix to it to do what you > > want, it would make sense; but there's a big but... we have sold > > MADV_DONTNEED as an madvise that only needs non-exclusive access > > to the mmap_sem, which means it can be used concurrently with faulting, > > which has made it much more useful to glibc (I believe). If we were > > to fiddle with vmas and accounting and merging in there, it would go > > back to needing exclusive mmap_sem, which would hurt important users. > > For my own edification, hurt these users how? Performance? Serializing access > during a MADV_DONTNEED? I wonder how big the "hurt" would be? Performance, yes: serializing, yes. I forget the details, others will have paid closer attention, I may be making this up! But it was something like garbage collection when when freeing mallocs: it pays off if faults elsewhere in the address space can occur concurrently, but bad news if exclusive mmap_sem locks out those faults. Big enough hurt to show up very badly in some reallife multithreaded apps, and benchmarks hitting the issue. > > A "refinement" to that suggestion is to put the file on tmpfs: > > you will then get charged for RAM+swap as you use it, but you can > > use madvise MADV_REMOVE to unmap pages, punching holes in the file, > > freeing up those charges. A little baroque, but I think it does > > amount to a way of doing exactly what you wanted in the first place. > > I like this (the refined) idea a lot. I coded it up and works as expected, > and the way I initially want. > > Thanks for taking the time and providing the solution... I appreciate it. I'm very glad to hear that worked out: thanks for reporting back. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2009-08-13 17:33 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <a1b36c3a0908101347t796dedbat2ecb0535c32f325b@mail.gmail.com>
2009-08-12 18:26 ` vma_merge issue Hugh Dickins
2009-08-12 19:04 ` Bill Speirs
2009-08-12 20:23 ` Hugh Dickins
2009-08-12 20:53 ` Hugh Dickins
2009-08-13 1:15 ` William R Speirs
2009-08-13 17:33 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox