* FADV_DONTNEED on hugetlbfs files broken
@ 2007-03-17 5:13 Nishanth Aravamudan
2007-03-17 6:13 ` William Lee Irwin III
0 siblings, 1 reply; 6+ messages in thread
From: Nishanth Aravamudan @ 2007-03-17 5:13 UTC (permalink / raw)
To: kenchen; +Cc: linux-mm, agl, dwg, wli
Hi Ken,
git commit 6649a3863232eb2e2f15ea6c622bd8ceacf96d76 "[PATCH] hugetlb:
preserve hugetlb pte dirty state" fixed one bug and caused another (or,
at least, a regression): FADV_DONTNEED no longer works on hugetlbfs
files. git-bisect revealed this commit to be the cause. I'm still trying
to figure out what the solution is (but it is also the start of the
weekend :) Maybe it's not a bug, but it is a change in behavior, and I
don't think it was clear from the commit message.
Thanks,
Nish
---
Background:
I found this while trying to add some code to libhugetlbfs to minimize
the number of hugepages used by the segment remapping code
(git://ozlabs.org/~dgibson/git/libhugetlbfs.git). The general sequence
of code is:
1) Map hugepage-backed file in, MAP_SHARED.
2) Copy smallpage-backed segment data into hugepage-backed file.
3) Unmap hugepage-backed file.
4) Unmap smallpage-backed segment.
5) Map hugepage-backed file in its place, MAP_PRIVATE.
(From what I understand, step 5) will take advantage of the fact that
the mapping from step 1) is still in the page cache and thus not
actually use any more huge pages)
Now, if this segment is writable, we are going to take a COW fault on
the PRIVATE mapping at some point (most likely) and then have twice as
many hugepages in use as need to be. So, I added some code to the
remapping to add two more steps:
6) If the segment is writable, for each hugepage in the hugepage-backed
file, force a COW.
7) Invoke posix_fadvise(fd, 0, 0, FADV_DONTNEED) on the hugepage-backed
file to drop the SHARED mapping out of the page cache.
Now, the problem I'm seeing on a very dummy program, test.c:
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
int array[8*1024*1024];
int main() {
getchar();
return 0;
}
relinked with libhugetlbfs
gcc -o test -B/path/to/ld/symlinked/to/ld.hugetlbfs -Wl,--hugetlbfs-link=BDT -L/path/to/libhugetlbfs.so
resulted in different behavior with 2.6.19 and 2.6.21-rc4 (on an x86_64
and on a powerpc):
2.6.19: Start out using 2 hugepages (one for each segment), the COW
causes us to go to 3 hugepages, and the fadvise drops us back down to 2
pages.
2.6.21-rc4: Start out using 2 hugepages, the COW causes us to go to 3
hugepages, and the fadvise has no effect.
--
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: FADV_DONTNEED on hugetlbfs files broken 2007-03-17 5:13 FADV_DONTNEED on hugetlbfs files broken Nishanth Aravamudan @ 2007-03-17 6:13 ` William Lee Irwin III 2007-03-17 19:37 ` Nishanth Aravamudan 0 siblings, 1 reply; 6+ messages in thread From: William Lee Irwin III @ 2007-03-17 6:13 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: kenchen, linux-mm, agl, dwg On Fri, Mar 16, 2007 at 10:13:09PM -0700, Nishanth Aravamudan wrote: > git commit 6649a3863232eb2e2f15ea6c622bd8ceacf96d76 "[PATCH] hugetlb: > preserve hugetlb pte dirty state" fixed one bug and caused another (or, > at least, a regression): FADV_DONTNEED no longer works on hugetlbfs > files. git-bisect revealed this commit to be the cause. I'm still trying > to figure out what the solution is (but it is also the start of the > weekend :) Maybe it's not a bug, but it is a change in behavior, and I > don't think it was clear from the commit message. Well, setting the pages always dirty like that will prevent things from dropping them because they think they still need to be written back. It is, however, legitimate and/or permissible to ignore fadvise() and/or madvise(); they are by definition only advisory. I think this is more of a "please add back FADV_DONTNEED support" affair. Perhaps we should ask what ramfs, tmpfs, et al would do. Or, for that matter, if they suffer from the same issue as Ken Chen identified for hugetlbfs. Perhaps the issue is not hugetlb's dirty state, but drop_pagecache_sb() failing to check the bdi for BDI_CAP_NO_WRITEBACK. Or perhaps what safety guarantees drop_pagecache_sb() is supposed to have or lack. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: FADV_DONTNEED on hugetlbfs files broken 2007-03-17 6:13 ` William Lee Irwin III @ 2007-03-17 19:37 ` Nishanth Aravamudan 2007-03-18 2:13 ` William Lee Irwin III 2007-03-18 7:43 ` Ken Chen 0 siblings, 2 replies; 6+ messages in thread From: Nishanth Aravamudan @ 2007-03-17 19:37 UTC (permalink / raw) To: William Lee Irwin III; +Cc: kenchen, linux-mm, agl, dwg On 16.03.2007 [23:13:22 -0700], William Lee Irwin III wrote: > On Fri, Mar 16, 2007 at 10:13:09PM -0700, Nishanth Aravamudan wrote: > > git commit 6649a3863232eb2e2f15ea6c622bd8ceacf96d76 "[PATCH] hugetlb: > > preserve hugetlb pte dirty state" fixed one bug and caused another (or, > > at least, a regression): FADV_DONTNEED no longer works on hugetlbfs > > files. git-bisect revealed this commit to be the cause. I'm still trying > > to figure out what the solution is (but it is also the start of the > > weekend :) Maybe it's not a bug, but it is a change in behavior, and I > > don't think it was clear from the commit message. > > Well, setting the pages always dirty like that will prevent things > from dropping them because they think they still need to be written > back. It is, however, legitimate and/or permissible to ignore > fadvise() and/or madvise(); they are by definition only advisory. I > think this is more of a "please add back FADV_DONTNEED support" > affair. Yes, that could be :) Sorry if my e-mail indicated I was asking otherwise. I don't want Ken's commit to be reverted, as that would make hugepages very nearly unusable on x86 and x86_64. But I had found a functional change and wanted it to be documented. If hugepages can no longer be dropped from the page cache, then we should make sure that is clear (and expected/desired). Now, even if I call fsync() on the file descriptor, I still don't get the pages out of the page cache. It seems to me like fsync() would clear the dirty state -- although perhaps with Ken's patch, writable hugetlbfs pages will *always* be dirty? I'm still trying to figure out what ever clears that dirty state (in hugetlbfs or anywhere else). Seems like hugetlbfs truncates call cancel_dirty_page(), but the comment there indicates it's only for truncates. > Perhaps we should ask what ramfs, tmpfs, et al would do. Or, for that > matter, if they suffer from the same issue as Ken Chen identified for > hugetlbfs. Perhaps the issue is not hugetlb's dirty state, but > drop_pagecache_sb() failing to check the bdi for BDI_CAP_NO_WRITEBACK. > Or perhaps what safety guarantees drop_pagecache_sb() is supposed to > have or lack. A good point, and one I hadn't considered. I'm less concerned by the drop_pagecache_sb() path (which is /proc/sys/vm/drop_caches, yes?), although it appears that it and the FADV_DONTNEED code both end up calling into invalidate_mapping_pages(). I'm still pretty new to this part of the kernel code, and am trying to follow along as best I can. In any case, if the problem were in drop_pagecache_sb(), it seems like it wouldn't help the DONTNEED case, since that's a level above the call to invalidate_mapping_pages(). I'll keep looking through the code and thinking, and if anyone has any patches they'd like me to test, I'll be glad to. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: FADV_DONTNEED on hugetlbfs files broken 2007-03-17 19:37 ` Nishanth Aravamudan @ 2007-03-18 2:13 ` William Lee Irwin III 2007-03-18 7:43 ` Ken Chen 1 sibling, 0 replies; 6+ messages in thread From: William Lee Irwin III @ 2007-03-18 2:13 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: kenchen, linux-mm, agl, dwg On 16.03.2007 [23:13:22 -0700], William Lee Irwin III wrote: >> Well, setting the pages always dirty like that will prevent things >> from dropping them because they think they still need to be written >> back. It is, however, legitimate and/or permissible to ignore >> fadvise() and/or madvise(); they are by definition only advisory. I >> think this is more of a "please add back FADV_DONTNEED support" >> affair. On Sat, Mar 17, 2007 at 12:37:29PM -0700, Nishanth Aravamudan wrote: > Yes, that could be :) Sorry if my e-mail indicated I was asking > otherwise. I don't want Ken's commit to be reverted, as that would make > hugepages very nearly unusable on x86 and x86_64. But I had found a > functional change and wanted it to be documented. If hugepages can no > longer be dropped from the page cache, then we should make sure that is > clear (and expected/desired). > Now, even if I call fsync() on the file descriptor, I still don't get > the pages out of the page cache. It seems to me like fsync() would clear > the dirty state -- although perhaps with Ken's patch, writable hugetlbfs > pages will *always* be dirty? I'm still trying to figure out what ever > clears that dirty state (in hugetlbfs or anywhere else). Seems like > hugetlbfs truncates call cancel_dirty_page(), but the comment there > indicates it's only for truncates. I'm not so convinced drop_pagecache_sb() semantics have such drastic effects on usability. It's not a standard API, and it is as of yet unclear to me how "safe" its semantics are intended to be as a root-only back door into kernel internals. On 16.03.2007 [23:13:22 -0700], William Lee Irwin III wrote: >> Perhaps we should ask what ramfs, tmpfs, et al would do. Or, for that >> matter, if they suffer from the same issue as Ken Chen identified for >> hugetlbfs. Perhaps the issue is not hugetlb's dirty state, but >> drop_pagecache_sb() failing to check the bdi for BDI_CAP_NO_WRITEBACK. >> Or perhaps what safety guarantees drop_pagecache_sb() is supposed to >> have or lack. On Sat, Mar 17, 2007 at 12:37:29PM -0700, Nishanth Aravamudan wrote: > A good point, and one I hadn't considered. I'm less concerned by the > drop_pagecache_sb() path (which is /proc/sys/vm/drop_caches, yes?), > although it appears that it and the FADV_DONTNEED code both end up > calling into invalidate_mapping_pages(). I'm still pretty new to this > part of the kernel code, and am trying to follow along as best I can. > In any case, if the problem were in drop_pagecache_sb(), it seems like > it wouldn't help the DONTNEED case, since that's a level above the call > to invalidate_mapping_pages(). > I'll keep looking through the code and thinking, and if anyone has any > patches they'd like me to test, I'll be glad to. Well, ramfs, tmpfs, et al don't do this sort of false dirtiness. So there must be some other method they have of coping, or otherwise, they let drop_pagecache_sb() have the rather user-hostile semantics our fix was intended to repair, possibly even intentionally. Best to wait until Monday so Ken Chen can chime in. Flagging down whoever has some notion of drop_pagecache_sb()'s intended semantics esp. wrt. "safety" would also be a good idea here. It should be clear that the actual code surrounding all this is not so involved; it's more an issue of clarifying intentions and/or what should be done in the first place. -- wli -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: FADV_DONTNEED on hugetlbfs files broken 2007-03-17 19:37 ` Nishanth Aravamudan 2007-03-18 2:13 ` William Lee Irwin III @ 2007-03-18 7:43 ` Ken Chen 2007-03-18 17:27 ` Nishanth Aravamudan 1 sibling, 1 reply; 6+ messages in thread From: Ken Chen @ 2007-03-18 7:43 UTC (permalink / raw) To: Nishanth Aravamudan; +Cc: William Lee Irwin III, linux-mm, agl, dwg On 3/17/07, Nishanth Aravamudan <nacc@us.ibm.com> wrote: > Yes, that could be :) Sorry if my e-mail indicated I was asking > otherwise. I don't want Ken's commit to be reverted, as that would make > hugepages very nearly unusable on x86 and x86_64. But I had found a > functional change and wanted it to be documented. If hugepages can no > longer be dropped from the page cache, then we should make sure that is > clear (and expected/desired). Oh gosh, I think you are really abusing the buggy hugetlb behavior in the dark age of 2.6.19. Hugetlb file does not have disk based backing store. The in-core page that resides in the page cache is the only copy of the file. For pages that are dirty, there are no place to sync them to and thus they have to stay in the page cache for the life of the file. And currently, there is no way to allocate hugetlb page in "clean" state because we can't mmap hugetlb page onto a disk file. So pages for live file in hugetlbfs are always being written to initially and it is just not possible to drop them out of page cache, otherwise we suffer from data corruption. > Now, even if I call fsync() on the file descriptor, I still don't get > the pages out of the page cache. It seems to me like fsync() would clear > the dirty state -- although perhaps with Ken's patch, writable hugetlbfs > pages will *always* be dirty? I'm still trying to figure out what ever > clears that dirty state (in hugetlbfs or anywhere else). Seems like > hugetlbfs truncates call cancel_dirty_page(), but the comment there > indicates it's only for truncates. fsync can not drop dirty pages out of page cache because there are no backing store. I believe truncate is the only way to remove hugetlb page out of page cache. > > Perhaps we should ask what ramfs, tmpfs, et al would do. Or, for that > > matter, if they suffer from the same issue as Ken Chen identified for > > hugetlbfs. Perhaps the issue is not hugetlb's dirty state, but > > drop_pagecache_sb() failing to check the bdi for BDI_CAP_NO_WRITEBACK. > > Or perhaps what safety guarantees drop_pagecache_sb() is supposed to > > have or lack. I looked, ramfs and tmpfs does the same thing. fadvice(DONTNEED) doesn't do anything to live files. - Ken -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: FADV_DONTNEED on hugetlbfs files broken 2007-03-18 7:43 ` Ken Chen @ 2007-03-18 17:27 ` Nishanth Aravamudan 0 siblings, 0 replies; 6+ messages in thread From: Nishanth Aravamudan @ 2007-03-18 17:27 UTC (permalink / raw) To: Ken Chen; +Cc: William Lee Irwin III, linux-mm, agl, dwg On 18.03.2007 [00:43:01 -0700], Ken Chen wrote: > On 3/17/07, Nishanth Aravamudan <nacc@us.ibm.com> wrote: > >Yes, that could be :) Sorry if my e-mail indicated I was asking > >otherwise. I don't want Ken's commit to be reverted, as that would > >make hugepages very nearly unusable on x86 and x86_64. But I had > >found a functional change and wanted it to be documented. If > >hugepages can no longer be dropped from the page cache, then we > >should make sure that is clear (and expected/desired). > > Oh gosh, I think you are really abusing the buggy hugetlb behavior in > the dark age of 2.6.19. Hugetlb file does not have disk based backing > store. The in-core page that resides in the page cache is the only > copy of the file. For pages that are dirty, there are no place to > sync them to and thus they have to stay in the page cache for the life > of the file. And 2.6.20, fwiw. Your explanation makes sense. Frustrating, though, since it means segment remapping uses twice as many huge pages as it needs to for each writable segment. > And currently, there is no way to allocate hugetlb page in "clean" > state because we can't mmap hugetlb page onto a disk file. So pages > for live file in hugetlbfs are always being written to initially and > it is just not possible to drop them out of page cache, otherwise we > suffer from data corruption. Let's be clear, for the sake of the archives of the world, this is only for *writable* allocations. In make_huge_pte(): if (writable) { entry = pte_mkwrite(pte_mkdirty(mk_pte(page, vma->vm_page_prot))); } else { entry = pte_wrprotect(mk_pte(page, vma->vm_page_prot)); } Probably obvious to anyone, since you need to be able to dirty the page to have it in a dirty state. > >Now, even if I call fsync() on the file descriptor, I still don't get > >the pages out of the page cache. It seems to me like fsync() would > >clear the dirty state -- although perhaps with Ken's patch, writable > >hugetlbfs pages will *always* be dirty? I'm still trying to figure > >out what ever clears that dirty state (in hugetlbfs or anywhere > >else). Seems like hugetlbfs truncates call cancel_dirty_page(), but > >the comment there indicates it's only for truncates. > > fsync can not drop dirty pages out of page cache because there are no > backing store. I believe truncate is the only way to remove hugetlb > page out of page cache. Which won't work here, because we don't want to lose the data. We just want to drop the original MAP_SHARED copy of the file out of the page_cache. I tried ftruncate()'ing the file down to 0 after we've mapped it PRIVATE and COW'd each hugepage, but then the process (obviously) SEGVs. We lose all hugepages in the page cache. > >> Perhaps we should ask what ramfs, tmpfs, et al would do. Or, for > >> that matter, if they suffer from the same issue as Ken Chen > >> identified for hugetlbfs. Perhaps the issue is not hugetlb's dirty > >> state, but drop_pagecache_sb() failing to check the bdi for > >> BDI_CAP_NO_WRITEBACK. Or perhaps what safety guarantees > >> drop_pagecache_sb() is supposed to have or lack. > > I looked, ramfs and tmpfs does the same thing. fadvice(DONTNEED) > doesn't do anything to live files. Ok, thanks for looking into it, Ken. -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2007-03-18 17:27 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-03-17 5:13 FADV_DONTNEED on hugetlbfs files broken Nishanth Aravamudan 2007-03-17 6:13 ` William Lee Irwin III 2007-03-17 19:37 ` Nishanth Aravamudan 2007-03-18 2:13 ` William Lee Irwin III 2007-03-18 7:43 ` Ken Chen 2007-03-18 17:27 ` Nishanth Aravamudan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox