* [rfc][patch] mm: madvise(WILLNEED) for anonymous memory @ 2007-12-20 13:05 Peter Zijlstra 2007-12-20 14:09 ` Hugh Dickins 0 siblings, 1 reply; 11+ messages in thread From: Peter Zijlstra @ 2007-12-20 13:05 UTC (permalink / raw) To: linux-kernel, linux-mm; +Cc: hugh, Nick Piggin, riel, Lennart Poettering Hi, Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans to use this to pre-fault pages. He currently uses: mlock/munlock for this purpose. [ compile tested only ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- diff --git a/mm/madvise.c b/mm/madvise.c index 93ee375..eff60ce 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -100,6 +100,24 @@ out: return error; } +static long madvice_willneed_anon(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + int ret, len; + + *prev = vma; + if (end > vma->vm_end) + end = vma->vm_end; + + len = end - start; + ret = get_user_pages(current, current->mm, start, len, + 0, 0, NULL, NULL); + if (ret < 0) + return ret; + return ret == len ? 0 : -1; +} + /* * Schedule all required I/O operations. Do not wait for completion. */ @@ -110,7 +128,7 @@ static long madvise_willneed(struct vm_area_struct * vma, struct file *file = vma->vm_file; if (!file) - return -EBADF; + return madvice_willneed_anon(vma, prev, start, end); if (file->f_mapping->a_ops->get_xip_page) { /* no bad return value, but ignore advice */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 13:05 [rfc][patch] mm: madvise(WILLNEED) for anonymous memory Peter Zijlstra @ 2007-12-20 14:09 ` Hugh Dickins 2007-12-20 14:47 ` Peter Zijlstra 2007-12-20 16:29 ` Lennart Poettering 0 siblings, 2 replies; 11+ messages in thread From: Hugh Dickins @ 2007-12-20 14:09 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-mm, Nick Piggin, riel, Lennart Poettering On Thu, 20 Dec 2007, Peter Zijlstra wrote: > > Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans > to use this to pre-fault pages. He currently uses: mlock/munlock for > this purpose. I certainly agree with this in principle: it just seems an unnecessary and surprising restriction to refuse on anonymous vmas; I guess the only reason for not adding this was not having anyone asking for it until now. Though, does Lennart realize he could use MAP_POPULATE in the mmap? > > [ compile tested only ] I haven't tried it either, but generally it looks plausible. > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > diff --git a/mm/madvise.c b/mm/madvise.c > index 93ee375..eff60ce 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -100,6 +100,24 @@ out: > return error; > } > > +static long madvice_willneed_anon(struct vm_area_struct *vma, > + struct vm_area_struct **prev, > + unsigned long start, unsigned long end) mavise.c uses "madvise_" rather than " madvice_" throughout, so please go with the flow. > +{ > + int ret, len; > + > + *prev = vma; > + if (end > vma->vm_end) > + end = vma->vm_end; Please check, but I think the upper level ensures end is within range. > + > + len = end - start; > + ret = get_user_pages(current, current->mm, start, len, > + 0, 0, NULL, NULL); > + if (ret < 0) > + return ret; > + return ret == len ? 0 : -1; It's not good to return -1 as an alternative to a real errno: it'll look like -EPERM. If you copied that from somewhere, better send a patch to fix the somewhere! Ah, yes, make_pages_present: it happens that nobody is interested in its return value, so we could make it a void; but that'd just be a cleanup. What to do here if non-negative ret less than len? Oh, just return 0, that's good enough in this case (the file case always returns 0). Hmm, might it be better to use make_pages_present itself, fixing its retval, rather than using get_user_pages directly? (I'd hope the caching makes its repeat of find_vma not an overhead.) Interesting divergence: make_pages_present faults in writable pages in a writable vma, whereas the file case's force_page_cache_readahead doesn't even insert the pages into the mm. > +} > + > /* > * Schedule all required I/O operations. Do not wait for completion. > */ > @@ -110,7 +128,7 @@ static long madvise_willneed(struct vm_area_struct * vma, > struct file *file = vma->vm_file; > > if (!file) > - return -EBADF; > + return madvice_willneed_anon(vma, prev, start, end); > > if (file->f_mapping->a_ops->get_xip_page) { > /* no bad return value, but ignore advice */ And there's a correctly invisible hunk to the patch too: this extension of MADV_WILLNEED also does not require down_write of mmap_sem, so madvise_need_mmap_write can remain unchanged. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 14:09 ` Hugh Dickins @ 2007-12-20 14:47 ` Peter Zijlstra 2007-12-20 14:56 ` Peter Zijlstra 2007-12-20 15:26 ` Hugh Dickins 2007-12-20 16:29 ` Lennart Poettering 1 sibling, 2 replies; 11+ messages in thread From: Peter Zijlstra @ 2007-12-20 14:47 UTC (permalink / raw) To: Hugh Dickins Cc: linux-kernel, linux-mm, Nick Piggin, riel, Lennart Poettering On Thu, 2007-12-20 at 14:09 +0000, Hugh Dickins wrote: > On Thu, 20 Dec 2007, Peter Zijlstra wrote: > > > > Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans > > to use this to pre-fault pages. He currently uses: mlock/munlock for > > this purpose. > > I certainly agree with this in principle: it just seems an unnecessary > and surprising restriction to refuse on anonymous vmas; I guess the only > reason for not adding this was not having anyone asking for it until now. > Though, does Lennart realize he could use MAP_POPULATE in the mmap? I think he's trying to get his data swapped-in. > > > > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > > --- > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 93ee375..eff60ce 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -100,6 +100,24 @@ out: > > return error; > > } > > > > +static long madvice_willneed_anon(struct vm_area_struct *vma, > > + struct vm_area_struct **prev, > > + unsigned long start, unsigned long end) > > mavise.c uses "madvise_" rather than " madvice_" throughout, > so please go with the flow. Ah, quite. I hadn't noticed this, will fix. > > +{ > > + int ret, len; > > + > > + *prev = vma; > > + if (end > vma->vm_end) > > + end = vma->vm_end; > > Please check, but I think the upper level ensures end is within range. It certainly looks like it, but I since the file case did this check I thought it prudent to also do it. I guess I might as well remove both. > > + > > + len = end - start; > > + ret = get_user_pages(current, current->mm, start, len, > > + 0, 0, NULL, NULL); > > + if (ret < 0) > > + return ret; > > + return ret == len ? 0 : -1; > > It's not good to return -1 as an alternative to a real errno: > it'll look like -EPERM. If you copied that from somewhere, better > send a patch to fix the somewhere! Ah, yes, make_pages_present: it > happens that nobody is interested in its return value, so we could > make it a void; but that'd just be a cleanup. What to do here if > non-negative ret less than len? Oh, just return 0, that's good > enough in this case (the file case always returns 0). ok, return 0; it is. > Hmm, might it be better to use make_pages_present itself, > fixing its retval, rather than using get_user_pages directly? > (I'd hope the caching makes its repeat of find_vma not an overhead.) > > Interesting divergence: make_pages_present faults in writable pages > in a writable vma, whereas the file case's force_page_cache_readahead > doesn't even insert the pages into the mm. Yeah, the find_vma and write fault thing are the reason I didn't use make_pages_present. I had noticed the difference in pte population between force_page_cache_readahead and make_pages_present, but it seemed to me that writing a function to walk the page tables and populate the swapcache but not populate the ptes wasn't worth the effort. > > +} > > + > > /* > > * Schedule all required I/O operations. Do not wait for completion. > > */ > > @@ -110,7 +128,7 @@ static long madvise_willneed(struct vm_area_struct * vma, > > struct file *file = vma->vm_file; > > > > if (!file) > > - return -EBADF; > > + return madvice_willneed_anon(vma, prev, start, end); > > > > if (file->f_mapping->a_ops->get_xip_page) { > > /* no bad return value, but ignore advice */ > > And there's a correctly invisible hunk to the patch too: this > extension of MADV_WILLNEED also does not require down_write of > mmap_sem, so madvise_need_mmap_write can remain unchanged. Indeed, I did check that :-) Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- diff --git a/mm/madvise.c b/mm/madvise.c index 93ee375..563bf00 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -100,6 +100,21 @@ out: return error; } +static long madvise_willneed_anon(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + int ret; + + *prev = vma; + ret = get_user_pages(current, current->mm, start, end - start, + 0, 0, NULL, NULL); + if (ret < 0) + return ret; + + return 0; +} + /* * Schedule all required I/O operations. Do not wait for completion. */ @@ -110,7 +125,7 @@ static long madvise_willneed(struct vm_area_struct * vma, struct file *file = vma->vm_file; if (!file) - return -EBADF; + return madvise_willneed_anon(vma, prev, start, end); if (file->f_mapping->a_ops->get_xip_page) { /* no bad return value, but ignore advice */ @@ -119,8 +134,6 @@ static long madvise_willneed(struct vm_area_struct * vma, *prev = vma; start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; - if (end > vma->vm_end) - end = vma->vm_end; end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; force_page_cache_readahead(file->f_mapping, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 14:47 ` Peter Zijlstra @ 2007-12-20 14:56 ` Peter Zijlstra 2007-12-20 15:18 ` Peter Zijlstra 2007-12-20 15:26 ` Hugh Dickins 1 sibling, 1 reply; 11+ messages in thread From: Peter Zijlstra @ 2007-12-20 14:56 UTC (permalink / raw) To: Hugh Dickins Cc: linux-kernel, linux-mm, Nick Piggin, riel, Lennart Poettering On Thu, 2007-12-20 at 15:47 +0100, Peter Zijlstra wrote: > On Thu, 2007-12-20 at 14:09 +0000, Hugh Dickins wrote: > > Interesting divergence: make_pages_present faults in writable pages > > in a writable vma, whereas the file case's force_page_cache_readahead > > doesn't even insert the pages into the mm. > > Yeah, the find_vma and write fault thing are the reason I didn't use > make_pages_present. > > I had noticed the difference in pte population between > force_page_cache_readahead and make_pages_present, but it seemed to me > that writing a function to walk the page tables and populate the > swapcache but not populate the ptes wasn't worth the effort. Ah, another, more important difference: force_page_cache_readahead will not wait for the read to complete, whereas get_user_pages() will be fully synchronous. I think I'd better come up with something else then,.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 14:56 ` Peter Zijlstra @ 2007-12-20 15:18 ` Peter Zijlstra 2007-12-20 15:23 ` Peter Zijlstra 0 siblings, 1 reply; 11+ messages in thread From: Peter Zijlstra @ 2007-12-20 15:18 UTC (permalink / raw) To: Hugh Dickins Cc: linux-kernel, linux-mm, Nick Piggin, riel, Lennart Poettering On Thu, 2007-12-20 at 15:56 +0100, Peter Zijlstra wrote: > On Thu, 2007-12-20 at 15:47 +0100, Peter Zijlstra wrote: > > On Thu, 2007-12-20 at 14:09 +0000, Hugh Dickins wrote: > > > > Interesting divergence: make_pages_present faults in writable pages > > > in a writable vma, whereas the file case's force_page_cache_readahead > > > doesn't even insert the pages into the mm. > > > > Yeah, the find_vma and write fault thing are the reason I didn't use > > make_pages_present. > > > > I had noticed the difference in pte population between > > force_page_cache_readahead and make_pages_present, but it seemed to me > > that writing a function to walk the page tables and populate the > > swapcache but not populate the ptes wasn't worth the effort. > > Ah, another, more important difference: > > force_page_cache_readahead will not wait for the read to complete, > whereas get_user_pages() will be fully synchronous. > > I think I'd better come up with something else then,.. Depending on the page table walk from -mm --- A best effort implementation of madvise(WILLNEED) for anonymous pages. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- diff --git a/mm/madvise.c b/mm/madvise.c index 93ee375..e6f772a 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -11,6 +11,8 @@ #include <linux/mempolicy.h> #include <linux/hugetlb.h> #include <linux/sched.h> +#include <linux/swap.h> +#include <linux/swapops.h> /* * Any behaviour which results in changes to the vma->vm_flags needs to @@ -100,6 +102,34 @@ out: return error; } +static int madvise_willneed_anon_pte(pte_t *ptep, + unsigned long start, unsigned long end, void *arg) +{ + struct vm_area_struct *vma = arg; + struct page *page; + + page = read_swap_cache_async(pte_to_swp_entry(*ptep), GFP_KERNEL, + vma, start); + if (page) + page_cache_release(page); + + return 0; +} + +static long madvise_willneed_anon(struct vm_area_struct * vma, + struct vm_area_struct ** prev, + unsigned long start, unsigned long end) +{ + struct mm_walk walk = { + .pte_entry = madvise_willneed_anon_pte, + }; + + *prev = vma; + walk_page_range(vma->vm_mm, start, end, &walk, vma); + + return 0; +} + /* * Schedule all required I/O operations. Do not wait for completion. */ @@ -110,7 +140,7 @@ static long madvise_willneed(struct vm_area_struct * vma, struct file *file = vma->vm_file; if (!file) - return -EBADF; + return madvise_willneed_anon(vma, prev, start, end); if (file->f_mapping->a_ops->get_xip_page) { /* no bad return value, but ignore advice */ -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 15:18 ` Peter Zijlstra @ 2007-12-20 15:23 ` Peter Zijlstra 0 siblings, 0 replies; 11+ messages in thread From: Peter Zijlstra @ 2007-12-20 15:23 UTC (permalink / raw) To: Hugh Dickins Cc: linux-kernel, linux-mm, Nick Piggin, riel, Lennart Poettering On Thu, 2007-12-20 at 16:18 +0100, Peter Zijlstra wrote: > +static int madvise_willneed_anon_pte(pte_t *ptep, > + unsigned long start, unsigned long end, void *arg) > +{ > + struct vm_area_struct *vma = arg; > + struct page *page; > + > + page = read_swap_cache_async(pte_to_swp_entry(*ptep), GFP_KERNEL, Argh, with HIGHPTE this is done inside a kmap_atomic. /me goes complicate the code with page pre-allocation.. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 14:47 ` Peter Zijlstra 2007-12-20 14:56 ` Peter Zijlstra @ 2007-12-20 15:26 ` Hugh Dickins 2007-12-20 16:53 ` Peter Zijlstra 1 sibling, 1 reply; 11+ messages in thread From: Hugh Dickins @ 2007-12-20 15:26 UTC (permalink / raw) To: Peter Zijlstra Cc: linux-kernel, linux-mm, Nick Piggin, riel, Lennart Poettering On Thu, 20 Dec 2007, Peter Zijlstra wrote: > On Thu, 2007-12-20 at 14:09 +0000, Hugh Dickins wrote: > > On Thu, 20 Dec 2007, Peter Zijlstra wrote: > > > > I certainly agree with this in principle: it just seems an unnecessary > > and surprising restriction to refuse on anonymous vmas; I guess the only > > reason for not adding this was not having anyone asking for it until now. > > Though, does Lennart realize he could use MAP_POPULATE in the mmap? > > I think he's trying to get his data swapped-in. That's perfectly reasonable, fair enough. > > > +{ > > > + int ret, len; > > > + > > > + *prev = vma; > > > + if (end > vma->vm_end) > > > + end = vma->vm_end; > > > > Please check, but I think the upper level ensures end is within range. > > It certainly looks like it, but I since the file case did this check I > thought it prudent to also do it. I guess I might as well remove both. Ah, so it does. Yes, please do remove both. > > Hmm, might it be better to use make_pages_present itself, > > fixing its retval, rather than using get_user_pages directly? > > (I'd hope the caching makes its repeat of find_vma not an overhead.) > > > > Interesting divergence: make_pages_present faults in writable pages > > in a writable vma, whereas the file case's force_page_cache_readahead > > doesn't even insert the pages into the mm. > > Yeah, the find_vma and write fault thing are the reason I didn't use > make_pages_present. The write fault thing is irrelevant now, actually: now do_anonymous_page doesn't use ZERO_PAGE, it puts in a writable page if the vma flags permit, even when it's just a read fault (and its write_access arg is redundant). > > I had noticed the difference in pte population between > force_page_cache_readahead and make_pages_present, but it seemed to me > that writing a function to walk the page tables and populate the > swapcache but not populate the ptes wasn't worth the effort. I was about to agree with you, when you made the observation: > Ah, another, more important difference: > > force_page_cache_readahead will not wait for the read to complete, > whereas get_user_pages() will be fully synchronous. > > I think I'd better come up with something else then,.. Yes, that's an interesting point. Maybe first put in what you have, to stop it from saying -EBADF on anon; then make it asynch later. The asynch code: perhaps not worth doing for MADV_WILLNEED alone, but might prove useful for more general use when swapping in. Not really the same as Con's swap prefetch, but worth looking at that for reference. But I guess this becomes a much bigger issue than you were intending to get into here. Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 15:26 ` Hugh Dickins @ 2007-12-20 16:53 ` Peter Zijlstra 2007-12-20 17:11 ` Matt Mackall 0 siblings, 1 reply; 11+ messages in thread From: Peter Zijlstra @ 2007-12-20 16:53 UTC (permalink / raw) To: Hugh Dickins Cc: linux-kernel, linux-mm, Nick Piggin, riel, Lennart Poettering, mpm On Thu, 2007-12-20 at 15:26 +0000, Hugh Dickins wrote: > The asynch code: perhaps not worth doing for MADV_WILLNEED alone, > but might prove useful for more general use when swapping in. > Not really the same as Con's swap prefetch, but worth looking > at that for reference. But I guess this becomes a much bigger > issue than you were intending to get into here. heh, yeah, got somewhat more complex that I'd hoped for. last patch for today (not even compile tested), will do a proper patch and test it tomorrow. --- A best effort MADV_WILLNEED implementation for anonymous memory. It adds a batch method to the page table walk routines so we can copy a few ptes while holding the kmap, which makes it possible to allocate the backing pages using GFP_KERNEL. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- diff --git a/include/linux/mm.h b/include/linux/mm.h index 5c3655f..391a453 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -726,6 +726,7 @@ unsigned long unmap_vmas(struct mmu_gather **tlb, * @pmd_entry: if set, called for each non-empty PMD (3rd-level) entry * @pte_entry: if set, called for each non-empty PTE (4th-level) entry * @pte_hole: if set, called for each hole at all levels + * @pte_batch: if set, called for each %WALK_BATCH_SIZE PTE entries. * * (see walk_page_range for more details) */ @@ -735,8 +736,16 @@ struct mm_walk { int (*pmd_entry)(pmd_t *, unsigned long, unsigned long, void *); int (*pte_entry)(pte_t *, unsigned long, unsigned long, void *); int (*pte_hole)(unsigned long, unsigned long, void *); + int (*pte_batch)(unsigned long, unsigned long, void *); }; +#define WALK_BATCH_SIZE 32 + +static inline walk_addr_index(unsigned long addr) +{ + return (addr >> PAGE_SHIFT) % WALK_BATCH_SIZE; +} + int walk_page_range(const struct mm_struct *, unsigned long addr, unsigned long end, const struct mm_walk *walk, void *private); diff --git a/mm/madvise.c b/mm/madvise.c index 93ee375..86610a0 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -11,6 +11,8 @@ #include <linux/mempolicy.h> #include <linux/hugetlb.h> #include <linux/sched.h> +#include <linux/swap.h> +#include <linux/swapops.h> /* * Any behaviour which results in changes to the vma->vm_flags needs to @@ -100,17 +102,71 @@ out: return error; } +struct madvise_willneed_anon_data { + pte_t entries[WALK_BATCH_SIZE]; + struct vm_area_struct *vma; +} + +static int madvise_willneed_anon_pte(pte_t *ptep, + unsigned long addr, unsigned long end, void *arg) +{ + struct madvise_willneed_anon_data *data = arg; + + data->entries[walk_addr_index(addr)] = *ptep; + + return 0; +} + +static int madvise_willneed_anon_batch(unsigned long addr, + unsigned long end, void *arg) +{ + struct madvise_willneed_anon_data *data = arg; + unsigned int i; + + for (; addr != end; addr += PAGE_SIZE) { + pte_t pte = data->entries[walk_addr_index(addr)]; + + if (is_swap_pte(pte)) { + struct page *page = + read_swap_cache_async(pte_to_swp_entry(pte), + GFP_KERNEL, data->vma, addr); + if (page) + page_cache_release(page); + } + } + + return 0; +} + +static long madvise_willneed_anon(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end) +{ + struct madvise_willneed_anon_data data = { + .vma = vma; + }; + struct mm_walk walk = { + .pte_entry = madvise_willneed_anon_pte, + .pte_batch = madvise_willneed_anon_batch, + }; + + *prev = vma; + walk_page_range(vma->vm_mm, start, end, &walk, vma); + + return 0; +} + /* * Schedule all required I/O operations. Do not wait for completion. */ -static long madvise_willneed(struct vm_area_struct * vma, - struct vm_area_struct ** prev, +static long madvise_willneed(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, unsigned long end) { struct file *file = vma->vm_file; if (!file) - return -EBADF; + return madvise_willneed_anon(vma, prev, start, end); if (file->f_mapping->a_ops->get_xip_page) { /* no bad return value, but ignore advice */ @@ -119,8 +175,6 @@ static long madvise_willneed(struct vm_area_struct * vma, *prev = vma; start = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; - if (end > vma->vm_end) - end = vma->vm_end; end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; force_page_cache_readahead(file->f_mapping, @@ -147,8 +201,8 @@ static long madvise_willneed(struct vm_area_struct * vma, * An interface that causes the system to free clean pages and flush * dirty pages is already available as msync(MS_INVALIDATE). */ -static long madvise_dontneed(struct vm_area_struct * vma, - struct vm_area_struct ** prev, +static long madvise_dontneed(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, unsigned long end) { *prev = vma; diff --git a/mm/pagewalk.c b/mm/pagewalk.c index b4f27d2..25fc656 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -2,12 +2,45 @@ #include <linux/highmem.h> #include <linux/sched.h> +static int walk_pte_range_batch(pmd_t *pmd, unsigned long addr, unsigned long end, + const struct mm_walk *walk, void *private) +{ + int err = 0; + + do { + unsigned int i; + pte_t *pte; + unsigned long start = addr; + int err2; + + pte = pte_offset_map(pmd, addr); + for (i = 0; i < WALK_BATCH_SIZE && addr != end; + i++, pte++, addr += PAGE_SIZE) { + err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private); + if (err) + break; + } + pte_unmap(pte); + + err2 = walk->pte_batch(start, end, private); + if (!err) + err = err2; + if (err) + break; + } while (addr != end); + + return err; +} + static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end, const struct mm_walk *walk, void *private) { pte_t *pte; int err = 0; + if (walk->pte_batch) + return walk_pte_range_batch(pmd, addr, end, walk, private); + pte = pte_offset_map(pmd, addr); do { err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private); -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 16:53 ` Peter Zijlstra @ 2007-12-20 17:11 ` Matt Mackall 2007-12-20 17:15 ` Peter Zijlstra 0 siblings, 1 reply; 11+ messages in thread From: Matt Mackall @ 2007-12-20 17:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Hugh Dickins, linux-kernel, linux-mm, Nick Piggin, riel, Lennart Poettering On Thu, Dec 20, 2007 at 05:53:41PM +0100, Peter Zijlstra wrote: > > On Thu, 2007-12-20 at 15:26 +0000, Hugh Dickins wrote: > > > The asynch code: perhaps not worth doing for MADV_WILLNEED alone, > > but might prove useful for more general use when swapping in. > > Not really the same as Con's swap prefetch, but worth looking > > at that for reference. But I guess this becomes a much bigger > > issue than you were intending to get into here. > > heh, yeah, got somewhat more complex that I'd hoped for. > > last patch for today (not even compile tested), will do a proper patch > and test it tomorrow. > > --- > A best effort MADV_WILLNEED implementation for anonymous memory. > > It adds a batch method to the page table walk routines so we can > copy a few ptes while holding the kmap, which makes it possible to > allocate the backing pages using GFP_KERNEL. Yuck. We actually need to just fix the atomic kmap issue in the existing pagemap code rather than add a new method, I think. If performance of map/unmap is too slow at a granularity of 1, we can add some internal batching in the CONFIG_HIGHPTE case. -- Mathematics is the supreme nostalgia of our time. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 17:11 ` Matt Mackall @ 2007-12-20 17:15 ` Peter Zijlstra 0 siblings, 0 replies; 11+ messages in thread From: Peter Zijlstra @ 2007-12-20 17:15 UTC (permalink / raw) To: Matt Mackall Cc: Hugh Dickins, linux-kernel, linux-mm, Nick Piggin, riel, Lennart Poettering On Thu, 2007-12-20 at 11:11 -0600, Matt Mackall wrote: > On Thu, Dec 20, 2007 at 05:53:41PM +0100, Peter Zijlstra wrote: > > > > On Thu, 2007-12-20 at 15:26 +0000, Hugh Dickins wrote: > > > > > The asynch code: perhaps not worth doing for MADV_WILLNEED alone, > > > but might prove useful for more general use when swapping in. > > > Not really the same as Con's swap prefetch, but worth looking > > > at that for reference. But I guess this becomes a much bigger > > > issue than you were intending to get into here. > > > > heh, yeah, got somewhat more complex that I'd hoped for. > > > > last patch for today (not even compile tested), will do a proper patch > > and test it tomorrow. > > > > --- > > A best effort MADV_WILLNEED implementation for anonymous memory. > > > > It adds a batch method to the page table walk routines so we can > > copy a few ptes while holding the kmap, which makes it possible to > > allocate the backing pages using GFP_KERNEL. > > Yuck. We actually need to just fix the atomic kmap issue in the > existing pagemap code rather than add a new method, I think. > > If performance of map/unmap is too slow at a granularity of 1, we can > add some internal batching in the CONFIG_HIGHPTE case. OK, sounds like a much better idea indeed. Will implement that. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory 2007-12-20 14:09 ` Hugh Dickins 2007-12-20 14:47 ` Peter Zijlstra @ 2007-12-20 16:29 ` Lennart Poettering 1 sibling, 0 replies; 11+ messages in thread From: Lennart Poettering @ 2007-12-20 16:29 UTC (permalink / raw) To: Hugh Dickins; +Cc: Peter Zijlstra, linux-kernel, linux-mm, Nick Piggin, riel On Thu, 20.12.07 14:09, Hugh Dickins (hugh@veritas.com) wrote: > > Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans > > to use this to pre-fault pages. He currently uses: mlock/munlock for > > this purpose. > > I certainly agree with this in principle: it just seems an unnecessary > and surprising restriction to refuse on anonymous vmas; I guess the only > reason for not adding this was not having anyone asking for it until now. > Though, does Lennart realize he could use MAP_POPULATE in the mmap? Not really. First, if the mmap() is hidden somewhere in glibc (i.e. as part of malloc() or whatever) it's not really possible to do MAP_POPULATE. Also, I need this for some memory that is allocated during the whole runtime but only seldomly used. Thus I am happy if it is swapped out, but everytime I want to use it I want to make sure it is paged in before I pass it on to the RT thread. So, there's a mmap() during startup only, and then, during the whole runtime of my program I want to page in the memory again and again, with long intervals in between, but with no call to mmap()/munmap(). Lennart -- Lennart Poettering Red Hat, Inc. lennart [at] poettering [dot] net ICQ# 11060553 http://0pointer.net/lennart/ GnuPG 0x1A015CC4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2007-12-20 17:15 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-12-20 13:05 [rfc][patch] mm: madvise(WILLNEED) for anonymous memory Peter Zijlstra 2007-12-20 14:09 ` Hugh Dickins 2007-12-20 14:47 ` Peter Zijlstra 2007-12-20 14:56 ` Peter Zijlstra 2007-12-20 15:18 ` Peter Zijlstra 2007-12-20 15:23 ` Peter Zijlstra 2007-12-20 15:26 ` Hugh Dickins 2007-12-20 16:53 ` Peter Zijlstra 2007-12-20 17:11 ` Matt Mackall 2007-12-20 17:15 ` Peter Zijlstra 2007-12-20 16:29 ` Lennart Poettering
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox