* [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible @ 2025-12-15 5:30 Barry Song 2025-12-18 13:01 ` David Hildenbrand (Red Hat) 2025-12-18 14:00 ` Uladzislau Rezki 0 siblings, 2 replies; 8+ messages in thread From: Barry Song @ 2025-12-15 5:30 UTC (permalink / raw) To: akpm, linux-mm Cc: dri-devel, jstultz, linaro-mm-sig, linux-kernel, linux-media, Barry Song, David Hildenbrand, Uladzislau Rezki, Sumit Semwal, Maxime Ripard, Tangquan Zheng From: Barry Song <v-songbaohua@oppo.com> In many cases, the pages passed to vmap() may include high-order pages allocated with __GFP_COMP flags. For example, the systemheap often allocates pages in descending order: order 8, then 4, then 0. Currently, vmap() iterates over every page individually—even pages inside a high-order block are handled one by one. This patch detects high-order pages and maps them as a single contiguous block whenever possible. An alternative would be to implement a new API, vmap_sg(), but that change seems to be large in scope. When vmapping a 128MB dma-buf using the systemheap, this patch makes system_heap_do_vmap() roughly 17× faster. W/ patch: [ 10.404769] system_heap_do_vmap took 2494000 ns [ 12.525921] system_heap_do_vmap took 2467008 ns [ 14.517348] system_heap_do_vmap took 2471008 ns [ 16.593406] system_heap_do_vmap took 2444000 ns [ 19.501341] system_heap_do_vmap took 2489008 ns W/o patch: [ 7.413756] system_heap_do_vmap took 42626000 ns [ 9.425610] system_heap_do_vmap took 42500992 ns [ 11.810898] system_heap_do_vmap took 42215008 ns [ 14.336790] system_heap_do_vmap took 42134992 ns [ 16.373890] system_heap_do_vmap took 42750000 ns Cc: David Hildenbrand <david@kernel.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: John Stultz <jstultz@google.com> Cc: Maxime Ripard <mripard@kernel.org> Tested-by: Tangquan Zheng <zhengtangquan@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> --- * diff with rfc: Many code refinements based on David's suggestions, thanks! Refine comment and changelog according to Uladzislau, thanks! rfc link: https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/ mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------ 1 file changed, 39 insertions(+), 6 deletions(-) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 41dd01e8430c..8d577767a9e5 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, return err; } +static inline int get_vmap_batch_order(struct page **pages, + unsigned int stride, unsigned int max_steps, unsigned int idx) +{ + int nr_pages = 1; + + /* + * Currently, batching is only supported in vmap_pages_range + * when page_shift == PAGE_SHIFT. + */ + if (stride != 1) + return 0; + + nr_pages = compound_nr(pages[idx]); + if (nr_pages == 1) + return 0; + if (max_steps < nr_pages) + return 0; + + if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages) + return compound_order(pages[idx]); + return 0; +} + /* * vmap_pages_range_noflush is similar to vmap_pages_range, but does not * flush caches. @@ -655,23 +678,33 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end, pgprot_t prot, struct page **pages, unsigned int page_shift) { unsigned int i, nr = (end - addr) >> PAGE_SHIFT; + unsigned int stride; WARN_ON(page_shift < PAGE_SHIFT); + /* + * For vmap(), users may allocate pages from high orders down to + * order 0, while always using PAGE_SHIFT as the page_shift. + * We first check whether the initial page is a compound page. If so, + * there may be an opportunity to batch multiple pages together. + */ if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || - page_shift == PAGE_SHIFT) + (page_shift == PAGE_SHIFT && !PageCompound(pages[0]))) return vmap_small_pages_range_noflush(addr, end, prot, pages); - for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { - int err; + stride = 1U << (page_shift - PAGE_SHIFT); + for (i = 0; i < nr; ) { + int err, order; - err = vmap_range_noflush(addr, addr + (1UL << page_shift), + order = get_vmap_batch_order(pages, stride, nr - i, i); + err = vmap_range_noflush(addr, addr + (1UL << (page_shift + order)), page_to_phys(pages[i]), prot, - page_shift); + page_shift + order); if (err) return err; - addr += 1UL << page_shift; + addr += 1UL << (page_shift + order); + i += 1U << (order + page_shift - PAGE_SHIFT); } return 0; -- 2.39.3 (Apple Git-146) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible 2025-12-15 5:30 [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible Barry Song @ 2025-12-18 13:01 ` David Hildenbrand (Red Hat) 2025-12-18 13:54 ` Uladzislau Rezki 2025-12-18 14:00 ` Uladzislau Rezki 1 sibling, 1 reply; 8+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-12-18 13:01 UTC (permalink / raw) To: Barry Song, akpm, linux-mm Cc: dri-devel, jstultz, linaro-mm-sig, linux-kernel, linux-media, Barry Song, Uladzislau Rezki, Sumit Semwal, Maxime Ripard, Tangquan Zheng On 12/15/25 06:30, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > In many cases, the pages passed to vmap() may include high-order > pages allocated with __GFP_COMP flags. For example, the systemheap > often allocates pages in descending order: order 8, then 4, then 0. > Currently, vmap() iterates over every page individually—even pages > inside a high-order block are handled one by one. > > This patch detects high-order pages and maps them as a single > contiguous block whenever possible. > > An alternative would be to implement a new API, vmap_sg(), but that > change seems to be large in scope. > > When vmapping a 128MB dma-buf using the systemheap, this patch > makes system_heap_do_vmap() roughly 17× faster. > > W/ patch: > [ 10.404769] system_heap_do_vmap took 2494000 ns > [ 12.525921] system_heap_do_vmap took 2467008 ns > [ 14.517348] system_heap_do_vmap took 2471008 ns > [ 16.593406] system_heap_do_vmap took 2444000 ns > [ 19.501341] system_heap_do_vmap took 2489008 ns > > W/o patch: > [ 7.413756] system_heap_do_vmap took 42626000 ns > [ 9.425610] system_heap_do_vmap took 42500992 ns > [ 11.810898] system_heap_do_vmap took 42215008 ns > [ 14.336790] system_heap_do_vmap took 42134992 ns > [ 16.373890] system_heap_do_vmap took 42750000 ns > That's quite a speedup. > Cc: David Hildenbrand <david@kernel.org> > Cc: Uladzislau Rezki <urezki@gmail.com> > Cc: Sumit Semwal <sumit.semwal@linaro.org> > Cc: John Stultz <jstultz@google.com> > Cc: Maxime Ripard <mripard@kernel.org> > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > * diff with rfc: > Many code refinements based on David's suggestions, thanks! > Refine comment and changelog according to Uladzislau, thanks! > rfc link: > https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/ > > mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 39 insertions(+), 6 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 41dd01e8430c..8d577767a9e5 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, > return err; > } > > +static inline int get_vmap_batch_order(struct page **pages, > + unsigned int stride, unsigned int max_steps, unsigned int idx) > +{ > + int nr_pages = 1; unsigned int, maybe Why are you initializing nr_pages when you overwrite it below? > + > + /* > + * Currently, batching is only supported in vmap_pages_range > + * when page_shift == PAGE_SHIFT. I don't know the code so realizing how we go from page_shift to stride too me a second. Maybe only talk about stride here? OTOH, is "stride" really the right terminology? we calculate it as stride = 1U << (page_shift - PAGE_SHIFT); page_shift - PAGE_SHIFT should give us an "order". So is this a "granularity" in nr_pages? Again, I don't know this code, so sorry for the question. > + */ > + if (stride != 1) > + return 0; > + > + nr_pages = compound_nr(pages[idx]); > + if (nr_pages == 1) > + return 0; > + if (max_steps < nr_pages) > + return 0; Might combine these simple checks if (nr_pages == 1 || max_steps < nr_pages) return 0; -- Cheers David ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible 2025-12-18 13:01 ` David Hildenbrand (Red Hat) @ 2025-12-18 13:54 ` Uladzislau Rezki 2025-12-18 21:24 ` Barry Song 0 siblings, 1 reply; 8+ messages in thread From: Uladzislau Rezki @ 2025-12-18 13:54 UTC (permalink / raw) To: David Hildenbrand (Red Hat), Barry Song Cc: Barry Song, akpm, linux-mm, dri-devel, jstultz, linaro-mm-sig, linux-kernel, linux-media, Barry Song, Uladzislau Rezki, Sumit Semwal, Maxime Ripard, Tangquan Zheng On Thu, Dec 18, 2025 at 02:01:56PM +0100, David Hildenbrand (Red Hat) wrote: > On 12/15/25 06:30, Barry Song wrote: > > From: Barry Song <v-songbaohua@oppo.com> > > > > In many cases, the pages passed to vmap() may include high-order > > pages allocated with __GFP_COMP flags. For example, the systemheap > > often allocates pages in descending order: order 8, then 4, then 0. > > Currently, vmap() iterates over every page individually—even pages > > inside a high-order block are handled one by one. > > > > This patch detects high-order pages and maps them as a single > > contiguous block whenever possible. > > > > An alternative would be to implement a new API, vmap_sg(), but that > > change seems to be large in scope. > > > > When vmapping a 128MB dma-buf using the systemheap, this patch > > makes system_heap_do_vmap() roughly 17× faster. > > > > W/ patch: > > [ 10.404769] system_heap_do_vmap took 2494000 ns > > [ 12.525921] system_heap_do_vmap took 2467008 ns > > [ 14.517348] system_heap_do_vmap took 2471008 ns > > [ 16.593406] system_heap_do_vmap took 2444000 ns > > [ 19.501341] system_heap_do_vmap took 2489008 ns > > > > W/o patch: > > [ 7.413756] system_heap_do_vmap took 42626000 ns > > [ 9.425610] system_heap_do_vmap took 42500992 ns > > [ 11.810898] system_heap_do_vmap took 42215008 ns > > [ 14.336790] system_heap_do_vmap took 42134992 ns > > [ 16.373890] system_heap_do_vmap took 42750000 ns > > > > That's quite a speedup. > > > Cc: David Hildenbrand <david@kernel.org> > > Cc: Uladzislau Rezki <urezki@gmail.com> > > Cc: Sumit Semwal <sumit.semwal@linaro.org> > > Cc: John Stultz <jstultz@google.com> > > Cc: Maxime Ripard <mripard@kernel.org> > > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > --- > > * diff with rfc: > > Many code refinements based on David's suggestions, thanks! > > Refine comment and changelog according to Uladzislau, thanks! > > rfc link: > > https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/ > > > > mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------ > > 1 file changed, 39 insertions(+), 6 deletions(-) > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 41dd01e8430c..8d577767a9e5 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, > > return err; > > } > > +static inline int get_vmap_batch_order(struct page **pages, > > + unsigned int stride, unsigned int max_steps, unsigned int idx) > > +{ > > + int nr_pages = 1; > > unsigned int, maybe > > Why are you initializing nr_pages when you overwrite it below? > > > + > > + /* > > + * Currently, batching is only supported in vmap_pages_range > > + * when page_shift == PAGE_SHIFT. > > I don't know the code so realizing how we go from page_shift to stride too > me a second. Maybe only talk about stride here? > > OTOH, is "stride" really the right terminology? > > we calculate it as > > stride = 1U << (page_shift - PAGE_SHIFT); > > page_shift - PAGE_SHIFT should give us an "order". So is this a > "granularity" in nr_pages? > > Again, I don't know this code, so sorry for the question. > To me "stride" also sounds unclear. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible 2025-12-18 13:54 ` Uladzislau Rezki @ 2025-12-18 21:24 ` Barry Song 2025-12-22 13:08 ` Uladzislau Rezki 0 siblings, 1 reply; 8+ messages in thread From: Barry Song @ 2025-12-18 21:24 UTC (permalink / raw) To: urezki Cc: 21cnbao, akpm, david, dri-devel, jstultz, linaro-mm-sig, linux-kernel, linux-media, linux-mm, mripard, sumit.semwal, v-songbaohua, zhengtangquan On Thu, Dec 18, 2025 at 9:55 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > On Thu, Dec 18, 2025 at 02:01:56PM +0100, David Hildenbrand (Red Hat) wrote: > > On 12/15/25 06:30, Barry Song wrote: > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > In many cases, the pages passed to vmap() may include high-order > > > pages allocated with __GFP_COMP flags. For example, the systemheap > > > often allocates pages in descending order: order 8, then 4, then 0. > > > Currently, vmap() iterates over every page individually—even pages > > > inside a high-order block are handled one by one. > > > > > > This patch detects high-order pages and maps them as a single > > > contiguous block whenever possible. > > > > > > An alternative would be to implement a new API, vmap_sg(), but that > > > change seems to be large in scope. > > > > > > When vmapping a 128MB dma-buf using the systemheap, this patch > > > makes system_heap_do_vmap() roughly 17× faster. > > > > > > W/ patch: > > > [ 10.404769] system_heap_do_vmap took 2494000 ns > > > [ 12.525921] system_heap_do_vmap took 2467008 ns > > > [ 14.517348] system_heap_do_vmap took 2471008 ns > > > [ 16.593406] system_heap_do_vmap took 2444000 ns > > > [ 19.501341] system_heap_do_vmap took 2489008 ns > > > > > > W/o patch: > > > [ 7.413756] system_heap_do_vmap took 42626000 ns > > > [ 9.425610] system_heap_do_vmap took 42500992 ns > > > [ 11.810898] system_heap_do_vmap took 42215008 ns > > > [ 14.336790] system_heap_do_vmap took 42134992 ns > > > [ 16.373890] system_heap_do_vmap took 42750000 ns > > > > > > > That's quite a speedup. > > > > > Cc: David Hildenbrand <david@kernel.org> > > > Cc: Uladzislau Rezki <urezki@gmail.com> > > > Cc: Sumit Semwal <sumit.semwal@linaro.org> > > > Cc: John Stultz <jstultz@google.com> > > > Cc: Maxime Ripard <mripard@kernel.org> > > > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > --- > > > * diff with rfc: > > > Many code refinements based on David's suggestions, thanks! > > > Refine comment and changelog according to Uladzislau, thanks! > > > rfc link: > > > https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/ > > > > > > mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------ > > > 1 file changed, 39 insertions(+), 6 deletions(-) > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > index 41dd01e8430c..8d577767a9e5 100644 > > > --- a/mm/vmalloc.c > > > +++ b/mm/vmalloc.c > > > @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, > > > return err; > > > } > > > +static inline int get_vmap_batch_order(struct page **pages, > > > + unsigned int stride, unsigned int max_steps, unsigned int idx) > > > +{ > > > + int nr_pages = 1; > > > > unsigned int, maybe Right > > > > Why are you initializing nr_pages when you overwrite it below? Right, initializing nr_pages can be dropped. > > > > > + > > > + /* > > > + * Currently, batching is only supported in vmap_pages_range > > > + * when page_shift == PAGE_SHIFT. > > > > I don't know the code so realizing how we go from page_shift to stride too > > me a second. Maybe only talk about stride here? > > > > OTOH, is "stride" really the right terminology? > > > > we calculate it as > > > > stride = 1U << (page_shift - PAGE_SHIFT); > > > > page_shift - PAGE_SHIFT should give us an "order". So is this a > > "granularity" in nr_pages? This is the case where vmalloc() may realize that it has high-order pages and therefore calls vmap_pages_range_noflush() with a page_shift larger than PAGE_SHIFT. For vmap(), we take a pages array, so page_shift is always PAGE_SHIFT. > > > > Again, I don't know this code, so sorry for the question. > > > To me "stride" also sounds unclear. Thanks, David and Uladzislau. On second thought, this stride may be redundant, and it should be possible to drop it entirely. This results in the code below: diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 41dd01e8430c..3962bdcb43e5 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -642,6 +642,20 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, return err; } +static inline int get_vmap_batch_order(struct page **pages, + unsigned int max_steps, unsigned int idx) +{ + unsigned int nr_pages = compound_nr(pages[idx]); + + if (nr_pages == 1 || max_steps < nr_pages) + return 0; + + if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages) + return compound_order(pages[idx]); + return 0; +} + /* * vmap_pages_range_noflush is similar to vmap_pages_range, but does not * flush caches. @@ -658,20 +672,35 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end, WARN_ON(page_shift < PAGE_SHIFT); + /* + * For vmap(), users may allocate pages from high orders down to + * order 0, while always using PAGE_SHIFT as the page_shift. + * We first check whether the initial page is a compound page. If so, + * there may be an opportunity to batch multiple pages together. + */ if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || - page_shift == PAGE_SHIFT) + (page_shift == PAGE_SHIFT && !PageCompound(pages[0]))) return vmap_small_pages_range_noflush(addr, end, prot, pages); - for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { + for (i = 0; i < nr; ) { + unsigned int shift = page_shift; int err; - err = vmap_range_noflush(addr, addr + (1UL << page_shift), + /* + * For vmap() cases, page_shift is always PAGE_SHIFT, even + * if the pages are physically contiguous, they may still + * be mapped in a batch. + */ + if (page_shift == PAGE_SHIFT) + shift += get_vmap_batch_order(pages, nr - i, i); + err = vmap_range_noflush(addr, addr + (1UL << shift), page_to_phys(pages[i]), prot, - page_shift); + shift); if (err) return err; - addr += 1UL << page_shift; + addr += 1UL << shift; + i += 1U << shift; } return 0; Does this look clearer? Thanks Barry ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible 2025-12-18 21:24 ` Barry Song @ 2025-12-22 13:08 ` Uladzislau Rezki 2025-12-23 21:23 ` Barry Song 0 siblings, 1 reply; 8+ messages in thread From: Uladzislau Rezki @ 2025-12-22 13:08 UTC (permalink / raw) To: Barry Song Cc: urezki, akpm, david, dri-devel, jstultz, linaro-mm-sig, linux-kernel, linux-media, linux-mm, mripard, sumit.semwal, v-songbaohua, zhengtangquan On Fri, Dec 19, 2025 at 05:24:36AM +0800, Barry Song wrote: > On Thu, Dec 18, 2025 at 9:55 PM Uladzislau Rezki <urezki@gmail.com> wrote: > > > > On Thu, Dec 18, 2025 at 02:01:56PM +0100, David Hildenbrand (Red Hat) wrote: > > > On 12/15/25 06:30, Barry Song wrote: > > > > From: Barry Song <v-songbaohua@oppo.com> > > > > > > > > In many cases, the pages passed to vmap() may include high-order > > > > pages allocated with __GFP_COMP flags. For example, the systemheap > > > > often allocates pages in descending order: order 8, then 4, then 0. > > > > Currently, vmap() iterates over every page individually—even pages > > > > inside a high-order block are handled one by one. > > > > > > > > This patch detects high-order pages and maps them as a single > > > > contiguous block whenever possible. > > > > > > > > An alternative would be to implement a new API, vmap_sg(), but that > > > > change seems to be large in scope. > > > > > > > > When vmapping a 128MB dma-buf using the systemheap, this patch > > > > makes system_heap_do_vmap() roughly 17× faster. > > > > > > > > W/ patch: > > > > [ 10.404769] system_heap_do_vmap took 2494000 ns > > > > [ 12.525921] system_heap_do_vmap took 2467008 ns > > > > [ 14.517348] system_heap_do_vmap took 2471008 ns > > > > [ 16.593406] system_heap_do_vmap took 2444000 ns > > > > [ 19.501341] system_heap_do_vmap took 2489008 ns > > > > > > > > W/o patch: > > > > [ 7.413756] system_heap_do_vmap took 42626000 ns > > > > [ 9.425610] system_heap_do_vmap took 42500992 ns > > > > [ 11.810898] system_heap_do_vmap took 42215008 ns > > > > [ 14.336790] system_heap_do_vmap took 42134992 ns > > > > [ 16.373890] system_heap_do_vmap took 42750000 ns > > > > > > > > > > That's quite a speedup. > > > > > > > Cc: David Hildenbrand <david@kernel.org> > > > > Cc: Uladzislau Rezki <urezki@gmail.com> > > > > Cc: Sumit Semwal <sumit.semwal@linaro.org> > > > > Cc: John Stultz <jstultz@google.com> > > > > Cc: Maxime Ripard <mripard@kernel.org> > > > > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com> > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > > > > --- > > > > * diff with rfc: > > > > Many code refinements based on David's suggestions, thanks! > > > > Refine comment and changelog according to Uladzislau, thanks! > > > > rfc link: > > > > https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/ > > > > > > > > mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------ > > > > 1 file changed, 39 insertions(+), 6 deletions(-) > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > index 41dd01e8430c..8d577767a9e5 100644 > > > > --- a/mm/vmalloc.c > > > > +++ b/mm/vmalloc.c > > > > @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, > > > > return err; > > > > } > > > > +static inline int get_vmap_batch_order(struct page **pages, > > > > + unsigned int stride, unsigned int max_steps, unsigned int idx) > > > > +{ > > > > + int nr_pages = 1; > > > > > > unsigned int, maybe > > Right > > > > > > > Why are you initializing nr_pages when you overwrite it below? > > Right, initializing nr_pages can be dropped. > > > > > > > > + > > > > + /* > > > > + * Currently, batching is only supported in vmap_pages_range > > > > + * when page_shift == PAGE_SHIFT. > > > > > > I don't know the code so realizing how we go from page_shift to stride too > > > me a second. Maybe only talk about stride here? > > > > > > OTOH, is "stride" really the right terminology? > > > > > > we calculate it as > > > > > > stride = 1U << (page_shift - PAGE_SHIFT); > > > > > > page_shift - PAGE_SHIFT should give us an "order". So is this a > > > "granularity" in nr_pages? > > This is the case where vmalloc() may realize that it has > high-order pages and therefore calls > vmap_pages_range_noflush() with a page_shift larger than > PAGE_SHIFT. For vmap(), we take a pages array, so > page_shift is always PAGE_SHIFT. > > > > > > > Again, I don't know this code, so sorry for the question. > > > > > To me "stride" also sounds unclear. > > Thanks, David and Uladzislau. On second thought, this stride may be > redundant, and it should be possible to drop it entirely. This results > in the code below: > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 41dd01e8430c..3962bdcb43e5 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -642,6 +642,20 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, > return err; > } > > +static inline int get_vmap_batch_order(struct page **pages, > + unsigned int max_steps, unsigned int idx) > +{ > + unsigned int nr_pages = compound_nr(pages[idx]); > + > + if (nr_pages == 1 || max_steps < nr_pages) > + return 0; > + > + if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages) > + return compound_order(pages[idx]); > + return 0; > +} > + > > /* > * vmap_pages_range_noflush is similar to vmap_pages_range, but does not > * flush caches. > @@ -658,20 +672,35 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end, > > WARN_ON(page_shift < PAGE_SHIFT); > > + /* > + * For vmap(), users may allocate pages from high orders down to > + * order 0, while always using PAGE_SHIFT as the page_shift. > + * We first check whether the initial page is a compound page. If so, > + * there may be an opportunity to batch multiple pages together. > + */ > if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || > - page_shift == PAGE_SHIFT) > + (page_shift == PAGE_SHIFT && !PageCompound(pages[0]))) > return vmap_small_pages_range_noflush(addr, end, prot, pages); Hm.. If first few pages are order-0 and the rest are compound then we do nothing. > > - for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { > + for (i = 0; i < nr; ) { > + unsigned int shift = page_shift; > int err; > > - err = vmap_range_noflush(addr, addr + (1UL << page_shift), > + /* > + * For vmap() cases, page_shift is always PAGE_SHIFT, even > + * if the pages are physically contiguous, they may still > + * be mapped in a batch. > + */ > + if (page_shift == PAGE_SHIFT) > + shift += get_vmap_batch_order(pages, nr - i, i); > + err = vmap_range_noflush(addr, addr + (1UL << shift), > page_to_phys(pages[i]), prot, > - page_shift); > + shift); > if (err) > return err; > > - addr += 1UL << page_shift; > + addr += 1UL << shift; > + i += 1U << shift; > } > > return 0; > > Does this look clearer? > The concern is we mix it with a huge page mapping path. If we want to batch v-mapping for page_shift == PAGE_SHIFT case, where "pages" array may contain compound pages(folio)(corner case to me), i think we should split it. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible 2025-12-22 13:08 ` Uladzislau Rezki @ 2025-12-23 21:23 ` Barry Song 0 siblings, 0 replies; 8+ messages in thread From: Barry Song @ 2025-12-23 21:23 UTC (permalink / raw) To: urezki Cc: 21cnbao, akpm, david, dri-devel, jstultz, linaro-mm-sig, linux-kernel, linux-media, linux-mm, mripard, sumit.semwal, v-songbaohua, zhengtangquan > > /* > > * vmap_pages_range_noflush is similar to vmap_pages_range, but does not > > * flush caches. > > @@ -658,20 +672,35 @@ int __vmap_pages_range_noflush(unsigned long addr, unsigned long end, > > > > WARN_ON(page_shift < PAGE_SHIFT); > > > > + /* > > + * For vmap(), users may allocate pages from high orders down to > > + * order 0, while always using PAGE_SHIFT as the page_shift. > > + * We first check whether the initial page is a compound page. If so, > > + * there may be an opportunity to batch multiple pages together. > > + */ > > if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || > > - page_shift == PAGE_SHIFT) > > + (page_shift == PAGE_SHIFT && !PageCompound(pages[0]))) > > return vmap_small_pages_range_noflush(addr, end, prot, pages); > Hm.. If first few pages are order-0 and the rest are compound > then we do nothing. Now the dma-buf is allocated in descending order. If page0 is not huge, page1 will not be either. However, I agree that we may extend support for this case. > > > > > - for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { > > + for (i = 0; i < nr; ) { > > + unsigned int shift = page_shift; > > int err; > > > > - err = vmap_range_noflush(addr, addr + (1UL << page_shift), > > + /* > > + * For vmap() cases, page_shift is always PAGE_SHIFT, even > > + * if the pages are physically contiguous, they may still > > + * be mapped in a batch. > > + */ > > + if (page_shift == PAGE_SHIFT) > > + shift += get_vmap_batch_order(pages, nr - i, i); > > + err = vmap_range_noflush(addr, addr + (1UL << shift), > > page_to_phys(pages[i]), prot, > > - page_shift); > > + shift); > > if (err) > > return err; > > > > - addr += 1UL << page_shift; > > + addr += 1UL << shift; > > + i += 1U << shift; > > } > > > > return 0; > > > > Does this look clearer? > > > The concern is we mix it with a huge page mapping path. If we want to batch > v-mapping for page_shift == PAGE_SHIFT case, where "pages" array may contain > compound pages(folio)(corner case to me), i think we should split it. I agree this might not be common when the vmap buffer is only used by the CPU. However, for GPUs, NPUs, and similar devices, benefiting from larger mappings may be quite common. Does the code below, which moves batched mapping to vmap(), address both of your concerns? diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ecbac900c35f..782f2eac8a63 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3501,6 +3501,20 @@ void vunmap(const void *addr) } EXPORT_SYMBOL(vunmap); +static inline int get_vmap_batch_order(struct page **pages, + unsigned int max_steps, unsigned int idx) +{ + unsigned int nr_pages; + + nr_pages = compound_nr(pages[idx]); + if (nr_pages == 1 || max_steps < nr_pages) + return 0; + + if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages) + return compound_order(pages[idx]); + return 0; +} + /** * vmap - map an array of pages into virtually contiguous space * @pages: array of page pointers @@ -3544,10 +3558,21 @@ void *vmap(struct page **pages, unsigned int count, return NULL; addr = (unsigned long)area->addr; - if (vmap_pages_range(addr, addr + size, pgprot_nx(prot), - pages, PAGE_SHIFT) < 0) { - vunmap(area->addr); - return NULL; + for (unsigned int i = 0; i < count; ) { + unsigned int shift = PAGE_SHIFT; + int err; + + shift += get_vmap_batch_order(pages, count - i, i); + err = vmap_range_noflush(addr, addr + (1UL << shift), + page_to_phys(pages[i]), pgprot_nx(prot), + shift); + if (err) { + vunmap(area->addr); + return NULL; + } + + addr += 1UL << shift; + i += 1U << shift; } if (flags & VM_MAP_PUT_PAGES) { -- 2.48.1 Thanks Barry ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible 2025-12-15 5:30 [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible Barry Song 2025-12-18 13:01 ` David Hildenbrand (Red Hat) @ 2025-12-18 14:00 ` Uladzislau Rezki 2025-12-18 20:05 ` Barry Song 1 sibling, 1 reply; 8+ messages in thread From: Uladzislau Rezki @ 2025-12-18 14:00 UTC (permalink / raw) To: Barry Song Cc: akpm, linux-mm, dri-devel, jstultz, linaro-mm-sig, linux-kernel, linux-media, Barry Song, David Hildenbrand, Uladzislau Rezki, Sumit Semwal, Maxime Ripard, Tangquan Zheng On Mon, Dec 15, 2025 at 01:30:50PM +0800, Barry Song wrote: > From: Barry Song <v-songbaohua@oppo.com> > > In many cases, the pages passed to vmap() may include high-order > pages allocated with __GFP_COMP flags. For example, the systemheap > often allocates pages in descending order: order 8, then 4, then 0. > Currently, vmap() iterates over every page individually—even pages > inside a high-order block are handled one by one. > > This patch detects high-order pages and maps them as a single > contiguous block whenever possible. > > An alternative would be to implement a new API, vmap_sg(), but that > change seems to be large in scope. > > When vmapping a 128MB dma-buf using the systemheap, this patch > makes system_heap_do_vmap() roughly 17× faster. > > W/ patch: > [ 10.404769] system_heap_do_vmap took 2494000 ns > [ 12.525921] system_heap_do_vmap took 2467008 ns > [ 14.517348] system_heap_do_vmap took 2471008 ns > [ 16.593406] system_heap_do_vmap took 2444000 ns > [ 19.501341] system_heap_do_vmap took 2489008 ns > > W/o patch: > [ 7.413756] system_heap_do_vmap took 42626000 ns > [ 9.425610] system_heap_do_vmap took 42500992 ns > [ 11.810898] system_heap_do_vmap took 42215008 ns > [ 14.336790] system_heap_do_vmap took 42134992 ns > [ 16.373890] system_heap_do_vmap took 42750000 ns > > Cc: David Hildenbrand <david@kernel.org> > Cc: Uladzislau Rezki <urezki@gmail.com> > Cc: Sumit Semwal <sumit.semwal@linaro.org> > Cc: John Stultz <jstultz@google.com> > Cc: Maxime Ripard <mripard@kernel.org> > Tested-by: Tangquan Zheng <zhengtangquan@oppo.com> > Signed-off-by: Barry Song <v-songbaohua@oppo.com> > --- > * diff with rfc: > Many code refinements based on David's suggestions, thanks! > Refine comment and changelog according to Uladzislau, thanks! > rfc link: > https://lore.kernel.org/linux-mm/20251122090343.81243-1-21cnbao@gmail.com/ > > mm/vmalloc.c | 45 +++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 39 insertions(+), 6 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 41dd01e8430c..8d577767a9e5 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -642,6 +642,29 @@ static int vmap_small_pages_range_noflush(unsigned long addr, unsigned long end, > return err; > } > > +static inline int get_vmap_batch_order(struct page **pages, > + unsigned int stride, unsigned int max_steps, unsigned int idx) > +{ > + int nr_pages = 1; > + > + /* > + * Currently, batching is only supported in vmap_pages_range > + * when page_shift == PAGE_SHIFT. > + */ > + if (stride != 1) > + return 0; > + > + nr_pages = compound_nr(pages[idx]); > + if (nr_pages == 1) > + return 0; > + if (max_steps < nr_pages) > + return 0; > + > + if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages) > + return compound_order(pages[idx]); > + return 0; > +} > + Can we instead look at this as: it can be that we have continues set of pages let's find out. I mean if we do not stick just to compound pages. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible 2025-12-18 14:00 ` Uladzislau Rezki @ 2025-12-18 20:05 ` Barry Song 0 siblings, 0 replies; 8+ messages in thread From: Barry Song @ 2025-12-18 20:05 UTC (permalink / raw) To: Uladzislau Rezki Cc: akpm, linux-mm, dri-devel, jstultz, linaro-mm-sig, linux-kernel, linux-media, Barry Song, David Hildenbrand, Sumit Semwal, Maxime Ripard, Tangquan Zheng [...] > > > > +static inline int get_vmap_batch_order(struct page **pages, > > + unsigned int stride, unsigned int max_steps, unsigned int idx) > > +{ > > + int nr_pages = 1; > > + > > + /* > > + * Currently, batching is only supported in vmap_pages_range > > + * when page_shift == PAGE_SHIFT. > > + */ > > + if (stride != 1) > > + return 0; > > + > > + nr_pages = compound_nr(pages[idx]); > > + if (nr_pages == 1) > > + return 0; > > + if (max_steps < nr_pages) > > + return 0; > > + > > + if (num_pages_contiguous(&pages[idx], nr_pages) == nr_pages) > > + return compound_order(pages[idx]); > > + return 0; > > +} > > + > Can we instead look at this as: it can be that we have continues > set of pages let's find out. I mean if we do not stick just to > compound pages. We use PageCompound(pages[0]) and compound_nr() as quick filters to skip checking the contiguous count, and this is now the intended use case. Always checking contiguity might cause a slight regression, I guess. BTW, do we have a strong use case where GFP_COMP or folio is not used, yet the pages are physically contiguous? Thanks Barry ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-12-23 21:23 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-12-15 5:30 [PATCH] mm/vmalloc: map contiguous pages in batches for vmap() whenever possible Barry Song 2025-12-18 13:01 ` David Hildenbrand (Red Hat) 2025-12-18 13:54 ` Uladzislau Rezki 2025-12-18 21:24 ` Barry Song 2025-12-22 13:08 ` Uladzislau Rezki 2025-12-23 21:23 ` Barry Song 2025-12-18 14:00 ` Uladzislau Rezki 2025-12-18 20:05 ` Barry Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox