[PATCH 0/2] optimization of dma-buf system

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] optimization of dma-buf system_heap allocation
@ 2025-10-14  8:32 zhaoyang.huang
  2025-10-14  8:32 ` [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful zhaoyang.huang
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: zhaoyang.huang @ 2025-10-14  8:32 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, Christian König, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, Zhaoyang Huang,
	steve.kang

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

This series of patches would like to introduce alloc_pages_bulk_list in
dma-buf which need to call back the API for page allocation.

Zhaoyang Huang (2):
  mm: call back alloc_pages_bulk_list since it is useful
  driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation

 drivers/dma-buf/heaps/system_heap.c | 33 +++++++++++++++---------
 include/linux/gfp.h                 |  9 +++++--
 mm/mempolicy.c                      | 14 +++++------
 mm/page_alloc.c                     | 39 ++++++++++++++++++++---------
 4 files changed, 62 insertions(+), 33 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful
  2025-10-14  8:32 [PATCH 0/2] optimization of dma-buf system_heap allocation zhaoyang.huang
@ 2025-10-14  8:32 ` zhaoyang.huang
  2025-10-14  9:41   ` Petr Tesarik
  2025-10-15 12:16   ` David Hildenbrand
  2025-10-14  8:32 ` [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation zhaoyang.huang
  2025-10-14 14:06 ` [PATCH 0/2] optimization of dma-buf system_heap allocation Matthew Wilcox
  2 siblings, 2 replies; 19+ messages in thread
From: zhaoyang.huang @ 2025-10-14  8:32 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, Christian König, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, Zhaoyang Huang,
	steve.kang

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

commit c8b979530f27 ("mm: alloc_pages_bulk_noprof: drop page_list
argument") drops alloc_pages_bulk_list. This commit would like to call back
it since it is proved to be helpful to the drivers which allocate a bulk of
pages(see patch of 2 in this series ).
I do notice that Matthew's comment of the time cost of iterating a list.
However, I also observed in our test that the extra page_array's allocation
could be more expensive than cpu iteration when direct reclaiming happens
when ram is low[1]. IMHO, could we leave the API here to have the users
choose between the array or list according to their scenarios.

[1]
android.hardwar-728     [002] .....   334.573875: system_heap_do_allocate: Execution time: order 0 1 us
android.hardwar-728     [002] .....   334.573879: system_heap_do_allocate: Execution time: order 0 2 us
android.hardwar-728     [002] .....   334.574239: system_heap_do_allocate: Execution time: order 0 354 us
android.hardwar-728     [002] .....   334.574247: system_heap_do_allocate: Execution time: order 0 4 us
android.hardwar-728     [002] .....   334.574250: system_heap_do_allocate: Execution time: order 0 2 us

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 include/linux/gfp.h |  9 +++++++--
 mm/mempolicy.c      | 14 +++++++-------
 mm/page_alloc.c     | 39 +++++++++++++++++++++++++++------------
 3 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 5ebf26fcdcfa..f1540c9fcd87 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -231,6 +231,7 @@ struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_
 
 unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 				nodemask_t *nodemask, int nr_pages,
+				struct list_head *page_list,
 				struct page **page_array);
 #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
 
@@ -242,7 +243,11 @@ unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
 
 /* Bulk allocate order-0 pages */
 #define alloc_pages_bulk(_gfp, _nr_pages, _page_array)		\
-	__alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _page_array)
+	__alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array)
+
+#define alloc_pages_bulk_list(_gfp, _nr_pages, _list)			\
+	__alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)
+
 
 static inline unsigned long
 alloc_pages_bulk_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
@@ -251,7 +256,7 @@ alloc_pages_bulk_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
 	if (nid == NUMA_NO_NODE)
 		nid = numa_mem_id();
 
-	return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, page_array);
+	return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, NULL, page_array);
 }
 
 #define alloc_pages_bulk_node(...)				\
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eb83cff7db8c..26274302ee01 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2537,13 +2537,13 @@ static unsigned long alloc_pages_bulk_interleave(gfp_t gfp,
 		if (delta) {
 			nr_allocated = alloc_pages_bulk_noprof(gfp,
 					interleave_nodes(pol), NULL,
-					nr_pages_per_node + 1,
+					nr_pages_per_node + 1, NULL,
 					page_array);
 			delta--;
 		} else {
 			nr_allocated = alloc_pages_bulk_noprof(gfp,
 					interleave_nodes(pol), NULL,
-					nr_pages_per_node, page_array);
+					nr_pages_per_node, NULL, page_array);
 		}
 
 		page_array += nr_allocated;
@@ -2593,7 +2593,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 	if (weight && node_isset(node, nodes)) {
 		node_pages = min(rem_pages, weight);
 		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
-						  page_array);
+						  NULL, page_array);
 		page_array += nr_allocated;
 		total_allocated += nr_allocated;
 		/* if that's all the pages, no need to interleave */
@@ -2658,7 +2658,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
 		if (!node_pages)
 			break;
 		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
-						  page_array);
+						  NULL, page_array);
 		page_array += nr_allocated;
 		total_allocated += nr_allocated;
 		if (total_allocated == nr_pages)
@@ -2682,11 +2682,11 @@ static unsigned long alloc_pages_bulk_preferred_many(gfp_t gfp, int nid,
 	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
 
 	nr_allocated  = alloc_pages_bulk_noprof(preferred_gfp, nid, &pol->nodes,
-					   nr_pages, page_array);
+					   nr_pages, NULL, page_array);
 
 	if (nr_allocated < nr_pages)
 		nr_allocated += alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL,
-				nr_pages - nr_allocated,
+				nr_pages - nr_allocated, NULL,
 				page_array + nr_allocated);
 	return nr_allocated;
 }
@@ -2722,7 +2722,7 @@ unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
 	nid = numa_node_id();
 	nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
 	return alloc_pages_bulk_noprof(gfp, nid, nodemask,
-				       nr_pages, page_array);
+				       nr_pages, NULL, page_array);
 }
 
 int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d1d037f97c5f..a95bdd8cbf5b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4940,23 +4940,28 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
 }
 
 /*
- * __alloc_pages_bulk - Allocate a number of order-0 pages to an array
+ * __alloc_pages_bulk - Allocate a number of order-0 pages to a list or array
  * @gfp: GFP flags for the allocation
  * @preferred_nid: The preferred NUMA node ID to allocate from
  * @nodemask: Set of nodes to allocate from, may be NULL
- * @nr_pages: The number of pages desired in the array
- * @page_array: Array to store the pages
+ * @nr_pages: The number of pages desired on the list or array
+ * @page_list: Optional list to store the allocated pages
+ * @page_array: Optional array to store the pages
  *
  * This is a batched version of the page allocator that attempts to
- * allocate nr_pages quickly. Pages are added to the page_array.
+ * allocate nr_pages quickly. Pages are added to page_list if page_list
+ * is not NULL, otherwise it is assumed that the page_array is valid.
  *
- * Note that only NULL elements are populated with pages and nr_pages
+ * For lists, nr_pages is the number of pages that should be allocated.
+ *
+ * For arrays, only NULL elements are populated with pages and nr_pages
  * is the maximum number of pages that will be stored in the array.
  *
- * Returns the number of pages in the array.
+ * Returns the number of pages on the list or array.
  */
 unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 			nodemask_t *nodemask, int nr_pages,
+			struct list_head *page_list,
 			struct page **page_array)
 {
 	struct page *page;
@@ -4974,7 +4979,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	 * Skip populated array elements to determine if any pages need
 	 * to be allocated before disabling IRQs.
 	 */
-	while (nr_populated < nr_pages && page_array[nr_populated])
+	while (page_array && nr_populated < nr_pages && page_array[nr_populated])
 		nr_populated++;
 
 	/* No pages requested? */
@@ -4982,7 +4987,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 		goto out;
 
 	/* Already populated array? */
-	if (unlikely(nr_pages - nr_populated == 0))
+	if (unlikely(page_array && nr_pages - nr_populated == 0))
 		goto out;
 
 	/* Bulk allocator does not support memcg accounting. */
@@ -5064,7 +5069,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 	while (nr_populated < nr_pages) {
 
 		/* Skip existing pages */
-		if (page_array[nr_populated]) {
+		if (page_array && page_array[nr_populated]) {
 			nr_populated++;
 			continue;
 		}
@@ -5083,7 +5088,11 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 		prep_new_page(page, 0, gfp, 0);
 		set_page_refcounted(page);
-		page_array[nr_populated++] = page;
+		if (page_list)
+			list_add(&page->lru, page_list);
+		else
+			page_array[nr_populated] = page;
+		nr_populated++;
 	}
 
 	pcp_spin_unlock(pcp);
@@ -5100,8 +5109,14 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
 
 failed:
 	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
-	if (page)
-		page_array[nr_populated++] = page;
+	if (page) {
+		if (page_list)
+			list_add(&page->lru, page_list);
+		else
+			page_array[nr_populated] = page;
+		nr_populated++;
+	}
+
 	goto out;
 }
 EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-14  8:32 [PATCH 0/2] optimization of dma-buf system_heap allocation zhaoyang.huang
  2025-10-14  8:32 ` [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful zhaoyang.huang
@ 2025-10-14  8:32 ` zhaoyang.huang
  2025-10-14 11:59   ` Christian König
  2025-10-14 12:45   ` Petr Tesarik
  2025-10-14 14:06 ` [PATCH 0/2] optimization of dma-buf system_heap allocation Matthew Wilcox
  2 siblings, 2 replies; 19+ messages in thread
From: zhaoyang.huang @ 2025-10-14  8:32 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Matthew Wilcox, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, Christian König, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, Zhaoyang Huang,
	steve.kang

From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

The size of once dma-buf allocation could be dozens MB or much more
which introduce a loop of allocating several thousands of order-0 pages.
Furthermore, the concurrent allocation could have dma-buf allocation enter
direct-reclaim during the loop. This commit would like to eliminate the
above two affections by introducing alloc_pages_bulk_list in dma-buf's
order-0 allocation. This patch is proved to be conditionally helpful
in 18MB allocation as decreasing the time from 24604us to 6555us and no
harm when bulk allocation can't be done(fallback to single page
allocation)

Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
---
 drivers/dma-buf/heaps/system_heap.c | 36 +++++++++++++++++++----------
 1 file changed, 24 insertions(+), 12 deletions(-)

diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
index bbe7881f1360..71b028c63bd8 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -300,8 +300,8 @@ static const struct dma_buf_ops system_heap_buf_ops = {
 	.release = system_heap_dma_buf_release,
 };
 
-static struct page *alloc_largest_available(unsigned long size,
-					    unsigned int max_order)
+static void alloc_largest_available(unsigned long size,
+		    unsigned int max_order, unsigned int *num_pages, struct list_head *list)
 {
 	struct page *page;
 	int i;
@@ -312,12 +312,19 @@ static struct page *alloc_largest_available(unsigned long size,
 		if (max_order < orders[i])
 			continue;
 
-		page = alloc_pages(order_flags[i], orders[i]);
-		if (!page)
+		if (orders[i]) {
+			page = alloc_pages(order_flags[i], orders[i]);
+			if (page) {
+				list_add(&page->lru, list);
+				*num_pages = 1;
+			}
+		} else
+			*num_pages = alloc_pages_bulk_list(LOW_ORDER_GFP, size / PAGE_SIZE, list);
+
+		if (list_empty(list))
 			continue;
-		return page;
+		return;
 	}
-	return NULL;
 }
 
 static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
@@ -335,6 +342,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	struct list_head pages;
 	struct page *page, *tmp_page;
 	int i, ret = -ENOMEM;
+	unsigned int num_pages;
+	LIST_HEAD(head);
 
 	buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
 	if (!buffer)
@@ -348,6 +357,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 	INIT_LIST_HEAD(&pages);
 	i = 0;
 	while (size_remaining > 0) {
+		num_pages = 0;
+		INIT_LIST_HEAD(&head);
 		/*
 		 * Avoid trying to allocate memory if the process
 		 * has been killed by SIGKILL
@@ -357,14 +368,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
 			goto free_buffer;
 		}
 
-		page = alloc_largest_available(size_remaining, max_order);
-		if (!page)
+		alloc_largest_available(size_remaining, max_order, &num_pages, &head);
+		if (!num_pages)
 			goto free_buffer;
 
-		list_add_tail(&page->lru, &pages);
-		size_remaining -= page_size(page);
-		max_order = compound_order(page);
-		i++;
+		list_splice_tail(&head, &pages);
+		max_order = folio_order(lru_to_folio(&head));
+		size_remaining -= PAGE_SIZE * (num_pages << max_order);
+		i += num_pages;
+
 	}
 
 	table = &buffer->sg_table;
-- 
2.25.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful
  2025-10-14  8:32 ` [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful zhaoyang.huang
@ 2025-10-14  9:41   ` Petr Tesarik
  2025-10-14 12:46     ` Zhaoyang Huang
  2025-10-15 12:16   ` David Hildenbrand
  1 sibling, 1 reply; 19+ messages in thread
From: Petr Tesarik @ 2025-10-14  9:41 UTC (permalink / raw)
  To: zhaoyang.huang
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, Christian König, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, Zhaoyang Huang,
	steve.kang

On Tue, 14 Oct 2025 16:32:29 +0800
"zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:

> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> commit c8b979530f27 ("mm: alloc_pages_bulk_noprof: drop page_list
> argument") drops alloc_pages_bulk_list. This commit would like to call back
> it since it is proved to be helpful to the drivers which allocate a bulk of
> pages(see patch of 2 in this series ).
> I do notice that Matthew's comment of the time cost of iterating a list.
> However, I also observed in our test that the extra page_array's allocation
> could be more expensive than cpu iteration when direct reclaiming happens
> when ram is low[1]. IMHO, could we leave the API here to have the users
> choose between the array or list according to their scenarios.

OK, so this is more or less a revert of commit c8b979530f27 ("mm:
alloc_pages_bulk_noprof: drop page_list argument")...

I cannot comment on the performance gains, but I dislike the fact that
the patch re-introduces alloc_pages_bulk_noprof() as a function with two
signatures (either page_list is used, or page_array is used).

If we can agree that allocations onto a linked list are useful, then I
suggest to split the existing function so that the common bits end up
in helper functions, called by both variants (one function using a
list, one using an array).

Petr T

> [1]
> android.hardwar-728     [002] .....   334.573875: system_heap_do_allocate: Execution time: order 0 1 us
> android.hardwar-728     [002] .....   334.573879: system_heap_do_allocate: Execution time: order 0 2 us
> android.hardwar-728     [002] .....   334.574239: system_heap_do_allocate: Execution time: order 0 354 us
> android.hardwar-728     [002] .....   334.574247: system_heap_do_allocate: Execution time: order 0 4 us
> android.hardwar-728     [002] .....   334.574250: system_heap_do_allocate: Execution time: order 0 2 us
> 
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
>  include/linux/gfp.h |  9 +++++++--
>  mm/mempolicy.c      | 14 +++++++-------
>  mm/page_alloc.c     | 39 +++++++++++++++++++++++++++------------
>  3 files changed, 41 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 5ebf26fcdcfa..f1540c9fcd87 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -231,6 +231,7 @@ struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_
>  
>  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  				nodemask_t *nodemask, int nr_pages,
> +				struct list_head *page_list,
>  				struct page **page_array);
>  #define __alloc_pages_bulk(...)			alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
>  
> @@ -242,7 +243,11 @@ unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
>  
>  /* Bulk allocate order-0 pages */
>  #define alloc_pages_bulk(_gfp, _nr_pages, _page_array)		\
> -	__alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _page_array)
> +	__alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array)
> +
> +#define alloc_pages_bulk_list(_gfp, _nr_pages, _list)			\
> +	__alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)
> +
>  
>  static inline unsigned long
>  alloc_pages_bulk_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
> @@ -251,7 +256,7 @@ alloc_pages_bulk_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
>  	if (nid == NUMA_NO_NODE)
>  		nid = numa_mem_id();
>  
> -	return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, page_array);
> +	return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, NULL, page_array);
>  }
>  
>  #define alloc_pages_bulk_node(...)				\
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index eb83cff7db8c..26274302ee01 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2537,13 +2537,13 @@ static unsigned long alloc_pages_bulk_interleave(gfp_t gfp,
>  		if (delta) {
>  			nr_allocated = alloc_pages_bulk_noprof(gfp,
>  					interleave_nodes(pol), NULL,
> -					nr_pages_per_node + 1,
> +					nr_pages_per_node + 1, NULL,
>  					page_array);
>  			delta--;
>  		} else {
>  			nr_allocated = alloc_pages_bulk_noprof(gfp,
>  					interleave_nodes(pol), NULL,
> -					nr_pages_per_node, page_array);
> +					nr_pages_per_node, NULL, page_array);
>  		}
>  
>  		page_array += nr_allocated;
> @@ -2593,7 +2593,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  	if (weight && node_isset(node, nodes)) {
>  		node_pages = min(rem_pages, weight);
>  		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> -						  page_array);
> +						  NULL, page_array);
>  		page_array += nr_allocated;
>  		total_allocated += nr_allocated;
>  		/* if that's all the pages, no need to interleave */
> @@ -2658,7 +2658,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
>  		if (!node_pages)
>  			break;
>  		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> -						  page_array);
> +						  NULL, page_array);
>  		page_array += nr_allocated;
>  		total_allocated += nr_allocated;
>  		if (total_allocated == nr_pages)
> @@ -2682,11 +2682,11 @@ static unsigned long alloc_pages_bulk_preferred_many(gfp_t gfp, int nid,
>  	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
>  
>  	nr_allocated  = alloc_pages_bulk_noprof(preferred_gfp, nid, &pol->nodes,
> -					   nr_pages, page_array);
> +					   nr_pages, NULL, page_array);
>  
>  	if (nr_allocated < nr_pages)
>  		nr_allocated += alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL,
> -				nr_pages - nr_allocated,
> +				nr_pages - nr_allocated, NULL,
>  				page_array + nr_allocated);
>  	return nr_allocated;
>  }
> @@ -2722,7 +2722,7 @@ unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
>  	nid = numa_node_id();
>  	nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
>  	return alloc_pages_bulk_noprof(gfp, nid, nodemask,
> -				       nr_pages, page_array);
> +				       nr_pages, NULL, page_array);
>  }
>  
>  int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d1d037f97c5f..a95bdd8cbf5b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4940,23 +4940,28 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
>  }
>  
>  /*
> - * __alloc_pages_bulk - Allocate a number of order-0 pages to an array
> + * __alloc_pages_bulk - Allocate a number of order-0 pages to a list or array
>   * @gfp: GFP flags for the allocation
>   * @preferred_nid: The preferred NUMA node ID to allocate from
>   * @nodemask: Set of nodes to allocate from, may be NULL
> - * @nr_pages: The number of pages desired in the array
> - * @page_array: Array to store the pages
> + * @nr_pages: The number of pages desired on the list or array
> + * @page_list: Optional list to store the allocated pages
> + * @page_array: Optional array to store the pages
>   *
>   * This is a batched version of the page allocator that attempts to
> - * allocate nr_pages quickly. Pages are added to the page_array.
> + * allocate nr_pages quickly. Pages are added to page_list if page_list
> + * is not NULL, otherwise it is assumed that the page_array is valid.
>   *
> - * Note that only NULL elements are populated with pages and nr_pages
> + * For lists, nr_pages is the number of pages that should be allocated.
> + *
> + * For arrays, only NULL elements are populated with pages and nr_pages
>   * is the maximum number of pages that will be stored in the array.
>   *
> - * Returns the number of pages in the array.
> + * Returns the number of pages on the list or array.
>   */
>  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  			nodemask_t *nodemask, int nr_pages,
> +			struct list_head *page_list,
>  			struct page **page_array)
>  {
>  	struct page *page;
> @@ -4974,7 +4979,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	 * Skip populated array elements to determine if any pages need
>  	 * to be allocated before disabling IRQs.
>  	 */
> -	while (nr_populated < nr_pages && page_array[nr_populated])
> +	while (page_array && nr_populated < nr_pages && page_array[nr_populated])
>  		nr_populated++;
>  
>  	/* No pages requested? */
> @@ -4982,7 +4987,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  		goto out;
>  
>  	/* Already populated array? */
> -	if (unlikely(nr_pages - nr_populated == 0))
> +	if (unlikely(page_array && nr_pages - nr_populated == 0))
>  		goto out;
>  
>  	/* Bulk allocator does not support memcg accounting. */
> @@ -5064,7 +5069,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  	while (nr_populated < nr_pages) {
>  
>  		/* Skip existing pages */
> -		if (page_array[nr_populated]) {
> +		if (page_array && page_array[nr_populated]) {
>  			nr_populated++;
>  			continue;
>  		}
> @@ -5083,7 +5088,11 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  
>  		prep_new_page(page, 0, gfp, 0);
>  		set_page_refcounted(page);
> -		page_array[nr_populated++] = page;
> +		if (page_list)
> +			list_add(&page->lru, page_list);
> +		else
> +			page_array[nr_populated] = page;
> +		nr_populated++;
>  	}
>  
>  	pcp_spin_unlock(pcp);
> @@ -5100,8 +5109,14 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
>  
>  failed:
>  	page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
> -	if (page)
> -		page_array[nr_populated++] = page;
> +	if (page) {
> +		if (page_list)
> +			list_add(&page->lru, page_list);
> +		else
> +			page_array[nr_populated] = page;
> +		nr_populated++;
> +	}
> +
>  	goto out;
>  }
>  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-14  8:32 ` [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation zhaoyang.huang
@ 2025-10-14 11:59   ` Christian König
  2025-10-14 12:44     ` Zhaoyang Huang
  2025-10-14 12:45   ` Petr Tesarik
  1 sibling, 1 reply; 19+ messages in thread
From: Christian König @ 2025-10-14 11:59 UTC (permalink / raw)
  To: zhaoyang.huang, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Mel Gorman, Vlastimil Babka, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T . J . Mercier, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, Zhaoyang Huang,
	steve.kang

On 14.10.25 10:32, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> The size of once dma-buf allocation could be dozens MB or much more
> which introduce a loop of allocating several thousands of order-0 pages.
> Furthermore, the concurrent allocation could have dma-buf allocation enter
> direct-reclaim during the loop. This commit would like to eliminate the
> above two affections by introducing alloc_pages_bulk_list in dma-buf's
> order-0 allocation. This patch is proved to be conditionally helpful
> in 18MB allocation as decreasing the time from 24604us to 6555us and no
> harm when bulk allocation can't be done(fallback to single page
> allocation)

Well that sounds like an absolutely horrible idea.

See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.

What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.

So this change here doesn't looks like it will fly very high. Please explain what you're actually trying to do, just optimize allocation time?

Regards,
Christian.

> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
>  drivers/dma-buf/heaps/system_heap.c | 36 +++++++++++++++++++----------
>  1 file changed, 24 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index bbe7881f1360..71b028c63bd8 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -300,8 +300,8 @@ static const struct dma_buf_ops system_heap_buf_ops = {
>  	.release = system_heap_dma_buf_release,
>  };
>  
> -static struct page *alloc_largest_available(unsigned long size,
> -					    unsigned int max_order)
> +static void alloc_largest_available(unsigned long size,
> +		    unsigned int max_order, unsigned int *num_pages, struct list_head *list)
>  {
>  	struct page *page;
>  	int i;
> @@ -312,12 +312,19 @@ static struct page *alloc_largest_available(unsigned long size,
>  		if (max_order < orders[i])
>  			continue;
>  
> -		page = alloc_pages(order_flags[i], orders[i]);
> -		if (!page)
> +		if (orders[i]) {
> +			page = alloc_pages(order_flags[i], orders[i]);
> +			if (page) {
> +				list_add(&page->lru, list);
> +				*num_pages = 1;
> +			}
> +		} else
> +			*num_pages = alloc_pages_bulk_list(LOW_ORDER_GFP, size / PAGE_SIZE, list);
> +
> +		if (list_empty(list))
>  			continue;
> -		return page;
> +		return;
>  	}
> -	return NULL;
>  }
>  
>  static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> @@ -335,6 +342,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>  	struct list_head pages;
>  	struct page *page, *tmp_page;
>  	int i, ret = -ENOMEM;
> +	unsigned int num_pages;
> +	LIST_HEAD(head);
>  
>  	buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
>  	if (!buffer)
> @@ -348,6 +357,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>  	INIT_LIST_HEAD(&pages);
>  	i = 0;
>  	while (size_remaining > 0) {
> +		num_pages = 0;
> +		INIT_LIST_HEAD(&head);
>  		/*
>  		 * Avoid trying to allocate memory if the process
>  		 * has been killed by SIGKILL
> @@ -357,14 +368,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>  			goto free_buffer;
>  		}
>  
> -		page = alloc_largest_available(size_remaining, max_order);
> -		if (!page)
> +		alloc_largest_available(size_remaining, max_order, &num_pages, &head);
> +		if (!num_pages)
>  			goto free_buffer;
>  
> -		list_add_tail(&page->lru, &pages);
> -		size_remaining -= page_size(page);
> -		max_order = compound_order(page);
> -		i++;
> +		list_splice_tail(&head, &pages);
> +		max_order = folio_order(lru_to_folio(&head));
> +		size_remaining -= PAGE_SIZE * (num_pages << max_order);
> +		i += num_pages;
> +
>  	}
>  
>  	table = &buffer->sg_table;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-14 11:59   ` Christian König
@ 2025-10-14 12:44     ` Zhaoyang Huang
  2025-10-14 13:04       ` Christian König
  0 siblings, 1 reply; 19+ messages in thread
From: Zhaoyang Huang @ 2025-10-14 12:44 UTC (permalink / raw)
  To: Christian König
  Cc: zhaoyang.huang, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Mel Gorman, Vlastimil Babka, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T . J . Mercier, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, steve.kang

On Tue, Oct 14, 2025 at 7:59 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 14.10.25 10:32, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > The size of once dma-buf allocation could be dozens MB or much more
> > which introduce a loop of allocating several thousands of order-0 pages.
> > Furthermore, the concurrent allocation could have dma-buf allocation enter
> > direct-reclaim during the loop. This commit would like to eliminate the
> > above two affections by introducing alloc_pages_bulk_list in dma-buf's
> > order-0 allocation. This patch is proved to be conditionally helpful
> > in 18MB allocation as decreasing the time from 24604us to 6555us and no
> > harm when bulk allocation can't be done(fallback to single page
> > allocation)
>
> Well that sounds like an absolutely horrible idea.
>
> See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
>
> What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.
emm, this patch doesn't change order-8 and order-4's allocation
behaviour but just to replace the loop of order-0 allocations into
once bulk allocation in the fallback way. What is your concern about
this?
>
> So this change here doesn't looks like it will fly very high. Please explain what you're actually trying to do, just optimize allocation time?
>
> Regards,
> Christian.
>
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > ---
> >  drivers/dma-buf/heaps/system_heap.c | 36 +++++++++++++++++++----------
> >  1 file changed, 24 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> > index bbe7881f1360..71b028c63bd8 100644
> > --- a/drivers/dma-buf/heaps/system_heap.c
> > +++ b/drivers/dma-buf/heaps/system_heap.c
> > @@ -300,8 +300,8 @@ static const struct dma_buf_ops system_heap_buf_ops = {
> >       .release = system_heap_dma_buf_release,
> >  };
> >
> > -static struct page *alloc_largest_available(unsigned long size,
> > -                                         unsigned int max_order)
> > +static void alloc_largest_available(unsigned long size,
> > +                 unsigned int max_order, unsigned int *num_pages, struct list_head *list)
> >  {
> >       struct page *page;
> >       int i;
> > @@ -312,12 +312,19 @@ static struct page *alloc_largest_available(unsigned long size,
> >               if (max_order < orders[i])
> >                       continue;
> >
> > -             page = alloc_pages(order_flags[i], orders[i]);
> > -             if (!page)
> > +             if (orders[i]) {
> > +                     page = alloc_pages(order_flags[i], orders[i]);
> > +                     if (page) {
> > +                             list_add(&page->lru, list);
> > +                             *num_pages = 1;
> > +                     }
> > +             } else
> > +                     *num_pages = alloc_pages_bulk_list(LOW_ORDER_GFP, size / PAGE_SIZE, list);
> > +
> > +             if (list_empty(list))
> >                       continue;
> > -             return page;
> > +             return;
> >       }
> > -     return NULL;
> >  }
> >
> >  static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> > @@ -335,6 +342,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> >       struct list_head pages;
> >       struct page *page, *tmp_page;
> >       int i, ret = -ENOMEM;
> > +     unsigned int num_pages;
> > +     LIST_HEAD(head);
> >
> >       buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
> >       if (!buffer)
> > @@ -348,6 +357,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> >       INIT_LIST_HEAD(&pages);
> >       i = 0;
> >       while (size_remaining > 0) {
> > +             num_pages = 0;
> > +             INIT_LIST_HEAD(&head);
> >               /*
> >                * Avoid trying to allocate memory if the process
> >                * has been killed by SIGKILL
> > @@ -357,14 +368,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> >                       goto free_buffer;
> >               }
> >
> > -             page = alloc_largest_available(size_remaining, max_order);
> > -             if (!page)
> > +             alloc_largest_available(size_remaining, max_order, &num_pages, &head);
> > +             if (!num_pages)
> >                       goto free_buffer;
> >
> > -             list_add_tail(&page->lru, &pages);
> > -             size_remaining -= page_size(page);
> > -             max_order = compound_order(page);
> > -             i++;
> > +             list_splice_tail(&head, &pages);
> > +             max_order = folio_order(lru_to_folio(&head));
> > +             size_remaining -= PAGE_SIZE * (num_pages << max_order);
> > +             i += num_pages;
> > +
> >       }
> >
> >       table = &buffer->sg_table;
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-14  8:32 ` [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation zhaoyang.huang
  2025-10-14 11:59   ` Christian König
@ 2025-10-14 12:45   ` Petr Tesarik
  1 sibling, 0 replies; 19+ messages in thread
From: Petr Tesarik @ 2025-10-14 12:45 UTC (permalink / raw)
  To: zhaoyang.huang
  Cc: Andrew Morton, David Hildenbrand, Matthew Wilcox, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, Christian König, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, Zhaoyang Huang,
	steve.kang

On Tue, 14 Oct 2025 16:32:30 +0800
"zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:

> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> The size of once dma-buf allocation could be dozens MB or much more
> which introduce a loop of allocating several thousands of order-0 pages.
> Furthermore, the concurrent allocation could have dma-buf allocation enter
> direct-reclaim during the loop. This commit would like to eliminate the
> above two affections by introducing alloc_pages_bulk_list in dma-buf's
> order-0 allocation. This patch is proved to be conditionally helpful
> in 18MB allocation as decreasing the time from 24604us to 6555us and no
> harm when bulk allocation can't be done(fallback to single page
> allocation)
> 
> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> ---
>  drivers/dma-buf/heaps/system_heap.c | 36 +++++++++++++++++++----------
>  1 file changed, 24 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index bbe7881f1360..71b028c63bd8 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -300,8 +300,8 @@ static const struct dma_buf_ops system_heap_buf_ops = {
>  	.release = system_heap_dma_buf_release,
>  };
>  
> -static struct page *alloc_largest_available(unsigned long size,
> -					    unsigned int max_order)
> +static void alloc_largest_available(unsigned long size,
> +		    unsigned int max_order, unsigned int *num_pages, struct list_head *list)

This interface feels weird. Maybe you could return the number of pages
instead of making this a void function and passing a pointer to get that
number?

>  {
>  	struct page *page;
>  	int i;
> @@ -312,12 +312,19 @@ static struct page *alloc_largest_available(unsigned long size,
>  		if (max_order < orders[i])
>  			continue;
>  
> -		page = alloc_pages(order_flags[i], orders[i]);
> -		if (!page)
> +		if (orders[i]) {
> +			page = alloc_pages(order_flags[i], orders[i]);

nitpick: Since the lowest order is special-cased now, you can simply
use HIGH_ORDER_GFP here and remove order_flags[] entirely.

> +			if (page) {
> +				list_add(&page->lru, list);
> +				*num_pages = 1;
> +			}
> +		} else
> +			*num_pages = alloc_pages_bulk_list(LOW_ORDER_GFP, size / PAGE_SIZE, list);
> +
> +		if (list_empty(list))
>  			continue;
> -		return page;
> +		return;
>  	}
> -	return NULL;
>  }
>  
>  static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
> @@ -335,6 +342,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>  	struct list_head pages;
>  	struct page *page, *tmp_page;
>  	int i, ret = -ENOMEM;
> +	unsigned int num_pages;
> +	LIST_HEAD(head);
>  
>  	buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
>  	if (!buffer)
> @@ -348,6 +357,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>  	INIT_LIST_HEAD(&pages);
>  	i = 0;
>  	while (size_remaining > 0) {
> +		num_pages = 0;
> +		INIT_LIST_HEAD(&head);
>  		/*
>  		 * Avoid trying to allocate memory if the process
>  		 * has been killed by SIGKILL
> @@ -357,14 +368,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>  			goto free_buffer;
>  		}
>  
> -		page = alloc_largest_available(size_remaining, max_order);
> -		if (!page)
> +		alloc_largest_available(size_remaining, max_order, &num_pages, &head);
> +		if (!num_pages)
>  			goto free_buffer;
>  
> -		list_add_tail(&page->lru, &pages);
> -		size_remaining -= page_size(page);
> -		max_order = compound_order(page);
> -		i++;
> +		list_splice_tail(&head, &pages);
> +		max_order = folio_order(lru_to_folio(&head));
> +		size_remaining -= PAGE_SIZE * (num_pages << max_order);

This looks complicated. What about changing alloc_largest_available()
to return the total number of pages and using PAGE_SIZE * num_page?

Ah, you still have to look at the folio order to determine the new
value of max_order, so no big win. Hm. You could pass a pointer to
max_order down to alloc_largest_available(), but at that point I think
it's a matter of taste (aka bikeshedding).

Petr T

> +		i += num_pages;
> +
>  	}
>  
>  	table = &buffer->sg_table;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful
  2025-10-14  9:41   ` Petr Tesarik
@ 2025-10-14 12:46     ` Zhaoyang Huang
  0 siblings, 0 replies; 19+ messages in thread
From: Zhaoyang Huang @ 2025-10-14 12:46 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: zhaoyang.huang, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Mel Gorman, Vlastimil Babka, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T . J . Mercier,
	Christian König, linux-media, dri-devel, linaro-mm-sig,
	linux-mm, linux-kernel, steve.kang

On Tue, Oct 14, 2025 at 5:41 PM Petr Tesarik <ptesarik@suse.com> wrote:
>
> On Tue, 14 Oct 2025 16:32:29 +0800
> "zhaoyang.huang" <zhaoyang.huang@unisoc.com> wrote:
>
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >
> > commit c8b979530f27 ("mm: alloc_pages_bulk_noprof: drop page_list
> > argument") drops alloc_pages_bulk_list. This commit would like to call back
> > it since it is proved to be helpful to the drivers which allocate a bulk of
> > pages(see patch of 2 in this series ).
> > I do notice that Matthew's comment of the time cost of iterating a list.
> > However, I also observed in our test that the extra page_array's allocation
> > could be more expensive than cpu iteration when direct reclaiming happens
> > when ram is low[1]. IMHO, could we leave the API here to have the users
> > choose between the array or list according to their scenarios.
>
> OK, so this is more or less a revert of commit c8b979530f27 ("mm:
> alloc_pages_bulk_noprof: drop page_list argument")...
>
> I cannot comment on the performance gains, but I dislike the fact that
> the patch re-introduces alloc_pages_bulk_noprof() as a function with two
> signatures (either page_list is used, or page_array is used).
>
> If we can agree that allocations onto a linked list are useful, then I
> suggest to split the existing function so that the common bits end up
> in helper functions, called by both variants (one function using a
> list, one using an array).
Yes. That is also what I wanted to do in the beginning. I will
implement if dma-buf would like to take the change
>
> Petr T
>
> > [1]
> > android.hardwar-728     [002] .....   334.573875: system_heap_do_allocate: Execution time: order 0 1 us
> > android.hardwar-728     [002] .....   334.573879: system_heap_do_allocate: Execution time: order 0 2 us
> > android.hardwar-728     [002] .....   334.574239: system_heap_do_allocate: Execution time: order 0 354 us
> > android.hardwar-728     [002] .....   334.574247: system_heap_do_allocate: Execution time: order 0 4 us
> > android.hardwar-728     [002] .....   334.574250: system_heap_do_allocate: Execution time: order 0 2 us
> >
> > Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> > ---
> >  include/linux/gfp.h |  9 +++++++--
> >  mm/mempolicy.c      | 14 +++++++-------
> >  mm/page_alloc.c     | 39 +++++++++++++++++++++++++++------------
> >  3 files changed, 41 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 5ebf26fcdcfa..f1540c9fcd87 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -231,6 +231,7 @@ struct folio *__folio_alloc_noprof(gfp_t gfp, unsigned int order, int preferred_
> >
> >  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >                               nodemask_t *nodemask, int nr_pages,
> > +                             struct list_head *page_list,
> >                               struct page **page_array);
> >  #define __alloc_pages_bulk(...)                      alloc_hooks(alloc_pages_bulk_noprof(__VA_ARGS__))
> >
> > @@ -242,7 +243,11 @@ unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
> >
> >  /* Bulk allocate order-0 pages */
> >  #define alloc_pages_bulk(_gfp, _nr_pages, _page_array)               \
> > -     __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _page_array)
> > +     __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, NULL, _page_array)
> > +
> > +#define alloc_pages_bulk_list(_gfp, _nr_pages, _list)                        \
> > +     __alloc_pages_bulk(_gfp, numa_mem_id(), NULL, _nr_pages, _list, NULL)
> > +
> >
> >  static inline unsigned long
> >  alloc_pages_bulk_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
> > @@ -251,7 +256,7 @@ alloc_pages_bulk_node_noprof(gfp_t gfp, int nid, unsigned long nr_pages,
> >       if (nid == NUMA_NO_NODE)
> >               nid = numa_mem_id();
> >
> > -     return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, page_array);
> > +     return alloc_pages_bulk_noprof(gfp, nid, NULL, nr_pages, NULL, page_array);
> >  }
> >
> >  #define alloc_pages_bulk_node(...)                           \
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index eb83cff7db8c..26274302ee01 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2537,13 +2537,13 @@ static unsigned long alloc_pages_bulk_interleave(gfp_t gfp,
> >               if (delta) {
> >                       nr_allocated = alloc_pages_bulk_noprof(gfp,
> >                                       interleave_nodes(pol), NULL,
> > -                                     nr_pages_per_node + 1,
> > +                                     nr_pages_per_node + 1, NULL,
> >                                       page_array);
> >                       delta--;
> >               } else {
> >                       nr_allocated = alloc_pages_bulk_noprof(gfp,
> >                                       interleave_nodes(pol), NULL,
> > -                                     nr_pages_per_node, page_array);
> > +                                     nr_pages_per_node, NULL, page_array);
> >               }
> >
> >               page_array += nr_allocated;
> > @@ -2593,7 +2593,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
> >       if (weight && node_isset(node, nodes)) {
> >               node_pages = min(rem_pages, weight);
> >               nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> > -                                               page_array);
> > +                                               NULL, page_array);
> >               page_array += nr_allocated;
> >               total_allocated += nr_allocated;
> >               /* if that's all the pages, no need to interleave */
> > @@ -2658,7 +2658,7 @@ static unsigned long alloc_pages_bulk_weighted_interleave(gfp_t gfp,
> >               if (!node_pages)
> >                       break;
> >               nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
> > -                                               page_array);
> > +                                               NULL, page_array);
> >               page_array += nr_allocated;
> >               total_allocated += nr_allocated;
> >               if (total_allocated == nr_pages)
> > @@ -2682,11 +2682,11 @@ static unsigned long alloc_pages_bulk_preferred_many(gfp_t gfp, int nid,
> >       preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
> >
> >       nr_allocated  = alloc_pages_bulk_noprof(preferred_gfp, nid, &pol->nodes,
> > -                                        nr_pages, page_array);
> > +                                        nr_pages, NULL, page_array);
> >
> >       if (nr_allocated < nr_pages)
> >               nr_allocated += alloc_pages_bulk_noprof(gfp, numa_node_id(), NULL,
> > -                             nr_pages - nr_allocated,
> > +                             nr_pages - nr_allocated, NULL,
> >                               page_array + nr_allocated);
> >       return nr_allocated;
> >  }
> > @@ -2722,7 +2722,7 @@ unsigned long alloc_pages_bulk_mempolicy_noprof(gfp_t gfp,
> >       nid = numa_node_id();
> >       nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
> >       return alloc_pages_bulk_noprof(gfp, nid, nodemask,
> > -                                    nr_pages, page_array);
> > +                                    nr_pages, NULL, page_array);
> >  }
> >
> >  int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index d1d037f97c5f..a95bdd8cbf5b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4940,23 +4940,28 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> >  }
> >
> >  /*
> > - * __alloc_pages_bulk - Allocate a number of order-0 pages to an array
> > + * __alloc_pages_bulk - Allocate a number of order-0 pages to a list or array
> >   * @gfp: GFP flags for the allocation
> >   * @preferred_nid: The preferred NUMA node ID to allocate from
> >   * @nodemask: Set of nodes to allocate from, may be NULL
> > - * @nr_pages: The number of pages desired in the array
> > - * @page_array: Array to store the pages
> > + * @nr_pages: The number of pages desired on the list or array
> > + * @page_list: Optional list to store the allocated pages
> > + * @page_array: Optional array to store the pages
> >   *
> >   * This is a batched version of the page allocator that attempts to
> > - * allocate nr_pages quickly. Pages are added to the page_array.
> > + * allocate nr_pages quickly. Pages are added to page_list if page_list
> > + * is not NULL, otherwise it is assumed that the page_array is valid.
> >   *
> > - * Note that only NULL elements are populated with pages and nr_pages
> > + * For lists, nr_pages is the number of pages that should be allocated.
> > + *
> > + * For arrays, only NULL elements are populated with pages and nr_pages
> >   * is the maximum number of pages that will be stored in the array.
> >   *
> > - * Returns the number of pages in the array.
> > + * Returns the number of pages on the list or array.
> >   */
> >  unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >                       nodemask_t *nodemask, int nr_pages,
> > +                     struct list_head *page_list,
> >                       struct page **page_array)
> >  {
> >       struct page *page;
> > @@ -4974,7 +4979,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >        * Skip populated array elements to determine if any pages need
> >        * to be allocated before disabling IRQs.
> >        */
> > -     while (nr_populated < nr_pages && page_array[nr_populated])
> > +     while (page_array && nr_populated < nr_pages && page_array[nr_populated])
> >               nr_populated++;
> >
> >       /* No pages requested? */
> > @@ -4982,7 +4987,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >               goto out;
> >
> >       /* Already populated array? */
> > -     if (unlikely(nr_pages - nr_populated == 0))
> > +     if (unlikely(page_array && nr_pages - nr_populated == 0))
> >               goto out;
> >
> >       /* Bulk allocator does not support memcg accounting. */
> > @@ -5064,7 +5069,7 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >       while (nr_populated < nr_pages) {
> >
> >               /* Skip existing pages */
> > -             if (page_array[nr_populated]) {
> > +             if (page_array && page_array[nr_populated]) {
> >                       nr_populated++;
> >                       continue;
> >               }
> > @@ -5083,7 +5088,11 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >
> >               prep_new_page(page, 0, gfp, 0);
> >               set_page_refcounted(page);
> > -             page_array[nr_populated++] = page;
> > +             if (page_list)
> > +                     list_add(&page->lru, page_list);
> > +             else
> > +                     page_array[nr_populated] = page;
> > +             nr_populated++;
> >       }
> >
> >       pcp_spin_unlock(pcp);
> > @@ -5100,8 +5109,14 @@ unsigned long alloc_pages_bulk_noprof(gfp_t gfp, int preferred_nid,
> >
> >  failed:
> >       page = __alloc_pages_noprof(gfp, 0, preferred_nid, nodemask);
> > -     if (page)
> > -             page_array[nr_populated++] = page;
> > +     if (page) {
> > +             if (page_list)
> > +                     list_add(&page->lru, page_list);
> > +             else
> > +                     page_array[nr_populated] = page;
> > +             nr_populated++;
> > +     }
> > +
> >       goto out;
> >  }
> >  EXPORT_SYMBOL_GPL(alloc_pages_bulk_noprof);
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-14 12:44     ` Zhaoyang Huang
@ 2025-10-14 13:04       ` Christian König
  2025-10-14 15:10         ` Petr Tesarik
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2025-10-14 13:04 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: zhaoyang.huang, Andrew Morton, David Hildenbrand, Matthew Wilcox,
	Mel Gorman, Vlastimil Babka, Sumit Semwal, Benjamin Gaignard,
	Brian Starkey, John Stultz, T . J . Mercier, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, steve.kang

On 14.10.25 14:44, Zhaoyang Huang wrote:
> On Tue, Oct 14, 2025 at 7:59 PM Christian König
> <christian.koenig@amd.com> wrote:
>>
>> On 14.10.25 10:32, zhaoyang.huang wrote:
>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>>
>>> The size of once dma-buf allocation could be dozens MB or much more
>>> which introduce a loop of allocating several thousands of order-0 pages.
>>> Furthermore, the concurrent allocation could have dma-buf allocation enter
>>> direct-reclaim during the loop. This commit would like to eliminate the
>>> above two affections by introducing alloc_pages_bulk_list in dma-buf's
>>> order-0 allocation. This patch is proved to be conditionally helpful
>>> in 18MB allocation as decreasing the time from 24604us to 6555us and no
>>> harm when bulk allocation can't be done(fallback to single page
>>> allocation)
>>
>> Well that sounds like an absolutely horrible idea.
>>
>> See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
>>
>> What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.
> emm, this patch doesn't change order-8 and order-4's allocation
> behaviour but just to replace the loop of order-0 allocations into
> once bulk allocation in the fallback way. What is your concern about
> this?

As far as I know the bulk allocation favors splitting large pages into smaller ones instead of allocating smaller pages first. That's where the performance benefit comes from.

But that is exactly what we try to avoid here by allocating only certain order of pages.

Regards,
Christian.

>>
>> So this change here doesn't looks like it will fly very high. Please explain what you're actually trying to do, just optimize allocation time?
>>
>> Regards,
>> Christian.
>>
>>> Signed-off-by: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>> ---
>>>  drivers/dma-buf/heaps/system_heap.c | 36 +++++++++++++++++++----------
>>>  1 file changed, 24 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
>>> index bbe7881f1360..71b028c63bd8 100644
>>> --- a/drivers/dma-buf/heaps/system_heap.c
>>> +++ b/drivers/dma-buf/heaps/system_heap.c
>>> @@ -300,8 +300,8 @@ static const struct dma_buf_ops system_heap_buf_ops = {
>>>       .release = system_heap_dma_buf_release,
>>>  };
>>>
>>> -static struct page *alloc_largest_available(unsigned long size,
>>> -                                         unsigned int max_order)
>>> +static void alloc_largest_available(unsigned long size,
>>> +                 unsigned int max_order, unsigned int *num_pages, struct list_head *list)
>>>  {
>>>       struct page *page;
>>>       int i;
>>> @@ -312,12 +312,19 @@ static struct page *alloc_largest_available(unsigned long size,
>>>               if (max_order < orders[i])
>>>                       continue;
>>>
>>> -             page = alloc_pages(order_flags[i], orders[i]);
>>> -             if (!page)
>>> +             if (orders[i]) {
>>> +                     page = alloc_pages(order_flags[i], orders[i]);
>>> +                     if (page) {
>>> +                             list_add(&page->lru, list);
>>> +                             *num_pages = 1;
>>> +                     }
>>> +             } else
>>> +                     *num_pages = alloc_pages_bulk_list(LOW_ORDER_GFP, size / PAGE_SIZE, list);
>>> +
>>> +             if (list_empty(list))
>>>                       continue;
>>> -             return page;
>>> +             return;
>>>       }
>>> -     return NULL;
>>>  }
>>>
>>>  static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>> @@ -335,6 +342,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>>       struct list_head pages;
>>>       struct page *page, *tmp_page;
>>>       int i, ret = -ENOMEM;
>>> +     unsigned int num_pages;
>>> +     LIST_HEAD(head);
>>>
>>>       buffer = kzalloc(sizeof(*buffer), GFP_KERNEL);
>>>       if (!buffer)
>>> @@ -348,6 +357,8 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>>       INIT_LIST_HEAD(&pages);
>>>       i = 0;
>>>       while (size_remaining > 0) {
>>> +             num_pages = 0;
>>> +             INIT_LIST_HEAD(&head);
>>>               /*
>>>                * Avoid trying to allocate memory if the process
>>>                * has been killed by SIGKILL
>>> @@ -357,14 +368,15 @@ static struct dma_buf *system_heap_allocate(struct dma_heap *heap,
>>>                       goto free_buffer;
>>>               }
>>>
>>> -             page = alloc_largest_available(size_remaining, max_order);
>>> -             if (!page)
>>> +             alloc_largest_available(size_remaining, max_order, &num_pages, &head);
>>> +             if (!num_pages)
>>>                       goto free_buffer;
>>>
>>> -             list_add_tail(&page->lru, &pages);
>>> -             size_remaining -= page_size(page);
>>> -             max_order = compound_order(page);
>>> -             i++;
>>> +             list_splice_tail(&head, &pages);
>>> +             max_order = folio_order(lru_to_folio(&head));
>>> +             size_remaining -= PAGE_SIZE * (num_pages << max_order);
>>> +             i += num_pages;
>>> +
>>>       }
>>>
>>>       table = &buffer->sg_table;
>>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/2] optimization of dma-buf system_heap allocation
  2025-10-14  8:32 [PATCH 0/2] optimization of dma-buf system_heap allocation zhaoyang.huang
  2025-10-14  8:32 ` [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful zhaoyang.huang
  2025-10-14  8:32 ` [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation zhaoyang.huang
@ 2025-10-14 14:06 ` Matthew Wilcox
  2 siblings, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2025-10-14 14:06 UTC (permalink / raw)
  To: zhaoyang.huang
  Cc: Andrew Morton, David Hildenbrand, Mel Gorman, Vlastimil Babka,
	Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
	T . J . Mercier, Christian König, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-kernel, Zhaoyang Huang,
	steve.kang

On Tue, Oct 14, 2025 at 04:32:28PM +0800, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> 
> This series of patches would like to introduce alloc_pages_bulk_list in
> dma-buf which need to call back the API for page allocation.

Start with the problem you're trying to solve.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-14 13:04       ` Christian König
@ 2025-10-14 15:10         ` Petr Tesarik
  2025-10-14 15:52           ` Christian König
  0 siblings, 1 reply; 19+ messages in thread
From: Petr Tesarik @ 2025-10-14 15:10 UTC (permalink / raw)
  To: Christian König
  Cc: Zhaoyang Huang, zhaoyang.huang, Andrew Morton, David Hildenbrand,
	Matthew Wilcox, Mel Gorman, Vlastimil Babka, Sumit Semwal,
	Benjamin Gaignard, Brian Starkey, John Stultz, T . J . Mercier,
	linux-media, dri-devel, linaro-mm-sig, linux-mm, linux-kernel,
	steve.kang

On Tue, 14 Oct 2025 15:04:14 +0200
Christian König <christian.koenig@amd.com> wrote:

> On 14.10.25 14:44, Zhaoyang Huang wrote:
> > On Tue, Oct 14, 2025 at 7:59 PM Christian König
> > <christian.koenig@amd.com> wrote:  
> >>
> >> On 14.10.25 10:32, zhaoyang.huang wrote:  
> >>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >>>
> >>> The size of once dma-buf allocation could be dozens MB or much more
> >>> which introduce a loop of allocating several thousands of order-0 pages.
> >>> Furthermore, the concurrent allocation could have dma-buf allocation enter
> >>> direct-reclaim during the loop. This commit would like to eliminate the
> >>> above two affections by introducing alloc_pages_bulk_list in dma-buf's
> >>> order-0 allocation. This patch is proved to be conditionally helpful
> >>> in 18MB allocation as decreasing the time from 24604us to 6555us and no
> >>> harm when bulk allocation can't be done(fallback to single page
> >>> allocation)  
> >>
> >> Well that sounds like an absolutely horrible idea.
> >>
> >> See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
> >>
> >> What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.  
> > emm, this patch doesn't change order-8 and order-4's allocation
> > behaviour but just to replace the loop of order-0 allocations into
> > once bulk allocation in the fallback way. What is your concern about
> > this?  
> 
> As far as I know the bulk allocation favors splitting large pages into smaller ones instead of allocating smaller pages first. That's where the performance benefit comes from.
> 
> But that is exactly what we try to avoid here by allocating only certain order of pages.

This is a good question, actually. Yes, bulk alloc will split large
pages if there are insufficient pages on the pcp free list. But is
dma-buf indeed trying to avoid it, or is it merely using an inefficient
API? And does it need the extra speed? Even if it leads to increased
fragmentation?

Petr T


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-14 15:10         ` Petr Tesarik
@ 2025-10-14 15:52           ` Christian König
  2025-10-15  1:12             ` Zhaoyang Huang
  0 siblings, 1 reply; 19+ messages in thread
From: Christian König @ 2025-10-14 15:52 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Zhaoyang Huang, zhaoyang.huang, Andrew Morton, David Hildenbrand,
	Matthew Wilcox, Mel Gorman, Vlastimil Babka, Sumit Semwal,
	Benjamin Gaignard, Brian Starkey, John Stultz, T . J . Mercier,
	linux-media, dri-devel, linaro-mm-sig, linux-mm, linux-kernel,
	steve.kang

On 14.10.25 17:10, Petr Tesarik wrote:
> On Tue, 14 Oct 2025 15:04:14 +0200
> Christian König <christian.koenig@amd.com> wrote:
> 
>> On 14.10.25 14:44, Zhaoyang Huang wrote:
>>> On Tue, Oct 14, 2025 at 7:59 PM Christian König
>>> <christian.koenig@amd.com> wrote:  
>>>>
>>>> On 14.10.25 10:32, zhaoyang.huang wrote:  
>>>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>>>>>
>>>>> The size of once dma-buf allocation could be dozens MB or much more
>>>>> which introduce a loop of allocating several thousands of order-0 pages.
>>>>> Furthermore, the concurrent allocation could have dma-buf allocation enter
>>>>> direct-reclaim during the loop. This commit would like to eliminate the
>>>>> above two affections by introducing alloc_pages_bulk_list in dma-buf's
>>>>> order-0 allocation. This patch is proved to be conditionally helpful
>>>>> in 18MB allocation as decreasing the time from 24604us to 6555us and no
>>>>> harm when bulk allocation can't be done(fallback to single page
>>>>> allocation)  
>>>>
>>>> Well that sounds like an absolutely horrible idea.
>>>>
>>>> See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
>>>>
>>>> What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.  
>>> emm, this patch doesn't change order-8 and order-4's allocation
>>> behaviour but just to replace the loop of order-0 allocations into
>>> once bulk allocation in the fallback way. What is your concern about
>>> this?  
>>
>> As far as I know the bulk allocation favors splitting large pages into smaller ones instead of allocating smaller pages first. That's where the performance benefit comes from.
>>
>> But that is exactly what we try to avoid here by allocating only certain order of pages.
> 
> This is a good question, actually. Yes, bulk alloc will split large
> pages if there are insufficient pages on the pcp free list. But is
> dma-buf indeed trying to avoid it, or is it merely using an inefficient
> API? And does it need the extra speed? Even if it leads to increased
> fragmentation?

DMA-buf-heaps is completly intentionally trying rather hard to avoid splitting large pages. That's why you have the distinction between HIGH_ORDER_GFP and LOW_ORDER_GFP as well.

Keep in mind that this is mostly used on embedded system with only small amounts of memory.

Not entering direct reclaim and instead preferring to split large pages until they are used up is an absolutely no-go for most use cases as far as I can see.

Could be that we need to make this behavior conditional, but somebody would need to come up with some really good arguments to justify the complexity.

Regards,
Christian.

> 
> Petr T



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-14 15:52           ` Christian König
@ 2025-10-15  1:12             ` Zhaoyang Huang
  2025-10-15  3:21               ` Matthew Wilcox
  0 siblings, 1 reply; 19+ messages in thread
From: Zhaoyang Huang @ 2025-10-15  1:12 UTC (permalink / raw)
  To: Christian König, Suren Baghdasaryan
  Cc: Petr Tesarik, zhaoyang.huang, Andrew Morton, David Hildenbrand,
	Matthew Wilcox, Mel Gorman, Vlastimil Babka, Sumit Semwal,
	Benjamin Gaignard, Brian Starkey, John Stultz, T . J . Mercier,
	linux-media, dri-devel, linaro-mm-sig, linux-mm, linux-kernel,
	steve.kang

On Tue, Oct 14, 2025 at 11:52 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 14.10.25 17:10, Petr Tesarik wrote:
> > On Tue, 14 Oct 2025 15:04:14 +0200
> > Christian König <christian.koenig@amd.com> wrote:
> >
> >> On 14.10.25 14:44, Zhaoyang Huang wrote:
> >>> On Tue, Oct 14, 2025 at 7:59 PM Christian König
> >>> <christian.koenig@amd.com> wrote:
> >>>>
> >>>> On 14.10.25 10:32, zhaoyang.huang wrote:
> >>>>> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
> >>>>>
> >>>>> The size of once dma-buf allocation could be dozens MB or much more
> >>>>> which introduce a loop of allocating several thousands of order-0 pages.
> >>>>> Furthermore, the concurrent allocation could have dma-buf allocation enter
> >>>>> direct-reclaim during the loop. This commit would like to eliminate the
> >>>>> above two affections by introducing alloc_pages_bulk_list in dma-buf's
> >>>>> order-0 allocation. This patch is proved to be conditionally helpful
> >>>>> in 18MB allocation as decreasing the time from 24604us to 6555us and no
> >>>>> harm when bulk allocation can't be done(fallback to single page
> >>>>> allocation)
> >>>>
> >>>> Well that sounds like an absolutely horrible idea.
> >>>>
> >>>> See the handling of allocating only from specific order is *exactly* there to avoid the behavior of bulk allocation.
> >>>>
> >>>> What you seem to do with this patch here is to add on top of the behavior to avoid allocating large chunks from the buddy the behavior to allocate large chunks from the buddy because that is faster.
> >>> emm, this patch doesn't change order-8 and order-4's allocation
> >>> behaviour but just to replace the loop of order-0 allocations into
> >>> once bulk allocation in the fallback way. What is your concern about
> >>> this?
> >>
> >> As far as I know the bulk allocation favors splitting large pages into smaller ones instead of allocating smaller pages first. That's where the performance benefit comes from.
> >>
> >> But that is exactly what we try to avoid here by allocating only certain order of pages.
> >
> > This is a good question, actually. Yes, bulk alloc will split large
> > pages if there are insufficient pages on the pcp free list. But islatest
> > dma-buf indeed trying to avoid it, or is it merely using an inefficient
> > API? And does it need the extra speed? Even if it leads to increased
> > fragmentation?
>
> DMA-buf-heaps is completly intentionally trying rather hard to avoid splitting large pages. That's why you have the distinction between HIGH_ORDER_GFP and LOW_ORDER_GFP as well.
Could you please check the alloc_pages_bulk_noprof in the patch 1/2 of
this series. According to my understanding, it try to get the order-0
page from pcplist firstly and then fallback to normal
alloc_pages(order-0) as same as what current dma-buf do. That is to
say no extra large pages splitting introduced by this commit.
>
> Keep in mind that this is mostly used on embedded system with only small amounts of memory.
Actually, dma-buf is the main consumer for driver's memory allocation
in the android world(multimedia, camera, npu etc) which could use even
half of the system RAM with dozens MB for once allocation.
>
> Not entering direct reclaim and instead preferring to split large pages until they are used up is an absolutely no-go for most use cases as far as I can see.
Yes. This behaviour would NOT be changed under this commit.
>
> Could be that we need to make this behavior conditional, but somebody would need to come up with some really good arguments to justify the complexity.
ok, should we use CONFIG_DMA_BUF_BULK_ALLOCATION or a variable
controlled by sysfs interface?
>
> Regards,
> Christian.
>
> >
> > Petr T
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-15  1:12             ` Zhaoyang Huang
@ 2025-10-15  3:21               ` Matthew Wilcox
  2025-10-15  5:52                 ` Zhaoyang Huang
  0 siblings, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2025-10-15  3:21 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Christian König, Suren Baghdasaryan, Petr Tesarik,
	zhaoyang.huang, Andrew Morton, David Hildenbrand, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-kernel, steve.kang

On Wed, Oct 15, 2025 at 09:12:07AM +0800, Zhaoyang Huang wrote:
> > Could be that we need to make this behavior conditional, but somebody would need to come up with some really good arguments to justify the complexity.
> ok, should we use CONFIG_DMA_BUF_BULK_ALLOCATION or a variable
> controlled by sysfs interface?

No.  Explain what you're trying to solve, because you haven't yet.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-15  3:21               ` Matthew Wilcox
@ 2025-10-15  5:52                 ` Zhaoyang Huang
  2025-10-15  8:40                   ` Petr Tesarik
  0 siblings, 1 reply; 19+ messages in thread
From: Zhaoyang Huang @ 2025-10-15  5:52 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Christian König, Suren Baghdasaryan, Petr Tesarik,
	zhaoyang.huang, Andrew Morton, David Hildenbrand, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-kernel, steve.kang

On Wed, Oct 15, 2025 at 11:21 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Oct 15, 2025 at 09:12:07AM +0800, Zhaoyang Huang wrote:
> > > Could be that we need to make this behavior conditional, but somebody would need to come up with some really good arguments to justify the complexity.
> > ok, should we use CONFIG_DMA_BUF_BULK_ALLOCATION or a variable
> > controlled by sysfs interface?
>
> No.  Explain what you're trying to solve, because you haven't yet.
Dma-buf works as a memory allocation backend could loop thousands of
times alloc_pages for allocating order-0 pages to fulfill the dozens
MB demand, this commit would like to replace the loop by once
alloc_pages_bulk. Whereas, alloc_pages_bulk_array perhaps introduces
extra memory allocation along with direct-reclaim which could be more
expensive than iterating the list. so call back the API
alloc_pages_bulk_list as well


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-15  5:52                 ` Zhaoyang Huang
@ 2025-10-15  8:40                   ` Petr Tesarik
  2025-10-15  9:14                     ` Zhaoyang Huang
  0 siblings, 1 reply; 19+ messages in thread
From: Petr Tesarik @ 2025-10-15  8:40 UTC (permalink / raw)
  To: Zhaoyang Huang
  Cc: Matthew Wilcox, Christian König, Suren Baghdasaryan,
	zhaoyang.huang, Andrew Morton, David Hildenbrand, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-kernel, steve.kang

On Wed, 15 Oct 2025 13:52:57 +0800
Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:

> On Wed, Oct 15, 2025 at 11:21 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, Oct 15, 2025 at 09:12:07AM +0800, Zhaoyang Huang wrote:  
> > > > Could be that we need to make this behavior conditional, but somebody would need to come up with some really good arguments to justify the complexity.  
> > > ok, should we use CONFIG_DMA_BUF_BULK_ALLOCATION or a variable
> > > controlled by sysfs interface?  
> >
> > No.  Explain what you're trying to solve, because you haven't yet.  
> Dma-buf works as a memory allocation backend could loop thousands of
> times alloc_pages for allocating order-0 pages to fulfill the dozens
> MB demand, this commit would like to replace the loop by once
> alloc_pages_bulk. Whereas, alloc_pages_bulk_array perhaps introduces
> extra memory allocation along with direct-reclaim which could be more
> expensive than iterating the list. so call back the API
> alloc_pages_bulk_list as well

This does not quite explain it. IIRC you mentioned allocating 18M as an
example. The ideal outcome in that case is:

- 16 order-8 compound pages
- 32 order-4 compound pages
-> total 48 calls to alloc_pages()

But presumably, that's not what happens, because fragmentation makes
(some of) those order-8 allocations fail. Since you talk about
thousands of loop iterations, it looks like even order-4 allocation
fail in your case. Then I agree there's not much value in trying to
avoid further fragmentation, and after so many order-0 allocations,
it's probably also pointless to do memory reclaim.

OTOH I can see why the opposite approach is a bad idea in situations
where fragmentation can be avoided. To make things even worse,
alloc_pages_bulk() will rather split pages in the preferred zone than
try allocating from the next best zone.

To sum it up, Zhaoyang, can you please describe in more detail what
happens in your scenario and what you believe should happen instead?

Petr T

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation
  2025-10-15  8:40                   ` Petr Tesarik
@ 2025-10-15  9:14                     ` Zhaoyang Huang
  0 siblings, 0 replies; 19+ messages in thread
From: Zhaoyang Huang @ 2025-10-15  9:14 UTC (permalink / raw)
  To: Petr Tesarik
  Cc: Matthew Wilcox, Christian König, Suren Baghdasaryan,
	zhaoyang.huang, Andrew Morton, David Hildenbrand, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, linux-media, dri-devel,
	linaro-mm-sig, linux-mm, linux-kernel, steve.kang

On Wed, Oct 15, 2025 at 4:40 PM Petr Tesarik <ptesarik@suse.com> wrote:
>
> On Wed, 15 Oct 2025 13:52:57 +0800
> Zhaoyang Huang <huangzhaoyang@gmail.com> wrote:
>
> > On Wed, Oct 15, 2025 at 11:21 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Wed, Oct 15, 2025 at 09:12:07AM +0800, Zhaoyang Huang wrote:
> > > > > Could be that we need to make this behavior conditional, but somebody would need to come up with some really good arguments to justify the complexity.
> > > > ok, should we use CONFIG_DMA_BUF_BULK_ALLOCATION or a variable
> > > > controlled by sysfs interface?
> > >
> > > No.  Explain what you're trying to solve, because you haven't yet.
> > Dma-buf works as a memory allocation backend could loop thousands of
> > times alloc_pages for allocating order-0 pages to fulfill the dozens
> > MB demand, this commit would like to replace the loop by once
> > alloc_pages_bulk. Whereas, alloc_pages_bulk_array perhaps introduces
> > extra memory allocation along with direct-reclaim which could be more
> > expensive than iterating the list. so call back the API
> > alloc_pages_bulk_list as well
>
> This does not quite explain it. IIRC you mentioned allocating 18M as an
> example. The ideal outcome in that case is:
>
> - 16 order-8 compound pages
> - 32 order-4 compound pages
> -> total 48 calls to alloc_pages()
>
> But presumably, that's not what happens, because fragmentation makes
> (some of) those order-8 allocations fail. Since you talk about
> thousands of loop iterations, it looks like even order-4 allocation
> fail in your case. Then I agree there's not much value in trying to
> avoid further fragmentation, and after so many order-0 allocations,
> it's probably also pointless to do memory reclaim.

Thanks for the words

>
> OTOH I can see why the opposite approach is a bad idea in situations
> where fragmentation can be avoided. To make things even worse,
> alloc_pages_bulk() will rather split pages in the preferred zone than
> try allocating from the next best zone.

but the loop of alloc_pages(order-0) could also split high-order pages
one by one on the prefered-zone, right?

>
> To sum it up, Zhaoyang, can you please describe in more detail what
> happens in your scenario and what you believe should happen instead?

My goal is as simple as eliminating the loop of alloc_pages to get
some performance gains.

>
> Petr T


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful
  2025-10-14  8:32 ` [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful zhaoyang.huang
  2025-10-14  9:41   ` Petr Tesarik
@ 2025-10-15 12:16   ` David Hildenbrand
  2025-10-15 12:35     ` Zhaoyang Huang
  1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2025-10-15 12:16 UTC (permalink / raw)
  To: zhaoyang.huang, Andrew Morton, Matthew Wilcox, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, Christian König, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, Zhaoyang Huang,
	steve.kang

On 14.10.25 10:32, zhaoyang.huang wrote:
> From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>

Probably the subject should be "mm: reintroduce alloc_pages_bulk_list()"

> 
> commit c8b979530f27 ("mm: alloc_pages_bulk_noprof: drop page_list
> argument") drops alloc_pages_bulk_list. This commit would like to call back
> it since it is proved to be helpful to the drivers which allocate a bulk of
> pages(see patch of 2 in this series ).

"Let's reintroduce it so we can us for bulk allocation in the context of 
XXX next."

> I do notice that Matthew's comment of the time cost of iterating a list.
> However, I also observed in our test that the extra page_array's allocation
> could be more expensive than cpu iteration when direct reclaiming happens
> when ram is low[1]. IMHO, could we leave the API here to have the users
> choose between the array or list according to their scenarios.

I'd prefer if we avoid reintroducing this interface.

How many pages are you intending to allocate? Wouldn't a smaller array 
on the stack be sufficient?


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful
  2025-10-15 12:16   ` David Hildenbrand
@ 2025-10-15 12:35     ` Zhaoyang Huang
  0 siblings, 0 replies; 19+ messages in thread
From: Zhaoyang Huang @ 2025-10-15 12:35 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: zhaoyang.huang, Andrew Morton, Matthew Wilcox, Mel Gorman,
	Vlastimil Babka, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
	John Stultz, T . J . Mercier, Christian König, linux-media,
	dri-devel, linaro-mm-sig, linux-mm, linux-kernel, steve.kang

On Wed, Oct 15, 2025 at 8:16 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.10.25 10:32, zhaoyang.huang wrote:
> > From: Zhaoyang Huang <zhaoyang.huang@unisoc.com>
>
> Probably the subject should be "mm: reintroduce alloc_pages_bulk_list()"
ok
>
> >
> > commit c8b979530f27 ("mm: alloc_pages_bulk_noprof: drop page_list
> > argument") drops alloc_pages_bulk_list. This commit would like to call back
> > it since it is proved to be helpful to the drivers which allocate a bulk of
> > pages(see patch of 2 in this series ).
>
> "Let's reintroduce it so we can us for bulk allocation in the context of
> XXX next."
>
> > I do notice that Matthew's comment of the time cost of iterating a list.
> > However, I also observed in our test that the extra page_array's allocation
> > could be more expensive than cpu iteration when direct reclaiming happens
> > when ram is low[1]. IMHO, could we leave the API here to have the users
> > choose between the array or list according to their scenarios.
>
> I'd prefer if we avoid reintroducing this interface.
>
> How many pages are you intending to allocate? Wouldn't a smaller array
> on the stack be sufficient?
Actually, dma-buf is the main consumer in android which could occupy
half of the system RAM(mainly by multimedia which passes the memory
between GPU and display, camera driver, NPU driver etc). Dozens MB is
quite common or maybe more. This commit is proved to be helpful in the
scenario of camera APP cold start which allocate around 300MB memory
in an 6GB RAM ANDROID system
IMHO, page_list could be more efficient than page_array in memory
perspective which is an uncertain factor than iterating the list
>
>
> --
> Cheers
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-10-15 12:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-14  8:32 [PATCH 0/2] optimization of dma-buf system_heap allocation zhaoyang.huang
2025-10-14  8:32 ` [PATCH 1/2] mm: call back alloc_pages_bulk_list since it is useful zhaoyang.huang
2025-10-14  9:41   ` Petr Tesarik
2025-10-14 12:46     ` Zhaoyang Huang
2025-10-15 12:16   ` David Hildenbrand
2025-10-15 12:35     ` Zhaoyang Huang
2025-10-14  8:32 ` [PATCH 2/2] driver: dma-buf: use alloc_pages_bulk_list for order-0 allocation zhaoyang.huang
2025-10-14 11:59   ` Christian König
2025-10-14 12:44     ` Zhaoyang Huang
2025-10-14 13:04       ` Christian König
2025-10-14 15:10         ` Petr Tesarik
2025-10-14 15:52           ` Christian König
2025-10-15  1:12             ` Zhaoyang Huang
2025-10-15  3:21               ` Matthew Wilcox
2025-10-15  5:52                 ` Zhaoyang Huang
2025-10-15  8:40                   ` Petr Tesarik
2025-10-15  9:14                     ` Zhaoyang Huang
2025-10-14 12:45   ` Petr Tesarik
2025-10-14 14:06 ` [PATCH 0/2] optimization of dma-buf system_heap allocation Matthew Wilcox

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox