[PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance
@ 2025-02-10  1:56 yangge1116
  2025-02-10  3:44 ` Barry Song
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: yangge1116 @ 2025-02-10  1:56 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, 21cnbao, david, baolin.wang,
	aisheng.dong, liuzixing, yangge

From: yangge <yangge1116@126.com>

For different CMAs, concurrent allocation of CMA memory ideally should not
require synchronization using locks. Currently, a global cma_mutex lock is
employed to synchronize all CMA allocations, which can impact the
performance of concurrent allocations across different CMAs.

To test the performance impact, follow these steps:
1. Boot the kernel with the command line argument hugetlb_cma=30G to
   allocate a 30GB CMA area specifically for huge page allocations. (note:
   on my machine, which has 3 nodes, each node is initialized with 10G of
   CMA)
2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
   count=30 to fully utilize the CMA area by writing zeroes to a file in
   /dev/shm.
3. Open three terminals and execute the following commands simultaneously:
   (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
   pages] of CMA memory.)
   On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
   On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
   On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc

We attempt to allocate pages through the CMA debug interface and use the
time command to measure the duration of each allocation.
Performance comparison:
             Without this patch      With this patch
Terminal1        ~7s                     ~7s
Terminal2       ~14s                     ~8s
Terminal3       ~21s                     ~7s

To slove problem above, we could use per-CMA locks to improve concurrent
allocation performance. This would allow each CMA to be managed
independently, reducing the need for a global lock and thus improving
scalability and performance.

Signed-off-by: yangge <yangge1116@126.com>
---

V2:
- update code and message suggested by Barry. 

 mm/cma.c | 7 ++++---
 mm/cma.h | 1 +
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 34a4df2..a0d4d2f 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -34,7 +34,6 @@

 struct cma cma_areas[MAX_CMA_AREAS];
 unsigned int cma_area_count;
-static DEFINE_MUTEX(cma_mutex);

 static int __init __cma_declare_contiguous_nid(phys_addr_t base,
 			phys_addr_t size, phys_addr_t limit,
@@ -175,6 +174,8 @@ static void __init cma_activate_area(struct cma *cma)

 	spin_lock_init(&cma->lock);

+	mutex_init(&cma->alloc_mutex);
+
 #ifdef CONFIG_CMA_DEBUGFS
 	INIT_HLIST_HEAD(&cma->mem_head);
 	spin_lock_init(&cma->mem_head_lock);
@@ -813,9 +814,9 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
 		spin_unlock_irq(&cma->lock);

 		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
-		mutex_lock(&cma_mutex);
+		mutex_lock(&cma->alloc_mutex);
 		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
-		mutex_unlock(&cma_mutex);
+		mutex_unlock(&cma->alloc_mutex);
 		if (ret == 0) {
 			page = pfn_to_page(pfn);
 			break;
diff --git a/mm/cma.h b/mm/cma.h
index df7fc62..41a3ab0 100644
--- a/mm/cma.h
+++ b/mm/cma.h
@@ -39,6 +39,7 @@ struct cma {
 	unsigned long	available_count;
 	unsigned int order_per_bit; /* Order of pages represented by one bit */
 	spinlock_t	lock;
+	struct mutex alloc_mutex;
 #ifdef CONFIG_CMA_DEBUGFS
 	struct hlist_head mem_head;
 	spinlock_t mem_head_lock;
-- 
2.7.4

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance
  2025-02-10  1:56 [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance yangge1116
@ 2025-02-10  3:44 ` Barry Song
  2025-02-10  8:34 ` David Hildenbrand
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Barry Song @ 2025-02-10  3:44 UTC (permalink / raw)
  To: yangge1116
  Cc: akpm, linux-mm, linux-kernel, david, baolin.wang, aisheng.dong,
	liuzixing

On Mon, Feb 10, 2025 at 2:56 PM <yangge1116@126.com> wrote:
>
> From: yangge <yangge1116@126.com>
>
> For different CMAs, concurrent allocation of CMA memory ideally should not
> require synchronization using locks. Currently, a global cma_mutex lock is
> employed to synchronize all CMA allocations, which can impact the
> performance of concurrent allocations across different CMAs.
>
> To test the performance impact, follow these steps:
> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>    allocate a 30GB CMA area specifically for huge page allocations. (note:
>    on my machine, which has 3 nodes, each node is initialized with 10G of
>    CMA)
> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>    count=30 to fully utilize the CMA area by writing zeroes to a file in
>    /dev/shm.
> 3. Open three terminals and execute the following commands simultaneously:
>    (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>    pages] of CMA memory.)
>    On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>    On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>    On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
>
> We attempt to allocate pages through the CMA debug interface and use the
> time command to measure the duration of each allocation.
> Performance comparison:
>              Without this patch      With this patch
> Terminal1        ~7s                     ~7s
> Terminal2       ~14s                     ~8s
> Terminal3       ~21s                     ~7s
>
> To slove problem above, we could use per-CMA locks to improve concurrent
> allocation performance. This would allow each CMA to be managed
> independently, reducing the need for a global lock and thus improving
> scalability and performance.
>
> Signed-off-by: yangge <yangge1116@126.com>

An allocation from one CMA region should not be blocked by an allocation from
another CMA region, especially since we may have multiple CMA regions or
even per-NUMA CMA regions.

Reviewed-by: Barry Song <baohua@kernel.org>

> ---
>
> V2:
> - update code and message suggested by Barry.
>
>  mm/cma.c | 7 ++++---
>  mm/cma.h | 1 +
>  2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/mm/cma.c b/mm/cma.c
> index 34a4df2..a0d4d2f 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -34,7 +34,6 @@
>
>  struct cma cma_areas[MAX_CMA_AREAS];
>  unsigned int cma_area_count;
> -static DEFINE_MUTEX(cma_mutex);
>
>  static int __init __cma_declare_contiguous_nid(phys_addr_t base,
>                         phys_addr_t size, phys_addr_t limit,
> @@ -175,6 +174,8 @@ static void __init cma_activate_area(struct cma *cma)
>
>         spin_lock_init(&cma->lock);
>
> +       mutex_init(&cma->alloc_mutex);
> +
>  #ifdef CONFIG_CMA_DEBUGFS
>         INIT_HLIST_HEAD(&cma->mem_head);
>         spin_lock_init(&cma->mem_head_lock);
> @@ -813,9 +814,9 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
>                 spin_unlock_irq(&cma->lock);
>
>                 pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
> -               mutex_lock(&cma_mutex);
> +               mutex_lock(&cma->alloc_mutex);
>                 ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
> -               mutex_unlock(&cma_mutex);
> +               mutex_unlock(&cma->alloc_mutex);
>                 if (ret == 0) {
>                         page = pfn_to_page(pfn);
>                         break;
> diff --git a/mm/cma.h b/mm/cma.h
> index df7fc62..41a3ab0 100644
> --- a/mm/cma.h
> +++ b/mm/cma.h
> @@ -39,6 +39,7 @@ struct cma {
>         unsigned long   available_count;
>         unsigned int order_per_bit; /* Order of pages represented by one bit */
>         spinlock_t      lock;
> +       struct mutex alloc_mutex;
>  #ifdef CONFIG_CMA_DEBUGFS
>         struct hlist_head mem_head;
>         spinlock_t mem_head_lock;
> --
> 2.7.4
>
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance
  2025-02-10  1:56 [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance yangge1116
  2025-02-10  3:44 ` Barry Song
@ 2025-02-10  8:34 ` David Hildenbrand
  2025-02-10  8:56   ` Ge Yang
  2025-02-10  9:15 ` Oscar Salvador
  2025-03-18  3:43 ` Andrew Morton
  3 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2025-02-10  8:34 UTC (permalink / raw)
  To: yangge1116, akpm
  Cc: linux-mm, linux-kernel, 21cnbao, baolin.wang, aisheng.dong, liuzixing

On 10.02.25 02:56, yangge1116@126.com wrote:
> From: yangge <yangge1116@126.com>
> 
> For different CMAs, concurrent allocation of CMA memory ideally should not
> require synchronization using locks. Currently, a global cma_mutex lock is
> employed to synchronize all CMA allocations, which can impact the
> performance of concurrent allocations across different CMAs.
> 
> To test the performance impact, follow these steps:
> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>     allocate a 30GB CMA area specifically for huge page allocations. (note:
>     on my machine, which has 3 nodes, each node is initialized with 10G of
>     CMA)
> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>     count=30 to fully utilize the CMA area by writing zeroes to a file in
>     /dev/shm.
> 3. Open three terminals and execute the following commands simultaneously:
>     (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>     pages] of CMA memory.)
>     On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>     On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>     On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
> 

Hi,

I'm curious, what is the real workload you are trying to optimize for? I 
assume this example here is just to have some measurement.

Is concurrency within a single CMA area also a problem for your use case?


> We attempt to allocate pages through the CMA debug interface and use the
> time command to measure the duration of each allocation.
> Performance comparison:
>               Without this patch      With this patch
> Terminal1        ~7s                     ~7s
> Terminal2       ~14s                     ~8s
> Terminal3       ~21s                     ~7s
> 
> To slove problem above, we could use per-CMA locks to improve concurrent
> allocation performance. This would allow each CMA to be managed
> independently, reducing the need for a global lock and thus improving
> scalability and performance.
> 
> Signed-off-by: yangge <yangge1116@126.com>
> ---
> 
> V2:
> - update code and message suggested by Barry.
> 
>   mm/cma.c | 7 ++++---
>   mm/cma.h | 1 +
>   2 files changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/cma.c b/mm/cma.c
> index 34a4df2..a0d4d2f 100644
> --- a/mm/cma.c
> +++ b/mm/cma.c
> @@ -34,7 +34,6 @@
>   
>   struct cma cma_areas[MAX_CMA_AREAS];
>   unsigned int cma_area_count;
> -static DEFINE_MUTEX(cma_mutex);
>   
>   static int __init __cma_declare_contiguous_nid(phys_addr_t base,
>   			phys_addr_t size, phys_addr_t limit,
> @@ -175,6 +174,8 @@ static void __init cma_activate_area(struct cma *cma)
>   
>   	spin_lock_init(&cma->lock);
>   
> +	mutex_init(&cma->alloc_mutex);
> +
>   #ifdef CONFIG_CMA_DEBUGFS
>   	INIT_HLIST_HEAD(&cma->mem_head);
>   	spin_lock_init(&cma->mem_head_lock);
> @@ -813,9 +814,9 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr,
>   		spin_unlock_irq(&cma->lock);
>   
>   		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
> -		mutex_lock(&cma_mutex);
> +		mutex_lock(&cma->alloc_mutex);
>   		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
> -		mutex_unlock(&cma_mutex);
> +		mutex_unlock(&cma->alloc_mutex);


As raised, a better approach might be to return -EAGAIN  in case we hit 
an isolated pageblock and deal with that more gracefully here (e.g., try 
another block, or retry this one if there are not others left, ...)

In any case, this change here looks like an improvement.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance
  2025-02-10  8:34 ` David Hildenbrand
@ 2025-02-10  8:56   ` Ge Yang
  0 siblings, 0 replies; 8+ messages in thread
From: Ge Yang @ 2025-02-10  8:56 UTC (permalink / raw)
  To: David Hildenbrand, akpm
  Cc: linux-mm, linux-kernel, 21cnbao, baolin.wang, aisheng.dong, liuzixing



在 2025/2/10 16:34, David Hildenbrand 写道:
> On 10.02.25 02:56, yangge1116@126.com wrote:
>> From: yangge <yangge1116@126.com>
>>
>> For different CMAs, concurrent allocation of CMA memory ideally should 
>> not
>> require synchronization using locks. Currently, a global cma_mutex 
>> lock is
>> employed to synchronize all CMA allocations, which can impact the
>> performance of concurrent allocations across different CMAs.
>>
>> To test the performance impact, follow these steps:
>> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>>     allocate a 30GB CMA area specifically for huge page allocations. 
>> (note:
>>     on my machine, which has 3 nodes, each node is initialized with 
>> 10G of
>>     CMA)
>> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>>     count=30 to fully utilize the CMA area by writing zeroes to a file in
>>     /dev/shm.
>> 3. Open three terminals and execute the following commands 
>> simultaneously:
>>     (Note: Each of these commands attempts to allocate 10GB [2621440 * 
>> 4KB
>>     pages] of CMA memory.)
>>     On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/ 
>> alloc
>>     On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/ 
>> alloc
>>     On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/ 
>> alloc
>>
> 
> Hi,
> 
> I'm curious, what is the real workload you are trying to optimize for? I 
> assume this example here is just to have some measurement.
> 
Some of our drivers require this optimization, but they haven't been 
merged into the mainline yet. Therefore, we optimize the debug code for 
them.
> Is concurrency within a single CMA area also a problem for your use case?
> 
Yes, we will first optimize the concurrent performance for multiple 
CMAs, and then proceed to optimize the concurrent performance for a 
single CMA later.
> 
>> We attempt to allocate pages through the CMA debug interface and use the
>> time command to measure the duration of each allocation.
>> Performance comparison:
>>               Without this patch      With this patch
>> Terminal1        ~7s                     ~7s
>> Terminal2       ~14s                     ~8s
>> Terminal3       ~21s                     ~7s
>>
>> To slove problem above, we could use per-CMA locks to improve concurrent
>> allocation performance. This would allow each CMA to be managed
>> independently, reducing the need for a global lock and thus improving
>> scalability and performance.
>>
>> Signed-off-by: yangge <yangge1116@126.com>
>> ---
>>
>> V2:
>> - update code and message suggested by Barry.
>>
>>   mm/cma.c | 7 ++++---
>>   mm/cma.h | 1 +
>>   2 files changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/cma.c b/mm/cma.c
>> index 34a4df2..a0d4d2f 100644
>> --- a/mm/cma.c
>> +++ b/mm/cma.c
>> @@ -34,7 +34,6 @@
>>   struct cma cma_areas[MAX_CMA_AREAS];
>>   unsigned int cma_area_count;
>> -static DEFINE_MUTEX(cma_mutex);
>>   static int __init __cma_declare_contiguous_nid(phys_addr_t base,
>>               phys_addr_t size, phys_addr_t limit,
>> @@ -175,6 +174,8 @@ static void __init cma_activate_area(struct cma *cma)
>>       spin_lock_init(&cma->lock);
>> +    mutex_init(&cma->alloc_mutex);
>> +
>>   #ifdef CONFIG_CMA_DEBUGFS
>>       INIT_HLIST_HEAD(&cma->mem_head);
>>       spin_lock_init(&cma->mem_head_lock);
>> @@ -813,9 +814,9 @@ static int cma_range_alloc(struct cma *cma, struct 
>> cma_memrange *cmr,
>>           spin_unlock_irq(&cma->lock);
>>           pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
>> -        mutex_lock(&cma_mutex);
>> +        mutex_lock(&cma->alloc_mutex);
>>           ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
>> -        mutex_unlock(&cma_mutex);
>> +        mutex_unlock(&cma->alloc_mutex);
> 
> 
> As raised, a better approach might be to return -EAGAIN  in case we hit 
> an isolated pageblock and deal with that more gracefully here (e.g., try 
> another block, or retry this one if there are not others left, ...)
> 
> In any case, this change here looks like an improvement.
> 
> Acked-by: David Hildenbrand <david@redhat.com>
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance
  2025-02-10  1:56 [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance yangge1116
  2025-02-10  3:44 ` Barry Song
  2025-02-10  8:34 ` David Hildenbrand
@ 2025-02-10  9:15 ` Oscar Salvador
  2025-03-18  3:43 ` Andrew Morton
  3 siblings, 0 replies; 8+ messages in thread
From: Oscar Salvador @ 2025-02-10  9:15 UTC (permalink / raw)
  To: yangge1116
  Cc: akpm, linux-mm, linux-kernel, 21cnbao, david, baolin.wang,
	aisheng.dong, liuzixing

On Mon, Feb 10, 2025 at 09:56:06AM +0800, yangge1116@126.com wrote:
> From: yangge <yangge1116@126.com>
> 
> For different CMAs, concurrent allocation of CMA memory ideally should not
> require synchronization using locks. Currently, a global cma_mutex lock is
> employed to synchronize all CMA allocations, which can impact the
> performance of concurrent allocations across different CMAs.
> 
> To test the performance impact, follow these steps:
> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>    allocate a 30GB CMA area specifically for huge page allocations. (note:
>    on my machine, which has 3 nodes, each node is initialized with 10G of
>    CMA)
> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>    count=30 to fully utilize the CMA area by writing zeroes to a file in
>    /dev/shm.
> 3. Open three terminals and execute the following commands simultaneously:
>    (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>    pages] of CMA memory.)
>    On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>    On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>    On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
> 
> We attempt to allocate pages through the CMA debug interface and use the
> time command to measure the duration of each allocation.
> Performance comparison:
>              Without this patch      With this patch
> Terminal1        ~7s                     ~7s
> Terminal2       ~14s                     ~8s
> Terminal3       ~21s                     ~7s
> 
> To slove problem above, we could use per-CMA locks to improve concurrent
> allocation performance. This would allow each CMA to be managed
> independently, reducing the need for a global lock and thus improving
> scalability and performance.
> 
> Signed-off-by: yangge <yangge1116@126.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance
  2025-02-10  1:56 [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance yangge1116
                   ` (2 preceding siblings ...)
  2025-02-10  9:15 ` Oscar Salvador
@ 2025-03-18  3:43 ` Andrew Morton
  2025-03-18  7:21   ` Ge Yang
  2025-03-18 13:02   ` David Hildenbrand
  3 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2025-03-18  3:43 UTC (permalink / raw)
  To: yangge1116
  Cc: linux-mm, linux-kernel, 21cnbao, david, baolin.wang,
	aisheng.dong, liuzixing

On Mon, 10 Feb 2025 09:56:06 +0800 yangge1116@126.com wrote:

> From: yangge <yangge1116@126.com>
> 
> For different CMAs, concurrent allocation of CMA memory ideally should not
> require synchronization using locks. Currently, a global cma_mutex lock is
> employed to synchronize all CMA allocations, which can impact the
> performance of concurrent allocations across different CMAs.
> 
> To test the performance impact, follow these steps:
> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>    allocate a 30GB CMA area specifically for huge page allocations. (note:
>    on my machine, which has 3 nodes, each node is initialized with 10G of
>    CMA)
> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>    count=30 to fully utilize the CMA area by writing zeroes to a file in
>    /dev/shm.
> 3. Open three terminals and execute the following commands simultaneously:
>    (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>    pages] of CMA memory.)
>    On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>    On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>    On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
> 
> We attempt to allocate pages through the CMA debug interface and use the
> time command to measure the duration of each allocation.
> Performance comparison:
>              Without this patch      With this patch
> Terminal1        ~7s                     ~7s
> Terminal2       ~14s                     ~8s
> Terminal3       ~21s                     ~7s
> 
> To slove problem above, we could use per-CMA locks to improve concurrent
> allocation performance. This would allow each CMA to be managed
> independently, reducing the need for a global lock and thus improving
> scalability and performance.

This patch was in and out of mm-unstable for a while, as Frank's series
"hugetlb/CMA improvements for large systems" was being added and
dropped.

Consequently it hasn't received any testing for a while.

Below is the version which I've now re-added to mm-unstable.  Can
you please check this and retest it?

Thanks.

From: Ge Yang <yangge1116@126.com>
Subject: mm/cma: using per-CMA locks to improve concurrent allocation performance
Date: Mon, 10 Feb 2025 09:56:06 +0800

For different CMAs, concurrent allocation of CMA memory ideally should not
require synchronization using locks.  Currently, a global cma_mutex lock
is employed to synchronize all CMA allocations, which can impact the
performance of concurrent allocations across different CMAs.

To test the performance impact, follow these steps:
1. Boot the kernel with the command line argument hugetlb_cma=30G to
   allocate a 30GB CMA area specifically for huge page allocations. (note:
   on my machine, which has 3 nodes, each node is initialized with 10G of
   CMA)
2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
   count=30 to fully utilize the CMA area by writing zeroes to a file in
   /dev/shm.
3. Open three terminals and execute the following commands simultaneously:
   (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
   pages] of CMA memory.)
   On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
   On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
   On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc

We attempt to allocate pages through the CMA debug interface and use the
time command to measure the duration of each allocation.
Performance comparison:
             Without this patch      With this patch
Terminal1        ~7s                     ~7s
Terminal2       ~14s                     ~8s
Terminal3       ~21s                     ~7s

To solve problem above, we could use per-CMA locks to improve concurrent
allocation performance.  This would allow each CMA to be managed
independently, reducing the need for a global lock and thus improving
scalability and performance.

Link: https://lkml.kernel.org/r/1739152566-744-1-git-send-email-yangge1116@126.com
Signed-off-by: Ge Yang <yangge1116@126.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Aisheng Dong <aisheng.dong@nxp.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/cma.c |    7 ++++---
 mm/cma.h |    1 +
 2 files changed, 5 insertions(+), 3 deletions(-)

--- a/mm/cma.c~mm-cma-using-per-cma-locks-to-improve-concurrent-allocation-performance
+++ a/mm/cma.c
@@ -34,7 +34,6 @@
 
 struct cma cma_areas[MAX_CMA_AREAS];
 unsigned int cma_area_count;
-static DEFINE_MUTEX(cma_mutex);
 
 static int __init __cma_declare_contiguous_nid(phys_addr_t base,
 			phys_addr_t size, phys_addr_t limit,
@@ -175,6 +174,8 @@ static void __init cma_activate_area(str
 
 	spin_lock_init(&cma->lock);
 
+	mutex_init(&cma->alloc_mutex);
+
 #ifdef CONFIG_CMA_DEBUGFS
 	INIT_HLIST_HEAD(&cma->mem_head);
 	spin_lock_init(&cma->mem_head_lock);
@@ -813,9 +814,9 @@ static int cma_range_alloc(struct cma *c
 		spin_unlock_irq(&cma->lock);
 
 		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
-		mutex_lock(&cma_mutex);
+		mutex_lock(&cma->alloc_mutex);
 		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
-		mutex_unlock(&cma_mutex);
+		mutex_unlock(&cma->alloc_mutex);
 		if (ret == 0) {
 			page = pfn_to_page(pfn);
 			break;
--- a/mm/cma.h~mm-cma-using-per-cma-locks-to-improve-concurrent-allocation-performance
+++ a/mm/cma.h
@@ -39,6 +39,7 @@ struct cma {
 	unsigned long	available_count;
 	unsigned int order_per_bit; /* Order of pages represented by one bit */
 	spinlock_t	lock;
+	struct mutex alloc_mutex;
 #ifdef CONFIG_CMA_DEBUGFS
 	struct hlist_head mem_head;
 	spinlock_t mem_head_lock;
_



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance
  2025-03-18  3:43 ` Andrew Morton
@ 2025-03-18  7:21   ` Ge Yang
  2025-03-18 13:02   ` David Hildenbrand
  1 sibling, 0 replies; 8+ messages in thread
From: Ge Yang @ 2025-03-18  7:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, 21cnbao, david, baolin.wang,
	aisheng.dong, liuzixing



在 2025/3/18 11:43, Andrew Morton 写道:
> On Mon, 10 Feb 2025 09:56:06 +0800 yangge1116@126.com wrote:
> 
>> From: yangge <yangge1116@126.com>
>>
>> For different CMAs, concurrent allocation of CMA memory ideally should not
>> require synchronization using locks. Currently, a global cma_mutex lock is
>> employed to synchronize all CMA allocations, which can impact the
>> performance of concurrent allocations across different CMAs.
>>
>> To test the performance impact, follow these steps:
>> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>>     allocate a 30GB CMA area specifically for huge page allocations. (note:
>>     on my machine, which has 3 nodes, each node is initialized with 10G of
>>     CMA)
>> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>>     count=30 to fully utilize the CMA area by writing zeroes to a file in
>>     /dev/shm.
>> 3. Open three terminals and execute the following commands simultaneously:
>>     (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>>     pages] of CMA memory.)
>>     On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>>     On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>>     On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
>>
>> We attempt to allocate pages through the CMA debug interface and use the
>> time command to measure the duration of each allocation.
>> Performance comparison:
>>               Without this patch      With this patch
>> Terminal1        ~7s                     ~7s
>> Terminal2       ~14s                     ~8s
>> Terminal3       ~21s                     ~7s
>>
>> To slove problem above, we could use per-CMA locks to improve concurrent
>> allocation performance. This would allow each CMA to be managed
>> independently, reducing the need for a global lock and thus improving
>> scalability and performance.
> 
> This patch was in and out of mm-unstable for a while, as Frank's series
> "hugetlb/CMA improvements for large systems" was being added and
> dropped.
> 
> Consequently it hasn't received any testing for a while.
> 
> Below is the version which I've now re-added to mm-unstable.  Can
> you please check this and retest it?
Based on the latest mm-unstable code, after applying the patch and 
conducting tests, it works normally. Thanks.
> 
> Thanks.
> 
> From: Ge Yang <yangge1116@126.com>
> Subject: mm/cma: using per-CMA locks to improve concurrent allocation performance
> Date: Mon, 10 Feb 2025 09:56:06 +0800
> 
> For different CMAs, concurrent allocation of CMA memory ideally should not
> require synchronization using locks.  Currently, a global cma_mutex lock
> is employed to synchronize all CMA allocations, which can impact the
> performance of concurrent allocations across different CMAs.
> 
> To test the performance impact, follow these steps:
> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>     allocate a 30GB CMA area specifically for huge page allocations. (note:
>     on my machine, which has 3 nodes, each node is initialized with 10G of
>     CMA)
> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>     count=30 to fully utilize the CMA area by writing zeroes to a file in
>     /dev/shm.
> 3. Open three terminals and execute the following commands simultaneously:
>     (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>     pages] of CMA memory.)
>     On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>     On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>     On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
> 
> We attempt to allocate pages through the CMA debug interface and use the
> time command to measure the duration of each allocation.
> Performance comparison:
>               Without this patch      With this patch
> Terminal1        ~7s                     ~7s
> Terminal2       ~14s                     ~8s
> Terminal3       ~21s                     ~7s
> 
> To solve problem above, we could use per-CMA locks to improve concurrent
> allocation performance.  This would allow each CMA to be managed
> independently, reducing the need for a global lock and thus improving
> scalability and performance.
> 
> Link: https://lkml.kernel.org/r/1739152566-744-1-git-send-email-yangge1116@126.com
> Signed-off-by: Ge Yang <yangge1116@126.com>
> Reviewed-by: Barry Song <baohua@kernel.org>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Cc: Aisheng Dong <aisheng.dong@nxp.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>   mm/cma.c |    7 ++++---
>   mm/cma.h |    1 +
>   2 files changed, 5 insertions(+), 3 deletions(-)
> 
> --- a/mm/cma.c~mm-cma-using-per-cma-locks-to-improve-concurrent-allocation-performance
> +++ a/mm/cma.c
> @@ -34,7 +34,6 @@
>   
>   struct cma cma_areas[MAX_CMA_AREAS];
>   unsigned int cma_area_count;
> -static DEFINE_MUTEX(cma_mutex);
>   
>   static int __init __cma_declare_contiguous_nid(phys_addr_t base,
>   			phys_addr_t size, phys_addr_t limit,
> @@ -175,6 +174,8 @@ static void __init cma_activate_area(str
>   
>   	spin_lock_init(&cma->lock);
>   
> +	mutex_init(&cma->alloc_mutex);
> +
>   #ifdef CONFIG_CMA_DEBUGFS
>   	INIT_HLIST_HEAD(&cma->mem_head);
>   	spin_lock_init(&cma->mem_head_lock);
> @@ -813,9 +814,9 @@ static int cma_range_alloc(struct cma *c
>   		spin_unlock_irq(&cma->lock);
>   
>   		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
> -		mutex_lock(&cma_mutex);
> +		mutex_lock(&cma->alloc_mutex);
>   		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
> -		mutex_unlock(&cma_mutex);
> +		mutex_unlock(&cma->alloc_mutex);
>   		if (ret == 0) {
>   			page = pfn_to_page(pfn);
>   			break;
> --- a/mm/cma.h~mm-cma-using-per-cma-locks-to-improve-concurrent-allocation-performance
> +++ a/mm/cma.h
> @@ -39,6 +39,7 @@ struct cma {
>   	unsigned long	available_count;
>   	unsigned int order_per_bit; /* Order of pages represented by one bit */
>   	spinlock_t	lock;
> +	struct mutex alloc_mutex;
>   #ifdef CONFIG_CMA_DEBUGFS
>   	struct hlist_head mem_head;
>   	spinlock_t mem_head_lock;
> _



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance
  2025-03-18  3:43 ` Andrew Morton
  2025-03-18  7:21   ` Ge Yang
@ 2025-03-18 13:02   ` David Hildenbrand
  1 sibling, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-03-18 13:02 UTC (permalink / raw)
  To: Andrew Morton, yangge1116
  Cc: linux-mm, linux-kernel, 21cnbao, baolin.wang, aisheng.dong, liuzixing

On 18.03.25 04:43, Andrew Morton wrote:
> On Mon, 10 Feb 2025 09:56:06 +0800 yangge1116@126.com wrote:
> 
>> From: yangge <yangge1116@126.com>
>>
>> For different CMAs, concurrent allocation of CMA memory ideally should not
>> require synchronization using locks. Currently, a global cma_mutex lock is
>> employed to synchronize all CMA allocations, which can impact the
>> performance of concurrent allocations across different CMAs.
>>
>> To test the performance impact, follow these steps:
>> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>>     allocate a 30GB CMA area specifically for huge page allocations. (note:
>>     on my machine, which has 3 nodes, each node is initialized with 10G of
>>     CMA)
>> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>>     count=30 to fully utilize the CMA area by writing zeroes to a file in
>>     /dev/shm.
>> 3. Open three terminals and execute the following commands simultaneously:
>>     (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>>     pages] of CMA memory.)
>>     On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>>     On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>>     On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
>>
>> We attempt to allocate pages through the CMA debug interface and use the
>> time command to measure the duration of each allocation.
>> Performance comparison:
>>               Without this patch      With this patch
>> Terminal1        ~7s                     ~7s
>> Terminal2       ~14s                     ~8s
>> Terminal3       ~21s                     ~7s
>>
>> To slove problem above, we could use per-CMA locks to improve concurrent
>> allocation performance. This would allow each CMA to be managed
>> independently, reducing the need for a global lock and thus improving
>> scalability and performance.
> 
> This patch was in and out of mm-unstable for a while, as Frank's series
> "hugetlb/CMA improvements for large systems" was being added and
> dropped.
> 
> Consequently it hasn't received any testing for a while.
> 
> Below is the version which I've now re-added to mm-unstable.  Can
> you please check this and retest it?
> 
> Thanks.
> 
> From: Ge Yang <yangge1116@126.com>
> Subject: mm/cma: using per-CMA locks to improve concurrent allocation performance
> Date: Mon, 10 Feb 2025 09:56:06 +0800
> 
> For different CMAs, concurrent allocation of CMA memory ideally should not
> require synchronization using locks.  Currently, a global cma_mutex lock
> is employed to synchronize all CMA allocations, which can impact the
> performance of concurrent allocations across different CMAs.
> 
> To test the performance impact, follow these steps:
> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>     allocate a 30GB CMA area specifically for huge page allocations. (note:
>     on my machine, which has 3 nodes, each node is initialized with 10G of
>     CMA)
> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>     count=30 to fully utilize the CMA area by writing zeroes to a file in
>     /dev/shm.
> 3. Open three terminals and execute the following commands simultaneously:
>     (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>     pages] of CMA memory.)
>     On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>     On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>     On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
> 
> We attempt to allocate pages through the CMA debug interface and use the
> time command to measure the duration of each allocation.
> Performance comparison:
>               Without this patch      With this patch
> Terminal1        ~7s                     ~7s
> Terminal2       ~14s                     ~8s
> Terminal3       ~21s                     ~7s
> 
> To solve problem above, we could use per-CMA locks to improve concurrent
> allocation performance.  This would allow each CMA to be managed
> independently, reducing the need for a global lock and thus improving
> scalability and performance.
> 
> Link: https://lkml.kernel.org/r/1739152566-744-1-git-send-email-yangge1116@126.com
> Signed-off-by: Ge Yang <yangge1116@126.com>
> Reviewed-by: Barry Song <baohua@kernel.org>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Cc: Aisheng Dong <aisheng.dong@nxp.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-03-18 13:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-10  1:56 [PATCH V2] mm/cma: using per-CMA locks to improve concurrent allocation performance yangge1116
2025-02-10  3:44 ` Barry Song
2025-02-10  8:34 ` David Hildenbrand
2025-02-10  8:56   ` Ge Yang
2025-02-10  9:15 ` Oscar Salvador
2025-03-18  3:43 ` Andrew Morton
2025-03-18  7:21   ` Ge Yang
2025-03-18 13:02   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox