From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E9456C282EC
	for <linux-mm@archiver.kernel.org>; Tue, 18 Mar 2025 07:21:54 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E0BD5280006; Tue, 18 Mar 2025 03:21:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DB933280001; Tue, 18 Mar 2025 03:21:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C8167280006; Tue, 18 Mar 2025 03:21:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id AAC09280001
	for <linux-mm@kvack.org>; Tue, 18 Mar 2025 03:21:52 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 0A5421A0CF1
	for <linux-mm@kvack.org>; Tue, 18 Mar 2025 07:21:54 +0000 (UTC)
X-FDA: 83233827348.15.FF6EEE2
Received: from m16.mail.126.com (m16.mail.126.com [220.197.31.8])
	by imf07.hostedemail.com (Postfix) with ESMTP id 6FB5740002
	for <linux-mm@kvack.org>; Tue, 18 Mar 2025 07:21:51 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=126.com header.s=s110527 header.b="mJ4/1SIg";
	spf=pass (imf07.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.8 as permitted sender) smtp.mailfrom=yangge1116@126.com;
	dmarc=pass (policy=none) header.from=126.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1742282512;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=xAv74FX3DoajPloCyOTTZZeRXTUHncsx5Pnyk67rO28=;
	b=2tB3tQzqLlXyO0aaa26kw672xwTdUX45H9rtxVa1Na8gTFdZ9jAgiZAx58ULaMnOoVUDFf
	zj3ghvyz77OmYgSnyNnS4LFValLeA5oSA+gtA9pyGWqNzfronCxETtwoN3J74IHXUqZqN/
	eZBoiomf8+R+4TL8ARN+LDA1aYrHlVI=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742282512; a=rsa-sha256;
	cv=none;
	b=zb7ZqlQAw2WFdh5KBGioBDPX4KRIWkfE0xpaSeYN+vChdoqqFp8twsV6u5VfGtkqtVTK6Q
	p++tOiLYSi7SHWrFRhSfYMxi3Tsv/AsVcCNN9v5+UKiKH6f8xzSPWZZO3zG369EFl/INRY
	fK7ygmjdCvKmQXzUAKTeYF3Dfqh4jws=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=126.com header.s=s110527 header.b="mJ4/1SIg";
	spf=pass (imf07.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.8 as permitted sender) smtp.mailfrom=yangge1116@126.com;
	dmarc=pass (policy=none) header.from=126.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com;
	s=s110527; h=Message-ID:Date:MIME-Version:Subject:From:
	Content-Type; bh=xAv74FX3DoajPloCyOTTZZeRXTUHncsx5Pnyk67rO28=;
	b=mJ4/1SIg/aUMGV0gn9/9PCKruF2BIM3Yg1wS5VVSX04zaBkBte77Evc1xa0RIF
	kvz6MEj1iofetvtwEkoIELZIIrHP2OaneYo4lOjN4f1q+FoNTMt5QpmtNcKP3dvi
	CsRCBeWIXuVkkyMPkP7iTGZLxd3z9o10GJL28BxyHBtxU=
Received: from [172.19.20.199] (unknown [])
	by gzsmtp3 (Coremail) with SMTP id PikvCgD3jrsGH9lnEPZYCA--.20701S2;
	Tue, 18 Mar 2025 15:21:43 +0800 (CST)
Message-ID: <09057869-eb32-45dd-a7a1-9b7e1850eb11@126.com>
Date: Tue, 18 Mar 2025 15:21:42 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH V2] mm/cma: using per-CMA locks to improve concurrent
 allocation performance
To: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, 21cnbao@gmail.com,
 david@redhat.com, baolin.wang@linux.alibaba.com, aisheng.dong@nxp.com,
 liuzixing@hygon.cn
References: <1739152566-744-1-git-send-email-yangge1116@126.com>
 <20250317204325.99b45373023ad2f901c1152e@linux-foundation.org>
From: Ge Yang <yangge1116@126.com>
In-Reply-To: <20250317204325.99b45373023ad2f901c1152e@linux-foundation.org>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-CM-TRANSID:PikvCgD3jrsGH9lnEPZYCA--.20701S2
X-Coremail-Antispam: 1Uf129KBjvJXoWxKFyxAF1fGrWxtFWxAF4Utwb_yoWxJFy5pF
	W8GFyDCr98Xry7Aw42k34DuF9a9ws7WFW7KFyjva4xZFnxCr90grs5tFy5u3y8urZrWFy0
	vryjqasrZw1UZ3DanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2
	9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jYGQhUUUUU=
X-Originating-IP: [112.64.138.194]
X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbiIh0UG2fZFYW67QAAsJ
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 6FB5740002
X-Stat-Signature: ir465qudhbhgsmh3agcbadn8itj4mnqz
X-HE-Tag: 1742282511-279987
X-HE-Meta: U2FsdGVkX1/a5yviOkiQ29J6sQsPuMKLULgGZMCQlYL1TLPBfeEoveuNdaN85Ry6aMZMV12MY2F4FYyHZwq4mDQ7Wgrek0KKrmvWNw5CZH6DIqVLwAP1HakVgdnQnFevhX0prGSf4XSb0bnfsrmAHIbGuLuHM/BcOhcrx/Pacm2vswm8Slj6VKGTgHy39+3tLcZLo47WEqkFuo/D7ZIBuyH5mFNFE8axooslGgfxOKEW5JQ1hrzNcxPHj2/piXnDhk3cYwBVu9ItgS3ndQODhFwf0eJxKHF1YGflEsoMOTYEf046JxuoOBz+ppdn8Z5BlQhueS23W03YXka99pPhr8LzfKu7LJSxCO1kzXHjCAtS+GZPYjI3KPmWOANGmfolziAFOA9fwC7Jfdb3MuJRzm87pZpw8hl0lvG2MjHSRcn7UlZoF6Ra1L8Ft4DOXPiQKwcjB8NJahZVdpoXFCCJHHgwzcnf1aOpX9XsWDSQkf2oS5FijEniBCP5ckVrIx4lxkIDzyqqlI/kleJ3SeE+0vdytlZLAj8bsVlT1KGS7FvXQ3fkJV5Ijve7XpBelmZkNtWWrlmZSCrY/F6VoKJEVROqh9FiJ4s1wzq8KiYggVvvcYg2EhDQhuPe5wNlJukB1V3rwGdJTpjuIgMJkEIv+tgZrq8OtCYIqtoe7fm69+I+R2PJV2+K7LbR2QSvHpuGTn4FT9R9011djiblxAN3ZAG2bnas0Q1rx5HBDFJvzc5kAnwDhZ/SeoXErlwZ+3IHvJYsjETAIrbRlWezGJ4L7dPemo7XGoESq7AMuS854XIAIspnu4lkNdGuLqvpTBgBCRUNld81AJTUTssQZq2mwDMHnvLVYTtCW1gw7J+YA+h2o3xlKe3oQOK5kLE+EkrbEYwJR5K4NRAuSSaGUhwzS3KjFPSD3LhaMk772y7q2OK3OzAQTByChOuZxHtf4zFvbVDlz/PfC6k2pdYJOur
 kBPVaJE0
 Mn1hEEScfzqFKsmyh5PE8bkB4yPLOdocVn0VQ6C5Y2Daf3SBNpOYgu/NCmX71CqY3cF7GTIe5gCt4XfJNMkOh+C0H8CxdCURVaSRjKgaaFRzPCeyv3uRNMEpOAm1yeFdaNha9iAqRafJiuacQJ1tP9sYZk8LHSSa4DdXff0dXZqVziHVXhAu1+NjPlFrbit7lv7V25Xcd3e8FcAUReVnX0IM0RVWLb4eZouKtOAaNtUMNMD1tJ1f7+UtnmchFUDnfjRiQrTA5JH9k53X2nRaMgT8jx/xPLD7w1iTac5rfX9xAS2O8USC2IwjN5fX35gtUlJm4F9IGarl6QuxQiZ4u/0l2Xugs6dQcKuGGzO2wxnUvcQQU+Lx+bHdouzPlMfcY4iaLOcCXUiCtJpy5qPYVnEdL4Wls/0+5q0DqW/ct3twvPVudRhKnLIx0z/D56/X+IJPs3oStnTMEECQhJp/67fXYjaHPLMh+iiA9ZbbViwIsr7F9GM0P0Kd5fkzd4aAn4A+RvBNV76gqcNKIFdNUot+tgEG47ktpq3cF9bAky6oLL8JP6c3ME2Ho2zZkdqulicBY
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


在 2025/3/18 11:43, Andrew Morton 写道:
> On Mon, 10 Feb 2025 09:56:06 +0800 yangge1116@126.com wrote:
> 
>> From: yangge <yangge1116@126.com>
>>
>> For different CMAs, concurrent allocation of CMA memory ideally should not
>> require synchronization using locks. Currently, a global cma_mutex lock is
>> employed to synchronize all CMA allocations, which can impact the
>> performance of concurrent allocations across different CMAs.
>>
>> To test the performance impact, follow these steps:
>> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>>     allocate a 30GB CMA area specifically for huge page allocations. (note:
>>     on my machine, which has 3 nodes, each node is initialized with 10G of
>>     CMA)
>> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>>     count=30 to fully utilize the CMA area by writing zeroes to a file in
>>     /dev/shm.
>> 3. Open three terminals and execute the following commands simultaneously:
>>     (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>>     pages] of CMA memory.)
>>     On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>>     On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>>     On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
>>
>> We attempt to allocate pages through the CMA debug interface and use the
>> time command to measure the duration of each allocation.
>> Performance comparison:
>>               Without this patch      With this patch
>> Terminal1        ~7s                     ~7s
>> Terminal2       ~14s                     ~8s
>> Terminal3       ~21s                     ~7s
>>
>> To slove problem above, we could use per-CMA locks to improve concurrent
>> allocation performance. This would allow each CMA to be managed
>> independently, reducing the need for a global lock and thus improving
>> scalability and performance.
> 
> This patch was in and out of mm-unstable for a while, as Frank's series
> "hugetlb/CMA improvements for large systems" was being added and
> dropped.
> 
> Consequently it hasn't received any testing for a while.
> 
> Below is the version which I've now re-added to mm-unstable.  Can
> you please check this and retest it?
Based on the latest mm-unstable code, after applying the patch and 
conducting tests, it works normally. Thanks.
> 
> Thanks.
> 
> From: Ge Yang <yangge1116@126.com>
> Subject: mm/cma: using per-CMA locks to improve concurrent allocation performance
> Date: Mon, 10 Feb 2025 09:56:06 +0800
> 
> For different CMAs, concurrent allocation of CMA memory ideally should not
> require synchronization using locks.  Currently, a global cma_mutex lock
> is employed to synchronize all CMA allocations, which can impact the
> performance of concurrent allocations across different CMAs.
> 
> To test the performance impact, follow these steps:
> 1. Boot the kernel with the command line argument hugetlb_cma=30G to
>     allocate a 30GB CMA area specifically for huge page allocations. (note:
>     on my machine, which has 3 nodes, each node is initialized with 10G of
>     CMA)
> 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G
>     count=30 to fully utilize the CMA area by writing zeroes to a file in
>     /dev/shm.
> 3. Open three terminals and execute the following commands simultaneously:
>     (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB
>     pages] of CMA memory.)
>     On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc
>     On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc
>     On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc
> 
> We attempt to allocate pages through the CMA debug interface and use the
> time command to measure the duration of each allocation.
> Performance comparison:
>               Without this patch      With this patch
> Terminal1        ~7s                     ~7s
> Terminal2       ~14s                     ~8s
> Terminal3       ~21s                     ~7s
> 
> To solve problem above, we could use per-CMA locks to improve concurrent
> allocation performance.  This would allow each CMA to be managed
> independently, reducing the need for a global lock and thus improving
> scalability and performance.
> 
> Link: https://lkml.kernel.org/r/1739152566-744-1-git-send-email-yangge1116@126.com
> Signed-off-by: Ge Yang <yangge1116@126.com>
> Reviewed-by: Barry Song <baohua@kernel.org>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Cc: Aisheng Dong <aisheng.dong@nxp.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>   mm/cma.c |    7 ++++---
>   mm/cma.h |    1 +
>   2 files changed, 5 insertions(+), 3 deletions(-)
> 
> --- a/mm/cma.c~mm-cma-using-per-cma-locks-to-improve-concurrent-allocation-performance
> +++ a/mm/cma.c
> @@ -34,7 +34,6 @@
>   
>   struct cma cma_areas[MAX_CMA_AREAS];
>   unsigned int cma_area_count;
> -static DEFINE_MUTEX(cma_mutex);
>   
>   static int __init __cma_declare_contiguous_nid(phys_addr_t base,
>   			phys_addr_t size, phys_addr_t limit,
> @@ -175,6 +174,8 @@ static void __init cma_activate_area(str
>   
>   	spin_lock_init(&cma->lock);
>   
> +	mutex_init(&cma->alloc_mutex);
> +
>   #ifdef CONFIG_CMA_DEBUGFS
>   	INIT_HLIST_HEAD(&cma->mem_head);
>   	spin_lock_init(&cma->mem_head_lock);
> @@ -813,9 +814,9 @@ static int cma_range_alloc(struct cma *c
>   		spin_unlock_irq(&cma->lock);
>   
>   		pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit);
> -		mutex_lock(&cma_mutex);
> +		mutex_lock(&cma->alloc_mutex);
>   		ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp);
> -		mutex_unlock(&cma_mutex);
> +		mutex_unlock(&cma->alloc_mutex);
>   		if (ret == 0) {
>   			page = pfn_to_page(pfn);
>   			break;
> --- a/mm/cma.h~mm-cma-using-per-cma-locks-to-improve-concurrent-allocation-performance
> +++ a/mm/cma.h
> @@ -39,6 +39,7 @@ struct cma {
>   	unsigned long	available_count;
>   	unsigned int order_per_bit; /* Order of pages represented by one bit */
>   	spinlock_t	lock;
> +	struct mutex alloc_mutex;
>   #ifdef CONFIG_CMA_DEBUGFS
>   	struct hlist_head mem_head;
>   	spinlock_t mem_head_lock;
> _