From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A7F5AC27C4F
	for <linux-mm@archiver.kernel.org>; Tue, 18 Jun 2024 06:56:05 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 39AA86B02EA; Tue, 18 Jun 2024 02:56:05 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 349C96B02EC; Tue, 18 Jun 2024 02:56:05 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2137D6B02ED; Tue, 18 Jun 2024 02:56:05 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 033106B02EA
	for <linux-mm@kvack.org>; Tue, 18 Jun 2024 02:56:04 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id AF436A04C0
	for <linux-mm@kvack.org>; Tue, 18 Jun 2024 06:56:04 +0000 (UTC)
X-FDA: 82243099848.21.6DA6121
Received: from m16.mail.126.com (m16.mail.126.com [117.135.210.6])
	by imf05.hostedemail.com (Postfix) with ESMTP id D92D3100012
	for <linux-mm@kvack.org>; Tue, 18 Jun 2024 06:56:01 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=126.com header.s=s110527 header.b=iRgaq55s;
	spf=pass (imf05.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.6 as permitted sender) smtp.mailfrom=yangge1116@126.com;
	dmarc=pass (policy=none) header.from=126.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718693756;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vvxOdGUqhDUSauHPn9IS8ZFXOKSW+Rv9renG2SFu+0w=;
	b=EpOnyeMap6JeKLCjZnmopghmRG8nryJ3TRoewdXKeULf7JBaNirCc25hiNIDVlu3Lg+HZm
	SUcaJeLRQ7Ki9cQ7/vaf4+B29pqZA+9RX9zewWLpoUEvRA3qlLq40YQPuF0OlRaSSF7+BU
	dSnyUvLJ/tA4bCJxFOEa4mBWLyqrzAc=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718693756; a=rsa-sha256;
	cv=none;
	b=ro96lcg87U4losGAN3LpFpMiU6n5Lz7Ewp5etrc7TIGozChcVq4x6kkcIBqr1Q7Mszvc/6
	sKGPSbyHEA5ydOPEsNYRvJLwBj8dhRgulSOU+H6ZloU88nToHW6+GB0mChJs08BeJkVAtf
	+OLP+AeMAuebnhQgEUrqqKIANvWC8UQ=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=126.com header.s=s110527 header.b=iRgaq55s;
	spf=pass (imf05.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.6 as permitted sender) smtp.mailfrom=yangge1116@126.com;
	dmarc=pass (policy=none) header.from=126.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com;
	s=s110527; h=Subject:From:Message-ID:Date:MIME-Version:
	Content-Type; bh=vvxOdGUqhDUSauHPn9IS8ZFXOKSW+Rv9renG2SFu+0w=;
	b=iRgaq55sYvXvtznNn8qmwsls2qIoCGr0uOgCxQD8pP8kdTUFaMqf6ZEib9hYtc
	X6zkt8F+p9lpmzW0cR9wodqY/b1ml1EqIx9+Sjmnudjz+jOiw6UXx1KMzbJmv8V6
	hfS+yXPSHL0tVqz/XBpgE727jc4pavv33OI27lS9J5Wls=
Received: from [172.21.21.216] (unknown [118.242.3.34])
	by gzga-smtp-mta-g0-4 (Coremail) with SMTP id _____wD3_xR0L3FmNui5Bg--.13587S2;
	Tue, 18 Jun 2024 14:55:50 +0800 (CST)
Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating
 non-CMA THP-sized page
To: Barry Song <21cnbao@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, baolin.wang@linux.alibaba.com,
 liuzixing@hygon.cn
References: <1717492460-19457-1-git-send-email-yangge1116@126.com>
 <CAGsJ_4zvG7gwukioZnqN+GpWHbpK1rkC0Jeqo5VFVL_RLACkaw@mail.gmail.com>
 <2e3a3a3f-737c-ed01-f820-87efee0adc93@126.com>
 <9b227c9d-f59b-a8b0-b353-7876a56c0bde@126.com>
 <CAGsJ_4ynfvjXsr6QFBA_7Gzk3PaO1pk+6ErKZaNCt4H+nuwiJw@mail.gmail.com>
 <4482bf69-eb07-0ec9-f777-28ce40f96589@126.com>
 <CAGsJ_4ytYTpvRVgR1EoazsH=QxZCDE2e8H0BeXrY-6zWFD0kCg@mail.gmail.com>
From: yangge1116 <yangge1116@126.com>
Message-ID: <69414410-4e2d-c04c-6fc3-9779f9377cf2@126.com>
Date: Tue, 18 Jun 2024 14:55:48 +0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.10.0
MIME-Version: 1.0
In-Reply-To: <CAGsJ_4ytYTpvRVgR1EoazsH=QxZCDE2e8H0BeXrY-6zWFD0kCg@mail.gmail.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-CM-TRANSID:_____wD3_xR0L3FmNui5Bg--.13587S2
X-Coremail-Antispam: 1Uf129KBjvJXoW3JFykXFy3XryfKw15GFykZrb_yoW3Jr15pF
	WfJ3W7Kr4UXryUAw17twn0kr1jkw13Kr18Xr15Jry8urnFyr1IyF4xJr1UuFyrAryUJF40
	qryUtF9xZF4UA3DanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2
	9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jW_MfUUUUU=
X-Originating-IP: [118.242.3.34]
X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbiOhwCG2VEw2ZqDAAAsd
X-Stat-Signature: fu488reftozjap3dgziyda1aobys9i8o
X-Rspamd-Queue-Id: D92D3100012
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1718693761-77096
X-HE-Meta: U2FsdGVkX1/jhdJS/u/t74UeBAj5Gk6hqHh3KNh3MLBH1GE243WE9UcuWTYItiZ4ldCn4oVP3MsaT41vjhuUxmIKDT6oq2+L070+I3ak5fp2DNEXkHFz3LEljpbS8e640ckBItsSJ9dw7eTUDBv8svmhIFcGUr9m6HxccoYw2eOCZ1QxO817xyeFKAR3xF4YsGn0U6yXIEqc80GaAGvDUFgh6vNKxIwkgzhKbN9sEHyiHMSD0k8acqQLJFskeNvKhjONuqZyByAcaCvMQi2ZhgeYljQ6a7KuzXdez7uFe+tQQZy8lXYKjCdgUWRWPUVDKJiuivfTRI8hb0RFvdOiCjEvDU7oC0PPPJW7btxed4Qb9b9Xa8ME+HPaOJSd2xc6h5qE3bd7gdRb10TJjCrpA+BQwTcYJL3Ygffr+ssk39apSnTz2cd67zJKBtKTYnpIg9an1ggxc1KS/Qh02JY8NDxNMvmYoToXQdeFbRD+UILiBzSw41fRNwFpnj+EVR0AHOAjN7RG+OkHI0SbZGwrgVoLPiQFVKUc5hfD4HhWValczqU80CkdAqO88WZ+FTsVYSZZudSZ7DiU0pxyKxG1333DBcPcgrWfz+YgNvSkqXJ4WwHdPNjQLcx7XOEv6Gp1tQ3w9i5VSUdmN+1ZqKz7gr2V61yidLxmHfE8hIbfNTrrVfPQsbp1qzDPqQh85glelsWYp+ljjyR03CshYsdGYGIKrt0SxAdwWY0Zo3q9p1bqQdxwtD24OBRfeUdOUdghBU6KVAmQ42Jex8xcSGN8kivSC7ps8tmuMSLeJOGY7jhJCYom4/brinoAj+O0IvrNy6aiqoIpK3yOvCS2CzyBnGREANp+MHHW1XBZ6n9YQfGHuBcf554PwMlv+cg9mOkopqZpwb92SvlEuhLYlswo1uDC0oJcpNkufVY/zf5v8IscphejXU3uvvWjko/24EkMlc4AGIQ5TUbtvDM5+M8
 +ydKMIMA
 WmZoIrwAxee9BovmP5d9wbOyKhEDYgPoBA34Axxz4WLHxYxVIyp+wl5IHuOSqpKkd61baUTNfozYHyia1/ceDwRaepP86lPUi+6Bt+E5VoXHme8Hr7vdfgJtlm6VtBrGPfLvgV4Llruy+qGS6jUwrwOJEwnO3R15s7rNxYqGHM8I6HoT+nMpxqAv/tKwPTGQDOetvaOTCnP2wiVvSd5tn103/t2yVkdcg97edN/K6jXEscxiPbLTujet4KXkN0vX32akXnGP+KWbL188BIJBcypSOsUSZRkDl+Xs/bqgAYxnVMJzl1zKNoBwznUlno+Om5bzSOt94KU0S2eEfTiLTOR4ZBX//P2qzqWIsgo8qz7juXLQ+0diVZs78LTorvtseNpLH+1VZmdbQ4eq/FkqKWlQ1ZFncDWgv/EkZnbsa2y1SXG5nFU8r1HRc9c/r8lYDKwviXHC37v+oAmKG+q50GMbfZeUvTs4x9QKncMJZpaDC4/2cCwpkJHRNEk6zVXCWiOJzBwSO/hLWC0NWrDUVt3TEYAid9ThyDPInmB0MpCq5ovL/anskX4MzrldzHPue7PfPFHv0sq3mUWrpsJZnEGwLBEMQbEYeYcRN5F6uJBUxzUo=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


在 2024/6/18 下午12:10, Barry Song 写道:
> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 <yangge1116@126.com> wrote:
>>
>>
>>
>> 在 2024/6/18 上午9:55, Barry Song 写道:
>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 <yangge1116@126.com> wrote:
>>>>
>>>>
>>>>
>>>> 在 2024/6/17 下午8:47, yangge1116 写道:
>>>>>
>>>>>
>>>>> 在 2024/6/17 下午6:26, Barry Song 写道:
>>>>>> On Tue, Jun 4, 2024 at 9:15 PM <yangge1116@126.com> wrote:
>>>>>>>
>>>>>>> From: yangge <yangge1116@126.com>
>>>>>>>
>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations") no longer differentiates the migration type
>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from
>>>>>>> the list, in some cases, it's not acceptable, for example, allocating
>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page.
>>>>>>>
>>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized
>>>>>>> PCP list to avoid the issue above.
>>>>>>
>>>>>> Could you please describe the impact on users in the commit log?
>>>>>
>>>>> If a large number of CMA memory are configured in the system (for
>>>>> example, the CMA memory accounts for 50% of the system memory), starting
>>>>> virtual machine with device passthrough will get stuck.
>>>>>
>>>>> During starting virtual machine, it will call pin_user_pages_remote(...,
>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area,
>>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA
>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests
>>>>> return CMA memory, pin_user_pages_remote() will enter endless loops.
>>>>>
>>>>> backtrace:
>>>>> pin_user_pages_remote
>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>> --------__get_user_pages_locked
>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>> to migrate
>>>>> ------------migrate_longterm_unpinnable_pages
>>>>> ----------------alloc_migration_target // non-movable allocation
>>>>>
>>>>>> Is it possible that some CMA memory might be used by non-movable
>>>>>> allocation requests?
>>>>>
>>>>> Yes.
>>>>>
>>>>>
>>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc()
>>>>>> to fail?
>>>>>
>>>>>
>>>>> No, it will cause endless loops in __gup_longterm_locked(). If
>>>>> non-movable allocation requests return CMA memory,
>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another
>>>>> CMA page, which is useless and cause endless loops in
>>>>> __gup_longterm_locked().
>>>
>>> This is only one perspective. We also need to consider the impact on
>>> CMA itself. For example,
>>> when CMA is borrowed by THP, and we need to reclaim it through
>>> cma_alloc() or dma_alloc_coherent(),
>>> we must move those pages out to ensure CMA's users can retrieve that
>>> contiguous memory.
>>>
>>> Currently, CMA's memory is occupied by non-movable pages, meaning we
>>> can't relocate them.
>>> As a result, cma_alloc() is more likely to fail.
>>>
>>>>>
>>>>> backtrace:
>>>>> pin_user_pages_remote
>>>>> ----__gup_longterm_locked //cause endless loops in this function
>>>>> --------__get_user_pages_locked
>>>>> --------check_and_migrate_movable_pages //always check fail and continue
>>>>> to migrate
>>>>> ------------migrate_longterm_unpinnable_pages
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>>
>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>>>>>>> THP-sized allocations")
>>>>>>> Signed-off-by: yangge <yangge1116@126.com>
>>>>>>> ---
>>>>>>>     mm/page_alloc.c | 10 ++++++++++
>>>>>>>     1 file changed, 10 insertions(+)
>>>>>>>
>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>>>>>> index 2e22ce5..0bdf471 100644
>>>>>>> --- a/mm/page_alloc.c
>>>>>>> +++ b/mm/page_alloc.c
>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone
>>>>>>> *preferred_zone,
>>>>>>>            WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
>>>>>>>
>>>>>>>            if (likely(pcp_allowed_order(order))) {
>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags &
>>>>>>> ALLOC_CMA ||
>>>>>>> +                                               order !=
>>>>>>> HPAGE_PMD_ORDER) {
>>>>>>> +                       page = rmqueue_pcplist(preferred_zone, zone,
>>>>>>> order,
>>>>>>> +                                               migratetype,
>>>>>>> alloc_flags);
>>>>>>> +                       if (likely(page))
>>>>>>> +                               goto out;
>>>>>>> +               }
>>>>>>
>>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP.
>>>>>> But it
>>>>>> still seems better than causing the failure of CMA allocation.
>>>>>>
>>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from
>>>>>> the first
>>>>>> beginning? Otherwise, we might need a separate PCP for CMA.
>>>>>>
>>>>
>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding
>>>> adding CMA THP into pcp may incur a slight performance penalty.
>>>>
>>>
>>> But the majority of movable pages aren't CMA, right?
>>
>>> Do we have an estimate for
>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which
>>> the original intention for THP was to avoid by having only one PCP[1]?
>>>
>>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/
>>>
>>
>> The size of struct per_cpu_pages is 256 bytes in current code containing
>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
>> allocations").
>> crash> struct per_cpu_pages
>> struct per_cpu_pages {
>>       spinlock_t lock;
>>       int count;
>>       int high;
>>       int high_min;
>>       int high_max;
>>       int batch;
>>       u8 flags;
>>       u8 alloc_factor;
>>       u8 expire;
>>       short free_count;
>>       struct list_head lists[13];
>> }
>> SIZE: 256
>>
>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list
>> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes.
>> crash> struct per_cpu_pages
>> struct per_cpu_pages {
>>       spinlock_t lock;
>>       int count;
>>       int high;
>>       int high_min;
>>       int high_max;
>>       int batch;
>>       u8 flags;
>>       u8 alloc_factor;
>>       u8 expire;
>>       short free_count;
>>       struct list_head lists[15];
>> }
>> SIZE: 272
>>
>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for
>> THP-sized allocations") decrease one cacheline.
> 
> the proposal is not reverting the patch but adding one CMA pcp.
> so it is "struct list_head lists[14]"; in this case, the size is still
> 256?
> 

Yes, the size is still 256. If add one PCP list, we will have 2 PCP 
lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other 
PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right?

> 
>>
>>>
>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly
>>>> refer to it.
>>>>
>>>>
>>>>>>> +#else
>>>>>>>                    page = rmqueue_pcplist(preferred_zone, zone, order,
>>>>>>>                                           migratetype, alloc_flags);
>>>>>>>                    if (likely(page))
>>>>>>>                            goto out;
>>>>>>> +#endif
>>>>>>>            }
>>>>>>>
>>>>>>>            page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
>>>>>>> --
>>>>>>> 2.7.4
>>>>>>
>>>>>> Thanks
>>>>>> Barry
>>>>>>
>>>>
>>>>
>>