From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2D51C27C4F for ; Tue, 18 Jun 2024 05:50:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0A9F76B02E3; Tue, 18 Jun 2024 01:50:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 059D76B02E4; Tue, 18 Jun 2024 01:50:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E3C248D0001; Tue, 18 Jun 2024 01:50:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id C626A6B02E3 for ; Tue, 18 Jun 2024 01:50:14 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 5408CA1E2B for ; Tue, 18 Jun 2024 05:50:14 +0000 (UTC) X-FDA: 82242933948.03.DE5812C Received: from m16.mail.126.com (m16.mail.126.com [117.135.210.6]) by imf28.hostedemail.com (Postfix) with ESMTP id D9EA9C0013 for ; Tue, 18 Jun 2024 05:50:11 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=p1QjbVXL; spf=pass (imf28.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.6 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718689807; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fULidGb+uvxXgscYWHbx+Jh3MrCdG6txdQsuV5f2w4g=; b=J4mK1dTVmQA5XCBCoMnrIZf4ApntwgkI3PANf2JUzGKPwWqrwdYbvFz1q5cxYgI04oMD+9 aOhniiOlvIRmVsgOof5QJ6rvNW9FGQ7DSUW4oTRZ5rFP/QrX20hhPc41Yn0eGUlyx8FPh9 tAQN0oHx2L0QQ3WK7vESLkXThzMorIE= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=p1QjbVXL; spf=pass (imf28.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.6 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718689807; a=rsa-sha256; cv=none; b=qvQWYlpkNJ+cc3fx6tmehv7Mnww5NRwrzhNFaZKdrG6+/2yug8Kl61UreH1XM+4BFOkhen Qvp4IOoY1QXjm33KZzIIGL1GuveT2G+L3InIzzP3YJU9o/S1g9SiWl4lluPWNrc0/az1bQ N1uC6Lqq75Ehza0qtgPwkEk1Bg/ETMQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Subject:From:Message-ID:Date:MIME-Version: Content-Type; bh=fULidGb+uvxXgscYWHbx+Jh3MrCdG6txdQsuV5f2w4g=; b=p1QjbVXLOaBdM+P5bPCBk1kFH9srnhgYdNVBMiD+DnG7yFZY60qAnDpqd2KazI dCTKq07H0EfxOpe81ehHJcOHUBBZzvO7DhQCxsidBryOnnn87LrPQ4ckzVWUbZ51 nRd4MaBliyLb0lwkNP/KbzD1L2hPRrSLkB5Qozq2mm1BM= Received: from [172.21.21.216] (unknown [118.242.3.34]) by gzga-smtp-mta-g0-3 (Coremail) with SMTP id _____wDXPxUFIHFmDnmvBw--.2240S2; Tue, 18 Jun 2024 13:49:59 +0800 (CST) Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, baolin.wang@linux.alibaba.com, liuzixing@hygon.cn References: <1717492460-19457-1-git-send-email-yangge1116@126.com> <2e3a3a3f-737c-ed01-f820-87efee0adc93@126.com> <9b227c9d-f59b-a8b0-b353-7876a56c0bde@126.com> <4482bf69-eb07-0ec9-f777-28ce40f96589@126.com> From: yangge1116 Message-ID: <5177ce6d-6c75-482e-4fa8-749972f08ffe@126.com> Date: Tue, 18 Jun 2024 13:49:57 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wDXPxUFIHFmDnmvBw--.2240S2 X-Coremail-Antispam: 1Uf129KBjvJXoW3JFykXFy3XryfKw15GFykZrb_yoWxKr17pF WfJ3W7Kr4UXryUAw17twn0kr1jkw1UKr18Xr15Jry8urnIyr1IyF4xJr1UuFyrAryUJF40 qrWUtasxZF4UA3DanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07bjiSLUUUUU= X-Originating-IP: [118.242.3.34] X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbiWRYCG2VLayh1vQABsy X-Stat-Signature: 5thrr5srqj733fip3myjk39hs6dp31p8 X-Rspam-User: X-Rspamd-Queue-Id: D9EA9C0013 X-Rspamd-Server: rspam02 X-HE-Tag: 1718689811-573739 X-HE-Meta: U2FsdGVkX1/YBhSa+X58+MeQz5VRskNZytL7LcB5LlnpIS0Ggcib+05etMbCN//bD/Q2/6OPdkLd8kVKm3/m+F/e1qzKjr767M0+z6CPqQwE6jrQ5nV77ekpsLWrmff4NJtBiUrSJqHr1V5IEktvYLj7yKrZ5IRsC3ZwoPw/t12Wl6IUl2HSiXHI9sXzuJV+DNpKsXz7FvSX7QfAuPvDA8i7UQR/uQNYZxqJFKH9X5P578lSEd7fyKhhwFLAd0F87zPeYsCcvIBKx8q4bQi64VIjNJA7WEDdZ/E8w64iNpHyG8xdmx66bKpm9i+3ux9B10ECLWQyCH7AZmvfo+PVE/qD/pU2/xITfPZ+MgM4K4IiSx821Qw2L/5NFu5JXj2fu0uqsraOQQLq+EvbA7hDQzziHU5b2iFUljwhmX1ZoqXoJYPSThiS6Ymtt3hAZS2OwQyGcBZ+Di2D/pu/B7H2l2zQ+3WsPSMsUxcvA5ldfEDhiPiJ8d4zJuUdsq0wrgT94ax3pD06p+VGyfhACY4xticEh1pbqDo16F/yiwlYkB0n3H4g3p1lDsVU/4G45FweltVu5L99tEDkQH3JN8I9Z+7Szb5CtMhZ72EvOOrTFSAUB8I8dN6jN+S3faLpslukU14A/RMKXti0xQmZqlKNaWCpv1XFyt1XobrCYSD1l0WsveVYmjO/PNDrCvrAD3UXRr84XHj+8ibijXexsXnSgLtdkzUSsVm1T6yH6q2nEzKbbpe/95D38r6OhJ9W7OD1EvLpODSBQUud5KzmVFFUdtXShV6pG8IlqNfCS6h8WVvFG4jyx43Lc/JA6dB+Z8hSWHAZWAsxbjjSrHOoNenJkN7KEzz2tW/vIiyEbAThN2B2c8P92QV0Fu5Qb0aMh8rfC04WdUJtc7pO5dx/+ReqCBpqYEp9W8J/fd7VpF3n4Xvf+8MJg9uNBfVi5YLA/kw4G0MUpTDnVS99h0ppzIn SEA5b6bp FJU0/UHZDmb65PEEqnogifF2GMLo1YrECkfofhjEx2Jkf7LtKc5oBEICz0a+Idcu15vZngzAL1QFqWDnSFQ4KfGTcl4i3Q1Z9td/hAMJg1NHo0d3BiHmmWBzaG4oXUzMY0kKqTNe/zDI8Lynll8+LOWRArAmynp9aoIBnhYRm57rzqayjZUpDlMmEPfAGd4lCQ2oQXSDe6hSb0cPEGpT4jz0DQkT6CDxsRQjZiKTg1BgvQYQaTm3Tu+VNvnUu5Yp/JAAJ5N6enE6ZSWeA0vJ09AkGIqChzSkj5svZ04tpTSi5RsESXe8TkWBJ3PxOgnkrufC1UHFUCJYJDHviiI7exW320RMsZvme8SfxOr/qUyHKOc3HJVI1pMCNSzdYQ+5G0r9QmLpyEZxNcER2bHzxFA7/B/+BrVgUx1iILtNJWI8P6CtxojZkEMIl6MehIz7jtLAi9RwujCWcp6lbxTyg7zs4tsFs8m1734ffvoU3Z/1Z3NAsA3K1QsE6XCPzlacmEXVqWFh/NGfL9ZrifCFktxSfUW+TiPaLkVBjAtaMs0eomLrT1v4PJPVZheBiaMjThduVsokJxohtzYYygitcQSHq245mysml5nhWsFFrd3m0pHc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/6/18 下午12:10, Barry Song 写道: > On Tue, Jun 18, 2024 at 3:32 PM yangge1116 wrote: >> >> >> >> 在 2024/6/18 上午9:55, Barry Song 写道: >>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 wrote: >>>> >>>> >>>> >>>> 在 2024/6/17 下午8:47, yangge1116 写道: >>>>> >>>>> >>>>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>>>> On Tue, Jun 4, 2024 at 9:15 PM wrote: >>>>>>> >>>>>>> From: yangge >>>>>>> >>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized allocations") no longer differentiates the migration type >>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>>>> >>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>>>> PCP list to avoid the issue above. >>>>>> >>>>>> Could you please describe the impact on users in the commit log? >>>>> >>>>> If a large number of CMA memory are configured in the system (for >>>>> example, the CMA memory accounts for 50% of the system memory), starting >>>>> virtual machine with device passthrough will get stuck. >>>>> >>>>> During starting virtual machine, it will call pin_user_pages_remote(..., >>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA >>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests >>>>> return CMA memory, pin_user_pages_remote() will enter endless loops. >>>>> >>>>> backtrace: >>>>> pin_user_pages_remote >>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>> --------__get_user_pages_locked >>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>> to migrate >>>>> ------------migrate_longterm_unpinnable_pages >>>>> ----------------alloc_migration_target // non-movable allocation >>>>> >>>>>> Is it possible that some CMA memory might be used by non-movable >>>>>> allocation requests? >>>>> >>>>> Yes. >>>>> >>>>> >>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() >>>>>> to fail? >>>>> >>>>> >>>>> No, it will cause endless loops in __gup_longterm_locked(). If >>>>> non-movable allocation requests return CMA memory, >>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another >>>>> CMA page, which is useless and cause endless loops in >>>>> __gup_longterm_locked(). >>> >>> This is only one perspective. We also need to consider the impact on >>> CMA itself. For example, >>> when CMA is borrowed by THP, and we need to reclaim it through >>> cma_alloc() or dma_alloc_coherent(), >>> we must move those pages out to ensure CMA's users can retrieve that >>> contiguous memory. >>> >>> Currently, CMA's memory is occupied by non-movable pages, meaning we >>> can't relocate them. >>> As a result, cma_alloc() is more likely to fail. >>> >>>>> >>>>> backtrace: >>>>> pin_user_pages_remote >>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>> --------__get_user_pages_locked >>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>> to migrate >>>>> ------------migrate_longterm_unpinnable_pages >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>>> >>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>> THP-sized allocations") >>>>>>> Signed-off-by: yangge >>>>>>> --- >>>>>>> mm/page_alloc.c | 10 ++++++++++ >>>>>>> 1 file changed, 10 insertions(+) >>>>>>> >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>> index 2e22ce5..0bdf471 100644 >>>>>>> --- a/mm/page_alloc.c >>>>>>> +++ b/mm/page_alloc.c >>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>>>> *preferred_zone, >>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>>>> >>>>>>> if (likely(pcp_allowed_order(order))) { >>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>>>> ALLOC_CMA || >>>>>>> + order != >>>>>>> HPAGE_PMD_ORDER) { >>>>>>> + page = rmqueue_pcplist(preferred_zone, zone, >>>>>>> order, >>>>>>> + migratetype, >>>>>>> alloc_flags); >>>>>>> + if (likely(page)) >>>>>>> + goto out; >>>>>>> + } >>>>>> >>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. >>>>>> But it >>>>>> still seems better than causing the failure of CMA allocation. >>>>>> >>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from >>>>>> the first >>>>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>>>> >>>> >>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >>>> adding CMA THP into pcp may incur a slight performance penalty. >>>> >>> >>> But the majority of movable pages aren't CMA, right? >> >>> Do we have an estimate for >>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which >>> the original intention for THP was to avoid by having only one PCP[1]? >>> >>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ >>> >> >> The size of struct per_cpu_pages is 256 bytes in current code containing >> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized >> allocations"). >> crash> struct per_cpu_pages >> struct per_cpu_pages { >> spinlock_t lock; >> int count; >> int high; >> int high_min; >> int high_max; >> int batch; >> u8 flags; >> u8 alloc_factor; >> u8 expire; >> short free_count; >> struct list_head lists[13]; >> } >> SIZE: 256 >> >> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list >> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. >> crash> struct per_cpu_pages >> struct per_cpu_pages { >> spinlock_t lock; >> int count; >> int high; >> int high_min; >> int high_max; >> int batch; >> u8 flags; >> u8 alloc_factor; >> u8 expire; >> short free_count; >> struct list_head lists[15]; >> } >> SIZE: 272 >> >> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >> THP-sized allocations") decrease one cacheline. > > the proposal is not reverting the patch but adding one CMA pcp. > so it is "struct list_head lists[14]"; in this case, the size is still > 256? > Yes, if only add one CMA pcp, the size is still 256. Seems adding one CMA pcp is more reasonable. > >> >>> >>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >>>> refer to it. >>>> >>>> >>>>>>> +#else >>>>>>> page = rmqueue_pcplist(preferred_zone, zone, order, >>>>>>> migratetype, alloc_flags); >>>>>>> if (likely(page)) >>>>>>> goto out; >>>>>>> +#endif >>>>>>> } >>>>>>> >>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >>>>>>> -- >>>>>>> 2.7.4 >>>>>> >>>>>> Thanks >>>>>> Barry >>>>>> >>>> >>>> >>