From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C5184C27C79 for ; Tue, 18 Jun 2024 03:31:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3EC4E6B02D2; Mon, 17 Jun 2024 23:31:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 39BE56B02D3; Mon, 17 Jun 2024 23:31:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 262B96B02D4; Mon, 17 Jun 2024 23:31:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 076B46B02D2 for ; Mon, 17 Jun 2024 23:31:20 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 6E5CA140344 for ; Tue, 18 Jun 2024 03:31:19 +0000 (UTC) X-FDA: 82242583878.16.085AC40 Received: from m16.mail.126.com (m16.mail.126.com [117.135.210.8]) by imf13.hostedemail.com (Postfix) with ESMTP id 37F862000F for ; Tue, 18 Jun 2024 03:31:15 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=hekzG3fr; spf=pass (imf13.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.8 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718681474; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FMGSPpXZUiFkhSCU0e4bsz7g6v295g4MMcsy6x3vwJM=; b=dDdNTW+gwvqCTsxonpJ5J9bB+WQBi2avHZfBIimWJjy4Cd/WUxvdqIqlX5qipw4pNafdZG s8+RY8FdGZSSp8Rv919Rnd0+ReAKrXxMU3e1nGSf6dk0ykUHIfiriUOpkgIpk2MryWmtYG pz7SbBSathZlzYgM8Jkj+wlpbW4alLY= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=hekzG3fr; spf=pass (imf13.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.8 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718681474; a=rsa-sha256; cv=none; b=RMNJE+ryqsuJ0UTvtYeOENpVueOwzSqxcNNQ/PYtv0QxXeukd5eoeXqL1CoWLYplLghC11 zvkzLjc7BYO6jbvwzHK3K0sHkzNI1lmfDuU5nCQQ3zM/ElYt5BnNzRPijdhmG3dFGBBVmR 6qoI2+/CeMb3wUhDXDV47anPgxZKgc0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Subject:From:Message-ID:Date:MIME-Version: Content-Type; bh=FMGSPpXZUiFkhSCU0e4bsz7g6v295g4MMcsy6x3vwJM=; b=hekzG3frY+gSaAFGZg0o8uk4sVJmZ5RAA780VseFoddNY8XMXdmXR5IYYeOaM2 TUTj7SYdmmnH9VzWz//drb+YjNs2oicI4LSrZ+n5Z58FDvhOyZnWuumLUk779uZ+ wcq/nBHwy4OYEc3IDSwC2ui412QUdTH5akfoI2SV3YKB0= Received: from [172.21.21.216] (unknown [118.242.3.34]) by gzga-smtp-mta-g1-5 (Coremail) with SMTP id _____wD3X0x0_3Bm47oVAA--.12793S2; Tue, 18 Jun 2024 11:31:02 +0800 (CST) Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, baolin.wang@linux.alibaba.com, liuzixing@hygon.cn References: <1717492460-19457-1-git-send-email-yangge1116@126.com> <2e3a3a3f-737c-ed01-f820-87efee0adc93@126.com> <9b227c9d-f59b-a8b0-b353-7876a56c0bde@126.com> From: yangge1116 Message-ID: <4482bf69-eb07-0ec9-f777-28ce40f96589@126.com> Date: Tue, 18 Jun 2024 11:31:00 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wD3X0x0_3Bm47oVAA--.12793S2 X-Coremail-Antispam: 1Uf129KBjvJXoWxtw17Gr4kuFyxZr4fWFyfJFb_yoWxXr1DpF WfG3W7Ka1UX345Awnrt3Z0kr1Yk343Kr18Wr1rXry8urnIyr1IkF4xJr1UuFyrAry7JF40 qrWUt3ZxZF4UAaDanT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07bjiSLUUUUU= X-Originating-IP: [118.242.3.34] X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbiOhkCG2VEw2HQMwAAsa X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 37F862000F X-Stat-Signature: phzhb81uqw4on773ppgkrfr1qb54zem5 X-HE-Tag: 1718681475-41119 X-HE-Meta: U2FsdGVkX19ssUfZcbUhaN3O3VnLCwkDFcbVyOZINks04CuoMC/pspLbBtRpv1Efy8bCWXTUyeja3CmH4UXDgFIvYIw+u2TYhHSpp1to/cjzXLqdo9bf9oLAZopGhv3VeiRlx74AfIVaSOzGo0LgdguwYZR6HMXDAV2yMPwDAdvhYtEnv8+4oKaeaFMThd/6Sl4hReFaTWhKYSgYs+K3zbC3mN2EswKnYoWpDl9Y2zE/blfqZqc07m72pgEHa8BVpIPqJR054EZ63spoUkStc8hzU5+TKFzue4PjaN9lqKKgSeFl4c6ydjw3UJQzybOn6PTiIcwa8WOelUzJplX8Cu//1mJOwChWs5dQaShVl1MvEXIRc1Sua2k+xsTuU//GDsD4ljJKziBYzL7itrBOWckN1RHlyqqT2dPPjzVkXU5OXEeDLYTVpcDmZDIDW9rqxhivVydf8gVzCi6zzddvQMkzsYpFuHXaxJcg5PNh6vgzMmmSxvNvhL3UTBeqn5gBt62VB4Ikc5QFcSfEGq8eYSNoMSAi3xnKGSvRpRkCMU39l415HCAU45Zy0gV+216IPKZVSstGc5r99p1ScGY5s5U+TtdUiur3LC3N9EKJ3P19/5F5cu3ttGJ4puugBpVCXns2HdBfv+N0GxGRCwABT7kvw0DrooElQCd0e6aEsnePk2HiW9GvH3P9BzLALtGLy0HbkbRU9bqGI2YVTsEB1AmOFq3DgZXO/7Ym4lxv+l7LjxtiCk26vgbj8lf2C/UbzWaQbWUCys1qjKkCxdDkefpweFlAi/lrpGCRiSBTbu7LZ3MANEfU/QOvk9r6Ygt4tn3KXI7Fsb0Powa8u5rhPbe8LH/Pk8REj0faaGj6jL0dFGeT/tkf9j/2o0XhPjRKLOxfaE+2ZwBJoDgRNo8cQChMfN6xtPpGWMRlNYhHp1I7uP9aBOrdWUmix1zDPbrfnvbdBhB7hqw8gnNqB0s ObPljfpn G/yh0UEl+oOUNAvuzViVEyvKErlmyOR8sxfAakP/C1v0pE7W4647imu2m03LoYhb3+TMsW8EEG+LIsgFBVRgUpkoC9rPJFsvJxpMC5kg7o3SLzJmi31QE90pOEzvvFTcUmuUvieeP2R7OLdysWFzPGDTZHA8HxjYXKePMsxxLiaqFflB+48PSYegM36UTjud5S89GvauetLWrprbuAr8Gf0Bp4EHJ79fBTO7llljqbkI7y236R3PzfwWTj5ucBDeumNlFjelbY4cu1Qpd9xqelkY7pImomvTUHtUliES4+Plz5d08GmVk5gyzCw3Oou6K87nFqRf8wFQvg2v8ti1Ck+11S3pk9U9eLYLFo+rbAl/lMm9iBDqxtIHaMc65mT9JJ04r6viIYKF+vIgsluG2ytfEBTf4M724rkJysaI+RmfVf2bCdJ14tDCorNsNiCcj7QX5X6ahlWUDpzQmUfgc0vSy9DonIC2jhN4f9EBYFmonmGhMKFKeCxQG8B4LEX23Ht0sfTlBAH7/tIX0lW/Yg3BWGn1Mj41VLmQn8KV0l0HA3zIAGN4FZCY+Nz1jb5WwQNjscIOwOes99ypg5ioUiyLoTzHeYfrxPeRqgp6WoW/ydW8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/6/18 上午9:55, Barry Song 写道: > On Tue, Jun 18, 2024 at 9:36 AM yangge1116 wrote: >> >> >> >> 在 2024/6/17 下午8:47, yangge1116 写道: >>> >>> >>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>> On Tue, Jun 4, 2024 at 9:15 PM wrote: >>>>> >>>>> From: yangge >>>>> >>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") no longer differentiates the migration type >>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>> >>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>> PCP list to avoid the issue above. >>>> >>>> Could you please describe the impact on users in the commit log? >>> >>> If a large number of CMA memory are configured in the system (for >>> example, the CMA memory accounts for 50% of the system memory), starting >>> virtual machine with device passthrough will get stuck. >>> >>> During starting virtual machine, it will call pin_user_pages_remote(..., >>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA >>> area because of FOLL_LONGTERM flag. If non-movable allocation requests >>> return CMA memory, pin_user_pages_remote() will enter endless loops. >>> >>> backtrace: >>> pin_user_pages_remote >>> ----__gup_longterm_locked //cause endless loops in this function >>> --------__get_user_pages_locked >>> --------check_and_migrate_movable_pages //always check fail and continue >>> to migrate >>> ------------migrate_longterm_unpinnable_pages >>> ----------------alloc_migration_target // non-movable allocation >>> >>>> Is it possible that some CMA memory might be used by non-movable >>>> allocation requests? >>> >>> Yes. >>> >>> >>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() >>>> to fail? >>> >>> >>> No, it will cause endless loops in __gup_longterm_locked(). If >>> non-movable allocation requests return CMA memory, >>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another >>> CMA page, which is useless and cause endless loops in >>> __gup_longterm_locked(). > > This is only one perspective. We also need to consider the impact on > CMA itself. For example, > when CMA is borrowed by THP, and we need to reclaim it through > cma_alloc() or dma_alloc_coherent(), > we must move those pages out to ensure CMA's users can retrieve that > contiguous memory. > > Currently, CMA's memory is occupied by non-movable pages, meaning we > can't relocate them. > As a result, cma_alloc() is more likely to fail. > >>> >>> backtrace: >>> pin_user_pages_remote >>> ----__gup_longterm_locked //cause endless loops in this function >>> --------__get_user_pages_locked >>> --------check_and_migrate_movable_pages //always check fail and continue >>> to migrate >>> ------------migrate_longterm_unpinnable_pages >>> >>> >>> >>> >>> >>>>> >>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") >>>>> Signed-off-by: yangge >>>>> --- >>>>> mm/page_alloc.c | 10 ++++++++++ >>>>> 1 file changed, 10 insertions(+) >>>>> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index 2e22ce5..0bdf471 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>> *preferred_zone, >>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>> >>>>> if (likely(pcp_allowed_order(order))) { >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>> ALLOC_CMA || >>>>> + order != >>>>> HPAGE_PMD_ORDER) { >>>>> + page = rmqueue_pcplist(preferred_zone, zone, >>>>> order, >>>>> + migratetype, >>>>> alloc_flags); >>>>> + if (likely(page)) >>>>> + goto out; >>>>> + } >>>> >>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. >>>> But it >>>> still seems better than causing the failure of CMA allocation. >>>> >>>> Is there a possible approach to avoiding adding CMA THP into pcp from >>>> the first >>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>> >> >> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >> adding CMA THP into pcp may incur a slight performance penalty. >> > > But the majority of movable pages aren't CMA, right? > Do we have an estimate for > adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which > the original intention for THP was to avoid by having only one PCP[1]? > > [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ > The size of struct per_cpu_pages is 256 bytes in current code containing commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations"). crash> struct per_cpu_pages struct per_cpu_pages { spinlock_t lock; int count; int high; int high_min; int high_max; int batch; u8 flags; u8 alloc_factor; u8 expire; short free_count; struct list_head lists[13]; } SIZE: 256 After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. crash> struct per_cpu_pages struct per_cpu_pages { spinlock_t lock; int count; int high; int high_min; int high_max; int batch; u8 flags; u8 alloc_factor; u8 expire; short free_count; struct list_head lists[15]; } SIZE: 272 Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized allocations") decrease one cacheline. > >> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >> refer to it. >> >> >>>>> +#else >>>>> page = rmqueue_pcplist(preferred_zone, zone, order, >>>>> migratetype, alloc_flags); >>>>> if (likely(page)) >>>>> goto out; >>>>> +#endif >>>>> } >>>>> >>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >>>>> -- >>>>> 2.7.4 >>>> >>>> Thanks >>>> Barry >>>> >> >>