From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 661B3C27C53 for ; Wed, 19 Jun 2024 05:34:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF1606B03E0; Wed, 19 Jun 2024 01:34:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BA0776B03E1; Wed, 19 Jun 2024 01:34:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A687C6B03E2; Wed, 19 Jun 2024 01:34:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 8AFC96B03E0 for ; Wed, 19 Jun 2024 01:34:32 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 45752A0D2B for ; Wed, 19 Jun 2024 05:34:32 +0000 (UTC) X-FDA: 82246523184.03.4707ED1 Received: from m16.mail.126.com (m16.mail.126.com [220.197.31.8]) by imf03.hostedemail.com (Postfix) with ESMTP id 46C872000F for ; Wed, 19 Jun 2024 05:34:28 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=XVWu4wKo; spf=pass (imf03.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.8 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718775266; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=gp31lDppN9BRxvc/ab3djSle0PEgTybfzF86txTaLA8=; b=4Im6Pg8rqMhpTd0aqB6EKwW5SxLIEjQpx1U95G+EVPbfbhUpsTPwKIEAEg6F+xS4OlJtNU Ek4+7dQXi52VB8SJfhnBfbbQQduwdj7+ymPkUMuKeoji2RnXVoJ45Atfofkbi/AEQATSK5 dd/HO0ahpnSdsZ/rNEcekMdYOkssqw8= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=XVWu4wKo; spf=pass (imf03.hostedemail.com: domain of yangge1116@126.com designates 220.197.31.8 as permitted sender) smtp.mailfrom=yangge1116@126.com; dmarc=pass (policy=none) header.from=126.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718775266; a=rsa-sha256; cv=none; b=yJx9ZU9b1msCKFVtWebSmZxh6sLAP3Brom33eRhT7L6irxJnesBmhWIEsxUXN447pFIvbh ETRSzdlzcjBBX0UVkIn+57b8e+4Ev4Oj3Y0k3bpRO/mSBs42XcJNUfIwWYaGXDdvwgi6j3 SMW4JH+kBSA51dvvQDUg2TDoBs79Iz4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Message-ID:Date:MIME-Version:Subject:From: Content-Type; bh=gp31lDppN9BRxvc/ab3djSle0PEgTybfzF86txTaLA8=; b=XVWu4wKo/UwEi0VzCXPufkDbGegz5N/AqbyU0mWlXJEDDojfRPafex8kTT6pci u8PJ0xMGT/Jd3vzMgtcmbyHOw8pYKy0X+jbpa/1r+xtg3IgSrKYVcXNh3ZuLEkjO Qkdj0Fzl2r8zKFsNlH+cpw2cRj68tOL9/BbO5h6tELpXk= Received: from [172.21.22.210] (unknown [118.242.3.34]) by gzga-smtp-mta-g0-0 (Coremail) with SMTP id _____wD3nzLXbXJmHY0JBw--.30716S2; Wed, 19 Jun 2024 13:34:17 +0800 (CST) Message-ID: Date: Wed, 19 Jun 2024 13:34:15 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page From: Ge Yang To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, baolin.wang@linux.alibaba.com, liuzixing@hygon.cn References: <1717492460-19457-1-git-send-email-yangge1116@126.com> <2e3a3a3f-737c-ed01-f820-87efee0adc93@126.com> <9b227c9d-f59b-a8b0-b353-7876a56c0bde@126.com> <4482bf69-eb07-0ec9-f777-28ce40f96589@126.com> <69414410-4e2d-c04c-6fc3-9779f9377cf2@126.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wD3nzLXbXJmHY0JBw--.30716S2 X-Coremail-Antispam: 1Uf129KBjvJXoWfJw43AFyUuF4rJr4kJF4kXrb_yoWDWr1rpr y8JF17tr4UJryUAw17trn8Ar1jyw17Jr1UXr15Jr18ArnFyr17Ar18tr1UuFy8JryUJF18 Xr1Utry3Zr1UAw7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07jGxRDUUUUU= X-Originating-IP: [118.242.3.34] X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbiWRYDG2VLa0hdHwAAsY X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 46C872000F X-Stat-Signature: s4c1ifysmimdz3r7zgsorfqkermu8emc X-HE-Tag: 1718775268-192886 X-HE-Meta: U2FsdGVkX18YFZoZlnQr8A70VTIABtOqXiIqOfRmIJJrGGERm0uRCZiudb5jD1p0r1zHQ93ut7BFMol+Zjp34rIehNIYnkznTWvsG9cK/6b55Jn2CQf9/OfopG9OtUeYT5mXZlkHXPLLcsE30FZZNuB/wHH6VzQAN+qiVjeraE/mXuYXiRKqB56G4sICgHhhcV9hcLzJhr4Uui6ASsy7OjeH1DswJ2pK3tndXt/rJi/I8BvRradb1ZOk1FTfKHJf3Q6+fijEsBItUW37t+JpY0/PtnCNWKO3XTYwzsOB3U2NgoRLELzuV8Jx7ZAfM0blqD8NUEysyvDeZpbG/O8pc6VqJY535PLMaOiU5E+SKeM376CeyDpDG0qCTeWOblcLbe68HGKJAsN+hfgi5JmP0pD6zTX33YqUh1E/n9wm+NMviTGu5l3XV6JRT2o8XdTw5cSR5Ohws4KwDX6qLhCylgnbsfTnYkz7F0V/VnEr4YBFYtH9+Efe4AVIOnt7edo5dH1YC47JP4QT2yYQPfUiFHZ6XVfoEpBcKSS1A7Y8pdCIToRODm8ftGnBwxagXsUwgFpZhuSR/n+Y8ZSqPHdbI1maGrExjYeJ4jlr6DmOhnZvpJZfMvyKtt/9UygLPqJ3Q0cWIxjXepN3r8u/UCZYBHA3M2PoeVmhJT/Q0O5AlbNnc3UqqDQx1uN3aIXxh19MVdhzh8AGX5gngJPTHz5SyMybHf4BwKeNKLCxpffzOjMWfdAs48FQLLuF5VgVLQKxyBsHwTCGh5+kOtvxZ3mbkEnwAnz6/5XpBRFYjU01R51AuqFMDkdXApmWJZ3awN7qaAxfI1vgGu9XGWMlcELzGWcWOpILEeVExYaTkSRmfPx51mnUnszQA7wHRkHHaqQUJED6i3zreLsgPfqHqEMx1xotssNrt/bzjwKV7n6aFZoWtOQk7g0uat86AL66U5bNl2FqZEimwNSO++F+OZl ZuZ0CoWy ClCrAiUZWkZQ6Z2eb1sU/aKM3z3OovZk2ibapd41HHeeJeEq3vi4pBvY5zZmMh4mRl5fSeEXunMlfA7uqcsH3cDVoU8psq26P+I0z9+QiL93hy6Mm5X4BWIEu+1WHKFr4oshPseqe/HOzSFsQrpErHWhIxjxDRnkVc2RsDyrc5GjJiqGI0+/G6xkTGvA8b0ZC9oFOv73t8HVoxRuBy0uLIertkVGXMq3JV1EGK0PBi9Uhf/h+YFuj5H9vxm+dZorv8Up4zRMhEEFuYC8/ioTvhUQV+kNdI6dVJXEE/Vp6KLQvUgABQKA3hwDrDsBvd+Ton2r4UXfH4dXvuywpYYVJPipstsiHtHiArir1k39GahA2vCTxRfMyLPboNBDU4221pvB8nl90m7GY9n13FPG5HP5Fb1HcQ3fjjUQsvYJHVZMQo1QDA9LbF1l//e/w4z5H1Kr28CKyUzl7s9lByiq0F0LKWTQMVp79Gaw9rBHyqbOfg/iMciLt+NyzpFjYNSS8BqdDYxrhxsGDSiRKbUH3ztz6dA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/6/18 15:51, yangge1116 写道: > > > 在 2024/6/18 下午2:58, Barry Song 写道: >> On Tue, Jun 18, 2024 at 6:56 PM yangge1116 wrote: >>> >>> >>> >>> 在 2024/6/18 下午12:10, Barry Song 写道: >>>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 wrote: >>>>> >>>>> >>>>> >>>>> 在 2024/6/18 上午9:55, Barry Song 写道: >>>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> 在 2024/6/17 下午8:47, yangge1116 写道: >>>>>>>> >>>>>>>> >>>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM wrote: >>>>>>>>>> >>>>>>>>>> From: yangge >>>>>>>>>> >>>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP >>>>>>>>>> list for >>>>>>>>>> THP-sized allocations") no longer differentiates the migration >>>>>>>>>> type >>>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA >>>>>>>>>> page from >>>>>>>>>> the list, in some cases, it's not acceptable, for example, >>>>>>>>>> allocating >>>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>>>>>>> >>>>>>>>>> The patch forbids allocating non-CMA THP-sized page from >>>>>>>>>> THP-sized >>>>>>>>>> PCP list to avoid the issue above. >>>>>>>>> >>>>>>>>> Could you please describe the impact on users in the commit log? >>>>>>>> >>>>>>>> If a large number of CMA memory are configured in the system (for >>>>>>>> example, the CMA memory accounts for 50% of the system memory), >>>>>>>> starting >>>>>>>> virtual machine with device passthrough will get stuck. >>>>>>>> >>>>>>>> During starting virtual machine, it will call >>>>>>>> pin_user_pages_remote(..., >>>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>>>>>>> pin_user_pages_remote() will migrate the page from CMA area to >>>>>>>> non-CMA >>>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation >>>>>>>> requests >>>>>>>> return CMA memory, pin_user_pages_remote() will enter endless >>>>>>>> loops. >>>>>>>> >>>>>>>> backtrace: >>>>>>>> pin_user_pages_remote >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>>> --------__get_user_pages_locked >>>>>>>> --------check_and_migrate_movable_pages //always check fail and >>>>>>>> continue >>>>>>>> to migrate >>>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>>> ----------------alloc_migration_target // non-movable allocation >>>>>>>> >>>>>>>>> Is it possible that some CMA memory might be used by non-movable >>>>>>>>> allocation requests? >>>>>>>> >>>>>>>> Yes. >>>>>>>> >>>>>>>> >>>>>>>>> If so, will CMA somehow become unable to migrate, causing >>>>>>>>> cma_alloc() >>>>>>>>> to fail? >>>>>>>> >>>>>>>> >>>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If >>>>>>>> non-movable allocation requests return CMA memory, >>>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to >>>>>>>> another >>>>>>>> CMA page, which is useless and cause endless loops in >>>>>>>> __gup_longterm_locked(). >>>>>> >>>>>> This is only one perspective. We also need to consider the impact on >>>>>> CMA itself. For example, >>>>>> when CMA is borrowed by THP, and we need to reclaim it through >>>>>> cma_alloc() or dma_alloc_coherent(), >>>>>> we must move those pages out to ensure CMA's users can retrieve that >>>>>> contiguous memory. >>>>>> >>>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we >>>>>> can't relocate them. >>>>>> As a result, cma_alloc() is more likely to fail. >>>>>> >>>>>>>> >>>>>>>> backtrace: >>>>>>>> pin_user_pages_remote >>>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>>> --------__get_user_pages_locked >>>>>>>> --------check_and_migrate_movable_pages //always check fail and >>>>>>>> continue >>>>>>>> to migrate >>>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>>> >>>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>>>>> THP-sized allocations") >>>>>>>>>> Signed-off-by: yangge >>>>>>>>>> --- >>>>>>>>>>      mm/page_alloc.c | 10 ++++++++++ >>>>>>>>>>      1 file changed, 10 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>>>>> index 2e22ce5..0bdf471 100644 >>>>>>>>>> --- a/mm/page_alloc.c >>>>>>>>>> +++ b/mm/page_alloc.c >>>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>>>>>>> *preferred_zone, >>>>>>>>>>             WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order >>>>>>>>>> > 1)); >>>>>>>>>> >>>>>>>>>>             if (likely(pcp_allowed_order(order))) { >>>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>>>>> +               if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>>>>>>> ALLOC_CMA || >>>>>>>>>> +                                               order != >>>>>>>>>> HPAGE_PMD_ORDER) { >>>>>>>>>> +                       page = rmqueue_pcplist(preferred_zone, >>>>>>>>>> zone, >>>>>>>>>> order, >>>>>>>>>> +                                               migratetype, >>>>>>>>>> alloc_flags); >>>>>>>>>> +                       if (likely(page)) >>>>>>>>>> +                               goto out; >>>>>>>>>> +               } >>>>>>>>> >>>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use >>>>>>>>> PCP. >>>>>>>>> But it >>>>>>>>> still seems better than causing the failure of CMA allocation. >>>>>>>>> >>>>>>>>> Is there a possible approach to avoiding adding CMA THP into >>>>>>>>> pcp from >>>>>>>>> the first >>>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>>>>>>> >>>>>>> >>>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >>>>>>> adding CMA THP into pcp may incur a slight performance penalty. >>>>>>> >>>>>> >>>>>> But the majority of movable pages aren't CMA, right? >>>>> >>>>>> Do we have an estimate for >>>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new >>>>>> cacheline, which >>>>>> the original intention for THP was to avoid by having only one >>>>>> PCP[1]? >>>>>> >>>>>> [1] >>>>>> https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ >>>>>> >>>>> >>>>> The size of struct per_cpu_pages is 256 bytes in current code >>>>> containing >>>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized >>>>> allocations"). >>>>> crash> struct per_cpu_pages >>>>> struct per_cpu_pages { >>>>>        spinlock_t lock; >>>>>        int count; >>>>>        int high; >>>>>        int high_min; >>>>>        int high_max; >>>>>        int batch; >>>>>        u8 flags; >>>>>        u8 alloc_factor; >>>>>        u8 expire; >>>>>        short free_count; >>>>>        struct list_head lists[13]; >>>>> } >>>>> SIZE: 256 >>>>> >>>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP >>>>> list >>>>> for THP-sized allocations"), the size of struct per_cpu_pages is >>>>> 272 bytes. >>>>> crash> struct per_cpu_pages >>>>> struct per_cpu_pages { >>>>>        spinlock_t lock; >>>>>        int count; >>>>>        int high; >>>>>        int high_min; >>>>>        int high_max; >>>>>        int batch; >>>>>        u8 flags; >>>>>        u8 alloc_factor; >>>>>        u8 expire; >>>>>        short free_count; >>>>>        struct list_head lists[15]; >>>>> } >>>>> SIZE: 272 >>>>> >>>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>> THP-sized allocations") decrease one cacheline. >>>> >>>> the proposal is not reverting the patch but adding one CMA pcp. >>>> so it is "struct list_head lists[14]"; in this case, the size is still >>>> 256? >>>> >>> >>> Yes, the size is still 256. If add one PCP list, we will have 2 PCP >>> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other >>> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that >>> right? >> >> i am not quite sure about MIGRATE_RECLAIMABLE as we want to >> CMA is only used by movable. >> So it might be: >> MOVABLE and NON-MOVABLE. > > One PCP list is used by UNMOVABLE pages, and the other PCP list is used > by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains > MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP > list contains MIGRATE_MOVABLE pages. > Is the following modification feasiable? #ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define NR_PCP_THP 1 +#define NR_PCP_THP 2 +#define PCP_THP_MOVABLE 0 +#define PCP_THP_UNMOVABLE 1 #else #define NR_PCP_THP 0 #endif static inline unsigned int order_to_pindex(int migratetype, int order) { + int pcp_type = migratetype; + #ifdef CONFIG_TRANSPARENT_HUGEPAGE if (order > PAGE_ALLOC_COSTLY_ORDER) { VM_BUG_ON(order != HPAGE_PMD_ORDER); - return NR_LOWORDER_PCP_LISTS; + + if (migratetype != MIGRATE_MOVABLE) + pcp_type = PCP_THP_UNMOVABLE; + else + pcp_type = PCP_THP_MOVABLE; + + return NR_LOWORDER_PCP_LISTS + pcp_type; } #else VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); #endif - return (MIGRATE_PCPTYPES * order) + migratetype; + return (MIGRATE_PCPTYPES * order) + pcp_type; } @@ -521,7 +529,7 @@ static inline int pindex_to_order(unsigned int pindex) int order = pindex / MIGRATE_PCPTYPES; #ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (pindex == NR_LOWORDER_PCP_LISTS) + if (order > PAGE_ALLOC_COSTLY_ORDER) order = HPAGE_PMD_ORDER; #else VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER); >> >>> >>>> >>>>> >>>>>> >>>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >>>>>>> refer to it. >>>>>>> >>>>>>> >>>>>>>>>> +#else >>>>>>>>>>                     page = rmqueue_pcplist(preferred_zone, >>>>>>>>>> zone, order, >>>>>>>>>>                                            migratetype, >>>>>>>>>> alloc_flags); >>>>>>>>>>                     if (likely(page)) >>>>>>>>>>                             goto out; >>>>>>>>>> +#endif >>>>>>>>>>             } >>>>>>>>>> >>>>>>>>>>             page = rmqueue_buddy(preferred_zone, zone, order, >>>>>>>>>> alloc_flags, >>>>>>>>>> -- >>>>>>>>>> 2.7.4 >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Barry >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>