From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 88F7BC2BA15 for ; Tue, 18 Jun 2024 07:52:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1EE056B00E0; Tue, 18 Jun 2024 03:52:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 19E616B00E1; Tue, 18 Jun 2024 03:52:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 005266B00E7; Tue, 18 Jun 2024 03:52:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D580E6B00E0 for ; Tue, 18 Jun 2024 03:52:20 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 914A1A2011 for ; Tue, 18 Jun 2024 07:52:20 +0000 (UTC) X-FDA: 82243241640.11.6A5FD2A Received: from m16.mail.126.com (m16.mail.126.com [117.135.210.6]) by imf16.hostedemail.com (Postfix) with ESMTP id D6C85180004 for ; Tue, 18 Jun 2024 07:52:17 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=ShwIHxcd; dmarc=pass (policy=none) header.from=126.com; spf=pass (imf16.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.6 as permitted sender) smtp.mailfrom=yangge1116@126.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718697133; a=rsa-sha256; cv=none; b=YPUk7noZz9dS/FMezIWaF5WeWAFnGHJ5GqUVN3+0Uxh93yAXTgH14m8QRDRodlFQ/a9Fpa EPlzfy8qgrkOXuPc7JHuHwtikXaIgbv2maEWomBtz53IHvVhcef0EcpPUAVv31yU7elx5W +hKoMVbRf7c6KCqRKnWqdd3Yynkhs4M= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=126.com header.s=s110527 header.b=ShwIHxcd; dmarc=pass (policy=none) header.from=126.com; spf=pass (imf16.hostedemail.com: domain of yangge1116@126.com designates 117.135.210.6 as permitted sender) smtp.mailfrom=yangge1116@126.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718697133; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RWXufJ1uEWQAARMD8UpBOg+tOlJQDOXXr/9plc+hEAk=; b=hJr9Sybuif24E4cLgeMlU3vp/OvKAxs89PwIW8+EACmfK7uBNbkRikFdmJ28shSEjn56mW bcRapYnJBxrYC0jkEqya4v8JjZYcdE++rPFt/KehW7Xyj4XEC0pKqNEm3fy0T6sTNkP58J vQWEcLiHZe1kqyPunwNxN5k2CNKl9lQ= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=126.com; s=s110527; h=Subject:From:Message-ID:Date:MIME-Version: Content-Type; bh=RWXufJ1uEWQAARMD8UpBOg+tOlJQDOXXr/9plc+hEAk=; b=ShwIHxcdHtMB347S2l6W1fQsptz6L1G7xGH+z5Y+tQQQ9bYlIM25yX94nQgw/8 Efn+GFXMsyndDmCKbYuR3KJouE65Z3ui6jfFL09MwUrneoJsx2hQqBdNZgbdo23A OJ94Svuwzvl921Pw0fwYZCTL9oxyCro5WGCRcNAJMd2h0= Received: from [172.21.21.216] (unknown [118.242.3.34]) by gzga-smtp-mta-g1-4 (Coremail) with SMTP id _____wD3vwyfPHFmL2DUAw--.20756S2; Tue, 18 Jun 2024 15:52:01 +0800 (CST) Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, baolin.wang@linux.alibaba.com, liuzixing@hygon.cn References: <1717492460-19457-1-git-send-email-yangge1116@126.com> <2e3a3a3f-737c-ed01-f820-87efee0adc93@126.com> <9b227c9d-f59b-a8b0-b353-7876a56c0bde@126.com> <4482bf69-eb07-0ec9-f777-28ce40f96589@126.com> <69414410-4e2d-c04c-6fc3-9779f9377cf2@126.com> From: yangge1116 Message-ID: Date: Tue, 18 Jun 2024 15:51:59 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit X-CM-TRANSID:_____wD3vwyfPHFmL2DUAw--.20756S2 X-Coremail-Antispam: 1Uf129KBjvJXoW3JFWfWrW5Kw45Cw43CrW8Xrb_yoW3tF4xpr WxJ3W7tr4UJr15Aw12qwn0kr1jyw17Gr1UXr15Jry8ZrnFyF17Ar4Utr1UuFy8AryUJF1j qr1UtFy3Zr1UAw7anT9S1TB71UUUUU7qnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDUYxBIdaVFxhVjvjDU0xZFpf9x07bjiSLUUUUU= X-Originating-IP: [118.242.3.34] X-CM-SenderInfo: 51dqwwjhrrila6rslhhfrp/1tbiWRICG2VLayva8gAAsU X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: D6C85180004 X-Stat-Signature: ne9pjro67t84w3w3aa4a4e9868jdzo3c X-Rspam-User: X-HE-Tag: 1718697137-493190 X-HE-Meta: U2FsdGVkX1+Zlb5JZPOpMYUEtdZMfL8ryr/mtnHDSjhhABsm8zrf9/ipmutzpKpHKaVIh7U4fD+xgQt/+bAq/e9o6rbPZHaPwoKisDNwNQygpYwzPW5uKLbnYKCcMgldB+QVTNeVGorhaViJmQDnPAngTp0QvNaHya2NMHgpp63nY1JJ/jGwXIS+hsV5LyUoJ6uF24FMdEJVRNnTajFUogbaUdhc2Qfa3qoDucLAc3lPV7NmpR7g/k+WyRKjuVpWZF/2tftCvAeLxhBSWcLfncfxt0vOzSOakEkRiina4XrqYZOD4oE9MlQZh+QjCoDT98TflYi/ygpb2Ocmcmj76Urmhaw336JhI728VPbNaReJHmmpj4C7+oJfD/to/ri7tD2+WEOOWt0jnAsrTzWB9vF3yvVIh+32kYy9Jyni1BupfnhC5GtBMVr+h/dMqVi0T7UKn40dv1guPg507n33PvS34HA+tZfJ64kYOLcMEE4jWEUmHbXphj/1sujosbTxCCh1qvHQJ8qXmzQqhf4AxixGGsDQ43HTMexR5FYDTS1FCHRBkxpeWS5meExurDCgEFDhSPf7k8caQEn9591CEuPoEwjgMewK/CdCaC46JjgoqjBhFFat1Lb0NmPdiNwmqgHPvbj7SsgJv99DRsNTNVofYkKeeNT85de00gW2sNKx92b/XpV2MnNmYAiOZoYzmEaiGkfCxeV+kF1+LXCwTliP1wzjP3VoZbddR0U0ZyxP3irXt9BdaZCqQsPqPgZTSZSs+nFXxK7sAS5GhcfA/ae5NOo3651LAAuUnexoAuY47/UwFf6/rCCKGOHEi4CcZPVAj8DMr+jufQ7R9cxUii/zI1ljQqiq+HulZ1LyleJO/ejNECTXnUM2feZAjyFBN4l52ejMrSoRU1iIvBVy7FFLSt2UFaGy/LsyKqevLD2ur382Xdj+zVB2vvDPwFGhQJYYIlOD3gDLCw8NnmX P0SJawZj ZC5Rs7P/zOfzqg3ewLHVN+/3ditdor810Os1WHpDCMNX/KGjeix4qIoo2NYNqB+zc2sMawvRCwgomjC7enI/OhI8ivkzUjKkCFAQUrJMwtWGmGgMC97ddwQpQe5lLxBf9ar7kdM56SwuihZLjxK3hHNRIJyC5aTK90R3HdQtL2Bw/xY1FsTiVwqCGt/VVnHzpNOZ4z8OG2Ik+jW3G+5CTElt0Ns7HTRMsjiH4iINUSUOAydVmthmFN06TY7dlJzn0JsaUDSunZhgr0WiVT9c1N5Fl0ksQZNLF1GfjIEz+JJqPyiFitB4rsIhlvg1qsrMXGTcWcVDyCyE2SMgeLYrGrcrPKYKtKmI6TipUKPOXff1JfBsKb9hd4HgfaCyTWM2eLQ8GFK072iC/k9YUFE7JGvFPTzUx6Uy4GsPCos5DsCYsC3gqFrlG313P8NsEJEpf4ESt87gUEPnF4p7OGmRshNr8SMkcaZ5Uh9s2OBsS4EwrFelL6MLj2qb7FbxzmxEL7lVXo/QGTGFtU0HtwYWw2xfKQsyDFzMFQi4xsMUS4A3088WhF7Y5dbSaP2sZJ87d42j3SDjYCn+buOnEZM08HirS478P5LZIOTDDY8T2V3a9pHg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000002, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/6/18 下午2:58, Barry Song 写道: > On Tue, Jun 18, 2024 at 6:56 PM yangge1116 wrote: >> >> >> >> 在 2024/6/18 下午12:10, Barry Song 写道: >>> On Tue, Jun 18, 2024 at 3:32 PM yangge1116 wrote: >>>> >>>> >>>> >>>> 在 2024/6/18 上午9:55, Barry Song 写道: >>>>> On Tue, Jun 18, 2024 at 9:36 AM yangge1116 wrote: >>>>>> >>>>>> >>>>>> >>>>>> 在 2024/6/17 下午8:47, yangge1116 写道: >>>>>>> >>>>>>> >>>>>>> 在 2024/6/17 下午6:26, Barry Song 写道: >>>>>>>> On Tue, Jun 4, 2024 at 9:15 PM wrote: >>>>>>>>> >>>>>>>>> From: yangge >>>>>>>>> >>>>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>>>> THP-sized allocations") no longer differentiates the migration type >>>>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page from >>>>>>>>> the list, in some cases, it's not acceptable, for example, allocating >>>>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. >>>>>>>>> >>>>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-sized >>>>>>>>> PCP list to avoid the issue above. >>>>>>>> >>>>>>>> Could you please describe the impact on users in the commit log? >>>>>>> >>>>>>> If a large number of CMA memory are configured in the system (for >>>>>>> example, the CMA memory accounts for 50% of the system memory), starting >>>>>>> virtual machine with device passthrough will get stuck. >>>>>>> >>>>>>> During starting virtual machine, it will call pin_user_pages_remote(..., >>>>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, >>>>>>> pin_user_pages_remote() will migrate the page from CMA area to non-CMA >>>>>>> area because of FOLL_LONGTERM flag. If non-movable allocation requests >>>>>>> return CMA memory, pin_user_pages_remote() will enter endless loops. >>>>>>> >>>>>>> backtrace: >>>>>>> pin_user_pages_remote >>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>> --------__get_user_pages_locked >>>>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>>>> to migrate >>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>> ----------------alloc_migration_target // non-movable allocation >>>>>>> >>>>>>>> Is it possible that some CMA memory might be used by non-movable >>>>>>>> allocation requests? >>>>>>> >>>>>>> Yes. >>>>>>> >>>>>>> >>>>>>>> If so, will CMA somehow become unable to migrate, causing cma_alloc() >>>>>>>> to fail? >>>>>>> >>>>>>> >>>>>>> No, it will cause endless loops in __gup_longterm_locked(). If >>>>>>> non-movable allocation requests return CMA memory, >>>>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to another >>>>>>> CMA page, which is useless and cause endless loops in >>>>>>> __gup_longterm_locked(). >>>>> >>>>> This is only one perspective. We also need to consider the impact on >>>>> CMA itself. For example, >>>>> when CMA is borrowed by THP, and we need to reclaim it through >>>>> cma_alloc() or dma_alloc_coherent(), >>>>> we must move those pages out to ensure CMA's users can retrieve that >>>>> contiguous memory. >>>>> >>>>> Currently, CMA's memory is occupied by non-movable pages, meaning we >>>>> can't relocate them. >>>>> As a result, cma_alloc() is more likely to fail. >>>>> >>>>>>> >>>>>>> backtrace: >>>>>>> pin_user_pages_remote >>>>>>> ----__gup_longterm_locked //cause endless loops in this function >>>>>>> --------__get_user_pages_locked >>>>>>> --------check_and_migrate_movable_pages //always check fail and continue >>>>>>> to migrate >>>>>>> ------------migrate_longterm_unpinnable_pages >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>> >>>>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>>>>>>> THP-sized allocations") >>>>>>>>> Signed-off-by: yangge >>>>>>>>> --- >>>>>>>>> mm/page_alloc.c | 10 ++++++++++ >>>>>>>>> 1 file changed, 10 insertions(+) >>>>>>>>> >>>>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>>>>>> index 2e22ce5..0bdf471 100644 >>>>>>>>> --- a/mm/page_alloc.c >>>>>>>>> +++ b/mm/page_alloc.c >>>>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone >>>>>>>>> *preferred_zone, >>>>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)); >>>>>>>>> >>>>>>>>> if (likely(pcp_allowed_order(order))) { >>>>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE >>>>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & >>>>>>>>> ALLOC_CMA || >>>>>>>>> + order != >>>>>>>>> HPAGE_PMD_ORDER) { >>>>>>>>> + page = rmqueue_pcplist(preferred_zone, zone, >>>>>>>>> order, >>>>>>>>> + migratetype, >>>>>>>>> alloc_flags); >>>>>>>>> + if (likely(page)) >>>>>>>>> + goto out; >>>>>>>>> + } >>>>>>>> >>>>>>>> This seems not ideal, because non-CMA THP gets no chance to use PCP. >>>>>>>> But it >>>>>>>> still seems better than causing the failure of CMA allocation. >>>>>>>> >>>>>>>> Is there a possible approach to avoiding adding CMA THP into pcp from >>>>>>>> the first >>>>>>>> beginning? Otherwise, we might need a separate PCP for CMA. >>>>>>>> >>>>>> >>>>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding >>>>>> adding CMA THP into pcp may incur a slight performance penalty. >>>>>> >>>>> >>>>> But the majority of movable pages aren't CMA, right? >>>> >>>>> Do we have an estimate for >>>>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheline, which >>>>> the original intention for THP was to avoid by having only one PCP[1]? >>>>> >>>>> [1] https://patchwork.kernel.org/project/linux-mm/patch/20220624125423.6126-3-mgorman@techsingularity.net/ >>>>> >>>> >>>> The size of struct per_cpu_pages is 256 bytes in current code containing >>>> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized >>>> allocations"). >>>> crash> struct per_cpu_pages >>>> struct per_cpu_pages { >>>> spinlock_t lock; >>>> int count; >>>> int high; >>>> int high_min; >>>> int high_max; >>>> int batch; >>>> u8 flags; >>>> u8 alloc_factor; >>>> u8 expire; >>>> short free_count; >>>> struct list_head lists[13]; >>>> } >>>> SIZE: 256 >>>> >>>> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP list >>>> for THP-sized allocations"), the size of struct per_cpu_pages is 272 bytes. >>>> crash> struct per_cpu_pages >>>> struct per_cpu_pages { >>>> spinlock_t lock; >>>> int count; >>>> int high; >>>> int high_min; >>>> int high_max; >>>> int batch; >>>> u8 flags; >>>> u8 alloc_factor; >>>> u8 expire; >>>> short free_count; >>>> struct list_head lists[15]; >>>> } >>>> SIZE: 272 >>>> >>>> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for >>>> THP-sized allocations") decrease one cacheline. >>> >>> the proposal is not reverting the patch but adding one CMA pcp. >>> so it is "struct list_head lists[14]"; in this case, the size is still >>> 256? >>> >> >> Yes, the size is still 256. If add one PCP list, we will have 2 PCP >> lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other >> PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that right? > > i am not quite sure about MIGRATE_RECLAIMABLE as we want to > CMA is only used by movable. > So it might be: > MOVABLE and NON-MOVABLE. One PCP list is used by UNMOVABLE pages, and the other PCP list is used by MOVABLE pages, seems it is feasible. UNMOVABLE PCP list contains MIGRATE_UNMOVABLE pages and MIGRATE_RECLAIMABLE pages, and MOVABLE PCP list contains MIGRATE_MOVABLE pages. > >> >>> >>>> >>>>> >>>>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly >>>>>> refer to it. >>>>>> >>>>>> >>>>>>>>> +#else >>>>>>>>> page = rmqueue_pcplist(preferred_zone, zone, order, >>>>>>>>> migratetype, alloc_flags); >>>>>>>>> if (likely(page)) >>>>>>>>> goto out; >>>>>>>>> +#endif >>>>>>>>> } >>>>>>>>> >>>>>>>>> page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, >>>>>>>>> -- >>>>>>>>> 2.7.4 >>>>>>>> >>>>>>>> Thanks >>>>>>>> Barry >>>>>>>> >>>>>> >>>>>> >>>> >>