From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0BE4CC27C4F for ; Tue, 18 Jun 2024 06:59:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8F5A98D0019; Tue, 18 Jun 2024 02:59:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8A32A8D0001; Tue, 18 Jun 2024 02:59:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 76AE98D0019; Tue, 18 Jun 2024 02:59:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 58F688D0001 for ; Tue, 18 Jun 2024 02:59:02 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 10F04404EA for ; Tue, 18 Jun 2024 06:59:02 +0000 (UTC) X-FDA: 82243107324.17.3CF9917 Received: from mail-ua1-f52.google.com (mail-ua1-f52.google.com [209.85.222.52]) by imf30.hostedemail.com (Postfix) with ESMTP id 40B9A8000A for ; Tue, 18 Jun 2024 06:59:00 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="e3A7apG/"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718693934; a=rsa-sha256; cv=none; b=CGdI1N/89uYFREfRcJhK5mgJM16NO3gXXkTs3jrr+zA55FiAYL4DzhF7+DI4iNQGdivwxB OTqtgQOllJAxEcroExomgsJj1JMvOnOYTsAwjVTici5M0Pcr8suZQk7Yqu8NEw//N9dsDC Qjg745swPT0z0m+iApKRbcBTLKD2ye0= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="e3A7apG/"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.52 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718693934; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A2jjBp+8k+j2TWVQUReaXEDw6XqYxCZh6m2WzO1uqGw=; b=OFEoXaDGu3SfZst4PFRh+SRD1eUXZjV51cin8GvMi55Ly81O9Y0antti+Fv4ohMFm7FWSd 7cXvDPCONToNg3LOICkV6xmHW0QHn/KOYY5jVcBExapXKoKX5VrlRnSH01wEEWWXVGeMKo 4Q87SKwv1K5J7zSUYlufq8HmplY9gGw= Received: by mail-ua1-f52.google.com with SMTP id a1e0cc1a2514c-80d68861bf9so1804360241.2 for ; Mon, 17 Jun 2024 23:59:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718693939; x=1719298739; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=A2jjBp+8k+j2TWVQUReaXEDw6XqYxCZh6m2WzO1uqGw=; b=e3A7apG/y2+x3quSYABZnd86cmAIhFDO7IGbE9twu+Wv4TnkLcAlSJMmhF2Dz71MVM eu5r2IjSNfQjrd+sa/aWZDjd8Ml0VU0C6jBJUvlKYpqT22iiydArQaPxBxbIrjoQDhX/ fqRzMZ7QR0e9W10jgDBk3EZ6TF3YkW60eWzxPBhnU9JwHPA7RNVNwhOXEFSSDgdUS25z laFg3HzgwYw5R2IkapukaJ9pnEGVi5x922UFXipsCAAhqVR+0OjZHISzaycKEDAFi9Kh 7LM9Z+q5fWXiA0W8b1ofpDX7LXUowrq1i01IVBP2sa8ODIb6m3tAuPz2AZ15trrEWySS 0rNw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718693939; x=1719298739; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=A2jjBp+8k+j2TWVQUReaXEDw6XqYxCZh6m2WzO1uqGw=; b=CUeM1rl6XXakifUhiXfwEzyo1DtcQkKpFAvPj0GlZU+hY6cCvV8tFnLarYXkH+i+WI VJOlStzW9i+3e1uSPu1R91Jo+UPOyublu5Ng4PbnOsMucFVLMH9OtvBhXZYs5aC7vz+N PWBIqphI+cMJj8RwWDqJMYr0TCd2ki5Ex6riAeM6LJYfuzemlvEx2umPTIUMKWLxUv4Z fRqtbgBCbqYwahPecj+mwywlxv6HkWoimHbk7nOPC3nPP2jOHah8ZkJLrlI1RODPLMgp pLJQd+/aoaMEIFneykmmJfWPGHpLkXHNd/4Nt3JZgIypttV5h2Y2S8nuSHmo5cjWESh6 AYGg== X-Forwarded-Encrypted: i=1; AJvYcCWZl3QCJkEG+Gp+HSGLuvlTgeMxLgp5akmGeK9H5VzGHKLN4XUW2ov8M9nFTa/sdmsX22V20Pdv533sVgOCSjdOtfQ= X-Gm-Message-State: AOJu0YyhN84O2K09G7RW6LMZpOTU+J32XnPHV6gQyiPyX9mgD/J50lHf VPjl5N4O7XCBCNrI/Y2uEffHORUm0A9zuGwuObBi4mKVwd5wbv/pk1c8iYkYrs++TA5hZs9bKVU 2xJxYLjEPwN/7AYVvzUn+v/nZwwEoCCAL X-Google-Smtp-Source: AGHT+IFmsnI6AMDsbfyONzno99n8eGEbPAuRjzCGNZ7yFklxYgmDfbiHJ40dch6XOkILknENdt18TDzSI1L3YFSqK9I= X-Received: by 2002:a05:6102:94a:b0:48e:f039:19ee with SMTP id ada2fe7eead31-48ef0391a64mr6246905137.15.1718693939221; Mon, 17 Jun 2024 23:58:59 -0700 (PDT) MIME-Version: 1.0 References: <1717492460-19457-1-git-send-email-yangge1116@126.com> <2e3a3a3f-737c-ed01-f820-87efee0adc93@126.com> <9b227c9d-f59b-a8b0-b353-7876a56c0bde@126.com> <4482bf69-eb07-0ec9-f777-28ce40f96589@126.com> <69414410-4e2d-c04c-6fc3-9779f9377cf2@126.com> In-Reply-To: <69414410-4e2d-c04c-6fc3-9779f9377cf2@126.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 18 Jun 2024 18:58:47 +1200 Message-ID: Subject: Re: [PATCH] mm/page_alloc: skip THP-sized PCP list when allocating non-CMA THP-sized page To: yangge1116 Cc: akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, baolin.wang@linux.alibaba.com, liuzixing@hygon.cn Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 40B9A8000A X-Stat-Signature: g3s7ykxremrs8ezjkuoeyug5nqo5zys7 X-Rspam-User: X-HE-Tag: 1718693940-643543 X-HE-Meta: U2FsdGVkX19NJE+JxJgtqHvdirdi5fN55wmlILxZs/DT/+UoUHp4m/Sw4Us5YqvLaQzcwdf/9yvVoQImI0Pr5+LyH3QBW7bioYo3lycIZ8C0AqqeK6nraBTiaOLBh/0qytbGQ+QH/2GqkiAgXe6gD5yF6OWA/8XmKw5JX+nPoJmMAo941VH8HhyyBOP6xsB7vEvPcsyxRmULSeI5oPfe4i291hbRUeSCujjmm/S7RVCR2D+WAKv4BANb/rdi3VHfOHA+Pd7GDjABiAHLy1KXtYlQN8EuKMwJR/8dIDo8xEtid/JPMLDLn+pSwmomkNmesUHyjQxRVeBuL/fvq9sl7yGCHOcYe0YIWcA0NDvAEWY3E9+6wUBW7FByXqUajS+UIdnwZez6WCM1BiKTx6fyrcY2IFNMWXsQCU7QhJMyQTzOwrwi+zxTHS6U9UKwAfkSNz0So9hsJ1bjlBRwpykon3TIaaWnkaaGy/k+GMV+urn9YeKorB5K0nIzNeXQpfT1MSrpsujR3D5aXRATKBMJGj/s93INt/vrPKrT1kGaHXHcVfDUNfZkLIhRDIY7/WVzBayR8MOUmtRxugXVopKc0G/7SLHV/601vHTZW1LMGhGwqiPOQV9f2f+rzVARB9MHMrVGSgvkBo1x304ERQcoRiyXm0JB5rvT64j6GJQKcYlQvrnzhfvZGecSw8KeoMlqkSnYWXBWelpRcz1gcXrmAKIdK+Qh6P/WDcaZntXkePz3wicY9BoPWSbjCl9WWi5avmzWn70EkGvTPZP/OJqELhqmdx193xbhLKq68Lc4tfyyQISSrgZXEFndoSGCljTpGGTMgS7UtYsnD2FHsRADxR9CXcsMXhItcyt5MhmBTLsnAR9SDx0pV171KL9VhnEF2GJBOGsmGtCRteTcU3YG6ZTvaiNXybDfBBkC/fVBTqT5uTjwhYMQgiTWZReA5i+/l23iwqrzIeLO+178H+N gGoj51F9 eYiPB6DxOD3iOAmM75vk6FA9PrUcI6kzcjAnGbLGrUJy3AtaPivlKpbG+RfLnP6B4b9Q9ZlTjuLmHyGUefqxENN5qB/7+HfoxO4wVxZQXpqwx5MGaxaVdfSOTllREXoQeJryQ5OT+Bjv2ZvJUSa97qZyRNRVO89IaIjNtnrRv/XMo8lfqIZ8HjsY296cBvY4CzTZdzmaF2fKgd7UOsVwAt3FumjefC7/lv7FPYyUmscvZCPVRXp2KFKZDVx8s0EamfiR9PU+hAaW+TJmezIvtsqJqA1qtS2HEHX8NzEkxNqwUu217DIHf8IYd3072FdIAALGyd9aIIeh+ooSF3fLcRAJtpLTNR6BwcJuKoFKzr05IUD0Ou9/L9nboAGOc5B8XTNv1Z5v9hPuoYJt63mLTN4YRVTERh2jHVdR2XdyC0EpNWmKZoiRl+hhQoQNT9gCTBJ3LYle3okZ+xNwTad32oQd4DYFlL+UsCUxqucUiPrYxuxE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 18, 2024 at 6:56=E2=80=AFPM yangge1116 wro= te: > > > > =E5=9C=A8 2024/6/18 =E4=B8=8B=E5=8D=8812:10, Barry Song =E5=86=99=E9=81= =93: > > On Tue, Jun 18, 2024 at 3:32=E2=80=AFPM yangge1116 = wrote: > >> > >> > >> > >> =E5=9C=A8 2024/6/18 =E4=B8=8A=E5=8D=889:55, Barry Song =E5=86=99=E9=81= =93: > >>> On Tue, Jun 18, 2024 at 9:36=E2=80=AFAM yangge1116 wrote: > >>>> > >>>> > >>>> > >>>> =E5=9C=A8 2024/6/17 =E4=B8=8B=E5=8D=888:47, yangge1116 =E5=86=99=E9= =81=93: > >>>>> > >>>>> > >>>>> =E5=9C=A8 2024/6/17 =E4=B8=8B=E5=8D=886:26, Barry Song =E5=86=99=E9= =81=93: > >>>>>> On Tue, Jun 4, 2024 at 9:15=E2=80=AFPM wrote: > >>>>>>> > >>>>>>> From: yangge > >>>>>>> > >>>>>>> Since commit 5d0a661d808f ("mm/page_alloc: use only one PCP list = for > >>>>>>> THP-sized allocations") no longer differentiates the migration ty= pe > >>>>>>> of pages in THP-sized PCP list, it's possible to get a CMA page f= rom > >>>>>>> the list, in some cases, it's not acceptable, for example, alloca= ting > >>>>>>> a non-CMA page with PF_MEMALLOC_PIN flag returns a CMA page. > >>>>>>> > >>>>>>> The patch forbids allocating non-CMA THP-sized page from THP-size= d > >>>>>>> PCP list to avoid the issue above. > >>>>>> > >>>>>> Could you please describe the impact on users in the commit log? > >>>>> > >>>>> If a large number of CMA memory are configured in the system (for > >>>>> example, the CMA memory accounts for 50% of the system memory), sta= rting > >>>>> virtual machine with device passthrough will get stuck. > >>>>> > >>>>> During starting virtual machine, it will call pin_user_pages_remote= (..., > >>>>> FOLL_LONGTERM, ...) to pin memory. If a page is in CMA area, > >>>>> pin_user_pages_remote() will migrate the page from CMA area to non-= CMA > >>>>> area because of FOLL_LONGTERM flag. If non-movable allocation reque= sts > >>>>> return CMA memory, pin_user_pages_remote() will enter endless loops= . > >>>>> > >>>>> backtrace: > >>>>> pin_user_pages_remote > >>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>> --------__get_user_pages_locked > >>>>> --------check_and_migrate_movable_pages //always check fail and con= tinue > >>>>> to migrate > >>>>> ------------migrate_longterm_unpinnable_pages > >>>>> ----------------alloc_migration_target // non-movable allocation > >>>>> > >>>>>> Is it possible that some CMA memory might be used by non-movable > >>>>>> allocation requests? > >>>>> > >>>>> Yes. > >>>>> > >>>>> > >>>>>> If so, will CMA somehow become unable to migrate, causing cma_allo= c() > >>>>>> to fail? > >>>>> > >>>>> > >>>>> No, it will cause endless loops in __gup_longterm_locked(). If > >>>>> non-movable allocation requests return CMA memory, > >>>>> migrate_longterm_unpinnable_pages() will migrate a CMA page to anot= her > >>>>> CMA page, which is useless and cause endless loops in > >>>>> __gup_longterm_locked(). > >>> > >>> This is only one perspective. We also need to consider the impact on > >>> CMA itself. For example, > >>> when CMA is borrowed by THP, and we need to reclaim it through > >>> cma_alloc() or dma_alloc_coherent(), > >>> we must move those pages out to ensure CMA's users can retrieve that > >>> contiguous memory. > >>> > >>> Currently, CMA's memory is occupied by non-movable pages, meaning we > >>> can't relocate them. > >>> As a result, cma_alloc() is more likely to fail. > >>> > >>>>> > >>>>> backtrace: > >>>>> pin_user_pages_remote > >>>>> ----__gup_longterm_locked //cause endless loops in this function > >>>>> --------__get_user_pages_locked > >>>>> --------check_and_migrate_movable_pages //always check fail and con= tinue > >>>>> to migrate > >>>>> ------------migrate_longterm_unpinnable_pages > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>> > >>>>>>> Fixes: 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >>>>>>> THP-sized allocations") > >>>>>>> Signed-off-by: yangge > >>>>>>> --- > >>>>>>> mm/page_alloc.c | 10 ++++++++++ > >>>>>>> 1 file changed, 10 insertions(+) > >>>>>>> > >>>>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c > >>>>>>> index 2e22ce5..0bdf471 100644 > >>>>>>> --- a/mm/page_alloc.c > >>>>>>> +++ b/mm/page_alloc.c > >>>>>>> @@ -2987,10 +2987,20 @@ struct page *rmqueue(struct zone > >>>>>>> *preferred_zone, > >>>>>>> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1)= ); > >>>>>>> > >>>>>>> if (likely(pcp_allowed_order(order))) { > >>>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > >>>>>>> + if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & > >>>>>>> ALLOC_CMA || > >>>>>>> + order !=3D > >>>>>>> HPAGE_PMD_ORDER) { > >>>>>>> + page =3D rmqueue_pcplist(preferred_zone, = zone, > >>>>>>> order, > >>>>>>> + migratetype, > >>>>>>> alloc_flags); > >>>>>>> + if (likely(page)) > >>>>>>> + goto out; > >>>>>>> + } > >>>>>> > >>>>>> This seems not ideal, because non-CMA THP gets no chance to use PC= P. > >>>>>> But it > >>>>>> still seems better than causing the failure of CMA allocation. > >>>>>> > >>>>>> Is there a possible approach to avoiding adding CMA THP into pcp f= rom > >>>>>> the first > >>>>>> beginning? Otherwise, we might need a separate PCP for CMA. > >>>>>> > >>>> > >>>> The vast majority of THP-sized allocations are GFP_MOVABLE, avoiding > >>>> adding CMA THP into pcp may incur a slight performance penalty. > >>>> > >>> > >>> But the majority of movable pages aren't CMA, right? > >> > >>> Do we have an estimate for > >>> adding back a CMA THP PCP? Will per_cpu_pages introduce a new cacheli= ne, which > >>> the original intention for THP was to avoid by having only one PCP[1]= ? > >>> > >>> [1] https://patchwork.kernel.org/project/linux-mm/patch/2022062412542= 3.6126-3-mgorman@techsingularity.net/ > >>> > >> > >> The size of struct per_cpu_pages is 256 bytes in current code containi= ng > >> commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-siz= ed > >> allocations"). > >> crash> struct per_cpu_pages > >> struct per_cpu_pages { > >> spinlock_t lock; > >> int count; > >> int high; > >> int high_min; > >> int high_max; > >> int batch; > >> u8 flags; > >> u8 alloc_factor; > >> u8 expire; > >> short free_count; > >> struct list_head lists[13]; > >> } > >> SIZE: 256 > >> > >> After revert commit 5d0a661d808f ("mm/page_alloc: use only one PCP lis= t > >> for THP-sized allocations"), the size of struct per_cpu_pages is 272 b= ytes. > >> crash> struct per_cpu_pages > >> struct per_cpu_pages { > >> spinlock_t lock; > >> int count; > >> int high; > >> int high_min; > >> int high_max; > >> int batch; > >> u8 flags; > >> u8 alloc_factor; > >> u8 expire; > >> short free_count; > >> struct list_head lists[15]; > >> } > >> SIZE: 272 > >> > >> Seems commit 5d0a661d808f ("mm/page_alloc: use only one PCP list for > >> THP-sized allocations") decrease one cacheline. > > > > the proposal is not reverting the patch but adding one CMA pcp. > > so it is "struct list_head lists[14]"; in this case, the size is still > > 256? > > > > Yes, the size is still 256. If add one PCP list, we will have 2 PCP > lists for THP. One PCP list is used by MIGRATE_UNMOVABLE, and the other > PCP list is used by MIGRATE_MOVABLE and MIGRATE_RECLAIMABLE. Is that righ= t? i am not quite sure about MIGRATE_RECLAIMABLE as we want to CMA is only used by movable. So it might be: MOVABLE and NON-MOVABLE. > > > > >> > >>> > >>>> Commit 1d91df85f399 takes a similar approach to filter, and I mainly > >>>> refer to it. > >>>> > >>>> > >>>>>>> +#else > >>>>>>> page =3D rmqueue_pcplist(preferred_zone, zone,= order, > >>>>>>> migratetype, alloc_flag= s); > >>>>>>> if (likely(page)) > >>>>>>> goto out; > >>>>>>> +#endif > >>>>>>> } > >>>>>>> > >>>>>>> page =3D rmqueue_buddy(preferred_zone, zone, order, al= loc_flags, > >>>>>>> -- > >>>>>>> 2.7.4 > >>>>>> > >>>>>> Thanks > >>>>>> Barry > >>>>>> > >>>> > >>>> > >> >