From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 63662C4345F for ; Fri, 12 Apr 2024 22:29:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A4B366B0083; Fri, 12 Apr 2024 18:29:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9FB026B0085; Fri, 12 Apr 2024 18:29:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8C27F6B0087; Fri, 12 Apr 2024 18:29:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7097B6B0083 for ; Fri, 12 Apr 2024 18:29:45 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 0DF794102A for ; Fri, 12 Apr 2024 22:29:45 +0000 (UTC) X-FDA: 82002323130.04.04C20D3 Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) by imf09.hostedemail.com (Postfix) with ESMTP id 46971140002 for ; Fri, 12 Apr 2024 22:29:43 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DWQAbRMW; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712960983; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HPZtO4XHR0qRUWlTbfxc5HY3HLVLbGJTxabQLOb0/Jk=; b=bL+chzJhAOB3FrChYNHfXfkGNSPyLfyRCOBS0OPnOoGG7Kv5273Lu2juSipIRyP7dEbkKo VF57PKaEaCw0RNFVK41q9eu2dqjRqltRiGlyAePmLlUAKs7z1jlvhrPhn8FMQJaVR+Zyqh hPYJO8ybLUfVIxa/Te5q+ZBllDtY0uA= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DWQAbRMW; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712960983; a=rsa-sha256; cv=none; b=puaUnGWFDh+ACIN6ZolsEhjjJ8ZUk4NJhDVLdjnhSROK7L2kHQQvd9J6LTdPCM2wsu3I97 pnU+5+z2YwJ62wb6bhddtzgKZLRH1SZYRQYVWUBhSVX7nCzje2LQcu27OFc4gAvqSOsAbV aSr4bH9JOsV9JbaVn6gJ7SJJPskaMf0= Received: by mail-lf1-f48.google.com with SMTP id 2adb3069b0e04-516d2b9cd69so1622497e87.2 for ; Fri, 12 Apr 2024 15:29:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1712960981; x=1713565781; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HPZtO4XHR0qRUWlTbfxc5HY3HLVLbGJTxabQLOb0/Jk=; b=DWQAbRMWgnOPg1Bm69NlUqVbEGZyBmhPL6+GyXM1pg3ezUmalImXNcYv7L08U6UUd0 aqevNsZxo5L5TG/TCsCPqKupwVZYAmBhKAVW4gnDn8zKxQtvgnpXd9e0IfhZDFuqk4Bg 9Z5QXBHJfVzHq0IG7coDctctOHoIr/GRXkcni2GNiqPBWzWiInfSbWiWS17jfdMpay9a 8SAs1lb8YeaFlyt5gVU37a0Sq0DyFQ9tmMUMY63wUNApb/azcxn8n+UUSM3K8UiUmXEX 9+EiyLRWgklclsb/2RjhL7zpplXmvt7bGk/NuzOBxOB/3CZ3EtoYn2CcttMhuskNvxR0 5Q3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712960981; x=1713565781; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HPZtO4XHR0qRUWlTbfxc5HY3HLVLbGJTxabQLOb0/Jk=; b=lnFtrPv2g9leTvZaF+cE/0ocRD7imD9AsJ0X6LkMZkWfzjiIey7X3Agpxn5r8lg7p6 UZAEMvnmUkfUx6Lm2iE8D4mxi3ViqyjMOl8uBG4dvTBgnzcCbNnDeNRZ1AxYHxLk6fZo CZGdy/ZvkMr5YuULACiP4IbIRBllzY/RJv+rjbzvJPiYMeUaGrTtBHbQJ++v4xXbz4Yu JgForjn3wHmMlJW/35y8uWPE4ju13MrcyKjGDOmUmp8eNaPlxJGWFJcC/6KDVJOdaY7W RnGVG69whfqFKrq62n6CdC+Wj6UH2VScLXchukY2yC+9zRAYygkn9QpCDlNu5EnoHdES inBA== X-Forwarded-Encrypted: i=1; AJvYcCUqITbxptSI/d6Jb1dB2qxFI2QsyVwRLvtSHfFPYIQL9+Nx+Rv9IVKpqA4kep7AC5CA/HT05Mh0Dhbaz7VhuePSznY= X-Gm-Message-State: AOJu0YzYPaz8CeukvaI4tvketS6Pbckc/3N03YJmQzFvirK7wukn0ViP US13lNYe60dyWZOFRU2z+HGP1PLGGVWOOJzlKvz8ehlTCDIOlnuTvV/t46ngY67yakniBzdbe4W DS2vN6Mb1QY2eyXy0jzcpoyY72tU= X-Google-Smtp-Source: AGHT+IEOLfjNT3vVsjAHHNkO5VeMiS5C5+iqoQb0vJCRPJRQYQwQU4EIG5h4pvrCNpp/sT3ZOYJFBHj8Efpfr0cWxUc= X-Received: by 2002:ac2:520d:0:b0:516:c3ee:5c79 with SMTP id a13-20020ac2520d000000b00516c3ee5c79mr2308678lfl.37.1712960981175; Fri, 12 Apr 2024 15:29:41 -0700 (PDT) MIME-Version: 1.0 References: <20240411153232.169560-1-zi.yan@sent.com> <2C698A64-268C-4E43-9EDE-6238B656A391@nvidia.com> In-Reply-To: From: Yang Shi Date: Fri, 12 Apr 2024 15:29:29 -0700 Message-ID: Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to deferred split list To: Zi Yan Cc: David Hildenbrand , linux-mm@kvack.org, Andrew Morton , Matthew Wilcox , Ryan Roberts , Barry Song <21cnbao@gmail.com>, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 46971140002 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: y4qwy8tbc8yt87pme68r8moaqhsqfwae X-HE-Tag: 1712960983-475187 X-HE-Meta: U2FsdGVkX1/M3nRRp18pAGI9ywj1RCDrdKBpCNg6vXrR43saQ/pz9iMB70+eoxybXPdBp/ZTy9S87uhoGVykg3CKs3K9bfqaydOoIt5q9KYWTgbYysvnbyccBJh9NjmGadAESsov2PvXNCRX8ZYUEkMvCexv2z6sreL8gdGHSivHzQ8nOAbdQyh9gSqAot1z7MeEhxTutVLAC7K2x+7zdMBiQPJdQel0mUvIhu6eg/M1AzUkD7ezbIVPT4hmOIDi4nQiV5Q6t8FwtCXMoSLldSVWhuwUcMzQgK0dV1GeliYhJckAAvbflW+vB91ILaXpU7V2ygJn2YptomFMSel490R1qLyK4AQbeJQXSl0Babpa+l6w0UFH/eAljklVayBws1Aaf3i3aoF9lk6IFzyNPRaEoSAJ4AwRPsH5DinK0XsZ3skJazkaF9dgTv/7w9+T+7WMMtzEVjD/fGjIZnGSroXC5IyEWpds/UMPKt8grkj+OrsBUBaWdVsdK3dxFoRARKub8oa6o/X/M3pvMYpHd+0RLqs1ECMAlbwAKgi5NUio9Hqn/WRaXVkFShjG8QwuO5FKjYNVWhdn3v/6sDlmnt6deAIPEdxJfn5szOyXjXJHENHzPH0cQcH7CYJwjRD/vxTNeRiQsWYckU8UO2k4lcdsEJ8pakvJsqb96K5NhGc62NE7yKvJWdCKMJFGwnvatVmc0Gis+5+hFOaoZNhzQPWpDME4+wOrJgbUwrs1UkWyVfq2h/eMuMQTv0XBPM6d3mAR9wAVZw6pX47Mr86rWbyjDO76ltbt538AQSRymMgQQ9sVR+AMhEBbh+iaP8UtSeDJUOWNleO9WJx+b0r26X60YzAqzzn1kecO6jF0F+27hta4OCCw+lCHiU2PmHNlVLpr1aoa2F3Z3UQiI/AsEXXBPpdvoLTscUSPEu+qLvGLKQZWNzzwOzKF8HCsh8uyIqBabbebKUSkPd310/r RQbH4Y9S oD3ttLSaLq5YWpwHfLgYj5xEZQ+HBROn7ByQw X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Apr 12, 2024 at 2:06=E2=80=AFPM Zi Yan wrote: > > On 12 Apr 2024, at 15:32, David Hildenbrand wrote: > > > On 12.04.24 16:35, Zi Yan wrote: > >> On 11 Apr 2024, at 11:46, David Hildenbrand wrote: > >> > >>> On 11.04.24 17:32, Zi Yan wrote: > >>>> From: Zi Yan > >>>> > >>>> In __folio_remove_rmap(), a large folio is added to deferred split l= ist > >>>> if any page in a folio loses its final mapping. It is possible that > >>>> the folio is unmapped fully, but it is unnecessary to add the folio > >>>> to deferred split list at all. Fix it by checking folio mapcount bef= ore > >>>> adding a folio to deferred split list. > >>>> > >>>> Signed-off-by: Zi Yan > >>>> --- > >>>> mm/rmap.c | 9 ++++++--- > >>>> 1 file changed, 6 insertions(+), 3 deletions(-) > >>>> > >>>> diff --git a/mm/rmap.c b/mm/rmap.c > >>>> index 2608c40dffad..d599a772e282 100644 > >>>> --- a/mm/rmap.c > >>>> +++ b/mm/rmap.c > >>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rma= p(struct folio *folio, > >>>> enum rmap_level level) > >>>> { > >>>> atomic_t *mapped =3D &folio->_nr_pages_mapped; > >>>> - int last, nr =3D 0, nr_pmdmapped =3D 0; > >>>> + int last, nr =3D 0, nr_pmdmapped =3D 0, mapcount =3D 0; > >>>> enum node_stat_item idx; > >>>> __folio_rmap_sanity_checks(folio, page, nr_pages, level); > >>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rma= p(struct folio *folio, > >>>> break; > >>>> } > >>>> - atomic_sub(nr_pages, &folio->_large_mapcount); > >>>> + mapcount =3D atomic_sub_return(nr_pages, > >>>> + &folio->_large_mapcount) + 1= ; > >>> > >>> That becomes a new memory barrier on some archs. Rather just re-read = it below. Re-reading should be fine here. > >> > >> Would atomic_sub_return_relaxed() work? Originally I was using atomic_= read(mapped) > >> below, but to save an atomic op, I chose to read mapcount here. > > > > Some points: > > > > (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic > > RMW that return a value -- and how they interact with memory barriers. > > Further, how relaxed variants are only optimized on some architectures. > > > > atomic_read() is usually READ_ONCE(), which is just an "ordinary" memor= y > > access that should not be refetched. Usually cheaper than most other st= uff > > that involves atomics. > > I should have checked the actual implementation instead of being fooled > by the name. Will read about it. Thanks. > > > > > (2) We can either use folio_large_mapcount() =3D=3D 0 or !atomic_read(m= apped) > > to figure out if the folio is now completely unmapped. > > > > (3) There is one fundamental issue: if we are not batch-unmapping the w= hole > > thing, we will still add the folios to the deferred split queue. Migrat= ion > > would still do that, or if there are multiple VMAs covering a folio. > > > > (4) We should really avoid making common operations slower only to make > > some unreliable stats less unreliable. > > > > > > We should likely do something like the following, which might even be a= bit > > faster in some cases because we avoid a function call in case we unmap > > individual PTEs by checking _deferred_list ahead of time > > > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 2608c40dffad..356598b3dc3c 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(= struct folio *folio, > > * page of the folio is unmapped and at least one page > > * is still mapped. > > */ > > - if (folio_test_large(folio) && folio_test_anon(folio)) > > - if (level =3D=3D RMAP_LEVEL_PTE || nr < nr_pmdm= apped) > > - deferred_split_folio(folio); > > + if (folio_test_large(folio) && folio_test_anon(folio) &= & > > + (level =3D=3D RMAP_LEVEL_PTE || nr < nr_pmdmapped) = && > > + atomic_read(mapped) && > > + data_race(list_empty(&folio->_deferred_list))) > > data_race() might not be needed, as Ryan pointed out[1] > > > + deferred_split_folio(folio); > > } > > > > I also thought about handling the scenario where we unmap the whole > > think in smaller chunks. We could detect "!atomic_read(mapped)" and > > detect that it is on the deferred split list, and simply remove it > > from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event. > > > > But it would be racy with concurrent remapping of the folio (might happ= en with > > anon folios in corner cases I guess). > > > > What we can do is the following, though: > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index dc30139590e6..f05cba1807f2 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *fol= io) > > ds_queue =3D get_deferred_split_queue(folio); > > spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > > if (!list_empty(&folio->_deferred_list)) { > > + if (folio_test_pmd_mappable(folio)) > > + count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE); > > ds_queue->split_queue_len--; > > list_del_init(&folio->_deferred_list); > > } > > > > Adding the right event of course. > > > > > > Then it's easy to filter out these "temporarily added to the list, but = never split > > before the folio was freed" cases. > > So instead of making THP_DEFERRED_SPLIT_PAGE precise, use > THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That shou= ld work. It is definitely possible that the THP on the deferred split queue are freed instead of split. For example, 1M is unmapped for a 2M THP, then later the remaining 1M is unmapped, or the process exits before memory pressure happens. So how come we can tell it is "temporarily added to list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE actually just counts how many pages are still on deferred split queue. It may be useful. However the counter is typically used to estimate how many THP are partially unmapped during a period of time. So we just need to know the initial value and the value when we read it again. > > I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred > splits, why not just move the counter to deferred_split_scan(), where the= actual > split happens. Or the counter has a different meaning? The deferred_split_scan() / deferred_split_count() just can return the number of pages on a specific queue (a specific node with a specific memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss something? Or you mean traverse all memcgs and all nodes? It sounds too overkilling. > > > > [1] https://lore.kernel.org/linux-mm/e3e14098-eade-483e-a459-e43200b87941= @arm.com/ > > -- > Best Regards, > Yan, Zi