From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 63662C4345F
	for <linux-mm@archiver.kernel.org>; Fri, 12 Apr 2024 22:29:46 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id A4B366B0083; Fri, 12 Apr 2024 18:29:45 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9FB026B0085; Fri, 12 Apr 2024 18:29:45 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8C27F6B0087; Fri, 12 Apr 2024 18:29:45 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 7097B6B0083
	for <linux-mm@kvack.org>; Fri, 12 Apr 2024 18:29:45 -0400 (EDT)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 0DF794102A
	for <linux-mm@kvack.org>; Fri, 12 Apr 2024 22:29:45 +0000 (UTC)
X-FDA: 82002323130.04.04C20D3
Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48])
	by imf09.hostedemail.com (Postfix) with ESMTP id 46971140002
	for <linux-mm@kvack.org>; Fri, 12 Apr 2024 22:29:43 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=DWQAbRMW;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=shy828301@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1712960983;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=HPZtO4XHR0qRUWlTbfxc5HY3HLVLbGJTxabQLOb0/Jk=;
	b=bL+chzJhAOB3FrChYNHfXfkGNSPyLfyRCOBS0OPnOoGG7Kv5273Lu2juSipIRyP7dEbkKo
	VF57PKaEaCw0RNFVK41q9eu2dqjRqltRiGlyAePmLlUAKs7z1jlvhrPhn8FMQJaVR+Zyqh
	hPYJO8ybLUfVIxa/Te5q+ZBllDtY0uA=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=DWQAbRMW;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.167.48 as permitted sender) smtp.mailfrom=shy828301@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712960983; a=rsa-sha256;
	cv=none;
	b=puaUnGWFDh+ACIN6ZolsEhjjJ8ZUk4NJhDVLdjnhSROK7L2kHQQvd9J6LTdPCM2wsu3I97
	pnU+5+z2YwJ62wb6bhddtzgKZLRH1SZYRQYVWUBhSVX7nCzje2LQcu27OFc4gAvqSOsAbV
	aSr4bH9JOsV9JbaVn6gJ7SJJPskaMf0=
Received: by mail-lf1-f48.google.com with SMTP id 2adb3069b0e04-516d2b9cd69so1622497e87.2
        for <linux-mm@kvack.org>; Fri, 12 Apr 2024 15:29:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1712960981; x=1713565781; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=HPZtO4XHR0qRUWlTbfxc5HY3HLVLbGJTxabQLOb0/Jk=;
        b=DWQAbRMWgnOPg1Bm69NlUqVbEGZyBmhPL6+GyXM1pg3ezUmalImXNcYv7L08U6UUd0
         aqevNsZxo5L5TG/TCsCPqKupwVZYAmBhKAVW4gnDn8zKxQtvgnpXd9e0IfhZDFuqk4Bg
         9Z5QXBHJfVzHq0IG7coDctctOHoIr/GRXkcni2GNiqPBWzWiInfSbWiWS17jfdMpay9a
         8SAs1lb8YeaFlyt5gVU37a0Sq0DyFQ9tmMUMY63wUNApb/azcxn8n+UUSM3K8UiUmXEX
         9+EiyLRWgklclsb/2RjhL7zpplXmvt7bGk/NuzOBxOB/3CZ3EtoYn2CcttMhuskNvxR0
         5Q3g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1712960981; x=1713565781;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=HPZtO4XHR0qRUWlTbfxc5HY3HLVLbGJTxabQLOb0/Jk=;
        b=lnFtrPv2g9leTvZaF+cE/0ocRD7imD9AsJ0X6LkMZkWfzjiIey7X3Agpxn5r8lg7p6
         UZAEMvnmUkfUx6Lm2iE8D4mxi3ViqyjMOl8uBG4dvTBgnzcCbNnDeNRZ1AxYHxLk6fZo
         CZGdy/ZvkMr5YuULACiP4IbIRBllzY/RJv+rjbzvJPiYMeUaGrTtBHbQJ++v4xXbz4Yu
         JgForjn3wHmMlJW/35y8uWPE4ju13MrcyKjGDOmUmp8eNaPlxJGWFJcC/6KDVJOdaY7W
         RnGVG69whfqFKrq62n6CdC+Wj6UH2VScLXchukY2yC+9zRAYygkn9QpCDlNu5EnoHdES
         inBA==
X-Forwarded-Encrypted: i=1; AJvYcCUqITbxptSI/d6Jb1dB2qxFI2QsyVwRLvtSHfFPYIQL9+Nx+Rv9IVKpqA4kep7AC5CA/HT05Mh0Dhbaz7VhuePSznY=
X-Gm-Message-State: AOJu0YzYPaz8CeukvaI4tvketS6Pbckc/3N03YJmQzFvirK7wukn0ViP
	US13lNYe60dyWZOFRU2z+HGP1PLGGVWOOJzlKvz8ehlTCDIOlnuTvV/t46ngY67yakniBzdbe4W
	DS2vN6Mb1QY2eyXy0jzcpoyY72tU=
X-Google-Smtp-Source: AGHT+IEOLfjNT3vVsjAHHNkO5VeMiS5C5+iqoQb0vJCRPJRQYQwQU4EIG5h4pvrCNpp/sT3ZOYJFBHj8Efpfr0cWxUc=
X-Received: by 2002:ac2:520d:0:b0:516:c3ee:5c79 with SMTP id
 a13-20020ac2520d000000b00516c3ee5c79mr2308678lfl.37.1712960981175; Fri, 12
 Apr 2024 15:29:41 -0700 (PDT)
MIME-Version: 1.0
References: <20240411153232.169560-1-zi.yan@sent.com> <ffbbade3-2de5-4bbe-a6e4-49d2ff7a2f0e@redhat.com>
 <2C698A64-268C-4E43-9EDE-6238B656A391@nvidia.com> <bc8effda-6ff4-458d-a3ee-0d6f25cd41e0@redhat.com>
 <BBA893A5-1463-482E-8475-384BAD1AC6FD@nvidia.com>
In-Reply-To: <BBA893A5-1463-482E-8475-384BAD1AC6FD@nvidia.com>
From: Yang Shi <shy828301@gmail.com>
Date: Fri, 12 Apr 2024 15:29:29 -0700
Message-ID: <CAHbLzkrg7HpEf1_g4qpeGAR68dUKosSGihhnLRNcONnGVWdCJQ@mail.gmail.com>
Subject: Re: [PATCH] mm/rmap: do not add fully unmapped large folio to
 deferred split list
To: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>, linux-mm@kvack.org, 
	Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, 
	Ryan Roberts <ryan.roberts@arm.com>, Barry Song <21cnbao@gmail.com>, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 46971140002
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: y4qwy8tbc8yt87pme68r8moaqhsqfwae
X-HE-Tag: 1712960983-475187
X-HE-Meta: U2FsdGVkX1/M3nRRp18pAGI9ywj1RCDrdKBpCNg6vXrR43saQ/pz9iMB70+eoxybXPdBp/ZTy9S87uhoGVykg3CKs3K9bfqaydOoIt5q9KYWTgbYysvnbyccBJh9NjmGadAESsov2PvXNCRX8ZYUEkMvCexv2z6sreL8gdGHSivHzQ8nOAbdQyh9gSqAot1z7MeEhxTutVLAC7K2x+7zdMBiQPJdQel0mUvIhu6eg/M1AzUkD7ezbIVPT4hmOIDi4nQiV5Q6t8FwtCXMoSLldSVWhuwUcMzQgK0dV1GeliYhJckAAvbflW+vB91ILaXpU7V2ygJn2YptomFMSel490R1qLyK4AQbeJQXSl0Babpa+l6w0UFH/eAljklVayBws1Aaf3i3aoF9lk6IFzyNPRaEoSAJ4AwRPsH5DinK0XsZ3skJazkaF9dgTv/7w9+T+7WMMtzEVjD/fGjIZnGSroXC5IyEWpds/UMPKt8grkj+OrsBUBaWdVsdK3dxFoRARKub8oa6o/X/M3pvMYpHd+0RLqs1ECMAlbwAKgi5NUio9Hqn/WRaXVkFShjG8QwuO5FKjYNVWhdn3v/6sDlmnt6deAIPEdxJfn5szOyXjXJHENHzPH0cQcH7CYJwjRD/vxTNeRiQsWYckU8UO2k4lcdsEJ8pakvJsqb96K5NhGc62NE7yKvJWdCKMJFGwnvatVmc0Gis+5+hFOaoZNhzQPWpDME4+wOrJgbUwrs1UkWyVfq2h/eMuMQTv0XBPM6d3mAR9wAVZw6pX47Mr86rWbyjDO76ltbt538AQSRymMgQQ9sVR+AMhEBbh+iaP8UtSeDJUOWNleO9WJx+b0r26X60YzAqzzn1kecO6jF0F+27hta4OCCw+lCHiU2PmHNlVLpr1aoa2F3Z3UQiI/AsEXXBPpdvoLTscUSPEu+qLvGLKQZWNzzwOzKF8HCsh8uyIqBabbebKUSkPd310/r
 RQbH4Y9S
 oD3ttLSaLq5YWpwHfLgYj5xEZQ+HBROn7ByQw
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Apr 12, 2024 at 2:06=E2=80=AFPM Zi Yan <ziy@nvidia.com> wrote:
>
> On 12 Apr 2024, at 15:32, David Hildenbrand wrote:
>
> > On 12.04.24 16:35, Zi Yan wrote:
> >> On 11 Apr 2024, at 11:46, David Hildenbrand wrote:
> >>
> >>> On 11.04.24 17:32, Zi Yan wrote:
> >>>> From: Zi Yan <ziy@nvidia.com>
> >>>>
> >>>> In __folio_remove_rmap(), a large folio is added to deferred split l=
ist
> >>>> if any page in a folio loses its final mapping. It is possible that
> >>>> the folio is unmapped fully, but it is unnecessary to add the folio
> >>>> to deferred split list at all. Fix it by checking folio mapcount bef=
ore
> >>>> adding a folio to deferred split list.
> >>>>
> >>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
> >>>> ---
> >>>>    mm/rmap.c | 9 ++++++---
> >>>>    1 file changed, 6 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/mm/rmap.c b/mm/rmap.c
> >>>> index 2608c40dffad..d599a772e282 100644
> >>>> --- a/mm/rmap.c
> >>>> +++ b/mm/rmap.c
> >>>> @@ -1494,7 +1494,7 @@ static __always_inline void __folio_remove_rma=
p(struct folio *folio,
> >>>>                    enum rmap_level level)
> >>>>    {
> >>>>            atomic_t *mapped =3D &folio->_nr_pages_mapped;
> >>>> -  int last, nr =3D 0, nr_pmdmapped =3D 0;
> >>>> +  int last, nr =3D 0, nr_pmdmapped =3D 0, mapcount =3D 0;
> >>>>            enum node_stat_item idx;
> >>>>            __folio_rmap_sanity_checks(folio, page, nr_pages, level);
> >>>> @@ -1506,7 +1506,8 @@ static __always_inline void __folio_remove_rma=
p(struct folio *folio,
> >>>>                            break;
> >>>>                    }
> >>>>   -                atomic_sub(nr_pages, &folio->_large_mapcount);
> >>>> +          mapcount =3D atomic_sub_return(nr_pages,
> >>>> +                                       &folio->_large_mapcount) + 1=
;
> >>>
> >>> That becomes a new memory barrier on some archs. Rather just re-read =
it below. Re-reading should be fine here.
> >>
> >> Would atomic_sub_return_relaxed() work? Originally I was using atomic_=
read(mapped)
> >> below, but to save an atomic op, I chose to read mapcount here.
> >
> > Some points:
> >
> > (1) I suggest reading about atomic get/set vs. atomic RMW vs. atomic
> > RMW that return a value -- and how they interact with memory barriers.
> > Further, how relaxed variants are only optimized on some architectures.
> >
> > atomic_read() is usually READ_ONCE(), which is just an "ordinary" memor=
y
> > access that should not be refetched. Usually cheaper than most other st=
uff
> > that involves atomics.
>
> I should have checked the actual implementation instead of being fooled
> by the name. Will read about it. Thanks.
>
> >
> > (2) We can either use folio_large_mapcount() =3D=3D 0 or !atomic_read(m=
apped)
> > to figure out if the folio is now completely unmapped.
> >
> > (3) There is one fundamental issue: if we are not batch-unmapping the w=
hole
> > thing, we will still add the folios to the deferred split queue. Migrat=
ion
> > would still do that, or if there are multiple VMAs covering a folio.
> >
> > (4) We should really avoid making common operations slower only to make
> > some unreliable stats less unreliable.
> >
> >
> > We should likely do something like the following, which might even be a=
 bit
> > faster in some cases because we avoid a function call in case we unmap
> > individual PTEs by checking _deferred_list ahead of time
> >
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 2608c40dffad..356598b3dc3c 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1553,9 +1553,11 @@ static __always_inline void __folio_remove_rmap(=
struct folio *folio,
> >                  * page of the folio is unmapped and at least one page
> >                  * is still mapped.
> >                  */
> > -               if (folio_test_large(folio) && folio_test_anon(folio))
> > -                       if (level =3D=3D RMAP_LEVEL_PTE || nr < nr_pmdm=
apped)
> > -                               deferred_split_folio(folio);
> > +               if (folio_test_large(folio) && folio_test_anon(folio) &=
&
> > +                   (level =3D=3D RMAP_LEVEL_PTE || nr < nr_pmdmapped) =
&&
> > +                   atomic_read(mapped) &&
> > +                   data_race(list_empty(&folio->_deferred_list)))
>
> data_race() might not be needed, as Ryan pointed out[1]
>
> > +                       deferred_split_folio(folio);
> >         }
> >
> > I also thought about handling the scenario where we unmap the whole
> > think in smaller chunks. We could detect "!atomic_read(mapped)" and
> > detect that it is on the deferred split list, and simply remove it
> > from that list incrementing an THP_UNDO_DEFERRED_SPLIT_PAGE event.
> >
> > But it would be racy with concurrent remapping of the folio (might happ=
en with
> > anon folios in corner cases I guess).
> >
> > What we can do is the following, though:
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index dc30139590e6..f05cba1807f2 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3133,6 +3133,8 @@ void folio_undo_large_rmappable(struct folio *fol=
io)
> >         ds_queue =3D get_deferred_split_queue(folio);
> >         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> >         if (!list_empty(&folio->_deferred_list)) {
> > +               if (folio_test_pmd_mappable(folio))
> > +                       count_vm_event(THP_UNDO_DEFERRED_SPLIT_PAGE);
> >                 ds_queue->split_queue_len--;
> >                 list_del_init(&folio->_deferred_list);
> >         }
> >
> > Adding the right event of course.
> >
> >
> > Then it's easy to filter out these "temporarily added to the list, but =
never split
> > before the folio was freed" cases.
>
> So instead of making THP_DEFERRED_SPLIT_PAGE precise, use
> THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE instead? That shou=
ld work.

It is definitely possible that the THP on the deferred split queue are
freed instead of split. For example, 1M is unmapped for a 2M THP, then
later the remaining 1M is unmapped, or the process exits before memory
pressure happens. So how come we can tell it is "temporarily added to
list"? Then THP_DEFERRED_SPLIT_PAGE - THP_UNDO_DEFERRED_SPLIT_PAGE
actually just counts how many pages are still on deferred split queue.
It may be useful. However the counter is typically used to estimate
how many THP are partially unmapped during a period of time. So we
just need to know the initial value and the value when we read it
again.

>
> I wonder what THP_DEFERRED_SPLIT_PAGE counts. If it counts THP deferred
> splits, why not just move the counter to deferred_split_scan(), where the=
 actual
> split happens. Or the counter has a different meaning?

The deferred_split_scan() / deferred_split_count() just can return the
number of pages on a specific queue (a specific node with a specific
memcg). But THP_DEFERRED_SPLIT_PAGE is a global counter. Did I miss
something? Or you mean traverse all memcgs and all nodes? It sounds
too overkilling.

>
>
>
> [1] https://lore.kernel.org/linux-mm/e3e14098-eade-483e-a459-e43200b87941=
@arm.com/
>
> --
> Best Regards,
> Yan, Zi