From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 57AA7C54E49
	for <linux-mm@archiver.kernel.org>; Thu,  7 Mar 2024 11:27:04 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C24E96B0162; Thu,  7 Mar 2024 06:27:03 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BAD796B0163; Thu,  7 Mar 2024 06:27:03 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A74C06B0164; Thu,  7 Mar 2024 06:27:03 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 934D16B0162
	for <linux-mm@kvack.org>; Thu,  7 Mar 2024 06:27:03 -0500 (EST)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 53BD041169
	for <linux-mm@kvack.org>; Thu,  7 Mar 2024 11:27:03 +0000 (UTC)
X-FDA: 81870016326.18.C907426
Received: from mail-vk1-f176.google.com (mail-vk1-f176.google.com [209.85.221.176])
	by imf30.hostedemail.com (Postfix) with ESMTP id 9A4D48000D
	for <linux-mm@kvack.org>; Thu,  7 Mar 2024 11:27:01 +0000 (UTC)
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Yu94lV5X;
	spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709810821;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=U5/UYxQ+V8iWd+wFm7wUJaiQxt6gTceVRV1QKSz0IOs=;
	b=0Eq57zl6wNfewUg1itHtxPkNcphd5dpjZS/7oVBNupkp2NelHBs2ixJrUpVSbxTrb1m6MA
	FNnJ1ql8J8QjJR8JoUuEFSF67aYNo3Y386hXrVNCp4ar3vfapzDbeFp044IMiBLB+vVSSs
	wg+wqj2cNKpzRFDzOUOrTCDneJAo8Xo=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709810821; a=rsa-sha256;
	cv=none;
	b=YqZ/OZefEYIW1NFNzo2JWv6G47eUGBJct/Xn8PDu3d651Z5XPwhbAhRobw1s5x1St/CYjD
	3zLtKKkWiiIIfYqNWK3dH3HknYK82KhaWTLHhM8xPWiBQCVOoE3TJZ0j0spgmi/0yCmK1Q
	490wwvPAcivvSVh48mjngTTuKX4HBzE=
ARC-Authentication-Results: i=1;
	imf30.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Yu94lV5X;
	spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.176 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-vk1-f176.google.com with SMTP id 71dfb90a1353d-4d33e0c1b3eso89770e0c.2
        for <linux-mm@kvack.org>; Thu, 07 Mar 2024 03:27:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1709810820; x=1710415620; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=U5/UYxQ+V8iWd+wFm7wUJaiQxt6gTceVRV1QKSz0IOs=;
        b=Yu94lV5XpdzT53zkM6GzP74wlXX79DTGFW+GDciQClDn+7+bpusR409b9xSpTpR9zc
         UIC3vsjr45RWJvLAd6u3FxuFEBbEwWCj8z1j3MEQJS5MksMAr75UMee+JIDHmAKXp6Uw
         9DoRTeCB0tHRqkcq7rN+4FTClQLxwpDqwyqg9x5cwYO7vi6XetTzs6PDj3HoMSRTpvTR
         CmLUCB8F7dXHj4II5CI+37aZ42eMEzSSK+Di1Gdl/c2EWu9Nd6y1iaGyM/TKME0hfhEk
         Z5I8PVHlq6xTBvtutmjonE8qVzV8oTt+CpuKxOA27OmtTwf3J1LZDtKJhdd4UPVUL7FO
         lpGg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709810820; x=1710415620;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=U5/UYxQ+V8iWd+wFm7wUJaiQxt6gTceVRV1QKSz0IOs=;
        b=SFa0t/caD1NBGcAALYIOuDnwITEmlYIx9zXcoBVYqJ0B4n5118x/QSJl91G4igvTCI
         FGLUXThdZGlkTd0EGRXe72V+GQ1xyHRqOpHpy4nrkV5VkBzhFdmkwQHgbYiel3VMBDFc
         bwb6vgL8LkjHrqVgr0ZbwkXql+BZl9+S3YL+GrwndAp1AW2HNQ5m9FXClLy3LLATrkiR
         x1dZflCZhg3HnEgUhL2IvmtD34fkR9SwM/j/7OwR9DU9YWvo50FZy1FL2RQm4JDHI+k5
         oLxZDPziujZvY3lmnUQjxshiFADySPtKPuMaqyQJXOHpAHWeriR3cp24sZz2GHxPilvS
         5Wmw==
X-Forwarded-Encrypted: i=1; AJvYcCXkl9OWAR9Z7LIsU3hgNG7TbFlDSphEGZNBbYuj62AOOJfOAykgL8xgXivMs1o7N/ZCq0JnEVwEjfm+RS9Qa0ypWrA=
X-Gm-Message-State: AOJu0Yxz3yFkOL3L4je7dJ0JXVldkr9vziybnrF493yUvTV5iSRRfHLe
	0XyalmAvXz4myyRbKfk7lJ+tXKafOCXOhrgOilkb/oL9kh+4JhMZuLGvm3D/tir/6i5Gr72h56o
	6u1glYNNlM8JIyP6vdqpNYgMR744=
X-Google-Smtp-Source: AGHT+IE6IVoDMMSp9p1MaTPE0G9ltm+lL5xNYXrUIzXUdFX9G8XAi+8LYpoVHhbZ9LrjwLylG9cnl6Mx+I+iygk9dkQ=
X-Received: by 2002:a05:6122:2908:b0:4d3:36b9:2c26 with SMTP id
 fm8-20020a056122290800b004d336b92c26mr7853739vkb.14.1709810820624; Thu, 07
 Mar 2024 03:27:00 -0800 (PST)
MIME-Version: 1.0
References: <20240307061425.21013-1-ioworker0@gmail.com> <CAGsJ_4xcRvZGdpPh1qcFTnTnDUbwz6WreQ=L_UO+oU2iFm9EPg@mail.gmail.com>
 <CAK1f24k2G_DSEjuqqqPyY0f7+btpYbjfoyMH7btLfP8nkasCTQ@mail.gmail.com>
 <CAGsJ_4xREM-P1mFqeM-s3-cJ9czb6PXwizb-3hOhwaF6+QM5QA@mail.gmail.com>
 <03458c20-5544-411b-9b8d-b4600a9b802f@arm.com> <CAGsJ_4zp1MXTjG=4gBO+J3owg7sHDgDJ8Ut51i1RBSnKnK0BfQ@mail.gmail.com>
 <501c9f77-1459-467a-8619-78e86b46d300@arm.com> <8f84c7d6-982a-4933-a7a7-3f640df64991@redhat.com>
 <e6bc142e-113d-4034-b92c-746b951a27ed@redhat.com> <d24f8553-33f2-4ae7-a06d-badaf9462d84@arm.com>
In-Reply-To: <d24f8553-33f2-4ae7-a06d-badaf9462d84@arm.com>
From: Barry Song <21cnbao@gmail.com>
Date: Thu, 7 Mar 2024 19:26:49 +0800
Message-ID: <CAGsJ_4za-2xpg21phWi2WWLF1iPXhoc1xM__FDTwYYBBKsTPgw@mail.gmail.com>
Subject: Re: [PATCH v2 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>, Lance Yang <ioworker0@gmail.com>, 
	Vishal Moola <vishal.moola@gmail.com>, akpm@linux-foundation.org, zokeefe@google.com, 
	shy828301@gmail.com, mhocko@suse.com, fengwei.yin@intel.com, 
	xiehuan09@gmail.com, wangkefeng.wang@huawei.com, songmuchun@bytedance.com, 
	peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: w9ir4h995nha9zd7hszbf6od1n4cj8u8
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 9A4D48000D
X-Rspam-User: 
X-HE-Tag: 1709810821-473943
X-HE-Meta: U2FsdGVkX1/y4gxtD3QrjfS477eOIbvLDZCxxJyIRmRZQjEnkHwRczldSgk/dU6cHHJMfdsfBviWsX0P+zIVVxoSTi0fkLL4odUQ20HEyCH6ZfRrtoqgCW406xouEgCGrA+Xo5xk0dE0TZaY/cWwbDeka/X+LuelY1wYnT1ykm86uyx9c9joDVq0nUsKR/SM7zdPHDE0HJX6VOsm1rau2lJgxQayAUMTVVtvS+99FyIdgxwzNjcPEhkgkr/sPUKmwLHAeg8YN7AQlKeGzHqOkcxPCgxXvlvY7cvCjqS8iRU4IjoKhqllfS8XkHIhL6XMS6lgpRNDJk4U8PD3ylbYDmIa2TDj8WgcyOmQ0nmRt/cTVlRc2wps/c+OC//nK8wMH5bXp+/GeCpJlptOsSeRRFk7D6PUQtSMlS4C9ub2fzytiK38tEfhxLrsEBVYu97uaQlWRVCtHs98YPAsQKzoOwpVqXNyg5ZgnicmSLYlVHPNGpsQGY5DhvI3lvjf07+mu/YeMqWpUUWF9NlYQi57o8IG4Vg4WIsAjWJUIcjMG0x9pj1DdeIRMH/e82HKtUORJZid+L8tenZ5urfUdaGbc7cUgjtKBhv+TWrHVxSQFxu3K1+piB66hcNyG01YBmVtzfX68Nw57xFjPJrsRI5rBnCfSr/RFyxh/kR6Bpsxq2n/VL3jXXYU0njBFkbKKc/SwjX1jazESyfi7PwNjuURV0a4fsLD73jffq6st5KUJUc7K05wBkqfXswJv8I7vsUhyvcyZxzmR9Z6YJhmDUiBGs2V1KUe7uY98ZWbNlbbdCvgcITm2LUKm9B1B54+aUbvNnfdckq3RsNh6AzDwBI1CtL3zv2yzSz7pL0Q9E5ApKF6UeHjYH64oTTkOK4xO3ux7bQV3tsP03TWuupm3kFOsNSerLb1r+XDZNqROcpZzDGXbiyHT2oOWVrsZsbzL6uVk4tAYB8aMJ8xSvTOyHO
 RIfOCms3
 omCiIiQrEZFqFSjNVAUkN/A6lySCy00rNYvz9jaKYVEoquPvJr9xnOKmzgiQ0nmXRgKSRzqgoHVnBv/oGZLKXqQ3GaRe/MSxbE+a20pZy//Gduspi8fh1kmo4vk3JNLrBtCtT4Lnzq9V1/Zhs23nFNB2iuIurQTe8xrHT
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Mar 7, 2024 at 7:13=E2=80=AFPM Ryan Roberts <ryan.roberts@arm.com> =
wrote:
>
> On 07/03/2024 10:54, David Hildenbrand wrote:
> > On 07.03.24 11:54, David Hildenbrand wrote:
> >> On 07.03.24 11:50, Ryan Roberts wrote:
> >>> On 07/03/2024 09:33, Barry Song wrote:
> >>>> On Thu, Mar 7, 2024 at 10:07=E2=80=AFPM Ryan Roberts <ryan.roberts@a=
rm.com> wrote:
> >>>>>
> >>>>> On 07/03/2024 08:10, Barry Song wrote:
> >>>>>> On Thu, Mar 7, 2024 at 9:00=E2=80=AFPM Lance Yang <ioworker0@gmail=
.com> wrote:
> >>>>>>>
> >>>>>>> Hey Barry,
> >>>>>>>
> >>>>>>> Thanks for taking time to review!
> >>>>>>>
> >>>>>>> On Thu, Mar 7, 2024 at 3:00=E2=80=AFPM Barry Song <21cnbao@gmail.=
com> wrote:
> >>>>>>>>
> >>>>>>>> On Thu, Mar 7, 2024 at 7:15=E2=80=AFPM Lance Yang <ioworker0@gma=
il.com> wrote:
> >>>>>>>>>
> >>>>>>> [...]
> >>>>>>>>> +static inline bool can_mark_large_folio_lazyfree(unsigned long=
 addr,
> >>>>>>>>> +                                                struct folio *=
folio,
> >>>>>>>>> pte_t *start_pte)
> >>>>>>>>> +{
> >>>>>>>>> +       int nr_pages =3D folio_nr_pages(folio);
> >>>>>>>>> +       fpb_t flags =3D FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRT=
Y;
> >>>>>>>>> +
> >>>>>>>>> +       for (int i =3D 0; i < nr_pages; i++)
> >>>>>>>>> +               if (page_mapcount(folio_page(folio, i)) !=3D 1)
> >>>>>>>>> +                       return false;
> >>>>>>>>
> >>>>>>>> we have moved to folio_estimated_sharers though it is not precis=
e, so
> >>>>>>>> we don't do
> >>>>>>>> this check with lots of loops and depending on the subpage's map=
count.
> >>>>>>>
> >>>>>>> If we don't check the subpage=E2=80=99s mapcount, and there is a =
cow folio
> >>>>>>> associated
> >>>>>>> with this folio and the cow folio has smaller size than this foli=
o,
> >>>>>>> should we still
> >>>>>>> mark this folio as lazyfree?
> >>>>>>
> >>>>>> I agree, this is true. However, we've somehow accepted the fact th=
at
> >>>>>> folio_likely_mapped_shared
> >>>>>> can result in false negatives or false positives to balance the
> >>>>>> overhead.  So I really don't know :-)
> >>>>>>
> >>>>>> Maybe David and Vishal can give some comments here.
> >>>>>>
> >>>>>>>
> >>>>>>>> BTW, do we need to rebase our work against David's changes[1]?
> >>>>>>>> [1]
> >>>>>>>> https://lore.kernel.org/linux-mm/20240227201548.857831-1-david@r=
edhat.com/
> >>>>>>>
> >>>>>>> Yes, we should rebase our work against David=E2=80=99s changes.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> +
> >>>>>>>>> +       return nr_pages =3D=3D folio_pte_batch(folio, addr, sta=
rt_pte,
> >>>>>>>>> +                                        ptep_get(start_pte), n=
r_pages,
> >>>>>>>>> flags, NULL);
> >>>>>>>>> +}
> >>>>>>>>> +
> >>>>>>>>>    static int madvise_free_pte_range(pmd_t *pmd, unsigned long =
addr,
> >>>>>>>>>                                   unsigned long end, struct mm_=
walk *walk)
> >>>>>>>>>
> >>>>>>>>> @@ -676,11 +690,45 @@ static int madvise_free_pte_range(pmd_t *=
pmd,
> >>>>>>>>> unsigned long addr,
> >>>>>>>>>                    */
> >>>>>>>>>                   if (folio_test_large(folio)) {
> >>>>>>>>>                           int err;
> >>>>>>>>> +                       unsigned long next_addr, align;
> >>>>>>>>>
> >>>>>>>>> -                       if (folio_estimated_sharers(folio) !=3D=
 1)
> >>>>>>>>> -                               break;
> >>>>>>>>> -                       if (!folio_trylock(folio))
> >>>>>>>>> -                               break;
> >>>>>>>>> +                       if (folio_estimated_sharers(folio) !=3D=
 1 ||
> >>>>>>>>> +                           !folio_trylock(folio))
> >>>>>>>>> +                               goto skip_large_folio;
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't think we can skip all the PTEs for nr_pages, as some of =
them
> >>>>>>>> might be
> >>>>>>>> pointing to other folios.
> >>>>>>>>
> >>>>>>>> for example, for a large folio with 16PTEs, you do MADV_DONTNEED=
(15-16),
> >>>>>>>> and write the memory of PTE15 and PTE16, you get page faults, th=
us PTE15
> >>>>>>>> and PTE16 will point to two different small folios. We can only =
skip
> >>>>>>>> when we
> >>>>>>>> are sure nr_pages =3D=3D folio_pte_batch() is sure.
> >>>>>>>
> >>>>>>> Agreed. Thanks for pointing that out.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> +
> >>>>>>>>> +                       align =3D folio_nr_pages(folio) * PAGE_=
SIZE;
> >>>>>>>>> +                       next_addr =3D ALIGN_DOWN(addr + align, =
align);
> >>>>>>>>> +
> >>>>>>>>> +                       /*
> >>>>>>>>> +                        * If we mark only the subpages as lazy=
free, or
> >>>>>>>>> +                        * cannot mark the entire large folio a=
s lazyfree,
> >>>>>>>>> +                        * then just split it.
> >>>>>>>>> +                        */
> >>>>>>>>> +                       if (next_addr > end || next_addr - addr=
 !=3D
> >>>>>>>>> align ||
> >>>>>>>>> +                           !can_mark_large_folio_lazyfree(addr=
, folio,
> >>>>>>>>> pte))
> >>>>>>>>> +                               goto split_large_folio;
> >>>>>>>>> +
> >>>>>>>>> +                       /*
> >>>>>>>>> +                        * Avoid unnecessary folio splitting if=
 the large
> >>>>>>>>> +                        * folio is entirely within the given r=
ange.
> >>>>>>>>> +                        */
> >>>>>>>>> +                       folio_clear_dirty(folio);
> >>>>>>>>> +                       folio_unlock(folio);
> >>>>>>>>> +                       for (; addr !=3D next_addr; pte++, addr=
 +=3D
> >>>>>>>>> PAGE_SIZE) {
> >>>>>>>>> +                               ptent =3D ptep_get(pte);
> >>>>>>>>> +                               if (pte_young(ptent) ||
> >>>>>>>>> pte_dirty(ptent)) {
> >>>>>>>>> +                                       ptent =3D ptep_get_and_=
clear_full(
> >>>>>>>>> +                                               mm, addr, pte,
> >>>>>>>>> tlb->fullmm);
> >>>>>>>>> +                                       ptent =3D pte_mkold(pte=
nt);
> >>>>>>>>> +                                       ptent =3D pte_mkclean(p=
tent);
> >>>>>>>>> +                                       set_pte_at(mm, addr, pt=
e, ptent);
> >>>>>>>>> +                                       tlb_remove_tlb_entry(tl=
b, pte,
> >>>>>>>>> addr);
> >>>>>>>>> +                               }
> >>>>>>>>
> >>>>>>>> Can we do this in batches? for a CONT-PTE mapped large folio, yo=
u are
> >>>>>>>> unfolding
> >>>>>>>> and folding again. It seems quite expensive.
> >>>>>
> >>>>> I'm not convinced we should be doing this in batches. We want the i=
nitial
> >>>>> folio_pte_batch() to be as loose as possible regarding permissions =
so that we
> >>>>> reduce our chances of splitting folios to the min. (e.g. ignore SW =
bits like
> >>>>> soft dirty, etc). I think it might be possible that some PTEs are R=
O and other
> >>>>> RW too (e.g. due to cow - although with the current cow impl, proba=
bly not.
> >>>>> But
> >>>>> its fragile to assume that). Anyway, if we do an initial batch that=
 ignores
> >>>>> all
> >>>>
> >>>> You are correct. I believe this scenario could indeed occur. For ins=
tance,
> >>>> if process A forks process B and then unmaps itself, leaving B as th=
e
> >>>> sole process owning the large folio.  The current wp_page_reuse() fu=
nction
> >>>> will reuse PTE one by one while the specific subpage is written.
> >>>
> >>> Hmm - I thought it would only reuse if the total mapcount for the fol=
io was 1.
> >>> And since it is a large folio with each page mapped once in proc B, I=
 thought
> >>> every subpage write would cause a copy except the last one? I haven't=
 looked at
> >>> the code for a while. But I had it in my head that this is an area we=
 need to
> >>> improve for mTHP.

So sad I am wrong again =F0=9F=98=A2

> >>
> >> wp_page_reuse() will currently reuse a PTE part of a large folio only =
if
> >> a single PTE remains mapped (refcount =3D=3D 0).
> >
> > ^ =3D=3D 1

seems this needs improvement. it is a waste the last subpage can
reuse the whole large folio. i was doing it in a quite different way,
if the large folio had only one subpage left, i would do copy and
released the large folio[1]. and if i could reuse the whole large folio
with CONT-PTE, i would reuse the whole large folio[2]. in mainline,
we don't have this cont-pte luxury exposed to mm, so i guess we can
not do [2] easily, but [1] seems to be an optimization.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplu=
s/sm8650_u_14.0.0_oneplus12/mm/memory.c#L3977
[2] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/blob/oneplu=
s/sm8650_u_14.0.0_oneplus12/mm/memory.c#L3812

>
> Ahh yes. That's what I meant. I got the behacviour vagulely right though.
>
> Anyway, regardless, I'm not sure we want to batch here. Or if we do, we w=
ant to
> batch function that will only clear access and dirty.
>

Thanks
Barry