From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EE2FFC5478C
	for <linux-mm@archiver.kernel.org>; Tue, 27 Feb 2024 22:40:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 572496B020B; Tue, 27 Feb 2024 17:40:01 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4FBC06B020C; Tue, 27 Feb 2024 17:40:01 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3757A6B020D; Tue, 27 Feb 2024 17:40:01 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 1E7D06B020B
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 17:40:01 -0500 (EST)
Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id EBA62A1102
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 22:40:00 +0000 (UTC)
X-FDA: 81839052960.11.B2BBFE9
Received: from mail-qv1-f45.google.com (mail-qv1-f45.google.com [209.85.219.45])
	by imf28.hostedemail.com (Postfix) with ESMTP id 4B886C0002
	for <linux-mm@kvack.org>; Tue, 27 Feb 2024 22:39:58 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=H3h+5a84;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.219.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709073598;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=AxUpWMCR8hNEAFIKta+eZ/m4DyipBT2mnf3KVgcZ+mE=;
	b=pKtc/R0cyfVVneqauhhHQSbJNcOszR05o/UrsSSfaXp2LNbY1lJaiLaDPi9FskupWrxhRN
	teVZ5DM7BJE5IC516F7kO+sFGc1YZT7kpS0u1HfzuzbNfSyiGewtOK6xmHr+1hq+sj/W+h
	IZ1q28hZdme3zUF7WR5s7rOQwRBDjtg=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=H3h+5a84;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.219.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709073598; a=rsa-sha256;
	cv=none;
	b=iTIESrW9HMZoCgMut8tBuTtk48M5lS5DeSjCy3eAfHioIEN8QqSq5hoEwAsVj8V1gRjwY0
	lbQRI7AMKIAxs5LwR6qsQXGntLb2RrBIY3A+64RIgJIJ0NZPgkVwX7/5afd48OEdO0hJ7Y
	A2ihr8aICfyLrQA5y1TiRR7uUl+QheA=
Received: by mail-qv1-f45.google.com with SMTP id 6a1803df08f44-68009cb4669so21795796d6.1
        for <linux-mm@kvack.org>; Tue, 27 Feb 2024 14:39:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1709073597; x=1709678397; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=AxUpWMCR8hNEAFIKta+eZ/m4DyipBT2mnf3KVgcZ+mE=;
        b=H3h+5a84FtQ3MMAKq7ViuK/kYISERpIrU/dKV7Co+HLzdaYVv69Vn58gHOjRxw1MXv
         XuSevi6bXpVYjLt3uBLyWQYoHgQgzdX15p/mgJbxcz1gI9JZ8naTky4d1NEZfN2S60hS
         NMnknxC3OUgZz7Hja8gUrHmw5+SEOitc4ZsQ0zCOmUoJjHHDU/m9CAnr6jPjVe6g/NNy
         odK1/bKzi7m/Cjr9UEk2BE8sQZABaTXh/TV0RoHMPTepZIIw0P/3fk3Mqfpes0eDSxQS
         TyC4eYWaebt+2mMPMagUWScena6qjiWX7C1ofVZdh2XFe5OLuKqXhvDKB4nQm3qk9YVj
         vtUg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709073597; x=1709678397;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=AxUpWMCR8hNEAFIKta+eZ/m4DyipBT2mnf3KVgcZ+mE=;
        b=Hoif7Lp+It2GE2JQLRS7PR2PfNwQxjDgHXsIGApQxzFpzgRlMwtmBdjpJqplQCztEO
         HXlnSWZqRCkdgVUQeujIAMVs7kY2Q85Eyvtd70nTUK/tCBxQXMh7OlsBShjBoY7mMsHK
         jd6uKksg3CYEqHzsCHlmH04cWBK6BjoYfFL2/dp46AffEEKN0fRg0WMxO1TgpkVPz+ve
         4Z+rpze5eoyBPJLrW6fSLZ84q40uCJsYiyFbmD4H0iWlPwrbZU3U7b+GkCFBkn9YoG1V
         5kr84PF6JAJDa2orH2dply7yfYHtzAohx9cj4npuXqihmAZd6f7HdVk6o6tttP+QE3Pz
         qghQ==
X-Forwarded-Encrypted: i=1; AJvYcCWprMEUW+YfieCWxul8iXWKCAzzQXxZq44OuJzFXCMD9ASFIS9UH9p/QiG3PWyC62gG55KGK5rLwnA3q7eHyF3rCkQ=
X-Gm-Message-State: AOJu0YwewVWzW7E4bcgJw7rIh2b/ihIu+crD9I/2ccTLN+VCdTJNo0df
	1/vXqt7oDuSq2HPCMIh0BVMWa8KGRMsHTkDFm1V1pwtvjNvONi9G4nGbbjKpPJbwpB3+XFJ79b3
	dPelyuG995/V/yqe8WwZy5jc0aoE=
X-Google-Smtp-Source: AGHT+IFkV/tuFCbI22oDXoor9C0zNy8Th1SebeCooBao0OnTsmGPWe4ju6Sl+/5sWLtpi1jNhMTY1GOdiOPgvvqmb1I=
X-Received: by 2002:a05:6214:f23:b0:68f:22b1:8e24 with SMTP id
 iw3-20020a0562140f2300b0068f22b18e24mr4999226qvb.28.1709073597328; Tue, 27
 Feb 2024 14:39:57 -0800 (PST)
MIME-Version: 1.0
References: <20231025144546.577640-1-ryan.roberts@arm.com> <20240118111036.72641-1-21cnbao@gmail.com>
 <20240118111036.72641-7-21cnbao@gmail.com> <bafb4ab0-ebf5-4736-8a9d-a70a06c92013@arm.com>
In-Reply-To: <bafb4ab0-ebf5-4736-8a9d-a70a06c92013@arm.com>
From: Barry Song <21cnbao@gmail.com>
Date: Wed, 28 Feb 2024 11:39:46 +1300
Message-ID: <CAGsJ_4wHDi1mZVLCrOEKV_X85x3SmsuHbYdGuVJ8PG1Kpqnu+w@mail.gmail.com>
Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: akpm@linux-foundation.org, david@redhat.com, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, mhocko@suse.com, shy828301@gmail.com, 
	wangkefeng.wang@huawei.com, willy@infradead.org, xiang@kernel.org, 
	ying.huang@intel.com, yuzhao@google.com, surenb@google.com, 
	steven.price@arm.com, Chuanhua Han <hanchuanhua@oppo.com>, 
	Barry Song <v-songbaohua@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 4B886C0002
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Stat-Signature: mfagrtap1rd3xpupoted7rsf9sgr953m
X-HE-Tag: 1709073598-122422
X-HE-Meta: U2FsdGVkX1/WAdNobxHCfXngmtBBEhiT0tyNW15SIQif9Unq/9641yDKKs67MRxrMU8rZLvlPoA21yD+zTYZgb3p2KS24lVOXn/muVrqnf4uuG5N//ES0SeN+/0T8zD0tGGWhgqdoYEzpC1+oOzAFqRZObK6/TY2rfcq9MhXnuUZh/G6RV5Vmk0pbO8khisjISOm9JhXgrp2X06+NJOSrFTwrq5eukgzPNq3Vco8xZgedDZwL46c2TWDMgm4oBnkfqBW6h6+rqcrQN8GWFuN6RrgizstAoHpfCCjW3hqkxwMaFHMZEEHFgp9YHPOCxpePtooimsXjE/hZ/Sed3rILZCwGKedM2wB+4c/wspINDgzs4sCvZ+4nLVraIvM/hooaB4Aru5ohwNK1xPdhM5tRB6tf1kU3cM1AtqobaQ23Ef5DfNn4ynqbUlBl2aJoQNlcoLvJrXXffBCtQJj5fmvDnazO+672AX0+7vPU3i6gdcT6apuxNw4BUBnTE0fWmjYDR3KDqoUx1ft5pWtvh6Bv9IL91p/ZpWwGY1V57zxwzJLv3a7LFCn3K0r1AsfNmhnDtWXYkVKjF3KB5QU2yuLaraORTtT+UGKXb72qomZciWDPcaJiAXBMGeY3Oj8EJvgeboYM2NjGELt9Hf5hiKZdM4vYgOTNcZJN2/mUM/bN2NnGz0le1sOC2k2pXbJslKaj2IIsM0U+wfNNqRsk3PLRND93QJaTrGGrAxIjEkktuuxdnxR+5+87TAflgZvj+6/DLwcTji7HxVsgn7zs9sdr3QFYtUS8drZVRQCQB65yOWXbqnZIi77gMMi9CWVoXKzwr9VG52lRB93xG+J2NYhMPh+TUYNnS8+YRCRBK/inb96vxhIj6BVArCBNK0x2rzGjNw63vuuOmOha1A1O6U0ALGFPuUeG2wYgUrKThJLTAIOnh/VbvEJwzJSejbtIz8kcGbRG/ZsPTOIEm3tbzs
 u8ldHS2x
 lveDdzkPeaKKZker2+Y3EsXHHppD2ook1SpxHCXtzHyWYHiZQwsp10HqP3dKYCOHe9xn9ibHm6jq/E+OnntHWTl1AgQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Feb 28, 2024 at 1:22=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> Hi Barry,
>
> I've scanned through this patch as part of trying to understand the races=
 you
> have reported (It's going to take me a while to fully understand it all :=
) ). In
> the meantime I have a few comments on this patch...
>
> On 18/01/2024 11:10, Barry Song wrote:
> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >
> > MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset=
 has
> > supported swapping large folios out as a whole for vmscan case. This pa=
tch
> > extends the feature to madvise.
> >
> > If madvised range covers the whole large folio, we don't split it. Othe=
rwise,
> > we still need to split it.
> >
> > This patch doesn't depend on ARM64's CONT-PTE, alternatively, it define=
s one
> > helper named pte_range_cont_mapped() to check if all PTEs are contiguou=
sly
> > mapped to a large folio.
> >
> > Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
> > Co-developed-by: Barry Song <v-songbaohua@oppo.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  include/asm-generic/tlb.h | 10 +++++++
> >  include/linux/pgtable.h   | 60 +++++++++++++++++++++++++++++++++++++++
> >  mm/madvise.c              | 48 +++++++++++++++++++++++++++++++
> >  3 files changed, 118 insertions(+)
> >
> > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> > index 129a3a759976..f894e22da5d6 100644
> > --- a/include/asm-generic/tlb.h
> > +++ b/include/asm-generic/tlb.h
> > @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_=
gather *tlb,
> >               __tlb_remove_tlb_entry(tlb, ptep, address);     \
> >       } while (0)
> >
> > +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr)               =
       \
> > +     do {                                                            \
> > +             int i;                                                  \
> > +             tlb_flush_pte_range(tlb, address,                       \
> > +                             PAGE_SIZE * nr);                        \
> > +             for (i =3D 0; i < nr; i++)                               =
 \
> > +                     __tlb_remove_tlb_entry(tlb, ptep + i,           \
> > +                                     address + i * PAGE_SIZE);       \
> > +     } while (0)
>
> David has recently added tlb_remove_tlb_entries() which does the same thi=
ng.

cool. While sending the patchset, we were not depending on other work.
Nice to know David's work can help this case.

>
> > +
> >  #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address)     \
> >       do {                                                    \
> >               unsigned long _sz =3D huge_page_size(h);          \
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 37fe83b0c358..da0c1cf447e3 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> >  }
> >  #endif
> >
> > +#ifndef pte_range_cont_mapped
> > +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> > +                                      pte_t *start_pte,
> > +                                      unsigned long start_addr,
> > +                                      int nr)
> > +{
> > +     int i;
> > +     pte_t pte_val;
> > +
> > +     for (i =3D 0; i < nr; i++) {
> > +             pte_val =3D ptep_get(start_pte + i);
> > +
> > +             if (pte_none(pte_val))
> > +                     return false;
> > +
> > +             if (pte_pfn(pte_val) !=3D (start_pfn + i))
> > +                     return false;
> > +     }
> > +
> > +     return true;
> > +}
> > +#endif
>
> David has recently added folio_pte_batch() which does a similar thing (as
> discussed in other context).

yes.

>
> > +
> > +#ifndef pte_range_young
> > +static inline bool pte_range_young(pte_t *start_pte, int nr)
> > +{
> > +     int i;
> > +
> > +     for (i =3D 0; i < nr; i++)
> > +             if (pte_young(ptep_get(start_pte + i)))
> > +                     return true;
> > +
> > +     return false;
> > +}
> > +#endif
>
> I wonder if this should come from folio_pte_batch()?

not quite sure folio_pte_batch can return young. but i guess
you already have a batched function to check if a large folio
is young?

>
> > +
> >  #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> >  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma=
,
> >                                           unsigned long address,
> > @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct=
 mm_struct *mm,
> >  }
> >  #endif
> >
> > +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL
> > +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm=
,
> > +                                               unsigned long start_add=
r,
> > +                                               pte_t *start_pte,
> > +                                               int nr, int full)
> > +{
> > +     int i;
> > +     pte_t pte;
> > +
> > +     pte =3D ptep_get_and_clear_full(mm, start_addr, start_pte, full);
> > +
> > +     for (i =3D 1; i < nr; i++)
> > +             ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE,
> > +                                     start_pte + i, full);
> > +
> > +     return pte;
> > +}
>
> David has recently added get_and_clear_full_ptes(). Your version isn't ga=
thering
> access/dirty, which may be ok for your case, but not ok in general.

ok. glad to know we can use get_and_clear_full_ptes().

>
> >
> >  /*
> >   * If two threads concurrently fault at the same page, the thread that
> > @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t =
entry, struct folio *folio)
> >  })
> >  #endif
> >
> > +#ifndef pte_nr_addr_end
> > +#define pte_nr_addr_end(addr, size, end)                             \
> > +({   unsigned long __boundary =3D ((addr) + size) & (~(size - 1));    =
 \
> > +     (__boundary - 1 < (end) - 1)? __boundary: (end);                \
> > +})
> > +#endif
> > +
> >  /*
> >   * When walking page tables, we usually want to skip any p?d_none entr=
ies;
> >   * and any p?d_bad entries - reporting the error before resetting to n=
one.
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 912155a94ed5..262460ac4b2e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t=
 *pmd,
> >               if (folio_test_large(folio)) {
> >                       int err;
> >
> > +                     if (!folio_test_pmd_mappable(folio)) {
> > +                             int nr_pages =3D folio_nr_pages(folio);
> > +                             unsigned long folio_size =3D PAGE_SIZE * =
nr_pages;
> > +                             unsigned long start_addr =3D ALIGN_DOWN(a=
ddr, nr_pages * PAGE_SIZE);;
>
> I doubt it is correct to align down here. Couldn't you be going outside t=
he
> bounds that the user supplied?

Yes, it can. This is ugly and suspicious but does not cause problems
if the large folio's virtadd is aligned , but it is wrong if virtual addres=
s is
not aligned as explained below.

>
> nit: you've defined folio_size, why not use it here?
> nit: double semi-colon.
>
> > +                             unsigned long start_pfn =3D page_to_pfn(f=
olio_page(folio, 0));
> > +                             pte_t *start_pte =3D pte - (addr - start_=
addr) / PAGE_SIZE;
>
> I think start_pte could be off the start of the pgtable and into random m=
emory

> in some corner cases (and outside the protection of the PTL)? You're assu=
ming
> that the folio is fully and contigously mapped and correctly aligned. mre=
map
> (and other things) could break that assumption.

actually we don't run under the assumption folio is fully and
contiguously mapped.
but the code does assume a large folio's virtual address is aligned with
nr_pages * PAGE_SIZE.

OTOH,  we have  if (next - addr !=3D folio_size) to split folios if
users just want to partially
reclaim a large folio, but I do agree we should move if (next - addr
!=3D folio_size)
before pte_range_cont_mapped().

as long as the virt addr is aligned, pte_range_cont_mapped() won't
cause a problem
for the code even before if (next - addr !=3D folio_size) (but ugly and
suspicious) as it is
still under the protection of PTL since we don't cross a PMD for a
pte-mapped large
folio.

but you are really right, we have cases like mremap which can remap an alig=
ned
large folio to an unaligned address. I actually placed a trace point
in kernel, running
lots of phones, didn't find this case was happening. so i feel mremap
is really rare.
Is it possible to split large folios and avoid complexity  instead if
we are remapping
to an unaligned address?

And, the code is really completely wrong if the large folio is
unaligned. we have to
remove the assumption if that is really happening. So shouldn't do ALIGN_DO=
WN.

>
> > +                             unsigned long next =3D pte_nr_addr_end(ad=
dr, folio_size, end);
> > +
> > +                             if (!pte_range_cont_mapped(start_pfn, sta=
rt_pte, start_addr, nr_pages))
> > +                                     goto split;
> > +
> > +                             if (next - addr !=3D folio_size) {
> > +                                     goto split;
> > +                             } else {
> > +                                     /* Do not interfere with other ma=
ppings of this page */
> > +                                     if (folio_estimated_sharers(folio=
) !=3D 1)
> > +                                             goto skip;
> > +
> > +                                     VM_BUG_ON(addr !=3D start_addr ||=
 pte !=3D start_pte);
> > +
> > +                                     if (pte_range_young(start_pte, nr=
_pages)) {
> > +                                             ptent =3D ptep_get_and_cl=
ear_range_full(mm, start_addr, start_pte,
> > +                                                                      =
             nr_pages, tlb->fullmm);
> > +                                             ptent =3D pte_mkold(ptent=
);
> > +
> > +                                             set_ptes(mm, start_addr, =
start_pte, ptent, nr_pages);
> > +                                             tlb_remove_nr_tlb_entry(t=
lb, start_pte, start_addr, nr_pages);
> > +                                     }
> > +
> > +                                     folio_clear_referenced(folio);
> > +                                     folio_test_clear_young(folio);
> > +                                     if (pageout) {
> > +                                             if (folio_isolate_lru(fol=
io)) {
> > +                                                     if (folio_test_un=
evictable(folio))
> > +                                                             folio_put=
back_lru(folio);
> > +                                                     else
> > +                                                             list_add(=
&folio->lru, &folio_list);
> > +                                             }
> > +                                     } else
> > +                                             folio_deactivate(folio);
> > +                             }
> > +skip:
> > +                             pte +=3D (next - PAGE_SIZE - (addr & PAGE=
_MASK))/PAGE_SIZE;
> > +                             addr =3D next - PAGE_SIZE;
> > +                             continue;
> > +
> > +                     }
> > +split:
> >                       if (folio_estimated_sharers(folio) !=3D 1)
> >                               break;
> >                       if (pageout_anon_only_filter && !folio_test_anon(=
folio))

Thanks
Barry