From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 94A12C48BF6 for ; Mon, 26 Feb 2024 06:40:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 015116B0162; Mon, 26 Feb 2024 01:40:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E933D6B0169; Mon, 26 Feb 2024 01:40:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D0D416B016A; Mon, 26 Feb 2024 01:40:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B62916B0168 for ; Mon, 26 Feb 2024 01:40:06 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 576DF405B4 for ; Mon, 26 Feb 2024 06:40:06 +0000 (UTC) X-FDA: 81833005212.26.96D076D Received: from mail-ua1-f50.google.com (mail-ua1-f50.google.com [209.85.222.50]) by imf29.hostedemail.com (Postfix) with ESMTP id B890F120009 for ; Mon, 26 Feb 2024 06:40:03 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LQP0c+xR; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1708929603; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=REJoyVQFOAk4vMMCZgTE8cBGw5DeBzilZygz2Qptoq8=; b=Rrb6IND15+kmyR7s7WnB+w5/ZbIPZAn5VySlOhX8yi/8XDCU9nKg9wFqDXTB82hZ26Rcu1 MZ5JXeIKNMLp16Yhs2qxhAyD8ACs1BkHeJal7l0H/aTCf9iKWbt5QTE6j8AXpTVeQMwqEg lBZKC+g/62SAqKiTGqWM24xHS4BugBI= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=LQP0c+xR; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.50 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1708929603; a=rsa-sha256; cv=none; b=6+zz1PsS09TysDTKm1laL5TXrPp66ZEzI6owmNvzwHMS6JwzucNmOLuO76YsCP/m9jlr4J y/fYhb+1Kq2ZrsloNtvqCfiTDA4frS1s6nEJGSrOQ3DQm9meiB5Wz3/eEtkhZqPvlTOf/H 47Qg6+vb/4YjseACihlR6cfdYjY4/qw= Received: by mail-ua1-f50.google.com with SMTP id a1e0cc1a2514c-7d5c257452dso724263241.0 for ; Sun, 25 Feb 2024 22:40:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1708929603; x=1709534403; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=REJoyVQFOAk4vMMCZgTE8cBGw5DeBzilZygz2Qptoq8=; b=LQP0c+xR7c3oqwLZRCxl9DLqzYeJO2tFb+G4CNXboribBO6l8UlbiPC4T3BM7OX1/J +JxfagGGo96v4E6HymqMAkzU1yQphK+/9m3CGrS9HQpU3MTOWl/vURYfq+JF7o/5xvl6 qwSs1nqyH8oXYe6cLITIgHezwCG7TkMogqU7T1Es3BM1tEYkKvUF1SxA8mNH/ldFvwEc Xx6I0HjTKoZJgrUOO0r5nu9e06vzueuIdX26TlZsZQq/vVXWffdyYbIkqU63bGDpRROZ svdGsIp96YfY3vjOccevlLb4FZafWxlNStdNE9zJICbhrN1hvXUDR75zKxfhliqc/pTX VDpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1708929603; x=1709534403; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=REJoyVQFOAk4vMMCZgTE8cBGw5DeBzilZygz2Qptoq8=; b=H+AzjjfH4WqxktE5xkUPRdZ6xP4iR1NSS2V/WrA7gNLgehzSrY6JwNxIXqN3xMYYNR 1ZsnkCl93aXVMqTfjgRNY9XY0NKRxxvbEwHKQEODSpiMg/EwKMYB8K/FzBlTq2/iGS0I xM3+/nsqJrgtV8Vj8BDsHtqiNPJRAyQE+SpBWUWCv0bMO9YHHWkekIkHL2v//iRNyt7t zMMXNiP+xOzy0ftJifn5NjprzNT/KfXe/ZBerK2sN3vPRpUOTFU6UqIVQKXVZUyJhB+I UOUnryW29G6wKVgn60HDuNuaK9PonafA0+tmrNljg4PAIMuFcRDt+CohRxO4D28kACz1 sb9Q== X-Forwarded-Encrypted: i=1; AJvYcCV5Be9kth/lHXgUS9uQD1S8+ktZuMxKuCE/SQztw4fBZapbFa1uSi5a1LLqLhH8cuUMgTfPpivwwZKKwOyCpJ4WPyc= X-Gm-Message-State: AOJu0YyY4RKapQx22midlI3A3hchJ2xgk1fegAYXGUjyGC+dDfdLErMI ms3T0CBH4oJquUa7RqhWtmAYwDisoA+jmOrLP8yPHWJh1nV3QIvIZ7FjvBCX3xtFkgkU+O4YwYX 0L/Dd5Vbw0R0GNMRI0OGbLIk5ymc= X-Google-Smtp-Source: AGHT+IHfxC1NeGX3vv+EHcFyZF/5L/kv3Pc3b6Cd85946cbSRvxmRbJpfUc8s+Pd/MyLQDjd+l6aQl+akvXTpJlDg5Q= X-Received: by 2002:a67:c791:0:b0:470:79da:d606 with SMTP id t17-20020a67c791000000b0047079dad606mr2433322vsk.21.1708929602705; Sun, 25 Feb 2024 22:40:02 -0800 (PST) MIME-Version: 1.0 References: <20231025144546.577640-1-ryan.roberts@arm.com> <20240118111036.72641-1-21cnbao@gmail.com> <20240118111036.72641-7-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Mon, 26 Feb 2024 19:39:51 +1300 Message-ID: Subject: Re: [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT To: Chris Li Cc: ryan.roberts@arm.com, akpm@linux-foundation.org, david@redhat.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com, willy@infradead.org, xiang@kernel.org, ying.huang@intel.com, yuzhao@google.com, surenb@google.com, steven.price@arm.com, Chuanhua Han , Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: B890F120009 X-Rspam-User: X-Stat-Signature: ceu8qt5tabryctxzikms8u3kic4gz1d7 X-Rspamd-Server: rspam01 X-HE-Tag: 1708929603-627622 X-HE-Meta: U2FsdGVkX1+XdCJcWTwrXdapARbpiBTOtdrNU/h6jFDASD6VIlvKFNzgx48YYzIaoawOTbO9iy9n9vnv1jNzCS2Suku/RSWcYLJFPR8YfqtW7oj9PXFKemPVpkO5kyeeii+ielb3Q3gFgb+kWycxDVKAZlTmo89AR3K6hG0gOdwg4AHUrJQb0Sc2KGMTSrNwqSesXHcN3oSXoIo3xbklI8og+PYPO5Y9HjpOCKB8i+xJmdcFO5wDcl5eeGFqnExjpyqK+imZNRQDKkeRcAXa0UUqeoSmmrKwGBru6i/a9Q5u/A0ut1hoGIuNWDZK2LW0BwjdXrWvGncGZG/SRNY5I51S59k/Y2YJTHBoYIInTix0HfMp1NA0rZxT/vEMM8x8WShOOzgE4rviuvfx7yhHDVZwiuLRBSqyq+kFIcstVBzvbbO20hDSJb6fy04zfBYCA1aYrZv0V7ZvhS5txiVjZdqurhMfWsY2wV4x72JGnfnIZ6YdOxb4kfKlTZVapAAXKtR7RuWw1UcfTcOX1Lxe/dn8Lr/GwBRu0Ct1d7YjSQlZPlQVvgv7jeq4pHbZMZPbviw5yJztzalNCZMMr0SFGUNz3DbbzGH86K3HTvkUzOvDQGGymaMZCNE/9lHmL2D6F+McRxhiqUxovdPc1hjASKV12ENH7IZ9b0W8E4/LO22vlAqk7L31AIzcO/Pf/xj4v1zDA4a8Xvgq0lTqdPj4nT1DeZKHIseEww85AqSp/2UGyme0KBSK32uBxCKnyCNya1eI/PF/SzkK7qv0opVvRDMtEs4I4a2O5Yg6xrnk2OMmAQ+6lGgTGK4PFhe5jfrvNJhd5yEHcyCgXkMwjOCNoctXaocSXwBWKIpqqeoJ/zblbKZxt/MU/bVkIBP7J8x642HPl2UKPBAlEDx6OO2Zmb1At8qBvMF56ERbMrp219sV1bxGAaE2mEXoh1WXgF6luf+oq4UFMLjvn2PozVd +G6qc+CC HmOvYmM4PgoESW+xvAGvQ+MSGeegfxHyCstfE8YVOAhOTSt5ag+cRjw2yvqPc1H46aL0rWNdLnfSD+0StTHZYOHbakw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jan 29, 2024 at 3:15=E2=80=AFPM Chris Li wrote: > > On Thu, Jan 18, 2024 at 3:12=E2=80=AFAM Barry Song <21cnbao@gmail.com> wr= ote: > > > > From: Chuanhua Han > > > > MADV_PAGEOUT and MADV_FREE are common cases in Android. Ryan's patchset= has > > supported swapping large folios out as a whole for vmscan case. This pa= tch > > extends the feature to madvise. > > > > If madvised range covers the whole large folio, we don't split it. Othe= rwise, > > we still need to split it. > > > > This patch doesn't depend on ARM64's CONT-PTE, alternatively, it define= s one > > helper named pte_range_cont_mapped() to check if all PTEs are contiguou= sly > > mapped to a large folio. > > > > Signed-off-by: Chuanhua Han > > Co-developed-by: Barry Song > > Signed-off-by: Barry Song > > --- > > include/asm-generic/tlb.h | 10 +++++++ > > include/linux/pgtable.h | 60 +++++++++++++++++++++++++++++++++++++++ > > mm/madvise.c | 48 +++++++++++++++++++++++++++++++ > > 3 files changed, 118 insertions(+) > > > > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > > index 129a3a759976..f894e22da5d6 100644 > > --- a/include/asm-generic/tlb.h > > +++ b/include/asm-generic/tlb.h > > @@ -608,6 +608,16 @@ static inline void tlb_flush_p4d_range(struct mmu_= gather *tlb, > > __tlb_remove_tlb_entry(tlb, ptep, address); \ > > } while (0) > > > > +#define tlb_remove_nr_tlb_entry(tlb, ptep, address, nr) = \ > > + do { = \ > > + int i; = \ > > + tlb_flush_pte_range(tlb, address, = \ > > + PAGE_SIZE * nr); = \ > > + for (i =3D 0; i < nr; i++) = \ > > + __tlb_remove_tlb_entry(tlb, ptep + i, = \ > > + address + i * PAGE_SIZE); = \ > > + } while (0) > > + > > #define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \ > > do { \ > > unsigned long _sz =3D huge_page_size(h); \ > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > > index 37fe83b0c358..da0c1cf447e3 100644 > > --- a/include/linux/pgtable.h > > +++ b/include/linux/pgtable.h > > @@ -320,6 +320,42 @@ static inline pgd_t pgdp_get(pgd_t *pgdp) > > } > > #endif > > > > +#ifndef pte_range_cont_mapped > > +static inline bool pte_range_cont_mapped(unsigned long start_pfn, > > + pte_t *start_pte, > > + unsigned long start_addr, > > + int nr) > > +{ > > + int i; > > + pte_t pte_val; > > + > > + for (i =3D 0; i < nr; i++) { > > + pte_val =3D ptep_get(start_pte + i); > > + > > + if (pte_none(pte_val)) > > + return false; > > Hmm, the following check pte_pfn =3D=3D start_pfn + i should have covered > the pte none case? > > I think the pte_none means it can't have a valid pfn. So this check > can be skipped? yes. check pte_pfn =3D=3D start_pfn + i should have covered the pte none case. but leaving pte_none there seems to make the code more readable. i guess we need to check pte_present() too, a small chance is swp_offset can equal pte_pfn after some shifting? in case, a PTE within the large folio range has been a swap entry? I am still thinking about if we have some cheaper way to check if a folio is still entirely mapped. maybe sth like if (list_empty(&folio->_deferred_list))? > > > + > > + if (pte_pfn(pte_val) !=3D (start_pfn + i)) > > + return false; > > + } > > + > > + return true; > > +} > > +#endif > > + > > +#ifndef pte_range_young > > +static inline bool pte_range_young(pte_t *start_pte, int nr) > > +{ > > + int i; > > + > > + for (i =3D 0; i < nr; i++) > > + if (pte_young(ptep_get(start_pte + i))) > > + return true; > > + > > + return false; > > +} > > +#endif > > + > > #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG > > static inline int ptep_test_and_clear_young(struct vm_area_struct *vma= , > > unsigned long address, > > @@ -580,6 +616,23 @@ static inline pte_t ptep_get_and_clear_full(struct= mm_struct *mm, > > } > > #endif > > > > +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_RANGE_FULL > > +static inline pte_t ptep_get_and_clear_range_full(struct mm_struct *mm= , > > + unsigned long start_a= ddr, > > + pte_t *start_pte, > > + int nr, int full) > > +{ > > + int i; > > + pte_t pte; > > + > > + pte =3D ptep_get_and_clear_full(mm, start_addr, start_pte, full= ); > > + > > + for (i =3D 1; i < nr; i++) > > + ptep_get_and_clear_full(mm, start_addr + i * PAGE_SIZE, > > + start_pte + i, full); > > + > > + return pte; > > +} > > > > /* > > * If two threads concurrently fault at the same page, the thread that > > @@ -995,6 +1048,13 @@ static inline void arch_swap_restore(swp_entry_t = entry, struct folio *folio) > > }) > > #endif > > > > +#ifndef pte_nr_addr_end > > +#define pte_nr_addr_end(addr, size, end) = \ > > +({ unsigned long __boundary =3D ((addr) + size) & (~(size - 1)); = \ > > + (__boundary - 1 < (end) - 1)? __boundary: (end); = \ > > +}) > > +#endif > > + > > /* > > * When walking page tables, we usually want to skip any p?d_none entr= ies; > > * and any p?d_bad entries - reporting the error before resetting to n= one. > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 912155a94ed5..262460ac4b2e 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -452,6 +452,54 @@ static int madvise_cold_or_pageout_pte_range(pmd_t= *pmd, > > if (folio_test_large(folio)) { > > int err; > > > > + if (!folio_test_pmd_mappable(folio)) { > > This session of code indent into the right too much. > You can do: > > if (folio_test_pmd_mappable(folio)) > goto split; > > to make the code flatter. I guess we don't need "if (!folio_test_pmd_mappable(folio))" at all as the pmd case has been handled at the first beginning of madvise_cold_or_pageout_pte_range(). > > > + int nr_pages =3D folio_nr_pages(folio); > > + unsigned long folio_size =3D PAGE_SIZE = * nr_pages; > > + unsigned long start_addr =3D ALIGN_DOWN= (addr, nr_pages * PAGE_SIZE);; > > + unsigned long start_pfn =3D page_to_pfn= (folio_page(folio, 0)); > > + pte_t *start_pte =3D pte - (addr - star= t_addr) / PAGE_SIZE; > > + unsigned long next =3D pte_nr_addr_end(= addr, folio_size, end); > > + > > + if (!pte_range_cont_mapped(start_pfn, s= tart_pte, start_addr, nr_pages)) > > + goto split; > > + > > + if (next - addr !=3D folio_size) { > > Nitpick: One line statement does not need { > > > + goto split; > > + } else { > > When the previous if statement already "goto split", there is no need > for the else. You can save one level of indentation. right! > > > > > + /* Do not interfere with other = mappings of this page */ > > + if (folio_estimated_sharers(fol= io) !=3D 1) > > + goto skip; > > + > > + VM_BUG_ON(addr !=3D start_addr = || pte !=3D start_pte); > > + > > + if (pte_range_young(start_pte, = nr_pages)) { > > + ptent =3D ptep_get_and_= clear_range_full(mm, start_addr, start_pte, > > + = nr_pages, tlb->fullmm); > > + ptent =3D pte_mkold(pte= nt); > > + > > + set_ptes(mm, start_addr= , start_pte, ptent, nr_pages); > > + tlb_remove_nr_tlb_entry= (tlb, start_pte, start_addr, nr_pages); > > + } > > + > > + folio_clear_referenced(folio); > > + folio_test_clear_young(folio); > > + if (pageout) { > > + if (folio_isolate_lru(f= olio)) { > > + if (folio_test_= unevictable(folio)) > > + folio_p= utback_lru(folio); > > + else > > + list_ad= d(&folio->lru, &folio_list); > > + } > > + } else > > + folio_deactivate(folio)= ; > > I notice this section is very similar to the earlier statements inside > the same function. > "if (pmd_trans_huge(*pmd)) {" > > Wondering if there is some way to unify the two a bit somehow. we have duplicated the code three times - pmd, pte-mapped large, normal fol= io. I am quite sure if we can extract a common function. > > Also notice if you test the else condition first, > > If (!pageout) { > folio_deactivate(folio); > goto skip; > } > > You can save one level of indentation. > Not your fault, I notice the section inside (pmd_trans_huge(*pmd)) > does exactly the same thing. > can address this issue once we have a common func. > Chris > > > > + } > > +skip: > > + pte +=3D (next - PAGE_SIZE - (addr & PA= GE_MASK))/PAGE_SIZE; > > + addr =3D next - PAGE_SIZE; > > + continue; > > + > > + } > > +split: > > if (folio_estimated_sharers(folio) !=3D 1) > > break; > > if (pageout_anon_only_filter && !folio_test_ano= n(folio)) > > -- > > 2.34.1 > > > > Thanks Barry