From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15E2ACD1292 for ; Mon, 1 Apr 2024 12:25:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 96EAA6B0089; Mon, 1 Apr 2024 08:25:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 91F036B008A; Mon, 1 Apr 2024 08:25:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7C0256B008C; Mon, 1 Apr 2024 08:25:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 598506B0089 for ; Mon, 1 Apr 2024 08:25:29 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 1B88F8021A for ; Mon, 1 Apr 2024 12:25:29 +0000 (UTC) X-FDA: 81960883578.29.C07770E Received: from mail-ed1-f47.google.com (mail-ed1-f47.google.com [209.85.208.47]) by imf20.hostedemail.com (Postfix) with ESMTP id 2FF191C0007 for ; Mon, 1 Apr 2024 12:25:26 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gj6E41Qh; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=ioworker0@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711974327; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=j5Wz2K6I2DnMCzQHAtDkcCM/sbtzk2JGtRSmt3tcB4Y=; b=SMqmtOt+bX6Zv6IULq+6OLPdb0RrbKvQlex7SLx6JWk0h7eJzHsLck2WWUSB1qB6cy1Iuf SQPA+K94eNAySCOpIzOmkKlkggF2kjkQrbfbAAnxiFHrGrAXdwgQ1wY7vfErRlw3UTnGEn wJzL19XFMxd8F6PMafdG7v1K3yhpINM= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gj6E41Qh; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.208.47 as permitted sender) smtp.mailfrom=ioworker0@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711974327; a=rsa-sha256; cv=none; b=r58tPM5xXTRSpiEPlN9vnGSMDIYUZ5CWYvrvOnxeGB0GX/uGImNAtSRBQzura9InA/G9i9 H9S9iG6qPLjsgAArDMII+X7geevwzQlSyB6N5EvblIauvexGOwXlaXike7sQ3+Ts21qscE IoKBLTSJdSgzWlaie9xfgvRUSDu5VmI= Received: by mail-ed1-f47.google.com with SMTP id 4fb4d7f45d1cf-56bf63af770so2073558a12.3 for ; Mon, 01 Apr 2024 05:25:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1711974325; x=1712579125; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=j5Wz2K6I2DnMCzQHAtDkcCM/sbtzk2JGtRSmt3tcB4Y=; b=gj6E41Qh8tatlp+yUAgr+Drv/BWl3PlZJmxk4upjGt0A3BERSWa0xyMloAc1cB8awH 3yF8qH48XI0a7Y5+fIbSaRlKKzZWAPgufcfjwuiXNt3vBIUNtiW6H0KJCvLNYWRtdzod b8Qe8sA6E1Jaq+lTN9BI/PgmGtDW47YfoZx6ogSwBMXa5CBk8oQDuxwKWTkcjgh60SZU RgJOMBaen4fxfiHyhBN7YXLPtsYJqdHzt3zGqA9Eef/PVf0dKXbkNmwlk+8z1RSbcNT+ 9tMfiyhNpLAOTw2+T1dABkaScOEoV0DE0y18Kgdxtm6TuRfUSXO/YBusRyrQIeb6oaPu tpEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711974325; x=1712579125; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=j5Wz2K6I2DnMCzQHAtDkcCM/sbtzk2JGtRSmt3tcB4Y=; b=fR7B5xa1HjmawA1O+AuFGjZ6i358lS0FWju5zIomUfboB7eYFaHKzLGRFIMMnPnJbh 4mtjo0/5IKUvRdXdB4/AZH8SE6VeznkPFrdC0i9hfXi0rWIIDCJ/fcSptrq+K87LLJ1i wDk9srYWcHY7cSaQXTok8QgwQYhRoSahj6xJoEz6gURRSYAooQKubo12V+4oBwI04o62 c4mo6w936fwt1VmEW/2ejKntIP5stelB2FzOwlb7rTVa/6bUT9Sl6S6Hxw+CbT6Mg2OI 4GV+Lwsec1ADRB55N2vlmQASuisGIw7H8D0VG4CPmSJyCY4Rh8OmcwRwYZUjlmSHJ59s BgMA== X-Forwarded-Encrypted: i=1; AJvYcCW9+3SWoar10XY2U0gWqF/6Tzw2c0OOftimS31HRMcrQ3FlphcM7h2XzrOZ+GZJJHwCJmaxZWxgYRTIV8NB+x6ycpM= X-Gm-Message-State: AOJu0Yyqyxw3clXYLXF3khw8ChZ7Q3j+balEtQzobRTsJ4BYHWUa2vWk 9owpIc+OzkdeN0lf1vVYPFS5O4xC9Ie7bO7JKs7wcbRyDl/Da7u3DRSOGbfSSVlDyQgRIIbMnVF 1UbVfVGLK0bGfwJXufUbesu3Vyoc= X-Google-Smtp-Source: AGHT+IGsYJCiWk1VPNmspSvN2B/1A+RmR7GmpS9FfAhe2KXRwYgHCRDx7saQzpYF6MmuqYPaa4df/BoskbhHTifgQ3w= X-Received: by 2002:a05:6402:34d6:b0:56b:d1c2:9b42 with SMTP id w22-20020a05640234d600b0056bd1c29b42mr9366765edc.29.1711974325215; Mon, 01 Apr 2024 05:25:25 -0700 (PDT) MIME-Version: 1.0 References: <20240327144537.4165578-1-ryan.roberts@arm.com> <20240327144537.4165578-7-ryan.roberts@arm.com> In-Reply-To: <20240327144537.4165578-7-ryan.roberts@arm.com> From: Lance Yang Date: Mon, 1 Apr 2024 20:25:13 +0800 Message-ID: Subject: Re: [PATCH v5 6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD To: Ryan Roberts Cc: Andrew Morton , David Hildenbrand , Matthew Wilcox , Huang Ying , Gao Xiang , Yu Zhao , Yang Shi , Michal Hocko , Kefeng Wang , Barry Song <21cnbao@gmail.com>, Chris Li , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 2FF191C0007 X-Stat-Signature: 3jassk8rw88thr7rs7nhs8jwwksm4uor X-HE-Tag: 1711974326-435566 X-HE-Meta: U2FsdGVkX19CSaLR6aNvbMEA23mTo/8AH+E2Wrtymizi7TpmB3P2bbZgAfS/XBiWRu/MWXYTy90AhB0zIwPk4cKlAsYrffn8AojX4EHJRVPmIDkJGJjCb3ZYrZVVOtdsY+MbgKSFsmWYWPIblrKJUjKwWT8fBgw42ITh2EX6aVf1+pf5BF1TuNfvKp3BEiQsIlZ/8oafrkJKvX2tU6oJwxD4s8aMS7qBnnoztvxVtLu5jDFyIqWw0mtGe704mUpmi712uPP0zI0n7W05q3MhvZha9cFd4aX3WC/rW0Rk2a/n5LXdbzsXcGjd3kvzWk2D88Q6JWbx628SobmWlkAC81Rgut1L/aLfljXkZbzViviEE3V8H+z/C1Dj9COHZAExF4sRcKua50HGbNLUjNefH2Pr/EeeaszfL6MHH9cYfpzbqJzs2A20Kqu5C/LNwAmP+pLqUfvu0dDDS7O4aqI4RoDngsNw63NXYyPb0qRVxsOwnSlPQSQFKt2RpRq1QIz+nLWttOVakJBHM6Y3AjaIDpUvS8mR4+v3JVsO+97iZ2o2jfRJ5wP740z7sjgzyd83cCWNv+TtWqXxnYXJaBZRxzfxnNONQdwy9PQeTjCnJcX9z9++psBrJ0b1/98/EqExZiyH+5F9iLMdvxEuM4R/jkFfK23fvgJXA4XEN2G/ymKoueE2yUu1xRi6oF995DxJiPn/dNLgECK9Ut4gD/Pz1ziTEYhCO4sJFF3RjC48zIqFnULoDifOQJJU+c10abEacSJuCFqQU2OeX0KKmRxWF1k4oN+GxR9uP2BGcyXp+vAMYxMn7oD7T/1Ngx+Y6v6V9YYoaJxd2uIJpkkivHyShA7UhUHuqXipuNHITwATNfi3t20iUKe4Naou9sSFgU1rX8SlRzAebl7kF90ODr5uKglu0HB+hAgxAbFDfpAR1hb5S3fGX2j911Gk/I/S4oM6nOtCLoa+ICyw53JaZI9 GMrGWSUv HScN5lDo1Ao2KXFKa7RW0xmZqxcYNtII7PHR5TK0apX2EOthc87MB+g5A8aX+poxLBc0yS9J7E8G7ygS9gJJ76vWZt47CU3CQGlIhDxeRrR4YSqKTZaT58WtYJoVwd/v6m9a7b13Tmz9lfXAtLaQic7FS+rJrPmxee3G2QfWAAE4KeLGRy4X5oymUn+IZYpZ7hrqjCdlpL5bx2XOQgouYzp1Mh8GzALq4DZk+e1Yvm0ZXxiIAT++9S+FS5tibI903ZKQTjXO0jd/gtmMUU3wCEUVQyFz/dF0zhbTIJ+m7fu5x6SysiQw2pQsFkVPhY2+6UU0fG68VYEFn4sM7WFaa0p6c18xpYADogzy6zsTjjn+bHuoLCQFPdKMJvuU0enf0Mi0P+bD5IilO5MAaGQ/VpL2PU17dy9FR09xqbFJ/okE3zlJxAa//4a4+q7NZsu6IAD8aZMV5HSQcY59LdUd2CqjcyBdcLKsvrYi83qcs9LA2aKYAvrzf674KpexbDafL5Gq4fwqDWKV3A+I= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Mar 27, 2024 at 10:46=E2=80=AFPM Ryan Roberts wrote: > > Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large > folio that is fully and contiguously mapped in the pageout/cold vm > range. This change means that large folios will be maintained all the > way to swap storage. This both improves performance during swap-out, by > eliding the cost of splitting the folio, and sets us up nicely for > maintaining the large folio when it is swapped back in (to be covered in > a separate series). > > Folios that are not fully mapped in the target range are still split, > but note that behavior is changed so that if the split fails for any > reason (folio locked, shared, etc) we now leave it as is and move to the > next pte in the range and continue work on the proceeding folios. > Previously any failure of this sort would cause the entire operation to > give up and no folios mapped at higher addresses were paged out or made > cold. Given large folios are becoming more common, this old behavior > would have likely lead to wasted opportunities. > > While we are at it, change the code that clears young from the ptes to > use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper > function. This is more efficent than get_and_clear/modify/set, > especially for contpte mappings on arm64, where the old approach would > require unfolding/refolding and the new approach can be done in place. > > Reviewed-by: Barry Song > Signed-off-by: Ryan Roberts > --- > include/linux/pgtable.h | 30 ++++++++++++++ > mm/internal.h | 12 +++++- > mm/madvise.c | 88 ++++++++++++++++++++++++----------------- > mm/memory.c | 4 +- > 4 files changed, 93 insertions(+), 41 deletions(-) > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h > index 8185939df1e8..391f56a1b188 100644 > --- a/include/linux/pgtable.h > +++ b/include/linux/pgtable.h > @@ -361,6 +361,36 @@ static inline int ptep_test_and_clear_young(struct v= m_area_struct *vma, > } > #endif > > +#ifndef mkold_ptes > +/** > + * mkold_ptes - Mark PTEs that map consecutive pages of the same folio a= s old. > + * @vma: VMA the pages are mapped into. > + * @addr: Address the first page is mapped at. > + * @ptep: Page table pointer for the first entry. > + * @nr: Number of entries to mark old. > + * > + * May be overridden by the architecture; otherwise, implemented as a si= mple > + * loop over ptep_test_and_clear_young(). > + * > + * Note that PTE bits in the PTE range besides the PFN can differ. For e= xample, > + * some PTEs might be write-protected. > + * > + * Context: The caller holds the page table lock. The PTEs map consecut= ive > + * pages that belong to the same folio. The PTEs are all in the same PM= D. > + */ > +static inline void mkold_ptes(struct vm_area_struct *vma, unsigned long = addr, > + pte_t *ptep, unsigned int nr) > +{ > + for (;;) { > + ptep_test_and_clear_young(vma, addr, ptep); IIUC, if the first PTE is a CONT-PTE, then calling ptep_test_and_clear_youn= g() will clear the young bit for the entire contig range to avoid having to unfold. So, the other PTEs within the range don't need to clear again. Maybe we should consider overriding mkold_ptes for arm64? Thanks, Lance > + if (--nr =3D=3D 0) > + break; > + ptep++; > + addr +=3D PAGE_SIZE; > + } > +} > +#endif > + > #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_ARCH_HAS_NONL= EAF_PMD_YOUNG) > static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, > diff --git a/mm/internal.h b/mm/internal.h > index eadb79c3a357..efee8e4cd2af 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -130,6 +130,8 @@ static inline pte_t __pte_batch_clear_ignored(pte_t p= te, fpb_t flags) > * @flags: Flags to modify the PTE batch semantics. > * @any_writable: Optional pointer to indicate whether any entry except = the > * first one is writable. > + * @any_young: Optional pointer to indicate whether any entry except the > + * first one is young. > * > * Detect a PTE batch: consecutive (present) PTEs that map consecutive > * pages of the same large folio. > @@ -145,16 +147,18 @@ static inline pte_t __pte_batch_clear_ignored(pte_t= pte, fpb_t flags) > */ > static inline int folio_pte_batch(struct folio *folio, unsigned long add= r, > pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags, > - bool *any_writable) > + bool *any_writable, bool *any_young) > { > unsigned long folio_end_pfn =3D folio_pfn(folio) + folio_nr_pages= (folio); > const pte_t *end_ptep =3D start_ptep + max_nr; > pte_t expected_pte, *ptep; > - bool writable; > + bool writable, young; > int nr; > > if (any_writable) > *any_writable =3D false; > + if (any_young) > + *any_young =3D false; > > VM_WARN_ON_FOLIO(!pte_present(pte), folio); > VM_WARN_ON_FOLIO(!folio_test_large(folio) || max_nr < 1, folio); > @@ -168,6 +172,8 @@ static inline int folio_pte_batch(struct folio *folio= , unsigned long addr, > pte =3D ptep_get(ptep); > if (any_writable) > writable =3D !!pte_write(pte); > + if (any_young) > + young =3D !!pte_young(pte); > pte =3D __pte_batch_clear_ignored(pte, flags); > > if (!pte_same(pte, expected_pte)) > @@ -183,6 +189,8 @@ static inline int folio_pte_batch(struct folio *folio= , unsigned long addr, > > if (any_writable) > *any_writable |=3D writable; > + if (any_young) > + *any_young |=3D young; > > nr =3D pte_batch_hint(ptep, pte); > expected_pte =3D pte_advance_pfn(expected_pte, nr); > diff --git a/mm/madvise.c b/mm/madvise.c > index 070bedb4996e..bd00b83e7c50 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *p= md, > LIST_HEAD(folio_list); > bool pageout_anon_only_filter; > unsigned int batch_count =3D 0; > + int nr; > > if (fatal_signal_pending(current)) > return -EINTR; > @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *p= md, > return 0; > flush_tlb_batched_pending(mm); > arch_enter_lazy_mmu_mode(); > - for (; addr < end; pte++, addr +=3D PAGE_SIZE) { > + for (; addr < end; pte +=3D nr, addr +=3D nr * PAGE_SIZE) { > + nr =3D 1; > ptent =3D ptep_get(pte); > > if (++batch_count =3D=3D SWAP_CLUSTER_MAX) { > @@ -447,55 +449,67 @@ static int madvise_cold_or_pageout_pte_range(pmd_t = *pmd, > continue; > > /* > - * Creating a THP page is expensive so split it only if w= e > - * are sure it's worth. Split it if we are only owner. > + * If we encounter a large folio, only split it if it is = not > + * fully mapped within the range we are operating on. Oth= erwise > + * leave it as is so that it can be swapped out whole. If= we > + * fail to split a folio, leave it in place and advance t= o the > + * next pte in the range. > */ > if (folio_test_large(folio)) { > - int err; > - > - if (folio_likely_mapped_shared(folio)) > - break; > - if (pageout_anon_only_filter && !folio_test_anon(= folio)) > - break; > - if (!folio_trylock(folio)) > - break; > - folio_get(folio); > - arch_leave_lazy_mmu_mode(); > - pte_unmap_unlock(start_pte, ptl); > - start_pte =3D NULL; > - err =3D split_folio(folio); > - folio_unlock(folio); > - folio_put(folio); > - if (err) > - break; > - start_pte =3D pte =3D > - pte_offset_map_lock(mm, pmd, addr, &ptl); > - if (!start_pte) > - break; > - arch_enter_lazy_mmu_mode(); > - pte--; > - addr -=3D PAGE_SIZE; > - continue; > + const fpb_t fpb_flags =3D FPB_IGNORE_DIRTY | > + FPB_IGNORE_SOFT_DIRTY; > + int max_nr =3D (end - addr) / PAGE_SIZE; > + bool any_young; > + > + nr =3D folio_pte_batch(folio, addr, pte, ptent, m= ax_nr, > + fpb_flags, NULL, &any_young)= ; > + if (any_young) > + ptent =3D pte_mkyoung(ptent); > + > + if (nr < folio_nr_pages(folio)) { > + int err; > + > + if (folio_likely_mapped_shared(folio)) > + continue; > + if (pageout_anon_only_filter && !folio_te= st_anon(folio)) > + continue; > + if (!folio_trylock(folio)) > + continue; > + folio_get(folio); > + arch_leave_lazy_mmu_mode(); > + pte_unmap_unlock(start_pte, ptl); > + start_pte =3D NULL; > + err =3D split_folio(folio); > + folio_unlock(folio); > + folio_put(folio); > + if (err) > + continue; > + start_pte =3D pte =3D > + pte_offset_map_lock(mm, pmd, addr= , &ptl); > + if (!start_pte) > + break; > + arch_enter_lazy_mmu_mode(); > + nr =3D 0; > + continue; > + } > } > > /* > * Do not interfere with other mappings of this folio and > - * non-LRU folio. > + * non-LRU folio. If we have a large folio at this point,= we > + * know it is fully mapped so if its mapcount is the same= as its > + * number of pages, it must be exclusive. > */ > - if (!folio_test_lru(folio) || folio_mapcount(folio) !=3D = 1) > + if (!folio_test_lru(folio) || > + folio_mapcount(folio) !=3D folio_nr_pages(folio)) > continue; > > if (pageout_anon_only_filter && !folio_test_anon(folio)) > continue; > > - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); > - > if (!pageout && pte_young(ptent)) { > - ptent =3D ptep_get_and_clear_full(mm, addr, pte, > - tlb->fullmm); > - ptent =3D pte_mkold(ptent); > - set_pte_at(mm, addr, pte, ptent); > - tlb_remove_tlb_entry(tlb, pte, addr); > + mkold_ptes(vma, addr, pte, nr); > + tlb_remove_tlb_entries(tlb, pte, nr, addr); > } > > /* > diff --git a/mm/memory.c b/mm/memory.c > index 9d844582ba38..b5b48f4cf2af 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -989,7 +989,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, str= uct vm_area_struct *src_vma > flags |=3D FPB_IGNORE_SOFT_DIRTY; > > nr =3D folio_pte_batch(folio, addr, src_pte, pte, max_nr,= flags, > - &any_writable); > + &any_writable, NULL); > folio_ref_add(folio, nr); > if (folio_test_anon(folio)) { > if (unlikely(folio_try_dup_anon_rmap_ptes(folio, = page, > @@ -1553,7 +1553,7 @@ static inline int zap_present_ptes(struct mmu_gathe= r *tlb, > */ > if (unlikely(folio_test_large(folio) && max_nr !=3D 1)) { > nr =3D folio_pte_batch(folio, addr, pte, ptent, max_nr, f= pb_flags, > - NULL); > + NULL, NULL); > > zap_present_folio_ptes(tlb, vma, folio, page, pte, ptent,= nr, > addr, details, rss, force_flush, > -- > 2.25.1 >