From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 428A2C54798 for ; Thu, 29 Feb 2024 22:54:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CDCE76B008C; Thu, 29 Feb 2024 17:54:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C8CD36B0092; Thu, 29 Feb 2024 17:54:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B2D4B6B0095; Thu, 29 Feb 2024 17:54:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A18CA6B008C for ; Thu, 29 Feb 2024 17:54:16 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 73DEDA1B1C for ; Thu, 29 Feb 2024 22:54:16 +0000 (UTC) X-FDA: 81846346512.06.5963C90 Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176]) by imf09.hostedemail.com (Postfix) with ESMTP id 8E3DD140013 for ; Thu, 29 Feb 2024 22:54:14 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FsVw9hOS; spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709247254; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WYL7T20DJ6FBUvU4nBeWBsBsIn5ux6ss0GzaZ/KnH48=; b=OcJzLceDk3AkDJhl0BtdgvR7xTJYYxUXQAKQrm4XGfayY7TO3THPcfsKn6hmWiiQn2wPZK 5lAYN/r0h549xy+NJFaM4XlLkxXwDFnNN0kuDy65dllVDGq0SBLNbAIgj2mxygE4zIJfRu E+LrSquohpRh12OAiSldzXTfMxECPvA= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=FsVw9hOS; spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=shy828301@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709247254; a=rsa-sha256; cv=none; b=D+wBXiq2fyqouILgabkRqM19Vhyx058Ry+TF404jb61sAGwaK40gGfsNK87Tw1lnHl4xz+ 5fEHpOUfZ3miMBUY/OHW+y7DZ+OXPAvTf+JHUqu2sXqgCZSRIA8rSwsn/64wV0v94PXBmT SEi0tz7CPG1cZtWhbrm97N8OW/5vjIA= Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-6e5629d5237so1442631b3a.3 for ; Thu, 29 Feb 2024 14:54:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709247253; x=1709852053; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=WYL7T20DJ6FBUvU4nBeWBsBsIn5ux6ss0GzaZ/KnH48=; b=FsVw9hOSvOdQtsFI+D8/bOAQjOGSW6gu+K3fBlQc5lXpTxqN6Zaxaaot3adKQlIlDy 6YyqsSm/mrcTHvT1bxqTN/yP5HmM5Y4Y1ALvdfzXBd/5GBudndfPxHOLSc6O611v8Yao 2eF3Ex9oIRmNG/IvQ3jWfMw3TGGukY6Z/z1PrHyKOjdPHK8hcfxWbqjGxCED+oE+bbAT Axni0dJLz7Eb1Y5/Of1dt1oXg78DoEHenvW6O17RtyuRhuLDJiKlEuja7yK1UyJI16Y2 fTk4IrTiHZWGVEA9WOC/xBDxXFDpR1mGbXy+c9uovXkgedim87G0Ve3APS/JaXR/PTVR tNcg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709247253; x=1709852053; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WYL7T20DJ6FBUvU4nBeWBsBsIn5ux6ss0GzaZ/KnH48=; b=U4EHbjlhdUuVr5wrlAKhFvnwtMDWP9KnXO7trH2rSXH0gg4tls9866JuMFjl+swRo3 0/NGbLttkqI7JqH0GrN/JMSzylI+aHNPfI5WHxvcl08vMDUFv1zUadV7E1aSMK18+7xD X87z2CvusdcKsPxWEXeZCeraEDPTU/AvqyH97E63FVmiM4nIvV1FUARkBmYaYj7ypviU thfOai/k7cF3Fil/uKnuMOFu3+Qqxf42GY7Fr2H55t+kwNqkqYOvg93ebj1maCNlizlv 9o3zvtUVvo96tRzW0zLDZyCkbNW/EBsWr7CQn1z1JKdiVTJQE+MaoK4u05qEtihyDKVY QIyw== X-Forwarded-Encrypted: i=1; AJvYcCVOh8V4ZuXK26MnW0fiafeaKBFIVaotpdZm2/O3a96K2OQyTlLvsVDbaDAYErnWJGbHNZ/wRNSuClzeCxXwZeeRuAM= X-Gm-Message-State: AOJu0YyjMDgassUald6scVvHTkpCiKnJwhwcpcL+RRzzzr+bNphqjPNG SUkaMnCewCKrGyfQ5L9q7Ttxy59E5+Py0X+b2q0Pfj3YT9nROrEuZM3bRzaCA3qrQon3jd/37Ab qMJhvvnKcrocyg5wrHi4ELfzoTr7IrBFq X-Google-Smtp-Source: AGHT+IHDjRYygSsuAf/YMts+HVM1/R2+g3U4ADTNGxQc+qmBhYQWPYF3xBQ6Nuy+R6GdymD1KMUhmg7rg/Y9kPrSGPI= X-Received: by 2002:a05:6a20:a602:b0:1a0:ce21:7888 with SMTP id bb2-20020a056a20a60200b001a0ce217888mr3305543pzb.16.1709247253104; Thu, 29 Feb 2024 14:54:13 -0800 (PST) MIME-Version: 1.0 References: <20240229183436.4110845-1-yuzhao@google.com> <20240229183436.4110845-4-yuzhao@google.com> In-Reply-To: <20240229183436.4110845-4-yuzhao@google.com> From: Yang Shi Date: Thu, 29 Feb 2024 14:54:01 -0800 Message-ID: Subject: Re: [Chapter Three] THP HVO: bring the hugeTLB feature to THP To: Yu Zhao Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, Jonathan Corbet Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 8E3DD140013 X-Rspam-User: X-Stat-Signature: oc7mmqpju6b11b6njidtt8rjpjb5fzi5 X-Rspamd-Server: rspam01 X-HE-Tag: 1709247254-488819 X-HE-Meta: U2FsdGVkX18ubE+ECJENDZddLG58p6q5BvKfeFvN+bbvsTrBbaJxBkNnnlxIFWGWMA7dgziYHnFcfAySrnsM2/Vb1zige6h8Ysvl8sWdJSUpfRj3vu/uxEczI39Qh8ZYeSkayadsb7P6h/J7RIR93Z/QWb2Bns9QXEZEskmq+4HWpQyzF00Ty/NFlsYQt0m5NlzxJvcuGkLwyr7VBLAt8jJ1IYCFUThOOoTJ+eyUuQI3OEvd264MPWZqHgxl9oV92PZ5Cuhwelpo45UP7F1kRvTJ5ECir4rOTw/TcA8w/f1RxlaPUoY1tbzHvdk8jdMnALzXRlwu3yaqYd6lVPPnnAp9sq+dPqJ42kTotrOdtRruhBYLq7obLpBqRqtU6vbPlyYlEXNiEaiSDi0gOUugLZoWQFnP+qZ+xnukWh4IWJV07zXyuDD2aF8ac0ZpJswmdSYv+/2lLb1rFyUfTMuOFrxEY8YbCme2FNn8xbgwPVGXTh6GZm3DYZ/w1nJ3GVsywIvdCZQ7DNShWTp3Jlnz5PrpPU/0p1J6GFSqf97J0rnrohhnGebRViDkOPf3CwhBno/RSJUnv6xAWZDn2YsNVHXJzjzgcgrPDCwEe8Y7MZ5r41BDybnMWeygnSXpPqeVO9PP9aooTeBn7LAcEbBupdveZ0sl0Vcwx2rbLWGzAIWtunf8cNjLqiLzDuZN1nmeuh8tMpuKU68MgHxXqR/3/RBqDw4fWKliuGe7HMLmgdmKsfrr+KrpDFiDcpdvozwa8SqdKyDKiysg4A7ahsuRnGkYACDzF+iN8ua+CKAG6pJFJ0eIYUZFevtI4NXgcOzpeQ1ZzAiboeNJBYEXTB0K6sPcVMreyFaEq+lMrTt1LxycQlt3zrxsa+HnKGn9WP9UQFjPbxybOsAUZQ2N4v73o8tzuadVougv7c0fslWmwKE/cnPkH+EI9bEvc5mq7luvXvBltE7Cx20Yby7tcps 11z9+BqU bDW2XVRO9pcqGdvyI7HBJoR6TdOwsaVo3/sba3l7Gi/KYze9Iu3ATGTkI5lUZqkwzXtJRZHSVK94ZAywdEfNE39uhyq+3MfDj/8G0Ta0FWKH6HRQX2mlXc9ZMCA+bURZ/tqTz6sYdh/uAI0kaddfLqoAbzMdNZWs9mlSKmNy36jPKwfTWFXHOAzGEANx2/JdK2zFBEjEuTpFZ1qv7cFl0N8WbgZceCMFFanBOzqJcmKeosURRrBr0qTiOu6kD+VdkPJoofkiAXh3am62gXBLzIRMXkaRs6wfdbmri7dbddZbEnTyShftbiUfQ4yY4K50I4WyxfJv/GUwR16DJsHFhu3Fl4h8yBo/NElpf4oSD+Eo2m9eu+GvCGlN+qg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 29, 2024 at 10:34=E2=80=AFAM Yu Zhao wrote: > > HVO can be one of the perks for heavy THP users like it is for hugeTLB > users. For example, if such a user uses 60% of physical memory for 2MB > THPs, THP HVO can reduce the struct page overhead by half (60% * 7/8 > ~=3D 50%). > > ZONE_NOMERGE considerably simplifies the implementation of HVO for > THPs, since THPs from it cannot be split or merged and thus do not > require any correctness-related operations on tail pages beyond the > second one. > > If a THP is mapped by PTEs, two optimization-related operations on its > tail pages, i.e., _mapcount and PG_anon_exclusive, can be binned to > track a group of pages, e.g., eight pages per group for 2MB THPs. The > estimation, as the copying cost incurred during shattering, is also by > design, since mapping by PTEs is another discouraged behavior. I'm confused by this. Can you please elaborate a little bit about binning mapcount and PG_anon_exclusive? For mapcount, IIUC, for example, when inc'ing a subpage's mapcount, you actually inc the (i % 64) page's mapcount (assuming THP size is 2M and base page size is 4K, so 8 strides and 64 pages in each stride), right? But how you can tell each page of the 8 pages has mapcount 1 or one page is mapped 8 times? Or this actually doesn't matter, we don't even care to distinguish the two cases? For PG_anon_exclusive, if one page has it set, it means other 7 pages in other strides have it set too? > > Signed-off-by: Yu Zhao > --- > include/linux/mm.h | 140 ++++++++++++++++++++++++++++++++++++++ > include/linux/mmzone.h | 1 + > include/linux/rmap.h | 4 ++ > init/main.c | 1 + > mm/gup.c | 3 +- > mm/huge_memory.c | 2 + > mm/hugetlb_vmemmap.c | 2 +- > mm/internal.h | 9 --- > mm/memory.c | 11 +-- > mm/page_alloc.c | 151 ++++++++++++++++++++++++++++++++++++++++- > mm/rmap.c | 17 ++++- > mm/vmstat.c | 2 + > 12 files changed, 323 insertions(+), 20 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index f5a97dec5169..d7014fc35cca 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1196,6 +1196,138 @@ static inline void page_mapcount_reset(struct pag= e *page) > atomic_set(&(page)->_mapcount, -1); > } > > +#define HVO_MOD (PAGE_SIZE / sizeof(struct page)) > + > +static inline int hvo_order_size(int order) > +{ > + if (PAGE_SIZE % sizeof(struct page) || !is_power_of_2(HVO_MOD)) > + return 0; > + > + return (1 << order) * sizeof(struct page); > +} > + > +static inline bool page_hvo_suitable(struct page *head, int order) > +{ > + VM_WARN_ON_ONCE_PAGE(!test_bit(PG_head, &head->flags), head); > + > + if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key)) > + return false; > + > + return page_zonenum(head) =3D=3D ZONE_NOMERGE && > + IS_ALIGNED((unsigned long)head, PAGE_SIZE) && > + hvo_order_size(order) > PAGE_SIZE; > +} > + > +static inline bool folio_hvo_suitable(struct folio *folio) > +{ > + return folio_test_large(folio) && page_hvo_suitable(&folio->page,= folio_order(folio)); > +} > + > +static inline bool page_is_hvo(struct page *head, int order) > +{ > + return page_hvo_suitable(head, order) && test_bit(PG_head, &head[= HVO_MOD].flags); > +} > + > +static inline bool folio_is_hvo(struct folio *folio) > +{ > + return folio_test_large(folio) && page_is_hvo(&folio->page, folio= _order(folio)); > +} > + > +/* > + * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages, > + * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit > + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Hugetlb current= ly > + * leaves nr_pages_mapped at 0, but avoid surprise if it participates la= ter. > + */ > +#define ENTIRELY_MAPPED 0x800000 > +#define FOLIO_PAGES_MAPPED (ENTIRELY_MAPPED - 1) > + > +static inline int hvo_range_mapcount(struct folio *folio, struct page *p= age, int nr_pages, int *ret) > +{ > + int i, next, end; > + int stride =3D hvo_order_size(folio_order(folio)) / PAGE_SIZE; > + > + if (!folio_is_hvo(folio)) > + return false; > + > + *ret =3D folio_entire_mapcount(folio); > + > + for (i =3D folio_page_idx(folio, page), end =3D i + nr_pages; i != =3D end; i =3D next) { > + next =3D min(end, round_down(i + stride, stride)); > + > + page =3D folio_page(folio, i / stride); > + *ret +=3D atomic_read(&page->_mapcount) + 1; > + } > + > + return true; > +} > + > +static inline bool hvo_map_range(struct folio *folio, struct page *page,= int nr_pages, int *ret) > +{ > + int i, next, end; > + int stride =3D hvo_order_size(folio_order(folio)) / PAGE_SIZE; > + > + if (!folio_is_hvo(folio)) > + return false; > + > + *ret =3D 0; > + > + for (i =3D folio_page_idx(folio, page), end =3D i + nr_pages; i != =3D end; i =3D next) { > + next =3D min(end, round_down(i + stride, stride)); > + > + page =3D folio_page(folio, i / stride); > + if (atomic_add_return(next - i, &page->_mapcount) =3D=3D = next - i - 1) > + *ret +=3D stride; > + } > + > + if (atomic_add_return(*ret, &folio->_nr_pages_mapped) >=3D ENTIRE= LY_MAPPED) > + *ret =3D 0; > + > + return true; > +} > + > +static inline bool hvo_unmap_range(struct folio *folio, struct page *pag= e, int nr_pages, int *ret) > +{ > + int i, next, end; > + int stride =3D hvo_order_size(folio_order(folio)) / PAGE_SIZE; > + > + if (!folio_is_hvo(folio)) > + return false; > + > + *ret =3D 0; > + > + for (i =3D folio_page_idx(folio, page), end =3D i + nr_pages; i != =3D end; i =3D next) { > + next =3D min(end, round_down(i + stride, stride)); > + > + page =3D folio_page(folio, i / stride); > + if (atomic_sub_return(next - i, &page->_mapcount) =3D=3D = -1) > + *ret +=3D stride; > + } > + > + if (atomic_sub_return(*ret, &folio->_nr_pages_mapped) >=3D ENTIRE= LY_MAPPED) > + *ret =3D 0; > + > + return true; > +} > + > +static inline bool hvo_dup_range(struct folio *folio, struct page *page,= int nr_pages) > +{ > + int i, next, end; > + int stride =3D hvo_order_size(folio_order(folio)) / PAGE_SIZE; > + > + if (!folio_is_hvo(folio)) > + return false; > + > + for (i =3D folio_page_idx(folio, page), end =3D i + nr_pages; i != =3D end; i =3D next) { > + next =3D min(end, round_down(i + stride, stride)); > + > + page =3D folio_page(folio, i / stride); > + atomic_add(next - i, &page->_mapcount); > + } > + > + return true; > +} > + > /** > * page_mapcount() - Number of times this precise page is mapped. > * @page: The page. > @@ -1212,6 +1344,9 @@ static inline int page_mapcount(struct page *page) > { > int mapcount =3D atomic_read(&page->_mapcount) + 1; > > + if (hvo_range_mapcount(page_folio(page), page, 1, &mapcount)) > + return mapcount; > + > if (unlikely(PageCompound(page))) > mapcount +=3D folio_entire_mapcount(page_folio(page)); > > @@ -3094,6 +3229,11 @@ static inline void pagetable_pud_dtor(struct ptdes= c *ptdesc) > > extern void __init pagecache_init(void); > extern void free_initmem(void); > +extern void free_vmemmap(void); > +extern int vmemmap_remap_free(unsigned long start, unsigned long end, > + unsigned long reuse, > + struct list_head *vmemmap_pages, > + unsigned long flags); > > /* > * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK) > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 532218167bba..00e4bb6c8533 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -916,6 +916,7 @@ struct zone { > #ifdef CONFIG_CMA > unsigned long cma_pages; > #endif > + atomic_long_t hvo_freed; > > const char *name; > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index b7944a833668..d058c4cb3c96 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -322,6 +322,8 @@ static __always_inline void __folio_dup_file_rmap(str= uct folio *folio, > > switch (level) { > case RMAP_LEVEL_PTE: > + if (hvo_dup_range(folio, page, nr_pages)) > + break; > do { > atomic_inc(&page->_mapcount); > } while (page++, --nr_pages > 0); > @@ -401,6 +403,8 @@ static __always_inline int __folio_try_dup_anon_rmap(= struct folio *folio, > if (PageAnonExclusive(page + i)) > return -EBUSY; > } > + if (hvo_dup_range(folio, page, nr_pages)) > + break; > do { > if (PageAnonExclusive(page)) > ClearPageAnonExclusive(page); > diff --git a/init/main.c b/init/main.c > index e24b0780fdff..74003495db32 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -1448,6 +1448,7 @@ static int __ref kernel_init(void *unused) > kgdb_free_init_mem(); > exit_boot_config(); > free_initmem(); > + free_vmemmap(); > mark_readonly(); > > /* > diff --git a/mm/gup.c b/mm/gup.c > index df83182ec72d..f3df0078505b 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -57,7 +57,7 @@ static inline void sanity_check_pinned_pages(struct pag= e **pages, > continue; > if (!folio_test_large(folio) || folio_test_hugetlb(folio)= ) > VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), = page); > - else > + else if (!folio_is_hvo(folio) || !folio_nr_pages_mapped(f= olio)) > /* Either a PTE-mapped or a PMD-mapped THP. */ > VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) &= & > !PageAnonExclusive(page), page); > @@ -645,6 +645,7 @@ static struct page *follow_page_pte(struct vm_area_st= ruct *vma, > } > > VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) && > + !folio_is_hvo(page_folio(page)) && > !PageAnonExclusive(page), page); > > /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is se= t. */ > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 62d2254bc51c..9e7e5d587a5c 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2535,6 +2535,8 @@ static void __split_huge_pmd_locked(struct vm_area_= struct *vma, pmd_t *pmd, > * > * See folio_try_share_anon_rmap_pmd(): invalidate PMD fi= rst. > */ > + if (folio_is_hvo(folio)) > + ClearPageAnonExclusive(page); > anon_exclusive =3D PageAnonExclusive(page); > if (freeze && anon_exclusive && > folio_try_share_anon_rmap_pmd(folio, page)) > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index da177e49d956..9f43d900e83c 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -310,7 +310,7 @@ static int vmemmap_remap_split(unsigned long start, u= nsigned long end, > * > * Return: %0 on success, negative error code otherwise. > */ > -static int vmemmap_remap_free(unsigned long start, unsigned long end, > +int vmemmap_remap_free(unsigned long start, unsigned long end, > unsigned long reuse, > struct list_head *vmemmap_pages, > unsigned long flags) > diff --git a/mm/internal.h b/mm/internal.h > index ac1d27468899..871c6eeb78b8 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -52,15 +52,6 @@ struct folio_batch; > > void page_writeback_init(void); > > -/* > - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages, > - * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit > - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE). Hugetlb current= ly > - * leaves nr_pages_mapped at 0, but avoid surprise if it participates la= ter. > - */ > -#define ENTIRELY_MAPPED 0x800000 > -#define FOLIO_PAGES_MAPPED (ENTIRELY_MAPPED - 1) > - > /* > * Flags passed to __show_mem() and show_free_areas() to suppress output= in > * various contexts. > diff --git a/mm/memory.c b/mm/memory.c > index 0bfc8b007c01..db389f1d776d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3047,8 +3047,8 @@ static inline void wp_page_reuse(struct vm_fault *v= mf, struct folio *folio) > VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE)); > > if (folio) { > - VM_BUG_ON(folio_test_anon(folio) && > - !PageAnonExclusive(vmf->page)); > + VM_BUG_ON_PAGE(folio_test_anon(folio) && !folio_is_hvo(fo= lio) && > + !PageAnonExclusive(vmf->page), vmf->page); > /* > * Clear the folio's cpupid information as the existing > * information potentially belongs to a now completely > @@ -3502,7 +3502,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) > */ > if (folio && folio_test_anon(folio) && > (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(foli= o, vma))) { > - if (!PageAnonExclusive(vmf->page)) > + if (!folio_is_hvo(folio) && !PageAnonExclusive(vmf->page)= ) > SetPageAnonExclusive(vmf->page); > if (unlikely(unshare)) { > pte_unmap_unlock(vmf->pte, vmf->ptl); > @@ -4100,8 +4100,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > rmap_flags); > } > > - VM_BUG_ON(!folio_test_anon(folio) || > - (pte_write(pte) && !PageAnonExclusive(page))); > + VM_BUG_ON_PAGE(!folio_test_anon(folio) || > + (pte_write(pte) && !folio_is_hvo(folio) && !PageAn= onExclusive(page)), > + page); > set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); > arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_p= te); > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index dd843fb04f78..5f8c6583a191 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -53,6 +53,7 @@ > #include > #include > #include > +#include > #include > #include "internal.h" > #include "shuffle.h" > @@ -585,6 +586,10 @@ void prep_compound_page(struct page *page, unsigned = int order) > int nr_pages =3D 1 << order; > > __SetPageHead(page); > + > + if (page_is_hvo(page, order)) > + nr_pages =3D HVO_MOD; > + > for (i =3D 1; i < nr_pages; i++) > prep_compound_tail(page, i); > > @@ -1124,10 +1129,15 @@ static __always_inline bool free_pages_prepare(st= ruct page *page, > */ > if (unlikely(order)) { > int i; > + int nr_pages =3D 1 << order; > > - if (compound) > + if (compound) { > + if (page_is_hvo(page, order)) > + nr_pages =3D HVO_MOD; > page[1].flags &=3D ~PAGE_FLAGS_SECOND; > - for (i =3D 1; i < (1 << order); i++) { > + } > + > + for (i =3D 1; i < nr_pages; i++) { > if (compound) > bad +=3D free_tail_page_prepare(page, pag= e + i); > if (is_check_pages_enabled()) { > @@ -1547,6 +1557,141 @@ inline void post_alloc_hook(struct page *page, un= signed int order, > page_table_check_alloc(page, order); > } > > +static void prep_hvo_page(struct page *head, int order) > +{ > + LIST_HEAD(list); > + struct page *page, *next; > + int freed =3D 0; > + unsigned long start =3D (unsigned long)head; > + unsigned long end =3D start + hvo_order_size(order); > + > + if (page_zonenum(head) !=3D ZONE_NOMERGE) > + return; > + > + if (WARN_ON_ONCE(order !=3D page_zone(head)->order)) { > + bad_page(head, "invalid page order"); > + return; > + } > + > + if (!page_hvo_suitable(head, order) || page_is_hvo(head, order)) > + return; > + > + vmemmap_remap_free(start + PAGE_SIZE, end, start, &list, 0); > + > + list_for_each_entry_safe(page, next, &list, lru) { > + if (PageReserved(page)) > + free_bootmem_page(page); > + else > + __free_page(page); > + freed++; > + } > + > + atomic_long_add(freed, &page_zone(head)->hvo_freed); > +} > + > +static void prep_nomerge_zone(struct zone *zone, enum migratetype type) > +{ > + int order; > + unsigned long flags; > + > + spin_lock_irqsave(&zone->lock, flags); > + > + for (order =3D MAX_PAGE_ORDER; order > zone->order; order--) { > + struct page *page; > + int split =3D 0; > + struct free_area *area =3D zone->free_area + order; > + > + while ((page =3D get_page_from_free_area(area, type))) { > + del_page_from_free_list(page, zone, order); > + expand(zone, page, zone->order, order, type); > + set_buddy_order(page, zone->order); > + add_to_free_list(page, zone, zone->order, type); > + split++; > + } > + > + pr_info(" HVO: order %d split %d\n", order, split); > + } > + > + spin_unlock_irqrestore(&zone->lock, flags); > +} > + > +static void hvo_nomerge_zone(struct zone *zone, enum migratetype type) > +{ > + LIST_HEAD(old); > + LIST_HEAD(new); > + int nomem, freed; > + unsigned long flags; > + struct list_head list; > + struct page *page, *next; > + struct free_area *area =3D zone->free_area + zone->order; > +again: > + nomem =3D freed =3D 0; > + INIT_LIST_HEAD(&list); > + > + spin_lock_irqsave(&zone->lock, flags); > + list_splice_init(area->free_list + type, &old); > + spin_unlock_irqrestore(&zone->lock, flags); > + > + list_for_each_entry_safe(page, next, &old, buddy_list) { > + unsigned long start =3D (unsigned long)page; > + unsigned long end =3D start + hvo_order_size(zone->order)= ; > + > + if (WARN_ON_ONCE(!IS_ALIGNED(start, PAGE_SIZE))) > + continue; > + > + if (vmemmap_remap_free(start + PAGE_SIZE, end, start, &li= st, 0)) > + nomem++; > + } > + > + list_for_each_entry_safe(page, next, &list, lru) { > + if (PageReserved(page)) > + free_bootmem_page(page); > + else > + __free_page(page); > + freed++; > + } > + > + list_splice_init(&old, &new); > + atomic_long_add(freed, &zone->hvo_freed); > + > + pr_info(" HVO: nomem %d freed %d\n", nomem, freed); > + > + if (!list_empty(area->free_list + type)) > + goto again; > + > + spin_lock_irqsave(&zone->lock, flags); > + list_splice(&new, area->free_list + type); > + spin_unlock_irqrestore(&zone->lock, flags); > +} > + > +static bool zone_hvo_suitable(struct zone *zone) > +{ > + if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key)) > + return false; > + > + return zone_idx(zone) =3D=3D ZONE_NOMERGE && hvo_order_size(zone-= >order) > PAGE_SIZE; > +} > + > +void free_vmemmap(void) > +{ > + struct zone *zone; > + > + static_branch_inc(&hugetlb_optimize_vmemmap_key); > + > + for_each_populated_zone(zone) { > + if (!zone_hvo_suitable(zone)) > + continue; > + > + pr_info("Freeing vmemmap of node %d zone %s\n", > + zone_to_nid(zone), zone->name); > + > + prep_nomerge_zone(zone, MIGRATE_MOVABLE); > + hvo_nomerge_zone(zone, MIGRATE_MOVABLE); > + > + cond_resched(); > + } > +} > + > static void prep_new_page(struct page *page, unsigned int order, gfp_t g= fp_flags, > unsigned int allo= c_flags) > { > @@ -1565,6 +1710,8 @@ static void prep_new_page(struct page *page, unsign= ed int order, gfp_t gfp_flags > set_page_pfmemalloc(page); > else > clear_page_pfmemalloc(page); > + > + prep_hvo_page(page, order); > } > > /* > diff --git a/mm/rmap.c b/mm/rmap.c > index 0ddb28c52961..d339bf489230 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1143,6 +1143,10 @@ int folio_total_mapcount(struct folio *folio) > /* In the common case, avoid the loop when no pages mapped by PTE= */ > if (folio_nr_pages_mapped(folio) =3D=3D 0) > return mapcount; > + > + if (hvo_range_mapcount(folio, &folio->page, folio_nr_pages(folio)= , &mapcount)) > + return mapcount; > + > /* > * Add all the PTE mappings of those pages mapped by PTE. > * Limit the loop to folio_nr_pages_mapped()? > @@ -1168,6 +1172,8 @@ static __always_inline unsigned int __folio_add_rma= p(struct folio *folio, > > switch (level) { > case RMAP_LEVEL_PTE: > + if (hvo_map_range(folio, page, nr_pages, &nr)) > + break; > do { > first =3D atomic_inc_and_test(&page->_mapcount); > if (first && folio_test_large(folio)) { > @@ -1314,6 +1320,8 @@ static __always_inline void __folio_add_anon_rmap(s= truct folio *folio, > if (flags & RMAP_EXCLUSIVE) { > switch (level) { > case RMAP_LEVEL_PTE: > + if (folio_is_hvo(folio)) > + break; > for (i =3D 0; i < nr_pages; i++) > SetPageAnonExclusive(page + i); > break; > @@ -1421,6 +1429,9 @@ void folio_add_new_anon_rmap(struct folio *folio, s= truct vm_area_struct *vma, > } else if (!folio_test_pmd_mappable(folio)) { > int i; > > + if (hvo_map_range(folio, &folio->page, nr, &nr)) > + goto done; > + > for (i =3D 0; i < nr; i++) { > struct page *page =3D folio_page(folio, i); > > @@ -1437,7 +1448,7 @@ void folio_add_new_anon_rmap(struct folio *folio, s= truct vm_area_struct *vma, > SetPageAnonExclusive(&folio->page); > __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr); > } > - > +done: > __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr); > } > > @@ -1510,6 +1521,8 @@ static __always_inline void __folio_remove_rmap(str= uct folio *folio, > > switch (level) { > case RMAP_LEVEL_PTE: > + if (hvo_unmap_range(folio, page, nr_pages, &nr)) > + break; > do { > last =3D atomic_add_negative(-1, &page->_mapcount= ); > if (last && folio_test_large(folio)) { > @@ -2212,7 +2225,7 @@ static bool try_to_migrate_one(struct folio *folio,= struct vm_area_struct *vma, > break; > } > VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_an= on(folio) && > - !anon_exclusive, subpage); > + !folio_is_hvo(folio) && !anon_excl= usive, subpage); > > /* See folio_try_share_anon_rmap_pte(): clear PTE= first. */ > if (folio_test_hugetlb(folio)) { > diff --git a/mm/vmstat.c b/mm/vmstat.c > index ff2114452334..f51f3b872270 100644 > --- a/mm/vmstat.c > +++ b/mm/vmstat.c > @@ -1704,6 +1704,7 @@ static void zoneinfo_show_print(struct seq_file *m,= pg_data_t *pgdat, > "\n present %lu" > "\n managed %lu" > "\n cma %lu" > + "\n hvo freed %lu" > "\n order %u", > zone_page_state(zone, NR_FREE_PAGES), > zone->watermark_boost, > @@ -1714,6 +1715,7 @@ static void zoneinfo_show_print(struct seq_file *m,= pg_data_t *pgdat, > zone->present_pages, > zone_managed_pages(zone), > zone_cma_pages(zone), > + atomic_long_read(&zone->hvo_freed), > zone->order); > > seq_printf(m, > -- > 2.44.0.rc1.240.g4c46232300-goog > >