From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 428A2C54798
	for <linux-mm@archiver.kernel.org>; Thu, 29 Feb 2024 22:54:17 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CDCE76B008C; Thu, 29 Feb 2024 17:54:16 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C8CD36B0092; Thu, 29 Feb 2024 17:54:16 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B2D4B6B0095; Thu, 29 Feb 2024 17:54:16 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id A18CA6B008C
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 17:54:16 -0500 (EST)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 73DEDA1B1C
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 22:54:16 +0000 (UTC)
X-FDA: 81846346512.06.5963C90
Received: from mail-pf1-f176.google.com (mail-pf1-f176.google.com [209.85.210.176])
	by imf09.hostedemail.com (Postfix) with ESMTP id 8E3DD140013
	for <linux-mm@kvack.org>; Thu, 29 Feb 2024 22:54:14 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=FsVw9hOS;
	spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709247254;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=WYL7T20DJ6FBUvU4nBeWBsBsIn5ux6ss0GzaZ/KnH48=;
	b=OcJzLceDk3AkDJhl0BtdgvR7xTJYYxUXQAKQrm4XGfayY7TO3THPcfsKn6hmWiiQn2wPZK
	5lAYN/r0h549xy+NJFaM4XlLkxXwDFnNN0kuDy65dllVDGq0SBLNbAIgj2mxygE4zIJfRu
	E+LrSquohpRh12OAiSldzXTfMxECPvA=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=FsVw9hOS;
	spf=pass (imf09.hostedemail.com: domain of shy828301@gmail.com designates 209.85.210.176 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709247254; a=rsa-sha256;
	cv=none;
	b=D+wBXiq2fyqouILgabkRqM19Vhyx058Ry+TF404jb61sAGwaK40gGfsNK87Tw1lnHl4xz+
	5fEHpOUfZ3miMBUY/OHW+y7DZ+OXPAvTf+JHUqu2sXqgCZSRIA8rSwsn/64wV0v94PXBmT
	SEi0tz7CPG1cZtWhbrm97N8OW/5vjIA=
Received: by mail-pf1-f176.google.com with SMTP id d2e1a72fcca58-6e5629d5237so1442631b3a.3
        for <linux-mm@kvack.org>; Thu, 29 Feb 2024 14:54:14 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1709247253; x=1709852053; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=WYL7T20DJ6FBUvU4nBeWBsBsIn5ux6ss0GzaZ/KnH48=;
        b=FsVw9hOSvOdQtsFI+D8/bOAQjOGSW6gu+K3fBlQc5lXpTxqN6Zaxaaot3adKQlIlDy
         6YyqsSm/mrcTHvT1bxqTN/yP5HmM5Y4Y1ALvdfzXBd/5GBudndfPxHOLSc6O611v8Yao
         2eF3Ex9oIRmNG/IvQ3jWfMw3TGGukY6Z/z1PrHyKOjdPHK8hcfxWbqjGxCED+oE+bbAT
         Axni0dJLz7Eb1Y5/Of1dt1oXg78DoEHenvW6O17RtyuRhuLDJiKlEuja7yK1UyJI16Y2
         fTk4IrTiHZWGVEA9WOC/xBDxXFDpR1mGbXy+c9uovXkgedim87G0Ve3APS/JaXR/PTVR
         tNcg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1709247253; x=1709852053;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=WYL7T20DJ6FBUvU4nBeWBsBsIn5ux6ss0GzaZ/KnH48=;
        b=U4EHbjlhdUuVr5wrlAKhFvnwtMDWP9KnXO7trH2rSXH0gg4tls9866JuMFjl+swRo3
         0/NGbLttkqI7JqH0GrN/JMSzylI+aHNPfI5WHxvcl08vMDUFv1zUadV7E1aSMK18+7xD
         X87z2CvusdcKsPxWEXeZCeraEDPTU/AvqyH97E63FVmiM4nIvV1FUARkBmYaYj7ypviU
         thfOai/k7cF3Fil/uKnuMOFu3+Qqxf42GY7Fr2H55t+kwNqkqYOvg93ebj1maCNlizlv
         9o3zvtUVvo96tRzW0zLDZyCkbNW/EBsWr7CQn1z1JKdiVTJQE+MaoK4u05qEtihyDKVY
         QIyw==
X-Forwarded-Encrypted: i=1; AJvYcCVOh8V4ZuXK26MnW0fiafeaKBFIVaotpdZm2/O3a96K2OQyTlLvsVDbaDAYErnWJGbHNZ/wRNSuClzeCxXwZeeRuAM=
X-Gm-Message-State: AOJu0YyjMDgassUald6scVvHTkpCiKnJwhwcpcL+RRzzzr+bNphqjPNG
	SUkaMnCewCKrGyfQ5L9q7Ttxy59E5+Py0X+b2q0Pfj3YT9nROrEuZM3bRzaCA3qrQon3jd/37Ab
	qMJhvvnKcrocyg5wrHi4ELfzoTr7IrBFq
X-Google-Smtp-Source: AGHT+IHDjRYygSsuAf/YMts+HVM1/R2+g3U4ADTNGxQc+qmBhYQWPYF3xBQ6Nuy+R6GdymD1KMUhmg7rg/Y9kPrSGPI=
X-Received: by 2002:a05:6a20:a602:b0:1a0:ce21:7888 with SMTP id
 bb2-20020a056a20a60200b001a0ce217888mr3305543pzb.16.1709247253104; Thu, 29
 Feb 2024 14:54:13 -0800 (PST)
MIME-Version: 1.0
References: <20240229183436.4110845-1-yuzhao@google.com> <20240229183436.4110845-4-yuzhao@google.com>
In-Reply-To: <20240229183436.4110845-4-yuzhao@google.com>
From: Yang Shi <shy828301@gmail.com>
Date: Thu, 29 Feb 2024 14:54:01 -0800
Message-ID: <CAHbLzkrmT7=HYimU8f0BcvsjQ=GM2bQdLGRohNeXcnCJoNzrCQ@mail.gmail.com>
Subject: Re: [Chapter Three] THP HVO: bring the hugeTLB feature to THP
To: Yu Zhao <yuzhao@google.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, 
	Jonathan Corbet <corbet@lwn.net>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 8E3DD140013
X-Rspam-User: 
X-Stat-Signature: oc7mmqpju6b11b6njidtt8rjpjb5fzi5
X-Rspamd-Server: rspam01
X-HE-Tag: 1709247254-488819
X-HE-Meta: U2FsdGVkX18ubE+ECJENDZddLG58p6q5BvKfeFvN+bbvsTrBbaJxBkNnnlxIFWGWMA7dgziYHnFcfAySrnsM2/Vb1zige6h8Ysvl8sWdJSUpfRj3vu/uxEczI39Qh8ZYeSkayadsb7P6h/J7RIR93Z/QWb2Bns9QXEZEskmq+4HWpQyzF00Ty/NFlsYQt0m5NlzxJvcuGkLwyr7VBLAt8jJ1IYCFUThOOoTJ+eyUuQI3OEvd264MPWZqHgxl9oV92PZ5Cuhwelpo45UP7F1kRvTJ5ECir4rOTw/TcA8w/f1RxlaPUoY1tbzHvdk8jdMnALzXRlwu3yaqYd6lVPPnnAp9sq+dPqJ42kTotrOdtRruhBYLq7obLpBqRqtU6vbPlyYlEXNiEaiSDi0gOUugLZoWQFnP+qZ+xnukWh4IWJV07zXyuDD2aF8ac0ZpJswmdSYv+/2lLb1rFyUfTMuOFrxEY8YbCme2FNn8xbgwPVGXTh6GZm3DYZ/w1nJ3GVsywIvdCZQ7DNShWTp3Jlnz5PrpPU/0p1J6GFSqf97J0rnrohhnGebRViDkOPf3CwhBno/RSJUnv6xAWZDn2YsNVHXJzjzgcgrPDCwEe8Y7MZ5r41BDybnMWeygnSXpPqeVO9PP9aooTeBn7LAcEbBupdveZ0sl0Vcwx2rbLWGzAIWtunf8cNjLqiLzDuZN1nmeuh8tMpuKU68MgHxXqR/3/RBqDw4fWKliuGe7HMLmgdmKsfrr+KrpDFiDcpdvozwa8SqdKyDKiysg4A7ahsuRnGkYACDzF+iN8ua+CKAG6pJFJ0eIYUZFevtI4NXgcOzpeQ1ZzAiboeNJBYEXTB0K6sPcVMreyFaEq+lMrTt1LxycQlt3zrxsa+HnKGn9WP9UQFjPbxybOsAUZQ2N4v73o8tzuadVougv7c0fslWmwKE/cnPkH+EI9bEvc5mq7luvXvBltE7Cx20Yby7tcps
 11z9+BqU
 bDW2XVRO9pcqGdvyI7HBJoR6TdOwsaVo3/sba3l7Gi/KYze9Iu3ATGTkI5lUZqkwzXtJRZHSVK94ZAywdEfNE39uhyq+3MfDj/8G0Ta0FWKH6HRQX2mlXc9ZMCA+bURZ/tqTz6sYdh/uAI0kaddfLqoAbzMdNZWs9mlSKmNy36jPKwfTWFXHOAzGEANx2/JdK2zFBEjEuTpFZ1qv7cFl0N8WbgZceCMFFanBOzqJcmKeosURRrBr0qTiOu6kD+VdkPJoofkiAXh3am62gXBLzIRMXkaRs6wfdbmri7dbddZbEnTyShftbiUfQ4yY4K50I4WyxfJv/GUwR16DJsHFhu3Fl4h8yBo/NElpf4oSD+Eo2m9eu+GvCGlN+qg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Feb 29, 2024 at 10:34=E2=80=AFAM Yu Zhao <yuzhao@google.com> wrote:
>
> HVO can be one of the perks for heavy THP users like it is for hugeTLB
> users. For example, if such a user uses 60% of physical memory for 2MB
> THPs, THP HVO can reduce the struct page overhead by half (60% * 7/8
> ~=3D 50%).
>
> ZONE_NOMERGE considerably simplifies the implementation of HVO for
> THPs, since THPs from it cannot be split or merged and thus do not
> require any correctness-related operations on tail pages beyond the
> second one.
>
> If a THP is mapped by PTEs, two optimization-related operations on its
> tail pages, i.e., _mapcount and PG_anon_exclusive, can be binned to
> track a group of pages, e.g., eight pages per group for 2MB THPs. The
> estimation, as the copying cost incurred during shattering, is also by
> design, since mapping by PTEs is another discouraged behavior.

I'm confused by this. Can you please elaborate a little bit about
binning mapcount and PG_anon_exclusive?

For mapcount, IIUC, for example, when inc'ing a subpage's mapcount,
you actually inc the (i % 64) page's mapcount (assuming THP size is 2M
and base page size is 4K, so 8 strides and 64 pages in each stride),
right? But how you can tell each page of the 8 pages has mapcount 1 or
one page is mapped 8 times? Or this actually doesn't matter, we don't
even care to distinguish the two cases?

For PG_anon_exclusive, if one page has it set, it means other 7 pages
in other strides have it set too?

>
> Signed-off-by: Yu Zhao <yuzhao@google.com>
> ---
>  include/linux/mm.h     | 140 ++++++++++++++++++++++++++++++++++++++
>  include/linux/mmzone.h |   1 +
>  include/linux/rmap.h   |   4 ++
>  init/main.c            |   1 +
>  mm/gup.c               |   3 +-
>  mm/huge_memory.c       |   2 +
>  mm/hugetlb_vmemmap.c   |   2 +-
>  mm/internal.h          |   9 ---
>  mm/memory.c            |  11 +--
>  mm/page_alloc.c        | 151 ++++++++++++++++++++++++++++++++++++++++-
>  mm/rmap.c              |  17 ++++-
>  mm/vmstat.c            |   2 +
>  12 files changed, 323 insertions(+), 20 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f5a97dec5169..d7014fc35cca 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1196,6 +1196,138 @@ static inline void page_mapcount_reset(struct pag=
e *page)
>         atomic_set(&(page)->_mapcount, -1);
>  }
>
> +#define HVO_MOD (PAGE_SIZE / sizeof(struct page))
> +
> +static inline int hvo_order_size(int order)
> +{
> +       if (PAGE_SIZE % sizeof(struct page) || !is_power_of_2(HVO_MOD))
> +               return 0;
> +
> +       return (1 << order) * sizeof(struct page);
> +}
> +
> +static inline bool page_hvo_suitable(struct page *head, int order)
> +{
> +       VM_WARN_ON_ONCE_PAGE(!test_bit(PG_head, &head->flags), head);
> +
> +       if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> +               return false;
> +
> +       return page_zonenum(head) =3D=3D ZONE_NOMERGE &&
> +              IS_ALIGNED((unsigned long)head, PAGE_SIZE) &&
> +              hvo_order_size(order) > PAGE_SIZE;
> +}
> +
> +static inline bool folio_hvo_suitable(struct folio *folio)
> +{
> +       return folio_test_large(folio) && page_hvo_suitable(&folio->page,=
 folio_order(folio));
> +}
> +
> +static inline bool page_is_hvo(struct page *head, int order)
> +{
> +       return page_hvo_suitable(head, order) && test_bit(PG_head, &head[=
HVO_MOD].flags);
> +}
> +
> +static inline bool folio_is_hvo(struct folio *folio)
> +{
> +       return folio_test_large(folio) && page_is_hvo(&folio->page, folio=
_order(folio));
> +}
> +
> +/*
> + * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
> + * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit
> + * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb current=
ly
> + * leaves nr_pages_mapped at 0, but avoid surprise if it participates la=
ter.
> + */
> +#define ENTIRELY_MAPPED                0x800000
> +#define FOLIO_PAGES_MAPPED     (ENTIRELY_MAPPED - 1)
> +
> +static inline int hvo_range_mapcount(struct folio *folio, struct page *p=
age, int nr_pages, int *ret)
> +{
> +       int i, next, end;
> +       int stride =3D hvo_order_size(folio_order(folio)) / PAGE_SIZE;
> +
> +       if (!folio_is_hvo(folio))
> +               return false;
> +
> +       *ret =3D folio_entire_mapcount(folio);
> +
> +       for (i =3D folio_page_idx(folio, page), end =3D i + nr_pages; i !=
=3D end; i =3D next) {
> +               next =3D min(end, round_down(i + stride, stride));
> +
> +               page =3D folio_page(folio, i / stride);
> +               *ret +=3D atomic_read(&page->_mapcount) + 1;
> +       }
> +
> +       return true;
> +}
> +
> +static inline bool hvo_map_range(struct folio *folio, struct page *page,=
 int nr_pages, int *ret)
> +{
> +       int i, next, end;
> +       int stride =3D hvo_order_size(folio_order(folio)) / PAGE_SIZE;
> +
> +       if (!folio_is_hvo(folio))
> +               return false;
> +
> +       *ret =3D 0;
> +
> +       for (i =3D folio_page_idx(folio, page), end =3D i + nr_pages; i !=
=3D end; i =3D next) {
> +               next =3D min(end, round_down(i + stride, stride));
> +
> +               page =3D folio_page(folio, i / stride);
> +               if (atomic_add_return(next - i, &page->_mapcount) =3D=3D =
next - i - 1)
> +                       *ret +=3D stride;
> +       }
> +
> +       if (atomic_add_return(*ret, &folio->_nr_pages_mapped) >=3D ENTIRE=
LY_MAPPED)
> +               *ret =3D 0;
> +
> +       return true;
> +}
> +
> +static inline bool hvo_unmap_range(struct folio *folio, struct page *pag=
e, int nr_pages, int *ret)
> +{
> +       int i, next, end;
> +       int stride =3D hvo_order_size(folio_order(folio)) / PAGE_SIZE;
> +
> +       if (!folio_is_hvo(folio))
> +               return false;
> +
> +       *ret =3D 0;
> +
> +       for (i =3D folio_page_idx(folio, page), end =3D i + nr_pages; i !=
=3D end; i =3D next) {
> +               next =3D min(end, round_down(i + stride, stride));
> +
> +               page =3D folio_page(folio, i / stride);
> +               if (atomic_sub_return(next - i, &page->_mapcount) =3D=3D =
-1)
> +                       *ret +=3D stride;
> +       }
> +
> +       if (atomic_sub_return(*ret, &folio->_nr_pages_mapped) >=3D ENTIRE=
LY_MAPPED)
> +               *ret =3D 0;
> +
> +       return true;
> +}
> +
> +static inline bool hvo_dup_range(struct folio *folio, struct page *page,=
 int nr_pages)
> +{
> +       int i, next, end;
> +       int stride =3D hvo_order_size(folio_order(folio)) / PAGE_SIZE;
> +
> +       if (!folio_is_hvo(folio))
> +               return false;
> +
> +       for (i =3D folio_page_idx(folio, page), end =3D i + nr_pages; i !=
=3D end; i =3D next) {
> +               next =3D min(end, round_down(i + stride, stride));
> +
> +               page =3D folio_page(folio, i / stride);
> +               atomic_add(next - i, &page->_mapcount);
> +       }
> +
> +       return true;
> +}
> +
>  /**
>   * page_mapcount() - Number of times this precise page is mapped.
>   * @page: The page.
> @@ -1212,6 +1344,9 @@ static inline int page_mapcount(struct page *page)
>  {
>         int mapcount =3D atomic_read(&page->_mapcount) + 1;
>
> +       if (hvo_range_mapcount(page_folio(page), page, 1, &mapcount))
> +               return mapcount;
> +
>         if (unlikely(PageCompound(page)))
>                 mapcount +=3D folio_entire_mapcount(page_folio(page));
>
> @@ -3094,6 +3229,11 @@ static inline void pagetable_pud_dtor(struct ptdes=
c *ptdesc)
>
>  extern void __init pagecache_init(void);
>  extern void free_initmem(void);
> +extern void free_vmemmap(void);
> +extern int vmemmap_remap_free(unsigned long start, unsigned long end,
> +                             unsigned long reuse,
> +                             struct list_head *vmemmap_pages,
> +                             unsigned long flags);
>
>  /*
>   * Free reserved pages within range [PAGE_ALIGN(start), end & PAGE_MASK)
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 532218167bba..00e4bb6c8533 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -916,6 +916,7 @@ struct zone {
>  #ifdef CONFIG_CMA
>         unsigned long           cma_pages;
>  #endif
> +       atomic_long_t           hvo_freed;
>
>         const char              *name;
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b7944a833668..d058c4cb3c96 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -322,6 +322,8 @@ static __always_inline void __folio_dup_file_rmap(str=
uct folio *folio,
>
>         switch (level) {
>         case RMAP_LEVEL_PTE:
> +               if (hvo_dup_range(folio, page, nr_pages))
> +                       break;
>                 do {
>                         atomic_inc(&page->_mapcount);
>                 } while (page++, --nr_pages > 0);
> @@ -401,6 +403,8 @@ static __always_inline int __folio_try_dup_anon_rmap(=
struct folio *folio,
>                                 if (PageAnonExclusive(page + i))
>                                         return -EBUSY;
>                 }
> +               if (hvo_dup_range(folio, page, nr_pages))
> +                       break;
>                 do {
>                         if (PageAnonExclusive(page))
>                                 ClearPageAnonExclusive(page);
> diff --git a/init/main.c b/init/main.c
> index e24b0780fdff..74003495db32 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1448,6 +1448,7 @@ static int __ref kernel_init(void *unused)
>         kgdb_free_init_mem();
>         exit_boot_config();
>         free_initmem();
> +       free_vmemmap();
>         mark_readonly();
>
>         /*
> diff --git a/mm/gup.c b/mm/gup.c
> index df83182ec72d..f3df0078505b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -57,7 +57,7 @@ static inline void sanity_check_pinned_pages(struct pag=
e **pages,
>                         continue;
>                 if (!folio_test_large(folio) || folio_test_hugetlb(folio)=
)
>                         VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page), =
page);
> -               else
> +               else if (!folio_is_hvo(folio) || !folio_nr_pages_mapped(f=
olio))
>                         /* Either a PTE-mapped or a PMD-mapped THP. */
>                         VM_BUG_ON_PAGE(!PageAnonExclusive(&folio->page) &=
&
>                                        !PageAnonExclusive(page), page);
> @@ -645,6 +645,7 @@ static struct page *follow_page_pte(struct vm_area_st=
ruct *vma,
>         }
>
>         VM_BUG_ON_PAGE((flags & FOLL_PIN) && PageAnon(page) &&
> +                      !folio_is_hvo(page_folio(page)) &&
>                        !PageAnonExclusive(page), page);
>
>         /* try_grab_page() does nothing unless FOLL_GET or FOLL_PIN is se=
t. */
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 62d2254bc51c..9e7e5d587a5c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2535,6 +2535,8 @@ static void __split_huge_pmd_locked(struct vm_area_=
struct *vma, pmd_t *pmd,
>                  *
>                  * See folio_try_share_anon_rmap_pmd(): invalidate PMD fi=
rst.
>                  */
> +               if (folio_is_hvo(folio))
> +                       ClearPageAnonExclusive(page);
>                 anon_exclusive =3D PageAnonExclusive(page);
>                 if (freeze && anon_exclusive &&
>                     folio_try_share_anon_rmap_pmd(folio, page))
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index da177e49d956..9f43d900e83c 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -310,7 +310,7 @@ static int vmemmap_remap_split(unsigned long start, u=
nsigned long end,
>   *
>   * Return: %0 on success, negative error code otherwise.
>   */
> -static int vmemmap_remap_free(unsigned long start, unsigned long end,
> +int vmemmap_remap_free(unsigned long start, unsigned long end,
>                               unsigned long reuse,
>                               struct list_head *vmemmap_pages,
>                               unsigned long flags)
> diff --git a/mm/internal.h b/mm/internal.h
> index ac1d27468899..871c6eeb78b8 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -52,15 +52,6 @@ struct folio_batch;
>
>  void page_writeback_init(void);
>
> -/*
> - * If a 16GB hugetlb folio were mapped by PTEs of all of its 4kB pages,
> - * its nr_pages_mapped would be 0x400000: choose the ENTIRELY_MAPPED bit
> - * above that range, instead of 2*(PMD_SIZE/PAGE_SIZE).  Hugetlb current=
ly
> - * leaves nr_pages_mapped at 0, but avoid surprise if it participates la=
ter.
> - */
> -#define ENTIRELY_MAPPED                0x800000
> -#define FOLIO_PAGES_MAPPED     (ENTIRELY_MAPPED - 1)
> -
>  /*
>   * Flags passed to __show_mem() and show_free_areas() to suppress output=
 in
>   * various contexts.
> diff --git a/mm/memory.c b/mm/memory.c
> index 0bfc8b007c01..db389f1d776d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3047,8 +3047,8 @@ static inline void wp_page_reuse(struct vm_fault *v=
mf, struct folio *folio)
>         VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE));
>
>         if (folio) {
> -               VM_BUG_ON(folio_test_anon(folio) &&
> -                         !PageAnonExclusive(vmf->page));
> +               VM_BUG_ON_PAGE(folio_test_anon(folio) && !folio_is_hvo(fo=
lio) &&
> +                              !PageAnonExclusive(vmf->page), vmf->page);
>                 /*
>                  * Clear the folio's cpupid information as the existing
>                  * information potentially belongs to a now completely
> @@ -3502,7 +3502,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
>          */
>         if (folio && folio_test_anon(folio) &&
>             (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(foli=
o, vma))) {
> -               if (!PageAnonExclusive(vmf->page))
> +               if (!folio_is_hvo(folio) && !PageAnonExclusive(vmf->page)=
)
>                         SetPageAnonExclusive(vmf->page);
>                 if (unlikely(unshare)) {
>                         pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -4100,8 +4100,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                                         rmap_flags);
>         }
>
> -       VM_BUG_ON(!folio_test_anon(folio) ||
> -                       (pte_write(pte) && !PageAnonExclusive(page)));
> +       VM_BUG_ON_PAGE(!folio_test_anon(folio) ||
> +                      (pte_write(pte) && !folio_is_hvo(folio) && !PageAn=
onExclusive(page)),
> +                      page);
>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>         arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_p=
te);
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dd843fb04f78..5f8c6583a191 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -53,6 +53,7 @@
>  #include <linux/khugepaged.h>
>  #include <linux/delayacct.h>
>  #include <linux/cacheinfo.h>
> +#include <linux/bootmem_info.h>
>  #include <asm/div64.h>
>  #include "internal.h"
>  #include "shuffle.h"
> @@ -585,6 +586,10 @@ void prep_compound_page(struct page *page, unsigned =
int order)
>         int nr_pages =3D 1 << order;
>
>         __SetPageHead(page);
> +
> +       if (page_is_hvo(page, order))
> +               nr_pages =3D HVO_MOD;
> +
>         for (i =3D 1; i < nr_pages; i++)
>                 prep_compound_tail(page, i);
>
> @@ -1124,10 +1129,15 @@ static __always_inline bool free_pages_prepare(st=
ruct page *page,
>          */
>         if (unlikely(order)) {
>                 int i;
> +               int nr_pages =3D 1 << order;
>
> -               if (compound)
> +               if (compound) {
> +                       if (page_is_hvo(page, order))
> +                               nr_pages =3D HVO_MOD;
>                         page[1].flags &=3D ~PAGE_FLAGS_SECOND;
> -               for (i =3D 1; i < (1 << order); i++) {
> +               }
> +
> +               for (i =3D 1; i < nr_pages; i++) {
>                         if (compound)
>                                 bad +=3D free_tail_page_prepare(page, pag=
e + i);
>                         if (is_check_pages_enabled()) {
> @@ -1547,6 +1557,141 @@ inline void post_alloc_hook(struct page *page, un=
signed int order,
>         page_table_check_alloc(page, order);
>  }
>
> +static void prep_hvo_page(struct page *head, int order)
> +{
> +       LIST_HEAD(list);
> +       struct page *page, *next;
> +       int freed =3D 0;
> +       unsigned long start =3D (unsigned long)head;
> +       unsigned long end =3D start + hvo_order_size(order);
> +
> +       if (page_zonenum(head) !=3D ZONE_NOMERGE)
> +               return;
> +
> +       if (WARN_ON_ONCE(order !=3D page_zone(head)->order)) {
> +               bad_page(head, "invalid page order");
> +               return;
> +       }
> +
> +       if (!page_hvo_suitable(head, order) || page_is_hvo(head, order))
> +               return;
> +
> +       vmemmap_remap_free(start + PAGE_SIZE, end, start, &list, 0);
> +
> +       list_for_each_entry_safe(page, next, &list, lru) {
> +               if (PageReserved(page))
> +                       free_bootmem_page(page);
> +               else
> +                       __free_page(page);
> +               freed++;
> +       }
> +
> +       atomic_long_add(freed, &page_zone(head)->hvo_freed);
> +}
> +
> +static void prep_nomerge_zone(struct zone *zone, enum migratetype type)
> +{
> +       int order;
> +       unsigned long flags;
> +
> +       spin_lock_irqsave(&zone->lock, flags);
> +
> +       for (order =3D MAX_PAGE_ORDER; order > zone->order; order--) {
> +               struct page *page;
> +               int split =3D 0;
> +               struct free_area *area =3D zone->free_area + order;
> +
> +               while ((page =3D get_page_from_free_area(area, type))) {
> +                       del_page_from_free_list(page, zone, order);
> +                       expand(zone, page, zone->order, order, type);
> +                       set_buddy_order(page, zone->order);
> +                       add_to_free_list(page, zone, zone->order, type);
> +                       split++;
> +               }
> +
> +               pr_info("  HVO: order %d split %d\n", order, split);
> +       }
> +
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +static void hvo_nomerge_zone(struct zone *zone, enum migratetype type)
> +{
> +       LIST_HEAD(old);
> +       LIST_HEAD(new);
> +       int nomem, freed;
> +       unsigned long flags;
> +       struct list_head list;
> +       struct page *page, *next;
> +       struct free_area *area =3D zone->free_area + zone->order;
> +again:
> +       nomem =3D freed =3D 0;
> +       INIT_LIST_HEAD(&list);
> +
> +       spin_lock_irqsave(&zone->lock, flags);
> +       list_splice_init(area->free_list + type, &old);
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +
> +       list_for_each_entry_safe(page, next, &old, buddy_list) {
> +               unsigned long start =3D (unsigned long)page;
> +               unsigned long end =3D start + hvo_order_size(zone->order)=
;
> +
> +               if (WARN_ON_ONCE(!IS_ALIGNED(start, PAGE_SIZE)))
> +                       continue;
> +
> +               if (vmemmap_remap_free(start + PAGE_SIZE, end, start, &li=
st, 0))
> +                       nomem++;
> +       }
> +
> +       list_for_each_entry_safe(page, next, &list, lru) {
> +               if (PageReserved(page))
> +                       free_bootmem_page(page);
> +               else
> +                       __free_page(page);
> +               freed++;
> +       }
> +
> +       list_splice_init(&old, &new);
> +       atomic_long_add(freed, &zone->hvo_freed);
> +
> +       pr_info("  HVO: nomem %d freed %d\n", nomem, freed);
> +
> +       if (!list_empty(area->free_list + type))
> +               goto again;
> +
> +       spin_lock_irqsave(&zone->lock, flags);
> +       list_splice(&new, area->free_list + type);
> +       spin_unlock_irqrestore(&zone->lock, flags);
> +}
> +
> +static bool zone_hvo_suitable(struct zone *zone)
> +{
> +       if (!static_branch_unlikely(&hugetlb_optimize_vmemmap_key))
> +               return false;
> +
> +       return zone_idx(zone) =3D=3D ZONE_NOMERGE && hvo_order_size(zone-=
>order) > PAGE_SIZE;
> +}
> +
> +void free_vmemmap(void)
> +{
> +       struct zone *zone;
> +
> +       static_branch_inc(&hugetlb_optimize_vmemmap_key);
> +
> +       for_each_populated_zone(zone) {
> +               if (!zone_hvo_suitable(zone))
> +                       continue;
> +
> +               pr_info("Freeing vmemmap of node %d zone %s\n",
> +                        zone_to_nid(zone), zone->name);
> +
> +               prep_nomerge_zone(zone, MIGRATE_MOVABLE);
> +               hvo_nomerge_zone(zone, MIGRATE_MOVABLE);
> +
> +               cond_resched();
> +       }
> +}
> +
>  static void prep_new_page(struct page *page, unsigned int order, gfp_t g=
fp_flags,
>                                                         unsigned int allo=
c_flags)
>  {
> @@ -1565,6 +1710,8 @@ static void prep_new_page(struct page *page, unsign=
ed int order, gfp_t gfp_flags
>                 set_page_pfmemalloc(page);
>         else
>                 clear_page_pfmemalloc(page);
> +
> +       prep_hvo_page(page, order);
>  }
>
>  /*
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0ddb28c52961..d339bf489230 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1143,6 +1143,10 @@ int folio_total_mapcount(struct folio *folio)
>         /* In the common case, avoid the loop when no pages mapped by PTE=
 */
>         if (folio_nr_pages_mapped(folio) =3D=3D 0)
>                 return mapcount;
> +
> +       if (hvo_range_mapcount(folio, &folio->page, folio_nr_pages(folio)=
, &mapcount))
> +               return mapcount;
> +
>         /*
>          * Add all the PTE mappings of those pages mapped by PTE.
>          * Limit the loop to folio_nr_pages_mapped()?
> @@ -1168,6 +1172,8 @@ static __always_inline unsigned int __folio_add_rma=
p(struct folio *folio,
>
>         switch (level) {
>         case RMAP_LEVEL_PTE:
> +               if (hvo_map_range(folio, page, nr_pages, &nr))
> +                       break;
>                 do {
>                         first =3D atomic_inc_and_test(&page->_mapcount);
>                         if (first && folio_test_large(folio)) {
> @@ -1314,6 +1320,8 @@ static __always_inline void __folio_add_anon_rmap(s=
truct folio *folio,
>         if (flags & RMAP_EXCLUSIVE) {
>                 switch (level) {
>                 case RMAP_LEVEL_PTE:
> +                       if (folio_is_hvo(folio))
> +                               break;
>                         for (i =3D 0; i < nr_pages; i++)
>                                 SetPageAnonExclusive(page + i);
>                         break;
> @@ -1421,6 +1429,9 @@ void folio_add_new_anon_rmap(struct folio *folio, s=
truct vm_area_struct *vma,
>         } else if (!folio_test_pmd_mappable(folio)) {
>                 int i;
>
> +               if (hvo_map_range(folio, &folio->page, nr, &nr))
> +                       goto done;
> +
>                 for (i =3D 0; i < nr; i++) {
>                         struct page *page =3D folio_page(folio, i);
>
> @@ -1437,7 +1448,7 @@ void folio_add_new_anon_rmap(struct folio *folio, s=
truct vm_area_struct *vma,
>                 SetPageAnonExclusive(&folio->page);
>                 __lruvec_stat_mod_folio(folio, NR_ANON_THPS, nr);
>         }
> -
> +done:
>         __lruvec_stat_mod_folio(folio, NR_ANON_MAPPED, nr);
>  }
>
> @@ -1510,6 +1521,8 @@ static __always_inline void __folio_remove_rmap(str=
uct folio *folio,
>
>         switch (level) {
>         case RMAP_LEVEL_PTE:
> +               if (hvo_unmap_range(folio, page, nr_pages, &nr))
> +                       break;
>                 do {
>                         last =3D atomic_add_negative(-1, &page->_mapcount=
);
>                         if (last && folio_test_large(folio)) {
> @@ -2212,7 +2225,7 @@ static bool try_to_migrate_one(struct folio *folio,=
 struct vm_area_struct *vma,
>                                 break;
>                         }
>                         VM_BUG_ON_PAGE(pte_write(pteval) && folio_test_an=
on(folio) &&
> -                                      !anon_exclusive, subpage);
> +                                      !folio_is_hvo(folio) && !anon_excl=
usive, subpage);
>
>                         /* See folio_try_share_anon_rmap_pte(): clear PTE=
 first. */
>                         if (folio_test_hugetlb(folio)) {
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ff2114452334..f51f3b872270 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1704,6 +1704,7 @@ static void zoneinfo_show_print(struct seq_file *m,=
 pg_data_t *pgdat,
>                    "\n        present  %lu"
>                    "\n        managed  %lu"
>                    "\n        cma      %lu"
> +                  "\n  hvo   freed    %lu"
>                    "\n        order    %u",
>                    zone_page_state(zone, NR_FREE_PAGES),
>                    zone->watermark_boost,
> @@ -1714,6 +1715,7 @@ static void zoneinfo_show_print(struct seq_file *m,=
 pg_data_t *pgdat,
>                    zone->present_pages,
>                    zone_managed_pages(zone),
>                    zone_cma_pages(zone),
> +                  atomic_long_read(&zone->hvo_freed),
>                    zone->order);
>
>         seq_printf(m,
> --
> 2.44.0.rc1.240.g4c46232300-goog
>
>