Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "David Hildenbrand (Red Hat)" <david@kernel.org>
To: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Li Zhe <lizhe.67@bytedance.com>,
	akpm@linux-foundation.org, ankur.a.arora@oracle.com,
	fvdl@google.com, joao.m.martins@oracle.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mhocko@suse.com, mjguzik@gmail.com, muchun.song@linux.dev,
	osalvador@suse.de, raghavendra.kt@amd.com,
	linux-cxl@vger.kernel.org, Davidlohr Bueso <dave@stgolabs.net>,
	Gregory Price <gourry@gourry.net>,
	Dan Williams <dan.j.williams@intel.com>,
	zhanjie9@hisilicon.com, wangzhou1@hisilicon.com
Subject: Re: [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism
Date: Thu, 15 Jan 2026 18:08:25 +0100	[thread overview]
Message-ID: <23513e86-0769-4f3f-b90b-22273343a03c@kernel.org> (raw)
In-Reply-To: <20260115115739.00007cf6@huawei.com>

On 1/15/26 12:57, Jonathan Cameron wrote:
> On Thu, 15 Jan 2026 12:08:03 +0100
> "David Hildenbrand (Red Hat)" <david@kernel.org> wrote:
> 
>> On 1/15/26 10:36, Li Zhe wrote:
>>> On Wed, 14 Jan 2026 18:21:08 +0100, david@kernel.org wrote:
>>>       
>>>>>> But again, I think the main motivation here is "increase application
>>>>>> startup", not optimize that the zeroing happens at specific points in
>>>>>> time during system operation (e.g., when idle etc).
>>>>>>   
>>>>>
>>>>> Framing this as "increase application startup" and merely shifting the
>>>>> overhead to shutdown seems like gaming the problem statement to me.
>>>>> The real problem is total real time spent on it while pages are
>>>>> needed.
>>>>>
>>>>> Support for background zeroing can give you more usable pages provided
>>>>> it has the cpu + ram to do it. If it does not, you are in the worst
>>>>> case in the same spot as with zeroing on free.
>>>>>
>>>>> Let's take a look at some examples.
>>>>>
>>>>> Say there are no free huge pages and you kill a vm + start a new one.
>>>>> On top of that all CPUs are pegged as is. In this case total time is
>>>>> the same for "zero on free" as it is for background zeroing.
>>>>
>>>> Right. If the pages get freed to immediately get allocated again, it
>>>> doesn't really matter who does the freeing. There might be some details,
>>>> of course.
>>>>   
>>>>>
>>>>> Say the system is freshly booted and you start up a vm. There are no
>>>>> pre-zeroed pages available so it suffers at start time no matter what.
>>>>> However, with some support for background zeroing, the machinery could
>>>>> respond to demand and do it in parallel in some capacity, shortening
>>>>> the real time needed.
>>>>
>>>> Just like for init_on_free, I would start with zeroing these pages
>>>> during boot.
>>>>
>>>> init_on_free assures that all pages in the buddy were zeroed out. Which
>>>> greatly simplifies the implementation, because there is no need to track
>>>> what was initialized and what was not.
>>>>
>>>> It's a good question if initialization during that should be done in
>>>> parallel, possibly asynchronously during boot. Reminds me a bit of
>>>> deferred page initialization during boot. But that is rather an
>>>> extension that could be added somewhat transparently on top later.
>>>>
>>>> If ever required we could dynamically enable this setting for a running
>>>> system. Whoever would enable it (flips the magic toggle) would zero out
>>>> all hugetlb pages that are already in the hugetlb allocator as free, but
>>>> not initialized yet.
>>>>
>>>> But again, these are extensions on top of the basic design of having all
>>>> free hugetlb folios be zeroed.
>>>>   
>>>>>
>>>>> Say a little bit of real time passes and you start another vm. With
>>>>> merely zeroing on free there are still no pre-zeroed pages available
>>>>> so it again suffers the overhead. With background zeroing some of the
>>>>> that memory would be already sorted out, speeding up said startup.
>>>>
>>>> The moment they end up in the hugetlb allocator as free folios they
>>>> would have to get initialized.
>>>>
>>>> Now, I am sure there are downsides to this approach (how to speedup
>>>> process exit by parallelizing zeroing, if ever required)? But it sounds
>>>> like being a bit ... simpler without user space changes required. In
>>>> theory :)
>>>
>>> I strongly agree that init_on_free strategy effectively eliminates the
>>> latency incurred during VM creation. However, it appears to introduce
>>> two new issues.
>>>
>>> First, the process that later allocates a page may not be the one that
>>> freed it, raising the question of which process should bear the cost
>>> of zeroing.
>>
>> Right now the cost is payed by the process that allocates a page. If you
>> shift that to the freeing path, it's still the same process, just at a
>> different point in time.
>>
>> Of course, there are exceptions to that: if you have a hugetlb file that
>> is shared by multiple processes (-> process that essentially truncates
>> the file). Or if someone (GUP-pin) holds a reference to a file even after
>> it was truncated (not common but possible).
>>
>> With CoW it would be the process that last unmaps the folio. CoW with
>> hugetlb is fortunately something that is rare (and rather shaky :) ).
>>
>>>
>>> Second, put_page() is executed atomically, making it inappropriate to
>>> invoke clear_page() within that context; off-loading the zeroing to a
>>> workqueue merely reopens the same accounting problem.
>>
>> I thought about this as well. For init_on_free we always invoke it for
>> up to 4MiB folios during put_page() on x86-64.
>>
>> See __folio_put()->__free_frozen_pages()->free_pages_prepare()
>>
>> Where we call kernel_init_pages(page, 1 << order);
>>
>> So surely, for 2 MiB folios (hugetlb) this is not a problem.
>>
>> ... but then, on arm64 with 64k base pages we have 512 MiB folios
>> (managed by the buddy!) where this is apparently not a problem? Or is
>> it and should be fixed?
>>
>> So I would expect once we go up to 1 GiB, we might only reveal more
>> areas where we should have optimized in the first case by dropping
>> the reference outside the spin lock ... and these optimizations would
>> obviously (unless in hugetlb specific code ...) benefit init_on_free
>> setups as well (and page poisoning).
> 
> FWIW I'd be interesting in seeing if we can do the zeroing async and allow
> for hardware offloading. If it happens to be in CXL (and someone
> built the fancy bits) we can ask the device to zero ranges of memory
> for us.  If they built the HDM-DB stuff it's coherent too (came up
> in the Davidlohr's LPC Device-mem talk on HDM-DB + back invalidate
> support)
> +CC linux-cxl and Davidlohr + a few others.
> 
> More locally this sounds like fun for DMA engines, though they are going
> to rapidly eat bandwidth up and so we'll need QoS stuff in place
> to stop them perturbing other workloads.
> 
> Give me a list of 1Gig pages and this stuff becomes much more efficient
> than anything the CPU can do.

Right, and ideally we'd implement any such mechanisms in a way that more 
parts of the kernel can benefit, and not just an unloved in-memory 
file-system that most people just want to get rid of as soon as we can :)

-- 
Cheers

David

next prev parent reply	other threads:[~2026-01-15 17:08 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-07 11:31 Li Zhe
2026-01-07 11:31 ` [PATCH v2 1/8] mm/hugetlb: add pre-zeroed framework Li Zhe
2026-01-07 11:31 ` [PATCH v2 2/8] mm/hugetlb: convert to prep_account_new_hugetlb_folio() Li Zhe
2026-01-07 11:31 ` [PATCH v2 3/8] mm/hugetlb: move the huge folio to the end of the list during enqueue Li Zhe
2026-01-07 11:31 ` [PATCH v2 4/8] mm/hugetlb: introduce per-node sysfs interface "zeroable_hugepages" Li Zhe
2026-01-07 11:31 ` [PATCH v2 5/8] mm/hugetlb: simplify function hugetlb_sysfs_add_hstate() Li Zhe
2026-01-07 11:31 ` [PATCH v2 6/8] mm/hugetlb: relocate the per-hstate struct kobject pointer Li Zhe
2026-01-07 11:31 ` [PATCH v2 7/8] mm/hugetlb: add epoll support for interface "zeroable_hugepages" Li Zhe
2026-01-07 11:31 ` [PATCH v2 8/8] mm/hugetlb: limit event generation frequency of function do_zero_free_notify() Li Zhe
2026-01-07 16:19 ` [PATCH v2 0/8] Introduce a huge-page pre-zeroing mechanism Andrew Morton
2026-01-12 11:25   ` Li Zhe
2026-01-09  6:05 ` Muchun Song
2026-01-12 11:27   ` Li Zhe
2026-01-12 19:52     ` David Hildenbrand (Red Hat)
2026-01-13  6:37       ` Li Zhe
2026-01-13 10:15         ` David Hildenbrand (Red Hat)
2026-01-13 12:41           ` Li Zhe
2026-01-14 10:41             ` David Hildenbrand (Red Hat)
2026-01-14 11:36               ` Li Zhe
2026-01-14 11:55                 ` David Hildenbrand (Red Hat)
2026-01-14 12:11                   ` Mateusz Guzik
2026-01-14 12:33                     ` David Hildenbrand (Red Hat)
2026-01-14 12:41                       ` David Hildenbrand (Red Hat)
2026-01-14 13:06                         ` Mateusz Guzik
2026-01-14 17:21                           ` David Hildenbrand (Red Hat)
2026-01-15  9:36                             ` Li Zhe
2026-01-15 11:08                               ` David Hildenbrand (Red Hat)
2026-01-15 11:57                                 ` Jonathan Cameron
2026-01-15 17:08                                   ` David Hildenbrand (Red Hat) [this message]
2026-01-15 20:16                                     ` dan.j.williams
2026-01-15 20:22                                       ` David Hildenbrand (Red Hat)
2026-01-15 22:30                                         ` Ankur Arora
2026-01-20  6:27                                           ` Li Zhe
2026-01-20  9:47                                             ` David Laight
2026-01-20 10:39                                               ` Li Zhe
2026-01-20 18:18                                                 ` Gregory Price
2026-01-20 18:38                                                   ` Gregory Price
2026-01-20 19:30                                                   ` David Laight
2026-01-20 19:52                                                     ` Gregory Price
2026-01-21  8:03                                                   ` Li Zhe
2026-01-21 12:41                                                   ` David Hildenbrand (Red Hat)
2026-01-21 12:32                                               ` David Hildenbrand (Red Hat)
2026-01-12 22:00     ` Ankur Arora
2026-01-13  6:39       ` Li Zhe
2026-01-12 22:01 ` Ankur Arora
2026-01-13  6:41   ` Li Zhe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=23513e86-0769-4f3f-b90b-22273343a03c@kernel.org \
    --to=david@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=ankur.a.arora@oracle.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave@stgolabs.net \
    --cc=fvdl@google.com \
    --cc=gourry@gourry.net \
    --cc=joao.m.martins@oracle.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizhe.67@bytedance.com \
    --cc=mhocko@suse.com \
    --cc=mjguzik@gmail.com \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=raghavendra.kt@amd.com \
    --cc=wangzhou1@hisilicon.com \
    --cc=zhanjie9@hisilicon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox