linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mateusz Guzik <mjguzik@gmail.com>
To: Joao Martins <joao.m.martins@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>,
	Frank van der Linden <fvdl@google.com>,
	linux-mm@kvack.org,  akpm@linux-foundation.org,
	Muchun Song <muchun.song@linux.dev>,
	 Miaohe Lin <linmiaohe@huawei.com>,
	Oscar Salvador <osalvador@suse.de>,
	 David Hildenbrand <david@redhat.com>,
	Peter Xu <peterx@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm/hugetlb: optionally pre-zero hugetlb pages
Date: Tue, 3 Dec 2024 16:57:11 +0100	[thread overview]
Message-ID: <CAGudoHGY_NEJe6Pp6rv91v8p--phSX32C5Pm55c6jpUAJFLKmA@mail.gmail.com> (raw)
In-Reply-To: <d1f224fb-c0fb-47f9-bea8-3c33137be161@oracle.com>

On Tue, Dec 3, 2024 at 3:26 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 03/12/2024 12:06, Michal Hocko wrote:
> > If the startup latency is a real problem is there a way to workaround
> > that in the userspace by preallocating hugetlb pages ahead of time
> > before those VMs are launched and hand over already pre-allocated pages?
>
> It should be relatively simple to actually do this. Me and Mike had experimented
> ourselves a couple years back but we never had the chance to send it over. IIRC
> if we:
>
> - add the PageZeroed tracking bit when a page is zeroed
> - clear it in the write (fixup/non-fixup) fault-path
>
> [somewhat similar to this series I suspect]
>
> Then what's left is to change the lookup of free hugetlb pages
> (dequeue_hugetlb_folio_node_exact() I think) to search first for non-zeroed
> pages. Provided we don't track its 'cleared' state, there's no UAPI change in
> behaviour. A daemon can just allocate/mmap+touch/etc them with read-only and
> free them back 'as zeroed' to implement a userspace scrubber. And in principle
> existing apps should see no difference. The amount of changes is consequently
> significantly smaller (or it looked as such in a quick PoC years back).
>
> Something extra on the top would perhaps be the ability so select a lookup
> heuristic such that we can pick the search method of
> non-zero-first/only-nonzero/zeroed pages behind ioctl() (or a better generic
> UAPI) to allow a scrubber to easily coexist with hugepage user (e.g. a VMM, etc)
> without too much of a dance.
>

Ye after the qemu prefaulting got pointed out I started thinking about
a userlevel daemon which would do the work proposed here.

Except I got stuck at a good way to do it. The mmap + load from the
area + munmap triple does work but also entails more overhead than
necessary, but I only have some handwaving how to not do it. :)

Suppose a daemon of the sort exists and there is a machine with 4 or
more NUMA domains to deal with. Further suppose it spawns at least one
thread per such domain and tasksets them accordingly.

Then perhaps an ioctl somewhere on hugetlbfs(?) could take a parameter
indicating how many pages to zero out (or even just accept one page).
This would avoid crap on munmap.

This would still need majority of the patch, but all the zeroing
policy would be taken out. Key point being that whatever specific
behavior one sees fit, they can implement it in userspace, preventing
future kernel patches to add more tweaks.
-- 
Mateusz Guzik <mjguzik gmail.com>


  reply	other threads:[~2024-12-03 15:57 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-02 20:20 Frank van der Linden
2024-12-02 21:58 ` Mateusz Guzik
2024-12-02 22:50   ` Frank van der Linden
2024-12-03 12:06     ` Michal Hocko
2024-12-03 12:17       ` David Hildenbrand
2024-12-03 14:26       ` Joao Martins
2024-12-03 15:57         ` Mateusz Guzik [this message]
2024-12-03 16:17           ` Mateusz Guzik
2024-12-03 16:21           ` Joao Martins
2024-12-03 18:43         ` Frank van der Linden
2024-12-03 20:15           ` Frank van der Linden
2024-12-03 14:02   ` Joao Martins
2024-12-04  0:06     ` Ankur Arora
2024-12-04  0:05   ` Ankur Arora
2024-12-04 17:01     ` Frank van der Linden
2024-12-04 19:57       ` Ankur Arora
2024-12-02 22:07 ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGudoHGY_NEJe6Pp6rv91v8p--phSX32C5Pm55c6jpUAJFLKmA@mail.gmail.com \
    --to=mjguzik@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fvdl@google.com \
    --cc=joao.m.martins@oracle.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox