From: Peter Xu <peterx@redhat.com>
To: Matthew Wilcox <willy@infradead.org>
Cc: James Houghton <jthoughton@google.com>,
Khalid Aziz <khalid.aziz@oracle.com>,
Vishal Moola <vishal.moola@gmail.com>,
Jane Chu <jane.chu@oracle.com>,
Muchun Song <muchun.song@linux.dev>,
linux-mm@kvack.org, Michal Hocko <mhocko@kernel.org>,
Yu Zhao <yuzhao@google.com>
Subject: Re: Unifying page table walkers
Date: Thu, 6 Jun 2024 17:33:31 -0400 [thread overview]
Message-ID: <ZmIrK4CB1JjaHIyP@x1n> (raw)
In-Reply-To: <ZmIWZWOeN2fLaJ3T@casper.infradead.org>
On Thu, Jun 06, 2024 at 09:04:53PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 06, 2024 at 12:30:44PM -0700, James Houghton wrote:
> > Today the VM_HUGETLB flag tells the fault handler to call into
> > hugetlb_fault() (there are many other special cases, but this one is
> > probably the most important). How should faults on VMAs without
> > VM_HUGETLB that map HugeTLB folios be handled? If you handle faults
> > with the main mm fault handler without getting rid of hugetlb_fault(),
> > I think you're basically implementing a second, more tmpfs-like
> > hugetlbfs... right?
> >
> > I don't really have anything against this approach, but I think the
> > decision was to reduce the number of special cases as much as we can
> > first before attempting to rewrite hugetlbfs.
> >
> > Or maybe I've got something wrong and what you're asking doesn't
> > logically end up at a hugetlbfs v2.
>
> Right, so we ignore hugetlb_fault() and call into __handle_mm_fault().
> Once there, we'll do:
>
> vmf.pud = pud_alloc(mm, p4d, address);
> if (pud_none(*vmf.pud) &&
> thp_vma_allowable_order(vma, vm_flags,
> TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
> ret = create_huge_pud(&vmf);
>
> which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER);
>
> So all we need to do is implement huge_fault in hugetlb_vm_ops. I
> don't think that's the same as creating a hugetlbfs2 because it's just
> another entry point. You can mmap() the same file both ways and it's
> all cache coherent.
Matthew, could you elaborate more on how hugetlb_vm_ops.huge_fault() could
start to inject hugetlb pages without a hugetlb VMA? I meant, at least
currently what I read is this, where we define a hugetlb VMA always as:
vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
vma->vm_ops = &hugetlb_vm_ops;
So any vma that uses hugetlb_vm_ops will have VM_HUGETLB set for sure..
If you're talking about some other VMAs, it sounds to me like that this
huge_fault() should belong to that new VMA's vm_ops? Then it sounds like a
way some non-hugetlb VMA wants to reuse the hugetlb allocator/pool of the
huge pages. I'm not sure I understand it right, though..
Regarding to the 4k mapping plan on hugetlb.. I talked to Michal, Yu and
some other people when during lsfmm, and I think so far it seems to me the
best way to go is to allow shmem provide 1G pages.
Again, IMHO I'd be totally fine if we finish the cleanup but just add HGM
on top of hugetlbv1, but it looks like I'm the only person who thinks like
that.. If we can introduce 1G to shmem then as long as the 1G pages can be
as stable as a hugetlbv1 1G page then it's good enough for a VM use case
(which I care, and I believe also to most cloud providers who cares about
postcopy) and then adding 4k mapping on top of that can avoid all the
hugetlb concerns people have too (even though I think most of the logic
that HGM wants will still be there).
That also kind of matches with the TAO's design where we may have more
chance having THPs allocated even dynamically on 1G, however the sake here
for VM context is we'll want reliable 1G not split-able ones. We may want
it split-able only on pgtable but not the folios.
However I don't yet think any of them are solid ideas. It might be
interesting to know how your thoughts correlate to this too, since I think
you mentioned the 4k mapping somewhere. I'm also making bold to copy
relevant people just in case it could be relevant discussion.
[PS: will need to be off work tomorrow, so please expect a delay on follow
up emails..]
Thanks,
--
Peter Xu
next prev parent reply other threads:[~2024-06-06 21:33 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-06 18:29 Matthew Wilcox
2024-06-06 19:30 ` James Houghton
2024-06-06 20:04 ` Matthew Wilcox
2024-06-06 20:23 ` James Houghton
2024-06-06 21:21 ` Matthew Wilcox
2024-06-06 23:07 ` James Houghton
2024-06-07 7:15 ` David Hildenbrand
2024-06-06 21:33 ` Peter Xu [this message]
2024-06-06 21:49 ` Peter Xu
2024-06-07 5:07 ` Oscar Salvador
2024-06-07 6:59 ` David Hildenbrand
2024-06-09 20:08 ` Matthew Wilcox
2024-06-09 20:28 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZmIrK4CB1JjaHIyP@x1n \
--to=peterx@redhat.com \
--cc=jane.chu@oracle.com \
--cc=jthoughton@google.com \
--cc=khalid.aziz@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=vishal.moola@gmail.com \
--cc=willy@infradead.org \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox