Unifying page table walkers

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Unifying page table walkers
@ 2024-06-06 18:29 Matthew Wilcox
  2024-06-06 19:30 ` James Houghton
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Matthew Wilcox @ 2024-06-06 18:29 UTC (permalink / raw)
  To: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song; +Cc: linux-mm

One of the things we discussed at LSFMM was unifying the hugetlb and
THP page table walkers.  I've been looking into it some more recently;
I've found a problem and I think a solution.

The reason we have a separate hugetlb_entry from pmd_entry and pud_entry
is that it has a different locking context.  It is called with the
hugetlb_vma_lock held for read (nb: this is not the same as the vma
lock; see walk_hugetlb_range()).  Why do we need this?  Because of page
table sharing.

In a completely separate discussion, I was talking with Khalid about
mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs
pages to be mapped by a VMA which does not have the VM_HUGETLB flag set.
If we do that, the page tables would not be permitted to be shared with
other users of that hugetlbfs file.  But we want to eliminate support
for that anyway, so that's more of a feature than a bug.

Once we don't use the VM_HUGETLB flag on these VMAs, that opens the
door to the other features we want, like mapping individual pages from
a hugetlb folio.  And we can use the regular page table walkers for
these VMAs.

Is this a reasonable path forward, or have I overlooked something?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 18:29 Unifying page table walkers Matthew Wilcox
@ 2024-06-06 19:30 ` James Houghton
  2024-06-06 20:04   ` Matthew Wilcox
  2024-06-06 21:49 ` Peter Xu
  2024-06-07  6:59 ` David Hildenbrand
  2 siblings, 1 reply; 13+ messages in thread
From: James Houghton @ 2024-06-06 19:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm

On Thu, Jun 6, 2024 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> One of the things we discussed at LSFMM was unifying the hugetlb and
> THP page table walkers.  I've been looking into it some more recently;
> I've found a problem and I think a solution.
>
> The reason we have a separate hugetlb_entry from pmd_entry and pud_entry
> is that it has a different locking context.  It is called with the
> hugetlb_vma_lock held for read (nb: this is not the same as the vma
> lock; see walk_hugetlb_range()).  Why do we need this?  Because of page
> table sharing.
>
> In a completely separate discussion, I was talking with Khalid about
> mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs
> pages to be mapped by a VMA which does not have the VM_HUGETLB flag set.
> If we do that, the page tables would not be permitted to be shared with
> other users of that hugetlbfs file.  But we want to eliminate support
> for that anyway, so that's more of a feature than a bug.
>
> Once we don't use the VM_HUGETLB flag on these VMAs, that opens the
> door to the other features we want, like mapping individual pages from
> a hugetlb folio.  And we can use the regular page table walkers for
> these VMAs.
>
> Is this a reasonable path forward, or have I overlooked something?

Hi Matthew,

Today the VM_HUGETLB flag tells the fault handler to call into
hugetlb_fault() (there are many other special cases, but this one is
probably the most important). How should faults on VMAs without
VM_HUGETLB that map HugeTLB folios be handled? If you handle faults
with the main mm fault handler without getting rid of hugetlb_fault(),
I think you're basically implementing a second, more tmpfs-like
hugetlbfs... right?

I don't really have anything against this approach, but I think the
decision was to reduce the number of special cases as much as we can
first before attempting to rewrite hugetlbfs.

Or maybe I've got something wrong and what you're asking doesn't
logically end up at a hugetlbfs v2.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 19:30 ` James Houghton
@ 2024-06-06 20:04   ` Matthew Wilcox
  2024-06-06 20:23     ` James Houghton
  2024-06-06 21:33     ` Peter Xu
  0 siblings, 2 replies; 13+ messages in thread
From: Matthew Wilcox @ 2024-06-06 20:04 UTC (permalink / raw)
  To: James Houghton
  Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm

On Thu, Jun 06, 2024 at 12:30:44PM -0700, James Houghton wrote:
> Today the VM_HUGETLB flag tells the fault handler to call into
> hugetlb_fault() (there are many other special cases, but this one is
> probably the most important). How should faults on VMAs without
> VM_HUGETLB that map HugeTLB folios be handled? If you handle faults
> with the main mm fault handler without getting rid of hugetlb_fault(),
> I think you're basically implementing a second, more tmpfs-like
> hugetlbfs... right?
> 
> I don't really have anything against this approach, but I think the
> decision was to reduce the number of special cases as much as we can
> first before attempting to rewrite hugetlbfs.
> 
> Or maybe I've got something wrong and what you're asking doesn't
> logically end up at a hugetlbfs v2.

Right, so we ignore hugetlb_fault() and call into __handle_mm_fault().
Once there, we'll do:

        vmf.pud = pud_alloc(mm, p4d, address);
        if (pud_none(*vmf.pud) &&
            thp_vma_allowable_order(vma, vm_flags,
                                TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
                ret = create_huge_pud(&vmf);

which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER);

So all we need to do is implement huge_fault in hugetlb_vm_ops.  I
don't think that's the same as creating a hugetlbfs2 because it's just
another entry point.  You can mmap() the same file both ways and it's
all cache coherent.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 20:04   ` Matthew Wilcox
@ 2024-06-06 20:23     ` James Houghton
  2024-06-06 21:21       ` Matthew Wilcox
  2024-06-06 21:33     ` Peter Xu
  1 sibling, 1 reply; 13+ messages in thread
From: James Houghton @ 2024-06-06 20:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm

On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@infradead.org> wrote:
> Right, so we ignore hugetlb_fault() and call into __handle_mm_fault().
> Once there, we'll do:
>
>         vmf.pud = pud_alloc(mm, p4d, address);
>         if (pud_none(*vmf.pud) &&
>             thp_vma_allowable_order(vma, vm_flags,
>                                 TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
>                 ret = create_huge_pud(&vmf);
>
> which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER);
>
> So all we need to do is implement huge_fault in hugetlb_vm_ops.  I
> don't think that's the same as creating a hugetlbfs2 because it's just
> another entry point.  You can mmap() the same file both ways and it's
> all cache coherent.

That makes a lot of sense. FWIW, this sounds good to me (though I'm
curious what Peter thinks :)).

But I think you'll need to be careful to ensure that, for now anyway,
huge_fault() is always called with the exact same ptep/pmdp/pudp that
hugetlb_walk() would have returned (ignoring sharing). If you allow
PMD mapping of what would otherwise be PUD-mapped hugetlb pages right
now, you'll break the vmemmap optimization (and probably other
things).

Also I'm not sure how this will interact with arm64's hugetlb pages
implemented with contiguous PTEs/PMDs. You might have to round
`address` down to make sure you've picked the first PTE/PMD in the
group.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 20:23     ` James Houghton
@ 2024-06-06 21:21       ` Matthew Wilcox
  2024-06-06 23:07         ` James Houghton
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2024-06-06 21:21 UTC (permalink / raw)
  To: James Houghton
  Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm

On Thu, Jun 06, 2024 at 01:23:08PM -0700, James Houghton wrote:
> On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@infradead.org> wrote:
> > Right, so we ignore hugetlb_fault() and call into __handle_mm_fault().
> > Once there, we'll do:
> >
> >         vmf.pud = pud_alloc(mm, p4d, address);
> >         if (pud_none(*vmf.pud) &&
> >             thp_vma_allowable_order(vma, vm_flags,
> >                                 TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
> >                 ret = create_huge_pud(&vmf);
> >
> > which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER);
> >
> > So all we need to do is implement huge_fault in hugetlb_vm_ops.  I
> > don't think that's the same as creating a hugetlbfs2 because it's just
> > another entry point.  You can mmap() the same file both ways and it's
> > all cache coherent.
> 
> That makes a lot of sense. FWIW, this sounds good to me (though I'm
> curious what Peter thinks :)).
> 
> But I think you'll need to be careful to ensure that, for now anyway,
> huge_fault() is always called with the exact same ptep/pmdp/pudp that
> hugetlb_walk() would have returned (ignoring sharing). If you allow
> PMD mapping of what would otherwise be PUD-mapped hugetlb pages right
> now, you'll break the vmemmap optimization (and probably other
> things).

Why is that?  This sounds like you know something I don't ;-)
Is it the mapcount issue?

> Also I'm not sure how this will interact with arm64's hugetlb pages
> implemented with contiguous PTEs/PMDs. You might have to round
> `address` down to make sure you've picked the first PTE/PMD in the
> group.

I hadn't thought about the sub-PMD size hugetlb issue either.  We can
certainly limit the support to require alignment to the appropriate
size.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 20:04   ` Matthew Wilcox
  2024-06-06 20:23     ` James Houghton
@ 2024-06-06 21:33     ` Peter Xu
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Xu @ 2024-06-06 21:33 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: James Houghton, Khalid Aziz, Vishal Moola, Jane Chu, Muchun Song,
	linux-mm, Michal Hocko, Yu Zhao

On Thu, Jun 06, 2024 at 09:04:53PM +0100, Matthew Wilcox wrote:
> On Thu, Jun 06, 2024 at 12:30:44PM -0700, James Houghton wrote:
> > Today the VM_HUGETLB flag tells the fault handler to call into
> > hugetlb_fault() (there are many other special cases, but this one is
> > probably the most important). How should faults on VMAs without
> > VM_HUGETLB that map HugeTLB folios be handled? If you handle faults
> > with the main mm fault handler without getting rid of hugetlb_fault(),
> > I think you're basically implementing a second, more tmpfs-like
> > hugetlbfs... right?
> > 
> > I don't really have anything against this approach, but I think the
> > decision was to reduce the number of special cases as much as we can
> > first before attempting to rewrite hugetlbfs.
> > 
> > Or maybe I've got something wrong and what you're asking doesn't
> > logically end up at a hugetlbfs v2.
> 
> Right, so we ignore hugetlb_fault() and call into __handle_mm_fault().
> Once there, we'll do:
> 
>         vmf.pud = pud_alloc(mm, p4d, address);
>         if (pud_none(*vmf.pud) &&
>             thp_vma_allowable_order(vma, vm_flags,
>                                 TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
>                 ret = create_huge_pud(&vmf);
> 
> which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER);
> 
> So all we need to do is implement huge_fault in hugetlb_vm_ops.  I
> don't think that's the same as creating a hugetlbfs2 because it's just
> another entry point.  You can mmap() the same file both ways and it's
> all cache coherent.

Matthew, could you elaborate more on how hugetlb_vm_ops.huge_fault() could
start to inject hugetlb pages without a hugetlb VMA?  I meant, at least
currently what I read is this, where we define a hugetlb VMA always as:

	vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND);
	vma->vm_ops = &hugetlb_vm_ops;

So any vma that uses hugetlb_vm_ops will have VM_HUGETLB set for sure..

If you're talking about some other VMAs, it sounds to me like that this
huge_fault() should belong to that new VMA's vm_ops? Then it sounds like a
way some non-hugetlb VMA wants to reuse the hugetlb allocator/pool of the
huge pages.  I'm not sure I understand it right, though..

Regarding to the 4k mapping plan on hugetlb..  I talked to Michal, Yu and
some other people when during lsfmm, and I think so far it seems to me the
best way to go is to allow shmem provide 1G pages.

Again, IMHO I'd be totally fine if we finish the cleanup but just add HGM
on top of hugetlbv1, but it looks like I'm the only person who thinks like
that..  If we can introduce 1G to shmem then as long as the 1G pages can be
as stable as a hugetlbv1 1G page then it's good enough for a VM use case
(which I care, and I believe also to most cloud providers who cares about
postcopy) and then adding 4k mapping on top of that can avoid all the
hugetlb concerns people have too (even though I think most of the logic
that HGM wants will still be there).

That also kind of matches with the TAO's design where we may have more
chance having THPs allocated even dynamically on 1G, however the sake here
for VM context is we'll want reliable 1G not split-able ones.  We may want
it split-able only on pgtable but not the folios.

However I don't yet think any of them are solid ideas.  It might be
interesting to know how your thoughts correlate to this too, since I think
you mentioned the 4k mapping somewhere.  I'm also making bold to copy
relevant people just in case it could be relevant discussion.

[PS: will need to be off work tomorrow, so please expect a delay on follow
 up emails..]

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 18:29 Unifying page table walkers Matthew Wilcox
  2024-06-06 19:30 ` James Houghton
@ 2024-06-06 21:49 ` Peter Xu
  2024-06-07  5:07   ` Oscar Salvador
  2024-06-07  6:59 ` David Hildenbrand
  2 siblings, 1 reply; 13+ messages in thread
From: Peter Xu @ 2024-06-06 21:49 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Khalid Aziz, Vishal Moola, Jane Chu, Muchun Song, linux-mm

On Thu, Jun 06, 2024 at 07:29:22PM +0100, Matthew Wilcox wrote:
> The reason we have a separate hugetlb_entry from pmd_entry and pud_entry
> is that it has a different locking context.  It is called with the
> hugetlb_vma_lock held for read (nb: this is not the same as the vma
> lock; see walk_hugetlb_range()).  Why do we need this?  Because of page
> table sharing.

Just to quickly comment on this one: I think it's more than the per-vma
lock.  Oscar is actually working together with me (we had plenty of
discussions but so far all offlist...), and the lock context is as simple
as this after refactor for hugetlb_entry() path:

https://github.com/leberus/linux/commit/88e56c1ecaf8c64ba9165aeba74335bdc15d1b56

hugetlb_entry() existed also because that's the only sane way to link to
the hugetlb API (used to be huge_pte_offset() I believe, now
hugetlb_walk()), which always walk to a specific level of hugetlb pgtable
but without even telling the caller (hence the pte_t* force-cast trick).
Then pxd_entry() won't apply if we don't know that info.  So it's probably
not only about the locking.

Meanwhile, I had a very vague memory that the per-vma lock is also used for
something else, perhaps fallocate() race against faults or something.  But
maybe I misremembered; I didn't read that part of code for quite some time,
as our hugetlb refactoring work doesn't need that knowledge involved: we
simply keep all the behaviors.  Maybe Muchun could remember.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 21:21       ` Matthew Wilcox
@ 2024-06-06 23:07         ` James Houghton
  2024-06-07  7:15           ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: James Houghton @ 2024-06-06 23:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song,
	linux-mm, David Hildenbrand

On Thu, Jun 6, 2024 at 2:21 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Jun 06, 2024 at 01:23:08PM -0700, James Houghton wrote:
> > On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > Right, so we ignore hugetlb_fault() and call into __handle_mm_fault().
> > > Once there, we'll do:
> > >
> > >         vmf.pud = pud_alloc(mm, p4d, address);
> > >         if (pud_none(*vmf.pud) &&
> > >             thp_vma_allowable_order(vma, vm_flags,
> > >                                 TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
> > >                 ret = create_huge_pud(&vmf);
> > >
> > > which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER);
> > >
> > > So all we need to do is implement huge_fault in hugetlb_vm_ops.  I
> > > don't think that's the same as creating a hugetlbfs2 because it's just
> > > another entry point.  You can mmap() the same file both ways and it's
> > > all cache coherent.
> >
> > That makes a lot of sense. FWIW, this sounds good to me (though I'm
> > curious what Peter thinks :)).
> >
> > But I think you'll need to be careful to ensure that, for now anyway,
> > huge_fault() is always called with the exact same ptep/pmdp/pudp that
> > hugetlb_walk() would have returned (ignoring sharing). If you allow
> > PMD mapping of what would otherwise be PUD-mapped hugetlb pages right
> > now, you'll break the vmemmap optimization (and probably other
> > things).
>
> Why is that?  This sounds like you know something I don't ;-)
> Is it the mapcount issue?

Yeah, that's what I was thinking about. But I guess whether or not you
are compatible with the vmemmap optimization depends on what
mapcounting scheme you use. If you just use the THP one, you're going
to end up incrementing _mapcount on the subpages, and that won't work.

I'm not immediately thinking of other things that would break... need
to think some more.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 21:49 ` Peter Xu
@ 2024-06-07  5:07   ` Oscar Salvador
  0 siblings, 0 replies; 13+ messages in thread
From: Oscar Salvador @ 2024-06-07  5:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Matthew Wilcox, Khalid Aziz, Vishal Moola, Jane Chu, Muchun Song,
	linux-mm

On Thu, Jun 06, 2024 at 05:49:30PM -0400, Peter Xu wrote:
> On Thu, Jun 06, 2024 at 07:29:22PM +0100, Matthew Wilcox wrote:
> > The reason we have a separate hugetlb_entry from pmd_entry and pud_entry
> > is that it has a different locking context.  It is called with the
> > hugetlb_vma_lock held for read (nb: this is not the same as the vma
> > lock; see walk_hugetlb_range()).  Why do we need this?  Because of page
> > table sharing.
> 
> Just to quickly comment on this one: I think it's more than the per-vma
> lock.  Oscar is actually working together with me (we had plenty of
> discussions but so far all offlist...), and the lock context is as simple
> as this after refactor for hugetlb_entry() path:
> 
> https://github.com/leberus/linux/commit/88e56c1ecaf8c64ba9165aeba74335bdc15d1b56

Yes, I reached out to Peter after LSFMM because I was highly interested in
helping out here.
We agreed that I would take pagewalk part, and I already do have some
patches on the works [1][2] that are based on a patchset that I have been
reviewing that removes hugepd on powerpc [3].

Ideally we should remove the exclusive use of 'pte' from hugetlb (unless
it is CONTPTE) and have it using pud/pmd where needed.
E.g: if we look at huge_ptep_get version from s390, which is the most
special one I would say:

huge_ptep_get()->__rste_to_pte or they way around (__pte_to_rste)

what it does is it tries to convert a pud/pmd entry into a pte or viceversa,
since hugetlb "can" only work with that, and so you have all this castings back
and forth all spread over.

I started first merging all .hugetlb_entry functions into the .pmd_entrys
(not done yet, half-way through) and creating .pud_entry because we will
need them since hugetlb can be PUD-mapped, unlike THP (well, yes, devmp
but most walkers do not care about it so they did not create a
.pud_entry).

Then I will be running some tests on x86_64/arm64/pp64le and s390(not
sure if I will be able to grab one but let us see), and then I will post
a patchset as RFC to gather some feedback.

[1] https://github.com/leberus/linux/tree/hugetlb-pagewalk-v2
[2] Do not stare too close as they are a very WIP, and ignore the last 4 commits
    as they are half-done.
[3] https://patchwork.kernel.org/project/linux-mm/cover/cover.1716815901.git.christophe.leroy@csgroup.eu/

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 18:29 Unifying page table walkers Matthew Wilcox
  2024-06-06 19:30 ` James Houghton
  2024-06-06 21:49 ` Peter Xu
@ 2024-06-07  6:59 ` David Hildenbrand
  2024-06-09 20:08   ` Matthew Wilcox
  2 siblings, 1 reply; 13+ messages in thread
From: David Hildenbrand @ 2024-06-07  6:59 UTC (permalink / raw)
  To: Matthew Wilcox, Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu,
	Muchun Song
  Cc: linux-mm

On 06.06.24 20:29, Matthew Wilcox wrote:
> One of the things we discussed at LSFMM was unifying the hugetlb and
> THP page table walkers.  I've been looking into it some more recently;
> I've found a problem and I think a solution.
> 
> The reason we have a separate hugetlb_entry from pmd_entry and pud_entry
> is that it has a different locking context.  It is called with the
> hugetlb_vma_lock held for read (nb: this is not the same as the vma
> lock; see walk_hugetlb_range()).  Why do we need this?  Because of page
> table sharing.
> 
> In a completely separate discussion, I was talking with Khalid about
> mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs
> pages to be mapped by a VMA which does not have the VM_HUGETLB flag set.
> If we do that, the page tables would not be permitted to be shared with
> other users of that hugetlbfs file.  But we want to eliminate support
> for that anyway, so that's more of a feature than a bug.

I am not sure why hugetlb support in mshare would require that (we don't 
need partial mappings and all of that to support mshare+hugetlb).

The possible mshare directions I discussed with Khalid at LSF/MM would 
likely not need that. But I have no idea which mshare design you and 
Khalid are discussing right now. Maybe it would be a a good idea that 
the three of us meet to discuss that, if my feedback/opinion could be 
helpful.

> 
> Once we don't use the VM_HUGETLB flag on these VMAs, that opens the
> door to the other features we want, like mapping individual pages from
> a hugetlb folio.  And we can use the regular page table walkers for
> these VMAs.

Right, but to me that's a different, long-term project that mshare would 
maybe not have to rely on.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-06 23:07         ` James Houghton
@ 2024-06-07  7:15           ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2024-06-07  7:15 UTC (permalink / raw)
  To: James Houghton, Matthew Wilcox
  Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm

On 07.06.24 01:07, James Houghton wrote:
> On Thu, Jun 6, 2024 at 2:21 PM Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Thu, Jun 06, 2024 at 01:23:08PM -0700, James Houghton wrote:
>>> On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@infradead.org> wrote:
>>>> Right, so we ignore hugetlb_fault() and call into __handle_mm_fault().
>>>> Once there, we'll do:
>>>>
>>>>          vmf.pud = pud_alloc(mm, p4d, address);
>>>>          if (pud_none(*vmf.pud) &&
>>>>              thp_vma_allowable_order(vma, vm_flags,
>>>>                                  TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) {
>>>>                  ret = create_huge_pud(&vmf);
>>>>
>>>> which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER);
>>>>
>>>> So all we need to do is implement huge_fault in hugetlb_vm_ops.  I
>>>> don't think that's the same as creating a hugetlbfs2 because it's just
>>>> another entry point.  You can mmap() the same file both ways and it's
>>>> all cache coherent.
>>>
>>> That makes a lot of sense. FWIW, this sounds good to me (though I'm
>>> curious what Peter thinks :)).
>>>
>>> But I think you'll need to be careful to ensure that, for now anyway,
>>> huge_fault() is always called with the exact same ptep/pmdp/pudp that
>>> hugetlb_walk() would have returned (ignoring sharing). If you allow
>>> PMD mapping of what would otherwise be PUD-mapped hugetlb pages right
>>> now, you'll break the vmemmap optimization (and probably other
>>> things).
>>
>> Why is that?  This sounds like you know something I don't ;-)
>> Is it the mapcount issue?
> 
> Yeah, that's what I was thinking about. But I guess whether or not you
> are compatible with the vmemmap optimization depends on what
> mapcounting scheme you use. If you just use the THP one, you're going
> to end up incrementing _mapcount on the subpages, and that won't work.
> 
> I'm not immediately thinking of other things that would break... need
> to think some more.

If we're only concerned about unifying most page table walkers there is 
little reason to mess with mapcounts of hugetlb at this point. mapcounts 
with hugetlb only come into play once we support partial mappings, and I 
consider that yet another step (mapping/unmapping code) than unifying 
most page table walkers like Oscar+Peter have on their plate.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-07  6:59 ` David Hildenbrand
@ 2024-06-09 20:08   ` Matthew Wilcox
  2024-06-09 20:28     ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2024-06-09 20:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm

On Fri, Jun 07, 2024 at 08:59:17AM +0200, David Hildenbrand wrote:
> On 06.06.24 20:29, Matthew Wilcox wrote:
> > One of the things we discussed at LSFMM was unifying the hugetlb and
> > THP page table walkers.  I've been looking into it some more recently;
> > I've found a problem and I think a solution.
> > 
> > The reason we have a separate hugetlb_entry from pmd_entry and pud_entry
> > is that it has a different locking context.  It is called with the
> > hugetlb_vma_lock held for read (nb: this is not the same as the vma
> > lock; see walk_hugetlb_range()).  Why do we need this?  Because of page
> > table sharing.
> > 
> > In a completely separate discussion, I was talking with Khalid about
> > mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs
> > pages to be mapped by a VMA which does not have the VM_HUGETLB flag set.
> > If we do that, the page tables would not be permitted to be shared with
> > other users of that hugetlbfs file.  But we want to eliminate support
> > for that anyway, so that's more of a feature than a bug.
> 
> I am not sure why hugetlb support in mshare would require that (we don't
> need partial mappings and all of that to support mshare+hugetlb).

You're absolutely right.  My motivation is the other way around.  A large
part of "hugetlbfs is special" is tied to the sharing of page tables.
That's why we have the hugetlb_vma_lock.  If we're already sharing
page tables with mshare, I assert that it is not necessary to also
share page tables with other hugetlb users.  So as part of including
hugetlb support in mshare, we should drop that support, and handle
hugetlb-mapped-with-mshare similarly to THP.  Possibly not the mapcount
parts so that we preserve the HVO.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Unifying page table walkers
  2024-06-09 20:08   ` Matthew Wilcox
@ 2024-06-09 20:28     ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2024-06-09 20:28 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm

On 09.06.24 22:08, Matthew Wilcox wrote:
> On Fri, Jun 07, 2024 at 08:59:17AM +0200, David Hildenbrand wrote:
>> On 06.06.24 20:29, Matthew Wilcox wrote:
>>> One of the things we discussed at LSFMM was unifying the hugetlb and
>>> THP page table walkers.  I've been looking into it some more recently;
>>> I've found a problem and I think a solution.
>>>
>>> The reason we have a separate hugetlb_entry from pmd_entry and pud_entry
>>> is that it has a different locking context.  It is called with the
>>> hugetlb_vma_lock held for read (nb: this is not the same as the vma
>>> lock; see walk_hugetlb_range()).  Why do we need this?  Because of page
>>> table sharing.
>>>
>>> In a completely separate discussion, I was talking with Khalid about
>>> mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs
>>> pages to be mapped by a VMA which does not have the VM_HUGETLB flag set.
>>> If we do that, the page tables would not be permitted to be shared with
>>> other users of that hugetlbfs file.  But we want to eliminate support
>>> for that anyway, so that's more of a feature than a bug.
>>
>> I am not sure why hugetlb support in mshare would require that (we don't
>> need partial mappings and all of that to support mshare+hugetlb).
> 
> You're absolutely right.  My motivation is the other way around.  A large
> part of "hugetlbfs is special" is tied to the sharing of page tables.
> That's why we have the hugetlb_vma_lock.  If we're already sharing
> page tables with mshare, I assert that it is not necessary to also
> share page tables with other hugetlb users.  So as part of including
> hugetlb support in mshare, we should drop that support, and handle
> hugetlb-mapped-with-mshare similarly to THP. 

Yes, we should absolutely *not* use hugetlb-page table sharing there!

Regarding THP, I'm not sure how far we should go -- we should make our 
lives easier by not allowing partial mappings initially.

> Possibly not the mapcount
> parts so that we preserve the HVO.

The new mapcount scheme I'll be reviving soon will not work easily with 
the existing hugetlb-page table sharing, simply because unrelated MMs 
can map/unmap pages in there, and we effectively transfer ownership of 
page tables between processes.

As soon as we have one "mm" that owns one set of shared page tables 
(i.e., one mshare-mm that owns a set of shared page tables, and to which 
we effectively account the mappings in there), it should all just work.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-06-09 20:28 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-06 18:29 Unifying page table walkers Matthew Wilcox
2024-06-06 19:30 ` James Houghton
2024-06-06 20:04   ` Matthew Wilcox
2024-06-06 20:23     ` James Houghton
2024-06-06 21:21       ` Matthew Wilcox
2024-06-06 23:07         ` James Houghton
2024-06-07  7:15           ` David Hildenbrand
2024-06-06 21:33     ` Peter Xu
2024-06-06 21:49 ` Peter Xu
2024-06-07  5:07   ` Oscar Salvador
2024-06-07  6:59 ` David Hildenbrand
2024-06-09 20:08   ` Matthew Wilcox
2024-06-09 20:28     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox