* Unifying page table walkers
@ 2024-06-06 18:29 Matthew Wilcox
2024-06-06 19:30 ` James Houghton
` (2 more replies)
0 siblings, 3 replies; 13+ messages in thread
From: Matthew Wilcox @ 2024-06-06 18:29 UTC (permalink / raw)
To: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song; +Cc: linux-mm
One of the things we discussed at LSFMM was unifying the hugetlb and
THP page table walkers. I've been looking into it some more recently;
I've found a problem and I think a solution.
The reason we have a separate hugetlb_entry from pmd_entry and pud_entry
is that it has a different locking context. It is called with the
hugetlb_vma_lock held for read (nb: this is not the same as the vma
lock; see walk_hugetlb_range()). Why do we need this? Because of page
table sharing.
In a completely separate discussion, I was talking with Khalid about
mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs
pages to be mapped by a VMA which does not have the VM_HUGETLB flag set.
If we do that, the page tables would not be permitted to be shared with
other users of that hugetlbfs file. But we want to eliminate support
for that anyway, so that's more of a feature than a bug.
Once we don't use the VM_HUGETLB flag on these VMAs, that opens the
door to the other features we want, like mapping individual pages from
a hugetlb folio. And we can use the regular page table walkers for
these VMAs.
Is this a reasonable path forward, or have I overlooked something?
^ permalink raw reply [flat|nested] 13+ messages in thread* Re: Unifying page table walkers 2024-06-06 18:29 Unifying page table walkers Matthew Wilcox @ 2024-06-06 19:30 ` James Houghton 2024-06-06 20:04 ` Matthew Wilcox 2024-06-06 21:49 ` Peter Xu 2024-06-07 6:59 ` David Hildenbrand 2 siblings, 1 reply; 13+ messages in thread From: James Houghton @ 2024-06-06 19:30 UTC (permalink / raw) To: Matthew Wilcox Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm On Thu, Jun 6, 2024 at 11:29 AM Matthew Wilcox <willy@infradead.org> wrote: > > One of the things we discussed at LSFMM was unifying the hugetlb and > THP page table walkers. I've been looking into it some more recently; > I've found a problem and I think a solution. > > The reason we have a separate hugetlb_entry from pmd_entry and pud_entry > is that it has a different locking context. It is called with the > hugetlb_vma_lock held for read (nb: this is not the same as the vma > lock; see walk_hugetlb_range()). Why do we need this? Because of page > table sharing. > > In a completely separate discussion, I was talking with Khalid about > mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs > pages to be mapped by a VMA which does not have the VM_HUGETLB flag set. > If we do that, the page tables would not be permitted to be shared with > other users of that hugetlbfs file. But we want to eliminate support > for that anyway, so that's more of a feature than a bug. > > Once we don't use the VM_HUGETLB flag on these VMAs, that opens the > door to the other features we want, like mapping individual pages from > a hugetlb folio. And we can use the regular page table walkers for > these VMAs. > > Is this a reasonable path forward, or have I overlooked something? Hi Matthew, Today the VM_HUGETLB flag tells the fault handler to call into hugetlb_fault() (there are many other special cases, but this one is probably the most important). How should faults on VMAs without VM_HUGETLB that map HugeTLB folios be handled? If you handle faults with the main mm fault handler without getting rid of hugetlb_fault(), I think you're basically implementing a second, more tmpfs-like hugetlbfs... right? I don't really have anything against this approach, but I think the decision was to reduce the number of special cases as much as we can first before attempting to rewrite hugetlbfs. Or maybe I've got something wrong and what you're asking doesn't logically end up at a hugetlbfs v2. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-06 19:30 ` James Houghton @ 2024-06-06 20:04 ` Matthew Wilcox 2024-06-06 20:23 ` James Houghton 2024-06-06 21:33 ` Peter Xu 0 siblings, 2 replies; 13+ messages in thread From: Matthew Wilcox @ 2024-06-06 20:04 UTC (permalink / raw) To: James Houghton Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm On Thu, Jun 06, 2024 at 12:30:44PM -0700, James Houghton wrote: > Today the VM_HUGETLB flag tells the fault handler to call into > hugetlb_fault() (there are many other special cases, but this one is > probably the most important). How should faults on VMAs without > VM_HUGETLB that map HugeTLB folios be handled? If you handle faults > with the main mm fault handler without getting rid of hugetlb_fault(), > I think you're basically implementing a second, more tmpfs-like > hugetlbfs... right? > > I don't really have anything against this approach, but I think the > decision was to reduce the number of special cases as much as we can > first before attempting to rewrite hugetlbfs. > > Or maybe I've got something wrong and what you're asking doesn't > logically end up at a hugetlbfs v2. Right, so we ignore hugetlb_fault() and call into __handle_mm_fault(). Once there, we'll do: vmf.pud = pud_alloc(mm, p4d, address); if (pud_none(*vmf.pud) && thp_vma_allowable_order(vma, vm_flags, TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { ret = create_huge_pud(&vmf); which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER); So all we need to do is implement huge_fault in hugetlb_vm_ops. I don't think that's the same as creating a hugetlbfs2 because it's just another entry point. You can mmap() the same file both ways and it's all cache coherent. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-06 20:04 ` Matthew Wilcox @ 2024-06-06 20:23 ` James Houghton 2024-06-06 21:21 ` Matthew Wilcox 2024-06-06 21:33 ` Peter Xu 1 sibling, 1 reply; 13+ messages in thread From: James Houghton @ 2024-06-06 20:23 UTC (permalink / raw) To: Matthew Wilcox Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@infradead.org> wrote: > Right, so we ignore hugetlb_fault() and call into __handle_mm_fault(). > Once there, we'll do: > > vmf.pud = pud_alloc(mm, p4d, address); > if (pud_none(*vmf.pud) && > thp_vma_allowable_order(vma, vm_flags, > TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { > ret = create_huge_pud(&vmf); > > which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER); > > So all we need to do is implement huge_fault in hugetlb_vm_ops. I > don't think that's the same as creating a hugetlbfs2 because it's just > another entry point. You can mmap() the same file both ways and it's > all cache coherent. That makes a lot of sense. FWIW, this sounds good to me (though I'm curious what Peter thinks :)). But I think you'll need to be careful to ensure that, for now anyway, huge_fault() is always called with the exact same ptep/pmdp/pudp that hugetlb_walk() would have returned (ignoring sharing). If you allow PMD mapping of what would otherwise be PUD-mapped hugetlb pages right now, you'll break the vmemmap optimization (and probably other things). Also I'm not sure how this will interact with arm64's hugetlb pages implemented with contiguous PTEs/PMDs. You might have to round `address` down to make sure you've picked the first PTE/PMD in the group. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-06 20:23 ` James Houghton @ 2024-06-06 21:21 ` Matthew Wilcox 2024-06-06 23:07 ` James Houghton 0 siblings, 1 reply; 13+ messages in thread From: Matthew Wilcox @ 2024-06-06 21:21 UTC (permalink / raw) To: James Houghton Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm On Thu, Jun 06, 2024 at 01:23:08PM -0700, James Houghton wrote: > On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@infradead.org> wrote: > > Right, so we ignore hugetlb_fault() and call into __handle_mm_fault(). > > Once there, we'll do: > > > > vmf.pud = pud_alloc(mm, p4d, address); > > if (pud_none(*vmf.pud) && > > thp_vma_allowable_order(vma, vm_flags, > > TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { > > ret = create_huge_pud(&vmf); > > > > which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER); > > > > So all we need to do is implement huge_fault in hugetlb_vm_ops. I > > don't think that's the same as creating a hugetlbfs2 because it's just > > another entry point. You can mmap() the same file both ways and it's > > all cache coherent. > > That makes a lot of sense. FWIW, this sounds good to me (though I'm > curious what Peter thinks :)). > > But I think you'll need to be careful to ensure that, for now anyway, > huge_fault() is always called with the exact same ptep/pmdp/pudp that > hugetlb_walk() would have returned (ignoring sharing). If you allow > PMD mapping of what would otherwise be PUD-mapped hugetlb pages right > now, you'll break the vmemmap optimization (and probably other > things). Why is that? This sounds like you know something I don't ;-) Is it the mapcount issue? > Also I'm not sure how this will interact with arm64's hugetlb pages > implemented with contiguous PTEs/PMDs. You might have to round > `address` down to make sure you've picked the first PTE/PMD in the > group. I hadn't thought about the sub-PMD size hugetlb issue either. We can certainly limit the support to require alignment to the appropriate size. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-06 21:21 ` Matthew Wilcox @ 2024-06-06 23:07 ` James Houghton 2024-06-07 7:15 ` David Hildenbrand 0 siblings, 1 reply; 13+ messages in thread From: James Houghton @ 2024-06-06 23:07 UTC (permalink / raw) To: Matthew Wilcox Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm, David Hildenbrand On Thu, Jun 6, 2024 at 2:21 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Jun 06, 2024 at 01:23:08PM -0700, James Houghton wrote: > > On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@infradead.org> wrote: > > > Right, so we ignore hugetlb_fault() and call into __handle_mm_fault(). > > > Once there, we'll do: > > > > > > vmf.pud = pud_alloc(mm, p4d, address); > > > if (pud_none(*vmf.pud) && > > > thp_vma_allowable_order(vma, vm_flags, > > > TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { > > > ret = create_huge_pud(&vmf); > > > > > > which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER); > > > > > > So all we need to do is implement huge_fault in hugetlb_vm_ops. I > > > don't think that's the same as creating a hugetlbfs2 because it's just > > > another entry point. You can mmap() the same file both ways and it's > > > all cache coherent. > > > > That makes a lot of sense. FWIW, this sounds good to me (though I'm > > curious what Peter thinks :)). > > > > But I think you'll need to be careful to ensure that, for now anyway, > > huge_fault() is always called with the exact same ptep/pmdp/pudp that > > hugetlb_walk() would have returned (ignoring sharing). If you allow > > PMD mapping of what would otherwise be PUD-mapped hugetlb pages right > > now, you'll break the vmemmap optimization (and probably other > > things). > > Why is that? This sounds like you know something I don't ;-) > Is it the mapcount issue? Yeah, that's what I was thinking about. But I guess whether or not you are compatible with the vmemmap optimization depends on what mapcounting scheme you use. If you just use the THP one, you're going to end up incrementing _mapcount on the subpages, and that won't work. I'm not immediately thinking of other things that would break... need to think some more. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-06 23:07 ` James Houghton @ 2024-06-07 7:15 ` David Hildenbrand 0 siblings, 0 replies; 13+ messages in thread From: David Hildenbrand @ 2024-06-07 7:15 UTC (permalink / raw) To: James Houghton, Matthew Wilcox Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm On 07.06.24 01:07, James Houghton wrote: > On Thu, Jun 6, 2024 at 2:21 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> On Thu, Jun 06, 2024 at 01:23:08PM -0700, James Houghton wrote: >>> On Thu, Jun 6, 2024 at 1:04 PM Matthew Wilcox <willy@infradead.org> wrote: >>>> Right, so we ignore hugetlb_fault() and call into __handle_mm_fault(). >>>> Once there, we'll do: >>>> >>>> vmf.pud = pud_alloc(mm, p4d, address); >>>> if (pud_none(*vmf.pud) && >>>> thp_vma_allowable_order(vma, vm_flags, >>>> TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { >>>> ret = create_huge_pud(&vmf); >>>> >>>> which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER); >>>> >>>> So all we need to do is implement huge_fault in hugetlb_vm_ops. I >>>> don't think that's the same as creating a hugetlbfs2 because it's just >>>> another entry point. You can mmap() the same file both ways and it's >>>> all cache coherent. >>> >>> That makes a lot of sense. FWIW, this sounds good to me (though I'm >>> curious what Peter thinks :)). >>> >>> But I think you'll need to be careful to ensure that, for now anyway, >>> huge_fault() is always called with the exact same ptep/pmdp/pudp that >>> hugetlb_walk() would have returned (ignoring sharing). If you allow >>> PMD mapping of what would otherwise be PUD-mapped hugetlb pages right >>> now, you'll break the vmemmap optimization (and probably other >>> things). >> >> Why is that? This sounds like you know something I don't ;-) >> Is it the mapcount issue? > > Yeah, that's what I was thinking about. But I guess whether or not you > are compatible with the vmemmap optimization depends on what > mapcounting scheme you use. If you just use the THP one, you're going > to end up incrementing _mapcount on the subpages, and that won't work. > > I'm not immediately thinking of other things that would break... need > to think some more. If we're only concerned about unifying most page table walkers there is little reason to mess with mapcounts of hugetlb at this point. mapcounts with hugetlb only come into play once we support partial mappings, and I consider that yet another step (mapping/unmapping code) than unifying most page table walkers like Oscar+Peter have on their plate. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-06 20:04 ` Matthew Wilcox 2024-06-06 20:23 ` James Houghton @ 2024-06-06 21:33 ` Peter Xu 1 sibling, 0 replies; 13+ messages in thread From: Peter Xu @ 2024-06-06 21:33 UTC (permalink / raw) To: Matthew Wilcox Cc: James Houghton, Khalid Aziz, Vishal Moola, Jane Chu, Muchun Song, linux-mm, Michal Hocko, Yu Zhao On Thu, Jun 06, 2024 at 09:04:53PM +0100, Matthew Wilcox wrote: > On Thu, Jun 06, 2024 at 12:30:44PM -0700, James Houghton wrote: > > Today the VM_HUGETLB flag tells the fault handler to call into > > hugetlb_fault() (there are many other special cases, but this one is > > probably the most important). How should faults on VMAs without > > VM_HUGETLB that map HugeTLB folios be handled? If you handle faults > > with the main mm fault handler without getting rid of hugetlb_fault(), > > I think you're basically implementing a second, more tmpfs-like > > hugetlbfs... right? > > > > I don't really have anything against this approach, but I think the > > decision was to reduce the number of special cases as much as we can > > first before attempting to rewrite hugetlbfs. > > > > Or maybe I've got something wrong and what you're asking doesn't > > logically end up at a hugetlbfs v2. > > Right, so we ignore hugetlb_fault() and call into __handle_mm_fault(). > Once there, we'll do: > > vmf.pud = pud_alloc(mm, p4d, address); > if (pud_none(*vmf.pud) && > thp_vma_allowable_order(vma, vm_flags, > TVA_IN_PF | TVA_ENFORCE_SYSFS, PUD_ORDER)) { > ret = create_huge_pud(&vmf); > > which will call vma->vm_ops->huge_fault(vmf, PUD_ORDER); > > So all we need to do is implement huge_fault in hugetlb_vm_ops. I > don't think that's the same as creating a hugetlbfs2 because it's just > another entry point. You can mmap() the same file both ways and it's > all cache coherent. Matthew, could you elaborate more on how hugetlb_vm_ops.huge_fault() could start to inject hugetlb pages without a hugetlb VMA? I meant, at least currently what I read is this, where we define a hugetlb VMA always as: vm_flags_set(vma, VM_HUGETLB | VM_DONTEXPAND); vma->vm_ops = &hugetlb_vm_ops; So any vma that uses hugetlb_vm_ops will have VM_HUGETLB set for sure.. If you're talking about some other VMAs, it sounds to me like that this huge_fault() should belong to that new VMA's vm_ops? Then it sounds like a way some non-hugetlb VMA wants to reuse the hugetlb allocator/pool of the huge pages. I'm not sure I understand it right, though.. Regarding to the 4k mapping plan on hugetlb.. I talked to Michal, Yu and some other people when during lsfmm, and I think so far it seems to me the best way to go is to allow shmem provide 1G pages. Again, IMHO I'd be totally fine if we finish the cleanup but just add HGM on top of hugetlbv1, but it looks like I'm the only person who thinks like that.. If we can introduce 1G to shmem then as long as the 1G pages can be as stable as a hugetlbv1 1G page then it's good enough for a VM use case (which I care, and I believe also to most cloud providers who cares about postcopy) and then adding 4k mapping on top of that can avoid all the hugetlb concerns people have too (even though I think most of the logic that HGM wants will still be there). That also kind of matches with the TAO's design where we may have more chance having THPs allocated even dynamically on 1G, however the sake here for VM context is we'll want reliable 1G not split-able ones. We may want it split-able only on pgtable but not the folios. However I don't yet think any of them are solid ideas. It might be interesting to know how your thoughts correlate to this too, since I think you mentioned the 4k mapping somewhere. I'm also making bold to copy relevant people just in case it could be relevant discussion. [PS: will need to be off work tomorrow, so please expect a delay on follow up emails..] Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-06 18:29 Unifying page table walkers Matthew Wilcox 2024-06-06 19:30 ` James Houghton @ 2024-06-06 21:49 ` Peter Xu 2024-06-07 5:07 ` Oscar Salvador 2024-06-07 6:59 ` David Hildenbrand 2 siblings, 1 reply; 13+ messages in thread From: Peter Xu @ 2024-06-06 21:49 UTC (permalink / raw) To: Matthew Wilcox; +Cc: Khalid Aziz, Vishal Moola, Jane Chu, Muchun Song, linux-mm On Thu, Jun 06, 2024 at 07:29:22PM +0100, Matthew Wilcox wrote: > The reason we have a separate hugetlb_entry from pmd_entry and pud_entry > is that it has a different locking context. It is called with the > hugetlb_vma_lock held for read (nb: this is not the same as the vma > lock; see walk_hugetlb_range()). Why do we need this? Because of page > table sharing. Just to quickly comment on this one: I think it's more than the per-vma lock. Oscar is actually working together with me (we had plenty of discussions but so far all offlist...), and the lock context is as simple as this after refactor for hugetlb_entry() path: https://github.com/leberus/linux/commit/88e56c1ecaf8c64ba9165aeba74335bdc15d1b56 hugetlb_entry() existed also because that's the only sane way to link to the hugetlb API (used to be huge_pte_offset() I believe, now hugetlb_walk()), which always walk to a specific level of hugetlb pgtable but without even telling the caller (hence the pte_t* force-cast trick). Then pxd_entry() won't apply if we don't know that info. So it's probably not only about the locking. Meanwhile, I had a very vague memory that the per-vma lock is also used for something else, perhaps fallocate() race against faults or something. But maybe I misremembered; I didn't read that part of code for quite some time, as our hugetlb refactoring work doesn't need that knowledge involved: we simply keep all the behaviors. Maybe Muchun could remember. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-06 21:49 ` Peter Xu @ 2024-06-07 5:07 ` Oscar Salvador 0 siblings, 0 replies; 13+ messages in thread From: Oscar Salvador @ 2024-06-07 5:07 UTC (permalink / raw) To: Peter Xu Cc: Matthew Wilcox, Khalid Aziz, Vishal Moola, Jane Chu, Muchun Song, linux-mm On Thu, Jun 06, 2024 at 05:49:30PM -0400, Peter Xu wrote: > On Thu, Jun 06, 2024 at 07:29:22PM +0100, Matthew Wilcox wrote: > > The reason we have a separate hugetlb_entry from pmd_entry and pud_entry > > is that it has a different locking context. It is called with the > > hugetlb_vma_lock held for read (nb: this is not the same as the vma > > lock; see walk_hugetlb_range()). Why do we need this? Because of page > > table sharing. > > Just to quickly comment on this one: I think it's more than the per-vma > lock. Oscar is actually working together with me (we had plenty of > discussions but so far all offlist...), and the lock context is as simple > as this after refactor for hugetlb_entry() path: > > https://github.com/leberus/linux/commit/88e56c1ecaf8c64ba9165aeba74335bdc15d1b56 Yes, I reached out to Peter after LSFMM because I was highly interested in helping out here. We agreed that I would take pagewalk part, and I already do have some patches on the works [1][2] that are based on a patchset that I have been reviewing that removes hugepd on powerpc [3]. Ideally we should remove the exclusive use of 'pte' from hugetlb (unless it is CONTPTE) and have it using pud/pmd where needed. E.g: if we look at huge_ptep_get version from s390, which is the most special one I would say: huge_ptep_get()->__rste_to_pte or they way around (__pte_to_rste) what it does is it tries to convert a pud/pmd entry into a pte or viceversa, since hugetlb "can" only work with that, and so you have all this castings back and forth all spread over. I started first merging all .hugetlb_entry functions into the .pmd_entrys (not done yet, half-way through) and creating .pud_entry because we will need them since hugetlb can be PUD-mapped, unlike THP (well, yes, devmp but most walkers do not care about it so they did not create a .pud_entry). Then I will be running some tests on x86_64/arm64/pp64le and s390(not sure if I will be able to grab one but let us see), and then I will post a patchset as RFC to gather some feedback. [1] https://github.com/leberus/linux/tree/hugetlb-pagewalk-v2 [2] Do not stare too close as they are a very WIP, and ignore the last 4 commits as they are half-done. [3] https://patchwork.kernel.org/project/linux-mm/cover/cover.1716815901.git.christophe.leroy@csgroup.eu/ -- Oscar Salvador SUSE Labs ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-06 18:29 Unifying page table walkers Matthew Wilcox 2024-06-06 19:30 ` James Houghton 2024-06-06 21:49 ` Peter Xu @ 2024-06-07 6:59 ` David Hildenbrand 2024-06-09 20:08 ` Matthew Wilcox 2 siblings, 1 reply; 13+ messages in thread From: David Hildenbrand @ 2024-06-07 6:59 UTC (permalink / raw) To: Matthew Wilcox, Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song Cc: linux-mm On 06.06.24 20:29, Matthew Wilcox wrote: > One of the things we discussed at LSFMM was unifying the hugetlb and > THP page table walkers. I've been looking into it some more recently; > I've found a problem and I think a solution. > > The reason we have a separate hugetlb_entry from pmd_entry and pud_entry > is that it has a different locking context. It is called with the > hugetlb_vma_lock held for read (nb: this is not the same as the vma > lock; see walk_hugetlb_range()). Why do we need this? Because of page > table sharing. > > In a completely separate discussion, I was talking with Khalid about > mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs > pages to be mapped by a VMA which does not have the VM_HUGETLB flag set. > If we do that, the page tables would not be permitted to be shared with > other users of that hugetlbfs file. But we want to eliminate support > for that anyway, so that's more of a feature than a bug. I am not sure why hugetlb support in mshare would require that (we don't need partial mappings and all of that to support mshare+hugetlb). The possible mshare directions I discussed with Khalid at LSF/MM would likely not need that. But I have no idea which mshare design you and Khalid are discussing right now. Maybe it would be a a good idea that the three of us meet to discuss that, if my feedback/opinion could be helpful. > > Once we don't use the VM_HUGETLB flag on these VMAs, that opens the > door to the other features we want, like mapping individual pages from > a hugetlb folio. And we can use the regular page table walkers for > these VMAs. Right, but to me that's a different, long-term project that mshare would maybe not have to rely on. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-07 6:59 ` David Hildenbrand @ 2024-06-09 20:08 ` Matthew Wilcox 2024-06-09 20:28 ` David Hildenbrand 0 siblings, 1 reply; 13+ messages in thread From: Matthew Wilcox @ 2024-06-09 20:08 UTC (permalink / raw) To: David Hildenbrand Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm On Fri, Jun 07, 2024 at 08:59:17AM +0200, David Hildenbrand wrote: > On 06.06.24 20:29, Matthew Wilcox wrote: > > One of the things we discussed at LSFMM was unifying the hugetlb and > > THP page table walkers. I've been looking into it some more recently; > > I've found a problem and I think a solution. > > > > The reason we have a separate hugetlb_entry from pmd_entry and pud_entry > > is that it has a different locking context. It is called with the > > hugetlb_vma_lock held for read (nb: this is not the same as the vma > > lock; see walk_hugetlb_range()). Why do we need this? Because of page > > table sharing. > > > > In a completely separate discussion, I was talking with Khalid about > > mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs > > pages to be mapped by a VMA which does not have the VM_HUGETLB flag set. > > If we do that, the page tables would not be permitted to be shared with > > other users of that hugetlbfs file. But we want to eliminate support > > for that anyway, so that's more of a feature than a bug. > > I am not sure why hugetlb support in mshare would require that (we don't > need partial mappings and all of that to support mshare+hugetlb). You're absolutely right. My motivation is the other way around. A large part of "hugetlbfs is special" is tied to the sharing of page tables. That's why we have the hugetlb_vma_lock. If we're already sharing page tables with mshare, I assert that it is not necessary to also share page tables with other hugetlb users. So as part of including hugetlb support in mshare, we should drop that support, and handle hugetlb-mapped-with-mshare similarly to THP. Possibly not the mapcount parts so that we preserve the HVO. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: Unifying page table walkers 2024-06-09 20:08 ` Matthew Wilcox @ 2024-06-09 20:28 ` David Hildenbrand 0 siblings, 0 replies; 13+ messages in thread From: David Hildenbrand @ 2024-06-09 20:28 UTC (permalink / raw) To: Matthew Wilcox Cc: Khalid Aziz, Peter Xu, Vishal Moola, Jane Chu, Muchun Song, linux-mm On 09.06.24 22:08, Matthew Wilcox wrote: > On Fri, Jun 07, 2024 at 08:59:17AM +0200, David Hildenbrand wrote: >> On 06.06.24 20:29, Matthew Wilcox wrote: >>> One of the things we discussed at LSFMM was unifying the hugetlb and >>> THP page table walkers. I've been looking into it some more recently; >>> I've found a problem and I think a solution. >>> >>> The reason we have a separate hugetlb_entry from pmd_entry and pud_entry >>> is that it has a different locking context. It is called with the >>> hugetlb_vma_lock held for read (nb: this is not the same as the vma >>> lock; see walk_hugetlb_range()). Why do we need this? Because of page >>> table sharing. >>> >>> In a completely separate discussion, I was talking with Khalid about >>> mshare() support for hugetlbfs, and I suggested that we permit hugetlbfs >>> pages to be mapped by a VMA which does not have the VM_HUGETLB flag set. >>> If we do that, the page tables would not be permitted to be shared with >>> other users of that hugetlbfs file. But we want to eliminate support >>> for that anyway, so that's more of a feature than a bug. >> >> I am not sure why hugetlb support in mshare would require that (we don't >> need partial mappings and all of that to support mshare+hugetlb). > > You're absolutely right. My motivation is the other way around. A large > part of "hugetlbfs is special" is tied to the sharing of page tables. > That's why we have the hugetlb_vma_lock. If we're already sharing > page tables with mshare, I assert that it is not necessary to also > share page tables with other hugetlb users. So as part of including > hugetlb support in mshare, we should drop that support, and handle > hugetlb-mapped-with-mshare similarly to THP. Yes, we should absolutely *not* use hugetlb-page table sharing there! Regarding THP, I'm not sure how far we should go -- we should make our lives easier by not allowing partial mappings initially. > Possibly not the mapcount > parts so that we preserve the HVO. The new mapcount scheme I'll be reviving soon will not work easily with the existing hugetlb-page table sharing, simply because unrelated MMs can map/unmap pages in there, and we effectively transfer ownership of page tables between processes. As soon as we have one "mm" that owns one set of shared page tables (i.e., one mshare-mm that owns a set of shared page tables, and to which we effectively account the mappings in there), it should all just work. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2024-06-09 20:28 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-06-06 18:29 Unifying page table walkers Matthew Wilcox 2024-06-06 19:30 ` James Houghton 2024-06-06 20:04 ` Matthew Wilcox 2024-06-06 20:23 ` James Houghton 2024-06-06 21:21 ` Matthew Wilcox 2024-06-06 23:07 ` James Houghton 2024-06-07 7:15 ` David Hildenbrand 2024-06-06 21:33 ` Peter Xu 2024-06-06 21:49 ` Peter Xu 2024-06-07 5:07 ` Oscar Salvador 2024-06-07 6:59 ` David Hildenbrand 2024-06-09 20:08 ` Matthew Wilcox 2024-06-09 20:28 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox