Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
       [not found] ` <20240807194812.819412-3-peterx@redhat.com>
@ 2024-08-08 15:33   ` Sean Christopherson
  2024-08-08 21:21     ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2024-08-08 15:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, Vlastimil Babka, Dave Hansen, Christophe Leroy,
	Huang Ying, kvm, Paolo Bonzini, David Rientjes

On Wed, Aug 07, 2024, Peter Xu wrote:
> mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> change_pmd_range").
> 
> At that time, the issue was that NUMA balancing can be applied on a huge
> range of VM memory, even if nothing was populated.  The notification can be
> avoided in this case if no valid pmd detected, which includes either THP or
> a PTE pgtable page.
> 
> Now to pave way for PUD handling, this isn't enough.  We need to generate
> mmu notifications even on PUD entries properly.  mprotect() is currently
> broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> mappings already), this is the start to fix it.
> 
> To fix that, this patch proposes to push such notifications to the PUD
> layers.
> 
> There is risk on regressing the problem Rik wanted to resolve before, but I
> think it shouldn't really happen, and I still chose this solution because
> of a few reasons:
> 
>   1) Consider a large VM that should definitely contain more than GBs of
>   memory, it's highly likely that PUDs are also none.  In this case there

I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
none?

>   will have no regression.
> 
>   2) KVM has evolved a lot over the years to get rid of rmap walks, which
>   might be the major cause of the previous soft-lockup.  At least TDP MMU
>   already got rid of rmap as long as not nested (which should be the major
>   use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
>   pgtable (e.g. EPT on x86), the invalidation of a full empty region in
>   most cases could be pretty fast now, comparing to 2014.

The TDP MMU will indeed be a-ok.  It only zaps leaf SPTEs in response to
mmu_notifier invalidations, and checks NEED_RESCHED after processing each SPTE,
i.e. KVM won't zap an entire PUD and get stuck processing all its children.

I doubt the shadow MMU will fair much better than it did years ago though, AFAICT
the relevant code hasn't changed.  E.g. when zapping a large range in response to
an mmu_notifier invalidation, KVM never yields even if blocking is allowed.  That 
said, it is stupidly easy to fix the soft lockup problem in the shadow MMU.  KVM
already has an rmap walk path that plays nice with NEED_RESCHED *and* zaps rmaps,
but because of how things grew organically over the years, KVM never adopted the
cond_resched() logic for the mmu_notifier path.

As a bonus, now the .change_pte() is gone, the only other usage of x86's
kvm_handle_gfn_range() is for the aging mmu_notifiers, and I want to move those
to their own flow too[*], i.e. kvm_handle_gfn_range() in its current form can
be removed entirely.

I'll post a separate series, I don't think it needs to block this work, and I'm
fairly certain I can get this done for 6.12 (shouldn't be a large or scary series,
though I may tack on my lockless aging idea as an RFC).

https://lore.kernel.org/all/Zo137P7BFSxAutL2@google.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 5/7] mm/x86: arch_check_zapped_pud()
       [not found]   ` <878qx80xy8.ffs@tglx>
@ 2024-08-08 15:49     ` Peter Xu
  2024-08-08 20:45       ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2024-08-08 15:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, David Hildenbrand,
	Vlastimil Babka, Dave Hansen, Christophe Leroy, Huang Ying

On Thu, Aug 08, 2024 at 12:28:47AM +0200, Thomas Gleixner wrote:
> On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:
> 
> > Subject: mm/x86: arch_check_zapped_pud()
> 
> Is not a proper subject line. It clearly lacks a verb.
> 
>   Subject: mm/x86: Implement arch_check_zapped_pud()
> 
> 
> > Introduce arch_check_zapped_pud() to sanity check shadow stack on PUD zaps.
> > It has the same logic of the PMD helper.
> 
> s/of/as/

Will fix above two.

> 
> > +
> > +void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
> > +{
> > +	/* See note in arch_check_zapped_pte() */
> > +	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
> > +			pud_shstk(pud));
> 
> Please get rid of the line break. You have 100 characters.

Coding-style.rst still tells me 80 here:

        The preferred limit on the length of a single line is 80 columns.

        Statements longer than 80 columns should be broken into sensible chunks,
        unless exceeding 80 columns significantly increases readability and does
        not hide information.

Maybe this just changed very recently so even not in mm-unstable?

I'll fix the two line-wrap in this patch anyway, as I figured these two
cases even didn't hit 80..  probably because I used fill-column=75 locally..

But still I'll probably need to figure it out for other spots.  Please help
me to justify.

> 
> > +}
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index 2a6a3cccfc36..2289e9f7aa1b 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -447,6 +447,13 @@ static inline void arch_check_zapped_pmd(struct vm_area_struct *vma,
> >  }
> >  #endif
> >  
> > +#ifndef arch_check_zapped_pud
> > +static inline void arch_check_zapped_pud(struct vm_area_struct *vma,
> > +					 pud_t pud)
> > +{
> 
> Ditto..
> 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 0024266dea0a..81c5da0708ed 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> 
> Why is a mm change burried in a patch which is named mm/x86?
> 
> It's clearly documented that core changes with the generic fallback come
> in one patch and the architecture override in a separate one afterwards.
> 
> Do we write documentation just for the sake of writing it?

Hmm if that's the rule, in practise I bet many patches are violating that
rule that we merged and whatever patches I used to read.. but I see now,
I'll break this patch into two.

Personally I'd really still prefer it to be one patch especially when it's
only implemented in x86, then I copy x86+mm maintainers all here and it
explains why it's there.  So please let me know if you think it may still
make sense to keep this patch as-is, or I'll split by default.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 6/7] mm/x86: Add missing pud helpers
       [not found]   ` <875xsc0xjy.ffs@tglx>
@ 2024-08-08 20:25     ` Peter Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2024-08-08 20:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, David Hildenbrand,
	Vlastimil Babka, Dave Hansen, Christophe Leroy, Huang Ying

On Thu, Aug 08, 2024 at 12:37:21AM +0200, Thomas Gleixner wrote:
> On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:
> > These new helpers will be needed for pud entry updates soon.  Introduce
> > these helpers by referencing the pmd ones.  Namely:
> >
> > - pudp_invalidate()
> > - pud_modify()
> 
> Zero content about what these helpers do and why they are needed. That's
> not how it works, really.

I hope what Dave suggested to add "by referencing the pmd ones" would be
helpful, because they are exactly referencing those.

But sure.. I rewrote the commit message as following:

    mm/x86: Add missing pud helpers

    Some new helpers will be needed for pud entry updates soon.  Introduce
    these helpers by referencing the pmd ones.  Namely:

      - pudp_invalidate(): this helper invalidates a huge pud before a split
      happens, so that the invalidated pud entry will make sure no race will
      happen (either with software, like a concurrent zap, or hardware, like
      a/d bit lost).

      - pud_modify(): this helper applies a new pgprot to an existing huge pud
      mapping.

    For more information on why we need these two helpers, please refer to the
    corresponding pmd helpers in the mprotect() code path.

    When at it, simplify the pud_modify()/pmd_modify() comments on shadow stack
    pgtable entries to reference pte_modify() to avoid duplicating the whole
    paragraph three times.

Please let me know if this works for you.

> 
>   
> > +static inline pud_t pud_mkinvalid(pud_t pud)
> > +{
> > +	return pfn_pud(pud_pfn(pud),
> > +		       __pgprot(pud_flags(pud) & ~(_PAGE_PRESENT|_PAGE_PROTNONE)));
> 
> 100 characters...

Waiting for an answer on this one, so I'll ignore "100 char" comments for
now, and will address them when I get a clue on what is happening..

> 
> > +}
> > +
> >  static inline u64 flip_protnone_guard(u64 oldval, u64 val, u64 mask);
> >  
> >  static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> > @@ -834,14 +840,8 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> >  	pmd_result = __pmd(val);
> >  
> >  	/*
> > -	 * To avoid creating Write=0,Dirty=1 PMDs, pte_modify() needs to avoid:
> > -	 *  1. Marking Write=0 PMDs Dirty=1
> > -	 *  2. Marking Dirty=1 PMDs Write=0
> > -	 *
> > -	 * The first case cannot happen because the _PAGE_CHG_MASK will filter
> > -	 * out any Dirty bit passed in newprot. Handle the second case by
> > -	 * going through the mksaveddirty exercise. Only do this if the old
> > -	 * value was Write=1 to avoid doing this on Shadow Stack PTEs.
> > +	 * Avoid creating shadow stack PMD by accident.  See comment in
> > +	 * pte_modify().
> 
> The changelog is utterly silent about this comment update.

Updated my commit message altogether above; I hope it works.

> 
> >  	 */
> >  	if (oldval & _PAGE_RW)
> >  		pmd_result = pmd_mksaveddirty(pmd_result);
> > @@ -851,6 +851,29 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> >  	return pmd_result;
> >  }
> >  
> > +static inline pud_t pud_modify(pud_t pud, pgprot_t newprot)
> > +{
> > +	pudval_t val = pud_val(pud), oldval = val;
> > +	pud_t pud_result;
> > +
> > +	val &= _HPAGE_CHG_MASK;
> > +	val |= check_pgprot(newprot) & ~_HPAGE_CHG_MASK;
> > +	val = flip_protnone_guard(oldval, val, PHYSICAL_PUD_PAGE_MASK);
> > +
> > +	pud_result = __pud(val);
> > +
> > +	/*
> > +	 * Avoid creating shadow stack PUD by accident.  See comment in
> > +	 * pte_modify().
> > +	 */
> > +	if (oldval & _PAGE_RW)
> > +		pud_result = pud_mksaveddirty(pud_result);
> > +	else
> > +		pud_result = pud_clear_saveddirty(pud_result);
> > +
> > +	return pud_result;
> > +}
> > +
> >  /*
> >   * mprotect needs to preserve PAT and encryption bits when updating
> >   * vm_page_prot
> > @@ -1389,10 +1412,26 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> >  }
> >  #endif
> >  
> > +static inline pud_t pudp_establish(struct vm_area_struct *vma,
> > +		unsigned long address, pud_t *pudp, pud_t pud)
> 
> Random line break alignment.... See documentation.

This is exactly what we do with pmdp_establish() right above.

Would you be fine I keep this just to make pmd/pud look the same?

> 
> > +{
> > +	page_table_check_pud_set(vma->vm_mm, pudp, pud);
> > +	if (IS_ENABLED(CONFIG_SMP)) {
> > +		return xchg(pudp, pud);
> > +	} else {
> > +		pud_t old = *pudp;
> > +		WRITE_ONCE(*pudp, pud);
> 
> Lacks a newline between variable declaration and code.
> 
> But seriously, why optimizing for !SMP? That's a pointless exercise and
> a guarantee for bitrot.

So far it looks still reasonable to me if it is there in pmd.  If I remove
it, people can complain again on "hey, we have this /optimization/ in pmd,
why you removed it in pud?".  No end..

So again.. it's the same to pmd ones.  Either we change nothing, or we
change both.  I don't want to expand this series to more than it wants to
do.. would you mind I keep it until someone justifies the change to modify
both?

> 
> > +		return old;
> > +	}
> > +}
> > +
> >  #define __HAVE_ARCH_PMDP_INVALIDATE_AD
> >  extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
> >  				unsigned long address, pmd_t *pmdp);
> >  
> > +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
> > +		      pud_t *pudp);
> 
> While 'extern' is not required, please keep the file style consistent
> and use the 100 characters...
> 
> > --- a/arch/x86/mm/pgtable.c
> > +++ b/arch/x86/mm/pgtable.c
> > @@ -641,6 +641,18 @@ pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
> >  }
> >  #endif
> >  
> > +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> > +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> > +pud_t pudp_invalidate(struct vm_area_struct *vma, unsigned long address,
> > +		     pud_t *pudp)
> > +{
> > +	VM_WARN_ON_ONCE(!pud_present(*pudp));
> > +	pud_t old = pudp_establish(vma, address, pudp, pud_mkinvalid(*pudp));
> > +	flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
> > +	return old;
> 
> Your keyboard clearly lacks a newline key ...

This is also the same, that pmdp_invalidate() coded it like exactly that.

In general, I prefer keeping the pmd/pud helpers look the same if you won't
disagree as of now, all over the places.

I know that it might even be better to clean up everything, get everything
reviewed first on pmd changes, then clone that to pud.  That might be
cleaner indeed.  But it adds much fuss all over the places, and even with
this series I got stuck for months.. and so far I haven't yet post what I
really wanted to post (huge pfnmaps).  I sincerely hope we can move forward
with this and then clean things up later (none of them are major issues;
all trivial details so far, IMHO, so nothing I see go severely wrong yet),
and then the cleanup will need to be justified one by one on each spot.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 5/7] mm/x86: arch_check_zapped_pud()
  2024-08-08 15:49     ` [PATCH v4 5/7] mm/x86: arch_check_zapped_pud() Peter Xu
@ 2024-08-08 20:45       ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2024-08-08 20:45 UTC (permalink / raw)
  To: Peter Xu, Thomas Gleixner
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, Vlastimil Babka,
	Dave Hansen, Christophe Leroy, Huang Ying

>>> +void arch_check_zapped_pud(struct vm_area_struct *vma, pud_t pud)
>>> +{
>>> +	/* See note in arch_check_zapped_pte() */
>>> +	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_SHADOW_STACK) &&
>>> +			pud_shstk(pud));
>>
>> Please get rid of the line break. You have 100 characters.
> 
> Coding-style.rst still tells me 80 here:
> 
>          The preferred limit on the length of a single line is 80 columns.
> 
>          Statements longer than 80 columns should be broken into sensible chunks,
>          unless exceeding 80 columns significantly increases readability and does
>          not hide information.
> 
> Maybe this just changed very recently so even not in mm-unstable?
> 
> I'll fix the two line-wrap in this patch anyway, as I figured these two
> cases even didn't hit 80..  probably because I used fill-column=75 locally..
> 
> But still I'll probably need to figure it out for other spots.  Please help
> me to justify.

My interpretation is (the doc is not completely clear to me as well, but 
checkpatch.pl hardcodes the max_line_length=100) that we can happily use 
up to 100 chars.

I also tend to stay within 80 chars, unless really reasonable. Years of 
Linux kernel hacking really taught my inner self to not do it.

Here I would agree that having the VM_WARN_ON_ONCE in a single would aid 
readability.

An example where 100 chars are likely a bad idea would be when nesting 
that deeply such that most lines start exceeding 80 chars. We should 
rather fix the code then -- like the coding style spells out :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-08 15:33   ` [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs Sean Christopherson
@ 2024-08-08 21:21     ` Peter Xu
  2024-08-08 21:31       ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2024-08-08 21:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, Vlastimil Babka, Dave Hansen, Christophe Leroy,
	Huang Ying, kvm, Paolo Bonzini, David Rientjes

Hi, Sean,

On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> On Wed, Aug 07, 2024, Peter Xu wrote:
> > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > change_pmd_range").
> > 
> > At that time, the issue was that NUMA balancing can be applied on a huge
> > range of VM memory, even if nothing was populated.  The notification can be
> > avoided in this case if no valid pmd detected, which includes either THP or
> > a PTE pgtable page.
> > 
> > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > mmu notifications even on PUD entries properly.  mprotect() is currently
> > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > mappings already), this is the start to fix it.
> > 
> > To fix that, this patch proposes to push such notifications to the PUD
> > layers.
> > 
> > There is risk on regressing the problem Rik wanted to resolve before, but I
> > think it shouldn't really happen, and I still chose this solution because
> > of a few reasons:
> > 
> >   1) Consider a large VM that should definitely contain more than GBs of
> >   memory, it's highly likely that PUDs are also none.  In this case there
> 
> I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> none?

I did mean the original wordings.

Note that in the previous case Rik worked on, it's about a mostly empty VM
got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
hope that when the numa hint applies on any part of the unpopulated guest
memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
mapping as long as the guest memory is not backed by DAX (since only DAX
supports 1G huge pud so far, while hugetlb has its own path here in
mprotect, so it must be things like anon or shmem), but a PUD entry that
contains pmd pgtables.  For that part, I was trying to justify "no pmd
pgtable installed" with the fact that "a large VM that should definitely
contain more than GBs of memory", it means the PUD range should hopefully
never been accessed, so even the pmd pgtable entry should be missing.

With that, we should hopefully keep avoiding mmu notifications after this
patch, just like it used to be when done in pmd layers.

> 
> >   will have no regression.
> > 
> >   2) KVM has evolved a lot over the years to get rid of rmap walks, which
> >   might be the major cause of the previous soft-lockup.  At least TDP MMU
> >   already got rid of rmap as long as not nested (which should be the major
> >   use case, IIUC), then the TDP MMU pgtable walker will simply see empty VM
> >   pgtable (e.g. EPT on x86), the invalidation of a full empty region in
> >   most cases could be pretty fast now, comparing to 2014.
> 
> The TDP MMU will indeed be a-ok.  It only zaps leaf SPTEs in response to
> mmu_notifier invalidations, and checks NEED_RESCHED after processing each SPTE,
> i.e. KVM won't zap an entire PUD and get stuck processing all its children.
> 
> I doubt the shadow MMU will fair much better than it did years ago though, AFAICT
> the relevant code hasn't changed.  E.g. when zapping a large range in response to
> an mmu_notifier invalidation, KVM never yields even if blocking is allowed.  That 
> said, it is stupidly easy to fix the soft lockup problem in the shadow MMU.  KVM
> already has an rmap walk path that plays nice with NEED_RESCHED *and* zaps rmaps,
> but because of how things grew organically over the years, KVM never adopted the
> cond_resched() logic for the mmu_notifier path.
> 
> As a bonus, now the .change_pte() is gone, the only other usage of x86's
> kvm_handle_gfn_range() is for the aging mmu_notifiers, and I want to move those
> to their own flow too[*], i.e. kvm_handle_gfn_range() in its current form can
> be removed entirely.
> 
> I'll post a separate series, I don't think it needs to block this work, and I'm
> fairly certain I can get this done for 6.12 (shouldn't be a large or scary series,
> though I may tack on my lockless aging idea as an RFC).

Great, and thanks for all these information! Glad to know.

I guess it makes me feel more confident that this patch shouldn't have any
major side effect at least on KVM side.

Thanks,

> 
> https://lore.kernel.org/all/Zo137P7BFSxAutL2@google.com
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-08 21:21     ` Peter Xu
@ 2024-08-08 21:31       ` Sean Christopherson
  2024-08-08 21:47         ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Sean Christopherson @ 2024-08-08 21:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, Vlastimil Babka, Dave Hansen, Christophe Leroy,
	Huang Ying, kvm, Paolo Bonzini, David Rientjes

On Thu, Aug 08, 2024, Peter Xu wrote:
> Hi, Sean,
> 
> On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> > On Wed, Aug 07, 2024, Peter Xu wrote:
> > > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > > change_pmd_range").
> > > 
> > > At that time, the issue was that NUMA balancing can be applied on a huge
> > > range of VM memory, even if nothing was populated.  The notification can be
> > > avoided in this case if no valid pmd detected, which includes either THP or
> > > a PTE pgtable page.
> > > 
> > > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > > mmu notifications even on PUD entries properly.  mprotect() is currently
> > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > > mappings already), this is the start to fix it.
> > > 
> > > To fix that, this patch proposes to push such notifications to the PUD
> > > layers.
> > > 
> > > There is risk on regressing the problem Rik wanted to resolve before, but I
> > > think it shouldn't really happen, and I still chose this solution because
> > > of a few reasons:
> > > 
> > >   1) Consider a large VM that should definitely contain more than GBs of
> > >   memory, it's highly likely that PUDs are also none.  In this case there
> > 
> > I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> > none?
> 
> I did mean the original wordings.
> 
> Note that in the previous case Rik worked on, it's about a mostly empty VM
> got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
> hope that when the numa hint applies on any part of the unpopulated guest
> memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
> mapping as long as the guest memory is not backed by DAX (since only DAX
> supports 1G huge pud so far, while hugetlb has its own path here in
> mprotect, so it must be things like anon or shmem), but a PUD entry that
> contains pmd pgtables.  For that part, I was trying to justify "no pmd
> pgtable installed" with the fact that "a large VM that should definitely
> contain more than GBs of memory", it means the PUD range should hopefully
> never been accessed, so even the pmd pgtable entry should be missing.

Ah, now I get what you were saying.

Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from
empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by
the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn.  And
that walk is done without ever yielding, which I suspect is the source of the
soft lockups of yore.

And there's no way around that conundrum (walking rmaps), at least not without a
major rewrite in KVM.  In a nested TDP scenario, KVM's stage-2 page tables (for
L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is
through the rmaps.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-08 21:31       ` Sean Christopherson
@ 2024-08-08 21:47         ` Peter Xu
  2024-08-08 22:45           ` Sean Christopherson
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2024-08-08 21:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, Vlastimil Babka, Dave Hansen, Christophe Leroy,
	Huang Ying, kvm, Paolo Bonzini, David Rientjes

On Thu, Aug 08, 2024 at 02:31:19PM -0700, Sean Christopherson wrote:
> On Thu, Aug 08, 2024, Peter Xu wrote:
> > Hi, Sean,
> > 
> > On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> > > On Wed, Aug 07, 2024, Peter Xu wrote:
> > > > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > > > change_pmd_range").
> > > > 
> > > > At that time, the issue was that NUMA balancing can be applied on a huge
> > > > range of VM memory, even if nothing was populated.  The notification can be
> > > > avoided in this case if no valid pmd detected, which includes either THP or
> > > > a PTE pgtable page.
> > > > 
> > > > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > > > mmu notifications even on PUD entries properly.  mprotect() is currently
> > > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > > > mappings already), this is the start to fix it.
> > > > 
> > > > To fix that, this patch proposes to push such notifications to the PUD
> > > > layers.
> > > > 
> > > > There is risk on regressing the problem Rik wanted to resolve before, but I
> > > > think it shouldn't really happen, and I still chose this solution because
> > > > of a few reasons:
> > > > 
> > > >   1) Consider a large VM that should definitely contain more than GBs of
> > > >   memory, it's highly likely that PUDs are also none.  In this case there
> > > 
> > > I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> > > none?
> > 
> > I did mean the original wordings.
> > 
> > Note that in the previous case Rik worked on, it's about a mostly empty VM
> > got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
> > hope that when the numa hint applies on any part of the unpopulated guest
> > memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
> > mapping as long as the guest memory is not backed by DAX (since only DAX
> > supports 1G huge pud so far, while hugetlb has its own path here in
> > mprotect, so it must be things like anon or shmem), but a PUD entry that
> > contains pmd pgtables.  For that part, I was trying to justify "no pmd
> > pgtable installed" with the fact that "a large VM that should definitely
> > contain more than GBs of memory", it means the PUD range should hopefully
> > never been accessed, so even the pmd pgtable entry should be missing.
> 
> Ah, now I get what you were saying.
> 
> Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from
> empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by
> the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn.  And
> that walk is done without ever yielding, which I suspect is the source of the
> soft lockups of yore.
> 
> And there's no way around that conundrum (walking rmaps), at least not without a
> major rewrite in KVM.  In a nested TDP scenario, KVM's stage-2 page tables (for
> L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is
> through the rmaps.

I think the hope here is when the whole PUDs being hinted are empty without
pgtable installed, there'll be no mmu notifier to be kicked off at all.

To be explicit, I meant after this patch applied, the pud loop for numa
hints look like this:

        FOR_EACH_PUD() {
                ...
                if (pud_none(pud))
                        continue;

                if (!range.start) {
                        mmu_notifier_range_init(&range,
                                                MMU_NOTIFY_PROTECTION_VMA, 0,
                                                vma->vm_mm, addr, end);
                        mmu_notifier_invalidate_range_start(&range);
                }
                ...
        }

So the hope is that pud_none() is always true for the hinted area (just
like it used to be when pmd_none() can be hopefully true always), then we
skip the mmu notifier as a whole (including KVM's)!

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs
  2024-08-08 21:47         ` Peter Xu
@ 2024-08-08 22:45           ` Sean Christopherson
  0 siblings, 0 replies; 10+ messages in thread
From: Sean Christopherson @ 2024-08-08 22:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, David Hildenbrand,
	Thomas Gleixner, Vlastimil Babka, Dave Hansen, Christophe Leroy,
	Huang Ying, kvm, Paolo Bonzini, David Rientjes

On Thu, Aug 08, 2024, Peter Xu wrote:
> On Thu, Aug 08, 2024 at 02:31:19PM -0700, Sean Christopherson wrote:
> > On Thu, Aug 08, 2024, Peter Xu wrote:
> > > Hi, Sean,
> > > 
> > > On Thu, Aug 08, 2024 at 08:33:59AM -0700, Sean Christopherson wrote:
> > > > On Wed, Aug 07, 2024, Peter Xu wrote:
> > > > > mprotect() does mmu notifiers in PMD levels.  It's there since 2014 of
> > > > > commit a5338093bfb4 ("mm: move mmu notifier call from change_protection to
> > > > > change_pmd_range").
> > > > > 
> > > > > At that time, the issue was that NUMA balancing can be applied on a huge
> > > > > range of VM memory, even if nothing was populated.  The notification can be
> > > > > avoided in this case if no valid pmd detected, which includes either THP or
> > > > > a PTE pgtable page.
> > > > > 
> > > > > Now to pave way for PUD handling, this isn't enough.  We need to generate
> > > > > mmu notifications even on PUD entries properly.  mprotect() is currently
> > > > > broken on PUD (e.g., one can easily trigger kernel error with dax 1G
> > > > > mappings already), this is the start to fix it.
> > > > > 
> > > > > To fix that, this patch proposes to push such notifications to the PUD
> > > > > layers.
> > > > > 
> > > > > There is risk on regressing the problem Rik wanted to resolve before, but I
> > > > > think it shouldn't really happen, and I still chose this solution because
> > > > > of a few reasons:
> > > > > 
> > > > >   1) Consider a large VM that should definitely contain more than GBs of
> > > > >   memory, it's highly likely that PUDs are also none.  In this case there
> > > > 
> > > > I don't follow this.  Did you mean to say it's highly likely that PUDs are *NOT*
> > > > none?
> > > 
> > > I did mean the original wordings.
> > > 
> > > Note that in the previous case Rik worked on, it's about a mostly empty VM
> > > got NUMA hint applied.  So I did mean "PUDs are also none" here, with the
> > > hope that when the numa hint applies on any part of the unpopulated guest
> > > memory, it'll find nothing in PUDs. Here it's mostly not about a huge PUD
> > > mapping as long as the guest memory is not backed by DAX (since only DAX
> > > supports 1G huge pud so far, while hugetlb has its own path here in
> > > mprotect, so it must be things like anon or shmem), but a PUD entry that
> > > contains pmd pgtables.  For that part, I was trying to justify "no pmd
> > > pgtable installed" with the fact that "a large VM that should definitely
> > > contain more than GBs of memory", it means the PUD range should hopefully
> > > never been accessed, so even the pmd pgtable entry should be missing.
> > 
> > Ah, now I get what you were saying.
> > 
> > Problem is, walking the rmaps for the shadow MMU doesn't benefit (much) from
> > empty PUDs, because KVM needs to blindly walk the rmaps for every gfn covered by
> > the PUD to see if there are any SPTEs in any shadow MMUs mapping that gfn.  And
> > that walk is done without ever yielding, which I suspect is the source of the
> > soft lockups of yore.
> > 
> > And there's no way around that conundrum (walking rmaps), at least not without a
> > major rewrite in KVM.  In a nested TDP scenario, KVM's stage-2 page tables (for
> > L2) key off of L2 gfns, not L1 gfns, and so the only way to find mappings is
> > through the rmaps.
> 
> I think the hope here is when the whole PUDs being hinted are empty without
> pgtable installed, there'll be no mmu notifier to be kicked off at all.
> 
> To be explicit, I meant after this patch applied, the pud loop for numa
> hints look like this:
> 
>         FOR_EACH_PUD() {
>                 ...
>                 if (pud_none(pud))
>                         continue;
> 
>                 if (!range.start) {
>                         mmu_notifier_range_init(&range,
>                                                 MMU_NOTIFY_PROTECTION_VMA, 0,
>                                                 vma->vm_mm, addr, end);
>                         mmu_notifier_invalidate_range_start(&range);
>                 }
>                 ...
>         }
> 
> So the hope is that pud_none() is always true for the hinted area (just
> like it used to be when pmd_none() can be hopefully true always), then we
> skip the mmu notifier as a whole (including KVM's)!

Gotcha, that makes sense.  Too many page tables flying around :-)


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit
       [not found]     ` <ZrTcGxANpcvwp1qt@x1n>
@ 2024-08-09 12:08       ` Thomas Gleixner
  2024-08-09 13:53         ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas Gleixner @ 2024-08-09 12:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, David Hildenbrand,
	Vlastimil Babka, Dave Hansen, Christophe Leroy, Huang Ying

On Thu, Aug 08 2024 at 10:54, Peter Xu wrote:
> On Thu, Aug 08, 2024 at 12:22:38AM +0200, Thomas Gleixner wrote:
>> On Wed, Aug 07 2024 at 15:48, Peter Xu wrote:
>> > An entry should be reported as PUD leaf even if it's PROT_NONE, in which
>> > case PRESENT bit isn't there. I hit bad pud without this when testing dax
>> > 1G on zapping a PROT_NONE PUD.
>> 
>> That does not qualify as a change log. What you hit is irrelevant unless
>> you explain the actual underlying problem. See Documentation/process/
>> including the TIP documentation.
>
> Firstly, thanks a lot for the reviews.
>
> I thought the commit message explained exactly what is the underlying
> problem, no?
>
> The problem is even if PROT_NONE, as long as the PSE bit is set on the PUD
> it should be treated as a PUD leaf.

Sure. But why should it be treated so.

> Currently, the code will return pud_leaf()==false for those PROT_NONE
> PUD entries, and IMHO that is wrong.

Your humble opinion is fine, but hardly a technical argument.

> This patch wants to make it right.  I still think that's mostly what I put
> there in the commit message..
>
> Would you please suggest something so I can try to make it better,
> otherwise?  Or it'll be helpful too if you could point out which part of
> the two documentations I should reference.

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog

  A good structure is to explain the context, the problem and the
  solution in separate paragraphs and this order

>> > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> > index e39311a89bf4..a2a3bd4c1bda 100644
>> > --- a/arch/x86/include/asm/pgtable.h
>> > +++ b/arch/x86/include/asm/pgtable.h
>> > @@ -1078,8 +1078,7 @@ static inline pmd_t *pud_pgtable(pud_t pud)
>> >  #define pud_leaf pud_leaf
>> >  static inline bool pud_leaf(pud_t pud)
>> >  {
>> > -	return (pud_val(pud) & (_PAGE_PSE | _PAGE_PRESENT)) ==
>> > -		(_PAGE_PSE | _PAGE_PRESENT);
>> > +	return pud_val(pud) & _PAGE_PSE;
>> >  }
>> 
>> And the changelog does not explain why this change is not affecting any
>> existing user of pud_leaf().
>
> That's what I want to do: I want to affect them..

Fine. Just the change log does not tell me what the actual problem is
("I hit something" does not qualify) and "should be reported" is not
helpful either as it does not explain anything

> And IMHO it's mostly fine before because mprotect() is broken with 1g
> anyway, and I guess nobody managed to populate any pud entry with PROT_NONE
> on dax 1g before, and that's what this whole series is trying to fix.

Again your humble opinion matters, but technical facts and analysis
matter way more.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit
  2024-08-09 12:08       ` [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit Thomas Gleixner
@ 2024-08-09 13:53         ` Peter Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2024-08-09 13:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, Aneesh Kumar K . V, Michael Ellerman,
	Oscar Salvador, Dan Williams, James Houghton, Matthew Wilcox,
	Nicholas Piggin, Rik van Riel, Dave Jiang, Andrew Morton, x86,
	Ingo Molnar, Rick P Edgecombe, Kirill A . Shutemov, linuxppc-dev,
	Mel Gorman, Hugh Dickins, Borislav Petkov, David Hildenbrand,
	Vlastimil Babka, Dave Hansen, Christophe Leroy, Huang Ying

On Fri, Aug 09, 2024 at 02:08:28PM +0200, Thomas Gleixner wrote:
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#changelog
> 
>   A good structure is to explain the context, the problem and the
>   solution in separate paragraphs and this order

I'll try to follow, thanks.

[...]

> > And IMHO it's mostly fine before because mprotect() is broken with 1g
> > anyway, and I guess nobody managed to populate any pud entry with PROT_NONE
> > on dax 1g before, and that's what this whole series is trying to fix.
> 
> Again your humble opinion matters, but technical facts and analysis
> matter way more.

All the rest comments in the reply were about "why it's a PUD leaf".  So
let me reply in one shot.

Referring to pXd_leaf() documentation in linux/pgtable.h:

/*
 * pXd_leaf() is the API to check whether a pgtable entry is a huge page
 * mapping.  It should work globally across all archs, without any
 * dependency on CONFIG_* options.  For architectures that do not support
 * huge mappings on specific levels, below fallbacks will be used.
 *
 * A leaf pgtable entry should always imply the following:
 *
 * - It is a "present" entry.  IOW, before using this API, please check it
 *   with pXd_present() first. NOTE: it may not always mean the "present
 *   bit" is set.  For example, PROT_NONE entries are always "present".
 *
 * - It should _never_ be a swap entry of any type.  Above "present" check
 *   should have guarded this, but let's be crystal clear on this.
 *
 * - It should contain a huge PFN, which points to a huge page larger than
 *   PAGE_SIZE of the platform.  The PFN format isn't important here.
 *
 * - It should cover all kinds of huge mappings (e.g., pXd_trans_huge(),
 *   pXd_devmap(), or hugetlb mappings).
 */

It explicitly stated that PROT_NONE should be treated as a present entry,
and also a leaf. The document is for pXd_leaf(), so it should cover puds
too.  In this specific case of the zapping path, it's only possible it's a
DAX 1G thp.  But pud_leaf() should work for hugetlb too, for example, when
PROT_NONE applied on top of a 1G hugetlb with PSE set.

Unfortunately, I wrote this document in 64078b3d57.. so that's also another
way of saying "my humble opinion".. it's just nobody disagreed so far, and
please shoot if you see any issue out of it.

IOW, I don't think we must define pXd_leaf() like this - we used to define
pXd_leaf() to cover migration entries at least on x86, for example. But per
my own past mm experience, the current way is the right thing to do to make
everything much easier and less error prone.  Sorry, I can't get rid of
"IMHO" here.

Another example of "we can define pXd_leaf() in other ways" is I believe
for PPC 8XX series it's possible to make special use of pmd_leaf() by
allowing pmd_leaf() to return true even for two continuous pte pgtable
covering 8MB memory.  But that will be an extremely special use of
pmd_leaf() even if it comes, maybe worth an update above when it happens,
and it'll only be used by powerpc not any other arch.  It won't happen if
we want to drop 8MB support, though.

So in short, I don't think there's a 100% correct "technical" answer of
saying "how to define pxx_leaf()"; things just keep evolving, and "humble
opinions" keeps coming with some good reasons.  Hope that answers the
question to some extent.

Taking all things into account, I wonder whether below enriched commit
message would get me closer to your ACK on this, trying to follow the rule
you referenced on the order of how context/problem/solution should be
ordered and addressed:

    When working on mprotect() on 1G dax entries, I hit an zap bad pud
    error when zapping a huge pud that is with PROT_NONE permission.

    Here the problem is x86's pud_leaf() requires both PRESENT and PSE bits
    set to report a pud entry as a leaf, but that doesn't look right, as
    it's not following the pXd_leaf() definition that we stick with so far,
    where PROT_NONE entries should be reported as leaves.

    To fix it, change x86's pud_leaf() implementation to only check against
    PSE bit to report a leaf, irrelevant of whether PRESENT bit is set.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-08-09 13:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20240807194812.819412-1-peterx@redhat.com>
     [not found] ` <20240807194812.819412-3-peterx@redhat.com>
2024-08-08 15:33   ` [PATCH v4 2/7] mm/mprotect: Push mmu notifier to PUDs Sean Christopherson
2024-08-08 21:21     ` Peter Xu
2024-08-08 21:31       ` Sean Christopherson
2024-08-08 21:47         ` Peter Xu
2024-08-08 22:45           ` Sean Christopherson
     [not found] ` <20240807194812.819412-6-peterx@redhat.com>
     [not found]   ` <878qx80xy8.ffs@tglx>
2024-08-08 15:49     ` [PATCH v4 5/7] mm/x86: arch_check_zapped_pud() Peter Xu
2024-08-08 20:45       ` David Hildenbrand
     [not found] ` <20240807194812.819412-7-peterx@redhat.com>
     [not found]   ` <875xsc0xjy.ffs@tglx>
2024-08-08 20:25     ` [PATCH v4 6/7] mm/x86: Add missing pud helpers Peter Xu
     [not found] ` <20240807194812.819412-5-peterx@redhat.com>
     [not found]   ` <87bk240y8h.ffs@tglx>
     [not found]     ` <ZrTcGxANpcvwp1qt@x1n>
2024-08-09 12:08       ` [PATCH v4 4/7] mm/x86: Make pud_leaf() only care about PSE bit Thomas Gleixner
2024-08-09 13:53         ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox