linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Boone, Max" <mboone@akamai.com>
To: "David Hildenbrand (Arm)" <david@kernel.org>,
	Alex Williamson <alex@shazbot.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R . Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	"Suren Baghdasaryan" <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	"Alex Williamson" <alex@shazbot.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"Tottenham, Max" <mtottenh@akamai.com>,
	"Hunt, Joshua" <johunt@akamai.com>,
	"Pelland, Matt" <mpelland@akamai.com>
Subject: Re: [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps
Date: Wed, 11 Mar 2026 09:42:56 +0000	[thread overview]
Message-ID: <6CBB0401-26DD-4EF9-949C-54BEAD3ABA33@akamai.com> (raw)
In-Reply-To: <0a652e7e-339e-4f98-b591-7fe5680e2006@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 5111 bytes --]


> On Mar 10, 2026, at 4:19 PM, David Hildenbrand (Arm) <david@kernel.org> wrote:
> 
> !-------------------------------------------------------------------|
>  This Message Is From an External Sender
>  This message came from outside your organization.
> |-------------------------------------------------------------------!
>> 
>> I was previously testing on 6.12 and didn’t see any changes to vfio-pci or
>> pagewalk.c which prompted me to check whether I could reproduce the
>> bug in a more recent kernel.  
>> 
>> However, when I tried to reproduce the bug on 7.0-rc2 (after adding some
>> tracing to get a clearer picture of the sequence of events) it doesn’t happen.
>> The VFIO DMA set operation is much faster on 7.0, so possibly the race 
>> window is too small for it to occur in reasonable time.
> 
> Interesting. You could try adding a delay to a test kernel to see if you
> can still provoke it.
> 
> There is the slight possibility that something else fixed the race for
> your reproducer by "accident".
> 
> [...]

Could be that the race window was made a lot smaller by [1], I see that the occurrence
of the race also drops significantly already when I’m just adding some extra trace
logs in the VFIO DMA set functions.

[…]

> 
> Just to double-check: is that expected?
> 
> I wonder why "-EINVAL" would be returned here. Do you know?
> 

I don’t think it’s expected, but at least it’s an error that can be caught in userspace.
I’ll do a bit more digging into why and where that originates, but I think that’ll get a
patch in vfio.

The -EINVAL originates from:

vfio_dma_do_map -> vfio_pin_map_dma -> vfio_pin_pages_remote
-> vaddr_get_pfns -> pin_user_pages_remote (mm/gup.c)

Possibly that’s also the origin of the concurrent PUD modification that requires
the retry in the walker in this patch.

>> 
>> For my own understanding, why is this patch preferred over:
>> - if (pud_none(*pud))
>> + if (pud_none(*pud) || pud_leaf(*pud))
>> in the walk_pud_range function?
> 
> It might currently work for PUDs, but as soon as we have non-present PUD
> entries (like migration entries) the code could become shaky: pud_leaf()
> is only guaranteed to yield the right result if pud_present() is true.
> 
> So I decided to instead make walk_pud_range() look more similar to
> walk_pmd_range(), which is quite helpful for spotting actual differences
> in the logic.
> 
>> 
>> I do think moving the check to walk_pmd_range is a more clear on the code’s intent and
>> personally prefer the code there, but I don’t see why this check is removing the possibility
>> of a race after the (!pud_present(pudval) || pud_leaf(pudval)) check, as to me it looks
>> like the PMD entry was possible to disappear between the splitting and this check?
> 
> I distilled that in the comment: PMD page tables cannot/are not
> reclaimed. So once you see a PMD page table, it's not going anywhere
> while you hold relevant locks (mmap_lock or VMA lock).
> 
> Only PMD leaf entries can get zapped any time and PMD none entries can
> get populated any time. But not PMD page tables.

Gotcha, thanks!

> 
>> 
>> Anyways, regardless, this patch resolves the bug and looks good to me - what’s the 
>> course of action as we probably want to backport this to earlier kernels as well. Shall
>> I send in a new PATCH without cover letter and take it from there?
> 
> Right, I think you should:
> 
> (1) rework the patch description to incorporate the essential stuff from
>    the cover letter
> (2) Identify and add Fixes: tag and Cc: stable
> (3) Document that we are reworking the code to mimic what we do in
>    walk_pmd_range(), to have less inconsistency on the core logic
> (4) Document why you think the reproducer fails on newer kernels. (or
>    best try to get it reproduced by adding some delays in the code)
> (5) Clarify that only PUD handling are prone to the race and that PMDs
>    are fine (and point out why)
> (6) Use a patch subject like "mm/pagewalk: fix race between unmapping
>    and refaulting in walk_pud_range()"
> 
> Once you resend, best to add
> 
> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> 
> Above your SOB.
> 
> To get something like:
> 
> Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Max Boone <mboone@akamai.com>
> 
> Note that the existing
> 
> Signed-off-by: Max Tottenham <mtottenh@akamai.com>
> 
> Is weird, as Max Tottenham did not send out this patch. If he was
> involved in the development, you should either make him
> 
> Suggested-by:
> 
> Or
> Debugged-by:
> 
> Or
> Co-developed-by: + Signed-off-by:
> 
> See Documentation/process/submitting-patches.rst
> 
> 
> Let me know if you have any questions :)
> 

Will do, thanks a lot!

> -- 
> Cheers,
> 
> David

[1] https://lore.kernel.org/all/20250814064714.56485-1-lizhe.67@bytedance.com/


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3061 bytes --]

  reply	other threads:[~2026-03-11  9:43 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09 17:49 [RFC 0/1] Avoid pagewalk hugepage-split race with VFIO DMA set Max Boone
2026-03-09 17:49 ` [RFC 1/1] mm/pagewalk: don't split device-backed huge pfnmaps Max Boone
2026-03-09 20:19   ` David Hildenbrand (Arm)
2026-03-09 22:47     ` Boone, Max
2026-03-09 23:02     ` Boone, Max
2026-03-10  9:11       ` David Hildenbrand (Arm)
2026-03-10 11:38         ` Boone, Max
2026-03-10 15:19           ` David Hildenbrand (Arm)
2026-03-11  9:42             ` Boone, Max [this message]
2026-03-11  9:59               ` David Hildenbrand (Arm)
2026-03-11 10:34                 ` Boone, Max
2026-03-11 10:45                   ` David Hildenbrand (Arm)
2026-03-11 11:14                     ` Boone, Max
2026-03-11 11:59                       ` David Hildenbrand (Arm)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6CBB0401-26DD-4EF9-949C-54BEAD3ABA33@akamai.com \
    --to=mboone@akamai.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex@shazbot.org \
    --cc=david@kernel.org \
    --cc=johunt@akamai.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=mpelland@akamai.com \
    --cc=mtottenh@akamai.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox