[PATCH v2 0/2] userfaultfd: fix races around pmd_trans

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/2] userfaultfd: fix races around pmd_trans_huge() check
@ 2024-08-13 20:25 Jann Horn
  2024-08-13 20:25 ` [PATCH v2 1/2] userfaultfd: Fix checks for huge PMDs Jann Horn
  2024-08-13 20:25 ` [PATCH v2 2/2] userfaultfd: Don't BUG_ON() if khugepaged yanks our page table Jann Horn
  0 siblings, 2 replies; 6+ messages in thread
From: Jann Horn @ 2024-08-13 20:25 UTC (permalink / raw)
  To: Andrew Morton, Pavel Emelianov, Andrea Arcangeli, Hugh Dickins
  Cc: linux-mm, linux-kernel, David Hildenbrand, Qi Zheng, Jann Horn, stable

The pmd_trans_huge() code in mfill_atomic() is wrong in three different
ways depending on kernel version:

1. The pmd_trans_huge() check is racy and can lead to a BUG_ON() (if you hit
   the right two race windows) - I've tested this in a kernel build with
   some extra mdelay() calls. See the commit message for a description
   of the race scenario.
   On older kernels (before 6.5), I think the same bug can even
   theoretically lead to accessing transhuge page contents as a page table
   if you hit the right 5 narrow race windows (I haven't tested this case).
2. As pointed out by Qi Zheng, pmd_trans_huge() is not sufficient for
   detecting PMDs that don't point to page tables.
   On older kernels (before 6.5), you'd just have to win a single fairly
   wide race to hit this.
   I've tested this on 6.1 stable by racing migration (with a mdelay()
   patched into try_to_migrate()) against UFFDIO_ZEROPAGE - on my x86
   VM, that causes a kernel oops in ptlock_ptr().
3. On newer kernels (>=6.5), for shmem mappings, khugepaged is allowed
   to yank page tables out from under us (though I haven't tested that),
   so I think the BUG_ON() checks in mfill_atomic() are just wrong.

I decided to write two separate fixes for these (one fix for bugs 1+2,
one fix for bug 3), so that the first fix can be backported to kernels
affected by bugs 1+2.

Signed-off-by: Jann Horn <jannh@google.com>
---
Changes in v2:
- in patch 1/2:
  - change title
  - get rid of redundant early pmd_trans_huge() check
  - also check for swap PMDs and devmap PMDs (Qi Zheng)
- Link to v1: https://lore.kernel.org/r/20240812-uffd-thp-flip-fix-v1-0-4fc1db7ccdd0@google.com

---
Jann Horn (2):
      userfaultfd: Fix checks for huge PMDs
      userfaultfd: Don't BUG_ON() if khugepaged yanks our page table

 mm/userfaultfd.c | 29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)
---
base-commit: d4560686726f7a357922f300fc81f5964be8df04
change-id: 20240812-uffd-thp-flip-fix-20f91f1151b9
-- 
Jann Horn <jannh@google.com>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] userfaultfd: Fix checks for huge PMDs
  2024-08-13 20:25 [PATCH v2 0/2] userfaultfd: fix races around pmd_trans_huge() check Jann Horn
@ 2024-08-13 20:25 ` Jann Horn
  2024-08-13 20:37   ` David Hildenbrand
  2024-08-14  2:25   ` Qi Zheng
  2024-08-13 20:25 ` [PATCH v2 2/2] userfaultfd: Don't BUG_ON() if khugepaged yanks our page table Jann Horn
  1 sibling, 2 replies; 6+ messages in thread
From: Jann Horn @ 2024-08-13 20:25 UTC (permalink / raw)
  To: Andrew Morton, Pavel Emelianov, Andrea Arcangeli, Hugh Dickins
  Cc: linux-mm, linux-kernel, David Hildenbrand, Qi Zheng, Jann Horn, stable

This fixes two issues.

I discovered that the following race can occur:

  mfill_atomic                other thread
  ============                ============
                              <zap PMD>
  pmdp_get_lockless() [reads none pmd]
  <bail if trans_huge>
  <if none:>
                              <pagefault creates transhuge zeropage>
    __pte_alloc [no-op]
                              <zap PMD>
  <bail if pmd_trans_huge(*dst_pmd)>
  BUG_ON(pmd_none(*dst_pmd))

I have experimentally verified this in a kernel with extra mdelay() calls;
the BUG_ON(pmd_none(*dst_pmd)) triggers.

On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
a BUG_ON(), since the page table access helpers are actually designed to
deal with page tables concurrently disappearing; but on older kernels
(<=6.4), I think we could probably theoretically race past the two BUG_ON()
checks and end up treating a hugepage as a page table.

The second issue is that, as Qi Zheng pointed out, there are other types of
huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
(in particular, migration PMDs).
On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
PMD that contains a migration entry (which just requires winning a single,
fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
assumes that the PMD points to a page table.
Breakage follows: First, the kernel tries to take the PTE lock (which will
crash or maybe worse if there is no "struct page" for the address bits in
the migration entry PMD - I think at least on X86 there usually is no
corresponding "struct page" thanks to the PTE inversion mitigation, amd64
looks different).
If that didn't crash, the kernel would next try to write a PTE into what it
wrongly thinks is a page table.

As part of fixing these issues, get rid of the check for pmd_trans_huge()
before __pte_alloc() - that's redundant, we're going to have to check for
that after the __pte_alloc() anyway.

Backport note: pmdp_get_lockless() is pmd_read_atomic() in older
kernels.

Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Closes: https://lore.kernel.org/r/59bf3c2e-d58b-41af-ab10-3e631d802229@bytedance.com
Cc: stable@vger.kernel.org
Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
Signed-off-by: Jann Horn <jannh@google.com>
---
 mm/userfaultfd.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e54e5c8907fa..290b2a0d84ac 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 		}

 		dst_pmdval = pmdp_get_lockless(dst_pmd);
-		/*
-		 * If the dst_pmd is mapped as THP don't
-		 * override it and just be strict.
-		 */
-		if (unlikely(pmd_trans_huge(dst_pmdval))) {
-			err = -EEXIST;
-			break;
-		}
 		if (unlikely(pmd_none(dst_pmdval)) &&
 		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
 			err = -ENOMEM;
 			break;
 		}
-		/* If an huge pmd materialized from under us fail */
-		if (unlikely(pmd_trans_huge(*dst_pmd))) {
+		dst_pmdval = pmdp_get_lockless(dst_pmd);
+		/*
+		 * If the dst_pmd is THP don't override it and just be strict.
+		 * (This includes the case where the PMD used to be THP and
+		 * changed back to none after __pte_alloc().)
+		 */
+		if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
+			     pmd_devmap(dst_pmdval))) {
+			err = -EEXIST;
+			break;
+		}
+		if (unlikely(pmd_bad(dst_pmdval))) {
 			err = -EFAULT;
 			break;
 		}

-- 
2.46.0.76.ge559c4bf1a-goog

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] userfaultfd: Don't BUG_ON() if khugepaged yanks our page table
  2024-08-13 20:25 [PATCH v2 0/2] userfaultfd: fix races around pmd_trans_huge() check Jann Horn
  2024-08-13 20:25 ` [PATCH v2 1/2] userfaultfd: Fix checks for huge PMDs Jann Horn
@ 2024-08-13 20:25 ` Jann Horn
  1 sibling, 0 replies; 6+ messages in thread
From: Jann Horn @ 2024-08-13 20:25 UTC (permalink / raw)
  To: Andrew Morton, Pavel Emelianov, Andrea Arcangeli, Hugh Dickins
  Cc: linux-mm, linux-kernel, David Hildenbrand, Qi Zheng, Jann Horn, stable

Since khugepaged was changed to allow retracting page tables in file
mappings without holding the mmap lock, these BUG_ON()s are wrong - get rid
of them.

We could also remove the preceding "if (unlikely(...))" block, but then
we could reach pte_offset_map_lock() with transhuge pages not just for file
mappings but also for anonymous mappings - which would probably be fine but
I think is not necessarily expected.

Cc: stable@vger.kernel.org
Fixes: 1d65b771bc08 ("mm/khugepaged: retract_page_tables() without mmap or vma lock")
Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Jann Horn <jannh@google.com>
---
 mm/userfaultfd.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 290b2a0d84ac..acc56c75ba99 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -807,9 +807,10 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
 			err = -EFAULT;
 			break;
 		}
-
-		BUG_ON(pmd_none(*dst_pmd));
-		BUG_ON(pmd_trans_huge(*dst_pmd));
+		/*
+		 * For shmem mappings, khugepaged is allowed to remove page
+		 * tables under us; pte_offset_map_lock() will deal with that.
+		 */
 
 		err = mfill_atomic_pte(dst_pmd, dst_vma, dst_addr,
 				       src_addr, flags, &folio);

-- 
2.46.0.76.ge559c4bf1a-goog



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] userfaultfd: Fix checks for huge PMDs
  2024-08-13 20:25 ` [PATCH v2 1/2] userfaultfd: Fix checks for huge PMDs Jann Horn
@ 2024-08-13 20:37   ` David Hildenbrand
  2024-08-13 20:44     ` Jann Horn
  2024-08-14  2:25   ` Qi Zheng
  1 sibling, 1 reply; 6+ messages in thread
From: David Hildenbrand @ 2024-08-13 20:37 UTC (permalink / raw)
  To: Jann Horn, Andrew Morton, Pavel Emelianov, Andrea Arcangeli,
	Hugh Dickins
  Cc: linux-mm, linux-kernel, Qi Zheng, stable

On 13.08.24 22:25, Jann Horn wrote:
> This fixes two issues.
> 
> I discovered that the following race can occur:
> 
>    mfill_atomic                other thread
>    ============                ============
>                                <zap PMD>
>    pmdp_get_lockless() [reads none pmd]
>    <bail if trans_huge>
>    <if none:>
>                                <pagefault creates transhuge zeropage>
>      __pte_alloc [no-op]
>                                <zap PMD>
>    <bail if pmd_trans_huge(*dst_pmd)>
>    BUG_ON(pmd_none(*dst_pmd))
> 
> I have experimentally verified this in a kernel with extra mdelay() calls;
> the BUG_ON(pmd_none(*dst_pmd)) triggers.
> 
> On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
> pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
> a BUG_ON(), since the page table access helpers are actually designed to
> deal with page tables concurrently disappearing; but on older kernels
> (<=6.4), I think we could probably theoretically race past the two BUG_ON()
> checks and end up treating a hugepage as a page table.
> 
> The second issue is that, as Qi Zheng pointed out, there are other types of
> huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
> (in particular, migration PMDs).
> On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
> PMD that contains a migration entry (which just requires winning a single,
> fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
> assumes that the PMD points to a page table.
> Breakage follows: First, the kernel tries to take the PTE lock (which will
> crash or maybe worse if there is no "struct page" for the address bits in
> the migration entry PMD - I think at least on X86 there usually is no
> corresponding "struct page" thanks to the PTE inversion mitigation, amd64
> looks different).
> If that didn't crash, the kernel would next try to write a PTE into what it
> wrongly thinks is a page table.
> 
> As part of fixing these issues, get rid of the check for pmd_trans_huge()
> before __pte_alloc() - that's redundant, we're going to have to check for
> that after the __pte_alloc() anyway.
> 
> Backport note: pmdp_get_lockless() is pmd_read_atomic() in older
> kernels.
> 
> Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Closes: https://lore.kernel.org/r/59bf3c2e-d58b-41af-ab10-3e631d802229@bytedance.com
> Cc: stable@vger.kernel.org
> Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>   mm/userfaultfd.c | 22 ++++++++++++----------
>   1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index e54e5c8907fa..290b2a0d84ac 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>   		}
>   
>   		dst_pmdval = pmdp_get_lockless(dst_pmd);
> -		/*
> -		 * If the dst_pmd is mapped as THP don't
> -		 * override it and just be strict.
> -		 */
> -		if (unlikely(pmd_trans_huge(dst_pmdval))) {
> -			err = -EEXIST;
> -			break;
> -		}
>   		if (unlikely(pmd_none(dst_pmdval)) &&
>   		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
>   			err = -ENOMEM;
>   			break;
>   		}
> -		/* If an huge pmd materialized from under us fail */
> -		if (unlikely(pmd_trans_huge(*dst_pmd))) {
> +		dst_pmdval = pmdp_get_lockless(dst_pmd);
> +		/*
> +		 * If the dst_pmd is THP don't override it and just be strict.
> +		 * (This includes the case where the PMD used to be THP and
> +		 * changed back to none after __pte_alloc().)
> +		 */
> +		if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
> +			     pmd_devmap(dst_pmdval))) {

Likely in the future we should turn the latter part into a "pmd_leaf()" 
check.

> +			err = -EEXIST;
> +			break;
> +		}
> +		if (unlikely(pmd_bad(dst_pmdval))) {
>   			err = -EFAULT;
>   			break;
>   		}
> 

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] userfaultfd: Fix checks for huge PMDs
  2024-08-13 20:37   ` David Hildenbrand
@ 2024-08-13 20:44     ` Jann Horn
  0 siblings, 0 replies; 6+ messages in thread
From: Jann Horn @ 2024-08-13 20:44 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Pavel Emelianov, Andrea Arcangeli, Hugh Dickins,
	linux-mm, linux-kernel, Qi Zheng, stable

On Tue, Aug 13, 2024 at 10:37 PM David Hildenbrand <david@redhat.com> wrote:
> On 13.08.24 22:25, Jann Horn wrote:
> > +             if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
> > +                          pmd_devmap(dst_pmdval))) {
>
> Likely in the future we should turn the latter part into a "pmd_leaf()"
> check.

Yeah, it'd be a good idea to refactor that as a followup.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] userfaultfd: Fix checks for huge PMDs
  2024-08-13 20:25 ` [PATCH v2 1/2] userfaultfd: Fix checks for huge PMDs Jann Horn
  2024-08-13 20:37   ` David Hildenbrand
@ 2024-08-14  2:25   ` Qi Zheng
  1 sibling, 0 replies; 6+ messages in thread
From: Qi Zheng @ 2024-08-14  2:25 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andrew Morton, Pavel Emelianov, Andrea Arcangeli, Hugh Dickins,
	linux-mm, linux-kernel, David Hildenbrand, stable



On 2024/8/14 04:25, Jann Horn wrote:
> This fixes two issues.
> 
> I discovered that the following race can occur:
> 
>    mfill_atomic                other thread
>    ============                ============
>                                <zap PMD>
>    pmdp_get_lockless() [reads none pmd]
>    <bail if trans_huge>
>    <if none:>
>                                <pagefault creates transhuge zeropage>
>      __pte_alloc [no-op]
>                                <zap PMD>
>    <bail if pmd_trans_huge(*dst_pmd)>
>    BUG_ON(pmd_none(*dst_pmd))
> 
> I have experimentally verified this in a kernel with extra mdelay() calls;
> the BUG_ON(pmd_none(*dst_pmd)) triggers.
> 
> On kernels newer than commit 0d940a9b270b ("mm/pgtable: allow
> pte_offset_map[_lock]() to fail"), this can't lead to anything worse than
> a BUG_ON(), since the page table access helpers are actually designed to
> deal with page tables concurrently disappearing; but on older kernels
> (<=6.4), I think we could probably theoretically race past the two BUG_ON()
> checks and end up treating a hugepage as a page table.
> 
> The second issue is that, as Qi Zheng pointed out, there are other types of
> huge PMDs that pmd_trans_huge() can't catch: devmap PMDs and swap PMDs
> (in particular, migration PMDs).
> On <=6.4, this is worse than the first issue: If mfill_atomic() runs on a
> PMD that contains a migration entry (which just requires winning a single,
> fairly wide race), it will pass the PMD to pte_offset_map_lock(), which
> assumes that the PMD points to a page table.
> Breakage follows: First, the kernel tries to take the PTE lock (which will
> crash or maybe worse if there is no "struct page" for the address bits in
> the migration entry PMD - I think at least on X86 there usually is no
> corresponding "struct page" thanks to the PTE inversion mitigation, amd64
> looks different).
> If that didn't crash, the kernel would next try to write a PTE into what it
> wrongly thinks is a page table.
> 
> As part of fixing these issues, get rid of the check for pmd_trans_huge()
> before __pte_alloc() - that's redundant, we're going to have to check for
> that after the __pte_alloc() anyway.
> 
> Backport note: pmdp_get_lockless() is pmd_read_atomic() in older
> kernels.
> 
> Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Closes: https://lore.kernel.org/r/59bf3c2e-d58b-41af-ab10-3e631d802229@bytedance.com

Ah, the issue fixed by this patch was not reported by me, so
I think that this Reported-by does not need to be added. ;)

> Cc: stable@vger.kernel.org
> Fixes: c1a4de99fada ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation")
> Signed-off-by: Jann Horn <jannh@google.com>
> ---
>   mm/userfaultfd.c | 22 ++++++++++++----------
>   1 file changed, 12 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index e54e5c8907fa..290b2a0d84ac 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -787,21 +787,23 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
>   		}
>   
>   		dst_pmdval = pmdp_get_lockless(dst_pmd);
> -		/*
> -		 * If the dst_pmd is mapped as THP don't
> -		 * override it and just be strict.
> -		 */
> -		if (unlikely(pmd_trans_huge(dst_pmdval))) {
> -			err = -EEXIST;
> -			break;
> -		}
>   		if (unlikely(pmd_none(dst_pmdval)) &&
>   		    unlikely(__pte_alloc(dst_mm, dst_pmd))) {
>   			err = -ENOMEM;
>   			break;
>   		}
> -		/* If an huge pmd materialized from under us fail */
> -		if (unlikely(pmd_trans_huge(*dst_pmd))) {
> +		dst_pmdval = pmdp_get_lockless(dst_pmd);
> +		/*
> +		 * If the dst_pmd is THP don't override it and just be strict.
> +		 * (This includes the case where the PMD used to be THP and
> +		 * changed back to none after __pte_alloc().)
> +		 */
> +		if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
> +			     pmd_devmap(dst_pmdval))) {
> +			err = -EEXIST;
> +			break;
> +		}
> +		if (unlikely(pmd_bad(dst_pmdval))) {
>   			err = -EFAULT;
>   			break;
>   		}

Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>

> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-08-14  2:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-13 20:25 [PATCH v2 0/2] userfaultfd: fix races around pmd_trans_huge() check Jann Horn
2024-08-13 20:25 ` [PATCH v2 1/2] userfaultfd: Fix checks for huge PMDs Jann Horn
2024-08-13 20:37   ` David Hildenbrand
2024-08-13 20:44     ` Jann Horn
2024-08-14  2:25   ` Qi Zheng
2024-08-13 20:25 ` [PATCH v2 2/2] userfaultfd: Don't BUG_ON() if khugepaged yanks our page table Jann Horn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox