linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
@ 2025-02-19 11:25 Barry Song
  2025-02-19 18:26 ` Suren Baghdasaryan
                   ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Barry Song @ 2025-02-19 11:25 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: linux-kernel, zhengtangquan, Barry Song, Andrea Arcangeli,
	Suren Baghdasaryan, Al Viro, Axel Rasmussen, Brian Geffon,
	Christian Brauner, David Hildenbrand, Hugh Dickins, Jann Horn,
	Kalesh Singh, Liam R . Howlett, Lokesh Gidra, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng

From: Barry Song <v-songbaohua@oppo.com>

userfaultfd_move() checks whether the PTE entry is present or a
swap entry.

- If the PTE entry is present, move_present_pte() handles folio
  migration by setting:

  src_folio->index = linear_page_index(dst_vma, dst_addr);

- If the PTE entry is a swap entry, move_swap_pte() simply copies
  the PTE to the new dst_addr.

This approach is incorrect because even if the PTE is a swap
entry, it can still reference a folio that remains in the swap
cache.

If do_swap_page() is triggered, it may locate the folio in the
swap cache. However, during add_rmap operations, a kernel panic
can occur due to:
 page_pgoff(folio, page) != linear_page_index(vma, address)

$./a.out > /dev/null
[   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
[   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
[   13.337716] memcg:ffff00000405f000
[   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
[   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
[   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
[   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
[   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
[   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
[   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
[   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
[   13.340190] ------------[ cut here ]------------
[   13.340316] kernel BUG at mm/rmap.c:1380!
[   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[   13.340969] Modules linked in:
[   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
[   13.341470] Hardware name: linux,dummy-virt (DT)
[   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
[   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
[   13.342018] sp : ffff80008752bb20
[   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
[   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
[   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
[   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
[   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
[   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
[   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
[   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
[   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
[   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
[   13.343876] Call trace:
[   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
[   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
[   13.344333]  do_swap_page+0x1060/0x1400
[   13.344417]  __handle_mm_fault+0x61c/0xbc8
[   13.344504]  handle_mm_fault+0xd8/0x2e8
[   13.344586]  do_page_fault+0x20c/0x770
[   13.344673]  do_translation_fault+0xb4/0xf0
[   13.344759]  do_mem_abort+0x48/0xa0
[   13.344842]  el0_da+0x58/0x130
[   13.344914]  el0t_64_sync_handler+0xc4/0x138
[   13.345002]  el0t_64_sync+0x1ac/0x1b0
[   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
[   13.345504] ---[ end trace 0000000000000000 ]---
[   13.345715] note: a.out[107] exited with irqs disabled
[   13.345954] note: a.out[107] exited with preempt_count 2

Fully fixing it would be quite complex, requiring similar handling
of folios as done in move_present_pte. For now, a quick solution
is to return -EBUSY.
I'd like to see others' opinions on whether a full fix is worth
pursuing.

For anyone interested in reproducing it, the a.out test program is
as below,

 #define _GNU_SOURCE
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include <sys/mman.h>
 #include <sys/ioctl.h>
 #include <sys/syscall.h>
 #include <linux/userfaultfd.h>
 #include <fcntl.h>
 #include <pthread.h>
 #include <unistd.h>
 #include <poll.h>
 #include <errno.h>

 #define PAGE_SIZE 4096
 #define REGION_SIZE (512 * 1024)

 #ifndef UFFDIO_MOVE
 struct uffdio_move {
     __u64 dst;
     __u64 src;
     __u64 len;
     #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
     #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
     __u64 mode;
     __s64 move;
 };
 #define _UFFDIO_MOVE  (0x05)
 #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
 #endif

 void *src, *dst;
 int uffd;

 void *madvise_thread(void *arg) {
     if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) {
         perror("madvise MADV_PAGEOUT");
     }
     return NULL;
 }

 void *fault_handler_thread(void *arg) {
     struct uffd_msg msg;
     struct uffdio_move move;
     struct pollfd pollfd = { .fd = uffd, .events = POLLIN };

     pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
     pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);

     while (1) {
         if (poll(&pollfd, 1, -1) == -1) {
             perror("poll");
             exit(EXIT_FAILURE);
         }

         if (read(uffd, &msg, sizeof(msg)) <= 0) {
             perror("read");
             exit(EXIT_FAILURE);
         }

         if (msg.event != UFFD_EVENT_PAGEFAULT) {
             fprintf(stderr, "Unexpected event\n");
             exit(EXIT_FAILURE);
         }

         move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst);
         move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
         move.len = PAGE_SIZE;
         move.mode = 0;

         if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
             perror("UFFDIO_MOVE");
             exit(EXIT_FAILURE);
         }
     }
     return NULL;
 }

 int main() {
 again:
     pthread_t thr, madv_thr;
     struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
     struct uffdio_register uffdio_register;

     src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
     if (src == MAP_FAILED) {
         perror("mmap src");
         exit(EXIT_FAILURE);
     }
     memset(src, 1, REGION_SIZE);

     dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
     if (dst == MAP_FAILED) {
         perror("mmap dst");
         exit(EXIT_FAILURE);
     }

     uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
     if (uffd == -1) {
         perror("userfaultfd");
         exit(EXIT_FAILURE);
     }

     if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
         perror("UFFDIO_API");
         exit(EXIT_FAILURE);
     }

     uffdio_register.range.start = (unsigned long)dst;
     uffdio_register.range.len = REGION_SIZE;
     uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;

     if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
         perror("UFFDIO_REGISTER");
         exit(EXIT_FAILURE);
     }

     if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
         perror("pthread_create madvise_thread");
         exit(EXIT_FAILURE);
     }

     if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
         perror("pthread_create fault_handler_thread");
         exit(EXIT_FAILURE);
     }

     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
         char val = ((char *)dst)[i];
         printf("Accessing dst at offset %zu, value: %d\n", i, val);
     }

     pthread_join(madv_thr, NULL);
     pthread_cancel(thr);
     pthread_join(thr, NULL);

     munmap(src, REGION_SIZE);
     munmap(dst, REGION_SIZE);
     close(uffd);
     goto again;
     return 0;
 }

As long as you enable mTHP (which likely increases the residency
time of swapcache), you can reproduce the issue within a few
seconds. But I guess the same race condition also exists with
small folios.

Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/userfaultfd.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 867898c4e30b..34cf1c8c725d 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
+#include "swap.h"
 
 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
@@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
 			 pmd_t *dst_pmd, pmd_t dst_pmdval,
 			 spinlock_t *dst_ptl, spinlock_t *src_ptl)
 {
+	struct folio *folio;
+	swp_entry_t entry;
+
 	if (!pte_swp_exclusive(orig_src_pte))
 		return -EBUSY;
 
+	entry = pte_to_swp_entry(orig_src_pte);
+	folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
+	if (!IS_ERR(folio)) {
+		folio_put(folio);
+		return -EBUSY;
+	}
+
 	double_pt_lock(dst_ptl, src_ptl);
 
 	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
-- 
2.39.3 (Apple Git-146)



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 11:25 [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache Barry Song
@ 2025-02-19 18:26 ` Suren Baghdasaryan
  2025-02-19 18:30   ` David Hildenbrand
  2025-02-19 20:37   ` Barry Song
  2025-02-19 18:40 ` Lokesh Gidra
  2025-02-19 22:31 ` Peter Xu
  2 siblings, 2 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-02-19 18:26 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, akpm, linux-kernel, zhengtangquan, Barry Song,
	Andrea Arcangeli, Al Viro, Axel Rasmussen, Brian Geffon,
	Christian Brauner, David Hildenbrand, Hugh Dickins, Jann Horn,
	Kalesh Singh, Liam R . Howlett, Lokesh Gidra, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng

On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> userfaultfd_move() checks whether the PTE entry is present or a
> swap entry.
>
> - If the PTE entry is present, move_present_pte() handles folio
>   migration by setting:
>
>   src_folio->index = linear_page_index(dst_vma, dst_addr);
>
> - If the PTE entry is a swap entry, move_swap_pte() simply copies
>   the PTE to the new dst_addr.
>
> This approach is incorrect because even if the PTE is a swap
> entry, it can still reference a folio that remains in the swap
> cache.
>
> If do_swap_page() is triggered, it may locate the folio in the
> swap cache. However, during add_rmap operations, a kernel panic
> can occur due to:
>  page_pgoff(folio, page) != linear_page_index(vma, address)

Thanks for the report and reproducer!

>
> $./a.out > /dev/null
> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> [   13.337716] memcg:ffff00000405f000
> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> [   13.340190] ------------[ cut here ]------------
> [   13.340316] kernel BUG at mm/rmap.c:1380!
> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [   13.340969] Modules linked in:
> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> [   13.341470] Hardware name: linux,dummy-virt (DT)
> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> [   13.342018] sp : ffff80008752bb20
> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> [   13.343876] Call trace:
> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> [   13.344333]  do_swap_page+0x1060/0x1400
> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> [   13.344586]  do_page_fault+0x20c/0x770
> [   13.344673]  do_translation_fault+0xb4/0xf0
> [   13.344759]  do_mem_abort+0x48/0xa0
> [   13.344842]  el0_da+0x58/0x130
> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> [   13.345504] ---[ end trace 0000000000000000 ]---
> [   13.345715] note: a.out[107] exited with irqs disabled
> [   13.345954] note: a.out[107] exited with preempt_count 2
>
> Fully fixing it would be quite complex, requiring similar handling
> of folios as done in move_present_pte.

How complex would that be? Is it a matter of adding
folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
folio->index = linear_page_index like in move_present_pte() or
something more?

> For now, a quick solution
> is to return -EBUSY.
> I'd like to see others' opinions on whether a full fix is worth
> pursuing.
>
> For anyone interested in reproducing it, the a.out test program is
> as below,
>
>  #define _GNU_SOURCE
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <sys/mman.h>
>  #include <sys/ioctl.h>
>  #include <sys/syscall.h>
>  #include <linux/userfaultfd.h>
>  #include <fcntl.h>
>  #include <pthread.h>
>  #include <unistd.h>
>  #include <poll.h>
>  #include <errno.h>
>
>  #define PAGE_SIZE 4096
>  #define REGION_SIZE (512 * 1024)
>
>  #ifndef UFFDIO_MOVE
>  struct uffdio_move {
>      __u64 dst;
>      __u64 src;
>      __u64 len;
>      #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
>      #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
>      __u64 mode;
>      __s64 move;
>  };
>  #define _UFFDIO_MOVE  (0x05)
>  #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
>  #endif
>
>  void *src, *dst;
>  int uffd;
>
>  void *madvise_thread(void *arg) {
>      if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) {
>          perror("madvise MADV_PAGEOUT");
>      }
>      return NULL;
>  }
>
>  void *fault_handler_thread(void *arg) {
>      struct uffd_msg msg;
>      struct uffdio_move move;
>      struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
>
>      pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
>      pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
>
>      while (1) {
>          if (poll(&pollfd, 1, -1) == -1) {
>              perror("poll");
>              exit(EXIT_FAILURE);
>          }
>
>          if (read(uffd, &msg, sizeof(msg)) <= 0) {
>              perror("read");
>              exit(EXIT_FAILURE);
>          }
>
>          if (msg.event != UFFD_EVENT_PAGEFAULT) {
>              fprintf(stderr, "Unexpected event\n");
>              exit(EXIT_FAILURE);
>          }
>
>          move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst);
>          move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
>          move.len = PAGE_SIZE;
>          move.mode = 0;
>
>          if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
>              perror("UFFDIO_MOVE");
>              exit(EXIT_FAILURE);
>          }
>      }
>      return NULL;
>  }
>
>  int main() {
>  again:
>      pthread_t thr, madv_thr;
>      struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
>      struct uffdio_register uffdio_register;
>
>      src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>      if (src == MAP_FAILED) {
>          perror("mmap src");
>          exit(EXIT_FAILURE);
>      }
>      memset(src, 1, REGION_SIZE);
>
>      dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>      if (dst == MAP_FAILED) {
>          perror("mmap dst");
>          exit(EXIT_FAILURE);
>      }
>
>      uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>      if (uffd == -1) {
>          perror("userfaultfd");
>          exit(EXIT_FAILURE);
>      }
>
>      if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
>          perror("UFFDIO_API");
>          exit(EXIT_FAILURE);
>      }
>
>      uffdio_register.range.start = (unsigned long)dst;
>      uffdio_register.range.len = REGION_SIZE;
>      uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
>
>      if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
>          perror("UFFDIO_REGISTER");
>          exit(EXIT_FAILURE);
>      }
>
>      if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
>          perror("pthread_create madvise_thread");
>          exit(EXIT_FAILURE);
>      }
>
>      if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
>          perror("pthread_create fault_handler_thread");
>          exit(EXIT_FAILURE);
>      }
>
>      for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
>          char val = ((char *)dst)[i];
>          printf("Accessing dst at offset %zu, value: %d\n", i, val);
>      }
>
>      pthread_join(madv_thr, NULL);
>      pthread_cancel(thr);
>      pthread_join(thr, NULL);
>
>      munmap(src, REGION_SIZE);
>      munmap(dst, REGION_SIZE);
>      close(uffd);
>      goto again;
>      return 0;
>  }
>
> As long as you enable mTHP (which likely increases the residency
> time of swapcache), you can reproduce the issue within a few
> seconds. But I guess the same race condition also exists with
> small folios.
>
> Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Brian Geffon <bgeffon@google.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Kalesh Singh <kaleshsingh@google.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> Cc: Nicolas Geoffray <ngeoffray@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: ZhangPeng <zhangpeng362@huawei.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/userfaultfd.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 867898c4e30b..34cf1c8c725d 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> +#include "swap.h"
>
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
>                          pmd_t *dst_pmd, pmd_t dst_pmdval,
>                          spinlock_t *dst_ptl, spinlock_t *src_ptl)
>  {
> +       struct folio *folio;
> +       swp_entry_t entry;
> +
>         if (!pte_swp_exclusive(orig_src_pte))
>                 return -EBUSY;
>

Would be helpful to add a comment explaining that this is the case
when the folio is in the swap cache.

> +       entry = pte_to_swp_entry(orig_src_pte);
> +       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> +       if (!IS_ERR(folio)) {
> +               folio_put(folio);
> +               return -EBUSY;
> +       }
> +
>         double_pt_lock(dst_ptl, src_ptl);
>
>         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> --
> 2.39.3 (Apple Git-146)
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 18:26 ` Suren Baghdasaryan
@ 2025-02-19 18:30   ` David Hildenbrand
  2025-02-19 18:58     ` Suren Baghdasaryan
  2025-02-19 20:37   ` Barry Song
  1 sibling, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2025-02-19 18:30 UTC (permalink / raw)
  To: Suren Baghdasaryan, Barry Song
  Cc: linux-mm, akpm, linux-kernel, zhengtangquan, Barry Song,
	Andrea Arcangeli, Al Viro, Axel Rasmussen, Brian Geffon,
	Christian Brauner, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Lokesh Gidra, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Nicolas Geoffray, Peter Xu, Ryan Roberts,
	Shuah Khan, ZhangPeng

On 19.02.25 19:26, Suren Baghdasaryan wrote:
> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> userfaultfd_move() checks whether the PTE entry is present or a
>> swap entry.
>>
>> - If the PTE entry is present, move_present_pte() handles folio
>>    migration by setting:
>>
>>    src_folio->index = linear_page_index(dst_vma, dst_addr);
>>
>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
>>    the PTE to the new dst_addr.
>>
>> This approach is incorrect because even if the PTE is a swap
>> entry, it can still reference a folio that remains in the swap
>> cache.
>>
>> If do_swap_page() is triggered, it may locate the folio in the
>> swap cache. However, during add_rmap operations, a kernel panic
>> can occur due to:
>>   page_pgoff(folio, page) != linear_page_index(vma, address)
> 
> Thanks for the report and reproducer!
> 
>>
>> $./a.out > /dev/null
>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
>> [   13.337716] memcg:ffff00000405f000
>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
>> [   13.340190] ------------[ cut here ]------------
>> [   13.340316] kernel BUG at mm/rmap.c:1380!
>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>> [   13.340969] Modules linked in:
>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
>> [   13.341470] Hardware name: linux,dummy-virt (DT)
>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
>> [   13.342018] sp : ffff80008752bb20
>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
>> [   13.343876] Call trace:
>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
>> [   13.344333]  do_swap_page+0x1060/0x1400
>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
>> [   13.344586]  do_page_fault+0x20c/0x770
>> [   13.344673]  do_translation_fault+0xb4/0xf0
>> [   13.344759]  do_mem_abort+0x48/0xa0
>> [   13.344842]  el0_da+0x58/0x130
>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
>> [   13.345504] ---[ end trace 0000000000000000 ]---
>> [   13.345715] note: a.out[107] exited with irqs disabled
>> [   13.345954] note: a.out[107] exited with preempt_count 2
>>
>> Fully fixing it would be quite complex, requiring similar handling
>> of folios as done in move_present_pte.
> 
> How complex would that be? Is it a matter of adding
> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> folio->index = linear_page_index like in move_present_pte() or
> something more?

If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot 
be pinned and we may be able to move it I think.

So all that's required is to check pte_swp_exclusive() and the folio size.

... in theory :) Not sure about the swap details.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 11:25 [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache Barry Song
  2025-02-19 18:26 ` Suren Baghdasaryan
@ 2025-02-19 18:40 ` Lokesh Gidra
  2025-02-19 20:45   ` Barry Song
  2025-02-19 22:31 ` Peter Xu
  2 siblings, 1 reply; 47+ messages in thread
From: Lokesh Gidra @ 2025-02-19 18:40 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, akpm, linux-kernel, zhengtangquan, Barry Song,
	Andrea Arcangeli, Suren Baghdasaryan, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, David Hildenbrand, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng

On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> userfaultfd_move() checks whether the PTE entry is present or a
> swap entry.
>
> - If the PTE entry is present, move_present_pte() handles folio
>   migration by setting:
>
>   src_folio->index = linear_page_index(dst_vma, dst_addr);
>
> - If the PTE entry is a swap entry, move_swap_pte() simply copies
>   the PTE to the new dst_addr.
>
> This approach is incorrect because even if the PTE is a swap
> entry, it can still reference a folio that remains in the swap
> cache.
>
> If do_swap_page() is triggered, it may locate the folio in the
> swap cache. However, during add_rmap operations, a kernel panic
> can occur due to:
>  page_pgoff(folio, page) != linear_page_index(vma, address)
>
> $./a.out > /dev/null
> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> [   13.337716] memcg:ffff00000405f000
> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> [   13.340190] ------------[ cut here ]------------
> [   13.340316] kernel BUG at mm/rmap.c:1380!
> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> [   13.340969] Modules linked in:
> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> [   13.341470] Hardware name: linux,dummy-virt (DT)
> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> [   13.342018] sp : ffff80008752bb20
> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> [   13.343876] Call trace:
> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> [   13.344333]  do_swap_page+0x1060/0x1400
> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> [   13.344586]  do_page_fault+0x20c/0x770
> [   13.344673]  do_translation_fault+0xb4/0xf0
> [   13.344759]  do_mem_abort+0x48/0xa0
> [   13.344842]  el0_da+0x58/0x130
> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> [   13.345504] ---[ end trace 0000000000000000 ]---
> [   13.345715] note: a.out[107] exited with irqs disabled
> [   13.345954] note: a.out[107] exited with preempt_count 2
>
> Fully fixing it would be quite complex, requiring similar handling
> of folios as done in move_present_pte. For now, a quick solution
> is to return -EBUSY.
> I'd like to see others' opinions on whether a full fix is worth
> pursuing.
>

Thanks a lot for finding this.

As a user of MOVE ioctl (in Android GC) I strongly urge you to fix
this properly. Because this is not going to be a rare occurrence in
the case of Android. And when -EBUSY is returned, all that userspace
can do is touch the page, which also does not guarantee that a
subsequent retry of the ioctl will succeed.

> For anyone interested in reproducing it, the a.out test program is
> as below,
>
>  #define _GNU_SOURCE
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <string.h>
>  #include <sys/mman.h>
>  #include <sys/ioctl.h>
>  #include <sys/syscall.h>
>  #include <linux/userfaultfd.h>
>  #include <fcntl.h>
>  #include <pthread.h>
>  #include <unistd.h>
>  #include <poll.h>
>  #include <errno.h>
>
>  #define PAGE_SIZE 4096
>  #define REGION_SIZE (512 * 1024)
>
>  #ifndef UFFDIO_MOVE
>  struct uffdio_move {
>      __u64 dst;
>      __u64 src;
>      __u64 len;
>      #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
>      #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
>      __u64 mode;
>      __s64 move;
>  };
>  #define _UFFDIO_MOVE  (0x05)
>  #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
>  #endif
>
>  void *src, *dst;
>  int uffd;
>
>  void *madvise_thread(void *arg) {
>      if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) {
>          perror("madvise MADV_PAGEOUT");
>      }
>      return NULL;
>  }
>
>  void *fault_handler_thread(void *arg) {
>      struct uffd_msg msg;
>      struct uffdio_move move;
>      struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
>
>      pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
>      pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
>
>      while (1) {
>          if (poll(&pollfd, 1, -1) == -1) {
>              perror("poll");
>              exit(EXIT_FAILURE);
>          }
>
>          if (read(uffd, &msg, sizeof(msg)) <= 0) {
>              perror("read");
>              exit(EXIT_FAILURE);
>          }
>
>          if (msg.event != UFFD_EVENT_PAGEFAULT) {
>              fprintf(stderr, "Unexpected event\n");
>              exit(EXIT_FAILURE);
>          }
>
>          move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst);
>          move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
>          move.len = PAGE_SIZE;
>          move.mode = 0;
>
>          if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
>              perror("UFFDIO_MOVE");
>              exit(EXIT_FAILURE);
>          }
>      }
>      return NULL;
>  }
>
>  int main() {
>  again:
>      pthread_t thr, madv_thr;
>      struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
>      struct uffdio_register uffdio_register;
>
>      src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>      if (src == MAP_FAILED) {
>          perror("mmap src");
>          exit(EXIT_FAILURE);
>      }
>      memset(src, 1, REGION_SIZE);
>
>      dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>      if (dst == MAP_FAILED) {
>          perror("mmap dst");
>          exit(EXIT_FAILURE);
>      }
>
>      uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>      if (uffd == -1) {
>          perror("userfaultfd");
>          exit(EXIT_FAILURE);
>      }
>
>      if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
>          perror("UFFDIO_API");
>          exit(EXIT_FAILURE);
>      }
>
>      uffdio_register.range.start = (unsigned long)dst;
>      uffdio_register.range.len = REGION_SIZE;
>      uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
>
>      if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
>          perror("UFFDIO_REGISTER");
>          exit(EXIT_FAILURE);
>      }
>
>      if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
>          perror("pthread_create madvise_thread");
>          exit(EXIT_FAILURE);
>      }
>
>      if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
>          perror("pthread_create fault_handler_thread");
>          exit(EXIT_FAILURE);
>      }
>
>      for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
>          char val = ((char *)dst)[i];
>          printf("Accessing dst at offset %zu, value: %d\n", i, val);
>      }
>
>      pthread_join(madv_thr, NULL);
>      pthread_cancel(thr);
>      pthread_join(thr, NULL);
>
>      munmap(src, REGION_SIZE);
>      munmap(dst, REGION_SIZE);
>      close(uffd);
>      goto again;
>      return 0;
>  }
>
> As long as you enable mTHP (which likely increases the residency
> time of swapcache), you can reproduce the issue within a few
> seconds. But I guess the same race condition also exists with
> small folios.
>
> Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Brian Geffon <bgeffon@google.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: Kalesh Singh <kaleshsingh@google.com>
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Lokesh Gidra <lokeshgidra@google.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> Cc: Nicolas Geoffray <ngeoffray@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: ZhangPeng <zhangpeng362@huawei.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/userfaultfd.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 867898c4e30b..34cf1c8c725d 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> +#include "swap.h"
>
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
>                          pmd_t *dst_pmd, pmd_t dst_pmdval,
>                          spinlock_t *dst_ptl, spinlock_t *src_ptl)
>  {
> +       struct folio *folio;
> +       swp_entry_t entry;
> +
>         if (!pte_swp_exclusive(orig_src_pte))
>                 return -EBUSY;
>
> +       entry = pte_to_swp_entry(orig_src_pte);
> +       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> +       if (!IS_ERR(folio)) {
> +               folio_put(folio);
> +               return -EBUSY;
> +       }
> +
>         double_pt_lock(dst_ptl, src_ptl);
>
>         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> --
> 2.39.3 (Apple Git-146)
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 18:30   ` David Hildenbrand
@ 2025-02-19 18:58     ` Suren Baghdasaryan
  2025-02-20  8:40       ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-02-19 18:58 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Barry Song, linux-mm, akpm, linux-kernel, zhengtangquan,
	Barry Song, Andrea Arcangeli, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, Hugh Dickins, Jann Horn,
	Kalesh Singh, Liam R . Howlett, Lokesh Gidra, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng

On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.02.25 19:26, Suren Baghdasaryan wrote:
> > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> From: Barry Song <v-songbaohua@oppo.com>
> >>
> >> userfaultfd_move() checks whether the PTE entry is present or a
> >> swap entry.
> >>
> >> - If the PTE entry is present, move_present_pte() handles folio
> >>    migration by setting:
> >>
> >>    src_folio->index = linear_page_index(dst_vma, dst_addr);
> >>
> >> - If the PTE entry is a swap entry, move_swap_pte() simply copies
> >>    the PTE to the new dst_addr.
> >>
> >> This approach is incorrect because even if the PTE is a swap
> >> entry, it can still reference a folio that remains in the swap
> >> cache.
> >>
> >> If do_swap_page() is triggered, it may locate the folio in the
> >> swap cache. However, during add_rmap operations, a kernel panic
> >> can occur due to:
> >>   page_pgoff(folio, page) != linear_page_index(vma, address)
> >
> > Thanks for the report and reproducer!
> >
> >>
> >> $./a.out > /dev/null
> >> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> >> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> >> [   13.337716] memcg:ffff00000405f000
> >> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> >> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> >> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> >> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> >> [   13.340190] ------------[ cut here ]------------
> >> [   13.340316] kernel BUG at mm/rmap.c:1380!
> >> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> >> [   13.340969] Modules linked in:
> >> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> >> [   13.341470] Hardware name: linux,dummy-virt (DT)
> >> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> >> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> >> [   13.342018] sp : ffff80008752bb20
> >> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> >> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> >> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> >> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> >> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> >> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> >> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> >> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> >> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> >> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> >> [   13.343876] Call trace:
> >> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> >> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> >> [   13.344333]  do_swap_page+0x1060/0x1400
> >> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> >> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> >> [   13.344586]  do_page_fault+0x20c/0x770
> >> [   13.344673]  do_translation_fault+0xb4/0xf0
> >> [   13.344759]  do_mem_abort+0x48/0xa0
> >> [   13.344842]  el0_da+0x58/0x130
> >> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> >> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> >> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> >> [   13.345504] ---[ end trace 0000000000000000 ]---
> >> [   13.345715] note: a.out[107] exited with irqs disabled
> >> [   13.345954] note: a.out[107] exited with preempt_count 2
> >>
> >> Fully fixing it would be quite complex, requiring similar handling
> >> of folios as done in move_present_pte.
> >
> > How complex would that be? Is it a matter of adding
> > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > folio->index = linear_page_index like in move_present_pte() or
> > something more?
>
> If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot
> be pinned and we may be able to move it I think.
>
> So all that's required is to check pte_swp_exclusive() and the folio size.
>
> ... in theory :) Not sure about the swap details.

Looking some more into it, I think we would have to perform all the
folio and anon_vma locking and pinning that we do for present pages in
move_pages_pte(). If that's correct then maybe treating swapcache
pages like a present page inside move_pages_pte() would be simpler?

>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 18:26 ` Suren Baghdasaryan
  2025-02-19 18:30   ` David Hildenbrand
@ 2025-02-19 20:37   ` Barry Song
  2025-02-19 20:57     ` Matthew Wilcox
                       ` (3 more replies)
  1 sibling, 4 replies; 47+ messages in thread
From: Barry Song @ 2025-02-19 20:37 UTC (permalink / raw)
  To: Suren Baghdasaryan, Lokesh Gidra
  Cc: linux-mm, akpm, linux-kernel, zhengtangquan, Barry Song,
	Andrea Arcangeli, Al Viro, Axel Rasmussen, Brian Geffon,
	Christian Brauner, David Hildenbrand, Hugh Dickins, Jann Horn,
	Kalesh Singh, Liam R . Howlett, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Nicolas Geoffray, Peter Xu, Ryan Roberts,
	Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > userfaultfd_move() checks whether the PTE entry is present or a
> > swap entry.
> >
> > - If the PTE entry is present, move_present_pte() handles folio
> >   migration by setting:
> >
> >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> >
> > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> >   the PTE to the new dst_addr.
> >
> > This approach is incorrect because even if the PTE is a swap
> > entry, it can still reference a folio that remains in the swap
> > cache.
> >
> > If do_swap_page() is triggered, it may locate the folio in the
> > swap cache. However, during add_rmap operations, a kernel panic
> > can occur due to:
> >  page_pgoff(folio, page) != linear_page_index(vma, address)
>
> Thanks for the report and reproducer!
>
> >
> > $./a.out > /dev/null
> > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > [   13.337716] memcg:ffff00000405f000
> > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > [   13.340190] ------------[ cut here ]------------
> > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > [   13.340969] Modules linked in:
> > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > [   13.342018] sp : ffff80008752bb20
> > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > [   13.343876] Call trace:
> > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > [   13.344333]  do_swap_page+0x1060/0x1400
> > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > [   13.344586]  do_page_fault+0x20c/0x770
> > [   13.344673]  do_translation_fault+0xb4/0xf0
> > [   13.344759]  do_mem_abort+0x48/0xa0
> > [   13.344842]  el0_da+0x58/0x130
> > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > [   13.345504] ---[ end trace 0000000000000000 ]---
> > [   13.345715] note: a.out[107] exited with irqs disabled
> > [   13.345954] note: a.out[107] exited with preempt_count 2
> >
> > Fully fixing it would be quite complex, requiring similar handling
> > of folios as done in move_present_pte.
>
> How complex would that be? Is it a matter of adding
> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> folio->index = linear_page_index like in move_present_pte() or
> something more?

My main concern is still with large folios that require a split_folio()
during move_pages(), as the entire folio shares the same index and
anon_vma. However, userfaultfd_move() moves pages individually,
making a split necessary.

However, in split_huge_page_to_list_to_order(), there is a:

        if (folio_test_writeback(folio))
                return -EBUSY;

This is likely true for swapcache, right? However, even for move_present_pte(),
it simply returns -EBUSY:

move_pages_pte()
{
                /* at this point we have src_folio locked */
                if (folio_test_large(src_folio)) {
                        /* split_folio() can block */
                        pte_unmap(&orig_src_pte);
                        pte_unmap(&orig_dst_pte);
                        src_pte = dst_pte = NULL;
                        err = split_folio(src_folio);
                        if (err)
                                goto out;

                        /* have to reacquire the folio after it got split */
                        folio_unlock(src_folio);
                        folio_put(src_folio);
                        src_folio = NULL;
                        goto retry;
                }
}

Do we need a folio_wait_writeback() before calling split_folio()?

By the way, I have also reported that userfaultfd_move() has a fundamental
conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
kernel. In this scenario, folios in the virtual zone won’t be split in
split_folio(). Instead, the large folio migrates into nr_pages small folios.

Thus, the best-case scenario would be:

mTHP -> migrate to small folios in split_folio() -> move small folios to
dst_addr

While this works, it negates the performance benefits of
userfaultfd_move(), as it introduces two PTE operations (migration in
split_folio() and move in userfaultfd_move() while retry), nr_pages memory
allocations, and still requires one memcpy(). This could end up
performing even worse than userfaultfd_copy(), I guess.

The worst-case scenario would be failing to allocate small folios in
split_folio(), then userfaultfd_move() might return -ENOMEM?

Given these issues, I strongly recommend that ART hold off on upgrading
to userfaultfd_move() until these problems are fully understood and
resolved. Otherwise, we’re in for a rough ride!

>
> > For now, a quick solution
> > is to return -EBUSY.
> > I'd like to see others' opinions on whether a full fix is worth
> > pursuing.
> >
> > For anyone interested in reproducing it, the a.out test program is
> > as below,
> >
> >  #define _GNU_SOURCE
> >  #include <stdio.h>
> >  #include <stdlib.h>
> >  #include <string.h>
> >  #include <sys/mman.h>
> >  #include <sys/ioctl.h>
> >  #include <sys/syscall.h>
> >  #include <linux/userfaultfd.h>
> >  #include <fcntl.h>
> >  #include <pthread.h>
> >  #include <unistd.h>
> >  #include <poll.h>
> >  #include <errno.h>
> >
> >  #define PAGE_SIZE 4096
> >  #define REGION_SIZE (512 * 1024)
> >
> >  #ifndef UFFDIO_MOVE
> >  struct uffdio_move {
> >      __u64 dst;
> >      __u64 src;
> >      __u64 len;
> >      #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
> >      #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
> >      __u64 mode;
> >      __s64 move;
> >  };
> >  #define _UFFDIO_MOVE  (0x05)
> >  #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
> >  #endif
> >
> >  void *src, *dst;
> >  int uffd;
> >
> >  void *madvise_thread(void *arg) {
> >      if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) {
> >          perror("madvise MADV_PAGEOUT");
> >      }
> >      return NULL;
> >  }
> >
> >  void *fault_handler_thread(void *arg) {
> >      struct uffd_msg msg;
> >      struct uffdio_move move;
> >      struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
> >
> >      pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
> >      pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
> >
> >      while (1) {
> >          if (poll(&pollfd, 1, -1) == -1) {
> >              perror("poll");
> >              exit(EXIT_FAILURE);
> >          }
> >
> >          if (read(uffd, &msg, sizeof(msg)) <= 0) {
> >              perror("read");
> >              exit(EXIT_FAILURE);
> >          }
> >
> >          if (msg.event != UFFD_EVENT_PAGEFAULT) {
> >              fprintf(stderr, "Unexpected event\n");
> >              exit(EXIT_FAILURE);
> >          }
> >
> >          move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst);
> >          move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
> >          move.len = PAGE_SIZE;
> >          move.mode = 0;
> >
> >          if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
> >              perror("UFFDIO_MOVE");
> >              exit(EXIT_FAILURE);
> >          }
> >      }
> >      return NULL;
> >  }
> >
> >  int main() {
> >  again:
> >      pthread_t thr, madv_thr;
> >      struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
> >      struct uffdio_register uffdio_register;
> >
> >      src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >      if (src == MAP_FAILED) {
> >          perror("mmap src");
> >          exit(EXIT_FAILURE);
> >      }
> >      memset(src, 1, REGION_SIZE);
> >
> >      dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >      if (dst == MAP_FAILED) {
> >          perror("mmap dst");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> >      if (uffd == -1) {
> >          perror("userfaultfd");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
> >          perror("UFFDIO_API");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      uffdio_register.range.start = (unsigned long)dst;
> >      uffdio_register.range.len = REGION_SIZE;
> >      uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> >
> >      if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
> >          perror("UFFDIO_REGISTER");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
> >          perror("pthread_create madvise_thread");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
> >          perror("pthread_create fault_handler_thread");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> >          char val = ((char *)dst)[i];
> >          printf("Accessing dst at offset %zu, value: %d\n", i, val);
> >      }
> >
> >      pthread_join(madv_thr, NULL);
> >      pthread_cancel(thr);
> >      pthread_join(thr, NULL);
> >
> >      munmap(src, REGION_SIZE);
> >      munmap(dst, REGION_SIZE);
> >      close(uffd);
> >      goto again;
> >      return 0;
> >  }
> >
> > As long as you enable mTHP (which likely increases the residency
> > time of swapcache), you can reproduce the issue within a few
> > seconds. But I guess the same race condition also exists with
> > small folios.
> >
> > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > Cc: Brian Geffon <bgeffon@google.com>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Kalesh Singh <kaleshsingh@google.com>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > Cc: Nicolas Geoffray <ngeoffray@google.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: ZhangPeng <zhangpeng362@huawei.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/userfaultfd.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 867898c4e30b..34cf1c8c725d 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -18,6 +18,7 @@
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > +#include "swap.h"
> >
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
> >                          pmd_t *dst_pmd, pmd_t dst_pmdval,
> >                          spinlock_t *dst_ptl, spinlock_t *src_ptl)
> >  {
> > +       struct folio *folio;
> > +       swp_entry_t entry;
> > +
> >         if (!pte_swp_exclusive(orig_src_pte))
> >                 return -EBUSY;
> >
>
> Would be helpful to add a comment explaining that this is the case
> when the folio is in the swap cache.
>
> > +       entry = pte_to_swp_entry(orig_src_pte);
> > +       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > +       if (!IS_ERR(folio)) {
> > +               folio_put(folio);
> > +               return -EBUSY;
> > +       }
> > +
> >         double_pt_lock(dst_ptl, src_ptl);
> >
> >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > --
> > 2.39.3 (Apple Git-146)
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 18:40 ` Lokesh Gidra
@ 2025-02-19 20:45   ` Barry Song
  2025-02-19 20:53     ` Lokesh Gidra
  0 siblings, 1 reply; 47+ messages in thread
From: Barry Song @ 2025-02-19 20:45 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: linux-mm, akpm, linux-kernel, zhengtangquan, Barry Song,
	Andrea Arcangeli, Suren Baghdasaryan, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, David Hildenbrand, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng

On Thu, Feb 20, 2025 at 7:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > userfaultfd_move() checks whether the PTE entry is present or a
> > swap entry.
> >
> > - If the PTE entry is present, move_present_pte() handles folio
> >   migration by setting:
> >
> >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> >
> > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> >   the PTE to the new dst_addr.
> >
> > This approach is incorrect because even if the PTE is a swap
> > entry, it can still reference a folio that remains in the swap
> > cache.
> >
> > If do_swap_page() is triggered, it may locate the folio in the
> > swap cache. However, during add_rmap operations, a kernel panic
> > can occur due to:
> >  page_pgoff(folio, page) != linear_page_index(vma, address)
> >
> > $./a.out > /dev/null
> > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > [   13.337716] memcg:ffff00000405f000
> > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > [   13.340190] ------------[ cut here ]------------
> > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > [   13.340969] Modules linked in:
> > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > [   13.342018] sp : ffff80008752bb20
> > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > [   13.343876] Call trace:
> > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > [   13.344333]  do_swap_page+0x1060/0x1400
> > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > [   13.344586]  do_page_fault+0x20c/0x770
> > [   13.344673]  do_translation_fault+0xb4/0xf0
> > [   13.344759]  do_mem_abort+0x48/0xa0
> > [   13.344842]  el0_da+0x58/0x130
> > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > [   13.345504] ---[ end trace 0000000000000000 ]---
> > [   13.345715] note: a.out[107] exited with irqs disabled
> > [   13.345954] note: a.out[107] exited with preempt_count 2
> >
> > Fully fixing it would be quite complex, requiring similar handling
> > of folios as done in move_present_pte. For now, a quick solution
> > is to return -EBUSY.
> > I'd like to see others' opinions on whether a full fix is worth
> > pursuing.
> >
>
> Thanks a lot for finding this.
>
> As a user of MOVE ioctl (in Android GC) I strongly urge you to fix
> this properly. Because this is not going to be a rare occurrence in
> the case of Android. And when -EBUSY is returned, all that userspace
> can do is touch the page, which also does not guarantee that a
> subsequent retry of the ioctl will succeed.

Not trying to push this idea, but I’m curious if it's feasible:

If UFFDIO_MOVE fails, could userspace fall back to UFFDIO_COPY?

I’m still trying to wrap my head around a few things, particularly
what exactly UFFDIO_MOVE is doing with mTHP, as I mentioned in my
reply to Suren and you in another email:

https://lore.kernel.org/linux-mm/CAGsJ_4yx1=jaQmDG_9rMqHFFkoXqMJw941eYvtby28OqDq+S7g@mail.gmail.com/


>
> > For anyone interested in reproducing it, the a.out test program is
> > as below,
> >
> >  #define _GNU_SOURCE
> >  #include <stdio.h>
> >  #include <stdlib.h>
> >  #include <string.h>
> >  #include <sys/mman.h>
> >  #include <sys/ioctl.h>
> >  #include <sys/syscall.h>
> >  #include <linux/userfaultfd.h>
> >  #include <fcntl.h>
> >  #include <pthread.h>
> >  #include <unistd.h>
> >  #include <poll.h>
> >  #include <errno.h>
> >
> >  #define PAGE_SIZE 4096
> >  #define REGION_SIZE (512 * 1024)
> >
> >  #ifndef UFFDIO_MOVE
> >  struct uffdio_move {
> >      __u64 dst;
> >      __u64 src;
> >      __u64 len;
> >      #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
> >      #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
> >      __u64 mode;
> >      __s64 move;
> >  };
> >  #define _UFFDIO_MOVE  (0x05)
> >  #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
> >  #endif
> >
> >  void *src, *dst;
> >  int uffd;
> >
> >  void *madvise_thread(void *arg) {
> >      if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) {
> >          perror("madvise MADV_PAGEOUT");
> >      }
> >      return NULL;
> >  }
> >
> >  void *fault_handler_thread(void *arg) {
> >      struct uffd_msg msg;
> >      struct uffdio_move move;
> >      struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
> >
> >      pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
> >      pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
> >
> >      while (1) {
> >          if (poll(&pollfd, 1, -1) == -1) {
> >              perror("poll");
> >              exit(EXIT_FAILURE);
> >          }
> >
> >          if (read(uffd, &msg, sizeof(msg)) <= 0) {
> >              perror("read");
> >              exit(EXIT_FAILURE);
> >          }
> >
> >          if (msg.event != UFFD_EVENT_PAGEFAULT) {
> >              fprintf(stderr, "Unexpected event\n");
> >              exit(EXIT_FAILURE);
> >          }
> >
> >          move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst);
> >          move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
> >          move.len = PAGE_SIZE;
> >          move.mode = 0;
> >
> >          if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
> >              perror("UFFDIO_MOVE");
> >              exit(EXIT_FAILURE);
> >          }
> >      }
> >      return NULL;
> >  }
> >
> >  int main() {
> >  again:
> >      pthread_t thr, madv_thr;
> >      struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
> >      struct uffdio_register uffdio_register;
> >
> >      src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >      if (src == MAP_FAILED) {
> >          perror("mmap src");
> >          exit(EXIT_FAILURE);
> >      }
> >      memset(src, 1, REGION_SIZE);
> >
> >      dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >      if (dst == MAP_FAILED) {
> >          perror("mmap dst");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> >      if (uffd == -1) {
> >          perror("userfaultfd");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
> >          perror("UFFDIO_API");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      uffdio_register.range.start = (unsigned long)dst;
> >      uffdio_register.range.len = REGION_SIZE;
> >      uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> >
> >      if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
> >          perror("UFFDIO_REGISTER");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
> >          perror("pthread_create madvise_thread");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
> >          perror("pthread_create fault_handler_thread");
> >          exit(EXIT_FAILURE);
> >      }
> >
> >      for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> >          char val = ((char *)dst)[i];
> >          printf("Accessing dst at offset %zu, value: %d\n", i, val);
> >      }
> >
> >      pthread_join(madv_thr, NULL);
> >      pthread_cancel(thr);
> >      pthread_join(thr, NULL);
> >
> >      munmap(src, REGION_SIZE);
> >      munmap(dst, REGION_SIZE);
> >      close(uffd);
> >      goto again;
> >      return 0;
> >  }
> >
> > As long as you enable mTHP (which likely increases the residency
> > time of swapcache), you can reproduce the issue within a few
> > seconds. But I guess the same race condition also exists with
> > small folios.
> >
> > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
> > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > Cc: Suren Baghdasaryan <surenb@google.com>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > Cc: Brian Geffon <bgeffon@google.com>
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: Kalesh Singh <kaleshsingh@google.com>
> > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Cc: Michal Hocko <mhocko@suse.com>
> > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > Cc: Nicolas Geoffray <ngeoffray@google.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: Shuah Khan <shuah@kernel.org>
> > Cc: ZhangPeng <zhangpeng362@huawei.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/userfaultfd.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 867898c4e30b..34cf1c8c725d 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -18,6 +18,7 @@
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > +#include "swap.h"
> >
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
> >                          pmd_t *dst_pmd, pmd_t dst_pmdval,
> >                          spinlock_t *dst_ptl, spinlock_t *src_ptl)
> >  {
> > +       struct folio *folio;
> > +       swp_entry_t entry;
> > +
> >         if (!pte_swp_exclusive(orig_src_pte))
> >                 return -EBUSY;
> >
> > +       entry = pte_to_swp_entry(orig_src_pte);
> > +       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > +       if (!IS_ERR(folio)) {
> > +               folio_put(folio);
> > +               return -EBUSY;
> > +       }
> > +
> >         double_pt_lock(dst_ptl, src_ptl);
> >
> >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > --
> > 2.39.3 (Apple Git-146)
> >
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 20:45   ` Barry Song
@ 2025-02-19 20:53     ` Lokesh Gidra
  0 siblings, 0 replies; 47+ messages in thread
From: Lokesh Gidra @ 2025-02-19 20:53 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, akpm, linux-kernel, zhengtangquan, Barry Song,
	Andrea Arcangeli, Suren Baghdasaryan, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, David Hildenbrand, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng

On Wed, Feb 19, 2025 at 12:45 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Feb 20, 2025 at 7:40 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > userfaultfd_move() checks whether the PTE entry is present or a
> > > swap entry.
> > >
> > > - If the PTE entry is present, move_present_pte() handles folio
> > >   migration by setting:
> > >
> > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > >
> > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > >   the PTE to the new dst_addr.
> > >
> > > This approach is incorrect because even if the PTE is a swap
> > > entry, it can still reference a folio that remains in the swap
> > > cache.
> > >
> > > If do_swap_page() is triggered, it may locate the folio in the
> > > swap cache. However, during add_rmap operations, a kernel panic
> > > can occur due to:
> > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > >
> > > $./a.out > /dev/null
> > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > [   13.337716] memcg:ffff00000405f000
> > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > [   13.340190] ------------[ cut here ]------------
> > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > [   13.340969] Modules linked in:
> > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > [   13.342018] sp : ffff80008752bb20
> > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > [   13.343876] Call trace:
> > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > [   13.344586]  do_page_fault+0x20c/0x770
> > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > [   13.344842]  el0_da+0x58/0x130
> > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > >
> > > Fully fixing it would be quite complex, requiring similar handling
> > > of folios as done in move_present_pte. For now, a quick solution
> > > is to return -EBUSY.
> > > I'd like to see others' opinions on whether a full fix is worth
> > > pursuing.
> > >
> >
> > Thanks a lot for finding this.
> >
> > As a user of MOVE ioctl (in Android GC) I strongly urge you to fix
> > this properly. Because this is not going to be a rare occurrence in
> > the case of Android. And when -EBUSY is returned, all that userspace
> > can do is touch the page, which also does not guarantee that a
> > subsequent retry of the ioctl will succeed.
>
> Not trying to push this idea, but I’m curious if it's feasible:
>
> If UFFDIO_MOVE fails, could userspace fall back to UFFDIO_COPY?

It's possible! But it wouldn't be rare to find such pages and falling
back to COPY so many times would mellow down the benefits of using
MOVE quite a bit.
>
> I’m still trying to wrap my head around a few things, particularly
> what exactly UFFDIO_MOVE is doing with mTHP, as I mentioned in my
> reply to Suren and you in another email:
>
> https://lore.kernel.org/linux-mm/CAGsJ_4yx1=jaQmDG_9rMqHFFkoXqMJw941eYvtby28OqDq+S7g@mail.gmail.com/
>
>
> >
> > > For anyone interested in reproducing it, the a.out test program is
> > > as below,
> > >
> > >  #define _GNU_SOURCE
> > >  #include <stdio.h>
> > >  #include <stdlib.h>
> > >  #include <string.h>
> > >  #include <sys/mman.h>
> > >  #include <sys/ioctl.h>
> > >  #include <sys/syscall.h>
> > >  #include <linux/userfaultfd.h>
> > >  #include <fcntl.h>
> > >  #include <pthread.h>
> > >  #include <unistd.h>
> > >  #include <poll.h>
> > >  #include <errno.h>
> > >
> > >  #define PAGE_SIZE 4096
> > >  #define REGION_SIZE (512 * 1024)
> > >
> > >  #ifndef UFFDIO_MOVE
> > >  struct uffdio_move {
> > >      __u64 dst;
> > >      __u64 src;
> > >      __u64 len;
> > >      #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
> > >      #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
> > >      __u64 mode;
> > >      __s64 move;
> > >  };
> > >  #define _UFFDIO_MOVE  (0x05)
> > >  #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
> > >  #endif
> > >
> > >  void *src, *dst;
> > >  int uffd;
> > >
> > >  void *madvise_thread(void *arg) {
> > >      if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) {
> > >          perror("madvise MADV_PAGEOUT");
> > >      }
> > >      return NULL;
> > >  }
> > >
> > >  void *fault_handler_thread(void *arg) {
> > >      struct uffd_msg msg;
> > >      struct uffdio_move move;
> > >      struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
> > >
> > >      pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
> > >      pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
> > >
> > >      while (1) {
> > >          if (poll(&pollfd, 1, -1) == -1) {
> > >              perror("poll");
> > >              exit(EXIT_FAILURE);
> > >          }
> > >
> > >          if (read(uffd, &msg, sizeof(msg)) <= 0) {
> > >              perror("read");
> > >              exit(EXIT_FAILURE);
> > >          }
> > >
> > >          if (msg.event != UFFD_EVENT_PAGEFAULT) {
> > >              fprintf(stderr, "Unexpected event\n");
> > >              exit(EXIT_FAILURE);
> > >          }
> > >
> > >          move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst);
> > >          move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
> > >          move.len = PAGE_SIZE;
> > >          move.mode = 0;
> > >
> > >          if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
> > >              perror("UFFDIO_MOVE");
> > >              exit(EXIT_FAILURE);
> > >          }
> > >      }
> > >      return NULL;
> > >  }
> > >
> > >  int main() {
> > >  again:
> > >      pthread_t thr, madv_thr;
> > >      struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
> > >      struct uffdio_register uffdio_register;
> > >
> > >      src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >      if (src == MAP_FAILED) {
> > >          perror("mmap src");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >      memset(src, 1, REGION_SIZE);
> > >
> > >      dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >      if (dst == MAP_FAILED) {
> > >          perror("mmap dst");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> > >      if (uffd == -1) {
> > >          perror("userfaultfd");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
> > >          perror("UFFDIO_API");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      uffdio_register.range.start = (unsigned long)dst;
> > >      uffdio_register.range.len = REGION_SIZE;
> > >      uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> > >
> > >      if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
> > >          perror("UFFDIO_REGISTER");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
> > >          perror("pthread_create madvise_thread");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
> > >          perror("pthread_create fault_handler_thread");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> > >          char val = ((char *)dst)[i];
> > >          printf("Accessing dst at offset %zu, value: %d\n", i, val);
> > >      }
> > >
> > >      pthread_join(madv_thr, NULL);
> > >      pthread_cancel(thr);
> > >      pthread_join(thr, NULL);
> > >
> > >      munmap(src, REGION_SIZE);
> > >      munmap(dst, REGION_SIZE);
> > >      close(uffd);
> > >      goto again;
> > >      return 0;
> > >  }
> > >
> > > As long as you enable mTHP (which likely increases the residency
> > > time of swapcache), you can reproduce the issue within a few
> > > seconds. But I guess the same race condition also exists with
> > > small folios.
> > >
> > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > Cc: Brian Geffon <bgeffon@google.com>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Jann Horn <jannh@google.com>
> > > Cc: Kalesh Singh <kaleshsingh@google.com>
> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > > Cc: Nicolas Geoffray <ngeoffray@google.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > Cc: Shuah Khan <shuah@kernel.org>
> > > Cc: ZhangPeng <zhangpeng362@huawei.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  mm/userfaultfd.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index 867898c4e30b..34cf1c8c725d 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -18,6 +18,7 @@
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/tlb.h>
> > >  #include "internal.h"
> > > +#include "swap.h"
> > >
> > >  static __always_inline
> > >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
> > >                          pmd_t *dst_pmd, pmd_t dst_pmdval,
> > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl)
> > >  {
> > > +       struct folio *folio;
> > > +       swp_entry_t entry;
> > > +
> > >         if (!pte_swp_exclusive(orig_src_pte))
> > >                 return -EBUSY;
> > >
> > > +       entry = pte_to_swp_entry(orig_src_pte);
> > > +       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > > +       if (!IS_ERR(folio)) {
> > > +               folio_put(folio);
> > > +               return -EBUSY;
> > > +       }
> > > +
> > >         double_pt_lock(dst_ptl, src_ptl);
> > >
> > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > --
> > > 2.39.3 (Apple Git-146)
> > >
> >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 20:37   ` Barry Song
@ 2025-02-19 20:57     ` Matthew Wilcox
  2025-02-19 21:05       ` Barry Song
  2025-02-19 21:02     ` Lokesh Gidra
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 47+ messages in thread
From: Matthew Wilcox @ 2025-02-19 20:57 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Michal Hocko, Mike Rapoport, Nicolas Geoffray,
	Peter Xu, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > How complex would that be? Is it a matter of adding
> > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > folio->index = linear_page_index like in move_present_pte() or
> > something more?
> 
> My main concern is still with large folios that require a split_folio()
> during move_pages(), as the entire folio shares the same index and
> anon_vma. However, userfaultfd_move() moves pages individually,
> making a split necessary.
> 
> However, in split_huge_page_to_list_to_order(), there is a:
> 
>         if (folio_test_writeback(folio))
>                 return -EBUSY;
> 
> This is likely true for swapcache, right?

I don't see why?  When they get moved to the swap cache, yes, they're
immediately written back, but after being swapped back in, they stay in
the swap cache, so they don't have to be moved back to the swap cache.
Right?



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 20:37   ` Barry Song
  2025-02-19 20:57     ` Matthew Wilcox
@ 2025-02-19 21:02     ` Lokesh Gidra
  2025-02-19 21:26       ` Barry Song
  2025-02-19 22:14     ` Peter Xu
  2025-02-20  8:51     ` David Hildenbrand
  3 siblings, 1 reply; 47+ messages in thread
From: Lokesh Gidra @ 2025-02-19 21:02 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, linux-mm, akpm, linux-kernel, zhengtangquan,
	Barry Song, Andrea Arcangeli, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, David Hildenbrand, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Wed, Feb 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > userfaultfd_move() checks whether the PTE entry is present or a
> > > swap entry.
> > >
> > > - If the PTE entry is present, move_present_pte() handles folio
> > >   migration by setting:
> > >
> > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > >
> > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > >   the PTE to the new dst_addr.
> > >
> > > This approach is incorrect because even if the PTE is a swap
> > > entry, it can still reference a folio that remains in the swap
> > > cache.
> > >
> > > If do_swap_page() is triggered, it may locate the folio in the
> > > swap cache. However, during add_rmap operations, a kernel panic
> > > can occur due to:
> > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> >
> > Thanks for the report and reproducer!
> >
> > >
> > > $./a.out > /dev/null
> > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > [   13.337716] memcg:ffff00000405f000
> > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > [   13.340190] ------------[ cut here ]------------
> > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > [   13.340969] Modules linked in:
> > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > [   13.342018] sp : ffff80008752bb20
> > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > [   13.343876] Call trace:
> > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > [   13.344586]  do_page_fault+0x20c/0x770
> > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > [   13.344842]  el0_da+0x58/0x130
> > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > >
> > > Fully fixing it would be quite complex, requiring similar handling
> > > of folios as done in move_present_pte.
> >
> > How complex would that be? Is it a matter of adding
> > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > folio->index = linear_page_index like in move_present_pte() or
> > something more?
>
> My main concern is still with large folios that require a split_folio()
> during move_pages(), as the entire folio shares the same index and
> anon_vma. However, userfaultfd_move() moves pages individually,
> making a split necessary.
>
> However, in split_huge_page_to_list_to_order(), there is a:
>
>         if (folio_test_writeback(folio))
>                 return -EBUSY;
>
> This is likely true for swapcache, right? However, even for move_present_pte(),
> it simply returns -EBUSY:
>
> move_pages_pte()
> {
>                 /* at this point we have src_folio locked */
>                 if (folio_test_large(src_folio)) {
>                         /* split_folio() can block */
>                         pte_unmap(&orig_src_pte);
>                         pte_unmap(&orig_dst_pte);
>                         src_pte = dst_pte = NULL;
>                         err = split_folio(src_folio);
>                         if (err)
>                                 goto out;
>
>                         /* have to reacquire the folio after it got split */
>                         folio_unlock(src_folio);
>                         folio_put(src_folio);
>                         src_folio = NULL;
>                         goto retry;
>                 }
> }
>
> Do we need a folio_wait_writeback() before calling split_folio()?
>
> By the way, I have also reported that userfaultfd_move() has a fundamental
> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> kernel. In this scenario, folios in the virtual zone won’t be split in
> split_folio(). Instead, the large folio migrates into nr_pages small folios.
>
> Thus, the best-case scenario would be:
>
> mTHP -> migrate to small folios in split_folio() -> move small folios to
> dst_addr
>
> While this works, it negates the performance benefits of
> userfaultfd_move(), as it introduces two PTE operations (migration in
> split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> allocations, and still requires one memcpy(). This could end up
> performing even worse than userfaultfd_copy(), I guess.
>
> The worst-case scenario would be failing to allocate small folios in
> split_folio(), then userfaultfd_move() might return -ENOMEM?
>
> Given these issues, I strongly recommend that ART hold off on upgrading
> to userfaultfd_move() until these problems are fully understood and
> resolved. Otherwise, we’re in for a rough ride!

At the moment, ART GC doesn't work taking mTHP into consideration. We
don't try to be careful in userspace to be large-page aligned or
anything. Also, the MOVE ioctl implementation works either on
huge-pages or on normal pages. IIUC, it can't handle mTHP large pages
as a whole. But that's true for other userfaultfd ioctls as well. If
we were to continue using COPY, it's not that it's in any way more
friendly to mTHP than MOVE. In fact, that's one of the reasons I'm
considering making the ART heap NO_HUGEPAGE to avoid the need for
folio-split entirely.

Furthermore, there are few cases in which COPY ioctl's overhead just
doesn't make sense for ART GC. So starting to use MOVE ioctl is the
right thing to do.

What we need eventually to gain mTHP benefits is both MOVE ioctl to
support large-page migration as well as GC code in userspace to work
with mTHP in mind.
>
> >
> > > For now, a quick solution
> > > is to return -EBUSY.
> > > I'd like to see others' opinions on whether a full fix is worth
> > > pursuing.
> > >
> > > For anyone interested in reproducing it, the a.out test program is
> > > as below,
> > >
> > >  #define _GNU_SOURCE
> > >  #include <stdio.h>
> > >  #include <stdlib.h>
> > >  #include <string.h>
> > >  #include <sys/mman.h>
> > >  #include <sys/ioctl.h>
> > >  #include <sys/syscall.h>
> > >  #include <linux/userfaultfd.h>
> > >  #include <fcntl.h>
> > >  #include <pthread.h>
> > >  #include <unistd.h>
> > >  #include <poll.h>
> > >  #include <errno.h>
> > >
> > >  #define PAGE_SIZE 4096
> > >  #define REGION_SIZE (512 * 1024)
> > >
> > >  #ifndef UFFDIO_MOVE
> > >  struct uffdio_move {
> > >      __u64 dst;
> > >      __u64 src;
> > >      __u64 len;
> > >      #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
> > >      #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
> > >      __u64 mode;
> > >      __s64 move;
> > >  };
> > >  #define _UFFDIO_MOVE  (0x05)
> > >  #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
> > >  #endif
> > >
> > >  void *src, *dst;
> > >  int uffd;
> > >
> > >  void *madvise_thread(void *arg) {
> > >      if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) {
> > >          perror("madvise MADV_PAGEOUT");
> > >      }
> > >      return NULL;
> > >  }
> > >
> > >  void *fault_handler_thread(void *arg) {
> > >      struct uffd_msg msg;
> > >      struct uffdio_move move;
> > >      struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
> > >
> > >      pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
> > >      pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
> > >
> > >      while (1) {
> > >          if (poll(&pollfd, 1, -1) == -1) {
> > >              perror("poll");
> > >              exit(EXIT_FAILURE);
> > >          }
> > >
> > >          if (read(uffd, &msg, sizeof(msg)) <= 0) {
> > >              perror("read");
> > >              exit(EXIT_FAILURE);
> > >          }
> > >
> > >          if (msg.event != UFFD_EVENT_PAGEFAULT) {
> > >              fprintf(stderr, "Unexpected event\n");
> > >              exit(EXIT_FAILURE);
> > >          }
> > >
> > >          move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst);
> > >          move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
> > >          move.len = PAGE_SIZE;
> > >          move.mode = 0;
> > >
> > >          if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
> > >              perror("UFFDIO_MOVE");
> > >              exit(EXIT_FAILURE);
> > >          }
> > >      }
> > >      return NULL;
> > >  }
> > >
> > >  int main() {
> > >  again:
> > >      pthread_t thr, madv_thr;
> > >      struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
> > >      struct uffdio_register uffdio_register;
> > >
> > >      src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >      if (src == MAP_FAILED) {
> > >          perror("mmap src");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >      memset(src, 1, REGION_SIZE);
> > >
> > >      dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > >      if (dst == MAP_FAILED) {
> > >          perror("mmap dst");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> > >      if (uffd == -1) {
> > >          perror("userfaultfd");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
> > >          perror("UFFDIO_API");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      uffdio_register.range.start = (unsigned long)dst;
> > >      uffdio_register.range.len = REGION_SIZE;
> > >      uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> > >
> > >      if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
> > >          perror("UFFDIO_REGISTER");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
> > >          perror("pthread_create madvise_thread");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
> > >          perror("pthread_create fault_handler_thread");
> > >          exit(EXIT_FAILURE);
> > >      }
> > >
> > >      for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> > >          char val = ((char *)dst)[i];
> > >          printf("Accessing dst at offset %zu, value: %d\n", i, val);
> > >      }
> > >
> > >      pthread_join(madv_thr, NULL);
> > >      pthread_cancel(thr);
> > >      pthread_join(thr, NULL);
> > >
> > >      munmap(src, REGION_SIZE);
> > >      munmap(dst, REGION_SIZE);
> > >      close(uffd);
> > >      goto again;
> > >      return 0;
> > >  }
> > >
> > > As long as you enable mTHP (which likely increases the residency
> > > time of swapcache), you can reproduce the issue within a few
> > > seconds. But I guess the same race condition also exists with
> > > small folios.
> > >
> > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
> > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > Cc: Brian Geffon <bgeffon@google.com>
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Cc: Hugh Dickins <hughd@google.com>
> > > Cc: Jann Horn <jannh@google.com>
> > > Cc: Kalesh Singh <kaleshsingh@google.com>
> > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > Cc: Michal Hocko <mhocko@suse.com>
> > > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > > Cc: Nicolas Geoffray <ngeoffray@google.com>
> > > Cc: Peter Xu <peterx@redhat.com>
> > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > Cc: Shuah Khan <shuah@kernel.org>
> > > Cc: ZhangPeng <zhangpeng362@huawei.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  mm/userfaultfd.c | 11 +++++++++++
> > >  1 file changed, 11 insertions(+)
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index 867898c4e30b..34cf1c8c725d 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -18,6 +18,7 @@
> > >  #include <asm/tlbflush.h>
> > >  #include <asm/tlb.h>
> > >  #include "internal.h"
> > > +#include "swap.h"
> > >
> > >  static __always_inline
> > >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
> > >                          pmd_t *dst_pmd, pmd_t dst_pmdval,
> > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl)
> > >  {
> > > +       struct folio *folio;
> > > +       swp_entry_t entry;
> > > +
> > >         if (!pte_swp_exclusive(orig_src_pte))
> > >                 return -EBUSY;
> > >
> >
> > Would be helpful to add a comment explaining that this is the case
> > when the folio is in the swap cache.
> >
> > > +       entry = pte_to_swp_entry(orig_src_pte);
> > > +       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > > +       if (!IS_ERR(folio)) {
> > > +               folio_put(folio);
> > > +               return -EBUSY;
> > > +       }
> > > +
> > >         double_pt_lock(dst_ptl, src_ptl);
> > >
> > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > --
> > > 2.39.3 (Apple Git-146)
> > >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 20:57     ` Matthew Wilcox
@ 2025-02-19 21:05       ` Barry Song
  0 siblings, 0 replies; 47+ messages in thread
From: Barry Song @ 2025-02-19 21:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Michal Hocko, Mike Rapoport, Nicolas Geoffray,
	Peter Xu, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 9:57 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > How complex would that be? Is it a matter of adding
> > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > folio->index = linear_page_index like in move_present_pte() or
> > > something more?
> >
> > My main concern is still with large folios that require a split_folio()
> > during move_pages(), as the entire folio shares the same index and
> > anon_vma. However, userfaultfd_move() moves pages individually,
> > making a split necessary.
> >
> > However, in split_huge_page_to_list_to_order(), there is a:
> >
> >         if (folio_test_writeback(folio))
> >                 return -EBUSY;
> >
> > This is likely true for swapcache, right?
>
> I don't see why?  When they get moved to the swap cache, yes, they're
> immediately written back, but after being swapped back in, they stay in
> the swap cache, so they don't have to be moved back to the swap cache.
> Right?

I don’t quite understand your question. The issue we’re discussing is
that the folio is in swapcache. Right now, we’re encountering a kernel
crash because we haven’t fixed the folio’s index. If we want to address
that, we need to perform a split_folio() for mTHP. Since we’re already
dealing with swapcache, we’re likely in a situation where we’re doing
writeback (pageout), considering Android uses sync zram. So, if
swapcache is true, writeback is probably true as well.

The race occurs after we call add_to_swap(), try_to_unmap() and
before we complete the writeback - page. (Swapcache will be cleared
for the sync device once the writeback is finished.)

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 21:02     ` Lokesh Gidra
@ 2025-02-19 21:26       ` Barry Song
  2025-02-19 21:32         ` Lokesh Gidra
  0 siblings, 1 reply; 47+ messages in thread
From: Barry Song @ 2025-02-19 21:26 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Suren Baghdasaryan, linux-mm, akpm, linux-kernel, zhengtangquan,
	Barry Song, Andrea Arcangeli, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, David Hildenbrand, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 10:03 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Wed, Feb 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > swap entry.
> > > >
> > > > - If the PTE entry is present, move_present_pte() handles folio
> > > >   migration by setting:
> > > >
> > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > >
> > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > >   the PTE to the new dst_addr.
> > > >
> > > > This approach is incorrect because even if the PTE is a swap
> > > > entry, it can still reference a folio that remains in the swap
> > > > cache.
> > > >
> > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > can occur due to:
> > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > >
> > > Thanks for the report and reproducer!
> > >
> > > >
> > > > $./a.out > /dev/null
> > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > [   13.337716] memcg:ffff00000405f000
> > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > [   13.340190] ------------[ cut here ]------------
> > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > [   13.340969] Modules linked in:
> > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > [   13.342018] sp : ffff80008752bb20
> > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > [   13.343876] Call trace:
> > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > [   13.344842]  el0_da+0x58/0x130
> > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > >
> > > > Fully fixing it would be quite complex, requiring similar handling
> > > > of folios as done in move_present_pte.
> > >
> > > How complex would that be? Is it a matter of adding
> > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > folio->index = linear_page_index like in move_present_pte() or
> > > something more?
> >
> > My main concern is still with large folios that require a split_folio()
> > during move_pages(), as the entire folio shares the same index and
> > anon_vma. However, userfaultfd_move() moves pages individually,
> > making a split necessary.
> >
> > However, in split_huge_page_to_list_to_order(), there is a:
> >
> >         if (folio_test_writeback(folio))
> >                 return -EBUSY;
> >
> > This is likely true for swapcache, right? However, even for move_present_pte(),
> > it simply returns -EBUSY:
> >
> > move_pages_pte()
> > {
> >                 /* at this point we have src_folio locked */
> >                 if (folio_test_large(src_folio)) {
> >                         /* split_folio() can block */
> >                         pte_unmap(&orig_src_pte);
> >                         pte_unmap(&orig_dst_pte);
> >                         src_pte = dst_pte = NULL;
> >                         err = split_folio(src_folio);
> >                         if (err)
> >                                 goto out;
> >
> >                         /* have to reacquire the folio after it got split */
> >                         folio_unlock(src_folio);
> >                         folio_put(src_folio);
> >                         src_folio = NULL;
> >                         goto retry;
> >                 }
> > }
> >
> > Do we need a folio_wait_writeback() before calling split_folio()?
> >
> > By the way, I have also reported that userfaultfd_move() has a fundamental
> > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> > kernel. In this scenario, folios in the virtual zone won’t be split in
> > split_folio(). Instead, the large folio migrates into nr_pages small folios.
> >
> > Thus, the best-case scenario would be:
> >
> > mTHP -> migrate to small folios in split_folio() -> move small folios to
> > dst_addr
> >
> > While this works, it negates the performance benefits of
> > userfaultfd_move(), as it introduces two PTE operations (migration in
> > split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> > allocations, and still requires one memcpy(). This could end up
> > performing even worse than userfaultfd_copy(), I guess.
> >
> > The worst-case scenario would be failing to allocate small folios in
> > split_folio(), then userfaultfd_move() might return -ENOMEM?
> >
> > Given these issues, I strongly recommend that ART hold off on upgrading
> > to userfaultfd_move() until these problems are fully understood and
> > resolved. Otherwise, we’re in for a rough ride!
>
> At the moment, ART GC doesn't work taking mTHP into consideration. We
> don't try to be careful in userspace to be large-page aligned or
> anything. Also, the MOVE ioctl implementation works either on
> huge-pages or on normal pages. IIUC, it can't handle mTHP large pages
> as a whole. But that's true for other userfaultfd ioctls as well. If
> we were to continue using COPY, it's not that it's in any way more
> friendly to mTHP than MOVE. In fact, that's one of the reasons I'm
> considering making the ART heap NO_HUGEPAGE to avoid the need for
> folio-split entirely.

Disabling mTHP is one way to avoid potential bugs. However, as long as
UFFDIO_MOVE is available, we can’t prevent others, aside from ART GC,
from using it, right? So, we still need to address these issues with mTHP.

If a trend-following Android app discovers the UFFDIO_MOVE API, it might
use it, and it may not necessarily know to disable hugepages. Doesn’t that
pose a risk?

>
> Furthermore, there are few cases in which COPY ioctl's overhead just
> doesn't make sense for ART GC. So starting to use MOVE ioctl is the
> right thing to do.
>
> What we need eventually to gain mTHP benefits is both MOVE ioctl to
> support large-page migration as well as GC code in userspace to work
> with mTHP in mind.
> >
> > >
> > > > For now, a quick solution
> > > > is to return -EBUSY.
> > > > I'd like to see others' opinions on whether a full fix is worth
> > > > pursuing.
> > > >
> > > > For anyone interested in reproducing it, the a.out test program is
> > > > as below,
> > > >
> > > >  #define _GNU_SOURCE
> > > >  #include <stdio.h>
> > > >  #include <stdlib.h>
> > > >  #include <string.h>
> > > >  #include <sys/mman.h>
> > > >  #include <sys/ioctl.h>
> > > >  #include <sys/syscall.h>
> > > >  #include <linux/userfaultfd.h>
> > > >  #include <fcntl.h>
> > > >  #include <pthread.h>
> > > >  #include <unistd.h>
> > > >  #include <poll.h>
> > > >  #include <errno.h>
> > > >
> > > >  #define PAGE_SIZE 4096
> > > >  #define REGION_SIZE (512 * 1024)
> > > >
> > > >  #ifndef UFFDIO_MOVE
> > > >  struct uffdio_move {
> > > >      __u64 dst;
> > > >      __u64 src;
> > > >      __u64 len;
> > > >      #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
> > > >      #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
> > > >      __u64 mode;
> > > >      __s64 move;
> > > >  };
> > > >  #define _UFFDIO_MOVE  (0x05)
> > > >  #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
> > > >  #endif
> > > >
> > > >  void *src, *dst;
> > > >  int uffd;
> > > >
> > > >  void *madvise_thread(void *arg) {
> > > >      if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) {
> > > >          perror("madvise MADV_PAGEOUT");
> > > >      }
> > > >      return NULL;
> > > >  }
> > > >
> > > >  void *fault_handler_thread(void *arg) {
> > > >      struct uffd_msg msg;
> > > >      struct uffdio_move move;
> > > >      struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
> > > >
> > > >      pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
> > > >      pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
> > > >
> > > >      while (1) {
> > > >          if (poll(&pollfd, 1, -1) == -1) {
> > > >              perror("poll");
> > > >              exit(EXIT_FAILURE);
> > > >          }
> > > >
> > > >          if (read(uffd, &msg, sizeof(msg)) <= 0) {
> > > >              perror("read");
> > > >              exit(EXIT_FAILURE);
> > > >          }
> > > >
> > > >          if (msg.event != UFFD_EVENT_PAGEFAULT) {
> > > >              fprintf(stderr, "Unexpected event\n");
> > > >              exit(EXIT_FAILURE);
> > > >          }
> > > >
> > > >          move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst);
> > > >          move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
> > > >          move.len = PAGE_SIZE;
> > > >          move.mode = 0;
> > > >
> > > >          if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
> > > >              perror("UFFDIO_MOVE");
> > > >              exit(EXIT_FAILURE);
> > > >          }
> > > >      }
> > > >      return NULL;
> > > >  }
> > > >
> > > >  int main() {
> > > >  again:
> > > >      pthread_t thr, madv_thr;
> > > >      struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
> > > >      struct uffdio_register uffdio_register;
> > > >
> > > >      src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > >      if (src == MAP_FAILED) {
> > > >          perror("mmap src");
> > > >          exit(EXIT_FAILURE);
> > > >      }
> > > >      memset(src, 1, REGION_SIZE);
> > > >
> > > >      dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > >      if (dst == MAP_FAILED) {
> > > >          perror("mmap dst");
> > > >          exit(EXIT_FAILURE);
> > > >      }
> > > >
> > > >      uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> > > >      if (uffd == -1) {
> > > >          perror("userfaultfd");
> > > >          exit(EXIT_FAILURE);
> > > >      }
> > > >
> > > >      if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
> > > >          perror("UFFDIO_API");
> > > >          exit(EXIT_FAILURE);
> > > >      }
> > > >
> > > >      uffdio_register.range.start = (unsigned long)dst;
> > > >      uffdio_register.range.len = REGION_SIZE;
> > > >      uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> > > >
> > > >      if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
> > > >          perror("UFFDIO_REGISTER");
> > > >          exit(EXIT_FAILURE);
> > > >      }
> > > >
> > > >      if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
> > > >          perror("pthread_create madvise_thread");
> > > >          exit(EXIT_FAILURE);
> > > >      }
> > > >
> > > >      if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
> > > >          perror("pthread_create fault_handler_thread");
> > > >          exit(EXIT_FAILURE);
> > > >      }
> > > >
> > > >      for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> > > >          char val = ((char *)dst)[i];
> > > >          printf("Accessing dst at offset %zu, value: %d\n", i, val);
> > > >      }
> > > >
> > > >      pthread_join(madv_thr, NULL);
> > > >      pthread_cancel(thr);
> > > >      pthread_join(thr, NULL);
> > > >
> > > >      munmap(src, REGION_SIZE);
> > > >      munmap(dst, REGION_SIZE);
> > > >      close(uffd);
> > > >      goto again;
> > > >      return 0;
> > > >  }
> > > >
> > > > As long as you enable mTHP (which likely increases the residency
> > > > time of swapcache), you can reproduce the issue within a few
> > > > seconds. But I guess the same race condition also exists with
> > > > small folios.
> > > >
> > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
> > > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > > Cc: Brian Geffon <bgeffon@google.com>
> > > > Cc: Christian Brauner <brauner@kernel.org>
> > > > Cc: David Hildenbrand <david@redhat.com>
> > > > Cc: Hugh Dickins <hughd@google.com>
> > > > Cc: Jann Horn <jannh@google.com>
> > > > Cc: Kalesh Singh <kaleshsingh@google.com>
> > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > > > Cc: Nicolas Geoffray <ngeoffray@google.com>
> > > > Cc: Peter Xu <peterx@redhat.com>
> > > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > > Cc: Shuah Khan <shuah@kernel.org>
> > > > Cc: ZhangPeng <zhangpeng362@huawei.com>
> > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > ---
> > > >  mm/userfaultfd.c | 11 +++++++++++
> > > >  1 file changed, 11 insertions(+)
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index 867898c4e30b..34cf1c8c725d 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -18,6 +18,7 @@
> > > >  #include <asm/tlbflush.h>
> > > >  #include <asm/tlb.h>
> > > >  #include "internal.h"
> > > > +#include "swap.h"
> > > >
> > > >  static __always_inline
> > > >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
> > > >                          pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl)
> > > >  {
> > > > +       struct folio *folio;
> > > > +       swp_entry_t entry;
> > > > +
> > > >         if (!pte_swp_exclusive(orig_src_pte))
> > > >                 return -EBUSY;
> > > >
> > >
> > > Would be helpful to add a comment explaining that this is the case
> > > when the folio is in the swap cache.
> > >
> > > > +       entry = pte_to_swp_entry(orig_src_pte);
> > > > +       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > > > +       if (!IS_ERR(folio)) {
> > > > +               folio_put(folio);
> > > > +               return -EBUSY;
> > > > +       }
> > > > +
> > > >         double_pt_lock(dst_ptl, src_ptl);
> > > >
> > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > --
> > > > 2.39.3 (Apple Git-146)
> > > >
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 21:26       ` Barry Song
@ 2025-02-19 21:32         ` Lokesh Gidra
  0 siblings, 0 replies; 47+ messages in thread
From: Lokesh Gidra @ 2025-02-19 21:32 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, linux-mm, akpm, linux-kernel, zhengtangquan,
	Barry Song, Andrea Arcangeli, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, David Hildenbrand, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Wed, Feb 19, 2025 at 1:26 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Feb 20, 2025 at 10:03 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Wed, Feb 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > swap entry.
> > > > >
> > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > >   migration by setting:
> > > > >
> > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > >
> > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > >   the PTE to the new dst_addr.
> > > > >
> > > > > This approach is incorrect because even if the PTE is a swap
> > > > > entry, it can still reference a folio that remains in the swap
> > > > > cache.
> > > > >
> > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > can occur due to:
> > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > >
> > > > Thanks for the report and reproducer!
> > > >
> > > > >
> > > > > $./a.out > /dev/null
> > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > [   13.337716] memcg:ffff00000405f000
> > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > [   13.340190] ------------[ cut here ]------------
> > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > [   13.340969] Modules linked in:
> > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > [   13.342018] sp : ffff80008752bb20
> > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > [   13.343876] Call trace:
> > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > >
> > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > of folios as done in move_present_pte.
> > > >
> > > > How complex would that be? Is it a matter of adding
> > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > folio->index = linear_page_index like in move_present_pte() or
> > > > something more?
> > >
> > > My main concern is still with large folios that require a split_folio()
> > > during move_pages(), as the entire folio shares the same index and
> > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > making a split necessary.
> > >
> > > However, in split_huge_page_to_list_to_order(), there is a:
> > >
> > >         if (folio_test_writeback(folio))
> > >                 return -EBUSY;
> > >
> > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > it simply returns -EBUSY:
> > >
> > > move_pages_pte()
> > > {
> > >                 /* at this point we have src_folio locked */
> > >                 if (folio_test_large(src_folio)) {
> > >                         /* split_folio() can block */
> > >                         pte_unmap(&orig_src_pte);
> > >                         pte_unmap(&orig_dst_pte);
> > >                         src_pte = dst_pte = NULL;
> > >                         err = split_folio(src_folio);
> > >                         if (err)
> > >                                 goto out;
> > >
> > >                         /* have to reacquire the folio after it got split */
> > >                         folio_unlock(src_folio);
> > >                         folio_put(src_folio);
> > >                         src_folio = NULL;
> > >                         goto retry;
> > >                 }
> > > }
> > >
> > > Do we need a folio_wait_writeback() before calling split_folio()?
> > >
> > > By the way, I have also reported that userfaultfd_move() has a fundamental
> > > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> > > kernel. In this scenario, folios in the virtual zone won’t be split in
> > > split_folio(). Instead, the large folio migrates into nr_pages small folios.
> > >
> > > Thus, the best-case scenario would be:
> > >
> > > mTHP -> migrate to small folios in split_folio() -> move small folios to
> > > dst_addr
> > >
> > > While this works, it negates the performance benefits of
> > > userfaultfd_move(), as it introduces two PTE operations (migration in
> > > split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> > > allocations, and still requires one memcpy(). This could end up
> > > performing even worse than userfaultfd_copy(), I guess.
> > >
> > > The worst-case scenario would be failing to allocate small folios in
> > > split_folio(), then userfaultfd_move() might return -ENOMEM?
> > >
> > > Given these issues, I strongly recommend that ART hold off on upgrading
> > > to userfaultfd_move() until these problems are fully understood and
> > > resolved. Otherwise, we’re in for a rough ride!
> >
> > At the moment, ART GC doesn't work taking mTHP into consideration. We
> > don't try to be careful in userspace to be large-page aligned or
> > anything. Also, the MOVE ioctl implementation works either on
> > huge-pages or on normal pages. IIUC, it can't handle mTHP large pages
> > as a whole. But that's true for other userfaultfd ioctls as well. If
> > we were to continue using COPY, it's not that it's in any way more
> > friendly to mTHP than MOVE. In fact, that's one of the reasons I'm
> > considering making the ART heap NO_HUGEPAGE to avoid the need for
> > folio-split entirely.
>
> Disabling mTHP is one way to avoid potential bugs. However, as long as
> UFFDIO_MOVE is available, we can’t prevent others, aside from ART GC,
> from using it, right? So, we still need to address these issues with mTHP.
>
> If a trend-following Android app discovers the UFFDIO_MOVE API, it might
> use it, and it may not necessarily know to disable hugepages. Doesn’t that
> pose a risk?
>
I absolutely agree that these issues need to be addressed.
Particularly the correctness bugs must be resolved at the earliest
possible.

I was just trying to answer your question as to why we want to use it,
now that it is available, instead of continuing with COPY ioctl. As
and when MOVE ioctl will start handling mTHP efficiently, I will make
the required changes in the userspace to leverage mTHP benefits.

> >
> > Furthermore, there are few cases in which COPY ioctl's overhead just
> > doesn't make sense for ART GC. So starting to use MOVE ioctl is the
> > right thing to do.
> >
> > What we need eventually to gain mTHP benefits is both MOVE ioctl to
> > support large-page migration as well as GC code in userspace to work
> > with mTHP in mind.
> > >
> > > >
> > > > > For now, a quick solution
> > > > > is to return -EBUSY.
> > > > > I'd like to see others' opinions on whether a full fix is worth
> > > > > pursuing.
> > > > >
> > > > > For anyone interested in reproducing it, the a.out test program is
> > > > > as below,
> > > > >
> > > > >  #define _GNU_SOURCE
> > > > >  #include <stdio.h>
> > > > >  #include <stdlib.h>
> > > > >  #include <string.h>
> > > > >  #include <sys/mman.h>
> > > > >  #include <sys/ioctl.h>
> > > > >  #include <sys/syscall.h>
> > > > >  #include <linux/userfaultfd.h>
> > > > >  #include <fcntl.h>
> > > > >  #include <pthread.h>
> > > > >  #include <unistd.h>
> > > > >  #include <poll.h>
> > > > >  #include <errno.h>
> > > > >
> > > > >  #define PAGE_SIZE 4096
> > > > >  #define REGION_SIZE (512 * 1024)
> > > > >
> > > > >  #ifndef UFFDIO_MOVE
> > > > >  struct uffdio_move {
> > > > >      __u64 dst;
> > > > >      __u64 src;
> > > > >      __u64 len;
> > > > >      #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
> > > > >      #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
> > > > >      __u64 mode;
> > > > >      __s64 move;
> > > > >  };
> > > > >  #define _UFFDIO_MOVE  (0x05)
> > > > >  #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
> > > > >  #endif
> > > > >
> > > > >  void *src, *dst;
> > > > >  int uffd;
> > > > >
> > > > >  void *madvise_thread(void *arg) {
> > > > >      if (madvise(src, REGION_SIZE, MADV_PAGEOUT) == -1) {
> > > > >          perror("madvise MADV_PAGEOUT");
> > > > >      }
> > > > >      return NULL;
> > > > >  }
> > > > >
> > > > >  void *fault_handler_thread(void *arg) {
> > > > >      struct uffd_msg msg;
> > > > >      struct uffdio_move move;
> > > > >      struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
> > > > >
> > > > >      pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
> > > > >      pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
> > > > >
> > > > >      while (1) {
> > > > >          if (poll(&pollfd, 1, -1) == -1) {
> > > > >              perror("poll");
> > > > >              exit(EXIT_FAILURE);
> > > > >          }
> > > > >
> > > > >          if (read(uffd, &msg, sizeof(msg)) <= 0) {
> > > > >              perror("read");
> > > > >              exit(EXIT_FAILURE);
> > > > >          }
> > > > >
> > > > >          if (msg.event != UFFD_EVENT_PAGEFAULT) {
> > > > >              fprintf(stderr, "Unexpected event\n");
> > > > >              exit(EXIT_FAILURE);
> > > > >          }
> > > > >
> > > > >          move.src = (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst);
> > > > >          move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
> > > > >          move.len = PAGE_SIZE;
> > > > >          move.mode = 0;
> > > > >
> > > > >          if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
> > > > >              perror("UFFDIO_MOVE");
> > > > >              exit(EXIT_FAILURE);
> > > > >          }
> > > > >      }
> > > > >      return NULL;
> > > > >  }
> > > > >
> > > > >  int main() {
> > > > >  again:
> > > > >      pthread_t thr, madv_thr;
> > > > >      struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
> > > > >      struct uffdio_register uffdio_register;
> > > > >
> > > > >      src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > > >      if (src == MAP_FAILED) {
> > > > >          perror("mmap src");
> > > > >          exit(EXIT_FAILURE);
> > > > >      }
> > > > >      memset(src, 1, REGION_SIZE);
> > > > >
> > > > >      dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > > > >      if (dst == MAP_FAILED) {
> > > > >          perror("mmap dst");
> > > > >          exit(EXIT_FAILURE);
> > > > >      }
> > > > >
> > > > >      uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> > > > >      if (uffd == -1) {
> > > > >          perror("userfaultfd");
> > > > >          exit(EXIT_FAILURE);
> > > > >      }
> > > > >
> > > > >      if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
> > > > >          perror("UFFDIO_API");
> > > > >          exit(EXIT_FAILURE);
> > > > >      }
> > > > >
> > > > >      uffdio_register.range.start = (unsigned long)dst;
> > > > >      uffdio_register.range.len = REGION_SIZE;
> > > > >      uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> > > > >
> > > > >      if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
> > > > >          perror("UFFDIO_REGISTER");
> > > > >          exit(EXIT_FAILURE);
> > > > >      }
> > > > >
> > > > >      if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
> > > > >          perror("pthread_create madvise_thread");
> > > > >          exit(EXIT_FAILURE);
> > > > >      }
> > > > >
> > > > >      if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
> > > > >          perror("pthread_create fault_handler_thread");
> > > > >          exit(EXIT_FAILURE);
> > > > >      }
> > > > >
> > > > >      for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> > > > >          char val = ((char *)dst)[i];
> > > > >          printf("Accessing dst at offset %zu, value: %d\n", i, val);
> > > > >      }
> > > > >
> > > > >      pthread_join(madv_thr, NULL);
> > > > >      pthread_cancel(thr);
> > > > >      pthread_join(thr, NULL);
> > > > >
> > > > >      munmap(src, REGION_SIZE);
> > > > >      munmap(dst, REGION_SIZE);
> > > > >      close(uffd);
> > > > >      goto again;
> > > > >      return 0;
> > > > >  }
> > > > >
> > > > > As long as you enable mTHP (which likely increases the residency
> > > > > time of swapcache), you can reproduce the issue within a few
> > > > > seconds. But I guess the same race condition also exists with
> > > > > small folios.
> > > > >
> > > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
> > > > > Cc: Andrea Arcangeli <aarcange@redhat.com>
> > > > > Cc: Suren Baghdasaryan <surenb@google.com>
> > > > > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > > > > Cc: Axel Rasmussen <axelrasmussen@google.com>
> > > > > Cc: Brian Geffon <bgeffon@google.com>
> > > > > Cc: Christian Brauner <brauner@kernel.org>
> > > > > Cc: David Hildenbrand <david@redhat.com>
> > > > > Cc: Hugh Dickins <hughd@google.com>
> > > > > Cc: Jann Horn <jannh@google.com>
> > > > > Cc: Kalesh Singh <kaleshsingh@google.com>
> > > > > Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> > > > > Cc: Lokesh Gidra <lokeshgidra@google.com>
> > > > > Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > > > Cc: Michal Hocko <mhocko@suse.com>
> > > > > Cc: Mike Rapoport (IBM) <rppt@kernel.org>
> > > > > Cc: Nicolas Geoffray <ngeoffray@google.com>
> > > > > Cc: Peter Xu <peterx@redhat.com>
> > > > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > > > Cc: Shuah Khan <shuah@kernel.org>
> > > > > Cc: ZhangPeng <zhangpeng362@huawei.com>
> > > > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > > > ---
> > > > >  mm/userfaultfd.c | 11 +++++++++++
> > > > >  1 file changed, 11 insertions(+)
> > > > >
> > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > index 867898c4e30b..34cf1c8c725d 100644
> > > > > --- a/mm/userfaultfd.c
> > > > > +++ b/mm/userfaultfd.c
> > > > > @@ -18,6 +18,7 @@
> > > > >  #include <asm/tlbflush.h>
> > > > >  #include <asm/tlb.h>
> > > > >  #include "internal.h"
> > > > > +#include "swap.h"
> > > > >
> > > > >  static __always_inline
> > > > >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > > > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
> > > > >                          pmd_t *dst_pmd, pmd_t dst_pmdval,
> > > > >                          spinlock_t *dst_ptl, spinlock_t *src_ptl)
> > > > >  {
> > > > > +       struct folio *folio;
> > > > > +       swp_entry_t entry;
> > > > > +
> > > > >         if (!pte_swp_exclusive(orig_src_pte))
> > > > >                 return -EBUSY;
> > > > >
> > > >
> > > > Would be helpful to add a comment explaining that this is the case
> > > > when the folio is in the swap cache.
> > > >
> > > > > +       entry = pte_to_swp_entry(orig_src_pte);
> > > > > +       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > > > > +       if (!IS_ERR(folio)) {
> > > > > +               folio_put(folio);
> > > > > +               return -EBUSY;
> > > > > +       }
> > > > > +
> > > > >         double_pt_lock(dst_ptl, src_ptl);
> > > > >
> > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > --
> > > > > 2.39.3 (Apple Git-146)
> > > > >
> > >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 20:37   ` Barry Song
  2025-02-19 20:57     ` Matthew Wilcox
  2025-02-19 21:02     ` Lokesh Gidra
@ 2025-02-19 22:14     ` Peter Xu
  2025-02-19 23:04       ` Barry Song
  2025-02-20  8:51     ` David Hildenbrand
  3 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-02-19 22:14 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > >
> > > userfaultfd_move() checks whether the PTE entry is present or a
> > > swap entry.
> > >
> > > - If the PTE entry is present, move_present_pte() handles folio
> > >   migration by setting:
> > >
> > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > >
> > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > >   the PTE to the new dst_addr.
> > >
> > > This approach is incorrect because even if the PTE is a swap
> > > entry, it can still reference a folio that remains in the swap
> > > cache.
> > >
> > > If do_swap_page() is triggered, it may locate the folio in the
> > > swap cache. However, during add_rmap operations, a kernel panic
> > > can occur due to:
> > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> >
> > Thanks for the report and reproducer!
> >
> > >
> > > $./a.out > /dev/null
> > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > [   13.337716] memcg:ffff00000405f000
> > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > [   13.340190] ------------[ cut here ]------------
> > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > [   13.340969] Modules linked in:
> > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > [   13.342018] sp : ffff80008752bb20
> > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > [   13.343876] Call trace:
> > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > [   13.344586]  do_page_fault+0x20c/0x770
> > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > [   13.344842]  el0_da+0x58/0x130
> > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > >
> > > Fully fixing it would be quite complex, requiring similar handling
> > > of folios as done in move_present_pte.
> >
> > How complex would that be? Is it a matter of adding
> > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > folio->index = linear_page_index like in move_present_pte() or
> > something more?
> 
> My main concern is still with large folios that require a split_folio()
> during move_pages(), as the entire folio shares the same index and
> anon_vma. However, userfaultfd_move() moves pages individually,
> making a split necessary.
> 
> However, in split_huge_page_to_list_to_order(), there is a:
> 
>         if (folio_test_writeback(folio))
>                 return -EBUSY;
> 
> This is likely true for swapcache, right? However, even for move_present_pte(),
> it simply returns -EBUSY:
> 
> move_pages_pte()
> {
>                 /* at this point we have src_folio locked */
>                 if (folio_test_large(src_folio)) {
>                         /* split_folio() can block */
>                         pte_unmap(&orig_src_pte);
>                         pte_unmap(&orig_dst_pte);
>                         src_pte = dst_pte = NULL;
>                         err = split_folio(src_folio);
>                         if (err)
>                                 goto out;
> 
>                         /* have to reacquire the folio after it got split */
>                         folio_unlock(src_folio);
>                         folio_put(src_folio);
>                         src_folio = NULL;
>                         goto retry;
>                 }
> }
> 
> Do we need a folio_wait_writeback() before calling split_folio()?

Maybe no need in the first version to fix the immediate bug?

It's also not always the case to hit writeback here. IIUC, writeback only
happens for a short window when the folio was just added into swapcache.
MOVE can happen much later after that anytime before a swapin.  My
understanding is that's also what Matthew wanted to point out.  It may be
better justified of that in a separate change with some performance
measurements.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 11:25 [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache Barry Song
  2025-02-19 18:26 ` Suren Baghdasaryan
  2025-02-19 18:40 ` Lokesh Gidra
@ 2025-02-19 22:31 ` Peter Xu
  2025-02-20  0:50   ` Barry Song
  2 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-02-19 22:31 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, akpm, linux-kernel, zhengtangquan, Barry Song,
	Andrea Arcangeli, Suren Baghdasaryan, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, David Hildenbrand, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Lokesh Gidra,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Nicolas Geoffray,
	Ryan Roberts, Shuah Khan, ZhangPeng

On Thu, Feb 20, 2025 at 12:25:19AM +1300, Barry Song wrote:
> @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
>  			 pmd_t *dst_pmd, pmd_t dst_pmdval,
>  			 spinlock_t *dst_ptl, spinlock_t *src_ptl)
>  {
> +	struct folio *folio;
> +	swp_entry_t entry;
> +
>  	if (!pte_swp_exclusive(orig_src_pte))
>  		return -EBUSY;
>  
> +	entry = pte_to_swp_entry(orig_src_pte);
> +	folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));

[Besides what's being discussed elsewhere..]

swap_cache_get_folio() says:

 * Caller must lock the swap device or hold a reference to keep it valid.

Do we need get_swap_device() too here to avoid swapoff race?

> +	if (!IS_ERR(folio)) {
> +		folio_put(folio);
> +		return -EBUSY;
> +	}
> +
>  	double_pt_lock(dst_ptl, src_ptl);
>  
>  	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> -- 
> 2.39.3 (Apple Git-146)
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 22:14     ` Peter Xu
@ 2025-02-19 23:04       ` Barry Song
  2025-02-19 23:19         ` Lokesh Gidra
  2025-02-20 22:59         ` Peter Xu
  0 siblings, 2 replies; 47+ messages in thread
From: Barry Song @ 2025-02-19 23:04 UTC (permalink / raw)
  To: Peter Xu
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > From: Barry Song <v-songbaohua@oppo.com>
> > > >
> > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > swap entry.
> > > >
> > > > - If the PTE entry is present, move_present_pte() handles folio
> > > >   migration by setting:
> > > >
> > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > >
> > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > >   the PTE to the new dst_addr.
> > > >
> > > > This approach is incorrect because even if the PTE is a swap
> > > > entry, it can still reference a folio that remains in the swap
> > > > cache.
> > > >
> > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > can occur due to:
> > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > >
> > > Thanks for the report and reproducer!
> > >
> > > >
> > > > $./a.out > /dev/null
> > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > [   13.337716] memcg:ffff00000405f000
> > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > [   13.340190] ------------[ cut here ]------------
> > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > [   13.340969] Modules linked in:
> > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > [   13.342018] sp : ffff80008752bb20
> > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > [   13.343876] Call trace:
> > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > [   13.344842]  el0_da+0x58/0x130
> > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > >
> > > > Fully fixing it would be quite complex, requiring similar handling
> > > > of folios as done in move_present_pte.
> > >
> > > How complex would that be? Is it a matter of adding
> > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > folio->index = linear_page_index like in move_present_pte() or
> > > something more?
> >
> > My main concern is still with large folios that require a split_folio()
> > during move_pages(), as the entire folio shares the same index and
> > anon_vma. However, userfaultfd_move() moves pages individually,
> > making a split necessary.
> >
> > However, in split_huge_page_to_list_to_order(), there is a:
> >
> >         if (folio_test_writeback(folio))
> >                 return -EBUSY;
> >
> > This is likely true for swapcache, right? However, even for move_present_pte(),
> > it simply returns -EBUSY:
> >
> > move_pages_pte()
> > {
> >                 /* at this point we have src_folio locked */
> >                 if (folio_test_large(src_folio)) {
> >                         /* split_folio() can block */
> >                         pte_unmap(&orig_src_pte);
> >                         pte_unmap(&orig_dst_pte);
> >                         src_pte = dst_pte = NULL;
> >                         err = split_folio(src_folio);
> >                         if (err)
> >                                 goto out;
> >
> >                         /* have to reacquire the folio after it got split */
> >                         folio_unlock(src_folio);
> >                         folio_put(src_folio);
> >                         src_folio = NULL;
> >                         goto retry;
> >                 }
> > }
> >
> > Do we need a folio_wait_writeback() before calling split_folio()?
>
> Maybe no need in the first version to fix the immediate bug?
>
> It's also not always the case to hit writeback here. IIUC, writeback only
> happens for a short window when the folio was just added into swapcache.
> MOVE can happen much later after that anytime before a swapin.  My
> understanding is that's also what Matthew wanted to point out.  It may be
> better justified of that in a separate change with some performance
> measurements.

The bug we’re discussing occurs precisely within the short window you
mentioned.

1. add_to_swap: The folio is added to swapcache.
2. try_to_unmap: PTEs are converted to swap entries.
3. pageout
4. Swapcache is cleared.

The issue happens between steps 2 and 4, where the PTE is not present, but
the folio is still in swapcache - the current code does move_swap_pte() but does
not fixup folio->index within swapcache.

My point is that if we want a proper fix for mTHP, we'd better handle writeback.
Otherwise, this isn’t much different from directly returning -EBUSY as proposed
in this RFC.

For small folios, there’s no split_folio issue, making it relatively
simpler. Lokesh
mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
the first priority.

>
> Thanks,
>
> --
> Peter Xu

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 23:04       ` Barry Song
@ 2025-02-19 23:19         ` Lokesh Gidra
  2025-02-20  0:49           ` Barry Song
  2025-02-20 22:59         ` Peter Xu
  1 sibling, 1 reply; 47+ messages in thread
From: Lokesh Gidra @ 2025-02-19 23:19 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Xu, Suren Baghdasaryan, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Wed, Feb 19, 2025 at 3:04 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > swap entry.
> > > > >
> > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > >   migration by setting:
> > > > >
> > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > >
> > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > >   the PTE to the new dst_addr.
> > > > >
> > > > > This approach is incorrect because even if the PTE is a swap
> > > > > entry, it can still reference a folio that remains in the swap
> > > > > cache.
> > > > >
> > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > can occur due to:
> > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > >
> > > > Thanks for the report and reproducer!
> > > >
> > > > >
> > > > > $./a.out > /dev/null
> > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > [   13.337716] memcg:ffff00000405f000
> > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > [   13.340190] ------------[ cut here ]------------
> > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > [   13.340969] Modules linked in:
> > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > [   13.342018] sp : ffff80008752bb20
> > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > [   13.343876] Call trace:
> > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > >
> > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > of folios as done in move_present_pte.
> > > >
> > > > How complex would that be? Is it a matter of adding
> > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > folio->index = linear_page_index like in move_present_pte() or
> > > > something more?
> > >
> > > My main concern is still with large folios that require a split_folio()
> > > during move_pages(), as the entire folio shares the same index and
> > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > making a split necessary.
> > >
> > > However, in split_huge_page_to_list_to_order(), there is a:
> > >
> > >         if (folio_test_writeback(folio))
> > >                 return -EBUSY;
> > >
> > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > it simply returns -EBUSY:
> > >
> > > move_pages_pte()
> > > {
> > >                 /* at this point we have src_folio locked */
> > >                 if (folio_test_large(src_folio)) {
> > >                         /* split_folio() can block */
> > >                         pte_unmap(&orig_src_pte);
> > >                         pte_unmap(&orig_dst_pte);
> > >                         src_pte = dst_pte = NULL;
> > >                         err = split_folio(src_folio);
> > >                         if (err)
> > >                                 goto out;
> > >
> > >                         /* have to reacquire the folio after it got split */
> > >                         folio_unlock(src_folio);
> > >                         folio_put(src_folio);
> > >                         src_folio = NULL;
> > >                         goto retry;
> > >                 }
> > > }
> > >
> > > Do we need a folio_wait_writeback() before calling split_folio()?
> >
> > Maybe no need in the first version to fix the immediate bug?
> >
> > It's also not always the case to hit writeback here. IIUC, writeback only
> > happens for a short window when the folio was just added into swapcache.
> > MOVE can happen much later after that anytime before a swapin.  My
> > understanding is that's also what Matthew wanted to point out.  It may be
> > better justified of that in a separate change with some performance
> > measurements.
>
> The bug we’re discussing occurs precisely within the short window you
> mentioned.
>
> 1. add_to_swap: The folio is added to swapcache.
> 2. try_to_unmap: PTEs are converted to swap entries.
> 3. pageout
> 4. Swapcache is cleared.
>
> The issue happens between steps 2 and 4, where the PTE is not present, but
> the folio is still in swapcache - the current code does move_swap_pte() but does
> not fixup folio->index within swapcache.
>
> My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> in this RFC.
>
> For small folios, there’s no split_folio issue, making it relatively
> simpler. Lokesh
> mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> the first priority.

Fixing for the non-mTHP case first sounds good to me. For a large
folio in swap cache maybe you can return EBUSY for now?

But for that, I believe the cleanest and simplest would be to
restructure move_pages_pte() such that folio is retrieved from
vm_normal_foliio() if src_pte is present, otherwise, from
filemap_get_folio() and then incorporate the check you have in this
FRC in there. This way the entire locking dance logic in there can be
reused for swap-cache case as well.
>
> >
> > Thanks,
> >
> > --
> > Peter Xu
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 23:19         ` Lokesh Gidra
@ 2025-02-20  0:49           ` Barry Song
  0 siblings, 0 replies; 47+ messages in thread
From: Barry Song @ 2025-02-20  0:49 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Peter Xu, Suren Baghdasaryan, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 12:19 PM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Wed, Feb 19, 2025 at 3:04 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > > swap entry.
> > > > > >
> > > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > >   migration by setting:
> > > > > >
> > > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > >
> > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > >   the PTE to the new dst_addr.
> > > > > >
> > > > > > This approach is incorrect because even if the PTE is a swap
> > > > > > entry, it can still reference a folio that remains in the swap
> > > > > > cache.
> > > > > >
> > > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > > can occur due to:
> > > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > >
> > > > > Thanks for the report and reproducer!
> > > > >
> > > > > >
> > > > > > $./a.out > /dev/null
> > > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > > [   13.337716] memcg:ffff00000405f000
> > > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > > [   13.340190] ------------[ cut here ]------------
> > > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > > [   13.340969] Modules linked in:
> > > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > > [   13.342018] sp : ffff80008752bb20
> > > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > > [   13.343876] Call trace:
> > > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > > >
> > > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > > of folios as done in move_present_pte.
> > > > >
> > > > > How complex would that be? Is it a matter of adding
> > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > > folio->index = linear_page_index like in move_present_pte() or
> > > > > something more?
> > > >
> > > > My main concern is still with large folios that require a split_folio()
> > > > during move_pages(), as the entire folio shares the same index and
> > > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > > making a split necessary.
> > > >
> > > > However, in split_huge_page_to_list_to_order(), there is a:
> > > >
> > > >         if (folio_test_writeback(folio))
> > > >                 return -EBUSY;
> > > >
> > > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > it simply returns -EBUSY:
> > > >
> > > > move_pages_pte()
> > > > {
> > > >                 /* at this point we have src_folio locked */
> > > >                 if (folio_test_large(src_folio)) {
> > > >                         /* split_folio() can block */
> > > >                         pte_unmap(&orig_src_pte);
> > > >                         pte_unmap(&orig_dst_pte);
> > > >                         src_pte = dst_pte = NULL;
> > > >                         err = split_folio(src_folio);
> > > >                         if (err)
> > > >                                 goto out;
> > > >
> > > >                         /* have to reacquire the folio after it got split */
> > > >                         folio_unlock(src_folio);
> > > >                         folio_put(src_folio);
> > > >                         src_folio = NULL;
> > > >                         goto retry;
> > > >                 }
> > > > }
> > > >
> > > > Do we need a folio_wait_writeback() before calling split_folio()?
> > >
> > > Maybe no need in the first version to fix the immediate bug?
> > >
> > > It's also not always the case to hit writeback here. IIUC, writeback only
> > > happens for a short window when the folio was just added into swapcache.
> > > MOVE can happen much later after that anytime before a swapin.  My
> > > understanding is that's also what Matthew wanted to point out.  It may be
> > > better justified of that in a separate change with some performance
> > > measurements.
> >
> > The bug we’re discussing occurs precisely within the short window you
> > mentioned.
> >
> > 1. add_to_swap: The folio is added to swapcache.
> > 2. try_to_unmap: PTEs are converted to swap entries.
> > 3. pageout
> > 4. Swapcache is cleared.
> >
> > The issue happens between steps 2 and 4, where the PTE is not present, but
> > the folio is still in swapcache - the current code does move_swap_pte() but does
> > not fixup folio->index within swapcache.
> >
> > My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> > Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> > in this RFC.
> >
> > For small folios, there’s no split_folio issue, making it relatively
> > simpler. Lokesh
> > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> > the first priority.
>
> Fixing for the non-mTHP case first sounds good to me. For a large
> folio in swap cache maybe you can return EBUSY for now?
>
> But for that, I believe the cleanest and simplest would be to
> restructure move_pages_pte() such that folio is retrieved from
> vm_normal_foliio() if src_pte is present, otherwise, from
> filemap_get_folio() and then incorporate the check you have in this
> FRC in there. This way the entire locking dance logic in there can be
> reused for swap-cache case as well.

Yep, let me give it a try in v2.

> >
> > >
> > > Thanks,
> > >
> > > --
> > > Peter Xu
> >
> > Thanks
> > Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 22:31 ` Peter Xu
@ 2025-02-20  0:50   ` Barry Song
  0 siblings, 0 replies; 47+ messages in thread
From: Barry Song @ 2025-02-20  0:50 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, akpm, linux-kernel, zhengtangquan, Barry Song,
	Andrea Arcangeli, Suren Baghdasaryan, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, David Hildenbrand, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Lokesh Gidra,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Nicolas Geoffray,
	Ryan Roberts, Shuah Khan, ZhangPeng

On Thu, Feb 20, 2025 at 11:31 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 20, 2025 at 12:25:19AM +1300, Barry Song wrote:
> > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm,
> >                        pmd_t *dst_pmd, pmd_t dst_pmdval,
> >                        spinlock_t *dst_ptl, spinlock_t *src_ptl)
> >  {
> > +     struct folio *folio;
> > +     swp_entry_t entry;
> > +
> >       if (!pte_swp_exclusive(orig_src_pte))
> >               return -EBUSY;
> >
> > +     entry = pte_to_swp_entry(orig_src_pte);
> > +     folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
>
> [Besides what's being discussed elsewhere..]
>
> swap_cache_get_folio() says:
>
>  * Caller must lock the swap device or hold a reference to keep it valid.
>
> Do we need get_swap_device() too here to avoid swapoff race?
>

Yep, thanks! Let me fix it in v2.

> > +     if (!IS_ERR(folio)) {
> > +             folio_put(folio);
> > +             return -EBUSY;
> > +     }
> > +
> >       double_pt_lock(dst_ptl, src_ptl);
> >
> >       if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > --
> > 2.39.3 (Apple Git-146)
> >
>
> --
> Peter Xu
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 18:58     ` Suren Baghdasaryan
@ 2025-02-20  8:40       ` David Hildenbrand
  2025-02-20  9:21         ` Barry Song
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2025-02-20  8:40 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Barry Song, linux-mm, akpm, linux-kernel, zhengtangquan,
	Barry Song, Andrea Arcangeli, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, Hugh Dickins, Jann Horn,
	Kalesh Singh, Liam R . Howlett, Lokesh Gidra, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng

On 19.02.25 19:58, Suren Baghdasaryan wrote:
> On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.02.25 19:26, Suren Baghdasaryan wrote:
>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> userfaultfd_move() checks whether the PTE entry is present or a
>>>> swap entry.
>>>>
>>>> - If the PTE entry is present, move_present_pte() handles folio
>>>>     migration by setting:
>>>>
>>>>     src_folio->index = linear_page_index(dst_vma, dst_addr);
>>>>
>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
>>>>     the PTE to the new dst_addr.
>>>>
>>>> This approach is incorrect because even if the PTE is a swap
>>>> entry, it can still reference a folio that remains in the swap
>>>> cache.
>>>>
>>>> If do_swap_page() is triggered, it may locate the folio in the
>>>> swap cache. However, during add_rmap operations, a kernel panic
>>>> can occur due to:
>>>>    page_pgoff(folio, page) != linear_page_index(vma, address)
>>>
>>> Thanks for the report and reproducer!
>>>
>>>>
>>>> $./a.out > /dev/null
>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
>>>> [   13.337716] memcg:ffff00000405f000
>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
>>>> [   13.340190] ------------[ cut here ]------------
>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>> [   13.340969] Modules linked in:
>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
>>>> [   13.342018] sp : ffff80008752bb20
>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
>>>> [   13.343876] Call trace:
>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
>>>> [   13.344333]  do_swap_page+0x1060/0x1400
>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
>>>> [   13.344586]  do_page_fault+0x20c/0x770
>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
>>>> [   13.344759]  do_mem_abort+0x48/0xa0
>>>> [   13.344842]  el0_da+0x58/0x130
>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
>>>> [   13.345715] note: a.out[107] exited with irqs disabled
>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
>>>>
>>>> Fully fixing it would be quite complex, requiring similar handling
>>>> of folios as done in move_present_pte.
>>>
>>> How complex would that be? Is it a matter of adding
>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
>>> folio->index = linear_page_index like in move_present_pte() or
>>> something more?
>>
>> If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot
>> be pinned and we may be able to move it I think.
>>
>> So all that's required is to check pte_swp_exclusive() and the folio size.
>>
>> ... in theory :) Not sure about the swap details.
> 
> Looking some more into it, I think we would have to perform all the
> folio and anon_vma locking and pinning that we do for present pages in
> move_pages_pte(). If that's correct then maybe treating swapcache
> pages like a present page inside move_pages_pte() would be simpler?

I'd be more in favor of not doing that. Maybe there are parts we can 
move out into helper functions instead, so we can reuse them?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 20:37   ` Barry Song
                       ` (2 preceding siblings ...)
  2025-02-19 22:14     ` Peter Xu
@ 2025-02-20  8:51     ` David Hildenbrand
  2025-02-20  9:31       ` Barry Song
  3 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2025-02-20  8:51 UTC (permalink / raw)
  To: Barry Song, Suren Baghdasaryan, Lokesh Gidra
  Cc: linux-mm, akpm, linux-kernel, zhengtangquan, Barry Song,
	Andrea Arcangeli, Al Viro, Axel Rasmussen, Brian Geffon,
	Christian Brauner, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Peter Xu, Ryan Roberts, Shuah Khan, ZhangPeng,
	Yu Zhao

On 19.02.25 21:37, Barry Song wrote:
> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>
>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>>>
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> userfaultfd_move() checks whether the PTE entry is present or a
>>> swap entry.
>>>
>>> - If the PTE entry is present, move_present_pte() handles folio
>>>    migration by setting:
>>>
>>>    src_folio->index = linear_page_index(dst_vma, dst_addr);
>>>
>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
>>>    the PTE to the new dst_addr.
>>>
>>> This approach is incorrect because even if the PTE is a swap
>>> entry, it can still reference a folio that remains in the swap
>>> cache.
>>>
>>> If do_swap_page() is triggered, it may locate the folio in the
>>> swap cache. However, during add_rmap operations, a kernel panic
>>> can occur due to:
>>>   page_pgoff(folio, page) != linear_page_index(vma, address)
>>
>> Thanks for the report and reproducer!
>>
>>>
>>> $./a.out > /dev/null
>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
>>> [   13.337716] memcg:ffff00000405f000
>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
>>> [   13.340190] ------------[ cut here ]------------
>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>> [   13.340969] Modules linked in:
>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
>>> [   13.342018] sp : ffff80008752bb20
>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
>>> [   13.343876] Call trace:
>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
>>> [   13.344333]  do_swap_page+0x1060/0x1400
>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
>>> [   13.344586]  do_page_fault+0x20c/0x770
>>> [   13.344673]  do_translation_fault+0xb4/0xf0
>>> [   13.344759]  do_mem_abort+0x48/0xa0
>>> [   13.344842]  el0_da+0x58/0x130
>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
>>> [   13.345504] ---[ end trace 0000000000000000 ]---
>>> [   13.345715] note: a.out[107] exited with irqs disabled
>>> [   13.345954] note: a.out[107] exited with preempt_count 2
>>>
>>> Fully fixing it would be quite complex, requiring similar handling
>>> of folios as done in move_present_pte.
>>
>> How complex would that be? Is it a matter of adding
>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
>> folio->index = linear_page_index like in move_present_pte() or
>> something more?
> 
> My main concern is still with large folios that require a split_folio()
> during move_pages(), as the entire folio shares the same index and
> anon_vma. However, userfaultfd_move() moves pages individually,
> making a split necessary.
> 
> However, in split_huge_page_to_list_to_order(), there is a:
> 
>          if (folio_test_writeback(folio))
>                  return -EBUSY;
> 
> This is likely true for swapcache, right? However, even for move_present_pte(),
> it simply returns -EBUSY:
> 
> move_pages_pte()
> {
>                  /* at this point we have src_folio locked */
>                  if (folio_test_large(src_folio)) {
>                          /* split_folio() can block */
>                          pte_unmap(&orig_src_pte);
>                          pte_unmap(&orig_dst_pte);
>                          src_pte = dst_pte = NULL;
>                          err = split_folio(src_folio);
>                          if (err)
>                                  goto out;
> 
>                          /* have to reacquire the folio after it got split */
>                          folio_unlock(src_folio);
>                          folio_put(src_folio);
>                          src_folio = NULL;
>                          goto retry;
>                  }
> }
> 
> Do we need a folio_wait_writeback() before calling split_folio()?
> 
> By the way, I have also reported that userfaultfd_move() has a fundamental
> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> kernel. In this scenario, folios in the virtual zone won’t be split in
> split_folio(). Instead, the large folio migrates into nr_pages small folios.
 > > Thus, the best-case scenario would be:
> 
> mTHP -> migrate to small folios in split_folio() -> move small folios to
> dst_addr
> 
> While this works, it negates the performance benefits of
> userfaultfd_move(), as it introduces two PTE operations (migration in
> split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> allocations, and still requires one memcpy(). This could end up
> performing even worse than userfaultfd_copy(), I guess.
 > > The worst-case scenario would be failing to allocate small folios in
> split_folio(), then userfaultfd_move() might return -ENOMEM?

Although that's an Android problem and not an upstream problem, I'll 
note that there are other reasons why the split / move might fail, and 
user space either must retry or fallback to a COPY.

Regarding mTHP, we could move the whole folio if the user space-provided 
range allows for batching over multiple PTEs (nr_ptes), they are in a 
single VMA, and folio_mapcount() == nr_ptes.

There are corner cases to handle, such as moving mTHPs such that they 
suddenly cross two page tables I assume, that are harder to handle when 
not moving individual PTEs where that cannot happen.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20  8:40       ` David Hildenbrand
@ 2025-02-20  9:21         ` Barry Song
  2025-02-20 10:24           ` David Hildenbrand
  2025-02-20 23:32           ` Peter Xu
  0 siblings, 2 replies; 47+ messages in thread
From: Barry Song @ 2025-02-20  9:21 UTC (permalink / raw)
  To: david
  Cc: 21cnbao, Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon,
	brauner, hughd, jannh, kaleshsingh, linux-kernel, linux-mm,
	lokeshgidra, mhocko, ngeoffray, peterx, rppt, ryan.roberts,
	shuah, surenb, v-songbaohua, viro, willy, zhangpeng362,
	zhengtangquan, yuzhao, stable

On Thu, Feb 20, 2025 at 9:40 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.02.25 19:58, Suren Baghdasaryan wrote:
> > On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 19.02.25 19:26, Suren Baghdasaryan wrote:
> >>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>
> >>>> userfaultfd_move() checks whether the PTE entry is present or a
> >>>> swap entry.
> >>>>
> >>>> - If the PTE entry is present, move_present_pte() handles folio
> >>>>     migration by setting:
> >>>>
> >>>>     src_folio->index = linear_page_index(dst_vma, dst_addr);
> >>>>
> >>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
> >>>>     the PTE to the new dst_addr.
> >>>>
> >>>> This approach is incorrect because even if the PTE is a swap
> >>>> entry, it can still reference a folio that remains in the swap
> >>>> cache.
> >>>>
> >>>> If do_swap_page() is triggered, it may locate the folio in the
> >>>> swap cache. However, during add_rmap operations, a kernel panic
> >>>> can occur due to:
> >>>>    page_pgoff(folio, page) != linear_page_index(vma, address)
> >>>
> >>> Thanks for the report and reproducer!
> >>>
> >>>>
> >>>> $./a.out > /dev/null
> >>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> >>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> >>>> [   13.337716] memcg:ffff00000405f000
> >>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> >>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> >>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> >>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> >>>> [   13.340190] ------------[ cut here ]------------
> >>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
> >>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> >>>> [   13.340969] Modules linked in:
> >>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> >>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
> >>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> >>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> >>>> [   13.342018] sp : ffff80008752bb20
> >>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> >>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> >>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> >>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> >>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> >>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> >>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> >>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> >>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> >>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> >>>> [   13.343876] Call trace:
> >>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> >>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> >>>> [   13.344333]  do_swap_page+0x1060/0x1400
> >>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> >>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> >>>> [   13.344586]  do_page_fault+0x20c/0x770
> >>>> [   13.344673]  do_translation_fault+0xb4/0xf0
> >>>> [   13.344759]  do_mem_abort+0x48/0xa0
> >>>> [   13.344842]  el0_da+0x58/0x130
> >>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> >>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> >>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> >>>> [   13.345504] ---[ end trace 0000000000000000 ]---
> >>>> [   13.345715] note: a.out[107] exited with irqs disabled
> >>>> [   13.345954] note: a.out[107] exited with preempt_count 2
> >>>>
> >>>> Fully fixing it would be quite complex, requiring similar handling
> >>>> of folios as done in move_present_pte.
> >>>
> >>> How complex would that be? Is it a matter of adding
> >>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> >>> folio->index = linear_page_index like in move_present_pte() or
> >>> something more?
> >>
> >> If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot
> >> be pinned and we may be able to move it I think.
> >>
> >> So all that's required is to check pte_swp_exclusive() and the folio size.
> >>
> >> ... in theory :) Not sure about the swap details.
> >
> > Looking some more into it, I think we would have to perform all the
> > folio and anon_vma locking and pinning that we do for present pages in
> > move_pages_pte(). If that's correct then maybe treating swapcache
> > pages like a present page inside move_pages_pte() would be simpler?
>
> I'd be more in favor of not doing that. Maybe there are parts we can
> move out into helper functions instead, so we can reuse them?

I actually have a v2 ready. Maybe we can discuss if some of the code can be
extracted as a helper based on the below before I send it formally?

I’d say there are many parts that can be shared with present PTE, but there
are two major differences:

1. Page exclusivity – swapcache doesn’t require it (try_to_unmap_one has remove
Exclusive flag;)
2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)


Subject: [PATCH v2 Discussing with David] mm: Fix kernel crash when userfaultfd_move encounters
 swapcache

userfaultfd_move() checks whether the PTE entry is present or a
swap entry.

- If the PTE entry is present, move_present_pte() handles folio
  migration by setting:

  src_folio->index = linear_page_index(dst_vma, dst_addr);

- If the PTE entry is a swap entry, move_swap_pte() simply copies
  the PTE to the new dst_addr.

This approach is incorrect because, even if the PTE is a swap entry,
it can still reference a folio that remains in the swap cache.

This exposes a race condition between steps 2 and 4:
 1. add_to_swap: The folio is added to the swapcache.
 2. try_to_unmap: PTEs are converted to swap entries.
 3. pageout: The folio is written back.
 4. Swapcache is cleared.
If userfaultfd_move() happens in the window between step 2 and step 4,
after the swap PTE is moved to the destination, accessing the destination
triggers do_swap_page(), which may locate the folio in the swap cache.
However, during add_rmap operations, a kernel panic can occur due to:

page_pgoff(folio, page) != linear_page_index(vma, address)

This happens because move_swap_pte() has never updated the index to
match dst_vma and dst_addr.

$./a.out > /dev/null
[   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
[   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
[   13.337716] memcg:ffff00000405f000
[   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
[   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
[   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
[   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
[   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
[   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
[   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
[   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
[   13.340190] ------------[ cut here ]------------
[   13.340316] kernel BUG at mm/rmap.c:1380!
[   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[   13.340969] Modules linked in:
[   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
[   13.341470] Hardware name: linux,dummy-virt (DT)
[   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
[   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
[   13.342018] sp : ffff80008752bb20
[   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
[   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
[   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
[   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
[   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
[   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
[   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
[   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
[   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
[   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
[   13.343876] Call trace:
[   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
[   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
[   13.344333]  do_swap_page+0x1060/0x1400
[   13.344417]  __handle_mm_fault+0x61c/0xbc8
[   13.344504]  handle_mm_fault+0xd8/0x2e8
[   13.344586]  do_page_fault+0x20c/0x770
[   13.344673]  do_translation_fault+0xb4/0xf0
[   13.344759]  do_mem_abort+0x48/0xa0
[   13.344842]  el0_da+0x58/0x130
[   13.344914]  el0t_64_sync_handler+0xc4/0x138
[   13.345002]  el0t_64_sync+0x1ac/0x1b0
[   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
[   13.345504] ---[ end trace 0000000000000000 ]---
[   13.345715] note: a.out[107] exited with irqs disabled
[   13.345954] note: a.out[107] exited with preempt_count 2

This patch also checks the swapcache when handling swap entries. If a
match is found in the swapcache, it processes it similarly to a present
PTE.
However, there are some differences. For example, the folio is no longer
exclusive because folio_try_share_anon_rmap_pte() is performed during
unmapping.
Furthermore, in the case of swapcache, the folio has already been
unmapped, eliminating the risk of concurrent rmap walks and removing the
need to acquire src_folio's anon_vma or lock.

Note that for large folios, in the swapcache handling path, we still
frequently encounter -EBUSY returns because split_folio() returns
-EBUSY when the folio is under writeback.
That is not an urgent fix, so a following patch will address it.

Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI")
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: ZhangPeng <zhangpeng362@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/userfaultfd.c | 228 +++++++++++++++++++++++++++--------------------
 1 file changed, 133 insertions(+), 95 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 867898c4e30b..e5718835a964 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
+#include "swap.h"
 
 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
@@ -1025,7 +1026,7 @@ static inline bool is_pte_pages_stable(pte_t *dst_pte, pte_t *src_pte,
 	       pmd_same(dst_pmdval, pmdp_get_lockless(dst_pmd));
 }
 
-static int move_present_pte(struct mm_struct *mm,
+static int move_pte_and_folio(struct mm_struct *mm,
 			    struct vm_area_struct *dst_vma,
 			    struct vm_area_struct *src_vma,
 			    unsigned long dst_addr, unsigned long src_addr,
@@ -1046,7 +1047,7 @@ static int move_present_pte(struct mm_struct *mm,
 	}
 	if (folio_test_large(src_folio) ||
 	    folio_maybe_dma_pinned(src_folio) ||
-	    !PageAnonExclusive(&src_folio->page)) {
+	    (pte_present(orig_src_pte) && !PageAnonExclusive(&src_folio->page))) {
 		err = -EBUSY;
 		goto out;
 	}
@@ -1062,10 +1063,13 @@ static int move_present_pte(struct mm_struct *mm,
 	folio_move_anon_rmap(src_folio, dst_vma);
 	src_folio->index = linear_page_index(dst_vma, dst_addr);
 
-	orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
-	/* Follow mremap() behavior and treat the entry dirty after the move */
-	orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);
-
+	if (pte_present(orig_src_pte)) {
+		orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
+		/* Follow mremap() behavior and treat the entry dirty after the move */
+		orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);
+	} else { /* swap entry */
+		orig_dst_pte = orig_src_pte;
+	}
 	set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
 out:
 	double_pt_unlock(dst_ptl, src_ptl);
@@ -1079,9 +1083,6 @@ static int move_swap_pte(struct mm_struct *mm,
 			 pmd_t *dst_pmd, pmd_t dst_pmdval,
 			 spinlock_t *dst_ptl, spinlock_t *src_ptl)
 {
-	if (!pte_swp_exclusive(orig_src_pte))
-		return -EBUSY;
-
 	double_pt_lock(dst_ptl, src_ptl);
 
 	if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
@@ -1137,6 +1138,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			  __u64 mode)
 {
 	swp_entry_t entry;
+	struct swap_info_struct *si = NULL;
 	pte_t orig_src_pte, orig_dst_pte;
 	pte_t src_folio_pte;
 	spinlock_t *src_ptl, *dst_ptl;
@@ -1220,122 +1222,156 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		goto out;
 	}
 
-	if (pte_present(orig_src_pte)) {
-		if (is_zero_pfn(pte_pfn(orig_src_pte))) {
-			err = move_zeropage_pte(mm, dst_vma, src_vma,
-					       dst_addr, src_addr, dst_pte, src_pte,
-					       orig_dst_pte, orig_src_pte,
-					       dst_pmd, dst_pmdval, dst_ptl, src_ptl);
+	if (!pte_present(orig_src_pte)) {
+		entry = pte_to_swp_entry(orig_src_pte);
+		if (is_migration_entry(entry)) {
+			pte_unmap(&orig_src_pte);
+			pte_unmap(&orig_dst_pte);
+			src_pte = dst_pte = NULL;
+			migration_entry_wait(mm, src_pmd, src_addr);
+			err = -EAGAIN;
+			goto out;
+		}
+
+		if (non_swap_entry(entry)) {
+			err = -EFAULT;
+			goto out;
+		}
+
+		if (!pte_swp_exclusive(orig_src_pte)) {
+			err = -EBUSY;
+			goto out;
+		}
+		/* Prevent swapoff from happening to us. */
+		if (!si)
+			si = get_swap_device(entry);
+		if (unlikely(!si)) {
+			err = -EAGAIN;
 			goto out;
 		}
+	}
+
+	if (pte_present(orig_src_pte) && is_zero_pfn(pte_pfn(orig_src_pte))) {
+		err = move_zeropage_pte(mm, dst_vma, src_vma,
+				dst_addr, src_addr, dst_pte, src_pte,
+				orig_dst_pte, orig_src_pte,
+				dst_pmd, dst_pmdval, dst_ptl, src_ptl);
+		goto out;
+	}
+
+	/*
+	 * Pin and lock both source folio and anon_vma. Since we are in
+	 * RCU read section, we can't block, so on contention have to
+	 * unmap the ptes, obtain the lock and retry.
+	 */
+	if (!src_folio) {
+		struct folio *folio;
 
 		/*
-		 * Pin and lock both source folio and anon_vma. Since we are in
-		 * RCU read section, we can't block, so on contention have to
-		 * unmap the ptes, obtain the lock and retry.
+		 * Pin the page while holding the lock to be sure the
+		 * page isn't freed under us
 		 */
-		if (!src_folio) {
-			struct folio *folio;
+		spin_lock(src_ptl);
+		if (!pte_same(orig_src_pte, ptep_get(src_pte))) {
+			spin_unlock(src_ptl);
+			err = -EAGAIN;
+			goto out;
+		}
 
-			/*
-			 * Pin the page while holding the lock to be sure the
-			 * page isn't freed under us
-			 */
-			spin_lock(src_ptl);
-			if (!pte_same(orig_src_pte, ptep_get(src_pte))) {
+		if (pte_present(orig_src_pte)) {
+			folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
+			if (!folio) {
 				spin_unlock(src_ptl);
-				err = -EAGAIN;
+				err = -EBUSY;
 				goto out;
 			}
-
-			folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
-			if (!folio || !PageAnonExclusive(&folio->page)) {
+			if (!PageAnonExclusive(&folio->page)) {
 				spin_unlock(src_ptl);
 				err = -EBUSY;
 				goto out;
 			}
-
 			folio_get(folio);
-			src_folio = folio;
-			src_folio_pte = orig_src_pte;
-			spin_unlock(src_ptl);
-
-			if (!folio_trylock(src_folio)) {
-				pte_unmap(&orig_src_pte);
-				pte_unmap(&orig_dst_pte);
-				src_pte = dst_pte = NULL;
-				/* now we can block and wait */
-				folio_lock(src_folio);
-				goto retry;
-			}
-
-			if (WARN_ON_ONCE(!folio_test_anon(src_folio))) {
-				err = -EBUSY;
+		} else {
+			/*
+			 * Check if swapcache exists.
+			 * If it does, we need to move the folio
+			 * even if the PTE is a swap entry.
+			 */
+			folio = filemap_get_folio(swap_address_space(entry),
+					swap_cache_index(entry));
+			if (IS_ERR(folio)) {
+				spin_unlock(src_ptl);
+				err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte,
+						orig_dst_pte, orig_src_pte, dst_pmd,
+						dst_pmdval, dst_ptl, src_ptl);
 				goto out;
 			}
 		}
 
-		/* at this point we have src_folio locked */
-		if (folio_test_large(src_folio)) {
-			/* split_folio() can block */
+		src_folio = folio;
+		src_folio_pte = orig_src_pte;
+		spin_unlock(src_ptl);
+
+		if (!folio_trylock(src_folio)) {
 			pte_unmap(&orig_src_pte);
 			pte_unmap(&orig_dst_pte);
 			src_pte = dst_pte = NULL;
-			err = split_folio(src_folio);
-			if (err)
-				goto out;
-			/* have to reacquire the folio after it got split */
-			folio_unlock(src_folio);
-			folio_put(src_folio);
-			src_folio = NULL;
+			/* now we can block and wait */
+			folio_lock(src_folio);
 			goto retry;
 		}
 
-		if (!src_anon_vma) {
-			/*
-			 * folio_referenced walks the anon_vma chain
-			 * without the folio lock. Serialize against it with
-			 * the anon_vma lock, the folio lock is not enough.
-			 */
-			src_anon_vma = folio_get_anon_vma(src_folio);
-			if (!src_anon_vma) {
-				/* page was unmapped from under us */
-				err = -EAGAIN;
-				goto out;
-			}
-			if (!anon_vma_trylock_write(src_anon_vma)) {
-				pte_unmap(&orig_src_pte);
-				pte_unmap(&orig_dst_pte);
-				src_pte = dst_pte = NULL;
-				/* now we can block and wait */
-				anon_vma_lock_write(src_anon_vma);
-				goto retry;
-			}
+		if (WARN_ON_ONCE(!folio_test_anon(src_folio))) {
+			err = -EBUSY;
+			goto out;
 		}
+	}
 
-		err = move_present_pte(mm,  dst_vma, src_vma,
-				       dst_addr, src_addr, dst_pte, src_pte,
-				       orig_dst_pte, orig_src_pte, dst_pmd,
-				       dst_pmdval, dst_ptl, src_ptl, src_folio);
-	} else {
-		entry = pte_to_swp_entry(orig_src_pte);
-		if (non_swap_entry(entry)) {
-			if (is_migration_entry(entry)) {
-				pte_unmap(&orig_src_pte);
-				pte_unmap(&orig_dst_pte);
-				src_pte = dst_pte = NULL;
-				migration_entry_wait(mm, src_pmd, src_addr);
-				err = -EAGAIN;
-			} else
-				err = -EFAULT;
+	/* at this point we have src_folio locked */
+	if (folio_test_large(src_folio)) {
+		/* split_folio() can block */
+		pte_unmap(&orig_src_pte);
+		pte_unmap(&orig_dst_pte);
+		src_pte = dst_pte = NULL;
+		err = split_folio(src_folio);
+		if (err)
 			goto out;
-		}
+		/* have to reacquire the folio after it got split */
+		folio_unlock(src_folio);
+		folio_put(src_folio);
+		src_folio = NULL;
+		goto retry;
+	}
 
-		err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte,
-				    orig_dst_pte, orig_src_pte, dst_pmd,
-				    dst_pmdval, dst_ptl, src_ptl);
+	if (!src_anon_vma && pte_present(orig_src_pte)) {
+		/*
+		 * folio_referenced walks the anon_vma chain
+		 * without the folio lock. Serialize against it with
+		 * the anon_vma lock, the folio lock is not enough.
+		 * In the swapcache case, the folio has been unmapped,
+		 * so there is no concurrent rmap walk.
+		 */
+		src_anon_vma = folio_get_anon_vma(src_folio);
+		if (!src_anon_vma) {
+			/* page was unmapped from under us */
+			err = -EAGAIN;
+			goto out;
+		}
+		if (!anon_vma_trylock_write(src_anon_vma)) {
+			pte_unmap(&orig_src_pte);
+			pte_unmap(&orig_dst_pte);
+			src_pte = dst_pte = NULL;
+			/* now we can block and wait */
+			anon_vma_lock_write(src_anon_vma);
+			goto retry;
+		}
 	}
 
+	err = move_pte_and_folio(mm,  dst_vma, src_vma,
+			dst_addr, src_addr, dst_pte, src_pte,
+			orig_dst_pte, orig_src_pte, dst_pmd,
+			dst_pmdval, dst_ptl, src_ptl, src_folio);
+
 out:
 	if (src_anon_vma) {
 		anon_vma_unlock_write(src_anon_vma);
@@ -1351,6 +1387,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		pte_unmap(src_pte);
 	mmu_notifier_invalidate_range_end(&range);
 
+	if (si)
+		put_swap_device(si);
 	return err;
 }
 
-- 
2.39.3 (Apple Git-146)


>
> --
> Cheers,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20  8:51     ` David Hildenbrand
@ 2025-02-20  9:31       ` Barry Song
  2025-02-20  9:36         ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Barry Song @ 2025-02-20  9:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 19.02.25 21:37, Barry Song wrote:
> > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>
> >> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> userfaultfd_move() checks whether the PTE entry is present or a
> >>> swap entry.
> >>>
> >>> - If the PTE entry is present, move_present_pte() handles folio
> >>>    migration by setting:
> >>>
> >>>    src_folio->index = linear_page_index(dst_vma, dst_addr);
> >>>
> >>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
> >>>    the PTE to the new dst_addr.
> >>>
> >>> This approach is incorrect because even if the PTE is a swap
> >>> entry, it can still reference a folio that remains in the swap
> >>> cache.
> >>>
> >>> If do_swap_page() is triggered, it may locate the folio in the
> >>> swap cache. However, during add_rmap operations, a kernel panic
> >>> can occur due to:
> >>>   page_pgoff(folio, page) != linear_page_index(vma, address)
> >>
> >> Thanks for the report and reproducer!
> >>
> >>>
> >>> $./a.out > /dev/null
> >>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> >>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> >>> [   13.337716] memcg:ffff00000405f000
> >>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> >>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> >>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> >>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> >>> [   13.340190] ------------[ cut here ]------------
> >>> [   13.340316] kernel BUG at mm/rmap.c:1380!
> >>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> >>> [   13.340969] Modules linked in:
> >>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> >>> [   13.341470] Hardware name: linux,dummy-virt (DT)
> >>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> >>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> >>> [   13.342018] sp : ffff80008752bb20
> >>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> >>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> >>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> >>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> >>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> >>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> >>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> >>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> >>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> >>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> >>> [   13.343876] Call trace:
> >>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> >>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> >>> [   13.344333]  do_swap_page+0x1060/0x1400
> >>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> >>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> >>> [   13.344586]  do_page_fault+0x20c/0x770
> >>> [   13.344673]  do_translation_fault+0xb4/0xf0
> >>> [   13.344759]  do_mem_abort+0x48/0xa0
> >>> [   13.344842]  el0_da+0x58/0x130
> >>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> >>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> >>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> >>> [   13.345504] ---[ end trace 0000000000000000 ]---
> >>> [   13.345715] note: a.out[107] exited with irqs disabled
> >>> [   13.345954] note: a.out[107] exited with preempt_count 2
> >>>
> >>> Fully fixing it would be quite complex, requiring similar handling
> >>> of folios as done in move_present_pte.
> >>
> >> How complex would that be? Is it a matter of adding
> >> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> >> folio->index = linear_page_index like in move_present_pte() or
> >> something more?
> >
> > My main concern is still with large folios that require a split_folio()
> > during move_pages(), as the entire folio shares the same index and
> > anon_vma. However, userfaultfd_move() moves pages individually,
> > making a split necessary.
> >
> > However, in split_huge_page_to_list_to_order(), there is a:
> >
> >          if (folio_test_writeback(folio))
> >                  return -EBUSY;
> >
> > This is likely true for swapcache, right? However, even for move_present_pte(),
> > it simply returns -EBUSY:
> >
> > move_pages_pte()
> > {
> >                  /* at this point we have src_folio locked */
> >                  if (folio_test_large(src_folio)) {
> >                          /* split_folio() can block */
> >                          pte_unmap(&orig_src_pte);
> >                          pte_unmap(&orig_dst_pte);
> >                          src_pte = dst_pte = NULL;
> >                          err = split_folio(src_folio);
> >                          if (err)
> >                                  goto out;
> >
> >                          /* have to reacquire the folio after it got split */
> >                          folio_unlock(src_folio);
> >                          folio_put(src_folio);
> >                          src_folio = NULL;
> >                          goto retry;
> >                  }
> > }
> >
> > Do we need a folio_wait_writeback() before calling split_folio()?
> >
> > By the way, I have also reported that userfaultfd_move() has a fundamental
> > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> > kernel. In this scenario, folios in the virtual zone won’t be split in
> > split_folio(). Instead, the large folio migrates into nr_pages small folios.
>  > > Thus, the best-case scenario would be:
> >
> > mTHP -> migrate to small folios in split_folio() -> move small folios to
> > dst_addr
> >
> > While this works, it negates the performance benefits of
> > userfaultfd_move(), as it introduces two PTE operations (migration in
> > split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> > allocations, and still requires one memcpy(). This could end up
> > performing even worse than userfaultfd_copy(), I guess.
>  > > The worst-case scenario would be failing to allocate small folios in
> > split_folio(), then userfaultfd_move() might return -ENOMEM?
>
> Although that's an Android problem and not an upstream problem, I'll
> note that there are other reasons why the split / move might fail, and
> user space either must retry or fallback to a COPY.
>
> Regarding mTHP, we could move the whole folio if the user space-provided
> range allows for batching over multiple PTEs (nr_ptes), they are in a
> single VMA, and folio_mapcount() == nr_ptes.
>
> There are corner cases to handle, such as moving mTHPs such that they
> suddenly cross two page tables I assume, that are harder to handle when
> not moving individual PTEs where that cannot happen.

This is a useful suggestion. I’ve heard that Lokesh is also interested in
modifying ART to perform moves at the mTHP granularity, which would require
kernel modifications as well. It’s likely the direction we’ll take after
fixing the current urgent bugs. The current split_folio() really isn’t ideal.

The corner cases you mentioned are definitely worth considering. However,
once we can perform batch UFFDIO_MOVE, I believe that in most cases,
the conflict between userfaultfd_move() and TAO will be resolved ?

For those corner cases, ART will still need to be fully aware that falling
back to copy or retrying is necessary.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20  9:31       ` Barry Song
@ 2025-02-20  9:36         ` David Hildenbrand
  2025-02-20 21:45           ` Barry Song
  0 siblings, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2025-02-20  9:36 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On 20.02.25 10:31, Barry Song wrote:
> On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.02.25 21:37, Barry Song wrote:
>>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>
>>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>>
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> userfaultfd_move() checks whether the PTE entry is present or a
>>>>> swap entry.
>>>>>
>>>>> - If the PTE entry is present, move_present_pte() handles folio
>>>>>     migration by setting:
>>>>>
>>>>>     src_folio->index = linear_page_index(dst_vma, dst_addr);
>>>>>
>>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
>>>>>     the PTE to the new dst_addr.
>>>>>
>>>>> This approach is incorrect because even if the PTE is a swap
>>>>> entry, it can still reference a folio that remains in the swap
>>>>> cache.
>>>>>
>>>>> If do_swap_page() is triggered, it may locate the folio in the
>>>>> swap cache. However, during add_rmap operations, a kernel panic
>>>>> can occur due to:
>>>>>    page_pgoff(folio, page) != linear_page_index(vma, address)
>>>>
>>>> Thanks for the report and reproducer!
>>>>
>>>>>
>>>>> $./a.out > /dev/null
>>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
>>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
>>>>> [   13.337716] memcg:ffff00000405f000
>>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
>>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
>>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
>>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
>>>>> [   13.340190] ------------[ cut here ]------------
>>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
>>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>> [   13.340969] Modules linked in:
>>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
>>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
>>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
>>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
>>>>> [   13.342018] sp : ffff80008752bb20
>>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
>>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
>>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
>>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
>>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
>>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
>>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
>>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
>>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
>>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
>>>>> [   13.343876] Call trace:
>>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
>>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
>>>>> [   13.344333]  do_swap_page+0x1060/0x1400
>>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
>>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
>>>>> [   13.344586]  do_page_fault+0x20c/0x770
>>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
>>>>> [   13.344759]  do_mem_abort+0x48/0xa0
>>>>> [   13.344842]  el0_da+0x58/0x130
>>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
>>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
>>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
>>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
>>>>> [   13.345715] note: a.out[107] exited with irqs disabled
>>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
>>>>>
>>>>> Fully fixing it would be quite complex, requiring similar handling
>>>>> of folios as done in move_present_pte.
>>>>
>>>> How complex would that be? Is it a matter of adding
>>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
>>>> folio->index = linear_page_index like in move_present_pte() or
>>>> something more?
>>>
>>> My main concern is still with large folios that require a split_folio()
>>> during move_pages(), as the entire folio shares the same index and
>>> anon_vma. However, userfaultfd_move() moves pages individually,
>>> making a split necessary.
>>>
>>> However, in split_huge_page_to_list_to_order(), there is a:
>>>
>>>           if (folio_test_writeback(folio))
>>>                   return -EBUSY;
>>>
>>> This is likely true for swapcache, right? However, even for move_present_pte(),
>>> it simply returns -EBUSY:
>>>
>>> move_pages_pte()
>>> {
>>>                   /* at this point we have src_folio locked */
>>>                   if (folio_test_large(src_folio)) {
>>>                           /* split_folio() can block */
>>>                           pte_unmap(&orig_src_pte);
>>>                           pte_unmap(&orig_dst_pte);
>>>                           src_pte = dst_pte = NULL;
>>>                           err = split_folio(src_folio);
>>>                           if (err)
>>>                                   goto out;
>>>
>>>                           /* have to reacquire the folio after it got split */
>>>                           folio_unlock(src_folio);
>>>                           folio_put(src_folio);
>>>                           src_folio = NULL;
>>>                           goto retry;
>>>                   }
>>> }
>>>
>>> Do we need a folio_wait_writeback() before calling split_folio()?
>>>
>>> By the way, I have also reported that userfaultfd_move() has a fundamental
>>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
>>> kernel. In this scenario, folios in the virtual zone won’t be split in
>>> split_folio(). Instead, the large folio migrates into nr_pages small folios.
>>   > > Thus, the best-case scenario would be:
>>>
>>> mTHP -> migrate to small folios in split_folio() -> move small folios to
>>> dst_addr
>>>
>>> While this works, it negates the performance benefits of
>>> userfaultfd_move(), as it introduces two PTE operations (migration in
>>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory
>>> allocations, and still requires one memcpy(). This could end up
>>> performing even worse than userfaultfd_copy(), I guess.
>>   > > The worst-case scenario would be failing to allocate small folios in
>>> split_folio(), then userfaultfd_move() might return -ENOMEM?
>>
>> Although that's an Android problem and not an upstream problem, I'll
>> note that there are other reasons why the split / move might fail, and
>> user space either must retry or fallback to a COPY.
>>
>> Regarding mTHP, we could move the whole folio if the user space-provided
>> range allows for batching over multiple PTEs (nr_ptes), they are in a
>> single VMA, and folio_mapcount() == nr_ptes.
>>
>> There are corner cases to handle, such as moving mTHPs such that they
>> suddenly cross two page tables I assume, that are harder to handle when
>> not moving individual PTEs where that cannot happen.
> 
> This is a useful suggestion. I’ve heard that Lokesh is also interested in
> modifying ART to perform moves at the mTHP granularity, which would require
> kernel modifications as well. It’s likely the direction we’ll take after
> fixing the current urgent bugs. The current split_folio() really isn’t ideal.
> 
> The corner cases you mentioned are definitely worth considering. However,
> once we can perform batch UFFDIO_MOVE, I believe that in most cases,
> the conflict between userfaultfd_move() and TAO will be resolved ?

Well, as soon as you would have varying mTHP sizes, you'd still run into 
the split with TAO. Maybe that doesn't apply with Android today, but I 
can just guess that performing sub-mTHP moving would still be required 
for GC at some point.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20  9:21         ` Barry Song
@ 2025-02-20 10:24           ` David Hildenbrand
  2025-02-26  5:37             ` Barry Song
  2025-02-20 23:32           ` Peter Xu
  1 sibling, 1 reply; 47+ messages in thread
From: David Hildenbrand @ 2025-02-20 10:24 UTC (permalink / raw)
  To: Barry Song
  Cc: Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon, brauner,
	hughd, jannh, kaleshsingh, linux-kernel, linux-mm, lokeshgidra,
	mhocko, ngeoffray, peterx, rppt, ryan.roberts, shuah, surenb,
	v-songbaohua, viro, willy, zhangpeng362, zhengtangquan, yuzhao,
	stable

On 20.02.25 10:21, Barry Song wrote:
> On Thu, Feb 20, 2025 at 9:40 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 19.02.25 19:58, Suren Baghdasaryan wrote:
>>> On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 19.02.25 19:26, Suren Baghdasaryan wrote:
>>>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> userfaultfd_move() checks whether the PTE entry is present or a
>>>>>> swap entry.
>>>>>>
>>>>>> - If the PTE entry is present, move_present_pte() handles folio
>>>>>>      migration by setting:
>>>>>>
>>>>>>      src_folio->index = linear_page_index(dst_vma, dst_addr);
>>>>>>
>>>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
>>>>>>      the PTE to the new dst_addr.
>>>>>>
>>>>>> This approach is incorrect because even if the PTE is a swap
>>>>>> entry, it can still reference a folio that remains in the swap
>>>>>> cache.
>>>>>>
>>>>>> If do_swap_page() is triggered, it may locate the folio in the
>>>>>> swap cache. However, during add_rmap operations, a kernel panic
>>>>>> can occur due to:
>>>>>>     page_pgoff(folio, page) != linear_page_index(vma, address)
>>>>>
>>>>> Thanks for the report and reproducer!
>>>>>
>>>>>>
>>>>>> $./a.out > /dev/null
>>>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
>>>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
>>>>>> [   13.337716] memcg:ffff00000405f000
>>>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
>>>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
>>>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
>>>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
>>>>>> [   13.340190] ------------[ cut here ]------------
>>>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
>>>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>>> [   13.340969] Modules linked in:
>>>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
>>>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
>>>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
>>>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
>>>>>> [   13.342018] sp : ffff80008752bb20
>>>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
>>>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
>>>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
>>>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
>>>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
>>>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
>>>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
>>>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
>>>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
>>>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
>>>>>> [   13.343876] Call trace:
>>>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
>>>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
>>>>>> [   13.344333]  do_swap_page+0x1060/0x1400
>>>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
>>>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
>>>>>> [   13.344586]  do_page_fault+0x20c/0x770
>>>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
>>>>>> [   13.344759]  do_mem_abort+0x48/0xa0
>>>>>> [   13.344842]  el0_da+0x58/0x130
>>>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
>>>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
>>>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
>>>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
>>>>>> [   13.345715] note: a.out[107] exited with irqs disabled
>>>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
>>>>>>
>>>>>> Fully fixing it would be quite complex, requiring similar handling
>>>>>> of folios as done in move_present_pte.
>>>>>
>>>>> How complex would that be? Is it a matter of adding
>>>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
>>>>> folio->index = linear_page_index like in move_present_pte() or
>>>>> something more?
>>>>
>>>> If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot
>>>> be pinned and we may be able to move it I think.
>>>>
>>>> So all that's required is to check pte_swp_exclusive() and the folio size.
>>>>
>>>> ... in theory :) Not sure about the swap details.
>>>
>>> Looking some more into it, I think we would have to perform all the
>>> folio and anon_vma locking and pinning that we do for present pages in
>>> move_pages_pte(). If that's correct then maybe treating swapcache
>>> pages like a present page inside move_pages_pte() would be simpler?
>>
>> I'd be more in favor of not doing that. Maybe there are parts we can
>> move out into helper functions instead, so we can reuse them?
> 
> I actually have a v2 ready. Maybe we can discuss if some of the code can be
> extracted as a helper based on the below before I send it formally?
> 
> I’d say there are many parts that can be shared with present PTE, but there
> are two major differences:
> 
> 1. Page exclusivity – swapcache doesn’t require it (try_to_unmap_one has remove
> Exclusive flag;)
> 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)
> 

That's a lot of complicated code you have there (not your fault, it's 
complicated stuff ... ) :)

Some of it might be compressed/simplified by the use of "else if".

I'll try to take a closer look later (will have to apply it to see the 
context better). Just one independent comment because I stumbled over 
this recently:

[...]

> @@ -1062,10 +1063,13 @@ static int move_present_pte(struct mm_struct *mm,
>   	folio_move_anon_rmap(src_folio, dst_vma);
>   	src_folio->index = linear_page_index(dst_vma, dst_addr);
>   
> -	orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
> -	/* Follow mremap() behavior and treat the entry dirty after the move */
> -	orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);
> -
> +	if (pte_present(orig_src_pte)) {
> +		orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
> +		/* Follow mremap() behavior and treat the entry dirty after the move */
> +		orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);

I'll note that the comment and mkdirty is misleading/wrong. It's 
softdirty that we care about only. But that is something independent of 
this change.

For swp PTEs, we maybe also would want to set softdirty.

See move_soft_dirty_pte() on what is actually done on the mremap path.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20  9:36         ` David Hildenbrand
@ 2025-02-20 21:45           ` Barry Song
  2025-02-20 22:19             ` Lokesh Gidra
  0 siblings, 1 reply; 47+ messages in thread
From: Barry Song @ 2025-02-20 21:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner, Hugh Dickins,
	Jann Horn, Kalesh Singh, Liam R . Howlett, Matthew Wilcox,
	Michal Hocko, Mike Rapoport, Nicolas Geoffray, Peter Xu,
	Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 20.02.25 10:31, Barry Song wrote:
> > On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 19.02.25 21:37, Barry Song wrote:
> >>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >>>>
> >>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>>
> >>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>
> >>>>> userfaultfd_move() checks whether the PTE entry is present or a
> >>>>> swap entry.
> >>>>>
> >>>>> - If the PTE entry is present, move_present_pte() handles folio
> >>>>>     migration by setting:
> >>>>>
> >>>>>     src_folio->index = linear_page_index(dst_vma, dst_addr);
> >>>>>
> >>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
> >>>>>     the PTE to the new dst_addr.
> >>>>>
> >>>>> This approach is incorrect because even if the PTE is a swap
> >>>>> entry, it can still reference a folio that remains in the swap
> >>>>> cache.
> >>>>>
> >>>>> If do_swap_page() is triggered, it may locate the folio in the
> >>>>> swap cache. However, during add_rmap operations, a kernel panic
> >>>>> can occur due to:
> >>>>>    page_pgoff(folio, page) != linear_page_index(vma, address)
> >>>>
> >>>> Thanks for the report and reproducer!
> >>>>
> >>>>>
> >>>>> $./a.out > /dev/null
> >>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> >>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> >>>>> [   13.337716] memcg:ffff00000405f000
> >>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> >>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> >>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> >>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> >>>>> [   13.340190] ------------[ cut here ]------------
> >>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
> >>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> >>>>> [   13.340969] Modules linked in:
> >>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> >>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
> >>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> >>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> >>>>> [   13.342018] sp : ffff80008752bb20
> >>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> >>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> >>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> >>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> >>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> >>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> >>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> >>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> >>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> >>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> >>>>> [   13.343876] Call trace:
> >>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> >>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> >>>>> [   13.344333]  do_swap_page+0x1060/0x1400
> >>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> >>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> >>>>> [   13.344586]  do_page_fault+0x20c/0x770
> >>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
> >>>>> [   13.344759]  do_mem_abort+0x48/0xa0
> >>>>> [   13.344842]  el0_da+0x58/0x130
> >>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> >>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> >>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> >>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
> >>>>> [   13.345715] note: a.out[107] exited with irqs disabled
> >>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
> >>>>>
> >>>>> Fully fixing it would be quite complex, requiring similar handling
> >>>>> of folios as done in move_present_pte.
> >>>>
> >>>> How complex would that be? Is it a matter of adding
> >>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> >>>> folio->index = linear_page_index like in move_present_pte() or
> >>>> something more?
> >>>
> >>> My main concern is still with large folios that require a split_folio()
> >>> during move_pages(), as the entire folio shares the same index and
> >>> anon_vma. However, userfaultfd_move() moves pages individually,
> >>> making a split necessary.
> >>>
> >>> However, in split_huge_page_to_list_to_order(), there is a:
> >>>
> >>>           if (folio_test_writeback(folio))
> >>>                   return -EBUSY;
> >>>
> >>> This is likely true for swapcache, right? However, even for move_present_pte(),
> >>> it simply returns -EBUSY:
> >>>
> >>> move_pages_pte()
> >>> {
> >>>                   /* at this point we have src_folio locked */
> >>>                   if (folio_test_large(src_folio)) {
> >>>                           /* split_folio() can block */
> >>>                           pte_unmap(&orig_src_pte);
> >>>                           pte_unmap(&orig_dst_pte);
> >>>                           src_pte = dst_pte = NULL;
> >>>                           err = split_folio(src_folio);
> >>>                           if (err)
> >>>                                   goto out;
> >>>
> >>>                           /* have to reacquire the folio after it got split */
> >>>                           folio_unlock(src_folio);
> >>>                           folio_put(src_folio);
> >>>                           src_folio = NULL;
> >>>                           goto retry;
> >>>                   }
> >>> }
> >>>
> >>> Do we need a folio_wait_writeback() before calling split_folio()?
> >>>
> >>> By the way, I have also reported that userfaultfd_move() has a fundamental
> >>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> >>> kernel. In this scenario, folios in the virtual zone won’t be split in
> >>> split_folio(). Instead, the large folio migrates into nr_pages small folios.
> >>   > > Thus, the best-case scenario would be:
> >>>
> >>> mTHP -> migrate to small folios in split_folio() -> move small folios to
> >>> dst_addr
> >>>
> >>> While this works, it negates the performance benefits of
> >>> userfaultfd_move(), as it introduces two PTE operations (migration in
> >>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> >>> allocations, and still requires one memcpy(). This could end up
> >>> performing even worse than userfaultfd_copy(), I guess.
> >>   > > The worst-case scenario would be failing to allocate small folios in
> >>> split_folio(), then userfaultfd_move() might return -ENOMEM?
> >>
> >> Although that's an Android problem and not an upstream problem, I'll
> >> note that there are other reasons why the split / move might fail, and
> >> user space either must retry or fallback to a COPY.
> >>
> >> Regarding mTHP, we could move the whole folio if the user space-provided
> >> range allows for batching over multiple PTEs (nr_ptes), they are in a
> >> single VMA, and folio_mapcount() == nr_ptes.
> >>
> >> There are corner cases to handle, such as moving mTHPs such that they
> >> suddenly cross two page tables I assume, that are harder to handle when
> >> not moving individual PTEs where that cannot happen.
> >
> > This is a useful suggestion. I’ve heard that Lokesh is also interested in
> > modifying ART to perform moves at the mTHP granularity, which would require
> > kernel modifications as well. It’s likely the direction we’ll take after
> > fixing the current urgent bugs. The current split_folio() really isn’t ideal.
> >
> > The corner cases you mentioned are definitely worth considering. However,
> > once we can perform batch UFFDIO_MOVE, I believe that in most cases,
> > the conflict between userfaultfd_move() and TAO will be resolved ?
>
> Well, as soon as you would have varying mTHP sizes, you'd still run into
> the split with TAO. Maybe that doesn't apply with Android today, but I
> can just guess that performing sub-mTHP moving would still be required
> for GC at some point.

With patch v2[1], as discussed in my previous email, I have observed that
small folios consistently succeed without crashing. Similarly, mTHP no
longer crashes; however, it still returns -EBUSY during the raced time
window, even after adding folio_wait_writeback. While I previously
mentioned that folio_writeback prevents mTHP from splitting, this is not
the only factor. The split_folio() function still returns -EBUSY because
folio_get_anon_vma(folio) returns NULL when the folio is not mapped.

int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
                                     unsigned int new_order)
{
                anon_vma = folio_get_anon_vma(folio);
                if (!anon_vma) {
                        ret = -EBUSY;
                        goto out;
                }

                end = -1;
                mapping = NULL;
                anon_vma_lock_write(anon_vma);
}

Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still
fail when performing sub-mTHP moving in the swap cache case due to:

struct anon_vma *folio_get_anon_vma(const struct folio *folio)
{
        ...
        if (!folio_mapped(folio))
                goto out;
         ...
}

We likely need to modify split_folio() to support splitting unmapped anon
folios within the swap cache or introduce a new function like
split_unmapped_anon_folio()? Otherwise, userspace will have to fall back
to UFFDIO_COPY or retry.

As it stands, I see no way for sub-mTHP to survive moving with the current
code and within the existing raced window. For mTHP, there is essentially
no difference between returning -EBUSY immediately upon detecting that it
is within the swap cache, as proposed in v1.

[1] https://lore.kernel.org/linux-mm/20250220092101.71966-1-21cnbao@gmail.com/

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 21:45           ` Barry Song
@ 2025-02-20 22:19             ` Lokesh Gidra
  2025-02-20 22:26               ` Barry Song
  0 siblings, 1 reply; 47+ messages in thread
From: Lokesh Gidra @ 2025-02-20 22:19 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, Suren Baghdasaryan, linux-mm, akpm,
	linux-kernel, zhengtangquan, Barry Song, Andrea Arcangeli,
	Al Viro, Axel Rasmussen, Brian Geffon, Christian Brauner,
	Hugh Dickins, Jann Horn, Kalesh Singh, Liam R . Howlett,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Nicolas Geoffray,
	Peter Xu, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 1:45 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 20.02.25 10:31, Barry Song wrote:
> > > On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
> > >>
> > >> On 19.02.25 21:37, Barry Song wrote:
> > >>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > >>>>
> > >>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > >>>>>
> > >>>>> From: Barry Song <v-songbaohua@oppo.com>
> > >>>>>
> > >>>>> userfaultfd_move() checks whether the PTE entry is present or a
> > >>>>> swap entry.
> > >>>>>
> > >>>>> - If the PTE entry is present, move_present_pte() handles folio
> > >>>>>     migration by setting:
> > >>>>>
> > >>>>>     src_folio->index = linear_page_index(dst_vma, dst_addr);
> > >>>>>
> > >>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > >>>>>     the PTE to the new dst_addr.
> > >>>>>
> > >>>>> This approach is incorrect because even if the PTE is a swap
> > >>>>> entry, it can still reference a folio that remains in the swap
> > >>>>> cache.
> > >>>>>
> > >>>>> If do_swap_page() is triggered, it may locate the folio in the
> > >>>>> swap cache. However, during add_rmap operations, a kernel panic
> > >>>>> can occur due to:
> > >>>>>    page_pgoff(folio, page) != linear_page_index(vma, address)
> > >>>>
> > >>>> Thanks for the report and reproducer!
> > >>>>
> > >>>>>
> > >>>>> $./a.out > /dev/null
> > >>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > >>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > >>>>> [   13.337716] memcg:ffff00000405f000
> > >>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > >>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > >>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > >>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > >>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > >>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > >>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > >>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > >>>>> [   13.340190] ------------[ cut here ]------------
> > >>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
> > >>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > >>>>> [   13.340969] Modules linked in:
> > >>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > >>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
> > >>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > >>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > >>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > >>>>> [   13.342018] sp : ffff80008752bb20
> > >>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > >>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > >>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > >>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > >>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > >>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > >>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > >>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > >>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > >>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > >>>>> [   13.343876] Call trace:
> > >>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > >>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > >>>>> [   13.344333]  do_swap_page+0x1060/0x1400
> > >>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > >>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > >>>>> [   13.344586]  do_page_fault+0x20c/0x770
> > >>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
> > >>>>> [   13.344759]  do_mem_abort+0x48/0xa0
> > >>>>> [   13.344842]  el0_da+0x58/0x130
> > >>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > >>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > >>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > >>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
> > >>>>> [   13.345715] note: a.out[107] exited with irqs disabled
> > >>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
> > >>>>>
> > >>>>> Fully fixing it would be quite complex, requiring similar handling
> > >>>>> of folios as done in move_present_pte.
> > >>>>
> > >>>> How complex would that be? Is it a matter of adding
> > >>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > >>>> folio->index = linear_page_index like in move_present_pte() or
> > >>>> something more?
> > >>>
> > >>> My main concern is still with large folios that require a split_folio()
> > >>> during move_pages(), as the entire folio shares the same index and
> > >>> anon_vma. However, userfaultfd_move() moves pages individually,
> > >>> making a split necessary.
> > >>>
> > >>> However, in split_huge_page_to_list_to_order(), there is a:
> > >>>
> > >>>           if (folio_test_writeback(folio))
> > >>>                   return -EBUSY;
> > >>>
> > >>> This is likely true for swapcache, right? However, even for move_present_pte(),
> > >>> it simply returns -EBUSY:
> > >>>
> > >>> move_pages_pte()
> > >>> {
> > >>>                   /* at this point we have src_folio locked */
> > >>>                   if (folio_test_large(src_folio)) {
> > >>>                           /* split_folio() can block */
> > >>>                           pte_unmap(&orig_src_pte);
> > >>>                           pte_unmap(&orig_dst_pte);
> > >>>                           src_pte = dst_pte = NULL;
> > >>>                           err = split_folio(src_folio);
> > >>>                           if (err)
> > >>>                                   goto out;
> > >>>
> > >>>                           /* have to reacquire the folio after it got split */
> > >>>                           folio_unlock(src_folio);
> > >>>                           folio_put(src_folio);
> > >>>                           src_folio = NULL;
> > >>>                           goto retry;
> > >>>                   }
> > >>> }
> > >>>
> > >>> Do we need a folio_wait_writeback() before calling split_folio()?
> > >>>
> > >>> By the way, I have also reported that userfaultfd_move() has a fundamental
> > >>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> > >>> kernel. In this scenario, folios in the virtual zone won’t be split in
> > >>> split_folio(). Instead, the large folio migrates into nr_pages small folios.
> > >>   > > Thus, the best-case scenario would be:
> > >>>
> > >>> mTHP -> migrate to small folios in split_folio() -> move small folios to
> > >>> dst_addr
> > >>>
> > >>> While this works, it negates the performance benefits of
> > >>> userfaultfd_move(), as it introduces two PTE operations (migration in
> > >>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> > >>> allocations, and still requires one memcpy(). This could end up
> > >>> performing even worse than userfaultfd_copy(), I guess.
> > >>   > > The worst-case scenario would be failing to allocate small folios in
> > >>> split_folio(), then userfaultfd_move() might return -ENOMEM?
> > >>
> > >> Although that's an Android problem and not an upstream problem, I'll
> > >> note that there are other reasons why the split / move might fail, and
> > >> user space either must retry or fallback to a COPY.
> > >>
> > >> Regarding mTHP, we could move the whole folio if the user space-provided
> > >> range allows for batching over multiple PTEs (nr_ptes), they are in a
> > >> single VMA, and folio_mapcount() == nr_ptes.
> > >>
> > >> There are corner cases to handle, such as moving mTHPs such that they
> > >> suddenly cross two page tables I assume, that are harder to handle when
> > >> not moving individual PTEs where that cannot happen.
> > >
> > > This is a useful suggestion. I’ve heard that Lokesh is also interested in
> > > modifying ART to perform moves at the mTHP granularity, which would require
> > > kernel modifications as well. It’s likely the direction we’ll take after
> > > fixing the current urgent bugs. The current split_folio() really isn’t ideal.
> > >
> > > The corner cases you mentioned are definitely worth considering. However,
> > > once we can perform batch UFFDIO_MOVE, I believe that in most cases,
> > > the conflict between userfaultfd_move() and TAO will be resolved ?
> >
> > Well, as soon as you would have varying mTHP sizes, you'd still run into
> > the split with TAO. Maybe that doesn't apply with Android today, but I
> > can just guess that performing sub-mTHP moving would still be required
> > for GC at some point.
>
> With patch v2[1], as discussed in my previous email, I have observed that
> small folios consistently succeed without crashing. Similarly, mTHP no
> longer crashes; however, it still returns -EBUSY during the raced time
> window, even after adding folio_wait_writeback. While I previously
> mentioned that folio_writeback prevents mTHP from splitting, this is not
> the only factor. The split_folio() function still returns -EBUSY because
> folio_get_anon_vma(folio) returns NULL when the folio is not mapped.
>
> int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>                                      unsigned int new_order)
> {
>                 anon_vma = folio_get_anon_vma(folio);
>                 if (!anon_vma) {
>                         ret = -EBUSY;
>                         goto out;
>                 }
>
>                 end = -1;
>                 mapping = NULL;
>                 anon_vma_lock_write(anon_vma);
> }
>
> Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still
> fail when performing sub-mTHP moving in the swap cache case due to:

Just to clarify my doubt. What do you mean by sub-mTHP? Also when you
say 'small folio' above,  do you mean single-page folios?

Am I understanding correctly that your patch correctly handles moving
single swap-cache page case?
>
> struct anon_vma *folio_get_anon_vma(const struct folio *folio)
> {
>         ...
>         if (!folio_mapped(folio))
>                 goto out;
>          ...
> }
>
> We likely need to modify split_folio() to support splitting unmapped anon
> folios within the swap cache or introduce a new function like
> split_unmapped_anon_folio()? Otherwise, userspace will have to fall back
> to UFFDIO_COPY or retry.
>
> As it stands, I see no way for sub-mTHP to survive moving with the current
> code and within the existing raced window. For mTHP, there is essentially
> no difference between returning -EBUSY immediately upon detecting that it
> is within the swap cache, as proposed in v1.
>
> [1] https://lore.kernel.org/linux-mm/20250220092101.71966-1-21cnbao@gmail.com/
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 22:19             ` Lokesh Gidra
@ 2025-02-20 22:26               ` Barry Song
  2025-02-20 22:31                 ` David Hildenbrand
  2025-02-20 22:33                 ` Lokesh Gidra
  0 siblings, 2 replies; 47+ messages in thread
From: Barry Song @ 2025-02-20 22:26 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: David Hildenbrand, Suren Baghdasaryan, linux-mm, akpm,
	linux-kernel, zhengtangquan, Barry Song, Andrea Arcangeli,
	Al Viro, Axel Rasmussen, Brian Geffon, Christian Brauner,
	Hugh Dickins, Jann Horn, Kalesh Singh, Liam R . Howlett,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Nicolas Geoffray,
	Peter Xu, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Fri, Feb 21, 2025 at 11:20 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> On Thu, Feb 20, 2025 at 1:45 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 20.02.25 10:31, Barry Song wrote:
> > > > On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
> > > >>
> > > >> On 19.02.25 21:37, Barry Song wrote:
> > > >>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >>>>
> > > >>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >>>>>
> > > >>>>> From: Barry Song <v-songbaohua@oppo.com>
> > > >>>>>
> > > >>>>> userfaultfd_move() checks whether the PTE entry is present or a
> > > >>>>> swap entry.
> > > >>>>>
> > > >>>>> - If the PTE entry is present, move_present_pte() handles folio
> > > >>>>>     migration by setting:
> > > >>>>>
> > > >>>>>     src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > >>>>>
> > > >>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > >>>>>     the PTE to the new dst_addr.
> > > >>>>>
> > > >>>>> This approach is incorrect because even if the PTE is a swap
> > > >>>>> entry, it can still reference a folio that remains in the swap
> > > >>>>> cache.
> > > >>>>>
> > > >>>>> If do_swap_page() is triggered, it may locate the folio in the
> > > >>>>> swap cache. However, during add_rmap operations, a kernel panic
> > > >>>>> can occur due to:
> > > >>>>>    page_pgoff(folio, page) != linear_page_index(vma, address)
> > > >>>>
> > > >>>> Thanks for the report and reproducer!
> > > >>>>
> > > >>>>>
> > > >>>>> $./a.out > /dev/null
> > > >>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > >>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > >>>>> [   13.337716] memcg:ffff00000405f000
> > > >>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > >>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > >>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > >>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > >>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > >>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > >>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > >>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > >>>>> [   13.340190] ------------[ cut here ]------------
> > > >>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > >>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > >>>>> [   13.340969] Modules linked in:
> > > >>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > >>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > >>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > >>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > >>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > >>>>> [   13.342018] sp : ffff80008752bb20
> > > >>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > >>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > >>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > >>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > >>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > >>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > >>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > >>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > >>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > >>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > >>>>> [   13.343876] Call trace:
> > > >>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > >>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > >>>>> [   13.344333]  do_swap_page+0x1060/0x1400
> > > >>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > >>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > >>>>> [   13.344586]  do_page_fault+0x20c/0x770
> > > >>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
> > > >>>>> [   13.344759]  do_mem_abort+0x48/0xa0
> > > >>>>> [   13.344842]  el0_da+0x58/0x130
> > > >>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > >>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > >>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > >>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
> > > >>>>> [   13.345715] note: a.out[107] exited with irqs disabled
> > > >>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
> > > >>>>>
> > > >>>>> Fully fixing it would be quite complex, requiring similar handling
> > > >>>>> of folios as done in move_present_pte.
> > > >>>>
> > > >>>> How complex would that be? Is it a matter of adding
> > > >>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > >>>> folio->index = linear_page_index like in move_present_pte() or
> > > >>>> something more?
> > > >>>
> > > >>> My main concern is still with large folios that require a split_folio()
> > > >>> during move_pages(), as the entire folio shares the same index and
> > > >>> anon_vma. However, userfaultfd_move() moves pages individually,
> > > >>> making a split necessary.
> > > >>>
> > > >>> However, in split_huge_page_to_list_to_order(), there is a:
> > > >>>
> > > >>>           if (folio_test_writeback(folio))
> > > >>>                   return -EBUSY;
> > > >>>
> > > >>> This is likely true for swapcache, right? However, even for move_present_pte(),
> > > >>> it simply returns -EBUSY:
> > > >>>
> > > >>> move_pages_pte()
> > > >>> {
> > > >>>                   /* at this point we have src_folio locked */
> > > >>>                   if (folio_test_large(src_folio)) {
> > > >>>                           /* split_folio() can block */
> > > >>>                           pte_unmap(&orig_src_pte);
> > > >>>                           pte_unmap(&orig_dst_pte);
> > > >>>                           src_pte = dst_pte = NULL;
> > > >>>                           err = split_folio(src_folio);
> > > >>>                           if (err)
> > > >>>                                   goto out;
> > > >>>
> > > >>>                           /* have to reacquire the folio after it got split */
> > > >>>                           folio_unlock(src_folio);
> > > >>>                           folio_put(src_folio);
> > > >>>                           src_folio = NULL;
> > > >>>                           goto retry;
> > > >>>                   }
> > > >>> }
> > > >>>
> > > >>> Do we need a folio_wait_writeback() before calling split_folio()?
> > > >>>
> > > >>> By the way, I have also reported that userfaultfd_move() has a fundamental
> > > >>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> > > >>> kernel. In this scenario, folios in the virtual zone won’t be split in
> > > >>> split_folio(). Instead, the large folio migrates into nr_pages small folios.
> > > >>   > > Thus, the best-case scenario would be:
> > > >>>
> > > >>> mTHP -> migrate to small folios in split_folio() -> move small folios to
> > > >>> dst_addr
> > > >>>
> > > >>> While this works, it negates the performance benefits of
> > > >>> userfaultfd_move(), as it introduces two PTE operations (migration in
> > > >>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> > > >>> allocations, and still requires one memcpy(). This could end up
> > > >>> performing even worse than userfaultfd_copy(), I guess.
> > > >>   > > The worst-case scenario would be failing to allocate small folios in
> > > >>> split_folio(), then userfaultfd_move() might return -ENOMEM?
> > > >>
> > > >> Although that's an Android problem and not an upstream problem, I'll
> > > >> note that there are other reasons why the split / move might fail, and
> > > >> user space either must retry or fallback to a COPY.
> > > >>
> > > >> Regarding mTHP, we could move the whole folio if the user space-provided
> > > >> range allows for batching over multiple PTEs (nr_ptes), they are in a
> > > >> single VMA, and folio_mapcount() == nr_ptes.
> > > >>
> > > >> There are corner cases to handle, such as moving mTHPs such that they
> > > >> suddenly cross two page tables I assume, that are harder to handle when
> > > >> not moving individual PTEs where that cannot happen.
> > > >
> > > > This is a useful suggestion. I’ve heard that Lokesh is also interested in
> > > > modifying ART to perform moves at the mTHP granularity, which would require
> > > > kernel modifications as well. It’s likely the direction we’ll take after
> > > > fixing the current urgent bugs. The current split_folio() really isn’t ideal.
> > > >
> > > > The corner cases you mentioned are definitely worth considering. However,
> > > > once we can perform batch UFFDIO_MOVE, I believe that in most cases,
> > > > the conflict between userfaultfd_move() and TAO will be resolved ?
> > >
> > > Well, as soon as you would have varying mTHP sizes, you'd still run into
> > > the split with TAO. Maybe that doesn't apply with Android today, but I
> > > can just guess that performing sub-mTHP moving would still be required
> > > for GC at some point.
> >
> > With patch v2[1], as discussed in my previous email, I have observed that
> > small folios consistently succeed without crashing. Similarly, mTHP no
> > longer crashes; however, it still returns -EBUSY during the raced time
> > window, even after adding folio_wait_writeback. While I previously
> > mentioned that folio_writeback prevents mTHP from splitting, this is not
> > the only factor. The split_folio() function still returns -EBUSY because
> > folio_get_anon_vma(folio) returns NULL when the folio is not mapped.
> >
> > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> >                                      unsigned int new_order)
> > {
> >                 anon_vma = folio_get_anon_vma(folio);
> >                 if (!anon_vma) {
> >                         ret = -EBUSY;
> >                         goto out;
> >                 }
> >
> >                 end = -1;
> >                 mapping = NULL;
> >                 anon_vma_lock_write(anon_vma);
> > }
> >
> > Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still
> > fail when performing sub-mTHP moving in the swap cache case due to:
>
> Just to clarify my doubt. What do you mean by sub-mTHP? Also when you
> say 'small folio' above,  do you mean single-page folios?

This means any moving size smaller than the size of mTHP, or moving
a partial mTHP.

>
> Am I understanding correctly that your patch correctly handles moving
> single swap-cache page case?

Yes, the crash is fixed for both small and large folios, and for small
folios, moving is consistently successful(even for the swapcache case).
The only issue is that sub-mTHP moving constantly fails for the swapcache
case because split_folio() fails, even after waiting for writeback as
split_folio()
can only split mapped folios - which is false for swapcache since
try_to_unmap_one() has been done.

So I'd say for mTHP, returning -EBUSY as early as possible is the
better choice to avoid wasting much time and eventually returning
-EBUSY anyway unless we want to modify split_folio() things.

> >
> > struct anon_vma *folio_get_anon_vma(const struct folio *folio)
> > {
> >         ...
> >         if (!folio_mapped(folio))
> >                 goto out;
> >          ...
> > }
> >
> > We likely need to modify split_folio() to support splitting unmapped anon
> > folios within the swap cache or introduce a new function like
> > split_unmapped_anon_folio()? Otherwise, userspace will have to fall back
> > to UFFDIO_COPY or retry.
> >
> > As it stands, I see no way for sub-mTHP to survive moving with the current
> > code and within the existing raced window. For mTHP, there is essentially
> > no difference between returning -EBUSY immediately upon detecting that it
> > is within the swap cache, as proposed in v1.
> >
> > [1] https://lore.kernel.org/linux-mm/20250220092101.71966-1-21cnbao@gmail.com/
> >
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 22:26               ` Barry Song
@ 2025-02-20 22:31                 ` David Hildenbrand
  2025-02-20 22:33                 ` Lokesh Gidra
  1 sibling, 0 replies; 47+ messages in thread
From: David Hildenbrand @ 2025-02-20 22:31 UTC (permalink / raw)
  To: Barry Song, Lokesh Gidra
  Cc: Suren Baghdasaryan, linux-mm, akpm, linux-kernel, zhengtangquan,
	Barry Song, Andrea Arcangeli, Al Viro, Axel Rasmussen,
	Brian Geffon, Christian Brauner, Hugh Dickins, Jann Horn,
	Kalesh Singh, Liam R . Howlett, Matthew Wilcox, Michal Hocko,
	Mike Rapoport, Nicolas Geoffray, Peter Xu, Ryan Roberts,
	Shuah Khan, ZhangPeng, Yu Zhao

On 20.02.25 23:26, Barry Song wrote:
> On Fri, Feb 21, 2025 at 11:20 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>>
>> On Thu, Feb 20, 2025 at 1:45 PM Barry Song <21cnbao@gmail.com> wrote:
>>>
>>> On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 20.02.25 10:31, Barry Song wrote:
>>>>> On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 19.02.25 21:37, Barry Song wrote:
>>>>>>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
>>>>>>>>
>>>>>>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>>>
>>>>>>>>> userfaultfd_move() checks whether the PTE entry is present or a
>>>>>>>>> swap entry.
>>>>>>>>>
>>>>>>>>> - If the PTE entry is present, move_present_pte() handles folio
>>>>>>>>>      migration by setting:
>>>>>>>>>
>>>>>>>>>      src_folio->index = linear_page_index(dst_vma, dst_addr);
>>>>>>>>>
>>>>>>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
>>>>>>>>>      the PTE to the new dst_addr.
>>>>>>>>>
>>>>>>>>> This approach is incorrect because even if the PTE is a swap
>>>>>>>>> entry, it can still reference a folio that remains in the swap
>>>>>>>>> cache.
>>>>>>>>>
>>>>>>>>> If do_swap_page() is triggered, it may locate the folio in the
>>>>>>>>> swap cache. However, during add_rmap operations, a kernel panic
>>>>>>>>> can occur due to:
>>>>>>>>>     page_pgoff(folio, page) != linear_page_index(vma, address)
>>>>>>>>
>>>>>>>> Thanks for the report and reproducer!
>>>>>>>>
>>>>>>>>>
>>>>>>>>> $./a.out > /dev/null
>>>>>>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
>>>>>>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
>>>>>>>>> [   13.337716] memcg:ffff00000405f000
>>>>>>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
>>>>>>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>>>>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>>>>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>>>>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>>>>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
>>>>>>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
>>>>>>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
>>>>>>>>> [   13.340190] ------------[ cut here ]------------
>>>>>>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
>>>>>>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>>>>>> [   13.340969] Modules linked in:
>>>>>>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
>>>>>>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
>>>>>>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
>>>>>>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
>>>>>>>>> [   13.342018] sp : ffff80008752bb20
>>>>>>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
>>>>>>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
>>>>>>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
>>>>>>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
>>>>>>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
>>>>>>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
>>>>>>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
>>>>>>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
>>>>>>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
>>>>>>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
>>>>>>>>> [   13.343876] Call trace:
>>>>>>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
>>>>>>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
>>>>>>>>> [   13.344333]  do_swap_page+0x1060/0x1400
>>>>>>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
>>>>>>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
>>>>>>>>> [   13.344586]  do_page_fault+0x20c/0x770
>>>>>>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
>>>>>>>>> [   13.344759]  do_mem_abort+0x48/0xa0
>>>>>>>>> [   13.344842]  el0_da+0x58/0x130
>>>>>>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
>>>>>>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
>>>>>>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
>>>>>>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
>>>>>>>>> [   13.345715] note: a.out[107] exited with irqs disabled
>>>>>>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
>>>>>>>>>
>>>>>>>>> Fully fixing it would be quite complex, requiring similar handling
>>>>>>>>> of folios as done in move_present_pte.
>>>>>>>>
>>>>>>>> How complex would that be? Is it a matter of adding
>>>>>>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
>>>>>>>> folio->index = linear_page_index like in move_present_pte() or
>>>>>>>> something more?
>>>>>>>
>>>>>>> My main concern is still with large folios that require a split_folio()
>>>>>>> during move_pages(), as the entire folio shares the same index and
>>>>>>> anon_vma. However, userfaultfd_move() moves pages individually,
>>>>>>> making a split necessary.
>>>>>>>
>>>>>>> However, in split_huge_page_to_list_to_order(), there is a:
>>>>>>>
>>>>>>>            if (folio_test_writeback(folio))
>>>>>>>                    return -EBUSY;
>>>>>>>
>>>>>>> This is likely true for swapcache, right? However, even for move_present_pte(),
>>>>>>> it simply returns -EBUSY:
>>>>>>>
>>>>>>> move_pages_pte()
>>>>>>> {
>>>>>>>                    /* at this point we have src_folio locked */
>>>>>>>                    if (folio_test_large(src_folio)) {
>>>>>>>                            /* split_folio() can block */
>>>>>>>                            pte_unmap(&orig_src_pte);
>>>>>>>                            pte_unmap(&orig_dst_pte);
>>>>>>>                            src_pte = dst_pte = NULL;
>>>>>>>                            err = split_folio(src_folio);
>>>>>>>                            if (err)
>>>>>>>                                    goto out;
>>>>>>>
>>>>>>>                            /* have to reacquire the folio after it got split */
>>>>>>>                            folio_unlock(src_folio);
>>>>>>>                            folio_put(src_folio);
>>>>>>>                            src_folio = NULL;
>>>>>>>                            goto retry;
>>>>>>>                    }
>>>>>>> }
>>>>>>>
>>>>>>> Do we need a folio_wait_writeback() before calling split_folio()?
>>>>>>>
>>>>>>> By the way, I have also reported that userfaultfd_move() has a fundamental
>>>>>>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
>>>>>>> kernel. In this scenario, folios in the virtual zone won’t be split in
>>>>>>> split_folio(). Instead, the large folio migrates into nr_pages small folios.
>>>>>>    > > Thus, the best-case scenario would be:
>>>>>>>
>>>>>>> mTHP -> migrate to small folios in split_folio() -> move small folios to
>>>>>>> dst_addr
>>>>>>>
>>>>>>> While this works, it negates the performance benefits of
>>>>>>> userfaultfd_move(), as it introduces two PTE operations (migration in
>>>>>>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory
>>>>>>> allocations, and still requires one memcpy(). This could end up
>>>>>>> performing even worse than userfaultfd_copy(), I guess.
>>>>>>    > > The worst-case scenario would be failing to allocate small folios in
>>>>>>> split_folio(), then userfaultfd_move() might return -ENOMEM?
>>>>>>
>>>>>> Although that's an Android problem and not an upstream problem, I'll
>>>>>> note that there are other reasons why the split / move might fail, and
>>>>>> user space either must retry or fallback to a COPY.
>>>>>>
>>>>>> Regarding mTHP, we could move the whole folio if the user space-provided
>>>>>> range allows for batching over multiple PTEs (nr_ptes), they are in a
>>>>>> single VMA, and folio_mapcount() == nr_ptes.
>>>>>>
>>>>>> There are corner cases to handle, such as moving mTHPs such that they
>>>>>> suddenly cross two page tables I assume, that are harder to handle when
>>>>>> not moving individual PTEs where that cannot happen.
>>>>>
>>>>> This is a useful suggestion. I’ve heard that Lokesh is also interested in
>>>>> modifying ART to perform moves at the mTHP granularity, which would require
>>>>> kernel modifications as well. It’s likely the direction we’ll take after
>>>>> fixing the current urgent bugs. The current split_folio() really isn’t ideal.
>>>>>
>>>>> The corner cases you mentioned are definitely worth considering. However,
>>>>> once we can perform batch UFFDIO_MOVE, I believe that in most cases,
>>>>> the conflict between userfaultfd_move() and TAO will be resolved ?
>>>>
>>>> Well, as soon as you would have varying mTHP sizes, you'd still run into
>>>> the split with TAO. Maybe that doesn't apply with Android today, but I
>>>> can just guess that performing sub-mTHP moving would still be required
>>>> for GC at some point.
>>>
>>> With patch v2[1], as discussed in my previous email, I have observed that
>>> small folios consistently succeed without crashing. Similarly, mTHP no
>>> longer crashes; however, it still returns -EBUSY during the raced time
>>> window, even after adding folio_wait_writeback. While I previously
>>> mentioned that folio_writeback prevents mTHP from splitting, this is not
>>> the only factor. The split_folio() function still returns -EBUSY because
>>> folio_get_anon_vma(folio) returns NULL when the folio is not mapped.
>>>
>>> int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>                                       unsigned int new_order)
>>> {
>>>                  anon_vma = folio_get_anon_vma(folio);
>>>                  if (!anon_vma) {
>>>                          ret = -EBUSY;
>>>                          goto out;
>>>                  }
>>>
>>>                  end = -1;
>>>                  mapping = NULL;
>>>                  anon_vma_lock_write(anon_vma);
>>> }
>>>
>>> Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still
>>> fail when performing sub-mTHP moving in the swap cache case due to:
>>
>> Just to clarify my doubt. What do you mean by sub-mTHP? Also when you
>> say 'small folio' above,  do you mean single-page folios?
> 
> This means any moving size smaller than the size of mTHP, or moving
> a partial mTHP.
> 
>>
>> Am I understanding correctly that your patch correctly handles moving
>> single swap-cache page case?
> 
> Yes, the crash is fixed for both small and large folios, and for small
> folios, moving is consistently successful(even for the swapcache case).
> The only issue is that sub-mTHP moving constantly fails for the swapcache
> case because split_folio() fails, even after waiting for writeback as
> split_folio()
> can only split mapped folios - which is false for swapcache since
> try_to_unmap_one() has been done.

I mean, we (as the caller of split_folio()) have the VMA + anon_vma in 
our hands. Do we only have to bypass that mapping check, or is there 
something else that would block us?


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 22:26               ` Barry Song
  2025-02-20 22:31                 ` David Hildenbrand
@ 2025-02-20 22:33                 ` Lokesh Gidra
  1 sibling, 0 replies; 47+ messages in thread
From: Lokesh Gidra @ 2025-02-20 22:33 UTC (permalink / raw)
  To: Barry Song
  Cc: David Hildenbrand, Suren Baghdasaryan, linux-mm, akpm,
	linux-kernel, zhengtangquan, Barry Song, Andrea Arcangeli,
	Al Viro, Axel Rasmussen, Brian Geffon, Christian Brauner,
	Hugh Dickins, Jann Horn, Kalesh Singh, Liam R . Howlett,
	Matthew Wilcox, Michal Hocko, Mike Rapoport, Nicolas Geoffray,
	Peter Xu, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 2:27 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Feb 21, 2025 at 11:20 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 1:45 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 10:36 PM David Hildenbrand <david@redhat.com> wrote:
> > > >
> > > > On 20.02.25 10:31, Barry Song wrote:
> > > > > On Thu, Feb 20, 2025 at 9:51 PM David Hildenbrand <david@redhat.com> wrote:
> > > > >>
> > > > >> On 19.02.25 21:37, Barry Song wrote:
> > > > >>> On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >>>>
> > > > >>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >>>>>
> > > > >>>>> From: Barry Song <v-songbaohua@oppo.com>
> > > > >>>>>
> > > > >>>>> userfaultfd_move() checks whether the PTE entry is present or a
> > > > >>>>> swap entry.
> > > > >>>>>
> > > > >>>>> - If the PTE entry is present, move_present_pte() handles folio
> > > > >>>>>     migration by setting:
> > > > >>>>>
> > > > >>>>>     src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > >>>>>
> > > > >>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > >>>>>     the PTE to the new dst_addr.
> > > > >>>>>
> > > > >>>>> This approach is incorrect because even if the PTE is a swap
> > > > >>>>> entry, it can still reference a folio that remains in the swap
> > > > >>>>> cache.
> > > > >>>>>
> > > > >>>>> If do_swap_page() is triggered, it may locate the folio in the
> > > > >>>>> swap cache. However, during add_rmap operations, a kernel panic
> > > > >>>>> can occur due to:
> > > > >>>>>    page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > >>>>
> > > > >>>> Thanks for the report and reproducer!
> > > > >>>>
> > > > >>>>>
> > > > >>>>> $./a.out > /dev/null
> > > > >>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > >>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > >>>>> [   13.337716] memcg:ffff00000405f000
> > > > >>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > >>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > >>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > >>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > >>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > >>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > >>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > >>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > >>>>> [   13.340190] ------------[ cut here ]------------
> > > > >>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > >>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > >>>>> [   13.340969] Modules linked in:
> > > > >>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > >>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > >>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > >>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > >>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > >>>>> [   13.342018] sp : ffff80008752bb20
> > > > >>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > >>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > >>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > >>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > >>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > >>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > >>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > >>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > >>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > >>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > >>>>> [   13.343876] Call trace:
> > > > >>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > >>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > >>>>> [   13.344333]  do_swap_page+0x1060/0x1400
> > > > >>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > >>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > >>>>> [   13.344586]  do_page_fault+0x20c/0x770
> > > > >>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > >>>>> [   13.344759]  do_mem_abort+0x48/0xa0
> > > > >>>>> [   13.344842]  el0_da+0x58/0x130
> > > > >>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > >>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > >>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > >>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > >>>>> [   13.345715] note: a.out[107] exited with irqs disabled
> > > > >>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > >>>>>
> > > > >>>>> Fully fixing it would be quite complex, requiring similar handling
> > > > >>>>> of folios as done in move_present_pte.
> > > > >>>>
> > > > >>>> How complex would that be? Is it a matter of adding
> > > > >>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > >>>> folio->index = linear_page_index like in move_present_pte() or
> > > > >>>> something more?
> > > > >>>
> > > > >>> My main concern is still with large folios that require a split_folio()
> > > > >>> during move_pages(), as the entire folio shares the same index and
> > > > >>> anon_vma. However, userfaultfd_move() moves pages individually,
> > > > >>> making a split necessary.
> > > > >>>
> > > > >>> However, in split_huge_page_to_list_to_order(), there is a:
> > > > >>>
> > > > >>>           if (folio_test_writeback(folio))
> > > > >>>                   return -EBUSY;
> > > > >>>
> > > > >>> This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > >>> it simply returns -EBUSY:
> > > > >>>
> > > > >>> move_pages_pte()
> > > > >>> {
> > > > >>>                   /* at this point we have src_folio locked */
> > > > >>>                   if (folio_test_large(src_folio)) {
> > > > >>>                           /* split_folio() can block */
> > > > >>>                           pte_unmap(&orig_src_pte);
> > > > >>>                           pte_unmap(&orig_dst_pte);
> > > > >>>                           src_pte = dst_pte = NULL;
> > > > >>>                           err = split_folio(src_folio);
> > > > >>>                           if (err)
> > > > >>>                                   goto out;
> > > > >>>
> > > > >>>                           /* have to reacquire the folio after it got split */
> > > > >>>                           folio_unlock(src_folio);
> > > > >>>                           folio_put(src_folio);
> > > > >>>                           src_folio = NULL;
> > > > >>>                           goto retry;
> > > > >>>                   }
> > > > >>> }
> > > > >>>
> > > > >>> Do we need a folio_wait_writeback() before calling split_folio()?
> > > > >>>
> > > > >>> By the way, I have also reported that userfaultfd_move() has a fundamental
> > > > >>> conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android common
> > > > >>> kernel. In this scenario, folios in the virtual zone won’t be split in
> > > > >>> split_folio(). Instead, the large folio migrates into nr_pages small folios.
> > > > >>   > > Thus, the best-case scenario would be:
> > > > >>>
> > > > >>> mTHP -> migrate to small folios in split_folio() -> move small folios to
> > > > >>> dst_addr
> > > > >>>
> > > > >>> While this works, it negates the performance benefits of
> > > > >>> userfaultfd_move(), as it introduces two PTE operations (migration in
> > > > >>> split_folio() and move in userfaultfd_move() while retry), nr_pages memory
> > > > >>> allocations, and still requires one memcpy(). This could end up
> > > > >>> performing even worse than userfaultfd_copy(), I guess.
> > > > >>   > > The worst-case scenario would be failing to allocate small folios in
> > > > >>> split_folio(), then userfaultfd_move() might return -ENOMEM?
> > > > >>
> > > > >> Although that's an Android problem and not an upstream problem, I'll
> > > > >> note that there are other reasons why the split / move might fail, and
> > > > >> user space either must retry or fallback to a COPY.
> > > > >>
> > > > >> Regarding mTHP, we could move the whole folio if the user space-provided
> > > > >> range allows for batching over multiple PTEs (nr_ptes), they are in a
> > > > >> single VMA, and folio_mapcount() == nr_ptes.
> > > > >>
> > > > >> There are corner cases to handle, such as moving mTHPs such that they
> > > > >> suddenly cross two page tables I assume, that are harder to handle when
> > > > >> not moving individual PTEs where that cannot happen.
> > > > >
> > > > > This is a useful suggestion. I’ve heard that Lokesh is also interested in
> > > > > modifying ART to perform moves at the mTHP granularity, which would require
> > > > > kernel modifications as well. It’s likely the direction we’ll take after
> > > > > fixing the current urgent bugs. The current split_folio() really isn’t ideal.
> > > > >
> > > > > The corner cases you mentioned are definitely worth considering. However,
> > > > > once we can perform batch UFFDIO_MOVE, I believe that in most cases,
> > > > > the conflict between userfaultfd_move() and TAO will be resolved ?
> > > >
> > > > Well, as soon as you would have varying mTHP sizes, you'd still run into
> > > > the split with TAO. Maybe that doesn't apply with Android today, but I
> > > > can just guess that performing sub-mTHP moving would still be required
> > > > for GC at some point.
> > >
> > > With patch v2[1], as discussed in my previous email, I have observed that
> > > small folios consistently succeed without crashing. Similarly, mTHP no
> > > longer crashes; however, it still returns -EBUSY during the raced time
> > > window, even after adding folio_wait_writeback. While I previously
> > > mentioned that folio_writeback prevents mTHP from splitting, this is not
> > > the only factor. The split_folio() function still returns -EBUSY because
> > > folio_get_anon_vma(folio) returns NULL when the folio is not mapped.
> > >
> > > int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> > >                                      unsigned int new_order)
> > > {
> > >                 anon_vma = folio_get_anon_vma(folio);
> > >                 if (!anon_vma) {
> > >                         ret = -EBUSY;
> > >                         goto out;
> > >                 }
> > >
> > >                 end = -1;
> > >                 mapping = NULL;
> > >                 anon_vma_lock_write(anon_vma);
> > > }
> > >
> > > Even if mTHP is not from TAO's virtual zone, userfaultfd_move() will still
> > > fail when performing sub-mTHP moving in the swap cache case due to:
> >
> > Just to clarify my doubt. What do you mean by sub-mTHP? Also when you
> > say 'small folio' above,  do you mean single-page folios?
>
> This means any moving size smaller than the size of mTHP, or moving
> a partial mTHP.
>
> >
> > Am I understanding correctly that your patch correctly handles moving
> > single swap-cache page case?
>
> Yes, the crash is fixed for both small and large folios, and for small
> folios, moving is consistently successful(even for the swapcache case).
> The only issue is that sub-mTHP moving constantly fails for the swapcache
> case because split_folio() fails, even after waiting for writeback as
> split_folio()
> can only split mapped folios - which is false for swapcache since
> try_to_unmap_one() has been done.
>
> So I'd say for mTHP, returning -EBUSY as early as possible is the
> better choice to avoid wasting much time and eventually returning
> -EBUSY anyway unless we want to modify split_folio() things.
>
Great! In this case, can we please fix the kernel panic bug as soon as
possible. Until that is fixed, the ioctl is practically unusable.
> > >
> > > struct anon_vma *folio_get_anon_vma(const struct folio *folio)
> > > {
> > >         ...
> > >         if (!folio_mapped(folio))
> > >                 goto out;
> > >          ...
> > > }
> > >
> > > We likely need to modify split_folio() to support splitting unmapped anon
> > > folios within the swap cache or introduce a new function like
> > > split_unmapped_anon_folio()? Otherwise, userspace will have to fall back
> > > to UFFDIO_COPY or retry.
> > >
> > > As it stands, I see no way for sub-mTHP to survive moving with the current
> > > code and within the existing raced window. For mTHP, there is essentially
> > > no difference between returning -EBUSY immediately upon detecting that it
> > > is within the swap cache, as proposed in v1.
> > >
> > > [1] https://lore.kernel.org/linux-mm/20250220092101.71966-1-21cnbao@gmail.com/
> > >
> > > >
> > > > --
> > > > Cheers,
> > > >
> > > > David / dhildenb
> > > >
> > >
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-19 23:04       ` Barry Song
  2025-02-19 23:19         ` Lokesh Gidra
@ 2025-02-20 22:59         ` Peter Xu
  2025-02-20 23:47           ` Suren Baghdasaryan
  2025-02-21  1:36           ` Barry Song
  1 sibling, 2 replies; 47+ messages in thread
From: Peter Xu @ 2025-02-20 22:59 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > >
> > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > swap entry.
> > > > >
> > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > >   migration by setting:
> > > > >
> > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > >
> > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > >   the PTE to the new dst_addr.
> > > > >
> > > > > This approach is incorrect because even if the PTE is a swap
> > > > > entry, it can still reference a folio that remains in the swap
> > > > > cache.
> > > > >
> > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > can occur due to:
> > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > >
> > > > Thanks for the report and reproducer!
> > > >
> > > > >
> > > > > $./a.out > /dev/null
> > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > [   13.337716] memcg:ffff00000405f000
> > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > [   13.340190] ------------[ cut here ]------------
> > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > [   13.340969] Modules linked in:
> > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > [   13.342018] sp : ffff80008752bb20
> > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > [   13.343876] Call trace:
> > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > >
> > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > of folios as done in move_present_pte.
> > > >
> > > > How complex would that be? Is it a matter of adding
> > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > folio->index = linear_page_index like in move_present_pte() or
> > > > something more?
> > >
> > > My main concern is still with large folios that require a split_folio()
> > > during move_pages(), as the entire folio shares the same index and
> > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > making a split necessary.
> > >
> > > However, in split_huge_page_to_list_to_order(), there is a:
> > >
> > >         if (folio_test_writeback(folio))
> > >                 return -EBUSY;
> > >
> > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > it simply returns -EBUSY:
> > >
> > > move_pages_pte()
> > > {
> > >                 /* at this point we have src_folio locked */
> > >                 if (folio_test_large(src_folio)) {
> > >                         /* split_folio() can block */
> > >                         pte_unmap(&orig_src_pte);
> > >                         pte_unmap(&orig_dst_pte);
> > >                         src_pte = dst_pte = NULL;
> > >                         err = split_folio(src_folio);
> > >                         if (err)
> > >                                 goto out;
> > >
> > >                         /* have to reacquire the folio after it got split */
> > >                         folio_unlock(src_folio);
> > >                         folio_put(src_folio);
> > >                         src_folio = NULL;
> > >                         goto retry;
> > >                 }
> > > }
> > >
> > > Do we need a folio_wait_writeback() before calling split_folio()?
> >
> > Maybe no need in the first version to fix the immediate bug?
> >
> > It's also not always the case to hit writeback here. IIUC, writeback only
> > happens for a short window when the folio was just added into swapcache.
> > MOVE can happen much later after that anytime before a swapin.  My
> > understanding is that's also what Matthew wanted to point out.  It may be
> > better justified of that in a separate change with some performance
> > measurements.
> 
> The bug we’re discussing occurs precisely within the short window you
> mentioned.
> 
> 1. add_to_swap: The folio is added to swapcache.
> 2. try_to_unmap: PTEs are converted to swap entries.
> 3. pageout
> 4. Swapcache is cleared.

Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
least that should be step 3.5, as IIUC "writeback" needs to be cleared
before "swapcache" bit being cleared.

> 
> The issue happens between steps 2 and 4, where the PTE is not present, but
> the folio is still in swapcache - the current code does move_swap_pte() but does
> not fixup folio->index within swapcache.

One thing I'm still not clear here is why it's a race condition, rather
than more severe than that.  I mean, folio->index is definitely wrong, then
as long as the page still in swapcache, we should be able to move the swp
entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
the page in swapcache with the wrong folio->index already and trigger.

I wrote a quick test like that, it actually won't trigger..

I had a closer look in the code, I think it's because do_swap_page() has
the logic to detect folio->index matching first, and allocate a new folio
if it doesn't match in ksm_might_need_to_copy().  IIUC that was for
ksm.. but it looks like it's functioning too here.

ksm_might_need_to_copy:
	if (folio_test_ksm(folio)) {
		if (folio_stable_node(folio) &&
		    !(ksm_run & KSM_RUN_UNMERGE))
			return folio;	/* no need to copy it */
	} else if (!anon_vma) {
		return folio;		/* no need to copy it */
	} else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
			anon_vma->root == vma->anon_vma->root) {
		return folio;		/* still no need to copy it */
	}
        ...
        
	new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
        ...

So I believe what I hit is at [1] it sees index doesn't match, then it
decided to allocate a new folio.  In this case, it won't hit your BUG
because it'll be "folio != swapcache" later, so it'll setup the
folio->index for the new one, rather than the sanity check.

Do you know how your case got triggered, being able to bypass the above [1]
which should check folio->index already?

> 
> My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> in this RFC.
> 
> For small folios, there’s no split_folio issue, making it relatively
> simpler. Lokesh
> mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> the first priority.

Agreed.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20  9:21         ` Barry Song
  2025-02-20 10:24           ` David Hildenbrand
@ 2025-02-20 23:32           ` Peter Xu
  2025-02-21  0:07             ` Barry Song
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-02-20 23:32 UTC (permalink / raw)
  To: Barry Song
  Cc: david, Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon,
	brauner, hughd, jannh, kaleshsingh, linux-kernel, linux-mm,
	lokeshgidra, mhocko, ngeoffray, rppt, ryan.roberts, shuah,
	surenb, v-songbaohua, viro, willy, zhangpeng362, zhengtangquan,
	yuzhao, stable

On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote:
> 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)

Could you help explain what guarantees the rmap walk not happen on a
swapcache page?

I'm not familiar with this path, though at least I see damon can start a
rmap walk on PageAnon almost with no locking..  some explanations would be
appreciated.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 22:59         ` Peter Xu
@ 2025-02-20 23:47           ` Suren Baghdasaryan
  2025-02-20 23:52             ` Suren Baghdasaryan
  2025-02-21  1:36           ` Barry Song
  1 sibling, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-02-20 23:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: Barry Song, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 2:59 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > > swap entry.
> > > > > >
> > > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > >   migration by setting:
> > > > > >
> > > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > >
> > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > >   the PTE to the new dst_addr.
> > > > > >
> > > > > > This approach is incorrect because even if the PTE is a swap
> > > > > > entry, it can still reference a folio that remains in the swap
> > > > > > cache.
> > > > > >
> > > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > > can occur due to:
> > > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > >
> > > > > Thanks for the report and reproducer!
> > > > >
> > > > > >
> > > > > > $./a.out > /dev/null
> > > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > > [   13.337716] memcg:ffff00000405f000
> > > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > > [   13.340190] ------------[ cut here ]------------
> > > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > > [   13.340969] Modules linked in:
> > > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > > [   13.342018] sp : ffff80008752bb20
> > > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > > [   13.343876] Call trace:
> > > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > > >
> > > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > > of folios as done in move_present_pte.
> > > > >
> > > > > How complex would that be? Is it a matter of adding
> > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > > folio->index = linear_page_index like in move_present_pte() or
> > > > > something more?
> > > >
> > > > My main concern is still with large folios that require a split_folio()
> > > > during move_pages(), as the entire folio shares the same index and
> > > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > > making a split necessary.
> > > >
> > > > However, in split_huge_page_to_list_to_order(), there is a:
> > > >
> > > >         if (folio_test_writeback(folio))
> > > >                 return -EBUSY;
> > > >
> > > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > it simply returns -EBUSY:
> > > >
> > > > move_pages_pte()
> > > > {
> > > >                 /* at this point we have src_folio locked */
> > > >                 if (folio_test_large(src_folio)) {
> > > >                         /* split_folio() can block */
> > > >                         pte_unmap(&orig_src_pte);
> > > >                         pte_unmap(&orig_dst_pte);
> > > >                         src_pte = dst_pte = NULL;
> > > >                         err = split_folio(src_folio);
> > > >                         if (err)
> > > >                                 goto out;
> > > >
> > > >                         /* have to reacquire the folio after it got split */
> > > >                         folio_unlock(src_folio);
> > > >                         folio_put(src_folio);
> > > >                         src_folio = NULL;
> > > >                         goto retry;
> > > >                 }
> > > > }
> > > >
> > > > Do we need a folio_wait_writeback() before calling split_folio()?
> > >
> > > Maybe no need in the first version to fix the immediate bug?
> > >
> > > It's also not always the case to hit writeback here. IIUC, writeback only
> > > happens for a short window when the folio was just added into swapcache.
> > > MOVE can happen much later after that anytime before a swapin.  My
> > > understanding is that's also what Matthew wanted to point out.  It may be
> > > better justified of that in a separate change with some performance
> > > measurements.
> >
> > The bug we’re discussing occurs precisely within the short window you
> > mentioned.
> >
> > 1. add_to_swap: The folio is added to swapcache.
> > 2. try_to_unmap: PTEs are converted to swap entries.
> > 3. pageout
> > 4. Swapcache is cleared.
>
> Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
> least that should be step 3.5, as IIUC "writeback" needs to be cleared
> before "swapcache" bit being cleared.
>
> >
> > The issue happens between steps 2 and 4, where the PTE is not present, but
> > the folio is still in swapcache - the current code does move_swap_pte() but does
> > not fixup folio->index within swapcache.
>
> One thing I'm still not clear here is why it's a race condition, rather
> than more severe than that.  I mean, folio->index is definitely wrong, then
> as long as the page still in swapcache, we should be able to move the swp
> entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
> the page in swapcache with the wrong folio->index already and trigger.
>
> I wrote a quick test like that, it actually won't trigger..
>
> I had a closer look in the code, I think it's because do_swap_page() has
> the logic to detect folio->index matching first, and allocate a new folio
> if it doesn't match in ksm_might_need_to_copy().  IIUC that was for
> ksm.. but it looks like it's functioning too here.
>
> ksm_might_need_to_copy:
>         if (folio_test_ksm(folio)) {
>                 if (folio_stable_node(folio) &&
>                     !(ksm_run & KSM_RUN_UNMERGE))
>                         return folio;   /* no need to copy it */
>         } else if (!anon_vma) {
>                 return folio;           /* no need to copy it */
>         } else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
>                         anon_vma->root == vma->anon_vma->root) {
>                 return folio;           /* still no need to copy it */
>         }
>         ...
>
>         new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
>         ...
>
> So I believe what I hit is at [1] it sees index doesn't match, then it
> decided to allocate a new folio.  In this case, it won't hit your BUG
> because it'll be "folio != swapcache" later, so it'll setup the
> folio->index for the new one, rather than the sanity check.
>
> Do you know how your case got triggered, being able to bypass the above [1]
> which should check folio->index already?

To understand the change I tried applying the proposed patch to both
mm-unstable and Linus' ToT and got conflicts for both trees. Barry,
which baseline are you using?

>
> >
> > My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> > Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> > in this RFC.
> >
> > For small folios, there’s no split_folio issue, making it relatively
> > simpler. Lokesh
> > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> > the first priority.
>
> Agreed.
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 23:47           ` Suren Baghdasaryan
@ 2025-02-20 23:52             ` Suren Baghdasaryan
  2025-02-21  0:36               ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-02-20 23:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: Barry Song, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 3:47 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Feb 20, 2025 at 2:59 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > >
> > > > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > > > swap entry.
> > > > > > >
> > > > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > > >   migration by setting:
> > > > > > >
> > > > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > > >
> > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > > >   the PTE to the new dst_addr.
> > > > > > >
> > > > > > > This approach is incorrect because even if the PTE is a swap
> > > > > > > entry, it can still reference a folio that remains in the swap
> > > > > > > cache.
> > > > > > >
> > > > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > > > can occur due to:
> > > > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > > >
> > > > > > Thanks for the report and reproducer!
> > > > > >
> > > > > > >
> > > > > > > $./a.out > /dev/null
> > > > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > > > [   13.337716] memcg:ffff00000405f000
> > > > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > > > [   13.340190] ------------[ cut here ]------------
> > > > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > > > [   13.340969] Modules linked in:
> > > > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > [   13.342018] sp : ffff80008752bb20
> > > > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > > > [   13.343876] Call trace:
> > > > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > > > >
> > > > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > > > of folios as done in move_present_pte.
> > > > > >
> > > > > > How complex would that be? Is it a matter of adding
> > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > > > folio->index = linear_page_index like in move_present_pte() or
> > > > > > something more?
> > > > >
> > > > > My main concern is still with large folios that require a split_folio()
> > > > > during move_pages(), as the entire folio shares the same index and
> > > > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > > > making a split necessary.
> > > > >
> > > > > However, in split_huge_page_to_list_to_order(), there is a:
> > > > >
> > > > >         if (folio_test_writeback(folio))
> > > > >                 return -EBUSY;
> > > > >
> > > > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > > it simply returns -EBUSY:
> > > > >
> > > > > move_pages_pte()
> > > > > {
> > > > >                 /* at this point we have src_folio locked */
> > > > >                 if (folio_test_large(src_folio)) {
> > > > >                         /* split_folio() can block */
> > > > >                         pte_unmap(&orig_src_pte);
> > > > >                         pte_unmap(&orig_dst_pte);
> > > > >                         src_pte = dst_pte = NULL;
> > > > >                         err = split_folio(src_folio);
> > > > >                         if (err)
> > > > >                                 goto out;
> > > > >
> > > > >                         /* have to reacquire the folio after it got split */
> > > > >                         folio_unlock(src_folio);
> > > > >                         folio_put(src_folio);
> > > > >                         src_folio = NULL;
> > > > >                         goto retry;
> > > > >                 }
> > > > > }
> > > > >
> > > > > Do we need a folio_wait_writeback() before calling split_folio()?
> > > >
> > > > Maybe no need in the first version to fix the immediate bug?
> > > >
> > > > It's also not always the case to hit writeback here. IIUC, writeback only
> > > > happens for a short window when the folio was just added into swapcache.
> > > > MOVE can happen much later after that anytime before a swapin.  My
> > > > understanding is that's also what Matthew wanted to point out.  It may be
> > > > better justified of that in a separate change with some performance
> > > > measurements.
> > >
> > > The bug we’re discussing occurs precisely within the short window you
> > > mentioned.
> > >
> > > 1. add_to_swap: The folio is added to swapcache.
> > > 2. try_to_unmap: PTEs are converted to swap entries.
> > > 3. pageout
> > > 4. Swapcache is cleared.
> >
> > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
> > least that should be step 3.5, as IIUC "writeback" needs to be cleared
> > before "swapcache" bit being cleared.
> >
> > >
> > > The issue happens between steps 2 and 4, where the PTE is not present, but
> > > the folio is still in swapcache - the current code does move_swap_pte() but does
> > > not fixup folio->index within swapcache.
> >
> > One thing I'm still not clear here is why it's a race condition, rather
> > than more severe than that.  I mean, folio->index is definitely wrong, then
> > as long as the page still in swapcache, we should be able to move the swp
> > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
> > the page in swapcache with the wrong folio->index already and trigger.
> >
> > I wrote a quick test like that, it actually won't trigger..
> >
> > I had a closer look in the code, I think it's because do_swap_page() has
> > the logic to detect folio->index matching first, and allocate a new folio
> > if it doesn't match in ksm_might_need_to_copy().  IIUC that was for
> > ksm.. but it looks like it's functioning too here.
> >
> > ksm_might_need_to_copy:
> >         if (folio_test_ksm(folio)) {
> >                 if (folio_stable_node(folio) &&
> >                     !(ksm_run & KSM_RUN_UNMERGE))
> >                         return folio;   /* no need to copy it */
> >         } else if (!anon_vma) {
> >                 return folio;           /* no need to copy it */
> >         } else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
> >                         anon_vma->root == vma->anon_vma->root) {
> >                 return folio;           /* still no need to copy it */
> >         }
> >         ...
> >
> >         new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
> >         ...
> >
> > So I believe what I hit is at [1] it sees index doesn't match, then it
> > decided to allocate a new folio.  In this case, it won't hit your BUG
> > because it'll be "folio != swapcache" later, so it'll setup the
> > folio->index for the new one, rather than the sanity check.
> >
> > Do you know how your case got triggered, being able to bypass the above [1]
> > which should check folio->index already?
>
> To understand the change I tried applying the proposed patch to both
> mm-unstable and Linus' ToT and got conflicts for both trees. Barry,
> which baseline are you using?

Oops, never mind. My mistake. Copying from the email messed up tabs...
It applies cleanly.

>
> >
> > >
> > > My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> > > in this RFC.
> > >
> > > For small folios, there’s no split_folio issue, making it relatively
> > > simpler. Lokesh
> > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> > > the first priority.
> >
> > Agreed.
> >
> > --
> > Peter Xu
> >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 23:32           ` Peter Xu
@ 2025-02-21  0:07             ` Barry Song
  2025-02-21  1:49               ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Barry Song @ 2025-02-21  0:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: david, Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon,
	brauner, hughd, jannh, kaleshsingh, linux-kernel, linux-mm,
	lokeshgidra, mhocko, ngeoffray, rppt, ryan.roberts, shuah,
	surenb, v-songbaohua, viro, willy, zhangpeng362, zhengtangquan,
	yuzhao, stable

On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote:
> > 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)
>
> Could you help explain what guarantees the rmap walk not happen on a
> swapcache page?
>
> I'm not familiar with this path, though at least I see damon can start a
> rmap walk on PageAnon almost with no locking..  some explanations would be
> appreciated.

I am observing the following in folio_referenced(), which the anon_vma lock
was originally intended to protect.

        if (!pra.mapcount)
                return 0;

I assume all other rmap walks should do the same?

int folio_referenced(struct folio *folio, int is_locked,
                     struct mem_cgroup *memcg, unsigned long *vm_flags)
{

        bool we_locked = false;
        struct folio_referenced_arg pra = {
                .mapcount = folio_mapcount(folio),
                .memcg = memcg,
        };

        struct rmap_walk_control rwc = {
                .rmap_one = folio_referenced_one,
                .arg = (void *)&pra,
                .anon_lock = folio_lock_anon_vma_read,
                .try_lock = true,
                .invalid_vma = invalid_folio_referenced_vma,
        };

        *vm_flags = 0;
        if (!pra.mapcount)
                return 0;
        ...
}

By the way, since the folio has been under reclamation in this case and
isn't in the lru, this should also prevent the rmap walk, right?

>
> --
> Peter Xu
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 23:52             ` Suren Baghdasaryan
@ 2025-02-21  0:36               ` Suren Baghdasaryan
  2025-02-25 11:05                 ` Barry Song
  0 siblings, 1 reply; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-02-21  0:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: Barry Song, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Thu, Feb 20, 2025 at 3:52 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Feb 20, 2025 at 3:47 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 2:59 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> > > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > >
> > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > > >
> > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > > > > swap entry.
> > > > > > > >
> > > > > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > > > >   migration by setting:
> > > > > > > >
> > > > > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > > > >
> > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > > > >   the PTE to the new dst_addr.
> > > > > > > >
> > > > > > > > This approach is incorrect because even if the PTE is a swap
> > > > > > > > entry, it can still reference a folio that remains in the swap
> > > > > > > > cache.
> > > > > > > >
> > > > > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > > > > can occur due to:
> > > > > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > > > >
> > > > > > > Thanks for the report and reproducer!
> > > > > > >
> > > > > > > >
> > > > > > > > $./a.out > /dev/null
> > > > > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > > > > [   13.337716] memcg:ffff00000405f000
> > > > > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > > > > [   13.340190] ------------[ cut here ]------------
> > > > > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > > > > [   13.340969] Modules linked in:
> > > > > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > > [   13.342018] sp : ffff80008752bb20
> > > > > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > > > > [   13.343876] Call trace:
> > > > > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > > > > >
> > > > > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > > > > of folios as done in move_present_pte.
> > > > > > >
> > > > > > > How complex would that be? Is it a matter of adding
> > > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > > > > folio->index = linear_page_index like in move_present_pte() or
> > > > > > > something more?
> > > > > >
> > > > > > My main concern is still with large folios that require a split_folio()
> > > > > > during move_pages(), as the entire folio shares the same index and
> > > > > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > > > > making a split necessary.
> > > > > >
> > > > > > However, in split_huge_page_to_list_to_order(), there is a:
> > > > > >
> > > > > >         if (folio_test_writeback(folio))
> > > > > >                 return -EBUSY;
> > > > > >
> > > > > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > > > it simply returns -EBUSY:
> > > > > >
> > > > > > move_pages_pte()
> > > > > > {
> > > > > >                 /* at this point we have src_folio locked */
> > > > > >                 if (folio_test_large(src_folio)) {
> > > > > >                         /* split_folio() can block */
> > > > > >                         pte_unmap(&orig_src_pte);
> > > > > >                         pte_unmap(&orig_dst_pte);
> > > > > >                         src_pte = dst_pte = NULL;
> > > > > >                         err = split_folio(src_folio);
> > > > > >                         if (err)
> > > > > >                                 goto out;
> > > > > >
> > > > > >                         /* have to reacquire the folio after it got split */
> > > > > >                         folio_unlock(src_folio);
> > > > > >                         folio_put(src_folio);
> > > > > >                         src_folio = NULL;
> > > > > >                         goto retry;
> > > > > >                 }
> > > > > > }
> > > > > >
> > > > > > Do we need a folio_wait_writeback() before calling split_folio()?
> > > > >
> > > > > Maybe no need in the first version to fix the immediate bug?
> > > > >
> > > > > It's also not always the case to hit writeback here. IIUC, writeback only
> > > > > happens for a short window when the folio was just added into swapcache.
> > > > > MOVE can happen much later after that anytime before a swapin.  My
> > > > > understanding is that's also what Matthew wanted to point out.  It may be
> > > > > better justified of that in a separate change with some performance
> > > > > measurements.
> > > >
> > > > The bug we’re discussing occurs precisely within the short window you
> > > > mentioned.
> > > >
> > > > 1. add_to_swap: The folio is added to swapcache.
> > > > 2. try_to_unmap: PTEs are converted to swap entries.
> > > > 3. pageout
> > > > 4. Swapcache is cleared.
> > >
> > > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
> > > least that should be step 3.5, as IIUC "writeback" needs to be cleared
> > > before "swapcache" bit being cleared.
> > >
> > > >
> > > > The issue happens between steps 2 and 4, where the PTE is not present, but
> > > > the folio is still in swapcache - the current code does move_swap_pte() but does
> > > > not fixup folio->index within swapcache.
> > >
> > > One thing I'm still not clear here is why it's a race condition, rather
> > > than more severe than that.  I mean, folio->index is definitely wrong, then
> > > as long as the page still in swapcache, we should be able to move the swp
> > > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
> > > the page in swapcache with the wrong folio->index already and trigger.
> > >
> > > I wrote a quick test like that, it actually won't trigger..
> > >
> > > I had a closer look in the code, I think it's because do_swap_page() has
> > > the logic to detect folio->index matching first, and allocate a new folio
> > > if it doesn't match in ksm_might_need_to_copy().  IIUC that was for
> > > ksm.. but it looks like it's functioning too here.
> > >
> > > ksm_might_need_to_copy:
> > >         if (folio_test_ksm(folio)) {
> > >                 if (folio_stable_node(folio) &&
> > >                     !(ksm_run & KSM_RUN_UNMERGE))
> > >                         return folio;   /* no need to copy it */
> > >         } else if (!anon_vma) {
> > >                 return folio;           /* no need to copy it */
> > >         } else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
> > >                         anon_vma->root == vma->anon_vma->root) {
> > >                 return folio;           /* still no need to copy it */
> > >         }
> > >         ...
> > >
> > >         new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
> > >         ...
> > >
> > > So I believe what I hit is at [1] it sees index doesn't match, then it
> > > decided to allocate a new folio.  In this case, it won't hit your BUG
> > > because it'll be "folio != swapcache" later, so it'll setup the
> > > folio->index for the new one, rather than the sanity check.
> > >
> > > Do you know how your case got triggered, being able to bypass the above [1]
> > > which should check folio->index already?
> >
> > To understand the change I tried applying the proposed patch to both
> > mm-unstable and Linus' ToT and got conflicts for both trees. Barry,
> > which baseline are you using?
>
> Oops, never mind. My mistake. Copying from the email messed up tabs...
> It applies cleanly.

Overall the code seems correct to me, however the new code has quite
complex logical structure IMO. Original simplified code structure is
like this:

if (pte_present(orig_src_pte)) {
        if (is_zero_pfn) {
                move_zeropage_pte()
                return
        }
        // pin and lock src_folio
        spin_lock(src_ptl)
        folio_get(folio)
        folio_trylock(folio)
        if (folio_test_large(src_folio))
                split_folio(src_folio)
        anon_vma_trylock_write(src_anon_vma)
        move_present_pte()
} else {
        if (non_swap_entry(entry))
                if (is_migration_entry(entry))
                        handle migration entry
        else
                move_swap_pte()
}

The new structure looks like this:

if (!pte_present(orig_src_pte)) {
        if (is_migration_entry(entry)) {
                handle migration entry
                return
       }
        if (!non_swap_entry() ||  !pte_swp_exclusive())
                return
        si = get_swap_device(entry);
}
if (pte_present(orig_src_pte) && is_zero_pfn(pte_pfn(orig_src_pte)))
        move_zeropage_pte()
        return
}
pin and lock src_folio
        spin_lock(src_ptl)
        if (pte_present(orig_src_pte))
                folio_get(folio)
        else {
                folio = filemap_get_folio(swap_entry)
                if (IS_ERR(folio))
                        move_swap_pte()
                        return
                }
        }
        folio_trylock(folio)
if (folio_test_large(src_folio))
        split_folio(src_folio)
if (pte_present(orig_src_pte))
        anon_vma_trylock_write(src_anon_vma)
move_pte_and_folio()

This looks more complex and harder to follow. Might be the reason
David was not in favour of treating swapcache and present pages in the
same path. And now I would agree that refactoring some common parts
and not breaking the original structure might be cleaner.

>
> >
> > >
> > > >
> > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> > > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> > > > in this RFC.
> > > >
> > > > For small folios, there’s no split_folio issue, making it relatively
> > > > simpler. Lokesh
> > > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> > > > the first priority.
> > >
> > > Agreed.
> > >
> > > --
> > > Peter Xu
> > >


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 22:59         ` Peter Xu
  2025-02-20 23:47           ` Suren Baghdasaryan
@ 2025-02-21  1:36           ` Barry Song
  2025-02-21  1:54             ` Peter Xu
  1 sibling, 1 reply; 47+ messages in thread
From: Barry Song @ 2025-02-21  1:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Fri, Feb 21, 2025 at 11:59 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > >
> > > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > > swap entry.
> > > > > >
> > > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > >   migration by setting:
> > > > > >
> > > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > >
> > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > >   the PTE to the new dst_addr.
> > > > > >
> > > > > > This approach is incorrect because even if the PTE is a swap
> > > > > > entry, it can still reference a folio that remains in the swap
> > > > > > cache.
> > > > > >
> > > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > > can occur due to:
> > > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > >
> > > > > Thanks for the report and reproducer!
> > > > >
> > > > > >
> > > > > > $./a.out > /dev/null
> > > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > > [   13.337716] memcg:ffff00000405f000
> > > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > > [   13.340190] ------------[ cut here ]------------
> > > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > > [   13.340969] Modules linked in:
> > > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > > [   13.342018] sp : ffff80008752bb20
> > > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > > [   13.343876] Call trace:
> > > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > > >
> > > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > > of folios as done in move_present_pte.
> > > > >
> > > > > How complex would that be? Is it a matter of adding
> > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > > folio->index = linear_page_index like in move_present_pte() or
> > > > > something more?
> > > >
> > > > My main concern is still with large folios that require a split_folio()
> > > > during move_pages(), as the entire folio shares the same index and
> > > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > > making a split necessary.
> > > >
> > > > However, in split_huge_page_to_list_to_order(), there is a:
> > > >
> > > >         if (folio_test_writeback(folio))
> > > >                 return -EBUSY;
> > > >
> > > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > it simply returns -EBUSY:
> > > >
> > > > move_pages_pte()
> > > > {
> > > >                 /* at this point we have src_folio locked */
> > > >                 if (folio_test_large(src_folio)) {
> > > >                         /* split_folio() can block */
> > > >                         pte_unmap(&orig_src_pte);
> > > >                         pte_unmap(&orig_dst_pte);
> > > >                         src_pte = dst_pte = NULL;
> > > >                         err = split_folio(src_folio);
> > > >                         if (err)
> > > >                                 goto out;
> > > >
> > > >                         /* have to reacquire the folio after it got split */
> > > >                         folio_unlock(src_folio);
> > > >                         folio_put(src_folio);
> > > >                         src_folio = NULL;
> > > >                         goto retry;
> > > >                 }
> > > > }
> > > >
> > > > Do we need a folio_wait_writeback() before calling split_folio()?
> > >
> > > Maybe no need in the first version to fix the immediate bug?
> > >
> > > It's also not always the case to hit writeback here. IIUC, writeback only
> > > happens for a short window when the folio was just added into swapcache.
> > > MOVE can happen much later after that anytime before a swapin.  My
> > > understanding is that's also what Matthew wanted to point out.  It may be
> > > better justified of that in a separate change with some performance
> > > measurements.
> >
> > The bug we’re discussing occurs precisely within the short window you
> > mentioned.
> >
> > 1. add_to_swap: The folio is added to swapcache.
> > 2. try_to_unmap: PTEs are converted to swap entries.
> > 3. pageout
> > 4. Swapcache is cleared.
>
> Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
> least that should be step 3.5, as IIUC "writeback" needs to be cleared
> before "swapcache" bit being cleared.
>
> >
> > The issue happens between steps 2 and 4, where the PTE is not present, but
> > the folio is still in swapcache - the current code does move_swap_pte() but does
> > not fixup folio->index within swapcache.
>
> One thing I'm still not clear here is why it's a race condition, rather
> than more severe than that.  I mean, folio->index is definitely wrong, then
> as long as the page still in swapcache, we should be able to move the swp
> entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
> the page in swapcache with the wrong folio->index already and trigger.
>
> I wrote a quick test like that, it actually won't trigger..
>
> I had a closer look in the code, I think it's because do_swap_page() has
> the logic to detect folio->index matching first, and allocate a new folio
> if it doesn't match in ksm_might_need_to_copy().  IIUC that was for
> ksm.. but it looks like it's functioning too here.
>
> ksm_might_need_to_copy:
>         if (folio_test_ksm(folio)) {
>                 if (folio_stable_node(folio) &&
>                     !(ksm_run & KSM_RUN_UNMERGE))
>                         return folio;   /* no need to copy it */
>         } else if (!anon_vma) {
>                 return folio;           /* no need to copy it */
>         } else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
>                         anon_vma->root == vma->anon_vma->root) {
>                 return folio;           /* still no need to copy it */
>         }
>         ...
>
>         new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
>         ...
>
> So I believe what I hit is at [1] it sees index doesn't match, then it
> decided to allocate a new folio.  In this case, it won't hit your BUG
> because it'll be "folio != swapcache" later, so it'll setup the
> folio->index for the new one, rather than the sanity check.

You're absolutely right.  The problem goes beyond just crashes; we're
also dealing with CoW when KSM is enabled. As long as we disable
KSM(which is true for Android), or when we are dealing with a large folio,
ksm_might_need_to_copy() will not allocate a new copy:

struct folio *ksm_might_need_to_copy(struct folio *folio,
                        struct vm_area_struct *vma, unsigned long addr)
{

        struct page *page = folio_page(folio, 0);
        struct anon_vma *anon_vma = folio_anon_vma(folio);
        struct folio *new_folio;

        if (folio_test_large(folio))
                return folio;
         ....
}

Thanks for your great findings! For the KSM-enabled and small folio case,
it's pretty funny how UFFDIO_MOVE finally turns into a new allocation and
copy— somehow automatically falling back to "UFFDIO_COPY" :-)
It's amusing, but debugging it is fun.

I'll add your findings to the changelog when I formally send v2, after gathering
all the code refinement suggestions and implementing the improvements.

>
> Do you know how your case got triggered, being able to bypass the above [1]
> which should check folio->index already?
>
> >
> > My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> > Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> > in this RFC.
> >
> > For small folios, there’s no split_folio issue, making it relatively
> > simpler. Lokesh
> > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> > the first priority.
>
> Agreed.
>
> --
> Peter Xu
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-21  0:07             ` Barry Song
@ 2025-02-21  1:49               ` Peter Xu
  2025-02-22 21:31                 ` Barry Song
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-02-21  1:49 UTC (permalink / raw)
  To: Barry Song
  Cc: david, Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon,
	brauner, hughd, jannh, kaleshsingh, linux-kernel, linux-mm,
	lokeshgidra, mhocko, ngeoffray, rppt, ryan.roberts, shuah,
	surenb, v-songbaohua, viro, willy, zhangpeng362, zhengtangquan,
	yuzhao, stable

On Fri, Feb 21, 2025 at 01:07:24PM +1300, Barry Song wrote:
> On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote:
> > > 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)
> >
> > Could you help explain what guarantees the rmap walk not happen on a
> > swapcache page?
> >
> > I'm not familiar with this path, though at least I see damon can start a
> > rmap walk on PageAnon almost with no locking..  some explanations would be
> > appreciated.
> 
> I am observing the following in folio_referenced(), which the anon_vma lock
> was originally intended to protect.
> 
>         if (!pra.mapcount)
>                 return 0;
> 
> I assume all other rmap walks should do the same?

Yes normally there'll be a folio_mapcount() check, however..

> 
> int folio_referenced(struct folio *folio, int is_locked,
>                      struct mem_cgroup *memcg, unsigned long *vm_flags)
> {
> 
>         bool we_locked = false;
>         struct folio_referenced_arg pra = {
>                 .mapcount = folio_mapcount(folio),
>                 .memcg = memcg,
>         };
> 
>         struct rmap_walk_control rwc = {
>                 .rmap_one = folio_referenced_one,
>                 .arg = (void *)&pra,
>                 .anon_lock = folio_lock_anon_vma_read,
>                 .try_lock = true,
>                 .invalid_vma = invalid_folio_referenced_vma,
>         };
> 
>         *vm_flags = 0;
>         if (!pra.mapcount)
>                 return 0;
>         ...
> }
> 
> By the way, since the folio has been under reclamation in this case and
> isn't in the lru, this should also prevent the rmap walk, right?

.. I'm not sure whether it's always working.

The thing is anon doesn't even require folio lock held during (1) checking
mapcount and (2) doing the rmap walk, in all similar cases as above.  I see
nothing blocks it from a concurrent thread zapping that last mapcount:

               thread 1                         thread 2
               --------                         --------
        [whatever scanner] 
           check folio_mapcount(), non-zero
                                                zap the last map.. then mapcount==0
           rmap_walk()

Not sure if I missed something.

The other thing is IIUC swapcache page can also have chance to be faulted
in but only if a read not write.  I actually had a feeling that your
reproducer triggered that exact path, causing a read swap in, reusing the
swapcache page, and hit the sanity check there somehow (even as mentioned
in the other reply, I don't yet know why the 1st check didn't seem to
work.. as we do check folio->index twice..).

Said that, I'm not sure if above concern will happen in this specific case,
as UIFFDIO_MOVE is pretty special, that we check exclusive bit first in swp
entry so we know it's definitely not mapped elsewhere, meanwhile if we hold
pgtable lock so maybe it can't get mapped back.. it is just still tricky,
at least we do some dances all over releasing and retaking locks.

We could either justify that's safe, or maybe still ok and simpler if we
could take anon_vma write lock, making sure nobody will be able to read the
folio->index when it's prone to an update.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-21  1:36           ` Barry Song
@ 2025-02-21  1:54             ` Peter Xu
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Xu @ 2025-02-21  1:54 UTC (permalink / raw)
  To: Barry Song
  Cc: Suren Baghdasaryan, Lokesh Gidra, linux-mm, akpm, linux-kernel,
	zhengtangquan, Barry Song, Andrea Arcangeli, Al Viro,
	Axel Rasmussen, Brian Geffon, Christian Brauner,
	David Hildenbrand, Hugh Dickins, Jann Horn, Kalesh Singh,
	Liam R . Howlett, Matthew Wilcox, Michal Hocko, Mike Rapoport,
	Nicolas Geoffray, Ryan Roberts, Shuah Khan, ZhangPeng, Yu Zhao

On Fri, Feb 21, 2025 at 02:36:27PM +1300, Barry Song wrote:
> On Fri, Feb 21, 2025 at 11:59 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > >
> > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > >
> > > > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > > > swap entry.
> > > > > > >
> > > > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > > >   migration by setting:
> > > > > > >
> > > > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > > >
> > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > > >   the PTE to the new dst_addr.
> > > > > > >
> > > > > > > This approach is incorrect because even if the PTE is a swap
> > > > > > > entry, it can still reference a folio that remains in the swap
> > > > > > > cache.
> > > > > > >
> > > > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > > > can occur due to:
> > > > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > > >
> > > > > > Thanks for the report and reproducer!
> > > > > >
> > > > > > >
> > > > > > > $./a.out > /dev/null
> > > > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > > > [   13.337716] memcg:ffff00000405f000
> > > > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > > > [   13.340190] ------------[ cut here ]------------
> > > > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > > > [   13.340969] Modules linked in:
> > > > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > [   13.342018] sp : ffff80008752bb20
> > > > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > > > [   13.343876] Call trace:
> > > > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > > > >
> > > > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > > > of folios as done in move_present_pte.
> > > > > >
> > > > > > How complex would that be? Is it a matter of adding
> > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > > > folio->index = linear_page_index like in move_present_pte() or
> > > > > > something more?
> > > > >
> > > > > My main concern is still with large folios that require a split_folio()
> > > > > during move_pages(), as the entire folio shares the same index and
> > > > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > > > making a split necessary.
> > > > >
> > > > > However, in split_huge_page_to_list_to_order(), there is a:
> > > > >
> > > > >         if (folio_test_writeback(folio))
> > > > >                 return -EBUSY;
> > > > >
> > > > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > > it simply returns -EBUSY:
> > > > >
> > > > > move_pages_pte()
> > > > > {
> > > > >                 /* at this point we have src_folio locked */
> > > > >                 if (folio_test_large(src_folio)) {
> > > > >                         /* split_folio() can block */
> > > > >                         pte_unmap(&orig_src_pte);
> > > > >                         pte_unmap(&orig_dst_pte);
> > > > >                         src_pte = dst_pte = NULL;
> > > > >                         err = split_folio(src_folio);
> > > > >                         if (err)
> > > > >                                 goto out;
> > > > >
> > > > >                         /* have to reacquire the folio after it got split */
> > > > >                         folio_unlock(src_folio);
> > > > >                         folio_put(src_folio);
> > > > >                         src_folio = NULL;
> > > > >                         goto retry;
> > > > >                 }
> > > > > }
> > > > >
> > > > > Do we need a folio_wait_writeback() before calling split_folio()?
> > > >
> > > > Maybe no need in the first version to fix the immediate bug?
> > > >
> > > > It's also not always the case to hit writeback here. IIUC, writeback only
> > > > happens for a short window when the folio was just added into swapcache.
> > > > MOVE can happen much later after that anytime before a swapin.  My
> > > > understanding is that's also what Matthew wanted to point out.  It may be
> > > > better justified of that in a separate change with some performance
> > > > measurements.
> > >
> > > The bug we’re discussing occurs precisely within the short window you
> > > mentioned.
> > >
> > > 1. add_to_swap: The folio is added to swapcache.
> > > 2. try_to_unmap: PTEs are converted to swap entries.
> > > 3. pageout
> > > 4. Swapcache is cleared.
> >
> > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
> > least that should be step 3.5, as IIUC "writeback" needs to be cleared
> > before "swapcache" bit being cleared.
> >
> > >
> > > The issue happens between steps 2 and 4, where the PTE is not present, but
> > > the folio is still in swapcache - the current code does move_swap_pte() but does
> > > not fixup folio->index within swapcache.
> >
> > One thing I'm still not clear here is why it's a race condition, rather
> > than more severe than that.  I mean, folio->index is definitely wrong, then
> > as long as the page still in swapcache, we should be able to move the swp
> > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
> > the page in swapcache with the wrong folio->index already and trigger.
> >
> > I wrote a quick test like that, it actually won't trigger..
> >
> > I had a closer look in the code, I think it's because do_swap_page() has
> > the logic to detect folio->index matching first, and allocate a new folio
> > if it doesn't match in ksm_might_need_to_copy().  IIUC that was for
> > ksm.. but it looks like it's functioning too here.
> >
> > ksm_might_need_to_copy:
> >         if (folio_test_ksm(folio)) {
> >                 if (folio_stable_node(folio) &&
> >                     !(ksm_run & KSM_RUN_UNMERGE))
> >                         return folio;   /* no need to copy it */
> >         } else if (!anon_vma) {
> >                 return folio;           /* no need to copy it */
> >         } else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
> >                         anon_vma->root == vma->anon_vma->root) {
> >                 return folio;           /* still no need to copy it */
> >         }
> >         ...
> >
> >         new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
> >         ...
> >
> > So I believe what I hit is at [1] it sees index doesn't match, then it
> > decided to allocate a new folio.  In this case, it won't hit your BUG
> > because it'll be "folio != swapcache" later, so it'll setup the
> > folio->index for the new one, rather than the sanity check.
> 
> You're absolutely right.  The problem goes beyond just crashes; we're
> also dealing with CoW when KSM is enabled. As long as we disable
> KSM(which is true for Android), or when we are dealing with a large folio,
> ksm_might_need_to_copy() will not allocate a new copy:

Ah!  That explains it..

> 
> struct folio *ksm_might_need_to_copy(struct folio *folio,
>                         struct vm_area_struct *vma, unsigned long addr)
> {
> 
>         struct page *page = folio_page(folio, 0);
>         struct anon_vma *anon_vma = folio_anon_vma(folio);
>         struct folio *new_folio;
> 
>         if (folio_test_large(folio))
>                 return folio;
>          ....
> }
> 
> Thanks for your great findings! For the KSM-enabled and small folio case,
> it's pretty funny how UFFDIO_MOVE finally turns into a new allocation and
> copy— somehow automatically falling back to "UFFDIO_COPY" :-)
> It's amusing, but debugging it is fun.
> 
> I'll add your findings to the changelog when I formally send v2, after gathering
> all the code refinement suggestions and implementing the improvements.

Thanks, that'll be helpful.

I wanted to try with !KSM build but it's pretty late today (and it's
company-wise PTO tomorrow..).  Just in case useful, this is the reproducer
I mentioned that didn't yet trigger when with KSM:

https://github.com/xzpeter/clibs/blob/master/uffd-test/uffd-move-bug.c

I'm not sure whether it'll also reproduce there, but there's chance it is a
simpler reproducer.

> 
> >
> > Do you know how your case got triggered, being able to bypass the above [1]
> > which should check folio->index already?
> >
> > >
> > > My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> > > in this RFC.
> > >
> > > For small folios, there’s no split_folio issue, making it relatively
> > > simpler. Lokesh
> > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> > > the first priority.
> >
> > Agreed.
> >
> > --
> > Peter Xu
> >
> 
> Thanks
> Barry
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-21  1:49               ` Peter Xu
@ 2025-02-22 21:31                 ` Barry Song
  2025-02-24 17:50                   ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Barry Song @ 2025-02-22 21:31 UTC (permalink / raw)
  To: Peter Xu
  Cc: david, Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon,
	brauner, hughd, jannh, kaleshsingh, linux-kernel, linux-mm,
	lokeshgidra, mhocko, ngeoffray, rppt, ryan.roberts, shuah,
	surenb, v-songbaohua, viro, willy, zhangpeng362, zhengtangquan,
	yuzhao, stable

On Fri, Feb 21, 2025 at 2:49 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Fri, Feb 21, 2025 at 01:07:24PM +1300, Barry Song wrote:
> > On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote:
> > > > 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)
> > >
> > > Could you help explain what guarantees the rmap walk not happen on a
> > > swapcache page?
> > >
> > > I'm not familiar with this path, though at least I see damon can start a
> > > rmap walk on PageAnon almost with no locking..  some explanations would be
> > > appreciated.
> >
> > I am observing the following in folio_referenced(), which the anon_vma lock
> > was originally intended to protect.
> >
> >         if (!pra.mapcount)
> >                 return 0;
> >
> > I assume all other rmap walks should do the same?
>
> Yes normally there'll be a folio_mapcount() check, however..
>
> >
> > int folio_referenced(struct folio *folio, int is_locked,
> >                      struct mem_cgroup *memcg, unsigned long *vm_flags)
> > {
> >
> >         bool we_locked = false;
> >         struct folio_referenced_arg pra = {
> >                 .mapcount = folio_mapcount(folio),
> >                 .memcg = memcg,
> >         };
> >
> >         struct rmap_walk_control rwc = {
> >                 .rmap_one = folio_referenced_one,
> >                 .arg = (void *)&pra,
> >                 .anon_lock = folio_lock_anon_vma_read,
> >                 .try_lock = true,
> >                 .invalid_vma = invalid_folio_referenced_vma,
> >         };
> >
> >         *vm_flags = 0;
> >         if (!pra.mapcount)
> >                 return 0;
> >         ...
> > }
> >
> > By the way, since the folio has been under reclamation in this case and
> > isn't in the lru, this should also prevent the rmap walk, right?
>
> .. I'm not sure whether it's always working.
>
> The thing is anon doesn't even require folio lock held during (1) checking
> mapcount and (2) doing the rmap walk, in all similar cases as above.  I see
> nothing blocks it from a concurrent thread zapping that last mapcount:
>
>                thread 1                         thread 2
>                --------                         --------
>         [whatever scanner]
>            check folio_mapcount(), non-zero
>                                                 zap the last map.. then mapcount==0
>            rmap_walk()
>
> Not sure if I missed something.
>
> The other thing is IIUC swapcache page can also have chance to be faulted
> in but only if a read not write.  I actually had a feeling that your
> reproducer triggered that exact path, causing a read swap in, reusing the
> swapcache page, and hit the sanity check there somehow (even as mentioned
> in the other reply, I don't yet know why the 1st check didn't seem to
> work.. as we do check folio->index twice..).
>
> Said that, I'm not sure if above concern will happen in this specific case,
> as UIFFDIO_MOVE is pretty special, that we check exclusive bit first in swp
> entry so we know it's definitely not mapped elsewhere, meanwhile if we hold
> pgtable lock so maybe it can't get mapped back.. it is just still tricky,
> at least we do some dances all over releasing and retaking locks.
>
> We could either justify that's safe, or maybe still ok and simpler if we
> could take anon_vma write lock, making sure nobody will be able to read the
> folio->index when it's prone to an update.

What prompted me to do the former is that folio_get_anon_vma() returns
NULL for an unmapped folio. As for the latter, we need to carefully evaluate
whether the change below is safe.

--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -505,7 +505,7 @@ struct anon_vma *folio_get_anon_vma(const struct
folio *folio)
        anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
        if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
                goto out;

-       if (!folio_mapped(folio))
+       if (!folio_mapped(folio) && !folio_test_swapcache(folio))
                goto out;

        anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
@@ -521,7 +521,7 @@ struct anon_vma *folio_get_anon_vma(const struct
folio *folio)
         * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
         * above cannot corrupt).
         */

-       if (!folio_mapped(folio)) {
+       if (!folio_mapped(folio) && !folio_test_swapcache(folio)) {
                rcu_read_unlock();
                put_anon_vma(anon_vma);
                return NULL;


The above change, combined with the change below, has also resolved the mTHP
-EBUSY issue.

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e5718835a964..1ef991b5c225 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1333,6 +1333,7 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                pte_unmap(&orig_src_pte);
                pte_unmap(&orig_dst_pte);
                src_pte = dst_pte = NULL;
+               folio_wait_writeback(src_folio);
                err = split_folio(src_folio);

                if (err)
                        goto out;
@@ -1343,7 +1344,7 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                goto retry;
        }

-       if (!src_anon_vma && pte_present(orig_src_pte)) {
+       if (!src_anon_vma) {
                /*
                 * folio_referenced walks the anon_vma chain
                 * without the folio lock. Serialize against it with


split_folio() returns -EBUSY if the folio is under writeback or if
folio_get_anon_vma() returns NULL.

I have no issues with the latter, provided the change in folio_get_anon_vma()
is safe, as it also resolves the mTHP -EBUSY issue.

We need to carefully consider the five places where folio_get_anon_vma() is
called, as this patch will also be backported to stable.

  1   2618  mm/huge_memory.c <<move_pages_huge_pmd>>
             src_anon_vma = folio_get_anon_vma(src_folio);

   2   3765  mm/huge_memory.c <<__folio_split>>
             anon_vma = folio_get_anon_vma(folio);

   3   1280  mm/migrate.c <<migrate_folio_unmap>>
             anon_vma = folio_get_anon_vma(src);

   4   1485  mm/migrate.c <<unmap_and_move_huge_page>>
             anon_vma = folio_get_anon_vma(src);

   5   1354  mm/userfaultfd.c <<move_pages_pte>>
             src_anon_vma = folio_get_anon_vma(src_folio);

>
> Thanks,
>
> --
> Peter Xu
>

Thanks
barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-22 21:31                 ` Barry Song
@ 2025-02-24 17:50                   ` Peter Xu
  2025-02-24 18:03                     ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-02-24 17:50 UTC (permalink / raw)
  To: Barry Song
  Cc: david, Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon,
	brauner, hughd, jannh, kaleshsingh, linux-kernel, linux-mm,
	lokeshgidra, mhocko, ngeoffray, rppt, ryan.roberts, shuah,
	surenb, v-songbaohua, viro, willy, zhangpeng362, zhengtangquan,
	yuzhao, stable

On Sun, Feb 23, 2025 at 10:31:37AM +1300, Barry Song wrote:
> On Fri, Feb 21, 2025 at 2:49 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Fri, Feb 21, 2025 at 01:07:24PM +1300, Barry Song wrote:
> > > On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote:
> > > > > 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)
> > > >
> > > > Could you help explain what guarantees the rmap walk not happen on a
> > > > swapcache page?
> > > >
> > > > I'm not familiar with this path, though at least I see damon can start a
> > > > rmap walk on PageAnon almost with no locking..  some explanations would be
> > > > appreciated.
> > >
> > > I am observing the following in folio_referenced(), which the anon_vma lock
> > > was originally intended to protect.
> > >
> > >         if (!pra.mapcount)
> > >                 return 0;
> > >
> > > I assume all other rmap walks should do the same?
> >
> > Yes normally there'll be a folio_mapcount() check, however..
> >
> > >
> > > int folio_referenced(struct folio *folio, int is_locked,
> > >                      struct mem_cgroup *memcg, unsigned long *vm_flags)
> > > {
> > >
> > >         bool we_locked = false;
> > >         struct folio_referenced_arg pra = {
> > >                 .mapcount = folio_mapcount(folio),
> > >                 .memcg = memcg,
> > >         };
> > >
> > >         struct rmap_walk_control rwc = {
> > >                 .rmap_one = folio_referenced_one,
> > >                 .arg = (void *)&pra,
> > >                 .anon_lock = folio_lock_anon_vma_read,
> > >                 .try_lock = true,
> > >                 .invalid_vma = invalid_folio_referenced_vma,
> > >         };
> > >
> > >         *vm_flags = 0;
> > >         if (!pra.mapcount)
> > >                 return 0;
> > >         ...
> > > }
> > >
> > > By the way, since the folio has been under reclamation in this case and
> > > isn't in the lru, this should also prevent the rmap walk, right?
> >
> > .. I'm not sure whether it's always working.
> >
> > The thing is anon doesn't even require folio lock held during (1) checking
> > mapcount and (2) doing the rmap walk, in all similar cases as above.  I see
> > nothing blocks it from a concurrent thread zapping that last mapcount:
> >
> >                thread 1                         thread 2
> >                --------                         --------
> >         [whatever scanner]
> >            check folio_mapcount(), non-zero
> >                                                 zap the last map.. then mapcount==0
> >            rmap_walk()
> >
> > Not sure if I missed something.
> >
> > The other thing is IIUC swapcache page can also have chance to be faulted
> > in but only if a read not write.  I actually had a feeling that your
> > reproducer triggered that exact path, causing a read swap in, reusing the
> > swapcache page, and hit the sanity check there somehow (even as mentioned
> > in the other reply, I don't yet know why the 1st check didn't seem to
> > work.. as we do check folio->index twice..).
> >
> > Said that, I'm not sure if above concern will happen in this specific case,
> > as UIFFDIO_MOVE is pretty special, that we check exclusive bit first in swp
> > entry so we know it's definitely not mapped elsewhere, meanwhile if we hold
> > pgtable lock so maybe it can't get mapped back.. it is just still tricky,
> > at least we do some dances all over releasing and retaking locks.
> >
> > We could either justify that's safe, or maybe still ok and simpler if we
> > could take anon_vma write lock, making sure nobody will be able to read the
> > folio->index when it's prone to an update.
> 
> What prompted me to do the former is that folio_get_anon_vma() returns
> NULL for an unmapped folio. As for the latter, we need to carefully evaluate
> whether the change below is safe.
> 
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -505,7 +505,7 @@ struct anon_vma *folio_get_anon_vma(const struct
> folio *folio)
>         anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>         if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 goto out;
> 
> -       if (!folio_mapped(folio))
> +       if (!folio_mapped(folio) && !folio_test_swapcache(folio))
>                 goto out;
> 
>         anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
> @@ -521,7 +521,7 @@ struct anon_vma *folio_get_anon_vma(const struct
> folio *folio)
>          * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
>          * above cannot corrupt).
>          */

[1]

> 
> -       if (!folio_mapped(folio)) {
> +       if (!folio_mapped(folio) && !folio_test_swapcache(folio)) {
>                 rcu_read_unlock();
>                 put_anon_vma(anon_vma);
>                 return NULL;

Hmm, this let me go back read again on how we manage anon_vma lifespan,
then I just noticed this may not work.

See the comment right above [1], here's a full version:

	/*
	 * If this folio is still mapped, then its anon_vma cannot have been
	 * freed.  But if it has been unmapped, we have no security against the
	 * anon_vma structure being freed and reused (for another anon_vma:
	 * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
	 * above cannot corrupt).
	 */

So afaiu that means we pretty much very rely upon folio_mapped() check to
make sure anon_vma being valid at all that we fetched from folio->mapping,
not to mention the rmap walk later afterwards.

Then above diff in folio_get_anon_vma() should be problematic, as when
"folio_mapped()==false && folio_test_swapcache()==true", above change will
start to return anon_vma pointer even if the anon_vma could have been freed
and reused by other VMAs.

> 
> 
> The above change, combined with the change below, has also resolved the mTHP
> -EBUSY issue.
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index e5718835a964..1ef991b5c225 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1333,6 +1333,7 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 pte_unmap(&orig_src_pte);
>                 pte_unmap(&orig_dst_pte);
>                 src_pte = dst_pte = NULL;
> +               folio_wait_writeback(src_folio);
>                 err = split_folio(src_folio);
> 
>                 if (err)
>                         goto out;
> @@ -1343,7 +1344,7 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 goto retry;
>         }
> 
> -       if (!src_anon_vma && pte_present(orig_src_pte)) {
> +       if (!src_anon_vma) {
>                 /*
>                  * folio_referenced walks the anon_vma chain
>                  * without the folio lock. Serialize against it with
> 
> 
> split_folio() returns -EBUSY if the folio is under writeback or if
> folio_get_anon_vma() returns NULL.
> 
> I have no issues with the latter, provided the change in folio_get_anon_vma()
> is safe, as it also resolves the mTHP -EBUSY issue.
> 
> We need to carefully consider the five places where folio_get_anon_vma() is
> called, as this patch will also be backported to stable.
> 
>   1   2618  mm/huge_memory.c <<move_pages_huge_pmd>>
>              src_anon_vma = folio_get_anon_vma(src_folio);
> 
>    2   3765  mm/huge_memory.c <<__folio_split>>
>              anon_vma = folio_get_anon_vma(folio);
> 
>    3   1280  mm/migrate.c <<migrate_folio_unmap>>
>              anon_vma = folio_get_anon_vma(src);
> 
>    4   1485  mm/migrate.c <<unmap_and_move_huge_page>>
>              anon_vma = folio_get_anon_vma(src);
> 
>    5   1354  mm/userfaultfd.c <<move_pages_pte>>
>              src_anon_vma = folio_get_anon_vma(src_folio);

If my above understanding is correct, we may indeed need alternative plans.
Not sure whether others have thoughts, but two things I am thinking out
loud..

  - Justify rmap on the swap cache folio not possible: I think if
    folio_mapped() is required for any anon rmap walk (which I didn't
    notice previously..), and we know this is exclusive swap entry and
    keeps true (so nobody will be able to race and make the swapcache
    mapped during the whole process), maybe we are able to justify any rmap
    won't happen at all, because they should fail at folio_mapped() check.
    Then we update the folio->mapping & folio->index directly without
    anon_vma locking.  A rich comment would be helpful in this case..

  - I wonder if it's possible we free the swap cache if it's already
    writeback to backend storage and clean.  Then moving the swp entry
    alone looks safe when without the swapcache present.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-24 17:50                   ` Peter Xu
@ 2025-02-24 18:03                     ` David Hildenbrand
  0 siblings, 0 replies; 47+ messages in thread
From: David Hildenbrand @ 2025-02-24 18:03 UTC (permalink / raw)
  To: Peter Xu, Barry Song
  Cc: Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon, brauner,
	hughd, jannh, kaleshsingh, linux-kernel, linux-mm, lokeshgidra,
	mhocko, ngeoffray, rppt, ryan.roberts, shuah, surenb,
	v-songbaohua, viro, willy, zhangpeng362, zhengtangquan, yuzhao,
	stable

On 24.02.25 18:50, Peter Xu wrote:
> On Sun, Feb 23, 2025 at 10:31:37AM +1300, Barry Song wrote:
>> On Fri, Feb 21, 2025 at 2:49 PM Peter Xu <peterx@redhat.com> wrote:
>>>
>>> On Fri, Feb 21, 2025 at 01:07:24PM +1300, Barry Song wrote:
>>>> On Fri, Feb 21, 2025 at 12:32 PM Peter Xu <peterx@redhat.com> wrote:
>>>>>
>>>>> On Thu, Feb 20, 2025 at 10:21:01PM +1300, Barry Song wrote:
>>>>>> 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)
>>>>>
>>>>> Could you help explain what guarantees the rmap walk not happen on a
>>>>> swapcache page?
>>>>>
>>>>> I'm not familiar with this path, though at least I see damon can start a
>>>>> rmap walk on PageAnon almost with no locking..  some explanations would be
>>>>> appreciated.
>>>>
>>>> I am observing the following in folio_referenced(), which the anon_vma lock
>>>> was originally intended to protect.
>>>>
>>>>          if (!pra.mapcount)
>>>>                  return 0;
>>>>
>>>> I assume all other rmap walks should do the same?
>>>
>>> Yes normally there'll be a folio_mapcount() check, however..
>>>
>>>>
>>>> int folio_referenced(struct folio *folio, int is_locked,
>>>>                       struct mem_cgroup *memcg, unsigned long *vm_flags)
>>>> {
>>>>
>>>>          bool we_locked = false;
>>>>          struct folio_referenced_arg pra = {
>>>>                  .mapcount = folio_mapcount(folio),
>>>>                  .memcg = memcg,
>>>>          };
>>>>
>>>>          struct rmap_walk_control rwc = {
>>>>                  .rmap_one = folio_referenced_one,
>>>>                  .arg = (void *)&pra,
>>>>                  .anon_lock = folio_lock_anon_vma_read,
>>>>                  .try_lock = true,
>>>>                  .invalid_vma = invalid_folio_referenced_vma,
>>>>          };
>>>>
>>>>          *vm_flags = 0;
>>>>          if (!pra.mapcount)
>>>>                  return 0;
>>>>          ...
>>>> }
>>>>
>>>> By the way, since the folio has been under reclamation in this case and
>>>> isn't in the lru, this should also prevent the rmap walk, right?
>>>
>>> .. I'm not sure whether it's always working.
>>>
>>> The thing is anon doesn't even require folio lock held during (1) checking
>>> mapcount and (2) doing the rmap walk, in all similar cases as above.  I see
>>> nothing blocks it from a concurrent thread zapping that last mapcount:
>>>
>>>                 thread 1                         thread 2
>>>                 --------                         --------
>>>          [whatever scanner]
>>>             check folio_mapcount(), non-zero
>>>                                                  zap the last map.. then mapcount==0
>>>             rmap_walk()
>>>
>>> Not sure if I missed something.
>>>
>>> The other thing is IIUC swapcache page can also have chance to be faulted
>>> in but only if a read not write.  I actually had a feeling that your
>>> reproducer triggered that exact path, causing a read swap in, reusing the
>>> swapcache page, and hit the sanity check there somehow (even as mentioned
>>> in the other reply, I don't yet know why the 1st check didn't seem to
>>> work.. as we do check folio->index twice..).
>>>
>>> Said that, I'm not sure if above concern will happen in this specific case,
>>> as UIFFDIO_MOVE is pretty special, that we check exclusive bit first in swp
>>> entry so we know it's definitely not mapped elsewhere, meanwhile if we hold
>>> pgtable lock so maybe it can't get mapped back.. it is just still tricky,
>>> at least we do some dances all over releasing and retaking locks.
>>>
>>> We could either justify that's safe, or maybe still ok and simpler if we
>>> could take anon_vma write lock, making sure nobody will be able to read the
>>> folio->index when it's prone to an update.
>>
>> What prompted me to do the former is that folio_get_anon_vma() returns
>> NULL for an unmapped folio. As for the latter, we need to carefully evaluate
>> whether the change below is safe.
>>
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -505,7 +505,7 @@ struct anon_vma *folio_get_anon_vma(const struct
>> folio *folio)
>>          anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
>>          if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>>                  goto out;
>>
>> -       if (!folio_mapped(folio))
>> +       if (!folio_mapped(folio) && !folio_test_swapcache(folio))
>>                  goto out;
>>
>>          anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>> @@ -521,7 +521,7 @@ struct anon_vma *folio_get_anon_vma(const struct
>> folio *folio)
>>           * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
>>           * above cannot corrupt).
>>           */
> 
> [1]
> 
>>
>> -       if (!folio_mapped(folio)) {
>> +       if (!folio_mapped(folio) && !folio_test_swapcache(folio)) {
>>                  rcu_read_unlock();
>>                  put_anon_vma(anon_vma);
>>                  return NULL;
> 
> Hmm, this let me go back read again on how we manage anon_vma lifespan,
> then I just noticed this may not work.
> 
> See the comment right above [1], here's a full version:
> 
> 	/*
> 	 * If this folio is still mapped, then its anon_vma cannot have been
> 	 * freed.  But if it has been unmapped, we have no security against the
> 	 * anon_vma structure being freed and reused (for another anon_vma:
> 	 * SLAB_TYPESAFE_BY_RCU guarantees that - so the atomic_inc_not_zero()
> 	 * above cannot corrupt).
> 	 */
> 
> So afaiu that means we pretty much very rely upon folio_mapped() check to
> make sure anon_vma being valid at all that we fetched from folio->mapping,
> not to mention the rmap walk later afterwards.
> 
> Then above diff in folio_get_anon_vma() should be problematic, as when
> "folio_mapped()==false && folio_test_swapcache()==true", above change will
> start to return anon_vma pointer even if the anon_vma could have been freed
> and reused by other VMAs.

When splitting a folio, we use folio_get_anon_vma(). That seems to work 
as long as we have the folio locked.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-21  0:36               ` Suren Baghdasaryan
@ 2025-02-25 11:05                 ` Barry Song
  2025-02-25 15:34                   ` Peter Xu
  0 siblings, 1 reply; 47+ messages in thread
From: Barry Song @ 2025-02-25 11:05 UTC (permalink / raw)
  To: surenb
  Cc: 21cnbao, Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon,
	brauner, david, hughd, jannh, kaleshsingh, linux-kernel,
	linux-mm, lokeshgidra, mhocko, ngeoffray, peterx, rppt,
	ryan.roberts, shuah, v-songbaohua, viro, willy, yuzhao,
	zhangpeng362, zhengtangquan

On Fri, Feb 21, 2025 at 1:36 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Feb 20, 2025 at 3:52 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 3:47 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 2:59 PM Peter Xu <peterx@redhat.com> wrote:
> > > >
> > > > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> > > > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > > > >
> > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > > > > > swap entry.
> > > > > > > > >
> > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > > > > >   migration by setting:
> > > > > > > > >
> > > > > > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > > > > >
> > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > > > > >   the PTE to the new dst_addr.
> > > > > > > > >
> > > > > > > > > This approach is incorrect because even if the PTE is a swap
> > > > > > > > > entry, it can still reference a folio that remains in the swap
> > > > > > > > > cache.
> > > > > > > > >
> > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > > > > > can occur due to:
> > > > > > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > > > > >
> > > > > > > > Thanks for the report and reproducer!
> > > > > > > >
> > > > > > > > >
> > > > > > > > > $./a.out > /dev/null
> > > > > > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > > > > > [   13.337716] memcg:ffff00000405f000
> > > > > > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > > > > > [   13.340190] ------------[ cut here ]------------
> > > > > > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > > > > > [   13.340969] Modules linked in:
> > > > > > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > > > [   13.342018] sp : ffff80008752bb20
> > > > > > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > > > > > [   13.343876] Call trace:
> > > > > > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > > > > > >
> > > > > > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > > > > > of folios as done in move_present_pte.
> > > > > > > >
> > > > > > > > How complex would that be? Is it a matter of adding
> > > > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > > > > > folio->index = linear_page_index like in move_present_pte() or
> > > > > > > > something more?
> > > > > > >
> > > > > > > My main concern is still with large folios that require a split_folio()
> > > > > > > during move_pages(), as the entire folio shares the same index and
> > > > > > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > > > > > making a split necessary.
> > > > > > >
> > > > > > > However, in split_huge_page_to_list_to_order(), there is a:
> > > > > > >
> > > > > > >         if (folio_test_writeback(folio))
> > > > > > >                 return -EBUSY;
> > > > > > >
> > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > > > > it simply returns -EBUSY:
> > > > > > >
> > > > > > > move_pages_pte()
> > > > > > > {
> > > > > > >                 /* at this point we have src_folio locked */
> > > > > > >                 if (folio_test_large(src_folio)) {
> > > > > > >                         /* split_folio() can block */
> > > > > > >                         pte_unmap(&orig_src_pte);
> > > > > > >                         pte_unmap(&orig_dst_pte);
> > > > > > >                         src_pte = dst_pte = NULL;
> > > > > > >                         err = split_folio(src_folio);
> > > > > > >                         if (err)
> > > > > > >                                 goto out;
> > > > > > >
> > > > > > >                         /* have to reacquire the folio after it got split */
> > > > > > >                         folio_unlock(src_folio);
> > > > > > >                         folio_put(src_folio);
> > > > > > >                         src_folio = NULL;
> > > > > > >                         goto retry;
> > > > > > >                 }
> > > > > > > }
> > > > > > >
> > > > > > > Do we need a folio_wait_writeback() before calling split_folio()?
> > > > > >
> > > > > > Maybe no need in the first version to fix the immediate bug?
> > > > > >
> > > > > > It's also not always the case to hit writeback here. IIUC, writeback only
> > > > > > happens for a short window when the folio was just added into swapcache.
> > > > > > MOVE can happen much later after that anytime before a swapin.  My
> > > > > > understanding is that's also what Matthew wanted to point out.  It may be
> > > > > > better justified of that in a separate change with some performance
> > > > > > measurements.
> > > > >
> > > > > The bug we’re discussing occurs precisely within the short window you
> > > > > mentioned.
> > > > >
> > > > > 1. add_to_swap: The folio is added to swapcache.
> > > > > 2. try_to_unmap: PTEs are converted to swap entries.
> > > > > 3. pageout
> > > > > 4. Swapcache is cleared.
> > > >
> > > > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
> > > > least that should be step 3.5, as IIUC "writeback" needs to be cleared
> > > > before "swapcache" bit being cleared.
> > > >
> > > > >
> > > > > The issue happens between steps 2 and 4, where the PTE is not present, but
> > > > > the folio is still in swapcache - the current code does move_swap_pte() but does
> > > > > not fixup folio->index within swapcache.
> > > >
> > > > One thing I'm still not clear here is why it's a race condition, rather
> > > > than more severe than that.  I mean, folio->index is definitely wrong, then
> > > > as long as the page still in swapcache, we should be able to move the swp
> > > > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
> > > > the page in swapcache with the wrong folio->index already and trigger.
> > > >
> > > > I wrote a quick test like that, it actually won't trigger..
> > > >
> > > > I had a closer look in the code, I think it's because do_swap_page() has
> > > > the logic to detect folio->index matching first, and allocate a new folio
> > > > if it doesn't match in ksm_might_need_to_copy().  IIUC that was for
> > > > ksm.. but it looks like it's functioning too here.
> > > >
> > > > ksm_might_need_to_copy:
> > > >         if (folio_test_ksm(folio)) {
> > > >                 if (folio_stable_node(folio) &&
> > > >                     !(ksm_run & KSM_RUN_UNMERGE))
> > > >                         return folio;   /* no need to copy it */
> > > >         } else if (!anon_vma) {
> > > >                 return folio;           /* no need to copy it */
> > > >         } else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
> > > >                         anon_vma->root == vma->anon_vma->root) {
> > > >                 return folio;           /* still no need to copy it */
> > > >         }
> > > >         ...
> > > >
> > > >         new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
> > > >         ...
> > > >
> > > > So I believe what I hit is at [1] it sees index doesn't match, then it
> > > > decided to allocate a new folio.  In this case, it won't hit your BUG
> > > > because it'll be "folio != swapcache" later, so it'll setup the
> > > > folio->index for the new one, rather than the sanity check.
> > > >
> > > > Do you know how your case got triggered, being able to bypass the above [1]
> > > > which should check folio->index already?
> > >
> > > To understand the change I tried applying the proposed patch to both
> > > mm-unstable and Linus' ToT and got conflicts for both trees. Barry,
> > > which baseline are you using?
> >
> > Oops, never mind. My mistake. Copying from the email messed up tabs...
> > It applies cleanly.
>
> Overall the code seems correct to me, however the new code has quite
> complex logical structure IMO. Original simplified code structure is
> like this:
>
> if (pte_present(orig_src_pte)) {
>         if (is_zero_pfn) {
>                 move_zeropage_pte()
>                 return
>         }
>         // pin and lock src_folio
>         spin_lock(src_ptl)
>         folio_get(folio)
>         folio_trylock(folio)
>         if (folio_test_large(src_folio))
>                 split_folio(src_folio)
>         anon_vma_trylock_write(src_anon_vma)
>         move_present_pte()
> } else {
>         if (non_swap_entry(entry))
>                 if (is_migration_entry(entry))
>                         handle migration entry
>         else
>                 move_swap_pte()
> }
>
> The new structure looks like this:
>
> if (!pte_present(orig_src_pte)) {
>         if (is_migration_entry(entry)) {
>                 handle migration entry
>                 return
>        }
>         if (!non_swap_entry() ||  !pte_swp_exclusive())
>                 return
>         si = get_swap_device(entry);
> }
> if (pte_present(orig_src_pte) && is_zero_pfn(pte_pfn(orig_src_pte)))
>         move_zeropage_pte()
>         return
> }
> pin and lock src_folio
>         spin_lock(src_ptl)
>         if (pte_present(orig_src_pte))
>                 folio_get(folio)
>         else {
>                 folio = filemap_get_folio(swap_entry)
>                 if (IS_ERR(folio))
>                         move_swap_pte()
>                         return
>                 }
>         }
>         folio_trylock(folio)
> if (folio_test_large(src_folio))
>         split_folio(src_folio)
> if (pte_present(orig_src_pte))
>         anon_vma_trylock_write(src_anon_vma)
> move_pte_and_folio()
>
> This looks more complex and harder to follow. Might be the reason
> David was not in favour of treating swapcache and present pages in the
> same path. And now I would agree that refactoring some common parts
> and not breaking the original structure might be cleaner.

Exactly, that’s the cost we’re facing in trying to share the code path
for swap and present PTEs.

I tried to extract some common functions for present PTE and swap entries,
but I found too many detailed differences and variants. This made the common
function overly complex, turning it into a real "monster." As a result, I
don't think this approach would make the code any more readable or cleaner.

After trying a couple of times, I feel the following is somehow more
readable:
(Lokesh is eager for the small folios fixes to be merged without further
delay. So, I'd prefer to return -EBUSY for large folios in the hotfixes
and handle the mTHP -EBUSY issue in a separate patch later.)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 867898c4e30b..eed9286ec1f3 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -18,6 +18,7 @@
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
+#include "swap.h"
 
 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
@@ -1072,15 +1073,15 @@ static int move_present_pte(struct mm_struct *mm,
 	return err;
 }
 
-static int move_swap_pte(struct mm_struct *mm,
+static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 			 unsigned long dst_addr, unsigned long src_addr,
 			 pte_t *dst_pte, pte_t *src_pte,
 			 pte_t orig_dst_pte, pte_t orig_src_pte,
 			 pmd_t *dst_pmd, pmd_t dst_pmdval,
-			 spinlock_t *dst_ptl, spinlock_t *src_ptl)
+			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
+			 struct folio *src_folio)
 {
-	if (!pte_swp_exclusive(orig_src_pte))
-		return -EBUSY;
+	int err = 0;
 
 	double_pt_lock(dst_ptl, src_ptl);
 
@@ -1090,11 +1091,22 @@ static int move_swap_pte(struct mm_struct *mm,
 		return -EAGAIN;
 	}
 
+	if (src_folio) {
+		/* Folio got pinned from under us. Put it back and fail the move. */
+		if (folio_maybe_dma_pinned(src_folio)) {
+			err = -EBUSY;
+			goto out;
+		}
+		folio_move_anon_rmap(src_folio, dst_vma);
+		src_folio->index = linear_page_index(dst_vma, dst_addr);
+	}
+
 	orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
 	set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
-	double_pt_unlock(dst_ptl, src_ptl);
 
-	return 0;
+out:
+	double_pt_unlock(dst_ptl, src_ptl);
+	return err;
 }
 
 static int move_zeropage_pte(struct mm_struct *mm,
@@ -1137,6 +1149,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			  __u64 mode)
 {
 	swp_entry_t entry;
+	struct swap_info_struct *si = NULL;
 	pte_t orig_src_pte, orig_dst_pte;
 	pte_t src_folio_pte;
 	spinlock_t *src_ptl, *dst_ptl;
@@ -1318,6 +1331,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				       orig_dst_pte, orig_src_pte, dst_pmd,
 				       dst_pmdval, dst_ptl, src_ptl, src_folio);
 	} else {
+		struct folio *folio = NULL;
+
 		entry = pte_to_swp_entry(orig_src_pte);
 		if (non_swap_entry(entry)) {
 			if (is_migration_entry(entry)) {
@@ -1331,9 +1346,47 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			goto out;
 		}
 
-		err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte,
-				    orig_dst_pte, orig_src_pte, dst_pmd,
-				    dst_pmdval, dst_ptl, src_ptl);
+		if (!pte_swp_exclusive(orig_src_pte)) {
+			err = -EBUSY;
+			goto out;
+		}
+
+		si = get_swap_device(entry);
+		if (unlikely(!si)) {
+			err = -EAGAIN;
+			goto out;
+		}
+                /*
+                 * Check if swapcache exists. If it does, the folio must be
+                 * moved even if the PTE is a swap entry. For large folios,
+                 * we directly return -EBUSY, as split_folio() currently
+                 * also returns -EBUSY when attempting to split unmapped
+                 * large folios in the swapcache. This needs to be fixed
+                 * to allow proper handling.
+                 */
+		if (!src_folio)
+			folio = filemap_get_folio(swap_address_space(entry),
+					swap_cache_index(entry));
+		if (!IS_ERR_OR_NULL(folio)) {
+			if (folio_test_large(folio)) {
+				err = -EBUSY;
+				folio_put(folio);
+				goto out;
+			}
+			src_folio = folio;
+			if (!folio_trylock(src_folio)) {
+				pte_unmap(&orig_src_pte);
+				pte_unmap(&orig_dst_pte);
+				src_pte = dst_pte = NULL;
+				/* now we can block and wait */
+				folio_lock(src_folio);
+				si = NULL;
+				goto retry;
+			}
+		}
+		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
+				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
+				dst_ptl, src_ptl, src_folio);
 	}
 
 out:
@@ -1350,6 +1403,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	if (src_pte)
 		pte_unmap(src_pte);
 	mmu_notifier_invalidate_range_end(&range);
+	if (si)
+		put_swap_device(si);
 
 	return err;
 }

If there are no objections, I'll send v2 tomorrow with the above code.
12:04 AM, Time to get some sleep now! :-)


>
> >
> > >
> > > >
> > > > >
> > > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> > > > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> > > > > in this RFC.
> > > > >
> > > > > For small folios, there’s no split_folio issue, making it relatively
> > > > > simpler. Lokesh
> > > > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> > > > > the first priority.
> > > >
> > > > Agreed.
> > > >
> > > > --
> > > > Peter Xu
> > > >

Thanks
Barry




^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-25 11:05                 ` Barry Song
@ 2025-02-25 15:34                   ` Peter Xu
  2025-02-25 17:02                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Xu @ 2025-02-25 15:34 UTC (permalink / raw)
  To: Barry Song
  Cc: surenb, Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon,
	brauner, david, hughd, jannh, kaleshsingh, linux-kernel,
	linux-mm, lokeshgidra, mhocko, ngeoffray, rppt, ryan.roberts,
	shuah, v-songbaohua, viro, willy, yuzhao, zhangpeng362,
	zhengtangquan

On Wed, Feb 26, 2025 at 12:05:25AM +1300, Barry Song wrote:
> On Fri, Feb 21, 2025 at 1:36 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Feb 20, 2025 at 3:52 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 3:47 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Thu, Feb 20, 2025 at 2:59 PM Peter Xu <peterx@redhat.com> wrote:
> > > > >
> > > > > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> > > > > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > >
> > > > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > > > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > > > > >
> > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > > > > > > swap entry.
> > > > > > > > > >
> > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > > > > > >   migration by setting:
> > > > > > > > > >
> > > > > > > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > > > > > >
> > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > > > > > >   the PTE to the new dst_addr.
> > > > > > > > > >
> > > > > > > > > > This approach is incorrect because even if the PTE is a swap
> > > > > > > > > > entry, it can still reference a folio that remains in the swap
> > > > > > > > > > cache.
> > > > > > > > > >
> > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > > > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > > > > > > can occur due to:
> > > > > > > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > > > > > >
> > > > > > > > > Thanks for the report and reproducer!
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > $./a.out > /dev/null
> > > > > > > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > > > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > > > > > > [   13.337716] memcg:ffff00000405f000
> > > > > > > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > > > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > > > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > > > > > > [   13.340190] ------------[ cut here ]------------
> > > > > > > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > > > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > > > > > > [   13.340969] Modules linked in:
> > > > > > > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > > > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > > > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > > > > [   13.342018] sp : ffff80008752bb20
> > > > > > > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > > > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > > > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > > > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > > > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > > > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > > > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > > > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > > > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > > > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > > > > > > [   13.343876] Call trace:
> > > > > > > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > > > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > > > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > > > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > > > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > > > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > > > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > > > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > > > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > > > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > > > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > > > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > > > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > > > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > > > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > > > > > > >
> > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > > > > > > of folios as done in move_present_pte.
> > > > > > > > >
> > > > > > > > > How complex would that be? Is it a matter of adding
> > > > > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > > > > > > folio->index = linear_page_index like in move_present_pte() or
> > > > > > > > > something more?
> > > > > > > >
> > > > > > > > My main concern is still with large folios that require a split_folio()
> > > > > > > > during move_pages(), as the entire folio shares the same index and
> > > > > > > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > > > > > > making a split necessary.
> > > > > > > >
> > > > > > > > However, in split_huge_page_to_list_to_order(), there is a:
> > > > > > > >
> > > > > > > >         if (folio_test_writeback(folio))
> > > > > > > >                 return -EBUSY;
> > > > > > > >
> > > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > > > > > it simply returns -EBUSY:
> > > > > > > >
> > > > > > > > move_pages_pte()
> > > > > > > > {
> > > > > > > >                 /* at this point we have src_folio locked */
> > > > > > > >                 if (folio_test_large(src_folio)) {
> > > > > > > >                         /* split_folio() can block */
> > > > > > > >                         pte_unmap(&orig_src_pte);
> > > > > > > >                         pte_unmap(&orig_dst_pte);
> > > > > > > >                         src_pte = dst_pte = NULL;
> > > > > > > >                         err = split_folio(src_folio);
> > > > > > > >                         if (err)
> > > > > > > >                                 goto out;
> > > > > > > >
> > > > > > > >                         /* have to reacquire the folio after it got split */
> > > > > > > >                         folio_unlock(src_folio);
> > > > > > > >                         folio_put(src_folio);
> > > > > > > >                         src_folio = NULL;
> > > > > > > >                         goto retry;
> > > > > > > >                 }
> > > > > > > > }
> > > > > > > >
> > > > > > > > Do we need a folio_wait_writeback() before calling split_folio()?
> > > > > > >
> > > > > > > Maybe no need in the first version to fix the immediate bug?
> > > > > > >
> > > > > > > It's also not always the case to hit writeback here. IIUC, writeback only
> > > > > > > happens for a short window when the folio was just added into swapcache.
> > > > > > > MOVE can happen much later after that anytime before a swapin.  My
> > > > > > > understanding is that's also what Matthew wanted to point out.  It may be
> > > > > > > better justified of that in a separate change with some performance
> > > > > > > measurements.
> > > > > >
> > > > > > The bug we’re discussing occurs precisely within the short window you
> > > > > > mentioned.
> > > > > >
> > > > > > 1. add_to_swap: The folio is added to swapcache.
> > > > > > 2. try_to_unmap: PTEs are converted to swap entries.
> > > > > > 3. pageout
> > > > > > 4. Swapcache is cleared.
> > > > >
> > > > > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
> > > > > least that should be step 3.5, as IIUC "writeback" needs to be cleared
> > > > > before "swapcache" bit being cleared.
> > > > >
> > > > > >
> > > > > > The issue happens between steps 2 and 4, where the PTE is not present, but
> > > > > > the folio is still in swapcache - the current code does move_swap_pte() but does
> > > > > > not fixup folio->index within swapcache.
> > > > >
> > > > > One thing I'm still not clear here is why it's a race condition, rather
> > > > > than more severe than that.  I mean, folio->index is definitely wrong, then
> > > > > as long as the page still in swapcache, we should be able to move the swp
> > > > > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
> > > > > the page in swapcache with the wrong folio->index already and trigger.
> > > > >
> > > > > I wrote a quick test like that, it actually won't trigger..
> > > > >
> > > > > I had a closer look in the code, I think it's because do_swap_page() has
> > > > > the logic to detect folio->index matching first, and allocate a new folio
> > > > > if it doesn't match in ksm_might_need_to_copy().  IIUC that was for
> > > > > ksm.. but it looks like it's functioning too here.
> > > > >
> > > > > ksm_might_need_to_copy:
> > > > >         if (folio_test_ksm(folio)) {
> > > > >                 if (folio_stable_node(folio) &&
> > > > >                     !(ksm_run & KSM_RUN_UNMERGE))
> > > > >                         return folio;   /* no need to copy it */
> > > > >         } else if (!anon_vma) {
> > > > >                 return folio;           /* no need to copy it */
> > > > >         } else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
> > > > >                         anon_vma->root == vma->anon_vma->root) {
> > > > >                 return folio;           /* still no need to copy it */
> > > > >         }
> > > > >         ...
> > > > >
> > > > >         new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
> > > > >         ...
> > > > >
> > > > > So I believe what I hit is at [1] it sees index doesn't match, then it
> > > > > decided to allocate a new folio.  In this case, it won't hit your BUG
> > > > > because it'll be "folio != swapcache" later, so it'll setup the
> > > > > folio->index for the new one, rather than the sanity check.
> > > > >
> > > > > Do you know how your case got triggered, being able to bypass the above [1]
> > > > > which should check folio->index already?
> > > >
> > > > To understand the change I tried applying the proposed patch to both
> > > > mm-unstable and Linus' ToT and got conflicts for both trees. Barry,
> > > > which baseline are you using?
> > >
> > > Oops, never mind. My mistake. Copying from the email messed up tabs...
> > > It applies cleanly.
> >
> > Overall the code seems correct to me, however the new code has quite
> > complex logical structure IMO. Original simplified code structure is
> > like this:
> >
> > if (pte_present(orig_src_pte)) {
> >         if (is_zero_pfn) {
> >                 move_zeropage_pte()
> >                 return
> >         }
> >         // pin and lock src_folio
> >         spin_lock(src_ptl)
> >         folio_get(folio)
> >         folio_trylock(folio)
> >         if (folio_test_large(src_folio))
> >                 split_folio(src_folio)
> >         anon_vma_trylock_write(src_anon_vma)
> >         move_present_pte()
> > } else {
> >         if (non_swap_entry(entry))
> >                 if (is_migration_entry(entry))
> >                         handle migration entry
> >         else
> >                 move_swap_pte()
> > }
> >
> > The new structure looks like this:
> >
> > if (!pte_present(orig_src_pte)) {
> >         if (is_migration_entry(entry)) {
> >                 handle migration entry
> >                 return
> >        }
> >         if (!non_swap_entry() ||  !pte_swp_exclusive())
> >                 return
> >         si = get_swap_device(entry);
> > }
> > if (pte_present(orig_src_pte) && is_zero_pfn(pte_pfn(orig_src_pte)))
> >         move_zeropage_pte()
> >         return
> > }
> > pin and lock src_folio
> >         spin_lock(src_ptl)
> >         if (pte_present(orig_src_pte))
> >                 folio_get(folio)
> >         else {
> >                 folio = filemap_get_folio(swap_entry)
> >                 if (IS_ERR(folio))
> >                         move_swap_pte()
> >                         return
> >                 }
> >         }
> >         folio_trylock(folio)
> > if (folio_test_large(src_folio))
> >         split_folio(src_folio)
> > if (pte_present(orig_src_pte))
> >         anon_vma_trylock_write(src_anon_vma)
> > move_pte_and_folio()
> >
> > This looks more complex and harder to follow. Might be the reason
> > David was not in favour of treating swapcache and present pages in the
> > same path. And now I would agree that refactoring some common parts
> > and not breaking the original structure might be cleaner.
> 
> Exactly, that’s the cost we’re facing in trying to share the code path
> for swap and present PTEs.
> 
> I tried to extract some common functions for present PTE and swap entries,
> but I found too many detailed differences and variants. This made the common
> function overly complex, turning it into a real "monster." As a result, I
> don't think this approach would make the code any more readable or cleaner.
> 
> After trying a couple of times, I feel the following is somehow more
> readable:
> (Lokesh is eager for the small folios fixes to be merged without further
> delay. So, I'd prefer to return -EBUSY for large folios in the hotfixes
> and handle the mTHP -EBUSY issue in a separate patch later.)
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 867898c4e30b..eed9286ec1f3 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -18,6 +18,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> +#include "swap.h"
>  
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> @@ -1072,15 +1073,15 @@ static int move_present_pte(struct mm_struct *mm,
>  	return err;
>  }
>  
> -static int move_swap_pte(struct mm_struct *mm,
> +static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
>  			 unsigned long dst_addr, unsigned long src_addr,
>  			 pte_t *dst_pte, pte_t *src_pte,
>  			 pte_t orig_dst_pte, pte_t orig_src_pte,
>  			 pmd_t *dst_pmd, pmd_t dst_pmdval,
> -			 spinlock_t *dst_ptl, spinlock_t *src_ptl)
> +			 spinlock_t *dst_ptl, spinlock_t *src_ptl,
> +			 struct folio *src_folio)
>  {
> -	if (!pte_swp_exclusive(orig_src_pte))
> -		return -EBUSY;
> +	int err = 0;
>  
>  	double_pt_lock(dst_ptl, src_ptl);
>  
> @@ -1090,11 +1091,22 @@ static int move_swap_pte(struct mm_struct *mm,
>  		return -EAGAIN;
>  	}
>  
> +	if (src_folio) {

We'd better add a comment here explaining src_folio in this case is a swap
cache folio, and we're updating index to make sure if the folio can be
reused later in a swapin, the rmap info will match, or something like that.

> +		/* Folio got pinned from under us. Put it back and fail the move. */
> +		if (folio_maybe_dma_pinned(src_folio)) {
> +			err = -EBUSY;
> +			goto out;
> +		}

If the swap entry is guaranteed exclusive (which I think will hold true), I
think we can drop this and the "out" label, as it can't happen.

> +		folio_move_anon_rmap(src_folio, dst_vma);
> +		src_folio->index = linear_page_index(dst_vma, dst_addr);
> +	}
> +
>  	orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
>  	set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
> -	double_pt_unlock(dst_ptl, src_ptl);
>  
> -	return 0;
> +out:
> +	double_pt_unlock(dst_ptl, src_ptl);
> +	return err;
>  }
>  
>  static int move_zeropage_pte(struct mm_struct *mm,
> @@ -1137,6 +1149,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  			  __u64 mode)
>  {
>  	swp_entry_t entry;
> +	struct swap_info_struct *si = NULL;
>  	pte_t orig_src_pte, orig_dst_pte;
>  	pte_t src_folio_pte;
>  	spinlock_t *src_ptl, *dst_ptl;
> @@ -1318,6 +1331,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  				       orig_dst_pte, orig_src_pte, dst_pmd,
>  				       dst_pmdval, dst_ptl, src_ptl, src_folio);
>  	} else {
> +		struct folio *folio = NULL;
> +
>  		entry = pte_to_swp_entry(orig_src_pte);
>  		if (non_swap_entry(entry)) {
>  			if (is_migration_entry(entry)) {
> @@ -1331,9 +1346,47 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  			goto out;
>  		}
>  
> -		err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte,
> -				    orig_dst_pte, orig_src_pte, dst_pmd,
> -				    dst_pmdval, dst_ptl, src_ptl);
> +		if (!pte_swp_exclusive(orig_src_pte)) {
> +			err = -EBUSY;
> +			goto out;
> +		}
> +
> +		si = get_swap_device(entry);
> +		if (unlikely(!si)) {
> +			err = -EAGAIN;
> +			goto out;
> +		}
> +                /*
> +                 * Check if swapcache exists. If it does, the folio must be
> +                 * moved even if the PTE is a swap entry. For large folios,
> +                 * we directly return -EBUSY, as split_folio() currently
> +                 * also returns -EBUSY when attempting to split unmapped
> +                 * large folios in the swapcache. This needs to be fixed
> +                 * to allow proper handling.
> +                 */

Some alignment issue on comments...

We could also add something on the decision on why not taking anon_vma.
IIUC mention that no possible rmap walker when exclusive and unmapped
should be ok as of now..

> +		if (!src_folio)
> +			folio = filemap_get_folio(swap_address_space(entry),
> +					swap_cache_index(entry));
> +		if (!IS_ERR_OR_NULL(folio)) {
> +			if (folio_test_large(folio)) {
> +				err = -EBUSY;
> +				folio_put(folio);
> +				goto out;
> +			}
> +			src_folio = folio;

We need to update src_folio_pte here, or later it might access
uninitialized stack var when a retry needed:

	if (src_folio && unlikely(!pte_same(src_folio_pte, orig_src_pte))) {
		err = -EAGAIN;
		goto out;
	}

> +			if (!folio_trylock(src_folio)) {
> +				pte_unmap(&orig_src_pte);
> +				pte_unmap(&orig_dst_pte);
> +				src_pte = dst_pte = NULL;
> +				/* now we can block and wait */
> +				folio_lock(src_folio);
> +				si = NULL;

Swap device ref leak?

Thanks,

> +				goto retry;
> +			}
> +		}
> +		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> +				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> +				dst_ptl, src_ptl, src_folio);
>  	}
>  
>  out:
> @@ -1350,6 +1403,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  	if (src_pte)
>  		pte_unmap(src_pte);
>  	mmu_notifier_invalidate_range_end(&range);
> +	if (si)
> +		put_swap_device(si);
>  
>  	return err;
>  }
> 
> If there are no objections, I'll send v2 tomorrow with the above code.
> 12:04 AM, Time to get some sleep now! :-)
> 
> 
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> > > > > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> > > > > > in this RFC.
> > > > > >
> > > > > > For small folios, there’s no split_folio issue, making it relatively
> > > > > > simpler. Lokesh
> > > > > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> > > > > > the first priority.
> > > > >
> > > > > Agreed.
> > > > >
> > > > > --
> > > > > Peter Xu
> > > > >
> 
> Thanks
> Barry
> 
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-25 15:34                   ` Peter Xu
@ 2025-02-25 17:02                     ` Suren Baghdasaryan
  0 siblings, 0 replies; 47+ messages in thread
From: Suren Baghdasaryan @ 2025-02-25 17:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Barry Song, Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon,
	brauner, david, hughd, jannh, kaleshsingh, linux-kernel,
	linux-mm, lokeshgidra, mhocko, ngeoffray, rppt, ryan.roberts,
	shuah, v-songbaohua, viro, willy, yuzhao, zhangpeng362,
	zhengtangquan

On Tue, Feb 25, 2025 at 7:34 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Feb 26, 2025 at 12:05:25AM +1300, Barry Song wrote:
> > On Fri, Feb 21, 2025 at 1:36 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Thu, Feb 20, 2025 at 3:52 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Thu, Feb 20, 2025 at 3:47 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > >
> > > > > On Thu, Feb 20, 2025 at 2:59 PM Peter Xu <peterx@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Feb 20, 2025 at 12:04:40PM +1300, Barry Song wrote:
> > > > > > > On Thu, Feb 20, 2025 at 11:15 AM Peter Xu <peterx@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Feb 20, 2025 at 09:37:50AM +1300, Barry Song wrote:
> > > > > > > > > On Thu, Feb 20, 2025 at 7:27 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > From: Barry Song <v-songbaohua@oppo.com>
> > > > > > > > > > >
> > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a
> > > > > > > > > > > swap entry.
> > > > > > > > > > >
> > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio
> > > > > > > > > > >   migration by setting:
> > > > > > > > > > >
> > > > > > > > > > >   src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > > > > > > > >
> > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies
> > > > > > > > > > >   the PTE to the new dst_addr.
> > > > > > > > > > >
> > > > > > > > > > > This approach is incorrect because even if the PTE is a swap
> > > > > > > > > > > entry, it can still reference a folio that remains in the swap
> > > > > > > > > > > cache.
> > > > > > > > > > >
> > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the
> > > > > > > > > > > swap cache. However, during add_rmap operations, a kernel panic
> > > > > > > > > > > can occur due to:
> > > > > > > > > > >  page_pgoff(folio, page) != linear_page_index(vma, address)
> > > > > > > > > >
> > > > > > > > > > Thanks for the report and reproducer!
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > $./a.out > /dev/null
> > > > > > > > > > > [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> > > > > > > > > > > [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> > > > > > > > > > > [   13.337716] memcg:ffff00000405f000
> > > > > > > > > > > [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> > > > > > > > > > > [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > > > > > [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > > > > > [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> > > > > > > > > > > [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> > > > > > > > > > > [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> > > > > > > > > > > [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> > > > > > > > > > > [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> > > > > > > > > > > [   13.340190] ------------[ cut here ]------------
> > > > > > > > > > > [   13.340316] kernel BUG at mm/rmap.c:1380!
> > > > > > > > > > > [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> > > > > > > > > > > [   13.340969] Modules linked in:
> > > > > > > > > > > [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> > > > > > > > > > > [   13.341470] Hardware name: linux,dummy-virt (DT)
> > > > > > > > > > > [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > > > > > > > > [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > > > > > [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> > > > > > > > > > > [   13.342018] sp : ffff80008752bb20
> > > > > > > > > > > [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> > > > > > > > > > > [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> > > > > > > > > > > [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> > > > > > > > > > > [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> > > > > > > > > > > [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> > > > > > > > > > > [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> > > > > > > > > > > [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> > > > > > > > > > > [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> > > > > > > > > > > [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> > > > > > > > > > > [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> > > > > > > > > > > [   13.343876] Call trace:
> > > > > > > > > > > [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> > > > > > > > > > > [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> > > > > > > > > > > [   13.344333]  do_swap_page+0x1060/0x1400
> > > > > > > > > > > [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> > > > > > > > > > > [   13.344504]  handle_mm_fault+0xd8/0x2e8
> > > > > > > > > > > [   13.344586]  do_page_fault+0x20c/0x770
> > > > > > > > > > > [   13.344673]  do_translation_fault+0xb4/0xf0
> > > > > > > > > > > [   13.344759]  do_mem_abort+0x48/0xa0
> > > > > > > > > > > [   13.344842]  el0_da+0x58/0x130
> > > > > > > > > > > [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> > > > > > > > > > > [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> > > > > > > > > > > [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> > > > > > > > > > > [   13.345504] ---[ end trace 0000000000000000 ]---
> > > > > > > > > > > [   13.345715] note: a.out[107] exited with irqs disabled
> > > > > > > > > > > [   13.345954] note: a.out[107] exited with preempt_count 2
> > > > > > > > > > >
> > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handling
> > > > > > > > > > > of folios as done in move_present_pte.
> > > > > > > > > >
> > > > > > > > > > How complex would that be? Is it a matter of adding
> > > > > > > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> > > > > > > > > > folio->index = linear_page_index like in move_present_pte() or
> > > > > > > > > > something more?
> > > > > > > > >
> > > > > > > > > My main concern is still with large folios that require a split_folio()
> > > > > > > > > during move_pages(), as the entire folio shares the same index and
> > > > > > > > > anon_vma. However, userfaultfd_move() moves pages individually,
> > > > > > > > > making a split necessary.
> > > > > > > > >
> > > > > > > > > However, in split_huge_page_to_list_to_order(), there is a:
> > > > > > > > >
> > > > > > > > >         if (folio_test_writeback(folio))
> > > > > > > > >                 return -EBUSY;
> > > > > > > > >
> > > > > > > > > This is likely true for swapcache, right? However, even for move_present_pte(),
> > > > > > > > > it simply returns -EBUSY:
> > > > > > > > >
> > > > > > > > > move_pages_pte()
> > > > > > > > > {
> > > > > > > > >                 /* at this point we have src_folio locked */
> > > > > > > > >                 if (folio_test_large(src_folio)) {
> > > > > > > > >                         /* split_folio() can block */
> > > > > > > > >                         pte_unmap(&orig_src_pte);
> > > > > > > > >                         pte_unmap(&orig_dst_pte);
> > > > > > > > >                         src_pte = dst_pte = NULL;
> > > > > > > > >                         err = split_folio(src_folio);
> > > > > > > > >                         if (err)
> > > > > > > > >                                 goto out;
> > > > > > > > >
> > > > > > > > >                         /* have to reacquire the folio after it got split */
> > > > > > > > >                         folio_unlock(src_folio);
> > > > > > > > >                         folio_put(src_folio);
> > > > > > > > >                         src_folio = NULL;
> > > > > > > > >                         goto retry;
> > > > > > > > >                 }
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > Do we need a folio_wait_writeback() before calling split_folio()?
> > > > > > > >
> > > > > > > > Maybe no need in the first version to fix the immediate bug?
> > > > > > > >
> > > > > > > > It's also not always the case to hit writeback here. IIUC, writeback only
> > > > > > > > happens for a short window when the folio was just added into swapcache.
> > > > > > > > MOVE can happen much later after that anytime before a swapin.  My
> > > > > > > > understanding is that's also what Matthew wanted to point out.  It may be
> > > > > > > > better justified of that in a separate change with some performance
> > > > > > > > measurements.
> > > > > > >
> > > > > > > The bug we’re discussing occurs precisely within the short window you
> > > > > > > mentioned.
> > > > > > >
> > > > > > > 1. add_to_swap: The folio is added to swapcache.
> > > > > > > 2. try_to_unmap: PTEs are converted to swap entries.
> > > > > > > 3. pageout
> > > > > > > 4. Swapcache is cleared.
> > > > > >
> > > > > > Hmm, I see. I was expecting step 4 to be "writeback is cleared".. or at
> > > > > > least that should be step 3.5, as IIUC "writeback" needs to be cleared
> > > > > > before "swapcache" bit being cleared.
> > > > > >
> > > > > > >
> > > > > > > The issue happens between steps 2 and 4, where the PTE is not present, but
> > > > > > > the folio is still in swapcache - the current code does move_swap_pte() but does
> > > > > > > not fixup folio->index within swapcache.
> > > > > >
> > > > > > One thing I'm still not clear here is why it's a race condition, rather
> > > > > > than more severe than that.  I mean, folio->index is definitely wrong, then
> > > > > > as long as the page still in swapcache, we should be able to move the swp
> > > > > > entry over to dest addr of UFFDIO_MOVE, read on dest addr, then it'll see
> > > > > > the page in swapcache with the wrong folio->index already and trigger.
> > > > > >
> > > > > > I wrote a quick test like that, it actually won't trigger..
> > > > > >
> > > > > > I had a closer look in the code, I think it's because do_swap_page() has
> > > > > > the logic to detect folio->index matching first, and allocate a new folio
> > > > > > if it doesn't match in ksm_might_need_to_copy().  IIUC that was for
> > > > > > ksm.. but it looks like it's functioning too here.
> > > > > >
> > > > > > ksm_might_need_to_copy:
> > > > > >         if (folio_test_ksm(folio)) {
> > > > > >                 if (folio_stable_node(folio) &&
> > > > > >                     !(ksm_run & KSM_RUN_UNMERGE))
> > > > > >                         return folio;   /* no need to copy it */
> > > > > >         } else if (!anon_vma) {
> > > > > >                 return folio;           /* no need to copy it */
> > > > > >         } else if (folio->index == linear_page_index(vma, addr) && <---------- [1]
> > > > > >                         anon_vma->root == vma->anon_vma->root) {
> > > > > >                 return folio;           /* still no need to copy it */
> > > > > >         }
> > > > > >         ...
> > > > > >
> > > > > >         new_folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, addr); <---- [2]
> > > > > >         ...
> > > > > >
> > > > > > So I believe what I hit is at [1] it sees index doesn't match, then it
> > > > > > decided to allocate a new folio.  In this case, it won't hit your BUG
> > > > > > because it'll be "folio != swapcache" later, so it'll setup the
> > > > > > folio->index for the new one, rather than the sanity check.
> > > > > >
> > > > > > Do you know how your case got triggered, being able to bypass the above [1]
> > > > > > which should check folio->index already?
> > > > >
> > > > > To understand the change I tried applying the proposed patch to both
> > > > > mm-unstable and Linus' ToT and got conflicts for both trees. Barry,
> > > > > which baseline are you using?
> > > >
> > > > Oops, never mind. My mistake. Copying from the email messed up tabs...
> > > > It applies cleanly.
> > >
> > > Overall the code seems correct to me, however the new code has quite
> > > complex logical structure IMO. Original simplified code structure is
> > > like this:
> > >
> > > if (pte_present(orig_src_pte)) {
> > >         if (is_zero_pfn) {
> > >                 move_zeropage_pte()
> > >                 return
> > >         }
> > >         // pin and lock src_folio
> > >         spin_lock(src_ptl)
> > >         folio_get(folio)
> > >         folio_trylock(folio)
> > >         if (folio_test_large(src_folio))
> > >                 split_folio(src_folio)
> > >         anon_vma_trylock_write(src_anon_vma)
> > >         move_present_pte()
> > > } else {
> > >         if (non_swap_entry(entry))
> > >                 if (is_migration_entry(entry))
> > >                         handle migration entry
> > >         else
> > >                 move_swap_pte()
> > > }
> > >
> > > The new structure looks like this:
> > >
> > > if (!pte_present(orig_src_pte)) {
> > >         if (is_migration_entry(entry)) {
> > >                 handle migration entry
> > >                 return
> > >        }
> > >         if (!non_swap_entry() ||  !pte_swp_exclusive())
> > >                 return
> > >         si = get_swap_device(entry);
> > > }
> > > if (pte_present(orig_src_pte) && is_zero_pfn(pte_pfn(orig_src_pte)))
> > >         move_zeropage_pte()
> > >         return
> > > }
> > > pin and lock src_folio
> > >         spin_lock(src_ptl)
> > >         if (pte_present(orig_src_pte))
> > >                 folio_get(folio)
> > >         else {
> > >                 folio = filemap_get_folio(swap_entry)
> > >                 if (IS_ERR(folio))
> > >                         move_swap_pte()
> > >                         return
> > >                 }
> > >         }
> > >         folio_trylock(folio)
> > > if (folio_test_large(src_folio))
> > >         split_folio(src_folio)
> > > if (pte_present(orig_src_pte))
> > >         anon_vma_trylock_write(src_anon_vma)
> > > move_pte_and_folio()
> > >
> > > This looks more complex and harder to follow. Might be the reason
> > > David was not in favour of treating swapcache and present pages in the
> > > same path. And now I would agree that refactoring some common parts
> > > and not breaking the original structure might be cleaner.
> >
> > Exactly, that’s the cost we’re facing in trying to share the code path
> > for swap and present PTEs.
> >
> > I tried to extract some common functions for present PTE and swap entries,
> > but I found too many detailed differences and variants. This made the common
> > function overly complex, turning it into a real "monster." As a result, I
> > don't think this approach would make the code any more readable or cleaner.
> >
> > After trying a couple of times, I feel the following is somehow more
> > readable:
> > (Lokesh is eager for the small folios fixes to be merged without further
> > delay. So, I'd prefer to return -EBUSY for large folios in the hotfixes
> > and handle the mTHP -EBUSY issue in a separate patch later.)

Hi Barry,
This is much more readable, thank you! With Peter's comments addressed
I think this is ready to be posted as an official fix.
Thanks,
Suren.

> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 867898c4e30b..eed9286ec1f3 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -18,6 +18,7 @@
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> >  #include "internal.h"
> > +#include "swap.h"
> >
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > @@ -1072,15 +1073,15 @@ static int move_present_pte(struct mm_struct *mm,
> >       return err;
> >  }
> >
> > -static int move_swap_pte(struct mm_struct *mm,
> > +static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
> >                        unsigned long dst_addr, unsigned long src_addr,
> >                        pte_t *dst_pte, pte_t *src_pte,
> >                        pte_t orig_dst_pte, pte_t orig_src_pte,
> >                        pmd_t *dst_pmd, pmd_t dst_pmdval,
> > -                      spinlock_t *dst_ptl, spinlock_t *src_ptl)
> > +                      spinlock_t *dst_ptl, spinlock_t *src_ptl,
> > +                      struct folio *src_folio)
> >  {
> > -     if (!pte_swp_exclusive(orig_src_pte))
> > -             return -EBUSY;
> > +     int err = 0;
> >
> >       double_pt_lock(dst_ptl, src_ptl);
> >
> > @@ -1090,11 +1091,22 @@ static int move_swap_pte(struct mm_struct *mm,
> >               return -EAGAIN;
> >       }
> >
> > +     if (src_folio) {
>
> We'd better add a comment here explaining src_folio in this case is a swap
> cache folio, and we're updating index to make sure if the folio can be
> reused later in a swapin, the rmap info will match, or something like that.
>
> > +             /* Folio got pinned from under us. Put it back and fail the move. */
> > +             if (folio_maybe_dma_pinned(src_folio)) {
> > +                     err = -EBUSY;
> > +                     goto out;
> > +             }
>
> If the swap entry is guaranteed exclusive (which I think will hold true), I
> think we can drop this and the "out" label, as it can't happen.
>
> > +             folio_move_anon_rmap(src_folio, dst_vma);
> > +             src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +     }
> > +
> >       orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> >       set_pte_at(mm, dst_addr, dst_pte, orig_src_pte);
> > -     double_pt_unlock(dst_ptl, src_ptl);
> >
> > -     return 0;
> > +out:
> > +     double_pt_unlock(dst_ptl, src_ptl);
> > +     return err;
> >  }
> >
> >  static int move_zeropage_pte(struct mm_struct *mm,
> > @@ -1137,6 +1149,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                         __u64 mode)
> >  {
> >       swp_entry_t entry;
> > +     struct swap_info_struct *si = NULL;
> >       pte_t orig_src_pte, orig_dst_pte;
> >       pte_t src_folio_pte;
> >       spinlock_t *src_ptl, *dst_ptl;
> > @@ -1318,6 +1331,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                                      orig_dst_pte, orig_src_pte, dst_pmd,
> >                                      dst_pmdval, dst_ptl, src_ptl, src_folio);
> >       } else {
> > +             struct folio *folio = NULL;
> > +
> >               entry = pte_to_swp_entry(orig_src_pte);
> >               if (non_swap_entry(entry)) {
> >                       if (is_migration_entry(entry)) {
> > @@ -1331,9 +1346,47 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                       goto out;
> >               }
> >
> > -             err = move_swap_pte(mm, dst_addr, src_addr, dst_pte, src_pte,
> > -                                 orig_dst_pte, orig_src_pte, dst_pmd,
> > -                                 dst_pmdval, dst_ptl, src_ptl);
> > +             if (!pte_swp_exclusive(orig_src_pte)) {
> > +                     err = -EBUSY;
> > +                     goto out;
> > +             }
> > +
> > +             si = get_swap_device(entry);
> > +             if (unlikely(!si)) {
> > +                     err = -EAGAIN;
> > +                     goto out;
> > +             }
> > +                /*
> > +                 * Check if swapcache exists. If it does, the folio must be
> > +                 * moved even if the PTE is a swap entry. For large folios,
> > +                 * we directly return -EBUSY, as split_folio() currently
> > +                 * also returns -EBUSY when attempting to split unmapped
> > +                 * large folios in the swapcache. This needs to be fixed
> > +                 * to allow proper handling.
> > +                 */
>
> Some alignment issue on comments...
>
> We could also add something on the decision on why not taking anon_vma.
> IIUC mention that no possible rmap walker when exclusive and unmapped
> should be ok as of now..
>
> > +             if (!src_folio)
> > +                     folio = filemap_get_folio(swap_address_space(entry),
> > +                                     swap_cache_index(entry));
> > +             if (!IS_ERR_OR_NULL(folio)) {
> > +                     if (folio_test_large(folio)) {
> > +                             err = -EBUSY;
> > +                             folio_put(folio);
> > +                             goto out;
> > +                     }
> > +                     src_folio = folio;
>
> We need to update src_folio_pte here, or later it might access
> uninitialized stack var when a retry needed:
>
>         if (src_folio && unlikely(!pte_same(src_folio_pte, orig_src_pte))) {
>                 err = -EAGAIN;
>                 goto out;
>         }
>
> > +                     if (!folio_trylock(src_folio)) {
> > +                             pte_unmap(&orig_src_pte);
> > +                             pte_unmap(&orig_dst_pte);
> > +                             src_pte = dst_pte = NULL;
> > +                             /* now we can block and wait */
> > +                             folio_lock(src_folio);
> > +                             si = NULL;
>
> Swap device ref leak?
>
> Thanks,
>
> > +                             goto retry;
> > +                     }
> > +             }
> > +             err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > +                             orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > +                             dst_ptl, src_ptl, src_folio);
> >       }
> >
> >  out:
> > @@ -1350,6 +1403,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >       if (src_pte)
> >               pte_unmap(src_pte);
> >       mmu_notifier_invalidate_range_end(&range);
> > +     if (si)
> > +             put_swap_device(si);
> >
> >       return err;
> >  }
> >
> > If there are no objections, I'll send v2 tomorrow with the above code.
> > 12:04 AM, Time to get some sleep now! :-)
> >
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > My point is that if we want a proper fix for mTHP, we'd better handle writeback.
> > > > > > > Otherwise, this isn’t much different from directly returning -EBUSY as proposed
> > > > > > > in this RFC.
> > > > > > >
> > > > > > > For small folios, there’s no split_folio issue, making it relatively
> > > > > > > simpler. Lokesh
> > > > > > > mentioned plans to madvise NOHUGEPAGE in ART, so fixing small folios is likely
> > > > > > > the first priority.
> > > > > >
> > > > > > Agreed.
> > > > > >
> > > > > > --
> > > > > > Peter Xu
> > > > > >
> >
> > Thanks
> > Barry
> >
> >
>
> --
> Peter Xu
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-20 10:24           ` David Hildenbrand
@ 2025-02-26  5:37             ` Barry Song
  2025-02-26  8:03               ` David Hildenbrand
  0 siblings, 1 reply; 47+ messages in thread
From: Barry Song @ 2025-02-26  5:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon, brauner,
	hughd, jannh, kaleshsingh, linux-kernel, linux-mm, lokeshgidra,
	mhocko, ngeoffray, peterx, rppt, ryan.roberts, shuah, surenb,
	v-songbaohua, viro, willy, zhangpeng362, zhengtangquan, yuzhao,
	stable

On Thu, Feb 20, 2025 at 11:24 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 20.02.25 10:21, Barry Song wrote:
> > On Thu, Feb 20, 2025 at 9:40 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 19.02.25 19:58, Suren Baghdasaryan wrote:
> >>> On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 19.02.25 19:26, Suren Baghdasaryan wrote:
> >>>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
> >>>>>>
> >>>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>>
> >>>>>> userfaultfd_move() checks whether the PTE entry is present or a
> >>>>>> swap entry.
> >>>>>>
> >>>>>> - If the PTE entry is present, move_present_pte() handles folio
> >>>>>>      migration by setting:
> >>>>>>
> >>>>>>      src_folio->index = linear_page_index(dst_vma, dst_addr);
> >>>>>>
> >>>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
> >>>>>>      the PTE to the new dst_addr.
> >>>>>>
> >>>>>> This approach is incorrect because even if the PTE is a swap
> >>>>>> entry, it can still reference a folio that remains in the swap
> >>>>>> cache.
> >>>>>>
> >>>>>> If do_swap_page() is triggered, it may locate the folio in the
> >>>>>> swap cache. However, during add_rmap operations, a kernel panic
> >>>>>> can occur due to:
> >>>>>>     page_pgoff(folio, page) != linear_page_index(vma, address)
> >>>>>
> >>>>> Thanks for the report and reproducer!
> >>>>>
> >>>>>>
> >>>>>> $./a.out > /dev/null
> >>>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
> >>>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
> >>>>>> [   13.337716] memcg:ffff00000405f000
> >>>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
> >>>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
> >>>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
> >>>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
> >>>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
> >>>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
> >>>>>> [   13.340190] ------------[ cut here ]------------
> >>>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
> >>>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
> >>>>>> [   13.340969] Modules linked in:
> >>>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
> >>>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
> >>>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> >>>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
> >>>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
> >>>>>> [   13.342018] sp : ffff80008752bb20
> >>>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
> >>>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
> >>>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
> >>>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
> >>>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
> >>>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
> >>>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
> >>>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
> >>>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
> >>>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
> >>>>>> [   13.343876] Call trace:
> >>>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
> >>>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
> >>>>>> [   13.344333]  do_swap_page+0x1060/0x1400
> >>>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
> >>>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
> >>>>>> [   13.344586]  do_page_fault+0x20c/0x770
> >>>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
> >>>>>> [   13.344759]  do_mem_abort+0x48/0xa0
> >>>>>> [   13.344842]  el0_da+0x58/0x130
> >>>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
> >>>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
> >>>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
> >>>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
> >>>>>> [   13.345715] note: a.out[107] exited with irqs disabled
> >>>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
> >>>>>>
> >>>>>> Fully fixing it would be quite complex, requiring similar handling
> >>>>>> of folios as done in move_present_pte.
> >>>>>
> >>>>> How complex would that be? Is it a matter of adding
> >>>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
> >>>>> folio->index = linear_page_index like in move_present_pte() or
> >>>>> something more?
> >>>>
> >>>> If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot
> >>>> be pinned and we may be able to move it I think.
> >>>>
> >>>> So all that's required is to check pte_swp_exclusive() and the folio size.
> >>>>
> >>>> ... in theory :) Not sure about the swap details.
> >>>
> >>> Looking some more into it, I think we would have to perform all the
> >>> folio and anon_vma locking and pinning that we do for present pages in
> >>> move_pages_pte(). If that's correct then maybe treating swapcache
> >>> pages like a present page inside move_pages_pte() would be simpler?
> >>
> >> I'd be more in favor of not doing that. Maybe there are parts we can
> >> move out into helper functions instead, so we can reuse them?
> >
> > I actually have a v2 ready. Maybe we can discuss if some of the code can be
> > extracted as a helper based on the below before I send it formally?
> >
> > I’d say there are many parts that can be shared with present PTE, but there
> > are two major differences:
> >
> > 1. Page exclusivity – swapcache doesn’t require it (try_to_unmap_one has remove
> > Exclusive flag;)
> > 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)
> >
>
> That's a lot of complicated code you have there (not your fault, it's
> complicated stuff ... ) :)
>
> Some of it might be compressed/simplified by the use of "else if".
>
> I'll try to take a closer look later (will have to apply it to see the
> context better). Just one independent comment because I stumbled over
> this recently:
>
> [...]
>
> > @@ -1062,10 +1063,13 @@ static int move_present_pte(struct mm_struct *mm,
> >       folio_move_anon_rmap(src_folio, dst_vma);
> >       src_folio->index = linear_page_index(dst_vma, dst_addr);
> >
> > -     orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
> > -     /* Follow mremap() behavior and treat the entry dirty after the move */
> > -     orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);
> > -
> > +     if (pte_present(orig_src_pte)) {
> > +             orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
> > +             /* Follow mremap() behavior and treat the entry dirty after the move */
> > +             orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);
>
> I'll note that the comment and mkdirty is misleading/wrong. It's
> softdirty that we care about only. But that is something independent of
> this change.
>
> For swp PTEs, we maybe also would want to set softdirty.
>
> See move_soft_dirty_pte() on what is actually done on the mremap path.

I actually don't quite understand the changelog in  commit 0f8975ec4db2
(" mm: soft-dirty bits for user memory changes tracking").

"    Another thing to note, is that when mremap moves PTEs they are marked
    with soft-dirty as well, since from the user perspective mremap modifies
    the virtual memory at mremap's new address."

Why is the hardware-dirty bit not relevant? From the user's perspective,
the memory at the destination virtual address of mremap/userfaultfd_move
has changed.

For systems where CONFIG_HAVE_ARCH_SOFT_DIRTY is false, how can the dirty status
be determined?

Or is the answer that we only care about soft-dirty changes?

For the hardware-dirty bit, do we only care about actual modifications to the
physical page content rather than changes at the virtual address level?

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache
  2025-02-26  5:37             ` Barry Song
@ 2025-02-26  8:03               ` David Hildenbrand
  0 siblings, 0 replies; 47+ messages in thread
From: David Hildenbrand @ 2025-02-26  8:03 UTC (permalink / raw)
  To: Barry Song
  Cc: Liam.Howlett, aarcange, akpm, axelrasmussen, bgeffon, brauner,
	hughd, jannh, kaleshsingh, linux-kernel, linux-mm, lokeshgidra,
	mhocko, ngeoffray, peterx, rppt, ryan.roberts, shuah, surenb,
	v-songbaohua, viro, willy, zhangpeng362, zhengtangquan, yuzhao,
	stable

On 26.02.25 06:37, Barry Song wrote:
> On Thu, Feb 20, 2025 at 11:24 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 20.02.25 10:21, Barry Song wrote:
>>> On Thu, Feb 20, 2025 at 9:40 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 19.02.25 19:58, Suren Baghdasaryan wrote:
>>>>> On Wed, Feb 19, 2025 at 10:30 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 19.02.25 19:26, Suren Baghdasaryan wrote:
>>>>>>> On Wed, Feb 19, 2025 at 3:25 AM Barry Song <21cnbao@gmail.com> wrote:
>>>>>>>>
>>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>>
>>>>>>>> userfaultfd_move() checks whether the PTE entry is present or a
>>>>>>>> swap entry.
>>>>>>>>
>>>>>>>> - If the PTE entry is present, move_present_pte() handles folio
>>>>>>>>       migration by setting:
>>>>>>>>
>>>>>>>>       src_folio->index = linear_page_index(dst_vma, dst_addr);
>>>>>>>>
>>>>>>>> - If the PTE entry is a swap entry, move_swap_pte() simply copies
>>>>>>>>       the PTE to the new dst_addr.
>>>>>>>>
>>>>>>>> This approach is incorrect because even if the PTE is a swap
>>>>>>>> entry, it can still reference a folio that remains in the swap
>>>>>>>> cache.
>>>>>>>>
>>>>>>>> If do_swap_page() is triggered, it may locate the folio in the
>>>>>>>> swap cache. However, during add_rmap operations, a kernel panic
>>>>>>>> can occur due to:
>>>>>>>>      page_pgoff(folio, page) != linear_page_index(vma, address)
>>>>>>>
>>>>>>> Thanks for the report and reproducer!
>>>>>>>
>>>>>>>>
>>>>>>>> $./a.out > /dev/null
>>>>>>>> [   13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c index:0xffffaf150 pfn:0x4667c
>>>>>>>> [   13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapped:1 pincount:0
>>>>>>>> [   13.337716] memcg:ffff00000405f000
>>>>>>>> [   13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owner_priv_1|head|swapbacked|node=0|zone=0|lastcpupid=0xffff)
>>>>>>>> [   13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>>>>>> [   13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>>>>>> [   13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538 ffff000006260361
>>>>>>>> [   13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000 ffff00000405f000
>>>>>>>> [   13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000 0000000000000001
>>>>>>>> [   13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff 0000000000000000
>>>>>>>> [   13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, page) != linear_page_index(vma, address))
>>>>>>>> [   13.340190] ------------[ cut here ]------------
>>>>>>>> [   13.340316] kernel BUG at mm/rmap.c:1380!
>>>>>>>> [   13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
>>>>>>>> [   13.340969] Modules linked in:
>>>>>>>> [   13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc3-gcf42737e247a-dirty #299
>>>>>>>> [   13.341470] Hardware name: linux,dummy-virt (DT)
>>>>>>>> [   13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>>>>>>>> [   13.341815] pc : __page_check_anon_rmap+0xa0/0xb0
>>>>>>>> [   13.341920] lr : __page_check_anon_rmap+0xa0/0xb0
>>>>>>>> [   13.342018] sp : ffff80008752bb20
>>>>>>>> [   13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000000000001
>>>>>>>> [   13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000000000001
>>>>>>>> [   13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdffc0199f00
>>>>>>>> [   13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 00000000ffffffff
>>>>>>>> [   13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 662866666f67705f
>>>>>>>> [   13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800083728ab0
>>>>>>>> [   13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff80008011bc40
>>>>>>>> [   13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff8000829eebf8
>>>>>>>> [   13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000000000000
>>>>>>>> [   13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 000000000000005f
>>>>>>>> [   13.343876] Call trace:
>>>>>>>> [   13.344045]  __page_check_anon_rmap+0xa0/0xb0 (P)
>>>>>>>> [   13.344234]  folio_add_anon_rmap_ptes+0x22c/0x320
>>>>>>>> [   13.344333]  do_swap_page+0x1060/0x1400
>>>>>>>> [   13.344417]  __handle_mm_fault+0x61c/0xbc8
>>>>>>>> [   13.344504]  handle_mm_fault+0xd8/0x2e8
>>>>>>>> [   13.344586]  do_page_fault+0x20c/0x770
>>>>>>>> [   13.344673]  do_translation_fault+0xb4/0xf0
>>>>>>>> [   13.344759]  do_mem_abort+0x48/0xa0
>>>>>>>> [   13.344842]  el0_da+0x58/0x130
>>>>>>>> [   13.344914]  el0t_64_sync_handler+0xc4/0x138
>>>>>>>> [   13.345002]  el0t_64_sync+0x1ac/0x1b0
>>>>>>>> [   13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000)
>>>>>>>> [   13.345504] ---[ end trace 0000000000000000 ]---
>>>>>>>> [   13.345715] note: a.out[107] exited with irqs disabled
>>>>>>>> [   13.345954] note: a.out[107] exited with preempt_count 2
>>>>>>>>
>>>>>>>> Fully fixing it would be quite complex, requiring similar handling
>>>>>>>> of folios as done in move_present_pte.
>>>>>>>
>>>>>>> How complex would that be? Is it a matter of adding
>>>>>>> folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and
>>>>>>> folio->index = linear_page_index like in move_present_pte() or
>>>>>>> something more?
>>>>>>
>>>>>> If the entry is pte_swp_exclusive(), and the folio is order-0, it cannot
>>>>>> be pinned and we may be able to move it I think.
>>>>>>
>>>>>> So all that's required is to check pte_swp_exclusive() and the folio size.
>>>>>>
>>>>>> ... in theory :) Not sure about the swap details.
>>>>>
>>>>> Looking some more into it, I think we would have to perform all the
>>>>> folio and anon_vma locking and pinning that we do for present pages in
>>>>> move_pages_pte(). If that's correct then maybe treating swapcache
>>>>> pages like a present page inside move_pages_pte() would be simpler?
>>>>
>>>> I'd be more in favor of not doing that. Maybe there are parts we can
>>>> move out into helper functions instead, so we can reuse them?
>>>
>>> I actually have a v2 ready. Maybe we can discuss if some of the code can be
>>> extracted as a helper based on the below before I send it formally?
>>>
>>> I’d say there are many parts that can be shared with present PTE, but there
>>> are two major differences:
>>>
>>> 1. Page exclusivity – swapcache doesn’t require it (try_to_unmap_one has remove
>>> Exclusive flag;)
>>> 2. src_anon_vma and its lock – swapcache doesn’t require it(folio is not mapped)
>>>
>>
>> That's a lot of complicated code you have there (not your fault, it's
>> complicated stuff ... ) :)
>>
>> Some of it might be compressed/simplified by the use of "else if".
>>
>> I'll try to take a closer look later (will have to apply it to see the
>> context better). Just one independent comment because I stumbled over
>> this recently:
>>
>> [...]
>>
>>> @@ -1062,10 +1063,13 @@ static int move_present_pte(struct mm_struct *mm,
>>>        folio_move_anon_rmap(src_folio, dst_vma);
>>>        src_folio->index = linear_page_index(dst_vma, dst_addr);
>>>
>>> -     orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
>>> -     /* Follow mremap() behavior and treat the entry dirty after the move */
>>> -     orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);
>>> -
>>> +     if (pte_present(orig_src_pte)) {
>>> +             orig_dst_pte = mk_pte(&src_folio->page, dst_vma->vm_page_prot);
>>> +             /* Follow mremap() behavior and treat the entry dirty after the move */
>>> +             orig_dst_pte = pte_mkwrite(pte_mkdirty(orig_dst_pte), dst_vma);
>>
>> I'll note that the comment and mkdirty is misleading/wrong. It's
>> softdirty that we care about only. But that is something independent of
>> this change.
>>
>> For swp PTEs, we maybe also would want to set softdirty.
>>
>> See move_soft_dirty_pte() on what is actually done on the mremap path.
> 
> I actually don't quite understand the changelog in  commit 0f8975ec4db2
> (" mm: soft-dirty bits for user memory changes tracking").
> 
> "    Another thing to note, is that when mremap moves PTEs they are marked
>      with soft-dirty as well, since from the user perspective mremap modifies
>      the virtual memory at mremap's new address."
> 
> Why is the hardware-dirty bit not relevant? From the user's perspective,
> the memory at the destination virtual address of mremap/userfaultfd_move
> has changed.

Yes, but it did not change from the system POV. For example, if the page 
was R/O clean and we moved it, why should it suddenly be R/O dirty and 
e.g., require writeback again.

Nobody modified *page content*, but from a user perspective the memory 
at that *virtual memory location* (dst) changed, for example, for 
logical zero (no page mapped) to non-zero (page mapped). That's what 
soft-dirty is about.


> 
> For systems where CONFIG_HAVE_ARCH_SOFT_DIRTY is false, how can the dirty status
> be determined?

No soft-dirty tracking, so nothing to maintain.

> 
> Or is the answer that we only care about soft-dirty changes?
 > > For the hardware-dirty bit, do we only care about actual 
modifications to the
> physical page content rather than changes at the virtual address level?

Exactly!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2025-02-26  8:04 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-19 11:25 [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache Barry Song
2025-02-19 18:26 ` Suren Baghdasaryan
2025-02-19 18:30   ` David Hildenbrand
2025-02-19 18:58     ` Suren Baghdasaryan
2025-02-20  8:40       ` David Hildenbrand
2025-02-20  9:21         ` Barry Song
2025-02-20 10:24           ` David Hildenbrand
2025-02-26  5:37             ` Barry Song
2025-02-26  8:03               ` David Hildenbrand
2025-02-20 23:32           ` Peter Xu
2025-02-21  0:07             ` Barry Song
2025-02-21  1:49               ` Peter Xu
2025-02-22 21:31                 ` Barry Song
2025-02-24 17:50                   ` Peter Xu
2025-02-24 18:03                     ` David Hildenbrand
2025-02-19 20:37   ` Barry Song
2025-02-19 20:57     ` Matthew Wilcox
2025-02-19 21:05       ` Barry Song
2025-02-19 21:02     ` Lokesh Gidra
2025-02-19 21:26       ` Barry Song
2025-02-19 21:32         ` Lokesh Gidra
2025-02-19 22:14     ` Peter Xu
2025-02-19 23:04       ` Barry Song
2025-02-19 23:19         ` Lokesh Gidra
2025-02-20  0:49           ` Barry Song
2025-02-20 22:59         ` Peter Xu
2025-02-20 23:47           ` Suren Baghdasaryan
2025-02-20 23:52             ` Suren Baghdasaryan
2025-02-21  0:36               ` Suren Baghdasaryan
2025-02-25 11:05                 ` Barry Song
2025-02-25 15:34                   ` Peter Xu
2025-02-25 17:02                     ` Suren Baghdasaryan
2025-02-21  1:36           ` Barry Song
2025-02-21  1:54             ` Peter Xu
2025-02-20  8:51     ` David Hildenbrand
2025-02-20  9:31       ` Barry Song
2025-02-20  9:36         ` David Hildenbrand
2025-02-20 21:45           ` Barry Song
2025-02-20 22:19             ` Lokesh Gidra
2025-02-20 22:26               ` Barry Song
2025-02-20 22:31                 ` David Hildenbrand
2025-02-20 22:33                 ` Lokesh Gidra
2025-02-19 18:40 ` Lokesh Gidra
2025-02-19 20:45   ` Barry Song
2025-02-19 20:53     ` Lokesh Gidra
2025-02-19 22:31 ` Peter Xu
2025-02-20  0:50   ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox