* [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
@ 2025-05-22 23:23 Barry Song
2025-05-22 23:43 ` Lokesh Gidra
2025-05-26 12:39 ` David Hildenbrand
0 siblings, 2 replies; 10+ messages in thread
From: Barry Song @ 2025-05-22 23:23 UTC (permalink / raw)
To: Peter Xu, David Hildenbrand, Suren Baghdasaryan, Lokesh Gidra,
Andrea Arcangeli
Cc: Andrew Morton, Linux-MM, Kairui Song, LKML
Hi All,
I'm encountering another bug that can be easily reproduced using the small
program below[1], which performs swap-out and swap-in in parallel.
The issue occurs when a folio is being swapped out while it is accessed
concurrently. In this case, do_swap_page() handles the access. However,
because the folio is under writeback, do_swap_page() completely removes
its exclusive attribute.
do_swap_page:
} else if (exclusive && folio_test_writeback(folio) &&
data_race(si->flags & SWP_STABLE_WRITES)) {
...
exclusive = false;
As a result, userfaultfd_move() will return -EBUSY, even though the
folio is not shared and is in fact exclusively owned.
folio = vm_normal_folio(src_vma, src_addr,
orig_src_pte);
if (!folio || !PageAnonExclusive(&folio->page)) {
spin_unlock(src_ptl);
+ pr_err("%s %d folio:%lx exclusive:%d
swapcache:%d\n",
+ __func__, __LINE__, folio,
PageAnonExclusive(&folio->page),
+ folio_test_swapcache(folio));
err = -EBUSY;
goto out;
}
I understand that shared folios should not be moved. However, in this
case, the folio is not shared, yet its exclusive flag is not set.
Therefore, I believe PageAnonExclusive is not a reliable indicator of
whether a folio is truly exclusive to a process.
The kernel log output is shown below:
[ 23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0
swapcache:1
I'm still struggling to find a real fix; it seems quite challenging.
Please let me know if you have any ideas. In any case It seems
userspace should fall back to userfaultfd_copy.
[1] The small program:
//Just in a couple of seconds, we are running into
//"UFFDIO_MOVE: Device or resource busy"
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <linux/userfaultfd.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <poll.h>
#include <errno.h>
#define PAGE_SIZE 4096
#define REGION_SIZE (512 * 1024)
#ifndef UFFDIO_MOVE
struct uffdio_move {
__u64 dst;
__u64 src;
__u64 len;
#define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0)
#define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
__u64 mode;
__s64 move;
};
#define _UFFDIO_MOVE (0x05)
#define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
#endif
void *src, *dst;
int uffd;
void *madvise_thread(void *arg) {
for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
madvise(src + i, PAGE_SIZE, MADV_PAGEOUT);
usleep(100);
}
return NULL;
}
void *swapin_thread(void *arg) {
volatile char dummy;
for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
dummy = ((char *)src)[i];
usleep(100);
}
return NULL;
}
void *fault_handler_thread(void *arg) {
struct uffd_msg msg;
struct uffdio_move move;
struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
while (1) {
if (poll(&pollfd, 1, -1) == -1) {
perror("poll");
exit(EXIT_FAILURE);
}
if (read(uffd, &msg, sizeof(msg)) <= 0) {
perror("read");
exit(EXIT_FAILURE);
}
if (msg.event != UFFD_EVENT_PAGEFAULT) {
fprintf(stderr, "Unexpected event\n");
exit(EXIT_FAILURE);
}
move.src = (unsigned long)src + (msg.arg.pagefault.address -
(unsigned long)dst);
move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
move.len = PAGE_SIZE;
move.mode = 0;
if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
perror("UFFDIO_MOVE");
exit(EXIT_FAILURE);
}
}
return NULL;
}
int main() {
again:
pthread_t thr, madv_thr, swapin_thr;
struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
struct uffdio_register uffdio_register;
src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
| MAP_ANONYMOUS, -1, 0);
if (src == MAP_FAILED) {
perror("mmap src");
exit(EXIT_FAILURE);
}
memset(src, 1, REGION_SIZE);
dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
| MAP_ANONYMOUS, -1, 0);
if (dst == MAP_FAILED) {
perror("mmap dst");
exit(EXIT_FAILURE);
}
uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
if (uffd == -1) {
perror("userfaultfd");
exit(EXIT_FAILURE);
}
if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
perror("UFFDIO_API");
exit(EXIT_FAILURE);
}
uffdio_register.range.start = (unsigned long)dst;
uffdio_register.range.len = REGION_SIZE;
uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
perror("UFFDIO_REGISTER");
exit(EXIT_FAILURE);
}
if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
perror("pthread_create madvise_thread");
exit(EXIT_FAILURE);
}
if (pthread_create(&swapin_thr, NULL, swapin_thread, NULL) != 0) {
perror("pthread_create swapin_thread");
exit(EXIT_FAILURE);
}
if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
perror("pthread_create fault_handler_thread");
exit(EXIT_FAILURE);
}
for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
char val = ((char *)dst)[i];
printf("Accessing dst at offset %zu, value: %d\n", i, val);
}
pthread_join(madv_thr, NULL);
pthread_join(swapin_thr, NULL);
pthread_cancel(thr);
pthread_join(thr, NULL);
munmap(src, REGION_SIZE);
munmap(dst, REGION_SIZE);
close(uffd);
goto again;
return 0;
}
Thanks
Barry
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out 2025-05-22 23:23 [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out Barry Song @ 2025-05-22 23:43 ` Lokesh Gidra 2025-05-22 23:53 ` Barry Song 2025-05-26 12:39 ` David Hildenbrand 1 sibling, 1 reply; 10+ messages in thread From: Lokesh Gidra @ 2025-05-22 23:43 UTC (permalink / raw) To: Barry Song Cc: Peter Xu, David Hildenbrand, Suren Baghdasaryan, Andrea Arcangeli, Andrew Morton, Linux-MM, Kairui Song, LKML Thanks Barry for stress testing MOVE ioctl. It's really helpful :) On Thu, May 22, 2025 at 4:23 PM Barry Song <21cnbao@gmail.com> wrote: > > Hi All, > > I'm encountering another bug that can be easily reproduced using the small > program below[1], which performs swap-out and swap-in in parallel. > > The issue occurs when a folio is being swapped out while it is accessed > concurrently. In this case, do_swap_page() handles the access. However, > because the folio is under writeback, do_swap_page() completely removes > its exclusive attribute. > > do_swap_page: > } else if (exclusive && folio_test_writeback(folio) && > data_race(si->flags & SWP_STABLE_WRITES)) { > ... > exclusive = false; > > As a result, userfaultfd_move() will return -EBUSY, even though the > folio is not shared and is in fact exclusively owned. > > folio = vm_normal_folio(src_vma, src_addr, > orig_src_pte); > if (!folio || !PageAnonExclusive(&folio->page)) { > spin_unlock(src_ptl); > + pr_err("%s %d folio:%lx exclusive:%d > swapcache:%d\n", > + __func__, __LINE__, folio, > PageAnonExclusive(&folio->page), > + folio_test_swapcache(folio)); > err = -EBUSY; > goto out; > } > > I understand that shared folios should not be moved. However, in this > case, the folio is not shared, yet its exclusive flag is not set. > > Therefore, I believe PageAnonExclusive is not a reliable indicator of > whether a folio is truly exclusive to a process. > > The kernel log output is shown below: > [ 23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0 > swapcache:1 > > I'm still struggling to find a real fix; it seems quite challenging. > Please let me know if you have any ideas. In any case It seems > userspace should fall back to userfaultfd_copy. > I'm not sure this is really a bug. A page under write-back is in a way 'busy' isn't it? I am not an expert of anon-exclusive, but it seems to me that an exclusively mapped anonymous page would have it true. So, isn't it expected that a page under write-back will not have it set as the page isn't mapped? I have observed this in my testing as well, and there are a couple of ways to deal with it in userspace. As you suggested, falling back to userfaultfd_copy on receiving -EBUSY is one option. In my case, making a fake store on the src page and then retrying has been working fine. > > > [1] The small program: > > //Just in a couple of seconds, we are running into > //"UFFDIO_MOVE: Device or resource busy" > > #define _GNU_SOURCE > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <sys/mman.h> > #include <sys/ioctl.h> > #include <sys/syscall.h> > #include <linux/userfaultfd.h> > #include <fcntl.h> > #include <pthread.h> > #include <unistd.h> > #include <poll.h> > #include <errno.h> > > #define PAGE_SIZE 4096 > #define REGION_SIZE (512 * 1024) > > #ifndef UFFDIO_MOVE > struct uffdio_move { > __u64 dst; > __u64 src; > __u64 len; > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > __u64 mode; > __s64 move; > }; > > #define _UFFDIO_MOVE (0x05) > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > #endif > > > void *src, *dst; > int uffd; > > void *madvise_thread(void *arg) { > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > madvise(src + i, PAGE_SIZE, MADV_PAGEOUT); > usleep(100); > } > return NULL; > } > > void *swapin_thread(void *arg) { > volatile char dummy; > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > dummy = ((char *)src)[i]; > usleep(100); > } > return NULL; > } > > > void *fault_handler_thread(void *arg) { > > struct uffd_msg msg; > struct uffdio_move move; > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > while (1) { > if (poll(&pollfd, 1, -1) == -1) { > perror("poll"); > exit(EXIT_FAILURE); > } > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > perror("read"); > exit(EXIT_FAILURE); > } > > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > fprintf(stderr, "Unexpected event\n"); > exit(EXIT_FAILURE); > } > > move.src = (unsigned long)src + (msg.arg.pagefault.address - > (unsigned long)dst); > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > move.len = PAGE_SIZE; > move.mode = 0; > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > perror("UFFDIO_MOVE"); > exit(EXIT_FAILURE); > } > } > return NULL; > } > > int main() { > again: > pthread_t thr, madv_thr, swapin_thr; > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > struct uffdio_register uffdio_register; > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE > | MAP_ANONYMOUS, -1, 0); > > if (src == MAP_FAILED) { > perror("mmap src"); > exit(EXIT_FAILURE); > } > > memset(src, 1, REGION_SIZE); > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE > | MAP_ANONYMOUS, -1, 0); > > if (dst == MAP_FAILED) { > perror("mmap dst"); > exit(EXIT_FAILURE); > } > > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > if (uffd == -1) { > perror("userfaultfd"); > exit(EXIT_FAILURE); > } > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > perror("UFFDIO_API"); > exit(EXIT_FAILURE); > } > > uffdio_register.range.start = (unsigned long)dst; > uffdio_register.range.len = REGION_SIZE; > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > perror("UFFDIO_REGISTER"); > exit(EXIT_FAILURE); > > } > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > perror("pthread_create madvise_thread"); > exit(EXIT_FAILURE); > } > > if (pthread_create(&swapin_thr, NULL, swapin_thread, NULL) != 0) { > perror("pthread_create swapin_thread"); > exit(EXIT_FAILURE); > } > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > perror("pthread_create fault_handler_thread"); > exit(EXIT_FAILURE); > } > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > char val = ((char *)dst)[i]; > printf("Accessing dst at offset %zu, value: %d\n", i, val); > } > > pthread_join(madv_thr, NULL); > pthread_join(swapin_thr, NULL); > pthread_cancel(thr); > pthread_join(thr, NULL); > munmap(src, REGION_SIZE); > munmap(dst, REGION_SIZE); > close(uffd); > goto again; > > return 0; > } > > Thanks > Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out 2025-05-22 23:43 ` Lokesh Gidra @ 2025-05-22 23:53 ` Barry Song 2025-05-23 0:03 ` Lokesh Gidra 0 siblings, 1 reply; 10+ messages in thread From: Barry Song @ 2025-05-22 23:53 UTC (permalink / raw) To: Lokesh Gidra Cc: Peter Xu, David Hildenbrand, Suren Baghdasaryan, Andrea Arcangeli, Andrew Morton, Linux-MM, Kairui Song, LKML On Fri, May 23, 2025 at 11:44 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > Thanks Barry for stress testing MOVE ioctl. It's really helpful :) > > On Thu, May 22, 2025 at 4:23 PM Barry Song <21cnbao@gmail.com> wrote: > > > > Hi All, > > > > I'm encountering another bug that can be easily reproduced using the small > > program below[1], which performs swap-out and swap-in in parallel. > > > > The issue occurs when a folio is being swapped out while it is accessed > > concurrently. In this case, do_swap_page() handles the access. However, > > because the folio is under writeback, do_swap_page() completely removes > > its exclusive attribute. > > > > do_swap_page: > > } else if (exclusive && folio_test_writeback(folio) && > > data_race(si->flags & SWP_STABLE_WRITES)) { > > ... > > exclusive = false; > > > > As a result, userfaultfd_move() will return -EBUSY, even though the > > folio is not shared and is in fact exclusively owned. > > > > folio = vm_normal_folio(src_vma, src_addr, > > orig_src_pte); > > if (!folio || !PageAnonExclusive(&folio->page)) { > > spin_unlock(src_ptl); > > + pr_err("%s %d folio:%lx exclusive:%d > > swapcache:%d\n", > > + __func__, __LINE__, folio, > > PageAnonExclusive(&folio->page), > > + folio_test_swapcache(folio)); > > err = -EBUSY; > > goto out; > > } > > > > I understand that shared folios should not be moved. However, in this > > case, the folio is not shared, yet its exclusive flag is not set. > > > > Therefore, I believe PageAnonExclusive is not a reliable indicator of > > whether a folio is truly exclusive to a process. > > > > The kernel log output is shown below: > > [ 23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0 > > swapcache:1 > > > > I'm still struggling to find a real fix; it seems quite challenging. > > Please let me know if you have any ideas. In any case It seems > > userspace should fall back to userfaultfd_copy. > > > I'm not sure this is really a bug. A page under write-back is in a way > 'busy' isn't it? I am not an expert of anon-exclusive, but it seems to > me that an exclusively mapped anonymous page would have it true. So, > isn't it expected that a page under write-back will not have it set as > the page isn't mapped? We have two return codes: -EAGAIN and -EBUSY. In many cases, we return -EAGAIN, which is transparent to userspace because the syscall is retried. Therefore, I expect -EAGAIN or a similar code here to avoid userspace noise, since we handle other cases where folios are undergoing transitions to become stable again by -EAGAIN. > > I have observed this in my testing as well, and there are a couple of > ways to deal with it in userspace. As you suggested, falling back to > userfaultfd_copy on receiving -EBUSY is one option. In my case, making > a fake store on the src page and then retrying has been working fine. Good to know you have some fallbacks implemented in userspace. That makes the issue less serious now. > > > > > > [1] The small program: > > > > //Just in a couple of seconds, we are running into > > //"UFFDIO_MOVE: Device or resource busy" > > > > #define _GNU_SOURCE > > #include <stdio.h> > > #include <stdlib.h> > > #include <string.h> > > #include <sys/mman.h> > > #include <sys/ioctl.h> > > #include <sys/syscall.h> > > #include <linux/userfaultfd.h> > > #include <fcntl.h> > > #include <pthread.h> > > #include <unistd.h> > > #include <poll.h> > > #include <errno.h> > > > > #define PAGE_SIZE 4096 > > #define REGION_SIZE (512 * 1024) > > > > #ifndef UFFDIO_MOVE > > struct uffdio_move { > > __u64 dst; > > __u64 src; > > __u64 len; > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > __u64 mode; > > __s64 move; > > }; > > > > #define _UFFDIO_MOVE (0x05) > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > #endif > > > > > > void *src, *dst; > > int uffd; > > > > void *madvise_thread(void *arg) { > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > madvise(src + i, PAGE_SIZE, MADV_PAGEOUT); > > usleep(100); > > } > > return NULL; > > } > > > > void *swapin_thread(void *arg) { > > volatile char dummy; > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > dummy = ((char *)src)[i]; > > usleep(100); > > } > > return NULL; > > } > > > > > > void *fault_handler_thread(void *arg) { > > > > struct uffd_msg msg; > > struct uffdio_move move; > > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > while (1) { > > if (poll(&pollfd, 1, -1) == -1) { > > perror("poll"); > > exit(EXIT_FAILURE); > > } > > > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > > perror("read"); > > exit(EXIT_FAILURE); > > } > > > > > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > > fprintf(stderr, "Unexpected event\n"); > > exit(EXIT_FAILURE); > > } > > > > move.src = (unsigned long)src + (msg.arg.pagefault.address - > > (unsigned long)dst); > > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > move.len = PAGE_SIZE; > > move.mode = 0; > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > > perror("UFFDIO_MOVE"); > > exit(EXIT_FAILURE); > > } > > } > > return NULL; > > } > > > > int main() { > > again: > > pthread_t thr, madv_thr, swapin_thr; > > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > > struct uffdio_register uffdio_register; > > > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE > > | MAP_ANONYMOUS, -1, 0); > > > > if (src == MAP_FAILED) { > > perror("mmap src"); > > exit(EXIT_FAILURE); > > } > > > > memset(src, 1, REGION_SIZE); > > > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE > > | MAP_ANONYMOUS, -1, 0); > > > > if (dst == MAP_FAILED) { > > perror("mmap dst"); > > exit(EXIT_FAILURE); > > } > > > > > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > if (uffd == -1) { > > perror("userfaultfd"); > > exit(EXIT_FAILURE); > > } > > > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > > perror("UFFDIO_API"); > > exit(EXIT_FAILURE); > > } > > > > uffdio_register.range.start = (unsigned long)dst; > > uffdio_register.range.len = REGION_SIZE; > > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > > perror("UFFDIO_REGISTER"); > > exit(EXIT_FAILURE); > > > > } > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > > perror("pthread_create madvise_thread"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&swapin_thr, NULL, swapin_thread, NULL) != 0) { > > perror("pthread_create swapin_thread"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > > perror("pthread_create fault_handler_thread"); > > exit(EXIT_FAILURE); > > } > > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > char val = ((char *)dst)[i]; > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > } > > > > pthread_join(madv_thr, NULL); > > pthread_join(swapin_thr, NULL); > > pthread_cancel(thr); > > pthread_join(thr, NULL); > > munmap(src, REGION_SIZE); > > munmap(dst, REGION_SIZE); > > close(uffd); > > goto again; > > > > return 0; > > } > > Thanks Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out 2025-05-22 23:53 ` Barry Song @ 2025-05-23 0:03 ` Lokesh Gidra 0 siblings, 0 replies; 10+ messages in thread From: Lokesh Gidra @ 2025-05-23 0:03 UTC (permalink / raw) To: Barry Song Cc: Peter Xu, David Hildenbrand, Suren Baghdasaryan, Andrea Arcangeli, Andrew Morton, Linux-MM, Kairui Song, LKML On Thu, May 22, 2025 at 4:53 PM Barry Song <21cnbao@gmail.com> wrote: > > On Fri, May 23, 2025 at 11:44 AM Lokesh Gidra <lokeshgidra@google.com> wrote: > > > > Thanks Barry for stress testing MOVE ioctl. It's really helpful :) > > > > On Thu, May 22, 2025 at 4:23 PM Barry Song <21cnbao@gmail.com> wrote: > > > > > > Hi All, > > > > > > I'm encountering another bug that can be easily reproduced using the small > > > program below[1], which performs swap-out and swap-in in parallel. > > > > > > The issue occurs when a folio is being swapped out while it is accessed > > > concurrently. In this case, do_swap_page() handles the access. However, > > > because the folio is under writeback, do_swap_page() completely removes > > > its exclusive attribute. > > > > > > do_swap_page: > > > } else if (exclusive && folio_test_writeback(folio) && > > > data_race(si->flags & SWP_STABLE_WRITES)) { > > > ... > > > exclusive = false; > > > > > > As a result, userfaultfd_move() will return -EBUSY, even though the > > > folio is not shared and is in fact exclusively owned. > > > > > > folio = vm_normal_folio(src_vma, src_addr, > > > orig_src_pte); > > > if (!folio || !PageAnonExclusive(&folio->page)) { > > > spin_unlock(src_ptl); > > > + pr_err("%s %d folio:%lx exclusive:%d > > > swapcache:%d\n", > > > + __func__, __LINE__, folio, > > > PageAnonExclusive(&folio->page), > > > + folio_test_swapcache(folio)); > > > err = -EBUSY; > > > goto out; > > > } > > > > > > I understand that shared folios should not be moved. However, in this > > > case, the folio is not shared, yet its exclusive flag is not set. > > > > > > Therefore, I believe PageAnonExclusive is not a reliable indicator of > > > whether a folio is truly exclusive to a process. > > > > > > The kernel log output is shown below: > > > [ 23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0 > > > swapcache:1 > > > > > > I'm still struggling to find a real fix; it seems quite challenging. > > > Please let me know if you have any ideas. In any case It seems > > > userspace should fall back to userfaultfd_copy. > > > > > I'm not sure this is really a bug. A page under write-back is in a way > > 'busy' isn't it? I am not an expert of anon-exclusive, but it seems to > > me that an exclusively mapped anonymous page would have it true. So, > > isn't it expected that a page under write-back will not have it set as > > the page isn't mapped? > > We have two return codes: -EAGAIN and -EBUSY. In many cases, we return > -EAGAIN, which is transparent to userspace because the syscall is retried. > > Therefore, I expect -EAGAIN or a similar code here to avoid userspace noise, > since we handle other cases where folios are undergoing transitions to > become stable again by -EAGAIN. I agree, this case probably should be -EAGAIN. OTOH, with -EBUSY, we know that a userfaultfd_copy fallback or some other operation needs to be performed before retrying. In most of the -EAGAIN cases, we simply have to retry without taking any additional steps. > > > > > I have observed this in my testing as well, and there are a couple of > > ways to deal with it in userspace. As you suggested, falling back to > > userfaultfd_copy on receiving -EBUSY is one option. In my case, making > > a fake store on the src page and then retrying has been working fine. > > Good to know you have some fallbacks implemented in userspace. > That makes the issue less serious now. I have been assuming so far that the reason I'm getting -EBUSY is because somehow the page is COW shared with zygote. But what you pointed out makes a lot more sense. > > > > > > > > > > [1] The small program: > > > > > > //Just in a couple of seconds, we are running into > > > //"UFFDIO_MOVE: Device or resource busy" > > > > > > #define _GNU_SOURCE > > > #include <stdio.h> > > > #include <stdlib.h> > > > #include <string.h> > > > #include <sys/mman.h> > > > #include <sys/ioctl.h> > > > #include <sys/syscall.h> > > > #include <linux/userfaultfd.h> > > > #include <fcntl.h> > > > #include <pthread.h> > > > #include <unistd.h> > > > #include <poll.h> > > > #include <errno.h> > > > > > > #define PAGE_SIZE 4096 > > > #define REGION_SIZE (512 * 1024) > > > > > > #ifndef UFFDIO_MOVE > > > struct uffdio_move { > > > __u64 dst; > > > __u64 src; > > > __u64 len; > > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > > __u64 mode; > > > __s64 move; > > > }; > > > > > > #define _UFFDIO_MOVE (0x05) > > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > > #endif > > > > > > > > > void *src, *dst; > > > int uffd; > > > > > > void *madvise_thread(void *arg) { > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > > madvise(src + i, PAGE_SIZE, MADV_PAGEOUT); > > > usleep(100); > > > } > > > return NULL; > > > } > > > > > > void *swapin_thread(void *arg) { > > > volatile char dummy; > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > > dummy = ((char *)src)[i]; > > > usleep(100); > > > } > > > return NULL; > > > } > > > > > > > > > void *fault_handler_thread(void *arg) { > > > > > > struct uffd_msg msg; > > > struct uffdio_move move; > > > struct pollfd pollfd = { .fd = uffd, .events = POLLIN }; > > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > > > while (1) { > > > if (poll(&pollfd, 1, -1) == -1) { > > > perror("poll"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (read(uffd, &msg, sizeof(msg)) <= 0) { > > > perror("read"); > > > exit(EXIT_FAILURE); > > > } > > > > > > > > > if (msg.event != UFFD_EVENT_PAGEFAULT) { > > > fprintf(stderr, "Unexpected event\n"); > > > exit(EXIT_FAILURE); > > > } > > > > > > move.src = (unsigned long)src + (msg.arg.pagefault.address - > > > (unsigned long)dst); > > > move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > > move.len = PAGE_SIZE; > > > move.mode = 0; > > > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) { > > > perror("UFFDIO_MOVE"); > > > exit(EXIT_FAILURE); > > > } > > > } > > > return NULL; > > > } > > > > > > int main() { > > > again: > > > pthread_t thr, madv_thr, swapin_thr; > > > struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 }; > > > struct uffdio_register uffdio_register; > > > > > > src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE > > > | MAP_ANONYMOUS, -1, 0); > > > > > > if (src == MAP_FAILED) { > > > perror("mmap src"); > > > exit(EXIT_FAILURE); > > > } > > > > > > memset(src, 1, REGION_SIZE); > > > > > > dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE > > > | MAP_ANONYMOUS, -1, 0); > > > > > > if (dst == MAP_FAILED) { > > > perror("mmap dst"); > > > exit(EXIT_FAILURE); > > > } > > > > > > > > > uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > > if (uffd == -1) { > > > perror("userfaultfd"); > > > exit(EXIT_FAILURE); > > > } > > > > > > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) { > > > perror("UFFDIO_API"); > > > exit(EXIT_FAILURE); > > > } > > > > > > uffdio_register.range.start = (unsigned long)dst; > > > uffdio_register.range.len = REGION_SIZE; > > > uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING; > > > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) { > > > perror("UFFDIO_REGISTER"); > > > exit(EXIT_FAILURE); > > > > > > } > > > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) { > > > perror("pthread_create madvise_thread"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (pthread_create(&swapin_thr, NULL, swapin_thread, NULL) != 0) { > > > perror("pthread_create swapin_thread"); > > > exit(EXIT_FAILURE); > > > } > > > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) { > > > perror("pthread_create fault_handler_thread"); > > > exit(EXIT_FAILURE); > > > } > > > > > > for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) { > > > char val = ((char *)dst)[i]; > > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > > } > > > > > > pthread_join(madv_thr, NULL); > > > pthread_join(swapin_thr, NULL); > > > pthread_cancel(thr); > > > pthread_join(thr, NULL); > > > munmap(src, REGION_SIZE); > > > munmap(dst, REGION_SIZE); > > > close(uffd); > > > goto again; > > > > > > return 0; > > > } > > > > > Thanks > Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out 2025-05-22 23:23 [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out Barry Song 2025-05-22 23:43 ` Lokesh Gidra @ 2025-05-26 12:39 ` David Hildenbrand 2025-05-27 4:17 ` Barry Song 1 sibling, 1 reply; 10+ messages in thread From: David Hildenbrand @ 2025-05-26 12:39 UTC (permalink / raw) To: Barry Song, Peter Xu, Suren Baghdasaryan, Lokesh Gidra, Andrea Arcangeli Cc: Andrew Morton, Linux-MM, Kairui Song, LKML On 23.05.25 01:23, Barry Song wrote: > Hi All, Hi! > > I'm encountering another bug that can be easily reproduced using the small > program below[1], which performs swap-out and swap-in in parallel. > > The issue occurs when a folio is being swapped out while it is accessed > concurrently. In this case, do_swap_page() handles the access. However, > because the folio is under writeback, do_swap_page() completely removes > its exclusive attribute. > > do_swap_page: > } else if (exclusive && folio_test_writeback(folio) && > data_race(si->flags & SWP_STABLE_WRITES)) { > ... > exclusive = false; > > As a result, userfaultfd_move() will return -EBUSY, even though the > folio is not shared and is in fact exclusively owned. > > folio = vm_normal_folio(src_vma, src_addr, > orig_src_pte); > if (!folio || !PageAnonExclusive(&folio->page)) { > spin_unlock(src_ptl); > + pr_err("%s %d folio:%lx exclusive:%d > swapcache:%d\n", > + __func__, __LINE__, folio, > PageAnonExclusive(&folio->page), > + folio_test_swapcache(folio)); > err = -EBUSY; > goto out; > } > > I understand that shared folios should not be moved. However, in this > case, the folio is not shared, yet its exclusive flag is not set. > > Therefore, I believe PageAnonExclusive is not a reliable indicator of > whether a folio is truly exclusive to a process. It is. The flag *not* being set is not a reliable indicator whether it is really shared. ;) The reason why we have this PAE workaround (dropping the flag) in place is because the page must not be written to (SWP_STABLE_WRITES). CoW reuse is not possible. uffd moving that page -- and in that same process setting it writable, see move_present_pte()->pte_mkwrite() -- would be very bad. > > The kernel log output is shown below: > [ 23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0 > swapcache:1 > > I'm still struggling to find a real fix; it seems quite challenging. PAE tells you that you can immediately write to that page without going through CoW. However, here, CoW is required. > Please let me know if you have any ideas. In any case It seems > userspace should fall back to userfaultfd_copy. We could try detecting whether the page is now exclusive, to reset PAE. That will only be possible after writeback completed, so it adds complexity without being able to move the page in all cases (during writeback). Letting userspace deal with that in these rate scenarios is significantly easier. -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out 2025-05-26 12:39 ` David Hildenbrand @ 2025-05-27 4:17 ` Barry Song 2025-05-27 8:37 ` Barry Song 0 siblings, 1 reply; 10+ messages in thread From: Barry Song @ 2025-05-27 4:17 UTC (permalink / raw) To: David Hildenbrand Cc: Peter Xu, Suren Baghdasaryan, Lokesh Gidra, Andrea Arcangeli, Andrew Morton, Linux-MM, Kairui Song, LKML On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@redhat.com> wrote: > > On 23.05.25 01:23, Barry Song wrote: > > Hi All, > > Hi! > > > > > I'm encountering another bug that can be easily reproduced using the small > > program below[1], which performs swap-out and swap-in in parallel. > > > > The issue occurs when a folio is being swapped out while it is accessed > > concurrently. In this case, do_swap_page() handles the access. However, > > because the folio is under writeback, do_swap_page() completely removes > > its exclusive attribute. > > > > do_swap_page: > > } else if (exclusive && folio_test_writeback(folio) && > > data_race(si->flags & SWP_STABLE_WRITES)) { > > ... > > exclusive = false; > > > > As a result, userfaultfd_move() will return -EBUSY, even though the > > folio is not shared and is in fact exclusively owned. > > > > folio = vm_normal_folio(src_vma, src_addr, > > orig_src_pte); > > if (!folio || !PageAnonExclusive(&folio->page)) { > > spin_unlock(src_ptl); > > + pr_err("%s %d folio:%lx exclusive:%d > > swapcache:%d\n", > > + __func__, __LINE__, folio, > > PageAnonExclusive(&folio->page), > > + folio_test_swapcache(folio)); > > err = -EBUSY; > > goto out; > > } > > > > I understand that shared folios should not be moved. However, in this > > case, the folio is not shared, yet its exclusive flag is not set. > > > > Therefore, I believe PageAnonExclusive is not a reliable indicator of > > whether a folio is truly exclusive to a process. > > It is. The flag *not* being set is not a reliable indicator whether it > is really shared. ;) > > The reason why we have this PAE workaround (dropping the flag) in place > is because the page must not be written to (SWP_STABLE_WRITES). CoW > reuse is not possible. > > uffd moving that page -- and in that same process setting it writable, > see move_present_pte()->pte_mkwrite() -- would be very bad. An alternative approach is to make the folio writable only when we are reasonably certain it is exclusive; otherwise, it remains read-only. If the destination is later written to and the folio has become exclusive, it can be reused directly. If not, a copy-on-write will occur on the destination address, transparently to userspace. This avoids Lokesh’s userspace-based strategy, which requires forcing a write to the source address. > > > > > The kernel log output is shown below: > > [ 23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0 > > swapcache:1 > > > > I'm still struggling to find a real fix; it seems quite challenging. > > PAE tells you that you can immediately write to that page without going > through CoW. However, here, CoW is required. > > > Please let me know if you have any ideas. In any case It seems > > userspace should fall back to userfaultfd_copy. > > We could try detecting whether the page is now exclusive, to reset PAE. > That will only be possible after writeback completed, so it adds > complexity without being able to move the page in all cases (during > writeback). > > Letting userspace deal with that in these rate scenarios is > significantly easier. Right, this appears to introduce the least change—essentially none—to the kernel, while shifting more noise to userspace :-) > > -- > Cheers, > > David / dhildenb > Thanks Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out 2025-05-27 4:17 ` Barry Song @ 2025-05-27 8:37 ` Barry Song 2025-05-27 9:00 ` David Hildenbrand 0 siblings, 1 reply; 10+ messages in thread From: Barry Song @ 2025-05-27 8:37 UTC (permalink / raw) To: david Cc: 21cnbao, aarcange, akpm, linux-kernel, linux-mm, lokeshgidra, peterx, ryncsn, surenb On Tue, May 27, 2025 at 4:17 PM Barry Song <21cnbao@gmail.com> wrote: > > On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@redhat.com> wrote: > > > > On 23.05.25 01:23, Barry Song wrote: > > > Hi All, > > > > Hi! > > > > > > > > I'm encountering another bug that can be easily reproduced using the small > > > program below[1], which performs swap-out and swap-in in parallel. > > > > > > The issue occurs when a folio is being swapped out while it is accessed > > > concurrently. In this case, do_swap_page() handles the access. However, > > > because the folio is under writeback, do_swap_page() completely removes > > > its exclusive attribute. > > > > > > do_swap_page: > > > } else if (exclusive && folio_test_writeback(folio) && > > > data_race(si->flags & SWP_STABLE_WRITES)) { > > > ... > > > exclusive = false; > > > > > > As a result, userfaultfd_move() will return -EBUSY, even though the > > > folio is not shared and is in fact exclusively owned. > > > > > > folio = vm_normal_folio(src_vma, src_addr, > > > orig_src_pte); > > > if (!folio || !PageAnonExclusive(&folio->page)) { > > > spin_unlock(src_ptl); > > > + pr_err("%s %d folio:%lx exclusive:%d > > > swapcache:%d\n", > > > + __func__, __LINE__, folio, > > > PageAnonExclusive(&folio->page), > > > + folio_test_swapcache(folio)); > > > err = -EBUSY; > > > goto out; > > > } > > > > > > I understand that shared folios should not be moved. However, in this > > > case, the folio is not shared, yet its exclusive flag is not set. > > > > > > Therefore, I believe PageAnonExclusive is not a reliable indicator of > > > whether a folio is truly exclusive to a process. > > > > It is. The flag *not* being set is not a reliable indicator whether it > > is really shared. ;) > > > > The reason why we have this PAE workaround (dropping the flag) in place > > is because the page must not be written to (SWP_STABLE_WRITES). CoW > > reuse is not possible. > > > > uffd moving that page -- and in that same process setting it writable, > > see move_present_pte()->pte_mkwrite() -- would be very bad. > > An alternative approach is to make the folio writable only when we are > reasonably certain it is exclusive; otherwise, it remains read-only. If the > destination is later written to and the folio has become exclusive, it can > be reused directly. If not, a copy-on-write will occur on the destination > address, transparently to userspace. This avoids Lokesh’s userspace-based > strategy, which requires forcing a write to the source address. Conceptually, I mean something like this: diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bc473ad21202..70eaabf4f1a3 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1047,7 +1047,8 @@ static int move_present_pte(struct mm_struct *mm, } if (folio_test_large(src_folio) || folio_maybe_dma_pinned(src_folio) || - !PageAnonExclusive(&src_folio->page)) { + (!PageAnonExclusive(&src_folio->page) && + folio_mapcount(src_folio) != 1)) { err = -EBUSY; goto out; } @@ -1070,7 +1071,8 @@ static int move_present_pte(struct mm_struct *mm, #endif if (pte_dirty(orig_src_pte)) orig_dst_pte = pte_mkdirty(orig_dst_pte); - orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma); + if (PageAnonExclusive(&src_folio->page)) + orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma); set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); out: @@ -1268,7 +1270,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, } folio = vm_normal_folio(src_vma, src_addr, orig_src_pte); - if (!folio || !PageAnonExclusive(&folio->page)) { + if (!folio || (!PageAnonExclusive(&folio->page) && + folio_mapcount(folio) != 1)) { spin_unlock(src_ptl); err = -EBUSY; goto out; I'm not trying to push this approach—unless Lokesh clearly sees that it could reduce userspace noise. I'm mainly just curious how we might make the fixup transparent to userspace. :-) > > > > > > > > > The kernel log output is shown below: > > > [ 23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0 > > > swapcache:1 > > > > > > I'm still struggling to find a real fix; it seems quite challenging. > > > > PAE tells you that you can immediately write to that page without going > > through CoW. However, here, CoW is required. > > > > > Please let me know if you have any ideas. In any case It seems > > > userspace should fall back to userfaultfd_copy. > > > > We could try detecting whether the page is now exclusive, to reset PAE. > > That will only be possible after writeback completed, so it adds > > complexity without being able to move the page in all cases (during > > writeback). > > > > Letting userspace deal with that in these rate scenarios is > > significantly easier. > > Right, this appears to introduce the least change—essentially none—to the > kernel, while shifting more noise to userspace :-) > > > > > -- > > Cheers, > > > > David / dhildenb > > > Thanks Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out 2025-05-27 8:37 ` Barry Song @ 2025-05-27 9:00 ` David Hildenbrand 2025-05-27 9:31 ` Barry Song 0 siblings, 1 reply; 10+ messages in thread From: David Hildenbrand @ 2025-05-27 9:00 UTC (permalink / raw) To: Barry Song Cc: aarcange, akpm, linux-kernel, linux-mm, lokeshgidra, peterx, ryncsn, surenb On 27.05.25 10:37, Barry Song wrote: > On Tue, May 27, 2025 at 4:17 PM Barry Song <21cnbao@gmail.com> wrote: >> >> On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@redhat.com> wrote: >>> >>> On 23.05.25 01:23, Barry Song wrote: >>>> Hi All, >>> >>> Hi! >>> >>>> >>>> I'm encountering another bug that can be easily reproduced using the small >>>> program below[1], which performs swap-out and swap-in in parallel. >>>> >>>> The issue occurs when a folio is being swapped out while it is accessed >>>> concurrently. In this case, do_swap_page() handles the access. However, >>>> because the folio is under writeback, do_swap_page() completely removes >>>> its exclusive attribute. >>>> >>>> do_swap_page: >>>> } else if (exclusive && folio_test_writeback(folio) && >>>> data_race(si->flags & SWP_STABLE_WRITES)) { >>>> ... >>>> exclusive = false; >>>> >>>> As a result, userfaultfd_move() will return -EBUSY, even though the >>>> folio is not shared and is in fact exclusively owned. >>>> >>>> folio = vm_normal_folio(src_vma, src_addr, >>>> orig_src_pte); >>>> if (!folio || !PageAnonExclusive(&folio->page)) { >>>> spin_unlock(src_ptl); >>>> + pr_err("%s %d folio:%lx exclusive:%d >>>> swapcache:%d\n", >>>> + __func__, __LINE__, folio, >>>> PageAnonExclusive(&folio->page), >>>> + folio_test_swapcache(folio)); >>>> err = -EBUSY; >>>> goto out; >>>> } >>>> >>>> I understand that shared folios should not be moved. However, in this >>>> case, the folio is not shared, yet its exclusive flag is not set. >>>> >>>> Therefore, I believe PageAnonExclusive is not a reliable indicator of >>>> whether a folio is truly exclusive to a process. >>> >>> It is. The flag *not* being set is not a reliable indicator whether it >>> is really shared. ;) >>> >>> The reason why we have this PAE workaround (dropping the flag) in place >>> is because the page must not be written to (SWP_STABLE_WRITES). CoW >>> reuse is not possible. >>> >>> uffd moving that page -- and in that same process setting it writable, >>> see move_present_pte()->pte_mkwrite() -- would be very bad. >> >> An alternative approach is to make the folio writable only when we are >> reasonably certain it is exclusive; otherwise, it remains read-only. If the >> destination is later written to and the folio has become exclusive, it can >> be reused directly. If not, a copy-on-write will occur on the destination >> address, transparently to userspace. This avoids Lokesh’s userspace-based >> strategy, which requires forcing a write to the source address. > > Conceptually, I mean something like this: > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index bc473ad21202..70eaabf4f1a3 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -1047,7 +1047,8 @@ static int move_present_pte(struct mm_struct *mm, > } > if (folio_test_large(src_folio) || > folio_maybe_dma_pinned(src_folio) || > - !PageAnonExclusive(&src_folio->page)) { > + (!PageAnonExclusive(&src_folio->page) && > + folio_mapcount(src_folio) != 1)) { > err = -EBUSY; > goto out; > } > @@ -1070,7 +1071,8 @@ static int move_present_pte(struct mm_struct *mm, > #endif > if (pte_dirty(orig_src_pte)) > orig_dst_pte = pte_mkdirty(orig_dst_pte); > - orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma); > + if (PageAnonExclusive(&src_folio->page)) > + orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma); > > set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); > out: > @@ -1268,7 +1270,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > } > > folio = vm_normal_folio(src_vma, src_addr, orig_src_pte); > - if (!folio || !PageAnonExclusive(&folio->page)) { > + if (!folio || (!PageAnonExclusive(&folio->page) && > + folio_mapcount(folio) != 1)) { > spin_unlock(src_ptl); > err = -EBUSY; > goto out; > > I'm not trying to push this approach—unless Lokesh clearly sees that it > could reduce userspace noise. I'm mainly just curious how we might make > the fixup transparent to userspace. :-) And that reveals the exact problem: it's all *very* complicated. :) ... and dangerous when we use the mapcount without having a complete understanding how it all works. What we would have to do for a small folio is 1) Take the folio lock 2) Make sure there is only this present page table mapping: folio_mapcount(folio) != 1 of better !folio_maybe_mapped_shared(folio); 3) Make sure that there are no swap references If in the swapcache, check the actual swapcount 3) Make sure it is not a KSM folio THPs are way, way, way more complicated to get right that way. Likely, the scenario described above cannot happen with a PMD-mapped THP for now at least (we don't have PMD swap entries). Of course, we'd then also have to handle the case when we have a swap pte where the marker is not set (e.g., because of swapout after the described swapin where we dropped the marker). What could be easier is triggering a FAULT_FLAG_UNSHARE fault. It's arguably less optimal in case the core will decide to swapin / CoW, but it leaves the magic to get all this right to the core -- and mimics the approach Lokesh uses. But then, maybe letting userspace just do a uffdio_copy would be even faster (only a single TLB shootdown?). I am also skeptical of calling this a BUG here. It's described to behave exactly like that [1]: EBUSY The pages in the source virtual memory range are either pinned or not exclusive to the process. The kernel might only perform lightweight checks for detecting whether the pages are exclusive. To make the operation more likely to succeed, KSM should be disabled, fork() should be avoided or MADV_DONTFORK should be configured for the source virtual memory area before fork(). Note the "lightweight" and "more likely to succeed". [1] https://lore.kernel.org/lkml/20231206103702.3873743-3-surenb@google.com/ -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out 2025-05-27 9:00 ` David Hildenbrand @ 2025-05-27 9:31 ` Barry Song 2025-05-27 11:06 ` David Hildenbrand 0 siblings, 1 reply; 10+ messages in thread From: Barry Song @ 2025-05-27 9:31 UTC (permalink / raw) To: David Hildenbrand Cc: aarcange, akpm, linux-kernel, linux-mm, lokeshgidra, peterx, ryncsn, surenb On Tue, May 27, 2025 at 5:00 PM David Hildenbrand <david@redhat.com> wrote: > > On 27.05.25 10:37, Barry Song wrote: > > On Tue, May 27, 2025 at 4:17 PM Barry Song <21cnbao@gmail.com> wrote: > >> > >> On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@redhat.com> wrote: > >>> > >>> On 23.05.25 01:23, Barry Song wrote: > >>>> Hi All, > >>> > >>> Hi! > >>> > >>>> > >>>> I'm encountering another bug that can be easily reproduced using the small > >>>> program below[1], which performs swap-out and swap-in in parallel. > >>>> > >>>> The issue occurs when a folio is being swapped out while it is accessed > >>>> concurrently. In this case, do_swap_page() handles the access. However, > >>>> because the folio is under writeback, do_swap_page() completely removes > >>>> its exclusive attribute. > >>>> > >>>> do_swap_page: > >>>> } else if (exclusive && folio_test_writeback(folio) && > >>>> data_race(si->flags & SWP_STABLE_WRITES)) { > >>>> ... > >>>> exclusive = false; > >>>> > >>>> As a result, userfaultfd_move() will return -EBUSY, even though the > >>>> folio is not shared and is in fact exclusively owned. > >>>> > >>>> folio = vm_normal_folio(src_vma, src_addr, > >>>> orig_src_pte); > >>>> if (!folio || !PageAnonExclusive(&folio->page)) { > >>>> spin_unlock(src_ptl); > >>>> + pr_err("%s %d folio:%lx exclusive:%d > >>>> swapcache:%d\n", > >>>> + __func__, __LINE__, folio, > >>>> PageAnonExclusive(&folio->page), > >>>> + folio_test_swapcache(folio)); > >>>> err = -EBUSY; > >>>> goto out; > >>>> } > >>>> > >>>> I understand that shared folios should not be moved. However, in this > >>>> case, the folio is not shared, yet its exclusive flag is not set. > >>>> > >>>> Therefore, I believe PageAnonExclusive is not a reliable indicator of > >>>> whether a folio is truly exclusive to a process. > >>> > >>> It is. The flag *not* being set is not a reliable indicator whether it > >>> is really shared. ;) > >>> > >>> The reason why we have this PAE workaround (dropping the flag) in place > >>> is because the page must not be written to (SWP_STABLE_WRITES). CoW > >>> reuse is not possible. > >>> > >>> uffd moving that page -- and in that same process setting it writable, > >>> see move_present_pte()->pte_mkwrite() -- would be very bad. > >> > >> An alternative approach is to make the folio writable only when we are > >> reasonably certain it is exclusive; otherwise, it remains read-only. If the > >> destination is later written to and the folio has become exclusive, it can > >> be reused directly. If not, a copy-on-write will occur on the destination > >> address, transparently to userspace. This avoids Lokesh’s userspace-based > >> strategy, which requires forcing a write to the source address. > > > > Conceptually, I mean something like this: > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..70eaabf4f1a3 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1047,7 +1047,8 @@ static int move_present_pte(struct mm_struct *mm, > > } > > if (folio_test_large(src_folio) || > > folio_maybe_dma_pinned(src_folio) || > > - !PageAnonExclusive(&src_folio->page)) { > > + (!PageAnonExclusive(&src_folio->page) && > > + folio_mapcount(src_folio) != 1)) { > > err = -EBUSY; > > goto out; > > } > > @@ -1070,7 +1071,8 @@ static int move_present_pte(struct mm_struct *mm, > > #endif > > if (pte_dirty(orig_src_pte)) > > orig_dst_pte = pte_mkdirty(orig_dst_pte); > > - orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma); > > + if (PageAnonExclusive(&src_folio->page)) > > + orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma); > > > > set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); > > out: > > @@ -1268,7 +1270,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > > } > > > > folio = vm_normal_folio(src_vma, src_addr, orig_src_pte); > > - if (!folio || !PageAnonExclusive(&folio->page)) { > > + if (!folio || (!PageAnonExclusive(&folio->page) && > > + folio_mapcount(folio) != 1)) { > > spin_unlock(src_ptl); > > err = -EBUSY; > > goto out; > > > > I'm not trying to push this approach—unless Lokesh clearly sees that it > > could reduce userspace noise. I'm mainly just curious how we might make > > the fixup transparent to userspace. :-) > > And that reveals the exact problem: it's all *very* complicated. :) > > ... and dangerous when we use the mapcount without having a complete > understanding how it all works. > > > What we would have to do for a small folio is > > 1) Take the folio lock > > 2) Make sure there is only this present page table mapping: > folio_mapcount(folio) != 1 > > of better > > !folio_maybe_mapped_shared(folio); > > 3) Make sure that there are no swap references > > If in the swapcache, check the actual swapcount > > 3) Make sure it is not a KSM folio > > > THPs are way, way, way more complicated to get right that way. Likely, > the scenario described above cannot happen with a PMD-mapped THP for now > at least (we don't have PMD swap entries). Yeah, this can get really complicated. > > > Of course, we'd then also have to handle the case when we have a swap > pte where the marker is not set (e.g., because of swapout after the > described swapin where we dropped the marker). > > > What could be easier is triggering a FAULT_FLAG_UNSHARE fault. It's > arguably less optimal in case the core will decide to swapin / CoW, but > it leaves the magic to get all this right to the core -- and mimics the > approach Lokesh uses. > > But then, maybe letting userspace just do a uffdio_copy would be even > faster (only a single TLB shootdown?). > > > I am also skeptical of calling this a BUG here. It's described to behave > exactly like that [1]: > > EBUSY > The pages in the source virtual memory range are either > pinned or not exclusive to the process. The kernel might > only perform lightweight checks for detecting whether the > pages are exclusive. To make the operation more likely to > succeed, KSM should be disabled, fork() should be avoided > or MADV_DONTFORK should be configured for the source > virtual memory area before fork(). > > Note the "lightweight" and "more likely to succeed". > Initially, my point was that an exclusive folio (single-process case) should be movable. Now I understand this isn’t a bug, but rather a compromise made due to implementation constraints. Perhaps the remaining value of this report is that it helped better understand scenarios beyond fork where a move might also fail. I truly appreciate your time and your clear analysis. > > [1] https://lore.kernel.org/lkml/20231206103702.3873743-3-surenb@google.com/ > > -- > Cheers, > > David / dhildenb > Thanks Barry ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out 2025-05-27 9:31 ` Barry Song @ 2025-05-27 11:06 ` David Hildenbrand 0 siblings, 0 replies; 10+ messages in thread From: David Hildenbrand @ 2025-05-27 11:06 UTC (permalink / raw) To: Barry Song Cc: aarcange, akpm, linux-kernel, linux-mm, lokeshgidra, peterx, ryncsn, surenb >> >> EBUSY >> The pages in the source virtual memory range are either >> pinned or not exclusive to the process. The kernel might >> only perform lightweight checks for detecting whether the >> pages are exclusive. To make the operation more likely to >> succeed, KSM should be disabled, fork() should be avoided >> or MADV_DONTFORK should be configured for the source >> virtual memory area before fork(). >> >> Note the "lightweight" and "more likely to succeed". >> > > Initially, my point was that an exclusive folio (single-process case) > should be movable. Yeah, I would wish that we wouldn't need that PAE hack in the swapin code. I was asking myself if we could just ... wait for writeback to end in that case? I mean, if we would have to swap in the folio we would also have to wait for disk I/O ... so here we would also have to wait for disk I/O. We could either wait for writeback before mapping the folio, or set the PAE bit and map the page R/O, to then wait for writeback during write faults. The latter has the downside that we have to handle it with more complexity during write faults (check if page is under writeback, then check if we require this sync I/O during write faults even though PAE is set ...). > Now I understand this isn’t a bug, but rather a compromise made due > to implementation constraints. That is a good summary! > Perhaps the remaining value of this report is that it helped better > understand scenarios beyond fork where a move might also fail. > > I truly appreciate your time and your clear analysis. YW :) -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2025-05-27 11:06 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-05-22 23:23 [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out Barry Song 2025-05-22 23:43 ` Lokesh Gidra 2025-05-22 23:53 ` Barry Song 2025-05-23 0:03 ` Lokesh Gidra 2025-05-26 12:39 ` David Hildenbrand 2025-05-27 4:17 ` Barry Song 2025-05-27 8:37 ` Barry Song 2025-05-27 9:00 ` David Hildenbrand 2025-05-27 9:31 ` Barry Song 2025-05-27 11:06 ` David Hildenbrand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox