[BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
@ 2025-05-22 23:23 Barry Song
  2025-05-22 23:43 ` Lokesh Gidra
  2025-05-26 12:39 ` David Hildenbrand
  0 siblings, 2 replies; 10+ messages in thread
From: Barry Song @ 2025-05-22 23:23 UTC (permalink / raw)
  To: Peter Xu, David Hildenbrand, Suren Baghdasaryan, Lokesh Gidra,
	Andrea Arcangeli
  Cc: Andrew Morton, Linux-MM, Kairui Song, LKML

Hi All,

I'm encountering another bug that can be easily reproduced using the small
program below[1], which performs swap-out and swap-in in parallel.

The issue occurs when a folio is being swapped out while it is accessed
concurrently. In this case, do_swap_page() handles the access. However,
because the folio is under writeback, do_swap_page() completely removes
its exclusive attribute.

do_swap_page:
               } else if (exclusive && folio_test_writeback(folio) &&
                          data_race(si->flags & SWP_STABLE_WRITES)) {
                        ...
                        exclusive = false;

As a result, userfaultfd_move() will return -EBUSY, even though the
folio is not shared and is in fact exclusively owned.

                        folio = vm_normal_folio(src_vma, src_addr,
orig_src_pte);
                        if (!folio || !PageAnonExclusive(&folio->page)) {
                                spin_unlock(src_ptl);
+                               pr_err("%s %d folio:%lx exclusive:%d
swapcache:%d\n",
+                                       __func__, __LINE__, folio,
PageAnonExclusive(&folio->page),
+                                       folio_test_swapcache(folio));
                                err = -EBUSY;
                                goto out;
                        }

I understand that shared folios should not be moved. However, in this
case, the folio is not shared, yet its exclusive flag is not set.

Therefore, I believe PageAnonExclusive is not a reliable indicator of
whether a folio is truly exclusive to a process.

The kernel log output is shown below:
[   23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0
swapcache:1

I'm still struggling to find a real fix; it seems quite challenging.
Please let me know if you have any ideas. In any case It seems
userspace should fall back to userfaultfd_copy.



[1] The small program:

//Just in a couple of seconds, we are running into
//"UFFDIO_MOVE: Device or resource busy"

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <linux/userfaultfd.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <poll.h>
#include <errno.h>

#define PAGE_SIZE 4096
#define REGION_SIZE (512 * 1024)

#ifndef UFFDIO_MOVE
struct uffdio_move {
    __u64 dst;
    __u64 src;
    __u64 len;
    #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
    #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
    __u64 mode;
    __s64 move;
};

#define _UFFDIO_MOVE  (0x05)
#define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
#endif


void *src, *dst;
int uffd;

void *madvise_thread(void *arg) {
    for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
    madvise(src + i, PAGE_SIZE, MADV_PAGEOUT);
    usleep(100);
    }
    return NULL;
}

void *swapin_thread(void *arg) {
    volatile char dummy;
    for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
        dummy = ((char *)src)[i];
        usleep(100);
    }
    return NULL;
}


void *fault_handler_thread(void *arg) {

    struct uffd_msg msg;
    struct uffdio_move move;
    struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
    pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
    pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);

    while (1) {
        if (poll(&pollfd, 1, -1) == -1) {
            perror("poll");
            exit(EXIT_FAILURE);
        }

        if (read(uffd, &msg, sizeof(msg)) <= 0) {
            perror("read");
            exit(EXIT_FAILURE);
        }


        if (msg.event != UFFD_EVENT_PAGEFAULT) {
            fprintf(stderr, "Unexpected event\n");
            exit(EXIT_FAILURE);
        }

        move.src = (unsigned long)src + (msg.arg.pagefault.address -
(unsigned long)dst);
        move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
        move.len = PAGE_SIZE;
        move.mode = 0;

        if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
            perror("UFFDIO_MOVE");
            exit(EXIT_FAILURE);
        }
    }
    return NULL;
}

int main() {
again:
    pthread_t thr, madv_thr, swapin_thr;
    struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
    struct uffdio_register uffdio_register;

    src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
| MAP_ANONYMOUS, -1, 0);

    if (src == MAP_FAILED) {
        perror("mmap src");
        exit(EXIT_FAILURE);
    }

    memset(src, 1, REGION_SIZE);

    dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
| MAP_ANONYMOUS, -1, 0);

    if (dst == MAP_FAILED) {
        perror("mmap dst");
        exit(EXIT_FAILURE);
    }


    uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    if (uffd == -1) {
        perror("userfaultfd");
        exit(EXIT_FAILURE);
    }


    if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
        perror("UFFDIO_API");
        exit(EXIT_FAILURE);
    }

    uffdio_register.range.start = (unsigned long)dst;
    uffdio_register.range.len = REGION_SIZE;
    uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;

    if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
        perror("UFFDIO_REGISTER");
        exit(EXIT_FAILURE);

    }

    if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
        perror("pthread_create madvise_thread");
        exit(EXIT_FAILURE);
    }

    if (pthread_create(&swapin_thr, NULL, swapin_thread, NULL) != 0) {
        perror("pthread_create swapin_thread");
        exit(EXIT_FAILURE);
    }

    if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
        perror("pthread_create fault_handler_thread");
        exit(EXIT_FAILURE);
    }

    for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
        char val = ((char *)dst)[i];
        printf("Accessing dst at offset %zu, value: %d\n", i, val);
    }

    pthread_join(madv_thr, NULL);
    pthread_join(swapin_thr, NULL);
    pthread_cancel(thr);
    pthread_join(thr, NULL);
    munmap(src, REGION_SIZE);
    munmap(dst, REGION_SIZE);
    close(uffd);
    goto again;

    return 0;
}

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
  2025-05-22 23:23 [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out Barry Song
@ 2025-05-22 23:43 ` Lokesh Gidra
  2025-05-22 23:53   ` Barry Song
  2025-05-26 12:39 ` David Hildenbrand
  1 sibling, 1 reply; 10+ messages in thread
From: Lokesh Gidra @ 2025-05-22 23:43 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Xu, David Hildenbrand, Suren Baghdasaryan,
	Andrea Arcangeli, Andrew Morton, Linux-MM, Kairui Song, LKML

Thanks Barry for stress testing MOVE ioctl. It's really helpful :)

On Thu, May 22, 2025 at 4:23 PM Barry Song <21cnbao@gmail.com> wrote:
>
> Hi All,
>
> I'm encountering another bug that can be easily reproduced using the small
> program below[1], which performs swap-out and swap-in in parallel.
>
> The issue occurs when a folio is being swapped out while it is accessed
> concurrently. In this case, do_swap_page() handles the access. However,
> because the folio is under writeback, do_swap_page() completely removes
> its exclusive attribute.
>
> do_swap_page:
>                } else if (exclusive && folio_test_writeback(folio) &&
>                           data_race(si->flags & SWP_STABLE_WRITES)) {
>                         ...
>                         exclusive = false;
>
> As a result, userfaultfd_move() will return -EBUSY, even though the
> folio is not shared and is in fact exclusively owned.
>
>                         folio = vm_normal_folio(src_vma, src_addr,
> orig_src_pte);
>                         if (!folio || !PageAnonExclusive(&folio->page)) {
>                                 spin_unlock(src_ptl);
> +                               pr_err("%s %d folio:%lx exclusive:%d
> swapcache:%d\n",
> +                                       __func__, __LINE__, folio,
> PageAnonExclusive(&folio->page),
> +                                       folio_test_swapcache(folio));
>                                 err = -EBUSY;
>                                 goto out;
>                         }
>
> I understand that shared folios should not be moved. However, in this
> case, the folio is not shared, yet its exclusive flag is not set.
>
> Therefore, I believe PageAnonExclusive is not a reliable indicator of
> whether a folio is truly exclusive to a process.
>
> The kernel log output is shown below:
> [   23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0
> swapcache:1
>
> I'm still struggling to find a real fix; it seems quite challenging.
> Please let me know if you have any ideas. In any case It seems
> userspace should fall back to userfaultfd_copy.
>
I'm not sure this is really a bug. A page under write-back is in a way
'busy' isn't it? I am not an expert of anon-exclusive, but it seems to
me that an exclusively mapped anonymous page would have it true. So,
isn't it expected that a page under write-back will not have it set as
the page isn't mapped?

I have observed this in my testing as well, and there are a couple of
ways to deal with it in userspace. As you suggested, falling back to
userfaultfd_copy on receiving -EBUSY is one option. In my case, making
a fake store on the src page and then retrying has been working fine.
>
>
> [1] The small program:
>
> //Just in a couple of seconds, we are running into
> //"UFFDIO_MOVE: Device or resource busy"
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <sys/ioctl.h>
> #include <sys/syscall.h>
> #include <linux/userfaultfd.h>
> #include <fcntl.h>
> #include <pthread.h>
> #include <unistd.h>
> #include <poll.h>
> #include <errno.h>
>
> #define PAGE_SIZE 4096
> #define REGION_SIZE (512 * 1024)
>
> #ifndef UFFDIO_MOVE
> struct uffdio_move {
>     __u64 dst;
>     __u64 src;
>     __u64 len;
>     #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
>     #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
>     __u64 mode;
>     __s64 move;
> };
>
> #define _UFFDIO_MOVE  (0x05)
> #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
> #endif
>
>
> void *src, *dst;
> int uffd;
>
> void *madvise_thread(void *arg) {
>     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
>     madvise(src + i, PAGE_SIZE, MADV_PAGEOUT);
>     usleep(100);
>     }
>     return NULL;
> }
>
> void *swapin_thread(void *arg) {
>     volatile char dummy;
>     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
>         dummy = ((char *)src)[i];
>         usleep(100);
>     }
>     return NULL;
> }
>
>
> void *fault_handler_thread(void *arg) {
>
>     struct uffd_msg msg;
>     struct uffdio_move move;
>     struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
>     pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
>     pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
>
>     while (1) {
>         if (poll(&pollfd, 1, -1) == -1) {
>             perror("poll");
>             exit(EXIT_FAILURE);
>         }
>
>         if (read(uffd, &msg, sizeof(msg)) <= 0) {
>             perror("read");
>             exit(EXIT_FAILURE);
>         }
>
>
>         if (msg.event != UFFD_EVENT_PAGEFAULT) {
>             fprintf(stderr, "Unexpected event\n");
>             exit(EXIT_FAILURE);
>         }
>
>         move.src = (unsigned long)src + (msg.arg.pagefault.address -
> (unsigned long)dst);
>         move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
>         move.len = PAGE_SIZE;
>         move.mode = 0;
>
>         if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
>             perror("UFFDIO_MOVE");
>             exit(EXIT_FAILURE);
>         }
>     }
>     return NULL;
> }
>
> int main() {
> again:
>     pthread_t thr, madv_thr, swapin_thr;
>     struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
>     struct uffdio_register uffdio_register;
>
>     src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
> | MAP_ANONYMOUS, -1, 0);
>
>     if (src == MAP_FAILED) {
>         perror("mmap src");
>         exit(EXIT_FAILURE);
>     }
>
>     memset(src, 1, REGION_SIZE);
>
>     dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
> | MAP_ANONYMOUS, -1, 0);
>
>     if (dst == MAP_FAILED) {
>         perror("mmap dst");
>         exit(EXIT_FAILURE);
>     }
>
>
>     uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>     if (uffd == -1) {
>         perror("userfaultfd");
>         exit(EXIT_FAILURE);
>     }
>
>
>     if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
>         perror("UFFDIO_API");
>         exit(EXIT_FAILURE);
>     }
>
>     uffdio_register.range.start = (unsigned long)dst;
>     uffdio_register.range.len = REGION_SIZE;
>     uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
>
>     if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
>         perror("UFFDIO_REGISTER");
>         exit(EXIT_FAILURE);
>
>     }
>
>     if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
>         perror("pthread_create madvise_thread");
>         exit(EXIT_FAILURE);
>     }
>
>     if (pthread_create(&swapin_thr, NULL, swapin_thread, NULL) != 0) {
>         perror("pthread_create swapin_thread");
>         exit(EXIT_FAILURE);
>     }
>
>     if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
>         perror("pthread_create fault_handler_thread");
>         exit(EXIT_FAILURE);
>     }
>
>     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
>         char val = ((char *)dst)[i];
>         printf("Accessing dst at offset %zu, value: %d\n", i, val);
>     }
>
>     pthread_join(madv_thr, NULL);
>     pthread_join(swapin_thr, NULL);
>     pthread_cancel(thr);
>     pthread_join(thr, NULL);
>     munmap(src, REGION_SIZE);
>     munmap(dst, REGION_SIZE);
>     close(uffd);
>     goto again;
>
>     return 0;
> }
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
  2025-05-22 23:43 ` Lokesh Gidra
@ 2025-05-22 23:53   ` Barry Song
  2025-05-23  0:03     ` Lokesh Gidra
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2025-05-22 23:53 UTC (permalink / raw)
  To: Lokesh Gidra
  Cc: Peter Xu, David Hildenbrand, Suren Baghdasaryan,
	Andrea Arcangeli, Andrew Morton, Linux-MM, Kairui Song, LKML

On Fri, May 23, 2025 at 11:44 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
>
> Thanks Barry for stress testing MOVE ioctl. It's really helpful :)
>
> On Thu, May 22, 2025 at 4:23 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > Hi All,
> >
> > I'm encountering another bug that can be easily reproduced using the small
> > program below[1], which performs swap-out and swap-in in parallel.
> >
> > The issue occurs when a folio is being swapped out while it is accessed
> > concurrently. In this case, do_swap_page() handles the access. However,
> > because the folio is under writeback, do_swap_page() completely removes
> > its exclusive attribute.
> >
> > do_swap_page:
> >                } else if (exclusive && folio_test_writeback(folio) &&
> >                           data_race(si->flags & SWP_STABLE_WRITES)) {
> >                         ...
> >                         exclusive = false;
> >
> > As a result, userfaultfd_move() will return -EBUSY, even though the
> > folio is not shared and is in fact exclusively owned.
> >
> >                         folio = vm_normal_folio(src_vma, src_addr,
> > orig_src_pte);
> >                         if (!folio || !PageAnonExclusive(&folio->page)) {
> >                                 spin_unlock(src_ptl);
> > +                               pr_err("%s %d folio:%lx exclusive:%d
> > swapcache:%d\n",
> > +                                       __func__, __LINE__, folio,
> > PageAnonExclusive(&folio->page),
> > +                                       folio_test_swapcache(folio));
> >                                 err = -EBUSY;
> >                                 goto out;
> >                         }
> >
> > I understand that shared folios should not be moved. However, in this
> > case, the folio is not shared, yet its exclusive flag is not set.
> >
> > Therefore, I believe PageAnonExclusive is not a reliable indicator of
> > whether a folio is truly exclusive to a process.
> >
> > The kernel log output is shown below:
> > [   23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0
> > swapcache:1
> >
> > I'm still struggling to find a real fix; it seems quite challenging.
> > Please let me know if you have any ideas. In any case It seems
> > userspace should fall back to userfaultfd_copy.
> >
> I'm not sure this is really a bug. A page under write-back is in a way
> 'busy' isn't it? I am not an expert of anon-exclusive, but it seems to
> me that an exclusively mapped anonymous page would have it true. So,
> isn't it expected that a page under write-back will not have it set as
> the page isn't mapped?

We have two return codes: -EAGAIN and -EBUSY. In many cases, we return
-EAGAIN, which is transparent to userspace because the syscall is retried.

Therefore, I expect -EAGAIN or a similar code here to avoid userspace noise,
since we handle other cases where folios are undergoing transitions to
become stable again by -EAGAIN.

>
> I have observed this in my testing as well, and there are a couple of
> ways to deal with it in userspace. As you suggested, falling back to
> userfaultfd_copy on receiving -EBUSY is one option. In my case, making
> a fake store on the src page and then retrying has been working fine.

Good to know you have some fallbacks implemented in userspace.
That makes the issue less serious now.

> >
> >
> > [1] The small program:
> >
> > //Just in a couple of seconds, we are running into
> > //"UFFDIO_MOVE: Device or resource busy"
> >
> > #define _GNU_SOURCE
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <string.h>
> > #include <sys/mman.h>
> > #include <sys/ioctl.h>
> > #include <sys/syscall.h>
> > #include <linux/userfaultfd.h>
> > #include <fcntl.h>
> > #include <pthread.h>
> > #include <unistd.h>
> > #include <poll.h>
> > #include <errno.h>
> >
> > #define PAGE_SIZE 4096
> > #define REGION_SIZE (512 * 1024)
> >
> > #ifndef UFFDIO_MOVE
> > struct uffdio_move {
> >     __u64 dst;
> >     __u64 src;
> >     __u64 len;
> >     #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
> >     #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
> >     __u64 mode;
> >     __s64 move;
> > };
> >
> > #define _UFFDIO_MOVE  (0x05)
> > #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
> > #endif
> >
> >
> > void *src, *dst;
> > int uffd;
> >
> > void *madvise_thread(void *arg) {
> >     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> >     madvise(src + i, PAGE_SIZE, MADV_PAGEOUT);
> >     usleep(100);
> >     }
> >     return NULL;
> > }
> >
> > void *swapin_thread(void *arg) {
> >     volatile char dummy;
> >     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> >         dummy = ((char *)src)[i];
> >         usleep(100);
> >     }
> >     return NULL;
> > }
> >
> >
> > void *fault_handler_thread(void *arg) {
> >
> >     struct uffd_msg msg;
> >     struct uffdio_move move;
> >     struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
> >     pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
> >     pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
> >
> >     while (1) {
> >         if (poll(&pollfd, 1, -1) == -1) {
> >             perror("poll");
> >             exit(EXIT_FAILURE);
> >         }
> >
> >         if (read(uffd, &msg, sizeof(msg)) <= 0) {
> >             perror("read");
> >             exit(EXIT_FAILURE);
> >         }
> >
> >
> >         if (msg.event != UFFD_EVENT_PAGEFAULT) {
> >             fprintf(stderr, "Unexpected event\n");
> >             exit(EXIT_FAILURE);
> >         }
> >
> >         move.src = (unsigned long)src + (msg.arg.pagefault.address -
> > (unsigned long)dst);
> >         move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
> >         move.len = PAGE_SIZE;
> >         move.mode = 0;
> >
> >         if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
> >             perror("UFFDIO_MOVE");
> >             exit(EXIT_FAILURE);
> >         }
> >     }
> >     return NULL;
> > }
> >
> > int main() {
> > again:
> >     pthread_t thr, madv_thr, swapin_thr;
> >     struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
> >     struct uffdio_register uffdio_register;
> >
> >     src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
> > | MAP_ANONYMOUS, -1, 0);
> >
> >     if (src == MAP_FAILED) {
> >         perror("mmap src");
> >         exit(EXIT_FAILURE);
> >     }
> >
> >     memset(src, 1, REGION_SIZE);
> >
> >     dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
> > | MAP_ANONYMOUS, -1, 0);
> >
> >     if (dst == MAP_FAILED) {
> >         perror("mmap dst");
> >         exit(EXIT_FAILURE);
> >     }
> >
> >
> >     uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> >     if (uffd == -1) {
> >         perror("userfaultfd");
> >         exit(EXIT_FAILURE);
> >     }
> >
> >
> >     if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
> >         perror("UFFDIO_API");
> >         exit(EXIT_FAILURE);
> >     }
> >
> >     uffdio_register.range.start = (unsigned long)dst;
> >     uffdio_register.range.len = REGION_SIZE;
> >     uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> >
> >     if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
> >         perror("UFFDIO_REGISTER");
> >         exit(EXIT_FAILURE);
> >
> >     }
> >
> >     if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
> >         perror("pthread_create madvise_thread");
> >         exit(EXIT_FAILURE);
> >     }
> >
> >     if (pthread_create(&swapin_thr, NULL, swapin_thread, NULL) != 0) {
> >         perror("pthread_create swapin_thread");
> >         exit(EXIT_FAILURE);
> >     }
> >
> >     if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
> >         perror("pthread_create fault_handler_thread");
> >         exit(EXIT_FAILURE);
> >     }
> >
> >     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> >         char val = ((char *)dst)[i];
> >         printf("Accessing dst at offset %zu, value: %d\n", i, val);
> >     }
> >
> >     pthread_join(madv_thr, NULL);
> >     pthread_join(swapin_thr, NULL);
> >     pthread_cancel(thr);
> >     pthread_join(thr, NULL);
> >     munmap(src, REGION_SIZE);
> >     munmap(dst, REGION_SIZE);
> >     close(uffd);
> >     goto again;
> >
> >     return 0;
> > }
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
  2025-05-22 23:53   ` Barry Song
@ 2025-05-23  0:03     ` Lokesh Gidra
  0 siblings, 0 replies; 10+ messages in thread
From: Lokesh Gidra @ 2025-05-23  0:03 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Xu, David Hildenbrand, Suren Baghdasaryan,
	Andrea Arcangeli, Andrew Morton, Linux-MM, Kairui Song, LKML

On Thu, May 22, 2025 at 4:53 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, May 23, 2025 at 11:44 AM Lokesh Gidra <lokeshgidra@google.com> wrote:
> >
> > Thanks Barry for stress testing MOVE ioctl. It's really helpful :)
> >
> > On Thu, May 22, 2025 at 4:23 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > Hi All,
> > >
> > > I'm encountering another bug that can be easily reproduced using the small
> > > program below[1], which performs swap-out and swap-in in parallel.
> > >
> > > The issue occurs when a folio is being swapped out while it is accessed
> > > concurrently. In this case, do_swap_page() handles the access. However,
> > > because the folio is under writeback, do_swap_page() completely removes
> > > its exclusive attribute.
> > >
> > > do_swap_page:
> > >                } else if (exclusive && folio_test_writeback(folio) &&
> > >                           data_race(si->flags & SWP_STABLE_WRITES)) {
> > >                         ...
> > >                         exclusive = false;
> > >
> > > As a result, userfaultfd_move() will return -EBUSY, even though the
> > > folio is not shared and is in fact exclusively owned.
> > >
> > >                         folio = vm_normal_folio(src_vma, src_addr,
> > > orig_src_pte);
> > >                         if (!folio || !PageAnonExclusive(&folio->page)) {
> > >                                 spin_unlock(src_ptl);
> > > +                               pr_err("%s %d folio:%lx exclusive:%d
> > > swapcache:%d\n",
> > > +                                       __func__, __LINE__, folio,
> > > PageAnonExclusive(&folio->page),
> > > +                                       folio_test_swapcache(folio));
> > >                                 err = -EBUSY;
> > >                                 goto out;
> > >                         }
> > >
> > > I understand that shared folios should not be moved. However, in this
> > > case, the folio is not shared, yet its exclusive flag is not set.
> > >
> > > Therefore, I believe PageAnonExclusive is not a reliable indicator of
> > > whether a folio is truly exclusive to a process.
> > >
> > > The kernel log output is shown below:
> > > [   23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0
> > > swapcache:1
> > >
> > > I'm still struggling to find a real fix; it seems quite challenging.
> > > Please let me know if you have any ideas. In any case It seems
> > > userspace should fall back to userfaultfd_copy.
> > >
> > I'm not sure this is really a bug. A page under write-back is in a way
> > 'busy' isn't it? I am not an expert of anon-exclusive, but it seems to
> > me that an exclusively mapped anonymous page would have it true. So,
> > isn't it expected that a page under write-back will not have it set as
> > the page isn't mapped?
>
> We have two return codes: -EAGAIN and -EBUSY. In many cases, we return
> -EAGAIN, which is transparent to userspace because the syscall is retried.
>
> Therefore, I expect -EAGAIN or a similar code here to avoid userspace noise,
> since we handle other cases where folios are undergoing transitions to
> become stable again by -EAGAIN.

I agree, this case probably should be -EAGAIN. OTOH, with -EBUSY, we
know that a userfaultfd_copy fallback or some other operation needs to
be performed before retrying. In most of the -EAGAIN cases, we simply
have to retry without taking any additional steps.
>
> >
> > I have observed this in my testing as well, and there are a couple of
> > ways to deal with it in userspace. As you suggested, falling back to
> > userfaultfd_copy on receiving -EBUSY is one option. In my case, making
> > a fake store on the src page and then retrying has been working fine.
>
> Good to know you have some fallbacks implemented in userspace.
> That makes the issue less serious now.

I have been assuming so far that the reason I'm getting -EBUSY is
because somehow the page is COW shared with zygote. But what you
pointed out makes a lot more sense.
>
> > >
> > >
> > > [1] The small program:
> > >
> > > //Just in a couple of seconds, we are running into
> > > //"UFFDIO_MOVE: Device or resource busy"
> > >
> > > #define _GNU_SOURCE
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <string.h>
> > > #include <sys/mman.h>
> > > #include <sys/ioctl.h>
> > > #include <sys/syscall.h>
> > > #include <linux/userfaultfd.h>
> > > #include <fcntl.h>
> > > #include <pthread.h>
> > > #include <unistd.h>
> > > #include <poll.h>
> > > #include <errno.h>
> > >
> > > #define PAGE_SIZE 4096
> > > #define REGION_SIZE (512 * 1024)
> > >
> > > #ifndef UFFDIO_MOVE
> > > struct uffdio_move {
> > >     __u64 dst;
> > >     __u64 src;
> > >     __u64 len;
> > >     #define UFFDIO_MOVE_MODE_DONTWAKE        ((__u64)1<<0)
> > >     #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1)
> > >     __u64 mode;
> > >     __s64 move;
> > > };
> > >
> > > #define _UFFDIO_MOVE  (0x05)
> > > #define UFFDIO_MOVE   _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move)
> > > #endif
> > >
> > >
> > > void *src, *dst;
> > > int uffd;
> > >
> > > void *madvise_thread(void *arg) {
> > >     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> > >     madvise(src + i, PAGE_SIZE, MADV_PAGEOUT);
> > >     usleep(100);
> > >     }
> > >     return NULL;
> > > }
> > >
> > > void *swapin_thread(void *arg) {
> > >     volatile char dummy;
> > >     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> > >         dummy = ((char *)src)[i];
> > >         usleep(100);
> > >     }
> > >     return NULL;
> > > }
> > >
> > >
> > > void *fault_handler_thread(void *arg) {
> > >
> > >     struct uffd_msg msg;
> > >     struct uffdio_move move;
> > >     struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
> > >     pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
> > >     pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
> > >
> > >     while (1) {
> > >         if (poll(&pollfd, 1, -1) == -1) {
> > >             perror("poll");
> > >             exit(EXIT_FAILURE);
> > >         }
> > >
> > >         if (read(uffd, &msg, sizeof(msg)) <= 0) {
> > >             perror("read");
> > >             exit(EXIT_FAILURE);
> > >         }
> > >
> > >
> > >         if (msg.event != UFFD_EVENT_PAGEFAULT) {
> > >             fprintf(stderr, "Unexpected event\n");
> > >             exit(EXIT_FAILURE);
> > >         }
> > >
> > >         move.src = (unsigned long)src + (msg.arg.pagefault.address -
> > > (unsigned long)dst);
> > >         move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
> > >         move.len = PAGE_SIZE;
> > >         move.mode = 0;
> > >
> > >         if (ioctl(uffd, UFFDIO_MOVE, &move) == -1) {
> > >             perror("UFFDIO_MOVE");
> > >             exit(EXIT_FAILURE);
> > >         }
> > >     }
> > >     return NULL;
> > > }
> > >
> > > int main() {
> > > again:
> > >     pthread_t thr, madv_thr, swapin_thr;
> > >     struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
> > >     struct uffdio_register uffdio_register;
> > >
> > >     src = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
> > > | MAP_ANONYMOUS, -1, 0);
> > >
> > >     if (src == MAP_FAILED) {
> > >         perror("mmap src");
> > >         exit(EXIT_FAILURE);
> > >     }
> > >
> > >     memset(src, 1, REGION_SIZE);
> > >
> > >     dst = mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE
> > > | MAP_ANONYMOUS, -1, 0);
> > >
> > >     if (dst == MAP_FAILED) {
> > >         perror("mmap dst");
> > >         exit(EXIT_FAILURE);
> > >     }
> > >
> > >
> > >     uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
> > >     if (uffd == -1) {
> > >         perror("userfaultfd");
> > >         exit(EXIT_FAILURE);
> > >     }
> > >
> > >
> > >     if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
> > >         perror("UFFDIO_API");
> > >         exit(EXIT_FAILURE);
> > >     }
> > >
> > >     uffdio_register.range.start = (unsigned long)dst;
> > >     uffdio_register.range.len = REGION_SIZE;
> > >     uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
> > >
> > >     if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
> > >         perror("UFFDIO_REGISTER");
> > >         exit(EXIT_FAILURE);
> > >
> > >     }
> > >
> > >     if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != 0) {
> > >         perror("pthread_create madvise_thread");
> > >         exit(EXIT_FAILURE);
> > >     }
> > >
> > >     if (pthread_create(&swapin_thr, NULL, swapin_thread, NULL) != 0) {
> > >         perror("pthread_create swapin_thread");
> > >         exit(EXIT_FAILURE);
> > >     }
> > >
> > >     if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != 0) {
> > >         perror("pthread_create fault_handler_thread");
> > >         exit(EXIT_FAILURE);
> > >     }
> > >
> > >     for (size_t i = 0; i < REGION_SIZE; i += PAGE_SIZE) {
> > >         char val = ((char *)dst)[i];
> > >         printf("Accessing dst at offset %zu, value: %d\n", i, val);
> > >     }
> > >
> > >     pthread_join(madv_thr, NULL);
> > >     pthread_join(swapin_thr, NULL);
> > >     pthread_cancel(thr);
> > >     pthread_join(thr, NULL);
> > >     munmap(src, REGION_SIZE);
> > >     munmap(dst, REGION_SIZE);
> > >     close(uffd);
> > >     goto again;
> > >
> > >     return 0;
> > > }
> > >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
  2025-05-22 23:23 [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out Barry Song
  2025-05-22 23:43 ` Lokesh Gidra
@ 2025-05-26 12:39 ` David Hildenbrand
  2025-05-27  4:17   ` Barry Song
  1 sibling, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-05-26 12:39 UTC (permalink / raw)
  To: Barry Song, Peter Xu, Suren Baghdasaryan, Lokesh Gidra, Andrea Arcangeli
  Cc: Andrew Morton, Linux-MM, Kairui Song, LKML

On 23.05.25 01:23, Barry Song wrote:
> Hi All,

Hi!

> 
> I'm encountering another bug that can be easily reproduced using the small
> program below[1], which performs swap-out and swap-in in parallel.
> 
> The issue occurs when a folio is being swapped out while it is accessed
> concurrently. In this case, do_swap_page() handles the access. However,
> because the folio is under writeback, do_swap_page() completely removes
> its exclusive attribute.
> 
> do_swap_page:
>                 } else if (exclusive && folio_test_writeback(folio) &&
>                            data_race(si->flags & SWP_STABLE_WRITES)) {
>                          ...
>                          exclusive = false;
> 
> As a result, userfaultfd_move() will return -EBUSY, even though the
> folio is not shared and is in fact exclusively owned.
> 
>                          folio = vm_normal_folio(src_vma, src_addr,
> orig_src_pte);
>                          if (!folio || !PageAnonExclusive(&folio->page)) {
>                                  spin_unlock(src_ptl);
> +                               pr_err("%s %d folio:%lx exclusive:%d
> swapcache:%d\n",
> +                                       __func__, __LINE__, folio,
> PageAnonExclusive(&folio->page),
> +                                       folio_test_swapcache(folio));
>                                  err = -EBUSY;
>                                  goto out;
>                          }
> 
> I understand that shared folios should not be moved. However, in this
> case, the folio is not shared, yet its exclusive flag is not set.
> 
> Therefore, I believe PageAnonExclusive is not a reliable indicator of
> whether a folio is truly exclusive to a process.

It is. The flag *not* being set is not a reliable indicator whether it 
is really shared. ;)

The reason why we have this PAE workaround (dropping the flag) in place 
is because the page must not be written to (SWP_STABLE_WRITES). CoW 
reuse is not possible.

uffd moving that page -- and in that same process setting it writable, 
see move_present_pte()->pte_mkwrite() -- would be very bad.

> 
> The kernel log output is shown below:
> [   23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0
> swapcache:1
> 
> I'm still struggling to find a real fix; it seems quite challenging.

PAE tells you that you can immediately write to that page without going 
through CoW. However, here, CoW is required.

> Please let me know if you have any ideas. In any case It seems
> userspace should fall back to userfaultfd_copy.

We could try detecting whether the page is now exclusive, to reset PAE. 
That will only be possible after writeback completed, so it adds 
complexity without being able to move the page in all cases (during 
writeback).

Letting userspace deal with that in these rate scenarios is 
significantly easier.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
  2025-05-26 12:39 ` David Hildenbrand
@ 2025-05-27  4:17   ` Barry Song
  2025-05-27  8:37     ` Barry Song
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2025-05-27  4:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Suren Baghdasaryan, Lokesh Gidra, Andrea Arcangeli,
	Andrew Morton, Linux-MM, Kairui Song, LKML

On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 23.05.25 01:23, Barry Song wrote:
> > Hi All,
>
> Hi!
>
> >
> > I'm encountering another bug that can be easily reproduced using the small
> > program below[1], which performs swap-out and swap-in in parallel.
> >
> > The issue occurs when a folio is being swapped out while it is accessed
> > concurrently. In this case, do_swap_page() handles the access. However,
> > because the folio is under writeback, do_swap_page() completely removes
> > its exclusive attribute.
> >
> > do_swap_page:
> >                 } else if (exclusive && folio_test_writeback(folio) &&
> >                            data_race(si->flags & SWP_STABLE_WRITES)) {
> >                          ...
> >                          exclusive = false;
> >
> > As a result, userfaultfd_move() will return -EBUSY, even though the
> > folio is not shared and is in fact exclusively owned.
> >
> >                          folio = vm_normal_folio(src_vma, src_addr,
> > orig_src_pte);
> >                          if (!folio || !PageAnonExclusive(&folio->page)) {
> >                                  spin_unlock(src_ptl);
> > +                               pr_err("%s %d folio:%lx exclusive:%d
> > swapcache:%d\n",
> > +                                       __func__, __LINE__, folio,
> > PageAnonExclusive(&folio->page),
> > +                                       folio_test_swapcache(folio));
> >                                  err = -EBUSY;
> >                                  goto out;
> >                          }
> >
> > I understand that shared folios should not be moved. However, in this
> > case, the folio is not shared, yet its exclusive flag is not set.
> >
> > Therefore, I believe PageAnonExclusive is not a reliable indicator of
> > whether a folio is truly exclusive to a process.
>
> It is. The flag *not* being set is not a reliable indicator whether it
> is really shared. ;)
>
> The reason why we have this PAE workaround (dropping the flag) in place
> is because the page must not be written to (SWP_STABLE_WRITES). CoW
> reuse is not possible.
>
> uffd moving that page -- and in that same process setting it writable,
> see move_present_pte()->pte_mkwrite() -- would be very bad.

An alternative approach is to make the folio writable only when we are
reasonably certain it is exclusive; otherwise, it remains read-only. If the
destination is later written to and the folio has become exclusive, it can
be reused directly. If not, a copy-on-write will occur on the destination
address, transparently to userspace. This avoids Lokesh’s userspace-based
strategy, which requires forcing a write to the source address.

>
> >
> > The kernel log output is shown below:
> > [   23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0
> > swapcache:1
> >
> > I'm still struggling to find a real fix; it seems quite challenging.
>
> PAE tells you that you can immediately write to that page without going
> through CoW. However, here, CoW is required.
>
> > Please let me know if you have any ideas. In any case It seems
> > userspace should fall back to userfaultfd_copy.
>
> We could try detecting whether the page is now exclusive, to reset PAE.
> That will only be possible after writeback completed, so it adds
> complexity without being able to move the page in all cases (during
> writeback).
>
> Letting userspace deal with that in these rate scenarios is
> significantly easier.

Right, this appears to introduce the least change—essentially none—to the
kernel, while shifting more noise to userspace :-)

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
  2025-05-27  4:17   ` Barry Song
@ 2025-05-27  8:37     ` Barry Song
  2025-05-27  9:00       ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2025-05-27  8:37 UTC (permalink / raw)
  To: david
  Cc: 21cnbao, aarcange, akpm, linux-kernel, linux-mm, lokeshgidra,
	peterx, ryncsn, surenb

On Tue, May 27, 2025 at 4:17 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 23.05.25 01:23, Barry Song wrote:
> > > Hi All,
> >
> > Hi!
> >
> > >
> > > I'm encountering another bug that can be easily reproduced using the small
> > > program below[1], which performs swap-out and swap-in in parallel.
> > >
> > > The issue occurs when a folio is being swapped out while it is accessed
> > > concurrently. In this case, do_swap_page() handles the access. However,
> > > because the folio is under writeback, do_swap_page() completely removes
> > > its exclusive attribute.
> > >
> > > do_swap_page:
> > >                 } else if (exclusive && folio_test_writeback(folio) &&
> > >                            data_race(si->flags & SWP_STABLE_WRITES)) {
> > >                          ...
> > >                          exclusive = false;
> > >
> > > As a result, userfaultfd_move() will return -EBUSY, even though the
> > > folio is not shared and is in fact exclusively owned.
> > >
> > >                          folio = vm_normal_folio(src_vma, src_addr,
> > > orig_src_pte);
> > >                          if (!folio || !PageAnonExclusive(&folio->page)) {
> > >                                  spin_unlock(src_ptl);
> > > +                               pr_err("%s %d folio:%lx exclusive:%d
> > > swapcache:%d\n",
> > > +                                       __func__, __LINE__, folio,
> > > PageAnonExclusive(&folio->page),
> > > +                                       folio_test_swapcache(folio));
> > >                                  err = -EBUSY;
> > >                                  goto out;
> > >                          }
> > >
> > > I understand that shared folios should not be moved. However, in this
> > > case, the folio is not shared, yet its exclusive flag is not set.
> > >
> > > Therefore, I believe PageAnonExclusive is not a reliable indicator of
> > > whether a folio is truly exclusive to a process.
> >
> > It is. The flag *not* being set is not a reliable indicator whether it
> > is really shared. ;)
> >
> > The reason why we have this PAE workaround (dropping the flag) in place
> > is because the page must not be written to (SWP_STABLE_WRITES). CoW
> > reuse is not possible.
> >
> > uffd moving that page -- and in that same process setting it writable,
> > see move_present_pte()->pte_mkwrite() -- would be very bad.
>
> An alternative approach is to make the folio writable only when we are
> reasonably certain it is exclusive; otherwise, it remains read-only. If the
> destination is later written to and the folio has become exclusive, it can
> be reused directly. If not, a copy-on-write will occur on the destination
> address, transparently to userspace. This avoids Lokesh’s userspace-based
> strategy, which requires forcing a write to the source address.

Conceptually, I mean something like this:

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..70eaabf4f1a3 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1047,7 +1047,8 @@ static int move_present_pte(struct mm_struct *mm,
 	}
 	if (folio_test_large(src_folio) ||
 	    folio_maybe_dma_pinned(src_folio) ||
-	    !PageAnonExclusive(&src_folio->page)) {
+	    (!PageAnonExclusive(&src_folio->page) &&
+	    folio_mapcount(src_folio) != 1)) {
 		err = -EBUSY;
 		goto out;
 	}
@@ -1070,7 +1071,8 @@ static int move_present_pte(struct mm_struct *mm,
 #endif
 	if (pte_dirty(orig_src_pte))
 		orig_dst_pte = pte_mkdirty(orig_dst_pte);
-	orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
+	if (PageAnonExclusive(&src_folio->page))
+		orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
 
 	set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
 out:
@@ -1268,7 +1270,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			}
 
 			folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
-			if (!folio || !PageAnonExclusive(&folio->page)) {
+			if (!folio || (!PageAnonExclusive(&folio->page) &&
+					folio_mapcount(folio) != 1)) {
 				spin_unlock(src_ptl);
 				err = -EBUSY;
 				goto out;

I'm not trying to push this approach—unless Lokesh clearly sees that it  
could reduce userspace noise. I'm mainly just curious how we might make  
the fixup transparent to userspace. :-)


>
> >
> > >
> > > The kernel log output is shown below:
> > > [   23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0
> > > swapcache:1
> > >
> > > I'm still struggling to find a real fix; it seems quite challenging.
> >
> > PAE tells you that you can immediately write to that page without going
> > through CoW. However, here, CoW is required.
> >
> > > Please let me know if you have any ideas. In any case It seems
> > > userspace should fall back to userfaultfd_copy.
> >
> > We could try detecting whether the page is now exclusive, to reset PAE.
> > That will only be possible after writeback completed, so it adds
> > complexity without being able to move the page in all cases (during
> > writeback).
> >
> > Letting userspace deal with that in these rate scenarios is
> > significantly easier.
>
> Right, this appears to introduce the least change—essentially none—to the
> kernel, while shifting more noise to userspace :-)
>
> >
> > --
> > Cheers,
> >
> > David / dhildenb
> >
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
  2025-05-27  8:37     ` Barry Song
@ 2025-05-27  9:00       ` David Hildenbrand
  2025-05-27  9:31         ` Barry Song
  0 siblings, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2025-05-27  9:00 UTC (permalink / raw)
  To: Barry Song
  Cc: aarcange, akpm, linux-kernel, linux-mm, lokeshgidra, peterx,
	ryncsn, surenb

On 27.05.25 10:37, Barry Song wrote:
> On Tue, May 27, 2025 at 4:17 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 23.05.25 01:23, Barry Song wrote:
>>>> Hi All,
>>>
>>> Hi!
>>>
>>>>
>>>> I'm encountering another bug that can be easily reproduced using the small
>>>> program below[1], which performs swap-out and swap-in in parallel.
>>>>
>>>> The issue occurs when a folio is being swapped out while it is accessed
>>>> concurrently. In this case, do_swap_page() handles the access. However,
>>>> because the folio is under writeback, do_swap_page() completely removes
>>>> its exclusive attribute.
>>>>
>>>> do_swap_page:
>>>>                  } else if (exclusive && folio_test_writeback(folio) &&
>>>>                             data_race(si->flags & SWP_STABLE_WRITES)) {
>>>>                           ...
>>>>                           exclusive = false;
>>>>
>>>> As a result, userfaultfd_move() will return -EBUSY, even though the
>>>> folio is not shared and is in fact exclusively owned.
>>>>
>>>>                           folio = vm_normal_folio(src_vma, src_addr,
>>>> orig_src_pte);
>>>>                           if (!folio || !PageAnonExclusive(&folio->page)) {
>>>>                                   spin_unlock(src_ptl);
>>>> +                               pr_err("%s %d folio:%lx exclusive:%d
>>>> swapcache:%d\n",
>>>> +                                       __func__, __LINE__, folio,
>>>> PageAnonExclusive(&folio->page),
>>>> +                                       folio_test_swapcache(folio));
>>>>                                   err = -EBUSY;
>>>>                                   goto out;
>>>>                           }
>>>>
>>>> I understand that shared folios should not be moved. However, in this
>>>> case, the folio is not shared, yet its exclusive flag is not set.
>>>>
>>>> Therefore, I believe PageAnonExclusive is not a reliable indicator of
>>>> whether a folio is truly exclusive to a process.
>>>
>>> It is. The flag *not* being set is not a reliable indicator whether it
>>> is really shared. ;)
>>>
>>> The reason why we have this PAE workaround (dropping the flag) in place
>>> is because the page must not be written to (SWP_STABLE_WRITES). CoW
>>> reuse is not possible.
>>>
>>> uffd moving that page -- and in that same process setting it writable,
>>> see move_present_pte()->pte_mkwrite() -- would be very bad.
>>
>> An alternative approach is to make the folio writable only when we are
>> reasonably certain it is exclusive; otherwise, it remains read-only. If the
>> destination is later written to and the folio has become exclusive, it can
>> be reused directly. If not, a copy-on-write will occur on the destination
>> address, transparently to userspace. This avoids Lokesh’s userspace-based
>> strategy, which requires forcing a write to the source address.
> 
> Conceptually, I mean something like this:
> 
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..70eaabf4f1a3 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1047,7 +1047,8 @@ static int move_present_pte(struct mm_struct *mm,
>   	}
>   	if (folio_test_large(src_folio) ||
>   	    folio_maybe_dma_pinned(src_folio) ||
> -	    !PageAnonExclusive(&src_folio->page)) {
> +	    (!PageAnonExclusive(&src_folio->page) &&
> +	    folio_mapcount(src_folio) != 1)) {
>   		err = -EBUSY;
>   		goto out;
>   	}
> @@ -1070,7 +1071,8 @@ static int move_present_pte(struct mm_struct *mm,
>   #endif
>   	if (pte_dirty(orig_src_pte))
>   		orig_dst_pte = pte_mkdirty(orig_dst_pte);
> -	orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> +	if (PageAnonExclusive(&src_folio->page))
> +		orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
>   
>   	set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
>   out:
> @@ -1268,7 +1270,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>   			}
>   
>   			folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> -			if (!folio || !PageAnonExclusive(&folio->page)) {
> +			if (!folio || (!PageAnonExclusive(&folio->page) &&
> +					folio_mapcount(folio) != 1)) {
>   				spin_unlock(src_ptl);
>   				err = -EBUSY;
>   				goto out;
> 
> I'm not trying to push this approach—unless Lokesh clearly sees that it
> could reduce userspace noise. I'm mainly just curious how we might make
> the fixup transparent to userspace. :-)

And that reveals the exact problem: it's all *very* complicated. :)

... and dangerous when we use the mapcount without having a complete 
understanding how it all works.


What we would have to do for a small folio is

1) Take the folio lock

2) Make sure there is only this present page table mapping:
	folio_mapcount(folio) != 1

	of better

	!folio_maybe_mapped_shared(folio);

3) Make sure that there are no swap references

	If in the swapcache, check the actual swapcount

3) Make sure it is not a KSM folio


THPs are way, way, way more complicated to get right that way. Likely, 
the scenario described above cannot happen with a PMD-mapped THP for now 
at least (we don't have PMD swap entries).


Of course, we'd then also have to handle the case when we have a swap 
pte where the marker is not set (e.g., because of swapout after the 
described swapin where we dropped the marker).


What could be easier is triggering a FAULT_FLAG_UNSHARE fault. It's 
arguably less optimal in case the core will decide to swapin / CoW, but 
it leaves the magic to get all this right to the core -- and mimics the 
approach Lokesh uses.

But then, maybe letting userspace just do a uffdio_copy would be even 
faster (only a single TLB shootdown?).


I am also skeptical of calling this a BUG here. It's described to behave 
exactly like that [1]:

        EBUSY
               The pages in the source virtual memory range are either
               pinned or not exclusive to the process. The kernel might
               only perform lightweight checks for detecting whether the
               pages are exclusive. To make the operation more likely to
               succeed, KSM should be disabled, fork() should be avoided
               or MADV_DONTFORK should be configured for the source
	      virtual memory area before fork().

Note the "lightweight" and "more likely to succeed".


[1] https://lore.kernel.org/lkml/20231206103702.3873743-3-surenb@google.com/

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
  2025-05-27  9:00       ` David Hildenbrand
@ 2025-05-27  9:31         ` Barry Song
  2025-05-27 11:06           ` David Hildenbrand
  0 siblings, 1 reply; 10+ messages in thread
From: Barry Song @ 2025-05-27  9:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: aarcange, akpm, linux-kernel, linux-mm, lokeshgidra, peterx,
	ryncsn, surenb

On Tue, May 27, 2025 at 5:00 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.05.25 10:37, Barry Song wrote:
> > On Tue, May 27, 2025 at 4:17 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Tue, May 27, 2025 at 12:39 AM David Hildenbrand <david@redhat.com> wrote:
> >>>
> >>> On 23.05.25 01:23, Barry Song wrote:
> >>>> Hi All,
> >>>
> >>> Hi!
> >>>
> >>>>
> >>>> I'm encountering another bug that can be easily reproduced using the small
> >>>> program below[1], which performs swap-out and swap-in in parallel.
> >>>>
> >>>> The issue occurs when a folio is being swapped out while it is accessed
> >>>> concurrently. In this case, do_swap_page() handles the access. However,
> >>>> because the folio is under writeback, do_swap_page() completely removes
> >>>> its exclusive attribute.
> >>>>
> >>>> do_swap_page:
> >>>>                  } else if (exclusive && folio_test_writeback(folio) &&
> >>>>                             data_race(si->flags & SWP_STABLE_WRITES)) {
> >>>>                           ...
> >>>>                           exclusive = false;
> >>>>
> >>>> As a result, userfaultfd_move() will return -EBUSY, even though the
> >>>> folio is not shared and is in fact exclusively owned.
> >>>>
> >>>>                           folio = vm_normal_folio(src_vma, src_addr,
> >>>> orig_src_pte);
> >>>>                           if (!folio || !PageAnonExclusive(&folio->page)) {
> >>>>                                   spin_unlock(src_ptl);
> >>>> +                               pr_err("%s %d folio:%lx exclusive:%d
> >>>> swapcache:%d\n",
> >>>> +                                       __func__, __LINE__, folio,
> >>>> PageAnonExclusive(&folio->page),
> >>>> +                                       folio_test_swapcache(folio));
> >>>>                                   err = -EBUSY;
> >>>>                                   goto out;
> >>>>                           }
> >>>>
> >>>> I understand that shared folios should not be moved. However, in this
> >>>> case, the folio is not shared, yet its exclusive flag is not set.
> >>>>
> >>>> Therefore, I believe PageAnonExclusive is not a reliable indicator of
> >>>> whether a folio is truly exclusive to a process.
> >>>
> >>> It is. The flag *not* being set is not a reliable indicator whether it
> >>> is really shared. ;)
> >>>
> >>> The reason why we have this PAE workaround (dropping the flag) in place
> >>> is because the page must not be written to (SWP_STABLE_WRITES). CoW
> >>> reuse is not possible.
> >>>
> >>> uffd moving that page -- and in that same process setting it writable,
> >>> see move_present_pte()->pte_mkwrite() -- would be very bad.
> >>
> >> An alternative approach is to make the folio writable only when we are
> >> reasonably certain it is exclusive; otherwise, it remains read-only. If the
> >> destination is later written to and the folio has become exclusive, it can
> >> be reused directly. If not, a copy-on-write will occur on the destination
> >> address, transparently to userspace. This avoids Lokesh’s userspace-based
> >> strategy, which requires forcing a write to the source address.
> >
> > Conceptually, I mean something like this:
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..70eaabf4f1a3 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1047,7 +1047,8 @@ static int move_present_pte(struct mm_struct *mm,
> >       }
> >       if (folio_test_large(src_folio) ||
> >           folio_maybe_dma_pinned(src_folio) ||
> > -         !PageAnonExclusive(&src_folio->page)) {
> > +         (!PageAnonExclusive(&src_folio->page) &&
> > +         folio_mapcount(src_folio) != 1)) {
> >               err = -EBUSY;
> >               goto out;
> >       }
> > @@ -1070,7 +1071,8 @@ static int move_present_pte(struct mm_struct *mm,
> >   #endif
> >       if (pte_dirty(orig_src_pte))
> >               orig_dst_pte = pte_mkdirty(orig_dst_pte);
> > -     orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> > +     if (PageAnonExclusive(&src_folio->page))
> > +             orig_dst_pte = pte_mkwrite(orig_dst_pte, dst_vma);
> >
> >       set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte);
> >   out:
> > @@ -1268,7 +1270,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                       }
> >
> >                       folio = vm_normal_folio(src_vma, src_addr, orig_src_pte);
> > -                     if (!folio || !PageAnonExclusive(&folio->page)) {
> > +                     if (!folio || (!PageAnonExclusive(&folio->page) &&
> > +                                     folio_mapcount(folio) != 1)) {
> >                               spin_unlock(src_ptl);
> >                               err = -EBUSY;
> >                               goto out;
> >
> > I'm not trying to push this approach—unless Lokesh clearly sees that it
> > could reduce userspace noise. I'm mainly just curious how we might make
> > the fixup transparent to userspace. :-)
>
> And that reveals the exact problem: it's all *very* complicated. :)
>
> ... and dangerous when we use the mapcount without having a complete
> understanding how it all works.
>
>
> What we would have to do for a small folio is
>
> 1) Take the folio lock
>
> 2) Make sure there is only this present page table mapping:
>         folio_mapcount(folio) != 1
>
>         of better
>
>         !folio_maybe_mapped_shared(folio);
>
> 3) Make sure that there are no swap references
>
>         If in the swapcache, check the actual swapcount
>
> 3) Make sure it is not a KSM folio
>
>
> THPs are way, way, way more complicated to get right that way. Likely,
> the scenario described above cannot happen with a PMD-mapped THP for now
> at least (we don't have PMD swap entries).

Yeah, this can get really complicated.

>
>
> Of course, we'd then also have to handle the case when we have a swap
> pte where the marker is not set (e.g., because of swapout after the
> described swapin where we dropped the marker).
>
>
> What could be easier is triggering a FAULT_FLAG_UNSHARE fault. It's
> arguably less optimal in case the core will decide to swapin / CoW, but
> it leaves the magic to get all this right to the core -- and mimics the
> approach Lokesh uses.
>
> But then, maybe letting userspace just do a uffdio_copy would be even
> faster (only a single TLB shootdown?).
>
>
> I am also skeptical of calling this a BUG here. It's described to behave
> exactly like that [1]:
>
>         EBUSY
>                The pages in the source virtual memory range are either
>                pinned or not exclusive to the process. The kernel might
>                only perform lightweight checks for detecting whether the
>                pages are exclusive. To make the operation more likely to
>                succeed, KSM should be disabled, fork() should be avoided
>                or MADV_DONTFORK should be configured for the source
>               virtual memory area before fork().
>
> Note the "lightweight" and "more likely to succeed".
>

Initially, my point was that an exclusive folio (single-process case)
should be movable.
Now I understand this isn’t a bug, but rather a compromise made due
to implementation constraints.
Perhaps the remaining value of this report is that it helped better
understand scenarios beyond fork where a move might also fail.

I truly appreciate your time and your clear analysis.

>
> [1] https://lore.kernel.org/lkml/20231206103702.3873743-3-surenb@google.com/
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out
  2025-05-27  9:31         ` Barry Song
@ 2025-05-27 11:06           ` David Hildenbrand
  0 siblings, 0 replies; 10+ messages in thread
From: David Hildenbrand @ 2025-05-27 11:06 UTC (permalink / raw)
  To: Barry Song
  Cc: aarcange, akpm, linux-kernel, linux-mm, lokeshgidra, peterx,
	ryncsn, surenb

>>
>>          EBUSY
>>                 The pages in the source virtual memory range are either
>>                 pinned or not exclusive to the process. The kernel might
>>                 only perform lightweight checks for detecting whether the
>>                 pages are exclusive. To make the operation more likely to
>>                 succeed, KSM should be disabled, fork() should be avoided
>>                 or MADV_DONTFORK should be configured for the source
>>                virtual memory area before fork().
>>
>> Note the "lightweight" and "more likely to succeed".
>>
> 
> Initially, my point was that an exclusive folio (single-process case)
> should be movable.

Yeah, I would wish that we wouldn't need that PAE hack in the swapin code.

I was asking myself if we could just ... wait for writeback to end in 
that case?

I mean, if we would have to swap in the folio we would also have to wait 
for disk I/O ... so here we would also have to wait for disk I/O.

We could either wait for writeback before mapping the folio, or set the 
PAE bit and map the page R/O, to then wait for writeback during write 
faults.

The latter has the downside that we have to handle it with more 
complexity during write faults (check if page is under writeback, then 
check if we require this sync I/O during write faults even though PAE is 
set ...).

> Now I understand this isn’t a bug, but rather a compromise made due
> to implementation constraints.

That is a good summary!

> Perhaps the remaining value of this report is that it helped better
> understand scenarios beyond fork where a move might also fail.
> 
> I truly appreciate your time and your clear analysis.

YW :)

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-05-27 11:06 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-22 23:23 [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out Barry Song
2025-05-22 23:43 ` Lokesh Gidra
2025-05-22 23:53   ` Barry Song
2025-05-23  0:03     ` Lokesh Gidra
2025-05-26 12:39 ` David Hildenbrand
2025-05-27  4:17   ` Barry Song
2025-05-27  8:37     ` Barry Song
2025-05-27  9:00       ` David Hildenbrand
2025-05-27  9:31         ` Barry Song
2025-05-27 11:06           ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox