From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4AE8DC54E65 for ; Thu, 22 May 2025 23:53:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C3B086B007B; Thu, 22 May 2025 19:53:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BEBBD6B0083; Thu, 22 May 2025 19:53:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B007C6B0085; Thu, 22 May 2025 19:53:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 928D26B007B for ; Thu, 22 May 2025 19:53:56 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 4EBC7E6D58 for ; Thu, 22 May 2025 23:53:56 +0000 (UTC) X-FDA: 83472199272.28.2208F74 Received: from mail-vk1-f175.google.com (mail-vk1-f175.google.com [209.85.221.175]) by imf20.hostedemail.com (Postfix) with ESMTP id 730B51C0002 for ; Thu, 22 May 2025 23:53:54 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dmSjidqv; spf=pass (imf20.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747958034; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OXpfq0uRm2Z+89nrs/0r3k9mZyXxOjfYOncnGRVl1Tc=; b=hMCyK+yDFfUbjI1OaPlvRqYtEvccSUjZHydvavCdxu1ylYDq0pJ/BzE6VUB78sh3fGVoSS 25IJL1+9tJfFOnBJX5qdRpT/mrT5Zw9G8mYgl0jw/6TcLBCJEkT/0yZvFyBixJJrkQMzrT +XooSmxbq8GaEzeJtERLARaydMS7btY= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dmSjidqv; spf=pass (imf20.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747958034; a=rsa-sha256; cv=none; b=qrSsS2iU7vwLtrX1cY3ZqnvJsLAyYrmVghbsnihtQJbFsjrkdzaxbPN47wqY+Uny48+uTt HJAUKa8DuDb9qN3qB6jrJcEsVe0Uim3ugCZ1B4Zb+PDWrfT+oF2TtJMyjLm9R+R9z5t0D5 5cMr9ebYV/+WzzMzcJHb2FQe8T9t1NU= Received: by mail-vk1-f175.google.com with SMTP id 71dfb90a1353d-5259331b31eso2946247e0c.0 for ; Thu, 22 May 2025 16:53:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747958033; x=1748562833; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OXpfq0uRm2Z+89nrs/0r3k9mZyXxOjfYOncnGRVl1Tc=; b=dmSjidqvMVzkX6e1TNqXUZulxfMzNWT2kqdykUMhHjYzgWVe9P7zdhWGysux7GqpsL MFobaJBV2mrZGY/GdMdVjSs6SxCQ3ItVb3fI/TnNvz1wnZcELtdjhoC0JxLwNp/6+OU/ RO5/o3gleKCk0cU8xNiu/wulvWdCi77qgsrIYRzMgIOrvAeektD5ceJ4j67p5Ndt6UPP drEu6y2R1FjhZTG7qjJiDGzXv6pWyz0UkYs/7CSre6u9v/Nt57spaUuUUtBHTUUknels SgINMKdMLu3zWn3R9nMQOse+6igOTh1ln8N3SJ7P6Nf27ZfxASVX18SLZ5tp7FMH5qbd woPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747958033; x=1748562833; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OXpfq0uRm2Z+89nrs/0r3k9mZyXxOjfYOncnGRVl1Tc=; b=hjb8TCkZXLYIXq4f9NOjhuIzcHhpMF11LXH2jeoz18/cGewq+wK649N4wDLQ5vPnEQ A//bz4wF1vrXq0n6VcmAXrLaUXQwlul1vDZT9BLiHfrhSUJQYUWURilgoEPq8xs401Xx nO0USD9fuNhq/dWNcBaLhWxFNixnJ2cHM/e3IyDfVDg2/RaEQqD8D08SYRVM6F8c0lxe f9TxsW3/9u64e9PIL7Fm4oDb3m3Yqm+ZE9JrjjTXodOjtLzwXayWTxqVC8hf8kkwT4gr zr1cn61hSefCea0OlGR9f92Rgpl3My8hNNTNorwcBOvxAbWPk64NL1u2O7k9ZlMDoyBS anVg== X-Forwarded-Encrypted: i=1; AJvYcCVcom63j/K00AnG724zru86jstrS5vhTedDUlUoisn9Sm2Cqvf9NzZaMe1ISSXGtLDVr3n/+EVAxg==@kvack.org X-Gm-Message-State: AOJu0Yzji2I3/E0Ufl33UXp16OyW/UMOzzbo6gTRlrBfjewioekjWwk2 rswWgdDQwHtCLeCmzT5GX4aP9C3gnQx9wfgZgMogUEZeXFemm2fJJEwdaphCVELB220Erab7K8A 8RqeZISne6g5toysrtYDXa2NGMqH2uc0= X-Gm-Gg: ASbGncug4Ufcn+/gW/zFbZR5tBYhujB6qti5WbS702kgDuLwvC9MijjIcQ1pte4FeKv m5Y5TUQ045Py0sRbaSHH//jrwdQBQYJntVDpkEMq5WS2NhDnVa7te0M8tWVLbFVrYfk8rXjNqjN a7fNxD4z7gTfOgkO5sXnDqVIhl+yyVfQROPA== X-Google-Smtp-Source: AGHT+IH60rSOSX+rc4TTvMJGX2pQCK/hAkkHSqTtX0JBYMZU5cuqslXT+gzwlQx19EUBERl3CIldLFT285mFFU69vpg= X-Received: by 2002:a05:6122:3bcb:b0:529:2644:5eec with SMTP id 71dfb90a1353d-52f1fe8b532mr823753e0c.8.1747958033419; Thu, 22 May 2025 16:53:53 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Fri, 23 May 2025 11:53:42 +1200 X-Gm-Features: AX0GCFuurE5qCxKrv05NKItqW8eq670GGZ0BjGoj5mEVeN7x03OFx_RNPfBpsAU Message-ID: Subject: Re: [BUG]userfaultfd_move fails to move a folio when swap-in occurs concurrently with swap-out To: Lokesh Gidra Cc: Peter Xu , David Hildenbrand , Suren Baghdasaryan , Andrea Arcangeli , Andrew Morton , Linux-MM , Kairui Song , LKML Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: fam1dnb5ugs98buwkho4eyq74t3cio84 X-Rspamd-Queue-Id: 730B51C0002 X-Rspam-User: X-Rspamd-Server: rspam02 X-HE-Tag: 1747958034-493273 X-HE-Meta: U2FsdGVkX18UL6RPbVjVXX4/JxNzf9nOmVVzR6v+Kx8lPUPEiHRvIBLNquYGIISMhAVZP7Yr7IiLByQjecRBl15t16kPUF4s5VSTnIbJYMHN89GZArCJVhC1jKccRrD7zLvpcZ7P9apzAz5iUUiz8+lGNK6mRG2fdomy+C1Dgco36eGFmzuFfy1cSz+jNiSCwJiwZCBsdK2o3tmH2rLvdaXuP1TDaO1IDcQfzW3aEUA0eFMPWGlRqW3kYGwCqhO+Ta1BSCqjW4VV5nvZ0MdR8Yt0fRkB+gvfkw2KoHvbkiblY6UrjY7Pp7+gJxp3jnvqcWqbgAJMzYp5tBKBoatvLCdtM9e7vjDTlX0fCGQrrMUk9+uSsV+jzTDOAvnfQIXfp99XKYOkQn16p7F2Z+BjdmQ7tfqyeWmkmxNCU4UiyrNRSXP9GOrD8AwgAu8iFM1o7TvXESb00OwhgL0H0Y17eNhh6z0xajYoCKVruz4Hf3NrZfpI4+WeKYHUldEv2EA50j19q9gpQWgD8754EFB0ZTTKVv1BbIXQ5EAa9Y2qcpuep31DWlX+eiNeHLdBtVt4/QLv7gm5mvenVI9NrKgPDEjcXynoZaegPcanZ1n8YozwefFMp4y54muOuPl/ccRfYONsXlMz6TvGfSZVj/+NnyiKD7jHlrno12VOlhopeutASaMiDznezV0tsrDsLsrYXPop62f6NLneolXXberIG2zBKLll4Wto/VoK46CjtTOmfevzyaFdQg25GW1dJEr1IIuCkx+wjB3qHL8j7az3NciXDgfXFJl/1CswDTHTvQHoLSb4HFE0FsahDUuoxKboxH4qzuEWrRE1X6Osu0hRdo4EasHOtCea/Nc2Tf1b6JhoRmPg1ogx45Fs/op5suQiAWASnvSroshH26o7oMcL+JYLVoBKVi1B1W2Vj/AmWo4DLg+qnY6m+hys6P/YvclTeVg9YwXeru5LlH37zjR ZogqGFxC b8CN7upnPodisxagy17VvzUVcnKxthwFdPP1QSHVcZR2dQtWbCBKFeIhO6NJaCD8kvHCDcmtTV1rt3+beDHPjkntqixnCAmOhF6FVg7nReRKlDPojjBU+pvk3sLHBSoh4mX5OMMYkPo/Lo8QUY9kIw6+0WJeul5dMo9MmiG29Wv1eXQIu2iZwrPJ15mlM0Ryd7oOjG38cfYZ2Klw7j8jfS9GKvNsBwG2B+8ZxdeWhfvFfCV+AoImm9QHr1nDU+87wfC96PKZKrYnaegygdifSf30SHisyOeDbMPY6ARCb8+9E2Re0l0b4HpdrJpnFkP2Fj+8X4XgIDaDkxS62qU9LM7crHKB3048DyPNZVF6y2GfSLMK0RcW2igiCsDMswwtZ7r0eXgWLDeTDOWn/Zdr766ZdvrYJgxNuqaoa X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 23, 2025 at 11:44=E2=80=AFAM Lokesh Gidra wrote: > > Thanks Barry for stress testing MOVE ioctl. It's really helpful :) > > On Thu, May 22, 2025 at 4:23=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > Hi All, > > > > I'm encountering another bug that can be easily reproduced using the sm= all > > program below[1], which performs swap-out and swap-in in parallel. > > > > The issue occurs when a folio is being swapped out while it is accessed > > concurrently. In this case, do_swap_page() handles the access. However, > > because the folio is under writeback, do_swap_page() completely removes > > its exclusive attribute. > > > > do_swap_page: > > } else if (exclusive && folio_test_writeback(folio) && > > data_race(si->flags & SWP_STABLE_WRITES)) { > > ... > > exclusive =3D false; > > > > As a result, userfaultfd_move() will return -EBUSY, even though the > > folio is not shared and is in fact exclusively owned. > > > > folio =3D vm_normal_folio(src_vma, src_addr, > > orig_src_pte); > > if (!folio || !PageAnonExclusive(&folio->page))= { > > spin_unlock(src_ptl); > > + pr_err("%s %d folio:%lx exclusive:%d > > swapcache:%d\n", > > + __func__, __LINE__, folio, > > PageAnonExclusive(&folio->page), > > + folio_test_swapcache(folio)); > > err =3D -EBUSY; > > goto out; > > } > > > > I understand that shared folios should not be moved. However, in this > > case, the folio is not shared, yet its exclusive flag is not set. > > > > Therefore, I believe PageAnonExclusive is not a reliable indicator of > > whether a folio is truly exclusive to a process. > > > > The kernel log output is shown below: > > [ 23.009516] move_pages_pte 1285 folio:fffffdffc01bba40 exclusive:0 > > swapcache:1 > > > > I'm still struggling to find a real fix; it seems quite challenging. > > Please let me know if you have any ideas. In any case It seems > > userspace should fall back to userfaultfd_copy. > > > I'm not sure this is really a bug. A page under write-back is in a way > 'busy' isn't it? I am not an expert of anon-exclusive, but it seems to > me that an exclusively mapped anonymous page would have it true. So, > isn't it expected that a page under write-back will not have it set as > the page isn't mapped? We have two return codes: -EAGAIN and -EBUSY. In many cases, we return -EAGAIN, which is transparent to userspace because the syscall is retried. Therefore, I expect -EAGAIN or a similar code here to avoid userspace noise= , since we handle other cases where folios are undergoing transitions to become stable again by -EAGAIN. > > I have observed this in my testing as well, and there are a couple of > ways to deal with it in userspace. As you suggested, falling back to > userfaultfd_copy on receiving -EBUSY is one option. In my case, making > a fake store on the src page and then retrying has been working fine. Good to know you have some fallbacks implemented in userspace. That makes the issue less serious now. > > > > > > [1] The small program: > > > > //Just in a couple of seconds, we are running into > > //"UFFDIO_MOVE: Device or resource busy" > > > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > #define PAGE_SIZE 4096 > > #define REGION_SIZE (512 * 1024) > > > > #ifndef UFFDIO_MOVE > > struct uffdio_move { > > __u64 dst; > > __u64 src; > > __u64 len; > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > __u64 mode; > > __s64 move; > > }; > > > > #define _UFFDIO_MOVE (0x05) > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > #endif > > > > > > void *src, *dst; > > int uffd; > > > > void *madvise_thread(void *arg) { > > for (size_t i =3D 0; i < REGION_SIZE; i +=3D PAGE_SIZE) { > > madvise(src + i, PAGE_SIZE, MADV_PAGEOUT); > > usleep(100); > > } > > return NULL; > > } > > > > void *swapin_thread(void *arg) { > > volatile char dummy; > > for (size_t i =3D 0; i < REGION_SIZE; i +=3D PAGE_SIZE) { > > dummy =3D ((char *)src)[i]; > > usleep(100); > > } > > return NULL; > > } > > > > > > void *fault_handler_thread(void *arg) { > > > > struct uffd_msg msg; > > struct uffdio_move move; > > struct pollfd pollfd =3D { .fd =3D uffd, .events =3D POLLIN }; > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > while (1) { > > if (poll(&pollfd, 1, -1) =3D=3D -1) { > > perror("poll"); > > exit(EXIT_FAILURE); > > } > > > > if (read(uffd, &msg, sizeof(msg)) <=3D 0) { > > perror("read"); > > exit(EXIT_FAILURE); > > } > > > > > > if (msg.event !=3D UFFD_EVENT_PAGEFAULT) { > > fprintf(stderr, "Unexpected event\n"); > > exit(EXIT_FAILURE); > > } > > > > move.src =3D (unsigned long)src + (msg.arg.pagefault.address - > > (unsigned long)dst); > > move.dst =3D msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > move.len =3D PAGE_SIZE; > > move.mode =3D 0; > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) =3D=3D -1) { > > perror("UFFDIO_MOVE"); > > exit(EXIT_FAILURE); > > } > > } > > return NULL; > > } > > > > int main() { > > again: > > pthread_t thr, madv_thr, swapin_thr; > > struct uffdio_api uffdio_api =3D { .api =3D UFFD_API, .features =3D= 0 }; > > struct uffdio_register uffdio_register; > > > > src =3D mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE > > | MAP_ANONYMOUS, -1, 0); > > > > if (src =3D=3D MAP_FAILED) { > > perror("mmap src"); > > exit(EXIT_FAILURE); > > } > > > > memset(src, 1, REGION_SIZE); > > > > dst =3D mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE > > | MAP_ANONYMOUS, -1, 0); > > > > if (dst =3D=3D MAP_FAILED) { > > perror("mmap dst"); > > exit(EXIT_FAILURE); > > } > > > > > > uffd =3D syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > if (uffd =3D=3D -1) { > > perror("userfaultfd"); > > exit(EXIT_FAILURE); > > } > > > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) =3D=3D -1) { > > perror("UFFDIO_API"); > > exit(EXIT_FAILURE); > > } > > > > uffdio_register.range.start =3D (unsigned long)dst; > > uffdio_register.range.len =3D REGION_SIZE; > > uffdio_register.mode =3D UFFDIO_REGISTER_MODE_MISSING; > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) =3D=3D -1) { > > perror("UFFDIO_REGISTER"); > > exit(EXIT_FAILURE); > > > > } > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) !=3D 0) { > > perror("pthread_create madvise_thread"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&swapin_thr, NULL, swapin_thread, NULL) !=3D 0) = { > > perror("pthread_create swapin_thread"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) !=3D 0) = { > > perror("pthread_create fault_handler_thread"); > > exit(EXIT_FAILURE); > > } > > > > for (size_t i =3D 0; i < REGION_SIZE; i +=3D PAGE_SIZE) { > > char val =3D ((char *)dst)[i]; > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > } > > > > pthread_join(madv_thr, NULL); > > pthread_join(swapin_thr, NULL); > > pthread_cancel(thr); > > pthread_join(thr, NULL); > > munmap(src, REGION_SIZE); > > munmap(dst, REGION_SIZE); > > close(uffd); > > goto again; > > > > return 0; > > } > > Thanks Barry