From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 699E0C021B0 for ; Wed, 19 Feb 2025 20:38:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EE34428026B; Wed, 19 Feb 2025 15:38:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E923228026A; Wed, 19 Feb 2025 15:38:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D0CD928026B; Wed, 19 Feb 2025 15:38:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B1C7228026A for ; Wed, 19 Feb 2025 15:38:04 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 42ADCA0A93 for ; Wed, 19 Feb 2025 20:38:04 +0000 (UTC) X-FDA: 83137856088.17.F02FFBD Received: from mail-vs1-f48.google.com (mail-vs1-f48.google.com [209.85.217.48]) by imf03.hostedemail.com (Postfix) with ESMTP id 701FB20003 for ; Wed, 19 Feb 2025 20:38:02 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kXMxCl6a; spf=pass (imf03.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739997482; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bpcqkinJ93f+CbkMM1VHBdXQ4JmZq4404Dgq5xr26Z8=; b=Oe1UQN0sRgRmmg8E4bR4kaIdAiu8gVzl1CEKp6YQKHv0n5bJkUeU4mZAf6pSibb/PV0rb6 y++3Ervv1nC/xsbO6i5LqIMyfatNv4/GxCPfnvjuqBNRJRSSI9uC3wg4HOOXvRvonl61Di etE97Qv3aGTNO6998V7nSFAjKbg2Asw= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kXMxCl6a; spf=pass (imf03.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739997482; a=rsa-sha256; cv=none; b=pMvx/0ika5YpuLWDJ/gKcrSE5HoU/AXM1nN/Kod0TmmOViyBlMRlWVQf7FBLOtEEOC7rwQ oJI3/via+O4tj+9ejc3gzmBnMaqQgMDAH8x/5LNW2giu9AMQvg7WvvQnFqxeWMeNjoh8+t z0DVa/obwB8QciCivl6R5YyaFWoYxRw= Received: by mail-vs1-f48.google.com with SMTP id ada2fe7eead31-4bd3a8b88f3so156470137.0 for ; Wed, 19 Feb 2025 12:38:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1739997481; x=1740602281; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=bpcqkinJ93f+CbkMM1VHBdXQ4JmZq4404Dgq5xr26Z8=; b=kXMxCl6aDPLilVQUYjf6YPaCkHAIe+URzfcmMPLC9gXxuMfia0WUlV6zFQAnMZTytc N+v6eK5flfpPdVixahgneg0HP/SHaSGo7TqUyWzkFEKxdptfi/TFE1yaGiq7c2aHqYie fGtkfcqUBTCxXiEEATKAelwOKQyBV+2n2OYLzabwtV5yHPET0MFM9rgZdDqkCgB4bMGp 4274U1U3dJmx/LZTYFHGXV10/raL1QiAvdm1joYr4vK1xcf+UFtnoqx59uvwPsYx3oQX LMOPQ+O+hZJ3ktJjKPCvyvyu1odDMFPX/JgR58Sv1MwnQ6PggUW0aEF7tliIq24qlMmo FxTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739997481; x=1740602281; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bpcqkinJ93f+CbkMM1VHBdXQ4JmZq4404Dgq5xr26Z8=; b=bikslwJKPaeV/I+k0sVYSM3BogCO41V3zX0KVG0nmtWE2ocVWC+69/R2C8BKjuGUVy Bbi17uX9hfu/BTwA6LEwy5J4irS3HsaxgLuypNJNkHOdqZP/VVd25lTCAM8VZ1rBYAYw ih1CQZOQ1SU2bgjjQJW6fYDW9YUMjywOong9mh+zgiv+08pPxlWspEIU7o+MFNZXAT03 w+LDp3P7CHbSWjc3QS49K3klricEgYvaA6yZG5SkjlcVacKGWEt95P0sIKY0mOi4SAMM kmIIxRFL3i24FuVVzVkzZGE2fbtJpKKK5mj7aHT1fd9fmsZhiuPKbBi3WpcbwCwjCAUU /ZqA== X-Gm-Message-State: AOJu0Yy1jQqjZDLd8qjbcG0AUODi/5z186QRgqg9bKw8TU+NsfN5/7N/ D4KbrvGPPsGqgeOXnx/4h0rYhMEyoNVHeBy1eisOWl8BIN/PviVOevnRbcROlEf9bIesJt3uDQe EUyip6HM0lryGfKgbwbY2LeuCCZA= X-Gm-Gg: ASbGncsRPRuMUYh0/rKfGLMeQo1X3p7tbkNJAJntHebYPIsh3Bqxqju2+pUvPns16Xu vLqCg/dPWYI2SxN2dxc4fiAYnYjODzk6kT1KyVelKfIT0+pQWQKayOdO1nRhskoPAKWVguXf6 X-Google-Smtp-Source: AGHT+IFdnwehAYkZywx+tA0tW4jREW61yS//5FmAojgTiBJ+2cvY6p+xxxxBBL0ClnyyH+Qn6FoSZ99eKGrgAkgRPuc= X-Received: by 2002:a05:6102:4b8b:b0:4bb:9b46:3f92 with SMTP id ada2fe7eead31-4be85b54ffemr3821290137.1.1739997481357; Wed, 19 Feb 2025 12:38:01 -0800 (PST) MIME-Version: 1.0 References: <20250219112519.92853-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 20 Feb 2025 09:37:50 +1300 X-Gm-Features: AWEUYZl7Fs8Gu_Y7BfTXg3BRwmUDrbvzxOZDuitqXWykGGxoHxM2vqATgdWqhp8 Message-ID: Subject: Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache To: Suren Baghdasaryan , Lokesh Gidra Cc: linux-mm@kvack.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, zhengtangquan@oppo.com, Barry Song , Andrea Arcangeli , Al Viro , Axel Rasmussen , Brian Geffon , Christian Brauner , David Hildenbrand , Hugh Dickins , Jann Horn , Kalesh Singh , "Liam R . Howlett" , Matthew Wilcox , Michal Hocko , Mike Rapoport , Nicolas Geoffray , Peter Xu , Ryan Roberts , Shuah Khan , ZhangPeng , Yu Zhao Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 701FB20003 X-Stat-Signature: exdobzyh7co7dnzcuidqufna1hdnika9 X-Rspam-User: X-HE-Tag: 1739997482-654296 X-HE-Meta: U2FsdGVkX189w7/CGgZred6yONOvbIlqEAOa/2QHH6BgJgAH9N6q+an1biZDGyy/Y7+fBKY4kqV9W3o4Ydc1rUuv4j7Clrb5kwHGh4T0SnRTtzdz+K6PzjVdYS/PnrENabpLjnDsqSStmM8+K/M/ePI8YI73lm5aji9LsL8fR+s0bJ4NLzdsit61Zg92MSma8D6kqxkZf5L2VRT13bJkP9MX0M3qEkQf6KQFgY9ms+SiTL+MPuAWml9inzujN+FvWosTt9fUaAbvxGdVMiIZLz1sI5tHpe/TdpNS6Zs1mmwdaVHTZrNFlfRuL6bV8RZEwNO8emHQTVKfwHl1cDy5PGqnHVtE5AZFF+arbjcNGHLKbZs5kpqqXPAKOWM0j4hhgCKda8BYkztbKmajZtJauIq6u45Krr01r4KvPSHIt94qKeildLmf9yfCN29kOe95T1Q5wMGrInIHGh/WwAjTg052nz+AQ01rIeDPcYnL3Q+QxEK0AHAxwdoIvrU+2kqXl2sY4xwp1Wk3TrRKgHCZMaVdY5Rs7NOlenF2hxhD54ZWAH3BGaJaY4LomBIy70YwOcbFgvFMxgmScwsruAyZbkgF7LyPuytSpvRkJWReA7xO2o8vf/4DD5+dKYI3epCZ9cPBxNt68i+2vnNhJYUhyl4y3vOmMBdcwrIAaYRMufJf6IkRgoRr+YfoCfiANszN7n2V0y5WjatVkQXbUndy42qADiHLiy2+7/VXFNHkJxbnOOkcHpCqaEaEa9dmq1dBiLoGa0WdgYscBIZPsfvQzEaImqz0inwgAX7A7xzt2MBfz44Lax+n6GiBUpln6ZexdL39syiltGP4Sh8Mnjnkhv1kRCpgb2SuGsjAo+9v+kfCHnjwMB/hgieqZuiqV+Ol4oBF3MeIge+5eDSTD3ZPrE/F8jzWo3KAps13aMdB+MSJ4W1sPvnP5/dqE2JJ0/tvYwSp/0npno+K3LOabUJ s/QWFL+0 Hpy/iDGqqBpLM2VwKfacFQun1P+mPSO1Oe3lpMFjhBBsnAJJyVHmWqe5FAnnqd6J2PtPNQ7SXyspXityqOMICUuKmTwUaxkXKBX7XHza9HMRk7Bpi07uzebo/dBRo0nGmC/3iGDNO711orw2LuRzyUYm7SyxeRmoJgbTj+lJPD1V01i3aXzdqDBoH0doD5Hay/5f9HUSPpCaNjwNOlJqqm4PkH5Jtc2MXA4brpDJrNJNRAm5ojpnzr6bagj5oPpSXQoAqvt0/3YuNPFQqsh7F6YqZPWkdSHvAvN9PrGjcUhm+GVTylpXpLXeUFfj1/PYIn1chxg5VyyNZoaHKRnExXe1PHmsd7DE1gjCXSS6Xgnm5MbkJauyfZ+NZHK2pe1WkayRzFy7cbZpJeQrj026talHds/qts4TW0ZET2X63UPSiWLq3Qx27Fu20g9jbqPT2I4eFlWLXug3j5FQxE+Q8UUzB0Mg21RFm8L3vwmAZ7JnvepxOpxVhS7zEqnaqyRJhq/9cEC8lqaCWdfd/IbrWEgUNWQEkqi+y26o0EQNxbOFpp0e8agGqnfrQmoyDvMbA83GuKhq6G65v6GZIdJwPR9G8IWLX5pLHtOJhpUFqAYuE8yCBr8/teUFzUaLw0pLOuRkN8LjXrRU6T+o/e/0Emn16h3tZEJUbintD2HKhLGMDL+N34pqpVL/EBh4II6L7J+MM5UZXlE0LM3b+1/8CFPJQsZ+j6SnHivDjAu6SNpPODJpX7y7vU33sSsLEuVorzM2Cj9Mi3roYskc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 20, 2025 at 7:27=E2=80=AFAM Suren Baghdasaryan wrote: > > On Wed, Feb 19, 2025 at 3:25=E2=80=AFAM Barry Song <21cnbao@gmail.com> wr= ote: > > > > From: Barry Song > > > > userfaultfd_move() checks whether the PTE entry is present or a > > swap entry. > > > > - If the PTE entry is present, move_present_pte() handles folio > > migration by setting: > > > > src_folio->index =3D linear_page_index(dst_vma, dst_addr); > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > the PTE to the new dst_addr. > > > > This approach is incorrect because even if the PTE is a swap > > entry, it can still reference a folio that remains in the swap > > cache. > > > > If do_swap_page() is triggered, it may locate the folio in the > > swap cache. However, during add_rmap operations, a kernel panic > > can occur due to: > > page_pgoff(folio, page) !=3D linear_page_index(vma, address) > > Thanks for the report and reproducer! > > > > > $./a.out > /dev/null > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db19c ind= ex:0xffffaf150 pfn:0x4667c > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_pages_mapp= ed:1 pincount:0 > > [ 13.337716] memcg:ffff00000405f000 > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirty|owne= r_priv_1|head|swapbacked|node=3D0|zone=3D0|lastcpupid=3D0xffff) > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff80008507b538 = ffff000006260361 > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 0000000600000000 = ffff00000405f000 > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff80008507b538= ffff000006260361 > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600000000= ffff00000405f000 > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00000000= 0000000000000001 > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ffffffff= 0000000000000000 > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(folio, pa= ge) !=3D linear_page_index(vma, address)) > > [ 13.340190] ------------[ cut here ]------------ > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMP= T SMP > > [ 13.340969] Modules linked in: > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.14.0-rc= 3-gcf42737e247a-dirty #299 > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BT= YPE=3D--) > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > [ 13.342018] sp : ffff80008752bb20 > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0000000= 000000001 > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0000000= 000000001 > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: fffffdf= fc0199f00 > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 0000000= 0ffffffff > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 6628666= 66f67705f > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: ffff800= 083728ab0 > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : ffff800= 08011bc40 > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : ffff800= 0829eebf8 > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0000000= 000000000 > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 0000000= 00000005f > > [ 13.343876] Call trace: > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > [ 13.344333] do_swap_page+0x1060/0x1400 > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > [ 13.344586] do_page_fault+0x20c/0x770 > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > [ 13.344759] do_mem_abort+0x48/0xa0 > > [ 13.344842] el0_da+0x58/0x130 > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d4210000) > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > [ 13.345715] note: a.out[107] exited with irqs disabled > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > Fully fixing it would be quite complex, requiring similar handling > > of folios as done in move_present_pte. > > How complex would that be? Is it a matter of adding > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > folio->index =3D linear_page_index like in move_present_pte() or > something more? My main concern is still with large folios that require a split_folio() during move_pages(), as the entire folio shares the same index and anon_vma. However, userfaultfd_move() moves pages individually, making a split necessary. However, in split_huge_page_to_list_to_order(), there is a: if (folio_test_writeback(folio)) return -EBUSY; This is likely true for swapcache, right? However, even for move_present_pt= e(), it simply returns -EBUSY: move_pages_pte() { /* at this point we have src_folio locked */ if (folio_test_large(src_folio)) { /* split_folio() can block */ pte_unmap(&orig_src_pte); pte_unmap(&orig_dst_pte); src_pte =3D dst_pte =3D NULL; err =3D split_folio(src_folio); if (err) goto out; /* have to reacquire the folio after it got split *= / folio_unlock(src_folio); folio_put(src_folio); src_folio =3D NULL; goto retry; } } Do we need a folio_wait_writeback() before calling split_folio()? By the way, I have also reported that userfaultfd_move() has a fundamental conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android commo= n kernel. In this scenario, folios in the virtual zone won=E2=80=99t be split= in split_folio(). Instead, the large folio migrates into nr_pages small folios= . Thus, the best-case scenario would be: mTHP -> migrate to small folios in split_folio() -> move small folios to dst_addr While this works, it negates the performance benefits of userfaultfd_move(), as it introduces two PTE operations (migration in split_folio() and move in userfaultfd_move() while retry), nr_pages memory allocations, and still requires one memcpy(). This could end up performing even worse than userfaultfd_copy(), I guess. The worst-case scenario would be failing to allocate small folios in split_folio(), then userfaultfd_move() might return -ENOMEM? Given these issues, I strongly recommend that ART hold off on upgrading to userfaultfd_move() until these problems are fully understood and resolved. Otherwise, we=E2=80=99re in for a rough ride! > > > For now, a quick solution > > is to return -EBUSY. > > I'd like to see others' opinions on whether a full fix is worth > > pursuing. > > > > For anyone interested in reproducing it, the a.out test program is > > as below, > > > > #define _GNU_SOURCE > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > #include > > > > #define PAGE_SIZE 4096 > > #define REGION_SIZE (512 * 1024) > > > > #ifndef UFFDIO_MOVE > > struct uffdio_move { > > __u64 dst; > > __u64 src; > > __u64 len; > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > __u64 mode; > > __s64 move; > > }; > > #define _UFFDIO_MOVE (0x05) > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_move) > > #endif > > > > void *src, *dst; > > int uffd; > > > > void *madvise_thread(void *arg) { > > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) =3D=3D -1) { > > perror("madvise MADV_PAGEOUT"); > > } > > return NULL; > > } > > > > void *fault_handler_thread(void *arg) { > > struct uffd_msg msg; > > struct uffdio_move move; > > struct pollfd pollfd =3D { .fd =3D uffd, .events =3D POLLIN }; > > > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > while (1) { > > if (poll(&pollfd, 1, -1) =3D=3D -1) { > > perror("poll"); > > exit(EXIT_FAILURE); > > } > > > > if (read(uffd, &msg, sizeof(msg)) <=3D 0) { > > perror("read"); > > exit(EXIT_FAILURE); > > } > > > > if (msg.event !=3D UFFD_EVENT_PAGEFAULT) { > > fprintf(stderr, "Unexpected event\n"); > > exit(EXIT_FAILURE); > > } > > > > move.src =3D (unsigned long)src + (msg.arg.pagefault.address -= (unsigned long)dst); > > move.dst =3D msg.arg.pagefault.address & ~(PAGE_SIZE - 1); > > move.len =3D PAGE_SIZE; > > move.mode =3D 0; > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) =3D=3D -1) { > > perror("UFFDIO_MOVE"); > > exit(EXIT_FAILURE); > > } > > } > > return NULL; > > } > > > > int main() { > > again: > > pthread_t thr, madv_thr; > > struct uffdio_api uffdio_api =3D { .api =3D UFFD_API, .features = =3D 0 }; > > struct uffdio_register uffdio_register; > > > > src =3D mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVAT= E | MAP_ANONYMOUS, -1, 0); > > if (src =3D=3D MAP_FAILED) { > > perror("mmap src"); > > exit(EXIT_FAILURE); > > } > > memset(src, 1, REGION_SIZE); > > > > dst =3D mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVAT= E | MAP_ANONYMOUS, -1, 0); > > if (dst =3D=3D MAP_FAILED) { > > perror("mmap dst"); > > exit(EXIT_FAILURE); > > } > > > > uffd =3D syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > if (uffd =3D=3D -1) { > > perror("userfaultfd"); > > exit(EXIT_FAILURE); > > } > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) =3D=3D -1) { > > perror("UFFDIO_API"); > > exit(EXIT_FAILURE); > > } > > > > uffdio_register.range.start =3D (unsigned long)dst; > > uffdio_register.range.len =3D REGION_SIZE; > > uffdio_register.mode =3D UFFDIO_REGISTER_MODE_MISSING; > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) =3D=3D -1) { > > perror("UFFDIO_REGISTER"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) !=3D 0) = { > > perror("pthread_create madvise_thread"); > > exit(EXIT_FAILURE); > > } > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) !=3D 0)= { > > perror("pthread_create fault_handler_thread"); > > exit(EXIT_FAILURE); > > } > > > > for (size_t i =3D 0; i < REGION_SIZE; i +=3D PAGE_SIZE) { > > char val =3D ((char *)dst)[i]; > > printf("Accessing dst at offset %zu, value: %d\n", i, val); > > } > > > > pthread_join(madv_thr, NULL); > > pthread_cancel(thr); > > pthread_join(thr, NULL); > > > > munmap(src, REGION_SIZE); > > munmap(dst, REGION_SIZE); > > close(uffd); > > goto again; > > return 0; > > } > > > > As long as you enable mTHP (which likely increases the residency > > time of swapcache), you can reproduce the issue within a few > > seconds. But I guess the same race condition also exists with > > small folios. > > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > > Cc: Andrea Arcangeli > > Cc: Suren Baghdasaryan > > Cc: Al Viro > > Cc: Axel Rasmussen > > Cc: Brian Geffon > > Cc: Christian Brauner > > Cc: David Hildenbrand > > Cc: Hugh Dickins > > Cc: Jann Horn > > Cc: Kalesh Singh > > Cc: Liam R. Howlett > > Cc: Lokesh Gidra > > Cc: Matthew Wilcox (Oracle) > > Cc: Michal Hocko > > Cc: Mike Rapoport (IBM) > > Cc: Nicolas Geoffray > > Cc: Peter Xu > > Cc: Ryan Roberts > > Cc: Shuah Khan > > Cc: ZhangPeng > > Signed-off-by: Barry Song > > --- > > mm/userfaultfd.c | 11 +++++++++++ > > 1 file changed, 11 insertions(+) > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index 867898c4e30b..34cf1c8c725d 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -18,6 +18,7 @@ > > #include > > #include > > #include "internal.h" > > +#include "swap.h" > > > > static __always_inline > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long ds= t_end) > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct *mm, > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > spinlock_t *dst_ptl, spinlock_t *src_ptl) > > { > > + struct folio *folio; > > + swp_entry_t entry; > > + > > if (!pte_swp_exclusive(orig_src_pte)) > > return -EBUSY; > > > > Would be helpful to add a comment explaining that this is the case > when the folio is in the swap cache. > > > + entry =3D pte_to_swp_entry(orig_src_pte); > > + folio =3D filemap_get_folio(swap_address_space(entry), swap_cac= he_index(entry)); > > + if (!IS_ERR(folio)) { > > + folio_put(folio); > > + return -EBUSY; > > + } > > + > > double_pt_lock(dst_ptl, src_ptl); > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_s= rc_pte, > > -- > > 2.39.3 (Apple Git-146) > > Thanks Barry