From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BB30C021AA for ; Wed, 19 Feb 2025 21:32:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B3C09280270; Wed, 19 Feb 2025 16:32:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id AC5A728026E; Wed, 19 Feb 2025 16:32:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 94488280270; Wed, 19 Feb 2025 16:32:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6CCD928026E for ; Wed, 19 Feb 2025 16:32:55 -0500 (EST) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id F1F491C6CB3 for ; Wed, 19 Feb 2025 21:32:54 +0000 (UTC) X-FDA: 83137994268.30.ECA4C0F Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) by imf19.hostedemail.com (Postfix) with ESMTP id 1B1AD1A0011 for ; Wed, 19 Feb 2025 21:32:52 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JyCab3bw; spf=pass (imf19.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740000773; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=KIaw+f80MFSgJb1e4JyC6NagP2gJp69BacIvDqHkFjA=; b=OjBPdBJ/dRNq/fblsTW24/XLayOfd2ySTYI+Msm/fJzYOlGphC4yPTHmp3PHcHnL76FGt5 nnnwlrvpBknQSSKu4j7jaS5bpqbHeF8La2Wi4mqzLu3uFQVTGv5L2s9PSoXwph6Kf6MZm5 AejhfKTtQNWHZ8C1KKVqC2LCFzCE5tY= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=JyCab3bw; spf=pass (imf19.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=lokeshgidra@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740000773; a=rsa-sha256; cv=none; b=wZisL/rkMsrfdFdcPWrEdUeP1cRf8p4FTfzyK8weESuXcq5PFn9M6zB5ONJhtleJqKI/Ep zw26P/Qee5GbWc/XZ1sNHQ17bxytH8HDydcaVM1utnTyUg0KDNGyKXRDSmPL7fDtiHO6k8 MYjc1kr7Iin90hMOQ81LUJKnHeC25Kc= Received: by mail-qt1-f179.google.com with SMTP id d75a77b69052e-471fbfe8b89so107471cf.0 for ; Wed, 19 Feb 2025 13:32:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1740000772; x=1740605572; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=KIaw+f80MFSgJb1e4JyC6NagP2gJp69BacIvDqHkFjA=; b=JyCab3bwNHvdcg8yiYIXPLccPjHEJ+Va2gE6XRCCQzjAFqNllwq3UNw61XVcwfcRLU GXPkkrbPZP1mmY7SbmKGus/pHOVhODLCyio9CKSTj9M/dt6UuAkeB4Yw7TkWzI127csG MM7kNhNng5n+jtLQ+JI6BEqQyNHsh4AKLJgFs31S6pUmznBCxtz7xcVoU4dkHxnjZ2Ff FkMADEv43vhtG3fvCSOO4JwmwQtPyPp5AuHsQlGqGGW1oBJyFdEihSW5DI3QnnbetCSn +01FDfYppMfWMqOtKwHaexuDF2E2GVGNre9SqT271SutSqeRXqG3N1kRXz9ZJ/LOr5FI dtwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740000772; x=1740605572; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=KIaw+f80MFSgJb1e4JyC6NagP2gJp69BacIvDqHkFjA=; b=jHr/dRcgKWVK2B9TYwMEDKxYUh2Uik7PkFCCVqUxBWA0G4IzMDEC60bTH5GkqanoOP A8MCVPmPPRoWV9aDde6TI1vqbKiwntzME0822BfeBjqCltOBSuE9HFiEfLNigQNM89+0 j+U2tx3OvfDqNYLPWrOHwXPU1jWRFC/WR/c9vUulpfPkzCGGAafCF4IMzFosVi5h+UpL 8neLbN5tD3HaerRy7+sG80TkyIvWJQ0DRr4xHa/19mCdysYag3zTz8JfubKXQxnr928z s38JCbgQ13A+6gCx5NgIxaMQ4T0edewcPNdcWSRwkUB9WKOQwWqFUCpTmBFzJNDuZNcY R+Rg== X-Forwarded-Encrypted: i=1; AJvYcCXbFqJFAfV///YSzvBwmbh2bfgU7yLeYWhtNml4yzdbjFU77+G9JYNXeNSr8XKyPbOG1kz25NSDPw==@kvack.org X-Gm-Message-State: AOJu0Yw9oZdMK63ABX2K1U47q/8vU9bwYi1BhbBdFmq8rSgNDQQkL3IQ C7Zfp+h/JtB2JkGvww1klAPKYhQZCQIOSrsyImCRQgfoZQCfoNGDID45T5eDHxXe+vnAR10xncb kKONdFJHEDO0xrxaZ0kre0WeZ9LsUPft6AB6v X-Gm-Gg: ASbGncvUm5hnIKk0xQOUBmJaEnvtJBHbBKa9xhDaJpepTRyK9qXBdioKUIxCJlh8qHM 5DcPigD77UMDvnf+EDwN5spsHlfIKLymyW2Ll9PRUvLT6y4yV7lmdi6smMGjTmfQA0txkRKfkMZ omTJEUUle7Mtbo6R5GZKIbTHYWYFU= X-Google-Smtp-Source: AGHT+IHCYtqFnkyoQRwL3mWknnUoXwCQvFuzO8VOEeSND8HSJin9ripKWYAXw8vPFeEdAXY28dOtPo3SUk5Ewtd1Hc0= X-Received: by 2002:a05:622a:1882:b0:472:f91:2935 with SMTP id d75a77b69052e-47215c97dbbmr944881cf.24.1740000771794; Wed, 19 Feb 2025 13:32:51 -0800 (PST) MIME-Version: 1.0 References: <20250219112519.92853-1-21cnbao@gmail.com> In-Reply-To: From: Lokesh Gidra Date: Wed, 19 Feb 2025 13:32:40 -0800 X-Gm-Features: AWEUYZk7ntNj3SpVNad2rUqj_45g4CRdsdywvoZxdt7ONZvztjocrwcBEdPtc8s Message-ID: Subject: Re: [PATCH RFC] mm: Fix kernel BUG when userfaultfd_move encounters swapcache To: Barry Song <21cnbao@gmail.com> Cc: Suren Baghdasaryan , linux-mm@kvack.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, zhengtangquan@oppo.com, Barry Song , Andrea Arcangeli , Al Viro , Axel Rasmussen , Brian Geffon , Christian Brauner , David Hildenbrand , Hugh Dickins , Jann Horn , Kalesh Singh , "Liam R . Howlett" , Matthew Wilcox , Michal Hocko , Mike Rapoport , Nicolas Geoffray , Peter Xu , Ryan Roberts , Shuah Khan , ZhangPeng , Yu Zhao Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 1B1AD1A0011 X-Stat-Signature: n5h3tiidk83k8fmiz14j6rt1jpg1553c X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1740000772-992838 X-HE-Meta: U2FsdGVkX1+XqzqBt0Px4SnU0fC3uLWnxdUoNL7dy9u5fCpewDqOaScg/uXgGvOsTz6fM+lihPX43qoLPzpMr8gFrIDrqKqIH6bxPXMwikENAWsBoS6DGmbTXV/xOJU84hITmm2aG0GJeZ+BY0aAEJ9Uhw5zlMe3VSozplZyD+TvlT2DifhJRdNcc2bmE7WVZts7bLky3y5Y8LxUOPSRtLBwKLnBXBZO8cuv8bv2dId5PQX52C4wTllwVVlEKAal4of4rvF83xF5+3s2P6XW8s8TyEDSsy9jDGBJxP2SYyfE9SJMxWlL6gFsXeYKXWi0jPqoDexoia1Vsg4s6YliexKqu1XW5BULredqJD/Hjhm47u2Le/oQR+GdpgTIguJ6XOFSEA52m10oRpKqUZEXK4Ypa5T2N/w9nv9BWaEav3qqhwzi0ul4lqOQLSs8t2FPTLaFYC9y1nh+uhXCk0h7A6XISPhnhaE+SO/fTyj8kzh0xw57un9DmLhZqcfNIbTIaFngM9IQt6Q02z+7bNPrdHTmukWqk4/SUPifxuRNT874XYeIMp8Tjv5SWzbaOaVvMyY2UK4Yfu0dYe7qEhbpw7RShZk7CgareKr6a0G4DnV4Uyr1flA4rFmfnR3+CZKH+mjOtuPwWAYYIQpi0cqTHVA39T/5QZHzmD9scx6fyeTbZkCXjCs2kg55EverhG+xydwHpHYoztfdbI0D2qVP0KpqO+Z0hPfz64gqTOV9594JRC+jhirml3zD3Q3sE6a+KjFrYk/aEQ/tHOiE6T4Qvs1jPw8C2yL5DD7c6qWPxHYyDHjAqgvn1HPO3rBQFhb3j537FGOoGHro7Kjy1btDUcMjQwPC7ct7J9p2yl1SQ98llazoE8A5FgwionNLEXeNaHhBLBLAieGBBt/RsQE1z5g4sU8rpJRAPaZkzEBTg8QHop617D2H8O8IAiodxmkywvPOnv9gzAvaeu5OEvO mDF6Freu GktmMEFAhEStRxHUaHcV4NzEHEC9NpQmOwe2kBJoOS50sHjtg/QfwXGBSCltCFVeNzlXPADGFTo5b/lwiJzQsKgOkuBxqup2/sziB7dGjwyOgp+P6g50GH2KvXwhRDkoFppYQz4a+IKywNN1jDL/OB8R+lWgkktaoM+oR2aqMAZA0EEqpyWFDQMTyUW1VIFfDLTFGxDXXV13Q6bP8t32FjwCk5Df8aSDEh9MzWZUrS0vFRbqXzjz/G40JnA1B6TQMLiT20/Z4Wns0TgcJE+qnnP2M6BIbYlby7qfmQ1opv+LZXBqw6s9u6FSMc1KBk3SGAJF5udCld2cJ9UmGUZwjdUR73irT/USS2+DU2vFIJWOzHaIW8cLBL3TVZTBUM9xnBjAMeeJGV8Xy9NoIzjKh3jzauIxC1uO0qAjcGOoMQzgIvHlV1Xdh3MkO/16jHEAApgC8g7qrg0t/ss0VJZKlw2yhsUmyWD7zJo6Vu5ZBfsoNm6PuVGv/517LP+qFzxPISx+P3VOib4NybQasoKOiMKXCbjf2wFjz0XQbAtL9f+6jwPwGKwIz5Zyrkt943d1JqNHP7T7Cd9k6JZopirnU8eDlJR1AiEPDGNXkdQHaKjLBABHqKLUpn/JVCPuTUyrkwvZ6UMTzGQUDN5g= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 19, 2025 at 1:26=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Thu, Feb 20, 2025 at 10:03=E2=80=AFAM Lokesh Gidra wrote: > > > > On Wed, Feb 19, 2025 at 12:38=E2=80=AFPM Barry Song <21cnbao@gmail.com>= wrote: > > > > > > On Thu, Feb 20, 2025 at 7:27=E2=80=AFAM Suren Baghdasaryan wrote: > > > > > > > > On Wed, Feb 19, 2025 at 3:25=E2=80=AFAM Barry Song <21cnbao@gmail.c= om> wrote: > > > > > > > > > > From: Barry Song > > > > > > > > > > userfaultfd_move() checks whether the PTE entry is present or a > > > > > swap entry. > > > > > > > > > > - If the PTE entry is present, move_present_pte() handles folio > > > > > migration by setting: > > > > > > > > > > src_folio->index =3D linear_page_index(dst_vma, dst_addr); > > > > > > > > > > - If the PTE entry is a swap entry, move_swap_pte() simply copies > > > > > the PTE to the new dst_addr. > > > > > > > > > > This approach is incorrect because even if the PTE is a swap > > > > > entry, it can still reference a folio that remains in the swap > > > > > cache. > > > > > > > > > > If do_swap_page() is triggered, it may locate the folio in the > > > > > swap cache. However, during add_rmap operations, a kernel panic > > > > > can occur due to: > > > > > page_pgoff(folio, page) !=3D linear_page_index(vma, address) > > > > > > > > Thanks for the report and reproducer! > > > > > > > > > > > > > > $./a.out > /dev/null > > > > > [ 13.336953] page: refcount:6 mapcount:1 mapping:00000000f43db1= 9c index:0xffffaf150 pfn:0x4667c > > > > > [ 13.337520] head: order:2 mapcount:1 entire_mapcount:0 nr_page= s_mapped:1 pincount:0 > > > > > [ 13.337716] memcg:ffff00000405f000 > > > > > [ 13.337849] anon flags: 0x3fffc0000020459(locked|uptodate|dirt= y|owner_priv_1|head|swapbacked|node=3D0|zone=3D0|lastcpupid=3D0xffff) > > > > > [ 13.338630] raw: 03fffc0000020459 ffff80008507b538 ffff8000850= 7b538 ffff000006260361 > > > > > [ 13.338831] raw: 0000000ffffaf150 0000000000004000 00000006000= 00000 ffff00000405f000 > > > > > [ 13.339031] head: 03fffc0000020459 ffff80008507b538 ffff800085= 07b538 ffff000006260361 > > > > > [ 13.339204] head: 0000000ffffaf150 0000000000004000 0000000600= 000000 ffff00000405f000 > > > > > [ 13.339375] head: 03fffc0000000202 fffffdffc0199f01 ffffffff00= 000000 0000000000000001 > > > > > [ 13.339546] head: 0000000000000004 0000000000000000 00000000ff= ffffff 0000000000000000 > > > > > [ 13.339736] page dumped because: VM_BUG_ON_PAGE(page_pgoff(fol= io, page) !=3D linear_page_index(vma, address)) > > > > > [ 13.340190] ------------[ cut here ]------------ > > > > > [ 13.340316] kernel BUG at mm/rmap.c:1380! > > > > > [ 13.340683] Internal error: Oops - BUG: 00000000f2000800 [#1] = PREEMPT SMP > > > > > [ 13.340969] Modules linked in: > > > > > [ 13.341257] CPU: 1 UID: 0 PID: 107 Comm: a.out Not tainted 6.1= 4.0-rc3-gcf42737e247a-dirty #299 > > > > > [ 13.341470] Hardware name: linux,dummy-virt (DT) > > > > > [ 13.341671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -S= SBS BTYPE=3D--) > > > > > [ 13.341815] pc : __page_check_anon_rmap+0xa0/0xb0 > > > > > [ 13.341920] lr : __page_check_anon_rmap+0xa0/0xb0 > > > > > [ 13.342018] sp : ffff80008752bb20 > > > > > [ 13.342093] x29: ffff80008752bb20 x28: fffffdffc0199f00 x27: 0= 000000000000001 > > > > > [ 13.342404] x26: 0000000000000000 x25: 0000000000000001 x24: 0= 000000000000001 > > > > > [ 13.342575] x23: 0000ffffaf0d0000 x22: 0000ffffaf0d0000 x21: f= ffffdffc0199f00 > > > > > [ 13.342731] x20: fffffdffc0199f00 x19: ffff000006210700 x18: 0= 0000000ffffffff > > > > > [ 13.342881] x17: 6c203d2120296567 x16: 6170202c6f696c6f x15: 6= 62866666f67705f > > > > > [ 13.343033] x14: 6567617028454741 x13: 2929737365726464 x12: f= fff800083728ab0 > > > > > [ 13.343183] x11: ffff800082996bf8 x10: 0000000000000fd7 x9 : f= fff80008011bc40 > > > > > [ 13.343351] x8 : 0000000000017fe8 x7 : 00000000fffff000 x6 : f= fff8000829eebf8 > > > > > [ 13.343498] x5 : c0000000fffff000 x4 : 0000000000000000 x3 : 0= 000000000000000 > > > > > [ 13.343645] x2 : 0000000000000000 x1 : ffff0000062db980 x0 : 0= 00000000000005f > > > > > [ 13.343876] Call trace: > > > > > [ 13.344045] __page_check_anon_rmap+0xa0/0xb0 (P) > > > > > [ 13.344234] folio_add_anon_rmap_ptes+0x22c/0x320 > > > > > [ 13.344333] do_swap_page+0x1060/0x1400 > > > > > [ 13.344417] __handle_mm_fault+0x61c/0xbc8 > > > > > [ 13.344504] handle_mm_fault+0xd8/0x2e8 > > > > > [ 13.344586] do_page_fault+0x20c/0x770 > > > > > [ 13.344673] do_translation_fault+0xb4/0xf0 > > > > > [ 13.344759] do_mem_abort+0x48/0xa0 > > > > > [ 13.344842] el0_da+0x58/0x130 > > > > > [ 13.344914] el0t_64_sync_handler+0xc4/0x138 > > > > > [ 13.345002] el0t_64_sync+0x1ac/0x1b0 > > > > > [ 13.345208] Code: aa1503e0 f000f801 910f6021 97ff5779 (d421000= 0) > > > > > [ 13.345504] ---[ end trace 0000000000000000 ]--- > > > > > [ 13.345715] note: a.out[107] exited with irqs disabled > > > > > [ 13.345954] note: a.out[107] exited with preempt_count 2 > > > > > > > > > > Fully fixing it would be quite complex, requiring similar handlin= g > > > > > of folios as done in move_present_pte. > > > > > > > > How complex would that be? Is it a matter of adding > > > > folio_maybe_dma_pinned() checks, doing folio_move_anon_rmap() and > > > > folio->index =3D linear_page_index like in move_present_pte() or > > > > something more? > > > > > > My main concern is still with large folios that require a split_folio= () > > > during move_pages(), as the entire folio shares the same index and > > > anon_vma. However, userfaultfd_move() moves pages individually, > > > making a split necessary. > > > > > > However, in split_huge_page_to_list_to_order(), there is a: > > > > > > if (folio_test_writeback(folio)) > > > return -EBUSY; > > > > > > This is likely true for swapcache, right? However, even for move_pres= ent_pte(), > > > it simply returns -EBUSY: > > > > > > move_pages_pte() > > > { > > > /* at this point we have src_folio locked */ > > > if (folio_test_large(src_folio)) { > > > /* split_folio() can block */ > > > pte_unmap(&orig_src_pte); > > > pte_unmap(&orig_dst_pte); > > > src_pte =3D dst_pte =3D NULL; > > > err =3D split_folio(src_folio); > > > if (err) > > > goto out; > > > > > > /* have to reacquire the folio after it got s= plit */ > > > folio_unlock(src_folio); > > > folio_put(src_folio); > > > src_folio =3D NULL; > > > goto retry; > > > } > > > } > > > > > > Do we need a folio_wait_writeback() before calling split_folio()? > > > > > > By the way, I have also reported that userfaultfd_move() has a fundam= ental > > > conflict with TAO (Cc'ed Yu Zhao), which has been part of the Android= common > > > kernel. In this scenario, folios in the virtual zone won=E2=80=99t be= split in > > > split_folio(). Instead, the large folio migrates into nr_pages small = folios. > > > > > > Thus, the best-case scenario would be: > > > > > > mTHP -> migrate to small folios in split_folio() -> move small folios= to > > > dst_addr > > > > > > While this works, it negates the performance benefits of > > > userfaultfd_move(), as it introduces two PTE operations (migration in > > > split_folio() and move in userfaultfd_move() while retry), nr_pages m= emory > > > allocations, and still requires one memcpy(). This could end up > > > performing even worse than userfaultfd_copy(), I guess. > > > > > > The worst-case scenario would be failing to allocate small folios in > > > split_folio(), then userfaultfd_move() might return -ENOMEM? > > > > > > Given these issues, I strongly recommend that ART hold off on upgradi= ng > > > to userfaultfd_move() until these problems are fully understood and > > > resolved. Otherwise, we=E2=80=99re in for a rough ride! > > > > At the moment, ART GC doesn't work taking mTHP into consideration. We > > don't try to be careful in userspace to be large-page aligned or > > anything. Also, the MOVE ioctl implementation works either on > > huge-pages or on normal pages. IIUC, it can't handle mTHP large pages > > as a whole. But that's true for other userfaultfd ioctls as well. If > > we were to continue using COPY, it's not that it's in any way more > > friendly to mTHP than MOVE. In fact, that's one of the reasons I'm > > considering making the ART heap NO_HUGEPAGE to avoid the need for > > folio-split entirely. > > Disabling mTHP is one way to avoid potential bugs. However, as long as > UFFDIO_MOVE is available, we can=E2=80=99t prevent others, aside from ART= GC, > from using it, right? So, we still need to address these issues with mTHP= . > > If a trend-following Android app discovers the UFFDIO_MOVE API, it might > use it, and it may not necessarily know to disable hugepages. Doesn=E2=80= =99t that > pose a risk? > I absolutely agree that these issues need to be addressed. Particularly the correctness bugs must be resolved at the earliest possible. I was just trying to answer your question as to why we want to use it, now that it is available, instead of continuing with COPY ioctl. As and when MOVE ioctl will start handling mTHP efficiently, I will make the required changes in the userspace to leverage mTHP benefits. > > > > Furthermore, there are few cases in which COPY ioctl's overhead just > > doesn't make sense for ART GC. So starting to use MOVE ioctl is the > > right thing to do. > > > > What we need eventually to gain mTHP benefits is both MOVE ioctl to > > support large-page migration as well as GC code in userspace to work > > with mTHP in mind. > > > > > > > > > > > > For now, a quick solution > > > > > is to return -EBUSY. > > > > > I'd like to see others' opinions on whether a full fix is worth > > > > > pursuing. > > > > > > > > > > For anyone interested in reproducing it, the a.out test program i= s > > > > > as below, > > > > > > > > > > #define _GNU_SOURCE > > > > > #include > > > > > #include > > > > > #include > > > > > #include > > > > > #include > > > > > #include > > > > > #include > > > > > #include > > > > > #include > > > > > #include > > > > > #include > > > > > #include > > > > > > > > > > #define PAGE_SIZE 4096 > > > > > #define REGION_SIZE (512 * 1024) > > > > > > > > > > #ifndef UFFDIO_MOVE > > > > > struct uffdio_move { > > > > > __u64 dst; > > > > > __u64 src; > > > > > __u64 len; > > > > > #define UFFDIO_MOVE_MODE_DONTWAKE ((__u64)1<<0) > > > > > #define UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES ((__u64)1<<1) > > > > > __u64 mode; > > > > > __s64 move; > > > > > }; > > > > > #define _UFFDIO_MOVE (0x05) > > > > > #define UFFDIO_MOVE _IOWR(UFFDIO, _UFFDIO_MOVE, struct uffdio_= move) > > > > > #endif > > > > > > > > > > void *src, *dst; > > > > > int uffd; > > > > > > > > > > void *madvise_thread(void *arg) { > > > > > if (madvise(src, REGION_SIZE, MADV_PAGEOUT) =3D=3D -1) { > > > > > perror("madvise MADV_PAGEOUT"); > > > > > } > > > > > return NULL; > > > > > } > > > > > > > > > > void *fault_handler_thread(void *arg) { > > > > > struct uffd_msg msg; > > > > > struct uffdio_move move; > > > > > struct pollfd pollfd =3D { .fd =3D uffd, .events =3D POLLIN = }; > > > > > > > > > > pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); > > > > > pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); > > > > > > > > > > while (1) { > > > > > if (poll(&pollfd, 1, -1) =3D=3D -1) { > > > > > perror("poll"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (read(uffd, &msg, sizeof(msg)) <=3D 0) { > > > > > perror("read"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (msg.event !=3D UFFD_EVENT_PAGEFAULT) { > > > > > fprintf(stderr, "Unexpected event\n"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > move.src =3D (unsigned long)src + (msg.arg.pagefault.add= ress - (unsigned long)dst); > > > > > move.dst =3D msg.arg.pagefault.address & ~(PAGE_SIZE - 1= ); > > > > > move.len =3D PAGE_SIZE; > > > > > move.mode =3D 0; > > > > > > > > > > if (ioctl(uffd, UFFDIO_MOVE, &move) =3D=3D -1) { > > > > > perror("UFFDIO_MOVE"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > } > > > > > return NULL; > > > > > } > > > > > > > > > > int main() { > > > > > again: > > > > > pthread_t thr, madv_thr; > > > > > struct uffdio_api uffdio_api =3D { .api =3D UFFD_API, .featu= res =3D 0 }; > > > > > struct uffdio_register uffdio_register; > > > > > > > > > > src =3D mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_= PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > > if (src =3D=3D MAP_FAILED) { > > > > > perror("mmap src"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > memset(src, 1, REGION_SIZE); > > > > > > > > > > dst =3D mmap(NULL, REGION_SIZE, PROT_READ | PROT_WRITE, MAP_= PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > > if (dst =3D=3D MAP_FAILED) { > > > > > perror("mmap dst"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > uffd =3D syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); > > > > > if (uffd =3D=3D -1) { > > > > > perror("userfaultfd"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (ioctl(uffd, UFFDIO_API, &uffdio_api) =3D=3D -1) { > > > > > perror("UFFDIO_API"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > uffdio_register.range.start =3D (unsigned long)dst; > > > > > uffdio_register.range.len =3D REGION_SIZE; > > > > > uffdio_register.mode =3D UFFDIO_REGISTER_MODE_MISSING; > > > > > > > > > > if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) =3D=3D -1= ) { > > > > > perror("UFFDIO_REGISTER"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (pthread_create(&madv_thr, NULL, madvise_thread, NULL) != =3D 0) { > > > > > perror("pthread_create madvise_thread"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > if (pthread_create(&thr, NULL, fault_handler_thread, NULL) != =3D 0) { > > > > > perror("pthread_create fault_handler_thread"); > > > > > exit(EXIT_FAILURE); > > > > > } > > > > > > > > > > for (size_t i =3D 0; i < REGION_SIZE; i +=3D PAGE_SIZE) { > > > > > char val =3D ((char *)dst)[i]; > > > > > printf("Accessing dst at offset %zu, value: %d\n", i, va= l); > > > > > } > > > > > > > > > > pthread_join(madv_thr, NULL); > > > > > pthread_cancel(thr); > > > > > pthread_join(thr, NULL); > > > > > > > > > > munmap(src, REGION_SIZE); > > > > > munmap(dst, REGION_SIZE); > > > > > close(uffd); > > > > > goto again; > > > > > return 0; > > > > > } > > > > > > > > > > As long as you enable mTHP (which likely increases the residency > > > > > time of swapcache), you can reproduce the issue within a few > > > > > seconds. But I guess the same race condition also exists with > > > > > small folios. > > > > > > > > > > Fixes: adef440691bab ("userfaultfd: UFFDIO_MOVE uABI") > > > > > Cc: Andrea Arcangeli > > > > > Cc: Suren Baghdasaryan > > > > > Cc: Al Viro > > > > > Cc: Axel Rasmussen > > > > > Cc: Brian Geffon > > > > > Cc: Christian Brauner > > > > > Cc: David Hildenbrand > > > > > Cc: Hugh Dickins > > > > > Cc: Jann Horn > > > > > Cc: Kalesh Singh > > > > > Cc: Liam R. Howlett > > > > > Cc: Lokesh Gidra > > > > > Cc: Matthew Wilcox (Oracle) > > > > > Cc: Michal Hocko > > > > > Cc: Mike Rapoport (IBM) > > > > > Cc: Nicolas Geoffray > > > > > Cc: Peter Xu > > > > > Cc: Ryan Roberts > > > > > Cc: Shuah Khan > > > > > Cc: ZhangPeng > > > > > Signed-off-by: Barry Song > > > > > --- > > > > > mm/userfaultfd.c | 11 +++++++++++ > > > > > 1 file changed, 11 insertions(+) > > > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > index 867898c4e30b..34cf1c8c725d 100644 > > > > > --- a/mm/userfaultfd.c > > > > > +++ b/mm/userfaultfd.c > > > > > @@ -18,6 +18,7 @@ > > > > > #include > > > > > #include > > > > > #include "internal.h" > > > > > +#include "swap.h" > > > > > > > > > > static __always_inline > > > > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned l= ong dst_end) > > > > > @@ -1079,9 +1080,19 @@ static int move_swap_pte(struct mm_struct = *mm, > > > > > pmd_t *dst_pmd, pmd_t dst_pmdval, > > > > > spinlock_t *dst_ptl, spinlock_t *src_ptl= ) > > > > > { > > > > > + struct folio *folio; > > > > > + swp_entry_t entry; > > > > > + > > > > > if (!pte_swp_exclusive(orig_src_pte)) > > > > > return -EBUSY; > > > > > > > > > > > > > Would be helpful to add a comment explaining that this is the case > > > > when the folio is in the swap cache. > > > > > > > > > + entry =3D pte_to_swp_entry(orig_src_pte); > > > > > + folio =3D filemap_get_folio(swap_address_space(entry), sw= ap_cache_index(entry)); > > > > > + if (!IS_ERR(folio)) { > > > > > + folio_put(folio); > > > > > + return -EBUSY; > > > > > + } > > > > > + > > > > > double_pt_lock(dst_ptl, src_ptl); > > > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, = orig_src_pte, > > > > > -- > > > > > 2.39.3 (Apple Git-146) > > > > > > > > > > Thanks > Barry