From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 06399C5B549 for ; Fri, 30 May 2025 08:49:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9889C6B008C; Fri, 30 May 2025 04:49:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 938036B0092; Fri, 30 May 2025 04:49:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7FFDB6B0093; Fri, 30 May 2025 04:49:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 586336B008C for ; Fri, 30 May 2025 04:49:26 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id CA557E0C13 for ; Fri, 30 May 2025 08:49:25 +0000 (UTC) X-FDA: 83498950290.12.911255F Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) by imf03.hostedemail.com (Postfix) with ESMTP id 90D6120003 for ; Fri, 30 May 2025 08:49:23 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=V75fzfoo; spf=pass (imf03.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748594963; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=0bMp2GwF28M9nCY+1eTbEjES5I4QV1m653ObPpjGH6k=; b=CbRkX/OweB+UgqjiYHViEOiSH+b2uoRa12ppkdLzvXM1KOgfJpvAurwRLAFJfwF0pLpvNo 1P20klB9ruQ5gwANS8eTD4EPwNyiTnYEDES+D/lq4MbX41B7ySdokuRjcqDWFUrzKUFN1k hiBlN1/ewrQEeDD29vDSgOUY2enDL7k= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=V75fzfoo; spf=pass (imf03.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748594963; a=rsa-sha256; cv=none; b=ntqPcaEI19n0vnjzuM6xcAlFN/ASZERYeoAqsaN4dRAjH/QpiXRO8D8WuPGGjNIzTGXf0l ajvTXsKgdxyMgwuB19sTQWnoaa3hBkO9eMv9Fo1WALey0xH8XJZ3wKX/4hkzcxEOqPh9dP zGvlShjh/tzA6RcvvKpfWWESN0CcBBs= Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-30db1bd3bddso15977871fa.3 for ; Fri, 30 May 2025 01:49:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748594962; x=1749199762; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=0bMp2GwF28M9nCY+1eTbEjES5I4QV1m653ObPpjGH6k=; b=V75fzfoopHMSWSnYihP4VQrza77+wsBqtzPMj4LXIXsTelrDEcRe9UdwyZNVoiSBZF 8gMfPUDQiAH5e/JIM0UPazmKwLpyVkZhdTz3e1tYKPgurE7cIRWbh8bYmC4jyUA8og1w p2UWTIqkHJtt1vFBa6Mvg7rX3zRcIYLe9+IiOHBxszkmPlWtjQTlGtiVhVGfMzBBNA21 s+w0MRbB8EGe8+ppUQaC1q1tIQ88dg8Bk1E5SfA8FIXVYXv436A17aRbKMmpZVRTiFm1 KOK3ZKfzG8rPcSkl8Ri+Qe6a7GG+aad6YELNeuOp04xawEydkQjJBzZcXRNcopnGHAzH abvQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748594962; x=1749199762; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=0bMp2GwF28M9nCY+1eTbEjES5I4QV1m653ObPpjGH6k=; b=gf/c32jQ2etWJAzU8nqvoLPf9t1ZGAPdVBI1DNjMHyizQwNkTR1KAMuoW/vy7FfHP8 ktVf7OtmRw0BD4pjCWpv3aUEmNydL1421V57NkjzTrQj9y3AoQg1C4anXm7JqT+aDq8e R5NrJTneDwbrx8WaVZy/Pr7l5mWoZMz9aeCXZs1Q7PIy5IgdxLO6YPuunerynD9yC+L+ QuG6QqQNtdzLkC3k/XgxW+ZH02QPc7sRvfNRnxhGZfCCGO0kv12mlEeRabt4+T4P8HYI NFR6Ocri/Dnx3I6d4MblISOFcbZlcqu4T2SMH6bPWmevZ8XDKaF0rjSxsq3IeaXY3bEV jbvQ== X-Forwarded-Encrypted: i=1; AJvYcCVCfq8xBqZaU674RNXwVyVrHO0gXcVWWHc90ils64JsT4qdQR+huDr8iytNenpYTIaMKNc0VwsQOA==@kvack.org X-Gm-Message-State: AOJu0YwXbL3RrgercqJetW2/N0215L9FdaxMQGTHqyv20Lq/lDYJekuM bRjwKCqXZYEY7E5ZztQjJOtKNpjkLH9Ckq9DvbFdLJ0V7ViSaak7WpbuaWGojWbvEgII4Nd728V xGXbvOs3+pk6iQb5C4jf/DnONbOsY9II= X-Gm-Gg: ASbGnctQ3b93oI/1fZgPTVGKQeFDiTRE4U+hsyCEEnWhbOrg0CpJMpQSpNXZpzxbB0/ DMmc3+JCevcZCGep2lBOagOyvxcKuSyx8rTx4PEWdJ96pZb4eZ3DNJUKgNaccshbmNaMzU/q9L9 Bv8nnL76S8lIL2XbLHjEqnj5vQFAz4slQz X-Google-Smtp-Source: AGHT+IHMUXxJ5vi0k1dwPqrpUyBi/SiUwVU9oHPtwN20lM6itfJnJo4i1TQfWd7HOIDyiJ+JF7GIMuHU7woIp38mUlo= X-Received: by 2002:a05:651c:b09:b0:32a:62a2:f72b with SMTP id 38308e7fff4ca-32a906cf1f8mr3745591fa.14.1748594960833; Fri, 30 May 2025 01:49:20 -0700 (PDT) MIME-Version: 1.0 References: <20250514201729.48420-6-ryncsn@gmail.com> <20250519043847.1806-1-21cnbao@gmail.com> In-Reply-To: From: Kairui Song Date: Fri, 30 May 2025 16:49:02 +0800 X-Gm-Features: AX0GCFundRkueUSS1m8MxSja_2c1kn5vAhBJf6EhcMcXxkYjJcXnH5ar5KUJuQQ Message-ID: Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, Baolin Wang , Baoquan He , Chris Li , David Hildenbrand , Johannes Weiner , Hugh Dickins , Kalesh Singh , LKML , linux-mm , Nhat Pham , Ryan Roberts , Kemeng Shi , Tim Chen , Matthew Wilcox , "Huang, Ying" , Yosry Ahmed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 90D6120003 X-Stat-Signature: akpipsa58ep8mbfpa768ugkctitapiim X-Rspam-User: X-Rspamd-Server: rspam04 X-HE-Tag: 1748594963-273011 X-HE-Meta: U2FsdGVkX19oZqSjcC5TNyT83cEAdHaBnn/oxH76rvNJWVfGBh8frczyLgOaTnY4FLF2j04YttljjmYsbdX0i5xfkka7GunWKMFbrwRJ6UleuHDnj3j77jHuuCc/oxl4xGDbSEMrWMrQoupH4whddk4w2e0C79B54YNUw2VPlW4S6xQBqzxpiKoTfb9uYj86r1zrSbUSTumiZzGKUQ1b2R5VyasZahfc4KQVvzPrK6sJi4r4a7Qm1exBwpI11Uzd3ENe8OoGB66YE+BEYk5iVWvMwO8lVPfUveyy9xVoJBmGOnUJ91nrGodanNk5UKQ7C1CRFET5ypFucfh9JEok390wQY9uC4Y2vxkxA4iusndWHrkDJF4VjFun5ebJhQluL7OAMckpt0K3xGQ95k1zB6ih2zgZAtHsCep9tJ2DT1ndZeA+dHP7ekH2AHbLZkCHMzR+cA/Q8vm1aY4/X7Tkd7EgerJSnbcKAhL5qYNe9CsX0PQxUZPAVc1zJUBWQK7PfUxLbuTnY1tQS+qykqTd2gSDmAlsGfqKnDkWqesbct/GQxtRXWKj7hJEEkryAP5jPBZiFjKDzsHRTcLDCUzWug9JtUPfjQQWARRHlTBSmr9IVAt3fvNUojz7XwFi4vwcI2kfMgiygNxrhHsFDPppE0M8gwr+myKmyC3ku9I9c9jth9d4iIWqP+lar4aSDTaTb/HdNe/mPYYoqMf2RCQu6oD0x4jO3ze2BU4NZI3moOhtCx6Rio1KWv2VS1cGs80qBRsD7P2j46HJV7y0KgQqYPYnOvFHHjVbj2MSCuDG/aO05HB8FEgTmLyTXptYoQfAbjgUnLLSuO4I7xqf8F7dZOqi2wx6yJl7Y7ZS9WqCQFXproxXahmJV2+gq9Fo6n0mdQB1Zqa+n9fX3MP8p83fEfs2GRt/jrEZtqWdXFyRgdbqVCF1XguAlNS17YkshQyWy/rSVggMAQ/zCFojkiN rkWLdSrg FmwQsqqR2ax7C82zguGzNsYE7SkeJA6J2GQbwPB16THzy9n0nz24ZdtBY4ozw4rfrtTizeJGcjT3xXhQgJB1ss8ybUUDMD+8tT0kRAIqVjnH/HB6XXG5LIjBHCKWhgt6HTM73h3Er7Mak1mVZ3w51H1fP64b987VTStIQe45kvxKg9/TfUJ5zhAGxB2Zcc1xx97yPHJqffHiBkjlBEvpDsvKW46OdYF88qb/Ie3s/++ONkwDgRuOKSXz5qbAYOgGbuS+vAl5qzcfSmT1lWRbF6zFUdJWWOEndXocFT5MX4cyi/ulNXm6y4DGYgnsNvII1+HkPn10ErSxCNFaSLo+6XKCazy7fRq38le6BMd1dlrdOt4ae6GslgEKoivH7G9RSCE2QIdB8ZT9TRK86WoV55bddoQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 27, 2025 at 11:11=E2=80=AFPM Kairui Song wro= te: > > On Tue, May 27, 2025 at 3:59=E2=80=AFPM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Sat, May 24, 2025 at 8:01=E2=80=AFAM Kairui Song = wrote: > > > > > > On Fri, May 23, 2025 at 10:30=E2=80=AFAM Barry Song <21cnbao@gmail.co= m> wrote: > > > > > > > > On Wed, May 21, 2025 at 2:45=E2=80=AFPM Kairui Song wrote: > > > > > > > > > > Barry Song <21cnbao@gmail.com> =E4=BA=8E 2025=E5=B9=B45=E6=9C=882= 1=E6=97=A5=E5=91=A8=E4=B8=89 06:33=E5=86=99=E9=81=93=EF=BC=9A > > > > > > Let me run test case [1] to check whether this ever happens. I = guess I need to > > > > > > hack kernel a bit to always add folio to swapcache even for SYN= C IO. > > > > > > > > > > That will cause quite a performance regression I think. Good thin= g is, > > > > > that's exactly the problem this series is solving by dropping the= SYNC > > > > > IO swapin path and never bypassing the swap cache, while improvin= g the > > > > > performance, eliminating things like this. One more reason to jus= tify > > > > > the approach :) > > > > > > Hi Barry, > > > > > > > > > > > I attempted to reproduce the scenario where a folio is added to the= swapcache > > > > after filemap_get_folio() returns NULL but before move_swap_pte() > > > > moves the swap PTE > > > > using non-synchronized I/O. Technically, this seems possible; howev= er, > > > > I was unable > > > > to reproduce it, likely because the time window between swapin_read= ahead and > > > > taking the page table lock within do_swap_page() is too short. > > > > > > Thank you so much for trying this! > > > > > > I have been trying to reproduce it too, and so far I didn't observe > > > any crash or warn. I added following debug code: > > > > > > static __always_inline > > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long = dst_end) > > > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm, > > > pmd_t *dst_pmd, pmd_t *src_pmd, > > > pmd_t dummy_pmdval; > > > pmd_t dst_pmdval; > > > struct folio *src_folio =3D NULL; > > > + struct folio *tmp_folio =3D NULL; > > > struct anon_vma *src_anon_vma =3D NULL; > > > struct mmu_notifier_range range; > > > int err =3D 0; > > > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm= , > > > pmd_t *dst_pmd, pmd_t *src_pmd, > > > if (!src_folio) > > > folio =3D filemap_get_folio(swap_address_spac= e(entry), > > > swap_cache_index(entry)); > > > + udelay(get_random_u32_below(1000)); > > > + tmp_folio =3D filemap_get_folio(swap_address_space(en= try), > > > + swap_cache_index(entry)); > > > + if (!IS_ERR_OR_NULL(tmp_folio)) { > > > + if (!IS_ERR_OR_NULL(folio) && tmp_folio !=3D = folio) { > > > + pr_err("UFFDIO_MOVE: UNSTABLE folio > > > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio, > > > tmp_folio->swap.val); > > > + } > > > + folio_put(tmp_folio); > > > + } > > > if (!IS_ERR_OR_NULL(folio)) { > > > if (folio_test_large(folio)) { > > > err =3D -EBUSY; > > > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm, > > > pmd_t *dst_pmd, pmd_t *src_pmd, > > > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr= , > > > dst_pte, src_pte, > > > orig_dst_pte, orig_src_pte, dst_pmd, = dst_pmdval, > > > dst_ptl, src_ptl, src_folio); > > > + if (tmp_folio !=3D folio && !err) > > > + pr_err("UFFDIO_MOVE: UNSTABLE folio passed > > > check: %lx -> %lx\n", folio, tmp_folio); > > > } > > > > > > And I saw these two prints are getting triggered like this (not a rea= l > > > issue though, just help to understand the problem) > > > ... > > > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) -> > > > fffffdffc7ccac80 (51) > > > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) -> > > > fffffdffc3435e00 (3b) > > > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) -> > > > fffffdffc34ab8c0 (76) > > > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) -> > > > fffffdffc34ab8c0 (6d) > > > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) -> > > > fffffdffc34eda80 (32) > > > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) -> > > > fffffdffc38cedc0 (b) > > > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdf= fc34db140 > > > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) -> > > > fffffdffc3465cc0 (21) > > > ... > > > > > > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) -> > > > fffffdffc3853540 (53)" worries me at first. On first look it seems th= e > > > folio is indeed freed completely from the swap cache after the first > > > lookup, so another swapout can reuse the entry. But as you mentioned > > > __remove_mapping won't release a folio if the refcount check fails, s= o > > > they must be freed by folio_free_swap or __try_to_reclaim_swap, there > > > are many places that can happen. But these two helpers won't free a > > > folio from swap cache if its swap count is not zero. And the folio > > > will either be swapped out (swap count non zero), or mapped (freeing > > > it is fine, PTE is non_swap, and another swapout will still use the > > > same folio). > > > > > > So after more investigation and dumping the pages, it's actually the > > > second lookup (tmp_folio) seeing the entry being reused by another > > > page table entry, after the first folio is swapped back and released. > > > So the page table check below will always fail just fine. > > > > > > But this also proves the first look up can see a completely irrelevan= t > > > folio too: If the src folio is swapped out, but got swapped back and > > > freed, then another folio B shortly got added to swap cache reuse the > > > src folio's old swap entry, then the folio B got seen by the look up > > > here and get freed from swap cache, then src folio got swapped out > > > again also reusing the same entry, then we have a problem as PTE seem= s > > > untouched indeed but we grabbed a wrong folio. Seems possible if I'm > > > not wrong: > > > > > > Something like this: > > > CPU1 CPU2 > > > move_pages_pte > > > entry =3D pte_to_swp_entry(orig_src_pte); > > > | Got Swap Entry S1 from src_pte > > > ... > > > > > > > I=E2=80=99m assuming you mean ``, since = I=E2=80=99m not > > sure where folio B comes from in the statement ` > swap out folio B>`. > > > > If that assumption is correct, and folio A is still in the swapcache, > > how could someone swap in folio B without hitting folio A? That would > > suggest folio A must have been removed from the swapcache earlier=E2=80= =94right? > > > > > > > > > > Sorry my bad, I think I made people think folio B is related to > src_pte at this point. What I actually mean is that: Another random > folio B, unrelated to src_pte, could got swapped out, and using the > swap entry S1. > > > > > > > ... > > > folio =3D swap_cache_get_folio(S1) > > > | Got folio B here !!! > > > move_swap_pte > > > > > > | Holding a reference doesn't pin = the cache > > > | as we have demonstrated > > > > > > double_pt_lock > > > is_pte_pages_stable > > > | Passed because of S1 is reused > > > folio_move_anon_rmap(...) > > > | Moved invalid folio B here !!! > > > > > > But this is extremely hard to reproduce though, even if doing it > > > deliberately... > > > > > > So I think a "folio_swap_contains" or equivalent check here is a good > > > thing to have, to make it more robust and easier to understand. The > > > checking after locking a folio has very tiny overhead and can > > > definitely ensure the folio's swap entry is valid and stable. > > > > > > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00" > > > here might seem problematic, but it's still not a real problem. That'= s > > > the case where the swapin in src region happens after the lookup, and > > > before the PTE lock. It will pass the PTE check without moving the > > > folio. But the folio is guaranteed to be a completely new folio here > > > because the folio can't be added back to the page table without > > > holding the PTE lock, and if that happens the following PTE check her= e > > > will fail. > > > > > > So I think we should patch the current kernel only adding a > > > "folio_swap_contains" equivalent check here, and maybe more comments, > > > how do you think? > > > > The description appears to have some inconsistencies. > > Would you mind rephrasing it? > > Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> > fffffdffc385fb00" part first, as both you and me have come into a > conclusion that "filemap_get_folio() returns NULL before > move_swap_pte, but a folio was added to swap cache" is OK, and this > output only proves that happens. > > So the problematic race is: > > Here move_pages_pte is moving src_pte to dst_pte, and it begins with > src_pte =3D=3D swap entry S1, and S1 isn't cached. > > CPU1 CPU2 > move_pages_pte() > entry =3D pte_to_swp_entry(orig_src_pte); > | src_pte is absent, and got entry =3D=3D S1 > ... < Somehow interrupted> ... > > | folio A is just a new allocated foli= o > | for resolving the swap in fault. > > | swap in fault is resolved, src_pte > | now points to folio A, so folio A > | can get freed just fine. > | And now S1 is free to be used > | by anyone. > > | Folio B is a completely unrelated > | folio swapped out by random process. > | (has nothing to do with src_pte) > | But S1 is freed so it may use S1 > | as its swap entry. > > ... > folio =3D filemap_get_folio(S1) > | The lookup is using S1, so it > | got folio B here !!! > ... < Somehow interrupted> ... > > | Folio B could fail to be swapped out > | or got swapped in again, so it can > | be freed by folio_free_swap or > | swap cache reclaim. > | CPU1 is holding a reference but it > | doesn't pin the swap cache folio > | as I have demonstrated with the > | test C program previously. > | New S1 is free to be used again. > > | No thing blocks this from happening > | The swapout is still using folio A, > | and src_pte =3D=3D S1. > folio_trylock(folio) > move_swap_pte > double_pt_lock > is_pte_pages_stable > | Passed because of S1 is reused so src_pte =3D=3D S1. > folio_move_anon_rmap(...) > | Moved invalid folio B here !!! > > It's a long and complex one, I don't think it's practically possible > to happen in reality but in theory doable, once in a million maybe... > Still we have to fix it, or did I got anything wrong here? Hi Barry, I managed to reproduce this issue, by hacking the kernel a bit (Only adding only delay to increase the race window, and adding a WARN to indicate the problem) 1. Applying following patch for kernel: =3D=3D=3D diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bc473ad21202..1d710adf9839 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -1100,6 +1101,10 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, * occur and hit the swapcache after moving the PTE. */ if (src_folio) { + if (WARN_ON(src_folio->swap.val !=3D pte_to_swp_entry(orig_src_pte).val)) + pr_err("Moving folio %lx (folio->swap =3D %lx), orig_src_pte =3D %lx\n", + (unsigned long)src_folio, src_folio->swap.va= l, + pte_to_swp_entry(orig_src_pte).val); folio_move_anon_rmap(src_folio, dst_vma); src_folio->index =3D linear_page_index(dst_vma, dst_addr); } @@ -1388,9 +1393,13 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, * folios in the swapcache. This issue needs to be resolved * separately to allow proper handling. */ + pr_err("DEBUG: Will do the lookup using entry %lx, wait 3s...\n", entry.val); + mdelay(1000 * 3); if (!src_folio) folio =3D filemap_get_folio(swap_address_space(entr= y), swap_cache_index(entry)); + pr_err("DEBUG: Got folio value %lx, wait 3s...\n", (unsigned long)folio); + mdelay(1000 * 3); if (!IS_ERR_OR_NULL(folio)) { if (folio_test_large(folio)) { err =3D -EBUSY; 2. Save following program in userspace (didn't bother with error check for simpler code): =3D=3D=3D #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #define PAGE_SIZE 4096 /* Need to consume all slots so define the swap device size here */ #define SWAP_DEVICE_SIZE (PAGE_SIZE * 1024 - 1) char *src, *race, *dst, *place_holder; int uffd; void read_in(char *p) { /* This test program initials memory with 0xAA to bypass zeromap */ while (*((volatile char*)p) !=3D 0xAA); } void *reader_thread(void *arg) { /* Test requires kernel to wait upon uffd move */ read_in(dst); return NULL; } void *fault_handler_thread(void *arg) { int ret; struct uffd_msg msg; struct uffdio_move move; struct pollfd pollfd =3D { .fd =3D uffd, .events =3D POLLIN }; pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL); pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL); while (1) { poll(&pollfd, 1, -1); read(uffd, &msg, sizeof(msg)); move.src =3D (unsigned long)src + (msg.arg.pagefault.address - (unsigned long)dst); move.dst =3D msg.arg.pagefault.address & ~(PAGE_SIZE - 1); move.len =3D PAGE_SIZE; move.mode =3D 0; ioctl(uffd, UFFDIO_MOVE, &move); } return NULL; } int main() { pthread_t fault_handler_thr, reader_thr; struct uffdio_api uffdio_api =3D { .api =3D UFFD_API, .features =3D 0 }= ; struct uffdio_register uffdio_register; src =3D mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); dst =3D mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(src, 0xAA, PAGE_SIZE); madvise(src, PAGE_SIZE, MADV_PAGEOUT); /* Consume all slots on swap device left only one entry (S1) */ place_holder =3D mmap(NULL, SWAP_DEVICE_SIZE - 1, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); memset(place_holder, 0xAA, SWAP_DEVICE_SIZE - 1); madvise(place_holder, SWAP_DEVICE_SIZE - 1, MADV_PAGEOUT); /* Setup uffd handler and dst reader */ uffd =3D syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK); ioctl(uffd, UFFDIO_API, &uffdio_api); uffdio_register.range.start =3D (unsigned long)dst; uffdio_register.range.len =3D PAGE_SIZE; uffdio_register.mode =3D UFFDIO_REGISTER_MODE_MISSING; ioctl(uffd, UFFDIO_REGISTER, &uffdio_register); pthread_create(&fault_handler_thr, NULL, fault_handler_thread, NULL); pthread_create(&reader_thr, NULL, reader_thread, NULL); /* Wait for UFFDIO to start */ sleep(1); /* Release src folio (A) from swap, freeing the entry S1 */ read_in(src); /* Swapout another race folio (B) using S1 */ race =3D mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); memset(race, 0xAA, PAGE_SIZE); madvise(race, PAGE_SIZE, MADV_PAGEOUT); /* Wait for UFFDIO swap lookup to see the race folio (B) */ sleep(3); /* Free the race folio (B) from swap */ read_in(race); /* And swap out src folio (A) again, using S1 */ madvise(src, PAGE_SIZE, MADV_PAGEOUT); /* Kernel should have moved a wrong folio by now */ pthread_join(reader_thr, NULL); pthread_cancel(fault_handler_thr); pthread_join(fault_handler_thr, NULL); munmap(race, PAGE_SIZE); munmap(src, PAGE_SIZE); munmap(dst, PAGE_SIZE); close(uffd); return 0; } 3. Run the test with (ensure no other swap device is mounted and current dir is on a block device): =3D=3D=3D dd if=3D/dev/zero of=3Dswap.img bs=3D1M count=3D1; mkswap swap.img; swapon swap.img; gcc test-uffd.c && ./a.out Then we get the WARN: [ 348.200587] ------------[ cut here ]------------ [ 348.200599] WARNING: CPU: 7 PID: 1856 at mm/userfaultfd.c:1104 move_pages_pte+0xdb8/0x11a0 [ 348.207544] Modules linked in: loop [ 348.209401] CPU: 7 UID: 0 PID: 1856 Comm: a.out Kdump: loaded Not tainted 6.15.0-rc6ptch-00381-g99f00d7c6c6f-dirty #304 PREEMPT(voluntary) [ 348.214579] Hardware name: QEMU QEMU Virtual Machine, BIOS edk2-stable202408-prebuilt.qemu.org 08/13/2024 [ 348.218656] pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE= =3D--) [ 348.222013] pc : move_pages_pte+0xdb8/0x11a0 [ 348.224062] lr : move_pages_pte+0x928/0x11a0 [ 348.225881] sp : ffff800088b2b8f0 [ 348.227360] x29: ffff800088b2b970 x28: 0000000000000000 x27: 0000ffffbc9= 20000 [ 348.230228] x26: fffffdffc335e4a8 x25: 0000000000000001 x24: fffffdffc3e= 4dd40 [ 348.233159] x23: 080000010d792403 x22: ffff0000cd792900 x21: ffff0000c5a= 6d2c0 [ 348.236339] x20: fffffdffc335e4a8 x19: 0000000000001004 x18: 00000000000= 00006 [ 348.239269] x17: 0000ffffbc920000 x16: 0000ffffbc922fff x15: 00000000000= 00003 [ 348.242703] x14: ffff8000812c3b68 x13: 0000000000000003 x12: 00000000000= 00003 [ 348.245947] x11: 0000000000000000 x10: ffff800081e4feb8 x9 : 00000000000= 00001 [ 348.249284] x8 : 0000000000000000 x7 : 6f696c6f6620746f x6 : 47203a47554= 24544 [ 348.252071] x5 : ffff8000815789e3 x4 : ffff8000815789e5 x3 : 00000000000= 00000 [ 348.255358] x2 : ffff0001fed2aef0 x1 : 0000000000000000 x0 : fffffdffc33= 5e4a8 [ 348.258134] Call trace: [ 348.259468] move_pages_pte+0xdb8/0x11a0 (P) [ 348.261348] move_pages+0x3c0/0x738 [ 348.262987] userfaultfd_ioctl+0x3d8/0x1f98 [ 348.264916] __arm64_sys_ioctl+0x88/0xd0 [ 348.266779] invoke_syscall+0x64/0xec [ 348.268347] el0_svc_common+0xa8/0xd8 [ 348.269967] do_el0_svc+0x1c/0x28 [ 348.271711] el0_svc+0x40/0xe0 [ 348.273345] el0t_64_sync_handler+0x78/0x108 [ 348.274821] el0t_64_sync+0x19c/0x1a0 [ 348.276117] ---[ end trace 0000000000000000 ]--- [ 348.278638] Moving folio fffffdffc3e4dd40 (folio->swap =3D 0), orig_src_= pte =3D 1 That's the new added WARN, but the test program also hung with D forever, and with errors with other tests like: [ 406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_ANONPAGES val:-1 [ 406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0 type:MM_SHMEMPAGES val:1 Because the kernel just moved the wrong folio, so unmap takes forever looking for the missing folio, and counting went wrong too. So this race is real. It's extremely unlikely to happen because it requires multiple collisions of multiple tiny race windows, however it's not impossible. I'll post a fix very soon.