From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B3D74C54F30 for ; Tue, 27 May 2025 15:11:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4F8206B00B1; Tue, 27 May 2025 11:11:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D0156B00B2; Tue, 27 May 2025 11:11:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3E5EE6B00B3; Tue, 27 May 2025 11:11:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 18CA36B00B1 for ; Tue, 27 May 2025 11:11:50 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id AF84BBE6B6 for ; Tue, 27 May 2025 15:11:49 +0000 (UTC) X-FDA: 83489027538.28.B7A995C Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53]) by imf07.hostedemail.com (Postfix) with ESMTP id 9DCC84000B for ; Tue, 27 May 2025 15:11:47 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="K4PQY8/F"; spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.53 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748358707; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=h56XKNtyXyta3lZZ70PXL6/fM382B+6dMGx8vne5QvA=; b=uitxg5+ArRB8TVa9+fXAE9GQEHXmOnkm/HSrNCCvs7Xbc+ocBJ2up7K6TMbya8BSYYSb2n PwCysXuZSkcMd80tH+4ETlxoGKbiR4baoQLOiqsh9k+OOaNg5MfsujTDNVgrzGPimIRTFn GazSsjUzRQqbZxtSvZyhgPLiXA89UT4= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="K4PQY8/F"; spf=pass (imf07.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.53 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748358707; a=rsa-sha256; cv=none; b=czi4FqIcP38a8kV9PHECIpv4daTBbCR1y82IS+PwZnRxv1dZxq142v8Jk6kPh0qFifFhVu H0JFcqdUsh2SkU+1tNwyerRL8vBg1phn4Q6//yE0+KDC8K4THOyBBBkx7ofzVYg13AdIUa hcVe4F/c4VM/fzszHqm9yoCmoK5BaOc= Received: by mail-lf1-f53.google.com with SMTP id 2adb3069b0e04-551fc6d4a76so4020488e87.0 for ; Tue, 27 May 2025 08:11:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1748358706; x=1748963506; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=h56XKNtyXyta3lZZ70PXL6/fM382B+6dMGx8vne5QvA=; b=K4PQY8/F8IDMqgjfqgFjezZYSj0eemHbOvhDxTUEdqI2wnFvcMy8A+uJahcno4g0Zz LDETayBG7XtysAQj5ynLdvpMcZGpIKA86D0sl0XIr/eIBaqKnPZRz4LIA1Vvqxz+RGoC BlJNMahjP10dDbNMFsynRVlYa8iY37D5tgr0XypyYn27lsHcUHoYYc16Lh51YU5oyZEB kHjQRDRAZg0sJHVgXRZir2DiiE2G62WtAViF7QeJ+gPd3oY5Lpvi5wRm+D/VL0+m83Fh TqqrJkVRLqP5Fa3TNIMthpDMPb1qh3mp67OVE7EYP7fowSCvu6z7sPMNMRluOFExCnXv YrFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748358706; x=1748963506; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=h56XKNtyXyta3lZZ70PXL6/fM382B+6dMGx8vne5QvA=; b=oXhWbfEvxClo0+fwepaW+8cqUORL5uq5Zc6G9FQLE4kb5MgQrtvG3RzVSpfe0GkxgQ jLK30q6Pr6R8Dz4C0Mn98usGY71+y90Ui2tzdOeslsoGegZ6tDY5pyrLTkKbwGNI4rEL mW7azaOxCK4qi6UUmU/mXS0Qhpk5RVFyhNDLbzeGDwYn6q37kd7DDLa4z0xvteLsxcMD sKIGeufxLcmkl8D4DaBHjeS2ogTlCd+xr9wPuB7eQxJ73ENIL59SLUHSZMxjm4nh74E8 n3fxvxHgRgUF7MjDecROF4DNmpbxI8+b/RALKmZYGOW+5wKu3478jHrSwkroPgLvv3o8 p6ZQ== X-Forwarded-Encrypted: i=1; AJvYcCW8wg0pK8/RHxobt2kPb2tejVFoBeQBZYbQfMDDPRMqE9ZRA0e3yinNWn7p0SaVLNaJfy+xcpZz6w==@kvack.org X-Gm-Message-State: AOJu0YyLH0umU7/N8Mix54AgCGWFp98EJUMQQXsWf1KGnvjw3dUc1Y2g OAK3dbRe3rwcTAoImPo3ddqF6KkQcah6+LWFIAZt/Dxk0+J6/MXZD0TbnfwuVQLZE+4eovP2KmT +0gL/p34CVcVzm4ki01dzn3lRzaymOHw= X-Gm-Gg: ASbGncsW0UqwMHSMTEVBHlnYC9u+y2Ma2A6yYWWzBJeVLAVdbKc7f4D9IlZsX7zJSm1 8nVqoJTf5LbYbP766jzZykSHqsGQC0uYEarTdbjgM+t/gghuIkym6r67riBsfuieviNIUjpiMUd 567sAMS4vI5A3w9CBvNtm0PLGcSIsTCcqa X-Google-Smtp-Source: AGHT+IFj/rnJCQCIWxhAi6iORfvdK9JLWNWyS0gII8ApwLnDzqbhGsh597o2ylDtcd0y3t4VEbA/04I+0x+ATzvuuRw= X-Received: by 2002:a05:6512:318e:b0:553:202e:a405 with SMTP id 2adb3069b0e04-553202ea4a3mr2714050e87.0.1748358705329; Tue, 27 May 2025 08:11:45 -0700 (PDT) MIME-Version: 1.0 References: <20250514201729.48420-6-ryncsn@gmail.com> <20250519043847.1806-1-21cnbao@gmail.com> In-Reply-To: From: Kairui Song Date: Tue, 27 May 2025 23:11:28 +0800 X-Gm-Features: AX0GCFsywKHdweAQpsq3ezsHZo1lW50RhNthF4TuGvSdOh8qQZEIpIMulnqV3qg Message-ID: Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, Baolin Wang , Baoquan He , Chris Li , David Hildenbrand , Johannes Weiner , Hugh Dickins , Kalesh Singh , LKML , linux-mm , Nhat Pham , Ryan Roberts , Kemeng Shi , Tim Chen , Matthew Wilcox , "Huang, Ying" , Yosry Ahmed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 9DCC84000B X-Stat-Signature: jbhh8bo5dceib6kntwqi843ag6msegs6 X-Rspam-User: X-HE-Tag: 1748358707-128915 X-HE-Meta: U2FsdGVkX1/aArlkn6PCZoW7Mpz1Wa+oGwZeTlvrpWkg2Ag36h6uORhnNurlViC7Q8KabBFLaP6tcvqKAbXrhfYXWc2nndy199hwCJBA+ugcNTqX3i3LyB05edE88+QIhqW9ePEaPQRZaSkaBpccsPXIRl4cAX69QmyGa3Tpnzh3IAd2+l/n719qmg5Prpn7t3xQdtdcafI0JaN0pvsGai0KWWVv1qgK2DaLxqjD0m2lMvM89W2lgcrgNQdLUY3PD76q79VTjlhAjFt4W8vwOY2d0B19o1igtbBO3tdY0bt5iXbv+sFL/D9KqxIZxd2hppNBLIVghfOMvlmpIfUYiwCQO6DbpOyqrmMzzE8zKujEeKxRE8hOoMN39fNLZw988XAh5691Zc0Smkz+HnZecl/lLQQMMijCKo3+eS1q1uNbBUp4Hpn15i9eWkC8HEWpPYszj1+lIX1qaGAPc3i87IImp0Wng6UoVT1x488QCvcRRXoIF1F9bmu6UBnPgM6GruTdEp3TtZdz+I2uH6iA8BRIhLV3D2vzwWzs2DloIA/0TctrPK4kSYajs1rxZtPjulL0uIWm8dyt0IotprY3zSzijsgcsTtxCc+nmGAA6IYGCSrke07EW9j7vo2SIiN20lsIqt6zhfYIrQOK3TxFhS3VmlwbYrCLdF4h1j8y8IaF/WYuRrT4rGuenRroP/4vq/CG8NGCSukO4tPStksOyTTZZczwOOWNMYqeWwpqx9SVJ2sp4/xrRw5aFh7IdWo8hFdW4FN6awgznmsiRq5uXivariMJ6qa2FSOD1BpN+4Sv16Qj8bMBYO70QZti/Yz7Db8N+bZEtxlmfx57Xl0713D5h8w56uA+M43ybxR9dXTlB5fuBIKQBwXk1ZGznLT9gMM4bzrRU8yW4do3IjDAUdTriXJpJOsEhmfbG9ignsXaKkv9dObrbJN3JGVoZvQNtXXKkiLHny0mWzWpXuk eB/LGcEk uEhYQhBpXsehDelKPHiawjgUyBmWutvTHm7eSehMG5RuUQG4ExH8hk2AEISqKRQtJr8f3zzE9q2+ELErUmj7+zFIIMUTmhUpBfKLWQD7+QgMJAF01SKwHKJelvbAV+dC+oKFlPETo99RDvYIphJXJBR5K+u1KoMtvMhiW3H9WJQobPI1r06HnCq6NB3bwvl29qPpzO/v3Vo8kQPb9hRbsWkyKzdVbsZ1Bo45F3/FV6+pnLYaGCiccT6fIFt8yKfTW8nQn3tOPjVI+gcRIc05LBKk/cvTAwSL5y4quau2eW+IyhJG8oGTfDLVV5eZP8wY8CCH5bwVA+9Go+MmM5FVNJjuGYg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, May 27, 2025 at 3:59=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot= e: > > On Sat, May 24, 2025 at 8:01=E2=80=AFAM Kairui Song wr= ote: > > > > On Fri, May 23, 2025 at 10:30=E2=80=AFAM Barry Song <21cnbao@gmail.com>= wrote: > > > > > > On Wed, May 21, 2025 at 2:45=E2=80=AFPM Kairui Song wrote: > > > > > > > > Barry Song <21cnbao@gmail.com> =E4=BA=8E 2025=E5=B9=B45=E6=9C=8821= =E6=97=A5=E5=91=A8=E4=B8=89 06:33=E5=86=99=E9=81=93=EF=BC=9A > > > > > Let me run test case [1] to check whether this ever happens. I gu= ess I need to > > > > > hack kernel a bit to always add folio to swapcache even for SYNC = IO. > > > > > > > > That will cause quite a performance regression I think. Good thing = is, > > > > that's exactly the problem this series is solving by dropping the S= YNC > > > > IO swapin path and never bypassing the swap cache, while improving = the > > > > performance, eliminating things like this. One more reason to justi= fy > > > > the approach :) > > > > Hi Barry, > > > > > > > > I attempted to reproduce the scenario where a folio is added to the s= wapcache > > > after filemap_get_folio() returns NULL but before move_swap_pte() > > > moves the swap PTE > > > using non-synchronized I/O. Technically, this seems possible; however= , > > > I was unable > > > to reproduce it, likely because the time window between swapin_readah= ead and > > > taking the page table lock within do_swap_page() is too short. > > > > Thank you so much for trying this! > > > > I have been trying to reproduce it too, and so far I didn't observe > > any crash or warn. I added following debug code: > > > > static __always_inline > > bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long ds= t_end) > > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm, > > pmd_t *dst_pmd, pmd_t *src_pmd, > > pmd_t dummy_pmdval; > > pmd_t dst_pmdval; > > struct folio *src_folio =3D NULL; > > + struct folio *tmp_folio =3D NULL; > > struct anon_vma *src_anon_vma =3D NULL; > > struct mmu_notifier_range range; > > int err =3D 0; > > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm, > > pmd_t *dst_pmd, pmd_t *src_pmd, > > if (!src_folio) > > folio =3D filemap_get_folio(swap_address_space(= entry), > > swap_cache_index(entry)); > > + udelay(get_random_u32_below(1000)); > > + tmp_folio =3D filemap_get_folio(swap_address_space(entr= y), > > + swap_cache_index(entry)); > > + if (!IS_ERR_OR_NULL(tmp_folio)) { > > + if (!IS_ERR_OR_NULL(folio) && tmp_folio !=3D fo= lio) { > > + pr_err("UFFDIO_MOVE: UNSTABLE folio > > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio, > > tmp_folio->swap.val); > > + } > > + folio_put(tmp_folio); > > + } > > if (!IS_ERR_OR_NULL(folio)) { > > if (folio_test_large(folio)) { > > err =3D -EBUSY; > > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm, > > pmd_t *dst_pmd, pmd_t *src_pmd, > > err =3D move_swap_pte(mm, dst_vma, dst_addr, src_addr, > > dst_pte, src_pte, > > orig_dst_pte, orig_src_pte, dst_pmd, ds= t_pmdval, > > dst_ptl, src_ptl, src_folio); > > + if (tmp_folio !=3D folio && !err) > > + pr_err("UFFDIO_MOVE: UNSTABLE folio passed > > check: %lx -> %lx\n", folio, tmp_folio); > > } > > > > And I saw these two prints are getting triggered like this (not a real > > issue though, just help to understand the problem) > > ... > > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) -> > > fffffdffc7ccac80 (51) > > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) -> > > fffffdffc3435e00 (3b) > > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) -> > > fffffdffc34ab8c0 (76) > > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) -> > > fffffdffc34ab8c0 (6d) > > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) -> > > fffffdffc34eda80 (32) > > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) -> > > fffffdffc38cedc0 (b) > > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc= 34db140 > > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) -> > > fffffdffc3465cc0 (21) > > ... > > > > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) -> > > fffffdffc3853540 (53)" worries me at first. On first look it seems the > > folio is indeed freed completely from the swap cache after the first > > lookup, so another swapout can reuse the entry. But as you mentioned > > __remove_mapping won't release a folio if the refcount check fails, so > > they must be freed by folio_free_swap or __try_to_reclaim_swap, there > > are many places that can happen. But these two helpers won't free a > > folio from swap cache if its swap count is not zero. And the folio > > will either be swapped out (swap count non zero), or mapped (freeing > > it is fine, PTE is non_swap, and another swapout will still use the > > same folio). > > > > So after more investigation and dumping the pages, it's actually the > > second lookup (tmp_folio) seeing the entry being reused by another > > page table entry, after the first folio is swapped back and released. > > So the page table check below will always fail just fine. > > > > But this also proves the first look up can see a completely irrelevant > > folio too: If the src folio is swapped out, but got swapped back and > > freed, then another folio B shortly got added to swap cache reuse the > > src folio's old swap entry, then the folio B got seen by the look up > > here and get freed from swap cache, then src folio got swapped out > > again also reusing the same entry, then we have a problem as PTE seems > > untouched indeed but we grabbed a wrong folio. Seems possible if I'm > > not wrong: > > > > Something like this: > > CPU1 CPU2 > > move_pages_pte > > entry =3D pte_to_swp_entry(orig_src_pte); > > | Got Swap Entry S1 from src_pte > > ... > > > > I=E2=80=99m assuming you mean ``, since I= =E2=80=99m not > sure where folio B comes from in the statement ` swap out folio B>`. > > If that assumption is correct, and folio A is still in the swapcache, > how could someone swap in folio B without hitting folio A? That would > suggest folio A must have been removed from the swapcache earlier=E2=80= =94right? > > > > > Sorry my bad, I think I made people think folio B is related to src_pte at this point. What I actually mean is that: Another random folio B, unrelated to src_pte, could got swapped out, and using the swap entry S1. > > > > ... > > folio =3D swap_cache_get_folio(S1) > > | Got folio B here !!! > > move_swap_pte > > > > | Holding a reference doesn't pin th= e cache > > | as we have demonstrated > > > > double_pt_lock > > is_pte_pages_stable > > | Passed because of S1 is reused > > folio_move_anon_rmap(...) > > | Moved invalid folio B here !!! > > > > But this is extremely hard to reproduce though, even if doing it > > deliberately... > > > > So I think a "folio_swap_contains" or equivalent check here is a good > > thing to have, to make it more robust and easier to understand. The > > checking after locking a folio has very tiny overhead and can > > definitely ensure the folio's swap entry is valid and stable. > > > > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00" > > here might seem problematic, but it's still not a real problem. That's > > the case where the swapin in src region happens after the lookup, and > > before the PTE lock. It will pass the PTE check without moving the > > folio. But the folio is guaranteed to be a completely new folio here > > because the folio can't be added back to the page table without > > holding the PTE lock, and if that happens the following PTE check here > > will fail. > > > > So I think we should patch the current kernel only adding a > > "folio_swap_contains" equivalent check here, and maybe more comments, > > how do you think? > > The description appears to have some inconsistencies. > Would you mind rephrasing it? Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00" part first, as both you and me have come into a conclusion that "filemap_get_folio() returns NULL before move_swap_pte, but a folio was added to swap cache" is OK, and this output only proves that happens. So the problematic race is: Here move_pages_pte is moving src_pte to dst_pte, and it begins with src_pte =3D=3D swap entry S1, and S1 isn't cached. CPU1 CPU2 move_pages_pte() entry =3D pte_to_swp_entry(orig_src_pte); | src_pte is absent, and got entry =3D=3D S1 ... < Somehow interrupted> ... | folio A is just a new allocated folio | for resolving the swap in fault. | swap in fault is resolved, src_pte | now points to folio A, so folio A | can get freed just fine. | And now S1 is free to be used | by anyone. | Folio B is a completely unrelated | folio swapped out by random process. | (has nothing to do with src_pte) | But S1 is freed so it may use S1 | as its swap entry. ... folio =3D filemap_get_folio(S1) | The lookup is using S1, so it | got folio B here !!! ... < Somehow interrupted> ... | Folio B could fail to be swapped out | or got swapped in again, so it can | be freed by folio_free_swap or | swap cache reclaim. | CPU1 is holding a reference but it | doesn't pin the swap cache folio | as I have demonstrated with the | test C program previously. | New S1 is free to be used again. | No thing blocks this from happening | The swapout is still using folio A, | and src_pte =3D=3D S1. folio_trylock(folio) move_swap_pte double_pt_lock is_pte_pages_stable | Passed because of S1 is reused so src_pte =3D=3D S1. folio_move_anon_rmap(...) | Moved invalid folio B here !!! It's a long and complex one, I don't think it's practically possible to happen in reality but in theory doable, once in a million maybe... Still we have to fix it, or did I got anything wrong here? > > Thanks > barry