From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C202EC3ABC9 for ; Wed, 21 May 2025 03:25:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2EEAA6B0082; Tue, 20 May 2025 23:25:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2C6896B0088; Tue, 20 May 2025 23:25:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1DCEF6B0089; Tue, 20 May 2025 23:25:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id F3ACA6B0082 for ; Tue, 20 May 2025 23:25:12 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 5F01C122104 for ; Wed, 21 May 2025 03:25:12 +0000 (UTC) X-FDA: 83465474064.25.081CCFE Received: from mail-ua1-f42.google.com (mail-ua1-f42.google.com [209.85.222.42]) by imf28.hostedemail.com (Postfix) with ESMTP id 80D8EC0003 for ; Wed, 21 May 2025 03:25:10 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kjpImXtU; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747797910; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MUITKC9D7ShfjPfuXdRjZj+CsBm2w9XbmNqmnF57EWo=; b=eyToaaBEq4gEunNfT8cv7DBKU9drV1k5izUlc9ZTgPegfX2B9auRiJE6tXUiK5S0klprX3 tdGRqACbhz3UuWxGpvjE5E5PqfPLzv6Inq7tFhMFtx81iOTjpAPqVExNwhsnkTm1ZPnf1H mdHRTaO7E/D5VJNjgmEBTicie1bX1yA= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=kjpImXtU; spf=pass (imf28.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747797910; a=rsa-sha256; cv=none; b=RATBeiFytLVCicMn38LjPwgEGPpgUhpqN5SVDNRKI/nVydFQn34foLZsjKFcfQnuS/dzPl pNixB8I6dOHBV3rbqrCkJZlaQNQlGZz0Eo89PhRnqNs/jp36eBBrMEXVyZZBuCjD5+oVnu qAc2WfdhS/dEkWHozMuJ7vIdRFmfJqA= Received: by mail-ua1-f42.google.com with SMTP id a1e0cc1a2514c-86fea8329cdso2837515241.1 for ; Tue, 20 May 2025 20:25:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1747797909; x=1748402709; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MUITKC9D7ShfjPfuXdRjZj+CsBm2w9XbmNqmnF57EWo=; b=kjpImXtU1BgNmLRZdXwmVM1TfXhygPu0adwpamK3qC0Cc1NaelOsqmmNGH52oYQddW Ai78MvIRGVMBI0HGAVrh3zR9UWd+I1YbExAgEpbyWV6n00Ll9NpArvQdgXLhTUiPr6xF qSCxJr/10qcIzLDxdVBfv7iIgwR/RWbvBsS+HTPX/edKc5iIuXUIyqBP4x03IsZy3eWq sDPRstjnT2zCEYScRKFoZ202aEX02EeWlU1NB9rCvWHk3rmCbRWeRU+mO1LmWFE5qbbF 8gmwOWw9jGBwTzmC/7yYNzbvvnFBb4PWzSzbpidihNbYs9f+NXp8bVSdiaeEH9XamP2d 7yQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1747797909; x=1748402709; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MUITKC9D7ShfjPfuXdRjZj+CsBm2w9XbmNqmnF57EWo=; b=S0/IMvJTVJtchBSE21hzFv+pFFOwE+8eqP2GcM7ak97adZ/ep/IoobdNwhjmQLfcsW pxyOlnSjiZ3wWZSAY1tj+nqJUPA2GKYJzUv9JA+n+mEdFDKUm9X9oCSj67dZ7QfnzSiM tcPm0BDLDS2epBygEnb3OvWtDfraaO3Cdm4M+ojOh4rSrO2ScKeDeuWk4XVGbb3RaRYo RxXeGALYbJQHjVog6xAZVjskxK/AJIP9EOLWb7Bj3cbTP1YdOE8jKQSLG4DtXdxREU6h /4rfKE0sIgscwJD6XUsdx9z/zjMfmsN/wkOjYvjvh8PGUopkhYQqU+OPSUuoL2qPW8Uq zxLw== X-Forwarded-Encrypted: i=1; AJvYcCXeOjCgXKHW+M4O62ldcqTQhfhmOLTb3CUvCgMMQNNUyPR79YRFNn2o0UGHsQWIQnW5a2OLF/kYlQ==@kvack.org X-Gm-Message-State: AOJu0Yyrh2OAUKsjpEO6YH9a53VsJjDmdjd47XvkkMRVDlEa+zgFqEx7 /IiixtKgIAFDpHxuFL9mxKl2I7yx4lvO0UvOCAx7b/Jz0o7DKiQScjti67KBIwxoeDJePmu3+WE ksfDys1c39vDDG7CMP+VxN4DNyWMAczI= X-Gm-Gg: ASbGnctVQ6uTw5QzpJ8HPKfNOPu1zdPbXvJuxV+F1RHiPuUXSikdZTYkZU9pDTbAyUp s6w5PcdSl/Nc7I3JtimCndm+9BJqYFYcvHnLL1v1hUliScplHSX98dfDZ7cFwSjPffr0dfmQ/Ie 7weWS2debHyVLJhmz4JlDjw55bmse1wGo03w== X-Google-Smtp-Source: AGHT+IF6Cv27F3kX+i+D4uM74OigVPJYd08slZUdRAXMrOeho4Wes1UCxkUZVKkSiT5c8Jvg/8HbTxG45o+r/HJetOQ= X-Received: by 2002:a05:6102:5714:b0:4c1:94df:9aea with SMTP id ada2fe7eead31-4dfa6bebe78mr17449645137.15.1747797909376; Tue, 20 May 2025 20:25:09 -0700 (PDT) MIME-Version: 1.0 References: <20250514201729.48420-6-ryncsn@gmail.com> <20250519043847.1806-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Wed, 21 May 2025 15:24:58 +1200 X-Gm-Features: AX0GCFtjQGZRg8liGvHHvK0xBqTur-JjSJbRmoubxb0-0IEZyRWgtX19nDy5H4g Message-ID: Subject: Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention To: Kairui Song Cc: akpm@linux-foundation.org, Baolin Wang , Baoquan He , Chris Li , David Hildenbrand , Johannes Weiner , Hugh Dickins , Kalesh Singh , LKML , linux-mm , Nhat Pham , Ryan Roberts , Kemeng Shi , Tim Chen , Matthew Wilcox , "Huang, Ying" , Yosry Ahmed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam10 X-Stat-Signature: gjaia64z1mck8py9pnngn88azc5qqe37 X-Rspamd-Queue-Id: 80D8EC0003 X-Rspam-User: X-HE-Tag: 1747797910-415595 X-HE-Meta: U2FsdGVkX1/PpjP8Vzs64WNUT+S7COcYBne3KkKlXmTnH0Q1qS2HtVSwfb6DEiaIa/+TzLP+sur8T0lv4lFIDKSWw3vm0nMezoVSLNSlWWGQPJUtRLB97j0ttm+8ncp3LcYs5Kqa18aGi9g1q0RH2vObvc9kZCEqmOQxnUhYdDaGLdYfKH98G52mTnWhJ5mGV3dJRPhTJl8NLQ5IxhNYbeEblFJnPKbwu1UvdPpBsnuJZkwWC2EeiLdLWe0L11F7Ft/35KwWngvO+zxbnTgd/F30PiJbzl7s6U1Gs6+1b3kBcRKE6s+pClmDR+8Z6nuzRnNpJP+Hx342xqQLzRCnPooIoxYEjhN3GpwVVupCRaoGDu5rsSdkA69weJ5GbO5MqhormiOLoQnpx3q/VHK/iw7stTUlMgcwWk8z8vsbLfYZ5Rtg1KykCDQ4WBv7ZzZYSAR5T8znRYUHvMuvOE2vvUdRYq0b10Iozx29bjOcIzVusuaIlhQZLHMFa5Dk2hK9a20xGJavSNTsCLG1kCllzBqGjXXi6of9eykw4NnE1X7ZNle2XrfxoS2F0gM96qy+iN7W+ujMazcEJ7KiSaMoFPnooFWzgbkj5rQezHd0o7PA4jwIiVh3+Vukjl3r7kDbb3KWXMpu+TslfhdNh9dhOT7qeB9i+oB4yTTZeZqwrgVpnlZNpzZ8IcJrik0/ncEQVxYQluOh9+i4C8YX30zHbbcLFSamRZYa7GIjYXO9TjwVMQGAD3cKfl35V2XyetZmSoSRMlYLmle/EwfHD55sQaeWmqn1fKO2+e0LtDc5rpTNyrP/F6Aaj137ORpqNgzLQfXbtaGqIiJeKzpVTHDGZZDiBGLZ7mDchOrfc+1QBAxGKceWv85kF7T3EFXCkxxUAaQkFynsJRgaJA1QjOGjF2/ILmTtic669C4QWMWyEATB+hLZiuQ8lSY5953gEZGmWui5LPBwYWING2ieYWY 9Y3O+nUW AnaDGtUuxNmUQ6EmNbneF88m7ZPCDhUfDHtY9cojQub6WXj9OKD0chsJ/M0G0Dbem7Q/R5DjCNlyTDO5QYIjkhbSNFAGT4fdGPeyiFtmyxSuoiz3qXI1kG2fehjS2uzCsncCrFL1A+Sv4t1DtVYWPVKqsjSvZ30RYCWosnQXBhoB0l7HVG1QgLWghiueqQpk1Wnplqd6/0mY6Hi6V3I7zQKrujv0pB1S1V/7aHjtO5dHFfZaCkzElCQ8rbkDhBUF7JuZAAyPc1Jm1OkyF0t5XxcmKZ4xc9XWtt7Z3BjC3PmoGMvSoy+N4Qd48hKlR70GX4fTJC7T+qbPbkM9YVpyzOjo9QyL4nMDP3bVGLLA20gK3pKlq3gnJ25TIbJc2LR1glvnKl5GTogGHnSY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, May 21, 2025 at 2:45=E2=80=AFPM Kairui Song wrot= e: > > Barry Song <21cnbao@gmail.com> =E4=BA=8E 2025=E5=B9=B45=E6=9C=8821=E6=97= =A5=E5=91=A8=E4=B8=89 06:33=E5=86=99=E9=81=93=EF=BC=9A > > > > On Wed, May 21, 2025 at 7:10=E2=80=AFAM Kairui Song = wrote: > > > > > > On Tue, May 20, 2025 at 12:41=E2=80=AFPM Barry Song <21cnbao@gmail.co= m> wrote: > > > > > > > > On Tue, May 20, 2025 at 3:31=E2=80=AFPM Kairui Song wrote: > > > > > > > > > > On Mon, May 19, 2025 at 12:38=E2=80=AFPM Barry Song <21cnbao@gmai= l.com> wrote: > > > > > > > > > > > > > From: Kairui Song > > > > > > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > > > > > > index e5a0db7f3331..5b4f01aecf35 100644 > > > > > > > --- a/mm/userfaultfd.c > > > > > > > +++ b/mm/userfaultfd.c > > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_st= ruct *mm, pmd_t *dst_pmd, pmd_t *src_pmd, > > > > > > > goto retry; > > > > > > > } > > > > > > > } > > > > > > > + if (!folio_swap_contains(src_folio, entry)) { > > > > > > > + err =3D -EBUSY; > > > > > > > + goto out; > > > > > > > + } > > > > > > > > > > > > It seems we don't need this. In move_swap_pte(), we have been c= hecking pte pages > > > > > > are stable: > > > > > > > > > > > > if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte= , orig_src_pte, > > > > > > dst_pmd, dst_pmdval)) { > > > > > > double_pt_unlock(dst_ptl, src_ptl); > > > > > > return -EAGAIN; > > > > > > } > > > > > > > > > > The tricky part is when swap_cache_get_folio returns the folio, b= oth > > > > > folio and ptes are unlocked. So is it possible that someone else > > > > > swapped in the entries, then swapped them out again using the sam= e > > > > > entries? > > > > > > > > > > The folio will be different here but PTEs are still the same valu= e to > > > > > they will pass the is_pte_pages_stable check, we previously saw > > > > > similar races with anon fault or shmem. I think more strict check= ing > > > > > won't hurt here. > > > > > > > > This doesn't seem to be the same case as the one you fixed in > > > > do_swap_page(). Here, we're hitting the swap cache, whereas in that > > > > case, there was no one hitting the swap cache, and you used > > > > swap_prepare() to set up the cache to fix the issue. > > > > > > > > By the way, if we're not hitting the swap cache, src_folio will be > > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) doe= s > > > > not guard against that case either. > > > > > > Ah, that's true, it should be moved inside the if (folio) {...} block > > > above. Thanks for catching this! > > > > > > > But I suspect we won't have a problem, since we're not swapping in = =E2=80=94 > > > > we didn't read any stale data, right? Swap-in will only occur after= we > > > > move the PTEs. > > > > > > My concern is that a parallel swapin / swapout could result in the > > > folio to be a completely irrelevant or invalid folio. > > > > > > It's not about the dst, but in the move src side, something like: > > > > > > CPU1 CPU2 > > > move_pages_pte > > > folio =3D swap_cache_get_folio(...) > > > | Got folio A here > > > move_swap_pte > > > > > > > > > | Now folio A is no longer valid. > > > | It's very unlikely but here SWAP > > > | could reuse the same entry as ab= ove. > > > > > > swap_cache_get_folio() does increment the folio's refcount, but it seem= s this > > doesn't prevent do_swap_page() from freeing the swap entry after swappi= ng > > in src_pte with folio A, if it's a read fault. > > for write fault, folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(foli= o)) > > will be false: > > > > static inline bool should_try_to_free_swap(struct folio *folio, > > struct vm_area_struct *vma, > > unsigned int fault_flags) > > { > > ... > > > > /* > > * If we want to map a page that's in the swapcache writable, w= e > > * have to detect via the refcount if we're really the exclusiv= e > > * user. Try freeing the swapcache to get rid of the swapcache > > * reference only in case it's likely that we'll be the exlusiv= e user. > > */ > > return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(foli= o) && > > folio_ref_count(folio) =3D=3D (1 + folio_nr_pages(folio= )); > > } > > > > and for swapout, __removing_mapping does check refcount as well: > > > > static int __remove_mapping(struct address_space *mapping, struct folio= *folio, > > bool reclaimed, struct mem_cgroup *target_m= emcg) > > { > > refcount =3D 1 + folio_nr_pages(folio); > > if (!folio_ref_freeze(folio, refcount)) > > goto cannot_free; > > > > } > > > > However, since __remove_mapping() occurs after pageout(), it seems > > this also doesn't prevent swapout from allocating a new swap entry to > > fill src_pte. > > > > It seems your concern is valid=E2=80=94unless I'm missing something. > > Do you have a reproducer? If so, this will likely need a separate fix > > patch rather than being hidden in this patchset. > > Thanks for the analysis. I don't have a reproducer yet, I did some > local experiments and that seems possible, but the race window is so > tiny and it's very difficult to make the swap entry reuse to collide > with that, I'll try more but in theory this seems possible, or at > least looks very fragile. I think we don't necessarily need to prove the same swap entry. As long as we can prove the original and current ptes are both swap with the same offset or not, this could be the possibility they will have the same offset= . This can increase the likelihood of occurrence and help validate the conceptual model. diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index bc473ad21202..f072d4a5bcd4 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1090,6 +1090,10 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma, if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_p= te, dst_pmd, dst_pmdval)) { + if (src_folio && !non_swap_entry(orig_src_pte) && + !non_swap_entry(ptep_get(src_pte))) + the case is true; + double_pt_unlock(dst_ptl, src_ptl); return -EAGAIN; } > > And yeah, let's patch the kernel first if that's a real issue. > > > > > > double_pt_lock > > > is_pte_pages_stable > > > | Passed because of entry reuse. > > > folio_move_anon_rmap(...) > > > | Moved invalid folio A. > > > > > > And could it be possible that the swap_cache_get_folio returns NULL > > > here, but later right before the double_pt_lock, a folio is added to > > > swap cache? Maybe we better check the swap cache after clear and > > > releasing dst lock, but before releasing src lock? > > > > It seems you're suggesting that a parallel swap-in allocates and adds > > a folio to the swap cache, but the PTE has not yet been updated from > > a swap entry to a present mapping? > > > > As long as do_swap_page() adds the folio to the swap cache > > before updating the PTE to present, this scenario seems possible. > > Yes, that's two kinds of problems here. I suspected there could be an > ABA problem while working on the series, but wasn't certain. And just > realised there could be another missed cache read here thanks to your > review and discussion :) > > > > > It seems we need to double-check: > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index bc473ad21202..976053bd2bf1 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm, > > struct vm_area_struct *dst_vma, > > if (src_folio) { > > folio_move_anon_rmap(src_folio, dst_vma); > > src_folio->index =3D linear_page_index(dst_vma, dst_add= r); > > + } else { > > + struct folio *folio =3D > > filemap_get_folio(swap_address_space(entry), > > + swap_cache_index(entry)); > > + if (!IS_ERR_OR_NULL(folio)) { > > + double_pt_unlock(dst_ptl, src_ptl); > > + return -EAGAIN; > > + } > > } > > - > > orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); > > #ifdef CONFIG_MEM_SOFT_DIRTY > > orig_src_pte =3D pte_swp_mksoft_dirty(orig_src_pte); > > Maybe it has to get even dirtier here to call swapcache_prepare too to > cover the SYNC_IO case? > > > > > Let me run test case [1] to check whether this ever happens. I guess I = need to > > hack kernel a bit to always add folio to swapcache even for SYNC IO. > > That will cause quite a performance regression I think. Good thing is, > that's exactly the problem this series is solving by dropping the SYNC > IO swapin path and never bypassing the swap cache, while improving the > performance, eliminating things like this. One more reason to justify > the approach :) > > > > > [1] https://lore.kernel.org/linux-mm/20250219112519.92853-1-21cnbao@gma= il.com/ > > I'll try this too. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, -EBUSY is somehow incorrect error code. > > > > > > > > > > Yes, thanks, I'll use EAGAIN here just like move_swap_pte. > > > > > > > > > > > > > > > > > > > > > > > err =3D move_swap_pte(mm, dst_vma, dst_addr, sr= c_addr, dst_pte, src_pte, > > > > > > > orig_dst_pte, orig_src_pte, dst= _pmd, dst_pmdval, > > > > > > > dst_ptl, src_ptl, src_folio); > > > > > > > > > > > > > > > > > > > Thanks Barry