From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8E5A3C7115D for ; Mon, 23 Jun 2025 03:38:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1F34D6B00B1; Sun, 22 Jun 2025 23:38:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1CA6F6B00B2; Sun, 22 Jun 2025 23:38:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0B9666B00B3; Sun, 22 Jun 2025 23:38:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id ED78E6B00B1 for ; Sun, 22 Jun 2025 23:38:28 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 92DCAC0F9D for ; Mon, 23 Jun 2025 03:38:28 +0000 (UTC) X-FDA: 83585257896.08.A73B974 Received: from mail-lj1-f182.google.com (mail-lj1-f182.google.com [209.85.208.182]) by imf23.hostedemail.com (Postfix) with ESMTP id A6F9614000A for ; Mon, 23 Jun 2025 03:38:26 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HFNCPf8u; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750649906; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=cZ1AbjGt4auz1KCPnf6rfW6+GLOGIFLk8pJenKl3bro=; b=RSf8FApK2b4roeKS+xVN1MjFTETLvzdMasGOoCNHkr3h5B8AHR8rclW010R7JduokmKuJF vTnzKPHlKMyen1bE5YxtGprtOHNBvMlMX9YPD2QfWN625tpS9+aAYkYOoFgoK5LmXnZHN5 4Fi5pF2w1/TbcXATa/JTadplEu+bZt0= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HFNCPf8u; spf=pass (imf23.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750649906; a=rsa-sha256; cv=none; b=5tZenNdV1l1KbmQbO8FOlwN0Sij95wT3yePFJEQqba9XN/q+eLyMy4GFylCAzsCaG9sRTt v+SzBoeWN34M6Hbib0aaMlSFfjL3xJ4GuUZZ/auUK7XcNSYxv0zgiiN3H8tzmQUfxO+O9u vLmyMcM9pJmyMwvF9JYoXFBpjXPUEC4= Received: by mail-lj1-f182.google.com with SMTP id 38308e7fff4ca-30f30200b51so44815651fa.3 for ; Sun, 22 Jun 2025 20:38:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750649905; x=1751254705; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=cZ1AbjGt4auz1KCPnf6rfW6+GLOGIFLk8pJenKl3bro=; b=HFNCPf8uvpY4e5Yl+bgnCoNipYdOFS44Cl+uyp/JtjJAO/KAcMUkvlq+Ilcstc/rkv rz2UMqdvsqmWl/J18Y/Nxxnz3wRHwW4tFkkaVmMWCexF2e6FwUVRrXF8NXjH0TulGXHo /YBzJRYOCJTw8TYKW2wwjD/6G/3TqTlSjdTV1F/PCBj7SIc+yz7pW9SLsR0092DSrNGB /y3e1PIi/BKkKZLutODcwpFZCKEDMgTKKVbkQf15rfmY/+/rKJVnFwqPiyPchO40nT00 xmDHo5QMnB08G96DVXaCBmlOeImShHJ1KeFJK5aXOWGJpCVhknxyTYCqlZDm3DY2vZNn U0wQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750649905; x=1751254705; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cZ1AbjGt4auz1KCPnf6rfW6+GLOGIFLk8pJenKl3bro=; b=R6aZkjbwy4eV9JWLVCCtF3BPTxZqWpOGwCw/0zdibcDyJOY8xRzTxXV+eYWDhQ5bv7 lpTjK30iZXDFwduYWib0bCcIPER0IQaS8Z0ecFanCcw2PJ+J1m4eNX8Xuaqq44XeJIA5 QVDkF6PeWBb5041waqdvjxaUSFbeDFnlw+AyKG7ndWoiyJIkd8xxh+DNG46YflDGQG73 owGDnGPmDcb9jBv/xuO991m0VW5U2BWLZhuNc8g5qwQiId8mFh2Fh7Dh9IUOvzD+XADt ZdWT1HQLHNGV3OajXWU8ssv/MdqVvmUnFKtk4/SA1PM6Bf/dU2S1jdkpEcoWOpOFkwAS ayTA== X-Gm-Message-State: AOJu0YwKZ7OlcEElfvcFauMFw8VbCVDabMunNJui7mPne3Gtz63y39z1 j+VnJ4Fb0uFL8cDwYXTI/AOKT3JnzqawWD4RNV2eFmp2qzpA88Dg8dNdnJW+7Ai/08axe1UifYW LMoKy5e2twILHVPDUS174Yaz2ACrdfew= X-Gm-Gg: ASbGnctKL1veV801g3g2B4mC+jWMo8gk1dI+uNCRZrlPDUf+KxrCTB2NpqpPe96NAjO Je+Hdg+ZItAwpG9Zt2oIMb7Z8RMfNk2TjEO4MOUpUT/9hCuDofJxtda6TjgE8M61GPNgrt2lIL/ ZnYZMGaQCemVOcIC5+d838XcGjUSUQ3kqN+fekrtbYvTk= X-Google-Smtp-Source: AGHT+IHCteR3rL5yqNc8IXBlqTm7+3SwBcFGG3LMXIscuiAe1FJxt9u28jimU6FFEb99vyCSKpQy9b4LjecBjt13ihw= X-Received: by 2002:a05:651c:4184:b0:32b:7413:503 with SMTP id 38308e7fff4ca-32b98e8e674mr28117831fa.16.1750649904821; Sun, 22 Jun 2025 20:38:24 -0700 (PDT) MIME-Version: 1.0 References: <20250619175538.15799-1-ryncsn@gmail.com> <20250619175538.15799-2-ryncsn@gmail.com> In-Reply-To: From: Kairui Song Date: Mon, 23 Jun 2025 11:37:47 +0800 X-Gm-Features: AX0GCFu2jGEpYfrIe9qC-n9hl8s92eqhNotRcx_FUzHlzHr8_OMJW2bXPNtIbgU Message-ID: Subject: Re: [PATCH v2 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung To: Baolin Wang Cc: linux-mm@kvack.org, Andrew Morton , Hugh Dickins , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org, stable@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: A6F9614000A X-Stat-Signature: h8hbwrmi1g48bbmem3tp5k4npddxjzmc X-HE-Tag: 1750649906-89259 X-HE-Meta: U2FsdGVkX1+n1Qp1uOQ8CJkFydwz3wpenwf66L3s8s763uI6oumUj3U7mPudf6VkcwrGOXifK0WB3FFPw8Clny9k4WNvv0xZhicoJVlSEEMTJxt60+Y9smHaEXiLOcMMrNUxbE7XRgh9BccaxKYgBkpC1MzfmJKeG3BT/OK0Ec+d/L0naSC4909Gdygdj3CBVqQ/dDYtn6MtijD0FflTtd0Hwo/6qNF9AVq2iH4Oh7jFGRKSPro1oE/xRul0FIlt/JE0hDZabHIxso+hpyO3p/SYcPxIP5mgfr7ZCJgGuZnorlmcEXfUSaRKXXLFbyBVHn/ep0RaoI5X8Pgc+/VcKlH0R2FrniNnjgqDEDrKQLVsyTD2hgcHM3dVATW9BndJpi4wcCCxbS3POqqfJWl00eVn7O/O7Rb7bAAnrEhYBpPzxGYLSD5DBvsppgMj/xYlgy7tv0Bc4REy/8UdzZnXFW27El0AhAS9tfaYNDyrYpRILc17WFzvPYowpFEC/CRLO0idj5tXogdd505TvynLMsqy/AqSSZGt3k8iD/cD51tBZ/rOL42jxUHyHpk0nAwesr8tarU1nRgPfjxUcBws8jlYf8p27mDmJH9nkmbd0MoP7DWqzxiteLCf0gN7ejUb2itj0MMru3q3Hjv+eBkEWqxtwtPetOaB+ltKzRA/Enhk7wrXwqyvBdJEqMavvubRhskQidjKsgpeYhXYrVHeEq9KmhxaHqGEL6TEO08/kDTYftb0J0Ong0zU86EvREa1Y6k0gbMpafzutRKBBaML/n6d5V37F/mDcmc8pIN7N/oBZLRT/ZL9erpSvecgySTdHFzVadbTEbcropl7izgHnWlckTKhsIDFDS9oAi/NS5Zbw0OkBSReUzRBn8dhH/Po4eDJ3PEORpkdKXe6+FV3Gnti8MJ5O5rmF9SZwRv3ewBqJqXzgjR/mX8ztu7eKHOVsxmMH75ut2deYgtCI9F SJw0I3x5 8uZBbGxI3NrSlt308+ej+RdsCPxqVq43aEi/dmwh179/1vCE0pGlT18Spmz2Ud9c2aiO0V5KDSrPTGFsjVsx5Zzv8irF/fhlm3SjO6oEoBF2J6zzR1EbYktjgcmDz2vPIdbA3PaubKjtHo1LAsjOYq246L5JNpOaGviiiGmrTkXBU/RPkhfwdTWY3sBuqs0hoxYM1C8x9CcdSs89iKutqYS8B+H3MzUrBSX1ytFFhIaJiLRrLGRXnrnBtU6rkk4plr4k2GCgb/wI79sYWgCauJg55LjhzqPkg+E37FVfVgRRNoHrZoKkRwl82MqxldFjPqNA6v3uOPWWsGZhFqgWZ4b0U2p/7X5Ztx1cbWDPl78QIAMJetTjBquKWMcMDRC4078FBhy8f3YAcn1Bzyh0izEG8lZ3jW76PFldZUML4jhe6cCHRSteu9MS8KQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 23, 2025 at 11:35=E2=80=AFAM Kairui Song wro= te: > > On Mon, Jun 23, 2025 at 11:26=E2=80=AFAM Baolin Wang > wrote: > > > > Hi Kairui, > > > > On 2025/6/20 01:55, Kairui Song wrote: > > > From: Kairui Song > > > > > > The current swap-in code assumes that, when a swap entry in shmem > > > mapping is order 0, its cached folios (if present) must be order 0 > > > too, which turns out not always correct. > > > > > > The problem is shmem_split_large_entry is called before verifying the > > > folio will eventually be swapped in, one possible race is: > > > > > > CPU1 CPU2 > > > shmem_swapin_folio > > > /* swap in of order > 0 swap entry S1 */ > > > folio =3D swap_cache_get_folio > > > /* folio =3D NULL */ > > > order =3D xa_get_order > > > /* order > 0 */ > > > folio =3D shmem_swap_alloc_folio > > > /* mTHP alloc failure, folio =3D NULL */ > > > <... Interrupted ...> > > > shmem_swapin_folio > > > /* S1 is swapped in */ > > > shmem_writeout > > > /* S1 is swapped out, folio cached = */ > > > shmem_split_large_entry(..., S1) > > > /* S1 is split, but the folio covering it has order > 0 now */ > > > > > > Now any following swapin of S1 will hang: `xa_get_order` returns 0, > > > and folio lookup will return a folio with order > 0. The > > > `xa_get_order(&mapping->i_pages, index) !=3D folio_order(folio)` will > > > always return false causing swap-in to return -EEXIST. > > > > > > And this looks fragile. So fix this up by allowing seeing a larger fo= lio > > > in swap cache, and check the whole shmem mapping range covered by the > > > swapin have the right swap value upon inserting the folio. And drop > > > the redundant tree walks before the insertion. > > > > > > This will actually improve the performance, as it avoided two redunda= nt > > > Xarray tree walks in the hot path, and the only side effect is that i= n > > > the failure path, shmem may redundantly reallocate a few folios > > > causing temporary slight memory pressure. > > > > > > And worth noting, it may seems the order and value check before > > > inserting might help reducing the lock contention, which is not true. > > > The swap cache layer ensures raced swapin will either see a swap cach= e > > > folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if > > > swap cache is bypassed), so holding the folio lock and checking the > > > folio flag is already good enough for avoiding the lock contention. > > > The chance that a folio passes the swap entry value check but the > > > shmem mapping slot has changed should be very low. > > > > Thanks for fixing the issue. Sadly, I haven't reproduced this issue fro= m > > my previous test cases :( > > > > And I have a question below. > > > > > Cc: stable@vger.kernel.org > > > Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") > > > Signed-off-by: Kairui Song > > > Reviewed-by: Kemeng Shi > > > --- > > > mm/shmem.c | 30 +++++++++++++++++++++--------- > > > 1 file changed, 21 insertions(+), 9 deletions(-) > > > > > > diff --git a/mm/shmem.c b/mm/shmem.c > > > index eda35be2a8d9..4e7ef343a29b 100644 > > > --- a/mm/shmem.c > > > +++ b/mm/shmem.c > > > @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *= folio, > > > pgoff_t index, void *expected, gfp_t= gfp) > > > { > > > XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio= )); > > > - long nr =3D folio_nr_pages(folio); > > > + unsigned long nr =3D folio_nr_pages(folio); > > > + swp_entry_t iter, swap; > > > + void *entry; > > > > > > VM_BUG_ON_FOLIO(index !=3D round_down(index, nr), folio); > > > VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); > > > @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio= *folio, > > > > > > gfp &=3D GFP_RECLAIM_MASK; > > > folio_throttle_swaprate(folio, gfp); > > > + swap =3D iter =3D radix_to_swp_entry(expected); > > > > > > do { > > > xas_lock_irq(&xas); > > > - if (expected !=3D xas_find_conflict(&xas)) { > > > - xas_set_err(&xas, -EEXIST); > > > - goto unlock; > > > + xas_for_each_conflict(&xas, entry) { > > > + /* > > > + * The range must either be empty, or filled wi= th > > > + * expected swap entries. Shmem swap entries ar= e never > > > + * partially freed without split of both entry = and > > > + * folio, so there shouldn't be any holes. > > > + */ > > > + if (!expected || entry !=3D swp_to_radix_entry(= iter)) { > > > + xas_set_err(&xas, -EEXIST); > > > + goto unlock; > > > + } > > > + iter.val +=3D 1 << xas_get_order(&xas); > > > } > > > - if (expected && xas_find_conflict(&xas)) { > > > + if (expected && iter.val - nr !=3D swap.val) { > > > xas_set_err(&xas, -EEXIST); > > > goto unlock; > > > } > > > @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *ino= de, pgoff_t index, > > > error =3D -ENOMEM; > > > goto failed; > > > } > > > - } else if (order !=3D folio_order(folio)) { > > > + } else if (order > folio_order(folio)) { > > > /* > > > * Swap readahead may swap in order 0 folios into swapc= ache > > > * asynchronously, while the shmem mapping can still st= ores > > > @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *i= node, pgoff_t index, > > > > > > swap =3D swp_entry(swp_type(swap), swp_offset(s= wap) + offset); > > > } > > > + } else if (order < folio_order(folio)) { > > > + swap.val =3D round_down(swp_type(swap), folio_order(fol= io)); > > > > Why rounding down the swap type? do you mean rounding down the swap off= set? > > Ouch, right, it should be the value: > > swap.val =3D round_down(swap.val, folio_order(folio)); > > I messed up the code here during a rebase, let me send a V3 then. Later commit reworked this part so I didn't notice it. It will be a problem if this commit got cherry-picked. Thanks for the review!