From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D0E4C71157 for ; Tue, 17 Jun 2025 18:35:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DD54B6B00AA; Tue, 17 Jun 2025 14:35:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D861B6B00AB; Tue, 17 Jun 2025 14:35:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C76866B00AC; Tue, 17 Jun 2025 14:35:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id B5A1F6B00AA for ; Tue, 17 Jun 2025 14:35:31 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 2CA541A0123 for ; Tue, 17 Jun 2025 18:35:31 +0000 (UTC) X-FDA: 83565745662.30.F950312 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) by imf18.hostedemail.com (Postfix) with ESMTP id 42B591C0005 for ; Tue, 17 Jun 2025 18:35:29 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eT6udSnX; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750185329; a=rsa-sha256; cv=none; b=NPTyE5KVk9k8IW6xTwVTa8sW3PVLVoqaPLcLUdO3NVyGWgmSPbdFrAzSkIISLRH0s75D12 C6xktT/wQg72ZsnVYoszJkIGA41nhqE67xxxswNF4tuDQWY22LmZa4AvB5IGG/aWrP54P0 IQglsBq2lqo6ZdpS6PT1WmKbUrIkWqE= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=eT6udSnX; spf=pass (imf18.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.178 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750185329; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Bntp77Hs3RcMQMVsD2F77AAd9YdDXm5jtfci1+gUQ14=; b=DHDVtAH8/Pcq93zCDIlR0MWXY04Cmek8M8pWr1cCfELTk9mqS1dZqQlZPjaWWVI/Rq3SJ9 sDg0ZzbDVDCsy+Na2D00yPCrLu+0xZMwh2DwIVpJ/ZkX8g5C8pn5p0CfSMgeHxHYGFOSju JcOi09zS2ZmRW2SGP3lIxnLyy4UuZgM= Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-23602481460so58926445ad.0 for ; Tue, 17 Jun 2025 11:35:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1750185327; x=1750790127; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=Bntp77Hs3RcMQMVsD2F77AAd9YdDXm5jtfci1+gUQ14=; b=eT6udSnXlaWiAwywOH/qwFs7MXShREsWqFDNy5Yh6xamU1VImOk5nbWrP3SAFMZcr3 PskAYzL88nFw3lnNk5HmxKppUGQrEz9YS/qXKx2bOIBRYmHJGRC/Yu2rIV870m5pARef KVdw4giqlGP1rHGhXEw6fQWOy54nQTGZCutc/IIBkIT4mAxOVS/dWwc/8fa5kyayJrFq G4YN5Q64w0vjs+qzFHzYEVf0Qju6+ea3lQu27WgVUvkwx8FmaK5bxjMlwiNGlqXfl+p1 4s0OemTdFBPxDxAEZ8vdQK1bQyQFtdWi3wh7SBSGaq7/NyL5dMkDbuujhDKoPwPZoUpv nLGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750185327; x=1750790127; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=Bntp77Hs3RcMQMVsD2F77AAd9YdDXm5jtfci1+gUQ14=; b=cnuf2vZVknMe/igqt60DWxUdL5Pk7Uz7ScYe1B/M2RimsBRvoeSGca4QG1S6ObAnxD oWVNtvp6y1HHURNja9xkdeEgKjZqEp+Kh1r3ExvktETm2Am7rhFlyTSsmMa8yM2YxxQQ VBYJ82bL/cjjms+iZoFUqkeL9gi60NPUpGjJ8dcu9qNTarud9ZRpBFJ7BcZxCknS9DPr a48TWZdcvfc63E0EbVyB+jwAEeN4VDD0Ck8hhEizd3WA0Nf8W5JYEWiisXB0aUUrRDsl AeJzqMY7agKkjo/f1S4gOnDzmSGkkgiMSFcYapUAgG0IeRdtglXkhSMDnx/K0WzpgJxI TJdA== X-Gm-Message-State: AOJu0YwK824NkFzBJfkm2/WxCna4gfvn1KB50XmJzHTlV+52gOIL7JSU xlidtNwo/4WXljvCZhFt1P94tFZTPSR3vDQCo7l7hre+7xBxMNXY6xGN1l6OVqrF4O0= X-Gm-Gg: ASbGnctGqOGMmS6yM08xyMAJFuUgWIdGATSCFgzpaXOtnkQmIDStK2lAxO06wteKDLv FaKERZeZrM0Jq5HGk7dDhc8nNIk/hFYB4i9stoUkAMj/XJX8JFy2KYYvgcNWqbr3MVnwnSYclat A6CjOxgk9ExqFn/zkO5xq0rPqaCK7WmqexNqwbb1pdnLhZA1Pn2iF0ftXgni5iBlCOYRm/QYo38 tPhdDE01wwHpuvsG2APE2GpzOY0BLSEYudetz1fUkslxDX9MpIcm0ii4XaQ2MMXodxKKG09zCNc wQzdG0YIeropCMRZAEofIJHfphyQqxIytm54iKQlZ8cB16rJEoEj34RwZ/5/AmSzlmMz1OHezVz c1MaBAoI= X-Google-Smtp-Source: AGHT+IGdnat7JAHqcXUEzk7AuYqg9efS1z6IgHRw61xLREkaJw9YC8i5ETUICV3Vo2xPHkSaPRPoQg== X-Received: by 2002:a17:90b:48ca:b0:312:e90b:419e with SMTP id 98e67ed59e1d1-313f1cacddfmr27441718a91.12.1750185326908; Tue, 17 Jun 2025 11:35:26 -0700 (PDT) Received: from KASONG-MC4.tencent.com ([106.37.123.168]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2365de781c7sm83753715ad.128.2025.06.17.11.35.21 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 17 Jun 2025 11:35:26 -0700 (PDT) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Hugh Dickins , Baolin Wang , Matthew Wilcox , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Barry Song , linux-kernel@vger.kernel.org, Kairui Song , stable@vger.kernel.org Subject: [PATCH 1/4] mm/shmem, swap: improve cached mTHP handling and fix potential hung Date: Wed, 18 Jun 2025 02:35:00 +0800 Message-ID: <20250617183503.10527-2-ryncsn@gmail.com> X-Mailer: git-send-email 2.50.0 In-Reply-To: <20250617183503.10527-1-ryncsn@gmail.com> References: <20250617183503.10527-1-ryncsn@gmail.com> Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 42B591C0005 X-Stat-Signature: b7phnor16yast9o456j96n9zag4k499q X-Rspam-User: X-HE-Tag: 1750185329-413436 X-HE-Meta: U2FsdGVkX1/KnZVeDmf70AAEVVgy3xW6UWCDVklNzZLZfqG2RXgTTA714xDCAdg7oOVaL73iqF8clGuovvMt2tQ31aNmWbbTh/gMwHYY/Pbal7EsJqHda5RtRmDqVsFjCB87nd9Tg5ebPVYDYUG7ecK09opOH0AXMkW752dzB7TC2g+IvrCbTxrUtLj4q36+E7/9lpH/KcjkErEr9PQV8lyAlFhJOptT/tGVVaqcquTGMsIqcF0U2z8AeYyOEuBgiebuDcjGRbwTSdRvSKoNELD2ZU/EGnlHXWg4DqaSuAN+kxlZ+qAKSDshg+f9ITmqdYckOpwJARKp7lSSDZCej1cqELhNI8HKBnHPy5BkXXMxRXUyG3WmvJbK7XTRfj1jUq7VUKFCb2bQhNl6XKb7+Sz94u0q0apviExew8EmCR3K6kOSROOv8STltltOStM9ee3fznTVOA/Y4360/sVzrRZi9ffIBB/U315u1aF0kR/jySpVCQ1/rpMxdM489ZziPtBPSAmOmbv5B4Za69K3xWu4LM9XsABA1rEqliCMv6wFchjPJAeqzzyon5X/rgwrbz8yXeSe5XvhgCiOx8AuY1Gwn7s+RoLkegWr11pm3NENPsXKPNGWdIV7RuUjrgYN0SnmLYniHLJnY+qxwlQvwKVTtg45L4cwKLJWQDYcLRITPgpNAC+/TRDN8kSOHb1/qvoD+EP64fRpo+bPihcrh3oUd8lPbXnbgJq3FDgwcEMk3YecI4oA5TcoiCkD1MVctrA7i1ldZVzsMDtHKxIJqZVzuPd0nKxGJB2WQHrvI9nGz2bDSWP/iLvOyeKk+qZIbDiLvsRh4AkVEkFSlqSJNlQ/H8k7pPFdU6UUO/Q2ozPSqvPL0dyxLA7gtNCStQBY0wP8aTKhZ0G5M7+gVNzVd/mWQbzeEGQHsFyvOUm6AuXJpKn3xdCvujXomB4BY0jRwpAElt5ZMH9f6yJh0X4 LV0ZEyxd o1qOYDQTVePVbfpZnWIdDo2KRiddpykn/JkztdcPB9FmcLMOLDc+w+r6zoxGZ0xxVbhPxAsYZ/LocxSxRguVx3Vh3K9rqVnOv2+4tICChyXIfAtU2VAWUtPWWRT1HxS9e5qwcwWqRto6KRu0c7Bqgd+Tj01UX154RqkwCmV0QdOTR+PO+spZVdzSECMESd+OBGVrwCfVH6sd1VrSK0av31Xa2ICQC2ml9IprhA4m8n87ho4xKRYV/GzZOZJ3O/6mOcyufgyIwiKWuq7B2C9RsZNmBCaXt7tVXNwH/VgLQ5b9IAs2wJpgtS4RzXmNdSFasjav/dC4nFDpLVieJR9fNMNCPL9SYXAmhjMKhNcJyXuIWRW5ZPhpf1Sj6bwkGJ/oAtC45XZ+Mf4y9aA/ZcrkY8BlIr/SqewDHvXsx0kmcxuV0g6QGQZRBIydyPQvteml9dmlCdXVR7Me5TM+E+cRByfnZYHD4fABLk4NhMKMfP3N91pf5RX7oj5/0WNb1LlYyuQ9ICJ6WjiH3UL80Fup5pdDYKGoLmxcq8rpZDKzIX5/Z2AbIuILLf1ndYUk6Gdb2qpqBmZBfHJ4AYdKGeC6Agzc1PQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song The current swap-in code assumes that, when a swap entry in shmem mapping is order 0, its cached folios (if present) must be order 0 too, which turns out not always correct. The problem is shmem_split_large_entry is called before verifying the folio will eventually be swapped in, one possible race is: CPU1 CPU2 shmem_swapin_folio /* swap in of order > 0 swap entry S1 */ folio = swap_cache_get_folio /* folio = NULL */ order = xa_get_order /* order > 0 */ folio = shmem_swap_alloc_folio /* mTHP alloc failure, folio = NULL */ <... Interrupted ...> shmem_swapin_folio /* S1 is swapped in */ shmem_writeout /* S1 is swapped out, folio cached */ shmem_split_large_entry(..., S1) /* S1 is split, but the folio covering it has order > 0 now */ Now any following swapin of S1 will hang: `xa_get_order` returns 0, and folio lookup will return a folio with order > 0. The `xa_get_order(&mapping->i_pages, index) != folio_order(folio)` will always return false causing swap-in to return -EEXIST. And this looks fragile. So fix this up by allowing seeing a larger folio in swap cache, and check the whole shmem mapping range covered by the swapin have the right swap value upon inserting the folio. And drop the redundant tree walks before the insertion. This will actually improve the performance, as it avoided two redundant Xarray tree walks in the hot path, and the only side effect is that in the failure path, shmem may redundantly reallocate a few folios causing temporary slight memory pressure. And worth noting, it may seems the order and value check before inserting might help reducing the lock contention, which is not true. The swap cache layer ensures raced swapin will either see a swap cache folio or failed to do a swapin (we have SWAP_HAS_CACHE bit even if swap cache is bypassed), so holding the folio lock and checking the folio flag is already good enough for avoiding the lock contention. The chance that a folio passes the swap entry value check but the shmem mapping slot has changed should be very low. Cc: stable@vger.kernel.org Fixes: 058313515d5a ("mm: shmem: fix potential data corruption during shmem swapin") Fixes: 809bc86517cc ("mm: shmem: support large folio swap out") Signed-off-by: Kairui Song --- mm/shmem.c | 30 +++++++++++++++++++++--------- 1 file changed, 21 insertions(+), 9 deletions(-) diff --git a/mm/shmem.c b/mm/shmem.c index eda35be2a8d9..4e7ef343a29b 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -884,7 +884,9 @@ static int shmem_add_to_page_cache(struct folio *folio, pgoff_t index, void *expected, gfp_t gfp) { XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio)); - long nr = folio_nr_pages(folio); + unsigned long nr = folio_nr_pages(folio); + swp_entry_t iter, swap; + void *entry; VM_BUG_ON_FOLIO(index != round_down(index, nr), folio); VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio); @@ -896,14 +898,24 @@ static int shmem_add_to_page_cache(struct folio *folio, gfp &= GFP_RECLAIM_MASK; folio_throttle_swaprate(folio, gfp); + swap = iter = radix_to_swp_entry(expected); do { xas_lock_irq(&xas); - if (expected != xas_find_conflict(&xas)) { - xas_set_err(&xas, -EEXIST); - goto unlock; + xas_for_each_conflict(&xas, entry) { + /* + * The range must either be empty, or filled with + * expected swap entries. Shmem swap entries are never + * partially freed without split of both entry and + * folio, so there shouldn't be any holes. + */ + if (!expected || entry != swp_to_radix_entry(iter)) { + xas_set_err(&xas, -EEXIST); + goto unlock; + } + iter.val += 1 << xas_get_order(&xas); } - if (expected && xas_find_conflict(&xas)) { + if (expected && iter.val - nr != swap.val) { xas_set_err(&xas, -EEXIST); goto unlock; } @@ -2323,7 +2335,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, error = -ENOMEM; goto failed; } - } else if (order != folio_order(folio)) { + } else if (order > folio_order(folio)) { /* * Swap readahead may swap in order 0 folios into swapcache * asynchronously, while the shmem mapping can still stores @@ -2348,15 +2360,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index, swap = swp_entry(swp_type(swap), swp_offset(swap) + offset); } + } else if (order < folio_order(folio)) { + swap.val = round_down(swp_type(swap), folio_order(folio)); } alloced: /* We have to do this with folio locked to prevent races */ folio_lock(folio); if ((!skip_swapcache && !folio_test_swapcache(folio)) || - folio->swap.val != swap.val || - !shmem_confirm_swap(mapping, index, swap) || - xa_get_order(&mapping->i_pages, index) != folio_order(folio)) { + folio->swap.val != swap.val) { error = -EEXIST; goto unlock; } -- 2.50.0