From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0DC1C61DB2 for ; Mon, 9 Jun 2025 04:30:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6486C6B007B; Mon, 9 Jun 2025 00:30:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F9056B0088; Mon, 9 Jun 2025 00:30:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E8746B0089; Mon, 9 Jun 2025 00:30:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 302CB6B007B for ; Mon, 9 Jun 2025 00:30:04 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id AEDE9120E71 for ; Mon, 9 Jun 2025 04:30:03 +0000 (UTC) X-FDA: 83534584686.05.98991BA Received: from mail-vk1-f171.google.com (mail-vk1-f171.google.com [209.85.221.171]) by imf25.hostedemail.com (Postfix) with ESMTP id DAB37A0003 for ; Mon, 9 Jun 2025 04:30:01 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JSl0RmBt; spf=pass (imf25.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749443401; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=retmcKtILXm+Ch153nL5bk1V574vggL5lmZxDYU9rSc=; b=76li5GpZdzovI7CTyqAu9Z1PMVv2g8kVCkTozYYlISTMrpUk6yJOXzTY0oSHUTOiHE+Az6 OcLIdfQ00SPJ3ODbJOCi/xLhwMHWIq8oh5em6eUmLmnyW6rwNsFHnpRwjzN597K8BfQrtJ i0GQRuUmSIQC5yFIAqXBGOixZoJS9i0= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JSl0RmBt; spf=pass (imf25.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.171 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749443401; a=rsa-sha256; cv=none; b=d+t5+vVaFZZt7Hh7oJZ4u3hrcfNSbVFE4Ih4HlNgqAKoETSoeTNZWO8yr98fsgXoFKMtBO ia9LwarWTAk3WhqTPtKk306VIhkUSMn7AtJ8BLrQMuf7zJx2iXwORWMxXoEF2tJP/Srac2 vfPcL9GLIpUgygtajKw6PUq83CbIpdE= Received: by mail-vk1-f171.google.com with SMTP id 71dfb90a1353d-530d764149eso1121764e0c.1 for ; Sun, 08 Jun 2025 21:30:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749443401; x=1750048201; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=retmcKtILXm+Ch153nL5bk1V574vggL5lmZxDYU9rSc=; b=JSl0RmBt4a+gydHeN3jQKi8HZ7rruMBcNvmsMat0vNFUr77Ep5U+tw1ayaq0arIjtI jw/XUiIavfY8si00myDSDkCuU+WSaZ5WsLk7nZ/UptTAwaVbZoB+5ygEZesc7TwiYwF1 mYLtJz4w3eKLaZV95mqPDfH4gncs5Z+fWAYWJJ2cyO14dl1QmUvr5dR/smYr8K5keBVR UTcpI5ocYQ2JprzNikcFnf7h3tGzIzXowyICKqEo9ANtEocZwEL3irnFL5t7pcNL/IB+ xNEN/vwbUcHbM0iuX7irMSxdSMAP5JN1sgluOysCxIBMmfGK20qRcJ388eSnbPsNQBpN GHfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749443401; x=1750048201; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=retmcKtILXm+Ch153nL5bk1V574vggL5lmZxDYU9rSc=; b=CWJNuHq5reXl+lIUPhRU4DBw1F+Bucv4+KTC3Af99zLE6Zoq++aHf/pnm58JX40/fs 1EGN4X04PR9jnAWaagU+GjOsSZF6tA4xDMeaxEKbFp0EpWzGWg3D6+WisOVp11eVxc9t j4sLvN8UfQadqLp7cYZldCyxBk5p7oRD2mCdCIuVqp05cGnwTzSYzP9imUAs1kzNlQsr dBETCAjNMoxFVsk5zHfrh2g3tAWWNFmQSCR7MuHvzXCenT4o76TPhvmDN4IzoiSdbic1 vvjtUArHWUy9ulNHu6SaYGQnbzNpO2xjd8B0bstmMTn9yEqPhfc/8l4mPZi2l6cRI11W VnYg== X-Gm-Message-State: AOJu0YwIBD5/bchqTtFABMjky5CMAUeVuCE2INAxZcmnItn2cOFrjLhn dUB7Y3hbQp0S2HqRyrVHC46t4bTHCeFrOXBIDmCFSTfE4o6FeROXzEs1bgBm6aH0aDLlNYcm9jL rRpRVi+RHiOza7C/WSqLCQGOT7Hqar0c= X-Gm-Gg: ASbGncvAhd/+KyVxNEO7dvqm9Ay8mq/yyKWSVJ1J5jrott4c9vsl1buM6koWCBZKHpM th5nByNzEWZ12WQB8tTLOyPKY8Z+OLz8Xlq34Rz0+8Goj4it+EHCitRKXVqxbk02tGvEXwLcve6 S21jvFdy0xeoi5p1shYXnr6Zf+oSRgG9mU3cbqbP0v72Gc X-Google-Smtp-Source: AGHT+IHxtm87Sj/NjmF3Ha6g397nLkq4FED4+G7i1tl+68+hVe5Q4idxLG6nzb3LEXeOEWQysSzWzNsIGu15BxwAl6Q= X-Received: by 2002:a05:6122:32c9:b0:530:81ad:1d79 with SMTP id 71dfb90a1353d-530e488e8a2mr9588428e0c.6.1749443400763; Sun, 08 Jun 2025 21:30:00 -0700 (PDT) MIME-Version: 1.0 References: <20250608192713.95875-1-ryncsn@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Mon, 9 Jun 2025 16:29:49 +1200 X-Gm-Features: AX0GCFtD4GIgaLFa8EXNiuFg--SAoYJ74GK5RhOehvgQd7z2YtKQgLTNXes9H1M Message-ID: Subject: Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Hugh Dickins , Baolin Wang , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Usama Arif , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: DAB37A0003 X-Stat-Signature: qh1y816zhijb5bgmm6otio8r5w6kpxjw X-Rspam-User: X-HE-Tag: 1749443401-777820 X-HE-Meta: U2FsdGVkX1+CpQ2G/zF5N9TsW9yFpLxwcAuf6ISKWcJOcbovfk8u29PeCITW1vQalQLVr8bK+6f8+Rez5L45nuXPf1MqEZOVxOzFPvRzCLAJHsVSSViIosVPghLHAr1XImuSQoQfOBgLw4joF/a/JhjLQRrQYJYdeRQj45ZNLzmjLvE5elnATm9it7MCwNAsIg9Z4vraTPjQgpJRczbfl6ShwH0bDwmEdHQ50BSZMGUpkWbTSWCeT4Gmxxers0oIRSKz7GgHOXTAmHPmSFskmemgvOe8YrQBuUcabftoD9+cvIUAypOBhOy3LiK7i7nWd792qNulBkjLf4vdwrYC40LTyYVrVAdoAWujzgugKjml3buZk/Io20dlaX643RKtQ3HKw9v2FE78ia/g6PDWxQz8dAn1OfUkIBVkBPqyHUTLYW1LSLYqdcs/trOG2+2KYuuIzG4FbF416jGu4UalThvsoxwUHvHkyTOOpsmoP6xePwo5KuSohMzRDB4BCmR57if4bPHt7uxD0ybXBlhEuc7M6I/TJPa3tc4aH4TR8GBufj5Qrtr4iTc+N4mh8CsvtDjTJABBg8oPXm+21yymn0PLAhTiuTXqOi4U019gevEp2n6s0nkbGq4wp+CaGlCbwhPh1hb6voNC7pMsCMXjAlukgF8TsC6XJZCGpoGO4sw/iTs63rGkNp8mku23MRyT+hq3RnIS4xHDkQkShTaH175atZCZPphLA6ITXIjcArwxCVbli6aIx8VpxfeKGy33w1Et90FFkhgYPMN1sVbZdtTl223AeH44eA60Jbd11qBQaGP7KpfFO8oD5nvNeUgHVEkFG9sECA+yL6DzIUAPZso1KBd70dk0OPgfrHOkVVI1bYLNk8pdAItZg4WBHxLA1U+3NBvrnmh5XJaFv6Xk9roL2LHw43vDOeFF/S2IrMVyqdbLmUah4BZNGbBA6+OO4VJRwWj2BooiYxAvgeN eUDOaa8Q CE15yM0b++vYpBUOgioEIBb+xegWfIcApcILNGg+PGz3DGj/miK1Xwbs4MnjpgsCksic2Zx9abO09EQv5kYsw4xF3YL0xB9oQBoofHtjoNx0YqVCoAOdoIcoBZ6fMCnAScsYC6L8invJtpr/lH+36Epgmw8VaXc1YmC4XEOogEwKu4KMGtbZ6juGlbC7mocYu88kGhhS/bVc9fg47TlW+ZNgtMFMKEmygYQ0HaWpMoA8PZj338RL6cM0WgbpSXs+UqdfduJJ14+eq5f4bWtypPjMRet805M7p3avgftcDjNZIeZh/7CtqVJfc6H+4kyXgQOjmAZ1E79TJ/DQm29cuiI0Y1wZpPsxHGyVqjfWMRBEQK4Yn57vSACywAbdNPbE/qYHpoIf1u3d/a0U= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 9, 2025 at 2:32=E2=80=AFPM Kairui Song wrote= : > > On Mon, Jun 9, 2025 at 7:58=E2=80=AFAM Barry Song <21cnbao@gmail.com> wro= te: > > > > On Mon, Jun 9, 2025 at 7:27=E2=80=AFAM Kairui Song w= rote: > > > > > > From: Kairui Song > > > > > > Following softlockup can be easily reproduced on my test machine with= : > > > > > > echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enab= led > > > swapon /dev/zram0 # zram0 is a 48G swap device > > > mkdir -p /sys/fs/cgroup/memory/test > > > echo 1G > /sys/fs/cgroup/test/memory.max > > > echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs > > > while true; do > > > dd if=3D/dev/zero of=3D/tmp/test.img bs=3D1M count=3D5120 > > > cat /tmp/test.img > /dev/null > > > rm /tmp/test.img > > > done > > > > > > Then after a while: > > > watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787] > > > Modules linked in: zram virtiofs > > > CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G = L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)=C2=B7 > > > Tainted: [L]=3DSOFTLOCKUP > > > Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 > > > RIP: 0010:mpol_shared_policy_lookup+0xd/0x70 > > > Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 90 = 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f 4= 8 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8 > > > RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202 > > > RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001 > > > RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518 > > > RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000 > > > R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001 > > > R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000 > > > FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:000000000= 0000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > PKRU: 55555554 > > > Call Trace: > > > > > > shmem_alloc_folio+0x31/0xc0 > > > shmem_swapin_folio+0x309/0xcf0 > > > ? filemap_get_entry+0x117/0x1e0 > > > ? xas_load+0xd/0xb0 > > > ? filemap_get_entry+0x101/0x1e0 > > > shmem_get_folio_gfp+0x2ed/0x5b0 > > > shmem_file_read_iter+0x7f/0x2e0 > > > vfs_read+0x252/0x330 > > > ksys_read+0x68/0xf0 > > > do_syscall_64+0x4c/0x1c0 > > > entry_SYSCALL_64_after_hwframe+0x76/0x7e > > > RIP: 0033:0x7f03f9a46991 > > > Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e8 = 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00 f= 0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec > > > RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 000000000000000= 0 > > > RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991 > > > RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003 > > > RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380 > > > R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000 > > > R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000 > > > > > > > > > The reason is simple, readahead brought some order 0 folio in swap > > > cache, and the swapin mTHP folio being allocated is in confict with i= t, > > > so swapcache_prepare fails and causes shmem_swap_alloc_folio to retur= n > > > -EEXIST, and shmem simply retries again and again causing this loop. > > > > > > Fix it by applying a similar fix for anon mTHP swapin. > > > > > > The performance change is very slight, time of swapin 10g zero folios > > > (test for 12 times): > > > Before: 2.49s > > > After: 2.52s > > > > > > Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of synchr= onous swap device") > > > Signed-off-by: Kairui Song > > > > > > --- > > > > > > I found this issue while doing a performance comparing of mm-new with > > > swap table series [1] on top of mm-new. This issue no longer exists > > > if the swap table series is applied, because it elimated both > > > SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while improvi= ng > > > the performance and simplify the code, and the race swapin is solved > > > differently by then. > > > > > > (The zero map fix might still need to stay for a while, but could be > > > optimized too later with swap table). > > > > > > It will be good if the swap table series could get reviewed and merge= d > > > to avoid more fixes like this. SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO = has > > > a history of causing many issues. I'll do a swap table rebase on top = of > > > this fix, if this one is accepted. > > > > > > And for a comparision, swap in 10G into shmem: > > > > > > Before this patch: 2.49s > > > After this patch: 2.52s > > > After swap table: 2.37s (Removing SWAP_HAS_CACHE and SWP_SYNCHRONOU= S_IO, > > > still not in the best shape but looking go= od) > > > > > > Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@= gmail.com/ [1] > > > > > > mm/memory.c | 20 -------------------- > > > mm/shmem.c | 12 +++++++++++- > > > mm/swap.h | 19 +++++++++++++++++++ > > > 3 files changed, 30 insertions(+), 21 deletions(-) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 9ead7ab07e8e..3845ed068d74 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -4313,26 +4313,6 @@ static struct folio *__alloc_swap_folio(struct= vm_fault *vmf) > > > } > > > > > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr) > > > -{ > > > - struct swap_info_struct *si =3D swp_swap_info(entry); > > > - pgoff_t offset =3D swp_offset(entry); > > > - int i; > > > - > > > - /* > > > - * While allocating a large folio and doing swap_read_folio, = which is > > > - * the case the being faulted pte doesn't have swapcache. We = need to > > > - * ensure all PTEs have no cache as well, otherwise, we might= go to > > > - * swap devices while the content is in swapcache. > > > - */ > > > - for (i =3D 0; i < max_nr; i++) { > > > - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE)) > > > - return i; > > > - } > > > - > > > - return i; > > > -} > > > - > > > /* > > > * Check if the PTEs within a range are contiguous swap entries > > > * and have consistent swapcache, zeromap. > > > diff --git a/mm/shmem.c b/mm/shmem.c > > > index 73182e904f9c..484cd3043a78 100644 > > > --- a/mm/shmem.c > > > +++ b/mm/shmem.c > > > @@ -1995,6 +1995,14 @@ static struct folio *shmem_swap_alloc_folio(st= ruct inode *inode, > > > */ > > > if (swapcache_prepare(entry, nr_pages)) { > > > folio_put(new); > > > + > > > + /* > > > + * A smaller folio is in the swap cache, mTHP swapin = will always fail > > > + * until it's gone. Return -EINVAL to fallback to ord= er 0. > > > + */ > > > + if (non_swapcache_batch(entry, nr_pages) !=3D nr_page= s) > > > + return ERR_PTR(-EINVAL); > > > + > > Hi Barry, > > > We're doing this before swapcache_prepare() for mTHP swapin. Why does i= t > > happen after swapcache_prepare() in the shmem case? > > `non_swapcache_batch(entry, nr_pages) !=3D nr_pages` is unlikely, that's > the reason why no one noticed this issue so far, so moving it after > swapcache_prepare can help avoid overhead caused by it in the common > case. swapcache_prepare already implies this check, but > swapcache_prepare can fall for multiple reasons, and shmem should and > only should fallback to order 0 swapin if it's caused by an existing > cache. (currently shmem unconditionally retry) Maybe it's because people are running it on systems with plenty of memory? Once we run it on a system with limited memory, we might see more failures allocating large folios and fall back to order-0 more often? For example, what if there's a 50% chance of failing to allocate large folios? > > And non_swapcache_batch might not be the best solution here, it also > might have false positives, we can add a full filemap lookup here, but > might be overkill for a corner case like this. I still think merge > swap cache with swap_map using swap table is the long term solution. > > Maybe I'm premature optimizing it, I can use the easier to review > implementation (same way with anon mTHP) and do a quick benchmark, if > there is no obvious performance change I'll use that style in V2. Right, the current approach is a bit hard to follow, since we ultimately change the return value from -EEXIST to -EINVAL. It does feel like there=E2= =80=99s some back-and-forth. But anyway, let=E2=80=99s look at the data=E2=80=94if = the current approach yields better results, we can refine the code comments to make it easier to understand. Thanks Barry