From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8996AC61CE8 for ; Mon, 9 Jun 2025 09:28:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 25E186B00A1; Mon, 9 Jun 2025 05:28:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 235266B00A5; Mon, 9 Jun 2025 05:28:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1241D6B00AA; Mon, 9 Jun 2025 05:28:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id E80B06B00A1 for ; Mon, 9 Jun 2025 05:28:38 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 630FF14162B for ; Mon, 9 Jun 2025 09:28:38 +0000 (UTC) X-FDA: 83535337116.25.898A721 Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46]) by imf08.hostedemail.com (Postfix) with ESMTP id 7395A16000D for ; Mon, 9 Jun 2025 09:28:36 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=g9RMyYFq; spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.46 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749461316; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=u3RVTX224fxutsjQgCa0V5lUr5xGGcvXr8obuMTxYl4=; b=6nRImRD+a0p7MoIGF64B07hUTvV4V3Gy4sJExoogc1B65PKs3FtS+ypQAnPenTmoEDIqUu cR1YCf8OXKMynREA6b6YcVNOtvyhQTeG1gSez0YLzeujIiXK1g1h1yjMccsRw9WY3XO6FY 7GPqahaYav6NgvYEO03s6oBGWLwCMYY= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=g9RMyYFq; spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.167.46 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749461316; a=rsa-sha256; cv=none; b=vBw948rT6ynZxgjikixu88ACmuztCrHmN4PxqdEo4goJh50XFaTeeTWu4q07AvO5Xd8DbF BTANw4gPPnFKZvE+AkndHCtCsvv8W0yHAgeMyD9DKyN2FvRcDBXBr1gYt9ImWP5CUhDZ3A Wa9r3m++BF1epL0KsplZG0xFOC+chlw= Received: by mail-lf1-f46.google.com with SMTP id 2adb3069b0e04-55342bca34eso3948785e87.2 for ; Mon, 09 Jun 2025 02:28:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1749461315; x=1750066115; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=u3RVTX224fxutsjQgCa0V5lUr5xGGcvXr8obuMTxYl4=; b=g9RMyYFq6ZWeV5o5UNlKxnJv7hV6MqHwYXGsszsPXa8YlmidLbpndscFDj6RWN00Hf sdGNvXwJt2rI0pP6CHSHxJ5MPcgWZ9lcGj60FmAovnQyatX+v5SMWBqhp8N6NpaalWN5 3NUIQiAhnCNa3rKi21pJ01Y4N4AFu/SLg9Ynu9UCOYDwaNEm+s97mN31KxL+Whj6PspG vuW5NTkOi8Wfkn3AsZi4BXUAOV2qjceVD1lPeQgizGUd2ylN41gXF2FxeAH+drOsPGje RPwL+fVlmhQnU8yDsOZyqxVHyP2pWpWc+vCYcB0+57NbvjEUyPurNu489+220/Wg7tH7 UQlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749461315; x=1750066115; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=u3RVTX224fxutsjQgCa0V5lUr5xGGcvXr8obuMTxYl4=; b=n+GNdGSpoxp2Hv8cBo0tbdYqcxCrsW/u9nMUm8GIfFwfAcd9ohW6gRqmVJkpBnHevB qX5M8+mcNnMqmT9SY05m++X7Lkgd0bh7bwq/VMJGNvmV5rPe3S6SkDUR+oJxSIV4CVB9 9Fnuu3t0pLQc8JRaMdKM3E4+dYzZ37lsqC/LvuwneVo0AnQfX52qfEAC2zE2Qo2kkb/S gUpJYCPWf5Jx9efFAd4Kg0CNtioa0jBPJkohCzKPis6tK6FgT3RnGC9oJExcCnDqfXJk dEVhwnB8jyJQoMydN/ETgnKYY+ZN0uchWcyJYmfpuwPnCT9tx3Hih35TqM+JtcMjgy1j GagQ== X-Forwarded-Encrypted: i=1; AJvYcCWK8pqW4wJdAdpfoWX+Nhh5Lq0dIHytwGe7ZfAodz40I01gkznFzDOJYm+e7oszFTl9y1KStYkB6g==@kvack.org X-Gm-Message-State: AOJu0YyISZzcxpUHzA9lp55UIZqMj1el/oseWHj9H3G7Rm2/JUnC8+69 Js+zr4ZLrR+YrVW+iupyvCtQUxaiObefCx6f9DjgVx4FDu5HGGAcsnlmftxuDXQNlR+Hbrx3Jey Rn8p/CX8IU6vdWhyYnTWdAdlp/hsdHgI= X-Gm-Gg: ASbGncujLG2YIOnBWosc0NkPSQvEyFYYJEE9F+tEeiOpnB4DJUUJb/LNzPGAvpclq7x nwLsMvB4LMHrezhYzV9bNt24dc9b6JDU34unU10ZuCwwm7FyMbCVzVwEiJOqFklMFEkpA9mCryF 9gQjNRTvPRaibSyMo7S40XcSVW4mZC4/IsOPCVr6gWUCc= X-Google-Smtp-Source: AGHT+IEEHzBZkj3DwIgLxYH+JjYZ8FU+NrT9NolEs18qFDQfCm0isvA81b+m6nu6OIrFM7LDSx4TxLyE4cmpSG0XDNw= X-Received: by 2002:a2e:be20:0:b0:32a:7e4c:e915 with SMTP id 38308e7fff4ca-32adfe26f37mr35748801fa.29.1749461314445; Mon, 09 Jun 2025 02:28:34 -0700 (PDT) MIME-Version: 1.0 References: <20250608192713.95875-1-ryncsn@gmail.com> <36f52466-071a-4efb-adc2-8514b11f120c@linux.alibaba.com> <1452d0c6-50ab-4680-9aa9-13290d51177d@linux.alibaba.com> In-Reply-To: From: Kairui Song Date: Mon, 9 Jun 2025 17:28:17 +0800 X-Gm-Features: AX0GCFvh0QgAg1TsYdwVqeLSHuvEeT4BuRW4TMbRfRIqqPSFtanIR30HYpz4buM Message-ID: Subject: Re: [PATCH] mm/shmem, swap: fix softlockup with mTHP swapin To: Barry Song <21cnbao@gmail.com> Cc: Baolin Wang , linux-mm@kvack.org, Andrew Morton , Hugh Dickins , Kemeng Shi , Chris Li , Nhat Pham , Baoquan He , Usama Arif , linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 7395A16000D X-Rspamd-Server: rspam09 X-Stat-Signature: f9u4mta4u3oeqagwt4g3qir8zka9enfa X-HE-Tag: 1749461316-284660 X-HE-Meta: U2FsdGVkX190AJc0Q7DT7OygWortHrjTPZ46ldLqDb49jktLlvl1J6Ggq9W1L2pOW5AlJSWdX/YBrTN9a1K/zhY8cX82Ffjk+w/P74XuuCs0YRdmGthhD3AQeM6rsq6RT1tzGW1fgTe+wIgKYTaeXUCVQnCMLzlHTYarGQeQxvj+ohT9jDZl8ysrsV13Roh4IdQGbV+wJZVDxRkX7WwqhdeDMJ9DLEvLhx5P7bFx2DVoocGTqM21Aok/YDXxJIaPGGCwmyaQXqIhjZIcX4h5ztaRINlnscZYcU6+Hugd2Ap9PLhVTh7ryviAEuNlBODwASdP8mrX8EDkYnj/QZ/jqdQa3EpHwQugCsOkn/Q+L+iXNaZMxdpA7LDp8HFmU/t7JjovnCQ2YKH9u3aJzTkpl1dyLQ8lxxvKTlXAX59iVaFxvZ45zf9hLamU6zM4eDwGxoTG50kG2oh7Al9XJU36+Zssm2Qwpb9vCA3Ex4gWLYNFx/hA53z9VTLESgYJYKU//xujbVlqPQSurJQDMKS5UFj8Urmm6lWrkv71eLxyYJk8c1J9say5nO1OlBRcLLNv9QTnG52WoGE1q336QI6jUl4oCvCVH0PC+Q7yc9AIIF9MdKnU0oKhXNHzQezCgW8iSTeXA5E5a0kj7FuTIWzBG6/HvrG4m52lzHElMTnrQTcPbDKPM25L/gLsAmPIDaul38hSQ2zioqZjHOxfnguopWNjGkNHIArFp0bSn4d9Vy57T5AnJ/AFvWhpTcnl0EJo1FOFFYNTZvVbgh3XiN4GskDc9cSgbmvV8UgaIIF7xnoPlP4WALZ6swRF3lJ8jBgvPHfkRXe/S0gKFmptTL7K4sImWbPKXTibxvnGjys3mwld5r1PlU219zuA2mgnDIWn7y+ChPBXOcSyVYjJM6E3HyDdjbL+Zp6qNKqBIu/RlMhU/OBrZVG+sm1s9+odLEL40YRx2BNc2qQufAShBHC 2xcu0bm8 8lJqwNlqeXysrYuncm7dVe+Qhvf6TROlmLZfyDycFczSzYuV57PHrvnKLEpLEuvp/ozoPZYvzvDUcVgCb1aV2/10OI0IoWWG1sNxBxFyUYCqwyo6+PoNWDwJPFSvYZgp9W6AwU4LCV/6l1f07V4t6YzdIrgedQ1O8KN0+MA1DPfSfIBLSlajfPKPZr9dvVf+NE6Sd8OtNxtV9cXNNayM8aE26KXpGb9dUZYPWznxrcpq5Si0R04q73CXs9xelsipizEIzHw3lkoW/sEWSdZEjauRwaToNm6Wy4wRNsFDP1uLPRaONXbDIgs5hN+dNfw3G7WWSnVgBRvxfXrGVxoK2Ni9rUeKtMYEM+bomc4rQRxn+8XwQ4LUFFLxk4MsbU0B78igM50DY4P0TBp0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 9, 2025 at 4:55=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrote= : > > On Mon, Jun 9, 2025 at 8:49=E2=80=AFPM Baolin Wang > wrote: > > > > > > > > On 2025/6/9 16:36, Kairui Song wrote: > > > On Mon, Jun 9, 2025 at 4:27=E2=80=AFPM Baolin Wang > > > wrote: > > >> On 2025/6/9 03:27, Kairui Song wrote: > > >>> From: Kairui Song > > >>> > > >>> Following softlockup can be easily reproduced on my test machine wi= th: > > >>> > > >>> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/en= abled > > >>> swapon /dev/zram0 # zram0 is a 48G swap device > > >>> mkdir -p /sys/fs/cgroup/memory/test > > >>> echo 1G > /sys/fs/cgroup/test/memory.max > > >>> echo $BASHPID > /sys/fs/cgroup/test/cgroup.procs > > >>> while true; do > > >>> dd if=3D/dev/zero of=3D/tmp/test.img bs=3D1M count=3D5120 > > >>> cat /tmp/test.img > /dev/null > > >>> rm /tmp/test.img > > >>> done > > >>> > > >>> Then after a while: > > >>> watchdog: BUG: soft lockup - CPU#0 stuck for 763s! [cat:5787] > > >>> Modules linked in: zram virtiofs > > >>> CPU: 0 UID: 0 PID: 5787 Comm: cat Kdump: loaded Tainted: G = L 6.15.0.orig-gf3021d9246bc-dirty #118 PREEMPT(voluntary)=C2=B7 > > >>> Tainted: [L]=3DSOFTLOCKUP > > >>> Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015 > > >>> RIP: 0010:mpol_shared_policy_lookup+0xd/0x70 > > >>> Code: e9 b8 b4 ff ff 31 c0 c3 cc cc cc cc 90 90 90 90 90 90 90 90 9= 0 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 <48> 8b 1f= 48 85 db 74 41 4c 8d 67 08 48 89 fb 48 89 f5 4c 89 e7 e8 > > >>> RSP: 0018:ffffc90002b1fc28 EFLAGS: 00000202 > > >>> RAX: 00000000001c20ca RBX: 0000000000724e1e RCX: 0000000000000001 > > >>> RDX: ffff888118e214c8 RSI: 0000000000057d42 RDI: ffff888118e21518 > > >>> RBP: 000000000002bec8 R08: 0000000000000001 R09: 0000000000000000 > > >>> R10: 0000000000000bf4 R11: 0000000000000000 R12: 0000000000000001 > > >>> R13: 00000000001c20ca R14: 00000000001c20ca R15: 0000000000000000 > > >>> FS: 00007f03f995c740(0000) GS:ffff88a07ad9a000(0000) knlGS:0000000= 000000000 > > >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > >>> CR2: 00007f03f98f1000 CR3: 0000000144626004 CR4: 0000000000770eb0 > > >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > >>> PKRU: 55555554 > > >>> Call Trace: > > >>> > > >>> shmem_alloc_folio+0x31/0xc0 > > >>> shmem_swapin_folio+0x309/0xcf0 > > >>> ? filemap_get_entry+0x117/0x1e0 > > >>> ? xas_load+0xd/0xb0 > > >>> ? filemap_get_entry+0x101/0x1e0 > > >>> shmem_get_folio_gfp+0x2ed/0x5b0 > > >>> shmem_file_read_iter+0x7f/0x2e0 > > >>> vfs_read+0x252/0x330 > > >>> ksys_read+0x68/0xf0 > > >>> do_syscall_64+0x4c/0x1c0 > > >>> entry_SYSCALL_64_after_hwframe+0x76/0x7e > > >>> RIP: 0033:0x7f03f9a46991 > > >>> Code: 00 48 8b 15 81 14 10 00 f7 d8 64 89 02 b8 ff ff ff ff eb bd e= 8 20 ad 01 00 f3 0f 1e fa 80 3d 35 97 10 00 00 74 13 31 c0 0f 05 <48> 3d 00= f0 ff ff 77 4f c3 66 0f 1f 44 00 00 55 48 89 e5 48 83 ec > > >>> RSP: 002b:00007fff3c52bd28 EFLAGS: 00000246 ORIG_RAX: 0000000000000= 000 > > >>> RAX: ffffffffffffffda RBX: 0000000000040000 RCX: 00007f03f9a46991 > > >>> RDX: 0000000000040000 RSI: 00007f03f98ba000 RDI: 0000000000000003 > > >>> RBP: 00007fff3c52bd50 R08: 0000000000000000 R09: 00007f03f9b9a380 > > >>> R10: 0000000000000022 R11: 0000000000000246 R12: 0000000000040000 > > >>> R13: 00007f03f98ba000 R14: 0000000000000003 R15: 0000000000000000 > > >>> > > >>> > > >>> The reason is simple, readahead brought some order 0 folio in swap > > >>> cache, and the swapin mTHP folio being allocated is in confict with= it, > > >>> so swapcache_prepare fails and causes shmem_swap_alloc_folio to ret= urn > > >>> -EEXIST, and shmem simply retries again and again causing this loop= . > > >> > > >> If swapcache_prepare() fails and retries, the folio's order (order 0= ) > > >> getting from swapcache will be different from the order stored in th= e > > >> shmem mapping, so we will split the large swap entry by the followin= g > > >> logic in shmem_swapin_folio(). So I am not sure why causing a softlo= ckup? > > >> > > >> } else if (order !=3D folio_order(folio)) { > > >> /* > > >> * Swap readahead may swap in order 0 folios into s= wapcache > > >> * asynchronously, while the shmem mapping can stil= l stores > > >> * large swap entries. In such cases, we should spl= it the > > >> * large swap entry to prevent possible data corrup= tion. > > >> */ > > >> split_order =3D shmem_split_large_entry(inode, inde= x, swap, gfp); > > >> if (split_order < 0) { > > >> error =3D split_order; > > >> goto failed; > > >> } > > >> > > >> /* > > >> * If the large swap entry has already been split, = it is > > >> * necessary to recalculate the new swap entry base= d on > > >> * the old order alignment. > > >> */ > > >> if (split_order > 0) { > > >> pgoff_t offset =3D index - round_down(index= , 1 << split_order); > > >> > > >> swap =3D swp_entry(swp_type(swap), swp_offs= et(swap) + offset); > > >> } > > >> } > > > > > > For example if the swap entry is 0x0 in shmem with order 4 (so it > > > corresponds to swap entries 0x0 - 0x10), and a order 0 folio is > > > currently cached with swap entry 0xa, then shmem swapin will try to > > > use a folio with order 4, that will always fails swapcache_prepare, > > > but filemap/swapcache lookup use entry 0x0 will return NULL, causing = a > > > loop. > > > > OK. Thanks for the explanation. > > > > >>> Fix it by applying a similar fix for anon mTHP swapin. > > >>> > > >>> The performance change is very slight, time of swapin 10g zero foli= os > > >>> (test for 12 times): > > >>> Before: 2.49s > > >>> After: 2.52s > > >>> > > >>> Fixes: 1dd44c0af4fa1 ("mm: shmem: skip swapcache for swapin of sync= hronous swap device") > > >>> Signed-off-by: Kairui Song > > >>> > > >>> --- > > >>> > > >>> I found this issue while doing a performance comparing of mm-new wi= th > > >>> swap table series [1] on top of mm-new. This issue no longer exists > > >>> if the swap table series is applied, because it elimated both > > >>> SWAP_HAS_CACHE and SWP_SYNCHRONOUS_IO swapin completely while impro= ving > > >>> the performance and simplify the code, and the race swapin is solve= d > > >>> differently by then. > > >>> > > >>> (The zero map fix might still need to stay for a while, but could b= e > > >>> optimized too later with swap table). > > >> > > >> I don't understand why adding zeromap changes, and should explain th= is > > >> explicitly. > > > > > > To stay in consistency with anon mTHP swapin, swap_zeromap_batch have > > > it's own comments that a hybird folio with zero and non-zero pages > > > can't be brought back as a whole. I can mention that in the commit > > > message. > > For mTHP swapin, we need the zeromap check because we have no way to reco= rd > whether there was a prior mTHP swap-out. So we rely on checking the > continuity of swap offsets. > > It=E2=80=99s entirely possible that, in the past, several small folios we= re > swapped out to consecutive locations, and one of them happened to be a > zero folio, while the others were not. > > But for shmem, we have a place to record that information - we swapped-= out > a mTHP, right? > > Regarding zeromap: for an mTHP swap-out, we currently can't mark subpages > individually as zeromap=E2=80=94it=E2=80=99s either all-zero for every su= bpage or none are. Thanks for the declaration! Yes, that's correct, I wasn't sure if zero map will mark subpages so just left the check there. Will remove the check in V2. > So maybe we don't need swap_zeromap_batch() for shmem? Right, it's not needed here, the fix will be simpler.