From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACAA6C27C79 for ; Mon, 17 Jun 2024 23:48:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 235DE6B02AD; Mon, 17 Jun 2024 19:48:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1E5676B02AE; Mon, 17 Jun 2024 19:48:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0D4666B02AF; Mon, 17 Jun 2024 19:48:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E47116B02AD for ; Mon, 17 Jun 2024 19:48:08 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 9902F140141 for ; Mon, 17 Jun 2024 23:48:08 +0000 (UTC) X-FDA: 82242021456.13.A86D143 Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf11.hostedemail.com (Postfix) with ESMTP id 038BB40002 for ; Mon, 17 Jun 2024 23:48:05 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Ar5PQ4eC; spf=pass (imf11.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718668080; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hM6I9hO7fGVyVAexaUQ3idZ+Gbsqsc4LGIc77+OQvbk=; b=TVPsAn3iO3KD7Rj6ajWId56VHfbsc/57j7OGBLewC9LyzDcyy0CfH1wRghh5dKmA6w7lLr NcjwNNaSWk4zyP3fD7oFeHj3D/7zSSw46OVkgbBZXfFpvEx6971LKBYpqFIsbZGPmTY21n OfRpaHZZxEJUUCXi36tIm8zfEGnSf6E= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718668080; a=rsa-sha256; cv=none; b=eBHSu7t8SFT6cigHATNh8bK/4byb3Gb5mYuQvjtLFFLLUHUGvv8TKs/6mDNqvOr+AY2yoH tymv7Jb+KKwgSp9vAPVj31VTukxIvyRW2/hTzm5iuIc+vFgy9X8No/aco0G9XvBlAEWsS2 QOKkvKlhKVDEaCYjNGOPY6JFQT9RjZA= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=Ar5PQ4eC; spf=pass (imf11.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id A0902CE1476 for ; Mon, 17 Jun 2024 23:48:01 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id AC22AC4AF51 for ; Mon, 17 Jun 2024 23:48:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1718668080; bh=BImYMm5B7jO/CLHn58PIBXuFp1QuOojZpa4T5sCKN7c=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=Ar5PQ4eCt9aTO4PcnqFVYtTP7TrCBXgU9uuLs0gHCC4iiyl53yVZuB90EZ9xUAGxp 29ARdK7Sxn0g05gPQHbfB2Q6zxtawCF/yNYxuyo7o3MU0hGmrjHDd0tC6FNVOPYNWq 3sfTgJKuFdOuhFNziO3waDucUUbo6iiQl4LOGh2Nc8UG9uC4Jb7YMiIMfQd5U7UmVR izsbQYEKTiFOIMn4XSgKYN8XTh/Vsb3kXycXCH3zkYSC5fnSGRmURtV619U13kSXgi Xmxbw87wE9rxEG/B6xB5EWt5QseVsjKTP4K6ihbMRKRoXo940nNgiOoB8vP3qbVGsy CJtOoQTKv5m2A== Received: by mail-il1-f171.google.com with SMTP id e9e14a558f8ab-37594abcee7so18107835ab.0 for ; Mon, 17 Jun 2024 16:48:00 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCXgcGndd9oZq5u7cFFmd3Xi0ULXDLLfx0XqTF0brZKTtUikgIulumJCqNipZmOnGEgNT5NoxHZDfuDun/2tlMKdH78= X-Gm-Message-State: AOJu0YyfSE3mJLpzQOEMbJ0LBi0UTe0hNoD9x6OC0AV6cytteH1BANBr av/cKbqZPjXOEXReQnIwUm1Dde0rjOpsjKRGXvk+2rbcg3rjLYIzSlC52Y2SS7GMBETcer2jwUJ CxncmXvLmqCnqwwS48vZjUTjsf6xe4xwbpsqU X-Google-Smtp-Source: AGHT+IHEOsLtSuzJyMhnIL1JLcdMUDU0DYNi7mwXj2jPXhOEYYD5zdUAaH9Tsi+UXh32q3/3Wf6F7AwJpegtX4jQ1Ms= X-Received: by 2002:a05:6e02:1c2a:b0:375:9e20:beef with SMTP id e9e14a558f8ab-375e0c4cb8emr129089605ab.0.1718668079743; Mon, 17 Jun 2024 16:47:59 -0700 (PDT) MIME-Version: 1.0 References: <20240614195921.a20f1766a78b27339a2a3128@linux-foundation.org> <20240615084714.37499-1-21cnbao@gmail.com> <8db63194-77fd-e0b8-8601-2bbf04889a5b@google.com> In-Reply-To: <8db63194-77fd-e0b8-8601-2bbf04889a5b@google.com> From: Chris Li Date: Mon, 17 Jun 2024 16:47:46 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v2 0/2] mm: swap: mTHP swap allocator base on swap cluster order To: Hugh Dickins Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, baohua@kernel.org, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, ryan.roberts@arm.com, ying.huang@intel.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 038BB40002 X-Stat-Signature: fsen18ptdd4kgbcnrretjp8ct83iyu6x X-HE-Tag: 1718668085-707592 X-HE-Meta: U2FsdGVkX18lSjMWDmJtuj/4uh0QwpC5Hn8QFZa5PN7RxHBI+9ZgJhN6h1rf0HLxvzek9dkuKU51b+Y8pgh13AlatFPzGjL6DtP0x+9ekjZprzmqnyIMPVWnX6Vg4lVgokRz1dZzwFlkhuYx5zBUZhH0llDq3V0BhQ0rX71RJ2ApCtLoK5fz6XisT0yW+BBDNOrTA8R2DHgI+tZivNPC04ste2Gz3YZd6Gc2OKYkMrLugkAtbgSFLoQO8KcRhaQ5u4ly4iJ1PFXHxcI6jhIl9oFFOIl0xHjlskPfR6tyWFHbvg/C5U7KQ8rJE5wnDza5bhyMBfOUIXDhWF23yO71SzTktII6f8aJuRYQUEAAHmDDkkojHJyWGei71whpoJkqzn6/gCdho8jQ6XnN29xiL2kt8mC4owlyestMlf9Bq9Jn1p5PjlF7XhfDzfVPCpcBnU54eS5SgJ1NQ711fWKagY4Htgmy8vZoz3QqkV9rQkGKOa+H4p9p6ylyyvVCs9shAFEQOFlw0HUMWapIfLrKVOFpYG+DO5zstrmNwiLCDOp5hDWc4A/wAdFMMmpyJc284AJtVYYw1sZ15VzUVuuZ/TeTwl1WnIy7oy1fxSJm+PEIguLWaentqAmxvcLpDR1g+ax+dND3uMJHQUvVSyCy7y7Xk94LA8nFfd1hs7Fw+1pmL/NuVet2L6mrzj8FK6+kjH/5Zd2D/62YXrPSZT3u7gHRIrde5x0rwgJEGRxLSgV3XY4jLivl6mQ7DhUCZjnbp7pcLeoCvM4B8n6WKmlVJEee0shK1FL1oAx5h8nV1STgtoG8OZsvMAw13/hwoCxhF+jd0lSVrhdIAYdF2Wl9vLp9or2aCobOjmxzU/iMMVkAgdhNHFZ395Hr6vU0kmXgSQgXvW6O6/r/R62lFF7tMscSVeO1x45izJFTojUk/VaKl49e2f0MF4FmWIc0bV6zxb/M2df1O8x0XxBRy03 gCZY2KKe nRZ8txSBN79j4NPVnyz4ME/Pr1i2l6ptYhsUfJbAQ8ANUZAyh/QTVHGE5uOvLaQ5CXE7kk0tqAhjL4/q4U1AO75/DEefdx3Yrdo5T2tA08AUoDBmCrjU4I/D98LNYLA75QyUYpmtSgi5yDg9gDGQPn+oznf7/3GtG5k3bz4tq4OpEM/Y/2NDYoZcYhS5pgXrhYdl9o9dWo7JudN9TwtJS0AiqovbGbkEOy/C1IsK1ZIWK7mV9UFj+9oXJjGFVEcptOLUslLkhqVOY9fWAihSwrGq7MQXQ08z5n1Fa8KMHZ6LJAcoYACmrwdZInZ2cKnVa0gqcCyHpBH6RKot2pnPjJD3iy5ieZ/qtcod95ZD37GcqYueCos6LIC6ZpN7SCcXI65SApdcE1rirQtKmQIXBGqH+RGQlOkzuNKiW X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 17, 2024 at 4:00=E2=80=AFPM Hugh Dickins wro= te: > > On Mon, 17 Jun 2024, Chris Li wrote: > > On Sat, Jun 15, 2024 at 1:47=E2=80=AFAM Barry Song <21cnbao@gmail.com> = wrote: > > > On Sat, Jun 15, 2024 at 2:59=E2=80=AFPM Andrew Morton wrote: > > > > On Fri, 14 Jun 2024 19:51:11 -0700 Chris Li wro= te: > > > > > > > > > > I'm having trouble understanding the overall impact of this on = users. > > > > > > We fail the mTHP swap allocation and fall back, but things cont= inue to > > > > > > operate OK? > > > > > > > > > > Continue to operate OK in the sense that the mTHP will have to sp= lit > > > > > into 4K pages before the swap out, aka the fall back. The swap ou= t and > > > > > swap in can continue to work as 4K pages, not as the mTHP. Due to= the > > > > > fallback, the mTHP based zsmalloc compression with 64K buffer wil= l not > > > > > happen. That is the effect of the fallback. But mTHP swap out and= swap > > > > > in is relatively new, it is not really a regression. > > > > > > > > Sure, but it's pretty bad to merge a new feature only to have it > > > > ineffective after a few hours use. > .... > > > > > > > $ /home/barry/develop/linux/mthp_swpout_test -s > > > > > > [ 1013.535798] ------------[ cut here ]------------ > > > [ 1013.538886] expecting order 4 got 0 > > > > This warning means there is a bug in this series somewhere I need to hu= nt down. > > The V1 has the same warning but I haven't heard it get triggered in > > V1, it is something new in V2. > > > > Andrew, please consider removing the series from mm-unstable until I > > resolve this warning assert. > > Agreed: I was glad to see it go into mm-unstable last week, that made > it easier to include in testing (or harder to avoid!), but my conclusion > is that it's not ready yet (and certainly not suitable for 6.10 hotfix). > > I too saw this "expecting order 4 got 0" once-warning every boot (from > ordinary page reclaim rather than from madvise_pageout shown below), > shortly after starting my tmpfs swapping load. But I never saw any bad > effect immediately after it: actual crashes came a few minutes later. > > (And I'm not seeing the warning at all now, with the change I made: that > doesn't tell us much, since what I have leaves out 2/2 entirely; but it > does suggest that it's more important to follow up the crashes, and > maybe when they are satisfactorily fixed, the warning will be fixed too.) > > Most crashes have been on that VM_BUG_ON(ci - si->cluster_info !=3D idx) > in alloc_cluster(). And when I poked around, it was usually (always?) > the case that si->free_clusters was empty, so list_first_entry() not > good at all. A few other crashes were GPFs, but I didn't pay much > attention to them, thinking the alloc_cluster() one best to pursue. > > I reverted both patches from mm-everything, and had no problem. > I added back 1/2 expecting it to be harmless ("no real function > change in this patch"), but was surprised to get those same > "expecting order 4 got 0" warnings and VM_BUG_ONs and GPFs: > so have spent most time trying to get 1/2 working. > > This patch on top of 1/2, restoring when cluster_is_free(ci) can > be seen to change, appears to have eliminated all those problems: > > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -418,6 +418,7 @@ static struct swap_cluster_info *alloc_c > > VM_BUG_ON(ci - si->cluster_info !=3D idx); > list_del(&ci->list); > + ci->state =3D CLUSTER_STATE_PER_CPU; > ci->count =3D 0; > return ci; > } > @@ -543,10 +544,6 @@ new_cluster: > if (tmp =3D=3D SWAP_NEXT_INVALID) { > if (!list_empty(&si->free_clusters)) { > ci =3D list_first_entry(&si->free_clusters, struc= t swap_cluster_info, list); > - list_del(&ci->list); > - spin_lock(&ci->lock); > - ci->state =3D CLUSTER_STATE_PER_CPU; > - spin_unlock(&ci->lock); > tmp =3D (ci - si->cluster_info) * SWAPFILE_CLUSTE= R; > } else if (!list_empty(&si->discard_clusters)) { > /* > Thanks for the nice bug report. That is my bad. Both you and Ying point out the critical bug here: The cluster was removed from the free list inside try_ssd() and in the case of conflict() failure followed by alloc_cluster(). It allocates from the cluster, it can remove the same cluster from the list again. That is the path I haven't considered well. All this attempt of allocation in try_ssd() but can have possible conflict and perform the dance in alloc_cluster() make things very complicated. In the try_ssd() when we have the cluster lock, can we just perform the actual allocation with lock held? There should not be conflict with the cluster lock protection, right? Chris > Delighted to have made progress after many attempts, I went to apply 2/2 > on top, but found that it builds upon those scan_swap_map_try_ssd_cluster= () > changes I've undone. I gave up at that point and hand back to you, Chris, > hoping that you will understand scan_swap_map_ssd_cluster_conflict() etc > much better than I ever shall! > > Clarifications on my load: all swapping to SSD, but discard not enabled; > /sys/kernel/mm/transparent_hugepage/ enabled always, shmem_enabled force, > hugepages-64kB/enabled always, hugepages-64kB/shmem_enabled always; > swapoff between iterations, did not appear relevant to problems; x86_64. > > Hugh > > > > > > [ 1013.540622] WARNING: CPU: 3 PID: 104 at mm/swapfile.c:600 scan_swa= p_map_try_ssd_cluster+0x340/0x370 > > > [ 1013.544460] Modules linked in: > > > [ 1013.545411] CPU: 3 PID: 104 Comm: mthp_swpout_tes Not tainted 6.10= .0-rc3-ga12328d9fb85-dirty #285 > > > [ 1013.545990] Hardware name: linux,dummy-virt (DT) > > > [ 1013.546585] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS = BTYPE=3D--) > > > [ 1013.547136] pc : scan_swap_map_try_ssd_cluster+0x340/0x370 > > > [ 1013.547768] lr : scan_swap_map_try_ssd_cluster+0x340/0x370 > > > [ 1013.548263] sp : ffff8000863e32e0 > > > [ 1013.548723] x29: ffff8000863e32e0 x28: 0000000000000670 x27: 00000= 00000000660 > > > [ 1013.549626] x26: 0000000000000010 x25: ffff0000c1692108 x24: ffff0= 000c27c4800 > > > [ 1013.550470] x23: 2e8ba2e8ba2e8ba3 x22: fffffdffbf7df2c0 x21: ffff0= 000c27c48b0 > > > [ 1013.551285] x20: ffff800083a946d0 x19: 0000000000000004 x18: fffff= fffffffffff > > > [ 1013.552263] x17: 0000000000000000 x16: 0000000000000000 x15: ffff8= 00084b13568 > > > [ 1013.553292] x14: ffffffffffffffff x13: ffff800084b13566 x12: 6e697= 46365707865 > > > [ 1013.554423] x11: fffffffffffe0000 x10: ffff800083b18b68 x9 : ffff8= 0008014c874 > > > [ 1013.555231] x8 : 00000000ffffefff x7 : ffff800083b16318 x6 : 00000= 00000002850 > > > [ 1013.555965] x5 : 40000000fffff1ae x4 : 0000000000000fff x3 : 00000= 00000000000 > > > [ 1013.556779] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff0= 000c24a1bc0 > > > [ 1013.557627] Call trace: > > > [ 1013.557960] scan_swap_map_try_ssd_cluster+0x340/0x370 > > > [ 1013.558498] get_swap_pages+0x23c/0xc20 > > > [ 1013.558899] folio_alloc_swap+0x5c/0x248 > > > [ 1013.559544] add_to_swap+0x40/0xf0 > > > [ 1013.559904] shrink_folio_list+0x6dc/0xf20 > > > [ 1013.560289] reclaim_folio_list+0x8c/0x168 > > > [ 1013.560710] reclaim_pages+0xfc/0x178 > > > [ 1013.561079] madvise_cold_or_pageout_pte_range+0x8d8/0xf28 > > > [ 1013.561524] walk_pgd_range+0x390/0x808 > > > [ 1013.561920] __walk_page_range+0x1e0/0x1f0 > > > [ 1013.562370] walk_page_range+0x1f0/0x2c8 > > > [ 1013.562888] madvise_pageout+0xf8/0x280 > > > [ 1013.563388] madvise_vma_behavior+0x314/0xa20 > > > [ 1013.563982] madvise_walk_vmas+0xc0/0x128 > > > [ 1013.564386] do_madvise.part.0+0x110/0x558 > > > [ 1013.564792] __arm64_sys_madvise+0x68/0x88 > > > [ 1013.565333] invoke_syscall+0x50/0x128 > > > [ 1013.565737] el0_svc_common.constprop.0+0x48/0xf8 > > > [ 1013.566285] do_el0_svc+0x28/0x40 > > > [ 1013.566667] el0_svc+0x50/0x150 > > > [ 1013.567094] el0t_64_sync_handler+0x13c/0x158 > > > [ 1013.567501] el0t_64_sync+0x1a4/0x1a8 > > > [ 1013.568058] irq event stamp: 0 > > > [ 1013.568661] hardirqs last enabled at (0): [<0000000000000000>] 0x= 0 > > > [ 1013.569560] hardirqs last disabled at (0): [] co= py_process+0x654/0x19a8 > > > [ 1013.570167] softirqs last enabled at (0): [] co= py_process+0x654/0x19a8 > > > [ 1013.570846] softirqs last disabled at (0): [<0000000000000000>] 0x= 0 > > > [ 1013.571330] ---[ end trace 0000000000000000 ]---