From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 85542C3DA60 for ; Wed, 17 Jul 2024 15:41:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 003F76B0092; Wed, 17 Jul 2024 11:41:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EF4D46B0095; Wed, 17 Jul 2024 11:41:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DBD586B0098; Wed, 17 Jul 2024 11:41:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id BC9F16B0092 for ; Wed, 17 Jul 2024 11:41:24 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 665A01C3AA9 for ; Wed, 17 Jul 2024 15:41:24 +0000 (UTC) X-FDA: 82349658888.05.D22C361 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf03.hostedemail.com (Postfix) with ESMTP id 685E420029 for ; Wed, 17 Jul 2024 15:41:21 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=AhqXBuaU; spf=pass (imf03.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721230861; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3kxojXuJfyvwT5Me/erPVAFD8YQ9aQMpKigbucx0Z60=; b=WviVTTRoWOL8T0FdN4TZHlgGSVlH5VfB0i98UxdO/PlP/W16HsFaFS0VsOQJIdexPk3SZM En7obVzgzwaNB4pyNJ3wN5TmFHuC/k6j2IwF0UWUNqI3ZM/6KDOIBKjQ+r7bhTvZu8UTJX 5VbNpx7Q5NKeNbs4jRdbvzw6tAYceKs= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=AhqXBuaU; spf=pass (imf03.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721230861; a=rsa-sha256; cv=none; b=b3d93prlawdnF3h6f/rkIV9OjU6TAhc1wOEP8Jn8yvKPyJLm66ddTK44xNSnqC2qrDbFSW 42m6T9lX4f97a/TbrHqagEbDoeC/fsPEwk1ecYUtHf+zbR2wz5oRnfk3fhjmmWX9IhpIc6 8yEyF5jcznMJs0XAbSLpLbgn7xFWEKA= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 7C1596170E for ; Wed, 17 Jul 2024 15:41:20 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 431AAC4AF1A for ; Wed, 17 Jul 2024 15:41:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1721230880; bh=EuMByBYI4GvVXdH3HAA8IPngULWSivXMEw1aCXbhPWo=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=AhqXBuaU9ZaJE1RYkUGN3nsTNQPDzpYnq3aWMZhJ0unuCKtrGFKARTS2SsoQpQApF voycjubCBrHCuisCG1DKWRBaJasrLox+LYdf+x5Qv91Y/YjGsUP0SPi7ifRlLxFpGg Dp3ojZDHxVwr49JeM9fOWXlzVmuv3PZqp66wooSIQyCux2po+X97aFt8L6yxUVa8bP 4807ApCuaQT7cqDaiCfDEAvc59Y07jUdVM+B0Mjdqvt0xOVi11u2HCcUD0WJD/dF+4 5l6ytrGBKMMk4Na7qvRkQgYCSmEn2m1/Yyn+SmDrCjmdSG8b2UNi7JQIL3u4HcS/Ja oghWwrbnjMcaQ== Received: by mail-yb1-f177.google.com with SMTP id 3f1490d57ef6-e03d49ff259so6251882276.2 for ; Wed, 17 Jul 2024 08:41:20 -0700 (PDT) X-Forwarded-Encrypted: i=1; AJvYcCVS/IbwCREdjDWBrnhc6tZujDeGcxWLKu/VNy0yF2Ochgi7bpiXeSpFNbSyGML/r7xBHiZw+Mu2y7evpfg5SB1affA= X-Gm-Message-State: AOJu0Yx2Ghf32YXR5ssEprUL4RkIeit5Ru1NnM2ebWV2qszwN4WOQXcE 0OG8/xP8lRjnnDnL6YAYlQu+yxcoS45h2Bqx/g6J4YyPsCuCggBYVDYGXPWinHLNVvzWx6pBG1P DATN2COfo+X+m00Uulf/cc/4lUXQicpZfcTkiyA== X-Google-Smtp-Source: AGHT+IE46+Y5BsCedo2Ch5l6lEGtIu7LVNdH0RIePa4eM3BT3zdt+a5EBe78MyU/25uykGBHZqqR/r45rzX/86gDDwM= X-Received: by 2002:a05:6902:c0b:b0:dfe:4ab9:1cc1 with SMTP id 3f1490d57ef6-e05ed796dcfmr2261823276.40.1721230879397; Wed, 17 Jul 2024 08:41:19 -0700 (PDT) MIME-Version: 1.0 References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> <20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org> In-Reply-To: From: Chris Li Date: Wed, 17 Jul 2024 08:41:07 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list To: Ryan Roberts Cc: Andrew Morton , Kairui Song , Hugh Dickins , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 685E420029 X-Stat-Signature: mqk55d4e99rpxwjpcynhrd57trghuyqw X-HE-Tag: 1721230881-986559 X-HE-Meta: U2FsdGVkX1+Zq6QCJxwPxwmmkiEcDerNCABJvKVOnGHAOenJBELMLVGSgjuIN1mwmc8Mn6+N6AkscFfdBtguCCUZrCQybZ4d+Xokt/+53nFt14eei/1inl2EZkYyhV2mSY1MPjwXokSeEm14DrDk/PlrZ1e3k7gBTVI55hkb1r+76Ry/0RS5XSaGNS5pDqDenv1QJTxU/jHRsiDCXkHty+Pjhj6MB1N2Ofi2b3nPTUNH985QadhM+AEmKasJllE8uPDrKJYntWqah1WUBQ3Pwczvs5UvOyw6fDM4MJcpnE1Muz/95Ocf+1xuo2BGFIFIJeMkUnfXQTnprqtx3rR6YZE/5KCEWu3Eic+wPCDV8CXduyZuzAYSUIHlhVJmUN5L/7b3H8PU2DpqVGn1qkQ2oiDeswBR2aB109JKNSS1YgjoqfNFyRRWrJGKh2HcsoHkd++B8J4DaiLdgsox9QZ0IbzQ912lzlu5MlXQUeSZyw6s2yIp5mmM8AtmnEt2dKx+Z+fz5pRMuuWY5po+Fvy0DywE8rOAssooqb+dXuLvGw0XTDPN8MwN0z2JEOM1/nIn2D50Hxzmv/3xLfb1kj5G1FfbdRFHgXcyNneG6DzKJy/7PzZzTThAeXvOVmW8mO426JTFhqC5lNnwPDnmyr1PLirt/LvNfxWKNxjdYs3C2Nbf0mpSUl/CTLXTDV88UAkwm49xZCe16UeLlO40Jkr2as/NsnvxdGMVk6K6tMSK7mFuDUYW9fCPdEm+DrILpJTQ1e/eHq3D9h6WDwBnN0iN9MWrFs0SBPpjZ/8oE8g7yZqlETwJV8c4m3ytLHEwYV1YsfFIOcbxJ5qGqAbZo1p/g8RQa3vxLxj0wg47Ad62H17LPtyhIbpSsDnHVPqRAaPZ5QdZfjskjxACwYJcjoI3+gmz1VXGZJ3i/g+tugysHeMpxwlLLo0AXvP9GOYKyg8L0qytk2R2h58pirtMHLl yHQrzFGC NMWEynsXybgYV1yDEHmCNBXWaDDjpyruWgFyg4T0fANoK/X83m3LoFFlRif4P0ZQEONLHVD0pMlNVV61rL+q45fVAneti3Iq1BE0sb2AlCMxygsO4zoiFlF4sPbXuwzB0eD+CzmDvH7M2Qki1C1F2gybAvHqnaNz6Y1AMk2dRlZofjHRf7HBEZknS4w2vdrVkqX9Q6ME57aN99DcgQEtOokQgsSqUzKI+S+3b4unYOyHfvUY2ZwaXAJ8Xs2ideSAGJus2FX7gYiZLDPUOMHyc8/sPXS1xTFwx3K4dGlPGK4XySXFS2hyrAihKMexEBcsojE+HldyVlWjyXZQugimN2M6NvtlfK8hhVlfdBZu6gZ7ruubFVfNEDJKKsA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 17, 2024 at 3:14=E2=80=AFAM Ryan Roberts = wrote: > > On 16/07/2024 23:46, Chris Li wrote: > > On Mon, Jul 15, 2024 at 8:40=E2=80=AFAM Ryan Roberts wrote: > >> > >> On 11/07/2024 08:29, Chris Li wrote: > >>> Track the nonfull cluster as well as the empty cluster > >>> on lists. Each order has one nonfull cluster list. > >>> > >>> The cluster will remember which order it was used during > >>> new cluster allocation. > >>> > >>> When the cluster has free entry, add to the nonfull[order] > >>> list. When the free cluster list is empty, also allocate > >>> from the nonempty list of that order. > >>> > >>> This improves the mTHP swap allocation success rate. > >>> > >>> There are limitations if the distribution of numbers of > >>> different orders of mTHP changes a lot. e.g. there are a lot > >>> of nonfull cluster assign to order A while later time there > >>> are a lot of order B allocation while very little allocation > >>> in order A. Currently the cluster used by order A will not > >>> reused by order B unless the cluster is 100% empty. > >>> > >>> Signed-off-by: Chris Li > >>> --- > >>> include/linux/swap.h | 4 ++++ > >>> mm/swapfile.c | 34 +++++++++++++++++++++++++++++++--- > >>> 2 files changed, 35 insertions(+), 3 deletions(-) > >>> > >>> diff --git a/include/linux/swap.h b/include/linux/swap.h > >>> index e9be95468fc7..db8d6000c116 100644 > >>> --- a/include/linux/swap.h > >>> +++ b/include/linux/swap.h > >>> @@ -254,9 +254,11 @@ struct swap_cluster_info { > >>> */ > >>> u16 count; > >>> u8 flags; > >>> + u8 order; > >>> struct list_head list; > >>> }; > >>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ > >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ > >>> > >>> > >>> /* > >>> @@ -296,6 +298,8 @@ struct swap_info_struct { > >>> unsigned long *zeromap; /* vmalloc'ed bitmap to track z= ero pages */ > >>> struct swap_cluster_info *cluster_info; /* cluster info. Only f= or SSD */ > >>> struct list_head free_clusters; /* free clusters list */ > >>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; > >>> + /* list of cluster that contain= s at least one free slot */ > >>> unsigned int lowest_bit; /* index of first free in swap_= map */ > >>> unsigned int highest_bit; /* index of last free in swap_m= ap */ > >>> unsigned int pages; /* total of usable pages of swa= p */ > >>> diff --git a/mm/swapfile.c b/mm/swapfile.c > >>> index f70d25005d2c..e13a33664cfa 100644 > >>> --- a/mm/swapfile.c > >>> +++ b/mm/swapfile.c > >>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struc= t swap_info_struct *si, > >>> memset(si->swap_map + idx * SWAPFILE_CLUSTER, > >>> SWAP_MAP_BAD, SWAPFILE_CLUSTER); > >>> > >>> - list_add_tail(&ci->list, &si->discard_clusters); > >>> + if (ci->flags) > >> > >> I'm not sure this is future proof; what happens if a flag is added in = future > >> that does not indicate that the cluster is on a list. Perhaps explicit= ly check > >> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`. > > > > Currently flags are only used to track which list it is on. > > Yes, I get that it works correctly at the moment. I just don't think it's= wise > for the code to assume that any flag being set means its on a list; that = feels > fragile for future. ACK. > > > BTW, this > > line has changed to check for explicite which list in patch 3 the big > > rewrite. I can move that line change to patch 2 if you want. > > That would get my vote; let's make every patch as good as it can be. Done. > > > > >> > >>> + list_move_tail(&ci->list, &si->discard_clusters); > >>> + else > >>> + list_add_tail(&ci->list, &si->discard_clusters); > >>> + ci->flags =3D 0; > >> > >> Bug: (I think?) the cluster ends up on the discard_clusters list and > >> swap_do_scheduled_discard() calls __free_cluster() which will then cal= l > > > > swap_do_scheduled_discard() delete the entry from discard list. > > Ahh yes, my bad! > > > The flag does not track the discard list state. > > > >> list_add_tail() to put it on the free_clusters list. But since it is o= n the > >> discard_list at that point, shouldn't it call list_move_tail()? > > > > See above. Call list_move_tail() would be a mistake. > > > >> > >>> schedule_work(&si->discard_work); > >>> } > >>> > >>> static void __free_cluster(struct swap_info_struct *si, struct swap_= cluster_info *ci) > >>> { > >>> + if (ci->flags & CLUSTER_FLAG_NONFULL) > >>> + list_move_tail(&ci->list, &si->free_clusters); > >>> + else > >>> + list_add_tail(&ci->list, &si->free_clusters); > >>> ci->flags =3D CLUSTER_FLAG_FREE; > >>> - list_add_tail(&ci->list, &si->free_clusters); > >>> } > >>> > >>> /* > >>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_in= fo_struct *p, struct swap_cluste > >>> ci->count--; > >>> > >>> if (!ci->count) > >>> - free_cluster(p, ci); > >>> + return free_cluster(p, ci); > >> > >> nit: I'm not sure what the kernel style guide says about this, but I'm= not a > >> huge fan of returning void. I'd find it clearer if you just turn the b= elow `if` > >> into an `else if`. > > > > I try to avoid 'else if' if possible. > > Changed to > > if (!ci->count) { > > free_cluster(p, ci); > > return; > > } > > ok > > > > >> > >>> + > >>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { > >>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order= ]); > >> > >> I find the transitions when you add and remove a cluster from the > >> nonfull_clusters list a bit strange (if I've understood correctly): It= is added > >> to the list whenever there is at least one free swap entry if not alre= ady on the > >> list. But you take it off the list when assigning it as the current cl= uster for > >> a cpu in scan_swap_map_try_ssd_cluster(). > >> > >> So you could have this situation: > >> > >> - cpuA allocs cluster from free list (exclusive to that cpu) > >> - cpuA allocs 1 swap entry from current cluster > >> - swap entry is freed; cluster added to nonfull_clusters > >> - cpuB "allocs" cluster from nonfull_clusters > >> > >> At this point both cpuA and cpuB share the same cluster as their curre= nt > >> cluster. So why not just put the cluster on the nonfull_clusters list = at > >> allocation time (when removed from free_list) and only remove it from = the > > > > The big rewrite on patch 3 does that, taking it off the free list and > > moving it into nonfull. > > Oh, from the title, "RFC: mm: swap: seperate SSD allocation from > scan_swap_map_slots()" I assumed that was just a refactoring of the code = to > separate the SSD and HDD code paths. Personally I'd prefer to see the > refactoring separated from behavioural changes. It is not a refactoring. It is a big rewrite of the swap allocator using the cluster. Behavior change is expected. The goal is completely removing the brute force scanning of swap_map[] array for cluster swap allocation. > > Since the patch was titled RFC and I thought it was just refactoring, I w= as > deferring review. But sounds like it is actually required to realize the = test > results quoted on the cover letter? Yes, required because it handles the previous fall out case try_ssd() failed. This big rewrite has gone through a lot of testing and bug fix. It is pretty stable now. The only reason I keep it as RFC is because it is not feature complete. Currently it does not do swap cache reclaim. The next version will have swap cache reclaim and remove the RFC. > > > I am only making the minimal change in this step so the big rewrite can= land. > > > >> nonfull_clusters list when it is completely full (or at least definite= ly doesn't > >> have room for an `order` allocation)? Then you allow "stealing" always= instead > >> of just sometimes. You would likely want to move the cluster to the en= d of the > >> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to r= educe the > >> chances of multiple CPUs using the same cluster. > > > > For nonfull clusters it is less important to avoid multiple CPU > > sharing the cluster. Because the cluster already has previous swap > > entries allocated from the previous CPU. > > But if 2 CPUs have the same cluster, isn't there a pathalogical case wher= e cpuA > could be slightly ahead of cpuB so that cpuA allocates all the free pages= and That happens to exist per cpu next pointer already. When the other CPU advances to the next cluster pointer, it can cross with the other CPU's next cluster pointer. > cpuB just ends up scanning and finding nothing to allocate. I think do wa= nt to > share the cluster when you really need to, but try to avoid it if there a= re > other options, and I think moving the cluster to the end of the list migh= t be a > way to help that? Simply moving to the end of the list can create a possible deadloop when all clusters have been scanned and not available swap range found. We have tried many different approaches including moving to the end of the list. It can cause more fragmentation because each CPU allocates their swap slot cache (64 entries) from a different cluster. > > Those behaviors will be fine > > tuned after the patch 3 big rewrite. Try to make this patch simple. Again, I want to keep it simple here so patch 3 can land. > >> Another potential optimization (which was in my hacked version IIRC) i= s to only > >> add/remove from nonfull list when `total - count` crosses the (1 << or= der) > >> boundary rather than when becoming completely full. You definitely won= 't be able > >> to allocate order-2 if there are only 3 pages available, for example. > > > > That is in patch 3 as well. This patch is just doing the bare minimum > > to introduce the nonfull list. > > > >> > >>> + ci->flags |=3D CLUSTER_FLAG_NONFULL; > >>> + } > >>> } > >>> > >>> /* > >>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct= swap_info_struct *si, > >>> if (tmp =3D=3D SWAP_NEXT_INVALID) { > >>> if (!list_empty(&si->free_clusters)) { > >>> ci =3D list_first_entry(&si->free_clusters, str= uct swap_cluster_info, list); > >>> + list_del(&ci->list); > >>> + spin_lock(&ci->lock); > >>> + ci->order =3D order; > >>> + ci->flags =3D 0; > >>> + spin_unlock(&ci->lock); > >>> + tmp =3D cluster_index(si, ci) * SWAPFILE_CLUSTE= R; > >>> + } else if (!list_empty(&si->nonfull_clusters[order])) { > >>> + ci =3D list_first_entry(&si->nonfull_clusters[o= rder], struct swap_cluster_info, list); > >>> + list_del(&ci->list); > >>> + spin_lock(&ci->lock); > >>> + ci->flags =3D 0; > >>> + spin_unlock(&ci->lock); > >>> tmp =3D cluster_index(si, ci) * SWAPFILE_CLUSTE= R; > >>> } else if (!list_empty(&si->discard_clusters)) { > >>> /* > >>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_st= ruct *si, unsigned long idx) > >>> ci =3D lock_cluster(si, offset); > >>> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); > >>> ci->count =3D 0; > >>> + ci->order =3D 0; > >>> ci->flags =3D 0; > >> > >> Wonder if it would be better to put this in __free_cluster()? > > > > Both flags and order were moved to __free_cluster() in patch 3 of this > > series. The order is best assigned together with flags when the > > cluster changes the list. > > > > Thanks for the review. The patch 3 big rewrite is the heavy lifting. > > OK, but sounds like patch 3 isn't really RFC after all, but a crucial par= t of > the series? I'll try to take a look at it today. Yes, it is the cluster swap allocator big rewrite. Thank you for taking a look. Chris