From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EAA91C3DA42 for ; Wed, 17 Jul 2024 10:14:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 855B06B0085; Wed, 17 Jul 2024 06:14:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 82CB06B0095; Wed, 17 Jul 2024 06:14:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F4686B009C; Wed, 17 Jul 2024 06:14:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 50BEC6B0085 for ; Wed, 17 Jul 2024 06:14:58 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id ED0B6809CB for ; Wed, 17 Jul 2024 10:14:57 +0000 (UTC) X-FDA: 82348836234.21.605C675 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf13.hostedemail.com (Postfix) with ESMTP id 2D2B420011 for ; Wed, 17 Jul 2024 10:14:54 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf13.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721211257; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HpjAf21Rp/n7FrGwIhvCBTPEqDgXTjXx8u2BSiD0e7o=; b=yKbCKv4cGbfEo2k2WQWscBEz6LzGk7bY7OpVBentkDAOaQLIuEvGKRTZ0WzME3w7pjEtDV qSW/FwsB+2Iuz7X/b2pHlSLLM3mWF1mu5SMq9rtD/+04HBqqwBRBoM8ajhEvvLRHYA7JWr CfuDAdVOh8l9QweG3+9nTNusdlvdjPE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721211257; a=rsa-sha256; cv=none; b=U3tzSBgmFeU4yofKF67alpx6QpAqSTWBsfDdqVFWiu2OerYw86960Jf4oaDCW6D8m3Ge7r cVIU77L7CLCUtKcsV6GwTARA92mcUY/R5s3TV9gILfcuHTzvv/sAXczObl7X59GsNB+mi9 JBZWuto3VG/cSXC2eBNlw6i3PMij9mM= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf13.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 91D111063; Wed, 17 Jul 2024 03:15:19 -0700 (PDT) Received: from [10.57.77.222] (unknown [10.57.77.222]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id DD03F3F73F; Wed, 17 Jul 2024 03:14:52 -0700 (PDT) Message-ID: Date: Wed, 17 Jul 2024 11:14:51 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list Content-Language: en-GB To: Chris Li Cc: Andrew Morton , Kairui Song , Hugh Dickins , "Huang, Ying" , Kalesh Singh , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org> <20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 2D2B420011 X-Stat-Signature: 5pufpcntihr13mkinsiqfi31yoewcn88 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1721211294-164488 X-HE-Meta: U2FsdGVkX1/Yi7/6oi20w64zSGg8PRbZotFP0SC4xOZEAQxA+DEWsLsRdPbf/bdYuRXyp/FUPTWbHml/Nj10sQ7+2cjiYAgMvD1NQ94PGRvBLn2DdYKsE9pjA3q5toHU7Nh9/7xkyn/IHsGQFREfS5zbdg1v3jxl9t4SG6ldJac7edAD8ZamNKOXhMaNrgYI9ycFAM2rREpCsQHGDnng6hnMAbErbV4HdjBcEExFbqGDy86ysrkeIjilsIMoSE7AAnVzxAevt2vhX6XvsjpkmgAOQAt7dqSEsOWRvZoX3XJTKbpB6xgahOYe4OIxKpLCz/x4LD9KZ5BJojQqvRIqn3pX0YWFn6kgjf+OqGxyXkkNs4rmZAicsvJqdZmR5tRJWqnCCLvq/SU+HG2EgXMx/HG7WAbvCzBsJlq+tcuGL5pX6ys1eMMKh173G2cF8zwVr1YNZ2INx5vtmYqOu++0faQwvYnIPzNRl6HGwqMZbHP39IGOlxDCtzXDg1Cr1zdjzMEwqRYqKjTf2ua0S9fPV8QAhyyJ8PU+BcjgMj2Ty9JDuYsQcfFwiGQsocrd4PuP/h/FC5CIrQkmnrd3awnRtjyxQITkWYmfXRgUW4WDneYyxRjh6Iau+I22czGpP9ytAnB8fmGDqhC8/JYooGkCIFjofwdKW3435XCahA+V4GZDYpL79moouWfEtLXP33UTEPwfPXcRjraSsPwWiT4F1fSvdf1+NcKsCcAg4pWXft37FHAKcPhIRTluvpQ4dxwIle9okX9IidUmJI6GwgcsVYvJqCLgWmWAlMk+hx6KiSjxTCBxtehbWGWBNcaWAIiC7OhGR3ak+SWtSgHyx2DS/AvyflykKBrtnHrbOkQQnkhv89K7RI6ANirC0J1f7zjcO7gQce55sGLkGYiSkzkPJ/M5qShZAHw1Ehf4GbSCsu29qtfW8GwM96mSBZfbfvZJKvpfu3gKFc3dW8MdU92 QQULGyUP 8PRgMp3chCu3qtnIKptrH7oa9UlPJvaOyBRBt666zFYE+EHE83iD7wGu356SNB/DNrVANPLXeEedPqvk5sY7Dnx3sddKDYjLAn+G2Luq+RJYCZdkZoziof9GHeckeIkUug1TIFPM4hKZhN5D0+Spsxav50kuDK/udDGMrAk7hiryDXNWw0m5HYE8ThuOHrpfLn8yZhc1bK+kpmNycZHIHOtgsg88H9Nle2XOebEsLheUzqWAyJdpgMkbe+7uXAbUgjax6+qr+3CN9As9IhHenuWd5cejWuC2Vl7jl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 16/07/2024 23:46, Chris Li wrote: > On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts wrote: >> >> On 11/07/2024 08:29, Chris Li wrote: >>> Track the nonfull cluster as well as the empty cluster >>> on lists. Each order has one nonfull cluster list. >>> >>> The cluster will remember which order it was used during >>> new cluster allocation. >>> >>> When the cluster has free entry, add to the nonfull[order] >>> list. When the free cluster list is empty, also allocate >>> from the nonempty list of that order. >>> >>> This improves the mTHP swap allocation success rate. >>> >>> There are limitations if the distribution of numbers of >>> different orders of mTHP changes a lot. e.g. there are a lot >>> of nonfull cluster assign to order A while later time there >>> are a lot of order B allocation while very little allocation >>> in order A. Currently the cluster used by order A will not >>> reused by order B unless the cluster is 100% empty. >>> >>> Signed-off-by: Chris Li >>> --- >>> include/linux/swap.h | 4 ++++ >>> mm/swapfile.c | 34 +++++++++++++++++++++++++++++++--- >>> 2 files changed, 35 insertions(+), 3 deletions(-) >>> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h >>> index e9be95468fc7..db8d6000c116 100644 >>> --- a/include/linux/swap.h >>> +++ b/include/linux/swap.h >>> @@ -254,9 +254,11 @@ struct swap_cluster_info { >>> */ >>> u16 count; >>> u8 flags; >>> + u8 order; >>> struct list_head list; >>> }; >>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ >>> >>> >>> /* >>> @@ -296,6 +298,8 @@ struct swap_info_struct { >>> unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */ >>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ >>> struct list_head free_clusters; /* free clusters list */ >>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; >>> + /* list of cluster that contains at least one free slot */ >>> unsigned int lowest_bit; /* index of first free in swap_map */ >>> unsigned int highest_bit; /* index of last free in swap_map */ >>> unsigned int pages; /* total of usable pages of swap */ >>> diff --git a/mm/swapfile.c b/mm/swapfile.c >>> index f70d25005d2c..e13a33664cfa 100644 >>> --- a/mm/swapfile.c >>> +++ b/mm/swapfile.c >>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, >>> memset(si->swap_map + idx * SWAPFILE_CLUSTER, >>> SWAP_MAP_BAD, SWAPFILE_CLUSTER); >>> >>> - list_add_tail(&ci->list, &si->discard_clusters); >>> + if (ci->flags) >> >> I'm not sure this is future proof; what happens if a flag is added in future >> that does not indicate that the cluster is on a list. Perhaps explicitly check >> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`. > > Currently flags are only used to track which list it is on. Yes, I get that it works correctly at the moment. I just don't think it's wise for the code to assume that any flag being set means its on a list; that feels fragile for future. > BTW, this > line has changed to check for explicite which list in patch 3 the big > rewrite. I can move that line change to patch 2 if you want. That would get my vote; let's make every patch as good as it can be. > >> >>> + list_move_tail(&ci->list, &si->discard_clusters); >>> + else >>> + list_add_tail(&ci->list, &si->discard_clusters); >>> + ci->flags = 0; >> >> Bug: (I think?) the cluster ends up on the discard_clusters list and >> swap_do_scheduled_discard() calls __free_cluster() which will then call > > swap_do_scheduled_discard() delete the entry from discard list. Ahh yes, my bad! > The flag does not track the discard list state. > >> list_add_tail() to put it on the free_clusters list. But since it is on the >> discard_list at that point, shouldn't it call list_move_tail()? > > See above. Call list_move_tail() would be a mistake. > >> >>> schedule_work(&si->discard_work); >>> } >>> >>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) >>> { >>> + if (ci->flags & CLUSTER_FLAG_NONFULL) >>> + list_move_tail(&ci->list, &si->free_clusters); >>> + else >>> + list_add_tail(&ci->list, &si->free_clusters); >>> ci->flags = CLUSTER_FLAG_FREE; >>> - list_add_tail(&ci->list, &si->free_clusters); >>> } >>> >>> /* >>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste >>> ci->count--; >>> >>> if (!ci->count) >>> - free_cluster(p, ci); >>> + return free_cluster(p, ci); >> >> nit: I'm not sure what the kernel style guide says about this, but I'm not a >> huge fan of returning void. I'd find it clearer if you just turn the below `if` >> into an `else if`. > > I try to avoid 'else if' if possible. > Changed to > if (!ci->count) { > free_cluster(p, ci); > return; > } ok > >> >>> + >>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { >>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); >> >> I find the transitions when you add and remove a cluster from the >> nonfull_clusters list a bit strange (if I've understood correctly): It is added >> to the list whenever there is at least one free swap entry if not already on the >> list. But you take it off the list when assigning it as the current cluster for >> a cpu in scan_swap_map_try_ssd_cluster(). >> >> So you could have this situation: >> >> - cpuA allocs cluster from free list (exclusive to that cpu) >> - cpuA allocs 1 swap entry from current cluster >> - swap entry is freed; cluster added to nonfull_clusters >> - cpuB "allocs" cluster from nonfull_clusters >> >> At this point both cpuA and cpuB share the same cluster as their current >> cluster. So why not just put the cluster on the nonfull_clusters list at >> allocation time (when removed from free_list) and only remove it from the > > The big rewrite on patch 3 does that, taking it off the free list and > moving it into nonfull. Oh, from the title, "RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots()" I assumed that was just a refactoring of the code to separate the SSD and HDD code paths. Personally I'd prefer to see the refactoring separated from behavioural changes. Since the patch was titled RFC and I thought it was just refactoring, I was deferring review. But sounds like it is actually required to realize the test results quoted on the cover letter? > I am only making the minimal change in this step so the big rewrite can land. > >> nonfull_clusters list when it is completely full (or at least definitely doesn't >> have room for an `order` allocation)? Then you allow "stealing" always instead >> of just sometimes. You would likely want to move the cluster to the end of the >> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the >> chances of multiple CPUs using the same cluster. > > For nonfull clusters it is less important to avoid multiple CPU > sharing the cluster. Because the cluster already has previous swap > entries allocated from the previous CPU. But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA could be slightly ahead of cpuB so that cpuA allocates all the free pages and cpuB just ends up scanning and finding nothing to allocate. I think do want to share the cluster when you really need to, but try to avoid it if there are other options, and I think moving the cluster to the end of the list might be a way to help that? > Those behaviors will be fine > tuned after the patch 3 big rewrite. Try to make this patch simple. > >> Another potential optimization (which was in my hacked version IIRC) is to only >> add/remove from nonfull list when `total - count` crosses the (1 << order) >> boundary rather than when becoming completely full. You definitely won't be able >> to allocate order-2 if there are only 3 pages available, for example. > > That is in patch 3 as well. This patch is just doing the bare minimum > to introduce the nonfull list. > >> >>> + ci->flags |= CLUSTER_FLAG_NONFULL; >>> + } >>> } >>> >>> /* >>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >>> if (tmp == SWAP_NEXT_INVALID) { >>> if (!list_empty(&si->free_clusters)) { >>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); >>> + list_del(&ci->list); >>> + spin_lock(&ci->lock); >>> + ci->order = order; >>> + ci->flags = 0; >>> + spin_unlock(&ci->lock); >>> + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; >>> + } else if (!list_empty(&si->nonfull_clusters[order])) { >>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list); >>> + list_del(&ci->list); >>> + spin_lock(&ci->lock); >>> + ci->flags = 0; >>> + spin_unlock(&ci->lock); >>> tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER; >>> } else if (!list_empty(&si->discard_clusters)) { >>> /* >>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) >>> ci = lock_cluster(si, offset); >>> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); >>> ci->count = 0; >>> + ci->order = 0; >>> ci->flags = 0; >> >> Wonder if it would be better to put this in __free_cluster()? > > Both flags and order were moved to __free_cluster() in patch 3 of this > series. The order is best assigned together with flags when the > cluster changes the list. > > Thanks for the review. The patch 3 big rewrite is the heavy lifting. OK, but sounds like patch 3 isn't really RFC after all, but a crucial part of the series? I'll try to take a look at it today. > > Chris