From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4FB02C27C53 for ; Fri, 7 Jun 2024 10:57:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DCE196B00A1; Fri, 7 Jun 2024 06:57:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D7E566B00A3; Fri, 7 Jun 2024 06:57:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C44976B00A4; Fri, 7 Jun 2024 06:57:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A8E4B6B00A1 for ; Fri, 7 Jun 2024 06:57:57 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 5CDCD1A0495 for ; Fri, 7 Jun 2024 10:57:57 +0000 (UTC) X-FDA: 82203792594.17.59AD545 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf05.hostedemail.com (Postfix) with ESMTP id 92488100006 for ; Fri, 7 Jun 2024 10:57:55 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf05.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717757875; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Dc6t8QNK8/WapN/gnRGhyeCVsSrFGLmVGYwzG6Trh5c=; b=FWOG2zZP5x3hDO0oCsOsCbq9S3ghPgCNJIREP/HTXMiiER9DJBrqMvM28OwWSX+g7PSD3s dIcSEuNAA+ha7zZq70LOOWIIvf/+QqtBUdMyd4qcoreGgdjcUf1a8hB/WZUL4uuZM5d9to FmsNflLcxpzt2pJufuyHu6Rzs4pkuDo= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf05.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717757875; a=rsa-sha256; cv=none; b=WWUNA44ayrtrYW7hv8ztBwFx7d2M7PZuUZIhxzyHWbM2dJ+nWjCn5PHKBPuA1xdN3s6wEb LaqAaypWFNebuWBRMAG6ShA53WykVvy1Wo2BhJrkVXc7rngVXKZz2nVEGtPjjVz52hiuWB 7IusWtpZWzsFnRRA+dFt3LTNvopvvQs= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 002C22F4; Fri, 7 Jun 2024 03:58:19 -0700 (PDT) Received: from [10.57.70.246] (unknown [10.57.70.246]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7FACD3F762; Fri, 7 Jun 2024 03:57:53 -0700 (PDT) Message-ID: <7553070e-630e-4e86-b64e-66cfce1ee125@arm.com> Date: Fri, 7 Jun 2024 11:57:52 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 2/2] mm: swap: mTHP allocate swap entries from nonfull list Content-Language: en-GB From: Ryan Roberts To: Chris Li , Andrew Morton Cc: Kairui Song , "Huang, Ying" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, Barry Song References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org> <20240524-swap-allocator-v1-2-47861b423b26@kernel.org> In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 92488100006 X-Stat-Signature: owq4bfp7wa5wuokxwxuyx16zgryouk51 X-HE-Tag: 1717757875-196029 X-HE-Meta: U2FsdGVkX1+qxz2q8hnEbr/7S6OwsQYdk0VjMOhmyCu0zz+m4QjdRHOcNi48raKfK+JT5BS9wbx0/lhnPXwvnKfo+AgYk470lStF+SoeV2TLCDZRb/+4iwsnVbKp0TR76AIhkfs2Rvb9jg8ovr3CBJrRQiXgi0TNiev6OyUEnx7b+8KVPrSYrAKRVYCkvvdALEBqdre8wN3kpdKfS8OePs5Ogno0/uLMF2xT6F3pMf+ncHYxL5TgdNrJRBowEJcc0pyR4PPVTnBgezXdA6a8V3ns5TNWhyONa8YT4DlDGAC/vcpSFI0MfcYSa459wCO1qAcqypyNrb4QexhTtnyGiNFqqeEWSh76bIae1ZLbtW/iMiPEBIy3xD5tXOFYN8gTAC64FDt0dwpbvekeriA8TMNs3HvJNsW5/RIVHb85nbaIQYgZAOFGEiMV+7li8SdZKLz+3NYPlQnPVruU7OZH3mb69W5DOCK7GYAKbkaEVt+zTepsr8JwNs9QaKGkqUM2FT7tmt27IR1berXmMNsbgAuNl+x3VQbEBVknc0hU3mGulkdCh5ylcYBlMKYdITfQ432d7DSrbFsvOpqYlzyMoW5Ngl8/Ic1gOMW3BEtQcmVg0NGg1FOlkatI16/hvrd+gik2Hlh2K6qaM66q4TRHZFWxcNmgDEXx07enZsdrkBJzbduY87MHtl4CIr9+8Zz/1IuPFGfR2y4bNHS7v6TN1exx3BIsKqjf5lg3//EnJ+US/TNnpAOZTuzrSZQorlcsc5EQnHCtcRrvwNJbxp6fXtyvvERb39LO+Oo6jSjls3RzPieuWxYW6AoZvOaD7fx90ppJZbBiRtlQ2MNwtTengB518Qxxe2n4VAgDllZkGp8z3IPkEsXdb83NklrRQmVwnSdZfmFNLwr9yHMBwDWd83EDIwoc+vrroGWKyeL6lvsBAx7GUrXRlREG121HLLgpwVSUl5XZ1pPiZR+G6Fc BiA7T/ka RAceufVPV4W54LALfT8IKQ75wTrCs8PyaEGDZWahwUKJo/NYIiRMEt12JeJF4bJduzmi+uWB1Nma/LJICYNpnpIw3Not/vfgFp+fBa8ITwMCYwJAXXuKhKDgnJ5AVGtQZTw7xjBF3lfwkxa+VE4aNCX1zD+v9eMVGTOfoQK5wLNd3bFs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 07/06/2024 11:35, Ryan Roberts wrote: > On 24/05/2024 18:17, Chris Li wrote: >> Track the nonfull cluster as well as the empty cluster >> on lists. Each order has one nonfull cluster list. >> >> The cluster will remember which order it was used during >> new cluster allocation. >> >> When the cluster has free entry, add to the nonfull[order] >> list.  When the free cluster list is empty, also allocate >> from the nonempty list of that order. >> >> This improves the mTHP swap allocation success rate. > > If I've understood correctly, the aim here is to link all the current per-cpu > clusters for a given order together so that if a cpu can't allocate a new > cluster for a given order, then it can steal another CPU's current cluster for > that order? > > If that's the intent, couldn't that be done just by iterating over the per-cpu, > per-order cluster pointers? Then you don't need all the linked list churn > (althogh I like the linked list changes as a nice cleanup, I'm not sure the > churn is neccessary for this change?). There would likely need to be some > locking considerations, but it would also allow you to get access to the next > entry within the cluster for allocation. > > However, fundamentally, I don't think this change solves the problem; it just > takes a bit longer before the allocation fails. The real problem is > fragmentation due to freeing individual pages from swap entries at different times. > > Wouldn't it be better to just extend scanning to support high order allocations? > Then we can steal a high order block from any cluster, even clusters that were > previously full, just like we currently do for order-0. Given we are already > falling back to this path for order-0, I don't think it would be any more > expensive; infact its less expensive because we only scan once for the high > order block, rather than scan for every split order-0 page. > > Of course that still doesn't solve the proplem entirely; if swap is so > fragmented that there is no contiguous block of the required order then you > still have to fall back to splitting. As an extra optimization, you could store > the largest contiguous free space available in each cluster to avoid scanning in > case its too small? > > >> >> There are limitations if the distribution of numbers of >> different orders of mTHP changes a lot. e.g. there are a lot >> of nonfull cluster assign to order A while later time there >> are a lot of order B allocation while very little allocation >> in order A. Currently the cluster used by order A will not >> reused by order B unless the cluster is 100% empty. >> >> This situation is best addressed by the longer term "swap >> buddy allocator", in future patches. >> --- >> include/linux/swap.h | 4 ++++ >> mm/swapfile.c | 25 +++++++++++++++++++++++-- >> 2 files changed, 27 insertions(+), 2 deletions(-) >> >> diff --git a/include/linux/swap.h b/include/linux/swap.h >> index 0d3906eff3c9..1b7f0794b9bf 100644 >> --- a/include/linux/swap.h >> +++ b/include/linux/swap.h >> @@ -255,10 +255,12 @@ struct swap_cluster_info { >> * cluster >> */ >> unsigned int count:16; >> + unsigned int order:8; >> unsigned int flags:8; >> struct list_head next; >> }; >> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */ >> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */ >> >> >> /* >> @@ -297,6 +299,8 @@ struct swap_info_struct { >> unsigned char *swap_map; /* vmalloc'ed array of usage counts */ >> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */ >> struct list_head free_clusters; /* free clusters list */ >> + struct list_head nonfull_clusters[SWAP_NR_ORDERS]; >> + /* list of cluster that contains at least one free slot */ >> unsigned int lowest_bit; /* index of first free in swap_map */ >> unsigned int highest_bit; /* index of last free in swap_map */ >> unsigned int pages; /* total of usable pages of swap */ >> diff --git a/mm/swapfile.c b/mm/swapfile.c >> index 205a60c5f9cb..51923aba500e 100644 >> --- a/mm/swapfile.c >> +++ b/mm/swapfile.c >> @@ -363,8 +363,11 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, >> >> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) >> { >> + if (ci->flags & CLUSTER_FLAG_NONFULL) >> + list_move_tail(&ci->next, &si->free_clusters); >> + else >> + list_add_tail(&ci->next, &si->free_clusters); >> ci->flags = CLUSTER_FLAG_FREE; >> - list_add_tail(&ci->next, &si->free_clusters); >> } >> >> /* >> @@ -486,7 +489,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste >> ci->count--; >> >> if (!ci->count) >> - free_cluster(p, ci); >> + return free_cluster(p, ci); >> + >> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { >> + list_add_tail(&ci->next, &p->nonfull_clusters[ci->order]); >> + ci->flags |= CLUSTER_FLAG_NONFULL; >> + } >> } >> >> /* >> @@ -547,6 +555,14 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, next); >> list_del(&ci->next); >> spin_lock(&ci->lock); >> + ci->order = order; >> + ci->flags = 0; >> + spin_unlock(&ci->lock); >> + tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; >> + } else if (!list_empty(&si->nonfull_clusters[order])) { >> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, next); >> + list_del(&ci->next); >> + spin_lock(&ci->lock); >> ci->flags = 0; >> spin_unlock(&ci->lock); >> tmp = (ci - si->cluster_info) * SWAPFILE_CLUSTER; > > This looks wrong to me; if the cluster is on the nonfull list then it will have > had some entries already allocated (by another cpu). So pointing tmp to the > first block in the cluster will never yield a free block. The cpu from which you > are stealing the cluster stores the next free block location in its per-cpu > structure. So perhaps iterating over the other cpu's `struct percpu_cluster`s is > a better approach than the nonfull list? Ahh; of course the cluster scan below will move this along to a free block. > > Additionally, this cluster will be stored back to this cpu's current cluster at > the bottom of the function. That may or may not be what you intended. > >> @@ -578,6 +594,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, >> break; >> tmp += nr_pages; >> } >> + WARN_ONCE(ci->order != order, "expecting order %d got %d", order, ci->order); >> unlock_cluster(ci); >> } >> if (tmp >= max) { >> @@ -956,6 +973,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx) >> ci = lock_cluster(si, offset); >> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER); >> ci->count = 0; >> + ci->order = 0; >> ci->flags = 0; >> free_cluster(si, ci); >> unlock_cluster(ci); >> @@ -2882,6 +2900,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p, >> INIT_LIST_HEAD(&p->free_clusters); >> INIT_LIST_HEAD(&p->discard_clusters); >> >> + for (i = 0; i < SWAP_NR_ORDERS; i++) >> + INIT_LIST_HEAD(&p->nonfull_clusters[i]); >> + >> for (i = 0; i < swap_header->info.nr_badpages; i++) { >> unsigned int page_nr = swap_header->info.badpages[i]; >> if (page_nr == 0 || page_nr > swap_header->info.last_page) >> >