From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B8A67C27C4F for ; Tue, 18 Jun 2024 23:27:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9F2916B0150; Tue, 18 Jun 2024 19:27:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 928546B03DC; Tue, 18 Jun 2024 19:27:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F74B6B03DD; Tue, 18 Jun 2024 19:27:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5F8706B0150 for ; Tue, 18 Jun 2024 19:27:08 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 0E4C980B26 for ; Tue, 18 Jun 2024 23:27:08 +0000 (UTC) X-FDA: 82245597336.10.B58C300 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf17.hostedemail.com (Postfix) with ESMTP id 624D640013 for ; Tue, 18 Jun 2024 23:27:06 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf17.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718753218; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=n5EVA+yQB0nUNRhHDs/asI24U++slgQ51GqF2drh+fg=; b=GP9wB1WTCuSal5ejXRzuL8TYY4Rn3dJ/drV1uVVGMGzxfC3TObyR7IVuvzYZlOTYIkX2Ng YCR7y+7C3DCVhA018yZi0zCXmTsVsd0UEhBbWAHQlvM8NrNc1hy2pIG3/SDuZib3Nznc6k GPsB5BiPGho3zxmDucFqRibLriiSFh0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718753218; a=rsa-sha256; cv=none; b=PrcNCjdmUW4VG5KNgpguqp5xOk1takyPjbn2ZOlmjcxz7D9hBfJtPKiqHeAwjNjoVa3LXJ N0+LnKu4teBmclVwMmNkFUBzDuexcNX8QOSKcSgwUMjRzU4g7n3RokFvEZJOnAWDKkxqWB h4j3QrAF90a0INj8ZOZ6IgtQVZNuJcs= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf17.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 5AD4014BF; Tue, 18 Jun 2024 16:27:30 -0700 (PDT) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id ECDFD3F64C; Tue, 18 Jun 2024 16:27:03 -0700 (PDT) From: Ryan Roberts To: Andrew Morton , Chris Li , Kairui Song , "Huang, Ying" , Kalesh Singh , Barry Song , Hugh Dickins , David Hildenbrand Cc: Ryan Roberts , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [RFC PATCH v1 4/5] mm: swap: Scan for free swap entries in allocated clusters Date: Wed, 19 Jun 2024 00:26:44 +0100 Message-ID: <20240618232648.4090299-5-ryan.roberts@arm.com> X-Mailer: git-send-email 2.43.0 In-Reply-To: <20240618232648.4090299-1-ryan.roberts@arm.com> References: <20240618232648.4090299-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 624D640013 X-Stat-Signature: jsrwbz1d3j1oi81enern4banzit4g7u3 X-Rspam-User: X-HE-Tag: 1718753226-752033 X-HE-Meta: U2FsdGVkX1+0nPpWzkKtJP2MtZ0/rEf2V7dewdRX+LCqT1W/z5x6BuyjcOXqDUS1KHA2fGU0dgjdiDLnzeLbd1FuLdCNTYpCbojj+jE9sgzH9MiNvFashDmgJe+JqBapMPXrU9iVbYHg/D48FXK6Q22x1U9V8rw2PxHRh0n7ERYigjho0GNVu55oSZ1VxVtCpaVNcazIdZDhITnc5AAuh1csmpkGNz6Q3rwFD1S+BpHMytad1SRVlQ2jHrbdyMuBV9onttIHw3crYYalH8IqaqtDpzikIuRkgjr3NRZ6dM8xeW7q+yjojdbhSIXbEn3HkfjTJz9T4ayuwwC/0J2gnjO1gNn7KYjC4egoNIqKgsbcW/Lxr6+sTpId/xeWI4xwTrYEZgstXYL7MCqJKkgzS70AIXenc2vqgJhOrMdvhhnVqam0cqt0wNzpIJ7Vw/0G5y/bOrImxsgCBugXjM0qrYsEQyJKr/7VNpXuT5mJWJaXlVKFrkbMMh0c6RJ/LUHJ7nDj4vygHbZItF2WJ8fYwoFij7upFT3QtGCvI5lTDWJ7edhSJtm0FWJRu9Iknb07Kw6ot5pGox3G1BBCW+u1jCr7KxGmJZ6VcTq26IeY8YhLUHRWHhV4q5FkSgi7y17N4hERL8nBzm77BOBVJyvi3gKGMDuPHciwyXMwiMJm0uXHm9hq8KofpMxvuFFt26qN8rJuVbO94fexnng/UcTV/5Ro7RLNAmIameWeAvEPpTUna1Ea4/9VNZveObolRk++MVjlyMLOBDlt/QPgYkzbhajEKY11wXETn2VdCFgxcDxJRhQKwitQTle+WZUSnJZSFEoqLsTF0ERomdjdv0yAGg2LOa+GEaEbCSRfyAEwo06D16p5cxq0EtXyrJ7hClK0dxowdKFuItLkg6VoN8nFJxg5hga9BoVl1cU8PmUhYjQfbWhGR49LJ5nQ/cwqzmaLKVB5Uj2x8fiX5nrQpFa keVcS9n4 WCaqsnPugvg9HTIsBHq+q7zHer2Mp+gerJhnBAGIDdoVATieNO10gZLDLuYk2Td5KHVn8QB8uZ2nH+Y2/YlishekG/+gKtEHIaD56IaDXeSbhXLfSRXb/ZeHTveNI+A+WJwDq0TU5wue1UIfCvE73SYoQxGfE1lWM+RQec3M9/yTTVDC5uKei2cuSla1XHN6I85FCgoR3XSzW1yteratZB+ZzeExocp8/qHLhyyFBfsg7dTY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Previously mTHP would only be swapped out if a CPU could allocate itself a free cluster from which to allocate mTHP-sized contiguous swap entry blocks. But for a system making heavy use of swap, after a while fragmentation ensures there are no available free clusters and therefore the swap entry allocation fails and forces the mTHP to be split to base pages which then get swap entries allocated by scanning the swap file for free individual pages. But when swap entries are freed, this makes holes in the clusters, and often it would be possible to allocate new mTHP swap entries in those holes. So if we fail to allocate a free cluster, scan through the clusters until we find one that is in use and contains swap entries of the order we require. Then scan it until we find a suitably sized and aligned hole. We keep a per-order "next cluster to scan" pointer so that future scanning can be picked up from where we last left off. And if we scan through all clusters without finding a suitable hole, we give up to prevent live lock. Running the test case provided by Barry Song at the below link, I can see swpout fallback rate, which was previously 100% after a few iterations, falls to 0% and stays there for all 100 iterations. This is also the case when sprinkling in some non-mTHP allocations ("-s") too. Signed-off-by: Ryan Roberts Link: https://lore.kernel.org/linux-mm/20240615084714.37499-1-21cnbao@gmail.com/ --- include/linux/swap.h | 2 + mm/swapfile.c | 90 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 92 insertions(+) diff --git a/include/linux/swap.h b/include/linux/swap.h index 2a40fe02d281..34ec4668a5c9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -310,6 +310,8 @@ struct swap_info_struct { unsigned int cluster_nr; /* countdown to next cluster search */ unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */ + struct swap_cluster_info *next_order_scan[SWAP_NR_ORDERS]; + /* Start cluster for next order-based scan */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ struct file *swap_file; /* seldom referenced */ diff --git a/mm/swapfile.c b/mm/swapfile.c index 7b13f02a7ac2..24db03db8830 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -644,6 +644,84 @@ static inline bool swap_range_empty(char *swap_map, unsigned int start, return true; } +static inline +struct swap_cluster_info *offset_to_cluster(struct swap_info_struct *si, + unsigned int offset) +{ + VM_WARN_ON(!si->cluster_info); + return si->cluster_info + (offset / SWAPFILE_CLUSTER); +} + +static inline +unsigned int cluster_to_offset(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + VM_WARN_ON(!si->cluster_info); + return (ci - si->cluster_info) * SWAPFILE_CLUSTER; +} + +static inline +struct swap_cluster_info *next_cluster_circular(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + struct swap_cluster_info *last; + + /* + * Wrap after the last whole cluster; never return the final partial + * cluster because users assume an entire cluster is accessible. + */ + last = offset_to_cluster(si, si->max) - 1; + return ci == last ? si->cluster_info : ++ci; +} + +static inline +struct swap_cluster_info *prev_cluster_circular(struct swap_info_struct *si, + struct swap_cluster_info *ci) +{ + struct swap_cluster_info *last; + + /* + * Wrap to the last whole cluster; never return the final partial + * cluster because users assume an entire cluster is accessible. + */ + last = offset_to_cluster(si, si->max) - 1; + return ci == si->cluster_info ? last : --ci; +} + +/* + * Returns the offset of the next cluster, allocated to contain swap entries of + * `order`, that is eligible to scan for free space. On first call, *stop should + * be set to SWAP_NEXT_INVALID to indicate the clusters should be scanned all + * the way back around to the returned cluster. The function updates *stop upon + * first call and consumes it in subsequent calls. Returns SWAP_NEXT_INVALID if + * no such clusters are available. Must be called with si lock held. + */ +static unsigned int next_cluster_for_scan(struct swap_info_struct *si, + int order, unsigned int *stop) +{ + struct swap_cluster_info *ci; + struct swap_cluster_info *end; + + ci = si->next_order_scan[order]; + if (*stop == SWAP_NEXT_INVALID) + *stop = cluster_to_offset(si, prev_cluster_circular(si, ci)); + end = offset_to_cluster(si, *stop); + + while (ci != end) { + if ((ci->flags & CLUSTER_FLAG_FREE) == 0 && ci->order == order) + break; + ci = next_cluster_circular(si, ci); + } + + if (ci == end) { + si->next_order_scan[order] = ci; + return SWAP_NEXT_INVALID; + } + + si->next_order_scan[order] = next_cluster_circular(si, ci); + return cluster_to_offset(si, ci); +} + /* * Try to get swap entries with specified order from current cpu's swap entry * pool (a cluster). This might involve allocating a new cluster for current CPU @@ -656,6 +734,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, struct percpu_cluster *cluster; struct swap_cluster_info *ci; unsigned int tmp, max; + unsigned int stop = SWAP_NEXT_INVALID; new_cluster: cluster = this_cpu_ptr(si->percpu_cluster); @@ -674,6 +753,15 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si, *scan_base = this_cpu_read(*si->cluster_next_cpu); *offset = *scan_base; goto new_cluster; + } else if (nr_pages < SWAPFILE_CLUSTER) { + /* + * There is no point in scanning for free areas the same + * size as the cluster, since the cluster would have + * already been freed in that case. + */ + tmp = next_cluster_for_scan(si, order, &stop); + if (tmp == SWAP_NEXT_INVALID) + return false; } else return false; } @@ -2392,6 +2480,8 @@ static void setup_swap_info(struct swap_info_struct *p, int prio, } p->swap_map = swap_map; p->cluster_info = cluster_info; + for (i = 0; i < SWAP_NR_ORDERS; i++) + p->next_order_scan[i] = cluster_info; } static void _enable_swap_info(struct swap_info_struct *p) -- 2.43.0