From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C07AE77188 for ; Mon, 30 Dec 2024 17:47:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7ED0A6B00A4; Mon, 30 Dec 2024 12:47:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 79BC76B00A5; Mon, 30 Dec 2024 12:47:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 617946B00A6; Mon, 30 Dec 2024 12:47:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3BAB96B00A4 for ; Mon, 30 Dec 2024 12:47:23 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 048911C937F for ; Mon, 30 Dec 2024 17:47:22 +0000 (UTC) X-FDA: 82952355780.22.507DBB2 Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf15.hostedemail.com (Postfix) with ESMTP id 74EE2A0003 for ; Mon, 30 Dec 2024 17:45:57 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=icQSunYs; spf=pass (imf15.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735580792; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2T/sNFK+FLf9Dy0NmIMHzxBPbHbk3WZ2pHgL/qbky98=; b=X9wmLwDi/vXJVkt5HiFEmlONpXFlYNcp6GSHZCAnQXetq4guTtir3H6Jk6QF06aWavEO5m 0PzdD3fGzjd53gtWqnZp5jgdWBrpUu+GBieeQKOdM97SiK+tWTdJXScz6FRdzSc/vcWtb+ vvVO/J+jlGxX8etazji18WlQiPQ1ypg= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=icQSunYs; spf=pass (imf15.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735580792; a=rsa-sha256; cv=none; b=e7kKg6p5csZBSL30Cf5Mynu5YYM/UtUOwUG7Tky5ADo1mgbxEVez3jsM0ktd+VhA3pTqsA bKybG7mLzwAEP7XUsVd7YXhTULrOxYgt5NQrVZ4EiJ7AmmyvRL5VSElxNqNGHnCqmbSbpV 5itDobbySXkjoENVgEneyV4+gxTWKl8= Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-21644aca3a0so26835015ad.3 for ; Mon, 30 Dec 2024 09:47:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735580839; x=1736185639; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=2T/sNFK+FLf9Dy0NmIMHzxBPbHbk3WZ2pHgL/qbky98=; b=icQSunYscphA/aSg2jC6Q3cALeUOMaNQOY7lO/1rnMsNPv44ubfJd6qJ+70OJUp3DK hYs2VPGb6BECxKDWpDcrfeHqzp9H7S7pIjV/nyKSRBRu4DZqJ3yxWROA3GtLTKjFl9UU uqczYzXKr+ZnD6wXdCWlCe/VvJTtCgHXDL12If60VgOmvJqlnNjIaqM3ptx8NstpLKZJ tvEln27gGYM1tMvK2yy02p6Dt7oS2vqGzZYkhOxLUtyBQX7yt7mhny9QAm5LdghU+zlA IiBrbrIUiJeEY7rbZq1LeUKMAF/1sQl44PwN9ffyvikodoEhj25eECn0rl7+PygZmouc JrRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735580839; x=1736185639; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=2T/sNFK+FLf9Dy0NmIMHzxBPbHbk3WZ2pHgL/qbky98=; b=wkwNeuEYJFN0XMxlPHjHG/2keLlPLSnqOzChVrLokZ8jyqjWAk7dMsL6Ur0b/3ieAp d9ZgqXe8w2G04mhLB2+5v8r4WdTfLLJyVYKEs3cSaOVM7kGJk9fy7+tWyJiWVrHTocWC f1aFeEaRpg8pFf7O644ba1qImqY52Ld0x4lutdDIia3Yc0ww6qamnx9eFBCiFT6rsGqy H08JfkmU66tFFSCV5Ank09r3CxdMT+La+7KYtbCVn/JFaMuXn715WGL2fchmKTg2YGaR 9Im3+m+S5JHKobdKM+O8wWbXIWy6mVdPAzDLZX8boVlxGqF1zcZ09HG+OF77C7qn267k eKDg== X-Gm-Message-State: AOJu0YxyIaQkELGvtV4/qzN3NSDZrQhp2oQPDvmzBUL8hm3nmakmIN28 Grr4Cfkhtv8FLnObWwIYcgel3dqlPctcMX5aS8jeOtgS5w0g/comWjX1s9AQhXg72g== X-Gm-Gg: ASbGncvIKdtVrIp72T9qK4ABxIabiuOqxl1VEmHFqW3mj4f93cvnlg6n2sssyb6l6Ii 8QMiAHfKOzQNn3Mj+iuWaeO6obphJwMpQ9k8PP29qE9pIFwr/MCHxzCbxBFBZKmfD1kFK89CruP LIGKAozO74XnpLSrPYQkecLT0GVStZ0X6ju0gYdg0v5qBjXRxwyg07EHcVRuadjG4P7Ys+0YEbB EtGfrjcEBjomTG+tjRfxOFADOGE09JfdbvehmtcQzYnreTBOy62aVlDh+67aEMXhtPt1WmBXMU1 X-Google-Smtp-Source: AGHT+IG1L+YlopKpf4F4WRcFOfSGe7YaoMXyGz516XDuI+LioGKwgnFwv7jBKhNwXjyeA8fR5b+skg== X-Received: by 2002:a17:903:2444:b0:215:e98c:c5d9 with SMTP id d9443c01a7336-219e6e9fb2emr458467365ad.18.1735580839480; Mon, 30 Dec 2024 09:47:19 -0800 (PST) Received: from KASONG-MC4.tencent.com ([1.203.117.231]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-219dc9cdf25sm180687695ad.118.2024.12.30.09.47.16 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 30 Dec 2024 09:47:19 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v3 12/13] mm, swap: use a global swap cluster for non-rotation devices Date: Tue, 31 Dec 2024 01:46:20 +0800 Message-ID: <20241230174621.61185-13-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20241230174621.61185-1-ryncsn@gmail.com> References: <20241230174621.61185-1-ryncsn@gmail.com> Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 74EE2A0003 X-Rspam-User: X-Stat-Signature: inwgtwqmby8n1n9owe9tfdz3sdtxdmgo X-HE-Tag: 1735580757-327573 X-HE-Meta: U2FsdGVkX1+UVQ2O34UhDmsswGI1WpTwyLE/iPad2Pj7t1hKFPtttq1QZZi7z4EQ72KUNFtwTgVB28uFBiGrBocpYBquRWozhphw1ZQgtfc8lCp8+BYV+Ze6FX2HtKJZLQtQv8QFfTRsgB1+nVz954OXihTVAY2YyjLGaN25waRunMm/0hSblJW65ji6xHEQ7BF1NYPCP+uB+lPE7Ie9cXp8XOymTGMNc40pOZvAX0N08ogPc0FZkJUBC1aa1U+JN8A2TWJyE4FO14lPlSLo3on78vxCgZ7Y/d8LU6jU4ZkmbgCSszme1ZLhjlnnn/N7hwsfllVysyTbL2K/wbo8puuO4+kSZZCB4WsGIGVuKaY5QjevKS0hrG5wxalvPhPEWbaesHAbP3ylZfNQP3hTV2LgPg70KMOXATr56np7mwN7ic/0TD4iDi/IT9wu6UeKuJHICPBbBzX1e7cVb8ZTQ849xG4MW/mGkVXrukGp0L/eIAh6JIl9sU5Jxe2swIFVimtFefe2vpxxJH/AL7+ePe0UjIR3OrNXRp4gZLXVYiu96dBfHccKou6zUwfukGOU4kpBjDUrgUG/HZDcrufn3qIZdvxcnbySD+HJi1++LU2uv+Hl1SxTXJVaSuoPqEFrRQmd/vT1DDV76P9M2qLSEM9dLWif5eC5VnaLrcOBegthxSN+QHwZP/DOpxsuotnZZVO8RwWl4dNWTliGSbIWf7S6DpG7z6pcL2C4i36NI2kyif9+i7y0JfNKwgMlBShzc6hIApRmvDs7pRustVZ/V813Ubro/WjHFD1rFB43F+rhC3SHsG+g8drTiR/47MHMyY9XtIiMrdNffypEylQ6KtP1Oe1zkcw3c0nJW6Bi7wy+F2hJIqFo29EfQrVpJDtcZMjVMrtB549QEWTlRUAxmEHOZonJSFLrf/oMF8ocCX9XBGzU3QlKysg7zxhMKsngmZ+AIBjbCxX5vp0unXN j427180w zwirhGQ0d35VVBp8dG0Sl6cBAtp9pBjUclovHFw/eZxvNAL3du9YYPfjckKu/2UkL33zCQkR5uDvUniEAqbENKzlvTVIu9W38PSwSdKyzP1/EfTcavC8MU5ZuLwmkfP2S5igT3iIT8FBTfO1PmVul9YI3pk03JYmdOPg2cYOQOX2sGYSe2FybcBTcMNugYRhFZ/wng1pid76EkeYDPvsc+JPGn4z3APFPiqPebKPDLtSltxUkHJbYYcn863KWsbyb223MY/l0DXIKfDeWU/+B4aGJWGRPohPjd9Rgkmg66pGzBqnyQBI31K7tXgwDbVPGZdM3MDaIv1Oag/x8+H79CK0XrTZtepodZgBSXTf9VFAhdFuRuHg+3iAcwOOXDPlExRd+ct9NwEsWqHj1sTmTazfA4AaazyQGhxvdFVcwhAQOIGOquhtmsahS+kc6d9Lj04qFdrTfzszsX9qS6Grzg/XfCo5uGi0UsuVGHBs64oR1C1FGUNiCwiUQu+DttxLLW8rJTRMX6ol8ZaAl1nljNPuFZMp+idFySeDhuR8kaYiazdo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song Non-rotational devices (SSD / ZRAM) can tolerate fragmentation, so the goal of the SWAP allocator is to avoid contention for clusters. It uses a per-CPU cluster design, and each CPU will use a different cluster as much as possible. However, HDDs are very sensitive to fragmentation, contention is trivial in comparison. Therefore, we use one global cluster instead. This ensures that each order will be written to the same cluster as much as possible, which helps make the I/O more continuous. This ensures that the performance of the cluster allocator is as good as that of the old allocator. Tests after this commit compared to those before this series: Tested using 'make -j32' with tinyconfig, a 1G memcg limit, and HDD swap: make -j32 with tinyconfig, using 1G memcg limit and HDD swap: Before this series: 114.44user 29.11system 39:42.90elapsed 6%CPU (0avgtext+0avgdata 157284maxresident)k 2901232inputs+0outputs (238877major+4227640minor)pagefaults After this commit: 113.90user 23.81system 38:11.77elapsed 6%CPU (0avgtext+0avgdata 157260maxresident)k 2548728inputs+0outputs (235471major+4238110minor)pagefaults Suggested-by: Chris Li Signed-off-by: Kairui Song --- include/linux/swap.h | 2 ++ mm/swapfile.c | 51 ++++++++++++++++++++++++++++++++------------ 2 files changed, 39 insertions(+), 14 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 4c1d2e69689f..b13b72645db3 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -318,6 +318,8 @@ struct swap_info_struct { unsigned int pages; /* total of usable pages of swap */ atomic_long_t inuse_pages; /* number of those currently in use */ struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */ + struct percpu_cluster *global_cluster; /* Use one global cluster for rotating device */ + spinlock_t global_cluster_lock; /* Serialize usage of global cluster */ struct rb_root swap_extent_root;/* root of the swap extent rbtree */ struct block_device *bdev; /* swap device or bdev of swap file */ struct file *swap_file; /* seldom referenced */ diff --git a/mm/swapfile.c b/mm/swapfile.c index a3d1239d944b..e57e5453a25b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -814,7 +814,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, out: relocate_cluster(si, ci); unlock_cluster(ci); - __this_cpu_write(si->percpu_cluster->next[order], next); + if (si->flags & SWP_SOLIDSTATE) + __this_cpu_write(si->percpu_cluster->next[order], next); + else + si->global_cluster->next[order] = next; return found; } @@ -875,9 +878,16 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o struct swap_cluster_info *ci; unsigned int offset, found = 0; - /* Fast path using per CPU cluster */ - local_lock(&si->percpu_cluster->lock); - offset = __this_cpu_read(si->percpu_cluster->next[order]); + if (si->flags & SWP_SOLIDSTATE) { + /* Fast path using per CPU cluster */ + local_lock(&si->percpu_cluster->lock); + offset = __this_cpu_read(si->percpu_cluster->next[order]); + } else { + /* Serialize HDD SWAP allocation for each device. */ + spin_lock(&si->global_cluster_lock); + offset = si->global_cluster->next[order]; + } + if (offset) { ci = lock_cluster(si, offset); /* Cluster could have been used by another order */ @@ -972,8 +982,10 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o } } done: - local_unlock(&si->percpu_cluster->lock); - + if (si->flags & SWP_SOLIDSTATE) + local_unlock(&si->percpu_cluster->lock); + else + spin_unlock(&si->global_cluster_lock); return found; } @@ -2778,6 +2790,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) mutex_unlock(&swapon_mutex); free_percpu(p->percpu_cluster); p->percpu_cluster = NULL; + kfree(p->global_cluster); + p->global_cluster = NULL; vfree(swap_map); kvfree(zeromap); kvfree(cluster_info); @@ -3183,17 +3197,24 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si, for (i = 0; i < nr_clusters; i++) spin_lock_init(&cluster_info[i].lock); - si->percpu_cluster = alloc_percpu(struct percpu_cluster); - if (!si->percpu_cluster) - goto err_free; + if (si->flags & SWP_SOLIDSTATE) { + si->percpu_cluster = alloc_percpu(struct percpu_cluster); + if (!si->percpu_cluster) + goto err_free; - for_each_possible_cpu(cpu) { - struct percpu_cluster *cluster; + for_each_possible_cpu(cpu) { + struct percpu_cluster *cluster; - cluster = per_cpu_ptr(si->percpu_cluster, cpu); + cluster = per_cpu_ptr(si->percpu_cluster, cpu); + for (i = 0; i < SWAP_NR_ORDERS; i++) + cluster->next[i] = SWAP_ENTRY_INVALID; + local_lock_init(&cluster->lock); + } + } else { + si->global_cluster = kmalloc(sizeof(*si->global_cluster), GFP_KERNEL); for (i = 0; i < SWAP_NR_ORDERS; i++) - cluster->next[i] = SWAP_ENTRY_INVALID; - local_lock_init(&cluster->lock); + si->global_cluster->next[i] = SWAP_ENTRY_INVALID; + spin_lock_init(&si->global_cluster_lock); } /* @@ -3467,6 +3488,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) bad_swap: free_percpu(si->percpu_cluster); si->percpu_cluster = NULL; + kfree(si->global_cluster); + si->global_cluster = NULL; inode = NULL; destroy_swap_extents(si); swap_cgroup_swapoff(si->type); -- 2.47.1