From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 24949CAC5BA for ; Thu, 25 Sep 2025 18:26:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4E9398E000E; Thu, 25 Sep 2025 14:26:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4C0D58E0001; Thu, 25 Sep 2025 14:26:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3FD988E000E; Thu, 25 Sep 2025 14:26:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 2D7018E0001 for ; Thu, 25 Sep 2025 14:26:12 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id D55C21DF87B for ; Thu, 25 Sep 2025 18:26:11 +0000 (UTC) X-FDA: 83928602142.21.C08BE04 Received: from sea.source.kernel.org (sea.source.kernel.org [172.234.252.31]) by imf29.hostedemail.com (Postfix) with ESMTP id D78CF120007 for ; Thu, 25 Sep 2025 18:26:09 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=msYy4trg; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf29.hostedemail.com: domain of chrisl@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1758824770; a=rsa-sha256; cv=none; b=uG+1I8qBN60MEtKM6CHAlSDhhRnO0iXltC/47ziAuc7moEyKhlKlyKuWydXZ1dJOEC3VXO Memny1/sB2w8lwftMlM2JDgSqZxTL3Zwl8BDpJzWSIRCShVIid912NOKJbhw9FUN7e5ZLf mYEVyHjXkzmlPzOFc6adTgzYmq8rBTE= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=msYy4trg; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf29.hostedemail.com: domain of chrisl@kernel.org designates 172.234.252.31 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1758824770; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=stqWXpg7e25/sXrR4eml3WGzg0W4FR0BY56y2LBkyEU=; b=pGtNM28q4zTVj/xGLUXpFWnfZo+bSP0EnDb89mu9seBr+qY3mfEpcH5oGAGh+A7RJf3xau eOCyFpIGfg6P55vmNhCHKKq1YOs9NlrrLCkVnS9J5NZ8bJMjgYal5V0kffEU9zP8QX/DE4 hKw79G8yngGvA/kY4zNvZZ/HDebxRPY= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sea.source.kernel.org (Postfix) with ESMTP id BFF9D43BA2 for ; Thu, 25 Sep 2025 18:26:08 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9DB72C4CEF4 for ; Thu, 25 Sep 2025 18:26:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1758824768; bh=DlB4tVCKpAVTusj26U9+rBCFaEApCKO8j4lQIvHJovE=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=msYy4trg4G3NTgOcwOQo9ooF8EBcNWMEnJj0xXy9gndp07SKF4+oBEJtVkNcxHNVs bo3R5MCOMZpGv3PiU+aQsKKFbKdxuAPQU5qAkgA0OHsWLrgb8qP/WwZ2iNs5WZlr7Z FP2qQ7O3JfrJEtIK/9syDp6kNDeHHg2nDZ8wVQDUB4yeFZnvihklMIzEL37v3GeCri VF4K8XwIkJA6KZvpiQRqeS2N02m/np2ms6LoD8WB0VhUQD/+n5iQrU+6T4wGr95fwT P39XtpUAfEsWrvLv0GwzAv+PeqtsfaVhgwr31U4B3rnxN74aWBepEErICjY3tqJqdP fk4/frtBiqHpg== Received: by mail-yx1-f47.google.com with SMTP id 956f58d0204a3-6352c8b683eso1151762d50.2 for ; Thu, 25 Sep 2025 11:26:08 -0700 (PDT) X-Gm-Message-State: AOJu0Yx3cF+BnC8WtFLX6zBAmQ4PPvFeNFrlp5J8U7wSYJGQkR/umoTm OWnm5YGSSSfnrnEQpA5gTwdcJUcTYsfjSdzCd3EB9Fe68xeWDWuDVBA6J/xcvb5RjHqwVUw8aXE /PQ6UfDig9BcmDEfkRiDdopEvtT6IbUss1e9XFMN5EQ== X-Google-Smtp-Source: AGHT+IHGREAbCXAjAciRgKHXY2OZCCLiSaasKLb9FrE385QScjxqqAd8ay+uBE5HIDnJ3pvKBQ9x68EoMs2kUhNZwj8= X-Received: by 2002:a05:690e:805:b0:633:adbd:d706 with SMTP id 956f58d0204a3-6361a85fff2mr3015529d50.26.1758824767818; Thu, 25 Sep 2025 11:26:07 -0700 (PDT) MIME-Version: 1.0 References: <20250924091746.146461-1-bhe@redhat.com> In-Reply-To: From: Chris Li Date: Thu, 25 Sep 2025 11:25:56 -0700 X-Gmail-Original-Message-ID: X-Gm-Features: AS18NWAfClcOsfuVrBQRzh-cTJQjWfDRx7qyhLYeKh_Tia9PMncceMpBmwYCX_A Message-ID: Subject: Re: [PATCH] mm/swapfile.c: select the swap device with default priority round robin To: Baoquan He Cc: linux-mm@kvack.org, akpm@linux-foundation.org, kasong@tencent.com, baohua@kernel.org, shikemeng@huaweicloud.com, nphamcs@gmail.com, YoungJun Park Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: D78CF120007 X-Stat-Signature: fg4ekan74t5igscdamt5ogshaxqyfzb3 X-Rspam-User: X-HE-Tag: 1758824769-712334 X-HE-Meta: U2FsdGVkX1+SoUfyL8HnNPl93aD1GGwVRTkhJriS4ZSOaXTV7xpXYkJF7KmEpcMqu0dmQOHTTOKTug01lc1yOtgXygdc5ssp7a1uop0+PbyXrRSr0D/Kd9EBM79cc9oIMqqodOhTe79KH+cLQMKb7WwcmQfzwbJbjvoQ62LMOZLDSxnhcbSnAYw7UIrYp+dcuATFfiWFW1E6vWYv3pnbJ7ywQA4lgtVsMAhFheHF52GAPsmaGwhrfX3DQ3oBKmXGYxWFH62FOX1amUj7Q5TdMwDe57JmP5nG0SDtrOiiDOvTWScXGwKaW/K6EN16NH+vDJlrjKZdZj5HYcL4+VZWRp+P3+R8e4mw3iCBnoOyhIYIvkiG/0k/+vBqXyZnf4hb5rOJbKWQ5vJRMrpO5JJ511SGOA6/6x2SdxMNkY9ZVVVRI5U0+NTaVo05xlh4ICQ+dU7zdocj+Ohx5AGTWIwVufooQMwmgj4khCTdJ2v8T4ge0QeZhrP9QT30KOekSImDfzdKglm5HTESUJV3191yLE69LeNtSsONiVlo9ud1H6y5p+MLjbgJZjahU6e6W3qYON51yB3nKDyoR+nj8RmPh96/Be2x/3G9rcfu1r2ZtXjBGTbTkmn2ZDyZCrIAMUHwX+FQiy2+78D5L8WSGWboq/LeKItqZwfwtsRH/yEdzUP1KS96tU21itlJI2p4eieijpqbigPKgIgoHh31v2lSdZ+3gJOjiG21gitbEvbasmVHKMv7nAxsl5VdCat0EDHeubViKI+VBjkwGG2oHbnBuLGrghzp45pQJhZ2vNZ47Gl4pft3Tpjv0f9ejU1EZ33mv5qSSDaPR2KbLZbBO9peEK4Qu9vlfecAdS3DjCB01/7H+U/sPNdsU3aNT2A8S/C8RKyX5RUPto0VVsgcXtrZ9z4RKIiK7TpF/f7mVP8CyHqGAB2bSMa/QP0NpC83/GujSJvATT5q6gpKuouQoyk 17PtUUD+ 4mR3DUUQJuq1A/M6TyriHT8z3vL2CPzdat0HlnPtKOrd2q76eCmha6fgodXrgFEegE9SMCiYFBVXu0gQbLD+TWPF5E3gRNHfKnTT185cxcqWo/Uq1C34jUHX7OMAkxWAn5d1fa3kGihaJYQ6YsuDeCxXXJKiSrkujhSMiBLS34d0LeoXjNiqL/weCSXMf1gP0LxzqSazC6/F1JSHCilb0I3e+kdTLD67YkjCFvmZaOKQGUn4aND5XPOk9rXJopLBWUfWxkz8QYhjwc2xD8JMrenqFCpwDEosQrOTY8UToyz0SP1Ry+6J6M5jO9+dV1qxlVbWZDZsw4RuZrkU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Sep 24, 2025 at 6:55=E2=80=AFPM Baoquan He wrote: > > On 09/24/25 at 08:54am, Chris Li wrote: > > Hi Baoquan, > > > > Very exciting numbers. I have always suspected the per node priority > > is not doing much contribution in the new swap allocator world. I did > > not expect it to have negative contributions. > > Yes. > > While compared with the very beginning, there has been some progress. > At the very beginning, there was one plist and swap device will get > priority from -1 then downwards by default, then all cpus will exhaust > the swap device of pirority '-1', then select swap device of priority > '-2' to exhaust, then -3, .... I think node-based adjustment distribute > the pressure of contending lock on one swap device a little bit. > However, in node they still try to exhaust one swap device by node's > CPUs; and nodes w/o swap device attached still try to exhaust swap > device one by one in the order of priority. > > > > > On Wed, Sep 24, 2025 at 2:18=E2=80=AFAM Baoquan He wro= te: > > > > > > Currently, on system with multiple swap devices, swap allocation will > > > select one swap device according to priority. The swap device with th= e > > > highest priority will be chosen to allocate firstly. > > > > > > People can specify a priority from 0 to 32767 when swapon a swap devi= ce, > > > or the system will set it from -2 then downwards by default. Meanwhil= e, > > > on NUMA system, the swap device with node_id will be considered first > > > on that NUMA node of the node_id. > > > > That behavior was introduced by: a2468cc9bfdf ("swap: choose swap > > device according to numa node") > > You are effectively reverting that patch and the following fix up > > patches on top of that. > > The commit message or maybe the title should reflect the reversion natu= re. > > > > If you did more than the simple revert plus fix up, please document > > what additional change you make in this patch. > > Sure, I will mention commit a2468cc9bfdf and my patch reverts it, on top > of that default priority of swap device will be set to '-1' so that all > swap devices with default priority will be chosen round robin. Like > this, the si->lock contention can be greatly reduced. Just curious, is setting to "-1" matches to kernel behavior before a2468cc9bfdf, if not what is the behavior before a2468cc9bfdf. > > > In the current code, an array of plist, swap_avail_heads[nid], is use= d > > > to organize swap devices on each NUMA node. For each NUMA node, there > > > is a plist organizing all swap devices. The 'prio' value in the plist > > > is the negated value of the device's priority due to plist being sort= ed > > > from low to high. The swap device owning one node_id will be promoted= to > > > the front position on that NUMA node, then other swap devices are put= in > > > order of their default priority. > > > > > > > The original patch that introduced it is using SSD as a benchmark. > > Here you are using patched zram as a benchmark. You want to explain in > > a little bit detail why you choose a different test method. e.g. You > > don't have a machine with an SSD device as a raw partition to do the > > original test. Compression ram based swap device, zswap or zram is > > used a lot more in the data center server and android workload > > environment, maybe even in some Linux workstation distro as well. > > Yeah, just as you said. I haven't got a machine to create several SSD > partitions as swap device to test this. So hacked the zram kernel code > to assign device id to zram devices's node_id. I also tried to use > several swap files on one SSD disk, the speed is too slow, the > comparison results are not very obvious, there's only about 1% > promotion. And as you said, in reality, currently zram or swap is being > taken more and more and become the default setup on systems. So I think > it makes sense to use zram devices to test this patch. Yes, I understand, you want to mention that in your cover letter for the reader to know. > > > > > You can also invite others who do have the spare SSD driver to test > > the SSD as a swap device. Maybe with some setup instructions how to > > set up and repeat your test on their machine with multiple SSD drives. > > How to compare the result with or without your reversion patch. > > Sure, I will write these detailed steps in v2 cover letter, and attach > my two draft scripts for easing doing statistics and output the final > result. Ack > > > > > > E.g I got a system with 8 NUMA nodes, and I setup 4 zram partition as > > > swap devices. > > > > You want to make it clear up front that you are using a patched zram > > to simulate the per node swap device behavior. Native zram does not > > have that. > > Sure, will add it in v2 cover letter. Ack. > > > > > > > > > Current behaviour: > > > their priorities will be(note that -1 is skipped): > > > NAME TYPE SIZE USED PRIO > > > /dev/zram0 partition 16G 0B -2 > > > /dev/zram1 partition 16G 0B -3 > > > /dev/zram2 partition 16G 0B -4 > > > /dev/zram3 partition 16G 0B -5 > > > > > > And their positions in the 8 swap_avail_lists[nid] will be: > > > swap_avail_lists[0]: /* node 0's available swap device list */ > > > zram0 -> zram1 -> zram2 -> zram3 > > > prio:1 prio:3 prio:4 prio:5 > > > swap_avali_lists[1]: /* node 1's available swap device list */ > > > zram1 -> zram0 -> zram2 -> zram3 > > > prio:1 prio:2 prio:4 prio:5 > > > swap_avail_lists[2]: /* node 2's available swap device list */ > > > zram2 -> zram0 -> zram1 -> zram3 > > > prio:1 prio:2 prio:3 prio:5 > > > swap_avail_lists[3]: /* node 3's available swap device list */ > > > zram3 -> zram0 -> zram1 -> zram2 > > > prio:1 prio:2 prio:3 prio:4 > > > swap_avail_lists[4-7]: /* node 4,5,6,7's available swap device list *= / > > > zram0 -> zram1 -> zram2 -> zram3 > > > prio:2 prio:3 prio:4 prio:5 > > > > > > The adjustment for swap device with node_id intended to decrease the > > > pressure of lock contention for one swap device by taking different > > > swap device on different node. However, the adjustment is very > > > coarse-grained. On the node, the swap device sharing the node's id wi= ll > > > always be selected firstly by node's CPUs until exhausted, then next = one. > > > And on other nodes where no swap device shares its node id, swap devi= ce > > > with priority '-2' will be selected firstly until exhausted, then nex= t > > > with priority '-3'. > > > > > > This is the swapon output during the process high pressure vm-scabili= ty > > > test is being taken. It's clearly shown zram0 is heavily exploited un= til > > > exhausted. > > > > Any tips how others repeat your high pressure vm-scability test, > > especially for someone who has multiple SSD drives as a test swap > > device. Some test script setup would be nice. You can post the > > instruction in the same email thread as separate email, it does not > > have to be in the commit message. > > Sure, will do. Ack > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > [root@hp-dl385g10-03 ~]# swapon > > > NAME TYPE SIZE USED PRIO > > > /dev/zram0 partition 16G 15.7G -2 > > > /dev/zram1 partition 16G 3.4G -3 > > > /dev/zram2 partition 16G 3.4G -4 > > > /dev/zram3 partition 16G 2.6G -5 > > > > > > This is unreasonable because swap devices are assumed to have similar > > > accessing speed if no priority is specified when swapon. It's unfair = and > > > doesn't make sense just because one swap device is swapped on firstly= , > > > its priority will be higher than the one swapped on later. > > > > > > So here change is made to select the swap device round robin if defau= lt > > > priority. In code, the plist array swap_avail_heads[nid] is replaced > > > with a plist swap_avail_head. Any device w/o specified priority will = get > > > the same default priority '-1'. Surely, swap device with specified pr= iority > > > are always put foremost, this is not impacted. If you care about thei= r > > > different accessing speed, then use 'swapon -p xx' to deploy priority= for > > > your swap devices. > > > > > > New behaviour: > > > > > > swap_avail_list: /* one global available swap device list */ > > > zram0 -> zram1 -> zram2 -> zram3 > > > prio:1 prio:1 prio:1 prio:1 > > > > > > This is the swapon output during the process high pressure vm-scabili= ty > > > being taken, all is selected round robin: > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > [root@hp-dl385g10-03 linux]# swapon > > > NAME TYPE SIZE USED PRIO > > > /dev/zram0 partition 16G 12.6G -1 > > > /dev/zram1 partition 16G 12.6G -1 > > > /dev/zram2 partition 16G 12.6G -1 > > > /dev/zram3 partition 16G 12.6G -1 > > > > > > With the change, we can see about 18% efficiency promotion as below: > > > > > > vm-scability test: > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > Test with: > > > usemem --init-time -O -y -x -n 31 2G (4G memcg, zram as swap) > > > Before: After: > > > System time: 637.92 s 526.74 s > > You can clarify here lower is better. > > > Sum Throughput: 3546.56 MB/s 4207.56 MB/s > > > > Higher is better. Also a percentage number can be useful here. e.g. > > that is +18.6% percent improvement since reverting to round robin. A > > huge difference! > > > > > Single process Throughput: 114.40 MB/s 135.72 MB/s > > Higher is better. > > > free latency: 10138455.99 us 6810119.01 us > > OK, will improve this. Ack > > > > > > Suggested-by: Chris Li > > > Signed-off-by: Baoquan He > > > --- > > > include/linux/swap.h | 11 +----- > > > mm/swapfile.c | 94 +++++++-----------------------------------= -- > > > > Very nice patch stats! Fewer code and runs faster. What more can we ask= for? > > > > > > > 2 files changed, 16 insertions(+), 89 deletions(-) > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > index 3473e4247ca3..f72c8e5e0635 100644 > > > --- a/include/linux/swap.h > > > +++ b/include/linux/swap.h > > > @@ -337,16 +337,7 @@ struct swap_info_struct { > > > struct work_struct discard_work; /* discard worker */ > > > struct work_struct reclaim_work; /* reclaim worker */ > > > struct list_head discard_clusters; /* discard clusters list *= / > > > - struct plist_node avail_lists[]; /* > > > - * entries in swap_avail_he= ads, one > > > - * entry per node. > > > - * Must be last as the numb= er of the > > > - * array is nr_node_ids, wh= ich is not > > > - * a fixed value so have to= allocate > > > - * dynamically. > > > - * And it has to be an arra= y so that > > > - * plist_for_each_* can wor= k. > > > - */ > > > + struct plist_node avail_list; /* entry in swap_avail_head *= / > > > }; > > > > > > static inline swp_entry_t page_swap_entry(struct page *page) > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > index b4f3cc712580..d8a54e5af16d 100644 > > > --- a/mm/swapfile.c > > > +++ b/mm/swapfile.c > > > @@ -73,7 +73,7 @@ atomic_long_t nr_swap_pages; > > > EXPORT_SYMBOL_GPL(nr_swap_pages); > > > /* protected with swap_lock. reading in vm_swap_full() doesn't need = lock */ > > > long total_swap_pages; > > > -static int least_priority =3D -1; > > > +#define DEF_SWAP_PRIO -1 > > > unsigned long swapfile_maximum_size; > > > #ifdef CONFIG_MIGRATION > > > bool swap_migration_ad_supported; > > > @@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head); > > > * is held and the locking order requires swap_lock to be taken > > > * before any swap_info_struct->lock. > > > */ > > > -static struct plist_head *swap_avail_heads; > > > +static PLIST_HEAD(swap_avail_head); > > > static DEFINE_SPINLOCK(swap_avail_lock); > > > > > > static struct swap_info_struct *swap_info[MAX_SWAPFILES]; > > > @@ -995,7 +995,6 @@ static unsigned long cluster_alloc_swap_entry(str= uct swap_info_struct *si, int o > > > /* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */ > > > static void del_from_avail_list(struct swap_info_struct *si, bool sw= apoff) > > > { > > > - int nid; > > > unsigned long pages; > > > > > > spin_lock(&swap_avail_lock); > > > @@ -1007,7 +1006,7 @@ static void del_from_avail_list(struct swap_inf= o_struct *si, bool swapoff) > > > * swap_avail_lock, to ensure the result can be seen = by > > > * add_to_avail_list. > > > */ > > > - lockdep_assert_held(&si->lock); > > > + //lockdep_assert_held(&si->lock); > > > > That seems like some debug stuff left over. If you need to remove it, r= emove it. > > I ever tried to adjust the lock usage, later I found I need keep it as > is, but forgot to change this line back. Will change it in v2. Ack. > > > > > The rest of the patch looks fine to me. Thanks for working on it. That > > is a very nice cleanup. > > > > Agree with YoungJun Park on removing the numa swap document as well. > > Exactly, will do. Ack. > > > > > Looking forward to your refresh version. I should be able to Ack-by on > > your next version. > > Thank you very much for your great suggestion from a big picture view, > and help during the process. Really appreciate it. You are welcome. Thanks for the great work. I am glad we don't need to carry that complexity any more. Chris