From: "ying.huang@intel.com" <ying.huang@intel.com>
To: Aaron Lu <aaron.lu@intel.com>, Yang Shi <shy828301@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>,
Andrew Morton <akpm@linux-foundation.org>,
Linux MM <linux-mm@kvack.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid
Date: Thu, 21 Apr 2022 15:49:21 +0800 [thread overview]
Message-ID: <f27ec36beb3cf1dbbfc3b8835e586d5d6fe7f561.camel@intel.com> (raw)
In-Reply-To: <Yl/FS9enAD4V8jG3@ziqianlu-nuc9qn>
On Wed, 2022-04-20 at 16:33 +0800, Aaron Lu wrote:
> On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <aaron.lu@intel.com> wrote:
> > >
> > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > The swap devices are linked to per node priority lists, the swap device
> > > > closer to the node has higher priority on that node's priority list.
> > > > This is supposed to improve I/O latency, particularly for some fast
> > > > devices. But the current code gets nid by calling numa_node_id() which
> > > > actually returns the nid that the reclaimer is running on instead of the
> > > > nid that the page belongs to.
> > > >
> > >
> > > Right.
> > >
> > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > right swap device. But it doesn't work for the swap slots cache which
> > > > is per cpu. We could skip swap slots cache if the current node is not
> > > > the page's node, but it may be overkilling. So keep using the current
> > > > node's swap slots cache. The issue was found by visual code inspection
> > > > so it is not sure how much improvement could be achieved due to lack of
> > > > suitable testing device. But anyway the current code does violate the
> > > > design.
> > > >
> > >
> > > I intentionally used the reclaimer's nid because I think when swapping
> > > out to a device, it is faster when the device is on the same node as
> > > the cpu.
> >
> > OK, the offline discussion with Huang Ying showed the design was to
> > have page's nid in order to achieve better I/O performance (more
> > noticeable on faster devices) since the reclaimer may be running on a
> > different node from the reclaimed page.
> >
> > >
> > > Anyway, I think I can make a test case where the workload allocates all
> > > its memory on the remote node and its workingset memory is larger then
> > > the available memory so swap is triggered, then we can see which way
> > > achieves better performance. Sounds reasonable to you?
> >
> > Yeah, definitely, thank you so much. I don't have a fast enough device
> > by hand to show the difference right now. If you could get some data
> > it would be perfect.
> >
>
> Failed to find a test box that has two NVMe disks attached to different
> nodes and since Shanghai is locked down right now, we couldn't install
> another NVMe on the box so I figured it might be OK to test on a box that
> has a single NVMe attached to node 0 like this:
>
> 1) restrict the test processes to run on node 0 and allocate on node 1;
> 2) restrict the test processes to run on node 1 and allocate on node 0.
>
> In case 1), the reclaimer's node id is the same as the swap device's so
> it's the same as current behaviour and in case 2), the page's node id is
> the same as the swap device's so it's what your patch proposed.
>
> The test I used is vm-scalability/case-swap-w-rand:
> https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> which spawns $nr_task processes and each will mmap $size and then
> randomly write to that area. I set nr_task=32 and $size=4G, so a total
> of 128G memory will be needed and I used memory.limit_in_bytes to
> restrict the available memory to 64G, to make sure swap is triggered.
>
> The reason why cgroup is used is to avoid waking up the per-node kswapd
> which can trigger swapping with reclaimer/page/swap device all having the
> same node id.
>
> And I don't see a measuable difference from the result:
> case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> case2(using page's node id) vm-scalability.throughput: 10567 KB/s
>
> My interpretation of the result is, when reclaiming a remote page, it
> doesn't matter much which swap device to use if the swap device is a IO
> device.
>
> Later Ying reminded me we have test box that has optane installed on
> different nodes so I also tested there: Icelake 2 sockets server with 2
> optane installed on each node. I did the test there like this:
> 1) restrict the test processes to run on node 0 and allocate on node 1
> and only swapon pmem0, which is the optane backed swap device on node 0;
> 2) restrict the test processes to run on node 0 and allocate on node 1
> and only swapon pmem1, which is the optane backed swap device on node 1.
>
> So case 1) is current behaviour and case 2) is what your patch proposed.
>
> With the same test and the same nr_task/size, the result is:
> case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> case2(using page's node id) vm-scalability.throughput: 58753 KB/s
>
The per-node swap device support is more about swap-in latency than
swap-out throughput. I suspect the test case is more about swap-out
throughput. perf profiling can show this.
For swap-in latency, we can use pmbench, which can output latency
information.
Best Regards,
Huang, Ying
[snip]
next prev parent reply other threads:[~2022-04-21 7:49 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-07 2:09 Yang Shi
2022-04-07 7:52 ` Michal Hocko
2022-04-07 17:27 ` Yang Shi
2022-04-07 8:13 ` Aaron Lu
2022-04-07 17:36 ` Yang Shi
2022-04-20 8:33 ` Aaron Lu
2022-04-20 22:21 ` Yang Shi
2022-04-21 7:34 ` Aaron Lu
2022-04-21 7:49 ` ying.huang [this message]
2022-04-21 8:17 ` Aaron Lu
2022-04-21 8:30 ` Aaron Lu
2022-04-21 8:34 ` ying.huang
2022-04-22 6:24 ` Aaron Lu
2022-04-22 6:27 ` ying.huang
2022-04-22 6:43 ` Aaron Lu
2022-04-22 7:26 ` ying.huang
2022-04-22 17:00 ` Yang Shi
2022-04-23 3:22 ` Aaron Lu
2022-04-29 10:26 ` Aaron Lu
2022-04-29 19:07 ` Yang Shi
2022-04-21 14:11 ` Aaron Lu
2022-04-21 17:19 ` Yang Shi
2022-04-21 23:57 ` ying.huang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f27ec36beb3cf1dbbfc3b8835e586d5d6fe7f561.camel@intel.com \
--to=ying.huang@intel.com \
--cc=aaron.lu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=shy828301@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox