From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D33FBC433F5
	for <linux-mm@archiver.kernel.org>; Thu, 21 Apr 2022 17:19:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 326276B0072; Thu, 21 Apr 2022 13:19:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2D56E6B0073; Thu, 21 Apr 2022 13:19:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 176876B0074; Thu, 21 Apr 2022 13:19:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28])
	by kanga.kvack.org (Postfix) with ESMTP id 088E86B0072
	for <linux-mm@kvack.org>; Thu, 21 Apr 2022 13:19:52 -0400 (EDT)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id C3D0D24EB
	for <linux-mm@kvack.org>; Thu, 21 Apr 2022 17:19:51 +0000 (UTC)
X-FDA: 79381548582.24.FBBA677
Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177])
	by imf26.hostedemail.com (Postfix) with ESMTP id CA556140026
	for <linux-mm@kvack.org>; Thu, 21 Apr 2022 17:19:49 +0000 (UTC)
Received: by mail-pl1-f177.google.com with SMTP id h12so1811667plf.12
        for <linux-mm@kvack.org>; Thu, 21 Apr 2022 10:19:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=AbrAgTuOHuSFvVLu+FxwQe0L1i24p9oP36C6BGa4L9w=;
        b=l6/LXh17bunh8O2uxngpbbIqFaCcmR/chYHKdEjHBbFyFFUQKVlxfn+b344LeogsIF
         b1VNKzLlMVuELXjXgN7LmG0YeOF5j2IeS+tFDxSqzO2TtNSLXNQD77MOsNrYFrPGOtFM
         5gSCKKEoawZP+tULlaJbatdtzIdZV2M1SZ8Cd3MO4SbRmFG5nwO1EcEwBwJY62LiPPAo
         XIRN+ys7H6CC1xDX5UWPJ4/VeHAoKBMFxjuBMWlTLtXdePSzY19vrfPzoPgzkFWnHP0u
         0omehIgs+Hl8fs5vnKYt0RynTzeA/MmrmLPDguXRkJwQL0FpyUX6jQO5BBaI345XfZtQ
         LAvA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=AbrAgTuOHuSFvVLu+FxwQe0L1i24p9oP36C6BGa4L9w=;
        b=s0ZDSnlNmqhwYe20mMFQVpvYX2hXmFhI15+2VvSl8VFh7xoR3fo8EoFIHx/J7zMxh5
         YnMyLT74/jZUG0rzdneKXb6VEW3zfY6nt0Z2RGqQpi+LRogsqqC5vrOEGBhhvPxRdsNe
         V0fG99zYlg0xRJyMEmBUdWqnNvfrMu3om9cpyGqktYvi9/Z2HQvj4nLpe1lQSrSsCwB/
         qqXpEpf8KhvXYdkwZYxytstNO1qwF2DhTmtTGuZ4tbwDbYsUg2YqydHp2rhayaXGrw4a
         rRkwWbLrcrmCICm4srERGJ53Yp/kpSO7y8R9/ySqUUIvwPbbcVV72JRljDq7n7If8DL8
         Qokg==
X-Gm-Message-State: AOAM5314XqXSSL08qM8e4i9Xl5ensaRS39W9oPCX8kwV/z5fefpfB5z+
	ab/Hm51iWTA1h1jqjFOwH9jAyohTYgHZkFclmo8=
X-Google-Smtp-Source: ABdhPJzgZBs9MdKT8UNJORX5Br/7o81raOVhGzJmuR7XhRqvLbUY0ITH3K7ab0YkazTk2YMPzbFsxrvfS8+oXU+9oa8=
X-Received: by 2002:a17:902:eb84:b0:158:8a72:bbdd with SMTP id
 q4-20020a170902eb8400b001588a72bbddmr269701plg.117.1650561590127; Thu, 21 Apr
 2022 10:19:50 -0700 (PDT)
MIME-Version: 1.0
References: <20220407020953.475626-1-shy828301@gmail.com> <Yk6cutNf5sOuYbDl@ziqianlu-nuc9qn>
 <CAHbLzkq+eKcKCsxXDTiOcBxk8FjMdWBqOxwi4N_NG7PZWbAAkA@mail.gmail.com>
 <Yl/FS9enAD4V8jG3@ziqianlu-nuc9qn> <f27ec36beb3cf1dbbfc3b8835e586d5d6fe7f561.camel@intel.com>
 <YmFmL42W+OrORElV@ziqianlu-nuc9qn>
In-Reply-To: <YmFmL42W+OrORElV@ziqianlu-nuc9qn>
From: Yang Shi <shy828301@gmail.com>
Date: Thu, 21 Apr 2022 10:19:37 -0700
Message-ID: <CAHbLzkqU7T_ABvQEmLe382JtwcB=m2-Y+smo8ZBDdQF4C=b_yQ@mail.gmail.com>
Subject: Re: [PATCH] mm: swap: determine swap device by using page nid
To: Aaron Lu <aaron.lu@intel.com>
Cc: "ying.huang@intel.com" <ying.huang@intel.com>, Michal Hocko <mhocko@suse.com>, 
	Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>, 
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: CA556140026
X-Stat-Signature: wo7fwkgujbrt8aq4rq1z3kgkww968my6
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b="l6/LXh17";
	spf=pass (imf26.hostedemail.com: domain of shy828301@gmail.com designates 209.85.214.177 as permitted sender) smtp.mailfrom=shy828301@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
X-Rspam-User: 
X-HE-Tag: 1650561589-647685
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Thu, Apr 21, 2022 at 7:12 AM Aaron Lu <aaron.lu@intel.com> wrote:
>
> On Thu, Apr 21, 2022 at 03:49:21PM +0800, ying.huang@intel.com wrote:
> > On Wed, 2022-04-20 at 16:33 +0800, Aaron Lu wrote:
> > > On Thu, Apr 07, 2022 at 10:36:54AM -0700, Yang Shi wrote:
> > > > On Thu, Apr 7, 2022 at 1:12 AM Aaron Lu <aaron.lu@intel.com> wrote:
> > > > >
> > > > > On Wed, Apr 06, 2022 at 07:09:53PM -0700, Yang Shi wrote:
> > > > > > The swap devices are linked to per node priority lists, the swap device
> > > > > > closer to the node has higher priority on that node's priority list.
> > > > > > This is supposed to improve I/O latency, particularly for some fast
> > > > > > devices.  But the current code gets nid by calling numa_node_id() which
> > > > > > actually returns the nid that the reclaimer is running on instead of the
> > > > > > nid that the page belongs to.
> > > > > >
> > > > >
> > > > > Right.
> > > > >
> > > > > > Pass the page's nid dow to get_swap_pages() in order to pick up the
> > > > > > right swap device.  But it doesn't work for the swap slots cache which
> > > > > > is per cpu.  We could skip swap slots cache if the current node is not
> > > > > > the page's node, but it may be overkilling. So keep using the current
> > > > > > node's swap slots cache.  The issue was found by visual code inspection
> > > > > > so it is not sure how much improvement could be achieved due to lack of
> > > > > > suitable testing device.  But anyway the current code does violate the
> > > > > > design.
> > > > > >
> > > > >
> > > > > I intentionally used the reclaimer's nid because I think when swapping
> > > > > out to a device, it is faster when the device is on the same node as
> > > > > the cpu.
> > > >
> > > > OK, the offline discussion with Huang Ying showed the design was to
> > > > have page's nid in order to achieve better I/O performance (more
> > > > noticeable on faster devices) since the reclaimer may be running on a
> > > > different node from the reclaimed page.
> > > >
> > > > >
> > > > > Anyway, I think I can make a test case where the workload allocates all
> > > > > its memory on the remote node and its workingset memory is larger then
> > > > > the available memory so swap is triggered, then we can see which way
> > > > > achieves better performance. Sounds reasonable to you?
> > > >
> > > > Yeah, definitely, thank you so much. I don't have a fast enough device
> > > > by hand to show the difference right now. If you could get some data
> > > > it would be perfect.
> > > >
> > >
> > > Failed to find a test box that has two NVMe disks attached to different
> > > nodes and since Shanghai is locked down right now, we couldn't install
> > > another NVMe on the box so I figured it might be OK to test on a box that
> > > has a single NVMe attached to node 0 like this:
> > >
> > > 1) restrict the test processes to run on node 0 and allocate on node 1;
> > > 2) restrict the test processes to run on node 1 and allocate on node 0.
> > >
> > > In case 1), the reclaimer's node id is the same as the swap device's so
> > > it's the same as current behaviour and in case 2), the page's node id is
> > > the same as the swap device's so it's what your patch proposed.
> > >
> > > The test I used is vm-scalability/case-swap-w-rand:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/wfg/vm-scalability.git/tree/case-swap-w-seq
> > > which spawns $nr_task processes and each will mmap $size and then
> > > randomly write to that area. I set nr_task=32 and $size=4G, so a total
> > > of 128G memory will be needed and I used memory.limit_in_bytes to
> > > restrict the available memory to 64G, to make sure swap is triggered.
> > >
> > > The reason why cgroup is used is to avoid waking up the per-node kswapd
> > > which can trigger swapping with reclaimer/page/swap device all having the
> > > same node id.
> > >
> > > And I don't see a measuable difference from the result:
> > > case1(using reclaimer's node id) vm-scalability.throughput: 10574 KB/s
> > > case2(using page's node id)      vm-scalability.throughput: 10567 KB/s
> > >
> > > My interpretation of the result is, when reclaiming a remote page, it
> > > doesn't matter much which swap device to use if the swap device is a IO
> > > device.
> > >
> > > Later Ying reminded me we have test box that has optane installed on
> > > different nodes so I also tested there: Icelake 2 sockets server with 2
> > > optane installed on each node. I did the test there like this:
> > > 1) restrict the test processes to run on node 0 and allocate on node 1
> > >    and only swapon pmem0, which is the optane backed swap device on node 0;
> > > 2) restrict the test processes to run on node 0 and allocate on node 1
> > >    and only swapon pmem1, which is the optane backed swap device on node 1.
> > >
> > > So case 1) is current behaviour and case 2) is what your patch proposed.
> > >
> > > With the same test and the same nr_task/size, the result is:
> > > case1(using reclaimer's node id) vm-scalability.throughput: 71033 KB/s
> > > case2(using page's node id)      vm-scalability.throughput: 58753 KB/s
> > >
> >
> > The per-node swap device support is more about swap-in latency than
> > swap-out throughput.  I suspect the test case is more about swap-out
> > throughput.  perf profiling can show this.
> >
>
> On another thought, swap out can very well affect swap in latency:
> since swap is involved, the available memory is in short supply and swap
> in may very likely need to reclaim a page and that reclaim can involve a
> swap out, so swap out performance can also affect swap in latency.

If you count in page allocation latency, yes. I think we could just
measure the I/O latency, for example, swap_readpage()? I'm supposed
the per-node swap device is aimed to minimize I/O latency.

>
> > For swap-in latency, we can use pmbench, which can output latency
> > information.
> >
> > Best Regards,
> > Huang, Ying
> >
> >
> > [snip]
> >
>