From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A3BF0E77173
	for <linux-mm@archiver.kernel.org>; Fri,  6 Dec 2024 19:58:16 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 331A06B02E1; Fri,  6 Dec 2024 14:58:16 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2BAEA6B02E2; Fri,  6 Dec 2024 14:58:16 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 134A36B02E3; Fri,  6 Dec 2024 14:58:16 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id E5C606B02E1
	for <linux-mm@kvack.org>; Fri,  6 Dec 2024 14:58:15 -0500 (EST)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 6FA691602A4
	for <linux-mm@kvack.org>; Fri,  6 Dec 2024 19:58:15 +0000 (UTC)
X-FDA: 82865595078.07.69A021B
Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182])
	by imf07.hostedemail.com (Postfix) with ESMTP id E33F240003
	for <linux-mm@kvack.org>; Fri,  6 Dec 2024 19:57:55 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=mPlWfpfY;
	spf=pass (imf07.hostedemail.com: domain of yuanchu@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=yuanchu@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1733515080;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=A82gvr7sDzCugWZNh93f0M6gJn7XxjisLQmLnxRt2mE=;
	b=lWg3qvBkhgDmRQkk1D5JPaI7kKPvzRjDiOXalbZnuRPcKgVPGsaAzBMThwm6cqcQOlNsBX
	oEqlEXCI6C1hDDi9kQiyUz5rerxakDKm5clR5AWsVstsLq9nQ5UE4jJSnYIqzGB3a8n3MY
	Wcm0xDe0KpS9oFffvPSQluiV4MAhZzg=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=mPlWfpfY;
	spf=pass (imf07.hostedemail.com: domain of yuanchu@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=yuanchu@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733515080; a=rsa-sha256;
	cv=none;
	b=62yHySWYAhC3MAchru2Bwu3CRkplqAnbxb0siImf4A5TtTeIcW1oDpZ7/xUKjvZXHhIq6j
	2cpC0M+IT3SmjKRU1UXhQdy+6sq/RMi/RQVqUldVYcBfKvimPK5+ogsDT9Wkj1v0WhawIM
	HUGjYcphI+UudyVjRHDL8FDJfpzu7s8=
Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-21561f7d135so14015ad.1
        for <linux-mm@kvack.org>; Fri, 06 Dec 2024 11:58:13 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1733515092; x=1734119892; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=A82gvr7sDzCugWZNh93f0M6gJn7XxjisLQmLnxRt2mE=;
        b=mPlWfpfYiWbh9gdnvv/ov8+BiKIInTcb/Y5iOsAcCBBtnTMNGzpZAcJSTgCtxjE5uI
         bX7lFcmgldregQXJgJZgVp3q/srVA8kXUrSCFoQdfg77DUt90n5h4euu6564ucTeAck5
         M3gC/VsDmVwSbyrQGZNT60dvPQytbU5nC9Bk9u+uQ+3AOs//9oQ7MwzBwcL7D5qi5eih
         qOFbsLzAj69hZ3gz1nVQSYIq3wyTH3/EPWT+54BHMNiDPAGjEtQo9y3mObuxkDz0QQNB
         d/34TD+1uSwHqg9dVc3yRCj4Q8JMdVbNi2Bsb8yuXZSKbvTtfwC9gYHcax74j9FYV99m
         5mgQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1733515092; x=1734119892;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=A82gvr7sDzCugWZNh93f0M6gJn7XxjisLQmLnxRt2mE=;
        b=Mi/ccf96Rc35gBRNaOEeQkClvLW67T50EJ5H6kk2rZ6TCbkr6WqrYpSVoWeXbQ5xrD
         E3WA7mxwaDMixLEvEnK31GmRkH89GAk+KcBzr9z+wt92ck0s63WfQdPbXSv0q+h/64xW
         4sjSDlfLtzFJ/3R7JC7OoyHWIANOJ9nggoEKFHKYuhSslh9/uVYbFDC9hJehMNjeV4v3
         u2WhpHbF8u32ExZ/y8DrXd2779AE3fsKWkO3fVB/IjeJ2tfyZsvRq3HMKke6jbdXiyT2
         bhQo+x10UjRjxTeaKaMIyUTyHxcdXktFNK9ga9VR5E7B+Usgxd2XEyM/1CNz4FwkYVwh
         Lm/A==
X-Forwarded-Encrypted: i=1; AJvYcCW38m8Z+tLQuE3wFtOfl80lLa2knIV6QJ3Yst3fe/iWw3XMrkb4wTISoGWxd+0WGlrDx5gdwr0njw==@kvack.org
X-Gm-Message-State: AOJu0YzHTF3OedgX9gd+G8L8V3mRJRwEwd4DayPJ5FWOGRoogLIAWWGN
	j9XI6z0ajXTxDpDbY+Qx+7WFAvfwChDU2v3/pTV7de22+2Vl+ZMjqMOXYAew4H1OA5ZTtmQegQ2
	J2uXgfpoHvVj3JAgiZR7enPAk5+sqj7Kj4Aon
X-Gm-Gg: ASbGncuIJcbQXcBGK9vNeLcFDBSTw+mvECazd82TauHt7d73A5ILjxPTfNYGH//Y79a
	pY9bqQfMPsYqQQ77YFK8Hvz5fFeeAxj+QjHdeYfBe6Ud5urX4F8ejfHqN57A=
X-Google-Smtp-Source: AGHT+IHZdlpKcu54llf+M0CD9QkcB0RHZTeLren1vYJvc90nzNmtuhq9YRRgqZ68LMma3iPfL32X7Vz6IZrABpYfBzY=
X-Received: by 2002:a17:902:ce06:b0:215:3e48:2b17 with SMTP id
 d9443c01a7336-2162ad6a092mr200125ad.5.1733515091991; Fri, 06 Dec 2024
 11:58:11 -0800 (PST)
MIME-Version: 1.0
References: <20241127025728.3689245-1-yuanchu@google.com> <20241127072604.GA2501036@cmpxchg.org>
In-Reply-To: <20241127072604.GA2501036@cmpxchg.org>
From: Yuanchu Xie <yuanchu@google.com>
Date: Fri, 6 Dec 2024 11:57:55 -0800
Message-ID: <CAJj2-QFdP6DKVQJ4Tw6rdV+XtgDihe=UOnvm4cm-q61K0hq6CQ@mail.gmail.com>
Subject: Re: [PATCH v4 0/9] mm: workingset reporting
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, 
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Khalid Aziz <khalid.aziz@oracle.com>, 
	Henry Huang <henry.hj@antgroup.com>, Yu Zhao <yuzhao@google.com>, 
	Dan Williams <dan.j.williams@intel.com>, Gregory Price <gregory.price@memverge.com>, 
	Huang Ying <ying.huang@intel.com>, Lance Yang <ioworker0@gmail.com>, 
	Randy Dunlap <rdunlap@infradead.org>, Muhammad Usama Anjum <usama.anjum@collabora.com>, 
	Tejun Heo <tj@kernel.org>, =?UTF-8?Q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, 
	Jonathan Corbet <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, 
	"Rafael J. Wysocki" <rafael@kernel.org>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, 
	Xuan Zhuo <xuanzhuo@linux.alibaba.com>, =?UTF-8?Q?Eugenio_P=C3=A9rez?= <eperezma@redhat.com>, 
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>, 
	Mike Rapoport <rppt@kernel.org>, Shuah Khan <shuah@kernel.org>, 
	Christian Brauner <brauner@kernel.org>, Daniel Watson <ozzloy@each.do>, cgroups@vger.kernel.org, 
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, 
	virtualization@lists.linux.dev, linux-mm@kvack.org, 
	linux-kselftest@vger.kernel.org, SeongJae Park <sj@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: E33F240003
X-Rspamd-Server: rspam12
X-Stat-Signature: d5u8ddt7gsthobwadjr87ho35m95qrah
X-Rspam-User: 
X-HE-Tag: 1733515075-340821
X-HE-Meta: U2FsdGVkX18EkZPwjw0pHoMqMCQr9yZ2fRXFDQlzarT+/i/FPJjMmA3gh793zQppf55ziJYeVXrlQ+LcyHWsZFosinjG7ZYwH/s2a5Xtsv9O9L8nddAF8WKjF0i7L/VhLuHgfh6/A2/vTrcZ+B8us5rFy+zpuaFxNSlO8SJdUaSHTb8HAtZBSg1PeaFIUjnseaJUq3xIB2bqVJNZn4OoqcW0yTEG9k8DCgisLcUUP27c3+3lT5oK64udwSVy53zUjW9K1FUiX5WeNFpwNvBxNWTgdK1nLOO71WF5LlU/uKtqtd9wTvq1n/vRmS/1mHTsQSC91NO38gyALjC8c7xdn9gg0gqweOaLgqZGXTkVlURA8a9fidldJ394sm8SfO4yWMJj/XHjK8/u5ov0r0k6Q9vvhECReDfut4GYpu+Zx3K5DOqkixGLFD6d559VplJqivBV9t+Hii8idyF0vw52yt2xOfrNaLZqJYGlZMfmpl2YHb8Ztf2finONiNecBXrpkvSGVPy98jQFpk9Aa07jcbjX55VWSZPrtzgNLt22EgFP96G8vHpXSMDdegLui5glbNN8PFI0AhGkRmjwwUAkF4Z4axnnnYVz5AIL+vCRe14Ijtc/fw4NBTAStzt2BjWsa3EWCaZJ4H0/JtMhbThvCP7Wv1Rys2Jlt1cS3kSD/Etz5ygZMK/B1TkjR4xCT54gsDrdkRaYIqEwHrYbBgfiACTu1k1tnXXyoBU7Dj0pDHH6rWrIMJVsRghZnH2h1s05WhQWth7bOT7hMeioRQMJHbIIWP5EYKEaLJUuowzIAt0Ds6/27arg98BGDcC6V6iw7cOXfEUF1OO115L7zK/qEIUmQcmtkM8Tg68E6xpYuyubWhRcHCGnve38PdW7t92wsVTcnYi7FyV1IpIX4bSlmeuFlAnHOhlo8EXp/YL3UsGLur8apvTSHto03v+0xDLDLmrlgcZJ/PCdFlQxk59
 MbJIGIJ3
 KhG/oIknkyd/ZvmufF8ccMOcT5ysrzZoYi0seWGB5Y3tVxm0lfz6WM/EpB89Mkxbkg92BnZMmTTVZmeAceNq5Kl3xBzd06l8u9TruGFV4ZTbuIwJsBPZsiOgTiHtiDDvckJvHo0/fHU7bFi9crmtRPNobL5dPd5C0DmqHQ3gHkDja0cHfOyQ/y4OKV6V17vgKg2G5pd30M+NmbvJqQeNxskgvJZXsKsU2G9iZkzapXKBYPyYhqfoXCX2/BcSd5bRBpc7a/T88TQq1bnBKOUoqKG/iHnN342v2QKkyE6mqBs9ktyIetl4rBnZ4kC1HktzTyuRw6RXt/sXdW800amFbFlpucP3nhyauR+LIIS1nTgu22Xs=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Thanks for the response Johannes. Some replies inline.

On Tue, Nov 26, 2024 at 11:26=E2=80=AFPM Johannes Weiner <hannes@cmpxchg.or=
g> wrote:
>
> On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace.
> > Another interesting idea might be hugepage workingset, so that we can
> > measure the proportion of hugepages backing cold memory. However, with
> > architectures like arm, there may be too many hugepage sizes leading to
> > a combinatorial explosion when exporting stats to the userspace.
> > Nonetheless, the kernel should provide a set of workingset interfaces
> > that is generic enough to accommodate the various use cases, and extens=
ible
> > to potential future use cases.
>
> Doesn't DAMON already provide this information?
>
> CCing SJ.
Thanks for the CC. DAMON was really good at visualizing the memory
access frequencies last time I tried it out! For server use cases,
DAMON would benefit from integrations with cgroups. The key then would
be a standard interface for exporting a cgroup's working set to the
user. It would be good to have something that will work for different
backing implementations, DAMON, MGLRU, or active/inactive LRU.

>
> > Use cases
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Job scheduling
> > On overcommitted hosts, workingset information improves efficiency and
> > reliability by allowing the job scheduler to have better stats on the
> > exact memory requirements of each job. This can manifest in efficiency =
by
> > landing more jobs on the same host or NUMA node. On the other hand, the
> > job scheduler can also ensure each node has a sufficient amount of memo=
ry
> > and does not enter direct reclaim or the kernel OOM path. With workings=
et
> > information and job priority, the userspace OOM killing or proactive
> > reclaim policy can kick in before the system is under memory pressure.
> > If the job shape is very different from the machine shape, knowing the
> > workingset per-node can also help inform page allocation policies.
> >
> > Proactive reclaim
> > Workingset information allows the a container manager to proactively
> > reclaim memory while not impacting a job's performance. While PSI may
> > provide a reactive measure of when a proactive reclaim has reclaimed to=
o
> > much, workingset reporting allows the policy to be more accurate and
> > flexible.
>
> I'm not sure about more accurate.
>
> Access frequency is only half the picture. Whether you need to keep
> memory with a given frequency resident depends on the speed of the
> backing device.
>
> There is memory compression; there is swap on flash; swap on crappy
> flash; swapfiles that share IOPS with co-located filesystems. There is
> zswap+writeback, where avg refault speed can vary dramatically.
>
> You can of course offload much more to a fast zswap backend than to a
> swapfile on a struggling flashdrive, with comparable app performance.
>
> So I think you'd be hard pressed to achieve a high level of accuracy
> in the usecases you list without taking the (often highly dynamic)
> cost of paging / memory transfer into account.
>
> There is a more detailed discussion of this in a paper we wrote on
> proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:
>
> https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf
>
Yes, PSI takes into account the paging cost. I'm not claiming that
Workingset reporting provides a superset of information, but rather it
can complement PSI. Sorry for the bad wording here.

> > Ballooning (similar to proactive reclaim)
> > The last patch of the series extends the virtio-balloon device to repor=
t
> > the guest workingset.
> > Balloon policies benefit from workingset to more precisely determine th=
e
> > size of the memory balloon. On end-user devices where memory is scarce =
and
> > overcommitted, the balloon sizing in multiple VMs running on the same
> > device can be orchestrated with workingset reports from each one.
> > On the server side, workingset reporting allows the balloon controller =
to
> > inflate the balloon without causing too much file cache to be reclaimed=
 in
> > the guest.
The ballooning use case is an important one. Having working set
information would allow us to inflate a balloon of the right size in
the guest.

> >
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated bac=
k
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >...
> >
> > Benchmarks
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Lin=
ux
> > compile and redis benchmarks from openbenchmarking.org. The policy and
> > runner is referred to as WMO (Workload Memory Optimization).
> > The results were based on v3 of the series, but v4 doesn't change the c=
ore
> > of the working set reporting and just adds the ballooning counterpart.
> >
> > The timed Linux kernel compilation benchmark shows improvements in peak
> > memory usage with a policy of "swap out all bytes colder than 10 second=
s
> > every 40 seconds". A swapfile is configured on SSD.
> > --------------------------------------------
> > peak memory usage (with WMO): 4982.61328 MiB
> > peak memory usage (control): 9569.1367 MiB
> > peak memory reduction: 47.9%
> > --------------------------------------------
> > Benchmark                                           | Experimental     =
|Control         | Experimental_Std_Dev | Control_Std_Dev
> > Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) =
| 679.499 (100%) | 0.6%                 | 0.1%
> > --------------------------------------------
> > Seconds, fewer is better
>
> You can do this with a recent (>2018) upstream kernel and ~100 lines
> of python [1]. It also works on both LRU implementations.
>
> [1] https://github.com/facebookincubator/senpai
>
> We use this approach in virtually the entire Meta fleet, to offload
> unneeded memory, estimate available capacity for job scheduling, plan
> future capacity needs, and provide accurate memory usage feedback to
> application developers.
>
> It works over a wide variety of CPU and storage configurations with no
> specific tuning.
>
> The paper I referenced above provides a detailed breakdown of how it
> all works together.
>
> I would be curious to see a more in-depth comparison to the prior art
> in this space. At first glance, your proposal seems more complex and
> less robust/versatile, at least for offloading and capacity gauging.
We have implemented TMO PSI-based proactive reclaim and compared it to
a kstaled-based reclaimer (reclaiming based on 2 minute working set
and refaults). The PSI-based reclaimer was able to save more memory,
but it also caused spikes of refaults and a lot higher
decompressions/second. Overall the test workloads had better
performance with the kstaled-based reclaimer. The conclusion was that
it was a trade-off. Since we have some app classes that we don't want
to induce pressure but still want to proactively reclaim from, there's
a missing piece. I do agree there's not a good in-depth comparison
with prior art though.