From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C41BAD6D248
	for <linux-mm@archiver.kernel.org>; Wed, 27 Nov 2024 23:33:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 07F066B0082; Wed, 27 Nov 2024 18:33:47 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 006906B0083; Wed, 27 Nov 2024 18:33:46 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DC2936B0085; Wed, 27 Nov 2024 18:33:46 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id B9DF76B0082
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 18:33:46 -0500 (EST)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 6ED161603AF
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 23:33:46 +0000 (UTC)
X-FDA: 82833479526.18.74FA296
Received: from mail-vs1-f48.google.com (mail-vs1-f48.google.com [209.85.217.48])
	by imf12.hostedemail.com (Postfix) with ESMTP id C38AA4000C
	for <linux-mm@kvack.org>; Wed, 27 Nov 2024 23:33:41 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b="I/48sPQj";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=yuzhao@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732750418; a=rsa-sha256;
	cv=none;
	b=uZkqc0tuij5rK46h8vVRJg1KVey1tKM2aASp2AAbflogmPyYEjZ0gviJP2yOBHDBz2o9St
	+/uQ1Ne8/pHSOO8kLNrtzF8Npsoqoal0wUqdqSSjxPtlAUOWpFpsRMomjKxk3aDikLbW1+
	Rg5UXpJBss2frN2o4s8nEFs3qC2E9lI=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b="I/48sPQj";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf12.hostedemail.com: domain of yuzhao@google.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=yuzhao@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1732750418;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=DeE7gf3SgsSd53Xjo/FWaB9otQ/uDFTu3yrje1a8eJQ=;
	b=4Eez1TGqtb5JOmE0/0jX0A+v+gxwvZnFynb111E4O0z5yt8RcrutxUqkttAyYrY9B32EtB
	qrDLl55RI78Uqzsjmk07KSD463Uw06ZYOIE3DuDVTgrhIMU6NvvRJbRzGEZ/+mNWpl3Kht
	srTHkY8/EcWFzstjO40fRXbgavkjyH4=
Received: by mail-vs1-f48.google.com with SMTP id ada2fe7eead31-4af1578d288so64374137.3
        for <linux-mm@kvack.org>; Wed, 27 Nov 2024 15:33:44 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1732750423; x=1733355223; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=DeE7gf3SgsSd53Xjo/FWaB9otQ/uDFTu3yrje1a8eJQ=;
        b=I/48sPQjGV0OzIpCVZGwjYipsXAieGlCyrS32M80IfV4KDatxzrxT0Lr5arRZPTFlL
         iXiv+hqwo3tkIb/HwLlEdPU8cdjIzoMn2W78ZcooNIoZjvwA4EXcVb5RTE5brJt1YqDc
         clDP/0g+REfogdCvPiHg3lmgTlRuFjTC9P+HtCNteqJzGUfKFKIjOBGz81Vwy+4Mg9w9
         v4uSbkbygUgvV52CShRPEdzc4y3nkhj9evQgNEnF19W/gwOXUIJ1SZ/JrBoWnIL29OEo
         PSIgxlkdZDneNzwRGbePruHnsacfjP38ScgodRdDj+Lud/nC8YqVxl7h2V6yPXM/eInE
         MZYw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1732750423; x=1733355223;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=DeE7gf3SgsSd53Xjo/FWaB9otQ/uDFTu3yrje1a8eJQ=;
        b=vrHs35sBM7CvbjTFiQ6vsg7JK3mcX/SI3LCgAJQrYJYKru8pgIc2lPi5BXkcUol/cn
         202OaLZ972iMEFzhKS5xnq97dqDBkzneRcUF22NEExaFeZCLp6GP7Ta3aXyYwxlIYx0x
         NWaq3hek5As05ebrstIQEN21M/vS9dx+qsuLdmZAvDbKIs6w0DuZV9thKN5uGl5mK9j4
         HmU80tJJf34xSeT0fJtXHap/8EoTkDO62uYI36piaptWpAwCU72LRn1UwevPua6ymnpM
         w+4y7WU9eerMZDvGsOEQDzpWfeoDHog2kklnWUkQQTfOAivVaOXzX+iU2ma1oWi9gb0O
         lmuQ==
X-Forwarded-Encrypted: i=1; AJvYcCWo6D82oTbdRAWFPmDfP4IXTArTFz/zAVxqGbYn+tNEN+SNFuW+++WnqFju5be26B422eFFA0WnZQ==@kvack.org
X-Gm-Message-State: AOJu0YxLrasZELuUWkJswyJIsgEHYKU2cVcvQSymQ7PTa0buvt35/xwM
	ixyM+N1jG6Mp9zi2bg+UTCc6PCy1nveQXcZCvEnu5DBbKn9JT/Xb0aBg3PaWXqMNQiiDJSarhhe
	wfanJV9+UqQnqIajtsBNDK9PbKNhbstm5lg+v
X-Gm-Gg: ASbGnctlxyHyvo6hfLoCDmCnevCRS7vWDwFDDwRbnghWDleYq5QazAmZPpDhRSoUPAY
	D+MlJHq+hPFoPqW9sdOTisvcqXWNKwq9TZD//r5/xBib0xq+T1XhLmdbdmKAfY3F9
X-Google-Smtp-Source: AGHT+IEb//pM5Q5yvsRmUmFE24+3YoU7gB1MI5HQU46JSAJh0oacWRVHYZ8480xFDhv7sb4Ta2tJ7XV+rafI2QDjbYc=
X-Received: by 2002:a05:6102:1899:b0:4af:469b:d3b8 with SMTP id
 ada2fe7eead31-4af469bd4d8mr4257577137.25.1732750423237; Wed, 27 Nov 2024
 15:33:43 -0800 (PST)
MIME-Version: 1.0
References: <20241127025728.3689245-1-yuanchu@google.com> <20241127072604.GA2501036@cmpxchg.org>
In-Reply-To: <20241127072604.GA2501036@cmpxchg.org>
From: Yu Zhao <yuzhao@google.com>
Date: Wed, 27 Nov 2024 16:33:06 -0700
Message-ID: <CAOUHufZ04fUgPUba89edv0UDLSiz7w+VJp-nbKPiVD8B-MMdfQ@mail.gmail.com>
Subject: Re: [PATCH v4 0/9] mm: workingset reporting
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Yuanchu Xie <yuanchu@google.com>, Andrew Morton <akpm@linux-foundation.org>, 
	David Hildenbrand <david@redhat.com>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, 
	Khalid Aziz <khalid.aziz@oracle.com>, Henry Huang <henry.hj@antgroup.com>, 
	Dan Williams <dan.j.williams@intel.com>, Gregory Price <gregory.price@memverge.com>, 
	Huang Ying <ying.huang@intel.com>, Lance Yang <ioworker0@gmail.com>, 
	Randy Dunlap <rdunlap@infradead.org>, Muhammad Usama Anjum <usama.anjum@collabora.com>, 
	Tejun Heo <tj@kernel.org>, =?UTF-8?Q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, 
	Jonathan Corbet <corbet@lwn.net>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, 
	"Rafael J. Wysocki" <rafael@kernel.org>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>, 
	Xuan Zhuo <xuanzhuo@linux.alibaba.com>, =?UTF-8?Q?Eugenio_P=C3=A9rez?= <eperezma@redhat.com>, 
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>, 
	Mike Rapoport <rppt@kernel.org>, Shuah Khan <shuah@kernel.org>, 
	Christian Brauner <brauner@kernel.org>, Daniel Watson <ozzloy@each.do>, cgroups@vger.kernel.org, 
	linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, 
	virtualization@lists.linux.dev, linux-mm@kvack.org, 
	linux-kselftest@vger.kernel.org, SeongJae Park <sj@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: hp5nxc347p851kep5zd4cqsxqj8bdqos
X-Rspam-User: 
X-Rspamd-Queue-Id: C38AA4000C
X-Rspamd-Server: rspam08
X-HE-Tag: 1732750421-290781
X-HE-Meta: U2FsdGVkX1+3Uwr07TC8vbteIejlOBm7s4Qm5zii1Nn8ehjoeJ8wsgYGuAeQ4sEtSZkFZQliqgkSbO5U1rzt6lmikse+YfO4MZVwq/eR5Q7+6dIjEVan+sslFUGZy6d91W0Tirj8kN10oulAlFwgm25TJ2hpI26UeJTe2R48Wp2uMPe5BM5RnGqRlQI/Jryd6Khxd0H3YX2oHSCcxBTR0GO8g4NppII9AEiZj0hxgdMHyxZVkHxU2HuUB2EsMNKr96Uwy3oxGlzGDM26kR5/wAw/8c/xQc0Y3za2jpVS8DdeDjQH7pEtkPfedqlL0ArfR1K2C+5bHFSU97T1H1iZHnv2NTabH9HoWuwvrJV3WtB3Fgnm9rkKeOetFfo9EJQPNDxffilrONdMfP04jXGAG3wQKTOWgjxnofTzHUvEzAy+htHaIbJrbFgniSp8c/50v1z9QD8sCsSCnJ4PDGsFWr6xcIPSLPr6KqF3ZV8PjNtzO3QO0+Hp2px+Md9HnIMnsat1z4y1TdZ0RShCgSSxGwsPAhyk4fFpWjczPAvO698NNL8wbeFSBUlh1aZExOO+Uh0H+b18RyYKsa8mEtglnX88s38MOQcPvxo6meXKqbNnX+huBLvfjLpfd8NZc4sIaWeOkx0nchLg9cMraR3pJRT/HveRY/ko2/tnUOkRTs/1PtwK2Hndyowc2OQwh0Npnxn1WE6lEKenshnw+ruMTjbIQmY/J4LXMfHTRCjq5t0nSJ/FF3wc6+pmHf3I4hnU50hYc51RL5F3EgBAHP1Lrmc6WVbXF/6VwMDqqqUbNhtvqG7HlcJ/MhOUWSXnxObDFyTCNN0M8ikL2JiH3T7y5DJft13HPVieNzjHjRNFL32RATqey8dRzTd667H2vmAL/EoPlYGJ5lRF17Q2zt3G1dPc4B2LLigcT7AfRnmUlJu9ywPfzNadzqXCVGqtxXQtu14lUsNCX5KMeAvz2Dk
 9bNztgUI
 u9qXVsqstoFoV7ngUIlQhOMxIZNfZSDXBu+9kBiAZdltnYeJvHamK8F7yzW514FUZPOWOKUSijQnTtKthtlDhEJDzgbqes5EOEceG/4iDpfARj5fT/cH0SyEh8vJfQM/PwkDz2UP2ONlvXD9VAzbzHMf4DL2S1Oq92zC8MY/m6XVJmNBZcg3sAUgpRjlD1aVt0ed17jT6hR/womLUeB6CB7/PVrRXDYY7sKFr+Tek3W5r27647h4/CbABIyxJWnIBCM540rc46gVndt4a2drDqNu7DKFbA8F0pULkBnFU/cHICrLLAlP9pzoqwSGSXM2zKYQwPVfzBllPVxZHU+fY/5HPLV/PM3zTFcrbZfcdeG27jWj3qF4j4/huLMecWjHmNOksZEsoc8EeljXjB8G1zErk+gw/EysjfOWSvW5CV3z3LTU=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Nov 27, 2024 at 12:26=E2=80=AFAM Johannes Weiner <hannes@cmpxchg.or=
g> wrote:
>
> On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace.
> > Another interesting idea might be hugepage workingset, so that we can
> > measure the proportion of hugepages backing cold memory. However, with
> > architectures like arm, there may be too many hugepage sizes leading to
> > a combinatorial explosion when exporting stats to the userspace.
> > Nonetheless, the kernel should provide a set of workingset interfaces
> > that is generic enough to accommodate the various use cases, and extens=
ible
> > to potential future use cases.
>
> Doesn't DAMON already provide this information?

Yuanchu might be able to answer this question a lot better than I do,
since he studied DAMON and tried to leverage it in our fleet.

My impression is that there are some fundamental differences in access
detection and accounting mechanisms between the two, i.e., sampling vs
scanning-based detection and non-lruvec vs lruvec-based accounting.

> CCing SJ.
>
> > Use cases
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Job scheduling
> > On overcommitted hosts, workingset information improves efficiency and
> > reliability by allowing the job scheduler to have better stats on the
> > exact memory requirements of each job. This can manifest in efficiency =
by
> > landing more jobs on the same host or NUMA node. On the other hand, the
> > job scheduler can also ensure each node has a sufficient amount of memo=
ry
> > and does not enter direct reclaim or the kernel OOM path. With workings=
et
> > information and job priority, the userspace OOM killing or proactive
> > reclaim policy can kick in before the system is under memory pressure.
> > If the job shape is very different from the machine shape, knowing the
> > workingset per-node can also help inform page allocation policies.
> >
> > Proactive reclaim
> > Workingset information allows the a container manager to proactively
> > reclaim memory while not impacting a job's performance. While PSI may
> > provide a reactive measure of when a proactive reclaim has reclaimed to=
o
> > much, workingset reporting allows the policy to be more accurate and
> > flexible.
>
> I'm not sure about more accurate.

Agreed. This is a (very) poor argument, unless there are facts to back this=
 up.

> Access frequency is only half the picture. Whether you need to keep
> memory with a given frequency resident depends on the speed of the
> backing device.

Along a similar line, we also need to consider use cases that don't
involve backing storage, e.g., far memory (remote node). More details below=
.

> There is memory compression; there is swap on flash; swap on crappy
> flash; swapfiles that share IOPS with co-located filesystems. There is
> zswap+writeback, where avg refault speed can vary dramatically.
>
> You can of course offload much more to a fast zswap backend than to a
> swapfile on a struggling flashdrive, with comparable app performance.
>
> So I think you'd be hard pressed to achieve a high level of accuracy
> in the usecases you list without taking the (often highly dynamic)
> cost of paging / memory transfer into account.
>
> There is a more detailed discussion of this in a paper we wrote on
> proactive reclaim/offloading - in 2.5 Hardware Heterogeneity:
>
> https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf
>
> > Ballooning (similar to proactive reclaim)
> > The last patch of the series extends the virtio-balloon device to repor=
t
> > the guest workingset.
> > Balloon policies benefit from workingset to more precisely determine th=
e
> > size of the memory balloon. On end-user devices where memory is scarce =
and
> > overcommitted, the balloon sizing in multiple VMs running on the same
> > device can be orchestrated with workingset reports from each one.
> > On the server side, workingset reporting allows the balloon controller =
to
> > inflate the balloon without causing too much file cache to be reclaimed=
 in
> > the guest.
> >
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated bac=
k
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >
> > [1]
> > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirem=
ents-white-paper-pdf-1
> >
> > Sysfs and Cgroup Interfaces
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > The interfaces are detailed in the patches that introduce them. The mai=
n
> > idea here is we break down the workingset per-node per-memcg into time
> > intervals (ms), e.g.
> >
> > 1000 anon=3D137368 file=3D24530
> > 20000 anon=3D34342 file=3D0
> > 30000 anon=3D353232 file=3D333608
> > 40000 anon=3D407198 file=3D206052
> > 9223372036854775807 anon=3D4925624 file=3D892892
> >
> > Implementation
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > The reporting of user pages is based off of MGLRU, and therefore requir=
es
> > CONFIG_LRU_GEN=3Dy. We would benefit from more MGLRU generations for a =
more
> > fine-grained workingset report, but we can already gather a lot of data
> > with just four generations. The workingset reporting mechanism is gated
> > behind CONFIG_WORKINGSET_REPORT, and the aging thread is behind
> > CONFIG_WORKINGSET_REPORT_AGING.
> >
> > Benchmarks
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Lin=
ux
> > compile and redis benchmarks from openbenchmarking.org. The policy and
> > runner is referred to as WMO (Workload Memory Optimization).
> > The results were based on v3 of the series, but v4 doesn't change the c=
ore
> > of the working set reporting and just adds the ballooning counterpart.
> >
> > The timed Linux kernel compilation benchmark shows improvements in peak
> > memory usage with a policy of "swap out all bytes colder than 10 second=
s
> > every 40 seconds". A swapfile is configured on SSD.
> > --------------------------------------------
> > peak memory usage (with WMO): 4982.61328 MiB
> > peak memory usage (control): 9569.1367 MiB
> > peak memory reduction: 47.9%
> > --------------------------------------------
> > Benchmark                                           | Experimental     =
|Control         | Experimental_Std_Dev | Control_Std_Dev
> > Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) =
| 679.499 (100%) | 0.6%                 | 0.1%
> > --------------------------------------------
> > Seconds, fewer is better
>
> You can do this with a recent (>2018) upstream kernel and ~100 lines
> of python [1]. It also works on both LRU implementations.
>
> [1] https://github.com/facebookincubator/senpai
>
> We use this approach in virtually the entire Meta fleet, to offload
> unneeded memory, estimate available capacity for job scheduling, plan
> future capacity needs, and provide accurate memory usage feedback to
> application developers.
>
> It works over a wide variety of CPU and storage configurations with no
> specific tuning.

How would Senpai work for use cases that don't have local storage,
i.e., all memory is mapped by either the fast or the slow tier? (>95%
memory usage in our fleet is mapped and local storage for non-storage
servers is only scratch space.)

My current understanding is that its approach would not be able to
form a feedback loop because there are currently no refaults from the
slow tier (because it's also mapped), and that's where I think this
proposal or something similar can help.

Also this proposal reports histograms, not scalars. So in theory,
userspace can see the projections of its potential actions, rather
than solely rely on trial and error. Of course, this needs to be
backed with data. So yes, some comparisons from real-world use cases
would be very helpful to demonstrate the value of this proposal.