From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3BF0E77173 for ; Fri, 6 Dec 2024 19:58:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 331A06B02E1; Fri, 6 Dec 2024 14:58:16 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2BAEA6B02E2; Fri, 6 Dec 2024 14:58:16 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 134A36B02E3; Fri, 6 Dec 2024 14:58:16 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E5C606B02E1 for ; Fri, 6 Dec 2024 14:58:15 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6FA691602A4 for ; Fri, 6 Dec 2024 19:58:15 +0000 (UTC) X-FDA: 82865595078.07.69A021B Received: from mail-pl1-f182.google.com (mail-pl1-f182.google.com [209.85.214.182]) by imf07.hostedemail.com (Postfix) with ESMTP id E33F240003 for ; Fri, 6 Dec 2024 19:57:55 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mPlWfpfY; spf=pass (imf07.hostedemail.com: domain of yuanchu@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=yuanchu@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1733515080; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A82gvr7sDzCugWZNh93f0M6gJn7XxjisLQmLnxRt2mE=; b=lWg3qvBkhgDmRQkk1D5JPaI7kKPvzRjDiOXalbZnuRPcKgVPGsaAzBMThwm6cqcQOlNsBX oEqlEXCI6C1hDDi9kQiyUz5rerxakDKm5clR5AWsVstsLq9nQ5UE4jJSnYIqzGB3a8n3MY Wcm0xDe0KpS9oFffvPSQluiV4MAhZzg= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=mPlWfpfY; spf=pass (imf07.hostedemail.com: domain of yuanchu@google.com designates 209.85.214.182 as permitted sender) smtp.mailfrom=yuanchu@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1733515080; a=rsa-sha256; cv=none; b=62yHySWYAhC3MAchru2Bwu3CRkplqAnbxb0siImf4A5TtTeIcW1oDpZ7/xUKjvZXHhIq6j 2cpC0M+IT3SmjKRU1UXhQdy+6sq/RMi/RQVqUldVYcBfKvimPK5+ogsDT9Wkj1v0WhawIM HUGjYcphI+UudyVjRHDL8FDJfpzu7s8= Received: by mail-pl1-f182.google.com with SMTP id d9443c01a7336-21561f7d135so14015ad.1 for ; Fri, 06 Dec 2024 11:58:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1733515092; x=1734119892; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=A82gvr7sDzCugWZNh93f0M6gJn7XxjisLQmLnxRt2mE=; b=mPlWfpfYiWbh9gdnvv/ov8+BiKIInTcb/Y5iOsAcCBBtnTMNGzpZAcJSTgCtxjE5uI bX7lFcmgldregQXJgJZgVp3q/srVA8kXUrSCFoQdfg77DUt90n5h4euu6564ucTeAck5 M3gC/VsDmVwSbyrQGZNT60dvPQytbU5nC9Bk9u+uQ+3AOs//9oQ7MwzBwcL7D5qi5eih qOFbsLzAj69hZ3gz1nVQSYIq3wyTH3/EPWT+54BHMNiDPAGjEtQo9y3mObuxkDz0QQNB d/34TD+1uSwHqg9dVc3yRCj4Q8JMdVbNi2Bsb8yuXZSKbvTtfwC9gYHcax74j9FYV99m 5mgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1733515092; x=1734119892; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=A82gvr7sDzCugWZNh93f0M6gJn7XxjisLQmLnxRt2mE=; b=Mi/ccf96Rc35gBRNaOEeQkClvLW67T50EJ5H6kk2rZ6TCbkr6WqrYpSVoWeXbQ5xrD E3WA7mxwaDMixLEvEnK31GmRkH89GAk+KcBzr9z+wt92ck0s63WfQdPbXSv0q+h/64xW 4sjSDlfLtzFJ/3R7JC7OoyHWIANOJ9nggoEKFHKYuhSslh9/uVYbFDC9hJehMNjeV4v3 u2WhpHbF8u32ExZ/y8DrXd2779AE3fsKWkO3fVB/IjeJ2tfyZsvRq3HMKke6jbdXiyT2 bhQo+x10UjRjxTeaKaMIyUTyHxcdXktFNK9ga9VR5E7B+Usgxd2XEyM/1CNz4FwkYVwh Lm/A== X-Forwarded-Encrypted: i=1; AJvYcCW38m8Z+tLQuE3wFtOfl80lLa2knIV6QJ3Yst3fe/iWw3XMrkb4wTISoGWxd+0WGlrDx5gdwr0njw==@kvack.org X-Gm-Message-State: AOJu0YzHTF3OedgX9gd+G8L8V3mRJRwEwd4DayPJ5FWOGRoogLIAWWGN j9XI6z0ajXTxDpDbY+Qx+7WFAvfwChDU2v3/pTV7de22+2Vl+ZMjqMOXYAew4H1OA5ZTtmQegQ2 J2uXgfpoHvVj3JAgiZR7enPAk5+sqj7Kj4Aon X-Gm-Gg: ASbGncuIJcbQXcBGK9vNeLcFDBSTw+mvECazd82TauHt7d73A5ILjxPTfNYGH//Y79a pY9bqQfMPsYqQQ77YFK8Hvz5fFeeAxj+QjHdeYfBe6Ud5urX4F8ejfHqN57A= X-Google-Smtp-Source: AGHT+IHZdlpKcu54llf+M0CD9QkcB0RHZTeLren1vYJvc90nzNmtuhq9YRRgqZ68LMma3iPfL32X7Vz6IZrABpYfBzY= X-Received: by 2002:a17:902:ce06:b0:215:3e48:2b17 with SMTP id d9443c01a7336-2162ad6a092mr200125ad.5.1733515091991; Fri, 06 Dec 2024 11:58:11 -0800 (PST) MIME-Version: 1.0 References: <20241127025728.3689245-1-yuanchu@google.com> <20241127072604.GA2501036@cmpxchg.org> In-Reply-To: <20241127072604.GA2501036@cmpxchg.org> From: Yuanchu Xie Date: Fri, 6 Dec 2024 11:57:55 -0800 Message-ID: Subject: Re: [PATCH v4 0/9] mm: workingset reporting To: Johannes Weiner Cc: Andrew Morton , David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Lance Yang , Randy Dunlap , Muhammad Usama Anjum , Tejun Heo , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Jonathan Corbet , Greg Kroah-Hartman , "Rafael J. Wysocki" , "Michael S. Tsirkin" , Jason Wang , Xuan Zhuo , =?UTF-8?Q?Eugenio_P=C3=A9rez?= , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Mike Rapoport , Shuah Khan , Christian Brauner , Daniel Watson , cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, virtualization@lists.linux.dev, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, SeongJae Park Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E33F240003 X-Rspamd-Server: rspam12 X-Stat-Signature: d5u8ddt7gsthobwadjr87ho35m95qrah X-Rspam-User: X-HE-Tag: 1733515075-340821 X-HE-Meta: U2FsdGVkX18EkZPwjw0pHoMqMCQr9yZ2fRXFDQlzarT+/i/FPJjMmA3gh793zQppf55ziJYeVXrlQ+LcyHWsZFosinjG7ZYwH/s2a5Xtsv9O9L8nddAF8WKjF0i7L/VhLuHgfh6/A2/vTrcZ+B8us5rFy+zpuaFxNSlO8SJdUaSHTb8HAtZBSg1PeaFIUjnseaJUq3xIB2bqVJNZn4OoqcW0yTEG9k8DCgisLcUUP27c3+3lT5oK64udwSVy53zUjW9K1FUiX5WeNFpwNvBxNWTgdK1nLOO71WF5LlU/uKtqtd9wTvq1n/vRmS/1mHTsQSC91NO38gyALjC8c7xdn9gg0gqweOaLgqZGXTkVlURA8a9fidldJ394sm8SfO4yWMJj/XHjK8/u5ov0r0k6Q9vvhECReDfut4GYpu+Zx3K5DOqkixGLFD6d559VplJqivBV9t+Hii8idyF0vw52yt2xOfrNaLZqJYGlZMfmpl2YHb8Ztf2finONiNecBXrpkvSGVPy98jQFpk9Aa07jcbjX55VWSZPrtzgNLt22EgFP96G8vHpXSMDdegLui5glbNN8PFI0AhGkRmjwwUAkF4Z4axnnnYVz5AIL+vCRe14Ijtc/fw4NBTAStzt2BjWsa3EWCaZJ4H0/JtMhbThvCP7Wv1Rys2Jlt1cS3kSD/Etz5ygZMK/B1TkjR4xCT54gsDrdkRaYIqEwHrYbBgfiACTu1k1tnXXyoBU7Dj0pDHH6rWrIMJVsRghZnH2h1s05WhQWth7bOT7hMeioRQMJHbIIWP5EYKEaLJUuowzIAt0Ds6/27arg98BGDcC6V6iw7cOXfEUF1OO115L7zK/qEIUmQcmtkM8Tg68E6xpYuyubWhRcHCGnve38PdW7t92wsVTcnYi7FyV1IpIX4bSlmeuFlAnHOhlo8EXp/YL3UsGLur8apvTSHto03v+0xDLDLmrlgcZJ/PCdFlQxk59 MbJIGIJ3 KhG/oIknkyd/ZvmufF8ccMOcT5ysrzZoYi0seWGB5Y3tVxm0lfz6WM/EpB89Mkxbkg92BnZMmTTVZmeAceNq5Kl3xBzd06l8u9TruGFV4ZTbuIwJsBPZsiOgTiHtiDDvckJvHo0/fHU7bFi9crmtRPNobL5dPd5C0DmqHQ3gHkDja0cHfOyQ/y4OKV6V17vgKg2G5pd30M+NmbvJqQeNxskgvJZXsKsU2G9iZkzapXKBYPyYhqfoXCX2/BcSd5bRBpc7a/T88TQq1bnBKOUoqKG/iHnN342v2QKkyE6mqBs9ktyIetl4rBnZ4kC1HktzTyuRw6RXt/sXdW800amFbFlpucP3nhyauR+LIIS1nTgu22Xs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Thanks for the response Johannes. Some replies inline. On Tue, Nov 26, 2024 at 11:26=E2=80=AFPM Johannes Weiner wrote: > > On Tue, Nov 26, 2024 at 06:57:19PM -0800, Yuanchu Xie wrote: > > This patch series provides workingset reporting of user pages in > > lruvecs, of which coldness can be tracked by accessed bits and fd > > references. However, the concept of workingset applies generically to > > all types of memory, which could be kernel slab caches, discardable > > userspace caches (databases), or CXL.mem. Therefore, data sources might > > come from slab shrinkers, device drivers, or the userspace. > > Another interesting idea might be hugepage workingset, so that we can > > measure the proportion of hugepages backing cold memory. However, with > > architectures like arm, there may be too many hugepage sizes leading to > > a combinatorial explosion when exporting stats to the userspace. > > Nonetheless, the kernel should provide a set of workingset interfaces > > that is generic enough to accommodate the various use cases, and extens= ible > > to potential future use cases. > > Doesn't DAMON already provide this information? > > CCing SJ. Thanks for the CC. DAMON was really good at visualizing the memory access frequencies last time I tried it out! For server use cases, DAMON would benefit from integrations with cgroups. The key then would be a standard interface for exporting a cgroup's working set to the user. It would be good to have something that will work for different backing implementations, DAMON, MGLRU, or active/inactive LRU. > > > Use cases > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Job scheduling > > On overcommitted hosts, workingset information improves efficiency and > > reliability by allowing the job scheduler to have better stats on the > > exact memory requirements of each job. This can manifest in efficiency = by > > landing more jobs on the same host or NUMA node. On the other hand, the > > job scheduler can also ensure each node has a sufficient amount of memo= ry > > and does not enter direct reclaim or the kernel OOM path. With workings= et > > information and job priority, the userspace OOM killing or proactive > > reclaim policy can kick in before the system is under memory pressure. > > If the job shape is very different from the machine shape, knowing the > > workingset per-node can also help inform page allocation policies. > > > > Proactive reclaim > > Workingset information allows the a container manager to proactively > > reclaim memory while not impacting a job's performance. While PSI may > > provide a reactive measure of when a proactive reclaim has reclaimed to= o > > much, workingset reporting allows the policy to be more accurate and > > flexible. > > I'm not sure about more accurate. > > Access frequency is only half the picture. Whether you need to keep > memory with a given frequency resident depends on the speed of the > backing device. > > There is memory compression; there is swap on flash; swap on crappy > flash; swapfiles that share IOPS with co-located filesystems. There is > zswap+writeback, where avg refault speed can vary dramatically. > > You can of course offload much more to a fast zswap backend than to a > swapfile on a struggling flashdrive, with comparable app performance. > > So I think you'd be hard pressed to achieve a high level of accuracy > in the usecases you list without taking the (often highly dynamic) > cost of paging / memory transfer into account. > > There is a more detailed discussion of this in a paper we wrote on > proactive reclaim/offloading - in 2.5 Hardware Heterogeneity: > > https://www.cs.cmu.edu/~dskarlat/publications/tmo_asplos22.pdf > Yes, PSI takes into account the paging cost. I'm not claiming that Workingset reporting provides a superset of information, but rather it can complement PSI. Sorry for the bad wording here. > > Ballooning (similar to proactive reclaim) > > The last patch of the series extends the virtio-balloon device to repor= t > > the guest workingset. > > Balloon policies benefit from workingset to more precisely determine th= e > > size of the memory balloon. On end-user devices where memory is scarce = and > > overcommitted, the balloon sizing in multiple VMs running on the same > > device can be orchestrated with workingset reports from each one. > > On the server side, workingset reporting allows the balloon controller = to > > inflate the balloon without causing too much file cache to be reclaimed= in > > the guest. The ballooning use case is an important one. Having working set information would allow us to inflate a balloon of the right size in the guest. > > > > Promotion/Demotion > > If different mechanisms are used for promition and demotion, workingset > > information can help connect the two and avoid pages being migrated bac= k > > and forth. > > For example, given a promotion hot page threshold defined in reaccess > > distance of N seconds (promote pages accessed more often than every N > > seconds). The threshold N should be set so that ~80% (e.g.) of pages on > > the fast memory node passes the threshold. This calculation can be done > > with workingset reports. > > To be directly useful for promotion policies, the workingset report > > interfaces need to be extended to report hotness and gather hotness > > information from the devices[1]. > >... > > > > Benchmarks > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > Ghait Ouled Amar Ben Cheikh has implemented a simple policy and ran Lin= ux > > compile and redis benchmarks from openbenchmarking.org. The policy and > > runner is referred to as WMO (Workload Memory Optimization). > > The results were based on v3 of the series, but v4 doesn't change the c= ore > > of the working set reporting and just adds the ballooning counterpart. > > > > The timed Linux kernel compilation benchmark shows improvements in peak > > memory usage with a policy of "swap out all bytes colder than 10 second= s > > every 40 seconds". A swapfile is configured on SSD. > > -------------------------------------------- > > peak memory usage (with WMO): 4982.61328 MiB > > peak memory usage (control): 9569.1367 MiB > > peak memory reduction: 47.9% > > -------------------------------------------- > > Benchmark | Experimental = |Control | Experimental_Std_Dev | Control_Std_Dev > > Timed Linux Kernel Compilation - allmodconfig (sec) | 708.486 (95.91%) = | 679.499 (100%) | 0.6% | 0.1% > > -------------------------------------------- > > Seconds, fewer is better > > You can do this with a recent (>2018) upstream kernel and ~100 lines > of python [1]. It also works on both LRU implementations. > > [1] https://github.com/facebookincubator/senpai > > We use this approach in virtually the entire Meta fleet, to offload > unneeded memory, estimate available capacity for job scheduling, plan > future capacity needs, and provide accurate memory usage feedback to > application developers. > > It works over a wide variety of CPU and storage configurations with no > specific tuning. > > The paper I referenced above provides a detailed breakdown of how it > all works together. > > I would be curious to see a more in-depth comparison to the prior art > in this space. At first glance, your proposal seems more complex and > less robust/versatile, at least for offloading and capacity gauging. We have implemented TMO PSI-based proactive reclaim and compared it to a kstaled-based reclaimer (reclaiming based on 2 minute working set and refaults). The PSI-based reclaimer was able to save more memory, but it also caused spikes of refaults and a lot higher decompressions/second. Overall the test workloads had better performance with the kstaled-based reclaimer. The conclusion was that it was a trade-off. Since we have some app classes that we don't want to induce pressure but still want to proactively reclaim from, there's a missing piece. I do agree there's not a good in-depth comparison with prior art though.