From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA6B4C531DC for ; Tue, 20 Aug 2024 13:01:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DBD736B007B; Tue, 20 Aug 2024 09:01:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D46636B0082; Tue, 20 Aug 2024 09:01:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B991B6B0083; Tue, 20 Aug 2024 09:01:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 9BFD06B007B for ; Tue, 20 Aug 2024 09:01:00 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 00000141AAA for ; Tue, 20 Aug 2024 13:00:59 +0000 (UTC) X-FDA: 82472633838.20.3E8E300 Received: from mail-qt1-f180.google.com (mail-qt1-f180.google.com [209.85.160.180]) by imf21.hostedemail.com (Postfix) with ESMTP id 7AB9C1C0039 for ; Tue, 20 Aug 2024 13:00:53 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=t0ASlvoW; spf=pass (imf21.hostedemail.com: domain of gourry@gourry.net designates 209.85.160.180 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724158814; a=rsa-sha256; cv=none; b=uTetBkOhK7TomtXzQNZ47+UswWEB/3qY+qsGOfdyuvbDRoa1S3caZic2twfbebQqcHgk1Q 4It29/P5OCeEzHnJlvHSbiOdfMrL8+Ri9zFI8EYnhTXNBjc32W8M4wVxHUzCc42MbT6m2x mU15vErozWkXopgWG9UZYNXUPrssR4Y= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gourry.net header.s=google header.b=t0ASlvoW; spf=pass (imf21.hostedemail.com: domain of gourry@gourry.net designates 209.85.160.180 as permitted sender) smtp.mailfrom=gourry@gourry.net; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724158814; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=j52XYJE2ICKQlINPQl+zW5CQKy5fCOrlHtppgGWQcXM=; b=we+br0q0vslUz6kHAAmYh7YcBHWcZTI41dbxaElW0cYgpYxeOjSHs0rlxC/RXhYridhMpU UjO/32fuyvyN3ixXvqE7AGPG+Zhf27gh9hnjLp3TiUKj5sBK14GcLtfPTs+Fbk2bDF403n kF/id9ZtvTwdo2W6ZzD+JK3D5X3RKfw= Received: by mail-qt1-f180.google.com with SMTP id d75a77b69052e-44fe106616eso30212171cf.1 for ; Tue, 20 Aug 2024 06:00:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gourry.net; s=google; t=1724158852; x=1724763652; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=j52XYJE2ICKQlINPQl+zW5CQKy5fCOrlHtppgGWQcXM=; b=t0ASlvoWftIqaKlhec8Rd0ZWY9O9AzXIJ0hUCII7Sbvwm2QTnYSkSQEu2o7NYV8b0z ljdYyJpJEugR4KltVv3y9U4mSvLBQ6ZBo32+NGjcmaKFFYRguKWJ44YeaDna/F8Nglr5 zw7913Iu7qlXP+L7rFyWxan2yYSJoaeUno/YcAeUvL6FIWnMXHPzCLQi9mUYqdgUf3tU fEOze2aISQdowtV9Oe8jbvOkHhAMIbBZiDAvgc4GpWQ78bCGJ0ObvVN5lMOxb5w5rhOW Hzf1WKEX+k8EyHitAJ0RGB9JhIhZ5uWVtnI1zFaJ3zjIM755n/N1od8ey3w9Z9AjCxl2 QC8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724158852; x=1724763652; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=j52XYJE2ICKQlINPQl+zW5CQKy5fCOrlHtppgGWQcXM=; b=cvDOTSFdTkiK+Zr+LsE+zcpMGQwUVKLR8K5Lzcl5+PZXglj2najA9L1ea0wOm+szaF gm8q0VQAQXL4r4a8sBIbl58thvNb7kQyBCWJv+c+jeHru3d/vA4lfU8immAXx0E2e2FL Gle9Jv7iyi0T/Y45k9UcSXCXslTv8VBIkuUgqfzSVy1dmz44X6S4mqqzcOC9S2yfjyeG 9MqPZR6V4x91i1p5wXFL73NjMVOSHP4ff+/2IHQ5X9bFC6XYUcR4KkLbtsKimmd6Mn6T hqRRpuRUY/qa7tN2bheenPQOxExf+fBUezO8CVvv8jv9wRoWTnkxjmUbx8Y3y4igCWJ2 P2IQ== X-Forwarded-Encrypted: i=1; AJvYcCUpsxdWMsxjnOh9NeXiPOxBepCSloJ2l9BYj+NKeIhWpnsv/7dmEQTp6CiPqV0+mSTvQR5YdXGM3g==@kvack.org X-Gm-Message-State: AOJu0Yw0Ds/2R0fFee+MjTxxW01JySKrbA7V9K5LTADjANdJPWM53JqW pUu1X7ZAphuT0rZJN1wK7LQ3WRmeeMZIXCDGQOont+cdIsORowvpRWfEh1AE818= X-Google-Smtp-Source: AGHT+IFogz/OKodC6xvuV+JQ0O1UMwu8nhnZCifjPnotmW7jOJBTqcnoRS5m2elNnoCh4kakCFzN+Q== X-Received: by 2002:a05:622a:6186:b0:453:6cb2:c8d1 with SMTP id d75a77b69052e-45374310eaemr144853771cf.37.1724158852033; Tue, 20 Aug 2024 06:00:52 -0700 (PDT) Received: from PC2K9PVX.TheFacebook.com (ip-185-104-139-70.ptr.icomera.net. [185.104.139.70]) by smtp.gmail.com with ESMTPSA id d75a77b69052e-45369fd716csm50159141cf.4.2024.08.20.06.00.48 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 20 Aug 2024 06:00:51 -0700 (PDT) Date: Tue, 20 Aug 2024 09:00:37 -0400 From: Gregory Price To: Yuanchu Xie Cc: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying , Andrew Morton , Lance Yang , Randy Dunlap , Muhammad Usama Anjum , Kalesh Singh , Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org Subject: Re: [PATCH v3 0/7] mm: workingset reporting Message-ID: References: <20240813165619.748102-1-yuanchu@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240813165619.748102-1-yuanchu@google.com> X-Stat-Signature: yj9xzqo1yo1ak4izsjudtiwutqg39mxd X-Rspamd-Queue-Id: 7AB9C1C0039 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1724158853-86035 X-HE-Meta: U2FsdGVkX1/X0yBaqZHFUKyweI5z9tGPe/86cfPLU2DK9fC5nQm995M/EZ45LuyEKtEoALa8MC/ljWaz7kY4/SLdUhMUHPvwGaHbZk1Jf7y/y+73pFfe4nzLn2l8Lf9LeNRMdr17ZRM8QHuXGZv5ug4T+MFVKvBdjWeH+GPNU0IODNdivabSLVmGCZP1W6+sVdO397GmMr4xhSbuqnhWAIyDn54Pu0WTTjzwlQFGysXkYLxLx7LMSsUY8dC1o5MgTMCXa+aJpLKuEPo+rF6Wm49UeofbQx5maioS9olwp9LgXoGTDGK1yKFYMtEsncCDJYXOvG2gxCKOVt2sYI9NsjzzYrtozFPKgDfdY3kmj4k50retqhd+grLAyYt/P7Pn2y1EqSdCMJJliEwNNnDoik0+f5+/WykIBgLfTRtu67OChU+G9KVKNjNC0tL4SkD/LeQA2qs4zcjS2TUAzGMu56y4JDubv4AHcMWfKn16iwW5G4LOk+JNHzJ6yaNr6TSog46Nzen03mjiLfJAqqI7F3ohFhXASDb47lPiUc0x9l3Za3KT7sOlORP66MncZKsgY7LXyqUtB7GjAjOQoDQDqgKfGVqIINX1c5XK2cHlLdOLc7NZBSKpbzenRWPKwVOeyhos1nhZSPXYkwHN4OxYbv9zy4FOAVOepQUqgCxKGTcqIv04oxOHmtGF56Tw1fHNdXFlXNQ0uIYcYIjAqzw1BO6DSeMwGG+WUvqN+Q5A5SelTqM8TUAYgmXQ2HVX9xU9egq5SQbnaMxT97RKj2iM4/7+1CvSkTUe3yJpxFU51Lksr9DgBjP8Z11mNoQX3URwHiUs6kFcd2pFr9NpVylc2RLqwR7LBg/NP0Tx+wYdHuOFqsS1UFe5+osIjnRqFWzy/i/XjjfzteUi6yA4W5L9X1NPxuoR6yLrbDoUXKwBpGoUD+TeE8ears7Iw8AqiHGDkhKA+8qMFAO+SPhcPyi IuNjKhHc C6DanLkog3m5oT+n3E73PyzFkmLW0j23xBl7MKSK642bhYg5k25s4SGet5BUYVRTDtRPWWSHsSEA30guYvfIFoa3jPOa70Ra8g+42Jcs5G6IA5tKOAdCjsVSvccvxNu2a8EK/4tSMBjHTbWL3sjZ4jkOgI7HCMKl9oVJXGeoyJqT1k9WikeaCD9iNtanuUCoU2ZBH8KANg3031ewYUNuX58zyXgliz9SL829UvTF1izo8jo9KEUhDUgfBOqppiDKXcHidquDcL03uOTgM/i8dhrk+OTHV7c/24er/PcZOntu2BQ+J43p+yfr1L9lExYbRbi8asuVC7Twpw0hkLCgsbfrM5eJFk8XvSN6Dsq2ankFXJ94+APkculMW2xoURg05+e41fJrFnyILm7x7nGAhZsreDE4Egbds5AwWLeK1QIX/xFGHo/HMurlNVl4p5lrThNOJoY8V4aciNcRrba6OmrsZLuH+bd8Wlo/dEGoqBbYV0H/JKEjoxx6aUbMT/LABDQ+u1VSLyf2B5v8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote: > This patch series provides workingset reporting of user pages in > lruvecs, of which coldness can be tracked by accessed bits and fd > references. However, the concept of workingset applies generically to > all types of memory, which could be kernel slab caches, discardable > userspace caches (databases), or CXL.mem. Therefore, data sources might > come from slab shrinkers, device drivers, or the userspace. IMO, the > kernel should provide a set of workingset interfaces that should be > generic enough to accommodate the various use cases, and be extensible > to potential future use cases. The current proposed interfaces are not > sufficient in that regard, but I would like to start somewhere, solicit > feedback, and iterate. > ... snip ... > Use cases > ========== > Promotion/Demotion > If different mechanisms are used for promition and demotion, workingset > information can help connect the two and avoid pages being migrated back > and forth. > For example, given a promotion hot page threshold defined in reaccess > distance of N seconds (promote pages accessed more often than every N > seconds). The threshold N should be set so that ~80% (e.g.) of pages on > the fast memory node passes the threshold. This calculation can be done > with workingset reports. > To be directly useful for promotion policies, the workingset report > interfaces need to be extended to report hotness and gather hotness > information from the devices[1]. > > [1] > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1 > > Sysfs and Cgroup Interfaces > ========== > The interfaces are detailed in the patches that introduce them. The main > idea here is we break down the workingset per-node per-memcg into time > intervals (ms), e.g. > > 1000 anon=137368 file=24530 > 20000 anon=34342 file=0 > 30000 anon=353232 file=333608 > 40000 anon=407198 file=206052 > 9223372036854775807 anon=4925624 file=892892 > > I realize this does not generalize well to hotness information, but I > lack the intuition for an abstraction that presents hotness in a useful > way. Based on a recent proposal for move_phys_pages[2], it seems like > userspace tiering software would like to move specific physical pages, > instead of informing the kernel "move x number of hot pages to y > device". Please advise. > > [2] > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/ > Just as a note on this work, this is really a testing interface. The end-goal is not to merge such an interface that is user-facing like move_phys_pages, but instead to have something like a triggered kernel task that has a directive of "Promote X pages from Device A". This work is more of an open collaboration for prototyping such that we don't have to plumb it through the kernel from the start and assess the usefulness of the hardware hotness collection mechanism. --- More generally on promotion, I have been considering recently a problem with promoting unmapped pagecache pages - since they are not subject to NUMA hint faults. I started looking at PG_accessed and PG_workingset as a potential mechanism to trigger promotion - but i'm starting to see a pattern of competing priorities between reclaim (LRU/MGLRU) logic and promotion logic. Reclaim is triggered largely under memory pressure - which means co-opting reclaim logic for promotion is at best logically confusing, and at worst likely to introduce regressions. The LRU/MGLRU logic is written largely for reclaim, not promotion. This makes hacking promotion in after the fact rather dubious - the design choices don't match. One example: if a page moves from inactive->active (or old->young), we could treat this as a page "becoming hot" and mark it for promotion, but this potentially punishes pages on the "active/younger" lists which are themselves hotter. I'm starting to think separate demotion/reclaim and promotion components are warranted. This could take the form of a separate kernel worker that occasionally gets scheduled to manage a promotion list, or even the addition of a PG_promote flag to decouple reclaim and promotion logic completely. Separating the structures entirely would be good to allow both demotion/reclaim and promotion to occur concurrently (although this seems problematic under memory pressure). Would like to know your thoughts here. If we can decide to segregate promotion and demotion logic, it might go a long way to simplify the existing interfaces and formalize transactions between the two. (also if you're going to LPC, might be worth a chat in person) ~Gregory