From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A110FC7EE2A for ; Wed, 25 Jun 2025 23:10:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A1706B00A7; Wed, 25 Jun 2025 19:10:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2776C6B00A9; Wed, 25 Jun 2025 19:10:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1B4716B00AA; Wed, 25 Jun 2025 19:10:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 0C02A6B00A7 for ; Wed, 25 Jun 2025 19:10:28 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B41AC1407B4 for ; Wed, 25 Jun 2025 23:10:27 +0000 (UTC) X-FDA: 83595468894.10.81C1B78 Received: from out-188.mta1.migadu.com (out-188.mta1.migadu.com [95.215.58.188]) by imf01.hostedemail.com (Postfix) with ESMTP id BFD2340002 for ; Wed, 25 Jun 2025 23:10:25 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=s0aVZhgQ; spf=pass (imf01.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750893026; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aTakL8zcMJBOlQ4XqDs4PSwXPkePFPHvOcCYu+b0aTg=; b=05ig0hfavH1oLvBbxpbew2lPxz158eoLbUDdP9oAc/QiZxhgjBrxGkDShbodMIvgsQK+xX Ojd6gS4MUwCzhQSu2KIOxrzGoqbKsBoPELa8VontIeIHZ4wsVAondaDnFNNzrElRZbFxTZ rbhDAILQ4zFQA3aEIvF+QkgqyfK17zY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750893026; a=rsa-sha256; cv=none; b=N+FnlYAAuHceddJGRTriJws2I7hUfiH+8pjIqI1hxFZyt1nSr9+gAu/FJ6Z0H3LVvMMjoH FBxn1vC7+s5mea0StY8PWw8O64lpVVUUDFdwWOwAAqGknhG8B4OWlQPozfINMak+Ex2NgN nqzqPZ38VSlXwdste8Jqyf2GeomBxzg= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=s0aVZhgQ; spf=pass (imf01.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Wed, 25 Jun 2025 16:10:16 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1750893023; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=aTakL8zcMJBOlQ4XqDs4PSwXPkePFPHvOcCYu+b0aTg=; b=s0aVZhgQ2XvVUdygwEzsBA4IfOujzSy8zMGjpH4RMlXIKYht3c0ALaRMfXAj8AtqPzZLrz IXOB6+7wtBpFzvIHjsBNnuEHpTMHZ6OQh18MndZjLIcdaptARbgLOz7Iw9I49lIEv0Jjgu ac1guwQX4/vhTx0iBdynNgAJmXo5uAU= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Davidlohr Bueso Cc: akpm@linux-foundation.org, mhocko@kernel.org, hannes@cmpxchg.org, roman.gushchin@linux.dev, yosryahmed@google.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 4/4] mm: introduce per-node proactive reclaim interface Message-ID: References: <20250623185851.830632-1-dave@stgolabs.net> <20250623185851.830632-5-dave@stgolabs.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250623185851.830632-5-dave@stgolabs.net> X-Migadu-Flow: FLOW_OUT X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: BFD2340002 X-Stat-Signature: tafcbz6ey3ds1di1ypzceu563jhoxeaj X-Rspam-User: X-HE-Tag: 1750893025-611236 X-HE-Meta: U2FsdGVkX1/pJ5QT8F3H5FwUs6Z3B5TGDiJnHfSQDuVTWAvBLV54bWSD7Ep0uDVxOXCypGgkotNwpsmYBY9HoHlVcXGjXN5CkSFJMDSQS5ZcPcpVu940K4P9gxUSLhmHX61+PyTvphPrXtrjntv1BdbA4SPAjRpi4Vw9SA8Vz0TW2tFmWsukigFB7/H2q6SIAFphYss/TEGODe9Ni+4eVLIMfGP5T/V6ghyzGJlmKdV9+PBbeNnaH6HHe2mqZ3fsGC+7CVK4m3/XPGeIYgo2rBXSR2A1uE9P7P8vwW17f3xItwhCZfBp7YsXabpMfdqco81N+5oiyN64tujylibZop6iycN2f49X0mYsfeVOnXMBvb9cnIjRl3XTxQUznaEALr6bFvoiCizDqgJyiGwS9yDv6NF+AXAEvMdhCLs7JTTUycn63eVBSFnhmd0+0LDyPge/oFx47INiKV+EI+J6TFkeuboSrPoREjgKHQabc7VF+Wwy/LAn2O2/zdwNfj91ZrAQcTl/2lZcxErFCZ02WeOqb/lYrtgTiZJWP47vjVmswISF2EJh3gVweIVnRG3aKXZrPXKNN713lCbrvLMGegLXIgL5XnW3TH74yXC3BfCdBIkX3bP/qhbhMbLy69Vwd0m8/iTFIYadoBTVlN6ewuEJYomYCu1n9fq+WUjHlbL93o8Ik3TrBZeb9IPDYfGzChxmj0dhlENtqYCqkZoCBkzV7XIrJ6G3rYaqBK18AwTG0GBnerp96ko8DEqzdcAKkVrJ2ubkaf6fsHIHUc6Gixp5a6zBAfrewg6vy2/xBT47Rcob+19el8xSPcRx8SEJuioGpp86aJby3iAna/RX/gAk1b68RW9F2doPU15LIEIAO2aOfkFd5aVs4yvsEwlDdiyapEwRySIrvL5zQF3mcF3C5DmnVPiErhywaSumY33KZYeplqBH4rAiwItdVJX9JVlILxFqezQoAvec6Hy +1j/3tiB eg67k8q678gNUOhpVCqKDV25Mgvym3d6ViUyLgH7JBrWKsFoJcldlX22px9Anq74VlgJ8Gc1/bm9OOm2A+8POZac2+cz0ZpUY+fIfX9zH9FzAqYNqs6ueDoMGEetWYJL2zLeyOtq8Pb747PIpofmh3MdVlGqhGBBGEpUrY8DINZ+FMYmRORzHsyObuia0K701Wc+R73kO/bv9SRcK1xiiquxEiKt5oHPBB+kuoj8+Z1QHM8npwIQ9NeutIrefYXQhSZjtVpJcnat3OxN4knia5KPw8A== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 23, 2025 at 11:58:51AM -0700, Davidlohr Bueso wrote: > This adds support for allowing proactive reclaim in general on a > NUMA system. A per-node interface extends support for beyond a > memcg-specific interface, respecting the current semantics of > memory.reclaim: respecting aging LRU and not supporting > artificially triggering eviction on nodes belonging to non-bottom > tiers. > > This patch allows userspace to do: > > echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim > > One of the premises for this is to semantically align as best as > possible with memory.reclaim. During a brief time memcg did > support nodemask until 55ab834a86a9 (Revert "mm: add nodes= > arg to memory.reclaim"), for which semantics around reclaim > (eviction) vs demotion were not clear, rendering charging > expectations to be broken. > > With this approach: > > 1. Users who do not use memcg can benefit from proactive reclaim. > The memcg interface is not NUMA aware and there are usecases that > are focusing on NUMA balancing rather than workload memory footprint. > > 2. Proactive reclaim on top tiers will trigger demotion, for which > memory is still byte-addressable. Reclaiming on the bottom nodes > will trigger evicting to swap (the traditional sense of reclaim). > This follows the semantics of what is today part of the aging process > on tiered memory, mirroring what every other form of reclaim does > (reactive and memcg proactive reclaim). Furthermore per-node proactive > reclaim is not as susceptible to the memcg charging problem mentioned > above. > > 3. Unlike the nodes= arg, this interface avoids confusing semantics, > such as what exactly the user wants when mixing top-tier and low-tier > nodes in the nodemask. Further per-node interface is less exposed to > "free up memory in my container" usecases, where eviction is intended. > > 4. Users that *really* want to free up memory can use proactive reclaim > on nodes knowingly to be on the bottom tiers to force eviction in a > natural way - higher access latencies are still better than swap. > If compelled, while no guarantees and perhaps not worth the effort, > users could also also potentially follow a ladder-like approach to > eventually free up the memory. Alternatively, perhaps an 'evict' option > could be added to the parameters for both memory.reclaim and per-node > interfaces to force this action unconditionally. > > Signed-off-by: Davidlohr Bueso Overall looks good but I will try to dig deeper in next couple of days (or weeks). One orthogonal thought: I wonder if we want a unified aging (hotness or generation or active/inactive) view of jobs/memcgs/system. At the moment due to the way LRUs are implemented i.e. per-memcg per-node, we can have different view of these LRUs even for the same memcg. For example the hottest pages in low tier node might be colder than coldest pages in the top tier. Not sure how to implement it in a scalable way.