From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 610D1CD4F25 for ; Wed, 4 Sep 2024 20:18:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9A4B26B02F2; Wed, 4 Sep 2024 16:18:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 92C066B02F5; Wed, 4 Sep 2024 16:18:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7CC436B02F3; Wed, 4 Sep 2024 16:18:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5B4BB6B02EF for ; Wed, 4 Sep 2024 16:18:16 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id D63ACA047D for ; Wed, 4 Sep 2024 20:18:15 +0000 (UTC) X-FDA: 82528167750.01.EF8FEF3 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf18.hostedemail.com (Postfix) with ESMTP id 19DD81C0012 for ; Wed, 4 Sep 2024 20:18:13 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=iXFUNW9G; dmarc=none; spf=pass (imf18.hostedemail.com: domain of akpm@linux-foundation.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725480987; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1fPOEAsvRMmAgun8qJi+RR/PYiap4SJGG6F4uV38Mjw=; b=Iy5Iv+f9jRevUM+cHa9GUXpgS1pQ1myqUVXq8lM/L9ANXh96sl+KRa30Rjzl1V5e/ZQhL8 YMu9ric/rdaLC0jXD4fE5/D9XwfN5ZfZ2k4xkx0OC6hLRpUrhvv3b/xCIrODwsEhmmF67r 0VelKnoUCbIE8QZvCZ7ow0QYk39ndB0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725480987; a=rsa-sha256; cv=none; b=i6eLsFsXEDA9rByG2I42nuceitvY/qnqVVnN2eZENCuuUdFBue8aT6eKZ793mQbSmkl5M4 B5w9YEGKIEoHKVW3ye70PaBLNzZZMVEOH7/trYmtdqUlS3xH0AZK5odH62MQNz2wwF303c 1wgzq/t4RKb23UqfjIqMcBJjNK4Yh7o= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=linux-foundation.org header.s=korg header.b=iXFUNW9G; dmarc=none; spf=pass (imf18.hostedemail.com: domain of akpm@linux-foundation.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=akpm@linux-foundation.org Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 75AF15C5706; Wed, 4 Sep 2024 20:18:09 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id EB95DC4CEC5; Wed, 4 Sep 2024 20:18:11 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linux-foundation.org; s=korg; t=1725481092; bh=CoqXSn1sTq8nn+oU+UgxqJ42Ebvl7TulVgM+ZPkzvlU=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=iXFUNW9G1AwOdKXtwIP6OBuxUSBgSxQs5mBF/8Y17zDXrNJY/hLc2lkKuBoTwYQfg oVxjFp9IvOxUSoBRM6Guvd4dJPE6HSPkaBHA+I62YVt47wc3cO0U1t3TjFKTRm405V PyLMaxBmfp7Fs8xEnTdQUw9k9DXjB47NqrsJQBGg= Date: Wed, 4 Sep 2024 13:18:11 -0700 From: Andrew Morton To: Davidlohr Bueso Cc: linux-mm@kvack.org, mhocko@kernel.org, rientjes@google.com, yosryahmed@google.com, hannes@cmpxchg.org, almasrymina@google.com, roman.gushchin@linux.dev, gthelen@google.com, dseo3@uci.edu, a.manzanares@samsung.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH -next] mm: introduce per-node proactive reclaim interface Message-Id: <20240904131811.234e005307f249ef07670c20@linux-foundation.org> In-Reply-To: <20240904162740.1043168-1-dave@stgolabs.net> References: <20240904162740.1043168-1-dave@stgolabs.net> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.33; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 19DD81C0012 X-Stat-Signature: o19ruznn75zhfuzo84eiw4do65x1at8y X-Rspam-User: X-HE-Tag: 1725481093-110544 X-HE-Meta: U2FsdGVkX1+iHmJIaOkNUqfnrl70vH6LCG06LLIQNYJSf/UnLB/0fxoBgmkICdvan0feJA+UrjXbO7H7xvv3gJ8Yng/5bR0DcpARVPCGoEtPqlRcV+gVfXCUKVp6lZ7+NO/HYLRSwAPkacQVWT06TPNBcWKwAeUHD8JT+QcGEjldlGZ2E0+zq860f3Iby6qYfEbEEGdV39v4TfI4OgU9qz94Pq5Bpq7cx7as+izA6ItCuHCbITZ3gB/Vw76L3oVp2lJsyRgUQ3JOboYmXnwOabxyzqkQX6GAf191ODoao9DiZ84/msjeBaN2AceWcurCPlPYa1qCXWrCYROgnjH9NW4WvTUVWGEWiRd16tjcDhUJ4ZfTPv6QQmUbZo0NXx26KXRhBt+rPB5WW7mZjbmQWGebDpHgyWME6ulVhnUVGfxmfYuCcAl80Z/nBIpb6x+vUqDH+FjBtWpyHWRbA9MlHuYCerMgInVqolPo9tjCpNncc6lgFHegVw3gBJi+BqhqxbZoQt4tKyy2AJOfBXZ+70uWUTrdwAptGm8AZxh2xLE6PHM/M9RloGgYNUVc/6IMzcO7AlH447YISb3To43sVWUd2bfV5gdumZcUzqfI8ir3cLi+uYKcOkTEeeIW4y7IB6AG/JLRCTy6uOdLCqpENEUQctK7zhX3a0Fcp/szNV9Th/MpWv+sZ8NRiwVzGB14fLuWiVjgYPTdoxWGH3qsSchMNQGDjHYYHJ08SPntpIUSVh3jFNl+WrKilD2wqIl+kLme4lkVf5SSMJasPyntXGbhoDclW3GbpiBKLYykjs7ulgwcnd5ft5pqDblHayQHcY5YW7/bgZ90vMsIAOByHDIIcsw7yruZbHEmGUF+opkzH1TaO7ujUO29vWFWdJ4BDuGr+kXtMG9XzV7oT3jXuq7iEFjzSTnlGzlbR/x1gZPSboBpjjirPCZcRTYb78I04KpoZlUMDaCZBlMBTlB Gtix/nZb bfM4pgxotq15opOxt+3kg3SXNE710nBX5lQDdv4Tkb4UESYPP3aVqmyzL5zW7G5NrnjpVFvvGj8ohCPlkIiJaNgaHOZndVHo5OTs22TVMW/v8OQ2is/0e2hoCajL9Ajm9po6RS8rLf6BTNzWOZuv4LHJkuy+89zN/RLL6EGjLIFg5iuojMJtlSowuRUkVks1mE2sQy69dluNL4Imz+CczQNjt2U+wqyJoe8Xn/17ZDQ5fTJ2oGe9ZembBg3y+iSRc7afQ6X3eREk1ZGGc2A2zHjGUJ3kjOIj5HVuBovXbDI+rGNXnnXgJ+A+7Gs2ZjNldXSoxq/i1fLkP7JzSHFhzOdO7Zjxr0wwoHumxqRhCEzc8kkCF/+I1JCxg9L8uPytZuRhqQ79qYe4HcM8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, 4 Sep 2024 09:27:40 -0700 Davidlohr Bueso wrote: > This adds support for allowing proactive reclaim in general on a > NUMA system. A per-node interface extends support for beyond a > memcg-specific interface, respecting the current semantics of > memory.reclaim: respecting aging LRU and not supporting > artificially triggering eviction on nodes belonging to non-bottom > tiers. > > This patch allows userspace to do: > > echo 512M swappiness=10 > /sys/devices/system/node/nodeX/reclaim One value per sysfs file is a rule. > One of the premises for this is to semantically align as best as > possible with memory.reclaim. During a brief time memcg did > support nodemask until 55ab834a86a9 (Revert "mm: add nodes= > arg to memory.reclaim"), for which semantics around reclaim > (eviction) vs demotion were not clear, rendering charging > expectations to be broken. > > With this approach: > > 1. Users who do not use memcg can benefit from proactive reclaim. > > 2. Proactive reclaim on top tiers will trigger demotion, for which > memory is still byte-addressable. Reclaiming on the bottom nodes > will trigger evicting to swap (the traditional sense of reclaim). > This follows the semantics of what is today part of the aging process > on tiered memory, mirroring what every other form of reclaim does > (reactive and memcg proactive reclaim). Furthermore per-node proactive > reclaim is not as susceptible to the memcg charging problem mentioned > above. > > 3. Unlike memcg, there should be no surprises of callers expecting > reclaim but instead got a demotion. Essentially relying on behavior > of shrink_folio_list() after 6b426d071419 (mm: disable top-tier > fallback to reclaim on proactive reclaim), without the expectations > of try_to_free_mem_cgroup_pages(). > > 4. Unlike the nodes= arg, this interface avoids confusing semantics, > such as what exactly the user wants when mixing top-tier and low-tier > nodes in the nodemask. Further per-node interface is less exposed to > "free up memory in my container" usecases, where eviction is intended. > > 5. Users that *really* want to free up memory can use proactive reclaim > on nodes knowingly to be on the bottom tiers to force eviction in a > natural way - higher access latencies are still better than swap. > If compelled, while no guarantees and perhaps not worth the effort, > users could also also potentially follow a ladder-like approach to > eventually free up the memory. Alternatively, perhaps an 'evict' option > could be added to the parameters for both memory.reclaim and per-node > interfaces to force this action unconditionally. > > ... > > --- a/Documentation/ABI/stable/sysfs-devices-node > +++ b/Documentation/ABI/stable/sysfs-devices-node > @@ -221,3 +221,14 @@ Contact: Jiaqi Yan > Description: > Of the raw poisoned pages on a NUMA node, how many pages are > recovered by memory error recovery attempt. > + > +What: /sys/devices/system/node/nodeX/reclaim > +Date: September 2024 > +Contact: Linux Memory Management list > +Description: > + This is write-only nested-keyed file which accepts the number of "is a write-only". What does "nested keyed" mean? > + bytes to reclaim as well as the swappiness for this particular > + operation. Write the amount of bytes to induce memory reclaim in > + this node. When it completes successfully, the specified amount > + or more memory will have been reclaimed, and -EAGAIN if less > + bytes are reclaimed than the specified amount. Could be that this feature would benefit from a more expansive treatment under Documentation/somewhere. > > ... > > +#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA) > + > +enum { > + MEMORY_RECLAIM_SWAPPINESS = 0, > + MEMORY_RECLAIM_NULL, > +}; > + > +static const match_table_t tokens = { > + { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"}, > + { MEMORY_RECLAIM_NULL, NULL }, > +}; > + > +static ssize_t reclaim_store(struct device *dev, > + struct device_attribute *attr, > + const char *buf, size_t count) > +{ > + int nid = dev->id; > + gfp_t gfp_mask = GFP_KERNEL; > + struct pglist_data *pgdat = NODE_DATA(nid); > + unsigned long nr_to_reclaim, nr_reclaimed = 0; > + unsigned int nr_retries = MAX_RECLAIM_RETRIES; > + int swappiness = -1; > + char *old_buf, *start; > + substring_t args[MAX_OPT_ARGS]; > + struct scan_control sc = { > + .gfp_mask = current_gfp_context(gfp_mask), > + .reclaim_idx = gfp_zone(gfp_mask), > + .priority = DEF_PRIORITY, > + .may_writepage = !laptop_mode, > + .may_unmap = 1, > + .may_swap = 1, > + .proactive = 1, > + }; > + > + buf = strstrip((char *)buf); > + > + old_buf = (char *)buf; > + nr_to_reclaim = memparse(buf, (char **)&buf) / PAGE_SIZE; > + if (buf == old_buf) > + return -EINVAL; > + > + buf = strstrip((char *)buf); > + > + while ((start = strsep((char **)&buf, " ")) != NULL) { > + if (!strlen(start)) > + continue; > + switch (match_token(start, tokens, args)) { > + case MEMORY_RECLAIM_SWAPPINESS: > + if (match_int(&args[0], &swappiness)) > + return -EINVAL; > + if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS) > + return -EINVAL; Code forgot to use local `swappiness' for any purpose? > + break; > + default: > + return -EINVAL; > + } > + } > + > > ... >