From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 886E7E77184
	for <linux-mm@archiver.kernel.org>; Sat, 21 Dec 2024 05:18:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EA59A6B007B; Sat, 21 Dec 2024 00:18:27 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E2EC96B0082; Sat, 21 Dec 2024 00:18:27 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CCF9E6B0083; Sat, 21 Dec 2024 00:18:27 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id AAD606B007B
	for <linux-mm@kvack.org>; Sat, 21 Dec 2024 00:18:27 -0500 (EST)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 2AE341A0443
	for <linux-mm@kvack.org>; Sat, 21 Dec 2024 05:18:27 +0000 (UTC)
X-FDA: 82917809898.18.A95B7C1
Received: from out30-100.freemail.mail.aliyun.com (out30-100.freemail.mail.aliyun.com [115.124.30.100])
	by imf03.hostedemail.com (Postfix) with ESMTP id 8D89320003
	for <linux-mm@kvack.org>; Sat, 21 Dec 2024 05:18:06 +0000 (UTC)
Authentication-Results: imf03.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b="Ts2V/1kc";
	spf=pass (imf03.hostedemail.com: domain of ying.huang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=ying.huang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1734758280;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=63fvxfNo+VR7o5+HruWuAXuIy+Vfv5cuOmYmq8sx/3Q=;
	b=eGsO0xW02+3Ja37cJ4eB347q27Ursm7Fq25pTUoWa1ejNKgAE3CPM7M9pG74EZo5xxZXwt
	3zvjwvJUGCyWhG3AijG22LjKpjx9k/M1PYbnzxEqbE9TFH8y3UBugjbCSU26ZInJMFI8Ob
	M2+9IuZcNbxZsphiwgCRo6fWkNQDnXQ=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734758280; a=rsa-sha256;
	cv=none;
	b=o8PNldP5UbVx0duQXDrIJ+3cgBDEDXR8QWuLgYLlM8eokZYrWdtMsF5m68MyBbKIQMCW5W
	u7hU8eA0VhKAfatTuqKRH9Weoyug/yOtZKsKaLnhD5zdLEd71/nVNxLrz0Pfr5GJVZEO7v
	KXmloAIc1PtF3N106SoIRsiEoPox1Yo=
ARC-Authentication-Results: i=1;
	imf03.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b="Ts2V/1kc";
	spf=pass (imf03.hostedemail.com: domain of ying.huang@linux.alibaba.com designates 115.124.30.100 as permitted sender) smtp.mailfrom=ying.huang@linux.alibaba.com;
	dmarc=pass (policy=none) header.from=linux.alibaba.com
DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1734758299; h=From:To:Subject:Date:Message-ID:MIME-Version:Content-Type;
	bh=63fvxfNo+VR7o5+HruWuAXuIy+Vfv5cuOmYmq8sx/3Q=;
	b=Ts2V/1kcRbM6VhB7RpgeoollF2BZLh/bOfqp8pnxgC+Blwe5Ra9R6iSDLL9aCgj8CO0Ntba3CAYNu/SGx30QcU0l2EGDfEVjXXRwTzCnlzIMTCziXEa+9bVrNgoEA+1Ot2oVrdqWlVlZ1JU5Q7NfeydORA9lUHGrqj94XY0LSu0=
Received: from DESKTOP-5N7EMDA(mailfrom:ying.huang@linux.alibaba.com fp:SMTPD_---0WLvXlYA_1734758284 cluster:ay36)
          by smtp.aliyun-inc.com;
          Sat, 21 Dec 2024 13:18:18 +0800
From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
  nehagholkar@meta.com,  abhishekd@meta.com,  kernel-team@meta.com,
  david@redhat.com,  nphamcs@gmail.com,  akpm@linux-foundation.org,
  hannes@cmpxchg.org,  kbusch@meta.com,  Feng Tang <feng.tang@intel.com>
Subject: Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
In-Reply-To: <20241210213744.2968-1-gourry@gourry.net> (Gregory Price's
	message of "Tue, 10 Dec 2024 16:37:39 -0500")
References: <20241210213744.2968-1-gourry@gourry.net>
Date: Sat, 21 Dec 2024 13:18:04 +0800
Message-ID: <87o715r4vn.fsf@DESKTOP-5N7EMDA>
User-Agent: Gnus/5.13 (Gnus v5.13)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: 8D89320003
X-Stat-Signature: 53xsxyjsuxasku83g7jftrjbpkocj7am
X-Rspam-User: 
X-HE-Tag: 1734758286-245196
X-HE-Meta: U2FsdGVkX18IqMIk+y4DK2KIri156L5UiDA4DF6b+nr9IJSXxXhx6UEyWC/g3wTPwIzV1sIkRm18IqYcS/u18hhve/AGHrlTCIz/IR5QZ9r665D7d69psWqkn6ZItYurmdtV8kaQodomsFE28T7FajxBKylB+alxU5wteuQPSSYOJkR0Tt8nh0N5bfFLs6u2B1A/H5zaLVT0oZfBbYr6+3O4iWo/pYoCsB6LXKH7l2XkJup7UKItL1RfmEdVBhh6S0YvJSvKbeYCmmTtsUV21ljylNf+qBE2aHRnAy1Ty1SNYndyZbNxpyCPq5bLE2n4nwE/6aAj8ZLoCj1HJr3RstxeKIYcaBe7x4bSCelgxnuyOjm5C8BJlrHmQtM+eZhQojbphLc855wT5p/78Sfp3RAlHSL9msu+egE70tfERT31Br4Maf2v453dGOh8Y3CQU78/X8lBEDebFN7ldvN9BQbhazujJDZ2+0Tf8TmY3Z1fNwd5dvnl0bpXou2eE4rFdTNsU7QUANIm4TG7QypUEGNk6lLMWI8KQAIgz8Ud14nGGaUKqrCKiB5G6hH8j9dAXOQtkGMYLXZjrr7gq3qiJWTYx1GvYJLW2k71cZwXiNBqpnnuOs+ks61yDT0+8a+Az0W5aAoB8oPYonFVyXD2KoLYzZVO0+n49hJ/pO7JzhB2BOhU+M5dmLOAnswruB6JlvvcA31w+WvDs6Snq+AurbgbT3jTqz272PhKItzvNAcJUF9Vx0Ef4MzGJFik9R9cxn9uvgtBk+GoplsQ4Oxt1ns2lbASNcfu31SHiSt15+lKoNpbAT0OtU9EsBNtf/X1yE31qrDxYSmubGEtXmdUgqk7KrgFiQVNm+BMWILKM3BnjMPndOH3JA9/b0eK4ZX1Fym7gHzj7MzNUU1YCitgqUwj7brSAYvt5e0XvCwDxt3jmQ0y7/yOPcVKE4E0pimgSx9rjWYlYrzTspM7ysA
 eIDPWejc
 bYp9yadzKcRYyPiEozI0BSMIumBrBAXFHljoAiQiZfV70BMHZzacztxLzoPGCm6qfH1GKUMXilt274h82ZJnGQ/JGxIkkX1DSimpniUw7zGXyY2W6Uknjkxxcm68pf0bBZV7xWwT38GOTT+/qjPilGRIN9m9q75ObVoEvEfKRd2fG27jGwjVaVu4au/XZtRfVtC7SB5VAgQSe/913R90EG4LT9eECiKRjRmUSOIBg3Z6ucSZrNrkx7nRfSOU9GTRuFDvrnsh8Z4J/BMfiR1MKaeO2TCgdi4BUMQ/o6SD876b8hwwTc8uEGQc6W3ZvjeyGQgs8FJXiJ0i3brg7CkbVsQR/SMDHmzA/szMvR7tCuqteuJsRlhJdSMzPMk6l4ulbkc2nt0oSCXog2iMIq67eRc60GXtEtv8sp+A6LxdNHglCdfNVbF0bEWuYfrxmaEByKIe0VIG8wJWHM8+mZV+hk3NPfA==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi, Gregory,

Thanks for working on this!

Gregory Price <gourry@gourry.net> writes:

> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
>     1) The page is fully swapped out and re-faulted
>     2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> Patches 1-3
> 	allow NULL as valid input to migration prep interfaces
> 	for vmf/vma - which is not present in unmapped folios.
> Patch 4
> 	adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> 	adds the promotion mechanism, along with a sysfs
> 	extension which defaults the behavior to off.
> 	/sys/kernel/mm/numa/pagecache_promotion_enabled
>
> Functional test showed that we are able to reclaim some performance
> in canned scenarios (a file gets demoted and becomes hot with 
> relatively little contention).  See test/overhead section below.
>
> v2
> - cleanup first commit to be accurate and take Ying's feedback
> - cleanup NUMA_HINT_ define usage
> - add NUMA_HINT_ type selection macro to keep code clean
> - mild comment updates
>
> Open Questions:
> ======
>    1) Should we also add a limit to how much can be forced onto
>       a single task's promotion list at any one time? This might
>       piggy-back on the existing TPP promotion limit (256MB?) and
>       would simply add something like task->promo_count.
>
>       Technically we are limited by the batch read-rate before a
>       TASK_RESUME occurs.
>
>    2) Should we exempt certain forms of folios, or add additional
>       knobs/levers in to deal with things like large folios?
>
>    3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
>       so we could validate the behavior works as intended. Should
>       we just call this a NUMA_HINT_FAULT and not add a new hint?
>
>    4) Benchmark suggestions that can pressure 1TB memory. This is
>       not my typical wheelhouse, so if folks know of a useful
>       benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
>       I'd like to add additional measurements here.
>
> Development Notes
> =================
>
> During development, we explored the following proposals:
>
> 1) directly promoting within folio_mark_accessed (FMA)
>    Originally suggested by Johannes Weiner
>    https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/
>
>    This caused deadlocks due to the fact that the PTL was held
>    in a variety of cases - but in particular during task exit.
>    It also is incredibly inflexible and causes promotion-on-fault.
>    It was discussed that a deferral mechanism was preferred.
>
>
> 2) promoting in filemap.c locations (calls of FMA)
>    Originally proposed by Feng Tang and Ying Huang
>    https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>
>    First, we saw this as less problematic than directly hooking FMA,
>    but we realized this has the potential to miss data in a variety of
>    locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.
>
>    Second, we discovered that the lock state of pages is very subtle,
>    and that these locations in filemap.c can be called in an atomic
>    context.  Prototypes lead to a variety of stalls and lockups.
>
>
> 3) a new LRU - originally proposed by Keith Busch
>    https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7
>
>    There are two issues with this approach: PG_promotable and reclaim.
>
>    First - PG_promotable has generally be discouraged.
>
>    Second - Attach this mechanism to an LRU is both backwards and
>    counter-intutive.  A promotable list is better served by a MOST
>    recently used list, and since LRUs are generally only shrank when
>    exposed to pressure it would require implementing a new promotion
>    list shrinker that runs separate from the existing reclaim logic.
>
>
> 4) Adding a separate kthread - suggested by many
>
>    This is - to an extent - a more general version of the LRU proposal.
>    We still have to track the folios - which likely requires the
>    addition of a page flag.  Additionally, this method would actually
>    contend pretty heavily with LRU behavior - i.e. we'd want to
>    throttle addition to the promotion candidate list in some scenarios.
>
>
> 5) Doing it in task work
>
>    This seemed to be the most realistic after considering the above.
>
>    We observe the following:
>     - FMA is an ideal hook for this and isolation is safe here
>     - the new promotion_candidate function is an ideal hook for new
>       filter logic (throttling, fairness, etc).
>     - isolated folios are either promoted or putback on task resume,
>       there are no additional concurrency mechanics to worry about
>     - The mechanic can be made optional via a sysfs hook to avoid
>       overhead in degenerate scenarios (thrashing).
>
>    We also piggy-backed on the numa_hint_fault_latency timestamp to
>    further throttle promotions to help avoid promotions on one or
>    two time accesses to a particular page.
>
>
> Test:
> ======
>
> Environment:
>     1.5-3.7GHz CPU, ~4000 BogoMIPS, 
>     1TB Machine with 768GB DRAM and 256GB CXL
>     A 64GB file being linearly read by 6-7 Python processes
>
> Goal:
>    Generate promotions. Demonstrate stability and measure overhead.
>
> System Settings:
>    echo 1 > /sys/kernel/mm/numa/demotion_enabled
>    echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
>    echo 2 > /proc/sys/kernel/numa_balancing
>    
> Each process took up ~128GB each, with anonymous memory growing and
> shrinking as python filled and released buffers with the 64GB data.
> This causes DRAM pressure to generate demotions, and file pages to
> "become hot" - and therefore be selected for promotion.
>
> First we ran with promotion disabled to show consistent overhead as
> a result of forcing a file out to CXL memory. We first ran a single
> reader to see uncontended performance, launched many readers to force
> demotions, then droppedb back to a single reader to observe.
>
> Single-reader DRAM: ~16.0-16.4s
> Single-reader CXL (after demotion):  ~16.8-17s

The difference is trivial.  This makes me thought that why we need this
patchset?

> Next we turned promotion on with only a single reader running.
>
> Before promotions:
>     Node 0 MemFree:        636478112 kB
>     Node 0 FilePages:      59009156 kB
>     Node 1 MemFree:        250336004 kB
>     Node 1 FilePages:      14979628 kB

Why are there some many file pages on node 1 even if there're a lot of
free pages on node 0?  You moved some file pages from node 0 to node 1?

> After promotions:
>     Node 0 MemFree:        632267268 kB
>     Node 0 FilePages:      72204968 kB
>     Node 1 MemFree:        262567056 kB
>     Node 1 FilePages:       2918768 kB
>
> Single-reader (after_promotion): ~16.5s
>
> Turning the promotion mechanism on when nothing had been demoted
> produced no appreciable overhead (memory allocation noise overpowers it)
>
> Read time did not change after turning promotion off after promotion
> occurred, which implies that the additional overhead is not coming from
> the promotion system itself - but likely other pages still trapped on
> the low tier.  Either way, this at least demonstrates the mechanism is
> not particularly harmful when there are no pages to promote - and the
> mechanism is valuable when a file actually is quite hot.
>
> Notability, it takes some time for the average read loop to come back
> down, and there still remains unpromoted file pages trapped in pagecache.
> This isn't entirely unexpected, there are many files which may have been
> demoted, and they may not be very hot.
>
>
> Overhead
> ======
> When promotion was tured on we saw a loop-runtime increate temporarily
>
> before: 16.8s
> during:
>   17.606216192245483
>   17.375206470489502
>   17.722095489501953
>   18.230552434921265
>   18.20712447166443
>   18.008254528045654
>   17.008427381515503
>   16.851454257965088
>   16.715774059295654
> stable: ~16.5s
>
> We measured overhead with a separate patch that simply measured the
> rdtsc value before/after calls in promotion_candidate and task work.
>
> e.g.:
> +       start = rdtsc();
>         list_for_each_entry_safe(folio, tmp, promo_list, lru) {
>                 list_del_init(&folio->lru);
>                 migrate_misplaced_folio(folio, NULL, nid);
> +               count++;
>         }
> +       atomic_long_add(rdtsc()-start, &promo_time);
> +       atomic_long_add(count, &promo_count);
>
> numa_migrate_prep: 93 - time(3969867917) count(42576860)
> migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>
> Thoughts on a good throttling heuristic would be appreciated here.

We do have a throttle mechanism already, for example, you can used

$ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps

to rate limit the promotion throughput under 100 MB/s for each DRAM
node.

> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Keith Busch <kbusch@meta.com>
> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
>
> Gregory Price (5):
>   migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
>   memory: move conditionally defined enums use inside ifdef tags
>   memory: allow non-fault migration in numa_migrate_check path
>   vmstat: add page-cache numa hints
>   migrate,sysfs: add pagecache promotion
>
>  .../ABI/testing/sysfs-kernel-mm-numa          | 20 ++++++
>  include/linux/memory-tiers.h                  |  2 +
>  include/linux/migrate.h                       |  2 +
>  include/linux/sched.h                         |  3 +
>  include/linux/sched/numa_balancing.h          |  5 ++
>  include/linux/vm_event_item.h                 |  8 +++
>  init/init_task.c                              |  1 +
>  kernel/sched/fair.c                           | 26 +++++++-
>  mm/memory-tiers.c                             | 27 ++++++++
>  mm/memory.c                                   | 32 +++++-----
>  mm/mempolicy.c                                | 25 +++++---
>  mm/migrate.c                                  | 61 ++++++++++++++++++-
>  mm/swap.c                                     |  3 +
>  mm/vmstat.c                                   |  2 +
>  14 files changed, 193 insertions(+), 24 deletions(-)

---
Best Regards,
Huang, Ying