From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
nehagholkar@meta.com, abhishekd@meta.com, kernel-team@meta.com,
david@redhat.com, nphamcs@gmail.com, akpm@linux-foundation.org,
hannes@cmpxchg.org, kbusch@meta.com,
Feng Tang <feng.tang@intel.com>
Subject: Re: [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
Date: Sat, 21 Dec 2024 13:18:04 +0800 [thread overview]
Message-ID: <87o715r4vn.fsf@DESKTOP-5N7EMDA> (raw)
In-Reply-To: <20241210213744.2968-1-gourry@gourry.net> (Gregory Price's message of "Tue, 10 Dec 2024 16:37:39 -0500")
Hi, Gregory,
Thanks for working on this!
Gregory Price <gourry@gourry.net> writes:
> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
> 1) The page is fully swapped out and re-faulted
> 2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> Patches 1-3
> allow NULL as valid input to migration prep interfaces
> for vmf/vma - which is not present in unmapped folios.
> Patch 4
> adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> adds the promotion mechanism, along with a sysfs
> extension which defaults the behavior to off.
> /sys/kernel/mm/numa/pagecache_promotion_enabled
>
> Functional test showed that we are able to reclaim some performance
> in canned scenarios (a file gets demoted and becomes hot with
> relatively little contention). See test/overhead section below.
>
> v2
> - cleanup first commit to be accurate and take Ying's feedback
> - cleanup NUMA_HINT_ define usage
> - add NUMA_HINT_ type selection macro to keep code clean
> - mild comment updates
>
> Open Questions:
> ======
> 1) Should we also add a limit to how much can be forced onto
> a single task's promotion list at any one time? This might
> piggy-back on the existing TPP promotion limit (256MB?) and
> would simply add something like task->promo_count.
>
> Technically we are limited by the batch read-rate before a
> TASK_RESUME occurs.
>
> 2) Should we exempt certain forms of folios, or add additional
> knobs/levers in to deal with things like large folios?
>
> 3) We added NUMA_HINT_PAGE_CACHE to differentiate hint faults
> so we could validate the behavior works as intended. Should
> we just call this a NUMA_HINT_FAULT and not add a new hint?
>
> 4) Benchmark suggestions that can pressure 1TB memory. This is
> not my typical wheelhouse, so if folks know of a useful
> benchmark that can pressure my 1TB (768 DRAM / 256 CXL) setup,
> I'd like to add additional measurements here.
>
> Development Notes
> =================
>
> During development, we explored the following proposals:
>
> 1) directly promoting within folio_mark_accessed (FMA)
> Originally suggested by Johannes Weiner
> https://lore.kernel.org/all/20240803094715.23900-1-gourry@gourry.net/
>
> This caused deadlocks due to the fact that the PTL was held
> in a variety of cases - but in particular during task exit.
> It also is incredibly inflexible and causes promotion-on-fault.
> It was discussed that a deferral mechanism was preferred.
>
>
> 2) promoting in filemap.c locations (calls of FMA)
> Originally proposed by Feng Tang and Ying Huang
> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/patch/?id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>
> First, we saw this as less problematic than directly hooking FMA,
> but we realized this has the potential to miss data in a variety of
> locations: swap.c, memory.c, gup.c, ksm.c, paddr.c - etc.
>
> Second, we discovered that the lock state of pages is very subtle,
> and that these locations in filemap.c can be called in an atomic
> context. Prototypes lead to a variety of stalls and lockups.
>
>
> 3) a new LRU - originally proposed by Keith Busch
> https://git.kernel.org/pub/scm/linux/kernel/git/kbusch/linux.git/patch/?id=6616afe9a722f6ebedbb27ade3848cf07b9a3af7
>
> There are two issues with this approach: PG_promotable and reclaim.
>
> First - PG_promotable has generally be discouraged.
>
> Second - Attach this mechanism to an LRU is both backwards and
> counter-intutive. A promotable list is better served by a MOST
> recently used list, and since LRUs are generally only shrank when
> exposed to pressure it would require implementing a new promotion
> list shrinker that runs separate from the existing reclaim logic.
>
>
> 4) Adding a separate kthread - suggested by many
>
> This is - to an extent - a more general version of the LRU proposal.
> We still have to track the folios - which likely requires the
> addition of a page flag. Additionally, this method would actually
> contend pretty heavily with LRU behavior - i.e. we'd want to
> throttle addition to the promotion candidate list in some scenarios.
>
>
> 5) Doing it in task work
>
> This seemed to be the most realistic after considering the above.
>
> We observe the following:
> - FMA is an ideal hook for this and isolation is safe here
> - the new promotion_candidate function is an ideal hook for new
> filter logic (throttling, fairness, etc).
> - isolated folios are either promoted or putback on task resume,
> there are no additional concurrency mechanics to worry about
> - The mechanic can be made optional via a sysfs hook to avoid
> overhead in degenerate scenarios (thrashing).
>
> We also piggy-backed on the numa_hint_fault_latency timestamp to
> further throttle promotions to help avoid promotions on one or
> two time accesses to a particular page.
>
>
> Test:
> ======
>
> Environment:
> 1.5-3.7GHz CPU, ~4000 BogoMIPS,
> 1TB Machine with 768GB DRAM and 256GB CXL
> A 64GB file being linearly read by 6-7 Python processes
>
> Goal:
> Generate promotions. Demonstrate stability and measure overhead.
>
> System Settings:
> echo 1 > /sys/kernel/mm/numa/demotion_enabled
> echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
> echo 2 > /proc/sys/kernel/numa_balancing
>
> Each process took up ~128GB each, with anonymous memory growing and
> shrinking as python filled and released buffers with the 64GB data.
> This causes DRAM pressure to generate demotions, and file pages to
> "become hot" - and therefore be selected for promotion.
>
> First we ran with promotion disabled to show consistent overhead as
> a result of forcing a file out to CXL memory. We first ran a single
> reader to see uncontended performance, launched many readers to force
> demotions, then droppedb back to a single reader to observe.
>
> Single-reader DRAM: ~16.0-16.4s
> Single-reader CXL (after demotion): ~16.8-17s
The difference is trivial. This makes me thought that why we need this
patchset?
> Next we turned promotion on with only a single reader running.
>
> Before promotions:
> Node 0 MemFree: 636478112 kB
> Node 0 FilePages: 59009156 kB
> Node 1 MemFree: 250336004 kB
> Node 1 FilePages: 14979628 kB
Why are there some many file pages on node 1 even if there're a lot of
free pages on node 0? You moved some file pages from node 0 to node 1?
> After promotions:
> Node 0 MemFree: 632267268 kB
> Node 0 FilePages: 72204968 kB
> Node 1 MemFree: 262567056 kB
> Node 1 FilePages: 2918768 kB
>
> Single-reader (after_promotion): ~16.5s
>
> Turning the promotion mechanism on when nothing had been demoted
> produced no appreciable overhead (memory allocation noise overpowers it)
>
> Read time did not change after turning promotion off after promotion
> occurred, which implies that the additional overhead is not coming from
> the promotion system itself - but likely other pages still trapped on
> the low tier. Either way, this at least demonstrates the mechanism is
> not particularly harmful when there are no pages to promote - and the
> mechanism is valuable when a file actually is quite hot.
>
> Notability, it takes some time for the average read loop to come back
> down, and there still remains unpromoted file pages trapped in pagecache.
> This isn't entirely unexpected, there are many files which may have been
> demoted, and they may not be very hot.
>
>
> Overhead
> ======
> When promotion was tured on we saw a loop-runtime increate temporarily
>
> before: 16.8s
> during:
> 17.606216192245483
> 17.375206470489502
> 17.722095489501953
> 18.230552434921265
> 18.20712447166443
> 18.008254528045654
> 17.008427381515503
> 16.851454257965088
> 16.715774059295654
> stable: ~16.5s
>
> We measured overhead with a separate patch that simply measured the
> rdtsc value before/after calls in promotion_candidate and task work.
>
> e.g.:
> + start = rdtsc();
> list_for_each_entry_safe(folio, tmp, promo_list, lru) {
> list_del_init(&folio->lru);
> migrate_misplaced_folio(folio, NULL, nid);
> + count++;
> }
> + atomic_long_add(rdtsc()-start, &promo_time);
> + atomic_long_add(count, &promo_count);
>
> numa_migrate_prep: 93 - time(3969867917) count(42576860)
> migrate_misplaced_folio_prepare: 491 - time(3433174319) count(6985523)
> migrate_misplaced_folio: 1635 - time(11426529980) count(6985523)
>
> Thoughts on a good throttling heuristic would be appreciated here.
We do have a throttle mechanism already, for example, you can used
$ echo 100 > /proc/sys/kernel/numa_balancing_promote_rate_limit_MBps
to rate limit the promotion throughput under 100 MB/s for each DRAM
node.
> Suggested-by: Huang Ying <ying.huang@linux.alibaba.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Suggested-by: Keith Busch <kbusch@meta.com>
> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
>
> Gregory Price (5):
> migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA.
> memory: move conditionally defined enums use inside ifdef tags
> memory: allow non-fault migration in numa_migrate_check path
> vmstat: add page-cache numa hints
> migrate,sysfs: add pagecache promotion
>
> .../ABI/testing/sysfs-kernel-mm-numa | 20 ++++++
> include/linux/memory-tiers.h | 2 +
> include/linux/migrate.h | 2 +
> include/linux/sched.h | 3 +
> include/linux/sched/numa_balancing.h | 5 ++
> include/linux/vm_event_item.h | 8 +++
> init/init_task.c | 1 +
> kernel/sched/fair.c | 26 +++++++-
> mm/memory-tiers.c | 27 ++++++++
> mm/memory.c | 32 +++++-----
> mm/mempolicy.c | 25 +++++---
> mm/migrate.c | 61 ++++++++++++++++++-
> mm/swap.c | 3 +
> mm/vmstat.c | 2 +
> 14 files changed, 193 insertions(+), 24 deletions(-)
---
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2024-12-21 5:18 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-10 21:37 Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 1/5] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 2/5] memory: move conditionally defined enums use inside ifdef tags Gregory Price
2024-12-27 10:34 ` Donet Tom
2024-12-27 15:42 ` Gregory Price
2024-12-29 14:49 ` Donet Tom
2024-12-10 21:37 ` [RFC v2 PATCH 3/5] memory: allow non-fault migration in numa_migrate_check path Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 4/5] vmstat: add page-cache numa hints Gregory Price
2024-12-27 10:48 ` Donet Tom
2024-12-27 15:49 ` Gregory Price
2024-12-29 14:57 ` Donet Tom
2025-01-03 10:18 ` Donet Tom
2025-01-03 19:19 ` Gregory Price
2024-12-10 21:37 ` [RFC v2 PATCH 5/5] migrate,sysfs: add pagecache promotion Gregory Price
2024-12-27 11:01 ` Donet Tom
2024-12-27 15:56 ` Gregory Price
2024-12-29 15:00 ` Donet Tom
2024-12-21 5:18 ` Huang, Ying [this message]
2024-12-21 14:48 ` [RFC v2 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
2024-12-22 7:09 ` Huang, Ying
2024-12-22 16:22 ` Gregory Price
2024-12-27 2:16 ` Huang, Ying
2024-12-27 15:40 ` Gregory Price
2024-12-27 19:09 ` Gregory Price
2024-12-28 3:38 ` Gregory Price
2024-12-31 7:32 ` Gregory Price
2025-01-02 2:58 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87o715r4vn.fsf@DESKTOP-5N7EMDA \
--to=ying.huang@linux.alibaba.com \
--cc=abhishekd@meta.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=feng.tang@intel.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=kbusch@meta.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nehagholkar@meta.com \
--cc=nphamcs@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox