linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@linux.alibaba.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org,  linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,  kernel-team@meta.com,
	nehagholkar@meta.com,  abhishekd@meta.com,  david@redhat.com,
	nphamcs@gmail.com,  akpm@linux-foundation.org,
	 hannes@cmpxchg.org, kbusch@meta.com,  feng.tang@intel.com,
	 donettom@linux.ibm.com
Subject: Re: [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios.
Date: Wed, 22 Jan 2025 19:16:03 +0800	[thread overview]
Message-ID: <87v7u7gkuk.fsf@DESKTOP-5N7EMDA> (raw)
In-Reply-To: <20250107000346.1338481-1-gourry@gourry.net> (Gregory Price's message of "Mon, 6 Jan 2025 19:03:40 -0500")

Hi, Gregory,

Thanks for the patchset and sorry about the late reply.

Gregory Price <gourry@gourry.net> writes:

> Unmapped page cache pages can be demoted to low-tier memory, but
> they can presently only be promoted in two conditions:
>     1) The page is fully swapped out and re-faulted
>     2) The page becomes mapped (and exposed to NUMA hint faults)
>
> This RFC proposes promoting unmapped page cache pages by using
> folio_mark_accessed as a hotness hint for unmapped pages.
>
> We show in a microbenchmark that this mechanism can increase
> performance up to 23.5% compared to leaving page cache on the
> low tier - when that page cache becomes excessively hot.
>
> When disabled (NUMA tiering off), overhead in folio_mark_accessed
> was limited to <1% in a worst case scenario (all work is file_read()).
>
> There is an open question as to how to integrate this into MGLRU,
> as the current design is only applies to traditional LRU.
>
> Patches 1-3
> 	allow NULL as valid input to migration prep interfaces
> 	for vmf/vma - which is not present in unmapped folios.
> Patch 4
> 	adds NUMA_HINT_PAGE_CACHE to vmstat
> Patch 5
> 	Implement migrate_misplaced_folio_batch
> Patch 6
> 	add the promotion mechanism, along with a sysfs
> 	extension which defaults the behavior to off.
> 	/sys/kernel/mm/numa/pagecache_promotion_enabled
>
> v3 Notes
> ===
> - added batch migration interface (migrate_misplaced_folio_batch)
>
> - dropped timestamp check in promotion_candidate (tests showed
>   it did not make a difference and the work is duplicated during
>   the migraiton process).
>
> - Bug fix from Donet Tom regarding vmstat
>
> - pulled folio_isolated and sysfs switch checks out into
>   folio_mark_accessed because microbenchmark tests showed the
>   function call overhead of promotion_candidate warranted a bit
>   of manual optimization for the scenario where the majority of
>   work is file_read().  This brought the standing overhead from
>   ~7% down to <1% when everything is disabled.
>
> - Limited promotion work list to a number of folios that match
>   the existing promotion rate limit, as microbenchmark demonstrated
>   excessive overhead on a single system-call when significant amounts
>   of memory are read.
>   Before: 128GB read went from 7 seconds to 40 seconds over ~2 rounds.
>   Now:    128GB read went from 7 seconds to ~11 seconds over ~10 rounds.
>
> - switched from list_add to list_add_tail in promotion_candidate, as
>   it was discovered promoting in non-linear order caused fairly
>   significant overheads (as high as running out of CXL) - likely due
>   to poor TLB and prefetch behavior.  Simply switching to list_add_tail
>   all but confirmed this as the additional ~20% overhead vanished.
>
>   This is likely to only occur on systems with a large amount of
>   contiguous physical memory available on the hot tier, since the
>   allocators are more likely to provide better spacially locality.
>
>
> Test:
> ======
>
> Environment:
>     1.5-3.7GHz CPU, ~4000 BogoMIPS, 
>     1TB Machine with 768GB DRAM and 256GB CXL
>     A 128GB file being linearly read by a single process
>
> Goal:
>    Generate promotions and demonstrate upper-bound on performance
>    overhead and gain/loss. 
>
> System Settings:
>    echo 1 > /sys/kernel/mm/numa/pagecache_promotion_enabled
>    echo 2 > /proc/sys/kernel/numa_balancing
>    
> Test process:
>    In each test, we do a linear read of a 128GB file into a buffer
>    in a loop.

IMHO, the linear reading isn't a very good test case for promotion.  You
cannot test the hot-page selection algorithm.  I think that it's better
to use something like normal accessing pattern.  IIRC, it is available
in fio test suite.

> To allocate the pagecache into CXL, we use mbind prior
>    to the CXL test runs and read the file.  We omit the overhead of
>    allocating the buffer and initializing the memory into CXL from the
>    test runs.
>
>    1) file allocated in DRAM with mechanisms off
>    2) file allocated in DRAM with balancing on but promotion off
>    3) file allocated in DRAM with balancing and promotion on
>       (promotion check is negative because all pages are top tier)
>    4) file allocated in CXL with mechanisms off
>    5) file allocated in CXL with mechanisms on
>
> Each test was run with 50 read cycles and averaged (where relevant)
> to account for system noise.  This number of cycles gives the promotion
> mechanism time to promote the vast majority of memory (usually <1MB
> remaining in worst case).
>
> Tests 2 and 3 test the upper bound on overhead of the new checks when
> there are no pages to migrate but work is dominated by file_read().
>
> |     1     |    2     |     3       |    4     |      5         |
> | DRAM Base | Promo On | TopTier Chk | CXL Base | Post-Promotion |
> |  7.5804   |  7.7586  |   7.9726    |   9.75   |    7.8941      |

For 3, we can check whether the folio is in top-tier as the first step.
Will that introduce measurable overhead?

> Baseline DRAM vs Baseline CXL shows a ~28% overhead just allowing the
> file to remain on CXL, while after promotion, we see the performance
> trend back towards the overhead of the TopTier check time - a total
> overhead reduction of ~84% (or ~5% overhead down from ~23.5%).
>
> During promotion, we do see overhead which eventually tapers off over
> time.  Here is a sample of the first 10 cycles during which promotion
> is the most aggressive, which shows overhead drops off dramatically
> as the majority of memory is migrated to the top tier.
>
> 12.79, 12.52, 12.33, 12.03, 11.81, 11.58, 11.36, 11.1, 8, 7.96
>
> This could be further limited by limiting the promotion rate via the
> existing knob, or by implementing a new knob detached from the existing
> promotion rate.  There are merits to both approach.

Have you tested with the existing knob?  Whether does it help?

> After promotion, turning the mechanism off via sysfs increased the
> overall performance back to the DRAM baseline. The slight (~1%)
> increase between post-migration performance and the baseline mechanism
> overhead check appears to be general variance as similar times were
> observed during the baseline checks on subsequent runs.
>
> The mechanism itself represents a ~2-5% overhead in a worst case
> scenario (all work is file_read() and pages are in DRAM).
>
[snip]

---
Best Regards,
Huang, Ying


  parent reply	other threads:[~2025-01-22 11:16 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-07  0:03 Gregory Price
2025-01-07  0:03 ` [PATCH v3 1/6] migrate: Allow migrate_misplaced_folio_prepare() to accept a NULL VMA Gregory Price
2025-01-07  0:03 ` [PATCH v3 2/6] memory: move conditionally defined enums use inside ifdef tags Gregory Price
2025-01-21  4:33   ` Bharata B Rao
2025-01-22 18:01     ` Gregory Price
2025-01-23  3:07       ` Bharata B Rao
2025-01-07  0:03 ` [PATCH v3 3/6] memory: allow non-fault migration in numa_migrate_check path Gregory Price
2025-01-07  0:03 ` [PATCH v3 4/6] vmstat: add page-cache numa hints Gregory Price
2025-01-07  0:03 ` [PATCH v3 5/6] migrate: implement migrate_misplaced_folio_batch Gregory Price
2025-01-07  0:03 ` [PATCH v3 6/6] migrate,sysfs: add pagecache promotion Gregory Price
2025-01-22 11:16 ` Huang, Ying [this message]
2025-01-22 16:48   ` [RFC v3 PATCH 0/5] Promotion of Unmapped Page Cache Folios Gregory Price
2025-01-23  3:46     ` Huang, Ying
2025-01-23 14:55       ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87v7u7gkuk.fsf@DESKTOP-5N7EMDA \
    --to=ying.huang@linux.alibaba.com \
    --cc=abhishekd@meta.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=donettom@linux.ibm.com \
    --cc=feng.tang@intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=kbusch@meta.com \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nehagholkar@meta.com \
    --cc=nphamcs@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox