linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Gregory Price <gourry@gourry.net>
Cc: linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org,  david@redhat.com,  nphamcs@gmail.com,
	nehagholkar@meta.com,  abhishekd@meta.com,
	 Johannes Weiner <hannes@cmpxchg.org>,
	 Feng Tang <feng.tang@intel.com>
Subject: Re: [PATCH 0/3] mm,TPP: Enable promotion of unmapped pagecache
Date: Tue, 05 Nov 2024 10:00:59 +0800	[thread overview]
Message-ID: <87jzdi782s.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <ZykOqYJpgL4lw7mw@PC2K9PVX.TheFacebook.com> (Gregory Price's message of "Mon, 4 Nov 2024 13:12:57 -0500")

Hi, Gregory,

Gregory Price <gourry@gourry.net> writes:

> On Mon, Sep 02, 2024 at 02:53:26PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> 
>> > On Mon, Aug 19, 2024 at 03:46:00PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gourry@gourry.net> writes:
>> >> 
>> >> > Unmapped pagecache pages can be demoted to low-tier memory, but 
>> >> > they can only be promoted if a process maps the pages into the
>> >> > memory space (so that NUMA hint faults can be caught).  This can
>> >> > cause significant performance degradation as the pagecache ages
>> >> > and unmapped, cached files are accessed.
>> >> >
>> >> > This patch series enables the pagecache to request a promotion of
>> >> > a folio when it is accessed via the pagecache.
>> >> >
>> >> > We add a new `numa_hint_page_cache` counter in vmstat to capture
>> >> > information on when these migrations occur.
>> >> 
>> >> It appears that you will promote page cache page on the second access.
>> >> Do you have some better way to identify hot pages from the not-so-hot
>> >> pages?  How to balance between unmapped and mapped pages?  We have hot
>> >> page selection for hot pages.
>> >> 
>> >> [snip]
>> >> 
>> >
>> > I've since explored moving this down under a (referenced && active) check.
>> >
>> > This would be more like promotion on third access within an LRU shrink
>> > round (the LRU should, in theory, hack off the active bits on some decent
>> > time interval when the system is pressured).
>> >
>> > Barring adding new counters to folios to track hits, I don't see a clear
>> > and obvious way way to track hotness.  The primary observation here is 
>> > that pagecache is un-mapped, and so cannot use numa-fault hints.
>> >
>> > This is more complicated with MGLRU, but I'm saving that for after I
>> > figure out the plan for plain old LRU.
>> 
>> Several years ago, we have tried to use the access time tracking
>> mechanism of NUMA balancing to track the access time latency of unmapped
>> file cache folios.  The original implementation is as follows,
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=5f2e64ce75c0322602c2ec8c70b64bb69b1f1329
>> 
>> What do you think about this?
>> 
>
> Coming back around to explore this topic a bit more, dug into this old
> patch and the LRU patch by Keith - I'm struggling find a good option
> that doesn't over-complicate or propose something contentious.
>
>
> I did a browse through lore and did not see any discussion on this patch
> or on Keith's LRU patch, so i presume discussion on this happened largely
> off-list.  So if you have any context as to why this wasn't RFC'd officially
> I would like more information.

Thanks for doing this.  There's no much discussion offline.  We just
don't have enough time to work on the solution.

> My observations between these 3 proposals:
>
> - The page-lock state is complex while trying interpose in mark_folio_accessed,
>   meaning inline promotion inside that interface is a non-starter.
>
>   We found one deadlock during task exit due to the PTL being held. 
>
>   This worries me more generally, but we did find some success changing certain
>   calls to mark_folio_accessed to mark_folio_accessed_and_promote - rather than
>   modifying mark_folio_accessed. This ends up changing code in similar places
>   to your hook - but catches a more conditions that mark a page accessed.
>
> - For Keith's proposal, promotions via LRU requires memory pressure on the lower
>   tier to cause a shrink and therefore promotions. I'm not well versed in LRU
>   LRU sematics, but it seems we could try proactive reclaim here.
>   
>   Doing promote-reclaim and demote/swap/evict reclaim on the same triggers
>   seems counter-intuitive.

IIUC, in TPP paper (https://arxiv.org/abs/2206.02878), a similar method
is proposed for page promoting.  I guess that it works together with
proactive reclaiming.

> - Doing promotions inline with access creates overhead.  I've seen some research
>   suggesting 60us+ per migration - so aggressiveness could harm performance.
>
>   Doing it async would alleviate inline access overheads - but it could also make
>   promotion pointless if time-to-promote is to far from liveliness of the pages.

Async promotion needs to deal with the resource (CPU/memory) charging
too.  You do some work for a task, so you need to charge the consumed
resource for the task.

> - Doing async-promotion may also require something like PG_PROMOTABLE (as proposed
>   by Keith's patch), which will obviously be a very contentious topic.

Some additional data structure can be used to record pages.

> tl;dr: I'm learning towards a solution like you have here, but we may need to
> make a sysfs switch similar to demotion_enabled in case of poor performance due
> to heuristically degenerate access patterns, and we may need to expose some
> form of adjustable aggressiveness value to make it tunable.

Yes.  We may need that, because the performance benefit may be lower
than the overhead introduced.

> Reading more into the code surrounding this and other migration logic, I also
> think we should explore an optimization to mempolicy that tries to aggressively
> keep certain classes of memory on the local node (RX memory and stack
> for example).
>
> Other areas of reclaim try to actively prevent demoting this type of memory, so we
> should try not to allocate it there in the first place.

We have already used DRAM first allocation policy.  So, we need to
measure its effect firstly.

--
Best Regards,
Huang, Ying


  reply	other threads:[~2024-11-05  2:04 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240803094715.23900-1-gourry@gourry.net>
2024-08-08 23:20 ` Andrew Morton
2024-08-13 15:04   ` Gregory Price
2024-08-14 16:09     ` Gregory Price
2024-08-19  7:46 ` Huang, Ying
2024-08-19 15:15   ` Gregory Price
2024-09-02  6:53     ` Huang, Ying
2024-09-03 13:36       ` Gregory Price
2024-11-04 18:12       ` Gregory Price
2024-11-05  2:00         ` Huang, Ying [this message]
2024-11-05 15:16           ` Gregory Price
2024-11-08 18:00           ` Gregory Price
2024-11-11  1:35             ` Huang, Ying
2024-11-11 14:25               ` Gregory Price
2024-11-12  0:33                 ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jzdi782s.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=abhishekd@meta.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=feng.tang@intel.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nehagholkar@meta.com \
    --cc=nphamcs@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox