linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yuanchu Xie <yuanchu@google.com>
To: gourry@gourry.net
Cc: David Hildenbrand <david@redhat.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	 Khalid Aziz <khalid.aziz@oracle.com>,
	Henry Huang <henry.hj@antgroup.com>,  Yu Zhao <yuzhao@google.com>,
	Dan Williams <dan.j.williams@intel.com>,
	 Gregory Price <gregory.price@memverge.com>,
	Huang Ying <ying.huang@intel.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Lance Yang <ioworker0@gmail.com>,
	 Randy Dunlap <rdunlap@infradead.org>,
	Muhammad Usama Anjum <usama.anjum@collabora.com>,
	 Kalesh Singh <kaleshsingh@google.com>,
	Wei Xu <weixugc@google.com>,
	 David Rientjes <rientjes@google.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	 "Rafael J. Wysocki" <rafael@kernel.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	 Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	 Shuah Khan <shuah@kernel.org>,
	Yosry Ahmed <yosryahmed@google.com>,
	 Matthew Wilcox <willy@infradead.org>,
	Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>,
	 Kairui Song <kasong@tencent.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	 Vasily Averin <vasily.averin@linux.dev>,
	Nhat Pham <nphamcs@gmail.com>,  Miaohe Lin <linmiaohe@huawei.com>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	 Abel Wu <wuyun.abel@bytedance.com>,
	"Vishal Moola (Oracle)" <vishal.moola@gmail.com>,
	 Kefeng Wang <wangkefeng.wang@huawei.com>,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org
Subject: Re: [PATCH v3 0/7] mm: workingset reporting
Date: Mon, 26 Aug 2024 16:43:01 -0700	[thread overview]
Message-ID: <CAJj2-QGtvvrhjH_h1wL3FCg4HgZU27rqxSCDZzPws81yPK_DvQ@mail.gmail.com> (raw)
In-Reply-To: <ZsSTdY5hsv05jcj-@PC2K9PVX.TheFacebook.com>

On Tue, Aug 20, 2024 at 6:00 AM Gregory Price <gourry@gourry.net> wrote:
>
> On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace. IMO, the
> > kernel should provide a set of workingset interfaces that should be
> > generic enough to accommodate the various use cases, and be extensible
> > to potential future use cases. The current proposed interfaces are not
> > sufficient in that regard, but I would like to start somewhere, solicit
> > feedback, and iterate.
> >
> ... snip ...
> > Use cases
> > ==========
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated back
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >
> > [1]
> > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> >
> > Sysfs and Cgroup Interfaces
> > ==========
> > The interfaces are detailed in the patches that introduce them. The main
> > idea here is we break down the workingset per-node per-memcg into time
> > intervals (ms), e.g.
> >
> > 1000 anon=137368 file=24530
> > 20000 anon=34342 file=0
> > 30000 anon=353232 file=333608
> > 40000 anon=407198 file=206052
> > 9223372036854775807 anon=4925624 file=892892
> >
> > I realize this does not generalize well to hotness information, but I
> > lack the intuition for an abstraction that presents hotness in a useful
> > way. Based on a recent proposal for move_phys_pages[2], it seems like
> > userspace tiering software would like to move specific physical pages,
> > instead of informing the kernel "move x number of hot pages to y
> > device". Please advise.
> >
> > [2]
> > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/
> >
>
> Just as a note on this work, this is really a testing interface.  The
> end-goal is not to merge such an interface that is user-facing like
> move_phys_pages, but instead to have something like a triggered kernel
> task that has a directive of "Promote X pages from Device A".
>
> This work is more of an open collaboration for prototyping such that we
> don't have to plumb it through the kernel from the start and assess the
> usefulness of the hardware hotness collection mechanism.

Understood. I think we previously had this exchange and I forgot to
remove the mentions from the cover letter.

>
> ---
>
> More generally on promotion, I have been considering recently a problem
> with promoting unmapped pagecache pages - since they are not subject to
> NUMA hint faults.  I started looking at PG_accessed and PG_workingset as
> a potential mechanism to trigger promotion - but i'm starting to see a
> pattern of competing priorities between reclaim (LRU/MGLRU) logic and
> promotion logic.

In this case, IMO hardware support would be good as it could provide
the kernel with exactly what pages are hot, and it would not care
whether a page is mapped or not. I recall there being some CXL
proposal on this, but I'm not sure whether it has settled into a
standard yet.

>
> Reclaim is triggered largely under memory pressure - which means co-opting
> reclaim logic for promotion is at best logically confusing, and at worst
> likely to introduce regressions.  The LRU/MGLRU logic is written largely
> for reclaim, not promotion.  This makes hacking promotion in after the
> fact rather dubious - the design choices don't match.
>
> One example: if a page moves from inactive->active (or old->young), we
> could treat this as a page "becoming hot" and mark it for promotion, but
> this potentially punishes pages on the "active/younger" lists which are
> themselves hotter.

To avoid punishing pages on the "young" list, one could insert the
page into a "less young" generation, but it would be difficult to have
a fixed policy for this in the kernel, so it may be best for this to
be configurable via BPF. One could insert the page in the middle of
the active/inactive list, but that would in effect create multiple
generations.

>
> I'm starting to think separate demotion/reclaim and promotion components
> are warranted. This could take the form of a separate kernel worker that
> occasionally gets scheduled to manage a promotion list, or even the
> addition of a PG_promote flag to decouple reclaim and promotion logic
> completely.  Separating the structures entirely would be good to allow
> both demotion/reclaim and promotion to occur concurrently (although this
> seems problematic under memory pressure).
>
> Would like to know your thoughts here.  If we can decide to segregate
> promotion and demotion logic, it might go a long way to simplify the
> existing interfaces and formalize transactions between the two.

The two systems still have to interact, so separating the two would
essentially create a new policy that decides whether the
demotion/reclaim or the promotion policy is in effect. If promotion
could figure out where to insert the page in terms of generations,
wouldn't that be simpler?

>
> (also if you're going to LPC, might be worth a chat in person)

I cannot make it to LPC. :( Sadness

Yuanchu


      reply	other threads:[~2024-08-26 23:43 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-13 16:56 Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 1/7] mm: aggregate working set information into histograms Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 2/7] mm: use refresh interval to rate-limit workingset report aggregation Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 3/7] mm: report workingset during memory pressure driven scanning Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 4/7] mm: extend working set reporting to memcgs Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 5/7] mm: add kernel aging thread for workingset reporting Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 6/7] selftest: test system-wide " Yuanchu Xie
2024-08-13 16:56 ` [PATCH v3 7/7] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces Yuanchu Xie
2024-08-13 18:23   ` Waiman Long
2024-08-13 23:45   ` Randy Dunlap
2024-08-13 18:33 ` [PATCH v3 0/7] mm: workingset reporting Andrew Morton
2024-08-16  3:14   ` David Rientjes
2024-08-20 13:00 ` Gregory Price
2024-08-26 23:43   ` Yuanchu Xie [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJj2-QGtvvrhjH_h1wL3FCg4HgZU27rqxSCDZzPws81yPK_DvQ@mail.gmail.com \
    --to=yuanchu@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=aneesh.kumar@linux.ibm.com \
    --cc=cgroups@vger.kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=gregkh@linuxfoundation.org \
    --cc=gregory.price@memverge.com \
    --cc=hannes@cmpxchg.org \
    --cc=henry.hj@antgroup.com \
    --cc=ioworker0@gmail.com \
    --cc=kaleshsingh@google.com \
    --cc=kasong@tencent.com \
    --cc=khalid.aziz@oracle.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mst@redhat.com \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=quic_sudaraja@quicinc.com \
    --cc=rafael@kernel.org \
    --cc=rdunlap@infradead.org \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shuah@kernel.org \
    --cc=usama.anjum@collabora.com \
    --cc=vasily.averin@linux.dev \
    --cc=vishal.moola@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=wuyun.abel@bytedance.com \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox