Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Michal Hocko <mhocko@kernel.org>
To: Yang Shi <yang.shi@linux.alibaba.com>
Cc: mgorman@techsingularity.net, riel@surriel.com,
	hannes@cmpxchg.org, akpm@linux-foundation.org,
	dave.hansen@intel.com, keith.busch@intel.com,
	dan.j.williams@intel.com, fengguang.wu@intel.com,
	fan.du@intel.com, ying.huang@intel.com, ziy@nvidia.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node
Date: Tue, 16 Apr 2019 09:47:14 +0200	[thread overview]
Message-ID: <20190416074714.GD11561@dhcp22.suse.cz> (raw)
In-Reply-To: <a68137bb-dcd8-4e4a-b3a9-69a66f9dccaf@linux.alibaba.com>

On Mon 15-04-19 17:09:07, Yang Shi wrote:
> 
> 
> On 4/12/19 1:47 AM, Michal Hocko wrote:
> > On Thu 11-04-19 11:56:50, Yang Shi wrote:
> > [...]
> > > Design
> > > ======
> > > Basically, the approach is aimed to spread data from DRAM (closest to local
> > > CPU) down further to PMEM and disk (typically assume the lower tier storage
> > > is slower, larger and cheaper than the upper tier) by their hotness.  The
> > > patchset tries to achieve this goal by doing memory promotion/demotion via
> > > NUMA balancing and memory reclaim as what the below diagram shows:
> > > 
> > >      DRAM <--> PMEM <--> Disk
> > >        ^                   ^
> > >        |-------------------|
> > >                 swap
> > > 
> > > When DRAM has memory pressure, demote pages to PMEM via page reclaim path.
> > > Then NUMA balancing will promote pages to DRAM as long as the page is referenced
> > > again.  The memory pressure on PMEM node would push the inactive pages of PMEM
> > > to disk via swap.
> > > 
> > > The promotion/demotion happens only between "primary" nodes (the nodes have
> > > both CPU and memory) and PMEM nodes.  No promotion/demotion between PMEM nodes
> > > and promotion from DRAM to PMEM and demotion from PMEM to DRAM.
> > > 
> > > The HMAT is effectively going to enforce "cpu-less" nodes for any memory range
> > > that has differentiated performance from the conventional memory pool, or
> > > differentiated performance for a specific initiator, per Dan Williams.  So,
> > > assuming PMEM nodes are cpuless nodes sounds reasonable.
> > > 
> > > However, cpuless nodes might be not PMEM nodes.  But, actually, memory
> > > promotion/demotion doesn't care what kind of memory will be the target nodes,
> > > it could be DRAM, PMEM or something else, as long as they are the second tier
> > > memory (slower, larger and cheaper than regular DRAM), otherwise it sounds
> > > pointless to do such demotion.
> > > 
> > > Defined "N_CPU_MEM" nodemask for the nodes which have both CPU and memory in
> > > order to distinguish with cpuless nodes (memory only, i.e. PMEM nodes) and
> > > memoryless nodes (some architectures, i.e. Power, may have memoryless nodes).
> > > Typically, memory allocation would happen on such nodes by default unless
> > > cpuless nodes are specified explicitly, cpuless nodes would be just fallback
> > > nodes, so they are also as known as "primary" nodes in this patchset.  With
> > > two tier memory system (i.e. DRAM + PMEM), this sounds good enough to
> > > demonstrate the promotion/demotion approach for now, and this looks more
> > > architecture-independent.  But it may be better to construct such node mask
> > > by reading hardware information (i.e. HMAT), particularly for more complex
> > > memory hierarchy.
> > I still believe you are overcomplicating this without a strong reason.
> > Why cannot we start simple and build from there? In other words I do not
> > think we really need anything like N_CPU_MEM at all.
> 
> In this patchset N_CPU_MEM is used to tell us what nodes are cpuless nodes.
> They would be the preferred demotion target.  Of course, we could rely on
> firmware to just demote to the next best node, but it may be a "preferred"
> node, if so I don't see too much benefit achieved by demotion. Am I missing
> anything?

Why cannot we simply demote in the proximity order? Why do you make
cpuless nodes so special? If other close nodes are vacant then just use
them.
 
> > I would expect that the very first attempt wouldn't do much more than
> > migrate to-be-reclaimed pages (without an explicit binding) with a
> 
> Do you mean respect mempolicy or cpuset when doing demotion? I was wondering
> this, but I didn't do so in the current implementation since it may need
> walk the rmap to retrieve the mempolicy in the reclaim path. Is there any
> easier way to do so?

You definitely have to follow policy. You cannot demote to a node which
is outside of the cpuset/mempolicy because you are breaking contract
expected by the userspace. That implies doing a rmap walk.

> > I would also not touch the numa balancing logic at this stage and rather
> > see how the current implementation behaves.
> 
> I agree we would prefer start from something simpler and see how it works.
> 
> The "twice access" optimization is aimed to reduce the PMEM bandwidth burden
> since the bandwidth of PMEM is scarce resource. I did compare "twice access"
> to "no twice access", it does save a lot bandwidth for some once-off access
> pattern. For example, when running stress test with mmtest's
> usemem-stress-numa-compact. The kernel would promote ~600,000 pages with
> "twice access" in 4 hours, but it would promote ~80,000,000 pages without
> "twice access".

I pressume this is a result of a synthetic workload, right? Or do you
have any numbers for a real life usecase?
-- 
Michal Hocko
SUSE Labs

next prev parent reply	other threads:[~2019-04-16  7:47 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-11  3:56 Yang Shi
2019-04-11  3:56 ` [v2 PATCH 1/9] mm: define N_CPU_MEM node states Yang Shi
2019-04-11  3:56 ` [v2 PATCH 2/9] mm: page_alloc: make find_next_best_node find return cpuless node Yang Shi
2019-04-11  3:56 ` [v2 PATCH 3/9] mm: numa: promote pages to DRAM when it gets accessed twice Yang Shi
2019-04-11  3:56 ` [v2 PATCH 4/9] mm: migrate: make migrate_pages() return nr_succeeded Yang Shi
2019-04-11  3:56 ` [v2 PATCH 5/9] mm: vmscan: demote anon DRAM pages to PMEM node Yang Shi
2019-04-11 14:31   ` Dave Hansen
2019-04-15 22:10     ` Yang Shi
2019-04-15 22:14       ` Dave Hansen
2019-04-15 22:26         ` Yang Shi
2019-04-11  3:56 ` [v2 PATCH 6/9] mm: vmscan: don't demote for memcg reclaim Yang Shi
2019-04-11  3:56 ` [v2 PATCH 7/9] mm: vmscan: check if the demote target node is contended or not Yang Shi
2019-04-11 16:06   ` Dave Hansen
2019-04-15 22:06     ` Yang Shi
2019-04-15 22:13       ` Dave Hansen
2019-04-15 22:23         ` Yang Shi
2019-04-11  3:56 ` [v2 PATCH 8/9] mm: vmscan: add page demotion counter Yang Shi
2019-04-11  3:56 ` [v2 PATCH 9/9] mm: numa: add page promotion counter Yang Shi
2019-04-11 14:28 ` [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node Dave Hansen
2019-04-12  8:47 ` Michal Hocko
2019-04-16  0:09   ` Yang Shi
2019-04-16  7:47     ` Michal Hocko [this message]
2019-04-16 14:30       ` Dave Hansen
2019-04-16 14:39         ` Michal Hocko
2019-04-16 15:46           ` Dave Hansen
2019-04-16 18:34             ` Michal Hocko
2019-04-16 15:33         ` Zi Yan
2019-04-16 15:55           ` Dave Hansen
2019-04-16 16:12             ` Zi Yan
2019-04-16 19:19       ` Yang Shi
2019-04-16 21:22         ` Dave Hansen
2019-04-16 21:59           ` Yang Shi
2019-04-16 23:04             ` Dave Hansen
2019-04-16 23:17               ` Yang Shi
2019-04-17 15:13                 ` Keith Busch
2019-04-17  9:23           ` Michal Hocko
2019-04-17 15:23             ` Keith Busch
2019-04-17 15:39               ` Michal Hocko
2019-04-17 15:37                 ` Keith Busch
2019-04-17 16:39                   ` Michal Hocko
2019-04-17 17:26                     ` Yang Shi
2019-04-17 17:29                       ` Keith Busch
2019-04-17 17:51                       ` Michal Hocko
2019-04-18 16:24                         ` Yang Shi
2019-04-17 17:13             ` Dave Hansen
2019-04-17 17:57               ` Michal Hocko
2019-04-18 18:16               ` Keith Busch
2019-04-18 19:23                 ` Yang Shi
2019-04-18 21:07                   ` Zi Yan
2019-04-16 23:18         ` Yang Shi
2019-04-17  9:17         ` Michal Hocko
2019-05-01  6:43           ` Fengguang Wu
2019-04-17 20:43         ` Yang Shi
2019-04-18  9:02           ` Michal Hocko
2019-05-01  5:20             ` Fengguang Wu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190416074714.GD11561@dhcp22.suse.cz \
    --to=mhocko@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=fan.du@intel.com \
    --cc=fengguang.wu@intel.com \
    --cc=hannes@cmpxchg.org \
    --cc=keith.busch@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=riel@surriel.com \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ying.huang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox