Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Michal Hocko <mhocko@kernel.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linux Memory Management List <linux-mm@kvack.org>,
	kvm@vger.kernel.org, LKML <linux-kernel@vger.kernel.org>,
	Fan Du <fan.du@intel.com>, Yao Yuan <yuan.yao@intel.com>,
	Peng Dong <dongx.peng@intel.com>,
	Huang Ying <ying.huang@intel.com>,
	Liu Jingqi <jingqi.liu@intel.com>,
	Dong Eddie <eddie.dong@intel.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Zhang Yi <yi.z.zhang@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration
Date: Thu, 3 Jan 2019 10:57:17 +0000	[thread overview]
Message-ID: <20190103105717.GI28934@suse.de> (raw)
In-Reply-To: <20181228195224.GY16738@dhcp22.suse.cz>

On Fri, Dec 28, 2018 at 08:52:24PM +0100, Michal Hocko wrote:
> [Ccing Mel and Andrea]
> 
> On Fri 28-12-18 21:31:11, Wu Fengguang wrote:
> > > > > I haven't looked at the implementation yet but if you are proposing a
> > > > > special cased zone lists then this is something CDM (Coherent Device
> > > > > Memory) was trying to do two years ago and there was quite some
> > > > > skepticism in the approach.
> > > > 
> > > > It looks we are pretty different than CDM. :)
> > > > We creating new NUMA nodes rather than CDM's new ZONE.
> > > > The zonelists modification is just to make PMEM nodes more separated.
> > > 
> > > Yes, this is exactly what CDM was after. Have a zone which is not
> > > reachable without explicit request AFAIR. So no, I do not think you are
> > > too different, you just use a different terminology ;)
> > 
> > Got it. OK.. The fall back zonelists patch does need more thoughts.
> > 
> > In long term POV, Linux should be prepared for multi-level memory.
> > Then there will arise the need to "allocate from this level memory".
> > So it looks good to have separated zonelists for each level of memory.
> 
> Well, I do not have a good answer for you here. We do not have good
> experiences with those systems, I am afraid. NUMA is with us for more
> than a decade yet our APIs are coarse to say the least and broken at so
> many times as well. Starting a new API just based on PMEM sounds like a
> ticket to another disaster to me.
> 
> I would like to see solid arguments why the current model of numa nodes
> with fallback in distances order cannot be used for those new
> technologies in the beginning and develop something better based on our
> experiences that we gain on the way.
> 
> I would be especially interested about a possibility of the memory
> migration idea during a memory pressure and relying on numa balancing to
> resort the locality on demand rather than hiding certain NUMA nodes or
> zones from the allocator and expose them only to the userspace.
> 

I didn't read the thread as I'm backlogged as I imagine a lot of people
are. However, I would agree that zonelists are not a good fit for something
like PMEM-based being available via a zonelist with a fake distance combined
with NUMA balancing moving pages in and out DRAM and PMEM. The same applies
to a much lesser extent for something like a special higher-speed memory
that is faster than RAM.

The fundamental problem encountered will be a hot-page-inversion issue.
In the PMEM case, DRAM fills, then PMEM starts filling except now we
know that the most recently allocated page which is potentially the most
important in terms of hotness is allocated on slower "remote" memory.
Reclaim kicks in for the DRAM node and then there is interleaving of
hotness between DRAM and PMEM with NUMA balancing then getting involved
with non-deterministic performance.

I recognise that the same problem happens for remote NUMA nodes and it
also has an inversion issue once reclaim gets involved, but it also has a
clearly defined API for dealing with that problem if applications encounter
it. It's also relatively well known given the age of the problem and how
to cope with it. It's less clear whether applications could be able to
cope of it's a more distant PMEM instead of a remote DRAM and how that
should be advertised.

This has been brought up repeatedly over the last few years since high
speed memory was first mentioned but I think long-term what we should
be thinking of is "age-based-migration" where cold pages from DRAM
get migrated to PMEM when DRAM fills and use NUMA balancing to promote
hot pages from PMEM to DRAM. It should also be workable for remote DRAM
although that *might* violate the principal of least surprise given that
applications exist that are remote NUMA aware. It might be safer overall
if such age-based-migration is specific to local-but-different-speed
memory with the main DRAM only being in the zonelists. NUMA balancing
could still optionally promote from DRAM->faster memory while aging
moves pages from fast->slow as memory pressure dictates.

There still would need to be thought on exactly how this is advertised
to userspace because while "distance" is reasonably well understood,
it's not as clear to me whether distance is appropriate to describe
"local-but-different-speed" memory given that accessing a remote
NUMA node can saturate a single link where as the same may not
be true of local-but-different-speed memory which probably has
dedicated channels. In an ideal world, application developers
interested in higher-speed-memory-reserved-for-important-use and
cheaper-lower-speed-memory could describe what sort of application
modifications they'd be willing to do but that might be unlikely.

-- 
Mel Gorman
SUSE Labs

next prev parent reply	other threads:[~2019-01-03 10:57 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-26 13:14 Fengguang Wu
2018-12-26 13:14 ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 01/21] e820: cheat PMEM as DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27  3:41   ` Matthew Wilcox
2018-12-27  4:11     ` Fengguang Wu
2018-12-27  5:13       ` Dan Williams
2018-12-27  5:13         ` Dan Williams
2018-12-27 19:32         ` Yang Shi
2018-12-27 19:32           ` Yang Shi
2018-12-28  3:27           ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 02/21] acpi/numa: memorize NUMA node type from SRAT table Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 03/21] x86/numa_emulation: fix fake NUMA in uniform case Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 04/21] x86/numa_emulation: pass numa node type to fake nodes Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 05/21] mmzone: new pgdat flags for DRAM and PMEM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 06/21] x86,numa: update numa node type Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 07/21] mm: export node type {pmem|dram} under /sys/bus/node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 08/21] mm: introduce and export pgdat peer_node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-27 20:07   ` Christopher Lameter
2018-12-27 20:07     ` Christopher Lameter
2018-12-28  2:31     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 09/21] mm: avoid duplicate peer target node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 10/21] mm: build separate zonelist for PMEM and DRAM node Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:14   ` Aneesh Kumar K.V
2019-01-01  9:14     ` Aneesh Kumar K.V
2019-01-07  9:57     ` Fengguang Wu
2019-01-07 14:09       ` Aneesh Kumar K.V
2018-12-26 13:14 ` [RFC][PATCH v2 11/21] kvm: allocate page table pages from DRAM Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2019-01-01  9:23   ` Aneesh Kumar K.V
2019-01-01  9:23     ` Aneesh Kumar K.V
2019-01-02  0:59     ` Yuan Yao
2019-01-02 16:47   ` Dave Hansen
2019-01-07 10:21     ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 12/21] x86/pgtable: " Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:14 ` [RFC][PATCH v2 13/21] x86/pgtable: dont check PMD accessed bit Fengguang Wu
2018-12-26 13:14   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 14/21] kvm: register in mm_struct Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2019-02-02  6:57   ` Peter Xu
2019-02-02 10:50     ` Fengguang Wu
2019-02-04 10:46     ` Paolo Bonzini
2018-12-26 13:15 ` [RFC][PATCH v2 15/21] ept-idle: EPT walk for virtual machine Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 16/21] mm-idle: mm_walk for normal task Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 17/21] proc: introduce /proc/PID/idle_pages Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 18/21] kvm-ept-idle: enable module Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 19/21] mm/migrate.c: add move_pages(MPOL_MF_SW_YOUNG) flag Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 20/21] mm/vmscan.c: migrate anon DRAM pages to PMEM node Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-26 13:15 ` [RFC][PATCH v2 21/21] mm/vmscan.c: shrink anon list if can migrate to PMEM Fengguang Wu
2018-12-26 13:15   ` Fengguang Wu
2018-12-27 20:31 ` [RFC][PATCH v2 00/21] PMEM NUMA node and hotness accounting/migration Michal Hocko
2018-12-28  5:08   ` Fengguang Wu
2018-12-28  8:41     ` Michal Hocko
2018-12-28  9:42       ` Fengguang Wu
2018-12-28 12:15         ` Michal Hocko
2018-12-28 13:15           ` Fengguang Wu
2018-12-28 13:15             ` Fengguang Wu
2018-12-28 19:46             ` Michal Hocko
2018-12-28 13:31           ` Fengguang Wu
2018-12-28 18:28             ` Yang Shi
2018-12-28 18:28               ` Yang Shi
2018-12-28 19:52             ` Michal Hocko
2019-01-02 12:21               ` Jonathan Cameron
2019-01-02 12:21                 ` Jonathan Cameron
2019-01-08 14:52                 ` Michal Hocko
2019-01-10 15:53                   ` Jerome Glisse
2019-01-10 15:53                     ` Jerome Glisse
2019-01-10 16:42                     ` Michal Hocko
2019-01-10 17:42                       ` Jerome Glisse
2019-01-10 17:42                         ` Jerome Glisse
2019-01-10 18:26                   ` Jonathan Cameron
2019-01-10 18:26                     ` Jonathan Cameron
2019-01-28 17:42                 ` Jonathan Cameron
2019-01-28 17:42                   ` Jonathan Cameron
2019-01-29  2:00                   ` Fengguang Wu
2019-01-03 10:57               ` Mel Gorman [this message]
2019-01-10 16:25               ` Jerome Glisse
2019-01-10 16:25                 ` Jerome Glisse
2019-01-10 16:50                 ` Michal Hocko
2019-01-10 18:02                   ` Jerome Glisse
2019-01-10 18:02                     ` Jerome Glisse
2019-01-02 18:12       ` Dave Hansen
2019-01-08 14:53         ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190103105717.GI28934@suse.de \
    --to=mgorman@suse.de \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dongx.peng@intel.com \
    --cc=eddie.dong@intel.com \
    --cc=fan.du@intel.com \
    --cc=fengguang.wu@intel.com \
    --cc=jingqi.liu@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=yi.z.zhang@linux.intel.com \
    --cc=ying.huang@intel.com \
    --cc=yuan.yao@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox