linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Arjan Van De Ven <arjan@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Vlastimil Babka <vbabka@suse.cz>,
	Johannes Weiner <jweiner@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [RFC 0/6] mm: improve page allocator scalability via splitting zones
Date: Wed, 17 May 2023 10:09:31 +0200	[thread overview]
Message-ID: <eae68813-4240-4de1-6177-0a44e00bd04d@redhat.com> (raw)
In-Reply-To: <87bkij7ncn.fsf@yhuang6-desk2.ccr.corp.intel.com>

>> If we could avoid instantiating more zones and rather improve existing
>> mechanisms (PCP), that would be much more preferred IMHO. I'm sure
>> it's not easy, but that shouldn't stop us from trying ;)
> 
> I do think improving PCP or adding another level of cache will help
> performance and scalability.
> 
> And, I think that it has value too to improve the performance of zone
> itself.  Because there will be always some cases that the zone lock
> itself is contended.
> 
> That is, PCP and zone works at different level, and both deserve to be
> improved.  Do you agree?

Spoiler: my humble opinion


Well, the zone is kind-of your "global" memory provider, and PCPs cache 
a fraction of that to avoid exactly having to mess with that global 
datastructure and lock contention.

One benefit I can see of such a "global" memory provider with caches on 
top is is that it is nicely integrated: for example, the concept of 
memory pressure exists for the zone as a whole. All memory is of the 
same kind and managed in a single entity, but free memory is cached for 
performance.

As soon as you manage the memory in multiple zones of the same kind, you 
lose that "global" view of your memory that is of the same kind, but 
managed in different bucks. You might end up with a lot of memory 
pressure in a single such zone, but still have plenty in another zone.

As one example, hot(un)plug of memory is easy: there is only a single 
zone. No need to make smart decisions or deal with having memory we're 
hotunplugging be stranded in multiple zones.

> 
>> I did not look into the details of this proposal, but seeing the
>> change in include/linux/page-flags-layout.h scares me.
> 
> It's possible for us to use 1 more bit in page->flags.  Do you think
> that will cause severe issue?  Or you think some other stuff isn't
> acceptable?

The issue is, everybody wants to consume more bits in page->flags, so if 
we can get away without it that would be much better :)

The more bits you want to consume, the more people will ask for making 
this a compile-time option and eventually compile it out on distro 
kernels (e.g., with many NUMA nodes). So we end up with more code and 
complexity and eventually not get the benefits where we really want them.

> 
>> Further, I'm not so sure how that change really interacts with
>> hot(un)plug of memory ... on a quick glimpse I feel like this series
>> hacks the code such that such that the split works based on the boot
>> memory size ...
> 
> Em..., the zone stuff is kind of static now.  It's hard to add a zone at
> run-time.  So, in this series, we determine the number of zones per zone
> type based on boot memory size.  This may be improved in the future via
> pre-allocate some empty zone instances during boot and hot-add some
> memory to these zones.

Just to give you some idea: with virtio-mem, hyper-v, daxctl, and 
upcoming cxl dynamic memory pooling (some day I'm sure ;) ) you might 
see quite a small boot memory (e.g., 4 GiB) but a significant amount of 
memory getting hotplugged incrementally (e.g., up to 1 TiB) -- well, and 
hotunplugged. With multiple zone instances you really have to be careful 
and might have to re-balance between the multiple zones to keep the 
scalability, to not create imbalances between the zones ...

Something like PCP auto-tuning would be able to handle that mostly 
automatically, as there is only a single memory pool.

> 
>> I agree with Michal that looking into auto-tuning PCP would be
>> preferred. If that can't be done, adding another layer might end up
>> cleaner and eventually cover more use cases.
> 
> I do agree that it's valuable to make PCP etc. cover more use cases.  I
> just think that this should not prevent us from optimizing zone itself
> to cover remaining use cases.

I really don't like the concept of replicating zones of the same kind 
for the same NUMA node. But that's just my personal opinion maintaining 
some memory hot(un)plug code :)

Having that said, some kind of a sub-zone concept (additional layer) as 
outlined by Michal IIUC, for example, indexed by core id/has/whatsoever 
could eventually be worth exploring. Yes, such a design raises various 
questions ... :)

-- 
Thanks,

David / dhildenb



  reply	other threads:[~2023-05-17  8:09 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-11  6:56 Huang Ying
2023-05-11  6:56 ` [RFC 1/6] mm: distinguish zone type and zone instance explicitly Huang Ying
2023-05-11  6:56 ` [RFC 2/6] mm: add struct zone_type_struct to describe zone type Huang Ying
2023-05-11  6:56 ` [RFC 3/6] mm: support multiple zone instances per zone type in memory online Huang Ying
2023-05-11  6:56 ` [RFC 4/6] mm: avoid show invalid zone in /proc/zoneinfo Huang Ying
2023-05-11  6:56 ` [RFC 5/6] mm: create multiple zone instances for one zone type based on memory size Huang Ying
2023-05-11  6:56 ` [RFC 6/6] mm: prefer different zone list on different logical CPU Huang Ying
2023-05-11 10:30 ` [RFC 0/6] mm: improve page allocator scalability via splitting zones Jonathan Cameron
2023-05-11 13:07   ` Arjan van de Ven
2023-05-11 14:23 ` Dave Hansen
2023-05-12  3:08   ` Huang, Ying
2023-05-11 15:05 ` Michal Hocko
2023-05-12  2:55   ` Huang, Ying
2023-05-15 11:14     ` Michal Hocko
2023-05-16  9:38       ` Huang, Ying
2023-05-16 10:30         ` David Hildenbrand
2023-05-17  1:34           ` Huang, Ying
2023-05-17  8:09             ` David Hildenbrand [this message]
2023-05-18  8:06               ` Huang, Ying
2023-05-24 12:30           ` Michal Hocko
2023-05-29  1:13             ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=eae68813-4240-4de1-6177-0a44e00bd04d@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=arjan@linux.intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=jweiner@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox