From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2985FC7EE22
	for <linux-mm@archiver.kernel.org>; Thu, 18 May 2023 08:07:56 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id B6915900005; Thu, 18 May 2023 04:07:55 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B1944900003; Thu, 18 May 2023 04:07:55 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 9E05C900005; Thu, 18 May 2023 04:07:55 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 8F10F900003
	for <linux-mm@kvack.org>; Thu, 18 May 2023 04:07:55 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 26B151207EA
	for <linux-mm@kvack.org>; Thu, 18 May 2023 08:07:55 +0000 (UTC)
X-FDA: 80802647310.22.46102D9
Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31])
	by imf26.hostedemail.com (Postfix) with ESMTP id 4F6BA14000F
	for <linux-mm@kvack.org>; Thu, 18 May 2023 08:07:51 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=fXP+nOKc;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf26.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1684397273;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=nv71jP4FlImP6KujIQGVA3TM0HB0G6qVcMWmVtLeBgw=;
	b=nz1r/BSN3mSB30LpdoCgoEBRz9dsovzocrLV2+w99Qz+kZJn2mYtwBnmz4DvU6K60gNYaD
	n+4Sbil4j/Vl0zj+hj0WbOmxszE+vWNAXbFyuvHBYmjUOG+u3HPRjVN9Dw4ay9VWnZ+GPn
	ODesIkrzowekeZjpYAB67d3+4aV7EDI=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=intel.com header.s=Intel header.b=fXP+nOKc;
	dmarc=pass (policy=none) header.from=intel.com;
	spf=pass (imf26.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684397273; a=rsa-sha256;
	cv=none;
	b=AsaoLK1u5tIcALs4ag5nLIwOsuTiBr6rDWA96w6CqTxy3r/6vsE/tRo7x/bjHQQfkePP1V
	vipYuLeDbpptwGC0vOepbyUwkV3f0RBxYSEn8scMxHld+FE9crp/nnLzIw2Ra5kS2tAQOr
	vmRcQme8v64RUnAQKzJaxE4ZlA01plg=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple;
  d=intel.com; i=@intel.com; q=dns/txt; s=Intel;
  t=1684397272; x=1715933272;
  h=from:to:cc:subject:references:date:in-reply-to:
   message-id:mime-version;
  bh=laOsXnWXv9Xaa6VQNpSpFGCPLJxDY76YKOU3FxnvFqA=;
  b=fXP+nOKcSVBi2W/+v0GIz/RdlfhYteNfchEcDZM0exmAMvP/6zUPGhOu
   ua+42sYi6ZX/KxDuyEAFb7ZIuhbD7lL/FZXRLln8PJPYeCiibx8I9Nirj
   Vf4XmbYeKHL8yKZDiCZFNwSz2QwxdnaTLDDi6adeltThuPS6SvC+Gt+K4
   D9VASiw6EVod4fWXMzm86ysWcA0ljKNG1Mzgz5P4CHOoFs0Fp+TPULwwC
   pbT3/GI20N+EN95iluzw/b00J0Llxo1gNPpJIuVEZEJsKt4wgrLepmqns
   JWkkNuywcfw0PJQC3B6CyIIiSiD009+mz753ReEEbpom7UyTa6ndLR202
   Q==;
X-IronPort-AV: E=McAfee;i="6600,9927,10713"; a="415426096"
X-IronPort-AV: E=Sophos;i="5.99,284,1677571200"; 
   d="scan'208";a="415426096"
Received: from orsmga001.jf.intel.com ([10.7.209.18])
  by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2023 01:07:50 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=McAfee;i="6600,9927,10713"; a="734987869"
X-IronPort-AV: E=Sophos;i="5.99,284,1677571200"; 
   d="scan'208";a="734987869"
Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55])
  by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2023 01:07:47 -0700
From: "Huang, Ying" <ying.huang@intel.com>
To: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>,  linux-mm@kvack.org,
  linux-kernel@vger.kernel.org,  Arjan Van De Ven <arjan@linux.intel.com>,
  Andrew Morton <akpm@linux-foundation.org>,  Mel Gorman
 <mgorman@techsingularity.net>,  Vlastimil Babka <vbabka@suse.cz>,
  Johannes Weiner <jweiner@redhat.com>,  Dave Hansen
 <dave.hansen@linux.intel.com>,  Pavel Tatashin
 <pasha.tatashin@soleen.com>,  Matthew Wilcox <willy@infradead.org>
Subject: Re: [RFC 0/6] mm: improve page allocator scalability via splitting
 zones
References: <20230511065607.37407-1-ying.huang@intel.com>
	<ZF0ET82ajDbFrIw/@dhcp22.suse.cz>
	<87r0rm8die.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<ZGIUEqhSydAdvRFN@dhcp22.suse.cz>
	<87jzx87h1d.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<3d77ca46-6256-7996-b0f5-67c414d2a8dc@redhat.com>
	<87bkij7ncn.fsf@yhuang6-desk2.ccr.corp.intel.com>
	<eae68813-4240-4de1-6177-0a44e00bd04d@redhat.com>
Date: Thu, 18 May 2023 16:06:43 +0800
In-Reply-To: <eae68813-4240-4de1-6177-0a44e00bd04d@redhat.com> (David
	Hildenbrand's message of "Wed, 17 May 2023 10:09:31 +0200")
Message-ID: <875y8q83n0.fsf@yhuang6-desk2.ccr.corp.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Stat-Signature: dgdcjnamxqueg7ugk89ie6p6bnbfmzem
X-Rspam-User: 
X-Rspamd-Queue-Id: 4F6BA14000F
X-Rspamd-Server: rspam07
X-HE-Tag: 1684397271-65508
X-HE-Meta: U2FsdGVkX18f57XeaD4efwU9W8jxAGqS9u+t/DNZtOJtEX3TV4OO0TNiW3J8RqWrH6MglySMx1BUTJje16JLeLo66naAwgxWimRcsZLRsTD8iBp03C7GAfTgNPBqNUlC4JEr4mlC7rnQuADevt5PNzeN8NgGIguvRQ39cn4V7TUq61w0jaW7JOEzX9mj/hS+GJ4F/OT+ra0/AoLPPEZIr1OhkUeJv6wWLTcsC3rZX/Yz8iYelYQiErN9tpTxv1mR4KFiC3ak6OZ4ODOuMiJzKUEJLms5F/VWhbllgzTYuHNqap2F+4CNpv3Dgxzl85eRHU00LzRqdQrk6yXcOICvItNTVB19fPn3RxhH0PnUpnkkdM7UnTizEjiXYryH7fPEquiwxWT9LMnVq5FXQniF9rYbgdpWOfVfDwq2keIdTh0f3kTBz1A0O0iRJBjAvCFRmHXhS7ytC0oeG/+hL5BjsJ1z4rXgW0ibJZ9ga0Idff4s65bAOxYPyLBy2zqjZZHtphBKoaBwWsDvY6LCcu/P/Jo5NYIqQtcJuoxrBe9jQ9LHYrU5C33qu+j2FCdp84VL/WqflK9c9ZBwY8XYdaOhJvKe66YHhALYEEezW7W72boEuhkay9yFKaMAxtETBVnWkBDU4MDrX1CsjmD5JYh03gMpOEAo64OoHNb8talanhqsVLPFfAsDIBL39ufpRjCwxF68/jh/ugVZr+/WXlXFeqrul/BhK1HvVnjgv7ThYm8O7C4uSfdhdm+XfxB1Oil/ODjJFSg9HUqVadRbUaW6FbTylbYv5U4yupEQNaHJ4C4emjmw71lRRecBhRguOVxhUzH/UqGRBWUlYAwmTwOLzQi+16MCmG+AsJyGtQlZKCaUbIACtrSwUuxTSK8P0bD+5Ahz4WaLcpvrWaXdhw5La+weLkd44rhr8QdfGRHpsZt8a9/2657OP5AH7GEQSeK6fIZwbGDRj6D8HO3M+gE
 n5GsgddQ
 vlaDYkrPp4+4NRnW4QHE0FhfGWtuov7fvmcMw2LtJtmhwQ3ifh8q+WXT2uGqU56DiuCjyXpsn2bp7fh/zDg+NtHSLotnVSe4Zpndy/p20ZLp5apJuyUjgrNoT3ip7ygZD9zSyKLs53l8kri+toA//UHH+/XjWD1H9Vrk1xF73xXlCH4g=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

David Hildenbrand <david@redhat.com> writes:

>>> If we could avoid instantiating more zones and rather improve existing
>>> mechanisms (PCP), that would be much more preferred IMHO. I'm sure
>>> it's not easy, but that shouldn't stop us from trying ;)
>> I do think improving PCP or adding another level of cache will help
>> performance and scalability.
>> And, I think that it has value too to improve the performance of
>> zone
>> itself.  Because there will be always some cases that the zone lock
>> itself is contended.
>> That is, PCP and zone works at different level, and both deserve to
>> be
>> improved.  Do you agree?
>
> Spoiler: my humble opinion
>
>
> Well, the zone is kind-of your "global" memory provider, and PCPs
> cache a fraction of that to avoid exactly having to mess with that
> global datastructure and lock contention.
>
> One benefit I can see of such a "global" memory provider with caches
> on top is is that it is nicely integrated: for example, the concept of 
> memory pressure exists for the zone as a whole. All memory is of the
> same kind and managed in a single entity, but free memory is cached
> for performance.
>
> As soon as you manage the memory in multiple zones of the same kind,
> you lose that "global" view of your memory that is of the same kind,
> but managed in different bucks. You might end up with a lot of memory 
> pressure in a single such zone, but still have plenty in another zone.
>
> As one example, hot(un)plug of memory is easy: there is only a single
> zone. No need to make smart decisions or deal with having memory we're 
> hotunplugging be stranded in multiple zones.

I understand that there are some unresolved issues for splitting zone.
I will think more about them and the possible solutions.

>> 
>>> I did not look into the details of this proposal, but seeing the
>>> change in include/linux/page-flags-layout.h scares me.
>> It's possible for us to use 1 more bit in page->flags.  Do you think
>> that will cause severe issue?  Or you think some other stuff isn't
>> acceptable?
>
> The issue is, everybody wants to consume more bits in page->flags, so
> if we can get away without it that would be much better :)

Yes.

> The more bits you want to consume, the more people will ask for making
> this a compile-time option and eventually compile it out on distro 
> kernels (e.g., with many NUMA nodes). So we end up with more code and
> complexity and eventually not get the benefits where we really want
> them.

That's possible.  Although I think we will still use more page flags
when necessary.

>> 
>>> Further, I'm not so sure how that change really interacts with
>>> hot(un)plug of memory ... on a quick glimpse I feel like this series
>>> hacks the code such that such that the split works based on the boot
>>> memory size ...
>> Em..., the zone stuff is kind of static now.  It's hard to add a
>> zone at
>> run-time.  So, in this series, we determine the number of zones per zone
>> type based on boot memory size.  This may be improved in the future via
>> pre-allocate some empty zone instances during boot and hot-add some
>> memory to these zones.
>
> Just to give you some idea: with virtio-mem, hyper-v, daxctl, and
> upcoming cxl dynamic memory pooling (some day I'm sure ;) ) you might 
> see quite a small boot memory (e.g., 4 GiB) but a significant amount
> of memory getting hotplugged incrementally (e.g., up to 1 TiB) --
> well, and hotunplugged. With multiple zone instances you really have
> to be careful and might have to re-balance between the multiple zones
> to keep the scalability, to not create imbalances between the zones
> ...

Thanks for your information!

> Something like PCP auto-tuning would be able to handle that mostly
> automatically, as there is only a single memory pool.

I agree that optimizing PCP will help performance regardless of
splitting zone or not.

>> 
>>> I agree with Michal that looking into auto-tuning PCP would be
>>> preferred. If that can't be done, adding another layer might end up
>>> cleaner and eventually cover more use cases.
>> I do agree that it's valuable to make PCP etc. cover more use cases.
>> I
>> just think that this should not prevent us from optimizing zone itself
>> to cover remaining use cases.
>
> I really don't like the concept of replicating zones of the same kind
> for the same NUMA node. But that's just my personal opinion
> maintaining some memory hot(un)plug code :)
>
> Having that said, some kind of a sub-zone concept (additional layer)
> as outlined by Michal IIUC, for example, indexed by core
> id/has/whatsoever could eventually be worth exploring. Yes, such a
> design raises various questions ... :)

Yes.  That's another possible solution for the page allocation
scalability problem.

Best Regards,
Huang, Ying