From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2985FC7EE22 for ; Thu, 18 May 2023 08:07:56 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B6915900005; Thu, 18 May 2023 04:07:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B1944900003; Thu, 18 May 2023 04:07:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9E05C900005; Thu, 18 May 2023 04:07:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 8F10F900003 for ; Thu, 18 May 2023 04:07:55 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 26B151207EA for ; Thu, 18 May 2023 08:07:55 +0000 (UTC) X-FDA: 80802647310.22.46102D9 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by imf26.hostedemail.com (Postfix) with ESMTP id 4F6BA14000F for ; Thu, 18 May 2023 08:07:51 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=fXP+nOKc; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf26.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684397273; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nv71jP4FlImP6KujIQGVA3TM0HB0G6qVcMWmVtLeBgw=; b=nz1r/BSN3mSB30LpdoCgoEBRz9dsovzocrLV2+w99Qz+kZJn2mYtwBnmz4DvU6K60gNYaD n+4Sbil4j/Vl0zj+hj0WbOmxszE+vWNAXbFyuvHBYmjUOG+u3HPRjVN9Dw4ay9VWnZ+GPn ODesIkrzowekeZjpYAB67d3+4aV7EDI= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=fXP+nOKc; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf26.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684397273; a=rsa-sha256; cv=none; b=AsaoLK1u5tIcALs4ag5nLIwOsuTiBr6rDWA96w6CqTxy3r/6vsE/tRo7x/bjHQQfkePP1V vipYuLeDbpptwGC0vOepbyUwkV3f0RBxYSEn8scMxHld+FE9crp/nnLzIw2Ra5kS2tAQOr vmRcQme8v64RUnAQKzJaxE4ZlA01plg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1684397272; x=1715933272; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=laOsXnWXv9Xaa6VQNpSpFGCPLJxDY76YKOU3FxnvFqA=; b=fXP+nOKcSVBi2W/+v0GIz/RdlfhYteNfchEcDZM0exmAMvP/6zUPGhOu ua+42sYi6ZX/KxDuyEAFb7ZIuhbD7lL/FZXRLln8PJPYeCiibx8I9Nirj Vf4XmbYeKHL8yKZDiCZFNwSz2QwxdnaTLDDi6adeltThuPS6SvC+Gt+K4 D9VASiw6EVod4fWXMzm86ysWcA0ljKNG1Mzgz5P4CHOoFs0Fp+TPULwwC pbT3/GI20N+EN95iluzw/b00J0Llxo1gNPpJIuVEZEJsKt4wgrLepmqns JWkkNuywcfw0PJQC3B6CyIIiSiD009+mz753ReEEbpom7UyTa6ndLR202 Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10713"; a="415426096" X-IronPort-AV: E=Sophos;i="5.99,284,1677571200"; d="scan'208";a="415426096" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2023 01:07:50 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10713"; a="734987869" X-IronPort-AV: E=Sophos;i="5.99,284,1677571200"; d="scan'208";a="734987869" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 18 May 2023 01:07:47 -0700 From: "Huang, Ying" To: David Hildenbrand Cc: Michal Hocko , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Arjan Van De Ven , Andrew Morton , Mel Gorman , Vlastimil Babka , Johannes Weiner , Dave Hansen , Pavel Tatashin , Matthew Wilcox Subject: Re: [RFC 0/6] mm: improve page allocator scalability via splitting zones References: <20230511065607.37407-1-ying.huang@intel.com> <87r0rm8die.fsf@yhuang6-desk2.ccr.corp.intel.com> <87jzx87h1d.fsf@yhuang6-desk2.ccr.corp.intel.com> <3d77ca46-6256-7996-b0f5-67c414d2a8dc@redhat.com> <87bkij7ncn.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 18 May 2023 16:06:43 +0800 In-Reply-To: (David Hildenbrand's message of "Wed, 17 May 2023 10:09:31 +0200") Message-ID: <875y8q83n0.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.1 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: dgdcjnamxqueg7ugk89ie6p6bnbfmzem X-Rspam-User: X-Rspamd-Queue-Id: 4F6BA14000F X-Rspamd-Server: rspam07 X-HE-Tag: 1684397271-65508 X-HE-Meta: U2FsdGVkX18f57XeaD4efwU9W8jxAGqS9u+t/DNZtOJtEX3TV4OO0TNiW3J8RqWrH6MglySMx1BUTJje16JLeLo66naAwgxWimRcsZLRsTD8iBp03C7GAfTgNPBqNUlC4JEr4mlC7rnQuADevt5PNzeN8NgGIguvRQ39cn4V7TUq61w0jaW7JOEzX9mj/hS+GJ4F/OT+ra0/AoLPPEZIr1OhkUeJv6wWLTcsC3rZX/Yz8iYelYQiErN9tpTxv1mR4KFiC3ak6OZ4ODOuMiJzKUEJLms5F/VWhbllgzTYuHNqap2F+4CNpv3Dgxzl85eRHU00LzRqdQrk6yXcOICvItNTVB19fPn3RxhH0PnUpnkkdM7UnTizEjiXYryH7fPEquiwxWT9LMnVq5FXQniF9rYbgdpWOfVfDwq2keIdTh0f3kTBz1A0O0iRJBjAvCFRmHXhS7ytC0oeG/+hL5BjsJ1z4rXgW0ibJZ9ga0Idff4s65bAOxYPyLBy2zqjZZHtphBKoaBwWsDvY6LCcu/P/Jo5NYIqQtcJuoxrBe9jQ9LHYrU5C33qu+j2FCdp84VL/WqflK9c9ZBwY8XYdaOhJvKe66YHhALYEEezW7W72boEuhkay9yFKaMAxtETBVnWkBDU4MDrX1CsjmD5JYh03gMpOEAo64OoHNb8talanhqsVLPFfAsDIBL39ufpRjCwxF68/jh/ugVZr+/WXlXFeqrul/BhK1HvVnjgv7ThYm8O7C4uSfdhdm+XfxB1Oil/ODjJFSg9HUqVadRbUaW6FbTylbYv5U4yupEQNaHJ4C4emjmw71lRRecBhRguOVxhUzH/UqGRBWUlYAwmTwOLzQi+16MCmG+AsJyGtQlZKCaUbIACtrSwUuxTSK8P0bD+5Ahz4WaLcpvrWaXdhw5La+weLkd44rhr8QdfGRHpsZt8a9/2657OP5AH7GEQSeK6fIZwbGDRj6D8HO3M+gE n5GsgddQ vlaDYkrPp4+4NRnW4QHE0FhfGWtuov7fvmcMw2LtJtmhwQ3ifh8q+WXT2uGqU56DiuCjyXpsn2bp7fh/zDg+NtHSLotnVSe4Zpndy/p20ZLp5apJuyUjgrNoT3ip7ygZD9zSyKLs53l8kri+toA//UHH+/XjWD1H9Vrk1xF73xXlCH4g= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: David Hildenbrand writes: >>> If we could avoid instantiating more zones and rather improve existing >>> mechanisms (PCP), that would be much more preferred IMHO. I'm sure >>> it's not easy, but that shouldn't stop us from trying ;) >> I do think improving PCP or adding another level of cache will help >> performance and scalability. >> And, I think that it has value too to improve the performance of >> zone >> itself. Because there will be always some cases that the zone lock >> itself is contended. >> That is, PCP and zone works at different level, and both deserve to >> be >> improved. Do you agree? > > Spoiler: my humble opinion > > > Well, the zone is kind-of your "global" memory provider, and PCPs > cache a fraction of that to avoid exactly having to mess with that > global datastructure and lock contention. > > One benefit I can see of such a "global" memory provider with caches > on top is is that it is nicely integrated: for example, the concept of > memory pressure exists for the zone as a whole. All memory is of the > same kind and managed in a single entity, but free memory is cached > for performance. > > As soon as you manage the memory in multiple zones of the same kind, > you lose that "global" view of your memory that is of the same kind, > but managed in different bucks. You might end up with a lot of memory > pressure in a single such zone, but still have plenty in another zone. > > As one example, hot(un)plug of memory is easy: there is only a single > zone. No need to make smart decisions or deal with having memory we're > hotunplugging be stranded in multiple zones. I understand that there are some unresolved issues for splitting zone. I will think more about them and the possible solutions. >> >>> I did not look into the details of this proposal, but seeing the >>> change in include/linux/page-flags-layout.h scares me. >> It's possible for us to use 1 more bit in page->flags. Do you think >> that will cause severe issue? Or you think some other stuff isn't >> acceptable? > > The issue is, everybody wants to consume more bits in page->flags, so > if we can get away without it that would be much better :) Yes. > The more bits you want to consume, the more people will ask for making > this a compile-time option and eventually compile it out on distro > kernels (e.g., with many NUMA nodes). So we end up with more code and > complexity and eventually not get the benefits where we really want > them. That's possible. Although I think we will still use more page flags when necessary. >> >>> Further, I'm not so sure how that change really interacts with >>> hot(un)plug of memory ... on a quick glimpse I feel like this series >>> hacks the code such that such that the split works based on the boot >>> memory size ... >> Em..., the zone stuff is kind of static now. It's hard to add a >> zone at >> run-time. So, in this series, we determine the number of zones per zone >> type based on boot memory size. This may be improved in the future via >> pre-allocate some empty zone instances during boot and hot-add some >> memory to these zones. > > Just to give you some idea: with virtio-mem, hyper-v, daxctl, and > upcoming cxl dynamic memory pooling (some day I'm sure ;) ) you might > see quite a small boot memory (e.g., 4 GiB) but a significant amount > of memory getting hotplugged incrementally (e.g., up to 1 TiB) -- > well, and hotunplugged. With multiple zone instances you really have > to be careful and might have to re-balance between the multiple zones > to keep the scalability, to not create imbalances between the zones > ... Thanks for your information! > Something like PCP auto-tuning would be able to handle that mostly > automatically, as there is only a single memory pool. I agree that optimizing PCP will help performance regardless of splitting zone or not. >> >>> I agree with Michal that looking into auto-tuning PCP would be >>> preferred. If that can't be done, adding another layer might end up >>> cleaner and eventually cover more use cases. >> I do agree that it's valuable to make PCP etc. cover more use cases. >> I >> just think that this should not prevent us from optimizing zone itself >> to cover remaining use cases. > > I really don't like the concept of replicating zones of the same kind > for the same NUMA node. But that's just my personal opinion > maintaining some memory hot(un)plug code :) > > Having that said, some kind of a sub-zone concept (additional layer) > as outlined by Michal IIUC, for example, indexed by core > id/has/whatsoever could eventually be worth exploring. Yes, such a > design raises various questions ... :) Yes. That's another possible solution for the page allocation scalability problem. Best Regards, Huang, Ying