From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C2290C64EC4 for ; Thu, 9 Mar 2023 02:16:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3353C6B0071; Wed, 8 Mar 2023 21:16:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E5B36B0072; Wed, 8 Mar 2023 21:16:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1ACE1280002; Wed, 8 Mar 2023 21:16:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0B7C46B0071 for ; Wed, 8 Mar 2023 21:16:02 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id C1C8B1A0EA4 for ; Thu, 9 Mar 2023 02:16:01 +0000 (UTC) X-FDA: 80547744522.10.5D15979 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf13.hostedemail.com (Postfix) with ESMTP id 8E86920012 for ; Thu, 9 Mar 2023 02:15:57 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; spf=pass (imf13.hostedemail.com: domain of chenjun102@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=chenjun102@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678328159; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding:in-reply-to: references:references; bh=5nBRYoskqYjw2a2TYdDtqp329AAJ8nksLJdR7XgIA0Q=; b=lD3UN6KH4unbTDtqj12wv0/+tZ3bQlirOibqY/n09fUD4NOIuD1FKETjmsTtzQNCDfZ+g/ UHVCIN5te5r50Gyt3zno6+Pon1HrFPt6kdeLNGtRIkgQnVW1tYpBqMTZw1DWNATOkPCvOS +P0opVF9uzfpBUWsvbO4DJQHWZg9F9k= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; spf=pass (imf13.hostedemail.com: domain of chenjun102@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=chenjun102@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678328159; a=rsa-sha256; cv=none; b=Hsfx0MEGMprYbHT5lzmLrHgpJ4EFFVbQXrklSvRBdYSzp2qfQGXh4XfhxtQ6RC7n7mct0g Na0HL7d+olpN4+23UUbC5jHCNovfNtllrvEopZQ3EZY0F6zCdztk6TA5T901oSo9ljUPGW qpddTlJHQHhP6gL6EAdC7PjBAmUTJNo= Received: from kwepemi500014.china.huawei.com (unknown [172.30.72.54]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4PXCQ7500tz16PK9; Thu, 9 Mar 2023 10:13:03 +0800 (CST) Received: from dggpemm500006.china.huawei.com (7.185.36.236) by kwepemi500014.china.huawei.com (7.221.188.232) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.21; Thu, 9 Mar 2023 10:15:52 +0800 Received: from dggpemm500006.china.huawei.com ([7.185.36.236]) by dggpemm500006.china.huawei.com ([7.185.36.236]) with mapi id 15.01.2507.021; Thu, 9 Mar 2023 10:15:52 +0800 From: "chenjun (AM)" To: Hyeonggon Yoo <42.hyeyoo@gmail.com> CC: "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "cl@linux.com" , "penberg@kernel.org" , "rientjes@google.com" , "iamjoonsoo.kim@lge.com" , "akpm@linux-foundation.org" , "vbabka@suse.cz" , "xuqiang (M)" Subject: Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios Thread-Topic: [RFC] mm/slub: Reduce memory consumption in extreme scenarios Thread-Index: AQHZUM8n8Cz2r/0+O0WpGZE+2ntUXw== Date: Thu, 9 Mar 2023 02:15:51 +0000 Message-ID: <74880f3c7c1e4d9fa6691ece991c931f@huawei.com> References: <20230307082811.120774-1-chenjun102@huawei.com> <4ad448c565134d76bea0ac8afffe4f37@huawei.com> Accept-Language: zh-CN, en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.174.178.43] Content-Type: text/plain; charset="iso-2022-jp" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: 83pzhczeufec5zzadbo94d45hnnub6h7 X-Rspamd-Queue-Id: 8E86920012 X-HE-Tag: 1678328157-231339 X-HE-Meta: U2FsdGVkX1/ERpS5kYHsWRLAi7wr4qqmHfgZrFJtagr6mYxeynWNVnwLdkQGlNK/x+lYBGcuNRc0NDF0QSMlN+7pnSOcs9iXIlT/oDOZpQxa1EQPjtws6+1xdxJWEZublvLNw7f1poQ8X4ygVjgTdSqDXrCKCMkaFfjvZUJQ0+opX1nDPXExovh2EzSSa/urM9FSTJJvpILsB+GuP2k7q3bKPhE6/4mb188bPDr7XGVPDj6UQ6HER+xFnppXeeDXKU8OSGukIBxJY6drajFAUMrgx/I0lfIui5EpPyybUJS43whXnAiV0vH+vE2JVnN3PESM+YFY2yK2Bf37aPmCakrvrwU3mePXM/BOhdQBuGtvGekyW2es35e0ftxCZ53A9a4wPndjWWUpM1p/Q3e45dsnCzZoLwjCkVf1g2Za1BcEmvoOSpk6UXiWi8dOS7fXghVPc+MFOo6u61dmwte3LRlrEnJtwZIb/lOHXypHN8ZkF01uwTN8Lf9VdiCKsBh2DZzCy03pELAkaRGENlXt37u5MMjutyFEXs2CFjlRPTNlrOKgD1ihYtGp0uu45JMdgKmEHPhweMdq22IIznavjgzmFg2Lm8PKfpH+BV4NqVD2sl5vMBK5tVz6QmHRdIL9Zi5VKlF1qWq51LR0eGTHCQifBc3MNGl3iwhoqGn1YdEYtsm4tkhURNO6rJR53gcXs6ayv2QMUOCSvV/dtL2UxSgZetuBVbdrx9hxmCzMRUavKKEYHkBDxMCkq2kEKyWjchSOiKTmV7Hkrq1XVqk1JI+hNYi3AIajGWpzMQJZi84CgR9zxyhxYk//lBvYwLC6/obgZZS2KATyU/CkWfAzwa66ntfsqd1FYrewNGpiLyEttTuOHpaFGF8ZCTrYqKPZgUyUIfs8yqGjm93dv5Z93M9jMQIHtTlkUkVIJjYpuXeBage2ZDevE2eILBEJbgR/brAuyOLoFvFVXejXR8a lrFI4zo7 lJCGo+KCeuqTrVv/uem7dnXfYkbLp5lA3PaQSEXlRBvPYmRnVcUmxRxVtRCniNTkc7pfsNSUvDZMoO+Befy63AGOnbRZ+Qi0TouNgkm56S/vkfkLPG4LF6L9GZkJAAHwWs8tB9SHIgs4vxNTZxucZUWcWKDto4z5he9CdqgpgXwBeiq0vjqLqBWd1oiO3DZ1vTsp0Xju7fLKZjYKzAO0k1svgb3XZr6qi6shO+AU9/VPAzTubMmUfVwl3VSY6tkGFiskZAFN6udTUFOqWHNbYplnVbZDTJgCI6XPu X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: =1B$B:_=1B(B 2023/3/8 21:37, Hyeonggon Yoo =1B$B On Wed, Mar 08, 2023 at 07:16:49AM +0000, chenjun (AM) wrote:=0A= >> Hi,=0A= >>=0A= >> Thanks for reply.=0A= >>=0A= >> =1B$B:_=1B(B 2023/3/7 22:20, Hyeonggon Yoo =1B$B>> On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:=0A= >>>> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory= .=0A= >>>> Slub will alloc a slub page which is not belong to A, and put the page= =0A= >>>> to kmem_cache_node[page_to_nid(page)]. The page can not be reused=0A= >>>> at next calling, because NULL will be get from get_partical().=0A= >>>> That make kmalloc_node consume more memory.=0A= >>>=0A= >>> Hello,=0A= >>>=0A= >>> elaborating a little bit:=0A= >>>=0A= >>> "When kmalloc_node() is called without __GFP_THISNODE and the target no= de=0A= >>> lacks sufficient memory, SLUB allocates a folio from a different node o= ther=0A= >>> than the requested node, instead of taking a partial slab from it.=0A= >>>=0A= >>> However, since the allocated folio does not belong to the requested nod= e,=0A= >>> it is deactivated and added to the partial slab list of the node it=0A= >>> belongs to.=0A= >>>=0A= >>> This behavior can result in excessive memory usage when the requested= =0A= >>> node has insufficient memory, as SLUB will repeatedly allocate folios f= rom=0A= >>> other nodes without reusing the previously allocated ones.=0A= >>>=0A= >>> To prevent memory wastage, take a partial slab from a different node wh= en=0A= >>> the requested node has no partial slab and __GFP_THISNODE is not explic= itly=0A= >>> specified."=0A= >>>=0A= >>=0A= >> Thanks, This is more clear than what I described.=0A= >>=0A= >>>> On qemu with 4 numas and each numa has 1G memory, Write a test ko=0A= >>>> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.=0A= >>>>=0A= >>>> cat /proc/slabinfo shows:=0A= >>>> kmalloc-256 4302317 15151808 256 32 2 : tunables..=0A= >>>>=0A= >>>> the total objects is much more then active objects.=0A= >>>>=0A= >>>> After this patch, cat /prac/slubinfo shows:=0A= >>>> kmalloc-256 5244950 5245088 256 32 2 : tunables..=0A= >>>>=0A= >>>> Signed-off-by: Chen Jun =0A= >>>> ---=0A= >>>> mm/slub.c | 17 ++++++++++++++---=0A= >>>> 1 file changed, 14 insertions(+), 3 deletions(-)=0A= >>>>=0A= >>>> diff --git a/mm/slub.c b/mm/slub.c=0A= >>>> index 39327e98fce3..c0090a5de54e 100644=0A= >>>> --- a/mm/slub.c=0A= >>>> +++ b/mm/slub.c=0A= >>>> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, i= nt node, struct partial_context=0A= >>>> searchnode =3D numa_mem_id();=0A= >>>> =0A= >>>> object =3D get_partial_node(s, get_node(s, searchnode), pc);=0A= >>>> - if (object || node !=3D NUMA_NO_NODE)=0A= >>>> + if (object || (node !=3D NUMA_NO_NODE && (pc->flags & __GFP_THISNODE= )))=0A= >>>> return object;=0A= >>>=0A= >>> I think the problem here is to avoid taking a partial slab from=0A= >>> different node than the requested node even if __GFP_THISNODE is not se= t.=0A= >>> (and then allocating new slab instead)=0A= >>>=0A= >>> Thus this hunk makes sense to me,=0A= >>> even if SLUB currently do not implement __GFP_THISNODE semantics.=0A= >>>=0A= >>>> return get_any_partial(s, pc);=0A= >>>> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s,= gfp_t gfpflags, int node,=0A= >>>> struct slab *slab;=0A= >>>> unsigned long flags;=0A= >>>> struct partial_context pc;=0A= >>>> + int try_thisndoe =3D 0;=0A= >>>>=0A= >>>> =0A= >>>> stat(s, ALLOC_SLOWPATH);=0A= >>>> =0A= >>>> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s= , gfp_t gfpflags, int node,=0A= >>>> }=0A= >>>> =0A= >>>> new_objects:=0A= >>>> -=0A= >>>> pc.flags =3D gfpflags;=0A= >>>> +=0A= >>>> + /* Try to get page from specific node even if __GFP_THISNODE is not = set */=0A= >>>> + if (node !=3D NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_th= isnode)=0A= >>>> + pc.flags |=3D __GFP_THISNODE;=0A= >>>> +=0A= =0A= Any suggestions to make the change more elegant?=0A= =0A= >>>> pc.slab =3D &slab;=0A= >>>> pc.orig_size =3D orig_size;=0A= >>>> freelist =3D get_partial(s, node, &pc);=0A= >>>> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *= s, gfp_t gfpflags, int node,=0A= >>>> goto check_new_slab;=0A= >>>> =0A= >>>> slub_put_cpu_ptr(s->cpu_slab);=0A= >>>> - slab =3D new_slab(s, gfpflags, node);=0A= >>>> + slab =3D new_slab(s, pc.flags, node);=0A= >>>> c =3D slub_get_cpu_ptr(s->cpu_slab);=0A= >>>> =0A= >>>> if (unlikely(!slab)) {=0A= >>>> + /* Try to get page from any other node */=0A= >>>> + if (node !=3D NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_t= hisnode) {=0A= >>>> + try_thisnode =3D 0;=0A= >>>> + goto new_objects;=0A= >>>> + }=0A= >>>> +=0A= >>>> slab_out_of_memory(s, gfpflags, node);=0A= >>>> return NULL;=0A= >>>=0A= >>> But these hunks do not make sense to me.=0A= >>> Why force __GFP_THISNODE even when the caller did not specify it?=0A= >>>=0A= >>> (Apart from the fact that try_thisnode is defined as try_thisndoe,=0A= >>> and try_thisnode is never set to nonzero value.)=0A= >>=0A= >> My mistake=1B$B!$=1B(B It should be:=0A= >> int try_thisnode =3D 0;=0A= > =0A= > I think it should be try_thisnode =3D 1?=0A= > Otherwise it won't be executed at all.=0A= > Also bool type will be more readable than int.=0A= > =0A= >>=0A= >>>=0A= >>> IMHO the first hunk is enough to solve the problem.=0A= >>=0A= >> I think, we should try to alloc a page on the target node before getting= =0A= >> one from other nodes' partial.=0A= > =0A= > You are right.=0A= > =0A= > Hmm then the new behavior when=0A= > (node !=3D NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) is:=0A= > =0A= > 1) try to get a partial slab from target node with __GFP_THISNODE=0A= > 2) if 1) failed, try to allocate a new slab from target node with __GFP_T= HISNODE=0A= > 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint=0A= > =0A= > when node !=3D NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior= =0A= > remains unchanged.=0A= > =0A= > It does not look that crazy to me, although it complicates the code=0A= > a little bit. (Vlastimil may have some opinions?)=0A= > =0A= > Now that I understand your intention, I think this behavior change also= =0A= > need to be added to the commit log.=0A= > =0A= =0A= I will add it.=0A= =0A= > Thanks,=0A= > Hyeonggon=0A= > =0A= >> If the caller does not specify __GFP_THISNODE, we add __GFP_THISNODE to= =0A= >> try to get the slab only on the target node. If it fails, use the=0A= >> original GFP FLAG to allow fallback.=0A= > =0A= =0A= If there are no other questions, I will send an official patch.=0A= =0A=