From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C2290C64EC4
	for <linux-mm@archiver.kernel.org>; Thu,  9 Mar 2023 02:16:02 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3353C6B0071; Wed,  8 Mar 2023 21:16:02 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2E5B36B0072; Wed,  8 Mar 2023 21:16:02 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1ACE1280002; Wed,  8 Mar 2023 21:16:02 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 0B7C46B0071
	for <linux-mm@kvack.org>; Wed,  8 Mar 2023 21:16:02 -0500 (EST)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id C1C8B1A0EA4
	for <linux-mm@kvack.org>; Thu,  9 Mar 2023 02:16:01 +0000 (UTC)
X-FDA: 80547744522.10.5D15979
Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255])
	by imf13.hostedemail.com (Postfix) with ESMTP id 8E86920012
	for <linux-mm@kvack.org>; Thu,  9 Mar 2023 02:15:57 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=none;
	spf=pass (imf13.hostedemail.com: domain of chenjun102@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=chenjun102@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1678328159;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:in-reply-to:
	 references:references; bh=5nBRYoskqYjw2a2TYdDtqp329AAJ8nksLJdR7XgIA0Q=;
	b=lD3UN6KH4unbTDtqj12wv0/+tZ3bQlirOibqY/n09fUD4NOIuD1FKETjmsTtzQNCDfZ+g/
	UHVCIN5te5r50Gyt3zno6+Pon1HrFPt6kdeLNGtRIkgQnVW1tYpBqMTZw1DWNATOkPCvOS
	+P0opVF9uzfpBUWsvbO4DJQHWZg9F9k=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=none;
	spf=pass (imf13.hostedemail.com: domain of chenjun102@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=chenjun102@huawei.com;
	dmarc=pass (policy=quarantine) header.from=huawei.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678328159; a=rsa-sha256;
	cv=none;
	b=Hsfx0MEGMprYbHT5lzmLrHgpJ4EFFVbQXrklSvRBdYSzp2qfQGXh4XfhxtQ6RC7n7mct0g
	Na0HL7d+olpN4+23UUbC5jHCNovfNtllrvEopZQ3EZY0F6zCdztk6TA5T901oSo9ljUPGW
	qpddTlJHQHhP6gL6EAdC7PjBAmUTJNo=
Received: from kwepemi500014.china.huawei.com (unknown [172.30.72.54])
	by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4PXCQ7500tz16PK9;
	Thu,  9 Mar 2023 10:13:03 +0800 (CST)
Received: from dggpemm500006.china.huawei.com (7.185.36.236) by
 kwepemi500014.china.huawei.com (7.221.188.232) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2507.21; Thu, 9 Mar 2023 10:15:52 +0800
Received: from dggpemm500006.china.huawei.com ([7.185.36.236]) by
 dggpemm500006.china.huawei.com ([7.185.36.236]) with mapi id 15.01.2507.021;
 Thu, 9 Mar 2023 10:15:52 +0800
From: "chenjun (AM)" <chenjun102@huawei.com>
To: Hyeonggon Yoo <42.hyeyoo@gmail.com>
CC: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>, "cl@linux.com" <cl@linux.com>,
	"penberg@kernel.org" <penberg@kernel.org>, "rientjes@google.com"
	<rientjes@google.com>, "iamjoonsoo.kim@lge.com" <iamjoonsoo.kim@lge.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>, "vbabka@suse.cz"
	<vbabka@suse.cz>, "xuqiang (M)" <xuqiang36@huawei.com>
Subject: Re: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
Thread-Topic: [RFC] mm/slub: Reduce memory consumption in extreme scenarios
Thread-Index: AQHZUM8n8Cz2r/0+O0WpGZE+2ntUXw==
Date: Thu, 9 Mar 2023 02:15:51 +0000
Message-ID: <74880f3c7c1e4d9fa6691ece991c931f@huawei.com>
References: <20230307082811.120774-1-chenjun102@huawei.com>
 <ZAdIJKkT8VHdbPs9@localhost> <4ad448c565134d76bea0ac8afffe4f37@huawei.com>
 <ZAiPhtexI0ebJCkV@localhost>
Accept-Language: zh-CN, en-US
Content-Language: en-US
X-MS-Has-Attach:
X-MS-TNEF-Correlator:
x-originating-ip: [10.174.178.43]
Content-Type: text/plain; charset="iso-2022-jp"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-CFilter-Loop: Reflected
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-Stat-Signature: 83pzhczeufec5zzadbo94d45hnnub6h7
X-Rspamd-Queue-Id: 8E86920012
X-HE-Tag: 1678328157-231339
X-HE-Meta: U2FsdGVkX1/ERpS5kYHsWRLAi7wr4qqmHfgZrFJtagr6mYxeynWNVnwLdkQGlNK/x+lYBGcuNRc0NDF0QSMlN+7pnSOcs9iXIlT/oDOZpQxa1EQPjtws6+1xdxJWEZublvLNw7f1poQ8X4ygVjgTdSqDXrCKCMkaFfjvZUJQ0+opX1nDPXExovh2EzSSa/urM9FSTJJvpILsB+GuP2k7q3bKPhE6/4mb188bPDr7XGVPDj6UQ6HER+xFnppXeeDXKU8OSGukIBxJY6drajFAUMrgx/I0lfIui5EpPyybUJS43whXnAiV0vH+vE2JVnN3PESM+YFY2yK2Bf37aPmCakrvrwU3mePXM/BOhdQBuGtvGekyW2es35e0ftxCZ53A9a4wPndjWWUpM1p/Q3e45dsnCzZoLwjCkVf1g2Za1BcEmvoOSpk6UXiWi8dOS7fXghVPc+MFOo6u61dmwte3LRlrEnJtwZIb/lOHXypHN8ZkF01uwTN8Lf9VdiCKsBh2DZzCy03pELAkaRGENlXt37u5MMjutyFEXs2CFjlRPTNlrOKgD1ihYtGp0uu45JMdgKmEHPhweMdq22IIznavjgzmFg2Lm8PKfpH+BV4NqVD2sl5vMBK5tVz6QmHRdIL9Zi5VKlF1qWq51LR0eGTHCQifBc3MNGl3iwhoqGn1YdEYtsm4tkhURNO6rJR53gcXs6ayv2QMUOCSvV/dtL2UxSgZetuBVbdrx9hxmCzMRUavKKEYHkBDxMCkq2kEKyWjchSOiKTmV7Hkrq1XVqk1JI+hNYi3AIajGWpzMQJZi84CgR9zxyhxYk//lBvYwLC6/obgZZS2KATyU/CkWfAzwa66ntfsqd1FYrewNGpiLyEttTuOHpaFGF8ZCTrYqKPZgUyUIfs8yqGjm93dv5Z93M9jMQIHtTlkUkVIJjYpuXeBage2ZDevE2eILBEJbgR/brAuyOLoFvFVXejXR8a
 lrFI4zo7
 lJCGo+KCeuqTrVv/uem7dnXfYkbLp5lA3PaQSEXlRBvPYmRnVcUmxRxVtRCniNTkc7pfsNSUvDZMoO+Befy63AGOnbRZ+Qi0TouNgkm56S/vkfkLPG4LF6L9GZkJAAHwWs8tB9SHIgs4vxNTZxucZUWcWKDto4z5he9CdqgpgXwBeiq0vjqLqBWd1oiO3DZ1vTsp0Xju7fLKZjYKzAO0k1svgb3XZr6qi6shO+AU9/VPAzTubMmUfVwl3VSY6tkGFiskZAFN6udTUFOqWHNbYplnVbZDTJgCI6XPu
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

=1B$B:_=1B(B 2023/3/8 21:37, Hyeonggon Yoo =1B$B<LF;=1B(B:=0A=
> On Wed, Mar 08, 2023 at 07:16:49AM +0000, chenjun (AM) wrote:=0A=
>> Hi,=0A=
>>=0A=
>> Thanks for reply.=0A=
>>=0A=
>> =1B$B:_=1B(B 2023/3/7 22:20, Hyeonggon Yoo =1B$B<LF;=1B(B:=0A=
>>> On Tue, Mar 07, 2023 at 08:28:11AM +0000, Chen Jun wrote:=0A=
>>>> If call kmalloc_node with NO __GFP_THISNODE and node[A] with no memory=
.=0A=
>>>> Slub will alloc a slub page which is not belong to A, and put the page=
=0A=
>>>> to kmem_cache_node[page_to_nid(page)]. The page can not be reused=0A=
>>>> at next calling, because NULL will be get from get_partical().=0A=
>>>> That make kmalloc_node consume more memory.=0A=
>>>=0A=
>>> Hello,=0A=
>>>=0A=
>>> elaborating a little bit:=0A=
>>>=0A=
>>> "When kmalloc_node() is called without __GFP_THISNODE and the target no=
de=0A=
>>> lacks sufficient memory, SLUB allocates a folio from a different node o=
ther=0A=
>>> than the requested node, instead of taking a partial slab from it.=0A=
>>>=0A=
>>> However, since the allocated folio does not belong to the requested nod=
e,=0A=
>>> it is deactivated and added to the partial slab list of the node it=0A=
>>> belongs to.=0A=
>>>=0A=
>>> This behavior can result in excessive memory usage when the requested=
=0A=
>>> node has insufficient memory, as SLUB will repeatedly allocate folios f=
rom=0A=
>>> other nodes without reusing the previously allocated ones.=0A=
>>>=0A=
>>> To prevent memory wastage, take a partial slab from a different node wh=
en=0A=
>>> the requested node has no partial slab and __GFP_THISNODE is not explic=
itly=0A=
>>> specified."=0A=
>>>=0A=
>>=0A=
>> Thanks, This is more clear than what I described.=0A=
>>=0A=
>>>> On qemu with 4 numas and each numa has 1G memory, Write a test ko=0A=
>>>> to call kmalloc_node(196, 0xd20c0, 3) for 5 * 1024 * 1024 times.=0A=
>>>>=0A=
>>>> cat /proc/slabinfo shows:=0A=
>>>> kmalloc-256       4302317 15151808    256   32    2 : tunables..=0A=
>>>>=0A=
>>>> the total objects is much more then active objects.=0A=
>>>>=0A=
>>>> After this patch, cat /prac/slubinfo shows:=0A=
>>>> kmalloc-256       5244950 5245088    256   32    2 : tunables..=0A=
>>>>=0A=
>>>> Signed-off-by: Chen Jun <chenjun102@huawei.com>=0A=
>>>> ---=0A=
>>>>    mm/slub.c | 17 ++++++++++++++---=0A=
>>>>    1 file changed, 14 insertions(+), 3 deletions(-)=0A=
>>>>=0A=
>>>> diff --git a/mm/slub.c b/mm/slub.c=0A=
>>>> index 39327e98fce3..c0090a5de54e 100644=0A=
>>>> --- a/mm/slub.c=0A=
>>>> +++ b/mm/slub.c=0A=
>>>> @@ -2384,7 +2384,7 @@ static void *get_partial(struct kmem_cache *s, i=
nt node, struct partial_context=0A=
>>>>    		searchnode =3D numa_mem_id();=0A=
>>>>    =0A=
>>>>    	object =3D get_partial_node(s, get_node(s, searchnode), pc);=0A=
>>>> -	if (object || node !=3D NUMA_NO_NODE)=0A=
>>>> +	if (object || (node !=3D NUMA_NO_NODE && (pc->flags & __GFP_THISNODE=
)))=0A=
>>>>    		return object;=0A=
>>>=0A=
>>> I think the problem here is to avoid taking a partial slab from=0A=
>>> different node than the requested node even if __GFP_THISNODE is not se=
t.=0A=
>>> (and then allocating new slab instead)=0A=
>>>=0A=
>>> Thus this hunk makes sense to me,=0A=
>>> even if SLUB currently do not implement __GFP_THISNODE semantics.=0A=
>>>=0A=
>>>>    	return get_any_partial(s, pc);=0A=
>>>> @@ -3069,6 +3069,7 @@ static void *___slab_alloc(struct kmem_cache *s,=
 gfp_t gfpflags, int node,=0A=
>>>>    	struct slab *slab;=0A=
>>>>    	unsigned long flags;=0A=
>>>>    	struct partial_context pc;=0A=
>>>> +	int try_thisndoe =3D 0;=0A=
>>>>=0A=
>>>>    =0A=
>>>>    	stat(s, ALLOC_SLOWPATH);=0A=
>>>>    =0A=
>>>> @@ -3181,8 +3182,12 @@ static void *___slab_alloc(struct kmem_cache *s=
, gfp_t gfpflags, int node,=0A=
>>>>    	}=0A=
>>>>    =0A=
>>>>    new_objects:=0A=
>>>> -=0A=
>>>>    	pc.flags =3D gfpflags;=0A=
>>>> +=0A=
>>>> +	/* Try to get page from specific node even if __GFP_THISNODE is not =
set */=0A=
>>>> +	if (node !=3D NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_th=
isnode)=0A=
>>>> +			pc.flags |=3D __GFP_THISNODE;=0A=
>>>> +=0A=
=0A=
Any suggestions to make the change more elegant?=0A=
=0A=
>>>>    	pc.slab =3D &slab;=0A=
>>>>    	pc.orig_size =3D orig_size;=0A=
>>>>    	freelist =3D get_partial(s, node, &pc);=0A=
>>>> @@ -3190,10 +3195,16 @@ static void *___slab_alloc(struct kmem_cache *=
s, gfp_t gfpflags, int node,=0A=
>>>>    		goto check_new_slab;=0A=
>>>>    =0A=
>>>>    	slub_put_cpu_ptr(s->cpu_slab);=0A=
>>>> -	slab =3D new_slab(s, gfpflags, node);=0A=
>>>> +	slab =3D new_slab(s, pc.flags, node);=0A=
>>>>    	c =3D slub_get_cpu_ptr(s->cpu_slab);=0A=
>>>>    =0A=
>>>>    	if (unlikely(!slab)) {=0A=
>>>> +		/* Try to get page from any other node */=0A=
>>>> +		if (node !=3D NUMA_NO_NODE && !(gfpflags & __GFP_THISNODE) && try_t=
hisnode) {=0A=
>>>> +			try_thisnode =3D 0;=0A=
>>>> +			goto new_objects;=0A=
>>>> +		}=0A=
>>>> +=0A=
>>>>    		slab_out_of_memory(s, gfpflags, node);=0A=
>>>>    		return NULL;=0A=
>>>=0A=
>>> But these hunks do not make sense to me.=0A=
>>> Why force __GFP_THISNODE even when the caller did not specify it?=0A=
>>>=0A=
>>> (Apart from the fact that try_thisnode is defined as try_thisndoe,=0A=
>>>    and try_thisnode is never set to nonzero value.)=0A=
>>=0A=
>> My mistake=1B$B!$=1B(B It should be:=0A=
>> int try_thisnode =3D 0;=0A=
> =0A=
> I think it should be try_thisnode =3D 1?=0A=
> Otherwise it won't be executed at all.=0A=
> Also bool type will be more readable than int.=0A=
> =0A=
>>=0A=
>>>=0A=
>>> IMHO the first hunk is enough to solve the problem.=0A=
>>=0A=
>> I think, we should try to alloc a page on the target node before getting=
=0A=
>> one from other nodes' partial.=0A=
> =0A=
> You are right.=0A=
> =0A=
> Hmm then the new behavior when=0A=
> (node !=3D NUMA_NO_NODE) && (gfpflags & __GFP_THISNODE) is:=0A=
> =0A=
> 1) try to get a partial slab from target node with __GFP_THISNODE=0A=
> 2) if 1) failed, try to allocate a new slab from target node with __GFP_T=
HISNODE=0A=
> 3) if 2) failed, retry 1) and 2) without __GFP_THISNODE constraint=0A=
> =0A=
> when node !=3D NUMA_NO_NODE || (gfpflags & __GFP_THISNODE), the behavior=
=0A=
> remains unchanged.=0A=
> =0A=
> It does not look that crazy to me, although it complicates the code=0A=
> a little bit. (Vlastimil may have some opinions?)=0A=
> =0A=
> Now that I understand your intention, I think this behavior change also=
=0A=
> need to be added to the commit log.=0A=
> =0A=
=0A=
I will add it.=0A=
=0A=
> Thanks,=0A=
> Hyeonggon=0A=
> =0A=
>> If the caller does not specify __GFP_THISNODE, we add __GFP_THISNODE to=
=0A=
>> try to get the slab only on the target node. If it fails, use the=0A=
>> original GFP FLAG to allow fallback.=0A=
> =0A=
=0A=
If there are no other questions, I will send an official patch.=0A=
=0A=