From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=S930=4K=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.2 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY,
	USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0E116C35671
	for <linux-mm@archiver.kernel.org>; Sat, 22 Feb 2020 06:55:48 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 86E3620707
	for <linux-mm@archiver.kernel.org>; Sat, 22 Feb 2020 06:55:47 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 86E3620707
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 2384D6B0003; Sat, 22 Feb 2020 01:55:47 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1E9546B0006; Sat, 22 Feb 2020 01:55:47 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0FE046B0007; Sat, 22 Feb 2020 01:55:47 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0195.hostedemail.com [216.40.44.195])
	by kanga.kvack.org (Postfix) with ESMTP id EA2D56B0003
	for <linux-mm@kvack.org>; Sat, 22 Feb 2020 01:55:46 -0500 (EST)
Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id 9EFB18248047
	for <linux-mm@kvack.org>; Sat, 22 Feb 2020 06:55:46 +0000 (UTC)
X-FDA: 76516852692.11.slave90_4f5ca40728344
X-HE-Tag: slave90_4f5ca40728344
X-Filterd-Recvd-Size: 8185
Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132])
	by imf11.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Sat, 22 Feb 2020 06:55:44 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R841e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04420;MF=wenyang@linux.alibaba.com;NM=1;PH=DS;RN=9;SR=0;TI=SMTPD_---0TqaCEln_1582354539;
Received: from IT-C02W23QPG8WN.local(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0TqaCEln_1582354539)
          by smtp.aliyun-inc.com(127.0.0.1);
          Sat, 22 Feb 2020 14:55:40 +0800
From: Wen Yang <wenyang@linux.alibaba.com>
Subject: Re: [PATCH] mm/slub: Detach node lock from counting free objects
To: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
 Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>,
 David Rientjes <rientjes@google.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
 Xunlei Pang <xlpang@linux.alibaba.com>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org
References: <20200201031502.92218-1-wenyang@linux.alibaba.com>
 <20200212145247.bf89431272038de53dd9d975@linux-foundation.org>
 <b42f7daa-4aea-1cf8-5bbb-2cd5d48b4e9a@linux.alibaba.com>
 <20200218205312.GA3156@carbon>
 <cb36f3e5-c01c-a99d-9230-af52f806e227@linux.alibaba.com>
 <20200220154036.GA191388@carbon.dhcp.thefacebook.com>
Message-ID: <98c84f23-0636-a877-a96d-d6e58d540aa4@linux.alibaba.com>
Date: Sat, 22 Feb 2020 14:55:39 +0800
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0)
 Gecko/20100101 Thunderbird/68.1.1
MIME-Version: 1.0
In-Reply-To: <20200220154036.GA191388@carbon.dhcp.thefacebook.com>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


On 2020/2/20 11:40 =E4=B8=8B=E5=8D=88, Roman Gushchin wrote:
> On Thu, Feb 20, 2020 at 09:53:26PM +0800, Wen Yang wrote:
>>
>>
>> On 2020/2/19 4:53 =E4=B8=8A=E5=8D=88, Roman Gushchin wrote:
>>> On Sun, Feb 16, 2020 at 12:15:54PM +0800, Wen Yang wrote:
>>>>
>>>>
>>>> On 2020/2/13 6:52 =E4=B8=8A=E5=8D=88, Andrew Morton wrote:
>>>>> On Sat,  1 Feb 2020 11:15:02 +0800 Wen Yang <wenyang@linux.alibaba.=
com> wrote:
>>>>>
>>>>>> The lock, protecting the node partial list, is taken when couting =
the free
>>>>>> objects resident in that list. It introduces locking contention wh=
en the
>>>>>> page(s) is moved between CPU and node partial lists in allocation =
path
>>>>>> on another CPU. So reading "/proc/slabinfo" can possibily block th=
e slab
>>>>>> allocation on another CPU for a while, 200ms in extreme cases. If =
the
>>>>>> slab object is to carry network packet, targeting the far-end disk=
 array,
>>>>>> it causes block IO jitter issue.
>>>>>>
>>>>>> This fixes the block IO jitter issue by caching the total inuse ob=
jects in
>>>>>> the node in advance. The value is retrieved without taking the nod=
e partial
>>>>>> list lock on reading "/proc/slabinfo".
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, =
struct page *page)
>>>>>>     static void discard_slab(struct kmem_cache *s, struct page *pa=
ge)
>>>>>>     {
>>>>>> -	dec_slabs_node(s, page_to_nid(page), page->objects);
>>>>>> +	int inuse =3D page->objects;
>>>>>> +
>>>>>> +	dec_slabs_node(s, page_to_nid(page), page->objects, inuse);
>>>>>
>>>>> Is this right?  dec_slabs_node(..., page->objects, page->objects)?
>>>>>
>>>>> If no, we could simply pass the page* to inc_slabs_node/dec_slabs_n=
ode
>>>>> and save a function argument.
>>>>>
>>>>> If yes then why?
>>>>>
>>>>
>>>> Thanks for your comments.
>>>> We are happy to improve this patch based on your suggestions.
>>>>
>>>>
>>>> When the user reads /proc/slabinfo, in order to obtain the active_ob=
js
>>>> information, the kernel traverses all slabs and executes the followi=
ng code
>>>> snippet:
>>>> static unsigned long count_partial(struct kmem_cache_node *n,
>>>>                                           int (*get_count)(struct pa=
ge *))
>>>> {
>>>>           unsigned long flags;
>>>>           unsigned long x =3D 0;
>>>>           struct page *page;
>>>>
>>>>           spin_lock_irqsave(&n->list_lock, flags);
>>>>           list_for_each_entry(page, &n->partial, slab_list)
>>>>                   x +=3D get_count(page);
>>>>           spin_unlock_irqrestore(&n->list_lock, flags);
>>>>           return x;
>>>> }
>>>>
>>>> It may cause performance issues.
>>>>
>>>> Christoph suggested "you could cache the value in the userspace appl=
ication?
>>>> Why is this value read continually?", But reading the /proc/slabinfo=
 is
>>>> initiated by the user program. As a cloud provider, we cannot contro=
l user
>>>> behavior. If a user program inadvertently executes cat /proc/slabinf=
o, it
>>>> may affect other user programs.
>>>>
>>>> As Christoph said: "The count is not needed for any operations. Just=
 for the
>>>> slabinfo output. The value has no operational value for the allocato=
r
>>>> itself. So why use extra logic to track it in potentially performanc=
e
>>>> critical paths?"
>>>>
>>>> In this way, could we show the approximate value of active_objs in t=
he
>>>> /proc/slabinfo?
>>>>
>>>> Based on the following information:
>>>> In the discard_slab() function, page->inuse is equal to page->total_=
objects;
>>>> In the allocate_slab() function, page->inuse is also equal to
>>>> page->total_objects (with one exception: for kmem_cache_node, page->=
 inuse
>>>> equals 1);
>>>> page->inuse will only change continuously when the obj is constantly
>>>> allocated or released. (This should be the performance critical path
>>>> emphasized by Christoph)
>>>>
>>>> When users query the global slabinfo information, we may use total_o=
bjects
>>>> to approximate active_objs.
>>>
>>> Well, from one point of view, it makes no sense, because the ratio be=
tween
>>> these two numbers is very meaningful: it's the slab utilization rate.
>>>
>>> On the other side, with enabled per-cpu partial lists active_objs has
>>> nothing to do with the reality anyway, so I agree with you, calling
>>> count_partial() is almost useless.
>>>
>>> That said, I wonder if the right thing to do is something like the pa=
tch below?
>>>
>>> Thanks!
>>>
>>> Roman
>>>
>>> --
>>>
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index 1d644143f93e..ba0505e75ecc 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -2411,14 +2411,16 @@ static inline unsigned long node_nr_objs(stru=
ct kmem_cache_node *n)
>>>    static unsigned long count_partial(struct kmem_cache_node *n,
>>>                                           int (*get_count)(struct pag=
e *))
>>>    {
>>> -       unsigned long flags;
>>>           unsigned long x =3D 0;
>>> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>>> +       unsigned long flags;
>>>           struct page *page;
>>>           spin_lock_irqsave(&n->list_lock, flags);
>>>           list_for_each_entry(page, &n->partial, slab_list)
>>>                   x +=3D get_count(page);
>>>           spin_unlock_irqrestore(&n->list_lock, flags);
>>> +#endif
>>>           return x;
>>>    }
>>>    #endif /* CONFIG_SLUB_DEBUG || CONFIG_SYSFS */
>>>
>>
>> Hi Roman,
>>
>> Thanks for your comments.
>>
>> In the server scenario, SLUB_CPU_PARTIAL is turned on by default, and =
can
>> improve the performance of the cloud server, as follows:
>=20
> Hello, Wen!
>=20
> That's exactly my point: if CONFIG_SLUB_CPU_PARTIAL is on, count_partia=
l() is useless
> anyway because the returned number is far from the reality. So if we de=
fine
> active_objects =3D=3D total_objects, as you basically suggest, we do no=
t introduce any
> regression. Actually I think it's even preferable to show the unrealist=
ic uniform 100%
> slab utilization rather than some very high but incorrect value.
>=20
> And on real-time systems uncontrolled readings of /proc/slabinfo is les=
s
> of a concern, I hope.
>=20
> Thank you!
>=20

Great=EF=BC=81
We only need to correct a typo to achieve this goal, as follows:
Change #ifdef CONFIG_SLUB_CPU_PARTIAL to #ifndef CONFIG_SLUB_CPU_PARTIAL

We will continue testing and send the modified patch soon.

Thank you very much.