From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB3E7C76199 for ; Sun, 16 Feb 2020 04:16:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8A5232073A for ; Sun, 16 Feb 2020 04:16:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8A5232073A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CE4486B0003; Sat, 15 Feb 2020 23:16:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C95206B0006; Sat, 15 Feb 2020 23:16:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BD0656B0007; Sat, 15 Feb 2020 23:16:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0020.hostedemail.com [216.40.44.20]) by kanga.kvack.org (Postfix) with ESMTP id A77DC6B0003 for ; Sat, 15 Feb 2020 23:16:05 -0500 (EST) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 2CE6F4DB5 for ; Sun, 16 Feb 2020 04:16:05 +0000 (UTC) X-FDA: 76494677490.18.rate88_520472293f616 X-HE-Tag: rate88_520472293f616 X-Filterd-Recvd-Size: 6335 Received: from out30-42.freemail.mail.aliyun.com (out30-42.freemail.mail.aliyun.com [115.124.30.42]) by imf23.hostedemail.com (Postfix) with ESMTP for ; Sun, 16 Feb 2020 04:16:03 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R811e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01f04428;MF=wenyang@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0Tq3BIwr_1581826554; Received: from IT-C02W23QPG8WN.local(mailfrom:wenyang@linux.alibaba.com fp:SMTPD_---0Tq3BIwr_1581826554) by smtp.aliyun-inc.com(127.0.0.1); Sun, 16 Feb 2020 12:15:55 +0800 Subject: Re: [PATCH] mm/slub: Detach node lock from counting free objects To: Andrew Morton Cc: Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Xunlei Pang , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20200201031502.92218-1-wenyang@linux.alibaba.com> <20200212145247.bf89431272038de53dd9d975@linux-foundation.org> From: Wen Yang Message-ID: Date: Sun, 16 Feb 2020 12:15:54 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Thunderbird/68.1.1 MIME-Version: 1.0 In-Reply-To: <20200212145247.bf89431272038de53dd9d975@linux-foundation.org> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2020/2/13 6:52 =E4=B8=8A=E5=8D=88, Andrew Morton wrote: > On Sat, 1 Feb 2020 11:15:02 +0800 Wen Yang = wrote: >=20 >> The lock, protecting the node partial list, is taken when couting the = free >> objects resident in that list. It introduces locking contention when t= he >> page(s) is moved between CPU and node partial lists in allocation path >> on another CPU. So reading "/proc/slabinfo" can possibily block the sl= ab >> allocation on another CPU for a while, 200ms in extreme cases. If the >> slab object is to carry network packet, targeting the far-end disk arr= ay, >> it causes block IO jitter issue. >> >> This fixes the block IO jitter issue by caching the total inuse object= s in >> the node in advance. The value is retrieved without taking the node pa= rtial >> list lock on reading "/proc/slabinfo". >> >> ... >> >> @@ -1768,7 +1774,9 @@ static void free_slab(struct kmem_cache *s, stru= ct page *page) >> =20 >> static void discard_slab(struct kmem_cache *s, struct page *page) >> { >> - dec_slabs_node(s, page_to_nid(page), page->objects); >> + int inuse =3D page->objects; >> + >> + dec_slabs_node(s, page_to_nid(page), page->objects, inuse); >=20 > Is this right? dec_slabs_node(..., page->objects, page->objects)? >=20 > If no, we could simply pass the page* to inc_slabs_node/dec_slabs_node > and save a function argument. >=20 > If yes then why? >=20 Thanks for your comments. We are happy to improve this patch based on your suggestions. When the user reads /proc/slabinfo, in order to obtain the active_objs=20 information, the kernel traverses all slabs and executes the following=20 code snippet: static unsigned long count_partial(struct kmem_cache_node *n, int (*get_count)(struct page *)) { unsigned long flags; unsigned long x =3D 0; struct page *page; spin_lock_irqsave(&n->list_lock, flags); list_for_each_entry(page, &n->partial, slab_list) x +=3D get_count(page); spin_unlock_irqrestore(&n->list_lock, flags); return x; } It may cause performance issues. Christoph suggested "you could cache the value in the userspace=20 application? Why is this value read continually?", But reading the=20 /proc/slabinfo is initiated by the user program. As a cloud provider, we=20 cannot control user behavior. If a user program inadvertently executes=20 cat /proc/slabinfo, it may affect other user programs. As Christoph said: "The count is not needed for any operations. Just for=20 the slabinfo output. The value has no operational value for the=20 allocator itself. So why use extra logic to track it in potentially=20 performance critical paths?" In this way, could we show the approximate value of active_objs in the=20 /proc/slabinfo? Based on the following information: In the discard_slab() function, page->inuse is equal to page->total_objec= ts; In the allocate_slab() function, page->inuse is also equal to=20 page->total_objects (with one exception: for kmem_cache_node, page->=20 inuse equals 1); page->inuse will only change continuously when the obj is constantly=20 allocated or released. (This should be the performance critical path=20 emphasized by Christoph) When users query the global slabinfo information, we may use=20 total_objects to approximate active_objs. In this way, the modified patch is as follows: diff --git a/mm/slub.c b/mm/slub.c index a0b335d..ef0e6ac 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5900,17 +5900,15 @@ void get_slabinfo(struct kmem_cache *s, struct=20 slabinfo *sinfo) { unsigned long nr_slabs =3D 0; unsigned long nr_objs =3D 0; - unsigned long nr_free =3D 0; int node; struct kmem_cache_node *n; for_each_kmem_cache_node(s, node, n) { nr_slabs +=3D node_nr_slabs(n); nr_objs +=3D node_nr_objs(n); - nr_free +=3D count_partial(n, count_free); } - sinfo->active_objs =3D nr_objs - nr_free; + sinfo->active_objs =3D nr_objs; sinfo->num_objs =3D nr_objs; sinfo->active_slabs =3D nr_slabs; sinfo->num_slabs =3D nr_slabs; In addition, when the user really needs to view the precise active_obj=20 value of a slab, he can query this single slab info through an interface=20 similar to the following, which avoids traversing all the slabs. # cat /sys/kernel/slab/kmalloc-512/total_objects 1472 N0=3D1472 # cat /sys/kernel/slab/kmalloc-512/objects 1311 N0=3D1311 or # cat /sys/kernel/slab/kmalloc-8k/total_objects 60 N0=3D60 # cat /sys/kernel/slab/kmalloc-8k/objects 60 N0=3D60 Best wishes, Wen >> free_slab(s, page); >> } >> =20