From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E8D0C64ED6 for ; Tue, 28 Feb 2023 10:05:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 77E5A6B0071; Tue, 28 Feb 2023 05:05:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 706646B0072; Tue, 28 Feb 2023 05:05:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 55A386B0073; Tue, 28 Feb 2023 05:05:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 3F3856B0071 for ; Tue, 28 Feb 2023 05:05:10 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 02999121340 for ; Tue, 28 Feb 2023 10:05:09 +0000 (UTC) X-FDA: 80516267580.13.C3A645E Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf06.hostedemail.com (Postfix) with ESMTP id 08A97180011 for ; Tue, 28 Feb 2023 10:05:06 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=UdrmKdvt; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf06.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677578708; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=n8tBohZg9MnxJCstc6cADUdFd79WkTqpbYudcuHPVfM=; b=qUwTMipX4DwOlfVdcEfOXHFlMXqNJAj/nE8qdo/viJVZS7+c3uJ7zhTwDP39+5+01y7XNp 7tfXlWIUqEGnqxgqkVAzMcDlbaqWmKAT12wa4NmECtfwS0ct98OrrjR8zeYjYAr/1qCJZt N3nBJnG8tMBX5GoYwXGrUNWjnyjVxSE= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=UdrmKdvt; dmarc=pass (policy=none) header.from=bytedance.com; spf=pass (imf06.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677578708; a=rsa-sha256; cv=none; b=Ta709fJH9EpOwx9trF1IMi92WXAYc19etWFIxmcc5onBpvEnxBxDkhDAHEEbBI1EcthiUr qClIca0BwY2ld+Lp/HXj0vu50c5awSF0HsLs3UKIT+ANSegykYidiymzrKRb4pe7Kkham2 k6neSWr/Y8kA9+fp4NVY3IavpZbcgxk= Received: by mail-pl1-f169.google.com with SMTP id i10so9810338plr.9 for ; Tue, 28 Feb 2023 02:05:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=n8tBohZg9MnxJCstc6cADUdFd79WkTqpbYudcuHPVfM=; b=UdrmKdvtJSSc8ocyzAgL/sJNVr9d4p3GW11ZINBSa+kJvSPEh9S5c1V1Pp1Hnk8KHc Da8KYjeGoihCiHvOK28YRS1BLOYoKWvv2ol9vp5Z9O65PQf7f3BENjOEwQH+sHUcZBDB Uy8RD6zQQzLipW2o9Bv1Soj2A1ffx2pTyaFJzZM3lcTBRICx0W0h8Fk5WaeaTOSACg0D /kpJyQ4Ca43u5JffS2brphSP4FmU0hcx9Z/diSp2T17uD8fGuipBaF+GFipNzfd2IqZU XRqJ8ng+aHF1Lg2azf6Y786lDhRcAeswhIU5LmK/pYgP6Rt3rKSnEI6oty8kETcuoZ7y 0NSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=n8tBohZg9MnxJCstc6cADUdFd79WkTqpbYudcuHPVfM=; b=6YAq8saYbBnhEZFSfVFh7x6VwstXcvP2iYlWXpwkhE/FURhQ6iBDpGTF7tCqr+Ge00 AK0QKs3H0tafPz91RKLWv5N5zq1psaG+tIyumuIQUUkc6nhqE3Dw38MgftVAcIr1TQIq lp2gGVlX/NQAJFVefvv4pQJHidOTk/SG5hmVNm0KwriObTtpRn1TAc931yDUeGrm8Lti VLlGwl6wXv9tWeIqxV8N7x9dOBxKOwrvDtGCeop+v6D9fUheXzh+eXIQ/T4uby+LAP6O 2gJ6y3Gv4EFihFyv6a32173itmEW8KFaSRAS9k7S05Q9L/HY67vqVi7Rqw9FDiI+NHQ1 rTOg== X-Gm-Message-State: AO0yUKVugdkFRv6zUzxiilFF3JOOb+/KHOAbrxuncq7Lt45FaHBjdIOw UqVndP/8kvwraDM4Mi1Dpow9ZQ== X-Google-Smtp-Source: AK7set9psX5vxXwoVV8nogv6bidJejoWVdBThkhpxPsjL61a4DJtxSemKR4i91D+gMtsFdD2S+wtEw== X-Received: by 2002:a17:902:e74b:b0:19a:7217:32af with SMTP id p11-20020a170902e74b00b0019a721732afmr2354000plf.5.1677578705519; Tue, 28 Feb 2023 02:05:05 -0800 (PST) Received: from [10.70.252.135] ([139.177.225.229]) by smtp.gmail.com with ESMTPSA id h5-20020a170902704500b0019a7c890c5asm6105763plt.263.2023.02.28.02.04.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 28 Feb 2023 02:05:05 -0800 (PST) Message-ID: <63a16f0e-d6e9-29a1-069e-dc76bfd82319@bytedance.com> Date: Tue, 28 Feb 2023 18:04:53 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.7.2 Subject: Re: [PATCH v3 0/8] make slab shrink lockless Content-Language: en-US To: Mike Rapoport , Andrew Morton Cc: tkhai@ya.ru, hannes@cmpxchg.org, shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com, sultan@kerneltoast.com, dave@stgolabs.net, penguin-kernel@i-love.sakura.ne.jp, paulmck@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20230226144655.79778-1-zhengqi.arch@bytedance.com> <20230226115100.7e12bda7931dd65dbabcebe3@linux-foundation.org> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 08A97180011 X-Stat-Signature: e5nzm7kfpaxmhbhpbnyysqsxarie1g74 X-HE-Tag: 1677578706-958781 X-HE-Meta: U2FsdGVkX19Nx4crzxi32Lh+XBufJ8I364D73NJK40LpO4EXuMpInLOm863jJmIWxmf4t6Z7pCow1DA5d6EErhSZaqQuo6oF5r9yTLw9DPc0s4YiwNuJ6Mab7nDdBQLPIsExHUirynFi3yYvt1L8Dc7WO0qrPYXrAv/vS90dNxo/XbaaAsgRKUELove30qRb5K5+91e8tGFPr/WsCuGC/siWyXL4IaxhsW0VpFgmJjv7KBBTXy53+Yy9GSZXnBdQIn/XGa+9nES1QzV6Yw6HIvVdv24upcehjk3XqEIkkRaugV5Hps5LStu+VYwYhEj6s94QPgzFKN4e0Imk/cqDL6+zLQTcR7Ct7ikHSasZkaSFUU8qCP9q/gB9OJADC/s2pmOiAlyHibcv22PmMxSjigsbDmESd9rsi8X9Fhj0ChXtrnS47JV8COtkzSzj+P2lZzonxSe4WT7f+SOyh15DPMbi8jVHjaOfOt4pmULksLk0SBMEaoFeto5EZjArF6XcDv5IuPV74S4PPKqI9wyeFwLNmjH6KY/sCYDlkNTU2MPp1TUd/TExXuE0Z1sX9MQQsfu4RReexA38OrTM1P75nMBvY89+7WoZwpo59qrrRWInsmZt04JiDE5CaEMo0Gler9tsiVtJF9DWaS6hN5/tZjjUz9YdWPNh42TNmfR36tDpoDdKBDtpjYNXYp6c8swbhG/xuTWEgPaG0sask1+kaZXZ9bihJkBh2vHMNrfIuufQQhqgFxAhhdHqG4LeCgQM6HK5NNJJn5AtR9WdyHpdvsLbjUmINqm8ogilHT6Pa7qkkZ3ixoU99jRchVVuS2bEDJCBt9ShjC66kguS+xURgLZ/xo6gIyBt8ozbBb5m9tJjCx+5afjtgRnwnM8iScAYCYZYaBAhnSak1lBb33uFxcpBEstGfL8zX1ZxpEDjNVjN1tA4TC8ZFhDatsAntQ9ZjuHWpD5OGHil+ZhczHn +35ijZbG HdALItVwnfNTB999IT7lK2IR4khfZIji6dJQufRxVC248QpQJUj+Ulu6zq4YKoFRpr3OG/4xcdPTQScvhWTU2KjXNHEUsCpu3EkGqHZJwQ5e+imIvH8gBOONA8txnV9wsTlXH5w7wi28bQHqJPJQk1aLlrKNuHynvKLyGqftDxmR3ATpE9GfM4i0NLq0pfxfdmk+vPQ8st9jNwnHjw3xRqp3nu8JbVjiwXhTdm7ySB/dkiV9jnuDv8Twi2MyBDW41Uguu5BL6etEI8H9hVePaPMdvU4v52BOqQOaOzM43Cx/2W+HU6CW1T0EXUYcANvjN1vyEgRM8xdMRsh1VQxFwjHpqbQW7x4IjDqTnzEucW7b+m2fDb/zYtr2XU0/gqB1gAyBLkrPTwqATMsR+WnwidQoWvR8WXC0wCSWi8wjPp60bhH1PzTjv3A7KOw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/2/27 23:08, Mike Rapoport wrote: > Hi, > > On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: >> >> >> On 2023/2/27 03:51, Andrew Morton wrote: >>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng wrote: >>> >>>> Hi all, >>>> >>>> This patch series aims to make slab shrink lockless. >>> >>> What an awesome changelog. >>> >>>> 2. Survey >>>> ========= >>> >>> Especially this part. >>> >>> Looking through all the prior efforts and at this patchset I am not >>> immediately seeing any statements about the overall effect upon >>> real-world workloads. For a good example, does this patchset >>> measurably improve throughput or energy consumption on your servers? >> >> Hi Andrew, >> >> I re-tested with the following physical machines: >> >> Architecture: x86_64 >> CPU(s): 96 >> On-line CPU(s) list: 0-95 >> Model name: Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz >> >> I found that the reason for the hotspot I described in cover letter is >> wrong. The reason for the down_read_trylock() hotspot is not because of >> the failure to trylock, but simply because of the atomic operation >> (cmpxchg). And this will lead to a significant reduction in IPC (insn >> per cycle). > > ... > >> Then we can use the following perf command to view hotspots: >> >> perf top -U -F 999 >> >> 1) Before applying this patchset: >> >> 32.31% [kernel] [k] down_read_trylock >> 19.40% [kernel] [k] pv_native_safe_halt >> 16.24% [kernel] [k] up_read >> 15.70% [kernel] [k] shrink_slab >> 4.69% [kernel] [k] _find_next_bit >> 2.62% [kernel] [k] shrink_node >> 1.78% [kernel] [k] shrink_lruvec >> 0.76% [kernel] [k] do_shrink_slab >> >> 2) After applying this patchset: >> >> 27.83% [kernel] [k] _find_next_bit >> 16.97% [kernel] [k] shrink_slab >> 15.82% [kernel] [k] pv_native_safe_halt >> 9.58% [kernel] [k] shrink_node >> 8.31% [kernel] [k] shrink_lruvec >> 5.64% [kernel] [k] do_shrink_slab >> 3.88% [kernel] [k] mem_cgroup_iter >> >> 2. At the same time, we use the following perf command to capture IPC >> information: >> >> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 >> >> 1) Before applying this patchset: >> >> Performance counter stats for 'system wide' (5 runs): >> >> 454187219766 cycles test ( >> +- 1.84% ) >> 78896433101 instructions test # 0.17 insn per >> cycle ( +- 0.44% ) >> >> 10.0020430 +- 0.0000366 seconds time elapsed ( +- 0.00% ) >> >> 2) After applying this patchset: >> >> Performance counter stats for 'system wide' (5 runs): >> >> 841954709443 cycles test ( >> +- 15.80% ) (98.69%) >> 527258677936 instructions test # 0.63 insn per >> cycle ( +- 15.11% ) (98.68%) >> >> 10.01064 +- 0.00831 seconds time elapsed ( +- 0.08% ) >> >> We can see that IPC drops very seriously when calling >> down_read_trylock() at high frequency. After using SRCU, >> the IPC is at a normal level. > > The results you present do show improvement in IPC for an artificial test > script. But more interesting would be to see how a real world workloads > benefit from your changes. Hi Mike and Andrew, I did encounter this problem under the real workload of our online server. At the end of this email, I posted another call stack and hot spot that I found before. I scanned the hotspots of all our online servers yesterday and today, but unfortunately did not find the live environment. Some of our servers have a large number of containers, and each container will mount some file systems. This is likely to trigger down_read_trylock() hotspots when the memory pressure of the whole machine or the memory pressure of memcg is high. So I just found a physical server with a similar configuration to the online server yesterday for a simulation test. The call stack and the hot spot in the simulation test are almost exactly the same, so in theory, when such a hot spot appears on the online server, we can also enjoy the improvement of IPC. This will improve the performance of the server in memory exhaustion scenarios (memcg or global level). And the above scenario is only one aspect, and the other aspect is the lock competition scenario mentioned by Kirill. After applying this patch set, slab shrink and register_shrinker() can be completely parallelized, which can fix that problem. These are the two main benefits for real workloads that I consider. Thanks, Qi call stack ---------- @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 do_try_to_free_pages+232 try_to_free_pages+243 _alloc_pages_slowpath+771 _alloc_pages_nodemask+702 pagecache_get_page+255 filemap_fault+1361 ext4_filemap_fault+44 __do_fault+76 handle_mm_fault+3543 do_user_addr_fault+442 do_page_fault+48 page_fault+62 ]: 1161690 @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 balance_pgdat+690 kswapd+389 kthread+246 ret_from_fork+31 ]: 8424884 @[ down_read_trylock+1 shrink_slab+128 shrink_node+371 do_try_to_free_pages+232 try_to_free_pages+243 __alloc_pages_slowpath+771 __alloc_pages_nodemask+702 __do_page_cache_readahead+244 filemap_fault+1674 ext4_filemap_fault+44 __do_fault+76 handle_mm_fault+3543 do_user_addr_fault+442 do_page_fault+48 page_fault+62 ]: 20917631 hotspot ------- 52.22% [kernel] [k] down_read_trylock 19.60% [kernel] [k] up_read 8.86% [kernel] [k] shrink_slab 2.44% [kernel] [k] idr_find 1.25% [kernel] [k] count_shadow_nodes 1.18% [kernel] [k] shrink lruvec 0.71% [kernel] [k] mem_cgroup_iter 0.71% [kernel] [k] shrink_node 0.55% [kernel] [k] find_next_bit > >> Thanks, >> Qi > -- Thanks, Qi