From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96AA0C64ED6 for ; Tue, 28 Feb 2023 11:00:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D95DD6B0071; Tue, 28 Feb 2023 06:00:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D465B6B0072; Tue, 28 Feb 2023 06:00:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C0E196B0073; Tue, 28 Feb 2023 06:00:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B38936B0071 for ; Tue, 28 Feb 2023 06:00:21 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 8C42341484 for ; Tue, 28 Feb 2023 11:00:21 +0000 (UTC) X-FDA: 80516406642.24.79AAF63 Received: from mail-qt1-f179.google.com (mail-qt1-f179.google.com [209.85.160.179]) by imf04.hostedemail.com (Postfix) with ESMTP id 1ADB24001E for ; Tue, 28 Feb 2023 11:00:17 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=d82XfB7e; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677582019; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oRaRJQujxEovvpNBmWWw2pEpFxn8UUlJ/UEqlgQi6ZI=; b=XuDAuUbNJykX1ULxukdOX/lHhy4R/U5Td0GKZJkUTxi0QlBPybI9c62bsZmbwDjdZJNrRq OVo6bs5DAiFy26deKGY9LJg/+GiNjeoK5gAypncDMWBovXJgUNyDFO1cB3iTr9VHMm5liy ZpVaINg674SEIjKd2FRoOc0e7Ad0RXQ= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=d82XfB7e; spf=pass (imf04.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.160.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677582019; a=rsa-sha256; cv=none; b=xK/CenbF3imEAbqAiF8Q7m8gd7ufgLttOWiLD6znSPhevsk6u/3iuypYlmSnCcdnplZsr/ FDeG6gHEhTN0cf+vproytvYoqa2569rCklShGXRniGN4LYnVPaoOi4aVjr5Duh0y/DkUEL GfTOUUFDBVbX+WD/Y3QBEzkl2woGpDI= Received: by mail-qt1-f179.google.com with SMTP id z6so9989241qtv.0 for ; Tue, 28 Feb 2023 03:00:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1677582017; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=oRaRJQujxEovvpNBmWWw2pEpFxn8UUlJ/UEqlgQi6ZI=; b=d82XfB7epIHyG14wvS9kYwDeIQ/rNViBOrcmnadssI9vdOFK+W5YScnSFGE4Q8KLL3 OHC9CUBQu6ZqDQnU4vM5xLmXJLj1DXYQsveHMCTYMXkkVFzfCCR4dw6u54g5S3fesKpr lQjjb61RbLYHQEd5Bwd4qmuIIrAFKXq/mS8ydoaHsaP3btQM8Y1Sb4qsalYoukZpEXEe X+g8La7Oq4MtfDp4clIltjDtwT2dW2YZy5gBuE3Yf8q7tYkbbNO/jgvPHEr7t1Arqtc2 EyvoKx3+k4A9S3ZI/2uLR3JTV4jsmYr/d6XCzD4hn1yU04lSRzc4Tg9t9I+CI1m6miu+ 87lg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677582017; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=oRaRJQujxEovvpNBmWWw2pEpFxn8UUlJ/UEqlgQi6ZI=; b=vR3w/X55HsoVOlRJLW2qzTsyh0JzGCuf2yMfMeLAsHFvYG1gLzoaQLC6nPuqfcLpQn FrZ60YsMW+pTi1/Irz1ZbFUclfBpIyWbdzMP17ZmVhSn8BAB7ypG2ck1KosM/0AfcvN2 hkT74vWZbsIxkJH8ZUiXFGG2zKjS/mpY5i2Gk3HGOTQ0o0SdP5rXboEw0esJbttwRxfB dKMOz5JrWrxrPTpMD8RdNJTKtm3OXZwrJW8DXZyRDPcpQPhXi1unQQVRdC/JCIC+iM9E Xwf1OfBn3eQSBNDrkelnaWjHxyZR5ezKXT945VGniOp7zkp6RsXQhL80EcE+lT/rwhRN MT1A== X-Gm-Message-State: AO0yUKUxYEh1v5LudT++cMr0MXUgKhblN3x1ViesB/KWjooqGwo/WId5 mCVSjrvT/srzSaJbB8gV4J8X8Z7wd1qRFUjI X-Google-Smtp-Source: AK7set8oPq+zjY95hL7VZvhmfFuroc5InJJqtSGtiOK3eHEN//wiiSs0C0NJ55KxfhblWw7OquRzWQ== X-Received: by 2002:a17:902:7109:b0:19a:839d:b67a with SMTP id a9-20020a170902710900b0019a839db67amr2571695pll.5.1677581606989; Tue, 28 Feb 2023 02:53:26 -0800 (PST) Received: from [10.70.252.135] ([139.177.225.245]) by smtp.gmail.com with ESMTPSA id y1-20020a1709029b8100b00185402cfedesm1528426plp.246.2023.02.28.02.53.19 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 28 Feb 2023 02:53:26 -0800 (PST) Message-ID: <36c737e1-7e1c-7098-8bd5-1767869489d9@bytedance.com> Date: Tue, 28 Feb 2023 18:53:17 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.7.2 Subject: Re: [PATCH v3 0/8] make slab shrink lockless Content-Language: en-US From: Qi Zheng To: Mike Rapoport , Andrew Morton Cc: tkhai@ya.ru, hannes@cmpxchg.org, shakeelb@google.com, mhocko@kernel.org, roman.gushchin@linux.dev, muchun.song@linux.dev, david@redhat.com, shy828301@gmail.com, sultan@kerneltoast.com, dave@stgolabs.net, penguin-kernel@i-love.sakura.ne.jp, paulmck@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20230226144655.79778-1-zhengqi.arch@bytedance.com> <20230226115100.7e12bda7931dd65dbabcebe3@linux-foundation.org> <63a16f0e-d6e9-29a1-069e-dc76bfd82319@bytedance.com> In-Reply-To: <63a16f0e-d6e9-29a1-069e-dc76bfd82319@bytedance.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 1ADB24001E X-Stat-Signature: 4q7jgys7n1te54ihswr1mdz3dph6boaj X-Rspam-User: X-HE-Tag: 1677582017-618740 X-HE-Meta: U2FsdGVkX1+kvx/dg38eeo3hgt9mX0MCx/eaYJEYMwDS0OR8V/Gd9OHXWvN3LJ2qIluF0CQLNToY7HVnayldiIEugmC+2YEf/CVBGNIRPHQxmVfVBQH5MIPHg1i+jRV0g+vXnPqqCHKuS4PObtgnsWfyK04TI8TquFqLJ4C/Oh2aRjMAe1uSXP6vam0dnQLoflr9A1aZ/IpPk2iMIhOnVan/pUWLScnLnsSvLiy2BT+oWUIzsztejtogadCzv2ishlLelkccndLOFJvvLhwxcIzxPwT0LmVd3cAYuamnuTc7xvvw64HyyRoMtVh4v6nXQwMMyIxO6aFmNDV9YGqpQuoIOV260ibuovilrpUWA60RhTcq5S1Q/dT7yqNiprIs8zckcP3mgzrDCt76HeJxTqTM4BInN1u6osjSEWUWG6kSzArBoTkgVICZn6NaAzo5FTSpSmEDyWqticq+Xh3zBZTXPmQpIUMrYR7tiDe+67GFqWcP9T141YAaqqmCbq7713pkUwtPPYR0RJaOFCffbwXBmrQfMg9Q7fU27DHiEeB+1hIyZOr1GGXm+CI8cu4GvzwWr+bplZDvoowGKOZ2DtLUxSHlhiJWmqq0VefpgQZCPHQf1V2HJJ3T7AGAMbMXRXarhh401I+O+VDUqC2Iz9TzdA+Et4WT54l8IN9BTkmCB2cQSL4OsgsXdyEBRgz+Zv6ImzBMtKfCArZ18OjLZ1xtPLYytOczGUlg6+MqnJoFW95wLBmRmjT+lAXqcZvK+2vk8FDu/GjGs38f5JhaLP55+rWoPbypMmuEBoBdDyJwLJSaxuaAAfXkSVdkCp1dFdhUfGokgskDpaa8hS1017Cp5hmGIaFweucesiJqeGalUrgiqkN9tOqzEoxzjPqhqIC2dMiSePIySZpd6ZBHgFPK6OlSfonuy4UdF5BA0kykCD7AVsxxHxbkXcIVrq2jHWwLLVUasQekXNAtnqn cAjn5XCh YE4KtXMYFtdD3nP+eg2t0xVHq6e4pSNxCjK5CAU/zbTF+hyA0G9kb1oZ94mTcipvY24FjJpOWtkE4uggLxDTe6sZPWPaHjHe6eXx8sXwUFV+cLtfYJsN8hnhpocKkvm6+G21JgcHWk/Etmw2asYm33lnmoDy2nu+FMwFn+lubMHUoyd7GRVN9YIwk87AzOZVMkePh0svINSsyIMcFHc9LeQNKixeXDh11aQ/hoJChW+VY+NuEZe5eJqqLvp0wul5jznTCE/nB4TMzMAQ53V07T6CbakWVK2DKEDuWBYIPLAHyB//UiOoBLOcqc9dbYNXx6K6+/N9Bo8mBW9p7AqcKtpiCxraZEhSHZRhVAj58fI6Flid32+c6JedSBt7DzqrfjjzKHN+3sDs2BUATCbBKfpXz6DvkFOVjHQYVD4ROcnSk3KBN6lzo3OwWpQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/2/28 18:04, Qi Zheng wrote: > > > On 2023/2/27 23:08, Mike Rapoport wrote: >> Hi, >> >> On Mon, Feb 27, 2023 at 09:31:51PM +0800, Qi Zheng wrote: >>> >>> >>> On 2023/2/27 03:51, Andrew Morton wrote: >>>> On Sun, 26 Feb 2023 22:46:47 +0800 Qi Zheng >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> This patch series aims to make slab shrink lockless. >>>> >>>> What an awesome changelog. >>>> >>>>> 2. Survey >>>>> ========= >>>> >>>> Especially this part. >>>> >>>> Looking through all the prior efforts and at this patchset I am not >>>> immediately seeing any statements about the overall effect upon >>>> real-world workloads.  For a good example, does this patchset >>>> measurably improve throughput or energy consumption on your servers? >>> >>> Hi Andrew, >>> >>> I re-tested with the following physical machines: >>> >>> Architecture:        x86_64 >>> CPU(s):              96 >>> On-line CPU(s) list: 0-95 >>> Model name:          Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz >>> >>> I found that the reason for the hotspot I described in cover letter is >>> wrong. The reason for the down_read_trylock() hotspot is not because of >>> the failure to trylock, but simply because of the atomic operation >>> (cmpxchg). And this will lead to a significant reduction in IPC (insn >>> per cycle). >> >> ... >>> Then we can use the following perf command to view hotspots: >>> >>> perf top -U -F 999 >>> >>> 1) Before applying this patchset: >>> >>>    32.31%  [kernel]           [k] down_read_trylock >>>    19.40%  [kernel]           [k] pv_native_safe_halt >>>    16.24%  [kernel]           [k] up_read >>>    15.70%  [kernel]           [k] shrink_slab >>>     4.69%  [kernel]           [k] _find_next_bit >>>     2.62%  [kernel]           [k] shrink_node >>>     1.78%  [kernel]           [k] shrink_lruvec >>>     0.76%  [kernel]           [k] do_shrink_slab >>> >>> 2) After applying this patchset: >>> >>>    27.83%  [kernel]           [k] _find_next_bit >>>    16.97%  [kernel]           [k] shrink_slab >>>    15.82%  [kernel]           [k] pv_native_safe_halt >>>     9.58%  [kernel]           [k] shrink_node >>>     8.31%  [kernel]           [k] shrink_lruvec >>>     5.64%  [kernel]           [k] do_shrink_slab >>>     3.88%  [kernel]           [k] mem_cgroup_iter >>> >>> 2. At the same time, we use the following perf command to capture IPC >>> information: >>> >>> perf stat -e cycles,instructions -G test -a --repeat 5 -- sleep 10 >>> >>> 1) Before applying this patchset: >>> >>>   Performance counter stats for 'system wide' (5 runs): >>> >>>        454187219766      cycles >>> test                    ( >>> +-  1.84% ) >>>         78896433101      instructions              test #    0.17 >>> insn per >>> cycle           ( +-  0.44% ) >>> >>>          10.0020430 +- 0.0000366 seconds time elapsed  ( +-  0.00% ) >>> >>> 2) After applying this patchset: >>> >>>   Performance counter stats for 'system wide' (5 runs): >>> >>>        841954709443      cycles >>> test                    ( >>> +- 15.80% )  (98.69%) >>>        527258677936      instructions              test #    0.63 >>> insn per >>> cycle           ( +- 15.11% )  (98.68%) >>> >>>            10.01064 +- 0.00831 seconds time elapsed  ( +-  0.08% ) >>> >>> We can see that IPC drops very seriously when calling >>> down_read_trylock() at high frequency. After using SRCU, >>> the IPC is at a normal level. >> >> The results you present do show improvement in IPC for an artificial test >> script. But more interesting would be to see how a real world workloads >> benefit from your changes. > > Hi Mike and Andrew, > > I did encounter this problem under the real workload of our online > server. At the end of this email, I posted another call stack and > hot spot that I found before. > > I scanned the hotspots of all our online servers yesterday and today, > but unfortunately did not find the live environment. > > Some of our servers have a large number of containers, and each > container will mount some file systems. This is likely to trigger > down_read_trylock() hotspots when the memory pressure of the whole > machine or the memory pressure of memcg is high. And the servers where this hotspot has happened (we have a hotspot alarm record), basically have 96 cores, or 128 cores or even more. > > So I just found a physical server with a similar configuration to the > online server yesterday for a simulation test. The call stack and the > hot spot in the simulation test are almost exactly the same, so in > theory, when such a hot spot appears on the online server, we can also > enjoy the improvement of IPC. This will improve the performance of the > server in memory exhaustion scenarios (memcg or global level). > > And the above scenario is only one aspect, and the other aspect is the > lock competition scenario mentioned by Kirill. After applying this patch > set, slab shrink and register_shrinker() can be completely parallelized, > which can fix that problem. > > These are the two main benefits for real workloads that I consider. > > Thanks, > Qi > > call stack > ---------- > > @[ >     down_read_trylock+1 >     shrink_slab+128 >     shrink_node+371 >     do_try_to_free_pages+232 >     try_to_free_pages+243 >     _alloc_pages_slowpath+771 >     _alloc_pages_nodemask+702 >     pagecache_get_page+255 >     filemap_fault+1361 >     ext4_filemap_fault+44 >     __do_fault+76 >     handle_mm_fault+3543 >     do_user_addr_fault+442 >     do_page_fault+48 >     page_fault+62 > ]: 1161690 > @[ >     down_read_trylock+1 >     shrink_slab+128 >     shrink_node+371 >     balance_pgdat+690 >     kswapd+389 >     kthread+246 >     ret_from_fork+31 > ]: 8424884 > @[ >     down_read_trylock+1 >     shrink_slab+128 >     shrink_node+371 >     do_try_to_free_pages+232 >     try_to_free_pages+243 >     __alloc_pages_slowpath+771 >     __alloc_pages_nodemask+702 >     __do_page_cache_readahead+244 >     filemap_fault+1674 >     ext4_filemap_fault+44 >     __do_fault+76 >     handle_mm_fault+3543 >     do_user_addr_fault+442 >     do_page_fault+48 >     page_fault+62 > ]: 20917631 > > hotspot > ------- > > 52.22% [kernel]        [k] down_read_trylock > 19.60% [kernel]        [k] up_read >  8.86% [kernel]        [k] shrink_slab >  2.44% [kernel]        [k] idr_find >  1.25% [kernel]        [k] count_shadow_nodes >  1.18% [kernel]        [k] shrink lruvec >  0.71% [kernel]        [k] mem_cgroup_iter >  0.71% [kernel]        [k] shrink_node >  0.55% [kernel]        [k] find_next_bit > > >>> Thanks, >>> Qi >> > -- Thanks, Qi