From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D9547C4345F for ; Mon, 15 Apr 2024 12:33:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 44C196B0089; Mon, 15 Apr 2024 08:33:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3FC7B6B008A; Mon, 15 Apr 2024 08:33:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2C3F96B008C; Mon, 15 Apr 2024 08:33:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 0FC286B0089 for ; Mon, 15 Apr 2024 08:33:54 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 912A5A11DF for ; Mon, 15 Apr 2024 12:33:53 +0000 (UTC) X-FDA: 82011707946.21.6A02151 Received: from szxga06-in.huawei.com (szxga06-in.huawei.com [45.249.212.32]) by imf05.hostedemail.com (Postfix) with ESMTP id 2F9F4100017 for ; Mon, 15 Apr 2024 12:33:49 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713184431; a=rsa-sha256; cv=none; b=Gcy/r2gflbdYr+lTEUtsuAGi9LgE4N1gUoM9MiwTn/zTl2Hiz5CkMTZ+auTO2al8YhfckJ Pjte4gc4bQ78dLIRBnt2I9cNke9EyQ+kvAe7TQyQ2yNdDYGmvby+xNkzikvXFXVpBu/0r1 Yhb9NTiDf42pUpEXW7vt5UEeOWjBH3o= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; spf=pass (imf05.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713184431; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1Aw0NG8e5/z2wwzo4i/wVLBcZRUwDRTxqJjyognwdE4=; b=KPQK3o5U1z0vA4JbkepnpIFvw+58AqSWX7oRV+vKzGiNnnQnnStECmeutTwtvITrMqc7aS Ghfu13w2kaDsIQE/LLDyeVzDZFnLnoLFuhBiAgm0oQahGLTWJA/CjpppbCqmJfGIQl41oM 3xG2y6osViIjufyqvgiFBU1yOMShb9U= Received: from mail.maildlp.com (unknown [172.19.88.163]) by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4VJ65D4v2gz1wrP8; Mon, 15 Apr 2024 20:32:48 +0800 (CST) Received: from kwepemm600020.china.huawei.com (unknown [7.193.23.147]) by mail.maildlp.com (Postfix) with ESMTPS id 57469180063; Mon, 15 Apr 2024 20:33:46 +0800 (CST) Received: from [10.174.179.160] (10.174.179.160) by kwepemm600020.china.huawei.com (7.193.23.147) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Mon, 15 Apr 2024 20:33:45 +0800 Message-ID: Date: Mon, 15 Apr 2024 20:33:44 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [RFC PATCH 0/3] mm: convert mm's rss stats into lazy_percpu_counter Content-Language: en-US To: Jan Kara CC: , , , , , , , , , , , , References: <20240412092441.3112481-1-zhangpeng362@huawei.com> <20240412135333.btd6e7wfprg4cmx2@quack3> From: "zhangpeng (AS)" In-Reply-To: <20240412135333.btd6e7wfprg4cmx2@quack3> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.174.179.160] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To kwepemm600020.china.huawei.com (7.193.23.147) X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2F9F4100017 X-Stat-Signature: mehuc79ifpcb4e8bgx4ya1muwaywtufi X-HE-Tag: 1713184429-518865 X-HE-Meta: U2FsdGVkX1+HUIjZnIXVQwewOdYHhpdDeXq6zmYRDCKPdMQRD63pTShu3DWqn06tli13Evb0uc/5OxXvuZiV6rNry7HQNALQn1L6EfKORSQu53+aq3wE479UlXLC0NqEwY7EtGBJ+6SI+rQnwGP863tYJTgJNOBFwq78d2X2UoxEBGLctNLkuMqXOGP6GWswF/BpqEa3ZOaDUNOcq6XDaTYCPYXCugv/9XZuSvKDkREIvhsv87LQkf6VRPCjPyp6uDbNSmnYVo9YMY8YlBAM+SaWXdRirL8vvI7xNX1ulmoyhRDXJg0D0d4CsFzeA7X+4evY4B5vaFhk+oM4XG1PFOx0j3fKfXYLLcqyJPOPfOf3V43eGeLaS35oQC9MmHS+CX9/gDiq+D21wxrrM+YBZxDn7mfRwDaAtsutsCMWzZtUAoFRVNSRSq3cNb57VgkuZVQFCaqn2Pmo03iHcC7yWii2bo2wbSB7CwES3RvUzsuNtHEg5TsRw8xxlNU/59MO+pf6IaXE35G44guzrqcIVI003/EUFO3yq/8YvRskqpYGHKpYeyQXcfln6VJkVYXv/mblXerVBcsQgRhKLufZcOa/qMDQ3PukAJl3itELwgJtNlSYVL+RRmaYeonK8A2RAFmkEE7rvSrzW4D4o8VqecMi99U38TseyHrBhRxtK+MIn/9yaV7j3TbTmj8OUbF/0EJItyYCHmoZ/VMCnGLjSwuOwTVuAN/fGswPJkKbqBI+cMGXX31/pLKlb3fxym09tVdqeopBxt9EwF/1lZ23kka2y1doj+7+6Z48AHPm8Fa7YKlfil5nKLIsjVH1rjRdOo0dqL8amkyIp7YZzO6MKSYs7s0w4VOfRRRDfVzfLZjbhKiRdZkhWmu3UAIsf+MWuer/D+m4d3JTAtGI9eJemyKL97Ut95NdGDT1v350arUJOT6V2ODY2RTN/M9jiACSGHggw2nh705YNu7u2Za 0QcNpSXP t2Cg2BqwIl5fdR4t11LUp+oc8lDXy/n7INFpUA0EIC/5nGadlSp0StcrRVgYPSFTcWvFxQ4fklRPJB5TecliD5Z4OhgXo7k2OTkI7ba+ZrtL24Hfnf8+0ljlnkrY+LB8TT8OaPrdhgurVjQ8fLF93KjEpFKv0vmH6k8Erf/1OBmas3iQ4L+fk8EOoUrOdKCt2zUVWA4l52RoRJNABp7FdAyXqLfYm3Cdn+gxSMSaGE8kSVDUdexuEq5rKgkx1sTiGH/NKetqLnHtxmbC/bvXPZWp+lDe8lDmYvROA X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/4/12 21:53, Jan Kara wrote: > On Fri 12-04-24 17:24:38, Peng Zhang wrote: >> From: ZhangPeng >> >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into >> percpu_counter"), the rss_stats have converted into percpu_counter, >> which convert the error margin from (nr_threads * 64) to approximately >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a >> performance regression on fork/exec/shell. Even after commit >> 14ef95be6f55 ("kernel/fork: group allocation/free of per-cpu counters >> for mm struct"), the performance of fork/exec/shell is still poor >> compared to previous kernel versions. >> >> To mitigate performance regression, we use lazy_percpu_counter[1] to >> delay the allocation of percpu memory for rss_stats. After lmbench test, >> we will get 3% ~ 6% performance improvement for lmbench >> fork_proc/exec_proc/shell_proc after conversion. >> >> The test results are as follows: >> >> base base+revert base+lazy_percpu_counter >> >> fork_proc 427.4ms 394.1ms (7.8%) 413.9ms (3.2%) >> exec_proc 2205.1ms 2042.2ms (7.4%) 2072.0ms (6.0%) >> shell_proc 3180.9ms 2963.7ms (6.8%) 3010.7ms (5.4%) >> >> This solution has not been fully evaluated and tested. The main idea of >> this RFC patch series is to get the community's opinion on this approach. > Thanks! I like the idea and in fact I wanted to do something similar (just > never got to it). Thread [2] has couple of good observations regarding this > problem. Couple of thoughts regarding your approach: > > 1) I think switching to pcpu counter when update rate exceeds 256 updates/s > is not a great fit for RSS because the updates are going to be frequent in > some cases but usually they will all happen from one thread. So I think it > would make more sense to move the decision of switching to pcpu mode from > the counter itself into the callers and just switch on clone() when the > second thread gets created. > > 2) I thought that for RSS lazy percpu counters, we could directly use > struct percpu_counter and just make it that if 'counters' is NULL, the > counter is in atomic mode (count is used as atomic_long_t), if counters != > NULL, we are in pcpu mode. Thanks for your reply! Agree with your thoughts, I'll implement it in the next version. > 3) In [2] Mateusz had a good observation that the old RSS counters actually > used atomic operations only in rare cases so even lazy pcpu counters are > going to have worse performance for singlethreaded processes than the old > code. We could *almost* get away with non-atomic updates to counter->count > if it was not for occasional RSS updates from unrelated tasks. So it might > be worth it to further optimize the counters as: > > struct rss_counter_single { > void *state; /* To detect switching to pcpu mode */ > atomic_long_t counter_atomic; /* Used for foreign updates */ > long counter; /* Used by local updates */ > } > > struct rss_counter { > union { > struct rss_counter_single single; > /* struct percpu_counter needs to be modified to have > * 'counters' first to avoid issues for different > * architectures or with CONFIG_HOTPLUG_CPU enabled */ > struct percpu_counter pcpu; > } > } > > But I'm not sure this complexity is worth it so I'd do it as a separate > patch with separate benchmarking if at all. > > Honza Agreed. Single-threaded processes don't need atomic operations, and this scenario needs to be thoroughly tested. I'll try to implement it in another patch series after I finish the basic approach. > [2] https://lore.kernel.org/all/ZOPSEJTzrow8YFix@snowbird/ > >> [1] https://lore.kernel.org/linux-iommu/20230501165450.15352-8-surenb@google.com/ >> >> Kent Overstreet (1): >> Lazy percpu counters >> >> ZhangPeng (2): >> lazy_percpu_counter: include struct percpu_counter in struct >> lazy_percpu_counter >> mm: convert mm's rss stats into lazy_percpu_counter >> >> include/linux/lazy-percpu-counter.h | 88 +++++++++++++++++++ >> include/linux/mm.h | 8 +- >> include/linux/mm_types.h | 4 +- >> include/trace/events/kmem.h | 4 +- >> kernel/fork.c | 12 +-- >> lib/Makefile | 2 +- >> lib/lazy-percpu-counter.c | 131 ++++++++++++++++++++++++++++ >> 7 files changed, 232 insertions(+), 17 deletions(-) >> create mode 100644 include/linux/lazy-percpu-counter.h >> create mode 100644 lib/lazy-percpu-counter.c >> >> -- >> 2.25.1 >> -- Best Regards, Peng