From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1F1F6C4345F for ; Sat, 20 Apr 2024 08:44:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 359AC6B007B; Sat, 20 Apr 2024 04:44:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 308946B0083; Sat, 20 Apr 2024 04:44:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1F7B46B0085; Sat, 20 Apr 2024 04:44:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 02C0E6B007B for ; Sat, 20 Apr 2024 04:44:50 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id B0986A1740 for ; Sat, 20 Apr 2024 08:44:50 +0000 (UTC) X-FDA: 82029274740.22.C983BF0 Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187]) by imf14.hostedemail.com (Postfix) with ESMTP id 5B51510000D for ; Sat, 20 Apr 2024 08:44:46 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713602689; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=phNqYurk1nBb0QI5Xyy35zmkHcuvDE1AHUSDeQqZE20=; b=VjWS3mxzwH9k0Gzecw5o3z8QxyDDHcyAengdK8xlZ9Foj0I/KppsJu8xfIBMDGOwnA5pO9 MMs5E+C+KrsvWZYUIvzceCAIWQXigkE/1gzw4ebIzWgx53kZ+YhZRgUX7jcQY1dvgaxob1 bilipH3/6XJnC9qXcmbhsg571fyb+nc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713602689; a=rsa-sha256; cv=none; b=eTk0FhPvKAeGanPLEsI79ftZ3PaXspiI4tEDRWZzjg6kzpnHKST6/ssjiHMzAP0ATig+28 KYGMiGW48z5pjjtqdw/Qcn6FvocP7Y1pi9L6WjZAyluELakM3LzG9lHyehT6aPCt/sQVH4 ePOJTofLnwjaBaLWzeE5/m4xUprLAos= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4VM4k82YhvzwSkd; Sat, 20 Apr 2024 16:41:36 +0800 (CST) Received: from kwepemm600020.china.huawei.com (unknown [7.193.23.147]) by mail.maildlp.com (Postfix) with ESMTPS id 9CEB018007D; Sat, 20 Apr 2024 16:44:42 +0800 (CST) Received: from [10.174.179.160] (10.174.179.160) by kwepemm600020.china.huawei.com (7.193.23.147) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Sat, 20 Apr 2024 16:44:41 +0800 Message-ID: Date: Sat, 20 Apr 2024 16:44:40 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode Content-Language: en-US To: Rongwei Wang , , CC: , , , , , , , , , , , References: <20240418142008.2775308-1-zhangpeng362@huawei.com> <20240418142008.2775308-3-zhangpeng362@huawei.com> <6a3b8095-8f49-47e0-a347-9e4a51806bf8@gmail.com> From: "zhangpeng (AS)" In-Reply-To: <6a3b8095-8f49-47e0-a347-9e4a51806bf8@gmail.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.179.160] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To kwepemm600020.china.huawei.com (7.193.23.147) X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 5B51510000D X-Rspam-User: X-Stat-Signature: o49wotayawhxhy4jno6cnr8u8wzr45b9 X-HE-Tag: 1713602686-844114 X-HE-Meta: U2FsdGVkX19chvLoZ+CpYYqdoK/ak6qXN7wKrGoiI6axD0WGeo5R5MFp0FJLhlwMZtEjQXB57Zp9ksCxS6h5qlJLEW1dq/vyoTTG3uKTsovWp/FtLxYwv5V02yDFSnsw7+Uc5F562aIXxdeUHQ+o6em6lPztLH+QzLdkmK3XSQ5CKSP1o6SemmG94eLQld2ta5tl+MInk7O6stfngf0KbBAeAPdvvakYuqKjUYfLip8FOhKh34Nqy0h5QPSoOMsAQkeQt6UoAD46W1XrJg0i8DA9RU2CdN81eFhYbcDBMv6yCNtzhQjQ9w6xsux047YUDa+Do2ct3y3wv4JWuOrA4UoNrwH3+LOdukJ77CXPtBU6dgL9IFFgPWlB94oaZFM8piNOYNct45XlLjaUtLE2rdLpLjKQhHEBLfIvviIJynJbyQP/nKuGQQtJebMA5X+vAX0ysn1HLQ66CwNKexhjJod4L+NfVoK7WyqpCcGwMJ2pnNgW0Cq6pU9S+JjLQDk3BqGIndvPIuGGpAHxOs7SPW0mNhApCyJ8+0Mvy7hFt73AQioJsJtsdBgoI+xHiB/Zl6wUkk5GYDoH5A89pc2mVKTH2fcpmObxhdCp768mp4gH+zNHletNIbRT560h+x2VXjBgQMVyIg6ALbvn4eWsxYFPqMsCRxXNAR+0HAUaz5NBlubvejNyYG/iKXWDKV1NHKWHezi/oSRwAGmvqWRtyoNjd/v/DNYHigdReiWJ5Jl3LbL67Y4yQkeruWauRfeUWth76yph7atiZMgc5DiJMJUpvJmvOid6wMOFnZCSBF29wCO6TldedWz3lI/UUJYSqRr2tkfHL6iRMzM2M2hHef45zPkj5QP1KktgfrmZSxSsEqtIkc82EEiCOpZBdpQShitq/t6askWwBlEE5BBDSwz8Bku35IxILpYKGKeMtTzrLKqJho0gw0/R+8adJT3GKHryURHU9w2w0hW0G/Q nN0dfstc DipxnhAZpSofQDvRcx/8Y/bNSPMit/7lNp4pbob2k52XsFKhAgnYmn2MTO9DlWvS4AHcPuxVjZNSntYJ0xCaoWmIQBMEjSrcqB4GIIOkffwdAL2XqrJTtqZ4oyh2ei0iZJDwIiyc8wOgqwG1xUCn5KAd63rhMEFkh9JtVZ8xkIY4WSbu4TXiKdYq1u4WZGS7cjJ2Whyq6ZIH4/zlea/4srX1YzQ8MjnvJUsaEVZftzYhBcwF4XRaaiwpRiZ2VDLsFs2jjMc61LCLqVA7KGUbpCJOCH3ptHZZqwtVkqAd01KivUf1cu0khZVL8K6Ip2gc1g2rN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/4/20 11:13, Rongwei Wang wrote: > On 2024/4/19 11:32, zhangpeng (AS) wrote: >> On 2024/4/19 10:30, Rongwei Wang wrote: >> >>> On 2024/4/18 22:20, Peng Zhang wrote: >>>> From: ZhangPeng >>>> >>>> Since commit f1a7941243c1 ("mm: convert mm's rss stats into >>>> percpu_counter"), the rss_stats have converted into percpu_counter, >>>> which convert the error margin from (nr_threads * 64) to approximately >>>> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() >>>> causes a >>>> performance regression on fork/exec/shell. Even after commit >>>> 14ef95be6f55 >>>> ("kernel/fork: group allocation/free of per-cpu counters for mm >>>> struct"), >>>> the performance of fork/exec/shell is still poor compared to previous >>>> kernel versions. >>>> >>>> To mitigate performance regression, we delay the allocation of percpu >>>> memory for rss_stats. Therefore, we convert mm's rss stats to use >>>> percpu_counter atomic mode. For single-thread processes, rss_stat >>>> is in >>>> atomic mode, which reduces the memory consumption and performance >>>> regression caused by using percpu. For multiple-thread processes, >>>> rss_stat is switched to the percpu mode to reduce the error margin. >>>> We convert rss_stats from atomic mode to percpu mode only when the >>>> second thread is created. >>> Hi, Zhang Peng >>> >>> This regression we also found it in lmbench these days. I have not >>> test your patch, but it seems will solve a lot for it. >>> And I see this patch not fix the regression in multi-threads, that's >>> because of the rss_stat switched to percpu mode? >>> (If I'm wrong, please correct me.) And It seems percpu_counter also >>> has a bad effect in exit_mmap(). >>> >>> If so, I'm wondering if we can further improving it on the >>> exit_mmap() path in multi-threads scenario, e.g. to determine which >>> CPUs the process has run on (mm_cpumask()? I'm not sure). >>> >> Hi, Rongwei, >> >> Yes, this patch only fixes the regression in single-thread processes. >> How >> much bad effect does percpu_counter have in exit_mmap()? IMHO, the >> addition > Actually, I not sure, just found a little free percpu hotspot in > exit_mmap() path when comparing 4 core vs 32 cores. > > I can test more next. Thanks, it would be better if there is test data. >> of mm counter is already in batch mode, maybe I miss something? >> >>>> >>>> After lmbench test, we can get 2% ~ 4% performance improvement >>>> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance >>>> improvement for lmbench page_fault (before batch mode[1]). >>>> >>>> The test results are as follows: >>>> >>>>               base           base+revert        base+this patch >>>> >>>> fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms (4.2%) >>>> exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%) >>>> shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%) >>>> page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%) >>> I think the regression will becomes more obvious if more cores. How >>> about your test machine? >>> >> Maybe multi-core is not a factor in the performance of the lmbench >> test here. >> Both of my test machines have 96 cores. >> >>> Thanks, >>> -wrw >>>> >>>> [1] >>>> https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/ >>>> >>>> Suggested-by: Jan Kara >>>> Signed-off-by: ZhangPeng >>>> Signed-off-by: Kefeng Wang >>>> --- >>>>   include/linux/mm.h          | 50 >>>> +++++++++++++++++++++++++++++++------ >>>>   include/trace/events/kmem.h |  4 +-- >>>>   kernel/fork.c               | 18 +++++++------ >>>>   3 files changed, 56 insertions(+), 16 deletions(-) >>>> >>>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>>> index d261e45bb29b..8f1bfbd54697 100644 >>>> --- a/include/linux/mm.h >>>> +++ b/include/linux/mm.h >>>> @@ -2631,30 +2631,66 @@ static inline bool >>>> get_user_page_fast_only(unsigned long addr, >>>>    */ >>>>   static inline unsigned long get_mm_counter(struct mm_struct *mm, >>>> int member) >>>>   { >>>> -    return percpu_counter_read_positive(&mm->rss_stat[member]); >>>> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >>>> + >>>> +    if (percpu_counter_initialized(fbc)) >>>> +        return percpu_counter_read_positive(fbc); >>>> + >>>> +    return percpu_counter_atomic_read(fbc); >>>>   } >>>>     void mm_trace_rss_stat(struct mm_struct *mm, int member); >>>>     static inline void add_mm_counter(struct mm_struct *mm, int >>>> member, long value) >>>>   { >>>> -    percpu_counter_add(&mm->rss_stat[member], value); >>>> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >>>> + >>>> +    if (percpu_counter_initialized(fbc)) >>>> +        percpu_counter_add(fbc, value); >>>> +    else >>>> +        percpu_counter_atomic_add(fbc, value); >>>>         mm_trace_rss_stat(mm, member); >>>>   } >>>>     static inline void inc_mm_counter(struct mm_struct *mm, int >>>> member) >>>>   { >>>> -    percpu_counter_inc(&mm->rss_stat[member]); >>>> - >>>> -    mm_trace_rss_stat(mm, member); >>>> +    add_mm_counter(mm, member, 1); >>>>   } >>>>     static inline void dec_mm_counter(struct mm_struct *mm, int >>>> member) >>>>   { >>>> -    percpu_counter_dec(&mm->rss_stat[member]); >>>> +    add_mm_counter(mm, member, -1); >>>> +} >>>>   -    mm_trace_rss_stat(mm, member); >>>> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member) >>>> +{ >>>> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >>>> + >>>> +    if (percpu_counter_initialized(fbc)) >>>> +        return percpu_counter_sum(fbc); >>>> + >>>> +    return percpu_counter_atomic_read(fbc); >>>> +} >>>> + >>>> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, >>>> int member) >>>> +{ >>>> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >>>> + >>>> +    if (percpu_counter_initialized(fbc)) >>>> +        return percpu_counter_sum_positive(fbc); >>>> + >>>> +    return percpu_counter_atomic_read(fbc); >>>> +} >>>> + >>>> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct >>>> *mm) >>>> +{ >>>> +    return percpu_counter_switch_to_pcpu_many(mm->rss_stat, >>>> NR_MM_COUNTERS); >>>> +} >>>> + >>>> +static inline void mm_counter_destroy_many(struct mm_struct *mm) >>>> +{ >>>> +    percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); >>>>   } >>>>     /* Optimized variant when folio is already known not to be anon */ >>>> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h >>>> index 6e62cc64cd92..a4e40ae6a8c8 100644 >>>> --- a/include/trace/events/kmem.h >>>> +++ b/include/trace/events/kmem.h >>>> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat, >>>>           __entry->mm_id = mm_ptr_to_hash(mm); >>>>           __entry->curr = !!(current->mm == mm); >>>>           __entry->member = member; >>>> -        __entry->size = >>>> (percpu_counter_sum_positive(&mm->rss_stat[member]) >>>> -                                << PAGE_SHIFT); >>>> +        __entry->size = (mm_counter_sum_positive(mm, member) >>>> +                            << PAGE_SHIFT); >>>>       ), >>>>         TP_printk("mm_id=%u curr=%d type=%s size=%ldB", >>>> diff --git a/kernel/fork.c b/kernel/fork.c >>>> index 99076dbe27d8..0214273798c5 100644 >>>> --- a/kernel/fork.c >>>> +++ b/kernel/fork.c >>>> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm) >>>>                "Please make sure 'struct resident_page_types[]' is >>>> updated as well"); >>>>         for (i = 0; i < NR_MM_COUNTERS; i++) { >>>> -        long x = percpu_counter_sum(&mm->rss_stat[i]); >>>> +        long x = mm_counter_sum(mm, i); >>>>             if (unlikely(x)) >>>>               pr_alert("BUG: Bad rss-counter state mm:%p type:%s >>>> val:%ld\n", >>>> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct >>>> mm_struct *mm, struct task_struct *p, >>>>       if (mm_alloc_cid(mm)) >>>>           goto fail_cid; >>>>   -    if (percpu_counter_init_many(mm->rss_stat, 0, >>>> GFP_KERNEL_ACCOUNT, >>>> -                     NR_MM_COUNTERS)) >>>> -        goto fail_pcpu; >>>> - >>>>       mm->user_ns = get_user_ns(user_ns); >>>>       lru_gen_init_mm(mm); >>>>       return mm; >>>>   -fail_pcpu: >>>> -    mm_destroy_cid(mm); >>>>   fail_cid: >>>>       destroy_context(mm); >>>>   fail_nocontext: >>>> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long >>>> clone_flags, struct task_struct *tsk) >>>>       if (!oldmm) >>>>           return 0; >>>>   +    /* >>>> +     * For single-thread processes, rss_stat is in atomic mode, which >>>> +     * reduces the memory consumption and performance regression >>>> caused by >>>> +     * using percpu. For multiple-thread processes, rss_stat is >>>> switched to >>>> +     * the percpu mode to reduce the error margin. >>>> +     */ >>>> +    if (clone_flags & CLONE_THREAD) >>>> +        if (mm_counter_switch_to_pcpu_many(oldmm)) >>>> +            return -ENOMEM; >>>> + >>>>       if (clone_flags & CLONE_VM) { >>>>           mmget(oldmm); >>>>           mm = oldmm; >>> >>> > > -- Best Regards, Peng