From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2E001C4345F for ; Fri, 19 Apr 2024 03:32:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5DCAA6B007B; Thu, 18 Apr 2024 23:32:52 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 58D8C6B0082; Thu, 18 Apr 2024 23:32:52 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 454C36B0083; Thu, 18 Apr 2024 23:32:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 27BE56B007B for ; Thu, 18 Apr 2024 23:32:52 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 8BC95A1FA1 for ; Fri, 19 Apr 2024 03:32:51 +0000 (UTC) X-FDA: 82024859742.30.021D35A Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) by imf14.hostedemail.com (Postfix) with ESMTP id 9E1BC100007 for ; Fri, 19 Apr 2024 03:32:48 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.190 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713497570; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=OdZXLgGqkgz1r4Ic3mQVlIGtlfc/R1NUqrSGKnFY2qA=; b=AoKw+dQZMuVj7WoeVjgWGEPyPKklH/MXKzAdoN9VhwUfQ0JY1qxrr4t0zVlqPalZnXDOzj lVAMh+e6P7y09APQ9SmXQzVYV/j+/gpJ6DG1XU5qBo7mXt2Ur5hI63a7k7KrqV8azittsa A567OxIdpFeaMV2hVgkzSeMWxQnmiHY= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; spf=pass (imf14.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.190 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713497570; a=rsa-sha256; cv=none; b=FhiRSR2gIrBsYDkotVKJoVYyEF/LxuKyQ0lHQukvPdf0msDeyAjU4zf9AAy8/mLnv9kxJw fYb7j0ejAtC8Vg9PTEe8eZhm18472J74z0howT9PZcFOGKBrcDKfx49D4TwEg1BTCU/RJg cJem/DtpRbJzpRrQC00FnjQaHQCzXlM= Received: from mail.maildlp.com (unknown [172.19.162.112]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4VLKsQ4HXDz1xv0l; Fri, 19 Apr 2024 11:30:18 +0800 (CST) Received: from kwepemm600020.china.huawei.com (unknown [7.193.23.147]) by mail.maildlp.com (Postfix) with ESMTPS id BD084140136; Fri, 19 Apr 2024 11:32:43 +0800 (CST) Received: from [10.174.179.160] (10.174.179.160) by kwepemm600020.china.huawei.com (7.193.23.147) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Fri, 19 Apr 2024 11:32:42 +0800 Message-ID: Date: Fri, 19 Apr 2024 11:32:41 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode Content-Language: en-US To: Rongwei Wang , , CC: , , , , , , , , , , , References: <20240418142008.2775308-1-zhangpeng362@huawei.com> <20240418142008.2775308-3-zhangpeng362@huawei.com> From: "zhangpeng (AS)" In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.179.160] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To kwepemm600020.china.huawei.com (7.193.23.147) X-Rspamd-Queue-Id: 9E1BC100007 X-Stat-Signature: dcc8awn8x4p3ftmbtqjm91ase3prjmz1 X-Rspam-User: X-Rspamd-Server: rspam07 X-HE-Tag: 1713497568-610238 X-HE-Meta: U2FsdGVkX1+de4Nr/N8re6BIon8zdwXCAjGoEFOrbVoyEatHiUYWB3MWkl3irZ+6i1XI1pldP86oT3N7MTMCyFEYsvXFbD5z6jK/rridnk7MnsufyH+MXndoKpDe4ZNJ664Q8Ernl1RI+a5jhKQtpKCHs0aO2/S213WK/tP2rQ+V7UA8t7Pdxxgl7J7yrInUBaBnlJe08oVsvF8CNtnAp1VPBmy7MDk6ZM+JH4E63zkm1HQJ3J21eKT2OfR1tmqbVyToV2U5X7sXUnR/shI4TiQRYygypDoQ8+SXBZoPFFn0lgv29H6rnXPa8sMuS+J80GcG/GNBhSzi8m5wRiw908l6rO2sEKRJoHVjlSK/py/ywf6j4cygNpzQleVsx7y+rXLT4DpJeVorMELDrvlqbeQMZvBbJ2fNF6L9TFQmZvW8v7foyuoiyvxVnw2aOcU6fLLDfqoXSMFAPIW43munbqWT+/a0UV8QV2cPlm+3LtJFpcP3KRUTR1NnWuUTQnujCOLwRx+wSFIjmxxW1eahJDsTBBUspcrcFcpPjNjhr1DN8SjYLkUGwjhpML6G5eNpLAtTfADjjCX01R4xi2mxpJ8pZ6I791y40K6L/fjmbLppflMXReH5F98KB4pogNleSx8oOyABNRm3feyDfGVTERchvzWi3GCr1CpH6/PQs03uSyicz5YRgBuGmTDKcOQXBo4uHNKyW/oXw34QjNSh9jxkhgg6pTV8MWDJQYVJTFTeuqnZMxSMicB4/yZ+Vd9Mn15QmfZOyZf/O1W98sTt7CyZh/M/gXxJPcNqrZXOqDipYTWQQ9yIdq/FeIIwkpuUJ2ioqjzQjxV19Apjfhol+c/73FjXccuM6QDQY+iaU9IFG7fwfhvmGkQTOh4O2YSu X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/4/19 10:30, Rongwei Wang wrote: > On 2024/4/18 22:20, Peng Zhang wrote: >> From: ZhangPeng >> >> Since commit f1a7941243c1 ("mm: convert mm's rss stats into >> percpu_counter"), the rss_stats have converted into percpu_counter, >> which convert the error margin from (nr_threads * 64) to approximately >> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a >> performance regression on fork/exec/shell. Even after commit >> 14ef95be6f55 >> ("kernel/fork: group allocation/free of per-cpu counters for mm >> struct"), >> the performance of fork/exec/shell is still poor compared to previous >> kernel versions. >> >> To mitigate performance regression, we delay the allocation of percpu >> memory for rss_stats. Therefore, we convert mm's rss stats to use >> percpu_counter atomic mode. For single-thread processes, rss_stat is in >> atomic mode, which reduces the memory consumption and performance >> regression caused by using percpu. For multiple-thread processes, >> rss_stat is switched to the percpu mode to reduce the error margin. >> We convert rss_stats from atomic mode to percpu mode only when the >> second thread is created. > Hi, Zhang Peng > > This regression we also found it in lmbench these days. I have not > test your patch, but it seems will solve a lot for it. > And I see this patch not fix the regression in multi-threads, that's > because of the rss_stat switched to percpu mode? > (If I'm wrong, please correct me.) And It seems percpu_counter also > has a bad effect in exit_mmap(). > > If so, I'm wondering if we can further improving it on the exit_mmap() > path in multi-threads scenario, e.g. to determine which CPUs the > process has run on (mm_cpumask()? I'm not sure). > Hi, Rongwei, Yes, this patch only fixes the regression in single-thread processes. How much bad effect does percpu_counter have in exit_mmap()? IMHO, the addition of mm counter is already in batch mode, maybe I miss something? >> >> After lmbench test, we can get 2% ~ 4% performance improvement >> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance >> improvement for lmbench page_fault (before batch mode[1]). >> >> The test results are as follows: >> >>               base           base+revert        base+this patch >> >> fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms  (4.2%) >> exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%) >> shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%) >> page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%) > I think the regression will becomes more obvious if more cores. How > about your test machine? > Maybe multi-core is not a factor in the performance of the lmbench test here. Both of my test machines have 96 cores. > Thanks, > -wrw >> >> [1] >> https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/ >> >> Suggested-by: Jan Kara >> Signed-off-by: ZhangPeng >> Signed-off-by: Kefeng Wang >> --- >>   include/linux/mm.h          | 50 +++++++++++++++++++++++++++++++------ >>   include/trace/events/kmem.h |  4 +-- >>   kernel/fork.c               | 18 +++++++------ >>   3 files changed, 56 insertions(+), 16 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index d261e45bb29b..8f1bfbd54697 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2631,30 +2631,66 @@ static inline bool >> get_user_page_fast_only(unsigned long addr, >>    */ >>   static inline unsigned long get_mm_counter(struct mm_struct *mm, >> int member) >>   { >> -    return percpu_counter_read_positive(&mm->rss_stat[member]); >> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >> + >> +    if (percpu_counter_initialized(fbc)) >> +        return percpu_counter_read_positive(fbc); >> + >> +    return percpu_counter_atomic_read(fbc); >>   } >>     void mm_trace_rss_stat(struct mm_struct *mm, int member); >>     static inline void add_mm_counter(struct mm_struct *mm, int >> member, long value) >>   { >> -    percpu_counter_add(&mm->rss_stat[member], value); >> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >> + >> +    if (percpu_counter_initialized(fbc)) >> +        percpu_counter_add(fbc, value); >> +    else >> +        percpu_counter_atomic_add(fbc, value); >>         mm_trace_rss_stat(mm, member); >>   } >>     static inline void inc_mm_counter(struct mm_struct *mm, int member) >>   { >> -    percpu_counter_inc(&mm->rss_stat[member]); >> - >> -    mm_trace_rss_stat(mm, member); >> +    add_mm_counter(mm, member, 1); >>   } >>     static inline void dec_mm_counter(struct mm_struct *mm, int member) >>   { >> -    percpu_counter_dec(&mm->rss_stat[member]); >> +    add_mm_counter(mm, member, -1); >> +} >>   -    mm_trace_rss_stat(mm, member); >> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member) >> +{ >> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >> + >> +    if (percpu_counter_initialized(fbc)) >> +        return percpu_counter_sum(fbc); >> + >> +    return percpu_counter_atomic_read(fbc); >> +} >> + >> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int >> member) >> +{ >> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >> + >> +    if (percpu_counter_initialized(fbc)) >> +        return percpu_counter_sum_positive(fbc); >> + >> +    return percpu_counter_atomic_read(fbc); >> +} >> + >> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm) >> +{ >> +    return percpu_counter_switch_to_pcpu_many(mm->rss_stat, >> NR_MM_COUNTERS); >> +} >> + >> +static inline void mm_counter_destroy_many(struct mm_struct *mm) >> +{ >> +    percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); >>   } >>     /* Optimized variant when folio is already known not to be anon */ >> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h >> index 6e62cc64cd92..a4e40ae6a8c8 100644 >> --- a/include/trace/events/kmem.h >> +++ b/include/trace/events/kmem.h >> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat, >>           __entry->mm_id = mm_ptr_to_hash(mm); >>           __entry->curr = !!(current->mm == mm); >>           __entry->member = member; >> -        __entry->size = >> (percpu_counter_sum_positive(&mm->rss_stat[member]) >> -                                << PAGE_SHIFT); >> +        __entry->size = (mm_counter_sum_positive(mm, member) >> +                            << PAGE_SHIFT); >>       ), >>         TP_printk("mm_id=%u curr=%d type=%s size=%ldB", >> diff --git a/kernel/fork.c b/kernel/fork.c >> index 99076dbe27d8..0214273798c5 100644 >> --- a/kernel/fork.c >> +++ b/kernel/fork.c >> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm) >>                "Please make sure 'struct resident_page_types[]' is >> updated as well"); >>         for (i = 0; i < NR_MM_COUNTERS; i++) { >> -        long x = percpu_counter_sum(&mm->rss_stat[i]); >> +        long x = mm_counter_sum(mm, i); >>             if (unlikely(x)) >>               pr_alert("BUG: Bad rss-counter state mm:%p type:%s >> val:%ld\n", >> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct >> mm_struct *mm, struct task_struct *p, >>       if (mm_alloc_cid(mm)) >>           goto fail_cid; >>   -    if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, >> -                     NR_MM_COUNTERS)) >> -        goto fail_pcpu; >> - >>       mm->user_ns = get_user_ns(user_ns); >>       lru_gen_init_mm(mm); >>       return mm; >>   -fail_pcpu: >> -    mm_destroy_cid(mm); >>   fail_cid: >>       destroy_context(mm); >>   fail_nocontext: >> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, >> struct task_struct *tsk) >>       if (!oldmm) >>           return 0; >>   +    /* >> +     * For single-thread processes, rss_stat is in atomic mode, which >> +     * reduces the memory consumption and performance regression >> caused by >> +     * using percpu. For multiple-thread processes, rss_stat is >> switched to >> +     * the percpu mode to reduce the error margin. >> +     */ >> +    if (clone_flags & CLONE_THREAD) >> +        if (mm_counter_switch_to_pcpu_many(oldmm)) >> +            return -ENOMEM; >> + >>       if (clone_flags & CLONE_VM) { >>           mmget(oldmm); >>           mm = oldmm; > > -- Best Regards, Peng