From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3EE81C4345F for ; Sat, 20 Apr 2024 03:13:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8CFC26B007B; Fri, 19 Apr 2024 23:13:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8806B6B0083; Fri, 19 Apr 2024 23:13:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 71F896B0085; Fri, 19 Apr 2024 23:13:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 542E96B007B for ; Fri, 19 Apr 2024 23:13:34 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id E0C121A1692 for ; Sat, 20 Apr 2024 03:13:33 +0000 (UTC) X-FDA: 82028439906.26.E6C5E88 Received: from mail-pg1-f174.google.com (mail-pg1-f174.google.com [209.85.215.174]) by imf13.hostedemail.com (Postfix) with ESMTP id 09C9F2000D for ; Sat, 20 Apr 2024 03:13:31 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Vlq2bzDG; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf13.hostedemail.com: domain of rongwei.wrw@gmail.com designates 209.85.215.174 as permitted sender) smtp.mailfrom=rongwei.wrw@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1713582812; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=H/9y/VC3Dc30/RVanrfc5AQFNBCU94aBu39DJHj/ztg=; b=cq2IsiAZFiTvZTDaz2njfXamyYBrpD48/p9HS6fny4RrhGCqNX55tpTFgOcr71824hOdj/ BiUoYLAh0rkJVsg74izV1yWa4/+VbQgW5UOSiTgErBQ4K0fMwTMr9gpz6/wKKHng/0bzCV Itib86NTFmBTJ2pADs1opfAorwnzV4M= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=Vlq2bzDG; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf13.hostedemail.com: domain of rongwei.wrw@gmail.com designates 209.85.215.174 as permitted sender) smtp.mailfrom=rongwei.wrw@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1713582812; a=rsa-sha256; cv=none; b=fVy8hEhSQ7OIpi90tbNMhrX1TdquXi9wOcOaPkBLSPkXpp+YpE2i2AfMrHDONZo4QIONfs edj7/pMdKtErXGHxgYiRApUjZQQ4bK65cVTENiGF0flOu3cveD7oVYeGHLf6PKbLLISqdE LFD3G0nJlXXNVDmtlEiksL0Tpz97aZ4= Received: by mail-pg1-f174.google.com with SMTP id 41be03b00d2f7-5d42e7ab8a9so1662840a12.3 for ; Fri, 19 Apr 2024 20:13:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1713582811; x=1714187611; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=H/9y/VC3Dc30/RVanrfc5AQFNBCU94aBu39DJHj/ztg=; b=Vlq2bzDGVWApbueYRFve4Gj1lW2+J3HyTZ4AVRKHsBFy2DvRyeZqc95kCDlWO5BFS2 F7VZCJypxs49p2lsMFroibchmLyRtN6WnhFrM2+8WP+v7hYcZt3R/te4CWwzBVcUM9cq QvQs/yyge9eT7Om3j/p7U4V5Z1zLfycy3HvNy4ii+kYEp4ng1S9/enug0BaqX56YSm99 Skrnx5kOphn+HKFi9KlDlbclprj9l0Fy8de78omoVZF3pxiy1XoChTakuB4WQv8eoQoF ZpVrXWBYyXs2fiPDTbK8KhVcVDZJVwjVCk9cadCFD5VYkKYcAmrO2oxxkPPln0+r+c4c HtmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1713582811; x=1714187611; h=content-transfer-encoding:in-reply-to:from:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=H/9y/VC3Dc30/RVanrfc5AQFNBCU94aBu39DJHj/ztg=; b=U4eDPSpMVw3L32LcjC7zIU2HrWaI0ySE9AqE22Wb3mivdR3Nd98yAfop3VPUCbgBQB JlpL+YYXapEMRiI3gA81iJxQX1XV0xHqDASh7pu2oobI6c7CdJD6QIMuQwIkF7jpAORs 0oTvQqDI4kOpR5S8h45SJDMc2sWgSiaxzU5ezGkGh50E/rvxUN5uPpaukvzlIysrPV43 mz+kr2pHdAYkKUFmYrqDHRiDL9mis9SQueR9XdiuDx4X6ETGLVjz16tntZKYZY21CHs1 UkLIrvdtu7a17GQc56rtaS5K+lnmgs8Rqpp/osthG5U9aDVdsQ9NCrVQEKVEKTIprw4O N/Og== X-Forwarded-Encrypted: i=1; AJvYcCVZP5XmbdMgyBcDcGgbc4tQIswuQBGQ520BCRAYnCypmbbl26zZocans0I8J4A8Jm+3RjM/sQNwLfJM9FaPx4xHNpI= X-Gm-Message-State: AOJu0YxwuarBD160Rcbr2NIQFYu2uw8SkzNusUvcWKhEc8eNBTBEBvdT 1J2r0756DkvPh7FWCt8Uh4cBWZq84LON9mUzG/zZmE8VuTgXb0Lu X-Google-Smtp-Source: AGHT+IE1ytvOp+FFdm1hhpZfFSXMC9hQ10r+tkTweJ86bG3b16fC587eKptWDDJegFa/ZwAzsGclyw== X-Received: by 2002:a17:903:244e:b0:1e4:3dd0:8ce0 with SMTP id l14-20020a170903244e00b001e43dd08ce0mr4927396pls.20.1713582810536; Fri, 19 Apr 2024 20:13:30 -0700 (PDT) Received: from [192.168.255.10] ([115.171.244.168]) by smtp.gmail.com with ESMTPSA id t20-20020a170902b21400b001e3e081dea1sm4104671plr.0.2024.04.19.20.13.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 19 Apr 2024 20:13:30 -0700 (PDT) Message-ID: <6a3b8095-8f49-47e0-a347-9e4a51806bf8@gmail.com> Date: Sat, 20 Apr 2024 11:13:25 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode To: "zhangpeng (AS)" , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: akpm@linux-foundation.org, dennisszhou@gmail.com, shakeelb@google.com, jack@suse.cz, surenb@google.com, kent.overstreet@linux.dev, mhocko@suse.cz, vbabka@suse.cz, yuzhao@google.com, yu.ma@intel.com, wangkefeng.wang@huawei.com, sunnanyong@huawei.com References: <20240418142008.2775308-1-zhangpeng362@huawei.com> <20240418142008.2775308-3-zhangpeng362@huawei.com> From: Rongwei Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 09C9F2000D X-Rspam-User: X-Stat-Signature: k5onamy96xbmkc6sp8yfgus88s9q6xbc X-HE-Tag: 1713582811-50392 X-HE-Meta: U2FsdGVkX19j/9+p6DznWFF1Lh6WYGkUUnJsPdB57Luaxkyat6e8KB6HBUulqjWin1DKTar6rFmFpXgelzOB6r9+GixKtIkKWrejzL4au+IJxuGh/mQXpFSu6wdEWPINTeXAF+Bce/35wTak1GACN/rLWUTLQYBxjZCtBjxTH/bUF7MlD0/m+5CmSNb5vAgk42eGdwYlQ4mQTkAedfC4O/3gyrV6P9/lt3NTEv0pcdo041dpOs7kHZ80x0qnDy2IDRQDiA+28ZipJ3VSa2Wn8nqD5jgw9mrg8zvYJXP9iAzjD2lK+MErS2V4E9V7Ywy4qvKORV3c1fD7Elzct7QIjIIjFeOH1x8VqUtaluJaN1jV/OrTGpCu/iaegNymFZ7pfQxcK6LbFaz4ZYf7FWLvtPYc7mpw3ovjekUdVUYftDjDUf2+UFTNlnKu+uKitVZpSuDyR7miE1OguJG3B8NSyzo0qwAkawx9DkT6CCiKAhAwZ3fhhgFKq4GCK/+4OP5eOwLC9xzcFru7h1BUwcoz9zy1Gx75d9R81Ph0x9REgk6SLVvthEjbr0ONNdFXSMBcYEbOVljxLNmSMIzcLOkbTHaRb0bL/e5xwRVOtfiz1PcuYRFkZ8fI2bMka1LA7NhjOoSJK2pYsqYECqu3T979DfziTakGHmqmP1eykPfw6urckIqSCxuIurntuFsfoT+3bekYIiEL0BbUtI/0YS1CSK7sV2OBb1AmeinwToLTfMxYP/iY9ntUGkJDZFTKlH7xQAK6zbzWbkrBSyt774TBfwQDLrhuf49aS3CiMb8okQZpHhlWZaC3q0LKUP3YeRiWl+ryJMtRtqZQYjeJPojXyfbWZHnvLBQyvIunbRkWcNjmcKwz/s0d8SAMHeWGbQxXw9vYwMCb0AQl3w47eoHvhSF2RoU3T0HzFE1D+bL++vP9uCZIPDg2e7h74yyALEnIaZsmSx9G1cRHYfmvLRj p9v0S5HF In1e36jaVezaAWZdHS76PHS7OACMegGjTwbr7RRx1x9kN8qPgIMpm0O6dFY4K9F8La+Zo2fOdv7Kuu8DZXNWjdJwkNg/sbqEoekzH4Kqd1bGbRJP/5Vq4M+bb3hVRRfuoSCgg5WvaoI7QN8m+NshalyECMoFqkFuvPc9nVt8rPGABvi0FgYd4R8lzS50fVyb0bWJ4phkghoAbroRnT1agrrLcs8IwxMKyIMDrH/eu+KiI/u1KKAignxlb4nvmA2XpFdFpH4PJve1D79nos8vx+640lRmQnjCj3o7rSP+sSOMy6JvfaCmIVSSVVl6golQBEMhOgf2g7j3XUXp1LKAh/1jHmz6FlqbCP4a/DkYElOe2Pw/zGH/ddmS3XFn9n8kZ8tz1E1SgJmUY283X8YayPkVBOtQugBC4tttFNeA/fJYRbQ75v8CtRI6EfDl3yv5lAJbME+Gm9dsZHKqvocBWAJOyBNJT/D37CYRpynPe08xvqeQC6V9z/27zeY3r/jWJ6NO6OTVR9XWULIWYHsY/p1zBUXPmGUZZjg1rQvXqcqOkY3mekiP/kmtH9W73UWo7tVGhuPQJB51jqoTy2jXcZALMzrzSk0mKafQT05RydoZjuV0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/4/19 11:32, zhangpeng (AS) wrote: > On 2024/4/19 10:30, Rongwei Wang wrote: > >> On 2024/4/18 22:20, Peng Zhang wrote: >>> From: ZhangPeng >>> >>> Since commit f1a7941243c1 ("mm: convert mm's rss stats into >>> percpu_counter"), the rss_stats have converted into percpu_counter, >>> which convert the error margin from (nr_threads * 64) to approximately >>> (nr_cpus ^ 2). However, the new percpu allocation in mm_init() causes a >>> performance regression on fork/exec/shell. Even after commit >>> 14ef95be6f55 >>> ("kernel/fork: group allocation/free of per-cpu counters for mm >>> struct"), >>> the performance of fork/exec/shell is still poor compared to previous >>> kernel versions. >>> >>> To mitigate performance regression, we delay the allocation of percpu >>> memory for rss_stats. Therefore, we convert mm's rss stats to use >>> percpu_counter atomic mode. For single-thread processes, rss_stat is in >>> atomic mode, which reduces the memory consumption and performance >>> regression caused by using percpu. For multiple-thread processes, >>> rss_stat is switched to the percpu mode to reduce the error margin. >>> We convert rss_stats from atomic mode to percpu mode only when the >>> second thread is created. >> Hi, Zhang Peng >> >> This regression we also found it in lmbench these days. I have not >> test your patch, but it seems will solve a lot for it. >> And I see this patch not fix the regression in multi-threads, that's >> because of the rss_stat switched to percpu mode? >> (If I'm wrong, please correct me.) And It seems percpu_counter also >> has a bad effect in exit_mmap(). >> >> If so, I'm wondering if we can further improving it on the >> exit_mmap() path in multi-threads scenario, e.g. to determine which >> CPUs the process has run on (mm_cpumask()? I'm not sure). >> > Hi, Rongwei, > > Yes, this patch only fixes the regression in single-thread processes. How > much bad effect does percpu_counter have in exit_mmap()? IMHO, the > addition Actually, I not sure, just found a little free percpu hotspot in exit_mmap() path when comparing 4 core vs 32 cores. I can test more next. > of mm counter is already in batch mode, maybe I miss something? > >>> >>> After lmbench test, we can get 2% ~ 4% performance improvement >>> for lmbench fork_proc/exec_proc/shell_proc and 6.7% performance >>> improvement for lmbench page_fault (before batch mode[1]). >>> >>> The test results are as follows: >>> >>>               base           base+revert        base+this patch >>> >>> fork_proc    416.3ms        400.0ms  (3.9%)    398.6ms  (4.2%) >>> exec_proc    2095.9ms       2061.1ms (1.7%)    2047.7ms (2.3%) >>> shell_proc   3028.2ms       2954.7ms (2.4%)    2961.2ms (2.2%) >>> page_fault   0.3603ms       0.3358ms (6.8%)    0.3361ms (6.7%) >> I think the regression will becomes more obvious if more cores. How >> about your test machine? >> > Maybe multi-core is not a factor in the performance of the lmbench > test here. > Both of my test machines have 96 cores. > >> Thanks, >> -wrw >>> >>> [1] >>> https://lore.kernel.org/all/20240412064751.119015-1-wangkefeng.wang@huawei.com/ >>> >>> Suggested-by: Jan Kara >>> Signed-off-by: ZhangPeng >>> Signed-off-by: Kefeng Wang >>> --- >>>   include/linux/mm.h          | 50 >>> +++++++++++++++++++++++++++++++------ >>>   include/trace/events/kmem.h |  4 +-- >>>   kernel/fork.c               | 18 +++++++------ >>>   3 files changed, 56 insertions(+), 16 deletions(-) >>> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>> index d261e45bb29b..8f1bfbd54697 100644 >>> --- a/include/linux/mm.h >>> +++ b/include/linux/mm.h >>> @@ -2631,30 +2631,66 @@ static inline bool >>> get_user_page_fast_only(unsigned long addr, >>>    */ >>>   static inline unsigned long get_mm_counter(struct mm_struct *mm, >>> int member) >>>   { >>> -    return percpu_counter_read_positive(&mm->rss_stat[member]); >>> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >>> + >>> +    if (percpu_counter_initialized(fbc)) >>> +        return percpu_counter_read_positive(fbc); >>> + >>> +    return percpu_counter_atomic_read(fbc); >>>   } >>>     void mm_trace_rss_stat(struct mm_struct *mm, int member); >>>     static inline void add_mm_counter(struct mm_struct *mm, int >>> member, long value) >>>   { >>> -    percpu_counter_add(&mm->rss_stat[member], value); >>> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >>> + >>> +    if (percpu_counter_initialized(fbc)) >>> +        percpu_counter_add(fbc, value); >>> +    else >>> +        percpu_counter_atomic_add(fbc, value); >>>         mm_trace_rss_stat(mm, member); >>>   } >>>     static inline void inc_mm_counter(struct mm_struct *mm, int member) >>>   { >>> -    percpu_counter_inc(&mm->rss_stat[member]); >>> - >>> -    mm_trace_rss_stat(mm, member); >>> +    add_mm_counter(mm, member, 1); >>>   } >>>     static inline void dec_mm_counter(struct mm_struct *mm, int member) >>>   { >>> -    percpu_counter_dec(&mm->rss_stat[member]); >>> +    add_mm_counter(mm, member, -1); >>> +} >>>   -    mm_trace_rss_stat(mm, member); >>> +static inline s64 mm_counter_sum(struct mm_struct *mm, int member) >>> +{ >>> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >>> + >>> +    if (percpu_counter_initialized(fbc)) >>> +        return percpu_counter_sum(fbc); >>> + >>> +    return percpu_counter_atomic_read(fbc); >>> +} >>> + >>> +static inline s64 mm_counter_sum_positive(struct mm_struct *mm, int >>> member) >>> +{ >>> +    struct percpu_counter *fbc = &mm->rss_stat[member]; >>> + >>> +    if (percpu_counter_initialized(fbc)) >>> +        return percpu_counter_sum_positive(fbc); >>> + >>> +    return percpu_counter_atomic_read(fbc); >>> +} >>> + >>> +static inline int mm_counter_switch_to_pcpu_many(struct mm_struct *mm) >>> +{ >>> +    return percpu_counter_switch_to_pcpu_many(mm->rss_stat, >>> NR_MM_COUNTERS); >>> +} >>> + >>> +static inline void mm_counter_destroy_many(struct mm_struct *mm) >>> +{ >>> +    percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); >>>   } >>>     /* Optimized variant when folio is already known not to be anon */ >>> diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h >>> index 6e62cc64cd92..a4e40ae6a8c8 100644 >>> --- a/include/trace/events/kmem.h >>> +++ b/include/trace/events/kmem.h >>> @@ -399,8 +399,8 @@ TRACE_EVENT(rss_stat, >>>           __entry->mm_id = mm_ptr_to_hash(mm); >>>           __entry->curr = !!(current->mm == mm); >>>           __entry->member = member; >>> -        __entry->size = >>> (percpu_counter_sum_positive(&mm->rss_stat[member]) >>> -                                << PAGE_SHIFT); >>> +        __entry->size = (mm_counter_sum_positive(mm, member) >>> +                            << PAGE_SHIFT); >>>       ), >>>         TP_printk("mm_id=%u curr=%d type=%s size=%ldB", >>> diff --git a/kernel/fork.c b/kernel/fork.c >>> index 99076dbe27d8..0214273798c5 100644 >>> --- a/kernel/fork.c >>> +++ b/kernel/fork.c >>> @@ -823,7 +823,7 @@ static void check_mm(struct mm_struct *mm) >>>                "Please make sure 'struct resident_page_types[]' is >>> updated as well"); >>>         for (i = 0; i < NR_MM_COUNTERS; i++) { >>> -        long x = percpu_counter_sum(&mm->rss_stat[i]); >>> +        long x = mm_counter_sum(mm, i); >>>             if (unlikely(x)) >>>               pr_alert("BUG: Bad rss-counter state mm:%p type:%s >>> val:%ld\n", >>> @@ -1301,16 +1301,10 @@ static struct mm_struct *mm_init(struct >>> mm_struct *mm, struct task_struct *p, >>>       if (mm_alloc_cid(mm)) >>>           goto fail_cid; >>>   -    if (percpu_counter_init_many(mm->rss_stat, 0, >>> GFP_KERNEL_ACCOUNT, >>> -                     NR_MM_COUNTERS)) >>> -        goto fail_pcpu; >>> - >>>       mm->user_ns = get_user_ns(user_ns); >>>       lru_gen_init_mm(mm); >>>       return mm; >>>   -fail_pcpu: >>> -    mm_destroy_cid(mm); >>>   fail_cid: >>>       destroy_context(mm); >>>   fail_nocontext: >>> @@ -1730,6 +1724,16 @@ static int copy_mm(unsigned long clone_flags, >>> struct task_struct *tsk) >>>       if (!oldmm) >>>           return 0; >>>   +    /* >>> +     * For single-thread processes, rss_stat is in atomic mode, which >>> +     * reduces the memory consumption and performance regression >>> caused by >>> +     * using percpu. For multiple-thread processes, rss_stat is >>> switched to >>> +     * the percpu mode to reduce the error margin. >>> +     */ >>> +    if (clone_flags & CLONE_THREAD) >>> +        if (mm_counter_switch_to_pcpu_many(oldmm)) >>> +            return -ENOMEM; >>> + >>>       if (clone_flags & CLONE_VM) { >>>           mmget(oldmm); >>>           mm = oldmm; >> >>