From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6121AC5AE59 for ; Thu, 5 Jun 2025 12:58:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 99FEE8D0007; Thu, 5 Jun 2025 08:58:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 977308D0054; Thu, 5 Jun 2025 08:58:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 88D248D0007; Thu, 5 Jun 2025 08:58:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 1A5216B0478 for ; Thu, 5 Jun 2025 08:58:48 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A0B0A100328 for ; Thu, 5 Jun 2025 12:58:47 +0000 (UTC) X-FDA: 83521351494.27.3435CF0 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by imf19.hostedemail.com (Postfix) with ESMTP id 969EC1A0002 for ; Thu, 5 Jun 2025 12:58:44 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=SLiibAGi; spf=pass (imf19.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749128325; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=1Dk53K+1nREQrA9jzQRa+fEPI5lfv5hZFEHqPDWUDYg=; b=DSMA4doF01IlIAj62dEYWG1lTUgJlrjp6ucnHjtRjv2bbOs24EDxUEMBr9k/z9+RTgWlQC LGkXYxEOzNuI4LkF/kJpWZZYUd6ubCTEKiayLDdLNY/AF7PCc7FFXmY4ihBELdPxpJayut XTvCdyXZGlkr5UggPGz6zv6l3X6AtCo= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=SLiibAGi; spf=pass (imf19.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749128325; a=rsa-sha256; cv=none; b=t4v5ey93ia+5fzyzcI1AW7jUkiz0/7Z2Mf6xmedHTxfnYSRuI4XwYk3XA1ABH+sCNYu2DY 3/HoFoVQOnHHNB7whcH2N383xutPEvRF6qdfavD8EEW2nnk5GSlAuUR/eGAuoXF1vq4z3X /MNVvmVgaaWeW8rOEmhmM85MPzRuF68= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1749128321; h=From:To:Subject:Date:Message-ID:MIME-Version; bh=1Dk53K+1nREQrA9jzQRa+fEPI5lfv5hZFEHqPDWUDYg=; b=SLiibAGioFShDfZJpPK1nDJDQl+UniwAk/CtmzRTDIrzeCfXRJ784GlB5eW0uuWd5fcCMtLildFM5nNfIGQ4mJdUYyhBXQMw0Pnq3o6+GrIiEKrgYGa2VRnuaUXDjLbxz8brFUGAKJLKxV3O3hieT18SvHJsVAOwMoHS/Z330Yg= Received: from localhost(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0Wd8HvvE_1749128318 cluster:ay36) by smtp.aliyun-inc.com; Thu, 05 Jun 2025 20:58:38 +0800 From: Baolin Wang To: akpm@linux-foundation.org, david@redhat.com, shakeel.butt@linux.dev Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, donettom@linux.ibm.com, aboorvad@linux.ibm.com, sj@kernel.org, baolin.wang@linux.alibaba.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [PATCH v2] mm: fix the inaccurate memory statistics issue for users Date: Thu, 5 Jun 2025 20:58:29 +0800 Message-ID: X-Mailer: git-send-email 2.43.5 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 969EC1A0002 X-Stat-Signature: t7nj9qydipksms7e3ddbgb4czfo8a8ge X-Rspam-User: X-HE-Tag: 1749128324-692853 X-HE-Meta: U2FsdGVkX185XMPmJfohVb5CYiWCUYH9HzQ5eZrdNQz626mr9NkRP9tl+GH8VYx3Bezt83hJ6IXXgoeZgN0iEYXtu87lcAeLxWtuWytpsjWUsvSf2+b7nStj21Z4Q5PnQaeND5XeZW7KxEeQ9ccMFQQxDApLiZfxoylG27v6mi8sAP/DVh4BXGR8ygOeaL7+MdEf+qTP9FNvC10/5Ojkb3qiCCN7B1WakDZ8qbR2eG05/ZL86WedxxNrS+HKD6S/mzVV2860kcs/yhV8KLVkiuyfYFuel31EaICk2u8X28NweVpZYyX8VY/LGguH4WwzSxH0sIaHYb7vNrc2Uj30KDGBHYvH/g+rd5xe1KYuXfUP0wuDuFcb3FtioS9fabJwxtPt5i6cqVGDrG1J5kmi/cygeC+uvT/d2V84/T5P/5S/QVdjmEC6PCYgHgAfErDgzLu9khi2R76TRQ6m8AtIc7SU/5Oh/HvJb5/38TWJVJ6jvBI8EIO1XED5WnEJNyPuLdtAnB87b5LVzpqS9GHldEQSbizNzXatTWvueIg0JTXmZn0FJ8uo/JXCQMseWSXXVsWSLmtTs/3qz06cSvt0JlX6uz0vp6UqdOs7EBDzE848BaQk9ZCKuxJhaFtLXTjCXJFasKgn5Z/Wzd0AhnGXzIjIEme8MWVoJOvEwrtrD3reVr2SOprN2cSswcIflYkB5HEIGRXIsVAq9ylml9QexHxxAwpv7klRipNwiQlBbcnFNEocn0cfZB6qhphMzL8N1EIktIo2Iwu1Xr9+7f5ahMVJtKIH1JTAcICL5ziHs/BT5r9szcscWx07OosV5uQfDaJuzy6rAKPRKEdnJUxKBnWSdb/ltSf1bUAhGZKfFlzh0Gs4m2a4fQaprpXCnYuGcUaVYWtMKSyydsV2mvr+Lhgep48CxVg5zs5ihPMa436jN0DHaa7X6UHgzFjKQl/1VH03KyYTH08Jp0oBONB VFldCQ+V WyraTm7LF4+W5Orcvs9M8W8BsgA0XGTf60zx4N4AQi1cKuLTR4ijVMvicqWDzxqs4YyuOpwSr1fBs9Nt+0OYnagY/CXHz0AkfypNQwMCdSf00xULtkddl0p+buxNhmp5eJCAFZpmBMDC1IxjDDetlh6D4XbDYcvRypZ2YxZAArX/pr0uC5kwqSyVCc2nPldi6xkAAiCQiEoep3hiIr9koJhtHIqYfIGc8Lv2K6VIq1BfuV2J0xGRoC7mrPDOxcvUdnQevyRjzTjEHVm6s+icOqZ7YI/sPTsLY0dIlgEEnvEg5HuLGnAniS93qdswN9e/52c9tolfdHBbU/4YRIfCk+DR3GBbJ+ohrB6T5n3lXq/SB6ZLRhZNaZQUsig== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On some large machines with a high number of CPUs running a 64K pagesize kernel, we found that the 'RES' field is always 0 displayed by the top command for some processes, which will cause a lot of confusion for users. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd The main reason is that the batch size of the percpu counter is quite large on these machines, caching a significant percpu value, since converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter"). Intuitively, the batch number should be optimized, but on some paths, performance may take precedence over statistical accuracy. Therefore, introducing a new interface to add the percpu statistical count and display it to users, which can remove the confusion. In addition, this change is not expected to be on a performance-critical path, so the modification should be acceptable. In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and dec/inc_mm_counter(), which are all wrappers around percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu batch caching to avoid 'fbc->lock' contention. This patch changes task_mem() and task_statm() to get the accurate mm counters under the 'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock contention due to the percpu batch caching of the mm counters. The following test also confirm the theoretical analysis. I run the stress-ng that stresses anon page faults in 32 threads on my 32 cores machine, while simultaneously running a script that starts 32 threads to busy-loop pread each stress-ng thread's /proc/pid/status interface. From the following data, I did not observe any obvious impact of this patch on the stress-ng tests. w/o patch: stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle) stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec w/patch: stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle) stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec Tested-by Donet Tom Reviewed-by: Aboorva Devarajan Tested-by: Aboorva Devarajan Acked-by: Shakeel Butt Acked-by: SeongJae Park Acked-by: Michal Hocko Signed-off-by: Baolin Wang --- Changes from v1: - Update the commit message to add some measurements. - Add acked tag from Michal. Thanks. - Drop the Fixes tag. Changes from RFC: - Collect reviewed and tested tags. Thanks. - Add Fixes tag. --- fs/proc/task_mmu.c | 14 +++++++------- include/linux/mm.h | 5 +++++ 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index b9e4fbbdf6e6..f629e6526935 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) unsigned long text, lib, swap, anon, file, shmem; unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss; - anon = get_mm_counter(mm, MM_ANONPAGES); - file = get_mm_counter(mm, MM_FILEPAGES); - shmem = get_mm_counter(mm, MM_SHMEMPAGES); + anon = get_mm_counter_sum(mm, MM_ANONPAGES); + file = get_mm_counter_sum(mm, MM_FILEPAGES); + shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES); /* * Note: to minimize their overhead, mm maintains hiwater_vm and @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) text = min(text, mm->exec_vm << PAGE_SHIFT); lib = (mm->exec_vm << PAGE_SHIFT) - text; - swap = get_mm_counter(mm, MM_SWAPENTS); + swap = get_mm_counter_sum(mm, MM_SWAPENTS); SEQ_PUT_DEC("VmPeak:\t", hiwater_vm); SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm); SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm); @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm, unsigned long *shared, unsigned long *text, unsigned long *data, unsigned long *resident) { - *shared = get_mm_counter(mm, MM_FILEPAGES) + - get_mm_counter(mm, MM_SHMEMPAGES); + *shared = get_mm_counter_sum(mm, MM_FILEPAGES) + + get_mm_counter_sum(mm, MM_SHMEMPAGES); *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> PAGE_SHIFT; *data = mm->data_vm + mm->stack_vm; - *resident = *shared + get_mm_counter(mm, MM_ANONPAGES); + *resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES); return mm->total_vm; } diff --git a/include/linux/mm.h b/include/linux/mm.h index 185424858f23..15ec5cfe9515 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) return percpu_counter_read_positive(&mm->rss_stat[member]); } +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member) +{ + return percpu_counter_sum_positive(&mm->rss_stat[member]); +} + void mm_trace_rss_stat(struct mm_struct *mm, int member); static inline void add_mm_counter(struct mm_struct *mm, int member, long value) -- 2.43.5