From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D34CC54ED0 for ; Sat, 24 May 2025 01:24:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2D2BD6B007B; Fri, 23 May 2025 21:24:49 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2839D6B0082; Fri, 23 May 2025 21:24:49 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 19BCF6B0085; Fri, 23 May 2025 21:24:49 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id EE9876B007B for ; Fri, 23 May 2025 21:24:48 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6C8541D4A9C for ; Sat, 24 May 2025 01:24:48 +0000 (UTC) X-FDA: 83476057056.23.7499F42 Received: from out30-118.freemail.mail.aliyun.com (out30-118.freemail.mail.aliyun.com [115.124.30.118]) by imf30.hostedemail.com (Postfix) with ESMTP id 92D7A8000A for ; Sat, 24 May 2025 01:24:45 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=PoQq9TbW; spf=pass (imf30.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748049886; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kURdILkIqYXjN86J/Vi1nRzrLORAbMvg48iqJVcrqJs=; b=lobAEekUnrR1DEk+uIj0d2tfzJDlRlojmOJADKZzTXrYX3iasBUcsWWwfHRRNRJw9ZMoms hOEQfEsX+KsCjdrfuoPKaPZXFVNRn9BBIKh//ZrIZg5htyuyNrgopd82wsr/RkLMVp8zBv 0Lkv0u7WHTrN/t2MBJPkp5hY8yn3//I= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=PoQq9TbW; spf=pass (imf30.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748049886; a=rsa-sha256; cv=none; b=u30T1X6oJX//ubAxZTJYOCs3yNK88ZqFtNYeQgJ5/l527YkPOGVNRrAS2TNFDgQO+dDYgt 33ENIDW0Y3MAYeGHA3iB3u0E5kS/DWtY5HbIytfyYqIrp/UMOVmCtdleyWi44shIdtJDqA Jd+vSxLxsvg369WIzE4TNIDKsMVLq+M= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1748049882; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=kURdILkIqYXjN86J/Vi1nRzrLORAbMvg48iqJVcrqJs=; b=PoQq9TbWUgsL72um0UPytedpqWhURksEdD/SOl889JUo4ZZORFbLjZMlBmR2ipdK9rwy9jccp2GatGCb4W2GcFTOncn1N9GsBPg2byZarg3L/kGmecy3LU+uaK/hzfSPm/UKaDyNMTLBVtTNUNEi+7HPLhxoowtYv/cLa3Trv50= Received: from 30.171.233.170(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WbcABvS_1748049880 cluster:ay36) by smtp.aliyun-inc.com; Sat, 24 May 2025 09:24:41 +0800 Message-ID: <6d6dcad5-169f-4bfc-91be-c620fef811e4@linux.alibaba.com> Date: Sat, 24 May 2025 09:24:39 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH] mm: fix the inaccurate memory statistics issue for users To: Aboorva Devarajan , akpm@linux-foundation.org, david@redhat.com, shakeel.butt@linux.dev Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, mhocko@suse.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <3dd21f662925c108cfe706c8954e8c201a327550.1747969935.git.baolin.wang@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: 92D7A8000A X-Rspamd-Server: rspam09 X-Stat-Signature: a7jojhmxyxmfco6m97d59eohgnecf7jc X-HE-Tag: 1748049885-885558 X-HE-Meta: U2FsdGVkX1+scCi8vvCN4XGyKQk0/IOon3SeSzWOODPUAiFMdefhwSpoT6BM5hV788umLjqirnGMuH8i76gO5eKnzB3fKRBtXO0DzOKvstqUoeurvFK1fKsWuInSht/CUkn0Pg89CdrI/xi+daj4sKuRJnofXTfnGGnIyTlGjxAfYg75qPOESHWGD9Q+nqztqQoDSckMnLVOXvvNT39eYYNrjFN4XZixQfE+oicDHsdkb7suUkaUxpb3TMaa7dwnIFKOZntcP3COqf1bbUYDJkNlJ6rdnE1IbRfsBDQ97zdNvK/TaSbSslA347B7Ya93g0KAhZHBA8bwKeX8NETUslRvQ+z0jh3iUy5Z5aPPbekkUckc7axfBF/4InwONHtI6BGMy18Clfj3Olp8MOznui1MY5Lg/xDfmd9f7iZFhwYaUfpx2tDXIxA+IIbF7bpJ1BoqKHFSbcGMclO+5zN12WAuX/gJN9chnbI3h6ZVQnLwxGL3w+za8y5hX9dluGFsPJpovXP8DoXnLjVSKrf20DegfEeOsedsPD7866vjKH4djvM1V3nP+sHCpdC9QKsZw/5i2oQzKyar6yGC2jiaJ8iexgR0nU30rrGFsn+oWKx/mN7Olk7gVtTaaF4MaXD568mt4gcIwRsJxelXMR4bYkEOxLxR0Ks2AlaDxTb0JdFUV0bdzGrenjLHusgwkzm3BLQDBQTRMSoTQrjF9zd/V4U64z/mrY5q8kPNlpAFq/CImMFVV4YLrfNJdsVPoTlsgxcUhMZwf61yecALGMEGUO/CxfdrnlWPa3gnZOgLz9pzGjy//0uXcBrjwjk9Ba7RXsWA2uy3wN09IQZWzybopSwnNWxDTi63HpCy+3XrBv91O1GR61yTGsOuZmf4qFEHG11os8XSjOW/I0RLkNjMYbjxPvAPIVAJWWRBNz6zHVIB8Yo92+dMgxmT+pdWQJ4gVcdyiC5VvexY6mvG18s mqSnwMfR wB0QCxGprRFJL4u1OcKb1XA9limlmYVJo7DRmJ5gSSuquSBsO97L6sHH1bCPqhR6PiHMr7dOFXvxp9qd2A51E2DhNQEOJfolPA0DkdFKHbzEV3gF35bLCkZJf/kxFwCt2K/RE57sJMAJF8g/ljv81CY5xtZ1d1buCPnu0kXPWJbULGr8XBQ2KPIlhjjKzHWXxJNIsglo6FNjFDNaPwXzSCV9ObxARwo80cOaLxwPmAEVnRT9jdzjEd6K6c8V4EWPJINfasRyWJX0q9eucmDLycM4ro9B3+myMbYKfokmbydfuzliGekaFmTZGByUrNOrjUMQgf8HCgQXGe79OAY9YKfJCXuXeX0e1wp/88pVpGwoR6fY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/5/23 18:14, Aboorva Devarajan wrote: > On Fri, 2025-05-23 at 11:16 +0800, Baolin Wang wrote: >> On some large machines with a high number of CPUs running a 64K kernel, >> we found that the 'RES' field is always 0 displayed by the top command >> for some processes, which will cause a lot of confusion for users. >> >>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND >>  875525 root      20   0   12480      0      0 R   0.3   0.0   0:00.08 top >>       1 root      20   0  172800      0      0 S   0.0   0.0   0:04.52 systemd >> >> The main reason is that the batch size of the percpu counter is quite large >> on these machines, caching a significant percpu value, since converting mm's >> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss >> stats into percpu_counter"). Intuitively, the batch number should be optimized, >> but on some paths, performance may take precedence over statistical accuracy. >> Therefore, introducing a new interface to add the percpu statistical count >> and display it to users, which can remove the confusion. In addition, this >> change is not expected to be on a performance-critical path, so the modification >> should be acceptable. >> >> Signed-off-by: Baolin Wang >> --- >>  fs/proc/task_mmu.c | 14 +++++++------- >>  include/linux/mm.h |  5 +++++ >>  2 files changed, 12 insertions(+), 7 deletions(-) >> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index b9e4fbbdf6e6..f629e6526935 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -36,9 +36,9 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) >>   unsigned long text, lib, swap, anon, file, shmem; >>   unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss; >> >> - anon = get_mm_counter(mm, MM_ANONPAGES); >> - file = get_mm_counter(mm, MM_FILEPAGES); >> - shmem = get_mm_counter(mm, MM_SHMEMPAGES); >> + anon = get_mm_counter_sum(mm, MM_ANONPAGES); >> + file = get_mm_counter_sum(mm, MM_FILEPAGES); >> + shmem = get_mm_counter_sum(mm, MM_SHMEMPAGES); >> >>   /* >>   * Note: to minimize their overhead, mm maintains hiwater_vm and >> @@ -59,7 +59,7 @@ void task_mem(struct seq_file *m, struct mm_struct *mm) >>   text = min(text, mm->exec_vm << PAGE_SHIFT); >>   lib = (mm->exec_vm << PAGE_SHIFT) - text; >> >> - swap = get_mm_counter(mm, MM_SWAPENTS); >> + swap = get_mm_counter_sum(mm, MM_SWAPENTS); >>   SEQ_PUT_DEC("VmPeak:\t", hiwater_vm); >>   SEQ_PUT_DEC(" kB\nVmSize:\t", total_vm); >>   SEQ_PUT_DEC(" kB\nVmLck:\t", mm->locked_vm); >> @@ -92,12 +92,12 @@ unsigned long task_statm(struct mm_struct *mm, >>   unsigned long *shared, unsigned long *text, >>   unsigned long *data, unsigned long *resident) >>  { >> - *shared = get_mm_counter(mm, MM_FILEPAGES) + >> - get_mm_counter(mm, MM_SHMEMPAGES); >> + *shared = get_mm_counter_sum(mm, MM_FILEPAGES) + >> + get_mm_counter_sum(mm, MM_SHMEMPAGES); >>   *text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >>   >> PAGE_SHIFT; >>   *data = mm->data_vm + mm->stack_vm; >> - *resident = *shared + get_mm_counter(mm, MM_ANONPAGES); >> + *resident = *shared + get_mm_counter_sum(mm, MM_ANONPAGES); >>   return mm->total_vm; >>  } >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 185424858f23..15ec5cfe9515 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2568,6 +2568,11 @@ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) >>   return percpu_counter_read_positive(&mm->rss_stat[member]); >>  } >> >> +static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member) >> +{ >> + return percpu_counter_sum_positive(&mm->rss_stat[member]); >> +} >> + >>  void mm_trace_rss_stat(struct mm_struct *mm, int member); >> >>  static inline void add_mm_counter(struct mm_struct *mm, int member, long value) > > Hi Baolin, > > This patch looks good to me. We observed a similar issue where the > generic mm selftest split_huge_page_test failed due to outdated RssAnon > values reported in /proc/[pid]/status. > > ... > > Without Patch: > > # ./split_huge_page_test > TAP version 13 > 1..34 > Bail out! No RssAnon is allocated before split > # Planned tests != run tests (34 != 0) > # Totals: pass:0 fail:0 xfail:0 xpass:0 skip:0 error:0 > > ... > > With Patch: > > # ./split_huge_page_test > # ./split_huge_page_test > TAP version 13 > 1..34 > ... > # Totals: pass:11 fail:0 xfail:0 xpass:0 skip:23 error:0 > > ... > > While this change may introduce some lock contention, it only affects > the task_mem function which is invoked only when reading > /proc/[pid]/status. Since this is not on a performance critical path, > it will be good to have this change in order to get accurate memory > stats. Agree. > > This fix resolves the issue we've seen with split_huge_page_test. > > Thanks! > > > Reviewed-by: Aboorva Devarajan > Tested-by: Aboorva Devarajan > Thanks for reviewing and testing.