From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 221F4C5AE59 for ; Tue, 3 Jun 2025 14:23:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B6BCC6B047F; Tue, 3 Jun 2025 10:23:00 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B43366B0480; Tue, 3 Jun 2025 10:23:00 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A593A6B0481; Tue, 3 Jun 2025 10:23:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 8D4036B047F for ; Tue, 3 Jun 2025 10:23:00 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 327A1160BA4 for ; Tue, 3 Jun 2025 14:23:00 +0000 (UTC) X-FDA: 83514306120.12.EE1DA32 Received: from out30-101.freemail.mail.aliyun.com (out30-101.freemail.mail.aliyun.com [115.124.30.101]) by imf24.hostedemail.com (Postfix) with ESMTP id 8657918000F for ; Tue, 3 Jun 2025 14:22:56 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=mdpV9532; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf24.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.101 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748960578; a=rsa-sha256; cv=none; b=67abNwfem1Mnzs9nDdAlZlUTGcxJoNJfvmWbTTpbAmZfREllQC7wd+GUHtkNPp1w53qBEP m2oN9Pyno61dq5bzenhBzIXxr7LBvYIYR6q5yxFb3yz4+N9DH/ufNgRKqZ51N2cPvwUe5f 5+ODDgRyTu6YiLqYjYt7IH3Oe8X9+vo= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=mdpV9532; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf24.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.101 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748960578; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=z5whc9aSlFZxs7dFMWRsS/E387L/NZNDlo8XTMs/rWM=; b=4oSiM2+1w1hspjFDKLY44bvoieSpWYvZK1faT1OgpUYwLYUvWPH/dCbwRlglxwiEZK55KY Suv5Lww9fVVv3ioiK4EdIMlnt/d6tpoUHIbHJexcD959kUb4iJpSm+DXc1EFV1zNvCfNZ0 jrxa43cUo1Ps1b2oSWhiz142tVKKUx4= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1748960569; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=z5whc9aSlFZxs7dFMWRsS/E387L/NZNDlo8XTMs/rWM=; b=mdpV9532wz2KJgrLr7vIJgnoV8LxqIoU9hXNzJx4oSJM44HTGSUtB5Ctpwtp9BXltMXqo3YikDTpDMmcE02W+4v6ijw0Y/e3Ao2tKaxvWajcCmqWabsjEwK6ZqQf6IYTB/Sh0r65+KHdfpTl6pIWeUn4dHzQLzDb1BG+8t42KZw= Received: from 30.171.150.78(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WcvLrj1_1748960566 cluster:ay36) by smtp.aliyun-inc.com; Tue, 03 Jun 2025 22:22:47 +0800 Message-ID: <7307bb7a-7c45-43f7-b073-acd9e1389000@linux.alibaba.com> Date: Tue, 3 Jun 2025 22:22:46 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm: fix the inaccurate memory statistics issue for users To: Michal Hocko Cc: Andrew Morton , david@redhat.com, shakeel.butt@linux.dev, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, donettom@linux.ibm.com, aboorvad@linux.ibm.com, sj@kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org References: <4f0fd51eb4f48c1a34226456b7a8b4ebff11bf72.1748051851.git.baolin.wang@linux.alibaba.com> <20250529205313.a1285b431bbec2c54d80266d@linux-foundation.org> <72f0dc8c-def3-447c-b54e-c390705f8c26@linux.alibaba.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 8657918000F X-Stat-Signature: 9jzzc8bjrpoaoanq8wsnmsr9czkkj6ig X-Rspam-User: X-HE-Tag: 1748960576-702461 X-HE-Meta: U2FsdGVkX18p/++ct444gtvyUCsBeTBKLx0Tb9fYPJX0iSigx7f622SliKJ7wMQDqB/QpepU7jv0Q2D3FUFU+EK3z3J15OtFSVvYbKeiqpRKP6pS36Iz/EJ/MU77A4oSiAlPQGqjJsdamOOeZdh4ecLWungNgdyGW/ewf7WtX4O5rrv5G2+3b2MnLqP+zG82oxH6/jCGP2F/Umndgvn2jkPwC7k8/fD2XV0/bi3GguyBMp7jGcK/UM3WGb7RQVRewEnAXKeWk5c7liYT2B6qopO2P4y5UndEfkSMb3q4cmR2/PlIzU9dUYzgHt2Emv/GQAUcyGOyhD5d/p4OhcwYq8eqSdE+ZCsGhIrMQaEod4lEupyKEYk9y2fj5hke6AZMT7EO9OgMNFxHAduG/NWkqQrzpT9h31dipGBIKBxj5zE/BIvFFKbSc9sWySplD1sOAEWRasXpUMIAQrxGl9Uel0u3tjRJBoeQdffUNRLSsA2t+VnabSPNuim05JQ+kMNsID9ldp7jqtTjXqITTruUEQ31sJcVXQK2EukWgheGV3WkooB+SCh/YwBH99FMefco7Q3NHTYu+UNTL93pknWPHCy7I1/r1v/cyJY1kztb94QPmQFYVlfZh3BvHiyuHZ/DRqLfeobwTJA1dqC3J3/Q4NqXTYDO0hgEfb+30dXsN3NU3ecMZy8zicxNMKZODFfAf8wvFH+w/SCnF1WWEwoi3O6HGwk3rwULy48ES8xhEQYITIs8e+Mvnl3e3mWx/m/LHfFf0SPGcelX3Zqvn6P2V4Z2YkNdhIj64iR7codeBIPusnyBqIsozQfVSJNapPyYjF1HxsTlqhK0K/8QIy/WgYrs35i+6JVcdUtL1pcNzRcwcxn2ANT+jDvZSlxvpd5FSu0m41JcP/JSH7EdDRbK4TRmXOUwcmHH9/gxyzWWjjror+R/Ou0gd5fo/N5PSd14JJ5WPfqaSTzu5z2EQBu vAaGiKcQ WYVndLynV66aWOvnH+gX8+6i+SA3AQV8BLnCi5RLu3ivIdVpB3U1Tfnxn/7f8axIRGTCfQ8lO2o0MoGZykIUIYKiBpe4a/8hOsI7+riZh6mbpCamIKbSwG2QrPlqWJQyJbp5QeEYtGIxseM/tw+pqizerR9wIs9fUrL7Cd/U24G9QsiE4Ybw7KSns7I1H6E9CpUvRUzwr43F+Yt+ouEeXZLSB1k+EtwX9QkHpQm9TszzeTRdFGwQfEA9vZZtAnVQHiE2nYmLut+aS9Wj5uePHhgIJAHPtkCcwofTiNgH8FhvCOg20uoq6x5nx/zGhC2g+t6y+oub4YBDlBonDV/7XnymmxgBAqzsSmbH8cJZfxOtrGfE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/6/3 18:28, Michal Hocko wrote: > On Tue 03-06-25 16:32:35, Baolin Wang wrote: >> >> >> On 2025/6/3 16:15, Michal Hocko wrote: >>> On Tue 03-06-25 16:08:21, Baolin Wang wrote: >>>> >>>> >>>> On 2025/5/30 21:39, Michal Hocko wrote: >>>>> On Thu 29-05-25 20:53:13, Andrew Morton wrote: >>>>>> On Sat, 24 May 2025 09:59:53 +0800 Baolin Wang wrote: >>>>>> >>>>>>> On some large machines with a high number of CPUs running a 64K pagesize >>>>>>> kernel, we found that the 'RES' field is always 0 displayed by the top >>>>>>> command for some processes, which will cause a lot of confusion for users. >>>>>>> >>>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >>>>>>> 875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top >>>>>>> 1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd >>>>>>> >>>>>>> The main reason is that the batch size of the percpu counter is quite large >>>>>>> on these machines, caching a significant percpu value, since converting mm's >>>>>>> rss stats into percpu_counter by commit f1a7941243c1 ("mm: convert mm's rss >>>>>>> stats into percpu_counter"). Intuitively, the batch number should be optimized, >>>>>>> but on some paths, performance may take precedence over statistical accuracy. >>>>>>> Therefore, introducing a new interface to add the percpu statistical count >>>>>>> and display it to users, which can remove the confusion. In addition, this >>>>>>> change is not expected to be on a performance-critical path, so the modification >>>>>>> should be acceptable. >>>>>>> >>>>>>> Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") >>>>>> >>>>>> Three years ago. >>>>>> >>>>>>> Tested-by Donet Tom >>>>>>> Reviewed-by: Aboorva Devarajan >>>>>>> Tested-by: Aboorva Devarajan >>>>>>> Acked-by: Shakeel Butt >>>>>>> Acked-by: SeongJae Park >>>>>>> Signed-off-by: Baolin Wang >>>>>> >>>>>> Thanks, I added cc:stable to this. >>>>> >>>>> I have only noticed this new posting now. I do not think this is a >>>>> stable material. I am also not convinced that the impact of the pcp lock >>>>> exposure to the userspace has been properly analyzed and documented in >>>>> the changelog. I am not nacking the patch (yet) but I would like to see >>>>> a serious analyses that this has been properly thought through. >>>> >>>> Good point. I did a quick measurement on my 32 cores Arm machine. I ran two >>>> workloads, one is the 'top' command: top -d 1 (updating every second). >>>> Another workload is kernel building (time make -j32). >>>> >>>> From the following data, I did not see any significant impact of the patch >>>> changes on the execution of the kernel building workload. >>> >>> I do not think this is really representative of an adverse workload. I >>> believe you need to have a look which potentially sensitive kernel code >>> paths run with the lock held how would a busy loop over affected proc >>> files influence those in the worst case. Maybe there are none of such >>> kernel code paths to really worry about. This should be a part of the >>> changelog though. >> >> IMO, kernel code paths usually have batch caching to avoid lock contention, >> so I think the impact on kernel code paths is not that obvious. > > This is a very generic statement. Does this refer to the existing pcp > locking usage in the kernel? Have you evaluated existing users? Let me try to clarify further. The 'mm->rss_stat' is updated by using add_mm_counter(), dec/inc_mm_counter(), which are all wrappers around percpu_counter_add_batch(). In percpu_counter_add_batch(), there is percpu batch caching to avoid 'fbc->lock' contention. This patch changes task_mem() and task_statm() to get the accurate mm counters under the 'fbc->lock', but this will not exacerbate kernel 'mm->rss_stat' lock contention due to the the percpu batch caching of the mm counters. You might argue that my test cases cannot demonstrate an actual lock contention, but they have already shown that there is no significant 'fbc->lock' contention when the kernel updates 'mm->rss_stat'. >> Therefore, I >> also think it's hard to find an adverse workload. >> >> How about adding the following comments in the commit log? >> " >> I did a quick measurement on my 32 cores Arm machine. I ran two workloads, >> one is the 'top' command: top -d 1 (updating every second). Another workload >> is kernel building (time make -j32). > > This test doesn't really do much to trigger an actual lock contention as > already mentioned. >