From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61F40C677C4 for ; Wed, 11 Jun 2025 03:41:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D46C76B007B; Tue, 10 Jun 2025 23:41:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CF8BB6B0088; Tue, 10 Jun 2025 23:41:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C0E136B0089; Tue, 10 Jun 2025 23:41:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id A1A9A6B007B for ; Tue, 10 Jun 2025 23:41:42 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id A6A70819EA for ; Wed, 11 Jun 2025 03:41:41 +0000 (UTC) X-FDA: 83541720402.05.57FE58C Received: from out-182.mta0.migadu.com (out-182.mta0.migadu.com [91.218.175.182]) by imf28.hostedemail.com (Postfix) with ESMTP id E7F4AC0002 for ; Wed, 11 Jun 2025 03:41:39 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=sPDei5hw; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf28.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.182 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749613300; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FcbMdWL9nohzlfXyKkU5qmdIVkZuUvvMyKp/ZhqYUxQ=; b=avTIHpDJHOjObFcG4eOmjESiHj1gS2z5zDytTzQv9c129rwPHFo8EW3c0wBb8vzchOvNAQ xnIT4v1amjMQa2h0uMAlU/FuxVPEekEI0xnK2h+WYmYGdFO/hfBotlW3jDmCLQuA2jz75z L+b1LEKasktJW2m6+oGbBInyC4/r84c= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749613300; a=rsa-sha256; cv=none; b=shlPxysd3Ig/q0qZ6jmB/EkPFlu5NAc8mE2/OqsM1n8ChGuK3eS8QRx2mrYcGWRsKwUGeH WqfpKV7pr7U6piEsiZQtKzPvIc0SyJQVyYZZYzlcGMINyS9MTkbCMgF4fGVt+Lf7zX8WYr 08/NwAsoyBn9VtYuC3kXpOtYkhCLoKI= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=sPDei5hw; dmarc=pass (policy=none) header.from=linux.dev; spf=pass (imf28.hostedemail.com: domain of kent.overstreet@linux.dev designates 91.218.175.182 as permitted sender) smtp.mailfrom=kent.overstreet@linux.dev Date: Tue, 10 Jun 2025 23:41:32 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1749613297; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=FcbMdWL9nohzlfXyKkU5qmdIVkZuUvvMyKp/ZhqYUxQ=; b=sPDei5hw1wxHO/c7GpPFFng4Z4B4Q2mjUKA09sRqeegIkAHF2HWzbE9gqLotIZpH32WPKx BNbJB4xOq7CjLrSFybj5csBKEL8iO1ScqutkJ07iWWtBXHGMruLiXOvkc7S8QuCRVuT1jG IdGyALOoF4unS5NnuNnqcIbPDZhqY6E= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Kent Overstreet To: Andrew Morton Cc: Casey Chen , surenb@google.com, corbet@lwn.net, dennis@kernel.org, tj@kernel.org, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com, rientjes@google.com, roman.gushchin@linux.dev, harry.yoo@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, yzhong@purestorage.com Subject: Re: [PATCH] alloc_tag: add per-NUMA node stats Message-ID: References: <20250610233053.973796-1-cachen@purestorage.com> <20250610182155.36090c78124e1f60f2959d8e@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250610182155.36090c78124e1f60f2959d8e@linux-foundation.org> X-Migadu-Flow: FLOW_OUT X-Stat-Signature: 33e6qk8rmm1i8b6z7rwf5ag6n7ykc7p5 X-Rspamd-Queue-Id: E7F4AC0002 X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1749613299-928942 X-HE-Meta: U2FsdGVkX18wquRhaL1jOqyZ2pBnvmubuaJ9sn8ko17GXpLO/7ed1BvHpNjXuLgdhhavk4G84aqaZOHQV5ISQvKPmxOQT6DakV6GYhTgTW+4IwmdVfRg+l1kFVG6dj1v8axqFhOc+JZZ4SfcxFt5IfNjd4WRLBiD7U3fJAefEUWQUdX43g0OnanRBuoODY9sHpWMLvfSXxLbGqeAK6O9tYprEQsH55ePUfMhA3P3wmmGbq1j0ZD2AIDH6ukqgwCZNEoonejGl2N3Yq4B33LG0vSoYx6H5mv8hc0OtKB68nEZPLwpEsk7ubLg9ahFxM4ucXbDCvEcxz8+8ZCgKzP8XMrp7zIyUBLHiCbDulk9ESaUYCot2NkwsK/y1oOcQSu7p4mxaUOh0LaGra9YM6f8Bs4S8apPNfYEslZ9Fk5Iq1sYBDimyp85XYP0hxKn9tMv14JnG/GUMSsFmL/vVq8afzTkzQMFp/0qNeipBUzAtj7fWgrYpRwdZpZgMXHD8pFXBwfHTUr9wDE9ozJM/wFzBbEIf5wZ9hp0MOlpzzLFWrs0EaSIbtwheZcQU4M4JJkofCr3CYCnIjqHUDYhjvHUa3TmgHkPIHV+m+hxSQMn4g1h74qhPQJS7/c2Kv2/8A6dEOK9NXd89tE1tKQ3UKrEJEOQ32CC9QaIrjJpW72tdiAaZkFhvO7VTJC44DLzIzg/Q3q+WGqE/uo6Y2Ga0xYXZV//fAlXZ4Tdjx3pFMXNvf9iILKtZwjMI3BUeYucKD6ZoW31pDMrb7me690am1uIAUUDVRzk8a0XBp0uqlD91s32vDzyl4ecSLAJyA7FIbSUtT0E+Nz2slN8x4x5jAm3GqV4yoJHYTjJXNQKZkhoqk3FLtFoibUM57diqntGcj5LVN/1RB7o/1oSERh6NCzgxGa9lS3xyGhYWa5Mo17Z/1+lLV92SdbXc0+1QDaPH9doQHQ2L73IZx5a5w28PkU 07YlCtjG JqRb89lEt50ICTjm9E/KSos66uOtF1wbWjDiwFwf645uu6VcSSgJdI13ra99UGpjwflsgc48duZTK0xKGmiv3Aa0R29/oWp68MRT26lY8DAFIBNaBfp+qsqwycUu2f4+q3mqfhgKDAlEaWhVZjBB7ZOHm7lnMm0blcMCoIUFwm5ugPRD1dYKjIGvajJ4SWBYzcjyPNJgNOpbeaqc8qyIzf/SoSFmNk9JtZ4kAHzCI4BcAjTk8zuBJAYraYQ8dxLyFwObun4Gt3GSEC0hlrK2/aeBGj3fQIiu0lwFal76Yys+vzdcdn0nPbthDDDbOYFWc4xVw29YLdy3wsew= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 10, 2025 at 06:21:55PM -0700, Andrew Morton wrote: > On Tue, 10 Jun 2025 17:30:53 -0600 Casey Chen wrote: > > > Add support for tracking per-NUMA node statistics in /proc/allocinfo. > > Previously, each alloc_tag had a single set of counters (bytes and > > calls), aggregated across all CPUs. With this change, each CPU can > > maintain separate counters for each NUMA node, allowing finer-grained > > memory allocation profiling. > > > > This feature is controlled by the new > > CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS option: > > > > * When enabled (=y), the output includes per-node statistics following > > the total bytes/calls: > > > > > > ... > > 315456 9858 mm/dmapool.c:338 func:pool_alloc_page > > nid0 94912 2966 > > nid1 220544 6892 > > 7680 60 mm/dmapool.c:254 func:dma_pool_create > > nid0 4224 33 > > nid1 3456 27 > > > > * When disabled (=n), the output remains unchanged: > > > > ... > > 315456 9858 mm/dmapool.c:338 func:pool_alloc_page > > 7680 60 mm/dmapool.c:254 func:dma_pool_create > > > > To minimize memory overhead, per-NUMA stats counters are dynamically > > allocated using the percpu allocator. PERCPU_DYNAMIC_RESERVE has been > > increased to ensure sufficient space for in-kernel alloc_tag counters. > > > > For in-kernel alloc_tag instances, pcpu_alloc_noprof() is used to > > allocate counters. These allocations are excluded from the profiling > > statistics themselves. > > What is glaringly missing here is "why". > > What is the use case? Why does Linux want this? What benefit does > this bring to our users? This is the most important part of the > changelog because it tells Andrew why he is even looking at this patch. In the last thread, I was pushing to get some input from other people working on userspace profiling - it seems like that's where we're going with this, so we need to make sure people working in the same area are talking to each other, and if we're building tools to export data we want to make sure it's relevant and useful. Steven Rostadt sent out some pings, but I don't think we've heard back from the relevant people, and I do want that to happen before (if) this goes in. And more basic than this, now that we've got the ability to map kernel address -> code that owns it, we should be seeing if that is relevant and interesting to the tooling people before we get fancier. Casey also mentioned that this was being used for some internal stuff at Pure, and I'd still like to hear more about that. If they already know interesting and cool stuff we can use this for - great, let's please hear what it is. But we definitely need some wider context before merging this. > Probably related to the above omission: why per-nid? It would be more > flexible to present the per-cpu counts and let userspace aggregate that > into per-node info if that is desirable. Per nid makes more sense to me if this is for profiling, and since this is a text interface meant primarily for human consumption we don't want to bloat that excessively. We can always add multiple interfaces for viewing the same data, and there's nothing preventing us from adding that later if it turns out something else does want per-cpu counts. Per-cpu counts would, I believe, bloat the data structures quite a bit since the counters are internally percpu - would there then be counters in nr_cpus^2? I hope not :) > > +static inline struct alloc_tag_counters alloc_tag_read_nid(struct alloc_tag *tag, int nid) > > +{ > > + struct alloc_tag_counters v = { 0, 0 }; > > + struct alloc_tag_counters *counters; > > + int cpu; > > + > > + for_each_possible_cpu(cpu) { > > for_each_possible_cpu() is lame - potentially much more expensive than > for_each_online_cpu. Is it impractical to use for_each_online_cpu()? > > Probably doesn't matter for a userspace displaying function but > userspace can do weird and unexpected things. for_each_online_cpu() means having to think hard about cpu hotplug and possibly adding callbacks, no? I think simple and dumb is fine here. > > + for (i = 0; i < kernel_tags.count; i++) { > > + /* Each CPU has one alloc_tag_counters per numa node */ > > + kernel_tags.first_tag[i].counters = > > + pcpu_alloc_noprof(pcpu_counters_size, > > + sizeof(struct alloc_tag_counters), > > + false, GFP_KERNEL | __GFP_ZERO); > > + if (!kernel_tags.first_tag[i].counters) { > > + while (--i >= 0) > > + free_percpu(kernel_tags.first_tag[i].counters); > > + pr_info("Failed to allocate per-cpu alloc_tag counters\n"); > > pr_err(), methinks. > > > + return; > > And now what happens. Will the kernel even work? > > This code path is untestable unless the developer jumps through hoops > and it will never be tested again. > > We assume that __init-time allocations always succeed, so a hearty > panic() here would be OK. percpu allocations are more limited than normal allocations, and we did hit that when developing memory allocation profiling, so - maybe WARN()? > > + int nid; > > C99 definition. > > > + for (nid = 0; nid < pcpu_counters_num; nid++) { > > If we're going to use C99 (is OK now) then it's better to go all the > way and give `i' loop scope. "for (int i..". Yes please; variable reuse can lead to some nasties. Going full C99 has its own problems (you can't do it after labels, and only one of gcc or clang complains and I forget which - perpetual headache). But C99 for loops are a total no brainer.