From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 293F1C677C4 for ; Wed, 11 Jun 2025 01:34:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A33816B008C; Tue, 10 Jun 2025 21:34:13 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9E5756B0093; Tue, 10 Jun 2025 21:34:13 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8D36D6B0095; Tue, 10 Jun 2025 21:34:13 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 6D87E6B008C for ; Tue, 10 Jun 2025 21:34:13 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id DE76616161B for ; Wed, 11 Jun 2025 01:34:12 +0000 (UTC) X-FDA: 83541399144.17.689E998 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) by imf19.hostedemail.com (Postfix) with ESMTP id CC2421A000D for ; Wed, 11 Jun 2025 01:34:10 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=purestorage.com header.s=google2022 header.b="Q/Bl4CcV"; dmarc=pass (policy=reject) header.from=purestorage.com; spf=pass (imf19.hostedemail.com: domain of cachen@purestorage.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=cachen@purestorage.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1749605651; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=O+RwS1PnfLMbD0PcfPF/1nwrqQ/N+l3fAFRI/JL96GI=; b=ypoXBzybW3yXC03LNpT1tepWZ/zX26jVVwpt5sqQejdAmIczPRrgwitBVoSn2kOampXjfx XvY32fEH2oN0D5kO3MBNCSzEW181pHiPgzMBdYNBXdKSH96PdejM4of6e4vk1W4QIr9SDY rzSD3Wx3gDYmycvjAy1QI0Iu81UD5O0= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=purestorage.com header.s=google2022 header.b="Q/Bl4CcV"; dmarc=pass (policy=reject) header.from=purestorage.com; spf=pass (imf19.hostedemail.com: domain of cachen@purestorage.com designates 209.85.216.51 as permitted sender) smtp.mailfrom=cachen@purestorage.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1749605651; a=rsa-sha256; cv=none; b=TZQgCGajEZ7KInSzsCFZEzqRmVDP+je4FMryiKdyRnQaNMJQ12K0ug0JLAh246x2qoXct5 pVnLDQUYs/hqDuVqDKV99b5k/VWRtEvPBkELp+sVQZ3ffk2viKSkRsG0/PjW6VFMTKI3j9 acmsyze5H9Va43XfY/qxFFxewccMNQ8= Received: by mail-pj1-f51.google.com with SMTP id 98e67ed59e1d1-313862d48e7so10452a91.1 for ; Tue, 10 Jun 2025 18:34:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=purestorage.com; s=google2022; t=1749605649; x=1750210449; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=O+RwS1PnfLMbD0PcfPF/1nwrqQ/N+l3fAFRI/JL96GI=; b=Q/Bl4CcVWoYdnRTr0KrT8YUF3FrIntbxIGpWudnhcKDgYiZ1AOgRMAsP00Twfij9uH vyUiWZLmek1z5rvqoj/IcWn/juqeZP3MCSzaxwCbMeUza6uxdp20dHpx3YHndLBqjIvj cSyFC5FZr3lVmQRRJ2OqYm8Kr9yfOdtYnr9vFw9PE5yZ2AzcCnDP/G+75r9G22ILeOzR p/Qx/b+o8DZBy/dkpdvhw8kq/i44+ckus2Xdwo8W7h+p2cLz0P9PGRbPclOKz+29LVwE 2ukgwx/r9GRtLP7nW+SlV4Q5EuTDaDO6tu5XYHDvwjeuEt4hWkruy0yuQXjozVltbt91 ePUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749605649; x=1750210449; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=O+RwS1PnfLMbD0PcfPF/1nwrqQ/N+l3fAFRI/JL96GI=; b=Ql69oFqgI+A9hGvYbI2vc4/hV7B3EszGHqY91ACatp7lPm0EbhHWJ4Pu2F33YkE72b mPkLxfFOODMcBoVbEyQwkmMo1D28ejDGw6+W3WzixDAtYn31G2eF0cg7NjuIA0KsQzky eS1VNZoRvsxrCERSTceYDJAKyEvaYpWkygLx+GDO+qsOPdSxofnNti5VufN1nP99vk8N ovGPEJjpJ1IdfkOQeeEw6GUx1URYpqvvSam/2MVnlcqUZw6P/lIPTRnSQjpqnzyO6irZ SsQiiT7sVyEKqBRsH/NA0bKI4dtjmkvuGPBLmIE3XSOCCKSzdecJYVb0sZUdnRngWQI9 KBHw== X-Forwarded-Encrypted: i=1; AJvYcCWWJpMU3+qGqeV3hYscgGkwpSxbqdyx6cr58NWhDeKOQrW2jv101ZAytUjMRL+j1QCL32WlN+n3zw==@kvack.org X-Gm-Message-State: AOJu0Yzdd02EAG/w0yM6XS6tWGjh0x9ENXgZbdyReg2R6y3vIKZ5z8d9 NbHNU5CIZNROx7eNF6/Tg5p7zYx47H+nsUPR9g4+3j3wqemu9UncTiQKJ2a+77hCCmUSKvJSUux EI8cPEPLseANOdxXbgs66+qsSnYbzaxG/iLNxRzLHGQ== X-Gm-Gg: ASbGncvDcb6Y5FWEQXA14vrD+kdm5rjY8TmnfsUIYMZfcfrQAm7cxYCCJWKfgDgu76S Znl3nA+ZJj6aoFC2O/qSMul3/6vQNuA4Y9UXub/mszbwqGLUwZA18NHRM6ytcLA8ju9AzjNiWed DH3CUSTgU8tLZyY3cWJpJnlBa6L449We8jxrVJxUKZIUps X-Google-Smtp-Source: AGHT+IEzl3sJB2wN19xR6W5vHIJ9eBZ803hJ4/GH3M5wIItPR+l3lII0iZcLuPI7+4eKztFXy7/pTtc1b3K9AFS6fSI= X-Received: by 2002:a17:90b:3ec5:b0:311:c939:c855 with SMTP id 98e67ed59e1d1-313af1a5f6dmr762729a91.3.1749605649344; Tue, 10 Jun 2025 18:34:09 -0700 (PDT) MIME-Version: 1.0 References: <20250610233053.973796-1-cachen@purestorage.com> <20250610182155.36090c78124e1f60f2959d8e@linux-foundation.org> In-Reply-To: <20250610182155.36090c78124e1f60f2959d8e@linux-foundation.org> From: Casey Chen Date: Tue, 10 Jun 2025 18:33:58 -0700 X-Gm-Features: AX0GCFsyzJdMCG0em6cVNKgcAjnFwkzRC995PaClcyP6g5VbMYXECJd6EZjBTAI Message-ID: Subject: Re: [PATCH] alloc_tag: add per-NUMA node stats To: Andrew Morton Cc: surenb@google.com, kent.overstreet@linux.dev, corbet@lwn.net, dennis@kernel.org, tj@kernel.org, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com, rientjes@google.com, roman.gushchin@linux.dev, harry.yoo@oracle.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, yzhong@purestorage.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: CC2421A000D X-Stat-Signature: hi88r5q3xkjwxzye9ycdrzsrbryrgz44 X-Rspam-User: X-HE-Tag: 1749605650-853906 X-HE-Meta: U2FsdGVkX19YNNCUfE3gp2I9Ejdx20nQGbuE8qNYcPt8F2y9Pcart7/yLU7YMo9rPiSNEPVn6klO6hkEYR/tC/4elMgRr1UryNgV8GZzUsf0/iSf24oc4VK5bC0TtFhpXT+x9UW2JVfKVo0UZg+VQA8sDklhuOcKnYi1Hr8Lmi6JR9xiFqzogmBNtKOuFx+G81b5iqdN7x3hywC51J0KZkzJoAbIi7+/I8shWRs6UOIFumntx1PSY1pi/VzWxQQGMTnyV9XXUWRSJHOPy3VCseGcocHxrdcX16NGTGxIiTtwhxacd6M/AK9lv69GP0fyuQbE5MOacBUZhSEqGAuV4f3iFClGENijLv8h2Ruppn3ldTx1lALIYJydDAK/Zpb2bs5sRGiCqNoz8tt/D6kgwbhazcKXOSnc1dCLd1ZKhjfQkEuEt1zlN8nfLsZh9lairduzPaLAADAJyNmZLLeUTJkpuUzDP0N3ty5swvBIaSG2Z1uM9f2swqaX/OI5YMNu1VESwcIdpgkjupyuu4VjX56u2LOhiOVMBKBzekzxIm8Vh0K5AyLnUHE1Fjf7OG+TgRx8kBQibjc8p/+e0WajNFWpztDXrxuhAEOPUPgswNgXfBu0wbHFL7cf+k9aWQ0379+osVhgh2BPboqpBCBYCo8mTML5rL/y+uwMh2lIx0Zfm5PDznwAE9zzve3ZKxBFjiQef4Xc0IU9muZWnndPrdmiJh5L1WFEqztWd0I8UoKkljbGTcOYS8gF+hqRy8cP+sMaXyt6D7ajS5fmg8BRq6RISJHz4Wfdi2MS3jhvDGfAJ3gtcD40KttnOmRgpALcl7NSJdhhC9FqD/bOOreN+neL27kNkKLawJ6SdJ5eoejRGgZWbAQll+UWFlU4znZOY3rfiHGzDP4HWGhgXk3KsV6pMGvoQGCUpYVNPAeHx8AZ7W+2qTJfyXBB92WEton99YP3vF6xV2COMFEzSqy A03xCQeN Lu+A/698i8ZebzkgavkyAYa+0GtcdfFgHxbIIv6izYTNhf2eniqGtxsXbSE4dNtqkn21c1olSX/hgCvRzcEdEMfjoJRmvX0iA/viGt7VhZFGcz63gSN3hsTJE3iVJzZRT4S4zOweZ02LetdzkpPFCZxqwRdHGTmZRLdmfzvqLcs/mxE+cplpF+w4ynDNG2C8mey12LJCxTTECTcGhiutkX++MmRkfuibnJ7SGFAHZQSZwQt0R6R+c221owHOD1BtAAdssVQ3vo8k9tN88/q1QOnOwrXDqSpqxHTyN4ZnLqYA3swhnsQVmskCq5h2SC5X5dxVJebRasqMTYGhHFszOjmXjQa+Ce/iBFqZIL0UpeJNRv4lCM6eGHTcRAOqNhQo9+ItF1H7fnIwee+Or+egCLeoTamqhBD+vw2buEmRxsFOBWE0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 10, 2025 at 6:21=E2=80=AFPM Andrew Morton wrote: > > On Tue, 10 Jun 2025 17:30:53 -0600 Casey Chen wr= ote: > > > Add support for tracking per-NUMA node statistics in /proc/allocinfo. > > Previously, each alloc_tag had a single set of counters (bytes and > > calls), aggregated across all CPUs. With this change, each CPU can > > maintain separate counters for each NUMA node, allowing finer-grained > > memory allocation profiling. > > > > This feature is controlled by the new > > CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS option: > > > > * When enabled (=3Dy), the output includes per-node statistics followin= g > > the total bytes/calls: > > > > > > ... > > 315456 9858 mm/dmapool.c:338 func:pool_alloc_page > > nid0 94912 2966 > > nid1 220544 6892 > > 7680 60 mm/dmapool.c:254 func:dma_pool_create > > nid0 4224 33 > > nid1 3456 27 > > > > * When disabled (=3Dn), the output remains unchanged: > > > > ... > > 315456 9858 mm/dmapool.c:338 func:pool_alloc_page > > 7680 60 mm/dmapool.c:254 func:dma_pool_create > > > > To minimize memory overhead, per-NUMA stats counters are dynamically > > allocated using the percpu allocator. PERCPU_DYNAMIC_RESERVE has been > > increased to ensure sufficient space for in-kernel alloc_tag counters. > > > > For in-kernel alloc_tag instances, pcpu_alloc_noprof() is used to > > allocate counters. These allocations are excluded from the profiling > > statistics themselves. > > What is glaringly missing here is "why". > > What is the use case? Why does Linux want this? What benefit does > this bring to our users? This is the most important part of the > changelog because it tells Andrew why he is even looking at this patch. > > > Probably related to the above omission: why per-nid? It would be more > flexible to present the per-cpu counts and let userspace aggregate that > into per-node info if that is desirable. > Hi Andrew, Thanks for taking time reviewing my patch. Sorry I didn't include you in the previous conversion. See https://lore.kernel.org/all/CAJuCfpHhSUhxer-6MP3503w6520YLfgBTGp7Q9Qm9kgN4T= Nsfw@mail.gmail.com/T/#u It includes some motivations and people's opinions. You can take a look while I am fixing your comments ASAP. Basically, we want to know if there is any NUMA imbalance on memory allocation. Further we could optimize our system based on the NUMA nodes stats. > > > > ... > > > > --- a/include/linux/alloc_tag.h > > +++ b/include/linux/alloc_tag.h > > @@ -15,6 +15,8 @@ > > #include > > #include > > > > +extern int pcpu_counters_num; > > This globally-visible variable's identifier is too generic - the name > should communicate which subsystem the variable belongs to. Perhaps > alloc_tag_num_pcpu_counters? It's long, but only used in a few places. > > In fact, it's a count-of-nodes so a better name would be alloc_tag_num_no= des. > > Also, as it's written to a single time, __read_mostly is appropriate. > > > struct alloc_tag_counters { > > u64 bytes; > > u64 calls; > > @@ -134,16 +136,34 @@ static inline bool mem_alloc_profiling_enabled(vo= id) > > &mem_alloc_profiling_key); > > } > > > > +static inline struct alloc_tag_counters alloc_tag_read_nid(struct allo= c_tag *tag, int nid) > > +{ > > + struct alloc_tag_counters v =3D { 0, 0 }; > > + struct alloc_tag_counters *counters; > > + int cpu; > > + > > + for_each_possible_cpu(cpu) { > > for_each_possible_cpu() is lame - potentially much more expensive than > for_each_online_cpu. Is it impractical to use for_each_online_cpu()? > > Probably doesn't matter for a userspace displaying function but > userspace can do weird and unexpected things. > > > + counters =3D per_cpu_ptr(tag->counters, cpu); > > + v.bytes +=3D counters[nid].bytes; > > + v.calls +=3D counters[nid].calls; > > + } > > + > > + return v; > > +} > > + > > > > ... > > > > static int allocinfo_show(struct seq_file *m, void *arg) > > { > > struct allocinfo_private *priv =3D (struct allocinfo_private *)ar= g; > > @@ -116,6 +136,9 @@ static int allocinfo_show(struct seq_file *m, void = *arg) > > priv->print_header =3D false; > > } > > alloc_tag_to_text(&buf, priv->iter.ct); > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS > > + alloc_tag_to_text_all_nids(&buf, priv->iter.ct); > > +#endif > > We can eliminate the ifdef by adding > > #else > static inline void alloc_tag_to_text_all_nids(struct seq_buf *out, struct= codetag *ct) > { > } > #endif > > above. > > > +static void alloc_tag_to_text_all_nids(struct seq_buf *out, struct cod= etag *ct) > > > seq_commit(m, seq_buf_used(&buf)); > > return 0; > > } > > > > ... > > > > @@ -247,19 +270,41 @@ static void shutdown_mem_profiling(bool remove_fi= le) > > void __init alloc_tag_sec_init(void) > > { > > struct alloc_tag *last_codetag; > > + int i; > > > > if (!mem_profiling_support) > > return; > > > > - if (!static_key_enabled(&mem_profiling_compressed)) > > - return; > > - > > kernel_tags.first_tag =3D (struct alloc_tag *)kallsyms_lookup_nam= e( > > SECTION_START(ALLOC_TAG_SECTION_N= AME)); > > last_codetag =3D (struct alloc_tag *)kallsyms_lookup_name( > > SECTION_STOP(ALLOC_TAG_SECTION_NA= ME)); > > kernel_tags.count =3D last_codetag - kernel_tags.first_tag; > > > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS > > + pcpu_counters_num =3D num_possible_nodes(); > > +#else > > + pcpu_counters_num =3D 1; > > +#endif > > In the CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS=3Dn case, let's make > pcpu_counters_num a constant "1", visible to all compilation units. > > That way the compiler can optimize away all the > > for (nid =3D 0; nid < pcpu_counters_num; nid++) > > looping. > > > + pcpu_counters_size =3D pcpu_counters_num * sizeof(struct alloc_ta= g_counters); > > > > + for (i =3D 0; i < kernel_tags.count; i++) { > > + /* Each CPU has one alloc_tag_counters per numa node */ > > + kernel_tags.first_tag[i].counters =3D > > + pcpu_alloc_noprof(pcpu_counters_size, > > + sizeof(struct alloc_tag_counter= s), > > + false, GFP_KERNEL | __GFP_ZERO)= ; > > + if (!kernel_tags.first_tag[i].counters) { > > + while (--i >=3D 0) > > + free_percpu(kernel_tags.first_tag[i].coun= ters); > > + pr_info("Failed to allocate per-cpu alloc_tag cou= nters\n"); > > pr_err(), methinks. > > > + return; > > And now what happens. Will the kernel even work? > > This code path is untestable unless the developer jumps through hoops > and it will never be tested again. > > We assume that __init-time allocations always succeed, so a hearty > panic() here would be OK. > > > + } > > + } > > + > > + if (!static_key_enabled(&mem_profiling_compressed)) > > + return; > > + > > /* Check if kernel tags fit into page flags */ > > if (kernel_tags.count > (1UL << NR_UNUSED_PAGEFLAG_BITS)) { > > shutdown_mem_profiling(false); /* allocinfo file does not= exist yet */ > > @@ -622,7 +667,9 @@ static int load_module(struct module *mod, struct c= odetag *start, struct codetag > > stop_tag =3D ct_to_alloc_tag(stop); > > for (tag =3D start_tag; tag < stop_tag; tag++) { > > WARN_ON(tag->counters); > > - tag->counters =3D alloc_percpu(struct alloc_tag_counters)= ; > > + tag->counters =3D __alloc_percpu_gfp(pcpu_counters_size, > > + sizeof(struct alloc_ta= g_counters), > > + GFP_KERNEL | __GFP_ZER= O); > > if (!tag->counters) { > > while (--tag >=3D start_tag) { > > free_percpu(tag->counters); > > Ditto here, actually. > > Not that it matters much. It's init.text and gets thrown away, shrug. > > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > > > ... > > > > @@ -428,6 +429,7 @@ void __show_mem(unsigned int filter, nodemask_t *no= demask, int max_zone_idx) > > nr =3D alloc_tag_top_users(tags, ARRAY_SIZE(tags), false)= ; > > if (nr) { > > pr_notice("Memory allocations:\n"); > > + pr_notice(" \n"); > > for (i =3D 0; i < nr; i++) { > > struct codetag *ct =3D tags[i].ct; > > struct alloc_tag *tag =3D ct_to_alloc_tag= (ct); > > @@ -435,16 +437,27 @@ void __show_mem(unsigned int filter, nodemask_t *= nodemask, int max_zone_idx) > > char bytes[10]; > > > > string_get_size(counter.bytes, 1, STRING_= UNITS_2, bytes, sizeof(bytes)); > > - > > /* Same as alloc_tag_to_text() but w/o in= termediate buffer */ > > if (ct->modname) > > - pr_notice("%12s %8llu %s:%u [%s] = func:%s\n", > > - bytes, counter.calls, c= t->filename, > > - ct->lineno, ct->modname= , ct->function); > > + pr_notice("%-12s %-8llu %s:%u [%s= ] func:%s\n", > > + bytes, counter.calls, ct-= >filename, > > + ct->lineno, ct->modname, = ct->function); > > else > > - pr_notice("%12s %8llu %s:%u func:= %s\n", > > - bytes, counter.calls, c= t->filename, > > - ct->lineno, ct->functio= n); > > + pr_notice("%-12s %-8llu %s:%u fun= c:%s\n", > > + bytes, counter.calls, > > + ct->filename, ct->lineno,= ct->function); > > + > > +#ifdef CONFIG_MEM_ALLOC_PROFILING_PER_NUMA_STATS > > + int nid; > > C99 definition. > > > + for (nid =3D 0; nid < pcpu_counters_num; = nid++) { > > If we're going to use C99 (is OK now) then it's better to go all the > way and give `i' loop scope. "for (int i..". > > > + counter =3D alloc_tag_read_nid(ta= g, nid); > > + string_get_size(counter.bytes, 1,= STRING_UNITS_2, > > + bytes, sizeof(byt= es)); > > + pr_notice(" nid%-5u %-12ll= d %-8lld\n", > > + nid, counter.bytes, cou= nter.calls); > > + } > > +#endif > > } > > } > > } > > > > ... > >