* [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
@ 2025-12-16 6:43 David Wang
2026-01-05 21:12 ` Suren Baghdasaryan
0 siblings, 1 reply; 11+ messages in thread
From: David Wang @ 2025-12-16 6:43 UTC (permalink / raw)
To: surenb, kent.overstreet
Cc: akpm, hannes, pasha.tatashin, souravpanda, vbabka, linux-mm,
linux-kernel, David Wang
When tracking memory allocation for some specific function,
picking the first codetag is more desired, because there
is no need to track down all allocation sites in the call
graph and change them to _noprof version, which is quite
inflexible when the call graph is complex.
For example, consider a simple graph:
A ---> B ---> C ===> D
E ---> C
===> means a call with codetag
---> means a call without codetag
To profiling memory allocation for A, the call graph needs
to be changed to
A ===> B ---> C ---> D
E ===> C
Three call sites needs to be changed.
But if pick the first codetag, only one change is needed.
A ===> B ---> C ===> D
E ---> C
The drawback is some accounting for C is splited to A,
making the number not accurate for C. (But the overall
accounting is still the same.)
This is useful when debug memory problems, not meant for
production usage though.
Signed-off-by: David Wang <00107082@163.com>
---
include/linux/sched.h | 6 ++++++
lib/Kconfig.debug | 12 ++++++++++++
2 files changed, 18 insertions(+)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d395f2810fac..4a4f7000737e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2288,14 +2288,20 @@ extern void sched_set_stop_task(int cpu, struct task_struct *stop);
#ifdef CONFIG_MEM_ALLOC_PROFILING
static __always_inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
{
+#ifdef CONFIG_MEM_ALLOC_PROFILING_PICK_FIRST_CODETAG
+ if (current->alloc_tag)
+ return current->alloc_tag;
+#endif
swap(current->alloc_tag, tag);
return tag;
}
static __always_inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
{
+#ifndef CONFIG_MEM_ALLOC_PROFILING_PICK_FIRST_CODETAG
#ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
+#endif
#endif
current->alloc_tag = old;
}
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ba36939fda79..6e6f3a12033a 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1038,6 +1038,18 @@ config MEM_ALLOC_PROFILING_DEBUG
Adds warnings with helpful error messages for memory allocation
profiling.
+config MEM_ALLOC_PROFILING_PICK_FIRST_CODETAG
+ bool "Use the first tag along the call chain"
+ default n
+ depends on MEM_ALLOC_PROFILING
+ select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
+ help
+ Make memory allocation profiling store counters to the first
+ codetag along the call chain. This help profiling memory allocation
+ for specific function by simply adding codetag to the function,
+ without clearup all the codetag down the callchain.
+ It is used for debug purpose.
+
source "lib/Kconfig.kasan"
source "lib/Kconfig.kfence"
source "lib/Kconfig.kmsan"
--
2.47.3
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2025-12-16 6:43 [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain David Wang
@ 2026-01-05 21:12 ` Suren Baghdasaryan
2026-01-06 3:50 ` David Wang
0 siblings, 1 reply; 11+ messages in thread
From: Suren Baghdasaryan @ 2026-01-05 21:12 UTC (permalink / raw)
To: David Wang
Cc: kent.overstreet, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
On Mon, Dec 15, 2025 at 10:44 PM David Wang <00107082@163.com> wrote:
>
> When tracking memory allocation for some specific function,
> picking the first codetag is more desired, because there
> is no need to track down all allocation sites in the call
> graph and change them to _noprof version, which is quite
> inflexible when the call graph is complex.
>
> For example, consider a simple graph:
>
> A ---> B ---> C ===> D
> E ---> C
>
> ===> means a call with codetag
> ---> means a call without codetag
>
> To profiling memory allocation for A, the call graph needs
> to be changed to
> A ===> B ---> C ---> D
> E ===> C
> Three call sites needs to be changed.
>
> But if pick the first codetag, only one change is needed.
> A ===> B ---> C ===> D
> E ---> C
>
> The drawback is some accounting for C is splited to A,
> making the number not accurate for C. (But the overall
> accounting is still the same.)
>
> This is useful when debug memory problems, not meant for
> production usage though.
Hi David,
Sorry for the delay. Do you have specific examples when allocation
needs to be accounted at the highest level?
Our usual approach is that we account allocation at the lowest
"allocator" level and if that allocator uses lowel level allocators it
should use _noprof versions so that allocation is still done at the
right level. I would like to keep that simple approach but if there
are cases when that's not enough, I would like to know more about them
before trying to address them.
Thanks,
Suren.
>
> Signed-off-by: David Wang <00107082@163.com>
> ---
> include/linux/sched.h | 6 ++++++
> lib/Kconfig.debug | 12 ++++++++++++
> 2 files changed, 18 insertions(+)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d395f2810fac..4a4f7000737e 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2288,14 +2288,20 @@ extern void sched_set_stop_task(int cpu, struct task_struct *stop);
> #ifdef CONFIG_MEM_ALLOC_PROFILING
> static __always_inline struct alloc_tag *alloc_tag_save(struct alloc_tag *tag)
> {
> +#ifdef CONFIG_MEM_ALLOC_PROFILING_PICK_FIRST_CODETAG
> + if (current->alloc_tag)
> + return current->alloc_tag;
> +#endif
> swap(current->alloc_tag, tag);
> return tag;
> }
>
> static __always_inline void alloc_tag_restore(struct alloc_tag *tag, struct alloc_tag *old)
> {
> +#ifndef CONFIG_MEM_ALLOC_PROFILING_PICK_FIRST_CODETAG
> #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> WARN(current->alloc_tag != tag, "current->alloc_tag was changed:\n");
> +#endif
> #endif
> current->alloc_tag = old;
> }
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index ba36939fda79..6e6f3a12033a 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1038,6 +1038,18 @@ config MEM_ALLOC_PROFILING_DEBUG
> Adds warnings with helpful error messages for memory allocation
> profiling.
>
> +config MEM_ALLOC_PROFILING_PICK_FIRST_CODETAG
> + bool "Use the first tag along the call chain"
> + default n
> + depends on MEM_ALLOC_PROFILING
> + select MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT
> + help
> + Make memory allocation profiling store counters to the first
> + codetag along the call chain. This help profiling memory allocation
> + for specific function by simply adding codetag to the function,
> + without clearup all the codetag down the callchain.
> + It is used for debug purpose.
> +
> source "lib/Kconfig.kasan"
> source "lib/Kconfig.kfence"
> source "lib/Kconfig.kmsan"
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2026-01-05 21:12 ` Suren Baghdasaryan
@ 2026-01-06 3:50 ` David Wang
2026-01-06 10:54 ` Kent Overstreet
0 siblings, 1 reply; 11+ messages in thread
From: David Wang @ 2026-01-06 3:50 UTC (permalink / raw)
To: Suren Baghdasaryan
Cc: kent.overstreet, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
At 2026-01-06 05:12:48, "Suren Baghdasaryan" <surenb@google.com> wrote:
>On Mon, Dec 15, 2025 at 10:44 PM David Wang <00107082@163.com> wrote:
>>
>> When tracking memory allocation for some specific function,
>> picking the first codetag is more desired, because there
>> is no need to track down all allocation sites in the call
>> graph and change them to _noprof version, which is quite
>> inflexible when the call graph is complex.
>>
>> For example, consider a simple graph:
>>
>> A ---> B ---> C ===> D
>> E ---> C
>>
>> ===> means a call with codetag
>> ---> means a call without codetag
>>
>> To profiling memory allocation for A, the call graph needs
>> to be changed to
>> A ===> B ---> C ---> D
>> E ===> C
>> Three call sites needs to be changed.
>>
>> But if pick the first codetag, only one change is needed.
>> A ===> B ---> C ===> D
>> E ---> C
>>
>> The drawback is some accounting for C is splited to A,
>> making the number not accurate for C. (But the overall
>> accounting is still the same.)
>>
>> This is useful when debug memory problems, not meant for
>> production usage though.
>
>Hi David,
>Sorry for the delay. Do you have specific examples when allocation
>needs to be accounted at the highest level
Hi,
I do not have a very convincing practical example yet. :(
I started to think about this in this thread[1], debugging possible memory leak in cephfs.
If modules want to account its memory usage, they can plant codetags in their codepath
without worrying about codetags deeper in the code chain.
And I noticed that some callsites' memory usage is incomplete, because its accounting
is split by codetags deeper in the code chain
For example, on my system, I have
512 1 drivers/usb/core/hub.c:6080 [usbcore] func:usb_hub_init
but if pick first codetag, I would have
20992 10 drivers/usb/core/hub.c:6080 [usbcore] func:usb_hub_init
One call chain
usb_hub_init==>alloc_workqueue--->__alloc_workqueue -->alloc_node_nr_active==>kzalloc_node
has two codetags, and its memory is not accounted to usb drivers.
If interested in module's memory usage, picking the first codetag would be preferred, I guess.
Thanks
David
https://lore.kernel.org/all/2a9ba88e.3aa6.19b0b73dd4e.Coremail.00107082@163.com/ [1]
>Our usual approach is that we account allocation at the lowest
>"allocator" level and if that allocator uses lowel level allocators it
>should use _noprof versions so that allocation is still done at the
>right level. I would like to keep that simple approach but if there
>are cases when that's not enough, I would like to know more about them
>before trying to address them.
>Thanks,
>Suren.
>
>>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2026-01-06 3:50 ` David Wang
@ 2026-01-06 10:54 ` Kent Overstreet
2026-01-06 14:07 ` David Wang
0 siblings, 1 reply; 11+ messages in thread
From: Kent Overstreet @ 2026-01-06 10:54 UTC (permalink / raw)
To: David Wang
Cc: Suren Baghdasaryan, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
On Tue, Jan 06, 2026 at 11:50:36AM +0800, David Wang wrote:
> At 2026-01-06 05:12:48, "Suren Baghdasaryan" <surenb@google.com> wrote:
> >On Mon, Dec 15, 2025 at 10:44 PM David Wang <00107082@163.com> wrote:
> >>
> >> When tracking memory allocation for some specific function,
> >> picking the first codetag is more desired, because there
> >> is no need to track down all allocation sites in the call
> >> graph and change them to _noprof version, which is quite
> >> inflexible when the call graph is complex.
> >>
> >> For example, consider a simple graph:
> >>
> >> A ---> B ---> C ===> D
> >> E ---> C
> >>
> >> ===> means a call with codetag
> >> ---> means a call without codetag
> >>
> >> To profiling memory allocation for A, the call graph needs
> >> to be changed to
> >> A ===> B ---> C ---> D
> >> E ===> C
> >> Three call sites needs to be changed.
> >>
> >> But if pick the first codetag, only one change is needed.
> >> A ===> B ---> C ===> D
> >> E ---> C
> >>
> >> The drawback is some accounting for C is splited to A,
> >> making the number not accurate for C. (But the overall
> >> accounting is still the same.)
> >>
> >> This is useful when debug memory problems, not meant for
> >> production usage though.
> >
> >Hi David,
> >Sorry for the delay. Do you have specific examples when allocation
> >needs to be accounted at the highest level
> Hi,
>
> I do not have a very convincing practical example yet. :(
> I started to think about this in this thread[1], debugging possible memory leak in cephfs.
> If modules want to account its memory usage, they can plant codetags in their codepath
> without worrying about codetags deeper in the code chain.
>
> And I noticed that some callsites' memory usage is incomplete, because its accounting
> is split by codetags deeper in the code chain
> For example, on my system, I have
> 512 1 drivers/usb/core/hub.c:6080 [usbcore] func:usb_hub_init
> but if pick first codetag, I would have
> 20992 10 drivers/usb/core/hub.c:6080 [usbcore] func:usb_hub_init
>
> One call chain
> usb_hub_init==>alloc_workqueue--->__alloc_workqueue -->alloc_node_nr_active==>kzalloc_node
> has two codetags, and its memory is not accounted to usb drivers.
>
> If interested in module's memory usage, picking the first codetag would be preferred, I guess.
Is an end user going to be able to do anything with such an option?
Your option just flattens the accounting - this results in incorrect
accounting, not just insufficiently fine grained - and incorrect in a
way that's harder to notice and find and fix.
How many times have you gone down the wrong rabbit hole because your
tools were subtly lying to you? This is something we really want to
avoid.
The fact that you have to be explicit about where the accounting happens
via _noprof is a feature, not a bug :)
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2026-01-06 10:54 ` Kent Overstreet
@ 2026-01-06 14:07 ` David Wang
2026-01-06 23:26 ` Kent Overstreet
0 siblings, 1 reply; 11+ messages in thread
From: David Wang @ 2026-01-06 14:07 UTC (permalink / raw)
To: Kent Overstreet
Cc: Suren Baghdasaryan, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
At 2026-01-06 18:54:36, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Tue, Jan 06, 2026 at 11:50:36AM +0800, David Wang wrote:
>> At 2026-01-06 05:12:48, "Suren Baghdasaryan" <surenb@google.com> wrote:
>> >On Mon, Dec 15, 2025 at 10:44 PM David Wang <00107082@163.com> wrote:
>> >>
>> >> When tracking memory allocation for some specific function,
>> >> picking the first codetag is more desired, because there
>> >> is no need to track down all allocation sites in the call
>> >> graph and change them to _noprof version, which is quite
>> >> inflexible when the call graph is complex.
>> >>
>> >> For example, consider a simple graph:
>> >>
>> >> A ---> B ---> C ===> D
>> >> E ---> C
>> >>
>> >> ===> means a call with codetag
>> >> ---> means a call without codetag
>> >>
>> >> To profiling memory allocation for A, the call graph needs
>> >> to be changed to
>> >> A ===> B ---> C ---> D
>> >> E ===> C
>> >> Three call sites needs to be changed.
>> >>
>> >> But if pick the first codetag, only one change is needed.
>> >> A ===> B ---> C ===> D
>> >> E ---> C
>> >>
>> >> The drawback is some accounting for C is splited to A,
>> >> making the number not accurate for C. (But the overall
>> >> accounting is still the same.)
>> >>
>> >> This is useful when debug memory problems, not meant for
>> >> production usage though.
>> >
>> >Hi David,
>> >Sorry for the delay. Do you have specific examples when allocation
>> >needs to be accounted at the highest level
>> Hi,
>>
>> I do not have a very convincing practical example yet. :(
>> I started to think about this in this thread[1], debugging possible memory leak in cephfs.
>> If modules want to account its memory usage, they can plant codetags in their codepath
>> without worrying about codetags deeper in the code chain.
>>
>> And I noticed that some callsites' memory usage is incomplete, because its accounting
>> is split by codetags deeper in the code chain
>> For example, on my system, I have
>> 512 1 drivers/usb/core/hub.c:6080 [usbcore] func:usb_hub_init
>> but if pick first codetag, I would have
>> 20992 10 drivers/usb/core/hub.c:6080 [usbcore] func:usb_hub_init
>>
>> One call chain
>> usb_hub_init==>alloc_workqueue--->__alloc_workqueue -->alloc_node_nr_active==>kzalloc_node
>> has two codetags, and its memory is not accounted to usb drivers.
>>
>> If interested in module's memory usage, picking the first codetag would be preferred, I guess.
>
>Is an end user going to be able to do anything with such an option?
I think the option could be used for testing drivers/modules, enable this option, the memory usage for
a module is more accurate.
>
>Your option just flattens the accounting - this results in incorrect
>accounting, not just insufficiently fine grained - and incorrect in a
>way that's harder to notice and find and fix.
I agree, the accounting would be incorrect for alloc sites down the callchain, and would confuse things.
When the call chain has more than one codetag, correct accounting for one codetag would always mean incorrect
accounting for other codetags, right? But I don't think picking the first tag would make the accounting totally incorrect.
In a perfect world, maybe we can make sure there is only one codetag in any callchains. But
current code base has several situations where a _noprof version calls indirectly an alloc without _noprof,
as the usb_hub_init example I mentioned above: alloc_workqueue calls kzalloc_node indirectly.
>
>How many times have you gone down the wrong rabbit hole because your
>tools were subtly lying to you? This is something we really want to
>avoid.
Totally agree.
I used to sum by filepath prefix to aggregate memory usage for drivers.
Take usb subsystem for example, on my system, the data say my usb drivers use up 200k memory,
and if pick first codetag, the data say ~350K. Which one is lying, or are those two both lying. I am confused.
I think this also raises the question of what is the *correct* way to make use of /proc/allocinfo...
>
>The fact that you have to be explicit about where the accounting happens
>via _noprof is a feature, not a bug :)
But it is tedious... :(
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2026-01-06 14:07 ` David Wang
@ 2026-01-06 23:26 ` Kent Overstreet
2026-01-07 3:38 ` David Wang
0 siblings, 1 reply; 11+ messages in thread
From: Kent Overstreet @ 2026-01-06 23:26 UTC (permalink / raw)
To: David Wang
Cc: Suren Baghdasaryan, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
On Tue, Jan 06, 2026 at 10:07:36PM +0800, David Wang wrote:
> I agree, the accounting would be incorrect for alloc sites down the callchain, and would confuse things.
> When the call chain has more than one codetag, correct accounting for one codetag would always mean incorrect
> accounting for other codetags, right? But I don't think picking the first tag would make the accounting totally incorrect.
The trouble is you end up in situations where you have an alloc tag on
the stack, but then you're doing an internal allocation that definitely
should not be accounted to the outer alloc tag.
E.g. there's a lot of internal mm allocations like this; object
extension vectors was I think the first place where it came up,
vmalloc() also has its own internal data structures that require
allocations.
Just using the outermost tag means these inner allocations will get
accounted to other unrelated alloc tags _effectively at random_; meaning
if we're burning more memory than we should be in a place like that it
will never show up in a way that we'll notice and be able to track it
down.
> Totally agree.
> I used to sum by filepath prefix to aggregate memory usage for drivers.
> Take usb subsystem for example, on my system, the data say my usb drivers use up 200k memory,
> and if pick first codetag, the data say ~350K. Which one is lying, or are those two both lying. I am confused.
>
> I think this also raises the question of what is the *correct* way to make use of /proc/allocinfo...
So yes, summing by filepath prefix is the way we want things to work.
But getting there - with a fully reliable end result - is a process.
What you want to do is - preferably on a reasonably idle machine, aside
from the code you're looking at - just look at everything in
/proc/allocinfo and sort by size. Look at the biggest ones that might be
relevant to your subsystem, and look for any that are suspicious and
perhaps should be accounted to your code. Yes, that may entail reading
code :)
This is why accounting to the innermost tag is important - by doing it
this way, if an allocation is being accounted at the wrong callsite
they'll all be lumped together at the specific callsite that needs to be
fixed, which then shows up higher than normal in /proc/allocations, so
that it gets looked at.
> >The fact that you have to be explicit about where the accounting happens
> >via _noprof is a feature, not a bug :)
>
> But it is tedious... :(
That's another way of saying it's easy :)
Spot an allocation with insufficiently fine grained accounting and it's
generally a 3-5 line patch to fix it, I've been doing those here and
there - e.g. mempools, workqueues, rhashtables.
One trick I did with rhashtables that may be relevant to other
subsystems: rhashtable does background processing for your hash table,
which will do new allocations for your hash table out of a workqueue.
So rhashtable_init() gets wrapped in alloc_hooks(), and then it stashes
the pointer to that alloc tag in the rhashtable, and uses it later for
all those asynchronous allocations.
This means that instead of seeing a ton of memory accounted to the
rhashtable code, with no idea of which rhashtable is burning memory -
all the rhashtable allocations are accounted to the callsit of the
initialization, meaning it's trivial to see which one is burning memory.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2026-01-06 23:26 ` Kent Overstreet
@ 2026-01-07 3:38 ` David Wang
2026-01-07 4:07 ` Kent Overstreet
0 siblings, 1 reply; 11+ messages in thread
From: David Wang @ 2026-01-07 3:38 UTC (permalink / raw)
To: Kent Overstreet
Cc: Suren Baghdasaryan, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
At 2026-01-07 07:26:18, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Tue, Jan 06, 2026 at 10:07:36PM +0800, David Wang wrote:
>> I agree, the accounting would be incorrect for alloc sites down the callchain, and would confuse things.
>> When the call chain has more than one codetag, correct accounting for one codetag would always mean incorrect
>> accounting for other codetags, right? But I don't think picking the first tag would make the accounting totally incorrect.
>
>The trouble is you end up in situations where you have an alloc tag on
>the stack, but then you're doing an internal allocation that definitely
>should not be accounted to the outer alloc tag.
>
>E.g. there's a lot of internal mm allocations like this; object
>extension vectors was I think the first place where it came up,
>vmalloc() also has its own internal data structures that require
>allocations.
>
>Just using the outermost tag means these inner allocations will get
>accounted to other unrelated alloc tags _effectively at random_; meaning
>if we're burning more memory than we should be in a place like that it
>will never show up in a way that we'll notice and be able to track it
>down.
Kind of feel that the same thing could be said for drivers: the driver could use more memory
than the data says....this is actually true....
Different developer may have different focus concerning the allocation site.
>
>> Totally agree.
>> I used to sum by filepath prefix to aggregate memory usage for drivers.
>> Take usb subsystem for example, on my system, the data say my usb drivers use up 200k memory,
>> and if pick first codetag, the data say ~350K. Which one is lying, or are those two both lying. I am confused.
>>
>> I think this also raises the question of what is the *correct* way to make use of /proc/allocinfo...
>
>So yes, summing by filepath prefix is the way we want things to work.
>
>But getting there - with a fully reliable end result - is a process.
>
>What you want to do is - preferably on a reasonably idle machine, aside
>from the code you're looking at - just look at everything in
>/proc/allocinfo and sort by size. Look at the biggest ones that might be
>relevant to your subsystem, and look for any that are suspicious and
>perhaps should be accounted to your code. Yes, that may entail reading
>code :)
>
>This is why accounting to the innermost tag is important - by doing it
>this way, if an allocation is being accounted at the wrong callsite
>they'll all be lumped together at the specific callsite that needs to be
>fixed, which then shows up higher than normal in /proc/allocations, so
>that it gets looked at.
>
>> >The fact that you have to be explicit about where the accounting happens
>> >via _noprof is a feature, not a bug :)
>>
>> But it is tedious... :(
>
>That's another way of saying it's easy :)
>
>Spot an allocation with insufficiently fine grained accounting and it's
>generally a 3-5 line patch to fix it, I've been doing those here and
>there - e.g. mempools, workqueues, rhashtables.
>
>One trick I did with rhashtables that may be relevant to other
>subsystems: rhashtable does background processing for your hash table,
>which will do new allocations for your hash table out of a workqueue.
>
>So rhashtable_init() gets wrapped in alloc_hooks(), and then it stashes
>the pointer to that alloc tag in the rhashtable, and uses it later for
>all those asynchronous allocations.
>
>This means that instead of seeing a ton of memory accounted to the
>rhashtable code, with no idea of which rhashtable is burning memory -
>all the rhashtable allocations are accounted to the callsit of the
>initialization, meaning it's trivial to see which one is burning memory.
Not that easy, ....code keeps being refactored, _noprof need to be changed along.
I was trying to split the accounting for __filemap_get_folio to its callers in 6.18,
it was easy, only ~10 lines of code changes. But 6.19 starts with code refactors to
__filemap_get_folio, adding another level of indirection, allocation callchain becomes
longer, and more _noprof should be added...quite unpleasant...
Sometimes I would feel too many _noprof could be obstacle for future code refactors....
PS: There are several allocation sites have *huge* memory accounting, __filemap_get_folio is
one of those. splitting those accounting to its callers would be more informative
Thanks
David
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2026-01-07 3:38 ` David Wang
@ 2026-01-07 4:07 ` Kent Overstreet
2026-01-07 6:16 ` David Wang
0 siblings, 1 reply; 11+ messages in thread
From: Kent Overstreet @ 2026-01-07 4:07 UTC (permalink / raw)
To: David Wang
Cc: Suren Baghdasaryan, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
On Wed, Jan 07, 2026 at 11:38:06AM +0800, David Wang wrote:
>
> At 2026-01-07 07:26:18, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
> >On Tue, Jan 06, 2026 at 10:07:36PM +0800, David Wang wrote:
> >> I agree, the accounting would be incorrect for alloc sites down the callchain, and would confuse things.
> >> When the call chain has more than one codetag, correct accounting for one codetag would always mean incorrect
> >> accounting for other codetags, right? But I don't think picking the first tag would make the accounting totally incorrect.
> >
> >The trouble is you end up in situations where you have an alloc tag on
> >the stack, but then you're doing an internal allocation that definitely
> >should not be accounted to the outer alloc tag.
> >
> >E.g. there's a lot of internal mm allocations like this; object
> >extension vectors was I think the first place where it came up,
> >vmalloc() also has its own internal data structures that require
> >allocations.
> >
> >Just using the outermost tag means these inner allocations will get
> >accounted to other unrelated alloc tags _effectively at random_; meaning
> >if we're burning more memory than we should be in a place like that it
> >will never show up in a way that we'll notice and be able to track it
> >down.
>
> Kind of feel that the same thing could be said for drivers: the driver could use more memory
> than the data says....this is actually true....
> Different developer may have different focus concerning the allocation site.
>
> >
> >> Totally agree.
> >> I used to sum by filepath prefix to aggregate memory usage for drivers.
> >> Take usb subsystem for example, on my system, the data say my usb drivers use up 200k memory,
> >> and if pick first codetag, the data say ~350K. Which one is lying, or are those two both lying. I am confused.
> >>
> >> I think this also raises the question of what is the *correct* way to make use of /proc/allocinfo...
> >
> >So yes, summing by filepath prefix is the way we want things to work.
> >
> >But getting there - with a fully reliable end result - is a process.
> >
> >What you want to do is - preferably on a reasonably idle machine, aside
> >from the code you're looking at - just look at everything in
> >/proc/allocinfo and sort by size. Look at the biggest ones that might be
> >relevant to your subsystem, and look for any that are suspicious and
> >perhaps should be accounted to your code. Yes, that may entail reading
> >code :)
> >
> >This is why accounting to the innermost tag is important - by doing it
> >this way, if an allocation is being accounted at the wrong callsite
> >they'll all be lumped together at the specific callsite that needs to be
> >fixed, which then shows up higher than normal in /proc/allocations, so
> >that it gets looked at.
> >
> >> >The fact that you have to be explicit about where the accounting happens
> >> >via _noprof is a feature, not a bug :)
> >>
> >> But it is tedious... :(
> >
> >That's another way of saying it's easy :)
> >
> >Spot an allocation with insufficiently fine grained accounting and it's
> >generally a 3-5 line patch to fix it, I've been doing those here and
> >there - e.g. mempools, workqueues, rhashtables.
> >
> >One trick I did with rhashtables that may be relevant to other
> >subsystems: rhashtable does background processing for your hash table,
> >which will do new allocations for your hash table out of a workqueue.
> >
> >So rhashtable_init() gets wrapped in alloc_hooks(), and then it stashes
> >the pointer to that alloc tag in the rhashtable, and uses it later for
> >all those asynchronous allocations.
> >
> >This means that instead of seeing a ton of memory accounted to the
> >rhashtable code, with no idea of which rhashtable is burning memory -
> >all the rhashtable allocations are accounted to the callsit of the
> >initialization, meaning it's trivial to see which one is burning memory.
>
> Not that easy, ....code keeps being refactored, _noprof need to be changed along.
> I was trying to split the accounting for __filemap_get_folio to its callers in 6.18,
> it was easy, only ~10 lines of code changes. But 6.19 starts with code refactors to
> __filemap_get_folio, adding another level of indirection, allocation callchain becomes
> longer, and more _noprof should be added...quite unpleasant...
>
> Sometimes I would feel too many _noprof could be obstacle for future code refactors....
>
> PS: There are several allocation sites have *huge* memory accounting, __filemap_get_folio is
> one of those. splitting those accounting to its callers would be more informative
I'm curious why you need to change __filemap_get_folio()? In filesystem
land we just lump that under "pagecache", but I guess you're doing more
interesting things with it in driver land?
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2026-01-07 4:07 ` Kent Overstreet
@ 2026-01-07 6:16 ` David Wang
2026-01-07 16:13 ` Kent Overstreet
0 siblings, 1 reply; 11+ messages in thread
From: David Wang @ 2026-01-07 6:16 UTC (permalink / raw)
To: Kent Overstreet
Cc: Suren Baghdasaryan, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
At 2026-01-07 12:07:34, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Wed, Jan 07, 2026 at 11:38:06AM +0800, David Wang wrote:
>>
>> At 2026-01-07 07:26:18, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>> >On Tue, Jan 06, 2026 at 10:07:36PM +0800, David Wang wrote:
>> >> I agree, the accounting would be incorrect for alloc sites down the callchain, and would confuse things.
>> >> When the call chain has more than one codetag, correct accounting for one codetag would always mean incorrect
>> >> accounting for other codetags, right? But I don't think picking the first tag would make the accounting totally incorrect.
>> >
>> >The trouble is you end up in situations where you have an alloc tag on
>> >the stack, but then you're doing an internal allocation that definitely
>> >should not be accounted to the outer alloc tag.
>> >
>> >E.g. there's a lot of internal mm allocations like this; object
>> >extension vectors was I think the first place where it came up,
>> >vmalloc() also has its own internal data structures that require
>> >allocations.
>> >
>> >Just using the outermost tag means these inner allocations will get
>> >accounted to other unrelated alloc tags _effectively at random_; meaning
>> >if we're burning more memory than we should be in a place like that it
>> >will never show up in a way that we'll notice and be able to track it
>> >down.
>>
>> Kind of feel that the same thing could be said for drivers: the driver could use more memory
>> than the data says....this is actually true....
>> Different developer may have different focus concerning the allocation site.
>>
>> >
>> >> Totally agree.
>> >> I used to sum by filepath prefix to aggregate memory usage for drivers.
>> >> Take usb subsystem for example, on my system, the data say my usb drivers use up 200k memory,
>> >> and if pick first codetag, the data say ~350K. Which one is lying, or are those two both lying. I am confused.
>> >>
>> >> I think this also raises the question of what is the *correct* way to make use of /proc/allocinfo...
>> >
>> >So yes, summing by filepath prefix is the way we want things to work.
>> >
>> >But getting there - with a fully reliable end result - is a process.
>> >
>> >What you want to do is - preferably on a reasonably idle machine, aside
>> >from the code you're looking at - just look at everything in
>> >/proc/allocinfo and sort by size. Look at the biggest ones that might be
>> >relevant to your subsystem, and look for any that are suspicious and
>> >perhaps should be accounted to your code. Yes, that may entail reading
>> >code :)
>> >
>> >This is why accounting to the innermost tag is important - by doing it
>> >this way, if an allocation is being accounted at the wrong callsite
>> >they'll all be lumped together at the specific callsite that needs to be
>> >fixed, which then shows up higher than normal in /proc/allocations, so
>> >that it gets looked at.
>> >
>> >> >The fact that you have to be explicit about where the accounting happens
>> >> >via _noprof is a feature, not a bug :)
>> >>
>> >> But it is tedious... :(
>> >
>> >That's another way of saying it's easy :)
>> >
>> >Spot an allocation with insufficiently fine grained accounting and it's
>> >generally a 3-5 line patch to fix it, I've been doing those here and
>> >there - e.g. mempools, workqueues, rhashtables.
>> >
>> >One trick I did with rhashtables that may be relevant to other
>> >subsystems: rhashtable does background processing for your hash table,
>> >which will do new allocations for your hash table out of a workqueue.
>> >
>> >So rhashtable_init() gets wrapped in alloc_hooks(), and then it stashes
>> >the pointer to that alloc tag in the rhashtable, and uses it later for
>> >all those asynchronous allocations.
>> >
>> >This means that instead of seeing a ton of memory accounted to the
>> >rhashtable code, with no idea of which rhashtable is burning memory -
>> >all the rhashtable allocations are accounted to the callsit of the
>> >initialization, meaning it's trivial to see which one is burning memory.
>>
>> Not that easy, ....code keeps being refactored, _noprof need to be changed along.
>> I was trying to split the accounting for __filemap_get_folio to its callers in 6.18,
>> it was easy, only ~10 lines of code changes. But 6.19 starts with code refactors to
>> __filemap_get_folio, adding another level of indirection, allocation callchain becomes
>> longer, and more _noprof should be added...quite unpleasant...
>>
>> Sometimes I would feel too many _noprof could be obstacle for future code refactors....
>>
>> PS: There are several allocation sites have *huge* memory accounting, __filemap_get_folio is
>> one of those. splitting those accounting to its callers would be more informative
>
>I'm curious why you need to change __filemap_get_folio()? In filesystem
>land we just lump that under "pagecache", but I guess you're doing more
>interesting things with it in driver land?
Oh, in [1], there is a report about possible memory leak in cephfs, (The issue is still open, tracked in [2].),
large trunk of memory could not be released even after dropcache.
memory allocation profiling shows those memory belongs to __filemap_get_folio,
something like
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc
>> > 12M 3 kernel/dma/pool.c:96 func:atomic_pool_expand
>> > 13M 751 mm/slub.c:3061 func:alloc_slab_page
>> > 16M 8 mm/khugepaged.c:1069 func:alloc_charge_folio
>> > 18M 4355 mm/memory.c:1190 func:folio_prealloc
>> > 24M 6119 mm/memory.c:1192 func:folio_prealloc
>> > 58M 14784 mm/page_ext.c:271 func:alloc_page_ext
>> > 61M 15448 mm/readahead.c:189 func:ractl_alloc_folio
>> > 79M 6726 mm/slub.c:3059 func:alloc_slab_page
>> > 11G 2674488 mm/filemap.c:2012 func:__filemap_get_folio
After adding codetag to __filemap_get_folio, it shows
># sort -g /proc/allocinfo|tail|numfmt --to=iec
> 10M 2541 drivers/block/zram/zram_drv.c:1597 [zram]
>func:zram_meta_alloc 12M 3001 mm/execmem.c:41 func:execmem_vmalloc
> 12M 3605 kernel/fork.c:311 func:alloc_thread_stack_node
> 16M 992 mm/slub.c:3061 func:alloc_slab_page
> 20M 35544 lib/xarray.c:378 func:xas_alloc
> 31M 7704 mm/memory.c:1192 func:folio_prealloc
> 69M 17562 mm/memory.c:1190 func:folio_prealloc
> 104M 8212 mm/slub.c:3059 func:alloc_slab_page
> 124M 30075 mm/readahead.c:189 func:ractl_alloc_folio
> 2.6G 661392 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin
>
Helpful or not, I am not sure. So far no bug has been spotted in the cephfs write path, yet.
But at least, it provides more information and narrow down the scope of suspicious.
https://lore.kernel.org/lkml/2a9ba88e.3aa6.19b0b73dd4e.Coremail.00107082@163.com/ [1]
https://tracker.ceph.com/issues/74156 [2]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2026-01-07 6:16 ` David Wang
@ 2026-01-07 16:13 ` Kent Overstreet
2026-01-07 17:50 ` David Wang
0 siblings, 1 reply; 11+ messages in thread
From: Kent Overstreet @ 2026-01-07 16:13 UTC (permalink / raw)
To: David Wang
Cc: Suren Baghdasaryan, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
On Wed, Jan 07, 2026 at 02:16:24PM +0800, David Wang wrote:
>
> At 2026-01-07 12:07:34, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
> >I'm curious why you need to change __filemap_get_folio()? In filesystem
> >land we just lump that under "pagecache", but I guess you're doing more
> >interesting things with it in driver land?
>
> Oh, in [1], there is a report about possible memory leak in cephfs, (The issue is still open, tracked in [2].),
> large trunk of memory could not be released even after dropcache.
> memory allocation profiling shows those memory belongs to __filemap_get_folio,
> something like
> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> >> > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc
> >> > 12M 3 kernel/dma/pool.c:96 func:atomic_pool_expand
> >> > 13M 751 mm/slub.c:3061 func:alloc_slab_page
> >> > 16M 8 mm/khugepaged.c:1069 func:alloc_charge_folio
> >> > 18M 4355 mm/memory.c:1190 func:folio_prealloc
> >> > 24M 6119 mm/memory.c:1192 func:folio_prealloc
> >> > 58M 14784 mm/page_ext.c:271 func:alloc_page_ext
> >> > 61M 15448 mm/readahead.c:189 func:ractl_alloc_folio
> >> > 79M 6726 mm/slub.c:3059 func:alloc_slab_page
> >> > 11G 2674488 mm/filemap.c:2012 func:__filemap_get_folio
>
> After adding codetag to __filemap_get_folio, it shows
>
> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
> > 10M 2541 drivers/block/zram/zram_drv.c:1597 [zram]
> >func:zram_meta_alloc 12M 3001 mm/execmem.c:41 func:execmem_vmalloc
> > 12M 3605 kernel/fork.c:311 func:alloc_thread_stack_node
> > 16M 992 mm/slub.c:3061 func:alloc_slab_page
> > 20M 35544 lib/xarray.c:378 func:xas_alloc
> > 31M 7704 mm/memory.c:1192 func:folio_prealloc
> > 69M 17562 mm/memory.c:1190 func:folio_prealloc
> > 104M 8212 mm/slub.c:3059 func:alloc_slab_page
> > 124M 30075 mm/readahead.c:189 func:ractl_alloc_folio
> > 2.6G 661392 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin
> >
>
> Helpful or not, I am not sure. So far no bug has been spotted in the cephfs write path, yet.
> But at least, it provides more information and narrow down the scope of suspicious.
>
>
> https://lore.kernel.org/lkml/2a9ba88e.3aa6.19b0b73dd4e.Coremail.00107082@163.com/ [1]
> https://tracker.ceph.com/issues/74156 [2]
Well, my first thought when looking at that is that memory allocation
profiling is unlikely to be any more help there. Once you're dealing
with the page cache, if you're looking at a genuine leak it would pretty
much have to be a folio refcount leak, and the code that leaked the ref
could be anything that touched that folio - you're looking at a pretty
wide scope.
Unfortunately, we're not great at visibility and introspection in mm/,
and refcount bugs tend to be hard in general.
Better mm introspection would be helpful to say definitively that you're
looking at a refcount leak, but then once that's determined it's still
going to be pretty painful to track down.
The approach I took in bcachefs for refcount bugs was to write a small
library that in debug mode splits a refcount into sub-refcounts, and
then enumerate every single codepath that takes refs and gives them
distinct sub-refs - this means in debug mode we can instantly pinpoint
the function that's buggy (and even better, with the new CLASS() and
guard() stuff these sorts of bugs have been going away).
But grafting that onto folio refcounts would be a hell of a chore.
OTOH, converting code to CLASS() and guards is much more
straightforward - just a matter of writing little helpers if you need
them and then a bunch of mechanical conversions, and it's well worth it.
But, I'm reading through the Ceph code, and it has /less/ code involving
folio refcounts than I would expect.
Has anyone checked if the bug reproduces without zswap? I've definitely
seen a lot of bug reports involving that code.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain
2026-01-07 16:13 ` Kent Overstreet
@ 2026-01-07 17:50 ` David Wang
0 siblings, 0 replies; 11+ messages in thread
From: David Wang @ 2026-01-07 17:50 UTC (permalink / raw)
To: Kent Overstreet, malcolm
Cc: Suren Baghdasaryan, akpm, hannes, pasha.tatashin, souravpanda,
vbabka, linux-mm, linux-kernel
At 2026-01-08 00:13:25, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>On Wed, Jan 07, 2026 at 02:16:24PM +0800, David Wang wrote:
>>
>> At 2026-01-07 12:07:34, "Kent Overstreet" <kent.overstreet@linux.dev> wrote:
>> >I'm curious why you need to change __filemap_get_folio()? In filesystem
>> >land we just lump that under "pagecache", but I guess you're doing more
>> >interesting things with it in driver land?
>>
>> Oh, in [1], there is a report about possible memory leak in cephfs, (The issue is still open, tracked in [2].),
>> large trunk of memory could not be released even after dropcache.
>> memory allocation profiling shows those memory belongs to __filemap_get_folio,
>> something like
>> >> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> >> > 12M 2987 mm/execmem.c:41 func:execmem_vmalloc
>> >> > 12M 3 kernel/dma/pool.c:96 func:atomic_pool_expand
>> >> > 13M 751 mm/slub.c:3061 func:alloc_slab_page
>> >> > 16M 8 mm/khugepaged.c:1069 func:alloc_charge_folio
>> >> > 18M 4355 mm/memory.c:1190 func:folio_prealloc
>> >> > 24M 6119 mm/memory.c:1192 func:folio_prealloc
>> >> > 58M 14784 mm/page_ext.c:271 func:alloc_page_ext
>> >> > 61M 15448 mm/readahead.c:189 func:ractl_alloc_folio
>> >> > 79M 6726 mm/slub.c:3059 func:alloc_slab_page
>> >> > 11G 2674488 mm/filemap.c:2012 func:__filemap_get_folio
>>
>> After adding codetag to __filemap_get_folio, it shows
>>
>> ># sort -g /proc/allocinfo|tail|numfmt --to=iec
>> > 10M 2541 drivers/block/zram/zram_drv.c:1597 [zram]
>> >func:zram_meta_alloc 12M 3001 mm/execmem.c:41 func:execmem_vmalloc
>> > 12M 3605 kernel/fork.c:311 func:alloc_thread_stack_node
>> > 16M 992 mm/slub.c:3061 func:alloc_slab_page
>> > 20M 35544 lib/xarray.c:378 func:xas_alloc
>> > 31M 7704 mm/memory.c:1192 func:folio_prealloc
>> > 69M 17562 mm/memory.c:1190 func:folio_prealloc
>> > 104M 8212 mm/slub.c:3059 func:alloc_slab_page
>> > 124M 30075 mm/readahead.c:189 func:ractl_alloc_folio
>> > 2.6G 661392 fs/netfs/buffered_read.c:635 [netfs] func:netfs_write_begin
>> >
>>
>> Helpful or not, I am not sure. So far no bug has been spotted in the cephfs write path, yet.
>> But at least, it provides more information and narrow down the scope of suspicious.
>>
>>
>> https://lore.kernel.org/lkml/2a9ba88e.3aa6.19b0b73dd4e.Coremail.00107082@163.com/ [1]
>> https://tracker.ceph.com/issues/74156 [2]
>
>Well, my first thought when looking at that is that memory allocation
>profiling is unlikely to be any more help there. Once you're dealing
>with the page cache, if you're looking at a genuine leak it would pretty
>much have to be a folio refcount leak, and the code that leaked the ref
>could be anything that touched that folio - you're looking at a pretty
>wide scope.
>
>Unfortunately, we're not great at visibility and introspection in mm/,
>and refcount bugs tend to be hard in general.
>
>Better mm introspection would be helpful to say definitively that you're
>looking at a refcount leak, but then once that's determined it's still
>going to be pretty painful to track down.
>
>The approach I took in bcachefs for refcount bugs was to write a small
>library that in debug mode splits a refcount into sub-refcounts, and
>then enumerate every single codepath that takes refs and gives them
>distinct sub-refs - this means in debug mode we can instantly pinpoint
>the function that's buggy (and even better, with the new CLASS() and
>guard() stuff these sorts of bugs have been going away).
>
>But grafting that onto folio refcounts would be a hell of a chore.
>
>OTOH, converting code to CLASS() and guards is much more
>straightforward - just a matter of writing little helpers if you need
>them and then a bunch of mechanical conversions, and it's well worth it.
>
>But, I'm reading through the Ceph code, and it has /less/ code involving
>folio refcounts than I would expect.
>
>Has anyone checked if the bug reproduces without zswap? I've definitely
>seen a lot of bug reports involving that code.
Thanks for the information, and your time~!
Add malcolm@haak.id.au
Actually I don't even have access to a cephfs to confirm the bug,
I was just interested in "memory leak" thing.. , and try to "sell" memory allocation profiling there. :)
Thanks
David
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-01-07 17:51 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-16 6:43 [PATCH RFC] alloc_tag: add option to pick the first codetag along callchain David Wang
2026-01-05 21:12 ` Suren Baghdasaryan
2026-01-06 3:50 ` David Wang
2026-01-06 10:54 ` Kent Overstreet
2026-01-06 14:07 ` David Wang
2026-01-06 23:26 ` Kent Overstreet
2026-01-07 3:38 ` David Wang
2026-01-07 4:07 ` Kent Overstreet
2026-01-07 6:16 ` David Wang
2026-01-07 16:13 ` Kent Overstreet
2026-01-07 17:50 ` David Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox