* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
[not found] <20251211193106.755485-2-echanude@redhat.com>
@ 2025-12-11 23:25 ` T.J. Mercier
2025-12-15 10:51 ` Maxime Ripard
0 siblings, 1 reply; 12+ messages in thread
From: T.J. Mercier @ 2025-12-11 23:25 UTC (permalink / raw)
To: Eric Chanudet
Cc: Sumit Semwal, Benjamin Gaignard, Brian Starkey, John Stultz,
Christian Koenig, Maxime Ripard, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
>
> The system dma-buf heap lets userspace allocate buffers from the page
> allocator. However, these allocations are not accounted for in memcg,
> allowing processes to escape limits that may be configured.
>
> Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
We had a discussion just last night in the MM track at LPC about how
shared memory accounted in memcg is pretty broken. Without a way to
identify (and possibly transfer) ownership of a shared buffer, this
makes the accounting of shared memory, and zombie memcg problems
worse. :\
>
> Userspace components using the system heap can be constrained with, e.g:
> systemd-run --user --scope -p MemoryMax=10M ...
>
> Signed-off-by: Eric Chanudet <echanude@redhat.com>
> ---
> drivers/dma-buf/heaps/system_heap.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c
> index 4c782fe33fd4..c91fcdff4b77 100644
> --- a/drivers/dma-buf/heaps/system_heap.c
> +++ b/drivers/dma-buf/heaps/system_heap.c
> @@ -38,10 +38,10 @@ struct dma_heap_attachment {
> bool mapped;
> };
>
> -#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO)
> +#define LOW_ORDER_GFP (GFP_HIGHUSER | __GFP_ZERO | __GFP_ACCOUNT)
> #define HIGH_ORDER_GFP (((GFP_HIGHUSER | __GFP_ZERO | __GFP_NOWARN \
> | __GFP_NORETRY) & ~__GFP_RECLAIM) \
> - | __GFP_COMP)
> + | __GFP_COMP | __GFP_ACCOUNT)
> static gfp_t order_flags[] = {HIGH_ORDER_GFP, HIGH_ORDER_GFP, LOW_ORDER_GFP};
> /*
> * The selection of the orders used for allocation (1MB, 64K, 4K) is designed
> --
> 2.52.0
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-11 23:25 ` [PATCH] dma-buf: system_heap: account for system heap allocation in memcg T.J. Mercier
@ 2025-12-15 10:51 ` Maxime Ripard
2025-12-15 13:30 ` Christian König
2025-12-16 2:06 ` T.J. Mercier
0 siblings, 2 replies; 12+ messages in thread
From: Maxime Ripard @ 2025-12-15 10:51 UTC (permalink / raw)
To: T.J. Mercier
Cc: Eric Chanudet, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
John Stultz, Christian Koenig, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
[-- Attachment #1: Type: text/plain, Size: 1312 bytes --]
Hi TJ,
On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
> On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
> >
> > The system dma-buf heap lets userspace allocate buffers from the page
> > allocator. However, these allocations are not accounted for in memcg,
> > allowing processes to escape limits that may be configured.
> >
> > Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
>
> We had a discussion just last night in the MM track at LPC about how
> shared memory accounted in memcg is pretty broken. Without a way to
> identify (and possibly transfer) ownership of a shared buffer, this
> makes the accounting of shared memory, and zombie memcg problems
> worse. :\
Are there notes or a report from that discussion anywhere?
The way I see it, the dma-buf heaps *trivial* case is non-existent at
the moment and that's definitely broken. Any application can bypass its
cgroups limits trivially, and that's a pretty big hole in the system.
The shared ownership is indeed broken, but it's not more or less broken
than, say, memfd + udmabuf, and I'm sure plenty of others.
So we really improve the common case, but only make the "advanced"
slightly more broken than it already is.
Would you disagree?
Maxime
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-15 10:51 ` Maxime Ripard
@ 2025-12-15 13:30 ` Christian König
2025-12-15 13:59 ` Maxime Ripard
2025-12-16 2:06 ` T.J. Mercier
1 sibling, 1 reply; 12+ messages in thread
From: Christian König @ 2025-12-15 13:30 UTC (permalink / raw)
To: Maxime Ripard, T.J. Mercier
Cc: Eric Chanudet, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
John Stultz, linux-media, dri-devel, linaro-mm-sig, linux-kernel,
open list:MEMORY MANAGEMENT
On 12/15/25 11:51, Maxime Ripard wrote:
> Hi TJ,
>
> On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
>> On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
>>>
>>> The system dma-buf heap lets userspace allocate buffers from the page
>>> allocator. However, these allocations are not accounted for in memcg,
>>> allowing processes to escape limits that may be configured.
>>>
>>> Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
>>
>> We had a discussion just last night in the MM track at LPC about how
>> shared memory accounted in memcg is pretty broken. Without a way to
>> identify (and possibly transfer) ownership of a shared buffer, this
>> makes the accounting of shared memory, and zombie memcg problems
>> worse. :\
>
> Are there notes or a report from that discussion anywhere?
>
> The way I see it, the dma-buf heaps *trivial* case is non-existent at
> the moment and that's definitely broken. Any application can bypass its
> cgroups limits trivially, and that's a pretty big hole in the system.
Well, that is just the tip of the iceberg.
Pretty much all driver interfaces doesn't account to memcg at the moment, all the way from alsa, over GPUs (both TTM and SHM-GEM) to V4L2.
> The shared ownership is indeed broken, but it's not more or less broken
> than, say, memfd + udmabuf, and I'm sure plenty of others.
>
> So we really improve the common case, but only make the "advanced"
> slightly more broken than it already is.
>
> Would you disagree?
I strongly disagree. As far as I can see there is a huge chance we break existing use cases with that.
There has been some work on TTM by Dave but I still haven't found time to wrap my head around all possible side effects such a change can have.
The fundamental problem is that neither memcg nor the classic resource tracking (e.g. the OOM killer) has a good understanding of shared resources.
For example you can use memfd to basically kill any process in the system because the OOM killer can't identify the process which holds the reference to the memory in question. And that is a *MUCH* bigger problem than just inaccurate memcg accounting.
Regards,
Christian.
>
> Maxime
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-15 13:30 ` Christian König
@ 2025-12-15 13:59 ` Maxime Ripard
2025-12-15 14:53 ` Christian König
0 siblings, 1 reply; 12+ messages in thread
From: Maxime Ripard @ 2025-12-15 13:59 UTC (permalink / raw)
To: Christian König
Cc: T.J. Mercier, Eric Chanudet, Sumit Semwal, Benjamin Gaignard,
Brian Starkey, John Stultz, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
[-- Attachment #1: Type: text/plain, Size: 2909 bytes --]
On Mon, Dec 15, 2025 at 02:30:47PM +0100, Christian König wrote:
> On 12/15/25 11:51, Maxime Ripard wrote:
> > Hi TJ,
> >
> > On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
> >> On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
> >>>
> >>> The system dma-buf heap lets userspace allocate buffers from the page
> >>> allocator. However, these allocations are not accounted for in memcg,
> >>> allowing processes to escape limits that may be configured.
> >>>
> >>> Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
> >>
> >> We had a discussion just last night in the MM track at LPC about how
> >> shared memory accounted in memcg is pretty broken. Without a way to
> >> identify (and possibly transfer) ownership of a shared buffer, this
> >> makes the accounting of shared memory, and zombie memcg problems
> >> worse. :\
> >
> > Are there notes or a report from that discussion anywhere?
> >
> > The way I see it, the dma-buf heaps *trivial* case is non-existent at
> > the moment and that's definitely broken. Any application can bypass its
> > cgroups limits trivially, and that's a pretty big hole in the system.
>
> Well, that is just the tip of the iceberg.
>
> Pretty much all driver interfaces doesn't account to memcg at the
> moment, all the way from alsa, over GPUs (both TTM and SHM-GEM) to
> V4L2.
Yes, I know, and step 1 of the plan we discussed earlier this year is to
fix the heaps.
> > The shared ownership is indeed broken, but it's not more or less broken
> > than, say, memfd + udmabuf, and I'm sure plenty of others.
> >
> > So we really improve the common case, but only make the "advanced"
> > slightly more broken than it already is.
> >
> > Would you disagree?
>
> I strongly disagree. As far as I can see there is a huge chance we
> break existing use cases with that.
Which ones? And what about the ones that are already broken?
> There has been some work on TTM by Dave but I still haven't found time
> to wrap my head around all possible side effects such a change can
> have.
>
> The fundamental problem is that neither memcg nor the classic resource
> tracking (e.g. the OOM killer) has a good understanding of shared
> resources.
And yet heap allocations don't necessarily have to be shared. But they
all have to be allocated.
> For example you can use memfd to basically kill any process in the
> system because the OOM killer can't identify the process which holds
> the reference to the memory in question. And that is a *MUCH* bigger
> problem than just inaccurate memcg accounting.
When you frame it like that, sure. Also, you can use the system heap to
DoS any process in the system. I'm not saying that what you're concerned
about isn't an issue, but let's not brush off other people legitimate
issues as well.
Maxime
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-15 13:59 ` Maxime Ripard
@ 2025-12-15 14:53 ` Christian König
2025-12-16 2:08 ` T.J. Mercier
2025-12-19 10:25 ` Maxime Ripard
0 siblings, 2 replies; 12+ messages in thread
From: Christian König @ 2025-12-15 14:53 UTC (permalink / raw)
To: Maxime Ripard
Cc: T.J. Mercier, Eric Chanudet, Sumit Semwal, Benjamin Gaignard,
Brian Starkey, John Stultz, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
On 12/15/25 14:59, Maxime Ripard wrote:
> On Mon, Dec 15, 2025 at 02:30:47PM +0100, Christian König wrote:
>> On 12/15/25 11:51, Maxime Ripard wrote:
>>> Hi TJ,
>>>
>>> On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
>>>> On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
>>>>>
>>>>> The system dma-buf heap lets userspace allocate buffers from the page
>>>>> allocator. However, these allocations are not accounted for in memcg,
>>>>> allowing processes to escape limits that may be configured.
>>>>>
>>>>> Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
>>>>
>>>> We had a discussion just last night in the MM track at LPC about how
>>>> shared memory accounted in memcg is pretty broken. Without a way to
>>>> identify (and possibly transfer) ownership of a shared buffer, this
>>>> makes the accounting of shared memory, and zombie memcg problems
>>>> worse. :\
>>>
>>> Are there notes or a report from that discussion anywhere?
>>>
>>> The way I see it, the dma-buf heaps *trivial* case is non-existent at
>>> the moment and that's definitely broken. Any application can bypass its
>>> cgroups limits trivially, and that's a pretty big hole in the system.
>>
>> Well, that is just the tip of the iceberg.
>>
>> Pretty much all driver interfaces doesn't account to memcg at the
>> moment, all the way from alsa, over GPUs (both TTM and SHM-GEM) to
>> V4L2.
>
> Yes, I know, and step 1 of the plan we discussed earlier this year is to
> fix the heaps.
>
>>> The shared ownership is indeed broken, but it's not more or less broken
>>> than, say, memfd + udmabuf, and I'm sure plenty of others.
>>>
>>> So we really improve the common case, but only make the "advanced"
>>> slightly more broken than it already is.
>>>
>>> Would you disagree?
>>
>> I strongly disagree. As far as I can see there is a huge chance we
>> break existing use cases with that.
>
> Which ones? And what about the ones that are already broken?
Well everybody that expects that driver resources are *not* accounted to memcg.
>> There has been some work on TTM by Dave but I still haven't found time
>> to wrap my head around all possible side effects such a change can
>> have.
>>
>> The fundamental problem is that neither memcg nor the classic resource
>> tracking (e.g. the OOM killer) has a good understanding of shared
>> resources.
>
> And yet heap allocations don't necessarily have to be shared. But they
> all have to be allocated.
>
>> For example you can use memfd to basically kill any process in the
>> system because the OOM killer can't identify the process which holds
>> the reference to the memory in question. And that is a *MUCH* bigger
>> problem than just inaccurate memcg accounting.
>
> When you frame it like that, sure. Also, you can use the system heap to
> DoS any process in the system. I'm not saying that what you're concerned
> about isn't an issue, but let's not brush off other people legitimate
> issues as well.
Completely agree, but we should prioritize.
That driver allocated memory is not memcg accounted is actually uAPI, e.g. that is not something which can easily change.
While fixing the OOM killer looks perfectly doable and will then most likely also show a better path how to fix the memcg accounting.
Christian.
>
> Maxime
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-15 10:51 ` Maxime Ripard
2025-12-15 13:30 ` Christian König
@ 2025-12-16 2:06 ` T.J. Mercier
2025-12-19 10:19 ` Maxime Ripard
1 sibling, 1 reply; 12+ messages in thread
From: T.J. Mercier @ 2025-12-16 2:06 UTC (permalink / raw)
To: Maxime Ripard
Cc: Eric Chanudet, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
John Stultz, Christian Koenig, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
On Mon, Dec 15, 2025 at 7:51 PM Maxime Ripard <mripard@redhat.com> wrote:
>
> Hi TJ,
Hi Maxime,
> On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
> > On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
> > >
> > > The system dma-buf heap lets userspace allocate buffers from the page
> > > allocator. However, these allocations are not accounted for in memcg,
> > > allowing processes to escape limits that may be configured.
> > >
> > > Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
> >
> > We had a discussion just last night in the MM track at LPC about how
> > shared memory accounted in memcg is pretty broken. Without a way to
> > identify (and possibly transfer) ownership of a shared buffer, this
> > makes the accounting of shared memory, and zombie memcg problems
> > worse. :\
>
> Are there notes or a report from that discussion anywhere?
The LPC vids haven't been clipped yet, and actually I can't even find
the recorded full live stream from Hall A2 on the first day. So I
don't think there's anything to look at, but I bet there's probably
nothing there you don't already know.
> The way I see it, the dma-buf heaps *trivial* case is non-existent at
> the moment and that's definitely broken. Any application can bypass its
> cgroups limits trivially, and that's a pretty big hole in the system.
Agree, but if we only charge the first allocator then limits can still
easily be bypassed assuming an app can cause an allocation outside of
its cgroup tree.
I'm not sure using static memcg limits where a significant portion of
the memory can be shared is really feasible. Even with just pagecache
being charged to memcgs, we're having trouble defining a static memcg
limit that is really useful since it has to be high enough to
accomodate occasional spikes due to shared memory that might or might
not be charged (since it can only be charged to one memcg - it may be
spread around or it may all get charged to one memcg). So excessive
anonymous use has to get really bad before it gets punished.
What I've been hearing lately is that folks are polling memory.stat or
PSI or other metrics and using that to take actions (memory.reclaim /
killing / adjust memory.high) at runtime rather than relying on
memory.high/max behavior with a static limit.
> The shared ownership is indeed broken, but it's not more or less broken
> than, say, memfd + udmabuf, and I'm sure plenty of others.
One thing that's worse about system heap buffers is that unlike memfd
the memory isn't reclaimable. So without killing all users there's
currently no way to deal with the zombie issue. Harry's proposing
reparenting, but I don't think our current interfaces support that
because we'd have to mess with the page structs behind system heap
dmabufs to change the memcg during reparenting.
Ah... but udmabuf pins the memfd pages, so you're right that memfd +
udmabuf isn't worse.
> So we really improve the common case, but only make the "advanced"
> slightly more broken than it already is.
>
> Would you disagree?
I think memcg limits in this case just wouldn't be usable because of
what I mentioned above. In our common case the allocator is in a
different cgroup tree than the real users of the buffer.
> Maxime
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-15 14:53 ` Christian König
@ 2025-12-16 2:08 ` T.J. Mercier
2025-12-19 10:25 ` Maxime Ripard
1 sibling, 0 replies; 12+ messages in thread
From: T.J. Mercier @ 2025-12-16 2:08 UTC (permalink / raw)
To: Christian König
Cc: Maxime Ripard, Eric Chanudet, Sumit Semwal, Benjamin Gaignard,
Brian Starkey, John Stultz, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
On Mon, Dec 15, 2025 at 11:53 PM Christian König
<christian.koenig@amd.com> wrote:
>
> On 12/15/25 14:59, Maxime Ripard wrote:
> > On Mon, Dec 15, 2025 at 02:30:47PM +0100, Christian König wrote:
> >> On 12/15/25 11:51, Maxime Ripard wrote:
> >>> Hi TJ,
> >>>
> >>> On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
> >>>> On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
> >>>>>
> >>>>> The system dma-buf heap lets userspace allocate buffers from the page
> >>>>> allocator. However, these allocations are not accounted for in memcg,
> >>>>> allowing processes to escape limits that may be configured.
> >>>>>
> >>>>> Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
> >>>>
> >>>> We had a discussion just last night in the MM track at LPC about how
> >>>> shared memory accounted in memcg is pretty broken. Without a way to
> >>>> identify (and possibly transfer) ownership of a shared buffer, this
> >>>> makes the accounting of shared memory, and zombie memcg problems
> >>>> worse. :\
> >>>
> >>> Are there notes or a report from that discussion anywhere?
> >>>
> >>> The way I see it, the dma-buf heaps *trivial* case is non-existent at
> >>> the moment and that's definitely broken. Any application can bypass its
> >>> cgroups limits trivially, and that's a pretty big hole in the system.
> >>
> >> Well, that is just the tip of the iceberg.
> >>
> >> Pretty much all driver interfaces doesn't account to memcg at the
> >> moment, all the way from alsa, over GPUs (both TTM and SHM-GEM) to
> >> V4L2.
> >
> > Yes, I know, and step 1 of the plan we discussed earlier this year is to
> > fix the heaps.
> >
> >>> The shared ownership is indeed broken, but it's not more or less broken
> >>> than, say, memfd + udmabuf, and I'm sure plenty of others.
> >>>
> >>> So we really improve the common case, but only make the "advanced"
> >>> slightly more broken than it already is.
> >>>
> >>> Would you disagree?
> >>
> >> I strongly disagree. As far as I can see there is a huge chance we
> >> break existing use cases with that.
> >
> > Which ones? And what about the ones that are already broken?
>
> Well everybody that expects that driver resources are *not* accounted to memcg.
>
> >> There has been some work on TTM by Dave but I still haven't found time
> >> to wrap my head around all possible side effects such a change can
> >> have.
> >>
> >> The fundamental problem is that neither memcg nor the classic resource
> >> tracking (e.g. the OOM killer) has a good understanding of shared
> >> resources.
> >
> > And yet heap allocations don't necessarily have to be shared. But they
> > all have to be allocated.
> >
> >> For example you can use memfd to basically kill any process in the
> >> system because the OOM killer can't identify the process which holds
> >> the reference to the memory in question. And that is a *MUCH* bigger
> >> problem than just inaccurate memcg accounting.
> >
> > When you frame it like that, sure. Also, you can use the system heap to
> > DoS any process in the system. I'm not saying that what you're concerned
> > about isn't an issue, but let's not brush off other people legitimate
> > issues as well.
>
> Completely agree, but we should prioritize.
>
> That driver allocated memory is not memcg accounted is actually uAPI, e.g. that is not something which can easily change.
>
> While fixing the OOM killer looks perfectly doable and will then most likely also show a better path how to fix the memcg accounting.
You think so? I can see how the OOM killer could identify that a
process is using a dmabuf and include that memory use for its decision
making, but the memory for it won't be reclaimed unless *all* users
get killed, which isn't easily known right now.
> Christian.
>
> >
> > Maxime
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-16 2:06 ` T.J. Mercier
@ 2025-12-19 10:19 ` Maxime Ripard
2025-12-23 19:20 ` T.J. Mercier
0 siblings, 1 reply; 12+ messages in thread
From: Maxime Ripard @ 2025-12-19 10:19 UTC (permalink / raw)
To: T.J. Mercier
Cc: Eric Chanudet, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
John Stultz, Christian Koenig, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
[-- Attachment #1: Type: text/plain, Size: 4677 bytes --]
Hi,
On Tue, Dec 16, 2025 at 11:06:59AM +0900, T.J. Mercier wrote:
> On Mon, Dec 15, 2025 at 7:51 PM Maxime Ripard <mripard@redhat.com> wrote:
> > On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
> > > On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
> > > >
> > > > The system dma-buf heap lets userspace allocate buffers from the page
> > > > allocator. However, these allocations are not accounted for in memcg,
> > > > allowing processes to escape limits that may be configured.
> > > >
> > > > Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
> > >
> > > We had a discussion just last night in the MM track at LPC about how
> > > shared memory accounted in memcg is pretty broken. Without a way to
> > > identify (and possibly transfer) ownership of a shared buffer, this
> > > makes the accounting of shared memory, and zombie memcg problems
> > > worse. :\
> >
> > Are there notes or a report from that discussion anywhere?
>
> The LPC vids haven't been clipped yet, and actually I can't even find
> the recorded full live stream from Hall A2 on the first day. So I
> don't think there's anything to look at, but I bet there's probably
> nothing there you don't already know.
Ack, thanks for looking at it still :)
> > The way I see it, the dma-buf heaps *trivial* case is non-existent at
> > the moment and that's definitely broken. Any application can bypass its
> > cgroups limits trivially, and that's a pretty big hole in the system.
>
> Agree, but if we only charge the first allocator then limits can still
> easily be bypassed assuming an app can cause an allocation outside of
> its cgroup tree.
>
> I'm not sure using static memcg limits where a significant portion of
> the memory can be shared is really feasible. Even with just pagecache
> being charged to memcgs, we're having trouble defining a static memcg
> limit that is really useful since it has to be high enough to
> accomodate occasional spikes due to shared memory that might or might
> not be charged (since it can only be charged to one memcg - it may be
> spread around or it may all get charged to one memcg). So excessive
> anonymous use has to get really bad before it gets punished.
>
> What I've been hearing lately is that folks are polling memory.stat or
> PSI or other metrics and using that to take actions (memory.reclaim /
> killing / adjust memory.high) at runtime rather than relying on
> memory.high/max behavior with a static limit.
But that's only side effects of a buffer being shared, right? (which,
for a buffer sharing mechanism is still pretty important, but still)
> > The shared ownership is indeed broken, but it's not more or less broken
> > than, say, memfd + udmabuf, and I'm sure plenty of others.
>
> One thing that's worse about system heap buffers is that unlike memfd
> the memory isn't reclaimable. So without killing all users there's
> currently no way to deal with the zombie issue. Harry's proposing
> reparenting, but I don't think our current interfaces support that
> because we'd have to mess with the page structs behind system heap
> dmabufs to change the memcg during reparenting.
>
> Ah... but udmabuf pins the memfd pages, so you're right that memfd +
> udmabuf isn't worse.
>
> > So we really improve the common case, but only make the "advanced"
> > slightly more broken than it already is.
> >
> > Would you disagree?
>
> I think memcg limits in this case just wouldn't be usable because of
> what I mentioned above. In our common case the allocator is in a
> different cgroup tree than the real users of the buffer.
So, my issue with this is that we want to fix not only dma-buf itself,
but every device buffer allocation mechanism, so also v4l2, drm, etc.
So we'll need a lot of infrastructure and rework outside of dma-buf to
get there, and figuring out how to solve the shared buffer accounting is
indeed one of them, but was so far considered kind the thing to do last
last time we discussed.
What I get from that discussion is that we now consider it a
prerequisite, and given how that topic has been advancing so far, one
that would take a couple of years at best to materialize into something
useful and upstream.
Thus, it blocks all the work around it for years.
Would you be open to merging patches that work on it but only enabled
through a kernel parameter for example (and possibly taint the kernel?)?
That would allow to work towards that goal while not being blocked by
the shared buffer accounting, and not affecting the general case either.
Maxime
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-15 14:53 ` Christian König
2025-12-16 2:08 ` T.J. Mercier
@ 2025-12-19 10:25 ` Maxime Ripard
2025-12-19 13:50 ` Christian König
1 sibling, 1 reply; 12+ messages in thread
From: Maxime Ripard @ 2025-12-19 10:25 UTC (permalink / raw)
To: Christian König
Cc: T.J. Mercier, Eric Chanudet, Sumit Semwal, Benjamin Gaignard,
Brian Starkey, John Stultz, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
[-- Attachment #1: Type: text/plain, Size: 4545 bytes --]
On Mon, Dec 15, 2025 at 03:53:22PM +0100, Christian König wrote:
> On 12/15/25 14:59, Maxime Ripard wrote:
> > On Mon, Dec 15, 2025 at 02:30:47PM +0100, Christian König wrote:
> >> On 12/15/25 11:51, Maxime Ripard wrote:
> >>> Hi TJ,
> >>>
> >>> On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
> >>>> On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
> >>>>>
> >>>>> The system dma-buf heap lets userspace allocate buffers from the page
> >>>>> allocator. However, these allocations are not accounted for in memcg,
> >>>>> allowing processes to escape limits that may be configured.
> >>>>>
> >>>>> Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
> >>>>
> >>>> We had a discussion just last night in the MM track at LPC about how
> >>>> shared memory accounted in memcg is pretty broken. Without a way to
> >>>> identify (and possibly transfer) ownership of a shared buffer, this
> >>>> makes the accounting of shared memory, and zombie memcg problems
> >>>> worse. :\
> >>>
> >>> Are there notes or a report from that discussion anywhere?
> >>>
> >>> The way I see it, the dma-buf heaps *trivial* case is non-existent at
> >>> the moment and that's definitely broken. Any application can bypass its
> >>> cgroups limits trivially, and that's a pretty big hole in the system.
> >>
> >> Well, that is just the tip of the iceberg.
> >>
> >> Pretty much all driver interfaces doesn't account to memcg at the
> >> moment, all the way from alsa, over GPUs (both TTM and SHM-GEM) to
> >> V4L2.
> >
> > Yes, I know, and step 1 of the plan we discussed earlier this year is to
> > fix the heaps.
> >
> >>> The shared ownership is indeed broken, but it's not more or less broken
> >>> than, say, memfd + udmabuf, and I'm sure plenty of others.
> >>>
> >>> So we really improve the common case, but only make the "advanced"
> >>> slightly more broken than it already is.
> >>>
> >>> Would you disagree?
> >>
> >> I strongly disagree. As far as I can see there is a huge chance we
> >> break existing use cases with that.
> >
> > Which ones? And what about the ones that are already broken?
>
> Well everybody that expects that driver resources are *not* accounted to memcg.
Which is a thing only because these buffers have never been accounted
for in the first place. So I guess the conclusion is that we shouldn't
even try to do memory accounting, because someone somewhere might not
expect that one of its application would take too much RAM in the
system?
> >> There has been some work on TTM by Dave but I still haven't found time
> >> to wrap my head around all possible side effects such a change can
> >> have.
> >>
> >> The fundamental problem is that neither memcg nor the classic resource
> >> tracking (e.g. the OOM killer) has a good understanding of shared
> >> resources.
> >
> > And yet heap allocations don't necessarily have to be shared. But they
> > all have to be allocated.
> >
> >> For example you can use memfd to basically kill any process in the
> >> system because the OOM killer can't identify the process which holds
> >> the reference to the memory in question. And that is a *MUCH* bigger
> >> problem than just inaccurate memcg accounting.
> >
> > When you frame it like that, sure. Also, you can use the system heap to
> > DoS any process in the system. I'm not saying that what you're concerned
> > about isn't an issue, but let's not brush off other people legitimate
> > issues as well.
>
> Completely agree, but we should prioritize.
>
> That driver allocated memory is not memcg accounted is actually uAPI,
> e.g. that is not something which can easily change.
>
> While fixing the OOM killer looks perfectly doable and will then most
> likely also show a better path how to fix the memcg accounting.
I don't necessarily disagree, but we don't necessarily have the same
priorities either. Your use-cases are probably quite different from
mine, and that's ok. But that's precisely why all these discussions
should be made on the ML when possible, or at least have some notes when
a discussion has happened at a conference or something.
So far, my whole experience with this topic, despite being the only one
(afaik) sending patches about this for the last 1.5y, is that everytime
some work on this is done the answer is "oh but you shouldn't have
worked on it because we completely changed our mind", and that's pretty
frustrating.
Maxime
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-19 10:25 ` Maxime Ripard
@ 2025-12-19 13:50 ` Christian König
2025-12-19 15:58 ` Maxime Ripard
0 siblings, 1 reply; 12+ messages in thread
From: Christian König @ 2025-12-19 13:50 UTC (permalink / raw)
To: Maxime Ripard
Cc: T.J. Mercier, Eric Chanudet, Sumit Semwal, Benjamin Gaignard,
Brian Starkey, John Stultz, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
On 12/19/25 11:25, Maxime Ripard wrote:
> On Mon, Dec 15, 2025 at 03:53:22PM +0100, Christian König wrote:
>> On 12/15/25 14:59, Maxime Ripard wrote:
...
>>>>> The shared ownership is indeed broken, but it's not more or less broken
>>>>> than, say, memfd + udmabuf, and I'm sure plenty of others.
>>>>>
>>>>> So we really improve the common case, but only make the "advanced"
>>>>> slightly more broken than it already is.
>>>>>
>>>>> Would you disagree?
>>>>
>>>> I strongly disagree. As far as I can see there is a huge chance we
>>>> break existing use cases with that.
>>>
>>> Which ones? And what about the ones that are already broken?
>>
>> Well everybody that expects that driver resources are *not* accounted to memcg.
>
> Which is a thing only because these buffers have never been accounted
> for in the first place.
Yeah, completely agree. By not accounting it for such a long time we ended up with people depending on this behavior.
Not nice, but that's what it is.
> So I guess the conclusion is that we shouldn't
> even try to do memory accounting, because someone somewhere might not
> expect that one of its application would take too much RAM in the
> system?
Well we do need some kind of solution to the problem. Either having some setting where you say "This memcg limit is inclusive/exclusive device driver allocated memory" or have a completely separate limit for device driver allocated memory.
Key point is we have both use cases, so we need to support both.
>>>> There has been some work on TTM by Dave but I still haven't found time
>>>> to wrap my head around all possible side effects such a change can
>>>> have.
>>>>
>>>> The fundamental problem is that neither memcg nor the classic resource
>>>> tracking (e.g. the OOM killer) has a good understanding of shared
>>>> resources.
>>>
>>> And yet heap allocations don't necessarily have to be shared. But they
>>> all have to be allocated.
>>>
>>>> For example you can use memfd to basically kill any process in the
>>>> system because the OOM killer can't identify the process which holds
>>>> the reference to the memory in question. And that is a *MUCH* bigger
>>>> problem than just inaccurate memcg accounting.
>>>
>>> When you frame it like that, sure. Also, you can use the system heap to
>>> DoS any process in the system. I'm not saying that what you're concerned
>>> about isn't an issue, but let's not brush off other people legitimate
>>> issues as well.
>>
>> Completely agree, but we should prioritize.
>>
>> That driver allocated memory is not memcg accounted is actually uAPI,
>> e.g. that is not something which can easily change.
>>
>> While fixing the OOM killer looks perfectly doable and will then most
>> likely also show a better path how to fix the memcg accounting.
>
> I don't necessarily disagree, but we don't necessarily have the same
> priorities either. Your use-cases are probably quite different from
> mine, and that's ok. But that's precisely why all these discussions
> should be made on the ML when possible, or at least have some notes when
> a discussion has happened at a conference or something.
>
> So far, my whole experience with this topic, despite being the only one
> (afaik) sending patches about this for the last 1.5y, is that everytime
> some work on this is done the answer is "oh but you shouldn't have
> worked on it because we completely changed our mind", and that's pretty
> frustrating.
Welcome to the club :)
I've already posted patches to start addressing at least the OOM killer issue ~10 years ago.
Those patches were not well received because back then driver memory was negligible and the problem simply didn't hurt much.
But by now we have GPUs and AI accelerators which eat up 90% of your system memory, security researchers stumbling over it and IIRC even multiple CVE numbers for some of the resulting issues...
I should probably dig it up and re-send my patch set.
Happy holidays,
Christian.
>
> Maxime
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-19 13:50 ` Christian König
@ 2025-12-19 15:58 ` Maxime Ripard
0 siblings, 0 replies; 12+ messages in thread
From: Maxime Ripard @ 2025-12-19 15:58 UTC (permalink / raw)
To: Christian König
Cc: T.J. Mercier, Eric Chanudet, Sumit Semwal, Benjamin Gaignard,
Brian Starkey, John Stultz, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
[-- Attachment #1: Type: text/plain, Size: 1751 bytes --]
On Fri, Dec 19, 2025 at 02:50:50PM +0100, Christian König wrote:
> On 12/19/25 11:25, Maxime Ripard wrote:
> > On Mon, Dec 15, 2025 at 03:53:22PM +0100, Christian König wrote:
> >> On 12/15/25 14:59, Maxime Ripard wrote:
> ...
> >>>>> The shared ownership is indeed broken, but it's not more or less broken
> >>>>> than, say, memfd + udmabuf, and I'm sure plenty of others.
> >>>>>
> >>>>> So we really improve the common case, but only make the "advanced"
> >>>>> slightly more broken than it already is.
> >>>>>
> >>>>> Would you disagree?
> >>>>
> >>>> I strongly disagree. As far as I can see there is a huge chance we
> >>>> break existing use cases with that.
> >>>
> >>> Which ones? And what about the ones that are already broken?
> >>
> >> Well everybody that expects that driver resources are *not* accounted to memcg.
> >
> > Which is a thing only because these buffers have never been accounted
> > for in the first place.
>
> Yeah, completely agree. By not accounting it for such a long time we
> ended up with people depending on this behavior.
>
> Not nice, but that's what it is.
>
> > So I guess the conclusion is that we shouldn't
> > even try to do memory accounting, because someone somewhere might not
> > expect that one of its application would take too much RAM in the
> > system?
>
> Well we do need some kind of solution to the problem. Either having
> some setting where you say "This memcg limit is inclusive/exclusive
> device driver allocated memory" or have a completely separate limit
> for device driver allocated memory.
A device driver memory specific limit sounds like a good idea because it
would make it easier to bridge the gap with dmem.
Happy holidays,
Maxime
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 273 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] dma-buf: system_heap: account for system heap allocation in memcg
2025-12-19 10:19 ` Maxime Ripard
@ 2025-12-23 19:20 ` T.J. Mercier
0 siblings, 0 replies; 12+ messages in thread
From: T.J. Mercier @ 2025-12-23 19:20 UTC (permalink / raw)
To: Maxime Ripard
Cc: Eric Chanudet, Sumit Semwal, Benjamin Gaignard, Brian Starkey,
John Stultz, Christian Koenig, linux-media, dri-devel,
linaro-mm-sig, linux-kernel, open list:MEMORY MANAGEMENT
On Fri, Dec 19, 2025 at 7:19 PM Maxime Ripard <mripard@redhat.com> wrote:
>
> Hi,
>
> On Tue, Dec 16, 2025 at 11:06:59AM +0900, T.J. Mercier wrote:
> > On Mon, Dec 15, 2025 at 7:51 PM Maxime Ripard <mripard@redhat.com> wrote:
> > > On Fri, Dec 12, 2025 at 08:25:19AM +0900, T.J. Mercier wrote:
> > > > On Fri, Dec 12, 2025 at 4:31 AM Eric Chanudet <echanude@redhat.com> wrote:
> > > > >
> > > > > The system dma-buf heap lets userspace allocate buffers from the page
> > > > > allocator. However, these allocations are not accounted for in memcg,
> > > > > allowing processes to escape limits that may be configured.
> > > > >
> > > > > Pass the __GFP_ACCOUNT for our allocations to account them into memcg.
> > > >
> > > > We had a discussion just last night in the MM track at LPC about how
> > > > shared memory accounted in memcg is pretty broken. Without a way to
> > > > identify (and possibly transfer) ownership of a shared buffer, this
> > > > makes the accounting of shared memory, and zombie memcg problems
> > > > worse. :\
> > >
> > > Are there notes or a report from that discussion anywhere?
> >
> > The LPC vids haven't been clipped yet, and actually I can't even find
> > the recorded full live stream from Hall A2 on the first day. So I
> > don't think there's anything to look at, but I bet there's probably
> > nothing there you don't already know.
>
> Ack, thanks for looking at it still :)
>
> > > The way I see it, the dma-buf heaps *trivial* case is non-existent at
> > > the moment and that's definitely broken. Any application can bypass its
> > > cgroups limits trivially, and that's a pretty big hole in the system.
> >
> > Agree, but if we only charge the first allocator then limits can still
> > easily be bypassed assuming an app can cause an allocation outside of
> > its cgroup tree.
> >
> > I'm not sure using static memcg limits where a significant portion of
> > the memory can be shared is really feasible. Even with just pagecache
> > being charged to memcgs, we're having trouble defining a static memcg
> > limit that is really useful since it has to be high enough to
> > accomodate occasional spikes due to shared memory that might or might
> > not be charged (since it can only be charged to one memcg - it may be
> > spread around or it may all get charged to one memcg). So excessive
> > anonymous use has to get really bad before it gets punished.
> >
> > What I've been hearing lately is that folks are polling memory.stat or
> > PSI or other metrics and using that to take actions (memory.reclaim /
> > killing / adjust memory.high) at runtime rather than relying on
> > memory.high/max behavior with a static limit.
>
> But that's only side effects of a buffer being shared, right? (which,
> for a buffer sharing mechanism is still pretty important, but still)
>
> > > The shared ownership is indeed broken, but it's not more or less broken
> > > than, say, memfd + udmabuf, and I'm sure plenty of others.
> >
> > One thing that's worse about system heap buffers is that unlike memfd
> > the memory isn't reclaimable. So without killing all users there's
> > currently no way to deal with the zombie issue. Harry's proposing
> > reparenting, but I don't think our current interfaces support that
> > because we'd have to mess with the page structs behind system heap
> > dmabufs to change the memcg during reparenting.
> >
> > Ah... but udmabuf pins the memfd pages, so you're right that memfd +
> > udmabuf isn't worse.
> >
> > > So we really improve the common case, but only make the "advanced"
> > > slightly more broken than it already is.
> > >
> > > Would you disagree?
> >
> > I think memcg limits in this case just wouldn't be usable because of
> > what I mentioned above. In our common case the allocator is in a
> > different cgroup tree than the real users of the buffer.
>
> So, my issue with this is that we want to fix not only dma-buf itself,
> but every device buffer allocation mechanism, so also v4l2, drm, etc.
>
> So we'll need a lot of infrastructure and rework outside of dma-buf to
> get there, and figuring out how to solve the shared buffer accounting is
> indeed one of them, but was so far considered kind the thing to do last
> last time we discussed.
>
> What I get from that discussion is that we now consider it a
> prerequisite, and given how that topic has been advancing so far, one
> that would take a couple of years at best to materialize into something
> useful and upstream.
>
> Thus, it blocks all the work around it for years.
>
> Would you be open to merging patches that work on it but only enabled
> through a kernel parameter for example (and possibly taint the kernel?)?
> That would allow to work towards that goal while not being blocked by
> the shared buffer accounting, and not affecting the general case either.
>
> Maxime
Hi Maxime,
A kernel param or a CONFIG sound like a good compromise to allow work
to progress. I'd be happy to add my R-B to that.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-12-23 19:20 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20251211193106.755485-2-echanude@redhat.com>
2025-12-11 23:25 ` [PATCH] dma-buf: system_heap: account for system heap allocation in memcg T.J. Mercier
2025-12-15 10:51 ` Maxime Ripard
2025-12-15 13:30 ` Christian König
2025-12-15 13:59 ` Maxime Ripard
2025-12-15 14:53 ` Christian König
2025-12-16 2:08 ` T.J. Mercier
2025-12-19 10:25 ` Maxime Ripard
2025-12-19 13:50 ` Christian König
2025-12-19 15:58 ` Maxime Ripard
2025-12-16 2:06 ` T.J. Mercier
2025-12-19 10:19 ` Maxime Ripard
2025-12-23 19:20 ` T.J. Mercier
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox