[PATCH v5] memcg: expose socket memory pressure in a cgroup

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5] memcg: expose socket memory pressure in a cgroup
@ 2025-10-07 12:50 Daniel Sedlak
  2025-10-07 20:01 ` Tejun Heo
  2025-10-08 18:58 ` Roman Gushchin
  0 siblings, 2 replies; 21+ messages in thread
From: Daniel Sedlak @ 2025-10-07 12:50 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, cgroups, Tejun Heo, Michal Koutný
  Cc: Daniel Sedlak, Matyas Hurtik

This patch is a result of our long-standing debug sessions, where it all
started as "networking is slow", and TCP network throughput suddenly
dropped from tens of Gbps to few Mbps, and we could not see anything in
the kernel log or netstat counters.

Currently, we have two memory pressure counters for TCP sockets [1],
which we manipulate only when the memory pressure is signalled through
the proto struct [2]. However, the memory pressure can also be signaled
through the cgroup memory subsystem, which we do not reflect in the
netstat counters. In the end, when the cgroup memory subsystem signals
that it is under pressure, we silently reduce the advertised TCP window
with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
throughput reduction.

Keep in mind that when the cgroup memory subsystem signals the socket
memory pressure for a given cgroup, it affects all sockets used in that
cgroup, including children cgroups.

This patch exposes a new file for each cgroup in sysfs which is a
read-only single value file showing how many microseconds this cgroup
contributed to throttling the throughput of network sockets. The file is
accessible in the following path.

  /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec

Just to summarize the proposals of different methods of hierarchical
propagation of the memory.net.throttled_usec.

1) None - keeping the reported duration local to that cgroup:

   value = self

   Would not be too out of place, since memory.events.local
   already does not accumulate hierarchically.
   To determine whether sockets in a memcg were throttled,
   we would traverse the /sys/fs/cgroup/ hierarchy from root to
   the cgroup of interest and sum those local durations.

2) Propagating the duration upwards (using rstat or simple iteration
   towards root memcg during write):

   value = self + sum of children

   Most semantically consistent with other exposed stat files.
   Could be added as an entry into memory.stat.
   Since the pressure gets applied from ancestors to children
   (see mem_cgroup_under_socket_pressure()), determining the duration of
   throttling for sockets in some cgroup would be hardest in this variant.

   It would involve iterating from the root to the examined cgroup and
   at each node subtracting the values of its children from that nodes
   value, then the sum of that would correspond to the total duration
   throttled.

3) Propagating the duration downwards (write only locally,
   read traversing hierarchy upwards):

   value = self + sum of ancestors

   Mirrors the logic used in mem_cgroup_under_socket_pressure(),
   increase in the reported value for a memcg would coincide with more
   throttling being done to the sockets of that memcg.

We chose variant 1, that is why it is a separate file instead of another
counter in mem.stat. Variant 2 seems to be most fitting however the
calculated value would be misleading and hard to interpret. Ideally, we
would go with variant 3 as this mirrors the logic of
mem_cgroup_under_socket_pressure(), but the third variant can be also
calculated manually from variant 1, and thus we chose the variant 1
as it is the most versatile one without leaking the internal
implementation that can change in the future.

Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
---
Sorry for the delay between the versions.

Changes:
v4 -> v5:
- Rebased
- Extend commit message with design decisions
- Rename cgroup counter
- Link to v4: https://lore.kernel.org/netdev/20250805064429.77876-1-daniel.sedlak@cdn77.com/

v3 -> v4:
- Add documentation
- Expose pressure as cummulative counter in microseconds
- Link to v3: https://lore.kernel.org/netdev/20250722071146.48616-1-daniel.sedlak@cdn77.com/

v2 -> v3:
- Expose the socket memory pressure on the cgroups instead of netstat
- Split patch
- Link to v2: https://lore.kernel.org/netdev/20250714143613.42184-1-daniel.sedlak@cdn77.com/

v1 -> v2:
- Add tracepoint
- Link to v1: https://lore.kernel.org/netdev/20250707105205.222558-1-daniel.sedlak@cdn77.com/

 Documentation/admin-guide/cgroup-v2.rst | 10 ++++++
 include/linux/memcontrol.h              | 41 +++++++++++++++----------
 mm/memcontrol.c                         | 31 ++++++++++++++++++-
 3 files changed, 65 insertions(+), 17 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 51c0bc4c2dc5..fe81a134c156 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1887,6 +1887,16 @@ The following nested keys are defined.
 	Shows pressure stall information for memory. See
 	:ref:`Documentation/accounting/psi.rst <psi>` for details.
 
+  memory.net.throttled_usec
+	A read-only single value file showing how many microseconds this cgroup
+	contributed to throttling the throughput of network sockets.
+
+	Socket throttling is applied to a cgroup and to all its children,
+	as a consequence of high reclaim pressure.
+
+	Observing throttling of sockets in a particular cgroup can be done
+	by checking this file for that cgroup and also for all its ancestors.
+
 
 Usage Guidelines
 ~~~~~~~~~~~~~~~~
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fb27e3d2fdac..647fba7dcc8a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -247,14 +247,19 @@ struct mem_cgroup {
 	atomic_t		kmem_stat;
 #endif
 	/*
-	 * Hint of reclaim pressure for socket memroy management. Note
+	 * Hint of reclaim pressure for socket memory management. Note
 	 * that this indicator should NOT be used in legacy cgroup mode
 	 * where socket memory is accounted/charged separately.
 	 */
 	u64			socket_pressure;
-#if BITS_PER_LONG < 64
+	/* memory.net.throttled_usec */
+	u64			socket_pressure_duration;
+#if BITS_PER_LONG >= 64
+	spinlock_t		socket_pressure_spinlock;
+#else
 	seqlock_t		socket_pressure_seqlock;
 #endif
+
 	int kmemcg_id;
 	/*
 	 * memcg->objcg is wiped out as a part of the objcg repaprenting
@@ -1607,19 +1612,33 @@ bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages,
 			  gfp_t gfp_mask);
 void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages);
 
-#if BITS_PER_LONG < 64
 static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg)
 {
-	u64 val = get_jiffies_64() + HZ;
 	unsigned long flags;
 
+#if BITS_PER_LONG >= 64
+	spin_lock_irqsave(&memcg->socket_pressure_spinlock, flags);
+#else
 	write_seqlock_irqsave(&memcg->socket_pressure_seqlock, flags);
-	memcg->socket_pressure = val;
+#endif
+	u64 old_socket_pressure = memcg->socket_pressure;
+	u64 new_socket_pressure = get_jiffies_64() + HZ;
+
+	memcg->socket_pressure = new_socket_pressure;
+	memcg->socket_pressure_duration +=  jiffies_to_usecs(
+		min(new_socket_pressure - old_socket_pressure, HZ));
+#if BITS_PER_LONG >= 64
+	spin_unlock_irqrestore(&memcg->socket_pressure_spinlock, flags);
+#else
 	write_sequnlock_irqrestore(&memcg->socket_pressure_seqlock, flags);
+#endif
 }
 
 static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg)
 {
+#if BITS_PER_LONG >= 64
+	return READ_ONCE(memcg->socket_pressure);
+#else
 	unsigned int seq;
 	u64 val;
 
@@ -1629,18 +1648,8 @@ static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg)
 	} while (read_seqretry(&memcg->socket_pressure_seqlock, seq));
 
 	return val;
-}
-#else
-static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg)
-{
-	WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
-}
-
-static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg)
-{
-	return READ_ONCE(memcg->socket_pressure);
-}
 #endif
+}
 
 int alloc_shrinker_info(struct mem_cgroup *memcg);
 void free_shrinker_info(struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index df3e9205c9e6..d29147223822 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3755,7 +3755,10 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	INIT_LIST_HEAD(&memcg->swap_peaks);
 	spin_lock_init(&memcg->peaks_lock);
 	memcg->socket_pressure = get_jiffies_64();
-#if BITS_PER_LONG < 64
+	memcg->socket_pressure_duration = 0;
+#if BITS_PER_LONG >= 64
+	spin_lock_init(&memcg->socket_pressure_spinlock);
+#else
 	seqlock_init(&memcg->socket_pressure_seqlock);
 #endif
 	memcg1_memcg_init(memcg);
@@ -4579,6 +4582,27 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
+static int memory_net_throttled_usec_show(struct seq_file *m, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+	u64 throttled_usec;
+
+#if BITS_PER_LONG >= 64
+	throttled_usec = READ_ONCE(memcg->socket_pressure_duration);
+#else
+	unsigned int seq;
+
+	do {
+		seq = read_seqbegin(&memcg->socket_pressure_seqlock);
+		throttled_usec = memcg->socket_pressure_duration;
+	} while (read_seqretry(&memcg->socket_pressure_seqlock, seq));
+#endif
+
+	seq_printf(m, "%llu\n", throttled_usec);
+
+	return 0;
+}
+
 static struct cftype memory_files[] = {
 	{
 		.name = "current",
@@ -4650,6 +4674,11 @@ static struct cftype memory_files[] = {
 		.flags = CFTYPE_NS_DELEGATABLE,
 		.write = memory_reclaim,
 	},
+	{
+		.name = "net.throttled_usec",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_net_throttled_usec_show,
+	},
 	{ }	/* terminate */
 };
 

base-commit: 312e6f7676e63bbb9b81e5c68e580a9f776cc6f0
-- 
2.39.5



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-07 12:50 [PATCH v5] memcg: expose socket memory pressure in a cgroup Daniel Sedlak
@ 2025-10-07 20:01 ` Tejun Heo
  2025-10-08 12:46   ` Matyas Hurtik
  2025-10-08 18:58 ` Roman Gushchin
  1 sibling, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2025-10-07 20:01 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, cgroups, Michal Koutný,
	Matyas Hurtik

On Tue, Oct 07, 2025 at 02:50:56PM +0200, Daniel Sedlak wrote:
...
> 1) None - keeping the reported duration local to that cgroup:
> 2) Propagating the duration upwards (using rstat or simple iteration
> 3) Propagating the duration downwards (write only locally,
>    read traversing hierarchy upwards):
...
> We chose variant 1, that is why it is a separate file instead of another
> counter in mem.stat. Variant 2 seems to be most fitting however the
> calculated value would be misleading and hard to interpret. Ideally, we
> would go with variant 3 as this mirrors the logic of
> mem_cgroup_under_socket_pressure(), but the third variant can be also
> calculated manually from variant 1, and thus we chose the variant 1
> as it is the most versatile one without leaking the internal
> implementation that can change in the future.

I'm not against going 1) but let's not do a separate file for this. Can't
you do memory.stat.local? It'd be better to have aggregation in memory.stat
but we can worry about that later.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-07 20:01 ` Tejun Heo
@ 2025-10-08 12:46   ` Matyas Hurtik
  2025-10-08 18:17     ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Matyas Hurtik @ 2025-10-08 12:46 UTC (permalink / raw)
  To: Tejun Heo, Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, cgroups, Michal Koutný

Hello,
> I'm not against going 1) but let's not do a separate file for this. Can't
> you do memory.stat.local?

I can't find memory.stat.local, so should we create it and add the 
counter as an entry there?


Regarding the code, is there anything you would like us to improve? I 
had to rewrite it a bit

because of the recent changes.


Thanks,

Matyas



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-08 12:46   ` Matyas Hurtik
@ 2025-10-08 18:17     ` Tejun Heo
  0 siblings, 0 replies; 21+ messages in thread
From: Tejun Heo @ 2025-10-08 18:17 UTC (permalink / raw)
  To: Matyas Hurtik
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt,
	Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, cgroups, Michal Koutný

On Wed, Oct 08, 2025 at 02:46:23PM +0200, Matyas Hurtik wrote:
> Hello,
> > I'm not against going 1) but let's not do a separate file for this. Can't
> > you do memory.stat.local?
> 
> I can't find memory.stat.local, so should we create it and add the counter
> as an entry there?

Yes.

> Regarding the code, is there anything you would like us to improve? I had to
> rewrite it a bit
> 
> because of the recent changes.

I think memcg guys will have much better idea on the actual code. Let's wait
for them to chime in.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-07 12:50 [PATCH v5] memcg: expose socket memory pressure in a cgroup Daniel Sedlak
  2025-10-07 20:01 ` Tejun Heo
@ 2025-10-08 18:58 ` Roman Gushchin
  2025-10-09 14:44   ` Daniel Sedlak
  1 sibling, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2025-10-08 18:58 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups,
	Tejun Heo, Michal Koutný,
	Matyas Hurtik

Daniel Sedlak <daniel.sedlak@cdn77.com> writes:

> This patch is a result of our long-standing debug sessions, where it all
> started as "networking is slow", and TCP network throughput suddenly
> dropped from tens of Gbps to few Mbps, and we could not see anything in
> the kernel log or netstat counters.
>
> Currently, we have two memory pressure counters for TCP sockets [1],
> which we manipulate only when the memory pressure is signalled through
> the proto struct [2]. However, the memory pressure can also be signaled
> through the cgroup memory subsystem, which we do not reflect in the
> netstat counters. In the end, when the cgroup memory subsystem signals
> that it is under pressure, we silently reduce the advertised TCP window
> with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
> throughput reduction.
>
> Keep in mind that when the cgroup memory subsystem signals the socket
> memory pressure for a given cgroup, it affects all sockets used in that
> cgroup, including children cgroups.
>
> This patch exposes a new file for each cgroup in sysfs which is a
> read-only single value file showing how many microseconds this cgroup
> contributed to throttling the throughput of network sockets. The file is
> accessible in the following path.
>
>   /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec

Hi Daniel!

How this value is going to be used? In other words, do you need an
exact number or something like memory.events::net_throttled would be
enough for your case?

Thanks!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-08 18:58 ` Roman Gushchin
@ 2025-10-09 14:44   ` Daniel Sedlak
  2025-10-09 15:32     ` Roman Gushchin
  0 siblings, 1 reply; 21+ messages in thread
From: Daniel Sedlak @ 2025-10-09 14:44 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups,
	Tejun Heo, Michal Koutný,
	Matyas Hurtik

Hi Roman,

On 10/8/25 8:58 PM, Roman Gushchin wrote:
>> This patch exposes a new file for each cgroup in sysfs which is a
>> read-only single value file showing how many microseconds this cgroup
>> contributed to throttling the throughput of network sockets. The file is
>> accessible in the following path.
>>
>>    /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec
> 
> Hi Daniel!
> 
> How this value is going to be used? In other words, do you need an
> exact number or something like memory.events::net_throttled would be
> enough for your case?

Just incrementing a counter each time the vmpressure() happens IMO 
provides bad semantics of what is actually happening, because it can 
hide important details, mainly the _time_ for how long the network 
traffic was slowed down.

For example, when memory.events::net_throttled=1000, it can mean that 
the network was slowed down for 1 second or 1000 seconds or something 
between, and the memory.net.throttled_usec proposed by this patch 
disambiguates it.

In addition, v1/v2 of this series started that way, then from v3 we 
rewrote it to calculate the duration instead, which proved to be better 
information for debugging, as it is easier to understand implications.

Thanks!
Daniel

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-09 14:44   ` Daniel Sedlak
@ 2025-10-09 15:32     ` Roman Gushchin
  2025-10-09 16:06       ` Shakeel Butt
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2025-10-09 15:32 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups,
	Tejun Heo, Michal Koutný,
	Matyas Hurtik

Daniel Sedlak <daniel.sedlak@cdn77.com> writes:

> Hi Roman,
>
> On 10/8/25 8:58 PM, Roman Gushchin wrote:
>>> This patch exposes a new file for each cgroup in sysfs which is a
>>> read-only single value file showing how many microseconds this cgroup
>>> contributed to throttling the throughput of network sockets. The file is
>>> accessible in the following path.
>>>
>>>    /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec
>> Hi Daniel!
>> How this value is going to be used? In other words, do you need an
>> exact number or something like memory.events::net_throttled would be
>> enough for your case?
>
> Just incrementing a counter each time the vmpressure() happens IMO
> provides bad semantics of what is actually happening, because it can
> hide important details, mainly the _time_ for how long the network
> traffic was slowed down.
>
> For example, when memory.events::net_throttled=1000, it can mean that
> the network was slowed down for 1 second or 1000 seconds or something
> between, and the memory.net.throttled_usec proposed by this patch
> disambiguates it.
>
> In addition, v1/v2 of this series started that way, then from v3 we
> rewrote it to calculate the duration instead, which proved to be
> better information for debugging, as it is easier to understand
> implications.

But how are you planning to use this information? Is this just
"networking is under pressure for non-trivial amount of time ->
raise the memcg limit" or something more complicated?

I am bit concerned about making this metric the part of cgroup API
simple because it's too implementation-defined and in my opinion
lack the fundamental meaning.

Vmpressure is calculated based on scanned/reclaimed ratio (which is
also not always the best proxy for the memory pressure level), then
if it reaches some level we basically throttle networking for 1s.
So it's all very arbitrary.

I totally get it from the debugging perspective, but not sure about
usefulness of it as a permanent metric. This is why I'm asking if there
are lighter alternatives, e.g. memory.events or maybe even tracepoints.

Thanks!

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-09 15:32     ` Roman Gushchin
@ 2025-10-09 16:06       ` Shakeel Butt
  2025-10-09 17:58         ` Roman Gushchin
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2025-10-09 16:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný,
	Matyas Hurtik

On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote:
> Daniel Sedlak <daniel.sedlak@cdn77.com> writes:
> 
> > Hi Roman,
> >
> > On 10/8/25 8:58 PM, Roman Gushchin wrote:
> >>> This patch exposes a new file for each cgroup in sysfs which is a
> >>> read-only single value file showing how many microseconds this cgroup
> >>> contributed to throttling the throughput of network sockets. The file is
> >>> accessible in the following path.
> >>>
> >>>    /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec
> >> Hi Daniel!
> >> How this value is going to be used? In other words, do you need an
> >> exact number or something like memory.events::net_throttled would be
> >> enough for your case?
> >
> > Just incrementing a counter each time the vmpressure() happens IMO
> > provides bad semantics of what is actually happening, because it can
> > hide important details, mainly the _time_ for how long the network
> > traffic was slowed down.
> >
> > For example, when memory.events::net_throttled=1000, it can mean that
> > the network was slowed down for 1 second or 1000 seconds or something
> > between, and the memory.net.throttled_usec proposed by this patch
> > disambiguates it.
> >
> > In addition, v1/v2 of this series started that way, then from v3 we
> > rewrote it to calculate the duration instead, which proved to be
> > better information for debugging, as it is easier to understand
> > implications.
> 
> But how are you planning to use this information? Is this just
> "networking is under pressure for non-trivial amount of time ->
> raise the memcg limit" or something more complicated?
> 
> I am bit concerned about making this metric the part of cgroup API
> simple because it's too implementation-defined and in my opinion
> lack the fundamental meaning.
> 
> Vmpressure is calculated based on scanned/reclaimed ratio (which is
> also not always the best proxy for the memory pressure level), then
> if it reaches some level we basically throttle networking for 1s.
> So it's all very arbitrary.
> 
> I totally get it from the debugging perspective, but not sure about
> usefulness of it as a permanent metric. This is why I'm asking if there
> are lighter alternatives, e.g. memory.events or maybe even tracepoints.
> 

I also have a very similar opinion that if we expose the current
implementation detail through a stable interface, we might get stuck
with this implementation and I want to change this in future.

Coming back to what information should we expose that will be helpful
for Daniel & Matyas and will be beneficial in general. After giving some
thought, I think the time "network was slowed down" or more specifically
time window when mem_cgroup_sk_under_memory_pressure() returns true
might not be that useful without the actual network activity. Basically
if no one is calling mem_cgroup_sk_under_memory_pressure() and doing
some actions, the time window is not that useful.

How about we track the actions taken by the callers of
mem_cgroup_sk_under_memory_pressure()? Basically if network stack
reduces the buffer size or whatever the other actions it may take when
mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
is what I think is needed here, at least for the debugging use-case.

WDYT?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-09 16:06       ` Shakeel Butt
@ 2025-10-09 17:58         ` Roman Gushchin
  2025-10-09 18:32           ` Shakeel Butt
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2025-10-09 17:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný,
	Matyas Hurtik

Shakeel Butt <shakeel.butt@linux.dev> writes:

> On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote:
>> Daniel Sedlak <daniel.sedlak@cdn77.com> writes:
>> 
>> > Hi Roman,
>> >
>> > On 10/8/25 8:58 PM, Roman Gushchin wrote:
>> >>> This patch exposes a new file for each cgroup in sysfs which is a
>> >>> read-only single value file showing how many microseconds this cgroup
>> >>> contributed to throttling the throughput of network sockets. The file is
>> >>> accessible in the following path.
>> >>>
>> >>>    /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec
>> >> Hi Daniel!
>> >> How this value is going to be used? In other words, do you need an
>> >> exact number or something like memory.events::net_throttled would be
>> >> enough for your case?
>> >
>> > Just incrementing a counter each time the vmpressure() happens IMO
>> > provides bad semantics of what is actually happening, because it can
>> > hide important details, mainly the _time_ for how long the network
>> > traffic was slowed down.
>> >
>> > For example, when memory.events::net_throttled=1000, it can mean that
>> > the network was slowed down for 1 second or 1000 seconds or something
>> > between, and the memory.net.throttled_usec proposed by this patch
>> > disambiguates it.
>> >
>> > In addition, v1/v2 of this series started that way, then from v3 we
>> > rewrote it to calculate the duration instead, which proved to be
>> > better information for debugging, as it is easier to understand
>> > implications.
>> 
>> But how are you planning to use this information? Is this just
>> "networking is under pressure for non-trivial amount of time ->
>> raise the memcg limit" or something more complicated?
>> 
>> I am bit concerned about making this metric the part of cgroup API
>> simple because it's too implementation-defined and in my opinion
>> lack the fundamental meaning.
>> 
>> Vmpressure is calculated based on scanned/reclaimed ratio (which is
>> also not always the best proxy for the memory pressure level), then
>> if it reaches some level we basically throttle networking for 1s.
>> So it's all very arbitrary.
>> 
>> I totally get it from the debugging perspective, but not sure about
>> usefulness of it as a permanent metric. This is why I'm asking if there
>> are lighter alternatives, e.g. memory.events or maybe even tracepoints.
>> 
>
> I also have a very similar opinion that if we expose the current
> implementation detail through a stable interface, we might get stuck
> with this implementation and I want to change this in future.
>
> Coming back to what information should we expose that will be helpful
> for Daniel & Matyas and will be beneficial in general. After giving some
> thought, I think the time "network was slowed down" or more specifically
> time window when mem_cgroup_sk_under_memory_pressure() returns true
> might not be that useful without the actual network activity. Basically
> if no one is calling mem_cgroup_sk_under_memory_pressure() and doing
> some actions, the time window is not that useful.
>
> How about we track the actions taken by the callers of
> mem_cgroup_sk_under_memory_pressure()? Basically if network stack
> reduces the buffer size or whatever the other actions it may take when
> mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
> is what I think is needed here, at least for the debugging use-case.
>
> WDYT?

I feel like if it's mostly intended for debugging purposes,
a combination of a trace point and bpftrace can work pretty well,
so there is no need to create a new sysfs interface.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-09 17:58         ` Roman Gushchin
@ 2025-10-09 18:32           ` Shakeel Butt
  2025-10-09 19:02             ` Roman Gushchin
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2025-10-09 18:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný,
	Matyas Hurtik

On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote:
> Shakeel Butt <shakeel.butt@linux.dev> writes:
> 
> > On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote:
> >> Daniel Sedlak <daniel.sedlak@cdn77.com> writes:
> >> 
> >> > Hi Roman,
> >> >
> >> > On 10/8/25 8:58 PM, Roman Gushchin wrote:
> >> >>> This patch exposes a new file for each cgroup in sysfs which is a
> >> >>> read-only single value file showing how many microseconds this cgroup
> >> >>> contributed to throttling the throughput of network sockets. The file is
> >> >>> accessible in the following path.
> >> >>>
> >> >>>    /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec
> >> >> Hi Daniel!
> >> >> How this value is going to be used? In other words, do you need an
> >> >> exact number or something like memory.events::net_throttled would be
> >> >> enough for your case?
> >> >
> >> > Just incrementing a counter each time the vmpressure() happens IMO
> >> > provides bad semantics of what is actually happening, because it can
> >> > hide important details, mainly the _time_ for how long the network
> >> > traffic was slowed down.
> >> >
> >> > For example, when memory.events::net_throttled=1000, it can mean that
> >> > the network was slowed down for 1 second or 1000 seconds or something
> >> > between, and the memory.net.throttled_usec proposed by this patch
> >> > disambiguates it.
> >> >
> >> > In addition, v1/v2 of this series started that way, then from v3 we
> >> > rewrote it to calculate the duration instead, which proved to be
> >> > better information for debugging, as it is easier to understand
> >> > implications.
> >> 
> >> But how are you planning to use this information? Is this just
> >> "networking is under pressure for non-trivial amount of time ->
> >> raise the memcg limit" or something more complicated?
> >> 
> >> I am bit concerned about making this metric the part of cgroup API
> >> simple because it's too implementation-defined and in my opinion
> >> lack the fundamental meaning.
> >> 
> >> Vmpressure is calculated based on scanned/reclaimed ratio (which is
> >> also not always the best proxy for the memory pressure level), then
> >> if it reaches some level we basically throttle networking for 1s.
> >> So it's all very arbitrary.
> >> 
> >> I totally get it from the debugging perspective, but not sure about
> >> usefulness of it as a permanent metric. This is why I'm asking if there
> >> are lighter alternatives, e.g. memory.events or maybe even tracepoints.
> >> 
> >
> > I also have a very similar opinion that if we expose the current
> > implementation detail through a stable interface, we might get stuck
> > with this implementation and I want to change this in future.
> >
> > Coming back to what information should we expose that will be helpful
> > for Daniel & Matyas and will be beneficial in general. After giving some
> > thought, I think the time "network was slowed down" or more specifically
> > time window when mem_cgroup_sk_under_memory_pressure() returns true
> > might not be that useful without the actual network activity. Basically
> > if no one is calling mem_cgroup_sk_under_memory_pressure() and doing
> > some actions, the time window is not that useful.
> >
> > How about we track the actions taken by the callers of
> > mem_cgroup_sk_under_memory_pressure()? Basically if network stack
> > reduces the buffer size or whatever the other actions it may take when
> > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
> > is what I think is needed here, at least for the debugging use-case.
> >
> > WDYT?
> 
> I feel like if it's mostly intended for debugging purposes,
> a combination of a trace point and bpftrace can work pretty well,
> so there is no need to create a new sysfs interface.
> 

Definitely not a new interface but I think having such information in
memory.events or memory.stat would be more convenient. Basically the
number of times the sockets in this memcg have to be clamped due to
memory pressure would be useful in general.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-09 18:32           ` Shakeel Butt
@ 2025-10-09 19:02             ` Roman Gushchin
  2025-10-13 14:30               ` Daniel Sedlak
  0 siblings, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2025-10-09 19:02 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný,
	Matyas Hurtik

Shakeel Butt <shakeel.butt@linux.dev> writes:

> On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote:
>> Shakeel Butt <shakeel.butt@linux.dev> writes:
>> 
>> > On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote:
>> >> Daniel Sedlak <daniel.sedlak@cdn77.com> writes:
>> >> 
>> >> > Hi Roman,
>> >> >
>> >> > On 10/8/25 8:58 PM, Roman Gushchin wrote:
>> >> >>> This patch exposes a new file for each cgroup in sysfs which is a
>> >> >>> read-only single value file showing how many microseconds this cgroup
>> >> >>> contributed to throttling the throughput of network sockets. The file is
>> >> >>> accessible in the following path.
>> >> >>>
>> >> >>>    /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec
>> >> >> Hi Daniel!
>> >> >> How this value is going to be used? In other words, do you need an
>> >> >> exact number or something like memory.events::net_throttled would be
>> >> >> enough for your case?
>> >> >
>> >> > Just incrementing a counter each time the vmpressure() happens IMO
>> >> > provides bad semantics of what is actually happening, because it can
>> >> > hide important details, mainly the _time_ for how long the network
>> >> > traffic was slowed down.
>> >> >
>> >> > For example, when memory.events::net_throttled=1000, it can mean that
>> >> > the network was slowed down for 1 second or 1000 seconds or something
>> >> > between, and the memory.net.throttled_usec proposed by this patch
>> >> > disambiguates it.
>> >> >
>> >> > In addition, v1/v2 of this series started that way, then from v3 we
>> >> > rewrote it to calculate the duration instead, which proved to be
>> >> > better information for debugging, as it is easier to understand
>> >> > implications.
>> >> 
>> >> But how are you planning to use this information? Is this just
>> >> "networking is under pressure for non-trivial amount of time ->
>> >> raise the memcg limit" or something more complicated?
>> >> 
>> >> I am bit concerned about making this metric the part of cgroup API
>> >> simple because it's too implementation-defined and in my opinion
>> >> lack the fundamental meaning.
>> >> 
>> >> Vmpressure is calculated based on scanned/reclaimed ratio (which is
>> >> also not always the best proxy for the memory pressure level), then
>> >> if it reaches some level we basically throttle networking for 1s.
>> >> So it's all very arbitrary.
>> >> 
>> >> I totally get it from the debugging perspective, but not sure about
>> >> usefulness of it as a permanent metric. This is why I'm asking if there
>> >> are lighter alternatives, e.g. memory.events or maybe even tracepoints.
>> >> 
>> >
>> > I also have a very similar opinion that if we expose the current
>> > implementation detail through a stable interface, we might get stuck
>> > with this implementation and I want to change this in future.
>> >
>> > Coming back to what information should we expose that will be helpful
>> > for Daniel & Matyas and will be beneficial in general. After giving some
>> > thought, I think the time "network was slowed down" or more specifically
>> > time window when mem_cgroup_sk_under_memory_pressure() returns true
>> > might not be that useful without the actual network activity. Basically
>> > if no one is calling mem_cgroup_sk_under_memory_pressure() and doing
>> > some actions, the time window is not that useful.
>> >
>> > How about we track the actions taken by the callers of
>> > mem_cgroup_sk_under_memory_pressure()? Basically if network stack
>> > reduces the buffer size or whatever the other actions it may take when
>> > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
>> > is what I think is needed here, at least for the debugging use-case.
>> >
>> > WDYT?
>> 
>> I feel like if it's mostly intended for debugging purposes,
>> a combination of a trace point and bpftrace can work pretty well,
>> so there is no need to create a new sysfs interface.
>> 
>
> Definitely not a new interface but I think having such information in
> memory.events or memory.stat would be more convenient. Basically the
> number of times the sockets in this memcg have to be clamped due to
> memory pressure would be useful in general.

Yeah, if we're going to add something, memory.events looks like the best
option, also because it allows to poll and get notified when the event
occurs.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-09 19:02             ` Roman Gushchin
@ 2025-10-13 14:30               ` Daniel Sedlak
  2025-10-14  1:43                 ` Roman Gushchin
  2025-10-14 20:32                 ` Shakeel Butt
  0 siblings, 2 replies; 21+ messages in thread
From: Daniel Sedlak @ 2025-10-13 14:30 UTC (permalink / raw)
  To: Roman Gushchin, Shakeel Butt
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima,
	David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev,
	Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo,
	Michal Koutný,
	Matyas Hurtik

On 10/9/25 9:02 PM, Roman Gushchin wrote:
> Shakeel Butt <shakeel.butt@linux.dev> writes:
> 
>> On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote:
>>> Shakeel Butt <shakeel.butt@linux.dev> writes:
>>>
>>>> On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote:
>>>>> Daniel Sedlak <daniel.sedlak@cdn77.com> writes:
>>>>>
>>>>>> Hi Roman,
>>>>>>
>>>>>> On 10/8/25 8:58 PM, Roman Gushchin wrote:
>>>>>>>> This patch exposes a new file for each cgroup in sysfs which is a
>>>>>>>> read-only single value file showing how many microseconds this cgroup
>>>>>>>> contributed to throttling the throughput of network sockets. The file is
>>>>>>>> accessible in the following path.
>>>>>>>>
>>>>>>>>     /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec
>>>>>>> Hi Daniel!
>>>>>>> How this value is going to be used? In other words, do you need an
>>>>>>> exact number or something like memory.events::net_throttled would be
>>>>>>> enough for your case?
>>>>>>
>>>>>> Just incrementing a counter each time the vmpressure() happens IMO
>>>>>> provides bad semantics of what is actually happening, because it can
>>>>>> hide important details, mainly the _time_ for how long the network
>>>>>> traffic was slowed down.
>>>>>>
>>>>>> For example, when memory.events::net_throttled=1000, it can mean that
>>>>>> the network was slowed down for 1 second or 1000 seconds or something
>>>>>> between, and the memory.net.throttled_usec proposed by this patch
>>>>>> disambiguates it.
>>>>>>
>>>>>> In addition, v1/v2 of this series started that way, then from v3 we
>>>>>> rewrote it to calculate the duration instead, which proved to be
>>>>>> better information for debugging, as it is easier to understand
>>>>>> implications.
>>>>>
>>>>> But how are you planning to use this information? Is this just
>>>>> "networking is under pressure for non-trivial amount of time ->
>>>>> raise the memcg limit" or something more complicated?

We plan to use it mostly for observability purposes and to better 
understand which traffic patterns affect the socket pressure the most 
(so we can try to fix/delay/improve it). We do not know how commonly 
this issue appears in other deployments, but in our deployment, many of 
servers were affected by this slowdown, which varied in terms of 
hardware and software configuration. Currently, it is very hard to 
detect if the socket is under pressure without using tools like 
bpftrace, so we would like to expose this metric in a more accessible 
way. So in the end, we do not really care in which file this "socket 
pressure happened" notification will be stored.
>>>>> I totally get it from the debugging perspective, but not sure about
>>>>> usefulness of it as a permanent metric. This is why I'm asking if there
>>>>> are lighter alternatives, e.g. memory.events or maybe even tracepoints.

If the combination of memory.events(.local) and tracepoint hook(s) is 
okay with you(?), we can use that and export the same information as in 
the current patch version. We can incorporate that into the next version.

Also, would it be possible to make the socket pressure signal 
configurable, e.g., allowing it to be configured via sysctl or per 
cgroup not to trigger the socket pressure signal? I cannot find the 
reasoning why this throttling cannot (maybe it can) be opt-out.
>>>> I also have a very similar opinion that if we expose the current
>>>> implementation detail through a stable interface, we might get stuck
>>>> with this implementation and I want to change this in future.
>>>>
>>>> Coming back to what information should we expose that will be helpful
>>>> for Daniel & Matyas and will be beneficial in general. After giving some
>>>> thought, I think the time "network was slowed down" or more specifically
>>>> time window when mem_cgroup_sk_under_memory_pressure() returns true
>>>> might not be that useful without the actual network activity. Basically
>>>> if no one is calling mem_cgroup_sk_under_memory_pressure() and doing
>>>> some actions, the time window is not that useful.
>>>>
>>>> How about we track the actions taken by the callers of
>>>> mem_cgroup_sk_under_memory_pressure()? Basically if network stack
>>>> reduces the buffer size or whatever the other actions it may take when
>>>> mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
>>>> is what I think is needed here, at least for the debugging use-case.

I am not against it, but I feel that conveying those tracked actions (or 
how to represent them) to the user will be much harder. Are there 
already existing APIs to push this information to the user?

Thanks!
Daniel.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-13 14:30               ` Daniel Sedlak
@ 2025-10-14  1:43                 ` Roman Gushchin
  2025-10-14 13:58                   ` Daniel Sedlak
  2025-10-14 20:32                 ` Shakeel Butt
  1 sibling, 1 reply; 21+ messages in thread
From: Roman Gushchin @ 2025-10-14  1:43 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: Shakeel Butt, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný,
	Matyas Hurtik

Daniel Sedlak <daniel.sedlak@cdn77.com> writes:

> On 10/9/25 9:02 PM, Roman Gushchin wrote:
>> Shakeel Butt <shakeel.butt@linux.dev> writes:
>> 
>>> On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote:
>>>> Shakeel Butt <shakeel.butt@linux.dev> writes:
>>>>
>>>>> On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote:
>>>>>> Daniel Sedlak <daniel.sedlak@cdn77.com> writes:
>>>>>>
>>>>>>> Hi Roman,
>>>>>>>
>>>>>>> On 10/8/25 8:58 PM, Roman Gushchin wrote:
>>>>>>>>> This patch exposes a new file for each cgroup in sysfs which is a
>>>>>>>>> read-only single value file showing how many microseconds this cgroup
>>>>>>>>> contributed to throttling the throughput of network sockets. The file is
>>>>>>>>> accessible in the following path.
>>>>>>>>>
>>>>>>>>>     /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec
>>>>>>>> Hi Daniel!
>>>>>>>> How this value is going to be used? In other words, do you need an
>>>>>>>> exact number or something like memory.events::net_throttled would be
>>>>>>>> enough for your case?
>>>>>>>
>>>>>>> Just incrementing a counter each time the vmpressure() happens IMO
>>>>>>> provides bad semantics of what is actually happening, because it can
>>>>>>> hide important details, mainly the _time_ for how long the network
>>>>>>> traffic was slowed down.
>>>>>>>
>>>>>>> For example, when memory.events::net_throttled=1000, it can mean that
>>>>>>> the network was slowed down for 1 second or 1000 seconds or something
>>>>>>> between, and the memory.net.throttled_usec proposed by this patch
>>>>>>> disambiguates it.
>>>>>>>
>>>>>>> In addition, v1/v2 of this series started that way, then from v3 we
>>>>>>> rewrote it to calculate the duration instead, which proved to be
>>>>>>> better information for debugging, as it is easier to understand
>>>>>>> implications.
>>>>>>
>>>>>> But how are you planning to use this information? Is this just
>>>>>> "networking is under pressure for non-trivial amount of time ->
>>>>>> raise the memcg limit" or something more complicated?
>
> We plan to use it mostly for observability purposes and to better
> understand which traffic patterns affect the socket pressure the most
> (so we can try to fix/delay/improve it). We do not know how commonly
> this issue appears in other deployments, but in our deployment, many
> of servers were affected by this slowdown, which varied in terms of
> hardware and software configuration. Currently, it is very hard to
> detect if the socket is under pressure without using tools like
> bpftrace, so we would like to expose this metric in a more accessible
> way. So in the end, we do not really care in which file this "socket
> pressure happened" notification will be stored.
>>>>>> I totally get it from the debugging perspective, but not sure about
>>>>>> usefulness of it as a permanent metric. This is why I'm asking if there
>>>>>> are lighter alternatives, e.g. memory.events or maybe even tracepoints.
>
> If the combination of memory.events(.local) and tracepoint hook(s) is
> okay with you(?), we can use that and export the same information as
> in the current patch version. We can incorporate that into the next
> version.

In my opinion
tracepoint > memory.events entry > memory.stat entry > new cgroupfs file.

>
> Also, would it be possible to make the socket pressure signal
> configurable, e.g., allowing it to be configured via sysctl or per
> cgroup not to trigger the socket pressure signal? I cannot find the
> reasoning why this throttling cannot (maybe it can) be opt-out.

It's a good point.

First, I think that vmpressure implementation is not the best
and we might want to switch to PSI (or something else) there.
This is why I'm resistant to exposing implementation-specific
metrics.

That said, I believe that some level of customization here is justified.
Maybe opting out completely is too much, but in the end it's hard for
the kernel to balance the importance of e.g. page cache vs networking
buffers as it might be really workload-dependent. Or some workloads
would prefer to risk being oom-killed rather than to tolerate a sub-par
networking performance.

Thanks!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-14  1:43                 ` Roman Gushchin
@ 2025-10-14 13:58                   ` Daniel Sedlak
  0 siblings, 0 replies; 21+ messages in thread
From: Daniel Sedlak @ 2025-10-14 13:58 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný,
	Matyas Hurtik

On 10/14/25 3:43 AM, Roman Gushchin wrote:
> Daniel Sedlak <daniel.sedlak@cdn77.com> writes:
> 
>> On 10/9/25 9:02 PM, Roman Gushchin wrote:
>>> Shakeel Butt <shakeel.butt@linux.dev> writes:
>>>
>>>> On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote:
>>>>> Shakeel Butt <shakeel.butt@linux.dev> writes:
>>>>>
>>>>>> On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote:
>>>>>>> Daniel Sedlak <daniel.sedlak@cdn77.com> writes:
>>>>>>>
>>>>>>>> Hi Roman,
>>>>>>>>
>>>>>>>> On 10/8/25 8:58 PM, Roman Gushchin wrote:
>>>>>>>>>> This patch exposes a new file for each cgroup in sysfs which is a
>>>>>>>>>> read-only single value file showing how many microseconds this cgroup
>>>>>>>>>> contributed to throttling the throughput of network sockets. The file is
>>>>>>>>>> accessible in the following path.
>>>>>>>>>>
>>>>>>>>>>      /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec
>>>>>>>>> Hi Daniel!
>>>>>>>>> How this value is going to be used? In other words, do you need an
>>>>>>>>> exact number or something like memory.events::net_throttled would be
>>>>>>>>> enough for your case?
>>>>>>>>
>>>>>>>> Just incrementing a counter each time the vmpressure() happens IMO
>>>>>>>> provides bad semantics of what is actually happening, because it can
>>>>>>>> hide important details, mainly the _time_ for how long the network
>>>>>>>> traffic was slowed down.
>>>>>>>>
>>>>>>>> For example, when memory.events::net_throttled=1000, it can mean that
>>>>>>>> the network was slowed down for 1 second or 1000 seconds or something
>>>>>>>> between, and the memory.net.throttled_usec proposed by this patch
>>>>>>>> disambiguates it.
>>>>>>>>
>>>>>>>> In addition, v1/v2 of this series started that way, then from v3 we
>>>>>>>> rewrote it to calculate the duration instead, which proved to be
>>>>>>>> better information for debugging, as it is easier to understand
>>>>>>>> implications.
>>>>>>>
>>>>>>> But how are you planning to use this information? Is this just
>>>>>>> "networking is under pressure for non-trivial amount of time ->
>>>>>>> raise the memcg limit" or something more complicated?
>>
>> We plan to use it mostly for observability purposes and to better
>> understand which traffic patterns affect the socket pressure the most
>> (so we can try to fix/delay/improve it). We do not know how commonly
>> this issue appears in other deployments, but in our deployment, many
>> of servers were affected by this slowdown, which varied in terms of
>> hardware and software configuration. Currently, it is very hard to
>> detect if the socket is under pressure without using tools like
>> bpftrace, so we would like to expose this metric in a more accessible
>> way. So in the end, we do not really care in which file this "socket
>> pressure happened" notification will be stored.
>>>>>>> I totally get it from the debugging perspective, but not sure about
>>>>>>> usefulness of it as a permanent metric. This is why I'm asking if there
>>>>>>> are lighter alternatives, e.g. memory.events or maybe even tracepoints.
>>
>> If the combination of memory.events(.local) and tracepoint hook(s) is
>> okay with you(?), we can use that and export the same information as
>> in the current patch version. We can incorporate that into the next
>> version.
> 
> In my opinion
> tracepoint > memory.events entry > memory.stat entry > new cgroupfs file.

Thanks, noted, we will incorporate it to the next version.
>> Also, would it be possible to make the socket pressure signal
>> configurable, e.g., allowing it to be configured via sysctl or per
>> cgroup not to trigger the socket pressure signal? I cannot find the
>> reasoning why this throttling cannot (maybe it can) be opt-out.
> 
> It's a good point.
> 
> First, I think that vmpressure implementation is not the best
> and we might want to switch to PSI (or something else) there.
> This is why I'm resistant to exposing implementation-specific
> metrics.
> 
> That said, I believe that some level of customization here is justified.
> Maybe opting out completely is too much, but in the end it's hard for
> the kernel to balance the importance of e.g. page cache vs networking
> buffers as it might be really workload-dependent. Or some workloads
> would prefer to risk being oom-killed rather than to tolerate a sub-par
> networking performance.

As of now, socket pressure throttling can be disabled by moving 
processes, causing the pressure, into the root cgroup. So we would 
definitely benefit from disabling it more idiomatically.

This bpftrace output is captured from a production server using nginx 
proxy (the left-most column is a timestamp in nanoseconds) which we use 
as a HTTP cache. As you can see, it fluctuates a lot.

26920285712831843, unified:/system.slice/nginx.service, scanned: 556, 
reclaimed: 146, pressure: 73
26920285731493743, unified:/system.slice/nginx.service, scanned: 22886, 
reclaimed: 13606, pressure: 40
26920285779559500, unified:/system.slice/nginx.service, scanned: 21775, 
reclaimed: 11525, pressure: 47
26920285784845147, unified:/system.slice/nginx.service, scanned: 698, 
reclaimed: 522, pressure: 25
26920285833808666, unified:/system.slice/nginx.service, scanned: 740, 
reclaimed: 232, pressure: 68
26920285835668081, unified:/system.slice/nginx.service, scanned: 1475, 
reclaimed: 1224, pressure: 17
26920285838877445, unified:/system.slice/nginx.service, scanned: 2919, 
reclaimed: 2334, pressure: 20
26920285854811898, unified:/system.slice/nginx.service, scanned: 11586, 
reclaimed: 7666, pressure: 33
26920285873634643, unified:/system.slice/nginx.service, scanned: 22898, 
reclaimed: 13387, pressure: 41
26920285899176135, unified:/system.slice/nginx.service, scanned: 10957, 
reclaimed: 7077, pressure: 35
26920285901529378, unified:/system.slice/nginx.service, scanned: 587, 
reclaimed: 156, pressure: 73
26920286020702357, unified:/system.slice/nginx.service, scanned: 563, 
reclaimed: 87, pressure: 84
26920286037434038, unified:/system.slice/nginx.service, scanned: 22072, 
reclaimed: 14161, pressure: 35
26920285789562313, unified:/system.slice/nginx.service, scanned: 2810, 
reclaimed: 1696, pressure: 39
26920285879597883, unified:/system.slice/nginx.service, scanned: 693, 
reclaimed: 625, pressure: 9
26920285884686863, unified:/system.slice/nginx.service, scanned: 2768, 
reclaimed: 2284, pressure: 17

We believe that the issue originates from suboptimally chosen constants, 
as seen in [1]. Currently, the vmpressure triggers when it cannot 
reclaim a few MiB of memory on a server that has over 500 GiB of memory.

Link: 
https://elixir.bootlin.com/linux/v6.17.1/source/mm/vmpressure.c#L38 [1]

We would like to work on that more after this patch to try to find a 
better constant or at least make it _more configurable_ if that makes 
sense for you.

Thanks!
Daniel



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-13 14:30               ` Daniel Sedlak
  2025-10-14  1:43                 ` Roman Gushchin
@ 2025-10-14 20:32                 ` Shakeel Butt
  2025-10-15 13:57                   ` Daniel Sedlak
  2025-10-15 18:21                   ` Kuniyuki Iwashima
  1 sibling, 2 replies; 21+ messages in thread
From: Shakeel Butt @ 2025-10-14 20:32 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný,
	Matyas Hurtik

On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote:
[...]
> > > > > How about we track the actions taken by the callers of
> > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack
> > > > > reduces the buffer size or whatever the other actions it may take when
> > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
> > > > > is what I think is needed here, at least for the debugging use-case.
> 
> I am not against it, but I feel that conveying those tracked actions (or how
> to represent them) to the user will be much harder. Are there already
> existing APIs to push this information to the user?
> 

I discussed with Wei Wang and she suggested we should start tracking the
calls to tcp_adjust_rcv_ssthresh() first. So, something like the
following. I would like feedback frm networking folks as well:


From 54bd2bf6681c1c694295646532f2a62a205ee41a Mon Sep 17 00:00:00 2001
From: Shakeel Butt <shakeel.butt@linux.dev>
Date: Tue, 14 Oct 2025 13:27:36 -0700
Subject: [PATCH] memcg: track network throttling due to memcg memory pressure

Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
---
 include/linux/memcontrol.h | 1 +
 mm/memcontrol.c            | 2 ++
 net/ipv4/tcp_input.c       | 5 ++++-
 net/ipv4/tcp_output.c      | 8 ++++++--
 4 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 873e510d6f8d..5fe254813123 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -52,6 +52,7 @@ enum memcg_memory_event {
 	MEMCG_SWAP_HIGH,
 	MEMCG_SWAP_MAX,
 	MEMCG_SWAP_FAIL,
+	MEMCG_SOCK_THROTTLED,
 	MEMCG_NR_MEMORY_EVENTS,
 };
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4deda33625f4..9207bba34e2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4463,6 +4463,8 @@ static void __memory_events_show(struct seq_file *m, atomic_long_t *events)
 		   atomic_long_read(&events[MEMCG_OOM_KILL]));
 	seq_printf(m, "oom_group_kill %lu\n",
 		   atomic_long_read(&events[MEMCG_OOM_GROUP_KILL]));
+	seq_printf(m, "sock_throttled %lu\n",
+		   atomic_long_read(&events[MEMCG_SOCK_THROTTLED]));
 }
 
 static int memory_events_show(struct seq_file *m, void *v)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 31ea5af49f2d..2206968fb505 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -713,6 +713,7 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb,
 		 * Adjust rcv_ssthresh according to reserved mem
 		 */
 		tcp_adjust_rcv_ssthresh(sk);
+		memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED);
 	}
 }
 
@@ -5764,8 +5765,10 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
 
 	if (!tcp_can_ingest(sk, in_skb))
 		tcp_clamp_window(sk);
-	else if (tcp_under_memory_pressure(sk))
+	else if (tcp_under_memory_pressure(sk)) {
 		tcp_adjust_rcv_ssthresh(sk);
+		memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED);
+	}
 
 	if (tcp_can_ingest(sk, in_skb))
 		return 0;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bb3576ac0ad7..8fe8d973d7ac 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3275,8 +3275,10 @@ u32 __tcp_select_window(struct sock *sk)
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
-		if (tcp_under_memory_pressure(sk))
+		if (tcp_under_memory_pressure(sk)) {
 			tcp_adjust_rcv_ssthresh(sk);
+			memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED);
+		}
 
 		/* free_space might become our new window, make sure we don't
 		 * increase it due to wscale.
@@ -3334,8 +3336,10 @@ u32 __tcp_select_window(struct sock *sk)
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
-		if (tcp_under_memory_pressure(sk))
+		if (tcp_under_memory_pressure(sk)) {
 			tcp_adjust_rcv_ssthresh(sk);
+			memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED);
+		}
 
 		/* if free space is too low, return a zero window */
 		if (free_space < (allowed_space >> 4) || free_space < mss ||
-- 
2.47.3



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-14 20:32                 ` Shakeel Butt
@ 2025-10-15 13:57                   ` Daniel Sedlak
  2025-10-15 18:36                     ` Shakeel Butt
  2025-10-15 18:21                   ` Kuniyuki Iwashima
  1 sibling, 1 reply; 21+ messages in thread
From: Daniel Sedlak @ 2025-10-15 13:57 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný,
	Matyas Hurtik

On 10/14/25 10:32 PM, Shakeel Butt wrote:
> On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote:
> [...]
>>>>>> How about we track the actions taken by the callers of
>>>>>> mem_cgroup_sk_under_memory_pressure()? Basically if network stack
>>>>>> reduces the buffer size or whatever the other actions it may take when
>>>>>> mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
>>>>>> is what I think is needed here, at least for the debugging use-case.
>>
>> I am not against it, but I feel that conveying those tracked actions (or how
>> to represent them) to the user will be much harder. Are there already
>> existing APIs to push this information to the user?
>>
> 
> I discussed with Wei Wang and she suggested we should start tracking the
> calls to tcp_adjust_rcv_ssthresh() first. So, something like the
> following. I would like feedback frm networking folks as well:

Looks like a good start. Are you planning on sending this patch 
separately, or can we include it in our v6 (with maybe slight 
modifications)?
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 873e510d6f8d..5fe254813123 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,6 +52,7 @@ enum memcg_memory_event {
>   	MEMCG_SWAP_HIGH,
>   	MEMCG_SWAP_MAX,
>   	MEMCG_SWAP_FAIL,
> +	MEMCG_SOCK_THROTTLED,

This probably should be MEMCG_TCP_SOCK_THROTTLED, because it checks only 
tcp_under_memory_pressure, however there is also the 
sk_under_memory_pressure used in net/sctp/sm_statefuns.c:6597 to also 
reduce the sending rate. Or also add the counter there and keep the name?

Thanks!
Daniel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-15 13:57                   ` Daniel Sedlak
@ 2025-10-15 18:36                     ` Shakeel Butt
  0 siblings, 0 replies; 21+ messages in thread
From: Shakeel Butt @ 2025-10-15 18:36 UTC (permalink / raw)
  To: Daniel Sedlak
  Cc: Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell,
	Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed,
	linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song,
	cgroups, Tejun Heo, Michal Koutný,
	Matyas Hurtik

On Wed, Oct 15, 2025 at 03:57:29PM +0200, Daniel Sedlak wrote:
> On 10/14/25 10:32 PM, Shakeel Butt wrote:
> > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote:
> > [...]
> > > > > > > How about we track the actions taken by the callers of
> > > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack
> > > > > > > reduces the buffer size or whatever the other actions it may take when
> > > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
> > > > > > > is what I think is needed here, at least for the debugging use-case.
> > > 
> > > I am not against it, but I feel that conveying those tracked actions (or how
> > > to represent them) to the user will be much harder. Are there already
> > > existing APIs to push this information to the user?
> > > 
> > 
> > I discussed with Wei Wang and she suggested we should start tracking the
> > calls to tcp_adjust_rcv_ssthresh() first. So, something like the
> > following. I would like feedback frm networking folks as well:
> 
> Looks like a good start. Are you planning on sending this patch separately,
> or can we include it in our v6 (with maybe slight modifications)?

What else you are planning to add in v6?

> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 873e510d6f8d..5fe254813123 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -52,6 +52,7 @@ enum memcg_memory_event {
> >   	MEMCG_SWAP_HIGH,
> >   	MEMCG_SWAP_MAX,
> >   	MEMCG_SWAP_FAIL,
> > +	MEMCG_SOCK_THROTTLED,
> 
> This probably should be MEMCG_TCP_SOCK_THROTTLED, because it checks only
> tcp_under_memory_pressure, however there is also the
> sk_under_memory_pressure used in net/sctp/sm_statefuns.c:6597 to also reduce
> the sending rate. Or also add the counter there and keep the name?

Yeah makes sense to add the counter in sctp as well.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-14 20:32                 ` Shakeel Butt
  2025-10-15 13:57                   ` Daniel Sedlak
@ 2025-10-15 18:21                   ` Kuniyuki Iwashima
  2025-10-15 18:39                     ` Shakeel Butt
  1 sibling, 1 reply; 21+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-15 18:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Sedlak, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups,
	Tejun Heo, Michal Koutný,
	Matyas Hurtik

On Tue, Oct 14, 2025 at 1:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote:
> [...]
> > > > > > How about we track the actions taken by the callers of
> > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack
> > > > > > reduces the buffer size or whatever the other actions it may take when
> > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
> > > > > > is what I think is needed here, at least for the debugging use-case.
> >
> > I am not against it, but I feel that conveying those tracked actions (or how
> > to represent them) to the user will be much harder. Are there already
> > existing APIs to push this information to the user?
> >
>
> I discussed with Wei Wang and she suggested we should start tracking the
> calls to tcp_adjust_rcv_ssthresh() first. So, something like the
> following. I would like feedback frm networking folks as well:

I think we could simply put memcg_memory_event() in
mem_cgroup_sk_under_memory_pressure() when it returns
true.

Other than tcp_adjust_rcv_ssthresh(), if tcp_under_memory_pressure()
returns true, it indicates something bad will happen, failure to expand
rcvbuf and sndbuf, need to prune out-of-order queue more aggressively,
FIN deferred to a retransmitted packet.

Also, we could cover mptcp and sctp too.



>
>
> From 54bd2bf6681c1c694295646532f2a62a205ee41a Mon Sep 17 00:00:00 2001
> From: Shakeel Butt <shakeel.butt@linux.dev>
> Date: Tue, 14 Oct 2025 13:27:36 -0700
> Subject: [PATCH] memcg: track network throttling due to memcg memory pressure
>
> Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
> ---
>  include/linux/memcontrol.h | 1 +
>  mm/memcontrol.c            | 2 ++
>  net/ipv4/tcp_input.c       | 5 ++++-
>  net/ipv4/tcp_output.c      | 8 ++++++--
>  4 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 873e510d6f8d..5fe254813123 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -52,6 +52,7 @@ enum memcg_memory_event {
>         MEMCG_SWAP_HIGH,
>         MEMCG_SWAP_MAX,
>         MEMCG_SWAP_FAIL,
> +       MEMCG_SOCK_THROTTLED,
>         MEMCG_NR_MEMORY_EVENTS,
>  };
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4deda33625f4..9207bba34e2e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4463,6 +4463,8 @@ static void __memory_events_show(struct seq_file *m, atomic_long_t *events)
>                    atomic_long_read(&events[MEMCG_OOM_KILL]));
>         seq_printf(m, "oom_group_kill %lu\n",
>                    atomic_long_read(&events[MEMCG_OOM_GROUP_KILL]));
> +       seq_printf(m, "sock_throttled %lu\n",
> +                  atomic_long_read(&events[MEMCG_SOCK_THROTTLED]));
>  }
>
>  static int memory_events_show(struct seq_file *m, void *v)
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 31ea5af49f2d..2206968fb505 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -713,6 +713,7 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb,
>                  * Adjust rcv_ssthresh according to reserved mem
>                  */
>                 tcp_adjust_rcv_ssthresh(sk);
> +               memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED);
>         }
>  }
>
> @@ -5764,8 +5765,10 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb)
>
>         if (!tcp_can_ingest(sk, in_skb))
>                 tcp_clamp_window(sk);
> -       else if (tcp_under_memory_pressure(sk))
> +       else if (tcp_under_memory_pressure(sk)) {
>                 tcp_adjust_rcv_ssthresh(sk);
> +               memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED);
> +       }
>
>         if (tcp_can_ingest(sk, in_skb))
>                 return 0;
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index bb3576ac0ad7..8fe8d973d7ac 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3275,8 +3275,10 @@ u32 __tcp_select_window(struct sock *sk)
>         if (free_space < (full_space >> 1)) {
>                 icsk->icsk_ack.quick = 0;
>
> -               if (tcp_under_memory_pressure(sk))
> +               if (tcp_under_memory_pressure(sk)) {
>                         tcp_adjust_rcv_ssthresh(sk);
> +                       memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED);
> +               }
>
>                 /* free_space might become our new window, make sure we don't
>                  * increase it due to wscale.
> @@ -3334,8 +3336,10 @@ u32 __tcp_select_window(struct sock *sk)
>         if (free_space < (full_space >> 1)) {
>                 icsk->icsk_ack.quick = 0;
>
> -               if (tcp_under_memory_pressure(sk))
> +               if (tcp_under_memory_pressure(sk)) {
>                         tcp_adjust_rcv_ssthresh(sk);
> +                       memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED);
> +               }
>
>                 /* if free space is too low, return a zero window */
>                 if (free_space < (allowed_space >> 4) || free_space < mss ||
> --
> 2.47.3
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-15 18:21                   ` Kuniyuki Iwashima
@ 2025-10-15 18:39                     ` Shakeel Butt
  2025-10-15 18:58                       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2025-10-15 18:39 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Daniel Sedlak, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups,
	Tejun Heo, Michal Koutný,
	Matyas Hurtik

On Wed, Oct 15, 2025 at 11:21:17AM -0700, Kuniyuki Iwashima wrote:
> On Tue, Oct 14, 2025 at 1:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote:
> > [...]
> > > > > > > How about we track the actions taken by the callers of
> > > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack
> > > > > > > reduces the buffer size or whatever the other actions it may take when
> > > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
> > > > > > > is what I think is needed here, at least for the debugging use-case.
> > >
> > > I am not against it, but I feel that conveying those tracked actions (or how
> > > to represent them) to the user will be much harder. Are there already
> > > existing APIs to push this information to the user?
> > >
> >
> > I discussed with Wei Wang and she suggested we should start tracking the
> > calls to tcp_adjust_rcv_ssthresh() first. So, something like the
> > following. I would like feedback frm networking folks as well:
> 
> I think we could simply put memcg_memory_event() in
> mem_cgroup_sk_under_memory_pressure() when it returns
> true.
> 
> Other than tcp_adjust_rcv_ssthresh(), if tcp_under_memory_pressure()
> returns true, it indicates something bad will happen, failure to expand
> rcvbuf and sndbuf, need to prune out-of-order queue more aggressively,
> FIN deferred to a retransmitted packet.
> 
> Also, we could cover mptcp and sctp too.
> 

I wanted to start simple and focus on one specific action but I am open
to other actins as well. Do we want a generic network throttled metric
or do we want different metric for different action? At the moment I
think for memcg, a single metric would be sufficient and then we can
have tracepoints for more fine grained debugging.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-15 18:39                     ` Shakeel Butt
@ 2025-10-15 18:58                       ` Kuniyuki Iwashima
  2025-10-15 20:17                         ` Roman Gushchin
  0 siblings, 1 reply; 21+ messages in thread
From: Kuniyuki Iwashima @ 2025-10-15 18:58 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Daniel Sedlak, Roman Gushchin, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups,
	Tejun Heo, Michal Koutný,
	Matyas Hurtik

On Wed, Oct 15, 2025 at 11:39 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Oct 15, 2025 at 11:21:17AM -0700, Kuniyuki Iwashima wrote:
> > On Tue, Oct 14, 2025 at 1:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote:
> > > [...]
> > > > > > > > How about we track the actions taken by the callers of
> > > > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack
> > > > > > > > reduces the buffer size or whatever the other actions it may take when
> > > > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
> > > > > > > > is what I think is needed here, at least for the debugging use-case.
> > > >
> > > > I am not against it, but I feel that conveying those tracked actions (or how
> > > > to represent them) to the user will be much harder. Are there already
> > > > existing APIs to push this information to the user?
> > > >
> > >
> > > I discussed with Wei Wang and she suggested we should start tracking the
> > > calls to tcp_adjust_rcv_ssthresh() first. So, something like the
> > > following. I would like feedback frm networking folks as well:
> >
> > I think we could simply put memcg_memory_event() in
> > mem_cgroup_sk_under_memory_pressure() when it returns
> > true.
> >
> > Other than tcp_adjust_rcv_ssthresh(), if tcp_under_memory_pressure()
> > returns true, it indicates something bad will happen, failure to expand
> > rcvbuf and sndbuf, need to prune out-of-order queue more aggressively,
> > FIN deferred to a retransmitted packet.
> >
> > Also, we could cover mptcp and sctp too.
> >
>
> I wanted to start simple and focus on one specific action but I am open
> to other actins as well. Do we want a generic network throttled metric
> or do we want different metric for different action? At the moment I
> think for memcg, a single metric would be sufficient and then we can
> have tracepoints for more fine grained debugging.

I agree that a single metric would be enough if it can signal
something bad is happening as a first step, then we can take
further action with tracepoint, bpftrace, whatever.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup
  2025-10-15 18:58                       ` Kuniyuki Iwashima
@ 2025-10-15 20:17                         ` Roman Gushchin
  0 siblings, 0 replies; 21+ messages in thread
From: Roman Gushchin @ 2025-10-15 20:17 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: Shakeel Butt, Daniel Sedlak, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm,
	netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups,
	Tejun Heo, Michal Koutný,
	Matyas Hurtik

Kuniyuki Iwashima <kuniyu@google.com> writes:

> On Wed, Oct 15, 2025 at 11:39 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>
>> On Wed, Oct 15, 2025 at 11:21:17AM -0700, Kuniyuki Iwashima wrote:
>> > On Tue, Oct 14, 2025 at 1:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>> > >
>> > > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote:
>> > > [...]
>> > > > > > > > How about we track the actions taken by the callers of
>> > > > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack
>> > > > > > > > reduces the buffer size or whatever the other actions it may take when
>> > > > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions
>> > > > > > > > is what I think is needed here, at least for the debugging use-case.
>> > > >
>> > > > I am not against it, but I feel that conveying those tracked actions (or how
>> > > > to represent them) to the user will be much harder. Are there already
>> > > > existing APIs to push this information to the user?
>> > > >
>> > >
>> > > I discussed with Wei Wang and she suggested we should start tracking the
>> > > calls to tcp_adjust_rcv_ssthresh() first. So, something like the
>> > > following. I would like feedback frm networking folks as well:
>> >
>> > I think we could simply put memcg_memory_event() in
>> > mem_cgroup_sk_under_memory_pressure() when it returns
>> > true.
>> >
>> > Other than tcp_adjust_rcv_ssthresh(), if tcp_under_memory_pressure()
>> > returns true, it indicates something bad will happen, failure to expand
>> > rcvbuf and sndbuf, need to prune out-of-order queue more aggressively,
>> > FIN deferred to a retransmitted packet.
>> >
>> > Also, we could cover mptcp and sctp too.
>> >
>>
>> I wanted to start simple and focus on one specific action but I am open
>> to other actins as well. Do we want a generic network throttled metric
>> or do we want different metric for different action? At the moment I
>> think for memcg, a single metric would be sufficient and then we can
>> have tracepoints for more fine grained debugging.
>
> I agree that a single metric would be enough if it can signal
> something bad is happening as a first step, then we can take
> further action with tracepoint, bpftrace, whatever.

+1 to a single metric


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-10-15 20:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-10-07 12:50 [PATCH v5] memcg: expose socket memory pressure in a cgroup Daniel Sedlak
2025-10-07 20:01 ` Tejun Heo
2025-10-08 12:46   ` Matyas Hurtik
2025-10-08 18:17     ` Tejun Heo
2025-10-08 18:58 ` Roman Gushchin
2025-10-09 14:44   ` Daniel Sedlak
2025-10-09 15:32     ` Roman Gushchin
2025-10-09 16:06       ` Shakeel Butt
2025-10-09 17:58         ` Roman Gushchin
2025-10-09 18:32           ` Shakeel Butt
2025-10-09 19:02             ` Roman Gushchin
2025-10-13 14:30               ` Daniel Sedlak
2025-10-14  1:43                 ` Roman Gushchin
2025-10-14 13:58                   ` Daniel Sedlak
2025-10-14 20:32                 ` Shakeel Butt
2025-10-15 13:57                   ` Daniel Sedlak
2025-10-15 18:36                     ` Shakeel Butt
2025-10-15 18:21                   ` Kuniyuki Iwashima
2025-10-15 18:39                     ` Shakeel Butt
2025-10-15 18:58                       ` Kuniyuki Iwashima
2025-10-15 20:17                         ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox