* [PATCH v5] memcg: expose socket memory pressure in a cgroup @ 2025-10-07 12:50 Daniel Sedlak 2025-10-07 20:01 ` Tejun Heo 2025-10-08 18:58 ` Roman Gushchin 0 siblings, 2 replies; 21+ messages in thread From: Daniel Sedlak @ 2025-10-07 12:50 UTC (permalink / raw) To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, cgroups, Tejun Heo, Michal Koutný Cc: Daniel Sedlak, Matyas Hurtik This patch is a result of our long-standing debug sessions, where it all started as "networking is slow", and TCP network throughput suddenly dropped from tens of Gbps to few Mbps, and we could not see anything in the kernel log or netstat counters. Currently, we have two memory pressure counters for TCP sockets [1], which we manipulate only when the memory pressure is signalled through the proto struct [2]. However, the memory pressure can also be signaled through the cgroup memory subsystem, which we do not reflect in the netstat counters. In the end, when the cgroup memory subsystem signals that it is under pressure, we silently reduce the advertised TCP window with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant throughput reduction. Keep in mind that when the cgroup memory subsystem signals the socket memory pressure for a given cgroup, it affects all sockets used in that cgroup, including children cgroups. This patch exposes a new file for each cgroup in sysfs which is a read-only single value file showing how many microseconds this cgroup contributed to throttling the throughput of network sockets. The file is accessible in the following path. /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec Just to summarize the proposals of different methods of hierarchical propagation of the memory.net.throttled_usec. 1) None - keeping the reported duration local to that cgroup: value = self Would not be too out of place, since memory.events.local already does not accumulate hierarchically. To determine whether sockets in a memcg were throttled, we would traverse the /sys/fs/cgroup/ hierarchy from root to the cgroup of interest and sum those local durations. 2) Propagating the duration upwards (using rstat or simple iteration towards root memcg during write): value = self + sum of children Most semantically consistent with other exposed stat files. Could be added as an entry into memory.stat. Since the pressure gets applied from ancestors to children (see mem_cgroup_under_socket_pressure()), determining the duration of throttling for sockets in some cgroup would be hardest in this variant. It would involve iterating from the root to the examined cgroup and at each node subtracting the values of its children from that nodes value, then the sum of that would correspond to the total duration throttled. 3) Propagating the duration downwards (write only locally, read traversing hierarchy upwards): value = self + sum of ancestors Mirrors the logic used in mem_cgroup_under_socket_pressure(), increase in the reported value for a memcg would coincide with more throttling being done to the sockets of that memcg. We chose variant 1, that is why it is a separate file instead of another counter in mem.stat. Variant 2 seems to be most fitting however the calculated value would be misleading and hard to interpret. Ideally, we would go with variant 3 as this mirrors the logic of mem_cgroup_under_socket_pressure(), but the third variant can be also calculated manually from variant 1, and thus we chose the variant 1 as it is the most versatile one without leaking the internal implementation that can change in the future. Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1] Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2] Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com> Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com> Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com> --- Sorry for the delay between the versions. Changes: v4 -> v5: - Rebased - Extend commit message with design decisions - Rename cgroup counter - Link to v4: https://lore.kernel.org/netdev/20250805064429.77876-1-daniel.sedlak@cdn77.com/ v3 -> v4: - Add documentation - Expose pressure as cummulative counter in microseconds - Link to v3: https://lore.kernel.org/netdev/20250722071146.48616-1-daniel.sedlak@cdn77.com/ v2 -> v3: - Expose the socket memory pressure on the cgroups instead of netstat - Split patch - Link to v2: https://lore.kernel.org/netdev/20250714143613.42184-1-daniel.sedlak@cdn77.com/ v1 -> v2: - Add tracepoint - Link to v1: https://lore.kernel.org/netdev/20250707105205.222558-1-daniel.sedlak@cdn77.com/ Documentation/admin-guide/cgroup-v2.rst | 10 ++++++ include/linux/memcontrol.h | 41 +++++++++++++++---------- mm/memcontrol.c | 31 ++++++++++++++++++- 3 files changed, 65 insertions(+), 17 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 51c0bc4c2dc5..fe81a134c156 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1887,6 +1887,16 @@ The following nested keys are defined. Shows pressure stall information for memory. See :ref:`Documentation/accounting/psi.rst <psi>` for details. + memory.net.throttled_usec + A read-only single value file showing how many microseconds this cgroup + contributed to throttling the throughput of network sockets. + + Socket throttling is applied to a cgroup and to all its children, + as a consequence of high reclaim pressure. + + Observing throttling of sockets in a particular cgroup can be done + by checking this file for that cgroup and also for all its ancestors. + Usage Guidelines ~~~~~~~~~~~~~~~~ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fb27e3d2fdac..647fba7dcc8a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -247,14 +247,19 @@ struct mem_cgroup { atomic_t kmem_stat; #endif /* - * Hint of reclaim pressure for socket memroy management. Note + * Hint of reclaim pressure for socket memory management. Note * that this indicator should NOT be used in legacy cgroup mode * where socket memory is accounted/charged separately. */ u64 socket_pressure; -#if BITS_PER_LONG < 64 + /* memory.net.throttled_usec */ + u64 socket_pressure_duration; +#if BITS_PER_LONG >= 64 + spinlock_t socket_pressure_spinlock; +#else seqlock_t socket_pressure_seqlock; #endif + int kmemcg_id; /* * memcg->objcg is wiped out as a part of the objcg repaprenting @@ -1607,19 +1612,33 @@ bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages, gfp_t gfp_mask); void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages); -#if BITS_PER_LONG < 64 static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg) { - u64 val = get_jiffies_64() + HZ; unsigned long flags; +#if BITS_PER_LONG >= 64 + spin_lock_irqsave(&memcg->socket_pressure_spinlock, flags); +#else write_seqlock_irqsave(&memcg->socket_pressure_seqlock, flags); - memcg->socket_pressure = val; +#endif + u64 old_socket_pressure = memcg->socket_pressure; + u64 new_socket_pressure = get_jiffies_64() + HZ; + + memcg->socket_pressure = new_socket_pressure; + memcg->socket_pressure_duration += jiffies_to_usecs( + min(new_socket_pressure - old_socket_pressure, HZ)); +#if BITS_PER_LONG >= 64 + spin_unlock_irqrestore(&memcg->socket_pressure_spinlock, flags); +#else write_sequnlock_irqrestore(&memcg->socket_pressure_seqlock, flags); +#endif } static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg) { +#if BITS_PER_LONG >= 64 + return READ_ONCE(memcg->socket_pressure); +#else unsigned int seq; u64 val; @@ -1629,18 +1648,8 @@ static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg) } while (read_seqretry(&memcg->socket_pressure_seqlock, seq)); return val; -} -#else -static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg) -{ - WRITE_ONCE(memcg->socket_pressure, jiffies + HZ); -} - -static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg) -{ - return READ_ONCE(memcg->socket_pressure); -} #endif +} int alloc_shrinker_info(struct mem_cgroup *memcg); void free_shrinker_info(struct mem_cgroup *memcg); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index df3e9205c9e6..d29147223822 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3755,7 +3755,10 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) INIT_LIST_HEAD(&memcg->swap_peaks); spin_lock_init(&memcg->peaks_lock); memcg->socket_pressure = get_jiffies_64(); -#if BITS_PER_LONG < 64 + memcg->socket_pressure_duration = 0; +#if BITS_PER_LONG >= 64 + spin_lock_init(&memcg->socket_pressure_spinlock); +#else seqlock_init(&memcg->socket_pressure_seqlock); #endif memcg1_memcg_init(memcg); @@ -4579,6 +4582,27 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, return nbytes; } +static int memory_net_throttled_usec_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + u64 throttled_usec; + +#if BITS_PER_LONG >= 64 + throttled_usec = READ_ONCE(memcg->socket_pressure_duration); +#else + unsigned int seq; + + do { + seq = read_seqbegin(&memcg->socket_pressure_seqlock); + throttled_usec = memcg->socket_pressure_duration; + } while (read_seqretry(&memcg->socket_pressure_seqlock, seq)); +#endif + + seq_printf(m, "%llu\n", throttled_usec); + + return 0; +} + static struct cftype memory_files[] = { { .name = "current", @@ -4650,6 +4674,11 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NS_DELEGATABLE, .write = memory_reclaim, }, + { + .name = "net.throttled_usec", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_net_throttled_usec_show, + }, { } /* terminate */ }; base-commit: 312e6f7676e63bbb9b81e5c68e580a9f776cc6f0 -- 2.39.5 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-07 12:50 [PATCH v5] memcg: expose socket memory pressure in a cgroup Daniel Sedlak @ 2025-10-07 20:01 ` Tejun Heo 2025-10-08 12:46 ` Matyas Hurtik 2025-10-08 18:58 ` Roman Gushchin 1 sibling, 1 reply; 21+ messages in thread From: Tejun Heo @ 2025-10-07 20:01 UTC (permalink / raw) To: Daniel Sedlak Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, cgroups, Michal Koutný, Matyas Hurtik On Tue, Oct 07, 2025 at 02:50:56PM +0200, Daniel Sedlak wrote: ... > 1) None - keeping the reported duration local to that cgroup: > 2) Propagating the duration upwards (using rstat or simple iteration > 3) Propagating the duration downwards (write only locally, > read traversing hierarchy upwards): ... > We chose variant 1, that is why it is a separate file instead of another > counter in mem.stat. Variant 2 seems to be most fitting however the > calculated value would be misleading and hard to interpret. Ideally, we > would go with variant 3 as this mirrors the logic of > mem_cgroup_under_socket_pressure(), but the third variant can be also > calculated manually from variant 1, and thus we chose the variant 1 > as it is the most versatile one without leaking the internal > implementation that can change in the future. I'm not against going 1) but let's not do a separate file for this. Can't you do memory.stat.local? It'd be better to have aggregation in memory.stat but we can worry about that later. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-07 20:01 ` Tejun Heo @ 2025-10-08 12:46 ` Matyas Hurtik 2025-10-08 18:17 ` Tejun Heo 0 siblings, 1 reply; 21+ messages in thread From: Matyas Hurtik @ 2025-10-08 12:46 UTC (permalink / raw) To: Tejun Heo, Daniel Sedlak Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, cgroups, Michal Koutný Hello, > I'm not against going 1) but let's not do a separate file for this. Can't > you do memory.stat.local? I can't find memory.stat.local, so should we create it and add the counter as an entry there? Regarding the code, is there anything you would like us to improve? I had to rewrite it a bit because of the recent changes. Thanks, Matyas ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-08 12:46 ` Matyas Hurtik @ 2025-10-08 18:17 ` Tejun Heo 0 siblings, 0 replies; 21+ messages in thread From: Tejun Heo @ 2025-10-08 18:17 UTC (permalink / raw) To: Matyas Hurtik Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song, cgroups, Michal Koutný On Wed, Oct 08, 2025 at 02:46:23PM +0200, Matyas Hurtik wrote: > Hello, > > I'm not against going 1) but let's not do a separate file for this. Can't > > you do memory.stat.local? > > I can't find memory.stat.local, so should we create it and add the counter > as an entry there? Yes. > Regarding the code, is there anything you would like us to improve? I had to > rewrite it a bit > > because of the recent changes. I think memcg guys will have much better idea on the actual code. Let's wait for them to chime in. Thanks. -- tejun ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-07 12:50 [PATCH v5] memcg: expose socket memory pressure in a cgroup Daniel Sedlak 2025-10-07 20:01 ` Tejun Heo @ 2025-10-08 18:58 ` Roman Gushchin 2025-10-09 14:44 ` Daniel Sedlak 1 sibling, 1 reply; 21+ messages in thread From: Roman Gushchin @ 2025-10-08 18:58 UTC (permalink / raw) To: Daniel Sedlak Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik Daniel Sedlak <daniel.sedlak@cdn77.com> writes: > This patch is a result of our long-standing debug sessions, where it all > started as "networking is slow", and TCP network throughput suddenly > dropped from tens of Gbps to few Mbps, and we could not see anything in > the kernel log or netstat counters. > > Currently, we have two memory pressure counters for TCP sockets [1], > which we manipulate only when the memory pressure is signalled through > the proto struct [2]. However, the memory pressure can also be signaled > through the cgroup memory subsystem, which we do not reflect in the > netstat counters. In the end, when the cgroup memory subsystem signals > that it is under pressure, we silently reduce the advertised TCP window > with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant > throughput reduction. > > Keep in mind that when the cgroup memory subsystem signals the socket > memory pressure for a given cgroup, it affects all sockets used in that > cgroup, including children cgroups. > > This patch exposes a new file for each cgroup in sysfs which is a > read-only single value file showing how many microseconds this cgroup > contributed to throttling the throughput of network sockets. The file is > accessible in the following path. > > /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec Hi Daniel! How this value is going to be used? In other words, do you need an exact number or something like memory.events::net_throttled would be enough for your case? Thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-08 18:58 ` Roman Gushchin @ 2025-10-09 14:44 ` Daniel Sedlak 2025-10-09 15:32 ` Roman Gushchin 0 siblings, 1 reply; 21+ messages in thread From: Daniel Sedlak @ 2025-10-09 14:44 UTC (permalink / raw) To: Roman Gushchin Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik Hi Roman, On 10/8/25 8:58 PM, Roman Gushchin wrote: >> This patch exposes a new file for each cgroup in sysfs which is a >> read-only single value file showing how many microseconds this cgroup >> contributed to throttling the throughput of network sockets. The file is >> accessible in the following path. >> >> /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec > > Hi Daniel! > > How this value is going to be used? In other words, do you need an > exact number or something like memory.events::net_throttled would be > enough for your case? Just incrementing a counter each time the vmpressure() happens IMO provides bad semantics of what is actually happening, because it can hide important details, mainly the _time_ for how long the network traffic was slowed down. For example, when memory.events::net_throttled=1000, it can mean that the network was slowed down for 1 second or 1000 seconds or something between, and the memory.net.throttled_usec proposed by this patch disambiguates it. In addition, v1/v2 of this series started that way, then from v3 we rewrote it to calculate the duration instead, which proved to be better information for debugging, as it is easier to understand implications. Thanks! Daniel ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-09 14:44 ` Daniel Sedlak @ 2025-10-09 15:32 ` Roman Gushchin 2025-10-09 16:06 ` Shakeel Butt 0 siblings, 1 reply; 21+ messages in thread From: Roman Gushchin @ 2025-10-09 15:32 UTC (permalink / raw) To: Daniel Sedlak Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Shakeel Butt, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik Daniel Sedlak <daniel.sedlak@cdn77.com> writes: > Hi Roman, > > On 10/8/25 8:58 PM, Roman Gushchin wrote: >>> This patch exposes a new file for each cgroup in sysfs which is a >>> read-only single value file showing how many microseconds this cgroup >>> contributed to throttling the throughput of network sockets. The file is >>> accessible in the following path. >>> >>> /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec >> Hi Daniel! >> How this value is going to be used? In other words, do you need an >> exact number or something like memory.events::net_throttled would be >> enough for your case? > > Just incrementing a counter each time the vmpressure() happens IMO > provides bad semantics of what is actually happening, because it can > hide important details, mainly the _time_ for how long the network > traffic was slowed down. > > For example, when memory.events::net_throttled=1000, it can mean that > the network was slowed down for 1 second or 1000 seconds or something > between, and the memory.net.throttled_usec proposed by this patch > disambiguates it. > > In addition, v1/v2 of this series started that way, then from v3 we > rewrote it to calculate the duration instead, which proved to be > better information for debugging, as it is easier to understand > implications. But how are you planning to use this information? Is this just "networking is under pressure for non-trivial amount of time -> raise the memcg limit" or something more complicated? I am bit concerned about making this metric the part of cgroup API simple because it's too implementation-defined and in my opinion lack the fundamental meaning. Vmpressure is calculated based on scanned/reclaimed ratio (which is also not always the best proxy for the memory pressure level), then if it reaches some level we basically throttle networking for 1s. So it's all very arbitrary. I totally get it from the debugging perspective, but not sure about usefulness of it as a permanent metric. This is why I'm asking if there are lighter alternatives, e.g. memory.events or maybe even tracepoints. Thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-09 15:32 ` Roman Gushchin @ 2025-10-09 16:06 ` Shakeel Butt 2025-10-09 17:58 ` Roman Gushchin 0 siblings, 1 reply; 21+ messages in thread From: Shakeel Butt @ 2025-10-09 16:06 UTC (permalink / raw) To: Roman Gushchin Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote: > Daniel Sedlak <daniel.sedlak@cdn77.com> writes: > > > Hi Roman, > > > > On 10/8/25 8:58 PM, Roman Gushchin wrote: > >>> This patch exposes a new file for each cgroup in sysfs which is a > >>> read-only single value file showing how many microseconds this cgroup > >>> contributed to throttling the throughput of network sockets. The file is > >>> accessible in the following path. > >>> > >>> /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec > >> Hi Daniel! > >> How this value is going to be used? In other words, do you need an > >> exact number or something like memory.events::net_throttled would be > >> enough for your case? > > > > Just incrementing a counter each time the vmpressure() happens IMO > > provides bad semantics of what is actually happening, because it can > > hide important details, mainly the _time_ for how long the network > > traffic was slowed down. > > > > For example, when memory.events::net_throttled=1000, it can mean that > > the network was slowed down for 1 second or 1000 seconds or something > > between, and the memory.net.throttled_usec proposed by this patch > > disambiguates it. > > > > In addition, v1/v2 of this series started that way, then from v3 we > > rewrote it to calculate the duration instead, which proved to be > > better information for debugging, as it is easier to understand > > implications. > > But how are you planning to use this information? Is this just > "networking is under pressure for non-trivial amount of time -> > raise the memcg limit" or something more complicated? > > I am bit concerned about making this metric the part of cgroup API > simple because it's too implementation-defined and in my opinion > lack the fundamental meaning. > > Vmpressure is calculated based on scanned/reclaimed ratio (which is > also not always the best proxy for the memory pressure level), then > if it reaches some level we basically throttle networking for 1s. > So it's all very arbitrary. > > I totally get it from the debugging perspective, but not sure about > usefulness of it as a permanent metric. This is why I'm asking if there > are lighter alternatives, e.g. memory.events or maybe even tracepoints. > I also have a very similar opinion that if we expose the current implementation detail through a stable interface, we might get stuck with this implementation and I want to change this in future. Coming back to what information should we expose that will be helpful for Daniel & Matyas and will be beneficial in general. After giving some thought, I think the time "network was slowed down" or more specifically time window when mem_cgroup_sk_under_memory_pressure() returns true might not be that useful without the actual network activity. Basically if no one is calling mem_cgroup_sk_under_memory_pressure() and doing some actions, the time window is not that useful. How about we track the actions taken by the callers of mem_cgroup_sk_under_memory_pressure()? Basically if network stack reduces the buffer size or whatever the other actions it may take when mem_cgroup_sk_under_memory_pressure() returns, tracking those actions is what I think is needed here, at least for the debugging use-case. WDYT? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-09 16:06 ` Shakeel Butt @ 2025-10-09 17:58 ` Roman Gushchin 2025-10-09 18:32 ` Shakeel Butt 0 siblings, 1 reply; 21+ messages in thread From: Roman Gushchin @ 2025-10-09 17:58 UTC (permalink / raw) To: Shakeel Butt Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik Shakeel Butt <shakeel.butt@linux.dev> writes: > On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote: >> Daniel Sedlak <daniel.sedlak@cdn77.com> writes: >> >> > Hi Roman, >> > >> > On 10/8/25 8:58 PM, Roman Gushchin wrote: >> >>> This patch exposes a new file for each cgroup in sysfs which is a >> >>> read-only single value file showing how many microseconds this cgroup >> >>> contributed to throttling the throughput of network sockets. The file is >> >>> accessible in the following path. >> >>> >> >>> /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec >> >> Hi Daniel! >> >> How this value is going to be used? In other words, do you need an >> >> exact number or something like memory.events::net_throttled would be >> >> enough for your case? >> > >> > Just incrementing a counter each time the vmpressure() happens IMO >> > provides bad semantics of what is actually happening, because it can >> > hide important details, mainly the _time_ for how long the network >> > traffic was slowed down. >> > >> > For example, when memory.events::net_throttled=1000, it can mean that >> > the network was slowed down for 1 second or 1000 seconds or something >> > between, and the memory.net.throttled_usec proposed by this patch >> > disambiguates it. >> > >> > In addition, v1/v2 of this series started that way, then from v3 we >> > rewrote it to calculate the duration instead, which proved to be >> > better information for debugging, as it is easier to understand >> > implications. >> >> But how are you planning to use this information? Is this just >> "networking is under pressure for non-trivial amount of time -> >> raise the memcg limit" or something more complicated? >> >> I am bit concerned about making this metric the part of cgroup API >> simple because it's too implementation-defined and in my opinion >> lack the fundamental meaning. >> >> Vmpressure is calculated based on scanned/reclaimed ratio (which is >> also not always the best proxy for the memory pressure level), then >> if it reaches some level we basically throttle networking for 1s. >> So it's all very arbitrary. >> >> I totally get it from the debugging perspective, but not sure about >> usefulness of it as a permanent metric. This is why I'm asking if there >> are lighter alternatives, e.g. memory.events or maybe even tracepoints. >> > > I also have a very similar opinion that if we expose the current > implementation detail through a stable interface, we might get stuck > with this implementation and I want to change this in future. > > Coming back to what information should we expose that will be helpful > for Daniel & Matyas and will be beneficial in general. After giving some > thought, I think the time "network was slowed down" or more specifically > time window when mem_cgroup_sk_under_memory_pressure() returns true > might not be that useful without the actual network activity. Basically > if no one is calling mem_cgroup_sk_under_memory_pressure() and doing > some actions, the time window is not that useful. > > How about we track the actions taken by the callers of > mem_cgroup_sk_under_memory_pressure()? Basically if network stack > reduces the buffer size or whatever the other actions it may take when > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions > is what I think is needed here, at least for the debugging use-case. > > WDYT? I feel like if it's mostly intended for debugging purposes, a combination of a trace point and bpftrace can work pretty well, so there is no need to create a new sysfs interface. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-09 17:58 ` Roman Gushchin @ 2025-10-09 18:32 ` Shakeel Butt 2025-10-09 19:02 ` Roman Gushchin 0 siblings, 1 reply; 21+ messages in thread From: Shakeel Butt @ 2025-10-09 18:32 UTC (permalink / raw) To: Roman Gushchin Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote: > Shakeel Butt <shakeel.butt@linux.dev> writes: > > > On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote: > >> Daniel Sedlak <daniel.sedlak@cdn77.com> writes: > >> > >> > Hi Roman, > >> > > >> > On 10/8/25 8:58 PM, Roman Gushchin wrote: > >> >>> This patch exposes a new file for each cgroup in sysfs which is a > >> >>> read-only single value file showing how many microseconds this cgroup > >> >>> contributed to throttling the throughput of network sockets. The file is > >> >>> accessible in the following path. > >> >>> > >> >>> /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec > >> >> Hi Daniel! > >> >> How this value is going to be used? In other words, do you need an > >> >> exact number or something like memory.events::net_throttled would be > >> >> enough for your case? > >> > > >> > Just incrementing a counter each time the vmpressure() happens IMO > >> > provides bad semantics of what is actually happening, because it can > >> > hide important details, mainly the _time_ for how long the network > >> > traffic was slowed down. > >> > > >> > For example, when memory.events::net_throttled=1000, it can mean that > >> > the network was slowed down for 1 second or 1000 seconds or something > >> > between, and the memory.net.throttled_usec proposed by this patch > >> > disambiguates it. > >> > > >> > In addition, v1/v2 of this series started that way, then from v3 we > >> > rewrote it to calculate the duration instead, which proved to be > >> > better information for debugging, as it is easier to understand > >> > implications. > >> > >> But how are you planning to use this information? Is this just > >> "networking is under pressure for non-trivial amount of time -> > >> raise the memcg limit" or something more complicated? > >> > >> I am bit concerned about making this metric the part of cgroup API > >> simple because it's too implementation-defined and in my opinion > >> lack the fundamental meaning. > >> > >> Vmpressure is calculated based on scanned/reclaimed ratio (which is > >> also not always the best proxy for the memory pressure level), then > >> if it reaches some level we basically throttle networking for 1s. > >> So it's all very arbitrary. > >> > >> I totally get it from the debugging perspective, but not sure about > >> usefulness of it as a permanent metric. This is why I'm asking if there > >> are lighter alternatives, e.g. memory.events or maybe even tracepoints. > >> > > > > I also have a very similar opinion that if we expose the current > > implementation detail through a stable interface, we might get stuck > > with this implementation and I want to change this in future. > > > > Coming back to what information should we expose that will be helpful > > for Daniel & Matyas and will be beneficial in general. After giving some > > thought, I think the time "network was slowed down" or more specifically > > time window when mem_cgroup_sk_under_memory_pressure() returns true > > might not be that useful without the actual network activity. Basically > > if no one is calling mem_cgroup_sk_under_memory_pressure() and doing > > some actions, the time window is not that useful. > > > > How about we track the actions taken by the callers of > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack > > reduces the buffer size or whatever the other actions it may take when > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions > > is what I think is needed here, at least for the debugging use-case. > > > > WDYT? > > I feel like if it's mostly intended for debugging purposes, > a combination of a trace point and bpftrace can work pretty well, > so there is no need to create a new sysfs interface. > Definitely not a new interface but I think having such information in memory.events or memory.stat would be more convenient. Basically the number of times the sockets in this memcg have to be clamped due to memory pressure would be useful in general. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-09 18:32 ` Shakeel Butt @ 2025-10-09 19:02 ` Roman Gushchin 2025-10-13 14:30 ` Daniel Sedlak 0 siblings, 1 reply; 21+ messages in thread From: Roman Gushchin @ 2025-10-09 19:02 UTC (permalink / raw) To: Shakeel Butt Cc: Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik Shakeel Butt <shakeel.butt@linux.dev> writes: > On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote: >> Shakeel Butt <shakeel.butt@linux.dev> writes: >> >> > On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote: >> >> Daniel Sedlak <daniel.sedlak@cdn77.com> writes: >> >> >> >> > Hi Roman, >> >> > >> >> > On 10/8/25 8:58 PM, Roman Gushchin wrote: >> >> >>> This patch exposes a new file for each cgroup in sysfs which is a >> >> >>> read-only single value file showing how many microseconds this cgroup >> >> >>> contributed to throttling the throughput of network sockets. The file is >> >> >>> accessible in the following path. >> >> >>> >> >> >>> /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec >> >> >> Hi Daniel! >> >> >> How this value is going to be used? In other words, do you need an >> >> >> exact number or something like memory.events::net_throttled would be >> >> >> enough for your case? >> >> > >> >> > Just incrementing a counter each time the vmpressure() happens IMO >> >> > provides bad semantics of what is actually happening, because it can >> >> > hide important details, mainly the _time_ for how long the network >> >> > traffic was slowed down. >> >> > >> >> > For example, when memory.events::net_throttled=1000, it can mean that >> >> > the network was slowed down for 1 second or 1000 seconds or something >> >> > between, and the memory.net.throttled_usec proposed by this patch >> >> > disambiguates it. >> >> > >> >> > In addition, v1/v2 of this series started that way, then from v3 we >> >> > rewrote it to calculate the duration instead, which proved to be >> >> > better information for debugging, as it is easier to understand >> >> > implications. >> >> >> >> But how are you planning to use this information? Is this just >> >> "networking is under pressure for non-trivial amount of time -> >> >> raise the memcg limit" or something more complicated? >> >> >> >> I am bit concerned about making this metric the part of cgroup API >> >> simple because it's too implementation-defined and in my opinion >> >> lack the fundamental meaning. >> >> >> >> Vmpressure is calculated based on scanned/reclaimed ratio (which is >> >> also not always the best proxy for the memory pressure level), then >> >> if it reaches some level we basically throttle networking for 1s. >> >> So it's all very arbitrary. >> >> >> >> I totally get it from the debugging perspective, but not sure about >> >> usefulness of it as a permanent metric. This is why I'm asking if there >> >> are lighter alternatives, e.g. memory.events or maybe even tracepoints. >> >> >> > >> > I also have a very similar opinion that if we expose the current >> > implementation detail through a stable interface, we might get stuck >> > with this implementation and I want to change this in future. >> > >> > Coming back to what information should we expose that will be helpful >> > for Daniel & Matyas and will be beneficial in general. After giving some >> > thought, I think the time "network was slowed down" or more specifically >> > time window when mem_cgroup_sk_under_memory_pressure() returns true >> > might not be that useful without the actual network activity. Basically >> > if no one is calling mem_cgroup_sk_under_memory_pressure() and doing >> > some actions, the time window is not that useful. >> > >> > How about we track the actions taken by the callers of >> > mem_cgroup_sk_under_memory_pressure()? Basically if network stack >> > reduces the buffer size or whatever the other actions it may take when >> > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions >> > is what I think is needed here, at least for the debugging use-case. >> > >> > WDYT? >> >> I feel like if it's mostly intended for debugging purposes, >> a combination of a trace point and bpftrace can work pretty well, >> so there is no need to create a new sysfs interface. >> > > Definitely not a new interface but I think having such information in > memory.events or memory.stat would be more convenient. Basically the > number of times the sockets in this memcg have to be clamped due to > memory pressure would be useful in general. Yeah, if we're going to add something, memory.events looks like the best option, also because it allows to poll and get notified when the event occurs. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-09 19:02 ` Roman Gushchin @ 2025-10-13 14:30 ` Daniel Sedlak 2025-10-14 1:43 ` Roman Gushchin 2025-10-14 20:32 ` Shakeel Butt 0 siblings, 2 replies; 21+ messages in thread From: Daniel Sedlak @ 2025-10-13 14:30 UTC (permalink / raw) To: Roman Gushchin, Shakeel Butt Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On 10/9/25 9:02 PM, Roman Gushchin wrote: > Shakeel Butt <shakeel.butt@linux.dev> writes: > >> On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote: >>> Shakeel Butt <shakeel.butt@linux.dev> writes: >>> >>>> On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote: >>>>> Daniel Sedlak <daniel.sedlak@cdn77.com> writes: >>>>> >>>>>> Hi Roman, >>>>>> >>>>>> On 10/8/25 8:58 PM, Roman Gushchin wrote: >>>>>>>> This patch exposes a new file for each cgroup in sysfs which is a >>>>>>>> read-only single value file showing how many microseconds this cgroup >>>>>>>> contributed to throttling the throughput of network sockets. The file is >>>>>>>> accessible in the following path. >>>>>>>> >>>>>>>> /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec >>>>>>> Hi Daniel! >>>>>>> How this value is going to be used? In other words, do you need an >>>>>>> exact number or something like memory.events::net_throttled would be >>>>>>> enough for your case? >>>>>> >>>>>> Just incrementing a counter each time the vmpressure() happens IMO >>>>>> provides bad semantics of what is actually happening, because it can >>>>>> hide important details, mainly the _time_ for how long the network >>>>>> traffic was slowed down. >>>>>> >>>>>> For example, when memory.events::net_throttled=1000, it can mean that >>>>>> the network was slowed down for 1 second or 1000 seconds or something >>>>>> between, and the memory.net.throttled_usec proposed by this patch >>>>>> disambiguates it. >>>>>> >>>>>> In addition, v1/v2 of this series started that way, then from v3 we >>>>>> rewrote it to calculate the duration instead, which proved to be >>>>>> better information for debugging, as it is easier to understand >>>>>> implications. >>>>> >>>>> But how are you planning to use this information? Is this just >>>>> "networking is under pressure for non-trivial amount of time -> >>>>> raise the memcg limit" or something more complicated? We plan to use it mostly for observability purposes and to better understand which traffic patterns affect the socket pressure the most (so we can try to fix/delay/improve it). We do not know how commonly this issue appears in other deployments, but in our deployment, many of servers were affected by this slowdown, which varied in terms of hardware and software configuration. Currently, it is very hard to detect if the socket is under pressure without using tools like bpftrace, so we would like to expose this metric in a more accessible way. So in the end, we do not really care in which file this "socket pressure happened" notification will be stored. >>>>> I totally get it from the debugging perspective, but not sure about >>>>> usefulness of it as a permanent metric. This is why I'm asking if there >>>>> are lighter alternatives, e.g. memory.events or maybe even tracepoints. If the combination of memory.events(.local) and tracepoint hook(s) is okay with you(?), we can use that and export the same information as in the current patch version. We can incorporate that into the next version. Also, would it be possible to make the socket pressure signal configurable, e.g., allowing it to be configured via sysctl or per cgroup not to trigger the socket pressure signal? I cannot find the reasoning why this throttling cannot (maybe it can) be opt-out. >>>> I also have a very similar opinion that if we expose the current >>>> implementation detail through a stable interface, we might get stuck >>>> with this implementation and I want to change this in future. >>>> >>>> Coming back to what information should we expose that will be helpful >>>> for Daniel & Matyas and will be beneficial in general. After giving some >>>> thought, I think the time "network was slowed down" or more specifically >>>> time window when mem_cgroup_sk_under_memory_pressure() returns true >>>> might not be that useful without the actual network activity. Basically >>>> if no one is calling mem_cgroup_sk_under_memory_pressure() and doing >>>> some actions, the time window is not that useful. >>>> >>>> How about we track the actions taken by the callers of >>>> mem_cgroup_sk_under_memory_pressure()? Basically if network stack >>>> reduces the buffer size or whatever the other actions it may take when >>>> mem_cgroup_sk_under_memory_pressure() returns, tracking those actions >>>> is what I think is needed here, at least for the debugging use-case. I am not against it, but I feel that conveying those tracked actions (or how to represent them) to the user will be much harder. Are there already existing APIs to push this information to the user? Thanks! Daniel. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-13 14:30 ` Daniel Sedlak @ 2025-10-14 1:43 ` Roman Gushchin 2025-10-14 13:58 ` Daniel Sedlak 2025-10-14 20:32 ` Shakeel Butt 1 sibling, 1 reply; 21+ messages in thread From: Roman Gushchin @ 2025-10-14 1:43 UTC (permalink / raw) To: Daniel Sedlak Cc: Shakeel Butt, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik Daniel Sedlak <daniel.sedlak@cdn77.com> writes: > On 10/9/25 9:02 PM, Roman Gushchin wrote: >> Shakeel Butt <shakeel.butt@linux.dev> writes: >> >>> On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote: >>>> Shakeel Butt <shakeel.butt@linux.dev> writes: >>>> >>>>> On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote: >>>>>> Daniel Sedlak <daniel.sedlak@cdn77.com> writes: >>>>>> >>>>>>> Hi Roman, >>>>>>> >>>>>>> On 10/8/25 8:58 PM, Roman Gushchin wrote: >>>>>>>>> This patch exposes a new file for each cgroup in sysfs which is a >>>>>>>>> read-only single value file showing how many microseconds this cgroup >>>>>>>>> contributed to throttling the throughput of network sockets. The file is >>>>>>>>> accessible in the following path. >>>>>>>>> >>>>>>>>> /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec >>>>>>>> Hi Daniel! >>>>>>>> How this value is going to be used? In other words, do you need an >>>>>>>> exact number or something like memory.events::net_throttled would be >>>>>>>> enough for your case? >>>>>>> >>>>>>> Just incrementing a counter each time the vmpressure() happens IMO >>>>>>> provides bad semantics of what is actually happening, because it can >>>>>>> hide important details, mainly the _time_ for how long the network >>>>>>> traffic was slowed down. >>>>>>> >>>>>>> For example, when memory.events::net_throttled=1000, it can mean that >>>>>>> the network was slowed down for 1 second or 1000 seconds or something >>>>>>> between, and the memory.net.throttled_usec proposed by this patch >>>>>>> disambiguates it. >>>>>>> >>>>>>> In addition, v1/v2 of this series started that way, then from v3 we >>>>>>> rewrote it to calculate the duration instead, which proved to be >>>>>>> better information for debugging, as it is easier to understand >>>>>>> implications. >>>>>> >>>>>> But how are you planning to use this information? Is this just >>>>>> "networking is under pressure for non-trivial amount of time -> >>>>>> raise the memcg limit" or something more complicated? > > We plan to use it mostly for observability purposes and to better > understand which traffic patterns affect the socket pressure the most > (so we can try to fix/delay/improve it). We do not know how commonly > this issue appears in other deployments, but in our deployment, many > of servers were affected by this slowdown, which varied in terms of > hardware and software configuration. Currently, it is very hard to > detect if the socket is under pressure without using tools like > bpftrace, so we would like to expose this metric in a more accessible > way. So in the end, we do not really care in which file this "socket > pressure happened" notification will be stored. >>>>>> I totally get it from the debugging perspective, but not sure about >>>>>> usefulness of it as a permanent metric. This is why I'm asking if there >>>>>> are lighter alternatives, e.g. memory.events or maybe even tracepoints. > > If the combination of memory.events(.local) and tracepoint hook(s) is > okay with you(?), we can use that and export the same information as > in the current patch version. We can incorporate that into the next > version. In my opinion tracepoint > memory.events entry > memory.stat entry > new cgroupfs file. > > Also, would it be possible to make the socket pressure signal > configurable, e.g., allowing it to be configured via sysctl or per > cgroup not to trigger the socket pressure signal? I cannot find the > reasoning why this throttling cannot (maybe it can) be opt-out. It's a good point. First, I think that vmpressure implementation is not the best and we might want to switch to PSI (or something else) there. This is why I'm resistant to exposing implementation-specific metrics. That said, I believe that some level of customization here is justified. Maybe opting out completely is too much, but in the end it's hard for the kernel to balance the importance of e.g. page cache vs networking buffers as it might be really workload-dependent. Or some workloads would prefer to risk being oom-killed rather than to tolerate a sub-par networking performance. Thanks! ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-14 1:43 ` Roman Gushchin @ 2025-10-14 13:58 ` Daniel Sedlak 0 siblings, 0 replies; 21+ messages in thread From: Daniel Sedlak @ 2025-10-14 13:58 UTC (permalink / raw) To: Roman Gushchin Cc: Shakeel Butt, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On 10/14/25 3:43 AM, Roman Gushchin wrote: > Daniel Sedlak <daniel.sedlak@cdn77.com> writes: > >> On 10/9/25 9:02 PM, Roman Gushchin wrote: >>> Shakeel Butt <shakeel.butt@linux.dev> writes: >>> >>>> On Thu, Oct 09, 2025 at 10:58:51AM -0700, Roman Gushchin wrote: >>>>> Shakeel Butt <shakeel.butt@linux.dev> writes: >>>>> >>>>>> On Thu, Oct 09, 2025 at 08:32:27AM -0700, Roman Gushchin wrote: >>>>>>> Daniel Sedlak <daniel.sedlak@cdn77.com> writes: >>>>>>> >>>>>>>> Hi Roman, >>>>>>>> >>>>>>>> On 10/8/25 8:58 PM, Roman Gushchin wrote: >>>>>>>>>> This patch exposes a new file for each cgroup in sysfs which is a >>>>>>>>>> read-only single value file showing how many microseconds this cgroup >>>>>>>>>> contributed to throttling the throughput of network sockets. The file is >>>>>>>>>> accessible in the following path. >>>>>>>>>> >>>>>>>>>> /sys/fs/cgroup/**/<cgroup name>/memory.net.throttled_usec >>>>>>>>> Hi Daniel! >>>>>>>>> How this value is going to be used? In other words, do you need an >>>>>>>>> exact number or something like memory.events::net_throttled would be >>>>>>>>> enough for your case? >>>>>>>> >>>>>>>> Just incrementing a counter each time the vmpressure() happens IMO >>>>>>>> provides bad semantics of what is actually happening, because it can >>>>>>>> hide important details, mainly the _time_ for how long the network >>>>>>>> traffic was slowed down. >>>>>>>> >>>>>>>> For example, when memory.events::net_throttled=1000, it can mean that >>>>>>>> the network was slowed down for 1 second or 1000 seconds or something >>>>>>>> between, and the memory.net.throttled_usec proposed by this patch >>>>>>>> disambiguates it. >>>>>>>> >>>>>>>> In addition, v1/v2 of this series started that way, then from v3 we >>>>>>>> rewrote it to calculate the duration instead, which proved to be >>>>>>>> better information for debugging, as it is easier to understand >>>>>>>> implications. >>>>>>> >>>>>>> But how are you planning to use this information? Is this just >>>>>>> "networking is under pressure for non-trivial amount of time -> >>>>>>> raise the memcg limit" or something more complicated? >> >> We plan to use it mostly for observability purposes and to better >> understand which traffic patterns affect the socket pressure the most >> (so we can try to fix/delay/improve it). We do not know how commonly >> this issue appears in other deployments, but in our deployment, many >> of servers were affected by this slowdown, which varied in terms of >> hardware and software configuration. Currently, it is very hard to >> detect if the socket is under pressure without using tools like >> bpftrace, so we would like to expose this metric in a more accessible >> way. So in the end, we do not really care in which file this "socket >> pressure happened" notification will be stored. >>>>>>> I totally get it from the debugging perspective, but not sure about >>>>>>> usefulness of it as a permanent metric. This is why I'm asking if there >>>>>>> are lighter alternatives, e.g. memory.events or maybe even tracepoints. >> >> If the combination of memory.events(.local) and tracepoint hook(s) is >> okay with you(?), we can use that and export the same information as >> in the current patch version. We can incorporate that into the next >> version. > > In my opinion > tracepoint > memory.events entry > memory.stat entry > new cgroupfs file. Thanks, noted, we will incorporate it to the next version. >> Also, would it be possible to make the socket pressure signal >> configurable, e.g., allowing it to be configured via sysctl or per >> cgroup not to trigger the socket pressure signal? I cannot find the >> reasoning why this throttling cannot (maybe it can) be opt-out. > > It's a good point. > > First, I think that vmpressure implementation is not the best > and we might want to switch to PSI (or something else) there. > This is why I'm resistant to exposing implementation-specific > metrics. > > That said, I believe that some level of customization here is justified. > Maybe opting out completely is too much, but in the end it's hard for > the kernel to balance the importance of e.g. page cache vs networking > buffers as it might be really workload-dependent. Or some workloads > would prefer to risk being oom-killed rather than to tolerate a sub-par > networking performance. As of now, socket pressure throttling can be disabled by moving processes, causing the pressure, into the root cgroup. So we would definitely benefit from disabling it more idiomatically. This bpftrace output is captured from a production server using nginx proxy (the left-most column is a timestamp in nanoseconds) which we use as a HTTP cache. As you can see, it fluctuates a lot. 26920285712831843, unified:/system.slice/nginx.service, scanned: 556, reclaimed: 146, pressure: 73 26920285731493743, unified:/system.slice/nginx.service, scanned: 22886, reclaimed: 13606, pressure: 40 26920285779559500, unified:/system.slice/nginx.service, scanned: 21775, reclaimed: 11525, pressure: 47 26920285784845147, unified:/system.slice/nginx.service, scanned: 698, reclaimed: 522, pressure: 25 26920285833808666, unified:/system.slice/nginx.service, scanned: 740, reclaimed: 232, pressure: 68 26920285835668081, unified:/system.slice/nginx.service, scanned: 1475, reclaimed: 1224, pressure: 17 26920285838877445, unified:/system.slice/nginx.service, scanned: 2919, reclaimed: 2334, pressure: 20 26920285854811898, unified:/system.slice/nginx.service, scanned: 11586, reclaimed: 7666, pressure: 33 26920285873634643, unified:/system.slice/nginx.service, scanned: 22898, reclaimed: 13387, pressure: 41 26920285899176135, unified:/system.slice/nginx.service, scanned: 10957, reclaimed: 7077, pressure: 35 26920285901529378, unified:/system.slice/nginx.service, scanned: 587, reclaimed: 156, pressure: 73 26920286020702357, unified:/system.slice/nginx.service, scanned: 563, reclaimed: 87, pressure: 84 26920286037434038, unified:/system.slice/nginx.service, scanned: 22072, reclaimed: 14161, pressure: 35 26920285789562313, unified:/system.slice/nginx.service, scanned: 2810, reclaimed: 1696, pressure: 39 26920285879597883, unified:/system.slice/nginx.service, scanned: 693, reclaimed: 625, pressure: 9 26920285884686863, unified:/system.slice/nginx.service, scanned: 2768, reclaimed: 2284, pressure: 17 We believe that the issue originates from suboptimally chosen constants, as seen in [1]. Currently, the vmpressure triggers when it cannot reclaim a few MiB of memory on a server that has over 500 GiB of memory. Link: https://elixir.bootlin.com/linux/v6.17.1/source/mm/vmpressure.c#L38 [1] We would like to work on that more after this patch to try to find a better constant or at least make it _more configurable_ if that makes sense for you. Thanks! Daniel ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-13 14:30 ` Daniel Sedlak 2025-10-14 1:43 ` Roman Gushchin @ 2025-10-14 20:32 ` Shakeel Butt 2025-10-15 13:57 ` Daniel Sedlak 2025-10-15 18:21 ` Kuniyuki Iwashima 1 sibling, 2 replies; 21+ messages in thread From: Shakeel Butt @ 2025-10-14 20:32 UTC (permalink / raw) To: Daniel Sedlak Cc: Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote: [...] > > > > > How about we track the actions taken by the callers of > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack > > > > > reduces the buffer size or whatever the other actions it may take when > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions > > > > > is what I think is needed here, at least for the debugging use-case. > > I am not against it, but I feel that conveying those tracked actions (or how > to represent them) to the user will be much harder. Are there already > existing APIs to push this information to the user? > I discussed with Wei Wang and she suggested we should start tracking the calls to tcp_adjust_rcv_ssthresh() first. So, something like the following. I would like feedback frm networking folks as well: From 54bd2bf6681c1c694295646532f2a62a205ee41a Mon Sep 17 00:00:00 2001 From: Shakeel Butt <shakeel.butt@linux.dev> Date: Tue, 14 Oct 2025 13:27:36 -0700 Subject: [PATCH] memcg: track network throttling due to memcg memory pressure Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> --- include/linux/memcontrol.h | 1 + mm/memcontrol.c | 2 ++ net/ipv4/tcp_input.c | 5 ++++- net/ipv4/tcp_output.c | 8 ++++++-- 4 files changed, 13 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 873e510d6f8d..5fe254813123 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -52,6 +52,7 @@ enum memcg_memory_event { MEMCG_SWAP_HIGH, MEMCG_SWAP_MAX, MEMCG_SWAP_FAIL, + MEMCG_SOCK_THROTTLED, MEMCG_NR_MEMORY_EVENTS, }; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4deda33625f4..9207bba34e2e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4463,6 +4463,8 @@ static void __memory_events_show(struct seq_file *m, atomic_long_t *events) atomic_long_read(&events[MEMCG_OOM_KILL])); seq_printf(m, "oom_group_kill %lu\n", atomic_long_read(&events[MEMCG_OOM_GROUP_KILL])); + seq_printf(m, "sock_throttled %lu\n", + atomic_long_read(&events[MEMCG_SOCK_THROTTLED])); } static int memory_events_show(struct seq_file *m, void *v) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 31ea5af49f2d..2206968fb505 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -713,6 +713,7 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb, * Adjust rcv_ssthresh according to reserved mem */ tcp_adjust_rcv_ssthresh(sk); + memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED); } } @@ -5764,8 +5765,10 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb) if (!tcp_can_ingest(sk, in_skb)) tcp_clamp_window(sk); - else if (tcp_under_memory_pressure(sk)) + else if (tcp_under_memory_pressure(sk)) { tcp_adjust_rcv_ssthresh(sk); + memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED); + } if (tcp_can_ingest(sk, in_skb)) return 0; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index bb3576ac0ad7..8fe8d973d7ac 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -3275,8 +3275,10 @@ u32 __tcp_select_window(struct sock *sk) if (free_space < (full_space >> 1)) { icsk->icsk_ack.quick = 0; - if (tcp_under_memory_pressure(sk)) + if (tcp_under_memory_pressure(sk)) { tcp_adjust_rcv_ssthresh(sk); + memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED); + } /* free_space might become our new window, make sure we don't * increase it due to wscale. @@ -3334,8 +3336,10 @@ u32 __tcp_select_window(struct sock *sk) if (free_space < (full_space >> 1)) { icsk->icsk_ack.quick = 0; - if (tcp_under_memory_pressure(sk)) + if (tcp_under_memory_pressure(sk)) { tcp_adjust_rcv_ssthresh(sk); + memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED); + } /* if free space is too low, return a zero window */ if (free_space < (allowed_space >> 4) || free_space < mss || -- 2.47.3 ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-14 20:32 ` Shakeel Butt @ 2025-10-15 13:57 ` Daniel Sedlak 2025-10-15 18:36 ` Shakeel Butt 2025-10-15 18:21 ` Kuniyuki Iwashima 1 sibling, 1 reply; 21+ messages in thread From: Daniel Sedlak @ 2025-10-15 13:57 UTC (permalink / raw) To: Shakeel Butt Cc: Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On 10/14/25 10:32 PM, Shakeel Butt wrote: > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote: > [...] >>>>>> How about we track the actions taken by the callers of >>>>>> mem_cgroup_sk_under_memory_pressure()? Basically if network stack >>>>>> reduces the buffer size or whatever the other actions it may take when >>>>>> mem_cgroup_sk_under_memory_pressure() returns, tracking those actions >>>>>> is what I think is needed here, at least for the debugging use-case. >> >> I am not against it, but I feel that conveying those tracked actions (or how >> to represent them) to the user will be much harder. Are there already >> existing APIs to push this information to the user? >> > > I discussed with Wei Wang and she suggested we should start tracking the > calls to tcp_adjust_rcv_ssthresh() first. So, something like the > following. I would like feedback frm networking folks as well: Looks like a good start. Are you planning on sending this patch separately, or can we include it in our v6 (with maybe slight modifications)? > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 873e510d6f8d..5fe254813123 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -52,6 +52,7 @@ enum memcg_memory_event { > MEMCG_SWAP_HIGH, > MEMCG_SWAP_MAX, > MEMCG_SWAP_FAIL, > + MEMCG_SOCK_THROTTLED, This probably should be MEMCG_TCP_SOCK_THROTTLED, because it checks only tcp_under_memory_pressure, however there is also the sk_under_memory_pressure used in net/sctp/sm_statefuns.c:6597 to also reduce the sending rate. Or also add the counter there and keep the name? Thanks! Daniel ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-15 13:57 ` Daniel Sedlak @ 2025-10-15 18:36 ` Shakeel Butt 0 siblings, 0 replies; 21+ messages in thread From: Shakeel Butt @ 2025-10-15 18:36 UTC (permalink / raw) To: Daniel Sedlak Cc: Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, Kuniyuki Iwashima, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On Wed, Oct 15, 2025 at 03:57:29PM +0200, Daniel Sedlak wrote: > On 10/14/25 10:32 PM, Shakeel Butt wrote: > > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote: > > [...] > > > > > > > How about we track the actions taken by the callers of > > > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack > > > > > > > reduces the buffer size or whatever the other actions it may take when > > > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions > > > > > > > is what I think is needed here, at least for the debugging use-case. > > > > > > I am not against it, but I feel that conveying those tracked actions (or how > > > to represent them) to the user will be much harder. Are there already > > > existing APIs to push this information to the user? > > > > > > > I discussed with Wei Wang and she suggested we should start tracking the > > calls to tcp_adjust_rcv_ssthresh() first. So, something like the > > following. I would like feedback frm networking folks as well: > > Looks like a good start. Are you planning on sending this patch separately, > or can we include it in our v6 (with maybe slight modifications)? What else you are planning to add in v6? > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 873e510d6f8d..5fe254813123 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -52,6 +52,7 @@ enum memcg_memory_event { > > MEMCG_SWAP_HIGH, > > MEMCG_SWAP_MAX, > > MEMCG_SWAP_FAIL, > > + MEMCG_SOCK_THROTTLED, > > This probably should be MEMCG_TCP_SOCK_THROTTLED, because it checks only > tcp_under_memory_pressure, however there is also the > sk_under_memory_pressure used in net/sctp/sm_statefuns.c:6597 to also reduce > the sending rate. Or also add the counter there and keep the name? Yeah makes sense to add the counter in sctp as well. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-14 20:32 ` Shakeel Butt 2025-10-15 13:57 ` Daniel Sedlak @ 2025-10-15 18:21 ` Kuniyuki Iwashima 2025-10-15 18:39 ` Shakeel Butt 1 sibling, 1 reply; 21+ messages in thread From: Kuniyuki Iwashima @ 2025-10-15 18:21 UTC (permalink / raw) To: Shakeel Butt Cc: Daniel Sedlak, Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On Tue, Oct 14, 2025 at 1:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote: > [...] > > > > > > How about we track the actions taken by the callers of > > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack > > > > > > reduces the buffer size or whatever the other actions it may take when > > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions > > > > > > is what I think is needed here, at least for the debugging use-case. > > > > I am not against it, but I feel that conveying those tracked actions (or how > > to represent them) to the user will be much harder. Are there already > > existing APIs to push this information to the user? > > > > I discussed with Wei Wang and she suggested we should start tracking the > calls to tcp_adjust_rcv_ssthresh() first. So, something like the > following. I would like feedback frm networking folks as well: I think we could simply put memcg_memory_event() in mem_cgroup_sk_under_memory_pressure() when it returns true. Other than tcp_adjust_rcv_ssthresh(), if tcp_under_memory_pressure() returns true, it indicates something bad will happen, failure to expand rcvbuf and sndbuf, need to prune out-of-order queue more aggressively, FIN deferred to a retransmitted packet. Also, we could cover mptcp and sctp too. > > > From 54bd2bf6681c1c694295646532f2a62a205ee41a Mon Sep 17 00:00:00 2001 > From: Shakeel Butt <shakeel.butt@linux.dev> > Date: Tue, 14 Oct 2025 13:27:36 -0700 > Subject: [PATCH] memcg: track network throttling due to memcg memory pressure > > Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> > --- > include/linux/memcontrol.h | 1 + > mm/memcontrol.c | 2 ++ > net/ipv4/tcp_input.c | 5 ++++- > net/ipv4/tcp_output.c | 8 ++++++-- > 4 files changed, 13 insertions(+), 3 deletions(-) > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 873e510d6f8d..5fe254813123 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -52,6 +52,7 @@ enum memcg_memory_event { > MEMCG_SWAP_HIGH, > MEMCG_SWAP_MAX, > MEMCG_SWAP_FAIL, > + MEMCG_SOCK_THROTTLED, > MEMCG_NR_MEMORY_EVENTS, > }; > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 4deda33625f4..9207bba34e2e 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -4463,6 +4463,8 @@ static void __memory_events_show(struct seq_file *m, atomic_long_t *events) > atomic_long_read(&events[MEMCG_OOM_KILL])); > seq_printf(m, "oom_group_kill %lu\n", > atomic_long_read(&events[MEMCG_OOM_GROUP_KILL])); > + seq_printf(m, "sock_throttled %lu\n", > + atomic_long_read(&events[MEMCG_SOCK_THROTTLED])); > } > > static int memory_events_show(struct seq_file *m, void *v) > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 31ea5af49f2d..2206968fb505 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -713,6 +713,7 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb, > * Adjust rcv_ssthresh according to reserved mem > */ > tcp_adjust_rcv_ssthresh(sk); > + memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED); > } > } > > @@ -5764,8 +5765,10 @@ static int tcp_prune_queue(struct sock *sk, const struct sk_buff *in_skb) > > if (!tcp_can_ingest(sk, in_skb)) > tcp_clamp_window(sk); > - else if (tcp_under_memory_pressure(sk)) > + else if (tcp_under_memory_pressure(sk)) { > tcp_adjust_rcv_ssthresh(sk); > + memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED); > + } > > if (tcp_can_ingest(sk, in_skb)) > return 0; > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index bb3576ac0ad7..8fe8d973d7ac 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -3275,8 +3275,10 @@ u32 __tcp_select_window(struct sock *sk) > if (free_space < (full_space >> 1)) { > icsk->icsk_ack.quick = 0; > > - if (tcp_under_memory_pressure(sk)) > + if (tcp_under_memory_pressure(sk)) { > tcp_adjust_rcv_ssthresh(sk); > + memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED); > + } > > /* free_space might become our new window, make sure we don't > * increase it due to wscale. > @@ -3334,8 +3336,10 @@ u32 __tcp_select_window(struct sock *sk) > if (free_space < (full_space >> 1)) { > icsk->icsk_ack.quick = 0; > > - if (tcp_under_memory_pressure(sk)) > + if (tcp_under_memory_pressure(sk)) { > tcp_adjust_rcv_ssthresh(sk); > + memcg_memory_event(sk->sk_memcg, MEMCG_SOCK_THROTTLED); > + } > > /* if free space is too low, return a zero window */ > if (free_space < (allowed_space >> 4) || free_space < mss || > -- > 2.47.3 > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-15 18:21 ` Kuniyuki Iwashima @ 2025-10-15 18:39 ` Shakeel Butt 2025-10-15 18:58 ` Kuniyuki Iwashima 0 siblings, 1 reply; 21+ messages in thread From: Shakeel Butt @ 2025-10-15 18:39 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Daniel Sedlak, Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On Wed, Oct 15, 2025 at 11:21:17AM -0700, Kuniyuki Iwashima wrote: > On Tue, Oct 14, 2025 at 1:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote: > > [...] > > > > > > > How about we track the actions taken by the callers of > > > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack > > > > > > > reduces the buffer size or whatever the other actions it may take when > > > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions > > > > > > > is what I think is needed here, at least for the debugging use-case. > > > > > > I am not against it, but I feel that conveying those tracked actions (or how > > > to represent them) to the user will be much harder. Are there already > > > existing APIs to push this information to the user? > > > > > > > I discussed with Wei Wang and she suggested we should start tracking the > > calls to tcp_adjust_rcv_ssthresh() first. So, something like the > > following. I would like feedback frm networking folks as well: > > I think we could simply put memcg_memory_event() in > mem_cgroup_sk_under_memory_pressure() when it returns > true. > > Other than tcp_adjust_rcv_ssthresh(), if tcp_under_memory_pressure() > returns true, it indicates something bad will happen, failure to expand > rcvbuf and sndbuf, need to prune out-of-order queue more aggressively, > FIN deferred to a retransmitted packet. > > Also, we could cover mptcp and sctp too. > I wanted to start simple and focus on one specific action but I am open to other actins as well. Do we want a generic network throttled metric or do we want different metric for different action? At the moment I think for memcg, a single metric would be sufficient and then we can have tracepoints for more fine grained debugging. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-15 18:39 ` Shakeel Butt @ 2025-10-15 18:58 ` Kuniyuki Iwashima 2025-10-15 20:17 ` Roman Gushchin 0 siblings, 1 reply; 21+ messages in thread From: Kuniyuki Iwashima @ 2025-10-15 18:58 UTC (permalink / raw) To: Shakeel Butt Cc: Daniel Sedlak, Roman Gushchin, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik On Wed, Oct 15, 2025 at 11:39 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > On Wed, Oct 15, 2025 at 11:21:17AM -0700, Kuniyuki Iwashima wrote: > > On Tue, Oct 14, 2025 at 1:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: > > > > > > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote: > > > [...] > > > > > > > > How about we track the actions taken by the callers of > > > > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack > > > > > > > > reduces the buffer size or whatever the other actions it may take when > > > > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions > > > > > > > > is what I think is needed here, at least for the debugging use-case. > > > > > > > > I am not against it, but I feel that conveying those tracked actions (or how > > > > to represent them) to the user will be much harder. Are there already > > > > existing APIs to push this information to the user? > > > > > > > > > > I discussed with Wei Wang and she suggested we should start tracking the > > > calls to tcp_adjust_rcv_ssthresh() first. So, something like the > > > following. I would like feedback frm networking folks as well: > > > > I think we could simply put memcg_memory_event() in > > mem_cgroup_sk_under_memory_pressure() when it returns > > true. > > > > Other than tcp_adjust_rcv_ssthresh(), if tcp_under_memory_pressure() > > returns true, it indicates something bad will happen, failure to expand > > rcvbuf and sndbuf, need to prune out-of-order queue more aggressively, > > FIN deferred to a retransmitted packet. > > > > Also, we could cover mptcp and sctp too. > > > > I wanted to start simple and focus on one specific action but I am open > to other actins as well. Do we want a generic network throttled metric > or do we want different metric for different action? At the moment I > think for memcg, a single metric would be sufficient and then we can > have tracepoints for more fine grained debugging. I agree that a single metric would be enough if it can signal something bad is happening as a first step, then we can take further action with tracepoint, bpftrace, whatever. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: [PATCH v5] memcg: expose socket memory pressure in a cgroup 2025-10-15 18:58 ` Kuniyuki Iwashima @ 2025-10-15 20:17 ` Roman Gushchin 0 siblings, 0 replies; 21+ messages in thread From: Roman Gushchin @ 2025-10-15 20:17 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Shakeel Butt, Daniel Sedlak, David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet, Neal Cardwell, David Ahern, Andrew Morton, Yosry Ahmed, linux-mm, netdev, Johannes Weiner, Michal Hocko, Muchun Song, cgroups, Tejun Heo, Michal Koutný, Matyas Hurtik Kuniyuki Iwashima <kuniyu@google.com> writes: > On Wed, Oct 15, 2025 at 11:39 AM Shakeel Butt <shakeel.butt@linux.dev> wrote: >> >> On Wed, Oct 15, 2025 at 11:21:17AM -0700, Kuniyuki Iwashima wrote: >> > On Tue, Oct 14, 2025 at 1:33 PM Shakeel Butt <shakeel.butt@linux.dev> wrote: >> > > >> > > On Mon, Oct 13, 2025 at 04:30:53PM +0200, Daniel Sedlak wrote: >> > > [...] >> > > > > > > > How about we track the actions taken by the callers of >> > > > > > > > mem_cgroup_sk_under_memory_pressure()? Basically if network stack >> > > > > > > > reduces the buffer size or whatever the other actions it may take when >> > > > > > > > mem_cgroup_sk_under_memory_pressure() returns, tracking those actions >> > > > > > > > is what I think is needed here, at least for the debugging use-case. >> > > > >> > > > I am not against it, but I feel that conveying those tracked actions (or how >> > > > to represent them) to the user will be much harder. Are there already >> > > > existing APIs to push this information to the user? >> > > > >> > > >> > > I discussed with Wei Wang and she suggested we should start tracking the >> > > calls to tcp_adjust_rcv_ssthresh() first. So, something like the >> > > following. I would like feedback frm networking folks as well: >> > >> > I think we could simply put memcg_memory_event() in >> > mem_cgroup_sk_under_memory_pressure() when it returns >> > true. >> > >> > Other than tcp_adjust_rcv_ssthresh(), if tcp_under_memory_pressure() >> > returns true, it indicates something bad will happen, failure to expand >> > rcvbuf and sndbuf, need to prune out-of-order queue more aggressively, >> > FIN deferred to a retransmitted packet. >> > >> > Also, we could cover mptcp and sctp too. >> > >> >> I wanted to start simple and focus on one specific action but I am open >> to other actins as well. Do we want a generic network throttled metric >> or do we want different metric for different action? At the moment I >> think for memcg, a single metric would be sufficient and then we can >> have tracepoints for more fine grained debugging. > > I agree that a single metric would be enough if it can signal > something bad is happening as a first step, then we can take > further action with tracepoint, bpftrace, whatever. +1 to a single metric ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2025-10-15 20:17 UTC | newest] Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-10-07 12:50 [PATCH v5] memcg: expose socket memory pressure in a cgroup Daniel Sedlak 2025-10-07 20:01 ` Tejun Heo 2025-10-08 12:46 ` Matyas Hurtik 2025-10-08 18:17 ` Tejun Heo 2025-10-08 18:58 ` Roman Gushchin 2025-10-09 14:44 ` Daniel Sedlak 2025-10-09 15:32 ` Roman Gushchin 2025-10-09 16:06 ` Shakeel Butt 2025-10-09 17:58 ` Roman Gushchin 2025-10-09 18:32 ` Shakeel Butt 2025-10-09 19:02 ` Roman Gushchin 2025-10-13 14:30 ` Daniel Sedlak 2025-10-14 1:43 ` Roman Gushchin 2025-10-14 13:58 ` Daniel Sedlak 2025-10-14 20:32 ` Shakeel Butt 2025-10-15 13:57 ` Daniel Sedlak 2025-10-15 18:36 ` Shakeel Butt 2025-10-15 18:21 ` Kuniyuki Iwashima 2025-10-15 18:39 ` Shakeel Butt 2025-10-15 18:58 ` Kuniyuki Iwashima 2025-10-15 20:17 ` Roman Gushchin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox