From: Daniel Sedlak <daniel.sedlak@cdn77.com>
To: "David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Simon Horman <horms@kernel.org>, Jonathan Corbet <corbet@lwn.net>,
Neal Cardwell <ncardwell@google.com>,
Kuniyuki Iwashima <kuniyu@google.com>,
David Ahern <dsahern@kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Shakeel Butt <shakeel.butt@linux.dev>,
Yosry Ahmed <yosry.ahmed@linux.dev>,
linux-mm@kvack.org, netdev@vger.kernel.org
Cc: Daniel Sedlak <daniel.sedlak@cdn77.com>,
Matyas Hurtik <matyas.hurtik@cdn77.com>
Subject: [PATCH v2 net-next 1/2] tcp: account for memory pressure signaled by cgroup
Date: Mon, 14 Jul 2025 16:36:12 +0200 [thread overview]
Message-ID: <20250714143613.42184-2-daniel.sedlak@cdn77.com> (raw)
In-Reply-To: <20250714143613.42184-1-daniel.sedlak@cdn77.com>
This patch is a result of our long-standing debug sessions, where it all
started as "networking is slow", and TCP network throughput suddenly
dropped from tens of Gbps to few Mbps, and we could not see anything in
the kernel log or netstat counters.
Currently, we have two memory pressure counters for TCP sockets [1],
which we manipulate only when the memory pressure is signalled through
the proto struct [2]. However, the memory pressure can also be signaled
through the cgroup memory subsystem, which we do not reflect in the
netstat counters. In the end, when the cgroup memory subsystem signals
that it is under pressure, we silently reduce the advertised TCP window
with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant
throughput reduction.
So this patch adds a new counter to account for memory pressure
signaled by the memory cgroup, so it is much easier to spot.
Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1]
Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2]
Co-developed-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
Signed-off-by: Matyas Hurtik <matyas.hurtik@cdn77.com>
Signed-off-by: Daniel Sedlak <daniel.sedlak@cdn77.com>
---
Documentation/networking/net_cachelines/snmp.rst | 1 +
include/net/tcp.h | 14 ++++++++------
include/uapi/linux/snmp.h | 1 +
net/ipv4/proc.c | 1 +
4 files changed, 11 insertions(+), 6 deletions(-)
diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst
index bd44b3eebbef..ed17ff84e39c 100644
--- a/Documentation/networking/net_cachelines/snmp.rst
+++ b/Documentation/networking/net_cachelines/snmp.rst
@@ -76,6 +76,7 @@ unsigned_long LINUX_MIB_TCPABORTONLINGER
unsigned_long LINUX_MIB_TCPABORTFAILED
unsigned_long LINUX_MIB_TCPMEMORYPRESSURES
unsigned_long LINUX_MIB_TCPMEMORYPRESSURESCHRONO
+unsigned_long LINUX_MIB_TCPCGROUPSOCKETPRESSURE
unsigned_long LINUX_MIB_TCPSACKDISCARD
unsigned_long LINUX_MIB_TCPDSACKIGNOREDOLD
unsigned_long LINUX_MIB_TCPDSACKIGNOREDNOUNDO
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 761c4a0ad386..aae3efe24282 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -267,6 +267,11 @@ extern long sysctl_tcp_mem[3];
#define TCP_RACK_STATIC_REO_WND 0x2 /* Use static RACK reo wnd */
#define TCP_RACK_NO_DUPTHRESH 0x4 /* Do not use DUPACK threshold in RACK */
+#define TCP_INC_STATS(net, field) SNMP_INC_STATS((net)->mib.tcp_statistics, field)
+#define __TCP_INC_STATS(net, field) __SNMP_INC_STATS((net)->mib.tcp_statistics, field)
+#define TCP_DEC_STATS(net, field) SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
+#define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
+
extern atomic_long_t tcp_memory_allocated;
DECLARE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc);
@@ -277,8 +282,10 @@ extern unsigned long tcp_memory_pressure;
static inline bool tcp_under_memory_pressure(const struct sock *sk)
{
if (mem_cgroup_sockets_enabled && sk->sk_memcg &&
- mem_cgroup_under_socket_pressure(sk->sk_memcg))
+ mem_cgroup_under_socket_pressure(sk->sk_memcg)) {
+ TCP_INC_STATS(sock_net(sk), LINUX_MIB_TCPCGROUPSOCKETPRESSURE);
return true;
+ }
return READ_ONCE(tcp_memory_pressure);
}
@@ -316,11 +323,6 @@ bool tcp_check_oom(const struct sock *sk, int shift);
extern struct proto tcp_prot;
-#define TCP_INC_STATS(net, field) SNMP_INC_STATS((net)->mib.tcp_statistics, field)
-#define __TCP_INC_STATS(net, field) __SNMP_INC_STATS((net)->mib.tcp_statistics, field)
-#define TCP_DEC_STATS(net, field) SNMP_DEC_STATS((net)->mib.tcp_statistics, field)
-#define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val)
-
void tcp_tsq_work_init(void);
int tcp_v4_err(struct sk_buff *skb, u32);
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 1d234d7e1892..9e8d1a5e56a9 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -231,6 +231,7 @@ enum
LINUX_MIB_TCPABORTFAILED, /* TCPAbortFailed */
LINUX_MIB_TCPMEMORYPRESSURES, /* TCPMemoryPressures */
LINUX_MIB_TCPMEMORYPRESSURESCHRONO, /* TCPMemoryPressuresChrono */
+ LINUX_MIB_TCPCGROUPSOCKETPRESSURE, /* TCPCgroupSocketPressure */
LINUX_MIB_TCPSACKDISCARD, /* TCPSACKDiscard */
LINUX_MIB_TCPDSACKIGNOREDOLD, /* TCPSACKIgnoredOld */
LINUX_MIB_TCPDSACKIGNOREDNOUNDO, /* TCPSACKIgnoredNoUndo */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index ea2f01584379..0bcec9a51fb0 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -235,6 +235,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPAbortFailed", LINUX_MIB_TCPABORTFAILED),
SNMP_MIB_ITEM("TCPMemoryPressures", LINUX_MIB_TCPMEMORYPRESSURES),
SNMP_MIB_ITEM("TCPMemoryPressuresChrono", LINUX_MIB_TCPMEMORYPRESSURESCHRONO),
+ SNMP_MIB_ITEM("TCPCgroupSocketPressure", LINUX_MIB_TCPCGROUPSOCKETPRESSURE),
SNMP_MIB_ITEM("TCPSACKDiscard", LINUX_MIB_TCPSACKDISCARD),
SNMP_MIB_ITEM("TCPDSACKIgnoredOld", LINUX_MIB_TCPDSACKIGNOREDOLD),
SNMP_MIB_ITEM("TCPDSACKIgnoredNoUndo", LINUX_MIB_TCPDSACKIGNOREDNOUNDO),
--
2.39.5
next prev parent reply other threads:[~2025-07-14 14:37 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-14 14:36 [PATCH v2 net-next 0/2] account for TCP " Daniel Sedlak
2025-07-14 14:36 ` Daniel Sedlak [this message]
2025-07-16 16:49 ` [PATCH v2 net-next 1/2] tcp: account for " Shakeel Butt
2025-07-16 18:07 ` Kuniyuki Iwashima
2025-07-16 18:37 ` Shakeel Butt
2025-07-17 15:31 ` Daniel Sedlak
2025-07-17 17:26 ` Kuniyuki Iwashima
2025-07-14 14:36 ` [PATCH v2 net-next 2/2] mm/vmpressure: add tracepoint for socket pressure detection Daniel Sedlak
2025-07-14 18:02 ` Kuniyuki Iwashima
2025-07-15 7:01 ` Daniel Sedlak
2025-07-15 17:17 ` Kuniyuki Iwashima
2025-07-15 17:46 ` Kuniyuki Iwashima
2025-07-16 8:47 ` Daniel Sedlak
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250714143613.42184-2-daniel.sedlak@cdn77.com \
--to=daniel.sedlak@cdn77.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=davem@davemloft.net \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=linux-mm@kvack.org \
--cc=matyas.hurtik@cdn77.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=shakeel.butt@linux.dev \
--cc=yosry.ahmed@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox