From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 50C5EC83F22 for ; Wed, 16 Jul 2025 16:50:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D082E6B007B; Wed, 16 Jul 2025 12:50:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CDF566B00AC; Wed, 16 Jul 2025 12:50:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C1C456B00AD; Wed, 16 Jul 2025 12:50:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B3B2D6B007B for ; Wed, 16 Jul 2025 12:50:03 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 533011DA100 for ; Wed, 16 Jul 2025 16:50:03 +0000 (UTC) X-FDA: 83670715086.13.06BE180 Received: from out-180.mta1.migadu.com (out-180.mta1.migadu.com [95.215.58.180]) by imf15.hostedemail.com (Postfix) with ESMTP id 7B752A0003 for ; Wed, 16 Jul 2025 16:50:01 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Xqr5SlOS; spf=pass (imf15.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.180 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752684601; a=rsa-sha256; cv=none; b=tcGMYIUUn+oJyAxWQP7HejXz5O0JHTODG/1EL92w7qulcEnVJ9FoR2LCDCN+7ODiVImdqc dxGFGTMeFaST2ywvdcBxk4YMzVP7BlJRerPW6nqlzgkzElB2Ki08do1B93oM1ec6TNdanf 0E2u7Jndvoy0Ye6S5cNofz/XxZNM1jk= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=Xqr5SlOS; spf=pass (imf15.hostedemail.com: domain of shakeel.butt@linux.dev designates 95.215.58.180 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752684601; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eJMpmtlIPf4k+hDO7cbykgh07PoM5NDYPjmc8MkH7/4=; b=prAiqv8FTE5Dk/5lB53SFUs6OMygzBo4K0v7EmkKdMCN3B/ulVKMgMWHVoEgNQ5mwAR5NO Z1L/E4HKM37vPMW3jkTF1+WQNChK5wQnWXH/fAykia11XfNmMLN6bEtOlr9TGju4Rp2l4O 6qGMNDld9gLGiLLY3HyXFOzCvNT57FI= Date: Wed, 16 Jul 2025 09:49:49 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1752684599; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=eJMpmtlIPf4k+hDO7cbykgh07PoM5NDYPjmc8MkH7/4=; b=Xqr5SlOSNTfFxmkk2StgfhvEBwSew07x8SDxLTfEdxR64duiIBUX0ioQGZCxzF4QLQoE59 D42vfchWcmJLG8avGrPG8gDtFkYSPR2iZxgNO8soiJ0FNewbOKNAzx/NgQh2bGCrhWKqgC D5Q9KvyMgt3ghQQFNA4LSRGVa4Aqylw= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Daniel Sedlak Cc: "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Jonathan Corbet , Neal Cardwell , Kuniyuki Iwashima , David Ahern , Andrew Morton , Yosry Ahmed , linux-mm@kvack.org, netdev@vger.kernel.org, Matyas Hurtik Subject: Re: [PATCH v2 net-next 1/2] tcp: account for memory pressure signaled by cgroup Message-ID: References: <20250714143613.42184-1-daniel.sedlak@cdn77.com> <20250714143613.42184-2-daniel.sedlak@cdn77.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20250714143613.42184-2-daniel.sedlak@cdn77.com> X-Migadu-Flow: FLOW_OUT X-Rspam-User: X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 7B752A0003 X-Stat-Signature: zpgpmcket3gu8sip69ycf3k56843hd8d X-HE-Tag: 1752684601-58844 X-HE-Meta: U2FsdGVkX19zuBrClu+ap0F36J78pDLIujt/Jz11yej1zw8BwhOP8zksnjkoFtNf0I0T+7vkI3S7aYZRjkV6HvpR4kS+FtBnHlcTP7vn2031aeAFp4GV7p7UKPVZgwoKj/s5OGWtsuy2dIP7dVyd6+uMeSWfIZUEcVBdCHpWIY4oCut87HNeOHylYDlU3rGbZXvLJURqXK17eWbk+HVMdqmhi6apjqf+NT+I5PdCM0pQ5ZkEp1u/trmY2LfoJMBuyrcMZ18FOIwzPoL7mtjOK9wyDbvcGYa91IOn8VSdkvpqB8jqlFMdxxUtDA4AF5FgYqrLeKGcKx2DTfLD9ZaJ7i6Hm1yLIyoz6mfEYdjkY569BPwcG20+mHvJ4p2fHdT9LwQiQFflNQ9HR7TfnPkI+YMWbEsa1j3T5Me7Oi0AFybIpe9aq6HvrT3UmaEsdmWNlXgD1es66eV44EUQjJmIcftu9FAo/RIb4r87rC3tAXCxRt4eW4YjSujXqTwPi0bVEZB+sW7yuso690VzDMvZqWAhwjt6qCRvL0oSb3me6aVMpjJmKK5RvSWyfVF3pKnb7sEIk16w4K3h4nwdfoM6+Ce8xtxOuSOv9h73fRRKXNnHEn4+JUd2gsPVGilURKkoKbeG/YSP2mr4Xpt2qHx48GTHmnBWjPO0QdvkiVYXdj33ny1nzqpJr0jE/CTpoeSvwo+qK9jkO3HpAs2q4M/9M2kue7x4pokMyFhVhJ4hPrRhBTx20PEhdZnmz5n/2K1qlJmFam93TWfD2Z8bzJX7RQpVgwcftqSkqtIePK4u5iS6yQwJ/NMiwGy3OHfq4eFRfelieKhlEpA8v26twGNtZoTN9OHfIa1snjm4aWPoPE6/zk7dKq3i9pZm+j3pj68LmpKuBWUYUzdPIjjBZhJgY397PxQ8l8SrdAi3ePFNtPqcSF+fPRCnPvb5r2RfUsTg09GeEwSa1TOTH83seR0 5qgMEkdI TDQk/bOrBYDBVnFeUvwpOxRXyFaoZBagDndPfcE3lL2Oqskwnkj5Psuqzv0qWxV1mwf4ZisBxwVPXP/4Lz1fNMgu0Jsxpjj6IJwS9rTWHVIndgf2TSc7gnVMwUP4FU77TAiuKlUgI0z9rJ6E/y+tXQB8Ka5rjg1+jSl5l+xOY+9KVKG5hCES39B/AkKuelBAoU/G7MDis2Xu07AM/HhdQGlNsR5hQhnNcBN9ypdUbl7aBCMqei5nHBu4Fwn7hkUmVNU3TiuPtFL9prqP1PWN+rOTLI39cauGyzKKMiF3snOPFetDFaMDsEoHQwVDCLY/JqfoumnAM2qJS7YI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jul 14, 2025 at 04:36:12PM +0200, Daniel Sedlak wrote: > This patch is a result of our long-standing debug sessions, where it all > started as "networking is slow", and TCP network throughput suddenly > dropped from tens of Gbps to few Mbps, and we could not see anything in > the kernel log or netstat counters. > > Currently, we have two memory pressure counters for TCP sockets [1], > which we manipulate only when the memory pressure is signalled through > the proto struct [2]. However, the memory pressure can also be signaled > through the cgroup memory subsystem, which we do not reflect in the > netstat counters. In the end, when the cgroup memory subsystem signals > that it is under pressure, we silently reduce the advertised TCP window > with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant > throughput reduction. > > So this patch adds a new counter to account for memory pressure > signaled by the memory cgroup, so it is much easier to spot. > > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1] > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2] > Co-developed-by: Matyas Hurtik > Signed-off-by: Matyas Hurtik > Signed-off-by: Daniel Sedlak > --- > Documentation/networking/net_cachelines/snmp.rst | 1 + > include/net/tcp.h | 14 ++++++++------ > include/uapi/linux/snmp.h | 1 + > net/ipv4/proc.c | 1 + > 4 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst > index bd44b3eebbef..ed17ff84e39c 100644 > --- a/Documentation/networking/net_cachelines/snmp.rst > +++ b/Documentation/networking/net_cachelines/snmp.rst > @@ -76,6 +76,7 @@ unsigned_long LINUX_MIB_TCPABORTONLINGER > unsigned_long LINUX_MIB_TCPABORTFAILED > unsigned_long LINUX_MIB_TCPMEMORYPRESSURES > unsigned_long LINUX_MIB_TCPMEMORYPRESSURESCHRONO > +unsigned_long LINUX_MIB_TCPCGROUPSOCKETPRESSURE > unsigned_long LINUX_MIB_TCPSACKDISCARD > unsigned_long LINUX_MIB_TCPDSACKIGNOREDOLD > unsigned_long LINUX_MIB_TCPDSACKIGNOREDNOUNDO > diff --git a/include/net/tcp.h b/include/net/tcp.h > index 761c4a0ad386..aae3efe24282 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -267,6 +267,11 @@ extern long sysctl_tcp_mem[3]; > #define TCP_RACK_STATIC_REO_WND 0x2 /* Use static RACK reo wnd */ > #define TCP_RACK_NO_DUPTHRESH 0x4 /* Do not use DUPACK threshold in RACK */ > > +#define TCP_INC_STATS(net, field) SNMP_INC_STATS((net)->mib.tcp_statistics, field) > +#define __TCP_INC_STATS(net, field) __SNMP_INC_STATS((net)->mib.tcp_statistics, field) > +#define TCP_DEC_STATS(net, field) SNMP_DEC_STATS((net)->mib.tcp_statistics, field) > +#define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val) > + > extern atomic_long_t tcp_memory_allocated; > DECLARE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc); > > @@ -277,8 +282,10 @@ extern unsigned long tcp_memory_pressure; > static inline bool tcp_under_memory_pressure(const struct sock *sk) > { > if (mem_cgroup_sockets_enabled && sk->sk_memcg && > - mem_cgroup_under_socket_pressure(sk->sk_memcg)) > + mem_cgroup_under_socket_pressure(sk->sk_memcg)) { > + TCP_INC_STATS(sock_net(sk), LINUX_MIB_TCPCGROUPSOCKETPRESSURE); > return true; Incrementing it here will give a very different semantic to this stat compared to LINUX_MIB_TCPMEMORYPRESSURES. Here the increments mean the number of times the kernel check if a given socket is under memcg pressure for a net namespace. Is that what we want?