From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7946C83F22 for ; Wed, 16 Jul 2025 18:37:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 434B76B00B1; Wed, 16 Jul 2025 14:37:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 40D016B00B2; Wed, 16 Jul 2025 14:37:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 349766B00B3; Wed, 16 Jul 2025 14:37:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 252646B00B1 for ; Wed, 16 Jul 2025 14:37:37 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 648CD12B387 for ; Wed, 16 Jul 2025 18:37:36 +0000 (UTC) X-FDA: 83670986112.10.2FA1D03 Received: from out-186.mta0.migadu.com (out-186.mta0.migadu.com [91.218.175.186]) by imf29.hostedemail.com (Postfix) with ESMTP id 59496120008 for ; Wed, 16 Jul 2025 18:37:34 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=RJyv8W69; spf=pass (imf29.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.186 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1752691054; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4M+AYbeSm4JPoXFjLgh6i1xWpk+QabtlbWqg0E6/p9c=; b=JYqQEP0xC8RrmwvBcDXHNu6J0oUJNOWOe5gy3oOTZg3tZSWqaup2nE2iyn11uVWnWVksDX dPU1tvEUOIBzoocbkNOgfgbpiurDf6bHUriUhW5fw/v0w66p40DWrsIWFLkFwnN5jhddil /MGzE55fF2VKYt1lHhYBFCGnOJo2z80= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=RJyv8W69; spf=pass (imf29.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.186 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1752691054; a=rsa-sha256; cv=none; b=dROLqSHBgNhsTigKNvPastbH98PyIqAwhyXOb43xkABCm5NSOkI+ok77UbsCo2xB5bACUE kRjSl22CI8+RdMiInlk/fFqGdiscgbPAhs80P6r14ai1uOQkPNiXMOR5OWCA+XkHMe3L0a AMlg38+DGSvxmiwB5/sQNrmwXnwqm7s= Date: Wed, 16 Jul 2025 11:37:26 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1752691051; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=4M+AYbeSm4JPoXFjLgh6i1xWpk+QabtlbWqg0E6/p9c=; b=RJyv8W69eNmx8bxaVekGbkXTNcZ+VrjT/41No1TBYBgUP10iH2KEFOKr+J1rb2PZ1hTBdw +QxVy3Wvblj7bMjVIG8Z2myFWv+KzuJ40MoWZTDN6K1ciRS9GQLcb4Rkg1YDWhP+NqTL/i nH9uFWmBklowYoPtEp4PBDa5QYkbL0o= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Kuniyuki Iwashima Cc: Daniel Sedlak , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , Jonathan Corbet , Neal Cardwell , David Ahern , Andrew Morton , Yosry Ahmed , linux-mm@kvack.org, netdev@vger.kernel.org, Matyas Hurtik Subject: Re: [PATCH v2 net-next 1/2] tcp: account for memory pressure signaled by cgroup Message-ID: References: <20250714143613.42184-1-daniel.sedlak@cdn77.com> <20250714143613.42184-2-daniel.sedlak@cdn77.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Stat-Signature: qs67z1tbp9twe7i13jc7fqfm4bbisn5e X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 59496120008 X-Rspam-User: X-HE-Tag: 1752691054-296912 X-HE-Meta: U2FsdGVkX19nv0hK2wBD60zOPaOE7fYxQdBUBurhrsbAqRm9o36LplpgWJgGU/e3BvGphDw/+igR9PQl8HQbNjJJe++BN7vQnhQVGAlwjS7FYgev/jlSeiEUTowRH9W+v31eYJVuOlcjTc8Y7LJZtNWeMSA0BSd44CmoO5JeW1MCU4ni7DI+jYO/cUMWhea4BKSYQv0d1JMOjkfzMyHFHe5FW68/efeYu0qMKE67TTjVhHAFYcyUJ6pS0xFFPMpbkxRxVUME1VlYFYb8NmUYtTmmBxg+N7eMjhBlB7SMTZu7IVRovT+V9ZvtqHDF0QPk1w2IYFHtzPhgdokLPewmVfAsEANuuiY9ZQNftmH8cRiki36lsJjNbHOjQy9jOo8i8wXuC5CWLQ8WMk1k883xVqjYdb4hAjzhdsSrPvVb7jH36ohxeqx8nT0CKNb9dmi1mKTUVJILML/B495edleSOH/cgml1qpcbObhZMgLqsacLdWhUXDvyH8oTkI9zb1iMIwDm2AVjcN0GCMcWRDVQC4fXSNrbM+Q0SJXkPgRFM9jp0ULgv6cRqyj2taT4hCbMBSkO3cyjzzX5V8E80W5fO9xNgdY8Lyy8Bh/AgU6xJvvnTTQNHUp/A8c3w8HsY/+yL+DUf9Mw6Xdo2sB+FBS+7Boh23/pmAoZ3ZS0eEoIXX1rJ26q+Y2/BeeRES5UhtzRDMGFad1YF81SpceXnnvQFof7aCRMCvjCrnuM+ASacBlfdTck2fZej7sMpaW9YabTOSfhpVe/ajkP0WuVM4LTi8D5Oz/ZYSYlB3cxGg8I902XNHN6p1ZJ8F+0s0vlsHKnCovsmVQV9mkMEP2VA22YDtlzVwfmkA0syQXqFFWWZrCQLWI2SVjrku4gvrTajiUqLyy5BQQ4JYYLlrURZ3R1Q0wYvhKwHNt8tKY4Mn4PkeZ9aSpSr4t/nxPOcaDXpH8qdy2rK2NLl6xE4HPdgxm t2cps50i 6UpH5ND8tYOXi01aHdbDz59J1V21H3yvi2sEWoPZUJa0XiocWu98eyi15tq/T54cPvnUpD0YZaCpVtPzUmDDNqo4g2MdewcQCl8tzPCYSHoxHta7bDFJ5SUGfipvncayslC1UrSHtasgjIcMUKIAeqgZPXC5kz9hHAUqrfSpBJrbA4mfSCdHsV1sBzohjp3/hVfU8AalpIvZ5bq0GzHT1C/ccE5u30fC3qnR/+0yvBgB2VfTSRBZoaJZEZ/Z9Vn533UYRrEzv18jZvAKbgpt8WTMI8fGxhtOsE7NvnIRLKuCfbZEjyI744JCJeak54wuAQeAWLx+zlqj/D7M/IyUucq89GngX5rw36MDsWhGpsZv0p/c= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 16, 2025 at 11:07:44AM -0700, Kuniyuki Iwashima wrote: > On Wed, Jul 16, 2025 at 9:50 AM Shakeel Butt wrote: > > > > On Mon, Jul 14, 2025 at 04:36:12PM +0200, Daniel Sedlak wrote: > > > This patch is a result of our long-standing debug sessions, where it all > > > started as "networking is slow", and TCP network throughput suddenly > > > dropped from tens of Gbps to few Mbps, and we could not see anything in > > > the kernel log or netstat counters. > > > > > > Currently, we have two memory pressure counters for TCP sockets [1], > > > which we manipulate only when the memory pressure is signalled through > > > the proto struct [2]. However, the memory pressure can also be signaled > > > through the cgroup memory subsystem, which we do not reflect in the > > > netstat counters. In the end, when the cgroup memory subsystem signals > > > that it is under pressure, we silently reduce the advertised TCP window > > > with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant > > > throughput reduction. > > > > > > So this patch adds a new counter to account for memory pressure > > > signaled by the memory cgroup, so it is much easier to spot. > > > > > > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1] > > > Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2] > > > Co-developed-by: Matyas Hurtik > > > Signed-off-by: Matyas Hurtik > > > Signed-off-by: Daniel Sedlak > > > --- > > > Documentation/networking/net_cachelines/snmp.rst | 1 + > > > include/net/tcp.h | 14 ++++++++------ > > > include/uapi/linux/snmp.h | 1 + > > > net/ipv4/proc.c | 1 + > > > 4 files changed, 11 insertions(+), 6 deletions(-) > > > > > > diff --git a/Documentation/networking/net_cachelines/snmp.rst b/Documentation/networking/net_cachelines/snmp.rst > > > index bd44b3eebbef..ed17ff84e39c 100644 > > > --- a/Documentation/networking/net_cachelines/snmp.rst > > > +++ b/Documentation/networking/net_cachelines/snmp.rst > > > @@ -76,6 +76,7 @@ unsigned_long LINUX_MIB_TCPABORTONLINGER > > > unsigned_long LINUX_MIB_TCPABORTFAILED > > > unsigned_long LINUX_MIB_TCPMEMORYPRESSURES > > > unsigned_long LINUX_MIB_TCPMEMORYPRESSURESCHRONO > > > +unsigned_long LINUX_MIB_TCPCGROUPSOCKETPRESSURE > > > unsigned_long LINUX_MIB_TCPSACKDISCARD > > > unsigned_long LINUX_MIB_TCPDSACKIGNOREDOLD > > > unsigned_long LINUX_MIB_TCPDSACKIGNOREDNOUNDO > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > > index 761c4a0ad386..aae3efe24282 100644 > > > --- a/include/net/tcp.h > > > +++ b/include/net/tcp.h > > > @@ -267,6 +267,11 @@ extern long sysctl_tcp_mem[3]; > > > #define TCP_RACK_STATIC_REO_WND 0x2 /* Use static RACK reo wnd */ > > > #define TCP_RACK_NO_DUPTHRESH 0x4 /* Do not use DUPACK threshold in RACK */ > > > > > > +#define TCP_INC_STATS(net, field) SNMP_INC_STATS((net)->mib.tcp_statistics, field) > > > +#define __TCP_INC_STATS(net, field) __SNMP_INC_STATS((net)->mib.tcp_statistics, field) > > > +#define TCP_DEC_STATS(net, field) SNMP_DEC_STATS((net)->mib.tcp_statistics, field) > > > +#define TCP_ADD_STATS(net, field, val) SNMP_ADD_STATS((net)->mib.tcp_statistics, field, val) > > > + > > > extern atomic_long_t tcp_memory_allocated; > > > DECLARE_PER_CPU(int, tcp_memory_per_cpu_fw_alloc); > > > > > > @@ -277,8 +282,10 @@ extern unsigned long tcp_memory_pressure; > > > static inline bool tcp_under_memory_pressure(const struct sock *sk) > > > { > > > if (mem_cgroup_sockets_enabled && sk->sk_memcg && > > > - mem_cgroup_under_socket_pressure(sk->sk_memcg)) > > > + mem_cgroup_under_socket_pressure(sk->sk_memcg)) { > > > + TCP_INC_STATS(sock_net(sk), LINUX_MIB_TCPCGROUPSOCKETPRESSURE); > > > return true; > > > > Incrementing it here will give a very different semantic to this stat > > compared to LINUX_MIB_TCPMEMORYPRESSURES. Here the increments mean the > > number of times the kernel check if a given socket is under memcg > > pressure for a net namespace. Is that what we want? > > I'm trying to decouple sk_memcg from the global tcp_memory_allocated > as you and Wei planned before, and the two accounting already have the > different semantics from day1 and will keep that, so a new stat having a > different semantics would be fine. > > But I think per-memcg stat like memory.stat.XXX would be a good fit > rather than pre-netns because one netns could be shared by multiple > cgroups and multiple sockets in the same cgroup could be spread across > multiple netns. Yeah it makes much more sense to have memcg stat for memcg based socket pressure.