From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 11CCFC83F17 for ; Wed, 23 Jul 2025 17:28:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 88ECD8E0038; Wed, 23 Jul 2025 13:28:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 83F708E0002; Wed, 23 Jul 2025 13:28:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 72DF58E0038; Wed, 23 Jul 2025 13:28:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 57A038E0002 for ; Wed, 23 Jul 2025 13:28:39 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E109612DC20 for ; Wed, 23 Jul 2025 17:28:38 +0000 (UTC) X-FDA: 83696213916.13.C9532C5 Received: from out-180.mta0.migadu.com (out-180.mta0.migadu.com [91.218.175.180]) by imf10.hostedemail.com (Postfix) with ESMTP id 084ECC000E for ; Wed, 23 Jul 2025 17:28:36 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=ESWzRUlq; spf=pass (imf10.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.180 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1753291717; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=qLskRxgcaK46IteoEZp426l7L5IGh6tRkZFLXTMPYHM=; b=tmt9xpbKdIgRx1crn3WGWlvDXy8pweYiMoKDerjN3hQ46MEuWAt82Gj3zDClKet5zaxfaL elJLpSAf5tdjAZ408FCcBL7AlRtigclmLTi8Z/5ihkiSKGmZKjOZw4KMi49wdKYKaNV21b BP+QiB3Nf07O9os9N+MENj+G8GaGA2Q= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1753291717; a=rsa-sha256; cv=none; b=GyMeXkELLha2SlqMG7G+xRnRxpVUIRnK24hHRI6Jd7JKyuicZLnP0wzrD3rQ8wn0ICyKVa 1f96ePftXznMgKRaHXC3FjxCpla0QGVnTMU6HPgQAZJpTWj4t6RWlmD++6vDTVQeElSjKs OPYqQId37GGGifL76PHGIe9LCYqWfO0= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=ESWzRUlq; spf=pass (imf10.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.180 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev; dmarc=pass (policy=none) header.from=linux.dev Date: Wed, 23 Jul 2025 10:28:23 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1753291711; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qLskRxgcaK46IteoEZp426l7L5IGh6tRkZFLXTMPYHM=; b=ESWzRUlqs6B4VD+/4WgG/ogXmFBi6o1zZ3IV91rTtlqqEkHOEbbnPDCCErhFN1QBEAGin0 /zZmVczzek5Vw8cW8ES8kNVxGGMyKG0Rak8FJBlCiztIvQUzcivlWh1J+4DbHaMmr1BdjB yc8628umd6E/3vJ9+wU0EEwgEZK4RUg= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Shakeel Butt To: Kuniyuki Iwashima Cc: Eric Dumazet , Michal =?utf-8?Q?Koutn=C3=BD?= , Tejun Heo , "David S. Miller" , Jakub Kicinski , Neal Cardwell , Paolo Abeni , Willem de Bruijn , Matthieu Baerts , Mat Martineau , Johannes Weiner , Michal Hocko , Roman Gushchin , Andrew Morton , Simon Horman , Geliang Tang , Muchun Song , Kuniyuki Iwashima , netdev@vger.kernel.org, mptcp@lists.linux.dev, cgroups@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting. Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Rspamd-Queue-Id: 084ECC000E X-Rspam-User: X-Rspamd-Server: rspam09 X-Stat-Signature: agax5r33uqez6rxh9zp8kdrmi7gm9amt X-HE-Tag: 1753291716-294436 X-HE-Meta: U2FsdGVkX1/jEN+woINz3f+peERrzvpKx8a7EoytVqOG0mLc6XrydCVuwxrTdr8au4iENHZ1uj3w003EFsHrXvGX6VT27GsZcMjgyxWIBXTJp+bomjgKUc5qt2oHdBfflIpQDGFQ6PMfwYwya5Z5m7jwuuugCBxzsO7t4zptoAq2HmCluaCZ2a3u7AjmdrevZEyHQJrHW4taf9fbZMhOwF/ZTnPoA00yoB1nV7YF9A/hLZT284jS1b+MTMJM1ZgNR2E6Mk9PI3y/6RAHOLIGo55cS5eHq4gVqvKtokrY8F5yWdTwTN4eLOvPexLvn6wI3mYPbE2q3moLaNvdY/KNcLQHMJuELAapXMK602EVVKBRzZ8VKfWePZn1Hr8XrzCvCeLjxYg9WgNDb3r/oXYCSIFmkAWPL3AGcza3xIQ6BufeQ8dA8OKKqJelTqo4h5GLHBJq7CYiI/C0APTZzYV48Q5mZxpwk2QbzPT5gJg3isDOzxAwe/VZoFlP4QP472BEmeCdqNItiVwXFfuCC7Lb9MTWARQacJwNUvI6ppuhHKlCrtq97KDmVJkVeHvAp9rrCVburAkO3Flv9xcG07zJrQMaKhOlbORlWrEgtT1oDtmpv4WY5jwSj60I7Xc0AIRemDKZlit704AOuVEmg1WDE6mVyMSSH8JTVx2udZD4hs+MLUuqLkKgQt3YrYXH6DL5M3Oc9u+OnkMTykMcNPceBmVBkq4gc8YQgB5XbmrqObBhrr5fds4KihTMVyaAIBKNGaewGxW40UIiQrguMStumDSYMV7EYkLV9Yw6sZBI98hejB3IWyCqaYBCTKhEshvTbXdaopwTsZPClhGUYziICYGLVc7y7moH83LxK+WLWrZNGUd6BFPIOw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Cc Tejun & Michal to get their opinion on memcg vs cgroup vs BPF options. On Tue, Jul 22, 2025 at 07:35:52PM -0700, Kuniyuki Iwashima wrote: [...] > > > > Running workloads in root cgroup is not normal and comes with a warning > > of no isolation provided. > > > > I looked at the patch again to understand the modes you are introducing. > > Initially, I thought the series introduced multiple modes, including an > > option to exclude network memory from memcg accounting. However, if I > > understand correctly, that is not the case—the opt-out applies only to > > the global TCP/UDP accounting. That’s a relief, and I apologize for the > > misunderstanding. > > > > If I’m correct, you need a way to exclude a workload from the global > > TCP/UDP accounting, and currently, memcg serves as a convenient > > abstraction for the workload. Please let me know if I misunderstood. > > Correct. > > Currently, memcg by itself cannot guarantee that memory allocation for > socket buffer does not fail even when memory.current < memory.max > due to the global protocol limits. > > It means we need to increase the global limits to > > (bytes of TCP socket buffer in each cgroup) * (number of cgroup) > > , which is hard to predict, and I guess that's the reason why you > or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global > limit. No that was not the reason. The main reason behind max tcp_mem global limit was it was not needed as memcg should account and limit the network memory. I think the reason you don't want tcp_mem global limit unlimited now is you have internal feature to let workloads opt out of the memcg accounting of network memory which is causing isolation issues. > > But we should keep tcp_mem[] within a sane range in the first place. > > This series allows us to configure memcg limits only and let memcg > guarantee no failure until it fully consumes memory.max. > > The point is that memcg should not be affected by the global limits, > and this is orthogonal with the assumption that every workload should > be running under memcg. > > > > > > Now memcg is one way to represent the workload. Another more natural, at > > least to me, is the core cgroup. Basically cgroup.something interface. > > BPF is yet another option. > > > > To me cgroup seems preferrable but let's see what other memcg & cgroup > > folks think. Also note that for cgroup and memcg the interface will need > > to be hierarchical. > > As the root cgroup doesn't have the knob, these combinations are > considered hierarchical: > > (parent, child) = (0, 0), (0, 1), (1, 1) > > and only the pattern below is not considered hierarchical > > (parent, child) = (1, 0) > > Let's say we lock the knob at the first socket creation like your > idea above. > > If a parent and its child' knobs are (0, 0) and the child creates a > socket, the child memcg is locked as 0. When the parent enables > the knob, we must check all child cgroups as well. Or, we lock > the all parents' knobs when a socket is created in a child cgroup > with knob=0 ? In any cases we need a global lock. > > Well, I understand that the hierarchical semantics is preferable > for cgroup but I think it does not resolve any real issue and rather > churns the code unnecessarily. All this is implementation detail and I am asking about semantics. More specifically: 1. Will the root be non-isolated always? 2. If a cgroup is isolated, does it mean all its desendants are isolated? 3. Will there ever be a reasonable use-case where there is non-isolated sub-tree under an isolated ancestor? Please give some thought to the above (and related) questions. I am still not convinced that memcg is the right home for this opt-out feature. I have CCed cgroup folks to get their opinion as well.