From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78E33C25B78 for ; Tue, 28 May 2024 17:20:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 08B936B0098; Tue, 28 May 2024 13:20:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 03C3B6B0099; Tue, 28 May 2024 13:20:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E461F6B009B; Tue, 28 May 2024 13:20:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id C75686B0098 for ; Tue, 28 May 2024 13:20:45 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 6BE191A042D for ; Tue, 28 May 2024 17:20:45 +0000 (UTC) X-FDA: 82168469250.22.85E6C80 Received: from mail-lj1-f170.google.com (mail-lj1-f170.google.com [209.85.208.170]) by imf17.hostedemail.com (Postfix) with ESMTP id 8180D40022 for ; Tue, 28 May 2024 17:20:42 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lgQmb6Jz; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.170 as permitted sender) smtp.mailfrom=ryncsn@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716916842; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fdiIr55oW9SirwSwnLEbCzgNCUW3ZQwPnbBonMmTJ/k=; b=HWF0fqWbcrzdRBXttqIVG74Sh/ygZOLLmJ+1C1uT5GlOV5ibqtwhUd8T1O6JcHPFo13TaQ KrKo32YfUrSalZJMRqBfsmrAwlTztJxySubGxyKIXRc2TyM6PXrhl1PwdeAbMidXHx4AqY yYoEoaMCco3IETb53zwryetn+6pDKo8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716916842; a=rsa-sha256; cv=none; b=TmVMsC+6C+i0b6lTP9p06p7zW8B6Yd503rPmS5/HaBI5TZpQ1XYoeWw3aBsV66s2osfI16 x9SrrgKTu4flqJzjBgl0WJGwHYKmQgL3v3JKFatEO+34U9GaGzDAgwh+y9PBhZ2fsqnfMg FnxDJ6K5L9oFcwAUfPGAzxRPYEoknfU= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=lgQmb6Jz; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf17.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.170 as permitted sender) smtp.mailfrom=ryncsn@gmail.com Received: by mail-lj1-f170.google.com with SMTP id 38308e7fff4ca-2e95a1d5ee2so22019011fa.0 for ; Tue, 28 May 2024 10:20:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716916841; x=1717521641; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=fdiIr55oW9SirwSwnLEbCzgNCUW3ZQwPnbBonMmTJ/k=; b=lgQmb6JzJqA8Gw6SPR1uJ44swNt+W06ohAQL5nGTOFCwg7+5KK81OFlO6OGvFvpmwx uIiQ4vuOd36y1oqhf/1hTzxau6HtbH8Z6MFOBzSFS+WjcX55pOwpQN7uixWZbEkSSsEY 8nJd/+eB/qtWb9k3+U7yptxPJ7syhPkKz2Wrkuy5Ryhi9NrR05PUo72j0IeZ0GhkyxaB 4onnGwm5byhHt4hp3s2eBZEOzhdxwDhXh4lOU6G46bxqXqvqLJIbiXOKIDzs4EBz6WtE +MFzizwKB8AFH+PardEhZ7L0pQ5f627td2FCy2O9W25BopYP2CNBLv9cXoPTPOtwdX76 RQ+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716916841; x=1717521641; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fdiIr55oW9SirwSwnLEbCzgNCUW3ZQwPnbBonMmTJ/k=; b=BdIfBbMk+kzfrAlLaKQ787yfqwu7zQsyB2E/vZyvbsNlfQajjIHbjHra1xh6pcyAvs cdNhTNFUPApqcghUrQ8Y62LNCcJho8nJtdrWKMlGyGXjEbDCqrHqe1IrzRe2d/AFF1mr 60msvYolXilOixOmfDKAvcwPFSSbiIBNzNsaLHh/e6my0a+BrS0ebLMcjdebJzFm/66F Ywe26oFX+rAG6E1Jbo+HI/4GgTTbvAi8PiAo+Q8w03O6vSC96i6PZ/CZQxylxyiIih7S n0ueZLx24ouBayXhKkhOcmTm3zfO7mPhDgX1baelnLxoZ/lud+uKAH376IJLvRDmWMx4 flsw== X-Forwarded-Encrypted: i=1; AJvYcCWED9QREzpNprgBiWOK7GM1V5jFdxAMm/LN463miFIU5mWJa1J/t8X0WzGlxXLWqZ2Araf0vnk1yuntXrGdUg//bY4= X-Gm-Message-State: AOJu0YwhDgIDJzlmffr26gkZfTnWeFQonjK93f/iqPK7EkMfM7yzXKLx 0la+JiETunw5W5YbLSZ9pqrFJ6qCo800Lmz4pkgMIisfba7qSVkdIeczmayQuzCOLdAogKhCtLk EvMF/Ej7hn66fiS5glOweG6e50Ww= X-Google-Smtp-Source: AGHT+IFhj2dNDHXsQzSVwQ/3bSZznNzhhZTFGmdJPfV4MwMaaknuJ0ICXqJHWmQRX0kJAgHsuZs5Ncsu+kyIpQKawt8= X-Received: by 2002:a2e:820b:0:b0:2e5:4c78:1227 with SMTP id 38308e7fff4ca-2e95b0c241amr90788531fa.31.1716916840491; Tue, 28 May 2024 10:20:40 -0700 (PDT) MIME-Version: 1.0 References: <20240509034138.2207186-1-roman.gushchin@linux.dev> In-Reply-To: From: Kairui Song Date: Wed, 29 May 2024 01:20:23 +0800 Message-ID: Subject: Re: [PATCH rfc 0/9] mm: memcg: separate legacy cgroup v1 code and put under config option To: Roman Gushchin Cc: Shakeel Butt , Andrew Morton , Muchun Song , Johannes Weiner , Michal Hocko , Matthew Wilcox , linux-mm@kvack.org, linux-kernel@vger.kernel.org, gthelen@google.coma, rientjes@google.com, Chris Li Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 7wkrndiuef7fsisao1xdxwxiuufwcxa3 X-Rspamd-Queue-Id: 8180D40022 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1716916842-306139 X-HE-Meta: U2FsdGVkX1+XbhDqqwBiDfRKuoOmeHKXV5im8aXTb8xYV3uaGhrPkYCqkCcKr8OMyz2UQ21PzixURu8rlvXvGVUrWutVU4lSxIrZkI1/LbK5VeDwItI6/G1l8aVy1Sk+W6eQilM//U23Rf8XgmV9bCOILs5gLq2mv38FyCX4Yx+w7uIxkSOoMPplp1CUhOjNz701lq0fwi9B6UJM+D65+pUC5buuYnuMlcFMK+Fx91PKSIsPPAS/8+P4azunDk+BSGJWEb0ITMUZgbryDATLbvfkc7WPGkESb9ZDYiD0Nwx5x3buIvysYrVHuKglTF+q4MmJk19U7rljAFWfPpWNzjf6NnPsBRZrdXqQlp36tKuDg+KxRVNREiuqKOcnCDOEqMtkoL03c/ypiNXMFFekgJIkx2JmnkJJN5WpQg/jHoPgP9YFK5c++rPN0xkvJsqwI4vfxlGuvnF8hZJFTOvf0Tu3G7PElNszNDAmCzOZaf4yJQj9AmmbVCGsbz8rxTxRnSj2XwPBDsNhK+I24I6PyfN+LLpBk618cZPDUrShtIHRyD8w4+wrS/0QJ7NJPdi7GZg9iRE+nny8PDMTEsJxhwJePHzEjIlqWfxk7GoXOkPM64u4+7E62falcWPzzfYlrFDkpstfaS+r3U+RtzAWBqkGAMJWoel6Lm3n3BEjGhMPJOqALt1pira7+uEPvQsCkdDkpDR4l7vccuxQc+FEUk1BNrmjLUkF9s9u9gazRZ8i6LLpn9RsujFOlWK68vj/ywmb3bHvuWa+SyNOgRGEBFIaXnS5KDyODiUW1re6mcnzA8ncgwByX92exUdYAk5CFPUJ1xibJ+aWuyQKz8HZ2rasllONjzYJaP2cFqvq+08jNIyqkhNc9933wau1LItJtLaJdJo7LQxW12V4iSTPJtNrVJy3HCV3CwmP6GpTxcZKa7AEEet5Ia61GWqR3j0sNcxnr1E8ys8ghF7pmpI /FtDC3Sg 29egrKDBbg7QVpAsBlb+bSckaBMmmzxAC6sAOJ7Dv+OtnsKXnXMT2w4AZvfxWXVVD+heOTROGEGSouhku1wKX/X6gohvYSG8RcQQdiij+LlR+Qraz7GUIZG4dEIP51Bjpo0NYp9HTwx5abpLIRmOFHWtSoDGvKghHb28fE3GskVqcV31oFtO+GDLJUmXDCifBkn2t/qrQyUWGbHp/Fk5RPSOwE9uXK/tYSdEkT/WEVIhZrL/xR7tM0ATEbMamqgHds5mg4XFN2HwyquLD511GHJgbDksbEaWr5BW2U49znW/9J6j3DVFYcNIc3Uu9h5+YM0zD/aQ5b+ZBCSifE/lJA0k4by1+eTWDw2pt X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 24, 2024 at 3:55=E2=80=AFAM Roman Gushchin wrote: > > On Thu, May 23, 2024 at 01:58:49AM +0800, Kairui Song wrote: > > On Thu, May 9, 2024 at 2:33=E2=80=AFPM Shakeel Butt wrote: > > > > > > On Wed, May 08, 2024 at 08:41:29PM -0700, Roman Gushchin wrote: > > > > Cgroups v2 have been around for a while and many users have fully a= dopted them, > > > > so they never use cgroups v1 features and functionality. Yet they h= ave to "pay" > > > > for the cgroup v1 support anyway: > > > > 1) the kernel binary contains useless cgroup v1 code, > > > > 2) some common structures like task_struct and mem_cgroup have neve= r used > > > > cgroup v1-specific members, > > > > 3) some code paths have additional checks which are not needed. > > > > > > > > Cgroup v1's memory controller has a number of features that are not= supported > > > > by cgroup v2 and their implementation is pretty much self contained= . > > > > Most notably, these features are: soft limit reclaim, oom handling = in userspace, > > > > complicated event notification system, charge migration. > > > > > > > > Cgroup v1-specific code in memcontrol.c is close to 4k lines in siz= e and it's > > > > intervened with generic and cgroup v2-specific code. It's a burden = on > > > > developers and maintainers. > > > > > > > > This patchset aims to solve these problems by: > > > > 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1= .c file, > > > > 2) putting definitions shared by memcontrol.c and memcontrol-v1.c i= nto the > > > > mm/internal.h header > > > > 3) introducing the CONFIG_MEMCG_V1 config option, turned on by defa= ult > > > > 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set > > > > 5) putting unused struct memory_cgroup and task_struct members unde= r > > > > CONFIG_MEMCG_V1 as well. > > > > > > > > This is an RFC version, which is not 100% polished yet, so but it w= ould be great > > > > to discuss and agree on the overall approach. > > > > > > > > Some open questions, opinions are appreciated: > > > > 1) I consider renaming non-static functions in memcontrol-v1.c to h= ave > > > > mem_cgroup_v1_ prefix. Is this a good idea? > > > > 2) Do we want to extend it beyond the memory controller? Should > > > > 3) Is it better to use a new include/linux/memcontrol-v1.h instead = of > > > > mm/internal.h? Or mm/memcontrol-v1.h. > > > > > > > > > > Hi Roman, > > > > > > A very timely and important topic and we should definitely talk about= it > > > during LSFMM as well. I have been thinking about this problem for qui= te > > > sometime and I am getting more and more convinced that we should aim = to > > > completely deprecate memcg-v1. > > > > > > More specifically: > > > > > > 1. What are the memcg-v1 features which have no alternative in memcg-= v2 > > > and are blocker for memcg-v1 users? (setting aside the cgroup v2 > > > structual restrictions) > > > > > > 2. What are unused memcg-v1 features which we should start deprecatin= g? > > > > > > IMO we should systematically start deprecating memcg-v1 features and > > > start unblocking the users stuck on memcg-v1. > > > > > > Now regarding the proposal in this series, I think it can be a first > > > step but should not give an impression that we are done. The only > > > concern I have is the potential of "out of sight, out of mind" situat= ion > > > with this change but if we keep the momentum of deprecation of memcg-= v1 > > > it should be fine. > > > > > > I have CCed Greg and David from Google to get their opinion on what > > > memcg-v1 features are blocker for their memcg-v2 migration and if the= y > > > have concern in deprecation of memcg-v1 features. > > > > > > Anyone else still on memcg-v1, please do provide your input. > > > > > > > Hi, > > > > Sorry for joining the discussion late, but I'd like to add some info > > here: We are using the "memsw" feature a lot. It's a very useful knob > > for container memory overcommitting: It's a great abstraction of the > > "expected total memory usage" of a container, so containers can't > > allocate too much memory using SWAP, but still be able to SWAP out. > > > > For a simple example, with memsw.limit =3D=3D memory.limit, containers > > can't exceed their original memory limit, even with SWAP enabled, they > > get OOM killed as how they used to, but the host is now able to > > offload cold pages. > > > > Similar ability seems absent with V2: With memory.swap.max =3D=3D 0, th= e > > host can't use SWAP to reclaim container memory at all. But with a > > value larger than that, containers are able to overuse memory, causing > > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily > > out of balance, especially with compress SWAP backends. > > > > Cgroup accounting of ZSWAP/ZRAM doesn't really help, we want to > > account for the total raw usage, not the compressed usage. One example > > is that if a container uses tons of duplicated pages, then it can > > allocate much more memory than it is limited, that could cause > > trouble. > > So you don't need separate swap knobs, only combined, right? Yes, currently we use either combined or separate knobs. > > I saw Chris also mentioned Google has a workaround internally for it > > for Cgroup V2. This will be a blocker for us and a similar workaround > > might be needed. It will be great so see an upstream support for this. > > I think that _at least_ we should refactor the code so that it would > be a minimal patch (e.g. one #define) to switch to the old mode. > > I don't think it's reasonable to add a new interface, but having a > patch/config option or even a mount option which changes the semantics > of memory.swap.max to the v1-like behavior should be ok. > > I'll try to do the first part (refactoring this code), and we can have > a discussion from there. Thanks, that sounds like a good start.