From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB5B1C4332F for ; Thu, 14 Dec 2023 18:00:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5B4F18D00D9; Thu, 14 Dec 2023 13:00:55 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 564EA8D00C7; Thu, 14 Dec 2023 13:00:55 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3DF3F8D00D9; Thu, 14 Dec 2023 13:00:55 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 2B6328D00C7 for ; Thu, 14 Dec 2023 13:00:55 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B18F0140425 for ; Thu, 14 Dec 2023 18:00:54 +0000 (UTC) X-FDA: 81566189628.24.21D4158 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf07.hostedemail.com (Postfix) with ESMTP id 864E640046 for ; Thu, 14 Dec 2023 18:00:51 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XbStTzo+; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of fdeutsch@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=fdeutsch@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702576851; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PzFB4i969g2ulq5cqPtzAnCiADzlqpPl/1WJwAjJih0=; b=nQHErGghrk6HqYda2dIxGsbEbNVxX/8S8XHOreUXqPDQ3l3DvhGZRAe1GBSzIjq62DgG1M GMgDnnLZ8pbI/4pfka21PYeEsjDBkNSJe69iVI4RgyLcA7APa5JyRs95OtDzbldvcb9wmZ LSrH5h4+AanBhwsLmGoQCvcl5pkivKs= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XbStTzo+; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of fdeutsch@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=fdeutsch@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702576851; a=rsa-sha256; cv=none; b=0j4iaPjx4sI9b9t92J0yOxdREDaJ1C5LOEbDz2WfcPFwzTJ3VpUSGLYpOnpfI5hn2PgUIp ZeKrqoN2J2/Z2pYW5fgfDNhlQY58sqvq/k4UxXgXZggF5c2HaozoJRFcOJ+kfWQzF3oDQi nuGgFI+ymI4SFrZ/n6//yVm541zHWoA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1702576850; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=PzFB4i969g2ulq5cqPtzAnCiADzlqpPl/1WJwAjJih0=; b=XbStTzo++P0fe/XNaqzqPJvWE3d5RvE8arAtXyqpnazPFMVmT6gMyUmL7unFGU5oKsY+du 5RpxIE9MmiG7T2SDlNDCFZ/CD2LfBgepbwMWDIJ/Xa0NMm21C3drMasNWFxYuRQK0f0oYo yO+SjglbXirmYg7C5CdZGO3VRe9KJcg= Received: from mail-yw1-f200.google.com (mail-yw1-f200.google.com [209.85.128.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-292-OdGWRdFSNHSE4G9k4OpJcg-1; Thu, 14 Dec 2023 13:00:48 -0500 X-MC-Unique: OdGWRdFSNHSE4G9k4OpJcg-1 Received: by mail-yw1-f200.google.com with SMTP id 00721157ae682-5e3a77ad1a6so9101987b3.3 for ; Thu, 14 Dec 2023 10:00:48 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702576848; x=1703181648; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=PzFB4i969g2ulq5cqPtzAnCiADzlqpPl/1WJwAjJih0=; b=Cc/zOuov9TvhDW7ysxIfEx6s3K4eFOaP3rlcBduUVhIYPkRmBTO7nQ6xnS5VdE/fcM yL3oPIjG7aWgND1MXvGse5plnJwo6YXDoRw2dhwalNeqqsJ7lp+GdkmpreNEXH6kxue5 zp+K3Cp51gOTeRrDRSmdeMhXO1f52YwNguzzQPMUDyj7sK+sa0a6oUuWyC6k/HL4mV6F tUb0Gecz6eA+TxBV6R46dkcCU8BY3FZgMTDsOBmfIJQ7o8Iy/TsDnPloFmHrWmpVDsOm 2FNbH2P0Xo1LKDijMBSJTrP7q3tHZv7Et8uKcW2w2vxclyWC1bgcZ5GL9rVSTYxlGOl3 ryrA== X-Gm-Message-State: AOJu0YyirOWd3RV6L7PnYu8p9XMqATVUQaehGfVOE53wDXAUBUGMrXQo nW3NDfP24aBSM+TGfDbS/l3YQkrdJm0Jb7N/Xn8JVrTxc5Z2jCV0L1XGiVYb/FO7io8+8tPGBsd ptaULb5XEjm8w6Y8XTG2gEL4+z8E= X-Received: by 2002:a0d:c3c6:0:b0:5de:7a72:3577 with SMTP id f189-20020a0dc3c6000000b005de7a723577mr8305273ywd.58.1702576847702; Thu, 14 Dec 2023 10:00:47 -0800 (PST) X-Google-Smtp-Source: AGHT+IFGSVX1Af3fKU+w/PExLZjkJD9c2C6lifVEeWX6XHPHzvw0Hig9z9crSdZus4kdAH3qtbxk1UfgVBibLiWUTDc= X-Received: by 2002:a0d:c3c6:0:b0:5de:7a72:3577 with SMTP id f189-20020a0dc3c6000000b005de7a723577mr8305226ywd.58.1702576847330; Thu, 14 Dec 2023 10:00:47 -0800 (PST) MIME-Version: 1.0 References: <20231207192406.3809579-1-nphamcs@gmail.com> <20231209034229.GA1001962@cmpxchg.org> <20231214171137.GA261942@cmpxchg.org> In-Reply-To: From: Fabian Deutsch Date: Thu, 14 Dec 2023 19:00:31 +0100 Message-ID: Subject: Re: [PATCH v6] zswap: memcontrol: implement zswap writeback disabling To: Yu Zhao Cc: Johannes Weiner , Minchan Kim , Chris Li , Nhat Pham , akpm@linux-foundation.org, tj@kernel.org, lizefan.x@bytedance.com, cerasuolodomenico@gmail.com, yosryahmed@google.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, hughd@google.com, corbet@lwn.net, konrad.wilk@oracle.com, senozhatsky@chromium.org, rppt@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, david@ixit.cz, Kairui Song , Zhongkun He X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: multipart/alternative; boundary="0000000000003036ea060c7c10ed" X-Rspamd-Queue-Id: 864E640046 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: u4z7aiq7zpwazsgzsyebzahff1iz8yz9 X-HE-Tag: 1702576851-5038 X-HE-Meta: U2FsdGVkX19KYgdQsWrS0Xn9enuQpnILu49Stwkl+WevTdXmp7rENNL7lHZfHJk2MMXK2KrPbQw1X3BJKQaP5NDFMCTAihlZiOVAO7+rEOIS2DnuL81VEdyeymjPBQTGi2v6+f9jViunWF+0O/cMezF557a4lNu+JKgFG1Peep5QN5AjlwmhYrz8KyfKI7agcTRKaDnAet52TfKCLkDrQ01pjOHWGMPiMgDFOEtNK1JzLrWZvCjXBmlu+TWvMFyoXbtFOLjBasYLtjoxpzI2n09JphpMd2q82dRLebQm9aGlVOiUcxxXE+tcRRq1lPCNyHGHnBmLv8almVmmOOxn2IVFekZ0WSzQ5BYGrwvl26SLNTSd5fx9sER+wgsK0bto0g3DWrDecLa8b1XPVgplFq85e85JoM+DfE+M4XneV211r9+iYp8c5UfL1njFH3KzaCIUjiCdl2Jqg2MOVcrw36YMlV0Z8KKCT4qg9hj81lDvfytyVgG7jXciCT5ksqLaCABqq/gD1S1EPuSIs6z7bsTP7ae0vnPhwVudbF4qVhGL4dpjCj2m/LyYWJNdxJUkBCBWcGagMsyZe3to9eUEh6BmKSHb0HFoQXI90Rgizd19Os1mmk5WQW8AsisSVQOv+aD/l+qn18FwyC9Gp4OQ4Utu2+Ms4Njv5/7bTLHtx0Dbv5Eg+pYVkqKUt1vTUGWskSKt0QgP8yKSu+Rb8vSIThlNql8BFpYb4Yq9Y/pJnSbWPeU6LAKkpkeS7rX5EIfZUHHEwn0/GBk+RmTUep0SAA2PBPNXSUpSVgB7ZNBNf9v35a4sRvKBSnUpBXKYZUZvZ82W1J7ZXN3kZYxEy5XzcILQXjX66nd0bHGSL9nGGLL0Xy78xh8Aj/Kya3QCFSys4u8Gx+4mDcDXVlIiFgPM+jYMuNT17NRqo4UZ8W/wZx55UQ9ZHUdW5WaXqmvOzmxnL01Kmoh6ryl4vtVf7RI z9bYR94J 53R94xK98BaOtfq8tlPUPUDu9b97A6EtEdBQuBjFCD7DQushn0C48TrcaWDbnHupUjwvo0yb7y2rkZ5H3WwGwZ8I2YL/zh9/yS/ihspRFmMHgI6eccNLZENrokKgwX23/EcDaB49J2Qd39UN00ibKwFpX5DWiujT/xKdYE5Dtnn/B0N2GyQHbWFH344CFStOrWKz4zV4YYXywnodqMxIcnt4TD/vf5KECSI9xUMr2wss456e/jK4JxjzcffoJYbvlW5f+Z2a0f1Z0g89TIXxCGo9yf+zhv8S80Qb4wLlDVi0/MPv42I7WrCMysYIasD/Ev8wHdmaJ6cpCPRulZPkK8eiPXUsMwuruXGFIV1Djpgg8iBbZDu4kPuEUjznrtlseCG5X6/e654P2mEXHCBS58Rvsz5nUkE0oNc9a7wRfWu6SaLggm7l1LVDVjOj/rnWXrEWtwwVshDkIUkv+IEy+g+Q/8SPwCnt4aXxdT66tv4Uxw7Mb6q93ofrWjF+QEXGEyOQ6zKZNHqB+LtrvF8GenveVu75PVIBVFejs4IN1tnG+InxeNMOGy8MIn/Cq/Ya2rjVhQFBh/LKE54RiW32zxe4QyoV8VOV9betO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: --0000000000003036ea060c7c10ed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu, Dec 14, 2023 at 6:24=E2=80=AFPM Yu Zhao wrote: > On Thu, Dec 14, 2023 at 10:11=E2=80=AFAM Johannes Weiner > wrote: > > > > On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote: > > > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote: > > > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > > > > I can give you three usage cases right now: > > > > > 1) Google producting kernel uses SSD only swap, it is currently o= n > > > > > pilot. This is not expressible by the memory.zswap.writeback. You > can > > > > > set the memory.zswap.max =3D 0 and memory.zswap.writeback =3D 1, = then > SSD > > > > > backed swapfile. But the whole thing feels very clunky, especiall= y > > > > > what you really want is SSD only swap, you need to do all this > zswap > > > > > config dance. Google has an internal memory.swapfile feature > > > > > implemented per cgroup swap file type by "zswap only", "real swap > file > > > > > only", "both", "none" (the exact keyword might be different). > running > > > > > in the production for almost 10 years. The need for more than zsw= ap > > > > > type of per cgroup control is really there. > > > > > > > > We use regular swap on SSD without zswap just fine. Of course it's > > > > expressible. > > > > > > > > On dedicated systems, zswap is disabled in sysfs. On shared hosts > > > > where it's determined based on which workload is scheduled, zswap i= s > > > > generally enabled through sysfs, and individual cgroup access is > > > > controlled via memory.zswap.max - which is what this knob is for. > > > > > > > > This is analogous to enabling swap globally, and then opting > > > > individual cgroups in and out with memory.swap.max. > > > > > > > > So this usecase is very much already supported, and it's expressed = in > > > > a way that's pretty natural for how cgroups express access and lack > of > > > > access to certain resources. > > > > > > > > I don't see how memory.swap.type or memory.swap.tiers would improve > > > > this in any way. On the contrary, it would overlap and conflict wit= h > > > > existing controls to manage swap and zswap on a per-cgroup basis. > > > > > > > > > 2) As indicated by this discussion, Tencent has a usage case for > SSD > > > > > and hard disk swap as overflow. > > > > > > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > > > > > +Kairui > > > > > > > > Multiple swap devices for round robin or with different priorities > > > > aren't new, they have been supported for a very, very long time. So > > > > far nobody has proposed to control the exact behavior on a per-cgro= up > > > > basis, and I didn't see anybody in this thread asking for it either= . > > > > > > > > So I don't see how this counts as an obvious and automatic usecase > for > > > > memory.swap.tiers. > > > > > > > > > 3) Android has some fancy swap ideas led by those patches. > > > > > > https://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.= org/ > > > > > It got shot down due to removal of frontswap. But the usage case > and > > > > > product requirement is there. > > > > > +Minchan > > > > > > > > This looks like an optimization for zram to bypass the block layer > and > > > > hook directly into the swap code. Correct me if I'm wrong, but this > > > > doesn't appear to have anything to do with per-cgroup backend > control. > > > > > > Hi Johannes, > > > > > > I haven't been following the thread closely, but I noticed the > discussion > > > about potential use cases for zram with memcg. > > > > > > One interesting idea I have is to implement a swap controller per > cgroup. > > > This would allow us to tailor the zram swap behavior to the specific > needs of > > > different groups. > > > > > > For example, Group A, which is sensitive to swap latency, could use > zram swap > > > with a fast compression setting, even if it sacrifices some > compression ratio. > > > This would prioritize quick access to swapped data, even if it takes > up more space. > > > > > > On the other hand, Group B, which can tolerate higher swap latency, > could benefit > > > from a slower compression setting that achieves a higher compression > ratio. > > > This would maximize memory efficiency at the cost of slightly slower > data access. > > > > > > This approach could provide a more nuanced and flexible way to manage > swap usage > > > within different cgroups. > > > > That makes sense to me. > > > > It sounds to me like per-cgroup swapfiles would be the easiest > > solution to this. > > Someone posted it about 10 years ago :) > https://lwn.net/Articles/592923/ > > +fdeutsch@redhat.com > Fabian recently asked me about its status. > Yep - for container use-cases. Now a few thoughts in this direction: - With swap per cgroup you loose the big "statistical" benefit of having swap on a node level. well, it depends on the size of the cgroup (i.e. system.slice is quite large). - With todays node level swap, and setting memory.swap.max=3D0 for all cgroups allows you toachieve a similar behavior (only opt-in cgroups will get swap). - the above approach however will still have a shared swap backend for all cgroups. --0000000000003036ea060c7c10ed Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Thu, Dec 14, 2023 at 6:24=E2=80=AF= PM Yu Zhao <yuzhao@google.com&g= t; wrote:
On Thu= , Dec 14, 2023 at 10:11=E2=80=AFAM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Dec 11, 2023 at 02:55:43PM -0800, Minchan Kim wrote:
> > On Fri, Dec 08, 2023 at 10:42:29PM -0500, Johannes Weiner wrote:<= br> > > > On Fri, Dec 08, 2023 at 03:55:59PM -0800, Chris Li wrote: > > > > I can give you three usage cases right now:
> > > > 1) Google producting kernel uses SSD only swap, it is c= urrently on
> > > > pilot. This is not expressible by the memory.zswap.writ= eback. You can
> > > > set the memory.zswap.max =3D 0 and memory.zswap.writeba= ck =3D 1, then SSD
> > > > backed swapfile. But the whole thing feels very clunky,= especially
> > > > what you really want is SSD only swap, you need to do a= ll this zswap
> > > > config dance. Google has an internal memory.swapfile fe= ature
> > > > implemented per cgroup swap file type by "zswap on= ly", "real swap file
> > > > only", "both", "none" (the exa= ct keyword might be different). running
> > > > in the production for almost 10 years. The need for mor= e than zswap
> > > > type of per cgroup control is really there.
> > >
> > > We use regular swap on SSD without zswap just fine. Of cours= e it's
> > > expressible.
> > >
> > > On dedicated systems, zswap is disabled in sysfs. On shared = hosts
> > > where it's determined based on which workload is schedul= ed, zswap is
> > > generally enabled through sysfs, and individual cgroup acces= s is
> > > controlled via memory.zswap.max - which is what this knob is= for.
> > >
> > > This is analogous to enabling swap globally, and then opting=
> > > individual cgroups in and out with memory.swap.max.
> > >
> > > So this usecase is very much already supported, and it's= expressed in
> > > a way that's pretty natural for how cgroups express acce= ss and lack of
> > > access to certain resources.
> > >
> > > I don't see how memory.swap.type or memory.swap.tiers wo= uld improve
> > > this in any way. On the contrary, it would overlap and confl= ict with
> > > existing controls to manage swap and zswap on a per-cgroup b= asis.
> > >
> > > > 2) As indicated by this discussion, Tencent has a usage= case for SSD
> > > > and hard disk swap as overflow.
> > > > https:/= /lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/
> > > > +Kairui
> > >
> > > Multiple swap devices for round robin or with different prio= rities
> > > aren't new, they have been supported for a very, very lo= ng time. So
> > > far nobody has proposed to control the exact behavior on a p= er-cgroup
> > > basis, and I didn't see anybody in this thread asking fo= r it either.
> > >
> > > So I don't see how this counts as an obvious and automat= ic usecase for
> > > memory.swap.tiers.
> > >
> > > > 3) Android has some fancy swap ideas led by those patch= es.
> > > > htt= ps://lore.kernel.org/linux-mm/20230710221659.2473460-1-minchan@kernel.org/<= /a>
> > > > It got shot down due to removal of frontswap. But the u= sage case and
> > > > product requirement is there.
> > > > +Minchan
> > >
> > > This looks like an optimization for zram to bypass the block= layer and
> > > hook directly into the swap code. Correct me if I'm wron= g, but this
> > > doesn't appear to have anything to do with per-cgroup ba= ckend control.
> >
> > Hi Johannes,
> >
> > I haven't been following the thread closely, but I noticed th= e discussion
> > about potential use cases for zram with memcg.
> >
> > One interesting idea I have is to implement a swap controller per= cgroup.
> > This would allow us to tailor the zram swap behavior to the speci= fic needs of
> > different groups.
> >
> > For example, Group A, which is sensitive to swap latency, could u= se zram swap
> > with a fast compression setting, even if it sacrifices some compr= ession ratio.
> > This would prioritize quick access to swapped data, even if it ta= kes up more space.
> >
> > On the other hand, Group B, which can tolerate higher swap latenc= y, could benefit
> > from a slower compression setting that achieves a higher compress= ion ratio.
> > This would maximize memory efficiency at the cost of slightly slo= wer data access.
> >
> > This approach could provide a more nuanced and flexible way to ma= nage swap usage
> > within different cgroups.
>
> That makes sense to me.
>
> It sounds to me like per-cgroup swapfiles would be the easiest
> solution to this.

Someone posted it about 10 years ago :)
https://lwn.net/Articles/592923/

+fdeutsch@redhat.c= om
Fabian recently asked me about its status.

<= div>Yep - for container use-cases.

Now a few thoug= hts in this direction:
- With swap per cgroup you loose the big &= quot;statistical" benefit of having swap on a node level. well, it dep= ends on the size of the cgroup (i.e. system.slice is quite large).
- With todays node level swap, and setting memory.swap.max=3D0 for all cg= roups allows you toachieve a similar behavior (only opt-in cgroups will get= swap).
- the above approach however will still have a shared swa= p backend for all cgroups.
=C2=A0
--0000000000003036ea060c7c10ed--