From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 53329C4167B for ; Mon, 11 Dec 2023 09:31:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 72E3E6B00AF; Mon, 11 Dec 2023 04:31:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6DD936B00B0; Mon, 11 Dec 2023 04:31:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5A4D86B00B1; Mon, 11 Dec 2023 04:31:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 44CD96B00AF for ; Mon, 11 Dec 2023 04:31:33 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E91E214071A for ; Mon, 11 Dec 2023 09:31:32 +0000 (UTC) X-FDA: 81554019624.24.746FCA7 Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) by imf19.hostedemail.com (Postfix) with ESMTP id E52401A001C for ; Mon, 11 Dec 2023 09:31:30 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QzcESM8J; spf=pass (imf19.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1702287091; a=rsa-sha256; cv=none; b=Zob5qXSUKZoxiXmJJqUgPacM0KAA/jIYoRraU+jNlBLSTN8mJ0pT29YDhaKWPiLo88NS0s zzX2byH1wh8mKmzLe2AIM6dfO7R8BeTN145DNbHQZF9E+hr5dDOpedq2J5C7kast0gy1A2 FYRBSQ2NulwANpTO8L2imRi9jv2DTog= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QzcESM8J; spf=pass (imf19.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.174 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1702287091; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=bJe20/B6V+1jA/nSiEGTI9TatZJ/dlgC/IshXSznt1k=; b=A9aVOnJOkttc3UYnYueda7JXQEkR7vEmfO5oKvUs6SgsqeokETuZW49BthRNIMg3Ku5JIF KAT5J8JxYpGrNI9qmP0EQgZR/4QFM5Wt5+s/Cqwi6tbQBX/Kv4mN+ZiD7E/X4p/VKJdnX7 6UdHgpDhmJB+z5sV6HLl9yLyRZ3oalc= Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-2c9c18e7990so57362071fa.2 for ; Mon, 11 Dec 2023 01:31:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1702287083; x=1702891883; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=bJe20/B6V+1jA/nSiEGTI9TatZJ/dlgC/IshXSznt1k=; b=QzcESM8JPPmD0Xp2qfREzWf2M+y5SbTkbqfu5o3xK3M1e6QO+Rj1bJKzugqIaWZozu trlqtvamhgxAXZQBfMO3qwcBL4ZbHbBZby7SMO4dGGRTfcvUIC+GHUIf/gHlwVy0GtLB JZeZs5EDYlCxC9AQSVBoPy6h3DRfZ5L/jpYYItLwlKPCn9DxXOgAsT0aiIKHls55ofQ1 DjPcXQI8ebhAjzLVBhb36+O62e5dlkHyTHVenIS8Cj3enycY3u+FfoVzxVZ2FHAwHW4e ptQYzFnZ8mFfXiLZ1lU3nJk22nrHbq0HlIwKQVyBi7c0X94CTq/r7gQMyNXGPfpG6z8E WLFQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1702287083; x=1702891883; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=bJe20/B6V+1jA/nSiEGTI9TatZJ/dlgC/IshXSznt1k=; b=AfRX4BeRJnl9gfaVLxJQQyysQ/dVH/kajLYeyomNDQahWoO9ky0KDs6NERAHKAOUdT oN5VvngD0swoZalrtRlZUuji+R3sUDHp/KqC6CXMPG2bZnWT0zxNmwk6Vj1uKGBVcI4h uBExC7p0vBUEt/YoNdOX2jw0zy0O9Jkkjy+5mnifEG8ZuCl4lrLsE7IZhz1E+OKQQNGW 9zd1tHg6gNB8bLfgQX5o8U+H6q73hwEeoT2RKaWn0YNHucKCB2xzNZhG0L4AWdH2sgoD gk5KUGYhlDCAdDjw8oHy86K2wHlnGVFfPeGKpEW9uqmHI9U9VYs2U7qohjwEo8inPHOR SpgA== X-Gm-Message-State: AOJu0YzR2+G9idFFZhrnEWxa9uBe3MstBsmp2QnRV0+AxjORS8b9Xvas 3NqUYRZ7ejUTxcYpEeDtwr0toZhgSROcOmALJbw= X-Google-Smtp-Source: AGHT+IGs39bCZR/J9n/341WQqL87l74iYPcz4l7EGywr8Hox9WXsoI1JGYf+LpYEvAbBya6DsiE1yELvbtVHng+G+uQ= X-Received: by 2002:a05:651c:19a4:b0:2cc:1db0:4a6f with SMTP id bx36-20020a05651c19a400b002cc1db04a6fmr839578ljb.32.1702287082654; Mon, 11 Dec 2023 01:31:22 -0800 (PST) MIME-Version: 1.0 References: <20231207192406.3809579-1-nphamcs@gmail.com> In-Reply-To: From: Kairui Song Date: Mon, 11 Dec 2023 17:31:05 +0800 Message-ID: Subject: Re: [PATCH v6] zswap: memcontrol: implement zswap writeback disabling To: Chris Li Cc: Nhat Pham , akpm@linux-foundation.org, tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, cerasuolodomenico@gmail.com, yosryahmed@google.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, hughd@google.com, corbet@lwn.net, konrad.wilk@oracle.com, senozhatsky@chromium.org, rppt@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, david@ixit.cz, Minchan Kim , Zhongkun He Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: E52401A001C X-Stat-Signature: egi8f8fqh7yhaa5o7cjmqwxisfucborw X-Rspam-User: X-HE-Tag: 1702287090-411546 X-HE-Meta: U2FsdGVkX185a53WtUg9QIm/dFNnC+CBeznwEX1ot1HGByaVj+/UbAuLbZ2A9p9Iljqnndeb71gpqTQABgrqTx3iD7m+cFNZzamC3Wdh0gljC8SShjqC/fViwfNFnXnhyvW+6bnOV92+8z0YGrz/4mzq8JZ64VbEbBMLAb/M/SpIoj9MtPVFzrab0KPMrvyQJlvMcwCOdVOY19JPGcUuswQps6bZjIq1lY3U90OmkneL5HM8XOtdcgrHLhu9SZKRJJSoveuwj5ZkHvCHTkf/kXoklmjAvy5ZKESaMxIU6wsStPSL/ad+PA6W81N/qVlyc+YlKV2KLY+kpl0PVxMQNg9u6uxwxMifU940xqZFnP0HuCQ9lbA8JM0bUgZj82HDKKdiRj6a7z3ThcRSDUVPkxB9XyH7qkV1sD0L0aoikiPIPd1Wb+THpfTva7h3NYKkxK6YDk2xqGbZC5efVMgp/6hDzeyctK9RBoDvQzcrs30AdpIjH3AZAAEQsna0q6LG+JFopffuBn2nkBDZJi21LkAhiEUs8gRWNh7TFtfY+iWB6J+2/M32dA0PNl0eskjmwQbLaSe4iYVcQi8ABfQJHTQalfSOuQdE5stj08AZAIeyIx2Q52jOu5oCyE480jOCIU3igmo0I3vrz6lA34CP3+02sVmBKXJW6P12QYMW4BCawu6HkTwOnxB5P6FWI2f24YNHxm+dOyJu+RfPr4ldiUg8ihoKO2KPMCf7+KLeFf/ajNr3nOcNCFMckMlD8TVSQdrQ3ApffrRx9tDv2mu3obfZqyCewGXFdQzycDYUjGjRkELt+cnMUnLlYb+0yJwThoHrrZQ17SowarHpPGT++F8Qb4N2sNSWZoCyShSW9BSkUQpbpKZEye/nKBxLxm5JQz0ufZ8z7oErlYwI76yIiLIspbjLMHepo537tkWauROUbGiGXu3txQNX6xe3l4gjdBKBUE5bRoMtnUhjil6 dX+KzAuW llT93SWcqR2fJZVtZEaXJB9vYN3aAo++5i6AHsbWh1wWCG0po5auenjre71b16/yhAh31Jh31AnmNw53eMPRl6eR1a2cgsJTYOSxitkd544Hg3bOUtFwSpF9fHWUOhf0FqDyBLvk31h07jwQs+6+k2UNh5dpFR74hOMBYRbJ7Lggk5XwQmgJGFS6LAivZa1XaLZydQkpKDrNmParakSyEZzGxeyjuWptwhSMIbRr240Q9os5EvbrTOaPEvt2ZZY1tdPxtjhuvtUyd9M1dlhGGZpaK2aGFwF1vKLXB69aM/iG6jl/Tr5X0jIWhkOW8k+fStYwaQilz26skY/G/0BIvLCM7qlFI6jIJ2elV1ORJ94ifOgL+HOsQRBzjXWjggF237NFAmRMRQgjj8OnGlfezYIs4gk05ZKKGkQ7FzfXBFGxJqJdOfTcSXATVTA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000151, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Chris Li =E4=BA=8E2023=E5=B9=B412=E6=9C=889=E6=97=A5=E5= =91=A8=E5=85=AD 07:56=E5=86=99=E9=81=93=EF=BC=9A > > Hi Nhat, > > On Thu, Dec 7, 2023 at 5:03=E2=80=AFPM Nhat Pham wrot= e: > > > > On Thu, Dec 7, 2023 at 4:19=E2=80=AFPM Chris Li wro= te: > > > > > > Hi Nhat, > > > > > > > > > On Thu, Dec 7, 2023 at 11:24=E2=80=AFAM Nhat Pham = wrote: > > > > > > > > During our experiment with zswap, we sometimes observe swap IOs due= to > > > > occasional zswap store failures and writebacks-to-swap. These swapp= ing > > > > IOs prevent many users who cannot tolerate swapping from adopting z= swap > > > > to save memory and improve performance where possible. > > > > > > > > This patch adds the option to disable this behavior entirely: do no= t > > > > writeback to backing swapping device when a zswap store attempt fai= l, > > > > and do not write pages in the zswap pool back to the backing swap > > > > device (both when the pool is full, and when the new zswap shrinker= is > > > > called). > > > > > > > > This new behavior can be opted-in/out on a per-cgroup basis via a n= ew > > > > cgroup file. By default, writebacks to swap device is enabled, whic= h is > > > > the previous behavior. Initially, writeback is enabled for the root > > > > cgroup, and a newly created cgroup will inherit the current setting= of > > > > its parent. > > > > > > > > Note that this is subtly different from setting memory.swap.max to = 0, as > > > > it still allows for pages to be stored in the zswap pool (which its= elf > > > > consumes swap space in its current form). > > > > > > > > This patch should be applied on top of the zswap shrinker series: > > > > > > > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@g= mail.com/ > > > > > > > > as it also disables the zswap shrinker, a major source of zswap > > > > writebacks. > > > > > > I am wondering about the status of "memory.swap.tiers" proof of conce= pt patch? > > > Are we still on board to have this two patch merge together somehow s= o > > > we can have > > > "memory.swap.tiers" =3D=3D "all" and "memory.swap.tiers" =3D=3D "zswa= p" cover the > > > memory.zswap.writeback =3D=3D 1 and memory.zswap.writeback =3D=3D 0 c= ase? > > > > > > Thanks > > > > > > Chris > > > > > > > Hi Chris, > > > > I briefly summarized my recent discussion with Johannes here: > > > > https://lore.kernel.org/all/CAKEwX=3DNwGGRAtXoNPfq63YnNLBCF0ZDOdLVRsvzU= mYhK4jxzHA@mail.gmail.com/ > > Sorry I am traveling in a different time zone so not able to get to > that email sooner. That email is only sent out less than one day > before the V6 patch right? > > > > > TL;DR is we acknowledge the potential usefulness of swap.tiers > > interface, but the use case is not quite there yet, so it does not > > I disagree about no use case. No use case for Meta !=3D no usage case > for the rest of the linux kernel community. That mindset really needs > to shift to do Linux kernel development. Respect other's usage cases. > It is not just Meta's Linux kernel. It is everybody's Linux kernel. > > I can give you three usage cases right now: > 1) Google producting kernel uses SSD only swap, it is currently on > pilot. This is not expressible by the memory.zswap.writeback. You can > set the memory.zswap.max =3D 0 and memory.zswap.writeback =3D 1, then SSD > backed swapfile. But the whole thing feels very clunky, especially > what you really want is SSD only swap, you need to do all this zswap > config dance. Google has an internal memory.swapfile feature > implemented per cgroup swap file type by "zswap only", "real swap file > only", "both", "none" (the exact keyword might be different). running > in the production for almost 10 years. The need for more than zswap > type of per cgroup control is really there. > > 2) As indicated by this discussion, Tencent has a usage case for SSD > and hard disk swap as overflow. > https://lore.kernel.org/linux-mm/20231119194740.94101-9-ryncsn@gmail.com/ > +Kairui Yes, we are not using zswap. We are using ZRAM for swap since we have many different varieties of workload instances, with a very flexible storage setup. Some of them don't have the ability to set up a swapfile. So we built a pack of kernel infrastructures based on ZRAM, which so far worked pretty well. The concern from some teams is that ZRAM (or zswap) can't always free up memory so they may lead to higher risk of OOM compared to a physical swap device, and they do have suitable devices for doing swap on some of their machines. So a secondary swap support is very helpful in case of memory usage peak. Besides this, another requirement is that different containers may have different priority, some containers can tolerate high swap overhead while some cannot, so swap tiering is useful for us in many ways. And thanks to cloud infrastructure the disk setup could change from time to time depending on workload requirements, so our requirement is to support ZRAM (always) + SSD (optional) + HDD (also optional) as swap backends, while not making things too complex to maintain. Currently we have implemented a cgroup based ZRAM compression algorithm control, per-cgroup ZRAM accounting and limit, and a experimental kernel worker to migrate cold swap entry from high priority device to low priority device at very small scale (lack of basic mechanics to do this at large scale, however due to the low IOPS of slow device and cold pages are rarely accessed, this wasn't too much of a problem so far but kind of ugly). The rest of swapping (eg. secondary swap when ZRAM if full) will depend on the kernel's native ability. So far it works, not in the best form, need more patches to make it work better (eg. the swapin/readahead patch I sent previously). Some of our design may also need to change in the long term, and we also want a well built interface and kernel mechanics to manage multi tier swaps, I'm very willing to talk and collaborate on this.