From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 651D0C4332F for ; Fri, 3 Nov 2023 19:24:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C79778D00CD; Fri, 3 Nov 2023 15:24:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C022C8D000C; Fri, 3 Nov 2023 15:24:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A7D328D00CD; Fri, 3 Nov 2023 15:24:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 9218C8D000C for ; Fri, 3 Nov 2023 15:24:42 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 63464C11A4 for ; Fri, 3 Nov 2023 19:24:42 +0000 (UTC) X-FDA: 81417620004.14.C8D1C04 Received: from mail-io1-f50.google.com (mail-io1-f50.google.com [209.85.166.50]) by imf01.hostedemail.com (Postfix) with ESMTP id 871B540009 for ; Fri, 3 Nov 2023 19:24:40 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TpyMcNhF; spf=pass (imf01.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.50 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1699039480; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=yZgFmOX3ULsBcBCYJS8Q2oH6cz5Ye/aK91vNIaxZApU=; b=aa9+WnFijgjMIe2/0bgYf8pSFzs0rEqTO1sN/Sx9sh5/O+xEAI5yKgnlMPHgexnwxkRp+C XIhymur0ghCKrRhF7U11Wa0TGvuwCp5PtCB4g4BeKTMjS0oBDA23F7UXi9IyPIjsqk5i01 202jiql8fD1/ZzFBv7KI/5IhZYfr/ao= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1699039480; a=rsa-sha256; cv=none; b=qr4oGQjR4xkT1POCY9uh+JogVymvCCCHCchWOzkmsxSDHSFZ99nBzGWHwhFyNON1v+SHIC KViKoGTzFAvQD9qRcP+roDfDsYgOZYJ6vIN9OQw9qSDRwxhc6JDpjsIQLnnSncyLezM39+ 5+lFeGN1YlYTtJlYpCBgaxoKTQsAgWM= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TpyMcNhF; spf=pass (imf01.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.50 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-io1-f50.google.com with SMTP id ca18e2360f4ac-7a693d8de75so82103739f.2 for ; Fri, 03 Nov 2023 12:24:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1699039479; x=1699644279; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=yZgFmOX3ULsBcBCYJS8Q2oH6cz5Ye/aK91vNIaxZApU=; b=TpyMcNhFhC4BfRfWCmDqwEywwr28nld+leyUBN9Qjf3F9j8NfzfX2ewR0fIIZmS07U cDop0FF/AQLmclPRV87cM91Ap2TiZnOylqSNVAGfv/PX+CrM++K6yDRhHJFlJSr+U1Op lwcNz1BLecMlfrP2LhdYuVO0DtYlbGCpFGLoBTtNxi8+Z8kYnkBtAled0JSjgulUwblv 1JKBa0J+7a3YhiYPHEquR5/uXxQLfoimBwSO0YLFzRgOcGEkroNRFoCv69XCeoi+1WHM p/ImoRfIRxtVKcXW/oCvYpKQAcH9DCngep+iFZdASMUcGyydvisLOEeWvnH2lfOU52VD NtdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1699039479; x=1699644279; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=yZgFmOX3ULsBcBCYJS8Q2oH6cz5Ye/aK91vNIaxZApU=; b=C3wj3FjCShc7faS4DmpjqNzgWoCfQivTZmqKBmTowTabj9icL6JfhOZ0XiZROHlTl8 wxTdF41zo2SwOUaG9+GFXOYrCOkrOle461o1L0NF8eU1FdXzi49X8D3eZQ/5eBnevZeq 8Bkvl19bYM2qZ1eMOmNPsfJ+H1NZaSQroKyhgxeRH1vcNPiLEQ27D3VilQobfhXus5n6 raMKc0T44r3HwPptpDxSVxMqsqVAgfZsJQxXee6iOofJmnNqS0qKoP8vwlJTGCm8Y69J MsQUrM53vxrtt2KYVkYd0WldUV1CmKRK+TLuceCEgoJuPNOIfzNAMIVaUzWZiSpKY/xz REgQ== X-Gm-Message-State: AOJu0YzQ8x54oivPDdxpWv+OW9DtHQD43wOMdj9fiCNqZCwq/1BgPu02 pfmm6EHb3BwmrtcrUIrZXnOa2xYGRA19KF+RIOw= X-Google-Smtp-Source: AGHT+IFSnD4V/LX6mRlu0PCTUa5qHRWWBa4S7AVEVw8WDTLWp95ndc+LdBPX4HP7r7C+1b3tL92MStshFm3nZszw5PY= X-Received: by 2002:a05:6602:2cc5:b0:790:cb89:44eb with SMTP id j5-20020a0566022cc500b00790cb8944ebmr28825006iow.1.1699039479485; Fri, 03 Nov 2023 12:24:39 -0700 (PDT) MIME-Version: 1.0 References: <20231102234236.1784543-1-nphamcs@gmail.com> In-Reply-To: From: Nhat Pham Date: Fri, 3 Nov 2023 12:24:27 -0700 Message-ID: Subject: Re: [RFC PATCH v3] zswap: memcontrol: implement zswap writeback disabling To: Yosry Ahmed Cc: akpm@linux-foundation.org, tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, cerasuolodomenico@gmail.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, hughd@google.com, corbet@lwn.net, konrad.wilk@oracle.com, senozhatsky@chromium.org, rppt@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, david@ixit.cz Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 871B540009 X-Rspam-User: X-Stat-Signature: h8x1x7xjhzx7jdjacrkchn3c3xowhsu5 X-Rspamd-Server: rspam03 X-HE-Tag: 1699039480-485056 X-HE-Meta: U2FsdGVkX18kvJP6dHZwEiytDU4mXWOWDlQpEo+sb6mnuZp0/t7gMdlBarW5UDpxHYbqb4LYapWI5T13oK7OOQN0YpnUPegFBeGwBhTe1PTZV4+cqNTZ4HkonNSZC6g/HFXhJk5HtZpB+mDg3g1nBp/NLcOi8pT2B31POwjcmCNeCoFfg8ncemRRRPI726joY3TVtMvBUi6F+psqZG4ysXRIt0MG9+9R26YvLdvw28R3hgDpRlGzVWDBepBtTFJyD927KZG0ZmxbVWQrWAZYJuj2Xpplfgtldprmaczti3sRU7MLaXP3th//7yNXaNTWbniFUFf44bbDM/jj7PMsHYcG0p8kIJbRjBBS03EXEApL+6b5Ex3bKAoudWHQfOQ5aixtTogrJ+LzkrHSVCF9k+ujFeK0E5j8XrWvctVBLXstVmryVvr9/eJBE2h1MXh4cFa5B9a+3VC8GexGmsLSaYAqtqvXOglEGQbqLUFjBuqddSvqKLsjVSW4oP6Lr2161n9eFqJl0X6i6bS5juOKFDMKehLzjg2US3Dit2ixNplCbqZjUuECU5XaLajaRkYtjPDiPUM1Xd/GwhAwFt7ED9uXbJxnMGc27UVbKj9JJpdXuE59wsA9i6+0HJyP2Ha7wGi0ZZVDJlrZFMk2YluLjHTtFmoaVchgVEF4nhZ5ekOiEkaLcrfqx55EvA5tRXcGr0mmeymluVkXGdk1hT6zURK5K1HG2FtzHuTy0+y7ZVpQYY3GsF+VmfzRnJ5nOxn1Fx36UvFK0+/YL+oXn6nxUlEABKUsxfc9RrY//AJIVR41ypodaIhQPf8uN8B02P7NdNLWgYeZ3HOxoBlpm+MWkjWjlP2hLA6ZhISPEKcWEmnJbPQ4OyPWaj2TRRiZjzaMMHhKIMES5L5KFcB2N6TdnvWKvMKA01DjlD5f1JnaJTCdvoUxggx5cmDh1hPmrojzNkpJSiTFG1Q/aYxIsY3 r8Nkro3I D+A2pYfulPAJExOmlejP55zCPMGEPMGcSDyUIT3eTaK41oU0Kfg06gczUIY4LjCxGoi6X4WsmYEBmoXLgemtvQNZPVvt0cmSYoZ+QrKEf1v9Wf/ObEJsnsvlC72Tjd8uBsIiUsdU8+12URnvmhmiVTI1/YwHkrhfZuZSgl/jXVsoHWBbeJRwJjgDOLUZ0AxzmUOYbcupV4q0Umdxw1h/Cqjpvu+yXVAWmdB6DlW1u7I8cRhOMBKYT5zjqANolSs6elEL0/F4xHc7rT2iDoQyf6wS5LhbuwtYctBQcXIgcDy6y+utyNrsROA1wrsUgx44FR990Sw8Gdw1y5248gsrsR2FC+fsceGAKnctU6olbtj2gKlN48/x0YO9UU10uCPDGKOI4kAObguiHggf3DC7c5flMgxPh+vGYYC1Egj36Zq4S4OvPpEEerA9Hsk2o2c0EAKitq5uu01zYZeY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 2, 2023 at 6:13=E2=80=AFPM Yosry Ahmed = wrote: > > On Thu, Nov 2, 2023 at 4:42=E2=80=AFPM Nhat Pham wrot= e: > > > > During our experiment with zswap, we sometimes observe swap IOs due to > > occasional zswap store failures and writebacks-to-swap. These swapping > > IOs prevent many users who cannot tolerate swapping from adopting zswap > > to save memory and improve performance where possible. > > > > This patch adds the option to disable this behavior entirely: do not > > writeback to backing swapping device when a zswap store attempt fail, > > and do not write pages in the zswap pool back to the backing swap > > device (both when the pool is full, and when the new zswap shrinker is > > called). > > > > This new behavior can be opted-in/out on a per-cgroup basis via a new > > cgroup file. By default, writebacks to swap device is enabled, which is > > the previous behavior. > > > > Note that this is subtly different from setting memory.swap.max to 0, a= s > > it still allows for pages to be stored in the zswap pool (which itself > > consumes swap space in its current form). > > > > Suggested-by: Johannes Weiner > > Signed-off-by: Nhat Pham > > --- > > Documentation/admin-guide/cgroup-v2.rst | 11 +++++++ > > Documentation/admin-guide/mm/zswap.rst | 6 ++++ > > include/linux/memcontrol.h | 12 ++++++++ > > include/linux/zswap.h | 6 ++++ > > mm/memcontrol.c | 38 +++++++++++++++++++++++++ > > mm/page_io.c | 6 ++++ > > mm/shmem.c | 3 +- > > mm/zswap.c | 14 +++++++++ > > 8 files changed, 94 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/ad= min-guide/cgroup-v2.rst > > index 606b2e0eac4b..18c4171392ea 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1672,6 +1672,17 @@ PAGE_SIZE multiple when read back. > > limit, it will refuse to take any more stores before existing > > entries fault back in or are written out to disk. > > > > + memory.zswap.writeback > > + A read-write single value file which exists on non-root > > + cgroups. The default value is "1". > > + > > + When this is set to 0, all swapping attempts to swapping device= s > > + are disabled. This included both zswap writebacks, and swapping= due > > + to zswap store failure. > > + > > + Note that this is subtly different from setting memory.swap.max= to > > + 0, as it still allows for pages to be written to the zswap pool= . > > + > > memory.pressure > > A read-only nested-keyed file. > > > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/adm= in-guide/mm/zswap.rst > > index 522ae22ccb84..b987e58edb70 100644 > > --- a/Documentation/admin-guide/mm/zswap.rst > > +++ b/Documentation/admin-guide/mm/zswap.rst > > @@ -153,6 +153,12 @@ attribute, e. g.:: > > > > Setting this parameter to 100 will disable the hysteresis. > > > > +Some users cannot tolerate the swapping that comes with zswap store fa= ilures > > +and zswap writebacks. Swapping can be disabled entirely (without disab= ling > > +zswap itself) on a cgroup-basis as follows: > > + > > + echo 0 > /sys/fs/cgroup//memory.zswap.writeback > > + > > When there is a sizable amount of cold memory residing in the zswap po= ol, it > > can be advantageous to proactively write these cold pages to swap and = reclaim > > the memory for other use cases. By default, the zswap shrinker is disa= bled. > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 95f6c9e60ed1..e51eafdf2a15 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -219,6 +219,12 @@ struct mem_cgroup { > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > unsigned long zswap_max; > > + > > + /* > > + * Prevent pages from this memcg from being written back from z= swap to > > + * swap, and from being swapped out on zswap store failures. > > + */ > > + bool zswap_writeback; > > #endif > > > > unsigned long soft_limit; > > @@ -1931,6 +1937,7 @@ static inline void count_objcg_event(struct obj_c= group *objcg, > > bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); > > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg); > > #else > > static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > { > > @@ -1944,6 +1951,11 @@ static inline void obj_cgroup_uncharge_zswap(str= uct obj_cgroup *objcg, > > size_t size) > > { > > } > > +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgrou= p *memcg) > > +{ > > + /* if zswap is disabled, do not block pages going to the swappi= ng device */ > > + return true; > > +} > > #endif > > > > #endif /* _LINUX_MEMCONTROL_H */ > > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > > index cbd373ba88d2..b4997e27a74b 100644 > > --- a/include/linux/zswap.h > > +++ b/include/linux/zswap.h > > @@ -35,6 +35,7 @@ void zswap_swapoff(int type); > > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); > > void zswap_lruvec_state_init(struct lruvec *lruvec); > > void zswap_lruvec_swapin(struct page *page); > > +bool is_zswap_enabled(void); > > #else > > > > struct zswap_lruvec_state {}; > > @@ -55,6 +56,11 @@ static inline void zswap_swapoff(int type) {} > > static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memc= g) {} > > static inline void zswap_lruvec_init(struct lruvec *lruvec) {} > > static inline void zswap_lruvec_swapin(struct page *page) {} > > + > > +static inline bool is_zswap_enabled(void) > > +{ > > + return false; > > +} > > #endif > > > > #endif /* _LINUX_ZSWAP_H */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index e43b5aba8efc..8a6aadcc103c 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -5545,6 +5545,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *= parent_css) > > WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > memcg->zswap_max =3D PAGE_COUNTER_MAX; > > + WRITE_ONCE(memcg->zswap_writeback, true); > > Generally LGTM, just one question. > > Would it be more convenient if the initial value is inherited from the > parent (the root starts with true)? > > I can see this being useful if we want to set it to false on the > entire machine or one a parent cgroup, we can set it before creating > any children instead of setting it to 0 every time we create a new > cgroup. I'm not 100% sure about the benefit or have a strong opinion one way or another, but this sounds like a nice-to-have detail to me, and a relativ= ely low cost one (both in effort and at runtime) at that too. Propagating the change everytime we modify the memory.zswap.writeback value of the ancestor might be data race-prone (and costly, depending on how big the cgroup subtree is), but this is just a one-time-per-cgroup propagation (at the new cgroup creation time). Can anyone come up with a failure case for this change, or why it might be a bad idea? Thanks for the suggestion, Yosry!