From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 78BEACF9C6B
	for <linux-mm@archiver.kernel.org>; Tue, 24 Sep 2024 19:39:12 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0361B6B00AD; Tue, 24 Sep 2024 15:39:12 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F28DB6B00AE; Tue, 24 Sep 2024 15:39:11 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DC8866B00AF; Tue, 24 Sep 2024 15:39:11 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id BFB676B00AD
	for <linux-mm@kvack.org>; Tue, 24 Sep 2024 15:39:11 -0400 (EDT)
Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 6F625120394
	for <linux-mm@kvack.org>; Tue, 24 Sep 2024 19:39:11 +0000 (UTC)
X-FDA: 82600645302.16.2057918
Received: from mail-ej1-f42.google.com (mail-ej1-f42.google.com [209.85.218.42])
	by imf23.hostedemail.com (Postfix) with ESMTP id 8E217140009
	for <linux-mm@kvack.org>; Tue, 24 Sep 2024 19:39:09 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=0uHdKoxN;
	spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1727206630;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=EtqiHAuCHLMytt38wF6iLI0wBR0Ju/JPn8h/2Oyi2VU=;
	b=VVNoFMQgZOj+WBUSEOa2Ht8OgR1aZb+YqHX7FedORBHzgDY92LQTbdVUlZ1VIPWjjPCiQx
	3D9M4OC+BrUAyl8sNjJ+fYkS9NZF9LHopLHwXxCq62QIRhm/oRCf5Yu5qaMQ38wcPEF5ew
	3tHene8M6aeTqjjA37xISFc3+f3sxic=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727206630; a=rsa-sha256;
	cv=none;
	b=ohQEkU2+OXbZZV1JyTyS1wNt0FnzMtCXYy8W5WNNB0PdeKEr38o9UVMcZDSqcZCqsds6vw
	G8CYMwXdJ0OcWJZbKQKdGRDW4KfujH1m1cVhVsf4qaUhfptsdiDpCm8V8vAEzw+vF+zmCY
	WM7K7rV5JKpn4Df6Nam/Yz/NMP4wIyc=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=0uHdKoxN;
	spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-ej1-f42.google.com with SMTP id a640c23a62f3a-a8d4979b843so747886466b.3
        for <linux-mm@kvack.org>; Tue, 24 Sep 2024 12:39:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1727206748; x=1727811548; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=EtqiHAuCHLMytt38wF6iLI0wBR0Ju/JPn8h/2Oyi2VU=;
        b=0uHdKoxNZjAcDG+2nBkuKXEzmT0JbJ+5WirEPyCts0lBjWwai7yYRmSc4LtivjIXBU
         1jKj2OwiztqjK8fOVe5FzXYQ5SBFbOLD+2kECV6QaPqWcNOUqo8QK51fy3OkpRnwGVxA
         oaYXxwoYTphHrxI78paBbda4t8eRM5IfTyD84TNe5DI28oZyYOXRAhFxQXnZHVI0sn21
         +8JQ8S4C3rV1hc9NXZ4oyqyoWdSwlx6uJW3DTEey7Hw3unxfnifGrueynLDt4zx7zWiY
         r+2O5fMZxJU9GohXml5ZKAQiFnlf4qFkWMIVRTYHHbMf8d7Sba6igcwtynxl9W6yetvb
         VjqQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1727206748; x=1727811548;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=EtqiHAuCHLMytt38wF6iLI0wBR0Ju/JPn8h/2Oyi2VU=;
        b=UJb4dY0caHe8OXbNOy0t8PGFxxQExT4RS90CYsvTlLNHROC6hL5S1CtK7AU2fLOWcf
         Q8vh5DlSPN4r7aM9wX4BDcW4xoWYzE+fqeQpznLlev5T5q5jBXFIvOWyIP5VeVXr6dR+
         dZlRUSV6WaG8uGb7SYPg6gU2KMvrm4kMy2cANqDuMkgbvNnDXgr+BCciJaRPFZ0ESyB0
         LeEgmHr7TICmeD+baKSQnw2B1ZtGkS341IZ+VGhCj+vx8CWAq2SjrzOozTTAQEqFrMKu
         zyBjQavWG42UE17go0V8rWnBdDDF3f9og84v6K/PruhhJvdDrswpGjr9ZJnHq/zHmRX1
         mM8Q==
X-Forwarded-Encrypted: i=1; AJvYcCWnaZPncLZV0IxvOhSGWXjFKp5yZRRf1U3lNTVuHPcIAeYa6tlaJIkTIYOyRKGeN1NNwor47LmXFQ==@kvack.org
X-Gm-Message-State: AOJu0YzeLVBgFcwM0jUFkd7q49wnwxJFEzqbn8hCG1qp7iEZcQZ+FIS7
	z1CgE+7ZS3ynrjCi0Fmk/W2B5dMfMBHEFLKgNoMrz8Yq2QyEy4ZftXlABxPhw5gjGw+UlpcDMqF
	4yK6PsCq9pi7HKmSMf8J3vi95Uia6EyHd69gl
X-Google-Smtp-Source: AGHT+IGn0R1QEn99UZ9WVWIpyRdJH8rqzL95g49Yztw8S4d+m3UM21Jf0yklrSBW6N8A6ZrvHMXVd9/sQA0NlcZoPzY=
X-Received: by 2002:a17:907:1c1f:b0:a8a:926a:d02a with SMTP id
 a640c23a62f3a-a93a05e7ea0mr29747666b.49.1727206747787; Tue, 24 Sep 2024
 12:39:07 -0700 (PDT)
MIME-Version: 1.0
References: <20240924011709.7037-1-kanchana.p.sridhar@intel.com> <20240924011709.7037-7-kanchana.p.sridhar@intel.com>
In-Reply-To: <20240924011709.7037-7-kanchana.p.sridhar@intel.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 24 Sep 2024 12:38:32 -0700
Message-ID: <CAJD7tkbd=H+=wx0gnu8sh44hOmx7BE3G3oSxi7zt+o3HKJthkA@mail.gmail.com>
Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
To: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, 
	nphamcs@gmail.com, chengming.zhou@linux.dev, usamaarif642@gmail.com, 
	shakeel.butt@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, 
	21cnbao@gmail.com, akpm@linux-foundation.org, nanhai.zou@intel.com, 
	wajdi.k.feghali@intel.com, vinodh.gopal@intel.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: r11wt5xijrt45w1wr8fd3o6pi3jkghze
X-Rspamd-Queue-Id: 8E217140009
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1727206749-211713
X-HE-Meta: U2FsdGVkX18zeO40hp2R9/v1s2Htazc4S7m6EXLZs47L/NbNieFlSABu9tiTaihjaVnxfdn2xiY+CjcFJc6Hsq7yzNap5Ukiz+AFcALBmtRgcWMyEHKT9FQpQQzZq6lcUcSSZqR34JFQP17iQLZnyNqX0UrhUBqAByRv+fyujaiRo7Qvyq1bc+Ty1dXxHTOaU7uzq9jPWCB9rS7yuoFqc1+0vTMYZ70ER6LaEzXPSOGLXm8kFyW0H4Mv6334UkzMGOD3UxYpb9oVl59IuatLarg07oG+iC8UeHi4Jt7vzX0Ri9NC5mwYyHSLKsw2Pc1RH2toR3Yml0UnR3xKYzAJ3F9a51O1/0fAbMuPUD9rSuCMju6xDL2ZZNxVBdsE2F6kykrHc+wzFK7mDjXVvbgoWVhsnV3bojey9pKYU0RMVgBR1Mh/DBmA+RjlRp2jRQRHUz9NTHi/gxoumb39tOym1zcs2kHimhdD6BzxGw97eNTdcf/5QB53d92I5f9/hPOi5/hnjXvoLpSEN6gqF8aCUzT38AgjRcHN25l3aANaUWzm7YU0SzJKp0gyeAn+hfLFf8z/XRuBt7UY096igmdQceKXd8QBKVNVn/XVZUJF09HnRB1sHqSeOQKtRGocvthsCgCjoM2unQ3u/n6AWujsr+zukT4i6jM2AP5l/RwlCnOwyZB9LTXOE0eMdZ1wuBh1eFbB5sUB4SlLkqbRpWr2v2xqbqy29sWJqap8lltRTLF6XA4IzZrgnLbVzNuinpiKWvIWibmqefZhVvahWBz//wQ6jAgpnUzzX78nKeYIrmGA33FC9FEqBxGpVnNDMCYBiqEeIm8TpH/j36njlT5rEzdW2ez+7XaTPAdXUo16e7au6LQEhK8qGpA8lxEw/rg349byY2cc/q5lmnu3CT2XDVcda206ED0HDNOPJWRHr9fqn6uOnkDLQnYBJAS+BF18mSakzqzKfg/8kdsKgJU
 TFa0TPfa
 GxML7Dj5zbb6qdvy7lIvQdAwNA0Cn58eXVTfFAKiQyO6kCSuWX+gLtrMkDX8NmzwKppX5xiF01j7KhK/g35eMHi/l2QCPiZ7JWlGfaFGn8ysGeCMvJSAOUE930DvhRtsJPIK3zwEEntePKS6T2RFrEfgrL65LG+C1M2kmBvwUXMMMQx2/ZxhNQv6EzsMThRRgH9WAp23V8UwuwBcilSr0qGPL0rO0D02WLEhzE4TN9XlgM7pel2tXUB0JgFF/P5X0BZsKgZllEa+cNG7rNPScp2dsV2+7ytK7kMklXHBMF9Zn9VctdYJpLdmXBpj1x4IgeGXyxgeOrZKg/ZMaKVuIssusqKCUMi80MzxY
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Sep 23, 2024 at 6:17=E2=80=AFPM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will now store mTHP and PMD-size THP folios by compressing
> them page by page.
>
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
>
> Each page's swap offset is stored as a separate zswap entry.
>
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
>
> This forms the basis for building batching of pages during zswap store
> of large folios by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP. The corresponding tunable
> zswap module parameter is "mthp_enabled".
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.robe=
rts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/Kconfig |   8 ++++
>  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
>  2 files changed, 66 insertions(+), 64 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09aebca1cae3..c659fb732ec4 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
>           reducing the chance that cold pages will reside in the zswap po=
ol
>           and consume memory indefinitely.
>
> +config ZSWAP_STORE_THP_DEFAULT_ON
> +       bool "Store mTHP and THP folios in zswap"
> +       depends on ZSWAP
> +       default n
> +       help
> +         If selected, zswap will process mTHP and THP folios by
> +         compressing and storing each 4K page in the large folio.
> +
>  choice
>         prompt "Default compressor"
>         depends on ZSWAP
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 8f2e0ab34c84..16ab770546d6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled =3D IS_ENABLED(
>                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644)=
;
>
> +/*
> + * Enable/disable zswap processing of mTHP folios.
> + * For now, only zswap_store will process mTHP folios.
> + */
> +static bool zswap_mthp_enabled =3D IS_ENABLED(
> +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
> +
>  bool zswap_is_enabled(void)
>  {
>         return zswap_enabled;
> @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarr=
ay *tree,
>   * @objcg: The folio's objcg.
>   * @pool:  The zswap_pool to store the compressed data for the page.
>   */
> -static bool __maybe_unused zswap_store_page(struct folio *folio, long in=
dex,
> -                                           struct obj_cgroup *objcg,
> -                                           struct zswap_pool *pool)
> +static bool zswap_store_page(struct folio *folio, long index,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool)

As I mentioned earlier, the patch that introduced zswap_store_page()
should have directly used it in zswap_store(). This would make this
patch much clearer.

>  {
>         swp_entry_t swp =3D folio->swap;
>         int type =3D swp_type(swp);
> @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struc=
t folio *folio, long index,
>         return false;
>  }
>
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compress=
ed
> + * and stored sequentially.
> + */
>  bool zswap_store(struct folio *folio)
>  {
>         long nr_pages =3D folio_nr_pages(folio);
>         swp_entry_t swp =3D folio->swap;
>         pgoff_t offset =3D swp_offset(swp);
>         struct xarray *tree =3D swap_zswap_tree(swp);
> -       struct zswap_entry *entry;
>         struct obj_cgroup *objcg =3D NULL;
>         struct mem_cgroup *memcg =3D NULL;
> +       struct zswap_pool *pool;
> +       bool ret =3D false;
> +       long index;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> +       /* Storing large folios isn't enabled */

The comment is now stating the obvious, please remove it.

> +       if (!zswap_mthp_enabled && folio_test_large(folio))
>                 return false;
>
>         if (!zswap_enabled)
> -               goto check_old;
> +               goto reject;
>
> -       /* Check cgroup limits */
> +       /*
> +        * Check cgroup limits:
> +        *
> +        * The cgroup zswap limit check is done once at the beginning of =
an
> +        * mTHP store, and not within zswap_store_page() for each page
> +        * in the mTHP. We do however check the zswap pool limits at the
> +        * start of zswap_store_page(). What this means is, the cgroup
> +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> +        * However, the per-store-page zswap pool limits check should
> +        * hopefully trigger the cgroup aware and zswap LRU aware global
> +        * reclaim implemented in the shrinker. If this assumption holds,
> +        * the cgroup exceeding the zswap limits could potentially be
> +        * resolved before the next zswap_store, and if it is not, the ne=
xt
> +        * zswap_store would fail the cgroup zswap limit check at the sta=
rt.
> +        */

I do not really like this. Allowing going one page above the limit is
one thing, but one THP above the limit seems too much. I also don't
like relying on the repeated limit checking in zswap_store_page(), if
anything I think that should be batched too.

Is it too unreasonable to maintain the average compression ratio and
use that to estimate limit checking for both memcg and global limits?
Johannes, Nhat, any thoughts on this?

>         objcg =3D get_obj_cgroup_from_folio(folio);
>         if (objcg && !obj_cgroup_may_zswap(objcg)) {
>                 memcg =3D get_mem_cgroup_from_objcg(objcg);
>                 if (shrink_memcg(memcg)) {
>                         mem_cgroup_put(memcg);
> -                       goto reject;
> +                       goto put_objcg;
>                 }
>                 mem_cgroup_put(memcg);
>         }
>
>         if (zswap_check_limits())
> -               goto reject;
> -
> -       /* allocate entry */
> -       entry =3D zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> -       if (!entry) {
> -               zswap_reject_kmemcache_fail++;
> -               goto reject;
> -       }
> +               goto put_objcg;
>
> -       /* if entry is successfully added, it keeps the reference */
> -       entry->pool =3D zswap_pool_current_get();
> -       if (!entry->pool)
> -               goto freepage;
> +       pool =3D zswap_pool_current_get();
> +       if (!pool)
> +               goto put_objcg;
>
>         if (objcg) {
>                 memcg =3D get_mem_cgroup_from_objcg(objcg);
> @@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio)
>                 mem_cgroup_put(memcg);
>         }
>
> -       if (!zswap_compress(&folio->page, entry))
> -               goto put_pool;
> -
> -       entry->swpentry =3D swp;
> -       entry->objcg =3D objcg;
> -       entry->referenced =3D true;
> -
> -       if (!zswap_store_entry(tree, entry))
> -               goto store_failed;
> -
> -       if (objcg) {
> -               obj_cgroup_charge_zswap(objcg, entry->length);
> -               count_objcg_event(objcg, ZSWPOUT);
> -       }
> -
>         /*
> -        * We finish initializing the entry while it's already in xarray.
> -        * This is safe because:
> -        *
> -        * 1. Concurrent stores and invalidations are excluded by folio l=
ock.
> -        *
> -        * 2. Writeback is excluded by the entry not being on the LRU yet=
.
> -        *    The publishing order matters to prevent writeback from seei=
ng
> -        *    an incoherent entry.
> +        * Store each page of the folio as a separate entry. If we fail t=
o store
> +        * a page, unwind by removing all the previous pages we stored.
>          */
> -       if (entry->length) {
> -               INIT_LIST_HEAD(&entry->lru);
> -               zswap_lru_add(&zswap_list_lru, entry);
> +       for (index =3D 0; index < nr_pages; ++index) {
> +               if (!zswap_store_page(folio, index, objcg, pool))
> +                       goto put_pool;
>         }
>
> -       /* update stats */
> -       atomic_inc(&zswap_stored_pages);
> -       count_vm_event(ZSWPOUT);
> -
> -       return true;
> +       ret =3D true;
>
> -store_failed:
> -       zpool_free(entry->pool->zpool, entry->handle);
>  put_pool:
> -       zswap_pool_put(entry->pool);
> -freepage:
> -       zswap_entry_cache_free(entry);
> -reject:
> +       zswap_pool_put(pool);
> +put_objcg:
>         obj_cgroup_put(objcg);
>         if (zswap_pool_reached_full)
>                 queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +reject:
>         /*
> -        * If the zswap store fails or zswap is disabled, we must invalid=
ate the
> -        * possibly stale entry which was previously stored at this offse=
t.
> -        * Otherwise, writeback could overwrite the new data in the swapf=
ile.
> +        * If the zswap store fails or zswap is disabled, we must invalid=
ate
> +        * the possibly stale entries which were previously stored at the
> +        * offsets corresponding to each page of the folio. Otherwise,
> +        * writeback could overwrite the new data in the swapfile.
>          */
> -       zswap_delete_stored_offsets(tree, offset, nr_pages);
> -       return false;
> +       if (!ret)
> +               zswap_delete_stored_offsets(tree, offset, nr_pages);
> +
> +       return ret;
>  }
>
>  bool zswap_load(struct folio *folio)
> --
> 2.27.0
>