From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 146A3CEB2CB
	for <linux-mm@archiver.kernel.org>; Mon, 30 Sep 2024 23:20:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7183F28003B; Mon, 30 Sep 2024 19:20:39 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6C82B280036; Mon, 30 Sep 2024 19:20:39 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 58FCF28003B; Mon, 30 Sep 2024 19:20:39 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 3BED3280036
	for <linux-mm@kvack.org>; Mon, 30 Sep 2024 19:20:39 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id B350A1A0B91
	for <linux-mm@kvack.org>; Mon, 30 Sep 2024 23:20:38 +0000 (UTC)
X-FDA: 82622976156.22.765C548
Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49])
	by imf24.hostedemail.com (Postfix) with ESMTP id CDAA318000D
	for <linux-mm@kvack.org>; Mon, 30 Sep 2024 23:20:36 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=yPVzNJQ7;
	spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1727738310;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=sAJSE2zeBZI18Uj9l0WKwOhvWKBTD7MyyyFR4HUEG44=;
	b=l6/fpAE1yY3+bPshHBYGNycZWO4TPiKV0SM9mJ2NqG5XsNZJp0jaaIzl31jHgyxcXM0eiz
	mOwpPMDDL9lU3gxJVX/ZLpuuCh0R5bMLGdHdFGvH/HoJ+Bk5UPbaHr3oY+RjT1ls2pERAT
	tqaLGNzdkBoahDimCgrBHqowzdE4Wnw=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727738310; a=rsa-sha256;
	cv=none;
	b=orC4O39Piz9BgJFgiKWG/63gupckpWRIJoCstLIWoxivE1HANMRIsmFvCPjkg4wfVIBSkw
	5OXypnfwj+TZrlYEOwb6Q8pHYHGCq2Y3AJpM1pVlsZUaI4upYBqt/I160Fo+jx3oV95t6Y
	CCRA24CC7QzvQmGUajPfmNmHRNh8e1I=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=yPVzNJQ7;
	spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.167.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-lf1-f49.google.com with SMTP id 2adb3069b0e04-5399041167cso2874043e87.0
        for <linux-mm@kvack.org>; Mon, 30 Sep 2024 16:20:36 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1727738435; x=1728343235; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=sAJSE2zeBZI18Uj9l0WKwOhvWKBTD7MyyyFR4HUEG44=;
        b=yPVzNJQ79G+NckendlRpGQYOTKF0LQrug3kx6ZXVEXQ1VaW0zrh5/5gfG2XAMyXOOA
         OssSsGcghtalmXUz8m1CDCi7/w6dw3Hxnc5DBhr1jR6lKz5C4K4p4A+6hPjk9HP+L6Q/
         qJJ8Vh6Sf0/3V+2dgb/qTt6fAoIQSQPXuc4C7DlJu9e93ZnaUxBrd1xAAoSnZdkrs6Ck
         LAbOowCHNefjJL74qQzzSHO5YdcGz3qcE8KFRMXo7y4ovhMQyTBXwRiq20HIGW2ZJUW7
         tU/bUgGFL/ENYvdGB+mQM05phYi4DgBlKGDKyfqL6EIYmIdlR+kZnAREOATS6U54+hw3
         YGAQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1727738435; x=1728343235;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=sAJSE2zeBZI18Uj9l0WKwOhvWKBTD7MyyyFR4HUEG44=;
        b=GqsHWZiCAkXYfDUH2nGlRcCendkwrpD6mvlYCWz2H1uBkezFkaQv+MXbvTz8AhqIj0
         UR0lCLwRmI71/+RTy4rAX4XEnoyTUi4whGZ57ApNjy99rConfXK3x7JXFRQVxIZqWJIG
         fD+flq8AB8ekZux3XCl+np8+RmaWy7o3naeRh82K6uk5cVKSFs9yxOvbIZUIicbqeeCu
         TF/sw+wmwuL2XcfxxY9aA6gnXxxcm/XJgXyYJ2t7IdV1MTvVxhnjWSnN1/3AwgywtHuU
         bw9hgM1uhrjRGTG3IIvAVI1vB6nLWTNCDAZVt/Q2QkrMb89L7WSxT09iJhSSXjp6V0Jy
         ZgCw==
X-Forwarded-Encrypted: i=1; AJvYcCVb1otOBtTbWdXRg8mHoXSX/KfsjD2C6MNOjHxBnx97j+uH80TtfDVFEzIjwBBkb/ecOkft+s4Umw==@kvack.org
X-Gm-Message-State: AOJu0YwzR5eDwrwPVRaYmLsbecC4MuIuZAmkK/5f4Fx+KnMIoZ01OhNS
	AWf4tJSOZSM1DROXQusHstW5B6/jShOmirGTnggkasXyTEHGtAT+U4u/6DBjjDDCFeCd0mCtgkr
	lvbfyfQcjWfze9Uj+mBbo+lzuAfxmwgyC0uoe
X-Google-Smtp-Source: AGHT+IGrDGkvUFcbfHQ/qglc6QYEHNqe87No4ZKeqpbYddNUEt+aYD2dPsoXcEoglHVeF9rlz+D86pLWW798AJxKZTM=
X-Received: by 2002:a05:6512:31c1:b0:536:5625:511f with SMTP id
 2adb3069b0e04-5389fc7d145mr9717836e87.45.1727738434697; Mon, 30 Sep 2024
 16:20:34 -0700 (PDT)
MIME-Version: 1.0
References: <20240930221221.6981-1-kanchana.p.sridhar@intel.com>
 <20240930221221.6981-7-kanchana.p.sridhar@intel.com> <CAKEwX=Podz0XkX4=sutZk_nRTPy_aQ3JSVUZ9GNxPbX=t7R2kg@mail.gmail.com>
In-Reply-To: <CAKEwX=Podz0XkX4=sutZk_nRTPy_aQ3JSVUZ9GNxPbX=t7R2kg@mail.gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Mon, 30 Sep 2024 16:19:57 -0700
Message-ID: <CAJD7tkadrmzUHMXRbeScq+ekY1AJhjrSZ_mZtt4H7x1Z=Ad=UQ@mail.gmail.com>
Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
To: Nhat Pham <nphamcs@gmail.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org, hannes@cmpxchg.org, chengming.zhou@linux.dev, 
	usamaarif642@gmail.com, shakeel.butt@linux.dev, ryan.roberts@arm.com, 
	ying.huang@intel.com, 21cnbao@gmail.com, akpm@linux-foundation.org, 
	willy@infradead.org, nanhai.zou@intel.com, wajdi.k.feghali@intel.com, 
	vinodh.gopal@intel.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: zskfu19o3egaz54wgc66gbgssf3i98e7
X-Rspamd-Queue-Id: CDAA318000D
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-HE-Tag: 1727738436-798229
X-HE-Meta: U2FsdGVkX19osSaJr6ZCpWYSW3Prm1fzsxCzN+DKEBRbFWWU4bJ12+P7+gRRewvgQaoa+IC8JBs7dFNg0y0pCDItiebj5IOnNPDIZPnbvgXurRwqMaQ/rrIUvXv73QJPFMuqokTFDshUQ/hIP+yVBocmAhmtEx6uUa3AGDjU/2+FkN90RVXYAjh7fYCnAX4PbItsVgE8a3v0iLJE5S2TWsrxKbwx89bEtlRnarhSIg5ENbngif0kQxm+SLJooJV3birLRIZaPZu/ptsPb/eJec0IslXi7Bohi5XJXE6hLVjtN4sJFzj3onLsiE1Xs+lX9Pr1lH2g6xmuBmCJeLIKeRNaaJr2ol6BXvTx6jMFbEBMde5GBoux7kx8Xjh0N53NJzOXaJLEoDpEBYPfCVNb0iJjUR0h6ixZpTkd9eXLawTIe7e5GNQh7tC99dke7rB+hgPPldAO0Y7i75IbGMeDRLeZp/BnbWJ1G+k4ftXDMwnnekDTuihvhGw8y48MfQNdmdhBVCahhHx4vZE4mSkGsZ+0n/QeoGWx/2segWU7l7dkEWlZIpqm2lKE1UuVODsVsupjpN1amO7EpUgzbWUvReyyHzgrZW4XYhX+DSOcOls4nrGM4rrMY+UkvU1ubxPiAK6AqnbJ5v4KazUgAyVlD4babFuLf7CGEIAfm0KBmZdAmBA/nbzG9/WhakoU2nMy34grgJEdLTFG197oIa8TWyda+axTFA/I/XyE7VS4qRR4ZhiDj16gi2ZfKhqiQakuU3Gonm8zjWYtSsmXyN/T2Hv9+OXvZOCM8u5hmzlcbi+AVZ/MTrZ3yj3PXUF0uOHo5vApfpRCK78v5vmfvUcZgT7wiBbLSDswqLG9vwt70zY2K630T4VC3+TZtFI28PxudMsntCB4WFtvsL3vC02dgsoluIHlWA6PyalKa5rG9YsNJvxqSMaUaWTTK9Ss3v6xKvVJRGx5ceyKjClSAUJ
 7jVnwfJo
 Y+QqxKexTw3o1w0+EvEDsRtdCEe1DJlwr61FsQZiSKKjayN8TTchwW6wiTiEUkqRJZFR9c3DQ7bAKeYqNIj0suItjnISGu3NMXQHDV+5DExl7XlwvIKMCLq5KPM/EupX5Ay4VQhtz4Us0KuifqQRfFoZyc76796R94/cE6JtyFqkjjz7t4nPyKnoE1O0yecE68njGEu5w5B8P8IAmOE2SjocBnA2Ph3LlWmcLEy42BW+WdyNArt74PmuZQ6FZQm3zCbk2Q6PLpMf/F3U05hSPU0cdcIAsDuYhKqPahC9KQHJP3LvieeQO1yAbK+/uLUqxYw0ve+R7yAd9lGepqvSgHgNWHS6pU1In7Nek8/4zgY7qDJqnw/LCdtT1Xza6ha1PYJTgVYHWjzrJGFE=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Sep 30, 2024 at 4:11=E2=80=AFPM Nhat Pham <nphamcs@gmail.com> wrote=
:
>
> On Mon, Sep 30, 2024 at 3:12=E2=80=AFPM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will store large folios by compressing them page by page.
> >
> > This patch provides a sequential implementation of storing a large foli=
o
> > in zswap_store() by iterating through each page in the folio to compres=
s
> > and store it in the zswap zpool.
> >
> > zswap_store() calls the newly added zswap_store_page() function for eac=
h
> > page in the folio. zswap_store_page() handles compressing and storing e=
ach
> > page.
> >
> > We check the global and per-cgroup limits once at the beginning of
> > zswap_store(), and only check that the limit is not reached yet. This i=
s
> > racy and inaccurate, but it should be sufficient for now. We also obtai=
n
> > initial references to the relevant objcg and pool to guarantee that
> > subsequent references can be acquired by zswap_store_page(). A new func=
tion
> > zswap_pool_get() is added to facilitate this.
> >
> > If these one-time checks pass, we compress the pages of the folio, whil=
e
> > maintaining a running count of compressed bytes for all the folio's pag=
es.
> > If all pages are successfully compressed and stored, we do the cgroup
> > zswap charging with the total compressed bytes, and batch update the
> > zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() onc=
e,
> > before returning from zswap_store().
> >
> > If an error is encountered during the store of any page in the folio,
> > all pages in that folio currently stored in zswap will be invalidated.
> > Thus, a folio is either entirely stored in zswap, or entirely not store=
d
> > in zswap.
> >
> > The most important value provided by this patch is it enables swapping =
out
> > large folios to zswap without splitting them. Furthermore, it batches s=
ome
> > operations while doing so (cgroup charging, stats updates).
> >
> > This patch also forms the basis for building compress batching of pages=
 in
> > a large folio in zswap_store() by compressing up to say, 8 pages of the
> > folio in parallel in hardware using the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.ro=
berts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
> >  1 file changed, 153 insertions(+), 67 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 2b8da50f6322..b74c8de99646 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct z=
swap_pool *pool)
> >         return percpu_ref_tryget(&pool->ref);
> >  }
> >
> > +/* The caller must already have a reference. */
> > +static void zswap_pool_get(struct zswap_pool *pool)
> > +{
> > +       percpu_ref_get(&pool->ref);
> > +}
> > +
> >  static void zswap_pool_put(struct zswap_pool *pool)
> >  {
> >         percpu_ref_put(&pool->ref);
> > @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w=
)
> >  /*********************************
> >  * main API
> >  **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @page:  The page to store in zswap.
> > + * @objcg: The folio's objcg. Caller has a reference.
> > + * @pool:  The zswap_pool to store the compressed data for the page.
> > + *         The caller should have obtained a reference to a valid
> > + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + *         argument.
> > + * @tree:  The xarray for the @page's folio's swap.
> > + * @compressed_bytes: The compressed entry->length value is added
> > + *                    to this, so that the caller can get the total
> > + *                    compressed lengths of all sub-pages in a folio.
> > + */
> > +static bool zswap_store_page(struct page *page,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool,
> > +                            struct xarray *tree,
> > +                            size_t *compressed_bytes)
> >  {
> > -       swp_entry_t swp =3D folio->swap;
> > -       pgoff_t offset =3D swp_offset(swp);
> > -       struct xarray *tree =3D swap_zswap_tree(swp);
> >         struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg =3D NULL;
> > -       struct mem_cgroup *memcg =3D NULL;
> > -
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > -
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > -               return false;
> > -
> > -       if (!zswap_enabled)
> > -               goto check_old;
> > -
> > -       /* Check cgroup limits */
> > -       objcg =3D get_obj_cgroup_from_folio(folio);
> > -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > -               memcg =3D get_mem_cgroup_from_objcg(objcg);
> > -               if (shrink_memcg(memcg)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto reject;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > -
> > -       if (zswap_check_limits())
> > -               goto reject;
> >
> >         /* allocate entry */
> > -       entry =3D zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio))=
;
> > +       entry =3D zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_fo=
lio(page)));
> >         if (!entry) {
> >                 zswap_reject_kmemcache_fail++;
> >                 goto reject;
> >         }
> >
> > -       /* if entry is successfully added, it keeps the reference */
> > -       entry->pool =3D zswap_pool_current_get();
> > -       if (!entry->pool)
> > -               goto freepage;
> > +       /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> > +       zswap_pool_get(pool);
>
> Should we also batch-get references to the pool as well? i.e add a
> helper function:
>
> /* The caller must already have a reference. */
> static void zswap_pool_get_many(struct zswap_pool *pool, unsigned long nr=
)
> {
>        percpu_ref_get_many(&pool->ref, nr);
> }
>
> then do it in a fell swoop after you're done storing all individual subpa=
ges
> (near atomic_long_add(nr_pages, &zswap_stored_pages)).
>
> Do double check that it is safe - I think it should be, since we have
> the folio locked in swapcache, so there should not be any shenanigans
> (for e.g no race with concurrent free or writeback).
>
> Perhaps a fixlet suffices?

I suggested this in a previous version, and Kanchana faced some
complexities implementing it:
https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@SJ0PR11=
MB5678.namprd11.prod.outlook.com/

Basically, if we batch get the refs after the store I think it's not
safe, because once an entry is published to writeback it can be
written back and freed, and a ref that we never acquired would be
dropped.

Getting refs before the store would work, but then if the store fails
at an arbitrary page, we need to only drop refs on the pool for pages
that were not added to the tree, as the cleanup loop with
zswap_entry_free() at the end of zswap_store() will drop the ref for
those that were added to the tree.

We agreed to (potentially) do the batching for refcounts as a followup.