Re: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when zswap_store_page() fails

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosry.ahmed@linux.dev>
To: "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Nhat Pham <nphamcs@gmail.com>,
	Chengming Zhou <chengming.zhou@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when zswap_store_page() fails
Date: Tue, 28 Jan 2025 19:19:39 +0000	[thread overview]
Message-ID: <Z5kty3MCeILaoLwz@google.com> (raw)
In-Reply-To: <SA3PR11MB81206771932B54FCFFD0DF2CC9EF2@SA3PR11MB8120.namprd11.prod.outlook.com>

On Tue, Jan 28, 2025 at 07:09:05PM +0000, Sridhar, Kanchana P wrote:
> Hi Hyeonggon,
> 
> > -----Original Message-----
> > From: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > Sent: Tuesday, January 28, 2025 10:55 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Johannes Weiner
> > <hannes@cmpxchg.org>; Yosry Ahmed <yosryahmed@google.com>; Nhat
> > Pham <nphamcs@gmail.com>; Chengming Zhou
> > <chengming.zhou@linux.dev>; Andrew Morton <akpm@linux-
> > foundation.org>
> > Cc: linux-mm@kvack.org; Hyeonggon Yoo <42.hyeyoo@gmail.com>;
> > stable@vger.kernel.org
> > Subject: [PATCH v2 mm-hotfixes] mm/zswap: fix inconsistent charging when
> > zswap_store_page() fails
> > 
> > Commit b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
> > skips charging any zswapped base pages when it failed to zswap the entire
> > folio.
> > 
> > However, when some base pages are zswapped but it failed to zswap
> > the entire folio, the zswap operation is rolled back.
> > When freeing zswap entries for those pages, zswap_entry_free() uncharges
> > the pages that were not previously charged, causing zswap charging to
> > become inconsistent.
> > 
> > This inconsistency triggers two warnings with following steps:
> >   # On a machine with 64GiB of RAM and 36GiB of zswap
> >   $ stress-ng --bigheap 2 # wait until the OOM-killer kills stress-ng
> >   $ sudo reboot
> > 
> >   Two warnings are:
> >     in mm/memcontrol.c:163, function obj_cgroup_release():
> >       WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> > 
> >     in mm/page_counter.c:60, function page_counter_cancel():
> >       if (WARN_ONCE(new < 0, "page_counter underflow: %ld
> > nr_pages=%lu\n",
> > 	  new, nr_pages))
> > 
> > While objcg events should only be accounted for when the entire folio is
> > zswapped, objcg charging should be performed regardlessly.
> > Fix accordingly.
> > 
> > After resolving the inconsistency, these warnings disappear.
> > 
> > Fixes: b7c0ccdfbafd ("mm: zswap: support large folios in zswap_store()")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > ---
> > 
> > v1->v2:
> > 
> >  Fixed objcg events being accounted for on zswap failure.
> > 
> >  Fixed the incorrect description. I misunderstood that the base pages are
> >  going to be stored in zswap, but their zswap entries are freed immediately.
> > 
> >  Added a comment on why it charges pages that are going to be removed
> >  from zswap.
> > 
> >  mm/zswap.c | 14 ++++++++++----
> >  1 file changed, 10 insertions(+), 4 deletions(-)
> > 
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 6504174fbc6a..10b30ac46deb 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1568,20 +1568,26 @@ bool zswap_store(struct folio *folio)
> > 
> >  		bytes = zswap_store_page(page, objcg, pool);
> >  		if (bytes < 0)
> > -			goto put_pool;
> > +			goto charge_zswap;
> >  		compressed_bytes += bytes;
> >  	}
> > 
> > -	if (objcg) {
> > -		obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > +	if (objcg)
> >  		count_objcg_events(objcg, ZSWPOUT, nr_pages);
> > -	}
> > 
> >  	atomic_long_add(nr_pages, &zswap_stored_pages);
> >  	count_vm_events(ZSWPOUT, nr_pages);
> > 
> >  	ret = true;
> > 
> > +charge_zswap:
> > +	/*
> > +	 * Charge zswapped pages even when it failed to zswap the entire
> > folio,
> > +	 * because zswap_entry_free() will uncharge them anyway.
> > +	 * Otherwise zswap charging will become inconsistent.
> > +	 */
> > +	if (objcg)
> > +		obj_cgroup_charge_zswap(objcg, compressed_bytes);
> 
> Thanks for finding this bug! I am thinking it might make sense to charge
> and increment the zswap_stored_pages counter in zswap_store_page().
> Something like:
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index b84c20d889b1..fd2a72598a8a 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1504,11 +1504,14 @@ static ssize_t zswap_store_page(struct page *page,
>  	entry->pool = pool;
>  	entry->swpentry = page_swpentry;
>  	entry->objcg = objcg;
> +	if (objcg)
> +		obj_cgroup_charge_zswap(objcg, entry->length);
>  	entry->referenced = true;
>  	if (entry->length) {
>  		INIT_LIST_HEAD(&entry->lru);
>  		zswap_lru_add(&zswap_list_lru, entry);
>  	}
> +	atomic_long_inc(&zswap_stored_pages);
>  
>  	return entry->length;
>  
> @@ -1526,7 +1529,6 @@ bool zswap_store(struct folio *folio)
>  	struct obj_cgroup *objcg = NULL;
>  	struct mem_cgroup *memcg = NULL;
>  	struct zswap_pool *pool;
> -	size_t compressed_bytes = 0;
>  	bool ret = false;
>  	long index;
>  
> @@ -1569,15 +1571,11 @@ bool zswap_store(struct folio *folio)
>  		bytes = zswap_store_page(page, objcg, pool);
>  		if (bytes < 0)
>  			goto put_pool;
> -		compressed_bytes += bytes;
>  	}
>  
> -	if (objcg) {
> -		obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +	if (objcg)
>  		count_objcg_events(objcg, ZSWPOUT, nr_pages);
> -	}
>  
> -	atomic_long_add(nr_pages, &zswap_stored_pages);
>  	count_vm_events(ZSWPOUT, nr_pages);
>  
>  	ret = true;
> 
> What do you think?
> 
> Yosry, Nhat, Johannes, please let me know if this would be a cleaner
> approach. If so, I don't think we would be losing a lot of performance
> by not doing the one-time charge per folio, but please let me know
> your thoughts as well.

This is certainly cleaner, and thanks for catching that
zswap_stored_pages cleanup is also wrong.

I am not sure if this has meaningful impact on performance, but it seems
like we are doing a bit more work in the common success case to avoid
the work in the uncommon failure case.

Moving the charge (and atomic addition) above the zswap_store_page()
loop would be doing the opposite, albeit less clean.

I don't feel strongly either way, but I slightly prefer the latter.

> 
> Thanks,
> Kanchana
> 
> >  put_pool:
> >  	zswap_pool_put(pool);
> >  put_objcg:
> > --
> > 2.47.1
> 
>

next prev parent reply	other threads:[~2025-01-28 19:19 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-28 18:55 Hyeonggon Yoo
2025-01-28 19:03 ` Yosry Ahmed
2025-01-28 19:19   ` Sridhar, Kanchana P
2025-01-28 19:09 ` Sridhar, Kanchana P
2025-01-28 19:19   ` Yosry Ahmed [this message]
2025-01-29  5:48     ` Hyeonggon Yoo
2025-01-29  6:40   ` Hyeonggon Yoo
2025-01-29  7:56     ` Sridhar, Kanchana P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z5kty3MCeILaoLwz@google.com \
    --to=yosry.ahmed@linux.dev \
    --cc=42.hyeyoo@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chengming.zhou@linux.dev \
    --cc=hannes@cmpxchg.org \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox