[PATCH v3 0/2] mm: zswap: fixes for global shrinker

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/2] mm: zswap: fixes for global shrinker
@ 2024-07-20  4:41 Takero Funaki
  2024-07-20  4:41 ` [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration Takero Funaki
  2024-07-20  4:41 ` [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic Takero Funaki
  0 siblings, 2 replies; 16+ messages in thread
From: Takero Funaki @ 2024-07-20  4:41 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou, Andrew Morton
  Cc: Takero Funaki, linux-mm, linux-kernel

This series addresses issues in the zswap global shrinker that could not
shrink stored pages. With this series, the shrinker continues to shrink
pages until it reaches the accept threshold more reliably.

These patches were extracted and updated from the original patch series
v2 (mm: zswap: global shrinker fix and proactive shrink):
https://lore.kernel.org/linux-mm/20240706022523.1104080-1-flintglass@gmail.com/

Changes in v3:
- Extract fixes for shrinker as a separate patch series.
- Fix comments and commit messages. (Chengming, Yosry)
- Drop logic to detect rare doubly advancing cursor. (Yosry)

Changes in v2:
mm: zswap: fix global shrinker memcg iteration:
- Change the loop style (Yosry, Nhat, Shakeel)
mm: zswap: fix global shrinker error handling logic:
- Change error code for no-writeback memcg. (Yosry)
- Use nr_scanned to check if lru is empty. (Yosry)

Changes in v1:
mm: zswap: fix global shrinker memcg iteration:
- Drop and reacquire spinlock before skipping a memcg.
- Add some comment to clarify the locking mechanism.

---

Takero Funaki (2):
  mm: zswap: fix global shrinker memcg iteration
  mm: zswap: fix global shrinker error handling logic

 mm/zswap.c | 100 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 73 insertions(+), 27 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration
  2024-07-20  4:41 [PATCH v3 0/2] mm: zswap: fixes for global shrinker Takero Funaki
@ 2024-07-20  4:41 ` Takero Funaki
  2024-07-22 21:39   ` Nhat Pham
                     ` (3 more replies)
  2024-07-20  4:41 ` [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic Takero Funaki
  1 sibling, 4 replies; 16+ messages in thread
From: Takero Funaki @ 2024-07-20  4:41 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou, Andrew Morton
  Cc: Takero Funaki, linux-mm, linux-kernel

This patch fixes an issue where the zswap global shrinker stopped
iterating through the memcg tree.

The problem was that shrink_worker() would stop iterating when a memcg
was being offlined and restart from the tree root.  Now, it properly
handles the offline memcg and continues shrinking with the next memcg.

To avoid holding refcount of offline memcg encountered during the memcg
tree walking, shrink_worker() must continue iterating to release the
offline memcg to ensure the next memcg stored in the cursor is online.

The offline memcg cleaner has also been changed to avoid the same issue.
When the next memcg of the offlined memcg is also offline, the refcount
stored in the iteration cursor was held until the next shrink_worker()
run. The cleaner must release the offline memcg recursively.

Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 mm/zswap.c | 77 +++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 56 insertions(+), 21 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index a50e2986cd2f..6528668c9af3 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -775,12 +775,33 @@ void zswap_folio_swapin(struct folio *folio)
 	}
 }
 
+/*
+ * This function should be called when a memcg is being offlined.
+ *
+ * Since the global shrinker shrink_worker() may hold a reference
+ * of the memcg, we must check and release the reference in
+ * zswap_next_shrink.
+ *
+ * shrink_worker() must handle the case where this function releases
+ * the reference of memcg being shrunk.
+ */
 void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
 {
 	/* lock out zswap shrinker walking memcg tree */
 	spin_lock(&zswap_shrink_lock);
-	if (zswap_next_shrink == memcg)
-		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
+	if (zswap_next_shrink == memcg) {
+		do {
+			zswap_next_shrink = mem_cgroup_iter(NULL,
+					zswap_next_shrink, NULL);
+		} while (zswap_next_shrink &&
+				!mem_cgroup_online(zswap_next_shrink));
+		/*
+		 * We verified the next memcg is online.  Even if the next
+		 * memcg is being offlined here, another cleaner must be
+		 * waiting for our lock.  We can leave the online memcg
+		 * reference.
+		 */
+	}
 	spin_unlock(&zswap_shrink_lock);
 }
 
@@ -1319,18 +1340,38 @@ static void shrink_worker(struct work_struct *w)
 	/* Reclaim down to the accept threshold */
 	thr = zswap_accept_thr_pages();
 
-	/* global reclaim will select cgroup in a round-robin fashion. */
+	/* global reclaim will select cgroup in a round-robin fashion.
+	 *
+	 * We save iteration cursor memcg into zswap_next_shrink,
+	 * which can be modified by the offline memcg cleaner
+	 * zswap_memcg_offline_cleanup().
+	 *
+	 * Since the offline cleaner is called only once, we cannot leave an
+	 * offline memcg reference in zswap_next_shrink.
+	 * We can rely on the cleaner only if we get online memcg under lock.
+	 *
+	 * If we get an offline memcg, we cannot determine if the cleaner has
+	 * already been called or will be called later. We must put back the
+	 * reference before returning from this function. Otherwise, the
+	 * offline memcg left in zswap_next_shrink will hold the reference
+	 * until the next run of shrink_worker().
+	 */
 	do {
 		spin_lock(&zswap_shrink_lock);
-		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
-		memcg = zswap_next_shrink;
 
 		/*
-		 * We need to retry if we have gone through a full round trip, or if we
-		 * got an offline memcg (or else we risk undoing the effect of the
-		 * zswap memcg offlining cleanup callback). This is not catastrophic
-		 * per se, but it will keep the now offlined memcg hostage for a while.
-		 *
+		 * Start shrinking from the next memcg after zswap_next_shrink.
+		 * When the offline cleaner has already advanced the cursor,
+		 * advancing the cursor here overlooks one memcg, but this
+		 * should be negligibly rare.
+		 */
+		do {
+			zswap_next_shrink = mem_cgroup_iter(NULL,
+						zswap_next_shrink, NULL);
+			memcg = zswap_next_shrink;
+		} while (memcg && !mem_cgroup_tryget_online(memcg));
+
+		/*
 		 * Note that if we got an online memcg, we will keep the extra
 		 * reference in case the original reference obtained by mem_cgroup_iter
 		 * is dropped by the zswap memcg offlining callback, ensuring that the
@@ -1344,17 +1385,11 @@ static void shrink_worker(struct work_struct *w)
 			goto resched;
 		}
 
-		if (!mem_cgroup_tryget_online(memcg)) {
-			/* drop the reference from mem_cgroup_iter() */
-			mem_cgroup_iter_break(NULL, memcg);
-			zswap_next_shrink = NULL;
-			spin_unlock(&zswap_shrink_lock);
-
-			if (++failures == MAX_RECLAIM_RETRIES)
-				break;
-
-			goto resched;
-		}
+		/*
+		 * We verified the memcg is online and got an extra memcg
+		 * reference.  Our memcg might be offlined concurrently but the
+		 * respective offline cleaner must be waiting for our lock.
+		 */
 		spin_unlock(&zswap_shrink_lock);
 
 		ret = shrink_memcg(memcg);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic
  2024-07-20  4:41 [PATCH v3 0/2] mm: zswap: fixes for global shrinker Takero Funaki
  2024-07-20  4:41 ` [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration Takero Funaki
@ 2024-07-20  4:41 ` Takero Funaki
  2024-07-22 21:51   ` Nhat Pham
  1 sibling, 1 reply; 16+ messages in thread
From: Takero Funaki @ 2024-07-20  4:41 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed, Nhat Pham, Chengming Zhou, Andrew Morton
  Cc: Takero Funaki, linux-mm, linux-kernel

This patch fixes zswap global shrinker that did not shrink zpool as
expected.

The issue it addresses is that `shrink_worker()` did not distinguish
between unexpected errors and expected error codes that should be
skipped, such as when there is no stored page in a memcg. This led to
the shrinking process being aborted on the expected error codes.

The shrinker should ignore these cases and skip to the next memcg.
However,  skipping all memcgs presents another problem. To address this,
this patch tracks progress while walking the memcg tree and checks for
progress once the tree walk is completed.

To handle the empty memcg case, the helper function `shrink_memcg()` is
modified to check if the memcg is empty and then return -ENOENT.

Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
Signed-off-by: Takero Funaki <flintglass@gmail.com>
---
 mm/zswap.c | 23 +++++++++++++++++------
 1 file changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 6528668c9af3..053d5be81d9a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1310,10 +1310,10 @@ static struct shrinker *zswap_alloc_shrinker(void)
 
 static int shrink_memcg(struct mem_cgroup *memcg)
 {
-	int nid, shrunk = 0;
+	int nid, shrunk = 0, scanned = 0;
 
 	if (!mem_cgroup_zswap_writeback_enabled(memcg))
-		return -EINVAL;
+		return -ENOENT;
 
 	/*
 	 * Skip zombies because their LRUs are reparented and we would be
@@ -1327,14 +1327,19 @@ static int shrink_memcg(struct mem_cgroup *memcg)
 
 		shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
 					    &shrink_memcg_cb, NULL, &nr_to_walk);
+		scanned += 1 - nr_to_walk;
 	}
+
+	if (!scanned)
+		return -ENOENT;
+
 	return shrunk ? 0 : -EAGAIN;
 }
 
 static void shrink_worker(struct work_struct *w)
 {
 	struct mem_cgroup *memcg;
-	int ret, failures = 0;
+	int ret, failures = 0, progress = 0;
 	unsigned long thr;
 
 	/* Reclaim down to the accept threshold */
@@ -1379,9 +1384,12 @@ static void shrink_worker(struct work_struct *w)
 		 */
 		if (!memcg) {
 			spin_unlock(&zswap_shrink_lock);
-			if (++failures == MAX_RECLAIM_RETRIES)
+
+			/* tree walk completed but no progress */
+			if (!progress && ++failures == MAX_RECLAIM_RETRIES)
 				break;
 
+			progress = 0;
 			goto resched;
 		}
 
@@ -1396,10 +1404,13 @@ static void shrink_worker(struct work_struct *w)
 		/* drop the extra reference */
 		mem_cgroup_put(memcg);
 
-		if (ret == -EINVAL)
-			break;
+		if (ret == -ENOENT)
+			continue;
+
 		if (ret && ++failures == MAX_RECLAIM_RETRIES)
 			break;
+
+		++progress;
 resched:
 		cond_resched();
 	} while (zswap_total_pages() > thr);
-- 
2.43.0



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration
  2024-07-20  4:41 ` [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration Takero Funaki
@ 2024-07-22 21:39   ` Nhat Pham
  2024-07-23 15:35     ` Takero Funaki
  2024-07-23  6:30   ` Yosry Ahmed
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Nhat Pham @ 2024-07-22 21:39 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Andrew Morton,
	linux-mm, linux-kernel

On Fri, Jul 19, 2024 at 9:41 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that shrink_worker() would stop iterating when a memcg
> was being offlined and restart from the tree root.  Now, it properly
> handles the offline memcg and continues shrinking with the next memcg.
>
> To avoid holding refcount of offline memcg encountered during the memcg
> tree walking, shrink_worker() must continue iterating to release the
> offline memcg to ensure the next memcg stored in the cursor is online.
>
> The offline memcg cleaner has also been changed to avoid the same issue.
> When the next memcg of the offlined memcg is also offline, the refcount
> stored in the iteration cursor was held until the next shrink_worker()
> run. The cleaner must release the offline memcg recursively.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
Hmm LGTM for the most part - a couple nits
[...]
> +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> +                                       zswap_next_shrink, NULL);
nit: this can fit in a single line right? Looks like it's exactly 80 characters.
[...]
> +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> +                                               zswap_next_shrink, NULL);
Same with this.
[...]
> +               /*
> +                * We verified the memcg is online and got an extra memcg
> +                * reference.  Our memcg might be offlined concurrently but the
> +                * respective offline cleaner must be waiting for our lock.
> +                */
>                 spin_unlock(&zswap_shrink_lock);
nit: can we remove this spin_unlock() call + the one within the `if
(!memcg)` block, and just do it unconditionally outside of if
(!memcg)? Looks like we are unlocking regardless of whether memcg is
null or not.

memcg is a local variable, not protected by zswap_shrink_lock, so this
should be fine right?

Otherwise:
Reviewed-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic
  2024-07-20  4:41 ` [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic Takero Funaki
@ 2024-07-22 21:51   ` Nhat Pham
  2024-07-23 16:44     ` Takero Funaki
  0 siblings, 1 reply; 16+ messages in thread
From: Nhat Pham @ 2024-07-22 21:51 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Andrew Morton,
	linux-mm, linux-kernel

On Fri, Jul 19, 2024 at 9:41 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch fixes zswap global shrinker that did not shrink zpool as
> expected.
>
> The issue it addresses is that `shrink_worker()` did not distinguish
> between unexpected errors and expected error codes that should be
> skipped, such as when there is no stored page in a memcg. This led to
> the shrinking process being aborted on the expected error codes.

The code itself seems reasonable to me, but may I ask you to document
(as a comment) all the expected v.s unexpected cases? i.e when do we
increment (or not increment) the failure counter?

My understanding is, we only increment the failure counter if we fail
to reclaim from a selected memcg that is non-empty and
writeback-enabled, or if we go a full tree walk without making any
progress. Is this correct?

>
> The shrinker should ignore these cases and skip to the next memcg.
> However,  skipping all memcgs presents another problem. To address this,
> this patch tracks progress while walking the memcg tree and checks for
> progress once the tree walk is completed.
>
> To handle the empty memcg case, the helper function `shrink_memcg()` is
> modified to check if the memcg is empty and then return -ENOENT.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>  mm/zswap.c | 23 +++++++++++++++++------
>  1 file changed, 17 insertions(+), 6 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 6528668c9af3..053d5be81d9a 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1310,10 +1310,10 @@ static struct shrinker *zswap_alloc_shrinker(void)
>
>  static int shrink_memcg(struct mem_cgroup *memcg)
>  {
> -       int nid, shrunk = 0;
> +       int nid, shrunk = 0, scanned = 0;
>
>         if (!mem_cgroup_zswap_writeback_enabled(memcg))
> -               return -EINVAL;
> +               return -ENOENT;
>
>         /*
>          * Skip zombies because their LRUs are reparented and we would be
> @@ -1327,14 +1327,19 @@ static int shrink_memcg(struct mem_cgroup *memcg)
>
>                 shrunk += list_lru_walk_one(&zswap_list_lru, nid, memcg,
>                                             &shrink_memcg_cb, NULL, &nr_to_walk);
> +               scanned += 1 - nr_to_walk;
>         }
> +
> +       if (!scanned)
> +               return -ENOENT;
> +
>         return shrunk ? 0 : -EAGAIN;
>  }
>
>  static void shrink_worker(struct work_struct *w)
>  {
>         struct mem_cgroup *memcg;
> -       int ret, failures = 0;
> +       int ret, failures = 0, progress = 0;
>         unsigned long thr;
>
>         /* Reclaim down to the accept threshold */
> @@ -1379,9 +1384,12 @@ static void shrink_worker(struct work_struct *w)
>                  */
>                 if (!memcg) {
>                         spin_unlock(&zswap_shrink_lock);
> -                       if (++failures == MAX_RECLAIM_RETRIES)
> +
> +                       /* tree walk completed but no progress */
> +                       if (!progress && ++failures == MAX_RECLAIM_RETRIES)
>                                 break;
>
> +                       progress = 0;
>                         goto resched;
>                 }
>
> @@ -1396,10 +1404,13 @@ static void shrink_worker(struct work_struct *w)
>                 /* drop the extra reference */
>                 mem_cgroup_put(memcg);
>
> -               if (ret == -EINVAL)
> -                       break;
> +               if (ret == -ENOENT)
> +                       continue;
> +
>                 if (ret && ++failures == MAX_RECLAIM_RETRIES)
>                         break;
> +
> +               ++progress;
>  resched:
>                 cond_resched();
>         } while (zswap_total_pages() > thr);
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration
  2024-07-20  4:41 ` [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration Takero Funaki
  2024-07-22 21:39   ` Nhat Pham
@ 2024-07-23  6:30   ` Yosry Ahmed
  2024-07-23  6:37   ` Yosry Ahmed
  2024-07-26  2:47   ` Chengming Zhou
  3 siblings, 0 replies; 16+ messages in thread
From: Yosry Ahmed @ 2024-07-23  6:30 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Nhat Pham, Chengming Zhou, Andrew Morton,
	linux-mm, linux-kernel

On Fri, Jul 19, 2024 at 9:41 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that shrink_worker() would stop iterating when a memcg
> was being offlined and restart from the tree root.  Now, it properly
> handles the offline memcg and continues shrinking with the next memcg.
>
> To avoid holding refcount of offline memcg encountered during the memcg
> tree walking, shrink_worker() must continue iterating to release the
> offline memcg to ensure the next memcg stored in the cursor is online.
>
> The offline memcg cleaner has also been changed to avoid the same issue.
> When the next memcg of the offlined memcg is also offline, the refcount
> stored in the iteration cursor was held until the next shrink_worker()
> run. The cleaner must release the offline memcg recursively.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>  mm/zswap.c | 77 +++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 56 insertions(+), 21 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a50e2986cd2f..6528668c9af3 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -775,12 +775,33 @@ void zswap_folio_swapin(struct folio *folio)
>         }
>  }
>
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
>  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>  {
>         /* lock out zswap shrinker walking memcg tree */
>         spin_lock(&zswap_shrink_lock);
> -       if (zswap_next_shrink == memcg)
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +       if (zswap_next_shrink == memcg) {
> +               do {
> +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> +                                       zswap_next_shrink, NULL);
> +               } while (zswap_next_shrink &&
> +                               !mem_cgroup_online(zswap_next_shrink));
> +               /*
> +                * We verified the next memcg is online.  Even if the next
> +                * memcg is being offlined here, another cleaner must be
> +                * waiting for our lock.  We can leave the online memcg
> +                * reference.
> +                */

I think this comment and the similar one at the end of the loop in
shrink_worker() are very similar and not necessary. The large comment
above the loop in shrink_worker() already explains how that loop and
the offline memcg cleaner interact, and I think the locking follows
naturally from there. You can explicitly mention the locking there as
well if you think it helps, but I think these comments are a little
repetitive and do not add much value.

I don't feel strongly about it tho, if Nhat feels like they add value
then I am okay with that.

Otherwise, and with Nhat's other comments addressed:
Acked-by: Yosry Ahmed <yosryahmed@google.com>

> +       }
>         spin_unlock(&zswap_shrink_lock);
>  }
>
> @@ -1319,18 +1340,38 @@ static void shrink_worker(struct work_struct *w)
>         /* Reclaim down to the accept threshold */
>         thr = zswap_accept_thr_pages();
>
> -       /* global reclaim will select cgroup in a round-robin fashion. */
> +       /* global reclaim will select cgroup in a round-robin fashion.
> +        *
> +        * We save iteration cursor memcg into zswap_next_shrink,
> +        * which can be modified by the offline memcg cleaner
> +        * zswap_memcg_offline_cleanup().
> +        *
> +        * Since the offline cleaner is called only once, we cannot leave an
> +        * offline memcg reference in zswap_next_shrink.
> +        * We can rely on the cleaner only if we get online memcg under lock.
> +        *
> +        * If we get an offline memcg, we cannot determine if the cleaner has
> +        * already been called or will be called later. We must put back the
> +        * reference before returning from this function. Otherwise, the
> +        * offline memcg left in zswap_next_shrink will hold the reference
> +        * until the next run of shrink_worker().
> +        */
>         do {
>                 spin_lock(&zswap_shrink_lock);
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> -               memcg = zswap_next_shrink;
>
>                 /*
> -                * We need to retry if we have gone through a full round trip, or if we
> -                * got an offline memcg (or else we risk undoing the effect of the
> -                * zswap memcg offlining cleanup callback). This is not catastrophic
> -                * per se, but it will keep the now offlined memcg hostage for a while.
> -                *
> +                * Start shrinking from the next memcg after zswap_next_shrink.
> +                * When the offline cleaner has already advanced the cursor,
> +                * advancing the cursor here overlooks one memcg, but this
> +                * should be negligibly rare.
> +                */
> +               do {
> +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> +                                               zswap_next_shrink, NULL);
> +                       memcg = zswap_next_shrink;
> +               } while (memcg && !mem_cgroup_tryget_online(memcg));
> +
> +               /*
>                  * Note that if we got an online memcg, we will keep the extra
>                  * reference in case the original reference obtained by mem_cgroup_iter
>                  * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1344,17 +1385,11 @@ static void shrink_worker(struct work_struct *w)
>                         goto resched;
>                 }
>
> -               if (!mem_cgroup_tryget_online(memcg)) {
> -                       /* drop the reference from mem_cgroup_iter() */
> -                       mem_cgroup_iter_break(NULL, memcg);
> -                       zswap_next_shrink = NULL;
> -                       spin_unlock(&zswap_shrink_lock);
> -
> -                       if (++failures == MAX_RECLAIM_RETRIES)
> -                               break;
> -
> -                       goto resched;
> -               }
> +               /*
> +                * We verified the memcg is online and got an extra memcg
> +                * reference.  Our memcg might be offlined concurrently but the
> +                * respective offline cleaner must be waiting for our lock.
> +                */
>                 spin_unlock(&zswap_shrink_lock);
>
>                 ret = shrink_memcg(memcg);
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration
  2024-07-20  4:41 ` [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration Takero Funaki
  2024-07-22 21:39   ` Nhat Pham
  2024-07-23  6:30   ` Yosry Ahmed
@ 2024-07-23  6:37   ` Yosry Ahmed
  2024-07-23 15:56     ` Takero Funaki
  2024-07-26  2:47   ` Chengming Zhou
  3 siblings, 1 reply; 16+ messages in thread
From: Yosry Ahmed @ 2024-07-23  6:37 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Nhat Pham, Chengming Zhou, Andrew Morton,
	linux-mm, linux-kernel

On Fri, Jul 19, 2024 at 9:41 PM Takero Funaki <flintglass@gmail.com> wrote:
>
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
>
> The problem was that shrink_worker() would stop iterating when a memcg
> was being offlined and restart from the tree root.  Now, it properly
> handles the offline memcg and continues shrinking with the next memcg.

It is probably worth explicitly calling out that before this change,
the shrinker would stop considering an offline memcg as a failure and
stop after hitting 16 failures, but after this change, a failure is
hitting the end of the tree. This means that cgroup trees with a lot
of offline cgroups will now observe significantly higher zswap
writeback activity.

Similarly, in the next patch commit log, please explicitly call out
the expected behavioral change, that hitting an empty memcg or
reaching the end of a tree is no longer considered a failure if there
is progress, which means that trees with a few cgroups using zswap
will now observe significantly higher zswap writeback activity.

>
> To avoid holding refcount of offline memcg encountered during the memcg
> tree walking, shrink_worker() must continue iterating to release the
> offline memcg to ensure the next memcg stored in the cursor is online.
>
> The offline memcg cleaner has also been changed to avoid the same issue.
> When the next memcg of the offlined memcg is also offline, the refcount
> stored in the iteration cursor was held until the next shrink_worker()
> run. The cleaner must release the offline memcg recursively.
>
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>
> ---
>  mm/zswap.c | 77 +++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 56 insertions(+), 21 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a50e2986cd2f..6528668c9af3 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -775,12 +775,33 @@ void zswap_folio_swapin(struct folio *folio)
>         }
>  }
>
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
>  void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>  {
>         /* lock out zswap shrinker walking memcg tree */
>         spin_lock(&zswap_shrink_lock);
> -       if (zswap_next_shrink == memcg)
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +       if (zswap_next_shrink == memcg) {
> +               do {
> +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> +                                       zswap_next_shrink, NULL);
> +               } while (zswap_next_shrink &&
> +                               !mem_cgroup_online(zswap_next_shrink));
> +               /*
> +                * We verified the next memcg is online.  Even if the next
> +                * memcg is being offlined here, another cleaner must be
> +                * waiting for our lock.  We can leave the online memcg
> +                * reference.
> +                */
> +       }
>         spin_unlock(&zswap_shrink_lock);
>  }
>
> @@ -1319,18 +1340,38 @@ static void shrink_worker(struct work_struct *w)
>         /* Reclaim down to the accept threshold */
>         thr = zswap_accept_thr_pages();
>
> -       /* global reclaim will select cgroup in a round-robin fashion. */
> +       /* global reclaim will select cgroup in a round-robin fashion.
> +        *
> +        * We save iteration cursor memcg into zswap_next_shrink,
> +        * which can be modified by the offline memcg cleaner
> +        * zswap_memcg_offline_cleanup().
> +        *
> +        * Since the offline cleaner is called only once, we cannot leave an
> +        * offline memcg reference in zswap_next_shrink.
> +        * We can rely on the cleaner only if we get online memcg under lock.
> +        *
> +        * If we get an offline memcg, we cannot determine if the cleaner has
> +        * already been called or will be called later. We must put back the
> +        * reference before returning from this function. Otherwise, the
> +        * offline memcg left in zswap_next_shrink will hold the reference
> +        * until the next run of shrink_worker().
> +        */
>         do {
>                 spin_lock(&zswap_shrink_lock);
> -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> -               memcg = zswap_next_shrink;
>
>                 /*
> -                * We need to retry if we have gone through a full round trip, or if we
> -                * got an offline memcg (or else we risk undoing the effect of the
> -                * zswap memcg offlining cleanup callback). This is not catastrophic
> -                * per se, but it will keep the now offlined memcg hostage for a while.
> -                *
> +                * Start shrinking from the next memcg after zswap_next_shrink.
> +                * When the offline cleaner has already advanced the cursor,
> +                * advancing the cursor here overlooks one memcg, but this
> +                * should be negligibly rare.
> +                */
> +               do {
> +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> +                                               zswap_next_shrink, NULL);
> +                       memcg = zswap_next_shrink;
> +               } while (memcg && !mem_cgroup_tryget_online(memcg));
> +
> +               /*
>                  * Note that if we got an online memcg, we will keep the extra
>                  * reference in case the original reference obtained by mem_cgroup_iter
>                  * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1344,17 +1385,11 @@ static void shrink_worker(struct work_struct *w)
>                         goto resched;
>                 }
>
> -               if (!mem_cgroup_tryget_online(memcg)) {
> -                       /* drop the reference from mem_cgroup_iter() */
> -                       mem_cgroup_iter_break(NULL, memcg);
> -                       zswap_next_shrink = NULL;
> -                       spin_unlock(&zswap_shrink_lock);
> -
> -                       if (++failures == MAX_RECLAIM_RETRIES)
> -                               break;
> -
> -                       goto resched;
> -               }
> +               /*
> +                * We verified the memcg is online and got an extra memcg
> +                * reference.  Our memcg might be offlined concurrently but the
> +                * respective offline cleaner must be waiting for our lock.
> +                */
>                 spin_unlock(&zswap_shrink_lock);
>
>                 ret = shrink_memcg(memcg);
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration
  2024-07-22 21:39   ` Nhat Pham
@ 2024-07-23 15:35     ` Takero Funaki
  2024-07-23 15:55       ` Nhat Pham
  0 siblings, 1 reply; 16+ messages in thread
From: Takero Funaki @ 2024-07-23 15:35 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Andrew Morton,
	linux-mm, linux-kernel

2024年7月23日(火) 6:39 Nhat Pham <nphamcs@gmail.com>:
>
> On Fri, Jul 19, 2024 at 9:41 PM Takero Funaki <flintglass@gmail.com> wrote:
> >
> > This patch fixes an issue where the zswap global shrinker stopped
> > iterating through the memcg tree.
> >
> > The problem was that shrink_worker() would stop iterating when a memcg
> > was being offlined and restart from the tree root.  Now, it properly
> > handles the offline memcg and continues shrinking with the next memcg.
> >
> > To avoid holding refcount of offline memcg encountered during the memcg
> > tree walking, shrink_worker() must continue iterating to release the
> > offline memcg to ensure the next memcg stored in the cursor is online.
> >
> > The offline memcg cleaner has also been changed to avoid the same issue.
> > When the next memcg of the offlined memcg is also offline, the refcount
> > stored in the iteration cursor was held until the next shrink_worker()
> > run. The cleaner must release the offline memcg recursively.
> >
> > Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> > Signed-off-by: Takero Funaki <flintglass@gmail.com>
> Hmm LGTM for the most part - a couple nits
> [...]
> > +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> > +                                       zswap_next_shrink, NULL);
> nit: this can fit in a single line right? Looks like it's exactly 80 characters.

Isn't that over 90 chars? But yes, we can reduce line breaks using
memcg as temporary, like:
-       if (zswap_next_shrink == memcg)
-               zswap_next_shrink = mem_cgroup_iter(NULL,
zswap_next_shrink, NULL);
+       if (zswap_next_shrink == memcg) {
+               do {
+                       memcg = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
+                       zswap_next_shrink = memcg;
+               } while (memcg && !mem_cgroup_online(memcg));


> [...]
> > +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> > +                                               zswap_next_shrink, NULL);
> Same with this.
> [...]
> > +               /*
> > +                * We verified the memcg is online and got an extra memcg
> > +                * reference.  Our memcg might be offlined concurrently but the
> > +                * respective offline cleaner must be waiting for our lock.
> > +                */
> >                 spin_unlock(&zswap_shrink_lock);
> nit: can we remove this spin_unlock() call + the one within the `if
> (!memcg)` block, and just do it unconditionally outside of if
> (!memcg)? Looks like we are unlocking regardless of whether memcg is
> null or not.
>
> memcg is a local variable, not protected by zswap_shrink_lock, so this
> should be fine right?
>
> Otherwise:
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Ah that's right. We no longer modify zswap_next_shrink in the if
branches. Merging the two spin_unlock.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration
  2024-07-23 15:35     ` Takero Funaki
@ 2024-07-23 15:55       ` Nhat Pham
  0 siblings, 0 replies; 16+ messages in thread
From: Nhat Pham @ 2024-07-23 15:55 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Andrew Morton,
	linux-mm, linux-kernel

On Tue, Jul 23, 2024 at 8:35 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> 2024年7月23日(火) 6:39 Nhat Pham <nphamcs@gmail.com>:
> >
> > On Fri, Jul 19, 2024 at 9:41 PM Takero Funaki <flintglass@gmail.com> wrote:
> > >
> > > This patch fixes an issue where the zswap global shrinker stopped
> > > iterating through the memcg tree.
> > >
> > > The problem was that shrink_worker() would stop iterating when a memcg
> > > was being offlined and restart from the tree root.  Now, it properly
> > > handles the offline memcg and continues shrinking with the next memcg.
> > >
> > > To avoid holding refcount of offline memcg encountered during the memcg
> > > tree walking, shrink_worker() must continue iterating to release the
> > > offline memcg to ensure the next memcg stored in the cursor is online.
> > >
> > > The offline memcg cleaner has also been changed to avoid the same issue.
> > > When the next memcg of the offlined memcg is also offline, the refcount
> > > stored in the iteration cursor was held until the next shrink_worker()
> > > run. The cleaner must release the offline memcg recursively.
> > >
> > > Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> > > Signed-off-by: Takero Funaki <flintglass@gmail.com>
> > Hmm LGTM for the most part - a couple nits
> > [...]
> > > +                       zswap_next_shrink = mem_cgroup_iter(NULL,
> > > +                                       zswap_next_shrink, NULL);
> > nit: this can fit in a single line right? Looks like it's exactly 80 characters.
>
> Isn't that over 90 chars? But yes, we can reduce line breaks using
> memcg as temporary, like:

Huh. Weird. I applied the patch locally, and it looked 80 chars to me ha.

Anyway - just some nits. If checkpatch complains then yeah no need to fix this.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration
  2024-07-23  6:37   ` Yosry Ahmed
@ 2024-07-23 15:56     ` Takero Funaki
  0 siblings, 0 replies; 16+ messages in thread
From: Takero Funaki @ 2024-07-23 15:56 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Nhat Pham, Chengming Zhou, Andrew Morton,
	linux-mm, linux-kernel

2024年7月23日(火) 15:37 Yosry Ahmed <yosryahmed@google.com>:
>
> On Fri, Jul 19, 2024 at 9:41 PM Takero Funaki <flintglass@gmail.com> wrote:
> >
> > This patch fixes an issue where the zswap global shrinker stopped
> > iterating through the memcg tree.
> >
> > The problem was that shrink_worker() would stop iterating when a memcg
> > was being offlined and restart from the tree root.  Now, it properly
> > handles the offline memcg and continues shrinking with the next memcg.
>
> It is probably worth explicitly calling out that before this change,
> the shrinker would stop considering an offline memcg as a failure and
> stop after hitting 16 failures, but after this change, a failure is
> hitting the end of the tree. This means that cgroup trees with a lot
> of offline cgroups will now observe significantly higher zswap
> writeback activity.
>
> Similarly, in the next patch commit log, please explicitly call out
> the expected behavioral change, that hitting an empty memcg or
> reaching the end of a tree is no longer considered a failure if there
> is progress, which means that trees with a few cgroups using zswap
> will now observe significantly higher zswap writeback activity.
>

Thanks for the comments.  Dropping the comments and changing the
commit message to:
    The problem was that shrink_worker() would restart iterating memcg tree
    from the tree root, considering an offline memcg as a failure, and abort
    shrinking after encountering the offline memcg 16 times even if there is
    only one offline memcg. After this change, an offline memcg in the tree
    is no longer considered a failure. This allows the shrinker to continue
    shrinking the other online memcgs regardless of whether an offline memcg
    exists, gives higher zswap writeback activity.

These issues do not require many offline memcgs or empty memcgs.
Without these patches, the shrinker would abort shrinking too early
even if there is only one offline memcg or only one empty memcg. The
shrinker counted the same memcg as another failure in every tree walks
and the failures limited writeback upto 16 pages * memcgs.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic
  2024-07-22 21:51   ` Nhat Pham
@ 2024-07-23 16:44     ` Takero Funaki
  2024-07-26  3:21       ` Chengming Zhou
  0 siblings, 1 reply; 16+ messages in thread
From: Takero Funaki @ 2024-07-23 16:44 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Johannes Weiner, Yosry Ahmed, Chengming Zhou, Andrew Morton,
	linux-mm, linux-kernel

2024年7月23日(火) 6:51 Nhat Pham <nphamcs@gmail.com>:
>
> On Fri, Jul 19, 2024 at 9:41 PM Takero Funaki <flintglass@gmail.com> wrote:
> >
> > This patch fixes zswap global shrinker that did not shrink zpool as
> > expected.
> >
> > The issue it addresses is that `shrink_worker()` did not distinguish
> > between unexpected errors and expected error codes that should be
> > skipped, such as when there is no stored page in a memcg. This led to
> > the shrinking process being aborted on the expected error codes.
>
> The code itself seems reasonable to me, but may I ask you to document
> (as a comment) all the expected v.s unexpected cases? i.e when do we
> increment (or not increment) the failure counter?
>

In addition to changes in the commit log suggested by Yosry,
adding some comments specifying what memcg is (not) candidates for
writeback, and what should be a failure.

-       /* global reclaim will select cgroup in a round-robin fashion.
+       /*
+        * Global reclaim will select cgroup in a round-robin fashion from all
+        * online memcgs, but memcgs that have no pages in zswap and
+        * writeback-disabled memcgs (memory.zswap.writeback=0) are not
+        * candidates for shrinking.
+        *
+        * Shrinking will be aborted if we encounter the following
+        * MAX_RECLAIM_RETRIES times:
+        * - No writeback-candidate memcgs found in a memcg tree walk.
+        * - Shrinking a writeback-candidate memcg failed.
         *
         * We save iteration cursor memcg into zswap_next_shrink,
         * which can be modified by the offline memcg cleaner

and, the reasons to (not) increment the progress:

@@ -1387,10 +1407,20 @@ static void shrink_worker(struct work_struct *w)
                /* drop the extra reference */
                mem_cgroup_put(memcg);

-               if (ret == -EINVAL)
-                       break;
+               /*
+                * There are no writeback-candidate pages in the memcg.
+                * This is not an issue as long as we can find another memcg
+                * with pages in zswap. Skip this without incrementing progress
+                * and failures.
+                */
+               if (ret == -ENOENT)
+                       continue;
+
                if (ret && ++failures == MAX_RECLAIM_RETRIES)
                        break;
+
+               /* completed writeback or incremented failures */
+               ++progress;
 resched:


> My understanding is, we only increment the failure counter if we fail
> to reclaim from a selected memcg that is non-empty and
> writeback-enabled, or if we go a full tree walk without making any
> progress. Is this correct?
>

Yes, that's the expected behavior.
Please let me know if there is more appropriate wording.

Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration
  2024-07-20  4:41 ` [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration Takero Funaki
                     ` (2 preceding siblings ...)
  2024-07-23  6:37   ` Yosry Ahmed
@ 2024-07-26  2:47   ` Chengming Zhou
  3 siblings, 0 replies; 16+ messages in thread
From: Chengming Zhou @ 2024-07-26  2:47 UTC (permalink / raw)
  To: Takero Funaki, Johannes Weiner, Yosry Ahmed, Nhat Pham, Andrew Morton
  Cc: linux-mm, linux-kernel

On 2024/7/20 12:41, Takero Funaki wrote:
> This patch fixes an issue where the zswap global shrinker stopped
> iterating through the memcg tree.
> 
> The problem was that shrink_worker() would stop iterating when a memcg
> was being offlined and restart from the tree root.  Now, it properly
> handles the offline memcg and continues shrinking with the next memcg.
> 
> To avoid holding refcount of offline memcg encountered during the memcg
> tree walking, shrink_worker() must continue iterating to release the
> offline memcg to ensure the next memcg stored in the cursor is online.
> 
> The offline memcg cleaner has also been changed to avoid the same issue.
> When the next memcg of the offlined memcg is also offline, the refcount
> stored in the iteration cursor was held until the next shrink_worker()
> run. The cleaner must release the offline memcg recursively.
> 
> Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware")
> Signed-off-by: Takero Funaki <flintglass@gmail.com>

Looks good to me! With other comments addressed:

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Thanks.

> ---
>   mm/zswap.c | 77 +++++++++++++++++++++++++++++++++++++++---------------
>   1 file changed, 56 insertions(+), 21 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a50e2986cd2f..6528668c9af3 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -775,12 +775,33 @@ void zswap_folio_swapin(struct folio *folio)
>   	}
>   }
>   
> +/*
> + * This function should be called when a memcg is being offlined.
> + *
> + * Since the global shrinker shrink_worker() may hold a reference
> + * of the memcg, we must check and release the reference in
> + * zswap_next_shrink.
> + *
> + * shrink_worker() must handle the case where this function releases
> + * the reference of memcg being shrunk.
> + */
>   void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
>   {
>   	/* lock out zswap shrinker walking memcg tree */
>   	spin_lock(&zswap_shrink_lock);
> -	if (zswap_next_shrink == memcg)
> -		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> +	if (zswap_next_shrink == memcg) {
> +		do {
> +			zswap_next_shrink = mem_cgroup_iter(NULL,
> +					zswap_next_shrink, NULL);
> +		} while (zswap_next_shrink &&
> +				!mem_cgroup_online(zswap_next_shrink));
> +		/*
> +		 * We verified the next memcg is online.  Even if the next
> +		 * memcg is being offlined here, another cleaner must be
> +		 * waiting for our lock.  We can leave the online memcg
> +		 * reference.
> +		 */
> +	}
>   	spin_unlock(&zswap_shrink_lock);
>   }
>   
> @@ -1319,18 +1340,38 @@ static void shrink_worker(struct work_struct *w)
>   	/* Reclaim down to the accept threshold */
>   	thr = zswap_accept_thr_pages();
>   
> -	/* global reclaim will select cgroup in a round-robin fashion. */
> +	/* global reclaim will select cgroup in a round-robin fashion.
> +	 *
> +	 * We save iteration cursor memcg into zswap_next_shrink,
> +	 * which can be modified by the offline memcg cleaner
> +	 * zswap_memcg_offline_cleanup().
> +	 *
> +	 * Since the offline cleaner is called only once, we cannot leave an
> +	 * offline memcg reference in zswap_next_shrink.
> +	 * We can rely on the cleaner only if we get online memcg under lock.
> +	 *
> +	 * If we get an offline memcg, we cannot determine if the cleaner has
> +	 * already been called or will be called later. We must put back the
> +	 * reference before returning from this function. Otherwise, the
> +	 * offline memcg left in zswap_next_shrink will hold the reference
> +	 * until the next run of shrink_worker().
> +	 */
>   	do {
>   		spin_lock(&zswap_shrink_lock);
> -		zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> -		memcg = zswap_next_shrink;
>   
>   		/*
> -		 * We need to retry if we have gone through a full round trip, or if we
> -		 * got an offline memcg (or else we risk undoing the effect of the
> -		 * zswap memcg offlining cleanup callback). This is not catastrophic
> -		 * per se, but it will keep the now offlined memcg hostage for a while.
> -		 *
> +		 * Start shrinking from the next memcg after zswap_next_shrink.
> +		 * When the offline cleaner has already advanced the cursor,
> +		 * advancing the cursor here overlooks one memcg, but this
> +		 * should be negligibly rare.
> +		 */
> +		do {
> +			zswap_next_shrink = mem_cgroup_iter(NULL,
> +						zswap_next_shrink, NULL);
> +			memcg = zswap_next_shrink;
> +		} while (memcg && !mem_cgroup_tryget_online(memcg));
> +
> +		/*
>   		 * Note that if we got an online memcg, we will keep the extra
>   		 * reference in case the original reference obtained by mem_cgroup_iter
>   		 * is dropped by the zswap memcg offlining callback, ensuring that the
> @@ -1344,17 +1385,11 @@ static void shrink_worker(struct work_struct *w)
>   			goto resched;
>   		}
>   
> -		if (!mem_cgroup_tryget_online(memcg)) {
> -			/* drop the reference from mem_cgroup_iter() */
> -			mem_cgroup_iter_break(NULL, memcg);
> -			zswap_next_shrink = NULL;
> -			spin_unlock(&zswap_shrink_lock);
> -
> -			if (++failures == MAX_RECLAIM_RETRIES)
> -				break;
> -
> -			goto resched;
> -		}
> +		/*
> +		 * We verified the memcg is online and got an extra memcg
> +		 * reference.  Our memcg might be offlined concurrently but the
> +		 * respective offline cleaner must be waiting for our lock.
> +		 */
>   		spin_unlock(&zswap_shrink_lock);
>   
>   		ret = shrink_memcg(memcg);


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic
  2024-07-23 16:44     ` Takero Funaki
@ 2024-07-26  3:21       ` Chengming Zhou
  2024-07-26  8:54         ` Takero Funaki
  0 siblings, 1 reply; 16+ messages in thread
From: Chengming Zhou @ 2024-07-26  3:21 UTC (permalink / raw)
  To: Takero Funaki, Nhat Pham
  Cc: Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm, linux-kernel

On 2024/7/24 00:44, Takero Funaki wrote:
> 2024年7月23日(火) 6:51 Nhat Pham <nphamcs@gmail.com>:
>>
>> On Fri, Jul 19, 2024 at 9:41 PM Takero Funaki <flintglass@gmail.com> wrote:
>>>
>>> This patch fixes zswap global shrinker that did not shrink zpool as
>>> expected.
>>>
>>> The issue it addresses is that `shrink_worker()` did not distinguish
>>> between unexpected errors and expected error codes that should be
>>> skipped, such as when there is no stored page in a memcg. This led to
>>> the shrinking process being aborted on the expected error codes.
>>
>> The code itself seems reasonable to me, but may I ask you to document
>> (as a comment) all the expected v.s unexpected cases? i.e when do we
>> increment (or not increment) the failure counter?
>>
> 
> In addition to changes in the commit log suggested by Yosry,
> adding some comments specifying what memcg is (not) candidates for
> writeback, and what should be a failure.
> 
> -       /* global reclaim will select cgroup in a round-robin fashion.
> +       /*
> +        * Global reclaim will select cgroup in a round-robin fashion from all
> +        * online memcgs, but memcgs that have no pages in zswap and
> +        * writeback-disabled memcgs (memory.zswap.writeback=0) are not
> +        * candidates for shrinking.
> +        *
> +        * Shrinking will be aborted if we encounter the following
> +        * MAX_RECLAIM_RETRIES times:
> +        * - No writeback-candidate memcgs found in a memcg tree walk.
> +        * - Shrinking a writeback-candidate memcg failed.
>           *
>           * We save iteration cursor memcg into zswap_next_shrink,
>           * which can be modified by the offline memcg cleaner
> 
> and, the reasons to (not) increment the progress:
> 
> @@ -1387,10 +1407,20 @@ static void shrink_worker(struct work_struct *w)
>                  /* drop the extra reference */
>                  mem_cgroup_put(memcg);
> 
> -               if (ret == -EINVAL)
> -                       break;
> +               /*
> +                * There are no writeback-candidate pages in the memcg.
> +                * This is not an issue as long as we can find another memcg
> +                * with pages in zswap. Skip this without incrementing progress
> +                * and failures.
> +                */
> +               if (ret == -ENOENT)
> +                       continue;
> +
>                  if (ret && ++failures == MAX_RECLAIM_RETRIES)
>                          break;
> +
> +               /* completed writeback or incremented failures */
> +               ++progress;

Maybe the name "progress" is a little confusing here? "progress" sounds 
to me that we have some writeback completed.

But actually it just means we have encountered some candidates, right?

Thanks.


>   resched:
> 
> 
>> My understanding is, we only increment the failure counter if we fail
>> to reclaim from a selected memcg that is non-empty and
>> writeback-enabled, or if we go a full tree walk without making any
>> progress. Is this correct?
>>
> 
> Yes, that's the expected behavior.
> Please let me know if there is more appropriate wording.
> 
> Thanks.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic
  2024-07-26  3:21       ` Chengming Zhou
@ 2024-07-26  8:54         ` Takero Funaki
  2024-07-26 18:01           ` Nhat Pham
  0 siblings, 1 reply; 16+ messages in thread
From: Takero Funaki @ 2024-07-26  8:54 UTC (permalink / raw)
  To: Chengming Zhou
  Cc: Nhat Pham, Johannes Weiner, Yosry Ahmed, Andrew Morton, linux-mm,
	linux-kernel

Thanks for your comments.


2024年7月26日(金) 12:21 Chengming Zhou <chengming.zhou@linux.dev>:
> > and, the reasons to (not) increment the progress:
> >
> > @@ -1387,10 +1407,20 @@ static void shrink_worker(struct work_struct *w)
> >                  /* drop the extra reference */
> >                  mem_cgroup_put(memcg);
> >
> > -               if (ret == -EINVAL)
> > -                       break;
> > +               /*
> > +                * There are no writeback-candidate pages in the memcg.
> > +                * This is not an issue as long as we can find another memcg
> > +                * with pages in zswap. Skip this without incrementing progress
> > +                * and failures.
> > +                */
> > +               if (ret == -ENOENT)
> > +                       continue;
> > +
> >                  if (ret && ++failures == MAX_RECLAIM_RETRIES)
> >                          break;
> > +
> > +               /* completed writeback or incremented failures */
> > +               ++progress;
>
> Maybe the name "progress" is a little confusing here? "progress" sounds
> to me that we have some writeback completed.
>
> But actually it just means we have encountered some candidates, right?
>
> Thanks.
>
>

Yes, the `++progress` counts both error and success as an iteration
progress for valid memcgs (not writeback amount). Incrementing only on
success will overly increment failures counter if there is only one
memcg, one from writeback failure and one from tree walk ends, the
worker aborts on 8 failures instead of 16.
`++candidates;` would be better? replacing the name and fixing commit
messages for v4.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic
  2024-07-26  8:54         ` Takero Funaki
@ 2024-07-26 18:01           ` Nhat Pham
  2024-07-27 11:08             ` Takero Funaki
  0 siblings, 1 reply; 16+ messages in thread
From: Nhat Pham @ 2024-07-26 18:01 UTC (permalink / raw)
  To: Takero Funaki
  Cc: Chengming Zhou, Johannes Weiner, Yosry Ahmed, Andrew Morton,
	linux-mm, linux-kernel

On Fri, Jul 26, 2024 at 1:54 AM Takero Funaki <flintglass@gmail.com> wrote:
>
> Yes, the `++progress` counts both error and success as an iteration
> progress for valid memcgs (not writeback amount). Incrementing only on
> success will overly increment failures counter if there is only one
> memcg, one from writeback failure and one from tree walk ends, the
> worker aborts on 8 failures instead of 16.
> `++candidates;` would be better? replacing the name and fixing commit
> messages for v4.

How about `attempt` or `attempted`? Naming is hard :)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic
  2024-07-26 18:01           ` Nhat Pham
@ 2024-07-27 11:08             ` Takero Funaki
  0 siblings, 0 replies; 16+ messages in thread
From: Takero Funaki @ 2024-07-27 11:08 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Chengming Zhou, Johannes Weiner, Yosry Ahmed, Andrew Morton,
	linux-mm, linux-kernel

2024年7月27日(土) 3:01 Nhat Pham <nphamcs@gmail.com>:
>
> On Fri, Jul 26, 2024 at 1:54 AM Takero Funaki <flintglass@gmail.com> wrote:
> >
> > Yes, the `++progress` counts both error and success as an iteration
> > progress for valid memcgs (not writeback amount). Incrementing only on
> > success will overly increment failures counter if there is only one
> > memcg, one from writeback failure and one from tree walk ends, the
> > worker aborts on 8 failures instead of 16.
> > `++candidates;` would be better? replacing the name and fixing commit
> > messages for v4.
>
> How about `attempt` or `attempted`? Naming is hard :)

Thanks. Rewriting with `attempts` in align with the `failures` counter.


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-07-27 11:09 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-20  4:41 [PATCH v3 0/2] mm: zswap: fixes for global shrinker Takero Funaki
2024-07-20  4:41 ` [PATCH v3 1/2] mm: zswap: fix global shrinker memcg iteration Takero Funaki
2024-07-22 21:39   ` Nhat Pham
2024-07-23 15:35     ` Takero Funaki
2024-07-23 15:55       ` Nhat Pham
2024-07-23  6:30   ` Yosry Ahmed
2024-07-23  6:37   ` Yosry Ahmed
2024-07-23 15:56     ` Takero Funaki
2024-07-26  2:47   ` Chengming Zhou
2024-07-20  4:41 ` [PATCH v3 2/2] mm: zswap: fix global shrinker error handling logic Takero Funaki
2024-07-22 21:51   ` Nhat Pham
2024-07-23 16:44     ` Takero Funaki
2024-07-26  3:21       ` Chengming Zhou
2024-07-26  8:54         ` Takero Funaki
2024-07-26 18:01           ` Nhat Pham
2024-07-27 11:08             ` Takero Funaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox