[PATCH v6] memcg: Don't wait writeback completion when release memcg.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v6] memcg: Don't wait writeback completion when release memcg.
@ 2025-09-17 21:29 Julian Sun
  2025-09-17 21:50 ` Tejun Heo
  2025-09-17 22:21 ` Andrew Morton
  0 siblings, 2 replies; 9+ messages in thread
From: Julian Sun @ 2025-09-17 21:29 UTC (permalink / raw)
  To: cgroups, linux-mm
  Cc: jack, tj, muchun.song, venkat88, hannes, mhocko, roman.gushchin,
	shakeel.butt, akpm

Recently, we encountered the following hung task:

INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
[Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
[Wed Jul 30 17:47:45 2025] Call Trace:
[Wed Jul 30 17:47:45 2025]  __schedule+0x934/0xe10
[Wed Jul 30 17:47:45 2025]  ? complete+0x3b/0x50
[Wed Jul 30 17:47:45 2025]  ? _cond_resched+0x15/0x30
[Wed Jul 30 17:47:45 2025]  schedule+0x40/0xb0
[Wed Jul 30 17:47:45 2025]  wb_wait_for_completion+0x52/0x80
[Wed Jul 30 17:47:45 2025]  ? finish_wait+0x80/0x80
[Wed Jul 30 17:47:45 2025]  mem_cgroup_css_free+0x22/0x1b0
[Wed Jul 30 17:47:45 2025]  css_free_rwork_fn+0x42/0x380
[Wed Jul 30 17:47:45 2025]  process_one_work+0x1a2/0x360
[Wed Jul 30 17:47:45 2025]  worker_thread+0x30/0x390
[Wed Jul 30 17:47:45 2025]  ? create_worker+0x1a0/0x1a0
[Wed Jul 30 17:47:45 2025]  kthread+0x110/0x130
[Wed Jul 30 17:47:45 2025]  ? __kthread_cancel_work+0x40/0x40
[Wed Jul 30 17:47:45 2025]  ret_from_fork+0x1f/0x30

The direct cause is that memcg spends a long time waiting for dirty page
writeback of foreign memcgs during release.

The root causes are:
    a. The wb may have multiple writeback tasks, containing millions
       of dirty pages, as shown below:

>>> for work in list_for_each_entry("struct wb_writeback_work", \
				    wb.work_list.address_of_(), "list"):
...     print(work.nr_pages, work.reason, hex(work))
...
900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
1275228 WB_REASON_FOREIGN_FLUSH 0xffff969d9b444bc0
1099673 WB_REASON_FOREIGN_FLUSH 0xffff969f0954d6c0
1351522 WB_REASON_FOREIGN_FLUSH 0xffff969e76713340
2567437 WB_REASON_FOREIGN_FLUSH 0xffff9694ae208400
2954033 WB_REASON_FOREIGN_FLUSH 0xffff96a22d62cbc0
3008860 WB_REASON_FOREIGN_FLUSH 0xffff969eee8ce3c0
3337932 WB_REASON_FOREIGN_FLUSH 0xffff9695b45156c0
3348916 WB_REASON_FOREIGN_FLUSH 0xffff96a22c7a4f40
3345363 WB_REASON_FOREIGN_FLUSH 0xffff969e5d872800
3333581 WB_REASON_FOREIGN_FLUSH 0xffff969efd0f4600
3382225 WB_REASON_FOREIGN_FLUSH 0xffff969e770edcc0
3418770 WB_REASON_FOREIGN_FLUSH 0xffff96a252ceea40
3387648 WB_REASON_FOREIGN_FLUSH 0xffff96a3bda86340
3385420 WB_REASON_FOREIGN_FLUSH 0xffff969efc6eb280
3418730 WB_REASON_FOREIGN_FLUSH 0xffff96a348ab1040
3426155 WB_REASON_FOREIGN_FLUSH 0xffff969d90beac00
3397995 WB_REASON_FOREIGN_FLUSH 0xffff96a2d7288800
3293095 WB_REASON_FOREIGN_FLUSH 0xffff969dab423240
3293595 WB_REASON_FOREIGN_FLUSH 0xffff969c765ff400
3199511 WB_REASON_FOREIGN_FLUSH 0xffff969a72d5e680
3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00

    b. The writeback might severely throttled by wbt, with a speed
       possibly less than 100kb/s, leading to a very long writeback time.

>>> wb.write_bandwidth
(unsigned long)24
>>> wb.write_bandwidth
(unsigned long)13

The wb_wait_for_completion() here is probably only used to prevent
use-after-free. Therefore, we manage 'done' separately and automatically
free it.

This allows us to remove wb_wait_for_completion() while preventing
the use-after-free issue.

Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
---
  Changes in v6:
    * Add comment to explain reason and usage of 
      struct cgwb_frn_wait.

  Changes in v5:
    * Protect operation on wq_head with spinlock.

  Changes in v4:
    * Check done->cnt in memcg_cgwb_waitq_callback_fn().
    * Code cleanups as Tejun suggested.

  Changes in v3:
    * Rename cgwb_frn_wq_entry to cgwb_frn_wait.
    * Define memcg_cgwb_waitq_callback_fn() only when
      CONFIG_CGROUP_WRITEBACK is defined.
    * Embed wb_completion into struct cgwb_frn_wait.

  Change in v2:
    * Use custom waitq function to free resources

 include/linux/memcontrol.h | 14 +++++++++-
 mm/memcontrol.c            | 57 ++++++++++++++++++++++++++++++++------
 2 files changed, 62 insertions(+), 9 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 785173aa0739..1568121d25ca 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -157,11 +157,23 @@ struct mem_cgroup_thresholds {
  */
 #define MEMCG_CGWB_FRN_CNT	4
 
+/*
+ * This structure exists to avoid waiting for writeback to finish on
+ * memcg release, which could lead to a hang task.
+ * @done.cnt is always > 0 before a memcg is released, so @wq_entry.func
+ * may only be invoked by finish_writeback_work() after memcg is freed.
+ * See mem_cgroup_css_free() for details.
+ */
+struct cgwb_frn_wait {
+	struct wb_completion done;
+	struct wait_queue_entry wq_entry;
+};
+
 struct memcg_cgwb_frn {
 	u64 bdi_id;			/* bdi->id of the foreign inode */
 	int memcg_id;			/* memcg->css.id of foreign inode */
 	u64 at;				/* jiffies_64 at the time of dirtying */
-	struct wb_completion done;	/* tracks in-flight foreign writebacks */
+	struct cgwb_frn_wait *wait;	/* tracks in-flight foreign writebacks */
 };
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..110937c3671a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3459,7 +3459,7 @@ void mem_cgroup_track_foreign_dirty_slowpath(struct folio *folio,
 		    frn->memcg_id == wb->memcg_css->id)
 			break;
 		if (time_before64(frn->at, oldest_at) &&
-		    atomic_read(&frn->done.cnt) == 1) {
+		    atomic_read(&frn->wait->done.cnt) == 1) {
 			oldest = i;
 			oldest_at = frn->at;
 		}
@@ -3506,12 +3506,12 @@ void mem_cgroup_flush_foreign(struct bdi_writeback *wb)
 		 * already one in flight.
 		 */
 		if (time_after64(frn->at, now - intv) &&
-		    atomic_read(&frn->done.cnt) == 1) {
+		    atomic_read(&frn->wait->done.cnt) == 1) {
 			frn->at = 0;
 			trace_flush_foreign(wb, frn->bdi_id, frn->memcg_id);
 			cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id,
 					       WB_REASON_FOREIGN_FLUSH,
-					       &frn->done);
+					       &frn->wait->done);
 		}
 	}
 }
@@ -3702,13 +3702,29 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
 	__mem_cgroup_free(memcg);
 }
 
+#ifdef CONFIG_CGROUP_WRITEBACK
+static int memcg_cgwb_waitq_callback_fn(struct wait_queue_entry *wq_entry, unsigned int mode,
+					int flags, void *key)
+{
+	struct cgwb_frn_wait *frn_wait = container_of(wq_entry,
+						      struct cgwb_frn_wait, wq_entry);
+
+	if (!atomic_read(&frn_wait->done.cnt)) {
+		list_del(&wq_entry->entry);
+		kfree(frn_wait);
+	}
+
+	return 0;
+}
+#endif
+
 static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 {
 	struct memcg_vmstats_percpu *statc;
 	struct memcg_vmstats_percpu __percpu *pstatc_pcpu;
 	struct mem_cgroup *memcg;
 	int node, cpu;
-	int __maybe_unused i;
+	int __maybe_unused i = 0;
 	long error;
 
 	memcg = kmem_cache_zalloc(memcg_cachep, GFP_KERNEL);
@@ -3763,9 +3779,17 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	INIT_LIST_HEAD(&memcg->objcg_list);
 #ifdef CONFIG_CGROUP_WRITEBACK
 	INIT_LIST_HEAD(&memcg->cgwb_list);
-	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
-		memcg->cgwb_frn[i].done =
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
+		struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i];
+
+		frn->wait = kmalloc(sizeof(struct cgwb_frn_wait), GFP_KERNEL);
+		if (!frn->wait)
+			goto fail;
+
+		frn->wait->done =
 			__WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq);
+		init_wait_func(&frn->wait->wq_entry, memcg_cgwb_waitq_callback_fn);
+	}
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	spin_lock_init(&memcg->deferred_split_queue.split_queue_lock);
@@ -3775,6 +3799,10 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 	lru_gen_init_memcg(memcg);
 	return memcg;
 fail:
+#ifdef CONFIG_CGROUP_WRITEBACK
+	while (i--)
+		kfree(memcg->cgwb_frn[i].wait);
+#endif
 	mem_cgroup_id_remove(memcg);
 	__mem_cgroup_free(memcg);
 	return ERR_PTR(error);
@@ -3912,8 +3940,21 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 	int __maybe_unused i;
 
 #ifdef CONFIG_CGROUP_WRITEBACK
-	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
-		wb_wait_for_completion(&memcg->cgwb_frn[i].done);
+	spin_lock(&memcg_cgwb_frn_waitq.lock);
+	for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) {
+		struct cgwb_frn_wait *wait = memcg->cgwb_frn[i].wait;
+
+		/*
+		 * Not necessary to wait for wb completion which might cause task hung,
+		 * only used to free resources. See memcg_cgwb_waitq_callback_fn().
+		 */
+		__add_wait_queue_entry_tail(wait->done.waitq, &wait->wq_entry);
+		if (atomic_dec_and_test(&wait->done.cnt)) {
+			__remove_wait_queue(wait->done.waitq, &wait->wq_entry);
+			kfree(wait);
+		}
+	}
+	spin_unlock(&memcg_cgwb_frn_waitq.lock);
 #endif
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_nosocket)
 		static_branch_dec(&memcg_sockets_enabled_key);
-- 
2.39.5



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6] memcg: Don't wait writeback completion when release memcg.
  2025-09-17 21:29 [PATCH v6] memcg: Don't wait writeback completion when release memcg Julian Sun
@ 2025-09-17 21:50 ` Tejun Heo
  2025-09-18  2:27   ` [External] " Julian Sun
  2025-09-17 22:21 ` Andrew Morton
  1 sibling, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2025-09-17 21:50 UTC (permalink / raw)
  To: Julian Sun
  Cc: cgroups, linux-mm, jack, muchun.song, venkat88, hannes, mhocko,
	roman.gushchin, shakeel.butt, akpm

Hello,

On Thu, Sep 18, 2025 at 05:29:59AM +0800, Julian Sun wrote:
...
> The wb_wait_for_completion() here is probably only used to prevent
> use-after-free. Therefore, we manage 'done' separately and automatically
> free it.
> 
> This allows us to remove wb_wait_for_completion() while preventing
> the use-after-free issue.
> 
> Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
> Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> Signed-off-by: Julian Sun <sunjunchao@bytedance.com>

  Acked-by: Tejun Heo <tj@kernel.org>

Minor comments below:

> +/*
> + * This structure exists to avoid waiting for writeback to finish on
> + * memcg release, which could lead to a hang task.
                                           ^
                                           hung

> + * @done.cnt is always > 0 before a memcg is released, so @wq_entry.func
> + * may only be invoked by finish_writeback_work() after memcg is freed.
> + * See mem_cgroup_css_free() for details.
> + */

I'm not sure this gives enough of a picture of what's going on. It'd be
better to expand a bit - briefly describe what the whole mechanism is for
and how hung tasks can happen.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6] memcg: Don't wait writeback completion when release memcg.
  2025-09-17 21:29 [PATCH v6] memcg: Don't wait writeback completion when release memcg Julian Sun
  2025-09-17 21:50 ` Tejun Heo
@ 2025-09-17 22:21 ` Andrew Morton
  2025-09-18  2:43   ` Lance Yang
  2025-09-18  3:03   ` [External] " Julian Sun
  1 sibling, 2 replies; 9+ messages in thread
From: Andrew Morton @ 2025-09-17 22:21 UTC (permalink / raw)
  To: Julian Sun
  Cc: cgroups, linux-mm, jack, tj, muchun.song, venkat88, hannes,
	mhocko, roman.gushchin, shakeel.butt, Lance Yang,
	Masami Hiramatsu

On Thu, 18 Sep 2025 05:29:59 +0800 Julian Sun <sunjunchao@bytedance.com> wrote:

> Recently, we encountered the following hung task:
> 
> INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
> [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
>
> ...
> 
> The direct cause is that memcg spends a long time waiting for dirty page
> writeback of foreign memcgs during release.
> 
> The root causes are:
>     a. The wb may have multiple writeback tasks, containing millions
>        of dirty pages, as shown below:
> 
> >>> for work in list_for_each_entry("struct wb_writeback_work", \
> 				    wb.work_list.address_of_(), "list"):
> ...     print(work.nr_pages, work.reason, hex(work))
> ...
> 900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
> 1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
>
> ...
>

I don't think it's particularly harmful that a dedicated worker thread
has to wait for a long time in this fashion.  It doesn't have anything
else to do (does it?) and a blocked kernel thread is cheap.

> 3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
> 3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
> 
>     b. The writeback might severely throttled by wbt, with a speed
>        possibly less than 100kb/s, leading to a very long writeback time.
>
> ...
> 
>  include/linux/memcontrol.h | 14 +++++++++-
>  mm/memcontrol.c            | 57 ++++++++++++++++++++++++++++++++------
>  2 files changed, 62 insertions(+), 9 deletions(-)

Seems we're adding a bunch of tricky code to fix a non-problem which
the hung-task detector undesirably reports.

Would a better fix be to simply suppress the warning?

I don't think we presently have a touch_hung_task_detector() (do we?)
but it's presumably pretty simple.  And maybe
touch_softlockup_watchdog) should be taught to call that
touch_hung_task_dectector().

Another approach might be to set some flag in the task_struct
instructing the hung task detector to ignore this thread.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH v6] memcg: Don't wait writeback completion when release memcg.
  2025-09-17 21:50 ` Tejun Heo
@ 2025-09-18  2:27   ` Julian Sun
  0 siblings, 0 replies; 9+ messages in thread
From: Julian Sun @ 2025-09-18  2:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cgroups, linux-mm, jack, muchun.song, venkat88, hannes, mhocko,
	roman.gushchin, shakeel.butt, akpm

Hi,

On Thu, Sep 18, 2025 at 5:50 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Sep 18, 2025 at 05:29:59AM +0800, Julian Sun wrote:
> ...
> > The wb_wait_for_completion() here is probably only used to prevent
> > use-after-free. Therefore, we manage 'done' separately and automatically
> > free it.
> >
> > This allows us to remove wb_wait_for_completion() while preventing
> > the use-after-free issue.
> >
> > Fixes: 97b27821b485 ("writeback, memcg: Implement foreign dirty flushing")
> > Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com>
> > Signed-off-by: Julian Sun <sunjunchao@bytedance.com>
>
>   Acked-by: Tejun Heo <tj@kernel.org>
>
> Minor comments below:
>
> > +/*
> > + * This structure exists to avoid waiting for writeback to finish on
> > + * memcg release, which could lead to a hang task.
>                                            ^
>                                            hung
>
> > + * @done.cnt is always > 0 before a memcg is released, so @wq_entry.func
> > + * may only be invoked by finish_writeback_work() after memcg is freed.
> > + * See mem_cgroup_css_free() for details.
> > + */
>
> I'm not sure this gives enough of a picture of what's going on. It'd be
> better to expand a bit - briefly describe what the whole mechanism is for
> and how hung tasks can happen.

Ah.. I think explaining how hungtask occurs is a bit too detailed and
verbose. If readers want to learn about more detailed content and
causes, I think the commit message would be a better choice...
>
> Thanks.
>
> --
> tejun


Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v6] memcg: Don't wait writeback completion when release memcg.
  2025-09-17 22:21 ` Andrew Morton
@ 2025-09-18  2:43   ` Lance Yang
  2025-09-18  3:03   ` [External] " Julian Sun
  1 sibling, 0 replies; 9+ messages in thread
From: Lance Yang @ 2025-09-18  2:43 UTC (permalink / raw)
  To: Julian Sun, Andrew Morton
  Cc: cgroups, linux-mm, jack, tj, muchun.song, venkat88, hannes,
	mhocko, roman.gushchin, shakeel.butt, Masami Hiramatsu

Hey Julian,

On 2025/9/18 06:21, Andrew Morton wrote:
> On Thu, 18 Sep 2025 05:29:59 +0800 Julian Sun <sunjunchao@bytedance.com> wrote:
> 
>> Recently, we encountered the following hung task:
>>
>> INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
>> [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
>>
>> ...
>>
>> The direct cause is that memcg spends a long time waiting for dirty page
>> writeback of foreign memcgs during release.
>>
>> The root causes are:
>>      a. The wb may have multiple writeback tasks, containing millions
>>         of dirty pages, as shown below:
>>
>>>>> for work in list_for_each_entry("struct wb_writeback_work", \
>> 				    wb.work_list.address_of_(), "list"):
>> ...     print(work.nr_pages, work.reason, hex(work))
>> ...
>> 900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
>> 1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
>>
>> ...
>>
> 
> I don't think it's particularly harmful that a dedicated worker thread
> has to wait for a long time in this fashion.  It doesn't have anything
> else to do (does it?) and a blocked kernel thread is cheap.

Looking at wb_wait_for_completion(), one could introduce a new helper that
internally uses wait_event_timeout() in a loop. Something like:

void wb_wait_for_completion_with_timeout(struct wb_completion *done)
{
	atomic_dec(&done->cnt);
	while (atomic_read(&done->cnt))
         	wait_event_timeout(*done->waitq, !atomic_read(&done->cnt), 
timeout);
}

With this, the detector should no longer complain for that specific case :)

> 
>> 3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
>> 3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
>>
>>      b. The writeback might severely throttled by wbt, with a speed
>>         possibly less than 100kb/s, leading to a very long writeback time.
>>
>> ...
>>
>>   include/linux/memcontrol.h | 14 +++++++++-
>>   mm/memcontrol.c            | 57 ++++++++++++++++++++++++++++++++------
>>   2 files changed, 62 insertions(+), 9 deletions(-)
> 
> Seems we're adding a bunch of tricky code to fix a non-problem which
> the hung-task detector undesirably reports.
> 
> Would a better fix be to simply suppress the warning?
> 
> I don't think we presently have a touch_hung_task_detector() (do we?)
> but it's presumably pretty simple.  And maybe
> touch_softlockup_watchdog) should be taught to call that
> touch_hung_task_dectector().

Yes, introducing a touch_hung_task_detector() and having other tasks
periodically call it for the blocked worker seems like a much cleaner
approach to suppress the warning, IMHO.

> 
> Another approach might be to set some flag in the task_struct
> instructing the hung task detector to ignore this thread.

Alternatively, the idea of setting a flag for the worker to explicitly
ignore certain tasks by the detector is also interesting, especially if the
worker's blocked state is expected and benign ;)

Well, happy to help explore either of these paths if you'd like to go 
further
with them ;P

Cheers,
Lance



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH v6] memcg: Don't wait writeback completion when release memcg.
  2025-09-17 22:21 ` Andrew Morton
  2025-09-18  2:43   ` Lance Yang
@ 2025-09-18  3:03   ` Julian Sun
  2025-09-18  3:26     ` Andrew Morton
  1 sibling, 1 reply; 9+ messages in thread
From: Julian Sun @ 2025-09-18  3:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, linux-mm, jack, tj, muchun.song, venkat88, hannes,
	mhocko, roman.gushchin, shakeel.butt, Lance Yang,
	Masami Hiramatsu

Hi,
Thanks for your review and comments.

On Thu, Sep 18, 2025 at 6:22 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 18 Sep 2025 05:29:59 +0800 Julian Sun <sunjunchao@bytedance.com> wrote:
>
> > Recently, we encountered the following hung task:
> >
> > INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
> > [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
> >
> > ...
> >
> > The direct cause is that memcg spends a long time waiting for dirty page
> > writeback of foreign memcgs during release.
> >
> > The root causes are:
> >     a. The wb may have multiple writeback tasks, containing millions
> >        of dirty pages, as shown below:
> >
> > >>> for work in list_for_each_entry("struct wb_writeback_work", \
> >                                   wb.work_list.address_of_(), "list"):
> > ...     print(work.nr_pages, work.reason, hex(work))
> > ...
> > 900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
> > 1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
> >
> > ...
> >
>
> I don't think it's particularly harmful that a dedicated worker thread
> has to wait for a long time in this fashion.  It doesn't have anything
> else to do (does it?) and a blocked kernel thread is cheap.

It also delays the release of other resources and the update of
vmstats and vmevents statistics for the parent cgroup. But we can put
the wb_wait_for_completion() after the release of these resources.
>
> > 3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
> > 3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
> >
> >     b. The writeback might severely throttled by wbt, with a speed
> >        possibly less than 100kb/s, leading to a very long writeback time.
> >
> > ...
> >
> >  include/linux/memcontrol.h | 14 +++++++++-
> >  mm/memcontrol.c            | 57 ++++++++++++++++++++++++++++++++------
> >  2 files changed, 62 insertions(+), 9 deletions(-)
>
> Seems we're adding a bunch of tricky code to fix a non-problem which
> the hung-task detector undesirably reports.

Emm.. What is the definition of 'undesirably' here?
>
> Would a better fix be to simply suppress the warning?
>
> I don't think we presently have a touch_hung_task_detector() (do we?)
> but it's presumably pretty simple.  And maybe
> touch_softlockup_watchdog) should be taught to call that
> touch_hung_task_dectector().
>
> Another approach might be to set some flag in the task_struct
> instructing the hung task detector to ignore this thread.

To me, this feels kind of like a workaround rather than a real fix.
And these approaches are beyond the scope of memcg, I'm not sure how
Tejun thinks about it since he mentioned before wanting to keep the
modifications inside the memcg. Given the fact we already have an
existing solution and code, and the scope of impact is confined to
memcg, I prefer to use the existing solution.
>
>

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH v6] memcg: Don't wait writeback completion when release memcg.
  2025-09-18  3:03   ` [External] " Julian Sun
@ 2025-09-18  3:26     ` Andrew Morton
  2025-09-18  4:22       ` Julian Sun
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2025-09-18  3:26 UTC (permalink / raw)
  To: Julian Sun
  Cc: cgroups, linux-mm, jack, tj, muchun.song, venkat88, hannes,
	mhocko, roman.gushchin, shakeel.butt, Lance Yang,
	Masami Hiramatsu

On Thu, 18 Sep 2025 11:03:18 +0800 Julian Sun <sunjunchao@bytedance.com> wrote:

> Hi,
> Thanks for your review and comments.
> 
> On Thu, Sep 18, 2025 at 6:22 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Thu, 18 Sep 2025 05:29:59 +0800 Julian Sun <sunjunchao@bytedance.com> wrote:
> >
> > > Recently, we encountered the following hung task:
> > >
> > > INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
> > > [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
> > >
> > > ...
> > >
> > > The direct cause is that memcg spends a long time waiting for dirty page
> > > writeback of foreign memcgs during release.
> > >
> > > The root causes are:
> > >     a. The wb may have multiple writeback tasks, containing millions
> > >        of dirty pages, as shown below:
> > >
> > > >>> for work in list_for_each_entry("struct wb_writeback_work", \
> > >                                   wb.work_list.address_of_(), "list"):
> > > ...     print(work.nr_pages, work.reason, hex(work))
> > > ...
> > > 900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
> > > 1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
> > >
> > > ...
> > >
> >
> > I don't think it's particularly harmful that a dedicated worker thread
> > has to wait for a long time in this fashion.  It doesn't have anything
> > else to do (does it?) and a blocked kernel thread is cheap.
> 
> It also delays the release of other resources and the update of
> vmstats and vmevents statistics for the parent cgroup.

This is new - such considerations weren't described in the changelog. 
How much of a problem are these things?

> But we can put
> the wb_wait_for_completion() after the release of these resources.

Can we move these actions into the writeback completion path which
seems to be the most accurate way to do them?

> >
> > > 3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
> > > 3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
> > >
> > >     b. The writeback might severely throttled by wbt, with a speed
> > >        possibly less than 100kb/s, leading to a very long writeback time.
> > >
> > > ...
> > >
> > >  include/linux/memcontrol.h | 14 +++++++++-
> > >  mm/memcontrol.c            | 57 ++++++++++++++++++++++++++++++++------
> > >  2 files changed, 62 insertions(+), 9 deletions(-)
> >
> > Seems we're adding a bunch of tricky code to fix a non-problem which
> > the hung-task detector undesirably reports.
> 
> Emm.. What is the definition of 'undesirably' here?

Seems the intent here is mainly to prevent the warning.  If that
warning wasn't coming out, would we bother making these changes?  If
no, just kill the warning.

> >
> > Would a better fix be to simply suppress the warning?
> >
> > I don't think we presently have a touch_hung_task_detector() (do we?)
> > but it's presumably pretty simple.  And maybe
> > touch_softlockup_watchdog) should be taught to call that
> > touch_hung_task_dectector().
> >
> > Another approach might be to set some flag in the task_struct
> > instructing the hung task detector to ignore this thread.
> 
> To me, this feels kind of like a workaround rather than a real fix.

I don't see why.  It appears that the kworker's intended role is to
wait for writeback completion then to finish things up.  Which it does
just fine, except there's this pesky warning we get.  Therefore: kill
the pesky warning,

> And these approaches are beyond the scope of memcg,

So expand the scope?  If hung-task doesn't have a way to suppress
inappropriate warnings then add it and use it.

> I'm not sure how
> Tejun thinks about it since he mentioned before wanting to keep the
> modifications inside the memcg. Given the fact we already have an
> existing solution and code, and the scope of impact is confined to
> memcg, I prefer to use the existing solution.

mmm...  sunk cost fallacy!  Let's just opt for the best solution,
regardless of cost.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH v6] memcg: Don't wait writeback completion when release memcg.
  2025-09-18  3:26     ` Andrew Morton
@ 2025-09-18  4:22       ` Julian Sun
  2025-09-18  4:32         ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Julian Sun @ 2025-09-18  4:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, linux-mm, jack, tj, muchun.song, venkat88, hannes,
	mhocko, roman.gushchin, shakeel.butt, Lance Yang,
	Masami Hiramatsu

Hi,

On Thu, Sep 18, 2025 at 11:26 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Thu, 18 Sep 2025 11:03:18 +0800 Julian Sun <sunjunchao@bytedance.com> wrote:
>
> > Hi,
> > Thanks for your review and comments.
> >
> > On Thu, Sep 18, 2025 at 6:22 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > On Thu, 18 Sep 2025 05:29:59 +0800 Julian Sun <sunjunchao@bytedance.com> wrote:
> > >
> > > > Recently, we encountered the following hung task:
> > > >
> > > > INFO: task kworker/4:1:1334558 blocked for more than 1720 seconds.
> > > > [Wed Jul 30 17:47:45 2025] Workqueue: cgroup_destroy css_free_rwork_fn
> > > >
> > > > ...
> > > >
> > > > The direct cause is that memcg spends a long time waiting for dirty page
> > > > writeback of foreign memcgs during release.
> > > >
> > > > The root causes are:
> > > >     a. The wb may have multiple writeback tasks, containing millions
> > > >        of dirty pages, as shown below:
> > > >
> > > > >>> for work in list_for_each_entry("struct wb_writeback_work", \
> > > >                                   wb.work_list.address_of_(), "list"):
> > > > ...     print(work.nr_pages, work.reason, hex(work))
> > > > ...
> > > > 900628  WB_REASON_FOREIGN_FLUSH 0xffff969e8d956b40
> > > > 1116521 WB_REASON_FOREIGN_FLUSH 0xffff9698332a9540
> > > >
> > > > ...
> > > >
> > >
> > > I don't think it's particularly harmful that a dedicated worker thread
> > > has to wait for a long time in this fashion.  It doesn't have anything
> > > else to do (does it?) and a blocked kernel thread is cheap.
> >
> > It also delays the release of other resources and the update of
> > vmstats and vmevents statistics for the parent cgroup.
>
> This is new - such considerations weren't described in the changelog.
> How much of a problem are these things?

The cost of the release of other resources should be fine, but I'm not
very sure about the impact of delayed statistics..
>
> > But we can put
> > the wb_wait_for_completion() after the release of these resources.
>
> Can we move these actions into the writeback completion path which
> seems to be the most accurate way to do them?

IMHO, it is technically feasible but logically a little bit weird. The
wb_wait_for_completion() here is only used to wait for wb completion
of foreign dirty pages (pages that memcg and writeback ownerships
don't match) and there's no connection between foreign writeback and
release of memcg resources if I'm not missing something.  It's up to
Tejun anyway.
>
> > >
> > > > 3085016 WB_REASON_FOREIGN_FLUSH 0xffff969f0455e000
> > > > 3035712 WB_REASON_FOREIGN_FLUSH 0xffff969d9bbf4b00
> > > >
> > > >     b. The writeback might severely throttled by wbt, with a speed
> > > >        possibly less than 100kb/s, leading to a very long writeback time.
> > > >
> > > > ...
> > > >
> > > >  include/linux/memcontrol.h | 14 +++++++++-
> > > >  mm/memcontrol.c            | 57 ++++++++++++++++++++++++++++++++------
> > > >  2 files changed, 62 insertions(+), 9 deletions(-)
> > >
> > > Seems we're adding a bunch of tricky code to fix a non-problem which
> > > the hung-task detector undesirably reports.
> >
> > Emm.. What is the definition of 'undesirably' here?
>
> Seems the intent here is mainly to prevent the warning.  If that
> warning wasn't coming out, would we bother making these changes?  If
> no, just kill the warning.

Got it. Seems like there's no more impact other than the pesky warning.
BTW, I'm also seeing many hung task warnings when the mount/umount
syscall is waiting for the s_umount semaphore—while s_umount is held
by the writeback code path. I think the hung task is also undesirable,
right? Since AFAICS there's also no more impact instead of the pesky
warnings.
>
> > >
> > > Would a better fix be to simply suppress the warning?
> > >
> > > I don't think we presently have a touch_hung_task_detector() (do we?)
> > > but it's presumably pretty simple.  And maybe
> > > touch_softlockup_watchdog) should be taught to call that
> > > touch_hung_task_dectector().
> > >
> > > Another approach might be to set some flag in the task_struct
> > > instructing the hung task detector to ignore this thread.
> >
> > To me, this feels kind of like a workaround rather than a real fix.
>
> I don't see why.  It appears that the kworker's intended role is to
> wait for writeback completion then to finish things up.  Which it does
> just fine, except there's this pesky warning we get.  Therefore: kill
> the pesky warning,
>
> > And these approaches are beyond the scope of memcg,
>
> So expand the scope?  If hung-task doesn't have a way to suppress
> inappropriate warnings then add it and use it.

Sounds good and more general.
>
> > I'm not sure how
> > Tejun thinks about it since he mentioned before wanting to keep the
> > modifications inside the memcg. Given the fact we already have an
> > existing solution and code, and the scope of impact is confined to
> > memcg, I prefer to use the existing solution.
>
> mmm...  sunk cost fallacy!  Let's just opt for the best solution,
> regardless of cost.

Sure. I will try new approaches if it's fine to Tejun.
>

Thanks,
-- 
Julian Sun <sunjunchao@bytedance.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [External] Re: [PATCH v6] memcg: Don't wait writeback completion when release memcg.
  2025-09-18  4:22       ` Julian Sun
@ 2025-09-18  4:32         ` Andrew Morton
  0 siblings, 0 replies; 9+ messages in thread
From: Andrew Morton @ 2025-09-18  4:32 UTC (permalink / raw)
  To: Julian Sun
  Cc: cgroups, linux-mm, jack, tj, muchun.song, venkat88, hannes,
	mhocko, roman.gushchin, shakeel.butt, Lance Yang,
	Masami Hiramatsu

On Thu, 18 Sep 2025 12:22:35 +0800 Julian Sun <sunjunchao@bytedance.com> wrote:

> > Seems the intent here is mainly to prevent the warning.  If that
> > warning wasn't coming out, would we bother making these changes?  If
> > no, just kill the warning.
> 
> Got it. Seems like there's no more impact other than the pesky warning.
> BTW, I'm also seeing many hung task warnings when the mount/umount
> syscall is waiting for the s_umount semaphore—while s_umount is held
> by the writeback code path. I think the hung task is also undesirable,
> right? Since AFAICS there's also no more impact instead of the pesky
> warnings.

Sure.  Writeback is famous for potentially taking vast amounts of time
and that's perfectly reasonable and expected - lots of dirty data takes
lots of time to write out.  There is no misbehavior in this and warning
the operator about it is just silly.  If there is no action either they
or we can take to prevent the warning then just squash the warning.




^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-09-18  4:32 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-17 21:29 [PATCH v6] memcg: Don't wait writeback completion when release memcg Julian Sun
2025-09-17 21:50 ` Tejun Heo
2025-09-18  2:27   ` [External] " Julian Sun
2025-09-17 22:21 ` Andrew Morton
2025-09-18  2:43   ` Lance Yang
2025-09-18  3:03   ` [External] " Julian Sun
2025-09-18  3:26     ` Andrew Morton
2025-09-18  4:22       ` Julian Sun
2025-09-18  4:32         ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox