From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 49F77C7115A
	for <linux-mm@archiver.kernel.org>; Wed, 18 Jun 2025 22:37:30 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D58226B0092; Wed, 18 Jun 2025 18:37:29 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CE1C46B00A3; Wed, 18 Jun 2025 18:37:29 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BA9976B00A2; Wed, 18 Jun 2025 18:37:29 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id A73EA6B0092
	for <linux-mm@kvack.org>; Wed, 18 Jun 2025 18:37:29 -0400 (EDT)
Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 63C471A0F3D
	for <linux-mm@kvack.org>; Wed, 18 Jun 2025 22:37:29 +0000 (UTC)
X-FDA: 83569984218.30.04D53ED
Received: from out-188.mta0.migadu.com (out-188.mta0.migadu.com [91.218.175.188])
	by imf07.hostedemail.com (Postfix) with ESMTP id 56B0240004
	for <linux-mm@kvack.org>; Wed, 18 Jun 2025 22:37:27 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=Nyx18AFE;
	dmarc=pass (policy=none) header.from=linux.dev;
	spf=pass (imf07.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750286247; a=rsa-sha256;
	cv=none;
	b=GXlwXgh5cDmnZEHC9D2bI0Jw87U9yK53rDRSsCGRHlCgbXyjPaYRSOwHpLR6sJk6dhoJcN
	/RP+oCaEem7/6Oyy9KOvTxQxXONs9R3Yl41/8lzS6sfIA/xJlsYv+wGs585meq448415wk
	SYdFzzPf4VfNr36zXutfls2slkAmK2g=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=Nyx18AFE;
	dmarc=pass (policy=none) header.from=linux.dev;
	spf=pass (imf07.hostedemail.com: domain of shakeel.butt@linux.dev designates 91.218.175.188 as permitted sender) smtp.mailfrom=shakeel.butt@linux.dev
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1750286247;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=0PV4oFt202O2fPTh0r932mq+HQZlxfrd2y3tMHFlmp0=;
	b=mEe0rdMViRmogPjeBNqQP8+7Y/AjIdMj9nF62wPQT+WWizu0kkKF+xmLPlFZKNl523myeF
	Um2KUh3SY12Xk4NKwwq+Tuz6ZmVxND84MgPjZmOwDeMYQotO6PZZsaGIA5D3PFsoilmG+q
	zYhTZm/hYe2b5L3uJ+ClDvp99o+XVS0=
Date: Wed, 18 Jun 2025 15:37:20 -0700
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1750286245;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=0PV4oFt202O2fPTh0r932mq+HQZlxfrd2y3tMHFlmp0=;
	b=Nyx18AFEfxQ633x19fQ6CdrijvFdGWCUfpL7tar+uZvlET7h68Vq9RLFpgM7d6CHfSZ3R9
	t0xUrZoT1TNiktdFmOHJQdOmJ0cko/X1O04S207bIJju9L+998Wh0sGWF9y2mBj+vgAtVe
	Y7o1nWwRjM6TirkNS/l2/Clfv7VoNLg=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Zhongkun He <hezhongkun.hzk@bytedance.com>
Cc: akpm@linux-foundation.org, tytso@mit.edu, jack@suse.com, 
	hannes@cmpxchg.org, mhocko@kernel.org, muchun.song@linux.dev, 
	linux-ext4@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, 
	cgroups@vger.kernel.org
Subject: Re: [PATCH 0/2] Postpone memcg reclaim to return-to-user path
Message-ID: <a57jjrtddjc4wjbrrjpyhfdx475zwpuetmkibeorboo7csc7aw@foqsmf5ipr73>
References: <cover.1750234270.git.hezhongkun.hzk@bytedance.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <cover.1750234270.git.hezhongkun.hzk@bytedance.com>
X-Migadu-Flow: FLOW_OUT
X-Stat-Signature: krunidftznmifo73ht7rms5153ootkcs
X-Rspamd-Queue-Id: 56B0240004
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-HE-Tag: 1750286247-835869
X-HE-Meta: U2FsdGVkX18KPuteWq8Z+MkyBoEfBlOqoEwG5bE1zwHVaWBwvrt/kfgOPHyyyaLrh1w80zUiyKwHb5RrQTO5jJ1toeipKkme9GzwFptEfO8uWomXONjonWPyA+8951snuuSN8V4V1cJjuU0OIntTmIyXsDJZ7ZYTfgml3K6BixCph8amcDxP+favNxOI+TP5ShCD6Ib7+PnifVs9loF0TCKHzIZMx2FCjsEzLHeIYbBJCrxRai46ReQVOyfetRHNcsz73ng+F8V8/tClyMa8DBPD8XWnvx46tJJcgPrG6bXAZlprUGe/N5J6s4knRNWOb8znhMBuf3eahLsMnYE3T5sQ50vAVJayH51N9FXwXw2fLg5kPcTc/ioo5tk6yyaEnEAhoCahMCRM5h+GosFJpdJIqKn+il6yN5/vFArkOCIneOzVoUymBrZkQIRp0lKZhIhnjh4uM7nDIKkPhm34daqSQ5iq4GfTbIdO2xg9cWZfTpO9lscFTgsAfjgAuMdxwgSrF4zKJ1p7HgV3qsyoF6mqn6nbKK4ASVkTGux/fXkJxwLkiWI+gEafHEAH3iFgsFyIMVadwLSIoUAXiyB8hOCnyBMX0BkS7/kM51dG6QALqeNXrQ5dQLZnTo0+vLCu1FXXNjwbQAhg0QVjI3v+AU2iuUOVVQNDT0Dx+5AM+mAJYaCcgiLcFMt6E5By4GCgBq2jpzY55Qzhl6Cejsjqw28bPL9DcJ6p6+DJwASam8VRabrtpAcY3VAQOmFQEKgIAMeprSP8e8DuwUQQA90eanM/E/hytLYKr/vK0J7gPOgbSKXCH6b+WYwB+ZjIE4Wh3hcixxKI+B2VCpMGzgZ7FFAKUV6JswuyfUozK6utPJL60YqGirZMd+L6yon1NQA1alhdFl2bXPfxfsBpDpheGAgJZcnene+L+0LyeJ/FY4wIv5wRdvacWYNB8BniGc462w8niL5QXOGhjl24jwu
 wWLSZhn3
 CFo8ejWcw37vVFEHFtK6O6O2rYmDpvUZsKVma5mnhFfFwTIGjFt03izgCIYFimb6LlHvpBHzQvECLtVR3bmCUDrqGqbT5LKCD1dB59kRsFYnb+PFzq+u24q7ZTApP14OTZBvEPuzwe9PUjB6a22Q5Srmmozmnt198iRRoh2g0tXHgnHByxTpwHx+Wp/LN/q+S8/l4OOFrK+Q+5VcgW4HHwIFYcfkonnpNPiuk+T54ORCUst/F1H8ootkiAQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hi Zhongkun,

Thanks for a very detailed and awesome description of the problem. This
is a real issue and we at Meta face similar scenarios as well. However I
would not go for PF_MEMALLOC_ACFORCE approach as it is easy to abuse and
is very manual and requires detecting the code which can cause such
scenarios and then case-by-case opting them in. I would prefer a dynamic
or automated approach where the kernel detects such an issue is
happening and recover from it. Though a case can be made where we avoid
such scenarios from happening but that might not be possible everytime.

Also this is very memcg specific, I can clearly see the same scenario
can happen for global reclaim as well.

I have a couple of questions below:

On Wed, Jun 18, 2025 at 07:39:56PM +0800, Zhongkun He wrote:
> # Introduction
> 
> This patchset aims to introduce an approach to ensure that memory
> allocations are forced to be accounted to the memory cgroup, even if
> they exceed the cgroup's maximum limit. In such cases, the reclaim
> process is postponed until the task returns to the user.

This breaks memory.max semantics. Any reason memory.high is not used
here. Basically instead of memory.max, use memory.high as job limit. I
would like to know how memory.high is lacking for your use-case. Maybe
we can fix that or introduce a new form of limit. However this is memcg
specific and will not resolve the global reclaim case.

> This is
> beneficial for users who perform over-max reclaim while holding multiple
> locks or other resources (especially resources related to file system
> writeback). If a task needs any of these resources, it would otherwise
> have to wait until the other task completes reclaim and releases the
> resources. Postponing reclaim to the return-to-user path helps avoid this issue.
> 
> # Background
> 
> We have been encountering an hungtask issue for a long time. Specifically,
> when a task holds the jbd2 handler 

Can you explain a bit more about jbd2 handler? Is it some global shared
lock or a workqueue which can only run single thread at a time.
Basically is there a way to get the current holder/owner of jbd2 handler
programmatically?

> and subsequently enters direct reclaim
> because it reaches the hard limit within a memory cgroup, the system may become
> blocked for a long time. The stack trace of waiting thread holding the jbd2
> handle is as follows (and so many other threads are waiting on the same jbd2
> handle):
> 
>  #0 __schedule at ffffffff97abc6c9
>  #1 preempt_schedule_common at ffffffff97abcdaa
>  #2 __cond_resched at ffffffff97abcddd
>  #3 shrink_active_list at ffffffff9744dca2
>  #4 shrink_lruvec at ffffffff97451407
>  #5 shrink_node at ffffffff974517c9
>  #6 do_try_to_free_pages at ffffffff97451dae
>  #7 try_to_free_mem_cgroup_pages at ffffffff974542b8
>  #8 try_charge_memcg at ffffffff974f0ede
>  #9 charge_memcg at ffffffff974f1d0e
> #10 __mem_cgroup_charge at ffffffff974f391c
> #11 __add_to_page_cache_locked at ffffffff974313e5
> #12 add_to_page_cache_lru at ffffffff974324b2
> #13 pagecache_get_page at ffffffff974338e3
> #14 __getblk_gfp at ffffffff97556798
> #15 __ext4_get_inode_loc at ffffffffc07a5518 [ext4]
> #16 ext4_get_inode_loc at ffffffffc07a7fec [ext4]
> #17 ext4_reserve_inode_write at ffffffffc07a9fb1 [ext4]
> #18 __ext4_mark_inode_dirty at ffffffffc07aa249 [ext4]
> #19 __ext4_new_inode at ffffffffc079cbae [ext4]
> #20 ext4_create at ffffffffc07c3e56 [ext4]
> #21 path_openat at ffffffff9751f471
> #22 do_filp_open at ffffffff97521384
> #23 do_sys_openat2 at ffffffff97508fd6
> #24 do_sys_open at ffffffff9750a65b
> #25 do_syscall_64 at ffffffff97aaed14
> 
> We've obtained a coredump and dumped struct scan_control from it by using crash tool.
> 
> struct scan_control {
>   nr_to_reclaim = 32,
>   order = 0 '\000',
>   priority = 1 '\001',
>   reclaim_idx = 4 '\004',
>   gfp_mask = 17861706, __GFP_NOFAIL
>   nr_scanned = 27810, 
>   nr_reclaimed = 0,
>   nr = {
>         dirty = 27797,
>         unqueued_dirty = 27797,
>         congested = 0,
>         writeback = 0,
>         immediate = 0,
>         file_taken = 27810,
>         taken = 27810
>   },
> }
> 

What is the kernel version? Can you run scripts/gfp-translate on the
gfp_mask above? Does this kernel have a75ffa26122b ("memcg, oom: do not
bypass oom killer for dying tasks")?

> The ->nr_reclaimed is zero meaning there is no memory we have reclaimed because
> most of the file pages are unqueued dirty. And ->priority is 1 also meaning we
> spent so much time on memory reclamation.

Is there a way to get how many times this thread has looped within
try_charge_memcg()?

> Since this thread has held the jbd2
> handler, the jbd2 thread was waiting for the same jbd2 handler, which blocked
> so many other threads from writing dirty pages as well.
> 
> 0 [] __schedule at ffffffff97abc6c9
> 1 [] schedule at ffffffff97abcd01
> 2 [] jbd2_journal_wait_updates at ffffffffc05a522f [jbd2]
> 3 [] jbd2_journal_commit_transaction at ffffffffc05a72c6 [jbd2]
> 4 [] kjournald2 at ffffffffc05ad66d [jbd2]
> 5 [] kthread at ffffffff972bc4c0
> 6 [] ret_from_fork at ffffffff9720440f
> 
> Furthermore, we observed that memory usage far exceeded the configured memory maximum,
> reaching around 38GB.
> 
> memory.max  : 134896020    514 GB
> memory.usage: 144747169    552 GB

This is unexpected and most probably our hacks to allow overcharge to
avoid similar situations are causing this. 

> 
> We investigated this issue and identified the root cause:
>   try_charge_memcg:
>     retry charge
>      charge failed
>        -> direct reclaim
>         -> mem_cgroup_oom    return true，but selected task is in an uninterruptible state
>            -> retry charge

Oh oom reaper didn't help?

> 
> In which cases, we saw many tasks in the uninterruptible (D) state with a pending
> SIGKILL signal. The OOM killer selects a victim and returns success, allowing the
> current thread to retry the memory charge. However, the selected task cannot acknowledge
> the SIGKILL signal because it is stuck in an uninterruptible state.

OOM reaper usually helps in such cases but I see below why it didn't
help.

> As a result,
> the charging task resets nr_retries and attempts to reclaim again, but the victim
> task never exits. This causes the current thread to enter a prolonged retry loop
> during direct reclaim, holding the jbd2 handler for much more time and leading to
> system-wide blocking. Why are there so many uninterruptible (D) state tasks?
> Check the most common stack trace.
> 
> crash> task_struct.__state ffff8c53a15b3080
>   __state = 2,   #define TASK_UNINTERRUPTIBLE        0x0002
>  0 [] __schedule at ffffffff97abc6c9
>  1 [] schedule at ffffffff97abcd01
>  2 [] schedule_preempt_disabled at ffffffff97abdf1a
>  3 [] rwsem_down_read_slowpath at ffffffff97ac05bf
>  4 [] down_read at ffffffff97ac06b1
>  5 [] do_user_addr_fault at ffffffff9727f1e7
>  6 [] exc_page_fault at ffffffff97ab286e
>  7 [] asm_exc_page_fault at ffffffff97c00d42
> 
> Check the owner of mm_struct.mmap_lock. The task below was entering memory reclaim
> holding mmap lock and there are 68 tasks in this memory cgroup, with 23 of them in
> the memory reclaim context.
> 

The following thread has mmap_lock in write mode and thus oom-reaper is
not helping. Do you see "oom_reaper: unable to reap pid..." messages in
dmesg?

>  7 [] shrink_active_list at ffffffff9744dd46
>  8 [] shrink_lruvec at ffffffff97451407
>  9 [] shrink_node at ffffffff974517c9
> 10 [] do_try_to_free_pages at ffffffff97451dae
> 11 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
> 12 [] try_charge_memcg at ffffffff974f0ede
> 13 [] obj_cgroup_charge_pages at ffffffff974f1dae
> 14 [] obj_cgroup_charge at ffffffff974f2fc2
> 15 [] kmem_cache_alloc at ffffffff974d054c
> 16 [] vm_area_dup at ffffffff972923f1
> 17 [] __split_vma at ffffffff97486c16
> 18 [] __do_munmap at ffffffff97486e78
> 19 [] __vm_munmap at ffffffff97487307
> 20 [] __x64_sys_munmap at ffffffff974873e7
> 21 [] do_syscall_64 at ffffffff97aaed14
> 
> Many threads was entering the memory reclaim in UN state, other threads was blocking
> on mmap_lock. Although the OOM killer selects a victim, it cannot terminate it.

Can you please confirm the above? Is the kernel able to oom-kill more
processes or if it is returning early because the current thread is
dying. However if the cgroup has just one big process, this doesn't
matter.

> The
> task holding the jbd2 handle retries memory charge, but it fails. Reclaiming continues
> while holding the jbd2 handler. write_pages also fails while waiting for the same jbd2
> handler, causing repeated shrink failures and potentially leading to a system-wide block.
> 
> ps | grep UN | wc -l
> 1463
> 
> While the system has 1463 UN state tasks, so the way to break this akin to "deadlock" is
> to let the thread holding jbd2 handler quickly exit the memory reclamation process.
> 
> We found that a related issue was reported and partially fixed in previous patches [1][2].
> However, those fixes only skip direct reclamation and return a failure for some cases such
> as readahead requests. As sb_getblk() is called multiple times in __ext4_get_inode_loc()
> with the NOFAIL flag, the problem still exists. And it is not feasible to simply remove
> __GFP_RECLAIMABLE when holding jbd2 handle to avoid potential very long memory reclaim
> latency,  as __GFP_NOFAIL is not supported without __GFP_DIRECT_RECLAIM.
> 
> # Fundamentals
> 
> This patchset introduces a new task flag of PF_MEMALLOC_ACFORCE to indicate that memory
> allocations are forced to be accounted to the memory cgroup, even if they exceed the cgroup's
> maximum limit. The reclaim process is deferred until the task returns to the user without
> holding any kernel resources for memory reclamation, thereby preventing priority inversion
> problems. Any users who might encounter potential similar issues can utilize this new flag
> to allocate memory and prevent long-term latency for the entire system.

I already explained upfront why this is not the approach we want.

We do see similar situations/scenarios but due to global/shared locks in
btrfs but I expect any global lock or global shared resource can cause
such priority inversion situations.