From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A04BBC7115A
	for <linux-mm@archiver.kernel.org>; Wed, 18 Jun 2025 11:40:14 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 41DE26B009D; Wed, 18 Jun 2025 07:40:14 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3CEC16B00A0; Wed, 18 Jun 2025 07:40:14 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 30B806B00A1; Wed, 18 Jun 2025 07:40:14 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 224B16B009D
	for <linux-mm@kvack.org>; Wed, 18 Jun 2025 07:40:14 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 2DAFD1A1340
	for <linux-mm@kvack.org>; Wed, 18 Jun 2025 11:40:13 +0000 (UTC)
X-FDA: 83568327906.26.6909E05
Received: from mail-pg1-f178.google.com (mail-pg1-f178.google.com [209.85.215.178])
	by imf16.hostedemail.com (Postfix) with ESMTP id 70E9B180006
	for <linux-mm@kvack.org>; Wed, 18 Jun 2025 11:40:10 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b=ki0DRI1e;
	dmarc=pass (policy=quarantine) header.from=bytedance.com;
	spf=pass (imf16.hostedemail.com: domain of hezhongkun.hzk@bytedance.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=hezhongkun.hzk@bytedance.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750246811; a=rsa-sha256;
	cv=none;
	b=yonDlGUG7fS25IkBG44amI1WzH2GMULGCuemEkYvRIge9rjPYhyaTDAy8SlTal1YYQ8SHl
	Tux/UoFRc3g/Di+UfxL/U11PwhYeoOQdOGEL2B5LY6GwZtrpVQeHfCsb+7LD2iQWlJSNKP
	u4eKXjaDNXTh6NrWp8vTHD7a32t5QVc=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=bytedance.com header.s=google header.b=ki0DRI1e;
	dmarc=pass (policy=quarantine) header.from=bytedance.com;
	spf=pass (imf16.hostedemail.com: domain of hezhongkun.hzk@bytedance.com designates 209.85.215.178 as permitted sender) smtp.mailfrom=hezhongkun.hzk@bytedance.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1750246811;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:in-reply-to:
	 references:dkim-signature; bh=RykmkQZwMH+K81hFVCmWyt4XvnWOxmDaIY/LCV05j1s=;
	b=YIQL3zDfRxdKzCiSoHcRVIXkCYg5TSNbqnS02iKUX+OiB2QqUDob1xCj4CGzJcjlvFryUC
	GdG1X6EtPZ8nX6OVgEHpeDvUxbsQb20I+w7bo4ViOvkapsyZqnHbP4EmHWGLyCFV6+vQfz
	flHfwXC1TVxhcBt3L44vk+12QCwW/mM=
Received: by mail-pg1-f178.google.com with SMTP id 41be03b00d2f7-b2f0faeb994so7853774a12.0
        for <linux-mm@kvack.org>; Wed, 18 Jun 2025 04:40:09 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=bytedance.com; s=google; t=1750246809; x=1750851609; darn=kvack.org;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=RykmkQZwMH+K81hFVCmWyt4XvnWOxmDaIY/LCV05j1s=;
        b=ki0DRI1eRTxjCgf0xVWvNZL3SvxS4yoNfWlhD3GRg6WrI1inMonS6lBSK2MfZGGt4G
         Uw69FjDZyW5p9sdHkYsW7Y0T8gFyB0f7WSnpG+eKz6/dR2Lc1tbVjQcrs4+6EFJvNviq
         +bie81HMwAUWOJGnb209COKvUprr/gHbUKdybAAo1fNo5QWo963zPC1gaHngXA51qq7J
         Muf/IundQHBWcidei3xPjyymlPLuNN3REr2Ic6m2I4Uo5Xxko04fXR+ze7r3wPC+okN+
         Y+T855XVC/WAyOeo2/Lu6V6T3HpdkFyoH1ml0GcJQBTY8qfwmsKLm2S7bphM8C70OV3C
         iF9g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1750246809; x=1750851609;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=RykmkQZwMH+K81hFVCmWyt4XvnWOxmDaIY/LCV05j1s=;
        b=wMSMn011cUYui2cxHggYYMdR1V3DlD6Bmg4vXVhQqGLSXxayZdq2o0cMcfbCImygHT
         Cux9S8Z69WjcjPBwGYSmzYPlx48IIe5x7/YIqotlw68ybgRI8qBa9HCeV/D48QtEGZeR
         PPV2cebIpBLKKsRYXNS4tIujui1zo9tnYLU6yjMrEn6KBAncMjxeMklaX1g5cscm6CFx
         pPY0A/+0GSF1vxg1LsU87gGvojxAU53NO4WdcDlwNmNz7xhuEdKJw2nftgo4LWCrKq4E
         k0AIe3HAeUl+j4I3OUGh9B570PMvBOH+ly9ghbhbJvBmIwL+RaFhdSR3+6WXUH4X2WvX
         6TdA==
X-Forwarded-Encrypted: i=1; AJvYcCVweLTorWtpRsgqsxcVfWxq1v5IQx9gmoWlxi0hNX5aqBz0uU4f5//nENz493GqTzgT1Peeq4uzrQ==@kvack.org
X-Gm-Message-State: AOJu0YyeSej+m/5wiHyfyX9j7jSVsV+FoJ9fHtAh3S//3kjgfoVPWege
	5WGCV0idaGgoweS9ps62kkEbJ7WoeHAN8OSrYYUHF0dtf6lI5QRi+Hu7EGJzZHYQPTc=
X-Gm-Gg: ASbGncuDYhg+V3/Dx2P6hyVbg+VAIFu6cyvuoCnHCoq5E/cRo78JPIPfLLBXQTH941A
	VEZf0F9HfhTXOgFUvNhJWHricbua8/nj7mzJhOinZbE+3FRFbvWX7PJoaTYo83kXvsUccovXSE+
	WjLruD96cyKBHd7WMuicuNFnHZSX9pselLrxLs8jkjjJc8UGOKGJnND7Rh1fe/OdxfqCLmPdLmh
	WAIpRfL2925uN+OljJTP5Cg3X1+iJaKKs4/lpea7cD7hggai+TGTZyo0L+Bi0/iyE5J82q6ZsfE
	TuzFh8rt8O6FgV755wizYjt3tRPWVtgSWLv9dzqpyVwZu0CXFYfKfAa8PnzCrR9L6mNPnAI5rjo
	mdM66+PDdtSM=
X-Google-Smtp-Source: AGHT+IGP/x1AQZSOiZyMN9OpYdgfMVlpTcSm49yjk/MnpwJzM7pGjLlDqyyD7CKULg4B1Eg/6rcgcQ==
X-Received: by 2002:a05:6a20:7d9e:b0:201:85f4:ade6 with SMTP id adf61e73a8af0-21fbd65e19cmr27156240637.27.1750246808841;
        Wed, 18 Jun 2025 04:40:08 -0700 (PDT)
Received: from n37-069-081.byted.org ([115.190.40.12])
        by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-748900e3a09sm10683148b3a.180.2025.06.18.04.40.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Wed, 18 Jun 2025 04:40:08 -0700 (PDT)
From: Zhongkun He <hezhongkun.hzk@bytedance.com>
To: akpm@linux-foundation.org,
	tytso@mit.edu,
	jack@suse.com,
	hannes@cmpxchg.org,
	mhocko@kernel.org
Cc: muchun.song@linux.dev,
	linux-ext4@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	cgroups@vger.kernel.org,
	Zhongkun He <hezhongkun.hzk@bytedance.com>
Subject: [PATCH 0/2] Postpone memcg reclaim to return-to-user path
Date: Wed, 18 Jun 2025 19:39:56 +0800
Message-Id: <cover.1750234270.git.hezhongkun.hzk@bytedance.com>
X-Mailer: git-send-email 2.39.5
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Rspamd-Queue-Id: 70E9B180006
X-Rspamd-Server: rspam02
X-Stat-Signature: txz3hhef8pis5mmyc1r7cts73tsdc9te
X-HE-Tag: 1750246810-672593
X-HE-Meta: U2FsdGVkX1/M3f9pL/S1+I74deebV4Pa38E/rOJs2XpyZKrPvdjoUJ1/UUfgxGXZ0d9InfTL3hPhowgEdHnXw4jk9cOpNng44Ta+VpHmBWFgk9CRokPs6PSk1YVbs4CtBUyyoT/ZRlD37K9mULlDihODOw4hNg5khNaXD+1HNDuJkg+eqv24BKAJUvieuDVuwqimutTxgAB1dgv6JQs9BH+WQ0SzZ7bNmpsX8EcNDw6apPb6XED7BHytmCikaivQ7HW9JThjnfRy1+5wYACvWufLzqI9Sm+Zm6xgykqGltDAtBHoByrO16FEoC45+D9VOHLIubuFR0ginTn3/10Rp2NAHW8YQLqJORMJhMc90OkqXpI/2GpaBKSkfXlrQ6lha72ylF1HS9vfai9ly3fSkmsGFrVbHwMFIZxqZ1Uu3H2XWydQm1pDNIB3xADrx7QxByKIm4Ieo7LoXyz62cSmDZxlc0MsfdifgyRz8ADLG1pvAfB2oCR5rURszi4pZjjLQquVXkdcgn33PHLFP/ljnFmyX1p2lao8b5kNKa5C7RlnwlZ/5Qn7vgx9ZEn2I/nsxh5Nd03qPdlyg4haWP5EXRXfSdU9+Uo/CzuJigbZMRRmbn80F1p+4TNBZOn32CBI4vJ6OQmhB5VZ6aNeRi8PuRbRmelOS2xJ7s0xoyJQxn3Zi5+R53YSZrhANrJgQbAxh6HrlD+CQrCWxEA/1ZaRPCJi/EISFLV35gfMv8AiLkJAyAW+Wa/DVAg8bCFDr6Rj66UwvZ+LHD1yiRnPITV3DeiTXof+G0Kq3Zx/zGe3nvgJopT0O8T4SOSqY9dppKpDDqzUDG19V9cEg3ij7TTbOGAjYConpPcErPMwE6Iz6olSaGOQRE89XG15EnGzBPvE00SqDQKkjw/iESh/8wCh6At9YrHmCLMD83KWStv6H6CaXKzCyZXLReC/I2yDd9k2fyW42QYpTPhly+QRKQh
 NQW5Sc99
 IBqGipoSHhdlH0Foujj09XhxU/+nAcStWX1Tl9m3sR2lAytzW6FeCYvSb9fDBoY1ETfs3uSpeVYkuYX+/vBfpkbuntxAdN5U2bj/z08dnbRw0UvVH07OMLUz0MsRSoUr7rGTZtue7GlPTkarOintZWhoQzZHPnhja1PVIlZcZNAvkpiS4OXfK+F5w+rDu7+IyaCH6vnzWLRh4MzaMOGzhHlEKcKiELMnQ4EtLp5GNTr5wRBKtLDQKcMLpM/rs8G59/UbG7IL8NCuwVuVaNvt5LPtaGVFE8aeGmgMWbgvWF8cbHsW9dJztQ3s2rOFkPVXXj22m1tfi8jx4IEJ953MGAOP0m7R9mLMJPFmu2dC2ZqPEwMIxoTC0f9mYTRhOMeaNO3M9nYfPeYDUr3PShECmQl2DznsGGnqeKAZlLtMKd0Nn7gE=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

# Introduction

This patchset aims to introduce an approach to ensure that memory
allocations are forced to be accounted to the memory cgroup, even if
they exceed the cgroup's maximum limit. In such cases, the reclaim
process is postponed until the task returns to the user. This is
beneficial for users who perform over-max reclaim while holding multiple
locks or other resources (especially resources related to file system
writeback). If a task needs any of these resources, it would otherwise
have to wait until the other task completes reclaim and releases the
resources. Postponing reclaim to the return-to-user path helps avoid this issue.

# Background

We have been encountering an hungtask issue for a long time. Specifically,
when a task holds the jbd2 handler and subsequently enters direct reclaim
because it reaches the hard limit within a memory cgroup, the system may become
blocked for a long time. The stack trace of waiting thread holding the jbd2
handle is as follows (and so many other threads are waiting on the same jbd2
handle):

 #0 __schedule at ffffffff97abc6c9
 #1 preempt_schedule_common at ffffffff97abcdaa
 #2 __cond_resched at ffffffff97abcddd
 #3 shrink_active_list at ffffffff9744dca2
 #4 shrink_lruvec at ffffffff97451407
 #5 shrink_node at ffffffff974517c9
 #6 do_try_to_free_pages at ffffffff97451dae
 #7 try_to_free_mem_cgroup_pages at ffffffff974542b8
 #8 try_charge_memcg at ffffffff974f0ede
 #9 charge_memcg at ffffffff974f1d0e
#10 __mem_cgroup_charge at ffffffff974f391c
#11 __add_to_page_cache_locked at ffffffff974313e5
#12 add_to_page_cache_lru at ffffffff974324b2
#13 pagecache_get_page at ffffffff974338e3
#14 __getblk_gfp at ffffffff97556798
#15 __ext4_get_inode_loc at ffffffffc07a5518 [ext4]
#16 ext4_get_inode_loc at ffffffffc07a7fec [ext4]
#17 ext4_reserve_inode_write at ffffffffc07a9fb1 [ext4]
#18 __ext4_mark_inode_dirty at ffffffffc07aa249 [ext4]
#19 __ext4_new_inode at ffffffffc079cbae [ext4]
#20 ext4_create at ffffffffc07c3e56 [ext4]
#21 path_openat at ffffffff9751f471
#22 do_filp_open at ffffffff97521384
#23 do_sys_openat2 at ffffffff97508fd6
#24 do_sys_open at ffffffff9750a65b
#25 do_syscall_64 at ffffffff97aaed14

We've obtained a coredump and dumped struct scan_control from it by using crash tool.

struct scan_control {
  nr_to_reclaim = 32,
  order = 0 '\000',
  priority = 1 '\001',
  reclaim_idx = 4 '\004',
  gfp_mask = 17861706, __GFP_NOFAIL
  nr_scanned = 27810, 
  nr_reclaimed = 0,
  nr = {
        dirty = 27797,
        unqueued_dirty = 27797,
        congested = 0,
        writeback = 0,
        immediate = 0,
        file_taken = 27810,
        taken = 27810
  },
}

The ->nr_reclaimed is zero meaning there is no memory we have reclaimed because
most of the file pages are unqueued dirty. And ->priority is 1 also meaning we
spent so much time on memory reclamation. Since this thread has held the jbd2
handler, the jbd2 thread was waiting for the same jbd2 handler, which blocked
so many other threads from writing dirty pages as well.

0 [] __schedule at ffffffff97abc6c9
1 [] schedule at ffffffff97abcd01
2 [] jbd2_journal_wait_updates at ffffffffc05a522f [jbd2]
3 [] jbd2_journal_commit_transaction at ffffffffc05a72c6 [jbd2]
4 [] kjournald2 at ffffffffc05ad66d [jbd2]
5 [] kthread at ffffffff972bc4c0
6 [] ret_from_fork at ffffffff9720440f

Furthermore, we observed that memory usage far exceeded the configured memory maximum,
reaching around 38GB.

memory.max  : 134896020    514 GB
memory.usage: 144747169    552 GB

We investigated this issue and identified the root cause:
  try_charge_memcg:
    retry charge
     charge failed
       -> direct reclaim
        -> mem_cgroup_oom    return true，but selected task is in an uninterruptible state
           -> retry charge

In which cases, we saw many tasks in the uninterruptible (D) state with a pending
SIGKILL signal. The OOM killer selects a victim and returns success, allowing the
current thread to retry the memory charge. However, the selected task cannot acknowledge
the SIGKILL signal because it is stuck in an uninterruptible state. As a result,
the charging task resets nr_retries and attempts to reclaim again, but the victim
task never exits. This causes the current thread to enter a prolonged retry loop
during direct reclaim, holding the jbd2 handler for much more time and leading to
system-wide blocking. Why are there so many uninterruptible (D) state tasks?
Check the most common stack trace.

crash> task_struct.__state ffff8c53a15b3080
  __state = 2,   #define TASK_UNINTERRUPTIBLE        0x0002
 0 [] __schedule at ffffffff97abc6c9
 1 [] schedule at ffffffff97abcd01
 2 [] schedule_preempt_disabled at ffffffff97abdf1a
 3 [] rwsem_down_read_slowpath at ffffffff97ac05bf
 4 [] down_read at ffffffff97ac06b1
 5 [] do_user_addr_fault at ffffffff9727f1e7
 6 [] exc_page_fault at ffffffff97ab286e
 7 [] asm_exc_page_fault at ffffffff97c00d42

Check the owner of mm_struct.mmap_lock. The task below was entering memory reclaim
holding mmap lock and there are 68 tasks in this memory cgroup, with 23 of them in
the memory reclaim context.

 7 [] shrink_active_list at ffffffff9744dd46
 8 [] shrink_lruvec at ffffffff97451407
 9 [] shrink_node at ffffffff974517c9
10 [] do_try_to_free_pages at ffffffff97451dae
11 [] try_to_free_mem_cgroup_pages at ffffffff974542b8
12 [] try_charge_memcg at ffffffff974f0ede
13 [] obj_cgroup_charge_pages at ffffffff974f1dae
14 [] obj_cgroup_charge at ffffffff974f2fc2
15 [] kmem_cache_alloc at ffffffff974d054c
16 [] vm_area_dup at ffffffff972923f1
17 [] __split_vma at ffffffff97486c16
18 [] __do_munmap at ffffffff97486e78
19 [] __vm_munmap at ffffffff97487307
20 [] __x64_sys_munmap at ffffffff974873e7
21 [] do_syscall_64 at ffffffff97aaed14

Many threads was entering the memory reclaim in UN state, other threads was blocking
on mmap_lock. Although the OOM killer selects a victim, it cannot terminate it. The
task holding the jbd2 handle retries memory charge, but it fails. Reclaiming continues
while holding the jbd2 handler. write_pages also fails while waiting for the same jbd2
handler, causing repeated shrink failures and potentially leading to a system-wide block.

ps | grep UN | wc -l
1463

While the system has 1463 UN state tasks, so the way to break this akin to "deadlock" is
to let the thread holding jbd2 handler quickly exit the memory reclamation process.

We found that a related issue was reported and partially fixed in previous patches [1][2].
However, those fixes only skip direct reclamation and return a failure for some cases such
as readahead requests. As sb_getblk() is called multiple times in __ext4_get_inode_loc()
with the NOFAIL flag, the problem still exists. And it is not feasible to simply remove
__GFP_RECLAIMABLE when holding jbd2 handle to avoid potential very long memory reclaim
latency,  as __GFP_NOFAIL is not supported without __GFP_DIRECT_RECLAIM.

# Fundamentals

This patchset introduces a new task flag of PF_MEMALLOC_ACFORCE to indicate that memory
allocations are forced to be accounted to the memory cgroup, even if they exceed the cgroup's
maximum limit. The reclaim process is deferred until the task returns to the user without
holding any kernel resources for memory reclamation, thereby preventing priority inversion
problems. Any users who might encounter potential similar issues can utilize this new flag
to allocate memory and prevent long-term latency for the entire system.

# References

[1] https://lore.kernel.org/linux-fsdevel/20230811071519.1094-1-teawaterz@linux.alibaba.com/
[2] https://lore.kernel.org/all/20230914150011.843330-1-willy@infradead.org/T/#u

Zhongkun He (2):
  mm: memcg: introduce PF_MEMALLOC_ACCOUNTFORCE to postpone reclaim to
    return-to-userland path
  jbd2: mark the transaction context with the scope PF_MEMALLOC_ACFORCE
    context

 fs/jbd2/transaction.c            | 15 +++++--
 include/linux/memcontrol.h       |  6 +++
 include/linux/resume_user_mode.h |  1 +
 include/linux/sched.h            | 11 ++++-
 include/linux/sched/mm.h         | 35 ++++++++++++++++
 mm/memcontrol.c                  | 71 ++++++++++++++++++++++++++++++++
 6 files changed, 133 insertions(+), 6 deletions(-)

-- 
2.39.5