From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=AujH=NJ=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.6 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 3A5F6C4338F
	for <linux-mm@archiver.kernel.org>; Wed, 18 Aug 2021 06:30:57 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id AD5C46024A
	for <linux-mm@archiver.kernel.org>; Wed, 18 Aug 2021 06:30:56 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org AD5C46024A
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.dev
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 47C3F6B006C; Wed, 18 Aug 2021 02:30:56 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 405908D0001; Wed, 18 Aug 2021 02:30:56 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2CD036B0073; Wed, 18 Aug 2021 02:30:56 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0008.hostedemail.com [216.40.44.8])
	by kanga.kvack.org (Postfix) with ESMTP id 0EC5B6B006C
	for <linux-mm@kvack.org>; Wed, 18 Aug 2021 02:30:56 -0400 (EDT)
Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id A56FC181CBC02
	for <linux-mm@kvack.org>; Wed, 18 Aug 2021 06:30:55 +0000 (UTC)
X-FDA: 78487228470.07.59D5C54
Received: from out1.migadu.com (out1.migadu.com [91.121.223.63])
	by imf07.hostedemail.com (Postfix) with ESMTP id D451810004F1
	for <linux-mm@kvack.org>; Wed, 18 Aug 2021 06:30:54 +0000 (UTC)
Date: Wed, 18 Aug 2021 15:30:42 +0900
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1;
	t=1629268252;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=4+y1w2cJDB4TCUahkeLKEQYMWNkdA4dKlcdPI1ef7f8=;
	b=G5LhJhkj3YFTZCwOF7qAvjiygCFB3G2KBK+yTUE4u/PwbXpaLUIIWcL92EyPwiN8Ex5uUI
	GJAmQyecSna7eSCsLHUN6PzI9ZWPMV/VLKt+/PIUhD5r/TfGmatNJYtwvKsnjbxA75hGZV
	QPLcbu7qfN//sgIMJoJ7PYpCqdopHmA=
X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers.
From: Naoya Horiguchi <naoya.horiguchi@linux.dev>
To: Yang Shi <shy828301@gmail.com>
Cc: osalvador@suse.de, tdmackey@twitter.com, akpm@linux-foundation.org,
	corbet@lwn.net, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: Re: [PATCH 1/2] mm: hwpoison: don't drop slab caches for offlining
 non-LRU page
Message-ID: <20210818063042.GA2310427@u2004>
References: <20210816180909.3603-1-shy828301@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20210816180909.3603-1-shy828301@gmail.com>
X-Migadu-Flow: FLOW_OUT
X-Migadu-Auth-User: naoya.horiguchi@linux.dev
X-Rspamd-Queue-Id: D451810004F1
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=linux.dev header.s=key1 header.b=G5LhJhkj;
	spf=pass (imf07.hostedemail.com: domain of naoya.horiguchi@linux.dev designates 91.121.223.63 as permitted sender) smtp.mailfrom=naoya.horiguchi@linux.dev;
	dmarc=pass (policy=none) header.from=linux.dev
X-Rspamd-Server: rspam01
X-Stat-Signature: phf3nih5n5o8y6c78m4w84yktzkz8oku
X-HE-Tag: 1629268254-735671
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Aug 16, 2021 at 11:09:08AM -0700, Yang Shi wrote:
> In the current implementation of soft offline, if non-LRU page is met,
> all the slab caches will be dropped to free the page then offline.  But
> if the page is not slab page all the effort is wasted in vain.  Even
> though it is a slab page, it is not guaranteed the page could be freed
> at all.
>=20
> However the side effect and cost is quite high.  It does not only drop
> the slab caches, but also may drop a significant amount of page caches
> which are associated with inode caches.  It could make the most
> workingset gone in order to just offline a page.  And the offline is no=
t
> guaranteed to succeed at all, actually I really doubt the success rate
> for real life workload.
>=20
> Furthermore the worse consequence is the system may be locked up and
> unusable since the page cache release may incur huge amount of works
> queued for memcg release.
>=20
> Actually we ran into such unpleasant case in our production environment=
.
> Firstly, the workqueue of memory_failure_work_func is locked up as
> below:
>=20
> BUG: workqueue lockup - pool cpus=3D1 node=3D0 flags=3D0x0 nice=3D0 stu=
ck for 53s!
> Showing busy workqueues and worker pools:
> workqueue events: flags=3D0x0
> =C2=A0 pwq 2: cpus=3D1 node=3D0 flags=3D0x0 nice=3D0 active=3D14/256 re=
fcnt=3D15
> =C2=A0 =C2=A0 in-flight: 409271:memory_failure_work_func
> =C2=A0 =C2=A0 pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_wor=
k, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_def=
erred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_w=
ork, drain_local_stock, kfree_rcu_work
> workqueue mm_percpu_wq: flags=3D0x8
> =C2=A0 pwq 2: cpus=3D1 node=3D0 flags=3D0x0 nice=3D0 active=3D1/256 ref=
cnt=3D2
> =C2=A0 =C2=A0 pending: vmstat_update
> workqueue cgroup_destroy: flags=3D0x0
>   pwq 2: cpus=3D1 node=3D0 flags=3D0x0 nice=3D0 active=3D1/1 refcnt=3D1=
2072
>     pending: css_release_work_fn
>=20
> There were over 12K css_release_work_fn queued, and this caused a few
> lockups due to the contention of worker pool lock with IRQ disabled, fo=
r
> example:
>=20
> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_p=
clmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fa=
t acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf i=
pmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_=
core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
> CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.3=
2-t1.el7.twitter.x86_64 #1
> Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30=
/2021
> Workqueue: events memory_failure_work_func
> RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
> Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09=
 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c=
0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
> RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
> RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
> RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
> R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
> R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
> FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:00000000000=
00000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
> Call Trace:
>  __queue_work+0xd6/0x3c0
>  queue_work_on+0x1c/0x30
>  uncharge_batch+0x10e/0x110
>  mem_cgroup_uncharge_list+0x6d/0x80
>  release_pages+0x37f/0x3f0
>  __pagevec_release+0x1c/0x50
>  __invalidate_mapping_pages+0x348/0x380
>  ? xfs_alloc_buftarg+0xa4/0x120 [xfs]
>  inode_lru_isolate+0x10a/0x160
>  ? iput+0x1d0/0x1d0
>  __list_lru_walk_one+0x7b/0x170
>  ? iput+0x1d0/0x1d0
>  list_lru_walk_one+0x4a/0x60
>  prune_icache_sb+0x37/0x50
>  super_cache_scan+0x123/0x1a0
>  do_shrink_slab+0x10c/0x2c0
>  shrink_slab+0x1f1/0x290
>  drop_slab_node+0x4d/0x70
>  soft_offline_page+0x1ac/0x5b0
>  ? dev_mce_log+0xee/0x110
>  ? notifier_call_chain+0x39/0x90
>  memory_failure_work_func+0x6a/0x90
>  process_one_work+0x19e/0x340
>  ? process_one_work+0x340/0x340
>  worker_thread+0x30/0x360
>  ? process_one_work+0x340/0x340
>  kthread+0x116/0x130
>=20
> The lockup made the machine is quite unusable.  And it also made the
> most workingset gone, the reclaimabled slab caches were reduced from 12=
G
> to 300MB, the page caches were decreased from 17G to 4G.
>=20
> But the most disappointing thing is all the effort doesn't make the pag=
e
> offline, it just returns:
>=20
> soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
>=20
> It seems the aggressive behavior for non-LRU page didn't pay back, so i=
t
> doesn't make too much sense to keep it considering the terrible side
> effect.
>=20
> Reported-by: David Mackey <tdmackey@twitter.com>
> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Thank you. I agree with the idea of dropping drop_slab_node() in shake_pa=
ge(),
hoping that range-based slab shrinker will be implemented in the future.

This patch conflicts with the patch
https://lore.kernel.org/linux-mm/20210817053703.2267588-1-naoya.horiguchi=
@linux.dev/T/#u
which adds another shake_page(), so could you add the following hunk in y=
our patch?

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 64f8ac969544..7dd2ca665866 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1198,7 +1198,7 @@ static int get_any_page(struct page *p, unsigned lo=
ng flags)
 			 * page, retry.
 			 */
 			if (pass++ < 3) {
-				shake_page(p, 1);
+				shake_page(p);
 				goto try_again;
 			}
 			goto out;


Thanks,
Naoya Horiguchi