From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 97C39C433F5
	for <linux-mm@archiver.kernel.org>; Mon, 31 Jan 2022 03:56:05 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EB9AE6B0099; Sun, 30 Jan 2022 22:56:04 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E68C66B009B; Sun, 30 Jan 2022 22:56:04 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D09876B009C; Sun, 30 Jan 2022 22:56:04 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0001.hostedemail.com [216.40.44.1])
	by kanga.kvack.org (Postfix) with ESMTP id BEBB36B0099
	for <linux-mm@kvack.org>; Sun, 30 Jan 2022 22:56:04 -0500 (EST)
Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay01.hostedemail.com (Postfix) with ESMTP id 6DD7C181CCB05
	for <linux-mm@kvack.org>; Mon, 31 Jan 2022 03:56:04 +0000 (UTC)
X-FDA: 79089219048.22.E43BCEC
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf02.hostedemail.com (Postfix) with ESMTP id D1BDB80005
	for <linux-mm@kvack.org>; Mon, 31 Jan 2022 03:56:03 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1643601363;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=B7pCh7cso6UhMDVYM0raa9qRlRsS2NwAWhyuB/0+4Xk=;
	b=Hspus75nv3dpT8GGNzPr7+6Hm3UZC25cI/3vxycKCOhYo60gkeRjmy7sQqCwsihIZ9NlZc
	sxJolIR8RAlmZ4p9cKG1zfFn0zZDCJaX8j4vF/M9kgbo1JG8oCkeWeTm9L3hZ9R3Qqevul
	9QHEP7AWRJRyQASNUp5EEXhPFNO59Ys=
Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com
 [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 us-mta-15-5kSMH5M_O3aJWcyic-QoSw-1; Sun, 30 Jan 2022 22:56:01 -0500
X-MC-Unique: 5kSMH5M_O3aJWcyic-QoSw-1
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14])
	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))
	(No client certificate requested)
	by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 25C92802C87;
	Mon, 31 Jan 2022 03:55:59 +0000 (UTC)
Received: from [10.22.16.114] (unknown [10.22.16.114])
	by smtp.corp.redhat.com (Postfix) with ESMTP id 19FE85E24E;
	Mon, 31 Jan 2022 03:55:57 +0000 (UTC)
Message-ID: <f7026256-4086-6632-569e-5b13594cb3fc@redhat.com>
Date: Sun, 30 Jan 2022 22:55:56 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.4.0
Subject: Re: [PATCH 1/3] mm, memcg: Don't put offlined memcg into local stock
Content-Language: en-US
To: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
 Vladimir Davydov <vdavydov.dev@gmail.com>,
 Andrew Morton <akpm@linux-foundation.org>, Vlastimil Babka <vbabka@suse.cz>,
 linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org,
 Shakeel Butt <shakeelb@google.com>, Muchun Song <songmuchun@bytedance.com>
References: <20211001190938.14050-1-longman@redhat.com>
 <20211001190938.14050-2-longman@redhat.com>
 <YVefHLo1+6lgw3aB@carbon.dhcp.thefacebook.com>
From: Waiman Long <longman@redhat.com>
In-Reply-To: <YVefHLo1+6lgw3aB@carbon.dhcp.thefacebook.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14
X-Rspam-User: nil
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: D1BDB80005
X-Stat-Signature: gewiyzsnng3a6mkm8jpsy39cbpqkusb1
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Hspus75n;
	spf=none (imf02.hostedemail.com: domain of longman@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=longman@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
X-HE-Tag: 1643601363-395381
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 10/1/21 19:51, Roman Gushchin wrote:
> On Fri, Oct 01, 2021 at 03:09:36PM -0400, Waiman Long wrote:
>> When freeing a page associated with an offlined memcg, refill_stock()
>> will put it into local stock delaying its demise until another memcg
>> comes in to take its place in the stock. To avoid that, we now check
>> for offlined memcg and go directly in this case to the slowpath for
>> the uncharge via the repurposed cancel_charge() function.
> Hi Waiman!
>
> I'm afraid it can make a cleanup of a dying cgroup slower: for every
> released page we'll potentially traverse the whole cgroup tree and
> decrease atomic page counters.
>
> I'm not sure I understand the benefits we get from this change which
> do justify the slowdown on the cleanup path.
>
> Thanks!

I was notified of a lockdep splat that this patch may help to prevent.

[18073.102101] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D
[18073.102101] WARNING: possible circular locking dependency detected
[18073.102101] 5.14.0-42.el9.x86_64+debug #1 Not tainted
[18073.102101] ------------------------------------------------------
[18073.102101] bz1567074_bin/420270 is trying to acquire lock:
[18073.102101] ffffffff9bdfc478 (css_set_lock){..-.}-{2:2}, at:=20
obj_cgroup_release+0x79/0x210
[18073.102101]
[18073.102101] but task is already holding lock:
[18073.102101] ffff88806ba4ef18 (&sighand->siglock){-...}-{2:2}, at:=20
force_sig_info_to_task+0x6c/0x370
[18073.102101]
[18073.102101] which lock already depends on the new lock.
[18073.102101]
[18073.102101]
[18073.102101] the existing dependency chain (in reverse order) is:
[18073.102101]
[18073.102101] -> #1 (&sighand->siglock){-...}-{2:2}:
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __lock_acquire+0=
xb72/0x1870
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 lock_acquire.par=
t.0+0x117/0x340
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 _raw_spin_lock_i=
rqsave+0x43/0x90
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __lock_task_sigh=
and+0xa0/0x210
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_freeze_ta=
sk+0x6f/0x150
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_migrate_e=
xecute+0x25f/0xf90
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_update_df=
l_csses+0x417/0x4f0
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_subtree_c=
ontrol_write+0x67b/0xa10
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_file_writ=
e+0x1ef/0x6a0
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kernfs_fop_write=
_iter+0x2c7/0x460
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 new_sync_write+0=
x36f/0x610
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vfs_write+0x5c6/=
0x890
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ksys_write+0xf9/=
0x1d0
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 do_syscall_64+0x=
3b/0x90
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 entry_SYSCALL_64=
_after_hwframe+0x44/0xae
[18073.102101]
[18073.102101] -> #0 (css_set_lock){..-.}-{2:2}:
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 check_prev_add+0=
x15e/0x20f0
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 validate_chain+0=
xac6/0xde0
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __lock_acquire+0=
xb72/0x1870
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 lock_acquire.par=
t.0+0x117/0x340
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 _raw_spin_lock_i=
rqsave+0x43/0x90
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 obj_cgroup_relea=
se+0x79/0x210
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 percpu_ref_put_m=
any.constprop.0+0x16b/0x1a0
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 drain_obj_stock+=
0x1a8/0x310
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 refill_obj_stock=
+0xa4/0x480
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 obj_cgroup_charg=
e+0x104/0x240
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kmem_cache_alloc=
+0x94/0x400
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __sigqueue_alloc=
+0x1b9/0x460
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __send_signal+0x=
4b2/0xf60
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 force_sig_info_t=
o_task+0x226/0x370
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 force_sig_fault+=
0xb0/0xf0
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 noist_exc_debug+=
0xec/0x110
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 asm_exc_debug+0x=
2b/0x30
[18073.102101]
[18073.102101] other info that might help us debug this:
[18073.102101]
[18073.102101]=C2=A0 Possible unsafe locking scenario:
[18073.102101]
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CPU0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 CPU1
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ----=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0 ----
[18073.102101]=C2=A0=C2=A0 lock(&sighand->siglock);
[18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=
=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 lock(css_set_lock);
[18073.102101] lock(&sighand->siglock);
[18073.102101]=C2=A0=C2=A0 lock(css_set_lock);
[18073.102101]
[18073.102101]=C2=A0 *** DEADLOCK ***
[18073.102101]
[18073.102101] 2 locks held by bz1567074_bin/420270:
[18073.102101]=C2=A0 #0: ffff88806ba4ef18 (&sighand->siglock){-...}-{2:2}=
,=20
at: force_sig_info_to_task+0x6c/0x370
[18073.102101]=C2=A0 #1: ffffffff9bd0ea00 (rcu_read_lock){....}-{1:2}, at=
:=20
percpu_ref_put_many.constprop.0+0x0/0x1a0
[18073.102101]
[18073.102101] stack backtrace:
[18073.102101] CPU: 0 PID: 420270 Comm: bz1567074_bin Kdump: loaded Not=20
tainted 5.14.0-42.el9.x86_64+debug #1
[18073.102101] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
[18073.102101] Call Trace:
[18073.102101]=C2=A0 dump_stack_lvl+0x57/0x7d
[18073.102101]=C2=A0 check_noncircular+0x26a/0x310
[18073.102101]=C2=A0 ? pvclock_clocksource_read+0x2b8/0x520
[18073.102101]=C2=A0 ? print_circular_bug+0x1f0/0x1f0
[18073.102101]=C2=A0 ? alloc_chain_hlocks+0x1de/0x530
[18073.102101]=C2=A0 check_prev_add+0x15e/0x20f0
[18073.102101]=C2=A0 validate_chain+0xac6/0xde0
[18073.102101]=C2=A0 ? check_prev_add+0x20f0/0x20f0
[18073.102101]=C2=A0 __lock_acquire+0xb72/0x1870
[18073.102101]=C2=A0 ? __lock_acquire+0xb72/0x1870
[18073.102101]=C2=A0 lock_acquire.part.0+0x117/0x340
[18073.102101]=C2=A0 ? obj_cgroup_release+0x79/0x210
[18073.102101]=C2=A0 ? rcu_read_unlock+0x40/0x40
[18073.102101]=C2=A0 ? rcu_read_lock_sched_held+0x3f/0x70
[18073.102101]=C2=A0 ? lock_acquire+0x224/0x2d0
[18073.102101]=C2=A0 ? obj_cgroup_release+0x79/0x210
[18073.102101]=C2=A0 _raw_spin_lock_irqsave+0x43/0x90
[18073.102101]=C2=A0 ? obj_cgroup_release+0x79/0x210
[18073.102101]=C2=A0 obj_cgroup_release+0x79/0x210
[18073.102101]=C2=A0 percpu_ref_put_many.constprop.0+0x16b/0x1a0
[18073.102101]=C2=A0 drain_obj_stock+0x1a8/0x310
[18073.102101]=C2=A0 refill_obj_stock+0xa4/0x480
[18073.102101]=C2=A0 obj_cgroup_charge+0x104/0x240
[18073.102101]=C2=A0 ? __sigqueue_alloc+0x1b9/0x460
[18073.102101]=C2=A0 kmem_cache_alloc+0x94/0x400
[18073.102101]=C2=A0 ? __sigqueue_alloc+0x129/0x460
[18073.102101]=C2=A0 __sigqueue_alloc+0x1b9/0x460
[18073.102101]=C2=A0 __send_signal+0x4b2/0xf60
[18073.102101]=C2=A0 ? send_signal+0x9f/0x580
[18073.102101]=C2=A0 force_sig_info_to_task+0x226/0x370
[18073.102101]=C2=A0 force_sig_fault+0xb0/0xf0
[18073.102101]=C2=A0 ? force_sig_fault_to_task+0xe0/0xe0
[18073.102101]=C2=A0 ? asm_exc_debug+0x23/0x30
[18073.102101]=C2=A0 ? notify_die+0x88/0x100
[18073.102101]=C2=A0 ? asm_exc_debug+0x23/0x30
[18073.102101]=C2=A0 noist_exc_debug+0xec/0x110
[18073.102101]=C2=A0 asm_exc_debug+0x2b/0x30

The &sighand->siglock =3D> css_set_lock locking sequence is caused by a=20
task holding sighand->siglock and call kmem_cache_alloc(GFP_ATOMIC) and=20
the release of the obj_cgroup originally from an offlined memcg in=20
percpu stock leading to the call of obj_cgroup_release() which takes the=20
cs_set_lock. The chance of hitting that is very small, but it can still=20
happen. So do you think addressing this possible deadlock scenario is=20
worth the possible slower release of an offlined memcg?

Cheers,
Longman

>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   mm/memcontrol.c | 16 +++++++++++-----
>>   1 file changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 4b32896d87a2..4568363062c1 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2167,6 +2167,8 @@ static bool consume_stock(struct mem_cgroup *mem=
cg, unsigned int nr_pages)
>>   	return ret;
>>   }
>>  =20
>> +static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_p=
ages);
>> +
>>   /*
>>    * Returns stocks cached in percpu and reset cached information.
>>    */
>> @@ -2178,9 +2180,7 @@ static void drain_stock(struct memcg_stock_pcp *=
stock)
>>   		return;
>>  =20
>>   	if (stock->nr_pages) {
>> -		page_counter_uncharge(&old->memory, stock->nr_pages);
>> -		if (do_memsw_account())
>> -			page_counter_uncharge(&old->memsw, stock->nr_pages);
>> +		cancel_charge(old, stock->nr_pages);
>>   		stock->nr_pages =3D 0;
>>   	}
>>  =20
>> @@ -2219,6 +2219,14 @@ static void refill_stock(struct mem_cgroup *mem=
cg, unsigned int nr_pages)
>>   	struct memcg_stock_pcp *stock;
>>   	unsigned long flags;
>>  =20
>> +	/*
>> +	 * An offlined memcg shouldn't be put into stock.
>> +	 */
>> +	if (unlikely(memcg->kmem_state !=3D KMEM_ONLINE)) {
>> +		cancel_charge(memcg, nr_pages);
>> +		return;
>> +	}
>> +
>>   	local_irq_save(flags);
>>  =20
>>   	stock =3D this_cpu_ptr(&memcg_stock);
>> @@ -2732,7 +2740,6 @@ static inline int try_charge(struct mem_cgroup *=
memcg, gfp_t gfp_mask,
>>   	return try_charge_memcg(memcg, gfp_mask, nr_pages);
>>   }
>>  =20
>> -#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MMU)
>>   static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_=
pages)
>>   {
>>   	if (mem_cgroup_is_root(memcg))
>> @@ -2742,7 +2749,6 @@ static void cancel_charge(struct mem_cgroup *mem=
cg, unsigned int nr_pages)
>>   	if (do_memsw_account())
>>   		page_counter_uncharge(&memcg->memsw, nr_pages);
>>   }
>> -#endif
>>  =20
>>   static void commit_charge(struct folio *folio, struct mem_cgroup *me=
mcg)
>>   {
>> --=20
>> 2.18.1
>>