From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97C39C433F5 for ; Mon, 31 Jan 2022 03:56:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EB9AE6B0099; Sun, 30 Jan 2022 22:56:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E68C66B009B; Sun, 30 Jan 2022 22:56:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D09876B009C; Sun, 30 Jan 2022 22:56:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0001.hostedemail.com [216.40.44.1]) by kanga.kvack.org (Postfix) with ESMTP id BEBB36B0099 for ; Sun, 30 Jan 2022 22:56:04 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 6DD7C181CCB05 for ; Mon, 31 Jan 2022 03:56:04 +0000 (UTC) X-FDA: 79089219048.22.E43BCEC Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf02.hostedemail.com (Postfix) with ESMTP id D1BDB80005 for ; Mon, 31 Jan 2022 03:56:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1643601363; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=B7pCh7cso6UhMDVYM0raa9qRlRsS2NwAWhyuB/0+4Xk=; b=Hspus75nv3dpT8GGNzPr7+6Hm3UZC25cI/3vxycKCOhYo60gkeRjmy7sQqCwsihIZ9NlZc sxJolIR8RAlmZ4p9cKG1zfFn0zZDCJaX8j4vF/M9kgbo1JG8oCkeWeTm9L3hZ9R3Qqevul 9QHEP7AWRJRyQASNUp5EEXhPFNO59Ys= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-15-5kSMH5M_O3aJWcyic-QoSw-1; Sun, 30 Jan 2022 22:56:01 -0500 X-MC-Unique: 5kSMH5M_O3aJWcyic-QoSw-1 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.phx2.redhat.com [10.5.11.14]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 25C92802C87; Mon, 31 Jan 2022 03:55:59 +0000 (UTC) Received: from [10.22.16.114] (unknown [10.22.16.114]) by smtp.corp.redhat.com (Postfix) with ESMTP id 19FE85E24E; Mon, 31 Jan 2022 03:55:57 +0000 (UTC) Message-ID: Date: Sun, 30 Jan 2022 22:55:56 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.4.0 Subject: Re: [PATCH 1/3] mm, memcg: Don't put offlined memcg into local stock Content-Language: en-US To: Roman Gushchin Cc: Johannes Weiner , Michal Hocko , Vladimir Davydov , Andrew Morton , Vlastimil Babka , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, Shakeel Butt , Muchun Song References: <20211001190938.14050-1-longman@redhat.com> <20211001190938.14050-2-longman@redhat.com> From: Waiman Long In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed X-Scanned-By: MIMEDefang 2.79 on 10.5.11.14 X-Rspam-User: nil X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: D1BDB80005 X-Stat-Signature: gewiyzsnng3a6mkm8jpsy39cbpqkusb1 Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Hspus75n; spf=none (imf02.hostedemail.com: domain of longman@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=longman@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1643601363-395381 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 10/1/21 19:51, Roman Gushchin wrote: > On Fri, Oct 01, 2021 at 03:09:36PM -0400, Waiman Long wrote: >> When freeing a page associated with an offlined memcg, refill_stock() >> will put it into local stock delaying its demise until another memcg >> comes in to take its place in the stock. To avoid that, we now check >> for offlined memcg and go directly in this case to the slowpath for >> the uncharge via the repurposed cancel_charge() function. > Hi Waiman! > > I'm afraid it can make a cleanup of a dying cgroup slower: for every > released page we'll potentially traverse the whole cgroup tree and > decrease atomic page counters. > > I'm not sure I understand the benefits we get from this change which > do justify the slowdown on the cleanup path. > > Thanks! I was notified of a lockdep splat that this patch may help to prevent. [18073.102101] =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D [18073.102101] WARNING: possible circular locking dependency detected [18073.102101] 5.14.0-42.el9.x86_64+debug #1 Not tainted [18073.102101] ------------------------------------------------------ [18073.102101] bz1567074_bin/420270 is trying to acquire lock: [18073.102101] ffffffff9bdfc478 (css_set_lock){..-.}-{2:2}, at:=20 obj_cgroup_release+0x79/0x210 [18073.102101] [18073.102101] but task is already holding lock: [18073.102101] ffff88806ba4ef18 (&sighand->siglock){-...}-{2:2}, at:=20 force_sig_info_to_task+0x6c/0x370 [18073.102101] [18073.102101] which lock already depends on the new lock. [18073.102101] [18073.102101] [18073.102101] the existing dependency chain (in reverse order) is: [18073.102101] [18073.102101] -> #1 (&sighand->siglock){-...}-{2:2}: [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __lock_acquire+0= xb72/0x1870 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 lock_acquire.par= t.0+0x117/0x340 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 _raw_spin_lock_i= rqsave+0x43/0x90 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __lock_task_sigh= and+0xa0/0x210 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_freeze_ta= sk+0x6f/0x150 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_migrate_e= xecute+0x25f/0xf90 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_update_df= l_csses+0x417/0x4f0 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_subtree_c= ontrol_write+0x67b/0xa10 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 cgroup_file_writ= e+0x1ef/0x6a0 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kernfs_fop_write= _iter+0x2c7/0x460 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 new_sync_write+0= x36f/0x610 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 vfs_write+0x5c6/= 0x890 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ksys_write+0xf9/= 0x1d0 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 do_syscall_64+0x= 3b/0x90 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 entry_SYSCALL_64= _after_hwframe+0x44/0xae [18073.102101] [18073.102101] -> #0 (css_set_lock){..-.}-{2:2}: [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 check_prev_add+0= x15e/0x20f0 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 validate_chain+0= xac6/0xde0 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __lock_acquire+0= xb72/0x1870 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 lock_acquire.par= t.0+0x117/0x340 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 _raw_spin_lock_i= rqsave+0x43/0x90 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 obj_cgroup_relea= se+0x79/0x210 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 percpu_ref_put_m= any.constprop.0+0x16b/0x1a0 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 drain_obj_stock+= 0x1a8/0x310 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 refill_obj_stock= +0xa4/0x480 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 obj_cgroup_charg= e+0x104/0x240 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 kmem_cache_alloc= +0x94/0x400 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __sigqueue_alloc= +0x1b9/0x460 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __send_signal+0x= 4b2/0xf60 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 force_sig_info_t= o_task+0x226/0x370 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 force_sig_fault+= 0xb0/0xf0 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 noist_exc_debug+= 0xec/0x110 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 asm_exc_debug+0x= 2b/0x30 [18073.102101] [18073.102101] other info that might help us debug this: [18073.102101] [18073.102101]=C2=A0 Possible unsafe locking scenario: [18073.102101] [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CPU0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 CPU1 [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ----=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 ---- [18073.102101]=C2=A0=C2=A0 lock(&sighand->siglock); [18073.102101]=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 lock(css_set_lock); [18073.102101] lock(&sighand->siglock); [18073.102101]=C2=A0=C2=A0 lock(css_set_lock); [18073.102101] [18073.102101]=C2=A0 *** DEADLOCK *** [18073.102101] [18073.102101] 2 locks held by bz1567074_bin/420270: [18073.102101]=C2=A0 #0: ffff88806ba4ef18 (&sighand->siglock){-...}-{2:2}= ,=20 at: force_sig_info_to_task+0x6c/0x370 [18073.102101]=C2=A0 #1: ffffffff9bd0ea00 (rcu_read_lock){....}-{1:2}, at= :=20 percpu_ref_put_many.constprop.0+0x0/0x1a0 [18073.102101] [18073.102101] stack backtrace: [18073.102101] CPU: 0 PID: 420270 Comm: bz1567074_bin Kdump: loaded Not=20 tainted 5.14.0-42.el9.x86_64+debug #1 [18073.102101] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007 [18073.102101] Call Trace: [18073.102101]=C2=A0 dump_stack_lvl+0x57/0x7d [18073.102101]=C2=A0 check_noncircular+0x26a/0x310 [18073.102101]=C2=A0 ? pvclock_clocksource_read+0x2b8/0x520 [18073.102101]=C2=A0 ? print_circular_bug+0x1f0/0x1f0 [18073.102101]=C2=A0 ? alloc_chain_hlocks+0x1de/0x530 [18073.102101]=C2=A0 check_prev_add+0x15e/0x20f0 [18073.102101]=C2=A0 validate_chain+0xac6/0xde0 [18073.102101]=C2=A0 ? check_prev_add+0x20f0/0x20f0 [18073.102101]=C2=A0 __lock_acquire+0xb72/0x1870 [18073.102101]=C2=A0 ? __lock_acquire+0xb72/0x1870 [18073.102101]=C2=A0 lock_acquire.part.0+0x117/0x340 [18073.102101]=C2=A0 ? obj_cgroup_release+0x79/0x210 [18073.102101]=C2=A0 ? rcu_read_unlock+0x40/0x40 [18073.102101]=C2=A0 ? rcu_read_lock_sched_held+0x3f/0x70 [18073.102101]=C2=A0 ? lock_acquire+0x224/0x2d0 [18073.102101]=C2=A0 ? obj_cgroup_release+0x79/0x210 [18073.102101]=C2=A0 _raw_spin_lock_irqsave+0x43/0x90 [18073.102101]=C2=A0 ? obj_cgroup_release+0x79/0x210 [18073.102101]=C2=A0 obj_cgroup_release+0x79/0x210 [18073.102101]=C2=A0 percpu_ref_put_many.constprop.0+0x16b/0x1a0 [18073.102101]=C2=A0 drain_obj_stock+0x1a8/0x310 [18073.102101]=C2=A0 refill_obj_stock+0xa4/0x480 [18073.102101]=C2=A0 obj_cgroup_charge+0x104/0x240 [18073.102101]=C2=A0 ? __sigqueue_alloc+0x1b9/0x460 [18073.102101]=C2=A0 kmem_cache_alloc+0x94/0x400 [18073.102101]=C2=A0 ? __sigqueue_alloc+0x129/0x460 [18073.102101]=C2=A0 __sigqueue_alloc+0x1b9/0x460 [18073.102101]=C2=A0 __send_signal+0x4b2/0xf60 [18073.102101]=C2=A0 ? send_signal+0x9f/0x580 [18073.102101]=C2=A0 force_sig_info_to_task+0x226/0x370 [18073.102101]=C2=A0 force_sig_fault+0xb0/0xf0 [18073.102101]=C2=A0 ? force_sig_fault_to_task+0xe0/0xe0 [18073.102101]=C2=A0 ? asm_exc_debug+0x23/0x30 [18073.102101]=C2=A0 ? notify_die+0x88/0x100 [18073.102101]=C2=A0 ? asm_exc_debug+0x23/0x30 [18073.102101]=C2=A0 noist_exc_debug+0xec/0x110 [18073.102101]=C2=A0 asm_exc_debug+0x2b/0x30 The &sighand->siglock =3D> css_set_lock locking sequence is caused by a=20 task holding sighand->siglock and call kmem_cache_alloc(GFP_ATOMIC) and=20 the release of the obj_cgroup originally from an offlined memcg in=20 percpu stock leading to the call of obj_cgroup_release() which takes the=20 cs_set_lock. The chance of hitting that is very small, but it can still=20 happen. So do you think addressing this possible deadlock scenario is=20 worth the possible slower release of an offlined memcg? Cheers, Longman >> Signed-off-by: Waiman Long >> --- >> mm/memcontrol.c | 16 +++++++++++----- >> 1 file changed, 11 insertions(+), 5 deletions(-) >> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index 4b32896d87a2..4568363062c1 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -2167,6 +2167,8 @@ static bool consume_stock(struct mem_cgroup *mem= cg, unsigned int nr_pages) >> return ret; >> } >> =20 >> +static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_p= ages); >> + >> /* >> * Returns stocks cached in percpu and reset cached information. >> */ >> @@ -2178,9 +2180,7 @@ static void drain_stock(struct memcg_stock_pcp *= stock) >> return; >> =20 >> if (stock->nr_pages) { >> - page_counter_uncharge(&old->memory, stock->nr_pages); >> - if (do_memsw_account()) >> - page_counter_uncharge(&old->memsw, stock->nr_pages); >> + cancel_charge(old, stock->nr_pages); >> stock->nr_pages =3D 0; >> } >> =20 >> @@ -2219,6 +2219,14 @@ static void refill_stock(struct mem_cgroup *mem= cg, unsigned int nr_pages) >> struct memcg_stock_pcp *stock; >> unsigned long flags; >> =20 >> + /* >> + * An offlined memcg shouldn't be put into stock. >> + */ >> + if (unlikely(memcg->kmem_state !=3D KMEM_ONLINE)) { >> + cancel_charge(memcg, nr_pages); >> + return; >> + } >> + >> local_irq_save(flags); >> =20 >> stock =3D this_cpu_ptr(&memcg_stock); >> @@ -2732,7 +2740,6 @@ static inline int try_charge(struct mem_cgroup *= memcg, gfp_t gfp_mask, >> return try_charge_memcg(memcg, gfp_mask, nr_pages); >> } >> =20 >> -#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MMU) >> static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_= pages) >> { >> if (mem_cgroup_is_root(memcg)) >> @@ -2742,7 +2749,6 @@ static void cancel_charge(struct mem_cgroup *mem= cg, unsigned int nr_pages) >> if (do_memsw_account()) >> page_counter_uncharge(&memcg->memsw, nr_pages); >> } >> -#endif >> =20 >> static void commit_charge(struct folio *folio, struct mem_cgroup *me= mcg) >> { >> --=20 >> 2.18.1 >>