From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E15C3C77B7C for ; Wed, 24 May 2023 04:04:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 541536B007D; Wed, 24 May 2023 00:04:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4F166900003; Wed, 24 May 2023 00:04:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3B987900002; Wed, 24 May 2023 00:04:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 286636B007D for ; Wed, 24 May 2023 00:04:25 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B8097140889 for ; Wed, 24 May 2023 04:04:24 +0000 (UTC) X-FDA: 80823806448.14.03B723D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf23.hostedemail.com (Postfix) with ESMTP id BBD6F140009 for ; Wed, 24 May 2023 04:04:21 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="Pj/EOwMG"; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf23.hostedemail.com: domain of longman@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=longman@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684901062; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RTp4n2rYhzZJCAzB2miuWTK/N8BXkz9FRgQIh2oQlBI=; b=TORlV4D8wYp5C6ATbH1wruKA8E4d2R7QUvX4yZqMqrVc66hPb4hciPonevATotXvJZGj/N W2ER7VPl0Oqzx8fH35Jbi4hmbPdKqEzxDwImK4R2XxobdoepT7lGyHvQ5OlNoVImaJaYsG 1A6APQr/EfLte/Wj5fXCGZ6fzSAjrM4= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="Pj/EOwMG"; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf23.hostedemail.com: domain of longman@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=longman@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684901062; a=rsa-sha256; cv=none; b=7FS3r1XcpkC+PD09E7fz6PM+lrBw2yrS95dPFYJdtgre3hMRRo3wfornpkPFed1bvQBkpD 6CsgfnbHy2AqKUijsPArWT4Mlwk4Ro7LdV45MB/st1QEQ8D76I7RgfTnxMVr0t5e7mZ6xM 8JJb7HWZcfD/x1O9ujR6BKtTE5rkOVc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684901060; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RTp4n2rYhzZJCAzB2miuWTK/N8BXkz9FRgQIh2oQlBI=; b=Pj/EOwMGzWhmiQh/4P16yTlF8GK6IwlctRT9N5MunUutNEJ/y6A96mWqX994ICBNT1sIZu G+yo1BjXksuS7JyWvOQOFFgnp2FQZWKoEhsmCNs1on6tDIIfxRx24EyJvw6YClJxWwbDst VldJjszO76uY1oS1QHDdd+FNwgh+J9w= Received: from mimecast-mx02.redhat.com (mx3-rdu2.redhat.com [66.187.233.73]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-411-2J8CFWs2OjC7HC1YSIwEdg-1; Wed, 24 May 2023 00:04:16 -0400 X-MC-Unique: 2J8CFWs2OjC7HC1YSIwEdg-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 5A3142823800; Wed, 24 May 2023 04:04:15 +0000 (UTC) Received: from [10.22.8.64] (unknown [10.22.8.64]) by smtp.corp.redhat.com (Postfix) with ESMTP id 69D30492B0A; Wed, 24 May 2023 04:04:14 +0000 (UTC) Message-ID: <11e81fc8-24db-54db-a518-b9bb67d0b504@redhat.com> Date: Wed, 24 May 2023 00:04:14 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Subject: Re: [PATCH] blk-cgroup: Flush stats before releasing blkcg_gq Content-Language: en-US To: Yosry Ahmed , Ming Lei , Linux-MM , Michal Hocko , Shakeel Butt , Johannes Weiner , Roman Gushchin , Muchun Song Cc: Jens Axboe , linux-block@vger.kernel.org, cgroups@vger.kernel.org, Tejun Heo , mkoutny@suse.com References: <20230524011935.719659-1-ming.lei@redhat.com> From: Waiman Long In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10 X-Rspamd-Queue-Id: BBD6F140009 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: rgqriauyc5tpoaorkbejzex5bf58xnu6 X-HE-Tag: 1684901061-345069 X-HE-Meta: U2FsdGVkX19BelfTs4bK3Ma+0+rXk8v1A/hvlmYn5qroM7jzw/LEII74poL8vcmAVBthFww5nLjltJFLQJ6CTs8HhmGW7FL5FGcKeVPcMETgz7rsSmR1i1f0myTHM8KjiixnlX3LNltcTcmkpVBqdUH06QSWTVcUjZfeSi1gi7JTZaPNSgi149XLalaB1RZnTcgqp8JW21A38hArDiA7hUF16I6AyOvO3k5rk/meVAcq054J6kh80SX6Blb7osnMEhvwAsPzuoB89Cc+fz6qHChMp2FJtHiZ7BgYmUVgDqovamU87XGeQDgSlI5c17wosjC5K5USLJ9FiT9jK5gEvpsm7bDQ/cibSI3O0xLtwU0kNbFJSaK0FCW37S08fjTaJj9QHxQkw3r75nulRqO8S9DLXxI00YSmwYGrT0OW/h8ilpdl8Qzny5BtXk1Aj4StkDeW6w21ek0onm/q6FLoFMq5yFgKn9P854NnNz3d1U6h1xi0lVScWWLzF12mv7/p+m1xFBC3zQ/Q2qZC0TwHYu9GD028lf32y31YKbySw4Nsz+xUxaT9dvk+/kIKcAYn6VQaZ6BaOwHVx5clgg7Qli3UykYa6AYFeupaegU6c7IgJ3lejqRWUhMCimEM4Q1MQtTjnQ9K6GI9ybOizxmx7UgWDLP2luuanmwfAbtHUmsBWJdfjUxZ5j4yuh0iLQ/KTIJxMA/WFu6xIdwZIstZ90j9bbw6yQKwlxtZs1FijTpB8ATEHv90AfK4LkKH4Si/8bF4Kn/+yWZVpr8y2d7FOV47SKLRtxdFcfdrlYlbtIIIZp8cDIjugb3D0XTE1iz2rlmLZwzEQRqgyDR4HibAsQeKxWfj7Xhffi0RQCQEqQ3r1ur1k1TwyqxdDWLLo+kTxJeOVIuriRvwhpjezLKTAQrg0PPCJIFZOGkJOx+RE/h3m8smviBWsaYVt7bZP/Je9OUGQWBC3ZIsKSaCNa7 r4YfEh0G z5RxV3QYRB7WRloWbjo+K0uULawUWvxIjYka4pEv9n6frxfJj/qfYNQ8J7hhehL2T6dyLRjUt8M7oXhC5R2LUi3TxmF4lh+c8rv124+SpmVITfeHUFFilHIZl0Ha18k8nVGiVno+nY5k+EsFKxCr2vbJ97eRMxb9aQBe4i+EPIUEZt8UDMx6iCT18cT28LHAM8B+E5GTo7NnccoAr4MWXYOV2ikTr+bAosYrnqmjv8ktn3Z67O7/evh99ovIZQ8G1FPZHfYSPD0Ch25wsm5uPwwc2q04syNxhF6kovibZXC9Jd93hBGa92ykxqM6oU6RYYpICwGC2M2L0iAndn9TmlJB9x7oeeU9BoeVw X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 5/23/23 22:06, Yosry Ahmed wrote: > Hi Ming, > > On Tue, May 23, 2023 at 6:21 PM Ming Lei wrote: >> As noted by Michal, the blkg_iostat_set's in the lockless list >> hold reference to blkg's to protect against their removal. Those >> blkg's hold reference to blkcg. When a cgroup is being destroyed, >> cgroup_rstat_flush() is only called at css_release_work_fn() which >> is called when the blkcg reference count reaches 0. This circular >> dependency will prevent blkcg and some blkgs from being freed after >> they are made offline. > I am not at all familiar with blkcg, but does calling > cgroup_rstat_flush() in offline_css() fix the problem? or can items be > added to the lockless list(s) after the blkcg is offlined? > >> It is less a problem if the cgroup to be destroyed also has other >> controllers like memory that will call cgroup_rstat_flush() which will >> clean up the reference count. If block is the only controller that uses >> rstat, these offline blkcg and blkgs may never be freed leaking more >> and more memory over time. >> >> To prevent this potential memory leak: >> >> - a new cgroup_rstat_css_cpu_flush() function is added to flush stats for >> a given css and cpu. This new function will be called in __blkg_release(). >> >> - don't grab bio->bi_blkg when adding the stats into blkcg's per-cpu >> stat list, and this kind of handling is the most fragile part of >> original patch >> >> Based on Waiman's patch: >> >> https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@redhat.com/ >> >> Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") >> Cc: Waiman Long >> Cc: cgroups@vger.kernel.org >> Cc: Tejun Heo >> Cc: mkoutny@suse.com >> Signed-off-by: Ming Lei >> --- >> block/blk-cgroup.c | 15 +++++++++++++-- >> include/linux/cgroup.h | 1 + >> kernel/cgroup/rstat.c | 18 ++++++++++++++++++ >> 3 files changed, 32 insertions(+), 2 deletions(-) >> >> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c >> index 0ce64dd73cfe..5437b6af3955 100644 >> --- a/block/blk-cgroup.c >> +++ b/block/blk-cgroup.c >> @@ -163,10 +163,23 @@ static void blkg_free(struct blkcg_gq *blkg) >> static void __blkg_release(struct rcu_head *rcu) >> { >> struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head); >> + struct blkcg *blkcg = blkg->blkcg; >> + int cpu; >> >> #ifdef CONFIG_BLK_CGROUP_PUNT_BIO >> WARN_ON(!bio_list_empty(&blkg->async_bios)); >> #endif >> + /* >> + * Flush all the non-empty percpu lockless lists before releasing >> + * us. Meantime no new bio can refer to this blkg any more given >> + * the refcnt is killed. >> + */ >> + for_each_possible_cpu(cpu) { >> + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); >> + >> + if (!llist_empty(lhead)) >> + cgroup_rstat_css_cpu_flush(&blkcg->css, cpu); >> + } >> >> /* release the blkcg and parent blkg refs this blkg has been holding */ >> css_put(&blkg->blkcg->css); >> @@ -991,7 +1004,6 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) >> if (parent && parent->parent) >> blkcg_iostat_update(parent, &blkg->iostat.cur, >> &blkg->iostat.last); >> - percpu_ref_put(&blkg->refcnt); >> } >> >> out: >> @@ -2075,7 +2087,6 @@ void blk_cgroup_bio_start(struct bio *bio) >> >> llist_add(&bis->lnode, lhead); >> WRITE_ONCE(bis->lqueued, true); >> - percpu_ref_get(&bis->blkg->refcnt); >> } >> >> u64_stats_update_end_irqrestore(&bis->sync, flags); >> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h >> index 885f5395fcd0..97d4764d8e6a 100644 >> --- a/include/linux/cgroup.h >> +++ b/include/linux/cgroup.h >> @@ -695,6 +695,7 @@ void cgroup_rstat_flush(struct cgroup *cgrp); >> void cgroup_rstat_flush_atomic(struct cgroup *cgrp); >> void cgroup_rstat_flush_hold(struct cgroup *cgrp); >> void cgroup_rstat_flush_release(void); >> +void cgroup_rstat_css_cpu_flush(struct cgroup_subsys_state *css, int cpu); >> >> /* >> * Basic resource stats. >> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c >> index 9c4c55228567..96e7a4e6da72 100644 >> --- a/kernel/cgroup/rstat.c >> +++ b/kernel/cgroup/rstat.c >> @@ -281,6 +281,24 @@ void cgroup_rstat_flush_release(void) >> spin_unlock_irq(&cgroup_rstat_lock); >> } >> >> +/** >> + * cgroup_rstat_css_cpu_flush - flush stats for the given css and cpu >> + * @css: target css to be flush >> + * @cpu: the cpu that holds the stats to be flush >> + * >> + * A lightweight rstat flush operation for a given css and cpu. >> + * Only the cpu_lock is being held for mutual exclusion, the cgroup_rstat_lock >> + * isn't used. > (Adding linux-mm and memcg maintainers) > +Linux-MM +Michal Hocko +Shakeel Butt +Johannes Weiner +Roman Gushchin > +Muchun Song > > I don't think flushing the stats without holding cgroup_rstat_lock is > safe for memcg stats flushing. mem_cgroup_css_rstat_flush() modifies > some non-percpu data (e.g. memcg->vmstats->state, > memcg->vmstats->state_pending). > > Perhaps have this be a separate callback than css_rstat_flush() (e.g. > css_rstat_flush_cpu() or something), so that it's clear what > subsystems support this? In this case, only blkcg would implement this > callback. That function is added to call blkcg_rstat_flush() only which flush the stat in the blkcg and it should be safe. I agree that we should note that in the comment to list the preconditions for calling it. Cheers, Longman