From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44415C7EE23 for ; Wed, 24 May 2023 02:37:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D5064900003; Tue, 23 May 2023 22:37:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D0069900002; Tue, 23 May 2023 22:37:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BC7BE900003; Tue, 23 May 2023 22:37:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id A94D2900002 for ; Tue, 23 May 2023 22:37:30 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 7150F1A0391 for ; Wed, 24 May 2023 02:37:30 +0000 (UTC) X-FDA: 80823587460.04.EE263A9 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf30.hostedemail.com (Postfix) with ESMTP id 9E0A280010 for ; Wed, 24 May 2023 02:37:27 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="g/H8IQZQ"; spf=pass (imf30.hostedemail.com: domain of ming.lei@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=ming.lei@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684895848; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7efy8ggOY6sESWoSThIrMDxn343AAutiYwYWesVcCYU=; b=J0LRpKW17stpiLz0MZNFlTPCoaJx+v8aT+1mF8BsvB3DqtnQcfllgbiM+mF8VtrM5TpzBl TzgWPmiq7B2gH02Kcl2/NWhms1BtGIKeClH8iz8oTv8SPExaC6u2kQco1uBwFrMgOlmyKm IDBTYodFGx6AfG0IZ52KWlwNugnYbOQ= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="g/H8IQZQ"; spf=pass (imf30.hostedemail.com: domain of ming.lei@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=ming.lei@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684895848; a=rsa-sha256; cv=none; b=tjrsDJSkTxxVnYbek7pQ0WYwhsDEujKw9z4Gfs1TINeR66CZRb5Dpn1HG48WhOKJOGwalo gLfiINe4giAKaa2fdeCotsPPzEbeVjwLlbYrV+ejo/SS1TBrwzqolew6Aa+dIACRJMRc/E /VIyTqXEQKuFWK7GfPs7gYZJQG/Vtds= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1684895846; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7efy8ggOY6sESWoSThIrMDxn343AAutiYwYWesVcCYU=; b=g/H8IQZQXrdrvHiRD7uP9znaH9wZCaIX2tEWTK/XWxdRpoXiMJyK+a1hDqeSKce9Xi8Oys nBEy4BjUKTBAmddiwzF0fRbmqPYiTbuzzADkYuAdWe0JpRMuGkouS+afEUnyjIfbfVgIEg 5cNdwCjOP1c7gS3A5G8R7UP9mafBjt0= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-343-nTKsjYuLOkuX3Yf8ITjjdg-1; Tue, 23 May 2023 22:37:23 -0400 X-MC-Unique: nTKsjYuLOkuX3Yf8ITjjdg-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A577A101A53A; Wed, 24 May 2023 02:37:22 +0000 (UTC) Received: from ovpn-8-17.pek2.redhat.com (ovpn-8-17.pek2.redhat.com [10.72.8.17]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 5C1E5492B00; Wed, 24 May 2023 02:37:14 +0000 (UTC) Date: Wed, 24 May 2023 10:37:10 +0800 From: Ming Lei To: Yosry Ahmed Cc: Linux-MM , Michal Hocko , Shakeel Butt , Johannes Weiner , Roman Gushchin , Muchun Song , Jens Axboe , linux-block@vger.kernel.org, Waiman Long , cgroups@vger.kernel.org, Tejun Heo , mkoutny@suse.com, ming.lei@redhat.com Subject: Re: [PATCH] blk-cgroup: Flush stats before releasing blkcg_gq Message-ID: References: <20230524011935.719659-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 X-Rspamd-Queue-Id: 9E0A280010 X-Rspam-User: X-Stat-Signature: cpcc4wqtwpf163xtwj9sw3y358km6na9 X-Rspamd-Server: rspam01 X-HE-Tag: 1684895847-230286 X-HE-Meta: U2FsdGVkX1/Msv/hpFmnkmFgAPbwy3FoMmVQjcZ7vYA7S5fI8tfOjp2675PtrVynzh2UkRcQuuANGGEsvIiL96FggWvdCtuvVg8Y5prFt9hZBtUw8ebBOxs3MB3MsPfkwpG6RbIgSFahkoyZHBSifF1Fgt1sFzwPOH+q0RxIVglbPn5Bfhn/bYM1Fdb1mjJaLqpXxT1QLYwOGSjR1apNKuS+sgmJBtiQPH7BwwhRIdA6hfb67PWGzAwCOCCzgaKz0N7iznXCFpOxSdr4F4sE8CScOsAuxCCfcm0w9M+tzZTo6gB57cmh6Gqpqp5QAwC5ldW2Nw/h5JyYyd/trUc7bgIUktDul9VM4N/YOcH3iMnnNEYh418l4mGNjwcyn2myyDqHpphXsxYmXQLQ3l9jFVV0ipzJ1wKYL0DEa+266J56HLdqDBpzB0UcU/xd22Xg7HicfBIsoE0+okb5cOJSVTkgYrUErO921NuVzLsH/LVoP2HwrVTbDYKNc/xRsODSNIqH6ZxffKZGBdFTx2yW7uVjKgLYnU/wfkYgQMaotXzdqaVGxV4vnsROWjNy0ueano5b3nB7XEf1QMvxPFbXa8qauhzXoOLVr9HalxeoKMRetHGL9Z7FHhQhbZYzROGa2bNrZw+bN3GRO5/B2HRb1fqoIWSohevDCBg37B62KMphvNRIDGsBZqaPnnW+1vo8xxvzY9jg8fIsDEqPGDgSAgA9zPJoUoTfnAQSF8Cx/FkzFMbiTItik3p6suw+1SrNZ06Vd6QYsL3kcpNlpj7kgG16I/E76Bs9wFT8rLzQYDK273Lalyg7KWRiFJTpS36yBkzxTT7H507dzJBG4Ke3dppZlVN+EYcjzCQAS5Um2S4E8q1+d5Wof3vzS20S3zaNAqVyx14SwLDLrRi8pTM/S+3flU5E9ewPei3lgKxs7KWORC6yHsszwwbLRSNTRImrFCz4vq6jYYK1G/1/baK kdjL+yjH +Aw0yw/kuZyfCplGoDiB/puNzCzDNEcnGvPNtl2aiB+4TkA6PRxbtnmvopPf8/c69GQM69Dy42OUyOjmJMyUlm3dEvU5at6nl2QlxntTuLAB5mjG8pJraIgALVIFRPWlOSVS4CIHSN1VyMYg4rxB9a0pd8ZrhMumF2WeNqJmeDoidhhdEz7p12/b5YqhVbOuMpL5TjR/ufoVaUmpUFXi1ZPRSMOK5AEnSJXlbpd7x1+K18kdHPO8EqKzIEOwc5Tkd3R+RtWtKA+KZjCgKoefv1lQx8QJErRF9SbCmQIVaVu0xCTtfSooT16AUb7JMSqfpKpYCbv0QQx66AjxyGIQo+56I9cvPxnz8HNUY X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Yosry, On Tue, May 23, 2023 at 07:06:38PM -0700, Yosry Ahmed wrote: > Hi Ming, > > On Tue, May 23, 2023 at 6:21 PM Ming Lei wrote: > > > > As noted by Michal, the blkg_iostat_set's in the lockless list > > hold reference to blkg's to protect against their removal. Those > > blkg's hold reference to blkcg. When a cgroup is being destroyed, > > cgroup_rstat_flush() is only called at css_release_work_fn() which > > is called when the blkcg reference count reaches 0. This circular > > dependency will prevent blkcg and some blkgs from being freed after > > they are made offline. > > I am not at all familiar with blkcg, but does calling > cgroup_rstat_flush() in offline_css() fix the problem? Except for offline, this list needs to be flushed after the associated disk is deleted. > or can items be > added to the lockless list(s) after the blkcg is offlined? Yeah. percpu_ref_*get(&blkg->refcnt) still can succeed after the percpu refcnt is killed in blkg_destroy() which is called from both offline css and removing disk. > > > > > It is less a problem if the cgroup to be destroyed also has other > > controllers like memory that will call cgroup_rstat_flush() which will > > clean up the reference count. If block is the only controller that uses > > rstat, these offline blkcg and blkgs may never be freed leaking more > > and more memory over time. > > > > To prevent this potential memory leak: > > > > - a new cgroup_rstat_css_cpu_flush() function is added to flush stats for > > a given css and cpu. This new function will be called in __blkg_release(). > > > > - don't grab bio->bi_blkg when adding the stats into blkcg's per-cpu > > stat list, and this kind of handling is the most fragile part of > > original patch > > > > Based on Waiman's patch: > > > > https://lore.kernel.org/linux-block/20221215033132.230023-3-longman@redhat.com/ > > > > Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()") > > Cc: Waiman Long > > Cc: cgroups@vger.kernel.org > > Cc: Tejun Heo > > Cc: mkoutny@suse.com > > Signed-off-by: Ming Lei > > --- > > block/blk-cgroup.c | 15 +++++++++++++-- > > include/linux/cgroup.h | 1 + > > kernel/cgroup/rstat.c | 18 ++++++++++++++++++ > > 3 files changed, 32 insertions(+), 2 deletions(-) > > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > > index 0ce64dd73cfe..5437b6af3955 100644 > > --- a/block/blk-cgroup.c > > +++ b/block/blk-cgroup.c > > @@ -163,10 +163,23 @@ static void blkg_free(struct blkcg_gq *blkg) > > static void __blkg_release(struct rcu_head *rcu) > > { > > struct blkcg_gq *blkg = container_of(rcu, struct blkcg_gq, rcu_head); > > + struct blkcg *blkcg = blkg->blkcg; > > + int cpu; > > > > #ifdef CONFIG_BLK_CGROUP_PUNT_BIO > > WARN_ON(!bio_list_empty(&blkg->async_bios)); > > #endif > > + /* > > + * Flush all the non-empty percpu lockless lists before releasing > > + * us. Meantime no new bio can refer to this blkg any more given > > + * the refcnt is killed. > > + */ > > + for_each_possible_cpu(cpu) { > > + struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu); > > + > > + if (!llist_empty(lhead)) > > + cgroup_rstat_css_cpu_flush(&blkcg->css, cpu); > > + } > > > > /* release the blkcg and parent blkg refs this blkg has been holding */ > > css_put(&blkg->blkcg->css); > > @@ -991,7 +1004,6 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state *css, int cpu) > > if (parent && parent->parent) > > blkcg_iostat_update(parent, &blkg->iostat.cur, > > &blkg->iostat.last); > > - percpu_ref_put(&blkg->refcnt); > > } > > > > out: > > @@ -2075,7 +2087,6 @@ void blk_cgroup_bio_start(struct bio *bio) > > > > llist_add(&bis->lnode, lhead); > > WRITE_ONCE(bis->lqueued, true); > > - percpu_ref_get(&bis->blkg->refcnt); > > } > > > > u64_stats_update_end_irqrestore(&bis->sync, flags); > > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h > > index 885f5395fcd0..97d4764d8e6a 100644 > > --- a/include/linux/cgroup.h > > +++ b/include/linux/cgroup.h > > @@ -695,6 +695,7 @@ void cgroup_rstat_flush(struct cgroup *cgrp); > > void cgroup_rstat_flush_atomic(struct cgroup *cgrp); > > void cgroup_rstat_flush_hold(struct cgroup *cgrp); > > void cgroup_rstat_flush_release(void); > > +void cgroup_rstat_css_cpu_flush(struct cgroup_subsys_state *css, int cpu); > > > > /* > > * Basic resource stats. > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c > > index 9c4c55228567..96e7a4e6da72 100644 > > --- a/kernel/cgroup/rstat.c > > +++ b/kernel/cgroup/rstat.c > > @@ -281,6 +281,24 @@ void cgroup_rstat_flush_release(void) > > spin_unlock_irq(&cgroup_rstat_lock); > > } > > > > +/** > > + * cgroup_rstat_css_cpu_flush - flush stats for the given css and cpu > > + * @css: target css to be flush > > + * @cpu: the cpu that holds the stats to be flush > > + * > > + * A lightweight rstat flush operation for a given css and cpu. > > + * Only the cpu_lock is being held for mutual exclusion, the cgroup_rstat_lock > > + * isn't used. > > (Adding linux-mm and memcg maintainers) > +Linux-MM +Michal Hocko +Shakeel Butt +Johannes Weiner +Roman Gushchin > +Muchun Song > > I don't think flushing the stats without holding cgroup_rstat_lock is > safe for memcg stats flushing. mem_cgroup_css_rstat_flush() modifies > some non-percpu data (e.g. memcg->vmstats->state, > memcg->vmstats->state_pending). > > Perhaps have this be a separate callback than css_rstat_flush() (e.g. > css_rstat_flush_cpu() or something), so that it's clear what > subsystems support this? In this case, only blkcg would implement this > callback. Also I guess cgroup_rstat_flush() can be used here too. BTW, cgroup_rstat_flush() is annotated as might_sleep(), however it won't sleep actually, so can this might_sleep() be removed? > > > + */ > > +void cgroup_rstat_css_cpu_flush(struct cgroup_subsys_state *css, int cpu) > > +{ > > + raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu); > > + > > + raw_spin_lock_irq(cpu_lock); > > + css->ss->css_rstat_flush(css, cpu); > > I think we need to check that css_rstat_flush() (or a new callback) is > implemented before calling it here. Good catch! Thanks, Ming