From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B4A3FCD128A for ; Wed, 3 Apr 2024 15:03:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3698D6B0083; Wed, 3 Apr 2024 11:03:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3199B6B0085; Wed, 3 Apr 2024 11:03:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BB616B0087; Wed, 3 Apr 2024 11:03:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id F2F1F6B0083 for ; Wed, 3 Apr 2024 11:03:18 -0400 (EDT) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 50C6E140B3A for ; Wed, 3 Apr 2024 15:03:18 +0000 (UTC) X-FDA: 81968538876.23.87C032B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf04.hostedemail.com (Postfix) with ESMTP id C0F6440016 for ; Wed, 3 Apr 2024 15:03:14 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KdM7X9Y3; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1712156594; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1t9rNljNB7WZCGoFa1Ob1ROcqUyp21vTL5nU3nsygOo=; b=zKaf3PSBbBYfCNVv332KvgHaea53NZqj/5jOmAOMAYyZsR/6+rguUFrjLi6tfnfQjQ+eXz pqM5I+IsASg+tJsMkh+MGvQ6MRD36fSroKgutaCsS8y6Ckpvrf93moifgLSPQGP5/51Yni Vx7JkO2wBFq6hBdIqA1i5RN8nvs8lrw= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KdM7X9Y3; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf04.hostedemail.com: domain of bfoster@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bfoster@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1712156594; a=rsa-sha256; cv=none; b=ruwxj7Zl4Qj4iAZ2CnVhpUO2AfwvgJDISFGDGBuLwNb+dHk1oj4MzuhnYpOZrpwJuNTDOd 2Vd3UiQG/N0ZQzTMuMyxXWF9KPMekeRsecX/w2pkDg6eh/qQrcmtpbUPGx7iSJ0AkcR49s lJqEX2REacN5Pn7lF2J1ozqdHuFQno8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1712156594; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1t9rNljNB7WZCGoFa1Ob1ROcqUyp21vTL5nU3nsygOo=; b=KdM7X9Y3bgzauGYs6CCDp8fWYOzlbTAle4QbNyeEu6Hb9ft3WGn/H1tiCE0yIxB485gfJ0 J3hZMHEeC2blxFuaDKCizxpP/x6KR0dvX+uFqHgm11cwtEm4E7kiO5io602l1+MzAaoQP1 YsZhhg0YeyrOBWUY+9D5vXwUIKgzKNE= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-625-Rwfc9kBWMhWP0sXigXb-eA-1; Wed, 03 Apr 2024 11:03:02 -0400 X-MC-Unique: Rwfc9kBWMhWP0sXigXb-eA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id 976748D1385; Wed, 3 Apr 2024 15:03:01 +0000 (UTC) Received: from bfoster (unknown [10.22.16.57]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 0FA7C10E47; Wed, 3 Apr 2024 15:03:00 +0000 (UTC) Date: Wed, 3 Apr 2024 11:04:58 -0400 From: Brian Foster To: Kemeng Shi Cc: akpm@linux-foundation.org, willy@infradead.org, jack@suse.cz, tj@kernel.org, dsterba@suse.com, mjguzik@gmail.com, dhowells@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org Subject: Re: [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi Message-ID: References: <20240327155751.3536-1-shikemeng@huaweicloud.com> <20240327155751.3536-4-shikemeng@huaweicloud.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Scanned-By: MIMEDefang 3.4.1 on 10.11.54.5 X-Rspamd-Queue-Id: C0F6440016 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: sakurprgsbmy4pumzn67q5n8opqhkcam X-HE-Tag: 1712156594-419270 X-HE-Meta: U2FsdGVkX1/JH5vTCveRICsGtUmkdySELys9PRMl8hUp4PLijD+oyF+nDWKajsCFr+DOjnwRNTPjHZFTSZ7B17VmbCXOmzwfoxGIzPo8uIQGvsSSrSOax4HBdt0PG36vv4WnFIrh/mu+xzRX82lfZmsiRBOXekuBesQHMdqRgkofJZ1U2KJ12BEloNYt23NdbhZDDavxFRk2SLpPkn743IRaKRcvLChQ/pGnxlLvArNJQnRs4Cp18ji6axwPsfvVy9irN7+FkENzoGTcnRDQE4nomsdGJ2x7Bv6k0oY7aQecvbKbE1/+ZBmj6SIP/YX/vhXTLO2KmBpv3mtCYm2e4bHhiV1lOiRov2oSN1apMY/vFsh7MYBDRo9eWJbTXkHIbatK45mWJhiRqjjGzLybypJYiP3t+vbVZvHNqWYo55MQADdclmliD82gxWvldlKPwMRX+Myvi9EIM7pQRFshvw0I0HxDNgzeuB3xWWOFFcRB6tZSCrytcmWBHY3rMyxhbUKyLod2g4ieffwaRl+KAFZIFCGDjpQlUyhJh31bpQdaPAghiey1gm2zrcOIPSaW8g6lucc9EvUav4ulciSKPXIKERMv1vfu0Cy4PqNYBA71L1s/rB2WaKZEe/awRa4Jn1uqekqOaFutHQPBnir7zfR6Wl8eZoHS6jgw/den5fIu3xeWTYsH0/U1GOooewFThWbno1DSNV5PNm4wDN3TQZUb909I2bkF0adZcFlg5ZTh5Git9AuPa3Fkeh8uRFwabyy7xoRvq+HtD8/Bqr3PBYp9BKfPBCqLzU1aXumQCZWQ8UtQdQSRj/vkwAwwuMHa+ZDd5xufdza1naWDLpncFZTiaIXWH7/jbJ9nvBUTsMrK0R2JpKMK5koWykwmtMxlr/Z0g6ez+q5A6L4xkme9SgJBb7MeuBmlFnUxAN39EWekBIgF3eVNlZVGfxN2hyQ4/wCijzZ/9nnrmJkQ2iD N5VCJQJW YLvpewH5ZHv7bQ1F14Slos0VS0klKgaXcW4caUtVbCckxGqvT1Rzlc3sTswmgr0PdYFmP7SrO5GktYtjrtMoSvddbJHpOWOE910nCnoyqbSX1Y3zJ4+w9if5FHedog2x6LjxiTyasjxRFOqBkmL/EeHL5mpaOHw8rRuFD+y9NymTyEvyPQeTGT1UoSxI5GpWVygtiVhRyCxZlwuZKysABcZd+wQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Apr 03, 2024 at 04:49:42PM +0800, Kemeng Shi wrote: > > > on 3/29/2024 9:10 PM, Brian Foster wrote: > > On Wed, Mar 27, 2024 at 11:57:48PM +0800, Kemeng Shi wrote: > >> Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats > >> of bdi. > >> > > > > Hi Kemeng, > Hello Brian, > > > > Just a few random thoughts/comments.. > > > >> Following domain hierarchy is tested: > >> global domain (320G) > >> / \ > >> cgroup domain1(10G) cgroup domain2(10G) > >> | | > >> bdi wb1 wb2 > >> > >> /* per wb writeback info of bdi is collected */ > >> cat /sys/kernel/debug/bdi/252:16/wb_stats > >> WbCgIno: 1 > >> WbWriteback: 0 kB > >> WbReclaimable: 0 kB > >> WbDirtyThresh: 0 kB > >> WbDirtied: 0 kB > >> WbWritten: 0 kB > >> WbWriteBandwidth: 102400 kBps > >> b_dirty: 0 > >> b_io: 0 > >> b_more_io: 0 > >> b_dirty_time: 0 > >> state: 1 > > > > Maybe some whitespace or something between entries would improve > > readability? > Sure, I will add a whitespace in next version. > > > >> WbCgIno: 4094 > >> WbWriteback: 54432 kB > >> WbReclaimable: 766080 kB > >> WbDirtyThresh: 3094760 kB > >> WbDirtied: 1656480 kB > >> WbWritten: 837088 kB > >> WbWriteBandwidth: 132772 kBps > >> b_dirty: 1 > >> b_io: 1 > >> b_more_io: 0 > >> b_dirty_time: 0 > >> state: 7 > >> WbCgIno: 4135 > >> WbWriteback: 15232 kB > >> WbReclaimable: 786688 kB > >> WbDirtyThresh: 2909984 kB > >> WbDirtied: 1482656 kB > >> WbWritten: 681408 kB > >> WbWriteBandwidth: 124848 kBps > >> b_dirty: 0 > >> b_io: 1 > >> b_more_io: 0 > >> b_dirty_time: 0 > >> state: 7 > >> > >> Signed-off-by: Kemeng Shi > >> --- > >> include/linux/writeback.h | 1 + > >> mm/backing-dev.c | 88 +++++++++++++++++++++++++++++++++++++++ > >> mm/page-writeback.c | 19 +++++++++ > >> 3 files changed, 108 insertions(+) > >> > > ... > >> diff --git a/mm/backing-dev.c b/mm/backing-dev.c > >> index 8daf950e6855..e3953db7d88d 100644 > >> --- a/mm/backing-dev.c > >> +++ b/mm/backing-dev.c > >> @@ -103,6 +103,91 @@ static void collect_wb_stats(struct wb_stats *stats, > >> } > >> > >> #ifdef CONFIG_CGROUP_WRITEBACK > > ... > >> +static int cgwb_debug_stats_show(struct seq_file *m, void *v) > >> +{ > >> + struct backing_dev_info *bdi; > >> + unsigned long background_thresh; > >> + unsigned long dirty_thresh; > >> + struct bdi_writeback *wb; > >> + struct wb_stats stats; > >> + > >> + rcu_read_lock(); > >> + bdi = lookup_bdi(m); > >> + if (!bdi) { > >> + rcu_read_unlock(); > >> + return -EEXIST; > >> + } > >> + > >> + global_dirty_limits(&background_thresh, &dirty_thresh); > >> + > >> + list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) { > >> + memset(&stats, 0, sizeof(stats)); > >> + stats.dirty_thresh = dirty_thresh; > > > > If you did something like the following here, wouldn't that also zero > > the rest of the structure? > > > > struct wb_stats stats = { .dirty_thresh = dirty_thresh }; > > > Suer, will do it in next version. > >> + collect_wb_stats(&stats, wb); > >> + > > > > Also, similar question as before on whether you'd want to check > > WB_registered or something here.. > Still prefer to keep full debug info and user could filter out on > demand. Ok. I was more wondering if that was needed for correctness. If not, then that seems fair enough to me. > > > >> + if (mem_cgroup_wb_domain(wb) == NULL) { > >> + wb_stats_show(m, wb, &stats); > >> + continue; > >> + } > > > > Can you explain what this logic is about? Is the cgwb_calc_thresh() > > thing not needed in this case? A comment might help for those less > > familiar with the implementation details. > If mem_cgroup_wb_domain(wb) is NULL, then it's bdi->wb, otherwise, > it's wb in cgroup. For bdi->wb, there is no need to do wb_tryget > and cgwb_calc_thresh. Will add some comment in next version. > > > > BTW, I'm also wondering if something like the following is correct > > and/or roughly equivalent: > > > > list_for_each_*(wb, ...) { > > struct wb_stats stats = ...; > > > > if (!wb_tryget(wb)) > > continue; > > > > collect_wb_stats(&stats, wb); > > > > /* > > * Extra wb_thresh magic. Drop rcu lock because ... . We > > * can do so here because we have a ref. > > */ > > if (mem_cgroup_wb_domain(wb)) { > > rcu_read_unlock(); > > stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb)); > > rcu_read_lock(); > > } > > > > wb_stats_show(m, wb, &stats) > > wb_put(wb); > > } > It's correct as wb_tryget to bdi->wb has no harm. I have considered > to do it in this way, I change my mind to do it in new way for > two reason: > 1. Put code handling wb in cgroup more tight which could be easier > to maintain. > 2. Rmove extra wb_tryget/wb_put for wb in bdi. > Would this make sense to you? Ok, well assuming it is correct the above logic is a bit more simple and readable to me. I think you'd just need to fill in the comment around the wb_thresh thing rather than i.e. having to explain we don't need to ref bdi->wb even though it doesn't seem to matter. I kind of feel the same on the wb_stats file thing below just because it seems more consistent and available if wb_stats eventually grows more wb-specific data. That said, this is subjective and not hugely important so I don't insist on either point. Maybe wait a bit and see if Jan or Tejun or somebody has any thoughts..? If nobody else expresses explicit preference then I'm good with it either way. Brian > > > >> + > >> + /* > >> + * cgwb_release will destroy wb->memcg_completions which > >> + * will be ued in cgwb_calc_thresh. Use wb_tryget to prevent > >> + * memcg_completions destruction from cgwb_release. > >> + */ > >> + if (!wb_tryget(wb)) > >> + continue; > >> + > >> + rcu_read_unlock(); > >> + /* cgwb_calc_thresh may sleep in cgroup_rstat_flush */ > >> + stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb)); > >> + wb_stats_show(m, wb, &stats); > >> + rcu_read_lock(); > >> + wb_put(wb); > >> + } > >> + rcu_read_unlock(); > >> + > >> + return 0; > >> +} > >> +DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats); > >> + > >> +static void cgwb_debug_register(struct backing_dev_info *bdi) > >> +{ > >> + debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi, > >> + &cgwb_debug_stats_fops); > >> +} > >> + > >> static void bdi_collect_stats(struct backing_dev_info *bdi, > >> struct wb_stats *stats) > >> { > >> @@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi, > >> { > >> collect_wb_stats(stats, &bdi->wb); > >> } > >> + > >> +static inline void cgwb_debug_register(struct backing_dev_info *bdi) { } > > > > Could we just create the wb_stats file regardless of whether cgwb is > > enabled? Obviously theres only one wb in the !CGWB case and it's > > somewhat duplicative with the bdi stats file, but that seems harmless if > > the same code can be reused..? Maybe there's also a small argument for > > dropping the state info from the bdi stats file and moving it to > > wb_stats.In backing-dev.c, there are a lot "#ifdef CGWB .. #else .. #endif" to > avoid unneed extra cost when CGWB is not enabled. > I think it's better to avoid extra cost from wb_stats when CGWB is not > enabled. For now, we only save cpu cost to create and destroy wb_stats > and save memory cost to record debugfs file, we could save more in > future when wb_stats records more debug info. > Move state info from bdi stats to wb_stats make senses to me. The only > concern would be compatibility problem. I will add a new patch to this > to make this more noticeable and easier to revert. > Thanks a lot for review! > > Kemeng > > > > Brian > > > >> #endif > >> > >> static int bdi_debug_stats_show(struct seq_file *m, void *v) > >> @@ -182,6 +269,7 @@ static void bdi_debug_register(struct backing_dev_info *bdi, const char *name) > >> > >> debugfs_create_file("stats", 0444, bdi->debug_dir, bdi, > >> &bdi_debug_stats_fops); > >> + cgwb_debug_register(bdi); > >> } > >> > >> static void bdi_debug_unregister(struct backing_dev_info *bdi) > >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c > >> index 0e20467367fe..3724c7525316 100644 > >> --- a/mm/page-writeback.c > >> +++ b/mm/page-writeback.c > >> @@ -893,6 +893,25 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) > >> return __wb_calc_thresh(&gdtc, thresh); > >> } > >> > >> +unsigned long cgwb_calc_thresh(struct bdi_writeback *wb) > >> +{ > >> + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB }; > >> + struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) }; > >> + unsigned long filepages, headroom, writeback; > >> + > >> + gdtc.avail = global_dirtyable_memory(); > >> + gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) + > >> + global_node_page_state(NR_WRITEBACK); > >> + > >> + mem_cgroup_wb_stats(wb, &filepages, &headroom, > >> + &mdtc.dirty, &writeback); > >> + mdtc.dirty += writeback; > >> + mdtc_calc_avail(&mdtc, filepages, headroom); > >> + domain_dirty_limits(&mdtc); > >> + > >> + return __wb_calc_thresh(&mdtc, mdtc.thresh); > >> +} > >> + > >> /* > >> * setpoint - dirty 3 > >> * f(dirty) := 1.0 + (----------------) > >> -- > >> 2.30.0 > >> > > > > >