linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: akpm@linux-foundation.org, willy@infradead.org, jack@suse.cz,
	tj@kernel.org, dsterba@suse.com, mjguzik@gmail.com,
	dhowells@redhat.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi
Date: Wed, 3 Apr 2024 11:04:58 -0400	[thread overview]
Message-ID: <Zg1wGvTeQxjqjYUG@bfoster> (raw)
In-Reply-To: <e3816f9c-0f29-a0e4-8ad8-a6acf82a06ad@huaweicloud.com>

On Wed, Apr 03, 2024 at 04:49:42PM +0800, Kemeng Shi wrote:
> 
> 
> on 3/29/2024 9:10 PM, Brian Foster wrote:
> > On Wed, Mar 27, 2024 at 11:57:48PM +0800, Kemeng Shi wrote:
> >> Add /sys/kernel/debug/bdi/xxx/wb_stats to show per group writeback stats
> >> of bdi.
> >>
> > 
> > Hi Kemeng,
> Hello Brian,
> > 
> > Just a few random thoughts/comments..
> > 
> >> Following domain hierarchy is tested:
> >>                 global domain (320G)
> >>                 /                 \
> >>         cgroup domain1(10G)     cgroup domain2(10G)
> >>                 |                 |
> >> bdi            wb1               wb2
> >>
> >> /* per wb writeback info of bdi is collected */
> >> cat /sys/kernel/debug/bdi/252:16/wb_stats
> >> WbCgIno:                    1
> >> WbWriteback:                0 kB
> >> WbReclaimable:              0 kB
> >> WbDirtyThresh:              0 kB
> >> WbDirtied:                  0 kB
> >> WbWritten:                  0 kB
> >> WbWriteBandwidth:      102400 kBps
> >> b_dirty:                    0
> >> b_io:                       0
> >> b_more_io:                  0
> >> b_dirty_time:               0
> >> state:                      1
> > 
> > Maybe some whitespace or something between entries would improve
> > readability?
> Sure, I will add a whitespace in next version.
> > 
> >> WbCgIno:                 4094
> >> WbWriteback:            54432 kB
> >> WbReclaimable:         766080 kB
> >> WbDirtyThresh:        3094760 kB
> >> WbDirtied:            1656480 kB
> >> WbWritten:             837088 kB
> >> WbWriteBandwidth:      132772 kBps
> >> b_dirty:                    1
> >> b_io:                       1
> >> b_more_io:                  0
> >> b_dirty_time:               0
> >> state:                      7
> >> WbCgIno:                 4135
> >> WbWriteback:            15232 kB
> >> WbReclaimable:         786688 kB
> >> WbDirtyThresh:        2909984 kB
> >> WbDirtied:            1482656 kB
> >> WbWritten:             681408 kB
> >> WbWriteBandwidth:      124848 kBps
> >> b_dirty:                    0
> >> b_io:                       1
> >> b_more_io:                  0
> >> b_dirty_time:               0
> >> state:                      7
> >>
> >> Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com>
> >> ---
> >>  include/linux/writeback.h |  1 +
> >>  mm/backing-dev.c          | 88 +++++++++++++++++++++++++++++++++++++++
> >>  mm/page-writeback.c       | 19 +++++++++
> >>  3 files changed, 108 insertions(+)
> >>
> > ...
> >> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> >> index 8daf950e6855..e3953db7d88d 100644
> >> --- a/mm/backing-dev.c
> >> +++ b/mm/backing-dev.c
> >> @@ -103,6 +103,91 @@ static void collect_wb_stats(struct wb_stats *stats,
> >>  }
> >>  
> >>  #ifdef CONFIG_CGROUP_WRITEBACK
> > ...
> >> +static int cgwb_debug_stats_show(struct seq_file *m, void *v)
> >> +{
> >> +	struct backing_dev_info *bdi;
> >> +	unsigned long background_thresh;
> >> +	unsigned long dirty_thresh;
> >> +	struct bdi_writeback *wb;
> >> +	struct wb_stats stats;
> >> +
> >> +	rcu_read_lock();
> >> +	bdi = lookup_bdi(m);
> >> +	if (!bdi) {
> >> +		rcu_read_unlock();
> >> +		return -EEXIST;
> >> +	}
> >> +
> >> +	global_dirty_limits(&background_thresh, &dirty_thresh);
> >> +
> >> +	list_for_each_entry_rcu(wb, &bdi->wb_list, bdi_node) {
> >> +		memset(&stats, 0, sizeof(stats));
> >> +		stats.dirty_thresh = dirty_thresh;
> > 
> > If you did something like the following here, wouldn't that also zero
> > the rest of the structure?
> > 
> > 		struct wb_stats stats = { .dirty_thresh = dirty_thresh };
> > 
> Suer, will do it in next version.
> >> +		collect_wb_stats(&stats, wb);
> >> +
> > 
> > Also, similar question as before on whether you'd want to check
> > WB_registered or something here..
> Still prefer to keep full debug info and user could filter out on
> demand.

Ok. I was more wondering if that was needed for correctness. If not,
then that seems fair enough to me.

> > 
> >> +		if (mem_cgroup_wb_domain(wb) == NULL) {
> >> +			wb_stats_show(m, wb, &stats);
> >> +			continue;
> >> +		}
> > 
> > Can you explain what this logic is about? Is the cgwb_calc_thresh()
> > thing not needed in this case? A comment might help for those less
> > familiar with the implementation details.
> If mem_cgroup_wb_domain(wb) is NULL, then it's bdi->wb, otherwise,
> it's wb in cgroup. For bdi->wb, there is no need to do wb_tryget
> and cgwb_calc_thresh. Will add some comment in next version.
> > 
> > BTW, I'm also wondering if something like the following is correct
> > and/or roughly equivalent:
> > 	
> > 	list_for_each_*(wb, ...) {
> > 		struct wb_stats stats = ...;
> > 
> > 		if (!wb_tryget(wb))
> > 			continue;
> > 
> > 		collect_wb_stats(&stats, wb);
> > 
> > 		/*
> > 		 * Extra wb_thresh magic. Drop rcu lock because ... . We
> > 		 * can do so here because we have a ref.
> > 		 */
> > 		if (mem_cgroup_wb_domain(wb)) {
> > 			rcu_read_unlock();
> > 			stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
> > 			rcu_read_lock();
> > 		}
> > 
> > 		wb_stats_show(m, wb, &stats)
> > 		wb_put(wb);
> > 	}
> It's correct as wb_tryget to bdi->wb has no harm. I have considered
> to do it in this way, I change my mind to do it in new way for
> two reason:
> 1. Put code handling wb in cgroup more tight which could be easier
> to maintain.
> 2. Rmove extra wb_tryget/wb_put for wb in bdi.
> Would this make sense to you?

Ok, well assuming it is correct the above logic is a bit more simple and
readable to me. I think you'd just need to fill in the comment around
the wb_thresh thing rather than i.e. having to explain we don't need to
ref bdi->wb even though it doesn't seem to matter.

I kind of feel the same on the wb_stats file thing below just because it
seems more consistent and available if wb_stats eventually grows more
wb-specific data.

That said, this is subjective and not hugely important so I don't insist
on either point. Maybe wait a bit and see if Jan or Tejun or somebody
has any thoughts..? If nobody else expresses explicit preference then
I'm good with it either way.

Brian

> > 
> >> +
> >> +		/*
> >> +		 * cgwb_release will destroy wb->memcg_completions which
> >> +		 * will be ued in cgwb_calc_thresh. Use wb_tryget to prevent
> >> +		 * memcg_completions destruction from cgwb_release.
> >> +		 */
> >> +		if (!wb_tryget(wb))
> >> +			continue;
> >> +
> >> +		rcu_read_unlock();
> >> +		/* cgwb_calc_thresh may sleep in cgroup_rstat_flush */
> >> +		stats.wb_thresh = min(stats.wb_thresh, cgwb_calc_thresh(wb));
> >> +		wb_stats_show(m, wb, &stats);
> >> +		rcu_read_lock();
> >> +		wb_put(wb);
> >> +	}
> >> +	rcu_read_unlock();
> >> +
> >> +	return 0;
> >> +}
> >> +DEFINE_SHOW_ATTRIBUTE(cgwb_debug_stats);
> >> +
> >> +static void cgwb_debug_register(struct backing_dev_info *bdi)
> >> +{
> >> +	debugfs_create_file("wb_stats", 0444, bdi->debug_dir, bdi,
> >> +			    &cgwb_debug_stats_fops);
> >> +}
> >> +
> >>  static void bdi_collect_stats(struct backing_dev_info *bdi,
> >>  			      struct wb_stats *stats)
> >>  {
> >> @@ -117,6 +202,8 @@ static void bdi_collect_stats(struct backing_dev_info *bdi,
> >>  {
> >>  	collect_wb_stats(stats, &bdi->wb);
> >>  }
> >> +
> >> +static inline void cgwb_debug_register(struct backing_dev_info *bdi) { }
> > 
> > Could we just create the wb_stats file regardless of whether cgwb is
> > enabled? Obviously theres only one wb in the !CGWB case and it's
> > somewhat duplicative with the bdi stats file, but that seems harmless if
> > the same code can be reused..? Maybe there's also a small argument for
> > dropping the state info from the bdi stats file and moving it to
> > wb_stats.In backing-dev.c, there are a lot "#ifdef CGWB .. #else .. #endif" to
> avoid unneed extra cost when CGWB is not enabled.
> I think it's better to avoid extra cost from wb_stats when CGWB is not
> enabled. For now, we only save cpu cost to create and destroy wb_stats
> and save memory cost to record debugfs file, we could save more in
> future when wb_stats records more debug info.
> Move state info from bdi stats to wb_stats make senses to me. The only
> concern would be compatibility problem. I will add a new patch to this
> to make this more noticeable and easier to revert.
> Thanks a lot for review!
> 
> Kemeng
> > 
> > Brian
> > 
> >>  #endif
> >>  
> >>  static int bdi_debug_stats_show(struct seq_file *m, void *v)
> >> @@ -182,6 +269,7 @@ static void bdi_debug_register(struct backing_dev_info *bdi, const char *name)
> >>  
> >>  	debugfs_create_file("stats", 0444, bdi->debug_dir, bdi,
> >>  			    &bdi_debug_stats_fops);
> >> +	cgwb_debug_register(bdi);
> >>  }
> >>  
> >>  static void bdi_debug_unregister(struct backing_dev_info *bdi)
> >> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> >> index 0e20467367fe..3724c7525316 100644
> >> --- a/mm/page-writeback.c
> >> +++ b/mm/page-writeback.c
> >> @@ -893,6 +893,25 @@ unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh)
> >>  	return __wb_calc_thresh(&gdtc, thresh);
> >>  }
> >>  
> >> +unsigned long cgwb_calc_thresh(struct bdi_writeback *wb)
> >> +{
> >> +	struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> >> +	struct dirty_throttle_control mdtc = { MDTC_INIT(wb, &gdtc) };
> >> +	unsigned long filepages, headroom, writeback;
> >> +
> >> +	gdtc.avail = global_dirtyable_memory();
> >> +	gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> >> +		     global_node_page_state(NR_WRITEBACK);
> >> +
> >> +	mem_cgroup_wb_stats(wb, &filepages, &headroom,
> >> +			    &mdtc.dirty, &writeback);
> >> +	mdtc.dirty += writeback;
> >> +	mdtc_calc_avail(&mdtc, filepages, headroom);
> >> +	domain_dirty_limits(&mdtc);
> >> +
> >> +	return __wb_calc_thresh(&mdtc, mdtc.thresh);
> >> +}
> >> +
> >>  /*
> >>   *                           setpoint - dirty 3
> >>   *        f(dirty) := 1.0 + (----------------)
> >> -- 
> >> 2.30.0
> >>
> > 
> > 
> 



  reply	other threads:[~2024-04-03 15:03 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-27 15:57 [PATCH v2 0/6] Improve visibility of writeback Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 1/6] writeback: protect race between bdi release and bdi_debug_stats_show Kemeng Shi
2024-03-28 17:53   ` Brian Foster
2024-04-03  2:16     ` Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 2/6] writeback: collect stats of all wb of bdi in bdi_debug_stats_show Kemeng Shi
2024-03-29 13:04   ` Brian Foster
2024-04-03  7:49     ` Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 3/6] writeback: support retrieving per group debug writeback stats of bdi Kemeng Shi
2024-03-29 13:10   ` Brian Foster
2024-04-03  8:49     ` Kemeng Shi
2024-04-03 15:04       ` Brian Foster [this message]
2024-04-04  9:07         ` Jan Kara
2024-04-07  3:13           ` Kemeng Shi
2024-04-07  2:48         ` Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 4/6] writeback: add wb_monitor.py script to monitor writeback info on bdi Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 5/6] writeback: rename nr_reclaimable to nr_dirty in balance_dirty_pages Kemeng Shi
2024-03-27 15:57 ` [PATCH v2 6/6] writeback: define GDTC_INIT_NO_WB to null Kemeng Shi
2024-03-27 17:40 ` [PATCH v2 0/6] Improve visibility of writeback Andrew Morton
2024-03-28  1:59   ` Kemeng Shi
2024-03-28  8:23     ` Kemeng Shi
2024-03-28 19:15   ` Kent Overstreet
2024-03-28 19:23     ` Andrew Morton
2024-03-28 19:36       ` Kent Overstreet
2024-03-28 19:24 ` Kent Overstreet
2024-03-28 19:31   ` Tejun Heo
2024-03-28 19:40     ` Kent Overstreet
2024-03-28 19:46       ` Tejun Heo
2024-03-28 19:55         ` Kent Overstreet
2024-03-28 20:13           ` Tejun Heo
2024-03-28 20:22             ` Kent Overstreet
2024-03-28 20:46               ` Tejun Heo
2024-03-28 20:53                 ` Kent Overstreet
2024-04-03 16:27                 ` Jan Kara
2024-04-03 18:44                   ` Tejun Heo
2024-04-03 19:06                     ` Kent Overstreet
2024-04-03 19:21                       ` Tejun Heo
2024-04-03 22:24                         ` Kent Overstreet
2024-04-03  6:56   ` Kemeng Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zg1wGvTeQxjqjYUG@bfoster \
    --to=bfoster@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=dsterba@suse.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mjguzik@gmail.com \
    --cc=shikemeng@huaweicloud.com \
    --cc=tj@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox