linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Roman Gushchin <guro@fb.com>
To: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>, <linux-fsdevel@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Dennis Zhou <dennis@kernel.org>,
	Dave Chinner <dchinner@redhat.com>, <cgroups@vger.kernel.org>
Subject: Re: [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes
Date: Fri, 28 May 2021 09:25:58 -0700	[thread overview]
Message-ID: <YLEZlpUMc03uYM7V@carbon.DHCP.thefacebook.com> (raw)
In-Reply-To: <20210528130543.GB28653@quack2.suse.cz>

On Fri, May 28, 2021 at 03:05:43PM +0200, Jan Kara wrote:
> On Thu 27-05-21 10:48:59, Roman Gushchin wrote:
> > On Thu, May 27, 2021 at 01:24:03PM +0200, Jan Kara wrote:
> > > On Wed 26-05-21 15:25:57, Roman Gushchin wrote:
> > > > Asynchronously try to release dying cgwbs by switching clean attached
> > > > inodes to the bdi's wb. It helps to get rid of per-cgroup writeback
> > > > structures themselves and of pinned memory and block cgroups, which
> > > > are way larger structures (mostly due to large per-cpu statistics
> > > > data). It helps to prevent memory waste and different scalability
> > > > problems caused by large piles of dying cgroups.
> > > > 
> > > > A cgwb cleanup operation can fail due to different reasons (e.g. the
> > > > cgwb has in-glight/pending io, an attached inode is locked or isn't
> > > > clean, etc). In this case the next scheduled cleanup will make a new
> > > > attempt. An attempt is made each time a new cgwb is offlined (in other
> > > > words a memcg and/or a blkcg is deleted by a user). In the future an
> > > > additional attempt scheduled by a timer can be implemented.
> > > > 
> > > > Signed-off-by: Roman Gushchin <guro@fb.com>
> > > > ---
> > > >  fs/fs-writeback.c                | 35 ++++++++++++++++++
> > > >  include/linux/backing-dev-defs.h |  1 +
> > > >  include/linux/writeback.h        |  1 +
> > > >  mm/backing-dev.c                 | 61 ++++++++++++++++++++++++++++++--
> > > >  4 files changed, 96 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > > > index 631ef6366293..8fbcd50844f0 100644
> > > > --- a/fs/fs-writeback.c
> > > > +++ b/fs/fs-writeback.c
> > > > @@ -577,6 +577,41 @@ static void inode_switch_wbs(struct inode *inode, int new_wb_id)
> > > >  	kfree(isw);
> > > >  }
> > > >  
> > > > +/**
> > > > + * cleanup_offline_wb - detach associated clean inodes
> > > > + * @wb: target wb
> > > > + *
> > > > + * Switch the inode->i_wb pointer of the attached inodes to the bdi's wb and
> > > > + * drop the corresponding per-cgroup wb's reference. Skip inodes which are
> > > > + * dirty, freeing, in the active writeback process or are in any way busy.
> > > 
> > > I think the comment doesn't match the function anymore.
> > > 
> > > > + */
> > > > +void cleanup_offline_wb(struct bdi_writeback *wb)
> > > > +{
> > > > +	struct inode *inode, *tmp;
> > > > +
> > > > +	spin_lock(&wb->list_lock);
> > > > +restart:
> > > > +	list_for_each_entry_safe(inode, tmp, &wb->b_attached, i_io_list) {
> > > > +		if (!spin_trylock(&inode->i_lock))
> > > > +			continue;
> > > > +		xa_lock_irq(&inode->i_mapping->i_pages);
> > > > +		if ((inode->i_state & I_REFERENCED) != I_REFERENCED) {
> > > 
> > > Why the I_REFERENCED check here? That's just inode aging bit and I have
> > > hard time seeing how it would relate to whether inode should switch wbs...
> > 
> > What I tried to say (and failed :) ) was that I_REFERENCED is the only accepted
> > flag here. So there must be
> > 	if ((inode->i_state | I_REFERENCED) != I_REFERENCED)
> > 
> > Does this look good or I am wrong and there are other flags acceptable here?
> 
> Ah, I see. That makes more sense. I guess you could also exclude I_DONTCACHE
> and I_OVL_INUSE but that's not that important.
> 
> > > > +			struct bdi_writeback *bdi_wb = &inode_to_bdi(inode)->wb;
> > > > +
> > > > +			WARN_ON_ONCE(inode->i_wb != wb);
> > > > +
> > > > +			inode->i_wb = bdi_wb;
> > > > +			list_del_init(&inode->i_io_list);
> > > > +			wb_put(wb);
> > > 
> > > I was kind of hoping you'll use some variant of inode_switch_wbs() here.
> > 
> > My reasoning was that by definition inode_switch_wbs() handles dirty inodes,
> > while in the cleanup case we can deal only with clean inodes and clean wb's.
> > Hopefully this can make the whole procedure simpler/cheaper. Also, the number
> > of simultaneous switches is limited and I don't think cleanups should share
> > this limit.
> > However I agree that it would be nice to share at least some code.
> 
> I agree limits on parallel switches should not apply. Otherwise I agree
> some bits of inode_switch_wbs_work_fn() should not be strictly necessary
> but they should be pretty cheap anyway.
> 
> > > That way we have single function handling all the subtleties of switching
> > > inode->i_wb of an active inode. Maybe it isn't strictly needed here because
> > > you detach only from b_attached list and move to bdi_wb so things are
> > > indeed simpler here. But you definitely miss transferring WB_WRITEBACK stat
> > > and I'd also like to have a comment here explaining why this cannot race
> > > with other writeback handling or wb switching in a harmful way.
> > 
> > If we'll check under wb->list_lock that wb has no inodes on any writeback
> > lists (excluding b_attached), doesn't it mean that WB_WRITEBACK must be
> > 0?
> 
> No, pages under writeback are not reflected in inode->i_state in any way.
> You would need to check mapping_tagged(inode->i_mapping,
> PAGECACHE_TAG_WRITEBACK) to find that out. But if you'd use
> inode_switch_wbs_work_fn() you wouldn't even have to be that careful when
> switching inodes as it can handle alive inodes just fine...

I see...

> 
> > Re racing: my logic here was that we're taking all possible locks before doing
> > anything and then we check that the inode is entirely clean, so this must be
> > safe:
> > 	spin_lock(&wb->list_lock);
> > 	spin_trylock(&inode->i_lock);
> > 	xa_lock_irq(&inode->i_mapping->i_pages);
> > 	...
> > 
> > But now I see that the unlocked inode's wb access mechanism
> > (unlocked_inode_to_wb_begin()/end()) probably requires additional care.
> 
> Yeah, exactly corner case like this were not quite clear to me whether you
> have them correct or not.
> 
> > Repeating the mechanism with scheduling the switching of each inode separately
> > after an rcu grace period looks too slow. Maybe we can mark all inodes at once
> > and then switch them all at once, all in two steps. I need to think more.
> > Do you have any ideas/suggestions here?
> 
> Nothing really bright. As you say I'd do this in batches - i.e., tag all
> inodes for switching with I_WB_SWITCH, then synchronize_rcu(), then call
> inode_switch_wbs_work_fn() for each inode (or probably some helper function
> that has guts of inode_switch_wbs_work_fn() as we probably don't want to
> acquire wb->list_lock's and wb_switch_rwsem repeatedly unnecessarily).

Ok, sounds good to me. I'm a bit worried about the possible CPU overhead,
but hopefully we can switch inodes slow enough so that the impact on the
system will be acceptable.

Thanks!


  reply	other threads:[~2021-05-28 16:26 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-26 22:25 [PATCH v5 0/2] cgroup, blkcg: prevent dirty inodes to pin dying memory cgroups Roman Gushchin
2021-05-26 22:25 ` [PATCH v5 1/2] writeback, cgroup: keep list of inodes attached to bdi_writeback Roman Gushchin
2021-05-27 10:35   ` Jan Kara
2021-05-27 16:32     ` Roman Gushchin
2021-05-26 22:25 ` [PATCH v5 2/2] writeback, cgroup: release dying cgwbs by switching attached inodes Roman Gushchin
2021-05-27  3:24   ` Hillf Danton
2021-05-27 16:29     ` Roman Gushchin
2021-05-27 11:24   ` Jan Kara
2021-05-27 17:48     ` Roman Gushchin
2021-05-27 19:45       ` Roman Gushchin
2021-05-28 13:05       ` Jan Kara
2021-05-28 16:25         ` Roman Gushchin [this message]
2021-05-28  2:58   ` Ming Lei
2021-05-28 16:22     ` Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YLEZlpUMc03uYM7V@carbon.DHCP.thefacebook.com \
    --to=guro@fb.com \
    --cc=cgroups@vger.kernel.org \
    --cc=dchinner@redhat.com \
    --cc=dennis@kernel.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=tj@kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox