Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Roman Gushchin <guro@fb.com>
To: Dave Chinner <dchinner@redhat.com>
Cc: Rik van Riel <riel@surriel.com>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"mhocko@kernel.org" <mhocko@kernel.org>,
	"guroan@gmail.com" <guroan@gmail.com>,
	Kernel Team <Kernel-team@fb.com>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>
Subject: Re: [LSF/MM TOPIC] dying memory cgroups and slab reclaim issues
Date: Wed, 20 Feb 2019 05:31:01 +0000	[thread overview]
Message-ID: <20190220053052.GA13267@castle.DHCP.thefacebook.com> (raw)
In-Reply-To: <20190220043332.GA31397@rh>

On Wed, Feb 20, 2019 at 03:33:32PM +1100, Dave Chinner wrote:
> On Tue, Feb 19, 2019 at 09:06:07PM -0500, Rik van Riel wrote:
> > On Wed, 2019-02-20 at 10:26 +1100, Dave Chinner wrote:
> > > On Tue, Feb 19, 2019 at 12:31:10PM -0500, Rik van Riel wrote:
> > > > On Tue, 2019-02-19 at 13:04 +1100, Dave Chinner wrote:
> > > > > On Tue, Feb 19, 2019 at 12:31:45AM +0000, Roman Gushchin wrote:
> > > > > > Sorry, resending with the fixed to/cc list. Please, ignore the
> > > > > > first letter.
> > > > > 
> > > > > Please resend again with linux-fsdevel on the cc list, because
> > > > > this
> > > > > isn't a MM topic given the regressions from the shrinker patches
> > > > > have all been on the filesystem side of the shrinkers....
> > > > 
> > > > It looks like there are two separate things going on here.
> > > > 
> > > > The first are an MM issues, one of potentially leaking memory
> > > > by not scanning slabs with few items on them,
> > > 
> > > We don't leak memory. Slabs with very few freeable items on them
> > > just don't get scanned when there is only light memory pressure.
> > > That's /by design/ and it is behaviour we've tried hard over many
> > > years to preserve. Once memory pressure ramps up, they'll be
> > > scanned just like all the other slabs.
> > 
> > That may have been fine before cgroups, but when
> > a system can have (tens of) thousands of slab
> > caches, we DO want to scan slab caches with few
> > freeable items in them.
> > 
> > The threshold for "few items" is 4096, not some
> > actually tiny number. That can add up to a lot
> > of memory if a system has hundreds of cgroups.
> 
> That doesn't sound right. The threshold is supposed to be low single
> digits based on the amount of pressure on the page cache, and it's
> accumulated by deferral until the batch threshold (128) is exceeded.
> 
> Ohhhhh. The penny just dropped - this whole sorry saga has be
> triggered because people are chasing a regression nobody has
> recognised as a regression because they don't actually understand
> how the shrinker algorithms are /supposed/ to work.
> 
> And I'm betting that it's been caused by some other recent FB
> shrinker change.....
> 
> Yup, there it is:
> 
> commit 9092c71bb724dba2ecba849eae69e5c9d39bd3d2
> Author: Josef Bacik <jbacik@fb.com>
> Date:   Wed Jan 31 16:16:26 2018 -0800
> 
>     mm: use sc->priority for slab shrink targets
> 
> ....
>     We don't need to know exactly how many pages each shrinker represents,
>     it's objects are all the information we need.  Making this change allows
>     us to place an appropriate amount of pressure on the shrinker pools for
>     their relative size.
> ....
> 
> -       delta = (4 * nr_scanned) / shrinker->seeks;
> -       delta *= freeable;
> -       do_div(delta, nr_eligible + 1);
> +       delta = freeable >> priority;
> +       delta *= 4;
> +       do_div(delta, shrinker->seeks);
> 
> 
> So, prior to this change:
> 
> 	delta ~= (4 * nr_scanned * freeable) / nr_eligible
> 
> IOWs, the ratio of nr_scanned:nr_eligible determined the resolution
> of scan, and that meant delta could (and did!) have values in the
> single digit range.
> 
> The current code introduced by the above patch does:
> 
> 	delta ~= (freeable >> priority) * 4
> 
> Which, as you state, has a threshold of freeable > 4096 to trigger
> scanning under low memory pressure.
> 
> So, that's the original regression that people are trying to fix
> (root cause analysis FTW).  It was introduced in 4.16-rc1. The
> attempts to fix this regression (i.e. the lack of low free object
> shrinker scanning) were introduced into 4.18-rc1, which caused even
> worse regressions and lead us directly to this point.
> 
> Ok, now I see where the real problem people are chasing is, I'll go
> write a patch to fix it.

Sounds good, I'll check if it can prevent the memcg leak.
If it will work, we're fine.

> 
> > Roman's patch, which reclaimed small slabs extra
> > aggressively, introduced issues, but reclaiming
> > small slabs at the same pressure/object as large
> > slabs seems like the desired behavior.
> 
> It's still broken. Both of your patches do the wrong thing because
> they don't address the resolution and accumulation regression and
> instead add another layer of heuristics over the top of the delta
> calculation to hide the lack of resolution.
> 
> > > That's a cgroup referencing and teardown problem, not a memory
> > > reclaim algorithm problem. To treat it as a memory reclaim problem
> > > smears memcg internal implementation bogosities all over the
> > > independent reclaim infrastructure. It violates the concepts of
> > > isolation, modularity, independence, abstraction layering, etc.
> > 
> > You are overlooking the fact that an inode loaded
> > into memory by one cgroup (which is getting torn
> > down) may be in active use by processes in other
> > cgroups.
> 
> No I am not. I am fully aware of this problem (have been since memcg
> day one because of the list_lru tracking issues Glauba and I had to
> sort out when we first realised shared inodes could occur). Sharing
> inodes across cgroups also causes "complexity" in things like cgroup
> writeback control (which cgroup dirty list tracks and does writeback
> of shared inodes?) and so on. Shared inodes across cgroups are
> considered the exception rather than the rule, and they are treated
> in many places with algorithms that assert "this is rare, if it's
> common we're going to be in trouble"....

No, even if sharing inodes can be advertised as a bad practice and
may lead to some sub-optimal results, it shouldn't trigger obvious
kernel issues like memory leaks. Otherwise it becomes a security concern.

Also, in practice, it's common to have a main workload and a couple
of supplementary processes (e.g. monitoring) in sibling cgroups,
which are sharing some inodes (e.g. logs).

Thanks!

next prev parent reply	other threads:[~2019-02-20  5:31 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-19  0:31 Roman Gushchin
2019-02-19  2:04 ` Dave Chinner
2019-02-19 17:31   ` Rik van Riel
2019-02-19 17:38     ` Michal Hocko
2019-02-19 23:26     ` Dave Chinner
2019-02-20  2:06       ` Rik van Riel
2019-02-20  4:33         ` Dave Chinner
2019-02-20  5:31           ` Roman Gushchin [this message]
2019-02-20 17:00           ` Rik van Riel
  -- strict thread matches above, loose matches on Subject: below --
2019-02-19  7:13 Roman Gushchin
2019-02-20  2:47 ` Dave Chinner
2019-02-20  5:50   ` Dave Chinner
2019-02-20  7:27     ` Dave Chinner
2019-02-20 16:20       ` Johannes Weiner
2019-02-21 22:46       ` Roman Gushchin
2019-02-22  1:48         ` Rik van Riel
2019-02-22  1:57           ` Roman Gushchin
2019-02-28 20:30         ` Roman Gushchin
2019-02-28 21:30           ` Dave Chinner
2019-02-28 22:29             ` Roman Gushchin
2019-02-18 23:53 Roman Gushchin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190220053052.GA13267@castle.DHCP.thefacebook.com \
    --to=guro@fb.com \
    --cc=Kernel-team@fb.com \
    --cc=dchinner@redhat.com \
    --cc=guroan@gmail.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@kernel.org \
    --cc=riel@surriel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox