linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Chris Down <chris@chrisdown.name>
To: "Bruno Prémont" <bonbons@linux-vserver.org>
Cc: cgroups@vger.kernel.org, linux-mm@kvack.org,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: Re: Memory CG and 5.1 to 5.6 uprade slows backup
Date: Thu, 9 Apr 2020 11:50:48 +0100	[thread overview]
Message-ID: <20200409105048.GA1040020@chrisdown.name> (raw)
In-Reply-To: <20200409112505.2e1fc150@hemera.lan.sysophe.eu>

Hi Bruno,

Bruno Prémont writes:
>Upgrading from 5.1 kernel to 5.6 kernel on a production system using
>cgroups (v2) and having backup process in a memory.high=2G cgroup
>sees backup being highly throttled (there are about 1.5T to be
>backuped).

Before 5.4, memory usage with memory.high=N is essentially unbounded if the 
system is not able to reclaim pages for some reason. This is because all 
memory.high throttling before that point is just based on forcing direct 
reclaim for a cgroup, but there's no guarantee that we can actually reclaim 
pages, or that it will serve as a time penalty.

In 5.4, my patch 0e4b01df8659 ("mm, memcg: throttle allocators when failing 
reclaim over memory.high") changes kernel behaviour to actively penalise 
cgroups exceeding their memory.high by a large amount. That is, if reclaim 
fails to reclaim pages and bring the cgroup below the high threshold, we 
actively deschedule the process running for some number of jiffies that is 
exponential to the amount of overage incurred. This is so that cgroups using 
memory.high cannot simply have runaway memory usage without any consequences.

This is the patch that I'd particularly suspect is related to your problem. 
However:

>Most memory usage in that cgroup is for file cache.
>
>Here are the memory details for the cgroup:
>memory.current:2147225600
>[...]
>memory.events:high 423774
>memory.events:max 31131
>memory.high:2147483648
>memory.max:2415919104

Your high limit is being exceeded heavily and you are failing to reclaim. You 
have `max` events here, which mean your application is at least at some point 
using over 268 *mega*bytes over its memory.high.

So yes, we will penalise this cgroup heavily since we cannot reclaim from it. 
The real question is why we can't reclaim from it :-)

>memory.low:33554432

You have a memory.low set, which will bias reclaim away from this cgroup based 
on overage. It's not very large, though, so it shouldn't change the semantics 
here, although it's worth noting since it also changed in another one of my 
patches, 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim"), 
which is also in 5.4.

In 5.1, as soon as you exceed memory.low, you immediately lose all protection.  
This is not ideal because it results in extremely binary, back-and-forth 
behaviour for cgroups using it (see the changelog for more information). This 
change means you will still receive some small amount of protection based on 
your overage, but it's fairly insignificant in this case (memory.current is 
about 64x larger than memory.low). What did you intend to do with this in 5.1? 
:-)

>memory.stat:anon 10887168
>memory.stat:file 2062102528
>memory.stat:kernel_stack 73728
>memory.stat:slab 76148736
>memory.stat:sock 360448
>memory.stat:shmem 0
>memory.stat:file_mapped 12029952
>memory.stat:file_dirty 946176
>memory.stat:file_writeback 405504
>memory.stat:anon_thp 0
>memory.stat:inactive_anon 0
>memory.stat:active_anon 10121216
>memory.stat:inactive_file 1954959360
>memory.stat:active_file 106418176
>memory.stat:unevictable 0
>memory.stat:slab_reclaimable 75247616
>memory.stat:slab_unreclaimable 901120
>memory.stat:pgfault 8651676
>memory.stat:pgmajfault 2013
>memory.stat:workingset_refault 8670651
>memory.stat:workingset_activate 409200
>memory.stat:workingset_nodereclaim 62040
>memory.stat:pgrefill 1513537
>memory.stat:pgscan 47519855
>memory.stat:pgsteal 44933838
>memory.stat:pgactivate 7986
>memory.stat:pgdeactivate 1480623
>memory.stat:pglazyfree 0
>memory.stat:pglazyfreed 0
>memory.stat:thp_fault_alloc 0
>memory.stat:thp_collapse_alloc 0

Hard to say exactly why we can't reclaim using these statistics, usually if 
anything the kernel is *over* eager to drop cache pages than anything.

If the kernel thinks those file pages are too hot, though, it won't drop them. 
However, we only have 106M active file, compared to 2GB memory.current, so it 
doesn't look like this is the issue.

Can you please show io.pressure, io.stat, and cpu.pressure during these periods 
compared to baseline for this cgroup and globally (from /proc/pressure)? My 
suspicion is that we are not able to reclaim fast enough because memory 
management is getting stuck behind a slow disk.

Swap availability and usage information would also be helpful.

>Regularly the backup process seems to be blocked for about 2s, but not
>within a syscall according to strace.

2 seconds is important, it's the maximum time we allow the allocator throttler 
to throttle for one allocation :-)

If you want to verify, you can look at /proc/pid/stack during these stalls -- 
they should be in mem_cgroup_handle_over_high, in an address related to 
allocator throttling.

>Is there a way to tell kernel that this cgroup should not be throttled

Huh? That's what memory.high is for, so why are you using if it you don't want 
that?

>and its inactive file cache given up (rather quickly).

I suspect the kernel is reclaiming as far as it can, but is being stopped from 
doing so for some reason, which is why I'd like to see io.pressure and 
cpu.pressure.

>On a side note, I liked v1's mode of soft/hard memory limit where the
>memory amount between soft and hard could be used if system has enough
>free memory. For v2 the difference between high and max seems almost of
>no use.

For that use case, that's more or less what we've designed memory.low to do. 
The difference is that v1's soft limit almost never worked: the heuristics are 
extremely complicated, so complicated in fact that even we as memcg maintainers 
cannot reason about them. If we cannot reason about them, I'm quite sure it's 
not really doing what you expect :-)

In this case everything looks like it's working as intended, just this is all 
the result of memory.high becoming less broken in 5.4. From your description, 
I'm not sure that memory.high is what you want, either.

>A cgroup parameter for impacting RO file cache differently than
>anonymous memory or otherwise dirty memory would be great too.

We had vm.swappiness in v1 and it manifested extremely poorly. I won't go too 
much into the details of that here though, since we already discussed it fairly 
comprehensively here[0].

Please feel free to send over the io.pressure, io.stat, cpu.pressure, and swap 
metrics at baseline and during this when possible. Thanks!

0: https://lore.kernel.org/patchwork/patch/1172080/


  parent reply	other threads:[~2020-04-09 10:50 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-09  9:25 Bruno Prémont
2020-04-09  9:46 ` Michal Hocko
2020-04-09 10:17   ` Bruno Prémont
2020-04-09 10:34     ` Michal Hocko
2020-04-09 15:09       ` Bruno Prémont
2020-04-09 15:24         ` Chris Down
2020-04-09 15:40           ` Bruno Prémont
2020-04-09 17:50             ` Chris Down
2020-04-09 17:56               ` Chris Down
2020-04-09 15:25         ` Michal Hocko
2020-04-10  7:15           ` Bruno Prémont
2020-04-10  8:43             ` Bruno Prémont
     [not found]               ` <20200410115010.1d9f6a3f@hemera.lan.sysophe.eu>
     [not found]                 ` <20200414163134.GQ4629@dhcp22.suse.cz>
2020-04-15 10:17                   ` Bruno Prémont
2020-04-15 10:24                     ` Michal Hocko
2020-04-15 11:37                       ` Bruno Prémont
2020-04-14 15:09           ` Bruno Prémont
2020-04-09 10:50 ` Chris Down [this message]
2020-04-09 11:58   ` Bruno Prémont

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200409105048.GA1040020@chrisdown.name \
    --to=chris@chrisdown.name \
    --cc=bonbons@linux-vserver.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=vdavydov.dev@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox