From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Usf8=5Z=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-0.6 required=3.0 tests=DKIM_INVALID,DKIM_SIGNED,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D5A9FC2BA2B
	for <linux-mm@archiver.kernel.org>; Thu,  9 Apr 2020 10:50:52 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 654A920A8B
	for <linux-mm@archiver.kernel.org>; Thu,  9 Apr 2020 10:50:52 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (1024-bit key) header.d=chrisdown.name header.i=@chrisdown.name header.b="HW+Knoxg"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 654A920A8B
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=chrisdown.name
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id DD23D8E000D; Thu,  9 Apr 2020 06:50:51 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D82C98E0006; Thu,  9 Apr 2020 06:50:51 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C71C48E000D; Thu,  9 Apr 2020 06:50:51 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0229.hostedemail.com [216.40.44.229])
	by kanga.kvack.org (Postfix) with ESMTP id B48898E0006
	for <linux-mm@kvack.org>; Thu,  9 Apr 2020 06:50:51 -0400 (EDT)
Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 881E4181AEF15
	for <linux-mm@kvack.org>; Thu,  9 Apr 2020 10:50:51 +0000 (UTC)
X-FDA: 76687998702.18.game34_41a3bd0a28b26
X-HE-Tag: game34_41a3bd0a28b26
X-Filterd-Recvd-Size: 9436
Received: from mail-wr1-f50.google.com (mail-wr1-f50.google.com [209.85.221.50])
	by imf32.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu,  9 Apr 2020 10:50:50 +0000 (UTC)
Received: by mail-wr1-f50.google.com with SMTP id k1so11428667wrm.3
        for <linux-mm@kvack.org>; Thu, 09 Apr 2020 03:50:50 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=chrisdown.name; s=google;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:content-transfer-encoding:in-reply-to;
        bh=v38bfbui8k5a6UTKAOOaJuah7JdF2f/erdWIs/iN2qA=;
        b=HW+Knoxg6d/dZG9qisjLcg2gnNliKCOWd8p3/R5jwqedJAeqw6sTWQbY7DRl6EP5qj
         Z0oIN4Gj+z5uafDfWB1vhxgYIt/IayuZXkSSHsYKgPNRr7lUaBAkhFpAiYGbcjTm/08J
         ogsOuGxSUOM8UjBOW/f8+EmhnUm5ARNpntjQI=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:content-transfer-encoding
         :in-reply-to;
        bh=v38bfbui8k5a6UTKAOOaJuah7JdF2f/erdWIs/iN2qA=;
        b=m6j9r48PxHIMYnUuiuoFxdG9pUmW+ZHk/sgac3nLXsHrFgBDLC75jKe28sUdtB9Rts
         VhOvfGJvCTdD2OovObUyHocr5NyLngRD0Ob0VSkuybPlVruAaCTky2qo81x5dpLKcqqD
         aUKE4E0ccNJQ0ZZpseY3QEy7iEKwN9uHLnPre4wh71qOwaFuIFbrbvwQqA1YTwfYjy2t
         nddOSOzENJ9RtbH76lAGo4vzcmIg9/n23To+2Tdq4QjEnFXT3MV7kPClP4rxYZ+o+Veo
         Ceh2tng3fH/mZf3EYCnNZGMSpW2l0k7pKE6jxe2EKOKbAny2hyyekEWOwHqNcp+4yTPH
         1yxQ==
X-Gm-Message-State: AGi0PuZNko9bVyZT7YmNRt4paAeSu0m0zu/uOrA6tcjajO9+d+MCOoiv
	kHiJV4U0Dplcanzqai2CbibW8Q==
X-Google-Smtp-Source: APiQypIwos1LrzxkTlY2XcsPKtu0+SZVjP9Xy4WHuHSvnmvsEMTjhb8qp4dTm5GmIwSVsr1AiAddXg==
X-Received: by 2002:adf:ea06:: with SMTP id q6mr10945520wrm.301.1586429449692;
        Thu, 09 Apr 2020 03:50:49 -0700 (PDT)
Received: from localhost ([2620:10d:c092:180::1:9ebe])
        by smtp.gmail.com with ESMTPSA id p22sm3206693wmc.42.2020.04.09.03.50.48
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 09 Apr 2020 03:50:49 -0700 (PDT)
Date: Thu, 9 Apr 2020 11:50:48 +0100
From: Chris Down <chris@chrisdown.name>
To: Bruno =?iso-8859-1?Q?Pr=E9mont?= <bonbons@linux-vserver.org>
Cc: cgroups@vger.kernel.org, linux-mm@kvack.org,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Vladimir Davydov <vdavydov.dev@gmail.com>
Subject: Re: Memory CG and 5.1 to 5.6 uprade slows backup
Message-ID: <20200409105048.GA1040020@chrisdown.name>
References: <20200409112505.2e1fc150@hemera.lan.sysophe.eu>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1; format=flowed
Content-Disposition: inline
In-Reply-To: <20200409112505.2e1fc150@hemera.lan.sysophe.eu>
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hi Bruno,

Bruno Pr=E9mont writes:
>Upgrading from 5.1 kernel to 5.6 kernel on a production system using
>cgroups (v2) and having backup process in a memory.high=3D2G cgroup
>sees backup being highly throttled (there are about 1.5T to be
>backuped).

Before 5.4, memory usage with memory.high=3DN is essentially unbounded if=
 the=20
system is not able to reclaim pages for some reason. This is because all=20
memory.high throttling before that point is just based on forcing direct=20
reclaim for a cgroup, but there's no guarantee that we can actually recla=
im=20
pages, or that it will serve as a time penalty.

In 5.4, my patch 0e4b01df8659 ("mm, memcg: throttle allocators when faili=
ng=20
reclaim over memory.high") changes kernel behaviour to actively penalise=20
cgroups exceeding their memory.high by a large amount. That is, if reclai=
m=20
fails to reclaim pages and bring the cgroup below the high threshold, we=20
actively deschedule the process running for some number of jiffies that i=
s=20
exponential to the amount of overage incurred. This is so that cgroups us=
ing=20
memory.high cannot simply have runaway memory usage without any consequen=
ces.

This is the patch that I'd particularly suspect is related to your proble=
m.=20
However:

>Most memory usage in that cgroup is for file cache.
>
>Here are the memory details for the cgroup:
>memory.current:2147225600
>[...]
>memory.events:high 423774
>memory.events:max 31131
>memory.high:2147483648
>memory.max:2415919104

Your high limit is being exceeded heavily and you are failing to reclaim.=
 You=20
have `max` events here, which mean your application is at least at some p=
oint=20
using over 268 *mega*bytes over its memory.high.

So yes, we will penalise this cgroup heavily since we cannot reclaim from=
 it.=20
The real question is why we can't reclaim from it :-)

>memory.low:33554432

You have a memory.low set, which will bias reclaim away from this cgroup =
based=20
on overage. It's not very large, though, so it shouldn't change the seman=
tics=20
here, although it's worth noting since it also changed in another one of =
my=20
patches, 9783aa9917f8 ("mm, memcg: proportional memory.{low,min} reclaim"=
),=20
which is also in 5.4.

In 5.1, as soon as you exceed memory.low, you immediately lose all protec=
tion. =20
This is not ideal because it results in extremely binary, back-and-forth=20
behaviour for cgroups using it (see the changelog for more information). =
This=20
change means you will still receive some small amount of protection based=
 on=20
your overage, but it's fairly insignificant in this case (memory.current =
is=20
about 64x larger than memory.low). What did you intend to do with this in=
 5.1?=20
:-)

>memory.stat:anon 10887168
>memory.stat:file 2062102528
>memory.stat:kernel_stack 73728
>memory.stat:slab 76148736
>memory.stat:sock 360448
>memory.stat:shmem 0
>memory.stat:file_mapped 12029952
>memory.stat:file_dirty 946176
>memory.stat:file_writeback 405504
>memory.stat:anon_thp 0
>memory.stat:inactive_anon 0
>memory.stat:active_anon 10121216
>memory.stat:inactive_file 1954959360
>memory.stat:active_file 106418176
>memory.stat:unevictable 0
>memory.stat:slab_reclaimable 75247616
>memory.stat:slab_unreclaimable 901120
>memory.stat:pgfault 8651676
>memory.stat:pgmajfault 2013
>memory.stat:workingset_refault 8670651
>memory.stat:workingset_activate 409200
>memory.stat:workingset_nodereclaim 62040
>memory.stat:pgrefill 1513537
>memory.stat:pgscan 47519855
>memory.stat:pgsteal 44933838
>memory.stat:pgactivate 7986
>memory.stat:pgdeactivate 1480623
>memory.stat:pglazyfree 0
>memory.stat:pglazyfreed 0
>memory.stat:thp_fault_alloc 0
>memory.stat:thp_collapse_alloc 0

Hard to say exactly why we can't reclaim using these statistics, usually =
if=20
anything the kernel is *over* eager to drop cache pages than anything.

If the kernel thinks those file pages are too hot, though, it won't drop =
them.=20
However, we only have 106M active file, compared to 2GB memory.current, s=
o it=20
doesn't look like this is the issue.

Can you please show io.pressure, io.stat, and cpu.pressure during these p=
eriods=20
compared to baseline for this cgroup and globally (from /proc/pressure)? =
My=20
suspicion is that we are not able to reclaim fast enough because memory=20
management is getting stuck behind a slow disk.

Swap availability and usage information would also be helpful.

>Regularly the backup process seems to be blocked for about 2s, but not
>within a syscall according to strace.

2 seconds is important, it's the maximum time we allow the allocator thro=
ttler=20
to throttle for one allocation :-)

If you want to verify, you can look at /proc/pid/stack during these stall=
s --=20
they should be in mem_cgroup_handle_over_high, in an address related to=20
allocator throttling.

>Is there a way to tell kernel that this cgroup should not be throttled

Huh? That's what memory.high is for, so why are you using if it you don't=
 want=20
that?

>and its inactive file cache given up (rather quickly).

I suspect the kernel is reclaiming as far as it can, but is being stopped=
 from=20
doing so for some reason, which is why I'd like to see io.pressure and=20
cpu.pressure.

>On a side note, I liked v1's mode of soft/hard memory limit where the
>memory amount between soft and hard could be used if system has enough
>free memory. For v2 the difference between high and max seems almost of
>no use.

For that use case, that's more or less what we've designed memory.low to =
do.=20
The difference is that v1's soft limit almost never worked: the heuristic=
s are=20
extremely complicated, so complicated in fact that even we as memcg maint=
ainers=20
cannot reason about them. If we cannot reason about them, I'm quite sure =
it's=20
not really doing what you expect :-)

In this case everything looks like it's working as intended, just this is=
 all=20
the result of memory.high becoming less broken in 5.4. From your descript=
ion,=20
I'm not sure that memory.high is what you want, either.

>A cgroup parameter for impacting RO file cache differently than
>anonymous memory or otherwise dirty memory would be great too.

We had vm.swappiness in v1 and it manifested extremely poorly. I won't go=
 too=20
much into the details of that here though, since we already discussed it =
fairly=20
comprehensively here[0].

Please feel free to send over the io.pressure, io.stat, cpu.pressure, and=
 swap=20
metrics at baseline and during this when possible. Thanks!

0: https://lore.kernel.org/patchwork/patch/1172080/