From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Lua+=57=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS,
	INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E99E1C2BA19
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 10:18:02 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A2C4121655
	for <linux-mm@archiver.kernel.org>; Wed, 15 Apr 2020 10:18:02 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A2C4121655
Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-vserver.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 451C38E0005; Wed, 15 Apr 2020 06:18:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 402638E0001; Wed, 15 Apr 2020 06:18:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 33EDE8E0005; Wed, 15 Apr 2020 06:18:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0169.hostedemail.com [216.40.44.169])
	by kanga.kvack.org (Postfix) with ESMTP id 18EE78E0001
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 06:18:02 -0400 (EDT)
Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay03.hostedemail.com (Postfix) with ESMTP id C901C824556B
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 10:18:01 +0000 (UTC)
X-FDA: 76709688762.10.fish80_7fb6d6d5d6e21
X-HE-Tag: fish80_7fb6d6d5d6e21
X-Filterd-Recvd-Size: 8299
Received: from smtprelay.restena.lu (smtprelay.restena.lu [158.64.1.62])
	by imf06.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Wed, 15 Apr 2020 10:17:57 +0000 (UTC)
Received: from hemera.lan.sysophe.eu (unknown [IPv6:2001:a18:1:12::4])
	by smtprelay.restena.lu (Postfix) with ESMTPS id 187D340FCB;
	Wed, 15 Apr 2020 12:17:54 +0200 (CEST)
Date: Wed, 15 Apr 2020 12:17:53 +0200
From: Bruno =?UTF-8?B?UHLDqW1vbnQ=?= <bonbons@linux-vserver.org>
To: Michal Hocko <mhocko@kernel.org>
Cc: Chris Down <chris@chrisdown.name>, cgroups@vger.kernel.org,
 linux-mm@kvack.org, Johannes Weiner  <hannes@cmpxchg.org>, Vladimir Davydov
 <vdavydov.dev@gmail.com>
Subject: Re: Memory CG and 5.1 to 5.6 uprade slows backup
Message-ID: <20200415121753.3c8d700b@hemera.lan.sysophe.eu>
In-Reply-To: <20200414163134.GQ4629@dhcp22.suse.cz>
References: <20200409112505.2e1fc150@hemera.lan.sysophe.eu>
	<20200409094615.GE18386@dhcp22.suse.cz>
	<20200409121733.1a5ba17c@hemera.lan.sysophe.eu>
	<20200409103400.GF18386@dhcp22.suse.cz>
	<20200409170926.182354c3@hemera.lan.sysophe.eu>
	<20200409152540.GP18386@dhcp22.suse.cz>
	<20200410091525.287062fa@hemera.lan.sysophe.eu>
	<20200410104343.5bcde519@hemera.lan.sysophe.eu>
	<20200410115010.1d9f6a3f@hemera.lan.sysophe.eu>
	<20200414163134.GQ4629@dhcp22.suse.cz>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Hi Michal,

On Tue, 14 Apr 2020 18:31:34 Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 10-04-20 11:50:10, Bruno Pr=C3=A9mont wrote:
> > Hi Michal, Chris,
> >=20
> > Sending ephemeral link to (now properly) captured cgroup details off-li=
st.

Re-adding the list and other readers.

> > It contains:
> >   snapshots of the cgroup contents at 1s interval
> > backup was running through the full captured period.
> >=20
> > You can see memory.max changes at
> >   26-21  (to high + 128M)
> > and
> >   30-24  (back to high) =20
>=20
> OK, so you have started with high =3D max and do not see any stalls. As
> soon as you activate the high limit reclaim by increasing the max limit
> you get your stalls because those tasks are put into sleep. There are
> only 3 tasks in the cgroup and it seems to be shell, tar and some
> subshell or is there anything else that could charge any memory?

On production system it's just backup software (3-4 processes)

On basic reproducer it's bash, tar and tar's gzip subprocess.

> Let's just focus on the time prior to switch and after
> * prior
> $ for i in $(seq -w 21); do cat 26-$i/memory.current ; done | calc_min_ma=
x.awk
> min: 2371895296.00 max: 2415874048.00 avg: 2404059136.00 std: 13469809.78=
 nr: 20
>=20
> high =3D hard limit =3D 2415919104
>=20
> * after
> $ for i in $(seq -w 22 59); do cat 26-$i/memory.current ; done | calc_min=
_max.awk
> min: 2409172992.00 max: 2415828992.00 avg: 2415420793.26 std: 1475181.24 =
nr: 38
>=20
> high limit =3D 2415919104
> hard limit =3D 2550136832
>=20
> Nothing interesting here. The charged memory stays below the high limit.
> This might be a matter of timing of course because your snapshot might
> hit the window when the situation was ok. But 90K is a larger margin
> than I would expect in such a case.
>=20
> $ cat 27-*/memory.current | calc_min_max.awk=20
> min: 2408161280.00 max: 2415910912.00 avg: 2415709583.19 std: 993983.28 n=
r: 59
>=20
> Still under the high limit but closer 8K so this looks much more like
> seeing high limit reclaim in action.
>=20
> $ cat 28-*/memory.current | calc_min_max.awk
> min: 2409123840.00 max: 2415910912.00 avg: 2415633019.59 std: 870311.11 n=
r: 58
>=20
> same here.
>=20
> $ cat 29-*/memory.current | calc_min_max.awk
> min: 2400464896.00 max: 2415828992.00 avg: 2414819015.59 std: 3133978.89 =
nr: 59
>=20
> quite below high limit.
>=20
> So I do not see any large high limit excess here but it could be a
> matter of timing as mentioned above. Let's have a look at the reclaim
> activity.
>=20
> 27-00/memory.stat:pgscan 82811883
> 27-00/memory.stat:pgsteal 80223813
> 27-01/memory.stat:pgscan 82811883
> 27-01/memory.stat:pgsteal 80223813
>=20
> No scanning=20
>=20
> 27-02/memory.stat:pgscan 82811947
> 27-02/memory.stat:pgsteal 80223877
>=20
> 64 pages scanned and reclaimed
>=20
> 27-03/memory.stat:pgscan 82811947
> 27-03/memory.stat:pgsteal 80223877
> 27-04/memory.stat:pgscan 82811947
> 27-04/memory.stat:pgsteal 80223877
> 27-05/memory.stat:pgscan 82811947
> 27-05/memory.stat:pgsteal 80223877
>=20
> No scanning
>=20
> 27-06/memory.stat:pgscan 82812011
> 27-06/memory.stat:pgsteal 80223941
>=20
> 64 pages scanned and reclaimed
>=20
> 27-07/memory.stat:pgscan 82812011
> 27-07/memory.stat:pgsteal 80223941
> 27-08/memory.stat:pgscan 82812011
> 27-08/memory.stat:pgsteal 80223941
> 27-09/memory.stat:pgscan 82812011
> 27-09/memory.stat:pgsteal 80223941
>=20
> No scanning
>=20
> 27-11/memory.stat:pgscan 82812075
> 27-11/memory.stat:pgsteal 80224005
>=20
> 64 pages scanned
>=20
> 27-12/memory.stat:pgscan 82812075
> 27-12/memory.stat:pgsteal 80224005
> 27-13/memory.stat:pgscan 82812075
> 27-13/memory.stat:pgsteal 80224005
> 27-14/memory.stat:pgscan 82812075
> 27-14/memory.stat:pgsteal 80224005
>=20
> No scanning. etc...
>=20
> So it seems there were two rounds of scanning (we usually do 32pages per
> batch and the reclaim was really effective at reclaiming that memory but
> then the task is put into sleep for 2-3s. This is quite unexpected
> because collected stats do not show the high limit excess during the
> sleeping time.
>=20
> It would be interesting to see more detailed information on the
> throttling itself. Which kernel version are you testing this on?
> 5.6+ kernel needs http://lkml.kernel.org/r/20200331152424.GA1019937@chris=
down.name
> but please note that e26733e0d0ec ("mm, memcg: throttle allocators based
> on ancestral memory.high") has been marked for stable so 5.4+ kernels
> might have it as well and you would need the same fix for them as well.
> I wouldn't be really surprised if this was the actual problem that you
> are hitting because the reclaim could simply made usage < high and the
> math in calculate_high_delay doesn't work properly.

I'm on 5.6.2. Seems neither e26733e0d0ec nor the fix hit 5.6.2 (nor
current 5.6.4).

> Anyway the following simply tracing patch should give a better clue.
> The output will appear in the trace buffer (mount tracefs and read
> trace_pipe file).

This is the output I get on 5.6.4 with simple tar -zc call (max=3Dhigh+4096=
):
  tar-16943 [000] ....  1098.796955: mem_cgroup_handle_over_high: memcg_nr_=
pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
  tar-16943 [000] ....  1100.876794: mem_cgroup_handle_over_high: memcg_nr_=
pages_over_high:1 penalty_jiffies:200 current:262122 high:262144
  tar-16943 [000] ....  1102.956636: mem_cgroup_handle_over_high: memcg_nr_=
pages_over_high:1 penalty_jiffies:200 current:262120 high:262144
  tar-16943 [000] ....  1105.037388: mem_cgroup_handle_over_high: memcg_nr_=
pages_over_high:1 penalty_jiffies:200 current:262121 high:262144
  tar-16943 [000] ....  1107.117246: mem_cgroup_handle_over_high: memcg_nr_=
pages_over_high:1 penalty_jiffies:200 current:262122 high:262144

With 5.7-rc1 it runs just fine, pressure remains zero and no output in trac=
e_pipe or throttling.

So the fixes that went in there do fix it.
Now matter of cherry-picking the right ones... e26733e0d0ec and its fixe's =
fix,
maybe some others (will start with those tagged for stable).


Thanks,
Bruno

> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 05b4ec2c6499..dcee3030309d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2417,6 +2417,10 @@ void mem_cgroup_handle_over_high(void)
>  	if (penalty_jiffies <=3D HZ / 100)
>  		goto out;
> =20
> +	trace_printk("memcg_nr_pages_over_high:%d penalty_jiffies:%ld current:%=
lu high:%lu\n",
> +			nr_pages, penalty_jiffies,
> +			page_counter_read(&memcg->memory), READ_ONCE(memcg->high));
> +
>  	/*
>  	 * If we exit early, we're guaranteed to die (since
>  	 * schedule_timeout_killable sets TASK_KILLABLE). This means we don't