From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 77A55C48292
	for <linux-mm@archiver.kernel.org>; Mon,  5 Feb 2024 20:48:04 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D79C36B0071; Mon,  5 Feb 2024 15:48:03 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D29286B0072; Mon,  5 Feb 2024 15:48:03 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BF0F96B0074; Mon,  5 Feb 2024 15:48:03 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id AE3AB6B0071
	for <linux-mm@kvack.org>; Mon,  5 Feb 2024 15:48:03 -0500 (EST)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 44964C0AD6
	for <linux-mm@kvack.org>; Mon,  5 Feb 2024 20:48:03 +0000 (UTC)
X-FDA: 81758937246.05.97BE5FC
Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com [209.85.128.180])
	by imf17.hostedemail.com (Postfix) with ESMTP id 77BF840006
	for <linux-mm@kvack.org>; Mon,  5 Feb 2024 20:48:01 +0000 (UTC)
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=eg9+qVKh;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf17.hostedemail.com: domain of tjmercier@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=tjmercier@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1707166081;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ev1blGMITnTwTBJjyggVWdClFZ0R3d+SfaHhYV19muU=;
	b=yU8RJwf/6KnYpBvUI5c90j0RD8OduYK9yvIBLpk1Mu0kjnQvN+n2ChFN+D8/LErAifdswh
	46FV34HF6bSUP4PUebN/LHxdcmnNJo7zAbJYQG8LQuvrgaCFLlKUd9c2tqMGX/88pPEOaX
	bp5rI1eP6KA8UY13uP80fdTsPbYdjss=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=eg9+qVKh;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf17.hostedemail.com: domain of tjmercier@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=tjmercier@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707166081; a=rsa-sha256;
	cv=none;
	b=aY4bkbxbIylULDL6jOIgwJqAVq2Of352o4nes943iY88Gxs240lSHOYTgYx+evpyRbaBkd
	yFJZ23UnpbjDEWf4rX7XoUIcOl60llENAiLvk76CPT14M/rrC1f8xxQLRZUNqj8m0wiMpo
	T2ODFyqjRdW3jrHOhqZQF5b5APJ0xUM=
Received: by mail-yw1-f180.google.com with SMTP id 00721157ae682-5edfcba97e3so48340717b3.2
        for <linux-mm@kvack.org>; Mon, 05 Feb 2024 12:48:01 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1707166080; x=1707770880; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ev1blGMITnTwTBJjyggVWdClFZ0R3d+SfaHhYV19muU=;
        b=eg9+qVKhOxMB+0TlbrRSl764lLTBQ48gwcd3ads5+Y+kcWAZbbZJTkeJSivJQKk4u1
         qqEb6VciCkJ39Sis0hxUUa0SPEUMu++bPYiK7+UC+8A3hX/E0xxJeEvH6COnfK000iJK
         QGcLq5Z1gJr4cgp6Lx9YHhX6x9VEoyovy5Fhq4glVB1cwAUBjq1RCdDnR+nRnNjzXn8P
         85JJHbvhvA0ZCkSbwDqVxllR/xFaTjpTDiGe3sVk9V+wN9cL3QPSAOsroENwtYnznLnG
         hrY9bbjwTsJwhAsVuuCw+88LEVk8yKf9+lZRZXlx834NPLszGkacVDzj0GFGmyZEQxml
         jTNA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1707166080; x=1707770880;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ev1blGMITnTwTBJjyggVWdClFZ0R3d+SfaHhYV19muU=;
        b=K0Claslfg7a/q+P1P7nRfTh14VkSVMqF93JiiH3UjGShxDJw16FjNdyiPSYE1O2fpa
         208s7rs4Hgk7crs1o8nEBFd7yuFqBNne4W1COxFAplnUwG3fui9s1KjZA3LLX2+VJBAn
         wQXkVUbql2xYCjqGm+8Twg2On3T0qBt2g6j3qiOnL0uu4z/HLWQ5lHBX3/FsmRnk4HWS
         UuhXGanUJSEs5UlHQJM9D3rixOCezWYpvIaQbPJgzmLdX1dtCqwXJioVnKgqMCTeS7Xo
         O14FsKa8enoM3xEQulreO6ox396GX9CFDbudDhFxVBdTyb4wL/s8G2Rb2Vw4JXCXj0XN
         l8IQ==
X-Gm-Message-State: AOJu0YzY6pNmE1gMg7eExr8rWuZWxyE2UiMPTgUn1SmHieJ0StCKxVoc
	SLa5wdiszg1OziFhweilkHmP9mQYnKn+6mhH//AU1PXF4yHPjF0WNxp9I4lDQx4Ho1I5mBWaOq5
	R928kEAyoXVwwtWVEju0rYLYFtF9k/1lTrDbL
X-Google-Smtp-Source: AGHT+IH6QR8LV9rFRVUdLhRUTE+qgb27qJmjD4J9AAEjrvOxC80Dud1zYKzjjCiCXXn5ykRAFbh3S6rgHlgl7hPp0y4=
X-Received: by 2002:a0d:d5d0:0:b0:5ff:6264:65c4 with SMTP id
 x199-20020a0dd5d0000000b005ff626465c4mr843162ywd.0.1707166080283; Mon, 05 Feb
 2024 12:48:00 -0800 (PST)
MIME-Version: 1.0
References: <20240202233855.1236422-1-tjmercier@google.com>
 <ZcC7Kgew3GDFNIux@tiehlicka> <CABdmKX3HbSxX6zLF4z3f+=Ybiq1bA71jckkeHv5QJxAjSexgaA@mail.gmail.com>
 <ZcE5n9cTdTGJChmq@tiehlicka> <CABdmKX0Du2F+bko=hjLBqdQO-bJSFcG3y1Bbuu2v6J8aVB39sw@mail.gmail.com>
 <ZcFG2JoXI7i8XzQY@tiehlicka>
In-Reply-To: <ZcFG2JoXI7i8XzQY@tiehlicka>
From: "T.J. Mercier" <tjmercier@google.com>
Date: Mon, 5 Feb 2024 12:47:47 -0800
Message-ID: <CABdmKX0t1LXj80Awe20TrmY5gQB6v2E4bGfW8WXr2i84o+k6ow@mail.gmail.com>
Subject: Re: [PATCH v3] mm: memcg: Use larger batches for proactive reclaim
To: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, 
	Andrew Morton <akpm@linux-foundation.org>, Efly Young <yangyifei03@kuaishou.com>, 
	android-mm@google.com, yuzhao@google.com, mkoutny@suse.com, 
	Yosry Ahmed <yosryahmed@google.com>, cgroups@vger.kernel.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 77BF840006
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: j3mz8je9nxbibbmcar7qtqod7e8a3qae
X-HE-Tag: 1707166081-789150
X-HE-Meta: U2FsdGVkX19lbaDg9OleM3ba5P7jvzf4WYTDU3OB5lpwWKsQaQvJfsk7+2nopI45UFLzntW0k9wGNFlgjKw8JHNCoziDONW3hl506q9WDHNPGRf8pMnHKlAvauZNVx5ct+8iKKIFYsO/eNbqHDFRJBc0N3zWq4Ou4YD0NKfflR6+v1UnTRaDJHqbpA3SquBJUIrzHbgr+sTjuklm7qsenvcZCUvYhuE7qBnggkjdACn7OCofttlgSumOeKDhlrddd9mPkp5KxUVlwj3lQblqv0sGct9MOdvmXoov38VxtnwDPwLWh2o/23hUZ+me0ea9VeMIPB97nplz4yoZqpKmi1mN3GVKA64AoLkhZ9lmFPjmi5pgg0sK/N3gzipLNuTgBhBqaY5knimi+o0LUgiOXyxn/bcmPyYMiZY9/V2YKTnbuKD7u7MS7n8kIcPgPYxIqKQgDSKY95LHVHiNUqBc07Vp8yYf9/nKDJTWQEMjANC4eqUWTOjz9wdlZTdjYmtYYY++2y0ptK5GI3+szQpQurWdvv3S97sRA6YAQ9p/0U9ba+uQPHv+INUnPdAX99+cXLQWAaGw6ef5w6dqZ3Rq28ckdPzUJYMljq9eljuBEnPbtQKHl0/vKepM8NLBbijHOSFJ4TIVc8CfggoOO5DgbnjvnapIrvMv7WNVdCmr1XP+v/lLvRRTwAs5IJiOo/VdUTKdigIFbGzgJkRQkhZGmIJ4VMXYN2CqrLAUaoAt0aWrTxqGUdlaWFCxboErhWaeWYUWKosFG3IK1QGfkJQfNQwxTR68GnEeK383YAx3SPPt/srfCHkR5ogfm/GOkKxvdMjs9uqYDSYOerg8joP0nowd8D3zifS3H+4pF77H1nQVE0+/9nQyx04ywMIGAIrR7jJA0fN/flUs6ED809Uz8nXgrqDFpX9O6FiaPwFzmC4vPsmTi2QB/3+KqKwgrP37cKMOTbUFTx6OtnIZ3Zx
 Ga9Vnj/8
 r+3zMxFGzBsLUT0qEY3EUFL2MHim/KMGFZWoriKiAS2zEagn4rvlYpPSBbIvaokSZgmBJWSzFtKHshtPS3IfmSbALtIX/DkQhiuV7BIODuhlZRoZllzu6spcnPpNpnjFdOr4xgrgazsX2MAR2Sh4NeQ6f+YdN7yFt+sd8+LP8pMQGAPfrflpOt/cHrq02dZfiz5qT84yjKVjOHgb1JztUNQ06BlvTd8uOlNAf4yVQPlNnoTBxvxo48aUDIXpd+C3KKC/Ab2VGxnITwFa1FfpYLNk4y6N940KowI+H9TLMmhWL6H1tqRicywnsyCJwtcRCpHQsy5xjkmtEVwx4+sxVKIqarAZH4VPgBkn10/FrH77+et4L6agDHhA0Pg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Feb 5, 2024 at 12:36=E2=80=AFPM Michal Hocko <mhocko@suse.com> wrot=
e:
>
> On Mon 05-02-24 12:26:10, T.J. Mercier wrote:
> > On Mon, Feb 5, 2024 at 11:40=E2=80=AFAM Michal Hocko <mhocko@suse.com> =
wrote:
> > >
> > > On Mon 05-02-24 11:29:49, T.J. Mercier wrote:
> > > > On Mon, Feb 5, 2024 at 2:40=E2=80=AFAM Michal Hocko <mhocko@suse.co=
m> wrote:
> > > > >
> > > > > On Fri 02-02-24 23:38:54, T.J. Mercier wrote:
> > > > > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during p=
roactive
> > > > > > reclaim") we passed the number of pages for the reclaim request=
 directly
> > > > > > to try_to_free_mem_cgroup_pages, which could lead to significan=
t
> > > > > > overreclaim. After 0388536ac291 the number of pages was limited=
 to a
> > > > > > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overrecla=
im.
> > > > > > However such a small batch size caused a regression in reclaim
> > > > > > performance due to many more reclaim start/stop cycles inside
> > > > > > memory_reclaim.
> > > > >
> > > > > You have mentioned that in one of the previous emails but it is g=
ood to
> > > > > mention what is the source of that overhead for the future refere=
nce.
> > > >
> > > > I can add a sentence about the restart cost being amortized over mo=
re
> > > > pages with a large batch size. It covers things like repeatedly
> > > > flushing stats, walking the tree, evaluating protection limits, etc=
.
> > > >
> > > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness a=
cross
> > > > > > nodes and cgroups over which the pages are spread. As such, the=
 bigger
> > > > > > the request, the bigger the absolute overreclaim error. Histori=
c
> > > > > > in-kernel users of reclaim have used fixed, small sized request=
s to
> > > > > > approach an appropriate reclaim rate over time. When we reclaim=
 a user
> > > > > > request of arbitrary size, use decaying batch sizes to manage e=
rror while
> > > > > > maintaining reasonable throughput.
> > > > >
> > > > > These numbers are with MGLRU or the default reclaim implementatio=
n?
> > > >
> > > > These numbers are for both. root uses the memcg LRU (MGLRU was
> > > > enabled), and /uid_0 does not.
> > >
> > > Thanks it would be nice to outline that in the changelog.
> >
> > Ok, I'll update the table below for each case.
> >
> > > > > > root - full reclaim       pages/sec   time (sec)
> > > > > > pre-0388536ac291      :    68047        10.46
> > > > > > post-0388536ac291     :    13742        inf
> > > > > > (reclaim-reclaimed)/4 :    67352        10.51
> > > > > >
> > > > > > /uid_0 - 1G reclaim       pages/sec   time (sec)  overreclaim (=
MiB)
> > > > > > pre-0388536ac291      :    258822       1.12            107.8
> > > > > > post-0388536ac291     :    105174       2.49            3.5
> > > > > > (reclaim-reclaimed)/4 :    233396       1.12            -7.4
> > > > > >
> > > > > > /uid_0 - full reclaim     pages/sec   time (sec)
> > > > > > pre-0388536ac291      :    72334        7.09
> > > > > > post-0388536ac291     :    38105        14.45
> > > > > > (reclaim-reclaimed)/4 :    72914        6.96
> > > > > >
> > > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during =
proactive reclaim")
> > > > > > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > > > > > Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> > > > > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > > > >
> > > > > > ---
> > > > > > v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No fu=
nctional
> > > > > > changes.
> > > > > > v2: Simplify the request size calculation per Johannes Weiner a=
nd Michal Koutn=C3=BD
> > > > > >
> > > > > >  mm/memcontrol.c | 6 ++++--
> > > > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > > > index 46d8d02114cf..f6ab61128869 100644
> > > > > > --- a/mm/memcontrol.c
> > > > > > +++ b/mm/memcontrol.c
> > > > > > @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct ker=
nfs_open_file *of, char *buf,
> > > > > >               if (!nr_retries)
> > > > > >                       lru_add_drain_all();
> > > > > >
> > > > > > +             /* Will converge on zero, but reclaim enforces a =
minimum */
> > > > > > +             unsigned long batch_size =3D (nr_to_reclaim - nr_=
reclaimed) / 4;
> > > > >
> > > > > This doesn't fit into the existing coding style. I do not think t=
here is
> > > > > a strong reason to go against it here.
> > > >
> > > > There's been some back and forth here. You'd prefer to move this to
> > > > the top of the while loop, under the declaration of reclaimed? It's
> > > > farther from its use there, but it does match the existing style in
> > > > the file better.
> > >
> > > This is not something I deeply care about but generally it is better =
to
> > > not mix styles unless that is a clear win. If you want to save one LO=
C
> > > you can just move it up - just couple of lines up, or you can keep th=
e
> > > definition closer and have a separate declaration.
> >
> > I find it nicer to have to search as little as possible for both the
> > declaration (type) and definition, but I am not attached to it either
> > and it's not worth annoying anyone over here. Let's move it up like
> > Yosry suggested initially.
> >
> > > > > > +
> > > > > >               reclaimed =3D try_to_free_mem_cgroup_pages(memcg,
> > > > > > -                                     min(nr_to_reclaim - nr_re=
claimed, SWAP_CLUSTER_MAX),
> > > > > > -                                     GFP_KERNEL, reclaim_optio=
ns);
> > > > > > +                                     batch_size, GFP_KERNEL, r=
eclaim_options);
> > > > >
> > > > > Also with the increased reclaim target do we need something like =
this?
> > > > >
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 4f9c854ce6cc..94794cf5ee9f 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -1889,7 +1889,7 @@ static unsigned long shrink_inactive_list(u=
nsigned long nr_to_scan,
> > > > >
> > > > >                 /* We are about to die and free our memory. Retur=
n now. */
> > > > >                 if (fatal_signal_pending(current))
> > > > > -                       return SWAP_CLUSTER_MAX;
> > > > > +                       return sc->nr_to_reclaim;
> > > > >         }
> > > > >
> > > > >         lru_add_drain();
> > > > > >
> > > > > >               if (!reclaimed && !nr_retries--)
> > > > > >                       return -EAGAIN;
> > > > > > --
> > > >
> > > > This is interesting, but I don't think it's closely related to this
> > > > change. This section looks like it was added to delay OOM kills due=
 to
> > > > apparent lack of reclaim progress when pages are isolated and the
> > > > direct reclaimer is scheduled out. A couple things:
> > > >
> > > > In the context of proactive reclaim, current is not really undergoi=
ng
> > > > reclaim due to memory pressure. It's initiated from userspace. So
> > > > whether it has a fatal signal pending or not doesn't seem like it
> > > > should influence the return value of shrink_inactive_list for some
> > > > probably unrelated process. It seems more straightforward to me to
> > > > return 0, and add another fatal signal pending check to the caller
> > > > (shrink_lruvec) to bail out early (dealing with OOM kill avoidance
> > > > there if necessary) instead of waiting to accumulate fake
> > > > SWAP_CLUSTER_MAX values from shrink_inactive_list.
> > >
> > > The point of this code is to bail out early if the caller has fatal
> > > signals pending. That could be SIGTERM sent to the process performing
> > > the reclaim for whatever reason. The bail out is tuned for
> > > SWAP_CLUSTER_MAX as you can see and your patch is increasing the recl=
aim
> > > target which means that bailout wouldn't work properly and you wouldn=
't
> > > get any useful work done but not really bail out.
> >
> > It's increasing to 1/4 of what it was 6 months ago before 88536ac291
> > ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") and
> > this hasn't changed since then, so if anything the bailout should
> > happen quicker than originally tuned for.
>
> Yes, this wasn't handled properly back then either.
>
> > > > As far as changing the value, SWAP_CLUSTER_MAX puts the final value=
 of
> > > > sc->nr_reclaimed pretty close to sc->nr_to_reclaim. Since there's a
> > > > loop for each evictable lru in shrink_lruvec, we could end up with =
4 *
> > > > sc->nr_to_reclaim in sc->nr_reclaimed if we switched to
> > > > sc->nr_to_reclaim from SWAP_CLUSTER_MAX... an even bigger lie. So I
> > > > don't think we'd want to do that.
> > >
> > > The actual number returned from the reclaim is not really important
> > > because memory_reclaim would break out of the loop and userspace woul=
d
> > > never see the result.
> >
> > This makes sense, but it makes me uneasy. I can't point to anywhere
> > this would cause a problem currently (except maybe super unlikely
> > overflow of nr_reclaimed), but it feels like a setup for future
> > unintended consequences.
>
> This of something like
> timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim
> where timeout acts as a stop gap if the reclaim cannot finish in
> TIMEOUT.

Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve
it is what's bothering me.

It's already wired up that way though, so if you want to make this
change now then I can try to test for the difference using really
large reclaim targets.