From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 528EDC4828D
	for <linux-mm@archiver.kernel.org>; Mon,  5 Feb 2024 19:30:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 96D216B0071; Mon,  5 Feb 2024 14:30:06 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 91D096B0074; Mon,  5 Feb 2024 14:30:06 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7E4CC6B0075; Mon,  5 Feb 2024 14:30:06 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 6DD1E6B0071
	for <linux-mm@kvack.org>; Mon,  5 Feb 2024 14:30:06 -0500 (EST)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 23929120AE6
	for <linux-mm@kvack.org>; Mon,  5 Feb 2024 19:30:06 +0000 (UTC)
X-FDA: 81758740812.09.06BD21C
Received: from mail-yw1-f182.google.com (mail-yw1-f182.google.com [209.85.128.182])
	by imf07.hostedemail.com (Postfix) with ESMTP id 5CA9540017
	for <linux-mm@kvack.org>; Mon,  5 Feb 2024 19:30:03 +0000 (UTC)
Authentication-Results: imf07.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=jC+OnIME;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf07.hostedemail.com: domain of tjmercier@google.com designates 209.85.128.182 as permitted sender) smtp.mailfrom=tjmercier@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1707161403;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=D4WN4Xb7dp47SMoVjUS6Uu6ut1YvXAAoP/+0s4mxi3o=;
	b=ytr1opk9opEaibnh2B0cNTRt2GDQEhiEwRw///qc5iEjWJ6DO/fMBFjpZorfqLYC75wjXH
	NTPCV5mztYC1Iqe0QSjELrZaFHbr4zL2jQ5OFO4tVVLDabs62vahRcAwHP8eVAzNfkgB44
	QNE1rsPQa0tE31V6BMtpyHipOISn6jw=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=jC+OnIME;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf07.hostedemail.com: domain of tjmercier@google.com designates 209.85.128.182 as permitted sender) smtp.mailfrom=tjmercier@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707161403; a=rsa-sha256;
	cv=none;
	b=8S0CQKs7cq27SvX4tbeo89d/AgR5bYdEE6TEgsyJIlEX6xXJfe2NKf8k4bxD2+kIEG2AhN
	S3h+0BrfEDghZGJeYFTKYfhwIqanqexfMz9h9gpHsQ0EYvnMUsGtjmL0B3zIj6N+PVbyv9
	a8maIbuMqXfkT3htkhNjhZRNzNmN6tk=
Received: by mail-yw1-f182.google.com with SMTP id 00721157ae682-5ffdf06e009so41861827b3.3
        for <linux-mm@kvack.org>; Mon, 05 Feb 2024 11:30:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1707161402; x=1707766202; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=D4WN4Xb7dp47SMoVjUS6Uu6ut1YvXAAoP/+0s4mxi3o=;
        b=jC+OnIMEuHWQUl80GyRzK4K5oOgqgHf4tR2nBbkii2xB7FT1J2tO7QrcMyYybidb7g
         a89a3A/d5W7NBni3L9xV3w79m2WVbMAWdenbxkebpoAZs8ebfEd/M/G4O9rYOy9jnDiU
         8Y2KliHVX/qcpH3qPlWaL/fZAyeUIvM6o6C4d311dhPXhPtGR5BASAioXRIi/JYteG1C
         T/2gvOfUB3587T1aAK2fwd3/ZPlW0IBbBHQMOnc/cVxhYWr0GeLPhX6Vn3QlQZ89Sp+A
         fMvT3gxp6QX9mg9lH71DNeiUJ+9SgY0fWCBVVi4zfAai9x2mmzayx2ufPtsRjGdSz7zo
         GEYg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1707161402; x=1707766202;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=D4WN4Xb7dp47SMoVjUS6Uu6ut1YvXAAoP/+0s4mxi3o=;
        b=ESJJ0ERg/MB/TQbmdieuUOXHsFrf7pfGWBUZSNWmwIn7JKjEjtEuRSmrgB9sHDzoKG
         SsOGltOsjuaUB3iec+owyA0pGW+Rv78n/S4hovlxmngz64HovbTW2YjJdRlQqizA8DEo
         FsDYbzwB3q62q/gOO82yTVAYLsxoPv6n039fo7q0RIvZmHr8oqxuIYhPk6Od5k5JQJF8
         JBI1SAHqwQE716Uk/dYMFJ2/SGrboKDryvjmj7qT3EKqi5dZoiDkhJl1YhkkeBqGRIws
         ynlRcuaZ1Ny7VTLwrWqd+/Qp6P4o/HuDbhDqIG/zElsYPAQylerhubPAeEPs3LcSYG9F
         o62Q==
X-Gm-Message-State: AOJu0Yzib0r9OTTnLNNbapen90x1BcrWVzSM3YxDIxWWwwUdDKJJmE5T
	ujz9fUF862KjN4UWKIIkuqNWoFlf8K2wi9Rf3TfLD0j3JAGs+9MeHHeP4hkvC/5DORquyzSnjIt
	OL+u49fUkPnEsdkDZZnKFovv2KfvfPbZUe1k3
X-Google-Smtp-Source: AGHT+IFdX8B1HIDI6xRkkY0IE+V+KdYvFauYmlhUmVafK81CJG44zOMPQ20ffZHQKgCeaNCwkpO4MJXz/ig0GgDIZTU=
X-Received: by 2002:a81:6d16:0:b0:604:3a16:8aee with SMTP id
 i22-20020a816d16000000b006043a168aeemr533115ywc.33.1707161402144; Mon, 05 Feb
 2024 11:30:02 -0800 (PST)
MIME-Version: 1.0
References: <20240202233855.1236422-1-tjmercier@google.com> <ZcC7Kgew3GDFNIux@tiehlicka>
In-Reply-To: <ZcC7Kgew3GDFNIux@tiehlicka>
From: "T.J. Mercier" <tjmercier@google.com>
Date: Mon, 5 Feb 2024 11:29:49 -0800
Message-ID: <CABdmKX3HbSxX6zLF4z3f+=Ybiq1bA71jckkeHv5QJxAjSexgaA@mail.gmail.com>
Subject: Re: [PATCH v3] mm: memcg: Use larger batches for proactive reclaim
To: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, 
	Andrew Morton <akpm@linux-foundation.org>, Efly Young <yangyifei03@kuaishou.com>, 
	android-mm@google.com, yuzhao@google.com, mkoutny@suse.com, 
	Yosry Ahmed <yosryahmed@google.com>, cgroups@vger.kernel.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 5CA9540017
X-Stat-Signature: mhjczehuo97hcr1u5x6rkpphnke8tq43
X-HE-Tag: 1707161403-366543
X-HE-Meta: U2FsdGVkX1+I/n6wsYDnfuRS8LHhEhtWZl3gfrHNQEDqOrR/HotScQ1rjKY5MNaougGAQojazS/O/WWRSM54FJOTYWbC5kW+i5Fa8bHcSdrZRelfWF4ikEaLzOFaLmSYxBoLH/dgOr94msOOeERvosN0p1TApDuD6KfWslFs5c5vmTzAlD4/hcvutWmJl0e9dyO9WXoiJ8bCXelAaPS2jL1TN7a+9H0k5tW0VbIZp8AeBI8m0qhvih2Titf5ciEPyfZF1kox5iDkEWcwUNuLy4f8ONGN1EXZC+cSDeR3rYUNz48LB1AKO1dRIXTJCLXrCqmU9dcv7EFwgQE3GEAHAZdpSjdhFL5hxvyUsGZF1L6tqYYYv3lRLefyEhoQrkrAbTvpC8hZf0MtNneroUTObcMezLVHlahgK7RihAHPan0HHDCdMP6eBXjmjJ1+0SvKp0jZaeCttbFfgGL1Px/+ZdkhQ2oFvDcGQWr1k5W2kB2dFEktGgdL9r+Tbq7tPQgMxD2lLN2tk2ati+cwh0pveOM0yK/2YiexRmah062i+TSkjoWlN3NBTTpI3coCOSHgmDkzcFE0aQVuubd/t18LOI5Il51MeicIhYdqxhSf2WH2zpyZ0BcDFZVG1vmqnjPL95X5TcPp5fAcld3Iva39aQR0VLf9PyyT8lqwitxqEzeF1HSZq5D/gVovN5g8xXy2Mc/6iDRiYkXfJDY4pg2+mKUxlxm6ouNPcBxv/LL9hB0brWzl1vv8kOQ/w3fYY6upv3WEro9GIx35tNJINfwwr+Z6Ei+mwtFwk/L9iu/MWxUm1WFHOttzLJCqnlsAOwmD6hH2pASAwAI1071ZauPHcMPOMlZb/1WrundaedfNQj8ODkxYQGpjG0IU9vXfPy89VQydvyPz9cqnm7jP1NGtrp5vu2QN25cyu15SWxBzIvDmLU0t2drWtk3sxBFBIrdVZEwsv23xyiXq94y8Wte
 6wXRz67f
 ljt6/wKE8fgTecbwgGrzxkzZzIvyBneHSpLr4qOYVt4JaWpFBwT87+5NElhcOQWzX1/BIxUuwdjJijCzQkwQTLmfZLZj1DowfRFmL4IHyiHXdSDqB4ibOinrzOJnfvRy9crPCmTWeYC0Ah5fvO++G/7XLhOe6QaUbERJ6GXXZLKPwPDEuPImtEcIO5rI9FauprPOLvLznxGmBUnR2p1CM5R6U2FxN4rMOQp9VAf2qOzmJTFfDz7nx+P+nW5+uWtOcUcsQmtPBglu0YQ866TsFKKpR60mCwWquObiO1/dSm8wNtisx2s1rPfQVqMqozY4EorkoM9/JMAJZIAGL3UVcuuaZeaIguSuzelNI772GitJNi5FVhycu7Kf7Qg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Feb 5, 2024 at 2:40=E2=80=AFAM Michal Hocko <mhocko@suse.com> wrote=
:
>
> On Fri 02-02-24 23:38:54, T.J. Mercier wrote:
> > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactive
> > reclaim") we passed the number of pages for the reclaim request directl=
y
> > to try_to_free_mem_cgroup_pages, which could lead to significant
> > overreclaim. After 0388536ac291 the number of pages was limited to a
> > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim.
> > However such a small batch size caused a regression in reclaim
> > performance due to many more reclaim start/stop cycles inside
> > memory_reclaim.
>
> You have mentioned that in one of the previous emails but it is good to
> mention what is the source of that overhead for the future reference.

I can add a sentence about the restart cost being amortized over more
pages with a large batch size. It covers things like repeatedly
flushing stats, walking the tree, evaluating protection limits, etc.

> > Reclaim tries to balance nr_to_reclaim fidelity with fairness across
> > nodes and cgroups over which the pages are spread. As such, the bigger
> > the request, the bigger the absolute overreclaim error. Historic
> > in-kernel users of reclaim have used fixed, small sized requests to
> > approach an appropriate reclaim rate over time. When we reclaim a user
> > request of arbitrary size, use decaying batch sizes to manage error whi=
le
> > maintaining reasonable throughput.
>
> These numbers are with MGLRU or the default reclaim implementation?

These numbers are for both. root uses the memcg LRU (MGLRU was
enabled), and /uid_0 does not.

> > root - full reclaim       pages/sec   time (sec)
> > pre-0388536ac291      :    68047        10.46
> > post-0388536ac291     :    13742        inf
> > (reclaim-reclaimed)/4 :    67352        10.51
> >
> > /uid_0 - 1G reclaim       pages/sec   time (sec)  overreclaim (MiB)
> > pre-0388536ac291      :    258822       1.12            107.8
> > post-0388536ac291     :    105174       2.49            3.5
> > (reclaim-reclaimed)/4 :    233396       1.12            -7.4
> >
> > /uid_0 - full reclaim     pages/sec   time (sec)
> > pre-0388536ac291      :    72334        7.09
> > post-0388536ac291     :    38105        14.45
> > (reclaim-reclaimed)/4 :    72914        6.96
> >
> > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during proactiv=
e reclaim")
> > Signed-off-by: T.J. Mercier <tjmercier@google.com>
> > Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >
> > ---
> > v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No functional
> > changes.
> > v2: Simplify the request size calculation per Johannes Weiner and Micha=
l Koutn=C3=BD
> >
> >  mm/memcontrol.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 46d8d02114cf..f6ab61128869 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct kernfs_open=
_file *of, char *buf,
> >               if (!nr_retries)
> >                       lru_add_drain_all();
> >
> > +             /* Will converge on zero, but reclaim enforces a minimum =
*/
> > +             unsigned long batch_size =3D (nr_to_reclaim - nr_reclaime=
d) / 4;
>
> This doesn't fit into the existing coding style. I do not think there is
> a strong reason to go against it here.

There's been some back and forth here. You'd prefer to move this to
the top of the while loop, under the declaration of reclaimed? It's
farther from its use there, but it does match the existing style in
the file better.

> > +
> >               reclaimed =3D try_to_free_mem_cgroup_pages(memcg,
> > -                                     min(nr_to_reclaim - nr_reclaimed,=
 SWAP_CLUSTER_MAX),
> > -                                     GFP_KERNEL, reclaim_options);
> > +                                     batch_size, GFP_KERNEL, reclaim_o=
ptions);
>
> Also with the increased reclaim target do we need something like this?
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f9c854ce6cc..94794cf5ee9f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1889,7 +1889,7 @@ static unsigned long shrink_inactive_list(unsigned =
long nr_to_scan,
>
>                 /* We are about to die and free our memory. Return now. *=
/
>                 if (fatal_signal_pending(current))
> -                       return SWAP_CLUSTER_MAX;
> +                       return sc->nr_to_reclaim;
>         }
>
>         lru_add_drain();
> >
> >               if (!reclaimed && !nr_retries--)
> >                       return -EAGAIN;
> > --

This is interesting, but I don't think it's closely related to this
change. This section looks like it was added to delay OOM kills due to
apparent lack of reclaim progress when pages are isolated and the
direct reclaimer is scheduled out. A couple things:

In the context of proactive reclaim, current is not really undergoing
reclaim due to memory pressure. It's initiated from userspace. So
whether it has a fatal signal pending or not doesn't seem like it
should influence the return value of shrink_inactive_list for some
probably unrelated process. It seems more straightforward to me to
return 0, and add another fatal signal pending check to the caller
(shrink_lruvec) to bail out early (dealing with OOM kill avoidance
there if necessary) instead of waiting to accumulate fake
SWAP_CLUSTER_MAX values from shrink_inactive_list.

As far as changing the value, SWAP_CLUSTER_MAX puts the final value of
sc->nr_reclaimed pretty close to sc->nr_to_reclaim. Since there's a
loop for each evictable lru in shrink_lruvec, we could end up with 4 *
sc->nr_to_reclaim in sc->nr_reclaimed if we switched to
sc->nr_to_reclaim from SWAP_CLUSTER_MAX... an even bigger lie. So I
don't think we'd want to do that.