From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77A55C48292 for ; Mon, 5 Feb 2024 20:48:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D79C36B0071; Mon, 5 Feb 2024 15:48:03 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id D29286B0072; Mon, 5 Feb 2024 15:48:03 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BF0F96B0074; Mon, 5 Feb 2024 15:48:03 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id AE3AB6B0071 for ; Mon, 5 Feb 2024 15:48:03 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 44964C0AD6 for ; Mon, 5 Feb 2024 20:48:03 +0000 (UTC) X-FDA: 81758937246.05.97BE5FC Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com [209.85.128.180]) by imf17.hostedemail.com (Postfix) with ESMTP id 77BF840006 for ; Mon, 5 Feb 2024 20:48:01 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eg9+qVKh; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of tjmercier@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=tjmercier@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707166081; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ev1blGMITnTwTBJjyggVWdClFZ0R3d+SfaHhYV19muU=; b=yU8RJwf/6KnYpBvUI5c90j0RD8OduYK9yvIBLpk1Mu0kjnQvN+n2ChFN+D8/LErAifdswh 46FV34HF6bSUP4PUebN/LHxdcmnNJo7zAbJYQG8LQuvrgaCFLlKUd9c2tqMGX/88pPEOaX bp5rI1eP6KA8UY13uP80fdTsPbYdjss= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=eg9+qVKh; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of tjmercier@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=tjmercier@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707166081; a=rsa-sha256; cv=none; b=aY4bkbxbIylULDL6jOIgwJqAVq2Of352o4nes943iY88Gxs240lSHOYTgYx+evpyRbaBkd yFJZ23UnpbjDEWf4rX7XoUIcOl60llENAiLvk76CPT14M/rrC1f8xxQLRZUNqj8m0wiMpo T2ODFyqjRdW3jrHOhqZQF5b5APJ0xUM= Received: by mail-yw1-f180.google.com with SMTP id 00721157ae682-5edfcba97e3so48340717b3.2 for ; Mon, 05 Feb 2024 12:48:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1707166080; x=1707770880; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ev1blGMITnTwTBJjyggVWdClFZ0R3d+SfaHhYV19muU=; b=eg9+qVKhOxMB+0TlbrRSl764lLTBQ48gwcd3ads5+Y+kcWAZbbZJTkeJSivJQKk4u1 qqEb6VciCkJ39Sis0hxUUa0SPEUMu++bPYiK7+UC+8A3hX/E0xxJeEvH6COnfK000iJK QGcLq5Z1gJr4cgp6Lx9YHhX6x9VEoyovy5Fhq4glVB1cwAUBjq1RCdDnR+nRnNjzXn8P 85JJHbvhvA0ZCkSbwDqVxllR/xFaTjpTDiGe3sVk9V+wN9cL3QPSAOsroENwtYnznLnG hrY9bbjwTsJwhAsVuuCw+88LEVk8yKf9+lZRZXlx834NPLszGkacVDzj0GFGmyZEQxml jTNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707166080; x=1707770880; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ev1blGMITnTwTBJjyggVWdClFZ0R3d+SfaHhYV19muU=; b=K0Claslfg7a/q+P1P7nRfTh14VkSVMqF93JiiH3UjGShxDJw16FjNdyiPSYE1O2fpa 208s7rs4Hgk7crs1o8nEBFd7yuFqBNne4W1COxFAplnUwG3fui9s1KjZA3LLX2+VJBAn wQXkVUbql2xYCjqGm+8Twg2On3T0qBt2g6j3qiOnL0uu4z/HLWQ5lHBX3/FsmRnk4HWS UuhXGanUJSEs5UlHQJM9D3rixOCezWYpvIaQbPJgzmLdX1dtCqwXJioVnKgqMCTeS7Xo O14FsKa8enoM3xEQulreO6ox396GX9CFDbudDhFxVBdTyb4wL/s8G2Rb2Vw4JXCXj0XN l8IQ== X-Gm-Message-State: AOJu0YzY6pNmE1gMg7eExr8rWuZWxyE2UiMPTgUn1SmHieJ0StCKxVoc SLa5wdiszg1OziFhweilkHmP9mQYnKn+6mhH//AU1PXF4yHPjF0WNxp9I4lDQx4Ho1I5mBWaOq5 R928kEAyoXVwwtWVEju0rYLYFtF9k/1lTrDbL X-Google-Smtp-Source: AGHT+IH6QR8LV9rFRVUdLhRUTE+qgb27qJmjD4J9AAEjrvOxC80Dud1zYKzjjCiCXXn5ykRAFbh3S6rgHlgl7hPp0y4= X-Received: by 2002:a0d:d5d0:0:b0:5ff:6264:65c4 with SMTP id x199-20020a0dd5d0000000b005ff626465c4mr843162ywd.0.1707166080283; Mon, 05 Feb 2024 12:48:00 -0800 (PST) MIME-Version: 1.0 References: <20240202233855.1236422-1-tjmercier@google.com> In-Reply-To: From: "T.J. Mercier" Date: Mon, 5 Feb 2024 12:47:47 -0800 Message-ID: Subject: Re: [PATCH v3] mm: memcg: Use larger batches for proactive reclaim To: Michal Hocko Cc: Johannes Weiner , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Efly Young , android-mm@google.com, yuzhao@google.com, mkoutny@suse.com, Yosry Ahmed , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 77BF840006 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: j3mz8je9nxbibbmcar7qtqod7e8a3qae X-HE-Tag: 1707166081-789150 X-HE-Meta: U2FsdGVkX19lbaDg9OleM3ba5P7jvzf4WYTDU3OB5lpwWKsQaQvJfsk7+2nopI45UFLzntW0k9wGNFlgjKw8JHNCoziDONW3hl506q9WDHNPGRf8pMnHKlAvauZNVx5ct+8iKKIFYsO/eNbqHDFRJBc0N3zWq4Ou4YD0NKfflR6+v1UnTRaDJHqbpA3SquBJUIrzHbgr+sTjuklm7qsenvcZCUvYhuE7qBnggkjdACn7OCofttlgSumOeKDhlrddd9mPkp5KxUVlwj3lQblqv0sGct9MOdvmXoov38VxtnwDPwLWh2o/23hUZ+me0ea9VeMIPB97nplz4yoZqpKmi1mN3GVKA64AoLkhZ9lmFPjmi5pgg0sK/N3gzipLNuTgBhBqaY5knimi+o0LUgiOXyxn/bcmPyYMiZY9/V2YKTnbuKD7u7MS7n8kIcPgPYxIqKQgDSKY95LHVHiNUqBc07Vp8yYf9/nKDJTWQEMjANC4eqUWTOjz9wdlZTdjYmtYYY++2y0ptK5GI3+szQpQurWdvv3S97sRA6YAQ9p/0U9ba+uQPHv+INUnPdAX99+cXLQWAaGw6ef5w6dqZ3Rq28ckdPzUJYMljq9eljuBEnPbtQKHl0/vKepM8NLBbijHOSFJ4TIVc8CfggoOO5DgbnjvnapIrvMv7WNVdCmr1XP+v/lLvRRTwAs5IJiOo/VdUTKdigIFbGzgJkRQkhZGmIJ4VMXYN2CqrLAUaoAt0aWrTxqGUdlaWFCxboErhWaeWYUWKosFG3IK1QGfkJQfNQwxTR68GnEeK383YAx3SPPt/srfCHkR5ogfm/GOkKxvdMjs9uqYDSYOerg8joP0nowd8D3zifS3H+4pF77H1nQVE0+/9nQyx04ywMIGAIrR7jJA0fN/flUs6ED809Uz8nXgrqDFpX9O6FiaPwFzmC4vPsmTi2QB/3+KqKwgrP37cKMOTbUFTx6OtnIZ3Zx Ga9Vnj/8 r+3zMxFGzBsLUT0qEY3EUFL2MHim/KMGFZWoriKiAS2zEagn4rvlYpPSBbIvaokSZgmBJWSzFtKHshtPS3IfmSbALtIX/DkQhiuV7BIODuhlZRoZllzu6spcnPpNpnjFdOr4xgrgazsX2MAR2Sh4NeQ6f+YdN7yFt+sd8+LP8pMQGAPfrflpOt/cHrq02dZfiz5qT84yjKVjOHgb1JztUNQ06BlvTd8uOlNAf4yVQPlNnoTBxvxo48aUDIXpd+C3KKC/Ab2VGxnITwFa1FfpYLNk4y6N940KowI+H9TLMmhWL6H1tqRicywnsyCJwtcRCpHQsy5xjkmtEVwx4+sxVKIqarAZH4VPgBkn10/FrH77+et4L6agDHhA0Pg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Feb 5, 2024 at 12:36=E2=80=AFPM Michal Hocko wrot= e: > > On Mon 05-02-24 12:26:10, T.J. Mercier wrote: > > On Mon, Feb 5, 2024 at 11:40=E2=80=AFAM Michal Hocko = wrote: > > > > > > On Mon 05-02-24 11:29:49, T.J. Mercier wrote: > > > > On Mon, Feb 5, 2024 at 2:40=E2=80=AFAM Michal Hocko wrote: > > > > > > > > > > On Fri 02-02-24 23:38:54, T.J. Mercier wrote: > > > > > > Before 388536ac291 ("mm:vmscan: fix inaccurate reclaim during p= roactive > > > > > > reclaim") we passed the number of pages for the reclaim request= directly > > > > > > to try_to_free_mem_cgroup_pages, which could lead to significan= t > > > > > > overreclaim. After 0388536ac291 the number of pages was limited= to a > > > > > > maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overrecla= im. > > > > > > However such a small batch size caused a regression in reclaim > > > > > > performance due to many more reclaim start/stop cycles inside > > > > > > memory_reclaim. > > > > > > > > > > You have mentioned that in one of the previous emails but it is g= ood to > > > > > mention what is the source of that overhead for the future refere= nce. > > > > > > > > I can add a sentence about the restart cost being amortized over mo= re > > > > pages with a large batch size. It covers things like repeatedly > > > > flushing stats, walking the tree, evaluating protection limits, etc= . > > > > > > > > > > Reclaim tries to balance nr_to_reclaim fidelity with fairness a= cross > > > > > > nodes and cgroups over which the pages are spread. As such, the= bigger > > > > > > the request, the bigger the absolute overreclaim error. Histori= c > > > > > > in-kernel users of reclaim have used fixed, small sized request= s to > > > > > > approach an appropriate reclaim rate over time. When we reclaim= a user > > > > > > request of arbitrary size, use decaying batch sizes to manage e= rror while > > > > > > maintaining reasonable throughput. > > > > > > > > > > These numbers are with MGLRU or the default reclaim implementatio= n? > > > > > > > > These numbers are for both. root uses the memcg LRU (MGLRU was > > > > enabled), and /uid_0 does not. > > > > > > Thanks it would be nice to outline that in the changelog. > > > > Ok, I'll update the table below for each case. > > > > > > > > root - full reclaim pages/sec time (sec) > > > > > > pre-0388536ac291 : 68047 10.46 > > > > > > post-0388536ac291 : 13742 inf > > > > > > (reclaim-reclaimed)/4 : 67352 10.51 > > > > > > > > > > > > /uid_0 - 1G reclaim pages/sec time (sec) overreclaim (= MiB) > > > > > > pre-0388536ac291 : 258822 1.12 107.8 > > > > > > post-0388536ac291 : 105174 2.49 3.5 > > > > > > (reclaim-reclaimed)/4 : 233396 1.12 -7.4 > > > > > > > > > > > > /uid_0 - full reclaim pages/sec time (sec) > > > > > > pre-0388536ac291 : 72334 7.09 > > > > > > post-0388536ac291 : 38105 14.45 > > > > > > (reclaim-reclaimed)/4 : 72914 6.96 > > > > > > > > > > > > Fixes: 0388536ac291 ("mm:vmscan: fix inaccurate reclaim during = proactive reclaim") > > > > > > Signed-off-by: T.J. Mercier > > > > > > Reviewed-by: Yosry Ahmed > > > > > > Acked-by: Johannes Weiner > > > > > > > > > > > > --- > > > > > > v3: Formatting fixes per Yosry Ahmed and Johannes Weiner. No fu= nctional > > > > > > changes. > > > > > > v2: Simplify the request size calculation per Johannes Weiner a= nd Michal Koutn=C3=BD > > > > > > > > > > > > mm/memcontrol.c | 6 ++++-- > > > > > > 1 file changed, 4 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > > index 46d8d02114cf..f6ab61128869 100644 > > > > > > --- a/mm/memcontrol.c > > > > > > +++ b/mm/memcontrol.c > > > > > > @@ -6976,9 +6976,11 @@ static ssize_t memory_reclaim(struct ker= nfs_open_file *of, char *buf, > > > > > > if (!nr_retries) > > > > > > lru_add_drain_all(); > > > > > > > > > > > > + /* Will converge on zero, but reclaim enforces a = minimum */ > > > > > > + unsigned long batch_size =3D (nr_to_reclaim - nr_= reclaimed) / 4; > > > > > > > > > > This doesn't fit into the existing coding style. I do not think t= here is > > > > > a strong reason to go against it here. > > > > > > > > There's been some back and forth here. You'd prefer to move this to > > > > the top of the while loop, under the declaration of reclaimed? It's > > > > farther from its use there, but it does match the existing style in > > > > the file better. > > > > > > This is not something I deeply care about but generally it is better = to > > > not mix styles unless that is a clear win. If you want to save one LO= C > > > you can just move it up - just couple of lines up, or you can keep th= e > > > definition closer and have a separate declaration. > > > > I find it nicer to have to search as little as possible for both the > > declaration (type) and definition, but I am not attached to it either > > and it's not worth annoying anyone over here. Let's move it up like > > Yosry suggested initially. > > > > > > > > + > > > > > > reclaimed =3D try_to_free_mem_cgroup_pages(memcg, > > > > > > - min(nr_to_reclaim - nr_re= claimed, SWAP_CLUSTER_MAX), > > > > > > - GFP_KERNEL, reclaim_optio= ns); > > > > > > + batch_size, GFP_KERNEL, r= eclaim_options); > > > > > > > > > > Also with the increased reclaim target do we need something like = this? > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index 4f9c854ce6cc..94794cf5ee9f 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -1889,7 +1889,7 @@ static unsigned long shrink_inactive_list(u= nsigned long nr_to_scan, > > > > > > > > > > /* We are about to die and free our memory. Retur= n now. */ > > > > > if (fatal_signal_pending(current)) > > > > > - return SWAP_CLUSTER_MAX; > > > > > + return sc->nr_to_reclaim; > > > > > } > > > > > > > > > > lru_add_drain(); > > > > > > > > > > > > if (!reclaimed && !nr_retries--) > > > > > > return -EAGAIN; > > > > > > -- > > > > > > > > This is interesting, but I don't think it's closely related to this > > > > change. This section looks like it was added to delay OOM kills due= to > > > > apparent lack of reclaim progress when pages are isolated and the > > > > direct reclaimer is scheduled out. A couple things: > > > > > > > > In the context of proactive reclaim, current is not really undergoi= ng > > > > reclaim due to memory pressure. It's initiated from userspace. So > > > > whether it has a fatal signal pending or not doesn't seem like it > > > > should influence the return value of shrink_inactive_list for some > > > > probably unrelated process. It seems more straightforward to me to > > > > return 0, and add another fatal signal pending check to the caller > > > > (shrink_lruvec) to bail out early (dealing with OOM kill avoidance > > > > there if necessary) instead of waiting to accumulate fake > > > > SWAP_CLUSTER_MAX values from shrink_inactive_list. > > > > > > The point of this code is to bail out early if the caller has fatal > > > signals pending. That could be SIGTERM sent to the process performing > > > the reclaim for whatever reason. The bail out is tuned for > > > SWAP_CLUSTER_MAX as you can see and your patch is increasing the recl= aim > > > target which means that bailout wouldn't work properly and you wouldn= 't > > > get any useful work done but not really bail out. > > > > It's increasing to 1/4 of what it was 6 months ago before 88536ac291 > > ("mm:vmscan: fix inaccurate reclaim during proactive reclaim") and > > this hasn't changed since then, so if anything the bailout should > > happen quicker than originally tuned for. > > Yes, this wasn't handled properly back then either. > > > > > As far as changing the value, SWAP_CLUSTER_MAX puts the final value= of > > > > sc->nr_reclaimed pretty close to sc->nr_to_reclaim. Since there's a > > > > loop for each evictable lru in shrink_lruvec, we could end up with = 4 * > > > > sc->nr_to_reclaim in sc->nr_reclaimed if we switched to > > > > sc->nr_to_reclaim from SWAP_CLUSTER_MAX... an even bigger lie. So I > > > > don't think we'd want to do that. > > > > > > The actual number returned from the reclaim is not really important > > > because memory_reclaim would break out of the loop and userspace woul= d > > > never see the result. > > > > This makes sense, but it makes me uneasy. I can't point to anywhere > > this would cause a problem currently (except maybe super unlikely > > overflow of nr_reclaimed), but it feels like a setup for future > > unintended consequences. > > This of something like > timeout $TIMEOUT echo $TARGET > $MEMCG_PATH/memory.reclaim > where timeout acts as a stop gap if the reclaim cannot finish in > TIMEOUT. Yeah I get the desired behavior, but using sc->nr_reclaimed to achieve it is what's bothering me. It's already wired up that way though, so if you want to make this change now then I can try to test for the difference using really large reclaim targets.