From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 46261D6B6D9
	for <linux-mm@archiver.kernel.org>; Wed, 30 Oct 2024 21:01:52 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D050A6B00B4; Wed, 30 Oct 2024 17:01:51 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CB5546B00B5; Wed, 30 Oct 2024 17:01:51 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B2F3A6B00B8; Wed, 30 Oct 2024 17:01:51 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 8E7ED6B00B4
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 17:01:51 -0400 (EDT)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 10981410CB
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 21:01:51 +0000 (UTC)
X-FDA: 82731489414.25.D925E90
Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174])
	by imf23.hostedemail.com (Postfix) with ESMTP id 35844140011
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 21:01:31 +0000 (UTC)
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Kl+Er4sk;
	spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730321895;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=tGxACa+63xvZPYViyYnCsJYtCoEinv2eYdTUf7ev3EA=;
	b=u1m9D22PrWwFYwNqjpwWUAGZM9j23PF07jAG+mekuF93kOVJhfRAFOiVuL51BNcgGYxNSB
	UXo264Zy6zjRan2QBeQr6oGY0MJKPECg6UrD5qoOlWyipOzW9+sn3m1A+f56JkYVZrFszm
	2ZbIwz0USz+ie1W3nGxoTHyhfl6ZRvA=
ARC-Authentication-Results: i=1;
	imf23.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Kl+Er4sk;
	spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730321895; a=rsa-sha256;
	cv=none;
	b=mBAy9FI86VIlfKUAPXdRLJSnH/TDnfuaoiM0FQ8ChTSezd1dd0VXQ0T1ICHO5qWgw4/aGT
	sU5jFuCyH2gvsUQD+v84BlZ0hhr112dpF0xDVMZCFFV+bnL83esAc6Xk1fKFMROh8pc0OO
	K1R31wdn1eWGRifc3cM20kT30j04X64=
Received: by mail-qt1-f174.google.com with SMTP id d75a77b69052e-460b16d4534so2007771cf.3
        for <linux-mm@kvack.org>; Wed, 30 Oct 2024 14:01:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1730322108; x=1730926908; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=tGxACa+63xvZPYViyYnCsJYtCoEinv2eYdTUf7ev3EA=;
        b=Kl+Er4sk1pYO5S5S59Hr1ChLxLMiCExnUvG8cBdxCC1gImoBacy47yXHsHFbI7WsU4
         Y0zLuHrpRbqUETX5uJNJxtFRDdbWiiCxuNpRuA6yKePW/X1j8yMuFtv9nuP5hX7QVdZw
         LUcyKP0b1Jvr0VisX9UR0tubmT4TCyI0EkVmWJNkO9DA69aar0QcUGl9bCBfQbDG1EX6
         E1IxIQHBDsPWmBVzGH24MLeDWPXOwxY1rGWA3JzM5Pl6YVkZhr4aMBwmX4yMoKy5hCyx
         Zh8GEDGeGSVQ24TkzQbrpHCkxLfso2/HlJx+2T8ocJw4O6PwFXM1naX8ANv27fsqGlhV
         MF8g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1730322108; x=1730926908;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=tGxACa+63xvZPYViyYnCsJYtCoEinv2eYdTUf7ev3EA=;
        b=CLQK+TL5+3dRdUPrS7hQu+n7TlJv8wyLReYE2J9CihdnrCVdNRfn9UrCaFoIlyRcNC
         8P9+H99jVW7NDIW1KrkO5/6rS/o8NZ5BK7K4dm6wLvRUDcSmp3Z6s8h10XmxY9e9QV4z
         StA4OqUYqeqaIgCDiJ5p8z6ini9wo7payLUnwoYabBxaIh+i2N+NK4D8MeNKrIS+NJJe
         VVUG9O41sxP0TMwMteobXgTrYXUukT9qISwRuitrpSu95oPIuznFEJYgK2RwSWdst9Bv
         I+FJfMbic1HFaDnHUAb8gyJmh85xaTKEUH7I965dDZXNL0bRgJJb2oUU1foeILIRkhA4
         s5/g==
X-Forwarded-Encrypted: i=1; AJvYcCVqkiLRkY075NzdWAtsLTyBFDkCmFn1fiMEhFoYQtCx5yZtRldtZ4Ps+CxfePIaEzhwT8Qcc5N6yw==@kvack.org
X-Gm-Message-State: AOJu0YwfJk1qyVFlQkeAbtGI79jDiIhLK5mB4YZSgoQXsvAedjbSwIa7
	sV6/0JC8rTAz5RmNiOaaibcbHZe6LDGYnRnDlMUjnkwIKHvjshVBarE9LlA7rPXA/Zd2ydvkYPC
	lQL9X9EI6qeruI2sjtI8c1JNHVdLkqML5+8pR
X-Google-Smtp-Source: AGHT+IHC7F99QDfKE3T8gIdLTca0twuukJg5DJ7WOAp4mqV+/Kb/DUCNeGxo8uvCDS8B44leoJurfVv0tsKAurSnVzQ=
X-Received: by 2002:a05:6214:598e:b0:6cc:598:67a8 with SMTP id
 6a1803df08f44-6d185838a0emr276556896d6.38.1730322107629; Wed, 30 Oct 2024
 14:01:47 -0700 (PDT)
MIME-Version: 1.0
References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com>
 <CAGsJ_4wdgptMK0dDTC5g66OE9WDxFDt7ixDQaFCjuHdTyTEGiA@mail.gmail.com>
 <e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com> <CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
 <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com>
In-Reply-To: <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Wed, 30 Oct 2024 14:01:11 -0700
Message-ID: <CAJD7tkaXL_vMsgYET9yjYQW5pM2c60fD_7r_z4vkMPcqferS8A@mail.gmail.com>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg
To: Usama Arif <usamaarif642@gmail.com>
Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>, 
	Kanchana P Sridhar <kanchana.p.sridhar@intel.com>, David Hildenbrand <david@redhat.com>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Chris Li <chrisl@kernel.org>, 
	"Huang, Ying" <ying.huang@intel.com>, Kairui Song <kasong@tencent.com>, 
	Ryan Roberts <ryan.roberts@arm.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 35844140011
X-Stat-Signature: gg7gpxy67h6q84w9b55rbjp51k9ifmx3
X-Rspam-User: 
X-HE-Tag: 1730322091-417945
X-HE-Meta: U2FsdGVkX1/jhiDCInCLN7zgQoCLjZW59/LqIhheaH31EYfxdXo0rSom2KD3TFnOA5ULvMxilIXGcHXkhK5XZjr6oLX4ROUHU5tkeOvx4T2vjxh9C6UrtZLpyf2vEEHUj3IQNddnCykqQ42VqHsJHTfTUZ0JuTRoSfXU7dx3Y0ckaMYg3+hLp0qw1JUNBd9ZqtDUxrGKK1pS03G2YJTnh632fcnLnp/zjnvCb6u3AzHXVHrFo2sMFyxdX2/QYnbS6fQ0pdBdJwCI4qL46NOROOzhZXoWb0JiK2Dn8AaWNnwN2qJopOh37pu/tTWMN8qqiQZhJ1W5Mgdpg4WKQw4gXZlbJ5yeGTPXbMs73noGwuF7cXUSEolImIHqMJktkUA5nrC/fS6arpGcvJ86pSJ7P7AXCoAPi+qfKvwnBOqYkJ2WeVBvXzmq5UvbZlfYrzu+2Y7iaBnHvTuNt5T/Jo4cRYNiM6JWObBZov8whgvGx7c6FROFl/r7HOD3pPqlPDyYQQNkLJhESojc6eQccZowJEsGXs6CydIErxaGScM1CkLUCg4+M6GAC5Hl6jihmHdBnivi/yUMYwh3r4o3wi7DOqgdpcuq0T8hOON2IaN4GUYoyaWB0WXKV5ykRMu/yePwLUasVpWRq30JKw3wiEEfYMEXUKcqYqJRlx5NetMp3zZKtCdhhQXJvColqWqK/yUkIfm4EW/QCXM24xHqPNosBh6xE1qt2PUXgi7PkqdxZpQayXSmpmRZIxVt4+jrb3E8FZfT9LfAOsOWm3dQVbwGt1iJkZNEkwXWOyUp2WASzwOWJQqcXC7QgGdqxCp8V///+iJ/88Ydh7fJGNheFHR1M1CO5hnkauXSqQlq50b3LV6MG/bYv9YOLIsid8BK7eMGe95Iy4KJWqFinPOl9FZ8JvcwJ125eo3487pxssYtoOHLjoXRtmZb/Q7veCPxVz8qn6a5XDGUlJACkqT4Sny
 FnRXyFR9
 o49/prqjWxc064657TqmPG8cgDMExTU6vgUjzjvU3+S2eeL/MJ09ApfMGAvQ6xrbOjQ5JXWTeu7Q/BIeLkJtmXjfpUV7VbwWjr2trFk7bZMAbuAdhqZOcaaLc3WSH1Vs0Gj2YqfC16vd8jMk1v224ERIvx1LRaCMHxbxuObEYwE9vN0kmE1hH3qdL9iKZ4c16Cz69sU1B3ymJsc9dZeohKbFlwIa+OaUrmLZt
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif <usamaarif642@gmail.com>=
 wrote:
>
>
>
> On 30/10/2024 19:51, Yosry Ahmed wrote:
> > [..]
> >>> My second point about the mitigation is as follows: For a system (or
> >>> memcg) under severe memory pressure, especially one without hardware =
TLB
> >>> optimization, is enabling mTHP always the right choice? Since mTHP op=
erates at
> >>> a larger granularity, some internal fragmentation is unavoidable, reg=
ardless
> >>> of optimization. Could the mitigation code help in automatically tuni=
ng
> >>> this fragmentation?
> >>>
> >>
> >> I agree with the point that enabling mTHP always is not the right thin=
g to do
> >> on all platforms. I also think it might be the case that enabling mTHP
> >> might be a good thing for some workloads, but enabling mTHP swapin alo=
ng with
> >> it might not.
> >>
> >> As you said when you have apps switching between foreground and backgr=
ound
> >> in android, it probably makes sense to have large folio swapping, as y=
ou
> >> want to bringin all the pages from background app as quickly as possib=
le.
> >> And also all the TLB optimizations and smaller lru overhead you get af=
ter
> >> you have brought in all the pages.
> >> Linux kernel build test doesnt really get to benefit from the TLB opti=
mization
> >> and smaller lru overhead, as probably the pages are very short lived. =
So I
> >> think it doesnt show the benefit of large folio swapin properly and
> >> large folio swapin should probably be disabled for this kind of worklo=
ad,
> >> eventhough mTHP should be enabled.
> >>
> >> I am not sure that the approach we are trying in this patch is the rig=
ht way:
> >> - This patch makes it a memcg issue, but you could have memcg disabled=
 and
> >> then the mitigation being tried here wont apply.
> >
> > Is the problem reproducible without memcg? I imagine only if the
> > entire system is under memory pressure. I guess we would want the same
> > "mitigation" either way.
> >
> What would be a good open source benchmark/workload to test without limit=
ing memory
> in memcg?
> For the kernel build test, I can only get zswap activity to happen if I b=
uild
> in cgroup and limit memory.max.

You mean a benchmark that puts the entire system under memory
pressure? I am not sure, it ultimately depends on the size of memory
you have, among other factors.

What if you run the kernel build test in a VM? Then you can limit is
size like a memcg, although you'd probably need to leave more room
because the entire guest OS will also subject to the same limit.

>
> I can just run zswap large folio zswapin in production and see, but that =
will take me a few
> days. tbh, running in prod is a much better test, and if there isn't any =
sort of thrashing,
> then maybe its not really an issue? I believe Barry doesnt see an issue i=
n android
> phones (but please correct me if I am wrong), and if there isnt an issue =
in Meta
> production as well, its a good data point for servers as well. And maybe
> kernel build in 4G memcg is not a good test.

If there is a regression in the kernel build, this means some
workloads may be affected, even if Meta's prod isn't. I understand
that the benchmark is not very representative of real world workloads,
but in this instance I think the thrashing problem surfaced by the
benchmark is real.

>
> >> - Instead of this being a large folio swapin issue, is it more of a re=
adahead
> >> issue? If we zswap (without the large folio swapin series) and change =
the window
> >> to 1 in swap_vma_readahead, we might see an improvement in linux kerne=
l build time
> >> when cgroup memory is limited as readahead would probably cause swap t=
hrashing as
> >> well.
> >
> > I think large folio swapin would make the problem worse anyway. I am
> > also not sure if the readahead window adjusts on memory pressure or
> > not.
> >
> readahead window doesnt look at memory pressure. So maybe the same thing =
is being
> seen here as there would be in swapin_readahead?

Maybe readahead is not as aggressive in general as large folio
swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
of the window is the smaller of page_cluster (2 or 3) and
SWAP_RA_ORDER_CEILING (5).

Also readahead will swapin 4k folios AFAICT, so we don't need a
contiguous allocation like large folio swapin. So that could be
another factor why readahead may not reproduce the problem.

> Maybe if we check kernel build test
> performance in 4G memcg with below diff, it might get better?

I think you can use the page_cluster tunable to do this at runtime.

>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 4669f29cf555..9e196e1e6885 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -809,7 +809,7 @@ static struct folio *swap_vma_readahead(swp_entry_t t=
arg_entry, gfp_t gfp_mask,
>         pgoff_t ilx;
>         bool page_allocated;
>
> -       win =3D swap_vma_ra_win(vmf, &start, &end);
> +       win =3D 1;
>         if (win =3D=3D 1)
>                 goto skip;
>