From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id BAA6FD6B6DB
	for <linux-mm@archiver.kernel.org>; Wed, 30 Oct 2024 21:18:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 549EE6B00B4; Wed, 30 Oct 2024 17:18:49 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 4FA9E6B00B8; Wed, 30 Oct 2024 17:18:49 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3C1DF6B00BC; Wed, 30 Oct 2024 17:18:49 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16])
	by kanga.kvack.org (Postfix) with ESMTP id 1E3746B00B4
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 17:18:49 -0400 (EDT)
Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id ABB9EA108D
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 21:18:48 +0000 (UTC)
X-FDA: 82731532506.15.8198743
Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181])
	by imf11.hostedemail.com (Postfix) with ESMTP id CB8B94000D
	for <linux-mm@kvack.org>; Wed, 30 Oct 2024 21:18:14 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b="1UM/C+Ui";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730323071; a=rsa-sha256;
	cv=none;
	b=wOoT9LhwPpNNZ7a3IzoeCmlaVELEyiGRgU4aW1U5VDGRWzCJbM0j14EcUYo782ggZhzRQ+
	6dTDMU6CP1FcCJA0myl9R6rtIXTL1m1Y53TH92eQiehcal7xLahOqjUrWGUUE3sKdg5hAt
	SVfQGpVgCNv4CR+MPAaEU3/XXyX6qLI=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b="1UM/C+Ui";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.160.181 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730323071;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=gBj8Er8NMrOYhbRjUej2vh8pNnsdyKz70XmQboypEJ4=;
	b=0Z1+T9BU+VCRh2BlZyFnQYh+1tUFXiMrVsd/i/ZaNxQElSnCIFh5JT5TXQHIVtplC7jyMe
	AN9/q32lGEdn8EwzUvh/ryT5oIGASHZg6HI7I/jCz01bnW3e1X1NbVj622EvwVxludwRwk
	3675ONho/QgcM7+zDq953nDWXTbfru8=
Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-460af1a1154so2050521cf.0
        for <linux-mm@kvack.org>; Wed, 30 Oct 2024 14:18:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1730323126; x=1730927926; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=gBj8Er8NMrOYhbRjUej2vh8pNnsdyKz70XmQboypEJ4=;
        b=1UM/C+UiPP3wSR5gMdxmpWfZqC4NbVAw6sNa5H+nLv1B94zHCi1H1rWQxs4i47yhqh
         KnuOWEN3PbZBru/XPDIU4xq9lzzvBIxlxBQSNNRXSSmvLCiKVHi1q3jW842y883mHykC
         QVHbtgr+ZMSmYFlid97xmZhZhB0SDavOyBgLVLE6foSqAxo/3xg3QPaUE+mPY8iPCZt9
         2XJvRGjEFT8lDMdLqq285ci/7KN27eizUpKj3WZ1eLzYPUZfHGN3AcRG1Y2iOQgP7QpB
         YOyGTsZYAKDGW3vJ4DGqy3sq+bo2sVXH0AkXgAlodWMcPWhpO9RwXZvRHMhgTB/o74v4
         fqGw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1730323126; x=1730927926;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=gBj8Er8NMrOYhbRjUej2vh8pNnsdyKz70XmQboypEJ4=;
        b=nxibadm3bruGJjqmmuXHgvS/9iIjFptB+xNPYQZx3XOBJ70WFNWMRXn+jtZZDq4QAu
         Lkvz8YIxzOfceSPM6Xo1rwxPTmkfYOed8dY6LZ2VDK4fITQ0wdM0OImHqMiLy2GuS7rj
         g1JuaCPqkn3v0Ksvdm5zIbFQ2l0l7s23JACYpME+bnZD19HT2YUdiw5G17yewk7N9UGx
         KiDQ+JuBcsOm8NTT9hjWMskuYzbNhMyk7zMUSbHqp0bL5zwPJ2L6D5/q6zGBF14DQbLX
         UZYe9HgtpCW+NEfm40kRN0IpIF6NZpxjoc7bqqLRe4KKXdFpiqEkyS+/Tz92OukqjMRT
         Mzuw==
X-Forwarded-Encrypted: i=1; AJvYcCVEBnCzSt5kPFnbKwXxwJ86Z0W13gKbmpNsja9wNZOiBydMF/gJW0iJXEIlGfDt75twTWOlLZ9EiQ==@kvack.org
X-Gm-Message-State: AOJu0YxyOLoXqAwj8/r8xYFjwK85rvrer3QzuddkzdRPs8V22Lx7xTxd
	I+SSjvd2yvueLH9mGm4BUT8KUbtQ5JchqRCvsQATM4bB/U/+jJLQ+qI8wzbj+ZJJm/a8rWI20ha
	9IlNvpnXG/vKp6vlWAKcZNLWd9jDis8j49/hp
X-Google-Smtp-Source: AGHT+IHxI3w9cOo1L55Sdi+cgh5+XY+pSQKUVCS3e1+2Yp8Xa0xDfZtTA6XPuIairOm84cmvRC727x8Z9h9wQ67ANzE=
X-Received: by 2002:a05:622a:15c5:b0:461:15a1:788a with SMTP id
 d75a77b69052e-462ab30a280mr13532511cf.57.1730323125830; Wed, 30 Oct 2024
 14:18:45 -0700 (PDT)
MIME-Version: 1.0
References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com>
 <CAGsJ_4wdgptMK0dDTC5g66OE9WDxFDt7ixDQaFCjuHdTyTEGiA@mail.gmail.com>
 <e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com> <CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
 <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> <CAJD7tkaXL_vMsgYET9yjYQW5pM2c60fD_7r_z4vkMPcqferS8A@mail.gmail.com>
 <c76635d7-f382-433a-8900-72bca644cdaa@gmail.com>
In-Reply-To: <c76635d7-f382-433a-8900-72bca644cdaa@gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Wed, 30 Oct 2024 14:18:09 -0700
Message-ID: <CAJD7tkYSRCjtEwP=o_n_ZhdfO8nga-z-a=RirvcKL7AYO76XJw@mail.gmail.com>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg
To: Usama Arif <usamaarif642@gmail.com>
Cc: Barry Song <21cnbao@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>, 
	Kanchana P Sridhar <kanchana.p.sridhar@intel.com>, David Hildenbrand <david@redhat.com>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Chris Li <chrisl@kernel.org>, 
	"Huang, Ying" <ying.huang@intel.com>, Kairui Song <kasong@tencent.com>, 
	Ryan Roberts <ryan.roberts@arm.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: CB8B94000D
X-Rspamd-Server: rspam01
X-Stat-Signature: q8rkkid8hjsx7bygsd716cu8td6m8jhm
X-HE-Tag: 1730323094-698018
X-HE-Meta: U2FsdGVkX1/YZtIfjZ6gfWDRHpBc66udmczJ5r40MfxkD6w1G6FBHbsNUmtXOCx0tJSwyOqSTKs5C5a/eKlKAoMSDah4bdoxZ8tFoO/PgR4XadoZsw6Ru4cBaKH8njnCJdzWs0WrATktFtUg6vN0pndaie+ouZbGD9qvXkIBjh4zWQkBwXIUFQD65VhYdJhCAsE2y5fPCjSAQNU8QPd0K03doluvmHr1zqv7phqpBO8Z4Kf7hFqnW83BbBV7mdwjDP8wnJRBgH5vpvKHt7p4I/wWK1H/ueWw0geFpNASioWIG6CVjJMvCKqyu/OCGeXheTi5N/61iy+7gg5oEpJJdd3mslreL+eNkSjO/K+RvBufddm8GH33IdN48Yeha5P/rZFQNMWCw0mwfSGKzQxSxaKj1q6qUGyAI+L2zrM08tO+GeILgRiM2D51vxCHIgAdI1zH0TrkEjLwqXcyOY+AdCFKYfEDFEEaf4sPdrsnEHQ3DXjgP09f+zYS/3tL+Zzb2gZlVbKlRiR2i/d/WbACUlXIc//wz+sLLD9nzQ5HuGwfsRgSLEphGUi7L6wT5G9V4UKw3+R5reGhAjZCRqHmi/z/T/gdwl0uAIdNSPfGVHoidPZeF/dcX012Dm9y0TiGgFrzJRgkKHRPp0Ug8LvURSTzYjKg5CJNJYQGaZEwyVXsVkBrqoFYpEE9nAuSTp/0DO73yYQmTsAdSteyEPX3Amz/DJiMEMNQg0PbdBNA13keNglnIuJS0Ux8TKOKPhHOBqrRKWB7w7WEEocJpKbRmE/SvTn5bG/ywTLZgfR79CKO8kF9OfnKo6GZIc//5uCHk4ZhrEGj6o7IriOFtZ30mk4Poej/jkNlnkm5Yff/hIcdwxjAJgkdq2FhT9I++BY80FOX7BMB26Mjnrm3qMwm2HmUpPYG5cChKVmDrAYb2ftLYwN+Hzd4QDF5ZJ77WJ7ngAlnkY6s/qRAtGDRh0G
 KHFpw8or
 ZQjCbH4W7ABxF+SDXbxJC8JTivIaQ4EECcLMmNOKUapsFMro/ktXkbep7vALV+ks9IBuzno1hfQ/o5Fj97vSusuT56eEUWKUJmHfU218cXBKD+V2Qe+MsEiZeh+lT1/80xgeLe4CPXgRcW4bu5OS646VUL4LE3rEXbS/ug4yZPlcElNmKDsbm4BlLxga6hnllwxe20rS+Jmj4iW87uH4df0jEi5NWGlZ57Dvm
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000101, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Oct 30, 2024 at 2:13=E2=80=AFPM Usama Arif <usamaarif642@gmail.com>=
 wrote:
>
>
>
> On 30/10/2024 21:01, Yosry Ahmed wrote:
> > On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif <usamaarif642@gmail.=
com> wrote:
> >>
> >>
> >>
> >> On 30/10/2024 19:51, Yosry Ahmed wrote:
> >>> [..]
> >>>>> My second point about the mitigation is as follows: For a system (o=
r
> >>>>> memcg) under severe memory pressure, especially one without hardwar=
e TLB
> >>>>> optimization, is enabling mTHP always the right choice? Since mTHP =
operates at
> >>>>> a larger granularity, some internal fragmentation is unavoidable, r=
egardless
> >>>>> of optimization. Could the mitigation code help in automatically tu=
ning
> >>>>> this fragmentation?
> >>>>>
> >>>>
> >>>> I agree with the point that enabling mTHP always is not the right th=
ing to do
> >>>> on all platforms. I also think it might be the case that enabling mT=
HP
> >>>> might be a good thing for some workloads, but enabling mTHP swapin a=
long with
> >>>> it might not.
> >>>>
> >>>> As you said when you have apps switching between foreground and back=
ground
> >>>> in android, it probably makes sense to have large folio swapping, as=
 you
> >>>> want to bringin all the pages from background app as quickly as poss=
ible.
> >>>> And also all the TLB optimizations and smaller lru overhead you get =
after
> >>>> you have brought in all the pages.
> >>>> Linux kernel build test doesnt really get to benefit from the TLB op=
timization
> >>>> and smaller lru overhead, as probably the pages are very short lived=
. So I
> >>>> think it doesnt show the benefit of large folio swapin properly and
> >>>> large folio swapin should probably be disabled for this kind of work=
load,
> >>>> eventhough mTHP should be enabled.
> >>>>
> >>>> I am not sure that the approach we are trying in this patch is the r=
ight way:
> >>>> - This patch makes it a memcg issue, but you could have memcg disabl=
ed and
> >>>> then the mitigation being tried here wont apply.
> >>>
> >>> Is the problem reproducible without memcg? I imagine only if the
> >>> entire system is under memory pressure. I guess we would want the sam=
e
> >>> "mitigation" either way.
> >>>
> >> What would be a good open source benchmark/workload to test without li=
miting memory
> >> in memcg?
> >> For the kernel build test, I can only get zswap activity to happen if =
I build
> >> in cgroup and limit memory.max.
> >
> > You mean a benchmark that puts the entire system under memory
> > pressure? I am not sure, it ultimately depends on the size of memory
> > you have, among other factors.
> >
> > What if you run the kernel build test in a VM? Then you can limit is
> > size like a memcg, although you'd probably need to leave more room
> > because the entire guest OS will also subject to the same limit.
> >
>
> I had tried this, but the variance in time/zswap numbers was very high.
> Much higher than the AMD numbers I posted in reply to Barry. So found
> it very difficult to make comparison.

Hmm yeah maybe more factors come into play with global memory
pressure. I am honestly not sure how to test this scenario, and I
suspect variance will be high anyway.

We can just try to use whatever technique we use for the memcg limit
though, if possible, right?

>
> >>
> >> I can just run zswap large folio zswapin in production and see, but th=
at will take me a few
> >> days. tbh, running in prod is a much better test, and if there isn't a=
ny sort of thrashing,
> >> then maybe its not really an issue? I believe Barry doesnt see an issu=
e in android
> >> phones (but please correct me if I am wrong), and if there isnt an iss=
ue in Meta
> >> production as well, its a good data point for servers as well. And may=
be
> >> kernel build in 4G memcg is not a good test.
> >
> > If there is a regression in the kernel build, this means some
> > workloads may be affected, even if Meta's prod isn't. I understand
> > that the benchmark is not very representative of real world workloads,
> > but in this instance I think the thrashing problem surfaced by the
> > benchmark is real.
> >
> >>
> >>>> - Instead of this being a large folio swapin issue, is it more of a =
readahead
> >>>> issue? If we zswap (without the large folio swapin series) and chang=
e the window
> >>>> to 1 in swap_vma_readahead, we might see an improvement in linux ker=
nel build time
> >>>> when cgroup memory is limited as readahead would probably cause swap=
 thrashing as
> >>>> well.
> >>>
> >>> I think large folio swapin would make the problem worse anyway. I am
> >>> also not sure if the readahead window adjusts on memory pressure or
> >>> not.
> >>>
> >> readahead window doesnt look at memory pressure. So maybe the same thi=
ng is being
> >> seen here as there would be in swapin_readahead?
> >
> > Maybe readahead is not as aggressive in general as large folio
> > swapins? Looking at swap_vma_ra_win(), it seems like the maximum order
> > of the window is the smaller of page_cluster (2 or 3) and
> > SWAP_RA_ORDER_CEILING (5).
> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> be similar to enabling 32K mTHP?

Not quite.

>
> >
> > Also readahead will swapin 4k folios AFAICT, so we don't need a
> > contiguous allocation like large folio swapin. So that could be
> > another factor why readahead may not reproduce the problem.

Because of this ^.