From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E8DD2E6F06D
	for <linux-mm@archiver.kernel.org>; Fri,  1 Nov 2024 16:20:10 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 7F44B6B0088; Fri,  1 Nov 2024 12:20:10 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 77B836B0089; Fri,  1 Nov 2024 12:20:10 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 5F5656B008C; Fri,  1 Nov 2024 12:20:10 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 3B1E36B0088
	for <linux-mm@kvack.org>; Fri,  1 Nov 2024 12:20:10 -0400 (EDT)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id A7AED1C2564
	for <linux-mm@kvack.org>; Fri,  1 Nov 2024 16:20:09 +0000 (UTC)
X-FDA: 82738036206.17.45BD967
Received: from mail-qk1-f179.google.com (mail-qk1-f179.google.com [209.85.222.179])
	by imf11.hostedemail.com (Postfix) with ESMTP id B779840002
	for <linux-mm@kvack.org>; Fri,  1 Nov 2024 16:19:32 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=JjG0+qfR;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.179 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730477925;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=WkKqGXDlNbeSSlin5gkAN7W+FYm3Ggd7tjtMkJva0DI=;
	b=tNn0K3ChLiU/VsyAqrYhTbB7av+M0k4rn0r6lQxzHZSkmvv14LukE1B4et10UlCb85wwRU
	AoGPJ0efVo6Qe+VEZe+/vXhslG6KldsQQdobon0U/S1eAmjX1najeDTWvKsY3v+m7hnRyd
	jXLRnhf+7HfOk5YvRM2aY5P/tlFlAAg=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730477925; a=rsa-sha256;
	cv=none;
	b=ahTWFqUTq4ufllLS/qbGF89fkS44zvltzT0UikBFpIp99t74h2QeX316wBBBRsFrgNwzEb
	6CtZNK2z6kxuKoaCuKmXRCvf9RzrN2mbizAF6RLLWLHrEg86GlcpUvN4Bseam7WkPkOs46
	e0D4k5XDm62KCfZxRNNfEf/6prW+FVw=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=JjG0+qfR;
	spf=pass (imf11.hostedemail.com: domain of yosryahmed@google.com designates 209.85.222.179 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
Received: by mail-qk1-f179.google.com with SMTP id af79cd13be357-7b1418058bbso148268085a.3
        for <linux-mm@kvack.org>; Fri, 01 Nov 2024 09:20:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1730478006; x=1731082806; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=WkKqGXDlNbeSSlin5gkAN7W+FYm3Ggd7tjtMkJva0DI=;
        b=JjG0+qfR+QFIsFS0DgiaAwzCr1aMX/fB6LQKNxI/9Jcx2lGHCpubVHzhgUfPT9bcca
         YRXy677vo/EYpZo1vWMnUMmkfvr24W+ASIhOJJ/UOjQlMuqDyr/CxYQCf11tiMitEgxe
         jQk6drPuLYwtj+S8VQmevYFmUe81er6qwCOcHbWM2nnEuk9dZ2Dt958ljN8CGW6CHwze
         0Q+UoGOo1yAXvoRcOGMKM87xqJiqkcE3IkXT3ZF7rG0UgFsPMJIWyu9y1+h3+l9IlKx3
         SBTrx12VXzJIFSLB275y1IRb/Qq3xpdE4PqCK3UUwlfcZomeqly7V4ER0YcIQBH4XPNx
         +ZuA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1730478006; x=1731082806;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=WkKqGXDlNbeSSlin5gkAN7W+FYm3Ggd7tjtMkJva0DI=;
        b=A9/fEkogTM5lO/bWcLNjpVrutFyD15Ii/e2CI8v1QPT4wy1rHhGpN0OmuTO+RFNA/a
         2q4z63aH5o3djhLi89qa6phE1ZNA+KetqtS+b2AXMg9ZizgnbLq3ogdWFg3o/98XciG5
         bpwuYCPk4PSrMWUPv1WBPOcP4tx+Lkq3/+lIbkRIENjZvLVBLyyTpiVEN4W0ChKfybHM
         0j3tq2xEi7Cwq1WuBUaht5wEx2FQuRTxhMgXyUeHApVcq3ft1R+0cCAf0/aoV50j12HN
         m6HLqfeiq97yN5O5iHuRjI+S6J13ylWu0+ZgSuylyGpQ/xWQHHqUluqEFonu/A/0MHSU
         YLBQ==
X-Forwarded-Encrypted: i=1; AJvYcCUKQi65JTkIbIccd6KL/p/EN7SoyPGasAB6wdzZV0X7FcTuyNcnJyleLEAt3G7Is64NHStxuN+x6w==@kvack.org
X-Gm-Message-State: AOJu0Ywe3tcY8NvIwvdNvPPKJAJa+8vgf2nfKWMydEjePdP8W21WqVnS
	yRpnnWco+B/CoBsv+VO1JHRDm8YfSMLklSDfWVO5aK24OfO2ZYXRqNcycVs12d0EjvHScXg81Gm
	pJGqFtDhS1rjyVxQOnkaO9ydsy3ipBYhuPsBs
X-Google-Smtp-Source: AGHT+IG10wiBsZhBV7O3UiGw66TRilPB4518JYdv2HdxSr/9COnl7l72Ih4bmf9Ba8TlKillRQS5OMI0KE3flvweyn0=
X-Received: by 2002:a05:6214:2f0f:b0:6cc:51f:6c31 with SMTP id
 6a1803df08f44-6d35c09c2d3mr62590276d6.5.1730478006210; Fri, 01 Nov 2024
 09:20:06 -0700 (PDT)
MIME-Version: 1.0
References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com>
 <CAGsJ_4wdgptMK0dDTC5g66OE9WDxFDt7ixDQaFCjuHdTyTEGiA@mail.gmail.com>
 <e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com> <CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
 <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> <CAJD7tkaXL_vMsgYET9yjYQW5pM2c60fD_7r_z4vkMPcqferS8A@mail.gmail.com>
 <c76635d7-f382-433a-8900-72bca644cdaa@gmail.com> <CAJD7tkYSRCjtEwP=o_n_ZhdfO8nga-z-a=RirvcKL7AYO76XJw@mail.gmail.com>
 <20241031153830.GA799903@cmpxchg.org> <CAJD7tkZ_xQHMoze_w3yBHgjPhQeDynJ+vWddbYKFzi2c63sT7w@mail.gmail.com>
 <CAGsJ_4yTuQMH2MMUnXRiSMbstOuoC2-fvNBsmb2noK9Axte5Gg@mail.gmail.com>
In-Reply-To: <CAGsJ_4yTuQMH2MMUnXRiSMbstOuoC2-fvNBsmb2noK9Axte5Gg@mail.gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Fri, 1 Nov 2024 09:19:29 -0700
Message-ID: <CAJD7tkaYAB2LPEUP_3CxxE7CNnzbabc0YUU-BnupJrBt2iM10g@mail.gmail.com>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg
To: Barry Song <21cnbao@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Usama Arif <usamaarif642@gmail.com>, 
	akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, 
	Barry Song <v-songbaohua@oppo.com>, Kanchana P Sridhar <kanchana.p.sridhar@intel.com>, 
	David Hildenbrand <david@redhat.com>, Baolin Wang <baolin.wang@linux.alibaba.com>, 
	Chris Li <chrisl@kernel.org>, "Huang, Ying" <ying.huang@intel.com>, 
	Kairui Song <kasong@tencent.com>, Ryan Roberts <ryan.roberts@arm.com>, 
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam10
X-Stat-Signature: 9cpdnjz1irw1p7mftjp4yzj35m44seaq
X-Rspamd-Queue-Id: B779840002
X-Rspam-User: 
X-HE-Tag: 1730477972-392
X-HE-Meta: U2FsdGVkX18aPaqO+QOV83GT5AI7aivFjM9u2PT+lgz8wA3/RW3gOQsjlqsdY6TXsgjq6Cfc2uZtDNKGHYecoYERhXGFyN1ZAROXGjZs/G1f8NEiNIAylhN81+bvOeU+Bb0zKndYA58FcFcRSpUilhXXaF6o+DMTPj8BQkGdswhN3FYe1x2t6X3mbLfxGyZEMAmQsrrg8viXq0j8lI/IMms8ae6hCSonQA7KYvLbGurJ4++MYS8lrxT+nMmLaYrUKVFXwsZD8XOA6C8RqilLu7slJsjagUfrJZI9RMBq5DzV/HTUaglxz836YF4+CJ1Ju0QugDpdH0XTLqpImtv75tbZCTYEyWeC3INEtZ077ZSN4Lqlt0WVlIhzulE14KvqgDzd2AigGUXH50QpKuwCRCuXcQHgGySWlhzS50HhcQGQ8zo/GA7Q2DiSUeJ1sjbhfunfQZTwPt2EsQ4ReDpmPBuYaE5oQttVQi2eVKSb72pTdSyfo8XA6UUCWp+qngRy9FBkUzE2J6XC073IUVxNeZ8IIgdzc9d92g0ptYRgw3S5V9zLkB5lZxWLrh36XH8H82Jn3q4witqaOa95Cr2lwGh1/i4+ooAvXUraIyMA1qvkM0JC5Nk8fa2lXbSmQCL7DdVPvZIE4cDRs5GwGyIt2FXsrA4zpy2U1GpMDJCsbMHlAR+G6UY5JgjZ2Y/pSr0sCcWYtb2y3SWeqbvYZq7FDZTkhZewNHUW2EbH8B163ERMPccWCCS+1/hcJUOfuVaM9XKptv5a5ye2+uPTNAXUZ//PKD3LgOWAP5LlPklNl+WHUP2LHKwmutXVnO0QF/4b9OZbsBfRtAzQ7jwDZjwZbodt1MlckimpkMCfML+bof3FHfoXDFUL4yRYgB5jd0jB387WF9VutU1sGPsg+nvg38l7cjiPhxR8jf7uZ3TwipRS/vzGLsTIw6qC5qFETb3+7Mu5lh1txl4zTjMLpYT
 vkKQNXiF
 CH85pJPi1gReLytaKmbzD/OWzy8HnKP3opCaEOIRGsSonebWyoTS1hAa3OEBCHJGTyDQogVYkL6UsqeVSMejKhvMcbEU5McbZNoP7wGBLaloyWdGzIW7iOTd1gTCfD0CcI1Q1RuxjMKIhlgOGBGa/7OGgJOxVI+EcZzeY6NMQGMshALlgI6PZIDVqpR2RDlwLEAnRynnwMY5NMNVI4Zo9D+wMcdvrQpoRI92txrGdKEsGqATI/YFAa1nc2z1U0CpcfJP6nl3BKRFSWtAmhBVy8ayViVeuN563sKP73RI7Wkfo9rg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Oct 31, 2024 at 2:00=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrot=
e:
>
> On Fri, Nov 1, 2024 at 5:00=E2=80=AFAM Yosry Ahmed <yosryahmed@google.com=
> wrote:
> >
> > On Thu, Oct 31, 2024 at 8:38=E2=80=AFAM Johannes Weiner <hannes@cmpxchg=
.org> wrote:
> > >
> > > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> > > > On Wed, Oct 30, 2024 at 2:13=E2=80=AFPM Usama Arif <usamaarif642@gm=
ail.com> wrote:
> > > > > On 30/10/2024 21:01, Yosry Ahmed wrote:
> > > > > > On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif <usamaarif64=
2@gmail.com> wrote:
> > > > > >>>> I am not sure that the approach we are trying in this patch =
is the right way:
> > > > > >>>> - This patch makes it a memcg issue, but you could have memc=
g disabled and
> > > > > >>>> then the mitigation being tried here wont apply.
> > > > > >>>
> > > > > >>> Is the problem reproducible without memcg? I imagine only if =
the
> > > > > >>> entire system is under memory pressure. I guess we would want=
 the same
> > > > > >>> "mitigation" either way.
> > > > > >>>
> > > > > >> What would be a good open source benchmark/workload to test wi=
thout limiting memory
> > > > > >> in memcg?
> > > > > >> For the kernel build test, I can only get zswap activity to ha=
ppen if I build
> > > > > >> in cgroup and limit memory.max.
> > > > > >
> > > > > > You mean a benchmark that puts the entire system under memory
> > > > > > pressure? I am not sure, it ultimately depends on the size of m=
emory
> > > > > > you have, among other factors.
> > > > > >
> > > > > > What if you run the kernel build test in a VM? Then you can lim=
it is
> > > > > > size like a memcg, although you'd probably need to leave more r=
oom
> > > > > > because the entire guest OS will also subject to the same limit=
.
> > > > > >
> > > > >
> > > > > I had tried this, but the variance in time/zswap numbers was very=
 high.
> > > > > Much higher than the AMD numbers I posted in reply to Barry. So f=
ound
> > > > > it very difficult to make comparison.
> > > >
> > > > Hmm yeah maybe more factors come into play with global memory
> > > > pressure. I am honestly not sure how to test this scenario, and I
> > > > suspect variance will be high anyway.
> > > >
> > > > We can just try to use whatever technique we use for the memcg limi=
t
> > > > though, if possible, right?
> > >
> > > You can boot a physical machine with mem=3D1G on the commandline, whi=
ch
> > > restricts the physical range of memory that will be initialized.
> > > Double check /proc/meminfo after boot, because part of that physical
> > > range might not be usable RAM.
> > >
> > > I do this quite often to test physical memory pressure with workloads
> > > that don't scale up easily, like kernel builds.
> > >
> > > > > >>>> - Instead of this being a large folio swapin issue, is it mo=
re of a readahead
> > > > > >>>> issue? If we zswap (without the large folio swapin series) a=
nd change the window
> > > > > >>>> to 1 in swap_vma_readahead, we might see an improvement in l=
inux kernel build time
> > > > > >>>> when cgroup memory is limited as readahead would probably ca=
use swap thrashing as
> > > > > >>>> well.
> > >
> > > +1
> > >
> > > I also think there is too much focus on cgroup alone. The bigger issu=
e
> > > seems to be how much optimistic volume we swap in when we're under
> > > pressure already. This applies to large folios and readahead; global
> > > memory availability and cgroup limits.
> >
> > Agreed, although the characteristics of large folios and readahead are
> > different. But yeah, different flavors of the same problem.
> >
> > >
> > > It happens to manifest with THP in cgroups because that's what you
> > > guys are testing. But IMO, any solution to this problem should
> > > consider the wider scope.
> >
> > +1, and I really think this should be addressed separately, not just
> > rely on large block compression/decompression to offset the cost. It's
> > probably not just a zswap/zram problem anyway, it just happens to be
> > what we support large folio swapin for.
>
> Agreed these are two separate issues and should be both investigated
> though 2 can offset the cost of 1.
> 1. swap thrashing
> 2. large block compression/decompression
>
> For point 1, we likely want to investigate the following:
>
> 1. if we can see the same thrashing if we always perform readahead
> (rapidly filling
> the memcg to full again after reclamation).
>
> 2. Whether there are any issues with balancing file and anon memory
> reclamation.
>
> The 'refault feedback loop' in mglru compares refault rates between anon =
and
> file pages to decide which type should be prioritized for reclamation.
>
> type =3D get_type_to_scan(lruvec, swappiness, &tier);
>
> static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int
> *tier_idx)
> {
>         ...
>         read_ctrl_pos(lruvec, LRU_GEN_ANON, 0, gain[LRU_GEN_ANON], &sp);
>         read_ctrl_pos(lruvec, LRU_GEN_FILE, 0, gain[LRU_GEN_FILE], &pv);
>         type =3D positive_ctrl_err(&sp, &pv);
>
>         read_ctrl_pos(lruvec, !type, 0, gain[!type], &sp);
>         for (tier =3D 1; tier < MAX_NR_TIERS; tier++) {
>                 read_ctrl_pos(lruvec, type, tier, gain[type], &pv);
>                 if (!positive_ctrl_err(&sp, &pv))
>                         break;
>         }
>
>         *tier_idx =3D tier - 1;
>         return type;
> }
>
> In this case, we may want to investigate whether reclamation is primarily
> targeting anonymous memory due to potential errors in the statistics path
> after mTHP is involved.
>
> 3. Determine if this is a memcg-specific issue by setting mem=3D1GB and
> running the same test on the global system.
>
> Yosry, Johannes, Usama,
> Is there anything else that might interest us?
>
> I'll get back to you after completing the investigation mentioned above.

Thanks for looking into this.

Perhaps a naive question, but is this only related to swap faults? Can
the same scenario happen with other types of faults allocating large
folios (e.g. faulting in a file page, or a new anon allocation)?

Do swap faults use a different policy for determining the folio order,
or is it just the swap faults are naturally more correlated to memory
pressure, so that's how the issue was surfaced?