From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 0FB75D1CA13
	for <linux-mm@archiver.kernel.org>; Tue,  5 Nov 2024 01:13:50 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 83F586B0089; Mon,  4 Nov 2024 20:13:50 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7C84D6B008C; Mon,  4 Nov 2024 20:13:50 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 642546B0092; Mon,  4 Nov 2024 20:13:50 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 433B16B0089
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 20:13:50 -0500 (EST)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id EF71D807A7
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 01:13:49 +0000 (UTC)
X-FDA: 82750268160.06.2D473B8
Received: from mail-vk1-f170.google.com (mail-vk1-f170.google.com [209.85.221.170])
	by imf29.hostedemail.com (Postfix) with ESMTP id 33774120005
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 01:13:04 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=h3j1lo1o;
	spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730769144;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=ek6t4hazbXE+8xpf80XM+MGpyF7OQ4zR6rOhy8KI6Is=;
	b=1IK+Rw+jN9oHWDIb568zB99Pqwo0DT8nwtXIvzEON9Y3TmYrRRILOF4pM/anGul2zQbqJG
	HHm2huukgguBa5dynPUKDQbCL3bwiG+hfMEqH8V9UDc0aeB+uy4b47HkiYkDzggDrjNwp/
	rpvsa/3A4s6+AtbZ/qOGYmND6vshna0=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730769144; a=rsa-sha256;
	cv=none;
	b=us2KYFwDs5n9ANSxzz6UwQltlqSIjIa6njNd4L3aLQocNzKo8YEfjovxOd7uxLwxmqWyEU
	bSho2+u/4HlSMgDElQcrIRsQIBMsOrt2hDorjHzESXF+n+yNXKYwdoTaClo5mGnPqI6GNc
	tEwZqfl87AgkKn1JOM8U5iqwqyXb1Ec=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=h3j1lo1o;
	spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-vk1-f170.google.com with SMTP id 71dfb90a1353d-50d5d4ef231so1732904e0c.0
        for <linux-mm@kvack.org>; Mon, 04 Nov 2024 17:13:47 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1730769227; x=1731374027; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=ek6t4hazbXE+8xpf80XM+MGpyF7OQ4zR6rOhy8KI6Is=;
        b=h3j1lo1of28puGUnEonwmgqTgMRtd37/Tk3Y4dFy92rV9T9tN9YXf/qw4wpniJntXy
         F/itgIVivc/KR6aoPy6rIpBWnuGqnCWIk93umevQvclURwbwYWckRSOXnQstN7QMMCYX
         1dlk6QPG0Any9pPzVuHdWF+VGnwJM7TnW5uuUmoWk7q760MtMWyYIONpkmNoo+2veVQH
         8wEjhzrwIccTq8LEuSv6JgmUwHNRfrcZGZRSqcXMlPEwrq2t3igqGqPJGUlQ7F78eoA9
         72HRYQCW8Ri8nddEsOLZNaiBp9CldAao8LiWXJ7f/2t6F35YvdSNG/pdS13cqjI36opL
         HDMw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1730769227; x=1731374027;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=ek6t4hazbXE+8xpf80XM+MGpyF7OQ4zR6rOhy8KI6Is=;
        b=XsSPscTYAQUSebXTKhP/t7q7fmU68HaXxh0hlFPz8N//4r75HihJbdo8ZZZw478W+s
         QmrmTwU+JZLzDS+7SySb367RilhzkORmDcdsPvfjUCrZ6MdGam8Tdn53rCrC04JCOfWD
         61goOH53+Bm+fhTfmdfv6L6+bi8OiL9CS7KjllMecf1tES0owXo+9U0mHOwBmMVoX6s/
         NudpxtljoIabviButCr7nNuBRv0RXgNmN8krEGrwNyQiM50fhat3B7+JgDnOy9wA3N6g
         2FcsKaWY81by1h1Ki20HOu791sDM6QzA9odd2bFumwR/qEhhodnM1bOqJ7xR0Bq9rOmj
         OkKQ==
X-Forwarded-Encrypted: i=1; AJvYcCU7w8Vk4j6i3bmJzThmA4a00WLOc39yEpS7Q9LTO08p4GIXXBXPH/NEhp9HXFKq/8j7sUUqjvX29A==@kvack.org
X-Gm-Message-State: AOJu0YwzwIz/0hmydHIH5n8PXHpEh2Q/Hr29Y6Xkd+7jcfrIz2kOwsAn
	hJQP9oZzDKCLhEAHvu4RS4jI2Tb96Or/togj1+dhUfs3d2UW6blpagQL3VnwyNszIhere+QHD8L
	XqmrTaaBbLDsvVMpgcunA+xMREig=
X-Google-Smtp-Source: AGHT+IF89eolPsfts3qy2/EjarAboO3hZ4pNJUESm4UV8/gfglnPZ0qDVPx2WZEJSIzeu7IrzAhCHA+w7WAZbkf3z9I=
X-Received: by 2002:a05:6122:3196:b0:50c:4efb:835a with SMTP id
 71dfb90a1353d-51014ff7a57mr29750349e0c.1.1730769227058; Mon, 04 Nov 2024
 17:13:47 -0800 (PST)
MIME-Version: 1.0
References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com>
 <CAGsJ_4wdgptMK0dDTC5g66OE9WDxFDt7ixDQaFCjuHdTyTEGiA@mail.gmail.com>
 <e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com> <CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
 <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> <CAJD7tkaXL_vMsgYET9yjYQW5pM2c60fD_7r_z4vkMPcqferS8A@mail.gmail.com>
 <c76635d7-f382-433a-8900-72bca644cdaa@gmail.com> <CAJD7tkYSRCjtEwP=o_n_ZhdfO8nga-z-a=RirvcKL7AYO76XJw@mail.gmail.com>
 <20241031153830.GA799903@cmpxchg.org> <87a5ef8ppq.fsf@yhuang6-desk2.ccr.corp.intel.com>
 <3f684183-c6df-4f2f-9e33-91ce43c791eb@gmail.com> <87ses67b0b.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87ses67b0b.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 5 Nov 2024 14:13:35 +1300
Message-ID: <CAGsJ_4z-zFESVpK2hDSs3EwHa2Ra3fYJFeQwH74LMHw3wVmB0g@mail.gmail.com>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Usama Arif <usamaarif642@gmail.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Yosry Ahmed <yosryahmed@google.com>, akpm@linux-foundation.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>, 
	Kanchana P Sridhar <kanchana.p.sridhar@intel.com>, David Hildenbrand <david@redhat.com>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Chris Li <chrisl@kernel.org>, 
	Kairui Song <kasong@tencent.com>, Ryan Roberts <ryan.roberts@arm.com>, 
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam10
X-Stat-Signature: 6zp9km1cnord93ew6anny3pwykpybf83
X-Rspamd-Queue-Id: 33774120005
X-Rspam-User: 
X-HE-Tag: 1730769184-792402
X-HE-Meta: U2FsdGVkX19/hoOQ6xLmbSzeyFluh5ZFDCRY55w0E+olA+5UPgUGlY8sgT3jRz9UehBeELHlky8cuf49tc36bGy0/Vx2Qxk6TwU29emKru9Bq839moI+tbt3m1AyVYfRnyfAYkYJU+bcOeLt+c/ZhMTzeTTA8eWp9aSv4B4PO4JLpMcL9fy4r4SxefDt5i8i2yszn0MIDiGAMmvPfn0GdGX6Rc5ZRa3+xNt1mwIVcrVLBLBCkZrhuK4/rPt4R8YQR44sGYJ7WieOfkRR/nrdBR6wdqsrvreGTiiy+dw4voMQvRtanBoYPo1/2ecVHTJCVTv2JSHGuKnIjlXZdkULsgbNFJq/M9YOADbxRwhCOcJQvhYMi0Z79weYGfYDZjX1YoLsX1r5A8y3RiBjGfJzf9clDni2nl1wLW9UssCzqOWyv0fFIfMeFJyG5xlzMvbY2kd6PrccDxWUTZxkBObgxsEHjpeyNsONrXwg5O4z2s1Ho5A4XwTDzPh8a9+isZoX7vMFABEMOsUe+vq8PpslAYMaLDGBEJn1AEw7g0cXd+khzwOhnN0J4PgkhK6JvgsPSFSH732UKekRwjOTWe7pMsdoPE/bAFKLRlQbS8BKlRl1H4WjTrW9j5+VIXti+h/TItcFv1hGGefg9/4xBOmGAp5mIC9rOw1Tx6sRyuC/qkCageMi8NtxKjye5xHJgoBvLXlmAt7W0y0b3E48Lx8naf35FtmWUqUPMHO0Qqehvzc3/wcXCePMJz5/SXnEGLWf44gtCYPfiXGjbdIPcKPoH9hHFcJWAm0lCJsMs/D2N6jdwD26N2E9md8B4RuZDMjV3Up1/9TXLEVL7XZkskmwY99a5ERVyeMK3+zQJgb9gXxNVnRNnc6W4YQEYWHNjfOj1/bHpVKdRqSuio8dNynvkGJw1F9JAcAbc42eJV3pYZbYcV77fryPrfA1GlUrO4OFq9xm6WgJxTo2GR6us3E
 /T2tfjQH
 0R6LhUbHa2LWcLt74MOkjtaptjR8jO58mwNeyK6tH1uc6xhPKuzUD36WdZwqLYS4FAhr0Qb4TyjOBBBe4KwVwFvSefXwfmFdqmeZonUCALHfWRkuY2oZmYDU3qGLXgfTY+EtUH/DVW3LYcRC7kwyxmcCQSA3oCsLPMs6kZ1oOfazSNrRVyXNyDxJl1z3W+aNw8cU+/elkBSmbA8XnwLctjGIbdSOUJDsP66iveh/H+Hw/XL4lj24+skn34ty3mGK21UYOTB0oU2tLu5lFxb++le+G4crtV9pTxcXBMlvIPaYf4efpWWkB0J/xoI6WZPemj4M3SXtr+iFWfvye7jQQP7WvHQGLoSRLlAWSWM6i6Wa8g9MdnlHp4+qnpF6A9VfIkyvp09XrjdyAOq9TzaJxsrZCi/AoxTNW/EGfICZ7DbMWWC8=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Nov 5, 2024 at 2:01=E2=80=AFPM Huang, Ying <ying.huang@intel.com> w=
rote:
>
> Usama Arif <usamaarif642@gmail.com> writes:
>
> > On 04/11/2024 06:42, Huang, Ying wrote:
> >> Johannes Weiner <hannes@cmpxchg.org> writes:
> >>
> >>> On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> >>>> On Wed, Oct 30, 2024 at 2:13=E2=80=AFPM Usama Arif <usamaarif642@gma=
il.com> wrote:
> >>>>> On 30/10/2024 21:01, Yosry Ahmed wrote:
> >>>>>> On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif <usamaarif642@g=
mail.com> wrote:
> >>>>>>>>> I am not sure that the approach we are trying in this patch is =
the right way:
> >>>>>>>>> - This patch makes it a memcg issue, but you could have memcg d=
isabled and
> >>>>>>>>> then the mitigation being tried here wont apply.
> >>>>>>>>
> >>>>>>>> Is the problem reproducible without memcg? I imagine only if the
> >>>>>>>> entire system is under memory pressure. I guess we would want th=
e same
> >>>>>>>> "mitigation" either way.
> >>>>>>>>
> >>>>>>> What would be a good open source benchmark/workload to test witho=
ut limiting memory
> >>>>>>> in memcg?
> >>>>>>> For the kernel build test, I can only get zswap activity to happe=
n if I build
> >>>>>>> in cgroup and limit memory.max.
> >>>>>>
> >>>>>> You mean a benchmark that puts the entire system under memory
> >>>>>> pressure? I am not sure, it ultimately depends on the size of memo=
ry
> >>>>>> you have, among other factors.
> >>>>>>
> >>>>>> What if you run the kernel build test in a VM? Then you can limit =
is
> >>>>>> size like a memcg, although you'd probably need to leave more room
> >>>>>> because the entire guest OS will also subject to the same limit.
> >>>>>>
> >>>>>
> >>>>> I had tried this, but the variance in time/zswap numbers was very h=
igh.
> >>>>> Much higher than the AMD numbers I posted in reply to Barry. So fou=
nd
> >>>>> it very difficult to make comparison.
> >>>>
> >>>> Hmm yeah maybe more factors come into play with global memory
> >>>> pressure. I am honestly not sure how to test this scenario, and I
> >>>> suspect variance will be high anyway.
> >>>>
> >>>> We can just try to use whatever technique we use for the memcg limit
> >>>> though, if possible, right?
> >>>
> >>> You can boot a physical machine with mem=3D1G on the commandline, whi=
ch
> >>> restricts the physical range of memory that will be initialized.
> >>> Double check /proc/meminfo after boot, because part of that physical
> >>> range might not be usable RAM.
> >>>
> >>> I do this quite often to test physical memory pressure with workloads
> >>> that don't scale up easily, like kernel builds.
> >>>
> >>>>>>>>> - Instead of this being a large folio swapin issue, is it more =
of a readahead
> >>>>>>>>> issue? If we zswap (without the large folio swapin series) and =
change the window
> >>>>>>>>> to 1 in swap_vma_readahead, we might see an improvement in linu=
x kernel build time
> >>>>>>>>> when cgroup memory is limited as readahead would probably cause=
 swap thrashing as
> >>>>>>>>> well.
> >>>
> >>> +1
> >>>
> >>> I also think there is too much focus on cgroup alone. The bigger issu=
e
> >>> seems to be how much optimistic volume we swap in when we're under
> >>> pressure already. This applies to large folios and readahead; global
> >>> memory availability and cgroup limits.
> >>
> >> The current swap readahead logic is something like,
> >>
> >> 1. try readahead some pages for sequential access pattern, mark them a=
s
> >>    readahead
> >>
> >> 2. if these readahead pages get accessed before swapped out again,
> >>    increase 'hits' counter
> >>
> >> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
> >>
> >> So, if there's heavy memory pressure, the readaheaded pages will not b=
e
> >> accessed before being swapped out again (in 2 above), the readahead
> >> pages will be minimal.
> >>
> >> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
> >> addition to the pages accessed are swapped in, the adjacent pages are
> >> swapped in (swap readahead) too.  If these readahead pages are not
> >> accessed before swapped out again, system runs into more severe
> >> thrashing.  This is because we lack the swap readahead window scaling
> >> mechanism as above.  And, this is why I suggested to combine the swap
> >> readahead mechanism and mTHP swap-in by default before.  That is, when
> >> kernel swaps in a page, it checks current swap readahead window, and
> >> decides mTHP order according to window size.  So, if there are heavy
> >> memory pressure, so that the nearby pages will not be accessed before
> >> being swapped out again, the mTHP swap-in order can be adjusted
> >> automatically.
> >
> > This is a good idea to do, but I think the issue is that readahead
> > is a folio flag and not a page flag, so only works when folio size is 1=
.
> >
> > In the swapin_readahead swapcache path, the current implementation deci=
des
> > the ra_window based on hits, which is incremented in swap_cache_get_fol=
io
> > if it has not been gotten from swapcache before.
> > The problem would be that we need information on how many distinct page=
s in
> > a large folio that has been swapped in have been accessed to decide the
> > hits/window size, which I don't think is possible. As once the entire l=
arge
> > folio has been swapped in, we won't get a fault.
> >
>
> To do that, we need to move readahead flag to per-page from per-folio.
> And we need to map only the accessed page of the folio in page fault
> handler.  This may impact performance.  So, we may only do that for
> sampled folios only, for example, every 100 folios.

I'm not entirely sure there's a chance to gain traction on this, as the cur=
rent
trend clearly leans toward moving flags from page to folio, not from folio =
to
page :-)

>
> >>
> >>> It happens to manifest with THP in cgroups because that's what you
> >>> guys are testing. But IMO, any solution to this problem should
> >>> consider the wider scope.
> >>>
> >>>>>>>> I think large folio swapin would make the problem worse anyway. =
I am
> >>>>>>>> also not sure if the readahead window adjusts on memory pressure=
 or
> >>>>>>>> not.
> >>>>>>>>
> >>>>>>> readahead window doesnt look at memory pressure. So maybe the sam=
e thing is being
> >>>>>>> seen here as there would be in swapin_readahead?
> >>>>>>
> >>>>>> Maybe readahead is not as aggressive in general as large folio
> >>>>>> swapins? Looking at swap_vma_ra_win(), it seems like the maximum o=
rder
> >>>>>> of the window is the smaller of page_cluster (2 or 3) and
> >>>>>> SWAP_RA_ORDER_CEILING (5).
> >>>>> Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> >>>>> be similar to enabling 32K mTHP?
> >>>>
> >>>> Not quite.
> >>>
> >>> Actually, I would expect it to be...
> >>
> >> Me too.
> >>
> >>>>>> Also readahead will swapin 4k folios AFAICT, so we don't need a
> >>>>>> contiguous allocation like large folio swapin. So that could be
> >>>>>> another factor why readahead may not reproduce the problem.
> >>>>
> >>>> Because of this ^.
> >>>
> >>> ...this matters for the physical allocation, which might require more
> >>> reclaim and compaction to produce the 32k. But an earlier version of
> >>> Barry's patch did the cgroup margin fallback after the THP was alread=
y
> >>> physically allocated, and it still helped.
> >>>
> >>> So the issue in this test scenario seems to be mostly about cgroup
> >>> volume. And then 8 4k charges should be equivalent to a singular 32k
> >>> charge when it comes to cgroup pressure.
>
> --
> Best Regards,
> Huang, Ying