From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1F020D111BA
	for <linux-mm@archiver.kernel.org>; Mon,  4 Nov 2024 08:07:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1F21B6B0083; Mon,  4 Nov 2024 03:07:02 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 1A2766B0085; Mon,  4 Nov 2024 03:07:02 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 01BBA6B0088; Mon,  4 Nov 2024 03:07:01 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id D822C6B0083
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 03:07:01 -0500 (EST)
Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id E1B7BA1B14
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 08:07:00 +0000 (UTC)
X-FDA: 82747680078.09.5220CEE
Received: from mail-ua1-f49.google.com (mail-ua1-f49.google.com [209.85.222.49])
	by imf21.hostedemail.com (Postfix) with ESMTP id 0B4FF1C0013
	for <linux-mm@kvack.org>; Mon,  4 Nov 2024 08:05:58 +0000 (UTC)
Authentication-Results: imf21.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=kJA06vxa;
	spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730707571;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=1/8mDEDTeSvNL26YQTk80QGTDhA3l2OBXppNz2Bi18s=;
	b=OkjXatyb8eS6+eH42NLW9FjyMiWkL54/jo6Mj9GFeiI9slto7DsTtLC5pLYjyCitN4oX+7
	JAXDlITZX096TVy5Gw9431WgLXAAOdR+rhDLdEwytWzYOC/LwERWXNNa0Qj6eSr9pH+H8m
	AEO8BDCv2cy9svcUWTXKewfz0JJ9wSY=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730707571; a=rsa-sha256;
	cv=none;
	b=fSMbK6cXQjlJXVFRvC5Eeueg8ZUIcWGwhrhvho04r7rUAdKFZa9XMOhoA9WcUUYEj/uEC8
	6f0CvD8Lw59pV7q7SLDCEf6Osh2hzkdMPrs2/SmPI9O0zpEPIFpALYww6Kk3i4kFc3lfZ6
	O8CINerjsntxPGQODloglf/qJcGU6fw=
ARC-Authentication-Results: i=1;
	imf21.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=kJA06vxa;
	spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.49 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-ua1-f49.google.com with SMTP id a1e0cc1a2514c-84fc1a5e65bso1012504241.2
        for <linux-mm@kvack.org>; Mon, 04 Nov 2024 00:06:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1730707618; x=1731312418; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=1/8mDEDTeSvNL26YQTk80QGTDhA3l2OBXppNz2Bi18s=;
        b=kJA06vxaJ4hN9Z40yzb5Nq3fl9lZ3l+O1HXzs/T5qYIGKtniKNqRDzpxSbRsP3Ve1E
         tayrQe3G/htCdgNfXLz7JVbooNhchaTNBJP7FHLzdWrXACWn5YcoY+6TXuCgFKda9O+P
         AIuOUKUFyXGEO0NWNoqCvsDwvyIG+OpGulhgQT67E3u0ZJ2CtIg5iyoFiopyKGnI7tkB
         0GtQivJIMqZy16ZvGA49pb9L43Nor6+w1zmObQ4nAbnSW2NUOKbshOov4ICYUKmllkTL
         6M3g0c6hLhP23z23R0SBODGXvBOfrRLGGKOeLxAAa/sjOTuyuwgpmKHGwU4GzR828nCI
         dXkw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1730707618; x=1731312418;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=1/8mDEDTeSvNL26YQTk80QGTDhA3l2OBXppNz2Bi18s=;
        b=m1wTKyOIpu/J0mKZbXAMEm4CxKx3R1CasSnJumUg0XQstNTB7e9iNKDHTLgZ3Q5/0D
         CzGWZiDspfdnTBAx51JS1KggU1YomXI6w3fJRLxbCPt1qpIFM1wLs59h//y4/YsjH894
         vA6gDLwHKxJ61MooGrXG5K8AFW+DSFBNZOqh8cHyw3mvH8txzAfVe3ElkQIRa2/Ji/wY
         iQQ3j5QnNHfKH7lJyZvLS81CaMtE/TVPMeD7SV1LZlBgxFgeQ/6f0jPwlL4Biyd7L0Jt
         KXFRCO4sZTOyhxtPFzmtGIldW3zs4W315MCFVOwoo9lnJOt1GNxjUxSM/amTOoH1YZAU
         LYnA==
X-Forwarded-Encrypted: i=1; AJvYcCUrirID3u9M2semy9/8T2xG7rQdFJxDqomXV1FKLhrlRM4mJclOB2zDGmhbLyI2wbg8F2D6567C2Q==@kvack.org
X-Gm-Message-State: AOJu0Yzt9fBQ0D/XMl8nR1qXtmGpe0x/Spy+rdUMK8GUGaBuUVCHA7Dm
	GUzg9BoZncb/fxH44yKwW84vunh3foydy16aSf9n8kHgkSJMidVA8+9WBYcRCZnjZ76IzFmrA0e
	Zl6aIW+VOW2gnzc62PwVxZpweqBQ=
X-Google-Smtp-Source: AGHT+IED8hPZTJb8P1QRSqkns/pKY/I89SSXSvc1CKAfEKjbTCef9CGAIfOivZGi5daILAAkpfOrkuiGvT9xwNrbCAE=
X-Received: by 2002:a05:6102:2921:b0:4a3:d215:427f with SMTP id
 ada2fe7eead31-4a962f7cbc8mr7867737137.23.1730707617970; Mon, 04 Nov 2024
 00:06:57 -0800 (PST)
MIME-Version: 1.0
References: <20241027001444.3233-1-21cnbao@gmail.com> <33c5d5ca-7bc4-49dc-b1c7-39f814962ae0@gmail.com>
 <CAGsJ_4wdgptMK0dDTC5g66OE9WDxFDt7ixDQaFCjuHdTyTEGiA@mail.gmail.com>
 <e8c6d46c-b8cf-4369-aa61-9e1b36b83fe3@gmail.com> <CAJD7tkZ60ROeHek92jgO0z7LsEfgPbfXN9naUC5j7QjRQxpoKw@mail.gmail.com>
 <852211c6-0b55-4bdd-8799-90e1f0c002c1@gmail.com> <CAJD7tkaXL_vMsgYET9yjYQW5pM2c60fD_7r_z4vkMPcqferS8A@mail.gmail.com>
 <c76635d7-f382-433a-8900-72bca644cdaa@gmail.com> <CAJD7tkYSRCjtEwP=o_n_ZhdfO8nga-z-a=RirvcKL7AYO76XJw@mail.gmail.com>
 <20241031153830.GA799903@cmpxchg.org> <87a5ef8ppq.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87a5ef8ppq.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Barry Song <21cnbao@gmail.com>
Date: Mon, 4 Nov 2024 21:06:46 +1300
Message-ID: <CAGsJ_4yALxTVuHOf_y00DbYp=kjEZuL3q6sWfn2Wf-Y+_G5RjA@mail.gmail.com>
Subject: Re: [PATCH RFC] mm: mitigate large folios usage and swap thrashing
 for nearly full memcg
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Yosry Ahmed <yosryahmed@google.com>, 
	Usama Arif <usamaarif642@gmail.com>, akpm@linux-foundation.org, linux-mm@kvack.org, 
	linux-kernel@vger.kernel.org, Barry Song <v-songbaohua@oppo.com>, 
	Kanchana P Sridhar <kanchana.p.sridhar@intel.com>, David Hildenbrand <david@redhat.com>, 
	Baolin Wang <baolin.wang@linux.alibaba.com>, Chris Li <chrisl@kernel.org>, 
	Kairui Song <kasong@tencent.com>, Ryan Roberts <ryan.roberts@arm.com>, 
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Shakeel Butt <shakeel.butt@linux.dev>, Muchun Song <muchun.song@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: r5yni5nbqzqtymktoaa4u5potwks9pd9
X-Rspam-User: 
X-Rspamd-Queue-Id: 0B4FF1C0013
X-Rspamd-Server: rspam02
X-HE-Tag: 1730707558-691633
X-HE-Meta: U2FsdGVkX18jp1owJVVMTepZ3ZnW92SZgeBMun1IEzZUI/kHSk/AcDbZSpMLBvB/lgvrMpjQ0CP1opcxukBVTe+kkAT/zabG0n7zU+tS3Llum21gAfuMT1agesX9yxuxTmko8Xw2hmxN9qN3YN8wh4TH/CxeBxjox+JOZm/I7/63b/XZpv6UOZx/90xX+CNV7YH3V/Nd1KUdfIzwpdu5O4O/Pn8s3aeKXgjIQrwNxBaTGFOOCjsIoTIG6HZljD/ASO/1HHguPfTAuPktc2fdDHH+8cQUH1gMqV+p2SIkejD3k87/HoVS54qYuq/ob7yepT1Tcv5gXzYiBfgApznKXHCdaOsJox4mYRRN6J96lE6pO30ZczOzIn2p8tza9OI3yZywDMH1UjoxNE3Ep6o16vQ3Ut4Z3bd9RbS3OAdvuke+ygLiq/p8nZnlVbotAcZ5DgDXuhU1G68HcSv2S39T4U6e6XeTFVgy6cCDn5a2jAZeiewPWCBbN71NWEM/8QvSlFLJhOhsG7QL3vDHkJ5cNpKDEPPpHsF5qc/S5n/J7SG7Ppy4MmOpXs5oC4qcjzI7KLoFxupQEf0o6zkfM14ulM58v51cjJ8U4Ww4hyEJbszUXHTEKsV9OWv6VBGH5jHjGJKpD1ubNuyWVYcWFbVi5bE+iYQq1J//ppVxGE8ntt2i7CNmRaKTKZyboHmSZ5chhyhifWtjzMJNd7wDed1dGEpjiNwtGO33KfKc2lhjEvnkppKCjXPHmYaCGUWgobE8GLYG6Vc9K5Ur9p5UQk4na0PvFnHHiuwKQwr7B/Y9CsBOFV2D3hiq7hyKSWJkbM9WLeA+guBRo6fZ30BfnhkJYwAeLwz1OTCz+6t37epbwTyhwLXLMyR8U2dW+vVqYMp1+maW1GhNqxkRBy0uWf+ddYuCbHaFtKib4TfGJbtlbJnJ/YAG6IdSwv4vZE7b1dVojjbcYjkSvHr0lUp6wre
 gNM7HUWy
 1mo0WSZbkux5Kq0PyuKMZQhzsmjxFSiilXdi2v5K2ILd4HNj55atvE1MoXsV06auRY9DYc4ogowPIJnlptdcQcH8IBhv/jrso++x1OAjzVOz86/jrzUS4fHsciP+sOCRveZiG0LFBgDir7deqJYl9DB0TI8+Nirp+OauR8CkkisfN+vtA0hlEnzT+acrdJ3spfjyHHlUS2pue/r7wS247HPPhCStrSMNOXOccdsodyjqYn0RiToQCQACG/3PjphaQCFQ6QCA29Oz7NGeoGt1QZHdk1R9+GE2h8tw0QH6Zsv13oFqNs1axAtWz813ESx20r9wD4YnMa0xtfKfhE5Q6eh4ZIZWJwzFAjO3t/xNsjBXhW8Y/wjbXVf52eRFq3Z8s1qFLBgNFBNHtQKv6UP9k0H+3Rv5DHxdrP4DH1rwvT3L3+0A=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Nov 4, 2024 at 7:46=E2=80=AFPM Huang, Ying <ying.huang@intel.com> w=
rote:
>
> Johannes Weiner <hannes@cmpxchg.org> writes:
>
> > On Wed, Oct 30, 2024 at 02:18:09PM -0700, Yosry Ahmed wrote:
> >> On Wed, Oct 30, 2024 at 2:13=E2=80=AFPM Usama Arif <usamaarif642@gmail=
.com> wrote:
> >> > On 30/10/2024 21:01, Yosry Ahmed wrote:
> >> > > On Wed, Oct 30, 2024 at 1:25=E2=80=AFPM Usama Arif <usamaarif642@g=
mail.com> wrote:
> >> > >>>> I am not sure that the approach we are trying in this patch is =
the right way:
> >> > >>>> - This patch makes it a memcg issue, but you could have memcg d=
isabled and
> >> > >>>> then the mitigation being tried here wont apply.
> >> > >>>
> >> > >>> Is the problem reproducible without memcg? I imagine only if the
> >> > >>> entire system is under memory pressure. I guess we would want th=
e same
> >> > >>> "mitigation" either way.
> >> > >>>
> >> > >> What would be a good open source benchmark/workload to test witho=
ut limiting memory
> >> > >> in memcg?
> >> > >> For the kernel build test, I can only get zswap activity to happe=
n if I build
> >> > >> in cgroup and limit memory.max.
> >> > >
> >> > > You mean a benchmark that puts the entire system under memory
> >> > > pressure? I am not sure, it ultimately depends on the size of memo=
ry
> >> > > you have, among other factors.
> >> > >
> >> > > What if you run the kernel build test in a VM? Then you can limit =
is
> >> > > size like a memcg, although you'd probably need to leave more room
> >> > > because the entire guest OS will also subject to the same limit.
> >> > >
> >> >
> >> > I had tried this, but the variance in time/zswap numbers was very hi=
gh.
> >> > Much higher than the AMD numbers I posted in reply to Barry. So foun=
d
> >> > it very difficult to make comparison.
> >>
> >> Hmm yeah maybe more factors come into play with global memory
> >> pressure. I am honestly not sure how to test this scenario, and I
> >> suspect variance will be high anyway.
> >>
> >> We can just try to use whatever technique we use for the memcg limit
> >> though, if possible, right?
> >
> > You can boot a physical machine with mem=3D1G on the commandline, which
> > restricts the physical range of memory that will be initialized.
> > Double check /proc/meminfo after boot, because part of that physical
> > range might not be usable RAM.
> >
> > I do this quite often to test physical memory pressure with workloads
> > that don't scale up easily, like kernel builds.
> >
> >> > >>>> - Instead of this being a large folio swapin issue, is it more =
of a readahead
> >> > >>>> issue? If we zswap (without the large folio swapin series) and =
change the window
> >> > >>>> to 1 in swap_vma_readahead, we might see an improvement in linu=
x kernel build time
> >> > >>>> when cgroup memory is limited as readahead would probably cause=
 swap thrashing as
> >> > >>>> well.
> >
> > +1
> >
> > I also think there is too much focus on cgroup alone. The bigger issue
> > seems to be how much optimistic volume we swap in when we're under
> > pressure already. This applies to large folios and readahead; global
> > memory availability and cgroup limits.
>
> The current swap readahead logic is something like,
>
> 1. try readahead some pages for sequential access pattern, mark them as
>    readahead
>
> 2. if these readahead pages get accessed before swapped out again,
>    increase 'hits' counter
>
> 3. for next swap in, try readahead 'hits' pages and clear 'hits'.
>
> So, if there's heavy memory pressure, the readaheaded pages will not be
> accessed before being swapped out again (in 2 above), the readahead
> pages will be minimal.
>
> IMHO, mTHP swap-in is kind of swap readahead in effect.  That is, in
> addition to the pages accessed are swapped in, the adjacent pages are
> swapped in (swap readahead) too.  If these readahead pages are not
> accessed before swapped out again, system runs into more severe
> thrashing.  This is because we lack the swap readahead window scaling
> mechanism as above.  And, this is why I suggested to combine the swap
> readahead mechanism and mTHP swap-in by default before.  That is, when
> kernel swaps in a page, it checks current swap readahead window, and
> decides mTHP order according to window size.  So, if there are heavy
> memory pressure, so that the nearby pages will not be accessed before
> being swapped out again, the mTHP swap-in order can be adjusted
> automatically.

This might help reduce memory reclamation thrashing for kernel build
workload running in a memory limited memcg which might not benefit
from mTHP that much. But this mechanism has clear disadvantages:

1. Loss of large folios: For example, if you're using app A and then switch
to app B, a significant portion (around 60%) of A's memory might be swapped
out as it moves to the background. When you switch back to app A, a large
portion of the memory originally in mTHP could be lost while swapping
in small folios.

Essentially, systems with a user interface operate quite differently from k=
ernel
build workloads running under a memory-limited memcg, as they switch
applications between the foreground and background.

2. Fragmentation of swap slots: This fragmentation increases the likelihood=
 of
mTHP swapout failures, as it makes it harder to maintain contiguous memory
blocks in swap.

3. Prevent the implementation of large block compression and decompression
to achieve higher compression ratios and significantly lower CPU consumptio=
n,
as small folio swap-ins may still remain the predominant approach.

Memory-limited systems often face challenges with larger page sizes. Even o=
n
systems that support various base page sizes, such as 4KB, 16KB, and 64KB
on ARM64, using 16KB or 64KB as the base page size is not always the best
choice.  With mTHP, we have already enabled per-size settings. For this ker=
nel
build workload operating within a limited memcg, enabling only 16kB is like=
ly
the best option for optimizing performance and minimizing thrashing.
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/enabled

We could focus on mTHP and seek strategies to minimize thrashing when free
memory is severely limited :-)

>
> > It happens to manifest with THP in cgroups because that's what you
> > guys are testing. But IMO, any solution to this problem should
> > consider the wider scope.
> >
> >> > >>> I think large folio swapin would make the problem worse anyway. =
I am
> >> > >>> also not sure if the readahead window adjusts on memory pressure=
 or
> >> > >>> not.
> >> > >>>
> >> > >> readahead window doesnt look at memory pressure. So maybe the sam=
e thing is being
> >> > >> seen here as there would be in swapin_readahead?
> >> > >
> >> > > Maybe readahead is not as aggressive in general as large folio
> >> > > swapins? Looking at swap_vma_ra_win(), it seems like the maximum o=
rder
> >> > > of the window is the smaller of page_cluster (2 or 3) and
> >> > > SWAP_RA_ORDER_CEILING (5).
> >> > Yes, I was seeing 8 pages swapin (order 3) when testing. So might
> >> > be similar to enabling 32K mTHP?
> >>
> >> Not quite.
> >
> > Actually, I would expect it to be...
>
> Me too.
>
> >> > > Also readahead will swapin 4k folios AFAICT, so we don't need a
> >> > > contiguous allocation like large folio swapin. So that could be
> >> > > another factor why readahead may not reproduce the problem.
> >>
> >> Because of this ^.
> >
> > ...this matters for the physical allocation, which might require more
> > reclaim and compaction to produce the 32k. But an earlier version of
> > Barry's patch did the cgroup margin fallback after the THP was already
> > physically allocated, and it still helped.
> >
> > So the issue in this test scenario seems to be mostly about cgroup
> > volume. And then 8 4k charges should be equivalent to a singular 32k
> > charge when it comes to cgroup pressure.
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry