From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EBEAFC25B10
	for <linux-mm@archiver.kernel.org>; Mon, 13 May 2024 09:24:31 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 431FA6B0261; Mon, 13 May 2024 05:24:31 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3E25B6B0266; Mon, 13 May 2024 05:24:31 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2A8786B0269; Mon, 13 May 2024 05:24:31 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 0CA2A6B0261
	for <linux-mm@kvack.org>; Mon, 13 May 2024 05:24:31 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id AC5D0A1745
	for <linux-mm@kvack.org>; Mon, 13 May 2024 09:24:30 +0000 (UTC)
X-FDA: 82112837100.18.FE363D0
Received: from mail-vs1-f54.google.com (mail-vs1-f54.google.com [209.85.217.54])
	by imf16.hostedemail.com (Postfix) with ESMTP id E6104180017
	for <linux-mm@kvack.org>; Mon, 13 May 2024 09:24:27 +0000 (UTC)
Authentication-Results: imf16.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=cd3D2qlI;
	spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1715592267; a=rsa-sha256;
	cv=none;
	b=hLue+bZ3nIwIi8HZt0ekaAN6Tj1iqd8vtZD8SUt2CscRc/Y/PMcFUQdKzQAhLNuGkWjKmT
	GmQ4sqJjsJV1ZXGZI1V6j7JMmjDQVtCjy+UFLlByuKW4zN/I5xXqEcCt1z7r8UxpDBUkOw
	9a/JmSpyKO/q01xdMwMwr22B6kaG460=
ARC-Authentication-Results: i=1;
	imf16.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=cd3D2qlI;
	spf=pass (imf16.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1715592267;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=HStxf7GkUt0sV5NStZS4CtVmqRThp9porzrMQBFZ7hI=;
	b=HTq0w27IsdEI81xsGI3z9n+Dgn2yyIUTASVmkhmBAL6U6SjHjLLZlcmdZlqtRFMW9BwQKb
	sSPPrTBOWsIGlH3/eroCL94PWYHq1Tp0WxrCrPJshBjKFnRTAixQGsKuKMBiVHIDpSe21n
	lF+6L39669zPE+vewWrCxyi7MsJuqaw=
Received: by mail-vs1-f54.google.com with SMTP id ada2fe7eead31-47f13801f3aso1063172137.0
        for <linux-mm@kvack.org>; Mon, 13 May 2024 02:24:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1715592267; x=1716197067; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=HStxf7GkUt0sV5NStZS4CtVmqRThp9porzrMQBFZ7hI=;
        b=cd3D2qlIiCLR4dgwUigCWIRPCmk0osWKoEhtYn6gK4xJSIaD1UoxxV+BxBv64ygakd
         TzEdKl7edPj4mTJtwqsfrGtgowes8ShzdsSbZ7+dRAsbwEuVMtawMMI0EhGz9LWDrjcV
         lSuy3aKTJM9d168vg1kiw9VKA6N+ecJbGtD285wdqdX48DseB8p7q9c5zlKQy3eZi4be
         mryysDSwBR5H4BHkG/JyWY+/FEIWKf6PEXfBAGCHgSTGvH+fqpwquOJepBOOyUuZ37KN
         Ni6dzFsYmJ7y4BqdnLCp0nE3a3YBPnQcsssvTm7hHkYnpzs6Z/Jn3Tov97hXUXIomSwQ
         vvmw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1715592267; x=1716197067;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=HStxf7GkUt0sV5NStZS4CtVmqRThp9porzrMQBFZ7hI=;
        b=l1D4RB9mUN/gSJgTwq5bjjmTv0//GCWAO+3hLJycny5jWwsF1MomqXep7FaPeiDm/8
         PNqAqdFLQ88sP81hIJLIPzfZz68JbH78jWWvhjuo/HaMbpFDXSa4ZnpFe9Tg4H8IZOIQ
         YWghKTJIv7EoQEznoFyK96OkoJpEaaDNqfLYoqAqGzPWEEQ0lCtEeJB3QvxZhET3vKQ3
         GGnRkEx3b5RsftsVfKuONEx+vBDbVGtNviNKKnWrCrzaL/iU2tp2WbnjyIxaY0TaMjF0
         QE3o5HL/MyVVjxFk8KC2YruigYNGvJor9rbuWndL3ZqO4ncIpQSzkzvX8Jlw83m+TZml
         oITw==
X-Forwarded-Encrypted: i=1; AJvYcCXDQTS/pdr9K/HOKgjdtDsgXNbD2KyisGAiMtTfGfrd19UteYm/juQi3j8wf6iBloM9VYYS8Q0Zu6cY8xnO88teEqo=
X-Gm-Message-State: AOJu0YxjHD7FyGmLvuBZDD8GWvEEejRKJT5UO1VUqQisk4DOegXbph/D
	HHX0IF1XW39JhLALlcDB6IduDn8nCUQ8D3SdpTo8eDkpaZTvgKiPj2SnDqPojTUauApaVmG/MaK
	ieVkDZdgXAFeQU35r+jEoCLUcnec=
X-Google-Smtp-Source: AGHT+IF+YLWYvJCQntdv5In7TQzqfQD8V9aSwtwZZGQO/04OSdnQu3sKywaEmysb1Qve8rDkWol7rLCquzuHrt5TjZM=
X-Received: by 2002:a05:6102:390b:b0:47c:2a80:ac49 with SMTP id
 ada2fe7eead31-48077e71426mr8368102137.26.1715592266796; Mon, 13 May 2024
 02:24:26 -0700 (PDT)
MIME-Version: 1.0
References: <20240408183946.2991168-1-ryan.roberts@arm.com>
 <20240408183946.2991168-6-ryan.roberts@arm.com> <CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com>
 <17b4f026-d734-4610-8517-d83081f75ed4@arm.com>
In-Reply-To: <17b4f026-d734-4610-8517-d83081f75ed4@arm.com>
From: Barry Song <21cnbao@gmail.com>
Date: Mon, 13 May 2024 21:24:13 +1200
Message-ID: <CAGsJ_4zEbqkEwzG0p-svwBA8obY0fSGqqthH7guc5qcxodM8hg@mail.gmail.com>
Subject: Re: [PATCH v7 5/7] mm: swap: Allow storage of all mTHP orders
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, 
	Matthew Wilcox <willy@infradead.org>, Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>, 
	Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>, 
	Kefeng Wang <wangkefeng.wang@huawei.com>, Chris Li <chrisl@kernel.org>, 
	Lance Yang <ioworker0@gmail.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: E6104180017
X-Stat-Signature: j9hs8krpd4giqnfdrhdse37f8zsuxgk9
X-HE-Tag: 1715592267-126932
X-HE-Meta: U2FsdGVkX185xqmwfa5/yFzLTIQLr4XbqJlHiF3TSMgf7ZUc2h3+i5P1CDZiJQQDmXNtTa+0MTpfGGMmOmw+a3rLd6bsl1hJ2miNALPMQrGon2JDLHOFOSQgoy1r9XqwrcCoceoTMxaSAH4HspxxMTMfykIbleAoAdNuocF8R93TKxl+Z6IxyFDNpSvLqfNH1Ge8Dcn6jjM2CQU1a2nLh9w1+y9MmdDKOOQ31kBY/xsC2BwYP4xhBd3cQGGnGor013zCVL6hKTd9erNK8pyHlN1GC5OkyRutE1C7LW/KYVlEfLxwqSBbhOQ9vYESONApMfChlR0Uz4hmx2rVOSX6qngOQBPoZMpdBhLqU5hjdN5H5La5CQeBUebPRYsKhJkOv/QSXszoilLQ4u3FyEmZk3f2n5igfo3koIWaBgrvtEmTtbB3jfj9Z5jSFnMF1qNsUMetn8xvWy7Pw3DFQOeks1TIcGLiUAVZuBwTpTR7vNx665/on7YMpNl9J/qcZI7U9kWiuVdc6OshpDo83kI2mhZer1G7fkL+j0E7yI2lvR8D+WVUCcspklX3wXlnp9jJppLqs+NlOjoeQFzecsG+foGu/t8RtBNLPfoB9u7/B3Aul7w3K2UkOZ+WzlY1KLpztrp9pzoxdcI8f+ZjqgXrSpEsDz8SzSiNh1lcFQ3AVj9duXDOapx3FqiaeS1F0f6rtb/QOC7VDfFUDhmnAegWkTIjY1j4gQHrX8ga6zdu33eFK4qBGtq3ycxOFjxcYQQuHvpxCJNrZKR0KGwh9Od23w0mBK5Pmt2U+osnL3CxVf97MeEfjSrvMLCWmF0u0EK8VVK4TF0d7X6OQT0GvcinyHco1N9lXe/mWmDG/2iuSgK6dzJ9ffvkNpaa+c3pqyqJqIIfB7CgIDjSJw6+xbH5iHX8rgC+aWVfNCIIGtZCxMZv0yU1Obqm51+8Zy9cEgUgn8oulgd5aN/X1lZXCgg
 DYS2v/Cl
 g14A4wQ17tEsE+x1h8FP2YIW1apUaKLDe5ZO4z4/LEJf32PFtrEHNzwwmIF0jYSCgGv2z77+oPTcxbZmvrAN7KgukQ3OoBzpG2rZQDayba/ZBTBhrAFQaDnY6jz83Hh9uSfIvt8z7TwCMJC1N4AkcElnqUM+IBsargDAgn4vUYMDiUIkYktFNukBfCaX89TfOUDNSIyu3TaI+PhcHWGtCsrGUlN59HbZ1vi/Gape8coTCfDcZvElvWDv4rUTTlgpfbWXBOuVrt/6seDJRAMSB3Fp/c4fMYHzxsaYUmtLUpKpqUkf+1qdw2Q66cZig6OsuBwfY3VE6c+suDwGJRzJec0LXDXzWXCTWrjOq1ihBuTS6LlZ1AEqvSNjX9pexJgt273Rsqs95fUSxrsg+TKVSwoOkPtCCijAMRLHH
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, May 13, 2024 at 8:43=E2=80=AFPM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> On 13/05/2024 08:30, Barry Song wrote:
> > On Tue, Apr 9, 2024 at 6:40=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.c=
om> wrote:
> >>
> >> Multi-size THP enables performance improvements by allocating large,
> >> pte-mapped folios for anonymous memory. However I've observed that on =
an
> >> arm64 system running a parallel workload (e.g. kernel compilation)
> >> across many cores, under high memory pressure, the speed regresses. Th=
is
> >> is due to bottlenecking on the increased number of TLBIs added due to
> >> all the extra folio splitting when the large folios are swapped out.
> >>
> >> Therefore, solve this regression by adding support for swapping out mT=
HP
> >> without needing to split the folio, just like is already done for
> >> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enable=
d,
> >> and when the swap backing store is a non-rotating block device. These
> >> are the same constraints as for the existing PMD-sized THP swap-out
> >> support.
> >>
> >> Note that no attempt is made to swap-in (m)THP here - this is still do=
ne
> >> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
> >> prerequisite for swapping-in mTHP.
> >>
> >> The main change here is to improve the swap entry allocator so that it
> >> can allocate any power-of-2 number of contiguous entries between [1, (=
1
> >> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> >> order and allocating sequentially from it until the cluster is full.
> >> This ensures that we don't need to search the map and we get no
> >> fragmentation due to alignment padding for different orders in the
> >> cluster. If there is no current cluster for a given order, we attempt =
to
> >> allocate a free cluster from the list. If there are no free clusters, =
we
> >> fail the allocation and the caller can fall back to splitting the foli=
o
> >> and allocates individual entries (as per existing PMD-sized THP
> >> fallback).
> >>
> >> The per-order current clusters are maintained per-cpu using the existi=
ng
> >> infrastructure. This is done to avoid interleving pages from different
> >> tasks, which would prevent IO being batched. This is already done for
> >> the order-0 allocations so we follow the same pattern.
> >>
> >> As is done for order-0 per-cpu clusters, the scanner now can steal
> >> order-0 entries from any per-cpu-per-order reserved cluster. This
> >> ensures that when the swap file is getting full, space doesn't get tie=
d
> >> up in the per-cpu reserves.
> >>
> >> This change only modifies swap to be able to accept any order mTHP. It
> >> doesn't change the callers to elide doing the actual split. That will =
be
> >> done in separate changes.
>
> [...]
>
> >
> > Hi Ryan,
> >
> > Sorry for bringing up an old thread.
>
> No problem - thanks for the report!
>
> >
> > During the initial hour of utilizing an Android phone with 64KiB mTHP,
> > we noticed that the
> > anon_swpout_fallback rate was less than 10%. However, after several
> > hours of phone
> > usage, we observed a significant increase in the anon_swpout_fallback
> > rate, reaching
> > 100%.
>
> I suspect this is due to fragmentation of the clusters; If there is just =
one
> page left in a cluster then the cluster can't be freed and once the clust=
er free
> list is empty a new cluster allcoation will fail and this will cause fall=
back to
> order-0.
>
> >
> > As I checked the code of scan_swap_map_try_ssd_cluster(),
> >
> > static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >         unsigned long *offset, unsigned long *scan_base, int order)
> > {
> >         unsigned int nr_pages =3D 1 << order;
> >         struct percpu_cluster *cluster;
> >         struct swap_cluster_info *ci;
> >         unsigned int tmp, max;
> >
> > new_cluster:
> >         cluster =3D this_cpu_ptr(si->percpu_cluster);
> >         tmp =3D cluster->next[order];
> >         if (tmp =3D=3D SWAP_NEXT_INVALID) {
> >                 if (!cluster_list_empty(&si->free_clusters)) {
> >                         tmp =3D cluster_next(&si->free_clusters.head) *
> >                                         SWAPFILE_CLUSTER;
> >                 } else if (!cluster_list_empty(&si->discard_clusters)) =
{
> >                         /*
> >                          * we don't have free cluster but have some clu=
sters in
> >                          * discarding, do discard now and reclaim them,=
 then
> >                          * reread cluster_next_cpu since we dropped si-=
>lock
> >                          */
> >                         swap_do_scheduled_discard(si);
> >                         *scan_base =3D this_cpu_read(*si->cluster_next_=
cpu);
> >                         *offset =3D *scan_base;
> >                         goto new_cluster;
> >                 } else
> >                         return false;
> >         }
> > ...
> >
> > }
> >
> > Considering the cluster_list_empty() checks, is it necessary to have
> > free_cluster to
> > ensure a continuous allocation of swap slots for large folio swap out?
>
> Yes, currently that is done by design; if we can't allocate a free cluste=
r then
> we only scan for free space in an already allocated cluster for order-0
> allocations. I did this for a couple of reasons;
>
> 1: Simplicity.
>
> 2: Keep behavior the same as PMD-order allocations, which are never scann=
ed
> (although the cluster is the same size as the PMD so scanning would be po=
intless
> there - so perhaps this is not a good argument for not scanning smaller h=
igh
> orders).
>
> 3: If scanning for a high order fails then we would fall back to order-0 =
and
> scan again, so I was trying to avoid the potential for 2 scans (although =
once
> you split the page, you'll end up scanning per-page, so perhaps its not a=
 real
> argument either).
>
> > For instance,
> > if numerous clusters still possess ample free swap slots, could we
> > potentially miss
> > out on them due to a lack of execution of a slow scan?
>
> I think it would definitely be possible to add support for scanning high =
orders
> and from memory, I don't think it would be too difficult. Based on your
> experience, it sounds like this would be valuable.
>
> I'm going to be out on Paternity leave for 3 weeks from end of today, so =
I won't
> personally be able to do this until I get back. I might find some time to=
 review
> if you were to post something though :)

Congratulations on the arrival of your precious little one! Forget
about the swap and
mTHP, enjoy your time with the family :-)

>
> >
> > I'm not saying your patchset has problems, just that I have some questi=
ons.
>
> Let's call it "opportunity for further improvement" rather than problems.=
 :)
>
> I suspect swap-in of large folios may help reduce the fragmentation a bit=
 since
> we are less likely to keep parts of a previously swapped-out mTHP in swap=
.
>
> Also, I understand that Chris Li has been doing some thinking around an
> indirection layer which would remove the requirement for pages of a large=
 folio
> to be stored contiguously in the swap file. I think he is planning to tal=
k about
> that at LSFMM? (which I sadly won't be attending).
>
> Thanks,
> Ryan
>
> >

Thanks
Barry