From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6C622C25B75
	for <linux-mm@archiver.kernel.org>; Fri, 31 May 2024 12:40:33 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E3B126B00A0; Fri, 31 May 2024 08:40:32 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DE8256B00A1; Fri, 31 May 2024 08:40:32 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id CD8226B00A2; Fri, 31 May 2024 08:40:32 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id AF78A6B00A0
	for <linux-mm@kvack.org>; Fri, 31 May 2024 08:40:32 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 5D2FDA2CAE
	for <linux-mm@kvack.org>; Fri, 31 May 2024 12:40:32 +0000 (UTC)
X-FDA: 82178649504.28.31CA914
Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179])
	by imf08.hostedemail.com (Postfix) with ESMTP id 84A70160016
	for <linux-mm@kvack.org>; Fri, 31 May 2024 12:40:30 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=QPKRBCuQ;
	spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1717159230;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=QfnILgMvH7GxtH3T/3UXsRnKvHnmj7OrwjW6Wjae0T4=;
	b=HC1900rHCWDbtJytF11v8gf9loZGtehtzgKQaVD8uzwuTvUgFhJxiQHNsjYXnUJwVvaLPG
	Mxz8pi9Y943B+3ZNHKgwnWblM/OF2+6v2UkENgfJkDdnCjnsCHmXIltcqAip9izcqfSeTo
	ag/RX3blCkxxKUuxgTqdoA/jo4r+ArA=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=QPKRBCuQ;
	spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.208.179 as permitted sender) smtp.mailfrom=ryncsn@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717159230; a=rsa-sha256;
	cv=none;
	b=UnjaAB41rkMiyvtvqhvho6BLbGbJeC7o3jB9N/Y4rMbZqn5zuxHxXwg0/S21c5OrjKLVI0
	VQb8TQ01S4QssjYe+uLs32o1w9WROrcZhDJRTPWx7QbGEmfp4V5G/RcfJFg6tvJ70I0dpc
	mm6F4ChouNWxl9g34uoF/ib8ly2k/6M=
Received: by mail-lj1-f179.google.com with SMTP id 38308e7fff4ca-2ea8fff1486so10315571fa.1
        for <linux-mm@kvack.org>; Fri, 31 May 2024 05:40:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1717159228; x=1717764028; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=QfnILgMvH7GxtH3T/3UXsRnKvHnmj7OrwjW6Wjae0T4=;
        b=QPKRBCuQLIoLeBN9/UZBa+j4/Bk2ARyYCtEOOpypDF+zYzpypzD1870CxgxitvMOod
         aXvhsmGraxSUqTQftWs3rfBdBGiOBkmLK3a2coCnraNLziwqPsyZxnpOLtxdxw9GOGM3
         7xcOc/4IixiYDfHpyCzpH+hPTzt219E3wAS5uxx5Jvoqv+g6mfIX7S2Q4hruCQJp+VRy
         MD2vL9jz5HNPRwnj32Mg4aosHh11MLluU1bHCe9ku125ke3dQ18Ba77p42zBdDQ6FW0A
         6pNQgeGd5zgtvNN4+4BCyJGITIsxuk4kZ9GUgQjaOLWVumh0N89Yh/lcx4Xf+JziTdu2
         CTUA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1717159228; x=1717764028;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=QfnILgMvH7GxtH3T/3UXsRnKvHnmj7OrwjW6Wjae0T4=;
        b=sH+tGr7iSvKnfG1OSBrYteWKICMXu6RuSUW1Lv37D+CBb0xEaodmFv2qw2dO0MLRA+
         0cYZ7CnuNDFWduXGk6Xfyzfdy8Bh/TWTJk8ndaZ6Ks4R4E4D5OVwa2g+RgPB/9lsMYvS
         uIEZMwI/PK92KCuy1PD5ROz3hBpJ7TuxaqG8bxTbrrbVoddJvJzuvtrrya1vLQdKW9xw
         vPsVb/v6NuDYWgg61AoLVsnFT3H4ORye0nKNU2+IaVjqdtJ5gMjULGYwir1WpDCUX/cp
         Nsa9QQcaCY9WyN0aPkL+MPFVQ6Rfm6HdiPAbmwXRXQeLldJP7gg2hf2U9uM7skO2KJFT
         52Ew==
X-Forwarded-Encrypted: i=1; AJvYcCXpKnxLL690af/MraAce0v8oGLum1iRzbsATF/KrCd3kzJvIqvJQjxu90xGZUnWIpw5kzMMgJxQebBESyadB6wez4o=
X-Gm-Message-State: AOJu0YzqJ0tfZ/bySfgGAXIs+opORr57yzhqWtqclIhtrdUTiD+qqdDj
	oMPsJTcR39F4BxtMO2BSFSyZGN+fWoS8FFfzjbrYKz9xuHqUvQEI0/x/5mKW/w5nRKTkxJfnK2x
	ifrNZwqPVx5t+NPsGtVEM6p+P3zI=
X-Google-Smtp-Source: AGHT+IHS2MTlto8RHe3AquZRlNJ0TIGzIQU7XbotZgQe2qCpma2vav7K0JxoFnoSzDubHtZ1cyUJCwZ7qsWSeh5ZrP0=
X-Received: by 2002:a2e:9ac4:0:b0:2e1:c97b:6f25 with SMTP id
 38308e7fff4ca-2ea950a723cmr5984231fa.1.1717159228227; Fri, 31 May 2024
 05:40:28 -0700 (PDT)
MIME-Version: 1.0
References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org>
 <CANeU7QkmQ+bJoFnr-ca-xp_dP1XgEKNSwb489MYVqynP_Q8Ddw@mail.gmail.com>
 <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAF8kJuN8HWLpv7=abVM2=M247KGZ92HLDxfgxWZD6JS47iZwZA@mail.gmail.com>
 <875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAF8kJuMc3sXKarq3hMPYGFfeqyo81Q63HrE0XtztK9uQkcZacA@mail.gmail.com>
 <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Kairui Song <ryncsn@gmail.com>
Date: Fri, 31 May 2024 20:40:11 +0800
Message-ID: <CAMgjq7C+rtbbnH+utGkUmwaTzB82zrO8qvotPx9Z6A4fMiO_4A@mail.gmail.com>
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Chris Li <chrisl@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, 
	Barry Song <baohua@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 84A70160016
X-Stat-Signature: sxdb66fi3h7a1yuxba3g4e87kpds6qra
X-Rspam-User: 
X-HE-Tag: 1717159230-880801
X-HE-Meta: U2FsdGVkX18W8EBA9jXPVXTMs9no3aHDI4XAXzYuwXTE8DTdU62J1cR1Ylipopj7iLZsU92Wa9Rzv2ucU3FJki+lt8m5z3Mtazsrgr+02UMzKj86FioQ/CUFK1nIS2kzeIH6pnvCBVSDiE6B/KFhMDV0cECjIoWVuifLApsBeriqw7SDlZMl8xwMtDW/lZqGwBwYFN7ybpwsri+zCznYXBruZkXenuA8knFe2Yb9CaHEpzCv/Oy/6QROI/zdnzdeE8mASP+28mB7cwZP/qDJyscBz/3+QHI5KeNcg/90aeGTLKmh6LzeNGG/tQlYCwQv12XpVkbWDENGZ9czw4Kp8dfi4B4XPD8sbc6daI50zS8obqDfypd5GSmU+/AnsB9S5rpVeSQ8kXbeZNSPBf6eIGbrFUqG3EqBrWkZEjBy7eBLbsoXxBDr/FC7dqwyI19pBkFT9XOEZltcRhinaYdUTLtCVo+LCPANKSRt0HMUxoyKXEhYaV2n+IxgLbRwtFfEzmprRPY5WRo+afIVMiy+oshgZ2ymtKjskWndEC2nUHSuCzj1PY0tKa8YgvEDsjm8afq3ByxqazOtORjtfzJgiD+ulcySAly+LXsc/YPxWG/G+kVbNGY1zYmCY6XsxCrFcx3z/wfACEXR8DEa4QnSEjhDg3/Ld7CDc4WLNjBsnD9mh5HHKoQRU1mbFfgXpQO97aGkaaY841wJydXqu3Px4uXli5lvqvhUOEkJnGJmIYWtc7ePKGAVDggoEMucfFSdDCUnPTG9xPyC5cXFc34o9zQlLmzwj3RV+QgtsyGJK/9nNOAb9mnXwMEfcHvdWnBXHX2uyNWjAMqRLbemG4vK1eegkXcUBEQ+Gm+acPTNTTkPbR1YA8xMGaZIbdk5DbSaZfNpc50LaRnF8gHYDP1icp40yww6IiAYmKNXNdY5m7za7Wv1xCGca7u5FB0xFu1TJOZGWRnFHANIIqSfhlH
 zQhupSOd
 H0uDUHvWqhMFJ+koUudnNmR40hA+i4XhAtGSyip4+rsuFj1cAEcdCg2922bdMc6UO1PWuDHLIH/b+8Ih67USsY81pggEPOGB2z10If+gE6Ieyvx7wo9X+UTjlL5honIAUGRGw9wwvHhS3aT9muM3mhj6xuN0q1n2NSjmsGtm4JVPjuVjLndXVfJMN2x0NtW6I61cfAf9OoM4wyL41wY4HSBLJWYRS1aD86anoXxtkrmeYFfnWKk2CKlVAgpwV6Pq2i+w0fHEkTSkvkTPaKY5EkwLnBcz7ODQCBnE02Aqb6feE2ADnzEnymZ5GylXSd5JHbo8naO7BRhhO71i+BNExmC02NwE63tPzDn7I+Im1GlXOfuztNoQU2zbD+8Y55F6fe26oSjo5gYMXOT8FDMQyMBDT4U2BOyqnQhJcz2xHkf6GrAsJkpG860B1Hw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.001542, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, May 31, 2024 at 10:37=E2=80=AFAM Huang, Ying <ying.huang@intel.com>=
 wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Wed, May 29, 2024 at 7:54=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > Hi Ying,
> >> >
> >> > On Wed, May 29, 2024 at 1:57=E2=80=AFAM Huang, Ying <ying.huang@inte=
l.com> wrote:
> >> >>
> >> >> Chris Li <chrisl@kernel.org> writes:
> >> >>
> >> >> > I am spinning a new version for this series to address two issues
> >> >> > found in this series:
> >> >> >
> >> >> > 1) Oppo discovered a bug in the following line:
> >> >> > +               ci =3D si->cluster_info + tmp;
> >> >> > Should be "tmp / SWAPFILE_CLUSTER" instead of "tmp".
> >> >> > That is a serious bug but trivial to fix.
> >> >> >
> >> >> > 2) order 0 allocation currently blindly scans swap_map disregardi=
ng
> >> >> > the cluster->order.
> >> >>
> >> >> IIUC, now, we only scan swap_map[] only if
> >> >> !list_empty(&si->free_clusters) && !list_empty(&si->nonfull_cluster=
s[order]).
> >> >> That is, if you doesn't run low swap free space, you will not do th=
at.
> >> >
> >> > You can still swap space in order 0 clusters while order 4 runs out =
of
> >> > free_cluster
> >> > or nonfull_clusters[order]. For Android that is a common case.
> >>
> >> When we fail to allocate order 4, we will fallback to order 0.  Still
> >> don't need to scan swap_map[].  But after looking at your below reply,=
 I
> >> realized that the swap space is almost full at most times in your case=
s.
> >> Then, it's possible that we run into scanning swap_map[].
> >> list_empty(&si->free_clusters) &&
> >> list_empty(&si->nonfull_clusters[order]) will become true, if we put t=
oo
> >> many clusters in si->percpu_cluster.  So, if we want to avoid to scan
> >> swap_map[], we can stop add clusters in si->percpu_cluster when swap
> >> space runs low.  And maybe take clusters out of si->percpu_cluster
> >> sometimes.
> >
> > One idea after reading your reply. If we run out of the
> > nonfull_cluster[order], we should be able to use other cpu's
> > si->percpu_cluster[] as well. That is a very small win for Android,
>
> This will be useful in general.  The number CPU may be large, and
> multiple orders may be used.
>
> > because android does not have too many cpu. We are talking about a
> > handful of clusters, which might not justify the code complexity. It
> > does not change the behavior that order 0 can pollut higher order.
>
> I have a feeling that you don't really know why swap_map[] is scanned.
> I suggest you to do more test and tracing to find out the reason.  I
> suspect that there are some non-full cluster collection issues.
>
> >> Another issue is nonfull_cluster[order1] cannot be used for
> >> nonfull_cluster[order2].  In definition, we should not fail order 0
> >> allocation, we need to steal nonfull_cluster[order>0] for order 0
> >> allocation.  This can avoid to scan swap_map[] too.  This may be not
> >> perfect, but it is the simplest first step implementation.  You can
> >> optimize based on it further.
> >
> > Yes, that is listed as the limitation of this cluster order approach.
> > Initially we need to support one order well first. We might choose
> > what order that is, 16K or 64K folio. 4K pages are too small, 2M pages
> > are too big. The sweet spot might be some there in between.  If we can
> > support one order well, we can demonstrate the value of the mTHP. We
> > can worry about other mix orders later.
> >
> > Do you have any suggestions for how to prevent the order 0 polluting
> > the higher order cluster? If we allow that to happen, then it defeats
> > the goal of being able to allocate higher order swap entries. The
> > tricky question is we don't know how much swap space we should reserve
> > for each order. We can always break higher order clusters to lower
> > order, but can't do the reserves. The current patch series lets the
> > actual usage determine the percentage of the cluster for each order.
> > However that seems not enough for the test case Barry has. When the
> > app gets OOM kill that is where a large swing of order 0 swap will
> > show up and not enough higher order usage for the brief moment. The
> > order 0 swap entry will pollute the high order cluster. We are
> > currently debating a "knob" to be able to reserve a certain % of swap
> > space for a certain order. Those reservations will be guaranteed and
> > order 0 swap entry can't pollute them even when it runs out of swap
> > space. That can make the mTHP at least usable for the Android case.
>
> IMO, the bottom line is that order-0 allocation is the first class
> citizen, we must keep it optimized.  And, OOM with free swap space isn't
> acceptable.  Please consider the policy we used for page allocation.
>
> > Do you see another way to protect the high order cluster polluted by
> > lower order one?
>
> If we use high-order page allocation as reference, we need something
> like compaction to guarantee high-order allocation finally.  But we are
> too far from that.
>
> For specific configuration, I believe that we can get reasonable
> high-order swap entry allocation success rate for specific use cases.
> For example, if we only do limited maximum number order-0 swap entries
> allocation, can we keep high-order clusters?

Isn't limiting order-0 allocation breaks the bottom line that order-0
allocation is the first class citizen, and should not fail if there is
space?

Just my two cents...

I had a try locally based on Chris's work, allowing order 0 to use
nonfull_clusters as Ying has suggested, and starting with low order
and increase the order until nonfull_cluster[order] is not empty, that
way higher order is just better protected, because unless we ran out
of free_cluster and nonfull_cluster, direct scan won't happen.

More concretely, I applied the following changes, which didn't change
the code much:
- In scan_swap_map_try_ssd_cluster, check nonfull_cluster first, then
free_clusters, then discard_cluster.
- If it's order 0, also check for (int i =3D 0; i < SWAP_NR_ORDERS; ++i)
nonfull_clusters[i] cluster before scan_swap_map_try_ssd_cluster
returns false.

A quick test still using the memtier test, but decreased the swap
device size from 10G to 8g for higher pressure.

Before:
hugepages-32kB/stats/swpout:34013
hugepages-32kB/stats/swpout_fallback:266
hugepages-512kB/stats/swpout:0
hugepages-512kB/stats/swpout_fallback:77
hugepages-2048kB/stats/swpout:0
hugepages-2048kB/stats/swpout_fallback:1
hugepages-1024kB/stats/swpout:0
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:35088
hugepages-64kB/stats/swpout_fallback:66
hugepages-16kB/stats/swpout:31848
hugepages-16kB/stats/swpout_fallback:402
hugepages-256kB/stats/swpout:390
hugepages-256kB/stats/swpout_fallback:7244
hugepages-128kB/stats/swpout:28573
hugepages-128kB/stats/swpout_fallback:474

After:
hugepages-32kB/stats/swpout:31448
hugepages-32kB/stats/swpout_fallback:3354
hugepages-512kB/stats/swpout:30
hugepages-512kB/stats/swpout_fallback:33
hugepages-2048kB/stats/swpout:2
hugepages-2048kB/stats/swpout_fallback:0
hugepages-1024kB/stats/swpout:0
hugepages-1024kB/stats/swpout_fallback:0
hugepages-64kB/stats/swpout:31255
hugepages-64kB/stats/swpout_fallback:3112
hugepages-16kB/stats/swpout:29931
hugepages-16kB/stats/swpout_fallback:3397
hugepages-256kB/stats/swpout:5223
hugepages-256kB/stats/swpout_fallback:2351
hugepages-128kB/stats/swpout:25600
hugepages-128kB/stats/swpout_fallback:2194

High order (256k) swapout rate are significantly higher, 512k is now
possible, which indicate high orders are better protected, lower
orders are sacrificed but seems worth it.