From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 10707C2BBCA
	for <linux-mm@archiver.kernel.org>; Tue, 25 Jun 2024 16:06:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5B5DC6B0092; Tue, 25 Jun 2024 12:06:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5655C6B00B0; Tue, 25 Jun 2024 12:06:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 42D216B00CE; Tue, 25 Jun 2024 12:06:40 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 24BDC6B0092
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 12:06:40 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 9E9F01A03F0
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 16:06:26 +0000 (UTC)
X-FDA: 82269888372.02.09425DE
Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55])
	by imf14.hostedemail.com (Postfix) with ESMTP id 6630D100018
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 16:06:22 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=tmwD6rSk;
	spf=pass (imf14.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org;
	dmarc=pass (policy=none) header.from=kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1719331572;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=2NFlQkzCRKQxeqz2V/q4iyDAGGRaEJJvmR9OrbdxSDE=;
	b=nCdpo5jcbPNpwQIb54hDiVJh5PD33tQT78aLnWdpLOve5FNk18O+MIzdsPvhOsXEeCwD2m
	KsJmg77hGD0+0b1n7nh4NlitNPeXy4ZO/Bcn4dOCLlzNCwtlwXTbG69KV95KqLXDwizI43
	ug4VgHTTNGyXbjzZR8/2ueHhlKNN4IE=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=tmwD6rSk;
	spf=pass (imf14.hostedemail.com: domain of chrisl@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=chrisl@kernel.org;
	dmarc=pass (policy=none) header.from=kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719331572; a=rsa-sha256;
	cv=none;
	b=Sl5BYGBmb32lKWrio2AN/17bpK6mXfNby0eYk3QgiqnXTHEcXg3CDsuGpjtixN3jQTdyQZ
	eXPjIFtO9MGG/pXTO4eSjFd9LhIVHmbk9o+mQCZj/2o8y0625I7XtWs2fGkGjpyPfsDPVk
	eZXd3/vKw38ljjUczwBdMBNUN2MOrr4=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by sin.source.kernel.org (Postfix) with ESMTP id 5C966CE1C09
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 16:06:18 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6E562C4AF0A
	for <linux-mm@kvack.org>; Tue, 25 Jun 2024 16:06:17 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1719331577;
	bh=h09MbXzjs75XRlEGTEqAxyYxcxdNwSXO7b1gJP2XE2g=;
	h=References:In-Reply-To:From:Date:Subject:To:Cc:From;
	b=tmwD6rSk8Q099uQ+ymfIcI97a6J8yvtqiLi4TbnNjawoCJbWV4vhdmeOjzicSvy2Q
	 B50v/FGaswxNZ34buqi918I9i3xi82qP8IRzSx/bKkYjqqOtrABSASmS9+AsBzthTC
	 BOLSD1OqvEMTf1RVNdha9R9axFtmAFRimpjsgRKxQFi5Uowg4OsQ4Dslu1HdsF29zd
	 8tIfuZXukpYb+5DsW9JBdFeYon27i0rKk9tCQyo8VDdKtfJ9t56L2+VkWQyMXzAk1n
	 xbem0DLce5IhPywhf/coebHB9pmIX1gKcgNTHwFDokQTocPa6qM9UlDfLbuSzKaPXQ
	 FMWd7vagqlVcQ==
Received: by mail-lf1-f41.google.com with SMTP id 2adb3069b0e04-52ce9ba0cedso2688023e87.2
        for <linux-mm@kvack.org>; Tue, 25 Jun 2024 09:06:17 -0700 (PDT)
X-Forwarded-Encrypted: i=1; AJvYcCVEgWMGkkySMWoDmWYJNmg3z0RjTI0i0VcymnGsAfIXGFiCTJtfGKn9+ABp1JQSspuXnrafbWOdYM17bwuWpHlZUl0=
X-Gm-Message-State: AOJu0YzLfSZzZelVVC4PwbVDmsgZ1QG3utgFK0v7m1P++pT28sfAAhu0
	RI+j2M2imX8oqeLvl8AwAWSshLvF3B98fmPXmf9N4ky1XMzY6B8baLVoFeeQ+tKQ2wwbIo4ns3i
	Sz3bOqwXQ0tQN2uPxhmHkkMxdFQ==
X-Google-Smtp-Source: AGHT+IEpP6+BBiiySTD3qWmFYC2HlHUVMgxsA+Xe+BW33VhWVg18kiDSkRdrL3YOFOoSRuRemIy7U/34Ad6BHARQR9E=
X-Received: by 2002:ac2:4d11:0:b0:52c:8342:6699 with SMTP id
 2adb3069b0e04-52ce1852897mr5644149e87.55.1719331575874; Tue, 25 Jun 2024
 09:06:15 -0700 (PDT)
MIME-Version: 1.0
References: <20240618232648.4090299-1-ryan.roberts@arm.com> <1cb60685-3c34-4441-81e1-a60adc70d1f2@arm.com>
In-Reply-To: <1cb60685-3c34-4441-81e1-a60adc70d1f2@arm.com>
From: Chris Li <chrisl@kernel.org>
Date: Tue, 25 Jun 2024 09:06:04 -0700
X-Gmail-Original-Message-ID: <CANeU7Qnmh6+7CSO3pTvTiJ6_4jA=+h2_KDtYz-Yx3pfT5-snyg@mail.gmail.com>
Message-ID: <CANeU7Qnmh6+7CSO3pTvTiJ6_4jA=+h2_KDtYz-Yx3pfT5-snyg@mail.gmail.com>
Subject: Re: [RFC PATCH v1 0/5] Alternative mTHP swap allocator improvements
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Kairui Song <kasong@tencent.com>, 
	"Huang, Ying" <ying.huang@intel.com>, Kalesh Singh <kaleshsingh@google.com>, 
	Barry Song <baohua@kernel.org>, Hugh Dickins <hughd@google.com>, 
	David Hildenbrand <david@redhat.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: rm38xoj9ghqwkzp33t6145e4xd5bixjq
X-Rspam-User: 
X-Rspamd-Queue-Id: 6630D100018
X-Rspamd-Server: rspam02
X-HE-Tag: 1719331582-110528
X-HE-Meta: U2FsdGVkX19WD2o4x1wascO7LmPHOuzHcOqld7GHYrv1t3b/Wk3q2iOfc+y9ODTCkV4PIW4sQb3IvSmoyBT5XOISkbGHH8CuoxmI4HtgjgC2I6Y8MxEd3ntMN9OIVvr/SRLB+elZQInsMXXeamqLkSVoLLAwcFny78BLGAjHCusHEm2+aHrD3upPNUZKjEPTqNqQwNcfJz6IK58J6d8IUpsUjMrGiM3++nsSEQhKSMptjxGe4ppc/w0tfKLZUd2X4RujKgO30oXC6JIRbdLklGRhRHqwapQmDvbaKPdywsk+g2WfIiHSaB1A3b8+DmJGsH90kG8bAONBAEYznIQZE/HbY1CpyXqZwOJE1doRPR2+nk0Qq5M8wlZ/2yYTJ5AJop6oIcqTTDpNYZG01DrLxFW2pOwaiuMZXTouxVR4+eVrCzoeNrkWciYYaY4l/fWloa4xeJ5qjHdIMjRSylyP5SPTG5QVBjqC57DioH8bdd0rPsALpdQjhkHXe9jLOsFomd5u6N90GrCJhwBTprRosbonnIwLWLkeiNRf6Iu8wDwR6WO9mG/YlMXMJMaiobfSuxmapsKG4W8mUcwZ+i2zzi6bbLmQkgJWMQh1D2riVcUxcdX4ZIlZcNJvOklyjEN5sP+7v/XE2LeEImJBhgfL52Fyx9ICVsQzFs8xy3qQk+tqJKP2OjpPA/IMeiJJgGke/7mjMjkgbRQXtabbWzTHh4YxM/whIOLL/9jyjkskmET7zV+PmwV3Tohzn4eyK6Bh6lwq1jek5ou3wORMOBxq3odDDp9fckCX2RM/0gEbQ4C5zGeXQijRxzuROy5OGFvzcTTOVRxgYiCINrHgxmDNYLPZpn1ajy3wdTEcRSKex41N2EaGh0iOIb2smi7rvU8tyTSwaeM3t6cC8yJBYa3jgDwBAGMACMN48BKM7NV7nrJT+qPv/DjCtKzUZ8HQmE1Qfy4CBzL6ldw4Arc6BCg
 kzuVd5s5
 WmE4DPM1g/hL/2jiiJjxw418x3w+8xP8itNlIVXBqyJoHKzOSAHh3q1tMavxj7QrIMmBYs46ZFtXmXZPWrigF5XeC5ATbkrusSwNa8B7ROHvjvDWeB2tWQPRX/tRZkwIb2b0tDyulbcCx4OX1ekqbvQCygrmKyTZxoIfjXvaD7gRLSs85/ZIlNMQqBq2ceDw53/uAvnxbsCZmnhQZd/IFnCtOqnaWa38GsLzyXHMiY8jT+8PLJHxk3a5kL/WR7z32nLZ1HfJkcNq8ryV81n2Xmfk361qInXJyiCNitrrl03trnFju2j1ngDnKew==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Jun 25, 2024 at 1:02=E2=80=AFAM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> On 19/06/2024 00:26, Ryan Roberts wrote:
> > Hi All,
> >
> > Chris has been doing great work at [1] to clean up my mess in the mTHP =
swap
> > entry allocator. But Barry posted a test program and results at [2] sho=
wing that
> > even with Chris's changes, there are still some fallbacks (around 5% - =
25% in
> > some cases). I was interested in why that might be and ended up putting=
 this PoC
> > patch set together to try to get a better understanding. This series en=
ds up
> > achieving 0% fallback, even with small folios ("-s") enabled. I haven't=
 done
> > much testing beyond that (yet) but thought it was worth posting on the =
strength
> > of that result alone.
> >
> > At a high level this works in a similar way to Chris's series; it marks=
 a
> > cluster as being for a particular order and if a new cluster cannot be =
allocated
> > then it scans through the existing non-full clusters. But it does it by=
 scanning
> > through the clusters rather than assembling them into a list. Cluster f=
lags are
> > used to mark clusters that have been scanned and are known not to have =
enough
> > contiguous space, so the efficiency should be similar in practice.
> >
> > Because its not based around a linked list, there is less churn and I'm
> > wondering if this is perhaps easier to review and potentially even get =
into
> > v6.10-rcX to fix up what's already there, rather than having to wait un=
til v6.11
> > for Chris's series? I know Chris has a larger roadmap of improvements, =
so at
> > best I see this as a tactical fix that will ultimately be superseeded b=
y Chris's
> > work.
> >
> > There are a few differences to note vs Chris's series:
> >
> > - order-0 fallback scanning is still allowed in any cluster; the argume=
nt in the
> >   past was that swap should always use all the swap space, so I've left=
 this
> >   mechanism in. It is only a fallback though; first the the new per-ord=
er
> >   scanner is invoked, even for order-0, so if there are free slots in c=
lusters
> >   already assigned for order-0, then the allocation will go there.
> >
> > - CPUs can steal slots from other CPU's current clusters; those cluster=
s remain
> >   scannable while they are current for a CPU and are only made unscanna=
ble when
> >   no more CPUs are scanning that particular cluster.
> >
> > - I'm preferring to allocate a free cluster ahead of per-order scanning=
, since,
> >   as I understand it, the original intent of a per-cpu current cluster =
was to
> >   get pages for an application adjacent in the swap to speed up IO.
>
> [...]
>
> I've been having a play, trying to extend this to actively free swap entr=
ies to make sufficient space for a new allcoation if the entries are alread=
y in swap cache. (To be clear, I'm not pursuing this series for inclusion, =
but was just trying to put some numbers to the idea, which could later be a=
dded to Chris's series if it makes sense).
>
> However, I'm running into an unexpected issue; It was my previous underst=
anding that if the swap map for an entry is SWAP_HAS_CACHE, then there is a=
 folio in the swap cache and nothing is referencing the swap entry, so the =
entry can be freed with __try_to_reclaim_swap() - that's the pattern that t=
he existing oder-0 scanner uses.
>
> But I'm finding that __try_to_reclaim_swap() always returns 0, indicating=
 that no folio was associated with the swap entry. How can that be, if swap=
_map[offset] =3D=3D SWAP_HAS_CACHE ?

I saw that in my test too. Some swap entries remain as SWAP_HAS_CACHE
which contribute to the swap allocation failure rate.

One possibility is that the zram is hitting the synchronous IO case
for swap in, it does not put the folio in swap cache when serving the
swap in. In commit 13ddaf26be324a7f951891ecd9ccd04466d27458 from
Kairui, the SWAP_HAS_CACHE flag was added for synchronous IO, in order
to address a race in the swap in. But the SWAP_HAS_CACHE should be
cleaned up after swap in fault though. I did not have a chance to dig
deeper into the root cause of those SWAP_HAS_CACHE entries yet.

Chris

>
> Here is my code, for reference. Note that "evict" is always set true to b=
e very agressive for now, but the eventual idea is that it would only be se=
t to true when certain criteria are met (e.g. last attempt to allocate for =
order either failed or had to evict to succeed).
>
> With logging added, every single call to __try_to_reclaim_swap() returns =
0. (well it also live locks due to that with code as shown, but I hacked ar=
ound that in experiments). Any ideas, before I dig deeper?
>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f6d78723f0fd..928d61e1d9d5 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -641,13 +641,25 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info=
_struct *si,
>  }
>
>  static inline bool swap_range_empty(char *swap_map, unsigned int start,
> -                                   unsigned int nr_pages)
> +                                   unsigned int nr_pages, bool *inc_cach=
ed)
>  {
>         unsigned int i;
> -
> -       for (i =3D 0; i < nr_pages; i++) {
> -               if (swap_map[start + i])
> -                       return false;
> +       bool hit_cache =3D false;
> +
> +       if (*inc_cached) {
> +               for (i =3D 0; i < nr_pages; i++) {
> +                       if (swap_map[start + i]) {
> +                               if (swap_map[start + i] !=3D SWAP_HAS_CAC=
HE)
> +                                       return false;
> +                               hit_cache =3D true;
> +                       }
> +               }
> +               *inc_cached =3D hit_cache;
> +       } else {
> +               for (i =3D 0; i < nr_pages; i++) {
> +                       if (swap_map[start + i])
> +                               return false;
> +               }
>         }
>
>         return true;
> @@ -760,6 +772,7 @@ static bool scan_swap_map_try_ssd_cluster(struct swap=
_info_struct *si,
>         struct swap_cluster_info *ci;
>         unsigned int tmp, max;
>         unsigned int stop =3D SWAP_NEXT_INVALID;
> +       bool evict =3D true;
>
>  new_cluster:
>         cluster =3D this_cpu_ptr(si->percpu_cluster);
> @@ -799,18 +812,48 @@ static bool scan_swap_map_try_ssd_cluster(struct sw=
ap_info_struct *si,
>          * natural alignment.
>          */
>         max =3D ALIGN(tmp + 1, SWAPFILE_CLUSTER);
> -       ci =3D lock_cluster(si, tmp);
> -       while (tmp < max) {
> -               if (swap_range_empty(si->swap_map, tmp, nr_pages))
> -                       break;
> -               tmp +=3D nr_pages;
> +scan_cluster:
> +       if (tmp < max) {
> +               ci =3D lock_cluster(si, tmp);
> +               while (tmp < max) {
> +                       if (swap_range_empty(si->swap_map, tmp, nr_pages,=
 &evict))
> +                               break;
> +                       tmp +=3D nr_pages;
> +               }
> +               unlock_cluster(ci);
>         }
> -       unlock_cluster(ci);
>         if (tmp >=3D max) {
>                 cluster_dec_scanners(ci);
>                 cluster->next[order] =3D SWAP_NEXT_INVALID;
>                 goto new_cluster;
>         }
> +       if (evict) {
> +               unsigned int off;
> +               int nr;
> +
> +               spin_unlock(&si->lock);
> +
> +               for (off =3D tmp; off < tmp + nr_pages; off +=3D nr) {
> +                       nr =3D 1;
> +                       if (READ_ONCE(si->swap_map[off]) =3D=3D SWAP_HAS_=
CACHE) {
> +                               nr =3D __try_to_reclaim_swap(si, off, TTR=
S_ANYWAY);
> +                               if (nr < 0)
> +                                       break;
> +                               else if (nr =3D=3D 0)
> +                                       nr =3D 1;
> +                               nr =3D ALIGN(off + 1, nr) - off;
> +                       }
> +               }
> +
> +               spin_lock(&si->lock);
> +               *scan_base =3D this_cpu_read(*si->cluster_next_cpu);
> +               *offset =3D *scan_base;
> +
> +               if (nr < 0)
> +                       tmp +=3D nr_pages;
> +
> +               goto scan_cluster;
> +       }
>         *offset =3D tmp;
>         *scan_base =3D tmp;
>         tmp +=3D nr_pages;
>
>