From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B957CC3DA63
	for <linux-mm@archiver.kernel.org>; Fri, 26 Jul 2024 07:16:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 362AD6B0099; Fri, 26 Jul 2024 03:16:13 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 313316B009B; Fri, 26 Jul 2024 03:16:13 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1B4D56B009C; Fri, 26 Jul 2024 03:16:13 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id F0E1D6B0099
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 03:16:12 -0400 (EDT)
Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id A1200A162A
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 07:16:12 +0000 (UTC)
X-FDA: 82381044984.02.67BBA50
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
	by imf29.hostedemail.com (Postfix) with ESMTP id A320B120007
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 07:16:10 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=klrnYHcZ;
	spf=pass (imf29.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org;
	dmarc=pass (policy=none) header.from=kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1721978169;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=VBIjBdlOM0t6pZtrfaw042jhnIDd9Pf6pOYlAuv8b0g=;
	b=kzSvu+S0nue8lwmsfDrGPr4YkWAaqS30P2puyHqc5oBHyloDedAU7+HfnfoLE8U4M0VQrD
	6mak2ZFDDVjFWz0PdshAv0ltVLulpI4FDDQb9GvIIS+ZN8zNF3jndt86ImdFUIo7Yl0snH
	AA4YAwk87pFnPWkfXdiMy6La9QGBfm8=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=klrnYHcZ;
	spf=pass (imf29.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org;
	dmarc=pass (policy=none) header.from=kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721978169; a=rsa-sha256;
	cv=none;
	b=3nKyAbKc2yH9q/mMKxXDqRopS+wRtSnZlrnH8XzoVzXt6AlSBlVqxyouKTxcRpj/05/Hvu
	Gws+N9+aUsTm0u5ga6QXx4gFm36fpH2NOD7MnEf+R642mUK2RoOENvpkVPthSvZlLCvL2s
	yFA56GtLYRX2x0ERzuySTFr6aKbqlwg=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by dfw.source.kernel.org (Postfix) with ESMTP id C90A96170D
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 07:16:09 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9CE47C32782
	for <linux-mm@kvack.org>; Fri, 26 Jul 2024 07:16:09 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1721978169;
	bh=vquLMJU43gL517xgZnwq1RxLEm/blqJGVCY1MLs4jl0=;
	h=References:In-Reply-To:From:Date:Subject:To:Cc:From;
	b=klrnYHcZrKOmzKV4V6jvNdzfHgKub3pXZ5cMPxxABSHbUjxHCpDF8nV2qSPoelyEp
	 mtEIy9wyOO+y6B8GstkU3xKC7WEGr8Bzrtuav6ND2ue5ZZ4syYetPMUx8xQybR9PTQ
	 xOEZowSvqlKe3Z8Bxu+BBErtg4mxxxqnBmMoZo9M34OtVXS6H6imlXizI66fsGCLZv
	 9vPgIE3FyAyKvaINDp4D3KYBsqh/JsUv+ua4cJtn/h09A7Nc7ds3bsoK79uaa2K973
	 e6ypr45XLdELTCbVkF4MplqNgtP380iu3WJLOc7Dj7Db5GEseFgNzn8pIPE5PkKMiP
	 +0wsOHPA1M+iA==
Received: by mail-yw1-f176.google.com with SMTP id 00721157ae682-66ca536621cso19430087b3.3
        for <linux-mm@kvack.org>; Fri, 26 Jul 2024 00:16:09 -0700 (PDT)
X-Forwarded-Encrypted: i=1; AJvYcCVn1xobD1FQHDW9Eqks9QA124QlbR/Wir9o0imvtuobW1sYHok6uK8gXkRM8/rjX+/74heUDDwGr4V4ln4Ymryq4iQ=
X-Gm-Message-State: AOJu0YwAQdgrPamlVQrtZxyJPuieA5IzVYo1rQ0R/FjynwgCOVKuScyq
	7SxasfAPhlACZWB+EzRynLopkxwms5OH33Q98yQO3G1tPuLH0tyvKOKo2YNsM5Y+N5vjighKEbX
	zpush60P889yQHqSylMv5eduPc4kQ1pWty/BM9w==
X-Google-Smtp-Source: AGHT+IGSvT53kHctBypvPedag4oAQC+VdpV81oBlF2bTSY9zy5lQI33IPoGCzXXI8+854SurcXr6jkGPhZdGSvWHi6o=
X-Received: by 2002:a0d:da82:0:b0:64b:2665:f92c with SMTP id
 00721157ae682-67510920816mr54930527b3.8.1721978168915; Fri, 26 Jul 2024
 00:16:08 -0700 (PDT)
MIME-Version: 1.0
References: <20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org>
 <20240711-swap-allocator-v4-2-0295a4d4c7aa@kernel.org> <ea720b4a-da70-4ee3-8f74-2c7344480170@arm.com>
 <CACePvbW_g4T10mqcG-FnJ11nP0obRG8ZgtdAN_EMCosnk9EQpA@mail.gmail.com>
 <b4b31314-1125-40ee-b784-20abc78bd468@arm.com> <CACePvbXfeyt5cSX3zQhbZQ4Z5suW6iXw4Kb8BDH96SeMi54o8Q@mail.gmail.com>
 <874j8nxhiq.fsf@yhuang6-desk2.ccr.corp.intel.com> <a50fe2d0-f22d-4ba0-8796-56732da0a5c4@arm.com>
 <87o76qjhqs.fsf@yhuang6-desk2.ccr.corp.intel.com> <43f73463-af42-4a00-8996-5f63bdf264a3@arm.com>
 <87jzhdkdzv.fsf@yhuang6-desk2.ccr.corp.intel.com> <f6fa3965-38db-4bdc-b6fd-6cd472169322@arm.com>
 <87sew0ei84.fsf@yhuang6-desk2.ccr.corp.intel.com> <4ec149fc-7c13-4777-bc97-58ee455a3d7e@arm.com>
 <CACePvbV9cx6Le1cYgYo2D922E4Com45+XXquMZugog2+w5K_yg@mail.gmail.com>
 <87plr26kg2.fsf@yhuang6-desk2.ccr.corp.intel.com> <CACePvbXC6SwD1mx_s_9yCZpqTXZhRKMetbCcBNPOgT-ZtLmGCA@mail.gmail.com>
 <87v80s3nvs.fsf@yhuang6-desk2.ccr.corp.intel.com> <CACePvbWW6YLZe=47+kfuz76J+WWGmfKHvqatGGm=RyRX=D-WeQ@mail.gmail.com>
 <87jzh83d3o.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87jzh83d3o.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Chris Li <chrisl@kernel.org>
Date: Fri, 26 Jul 2024 00:15:58 -0700
X-Gmail-Original-Message-ID: <CACePvbWJ8c8pF0nQ=_bidJU2qoaJn6ygOAGVz4H07Kf0Rzxaag@mail.gmail.com>
Message-ID: <CACePvbWJ8c8pF0nQ=_bidJU2qoaJn6ygOAGVz4H07Kf0Rzxaag@mail.gmail.com>
Subject: Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Kairui Song <kasong@tencent.com>, Hugh Dickins <hughd@google.com>, 
	Kalesh Singh <kaleshsingh@google.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, 
	Barry Song <baohua@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Stat-Signature: oinprcr4fj5wrwptaomm3uoj6zyp4uok
X-Rspamd-Queue-Id: A320B120007
X-Rspamd-Server: rspam11
X-HE-Tag: 1721978170-911909
X-HE-Meta: U2FsdGVkX18uIY0ALdL+5cTDroiZHTp03Auc3gdTX2qxh20LMINUCwkAXihmJ+4NQh2QPuOqilohluK8xNbCYIKh/2NSua+SOCNmoFSZkUyYyyGTtQ9BuraSWBy9UAdf7U4jbABC+wNOhMd4vVYzlY0l6AeN6H8n9PRS6xSEmkb0JLB2iCGnnL6fArj//abqHDM7ZxAndt2XosA4PArr07RKdCBwqIL5Hv0Lx9K1iTc0rUjKfrrZsduLn7XDD7gfYg+Orewtd6KghdEyUcxl7e2dO0NADIlanWGYYoho8qY+OxRlQH7xWtK4LG+LsuOetQiI63TX1zHfnhYgudDAMWNN+wGuwSsJS+APRSK6Du+F1+Z3Q9IZK4NsJWu2xk4IqmxZmj+HSwHY9/oRU5n1PoLivzNkTzYEbvL+FKY/1oFFqp0apDX+HAMPKJz76ZLLEuEhvBf/hKF8JoJ2/FHVvHd9kRhlsyMLuxclRlY6Wa2zJo+FGxXE8vTcTOxnrODk2fIl3QXw791Kj0ARW4bhIaarabOvP89Ibw5q4+GVuUgJmTVvnSEzkYYH6yfpVry7d71CMm6q5fqBZOal1g9XIjpwGtXDqa1s0XmI+KPGLakoJ2Q6+qe6cDbP5/hLygSx8txt+0o6Du+6SB4azZXrBekf+XP7qSG4ReiT949meWwZDrwUZCdV/rrgcGvt0m5WdpltM58XeyAMM/C9khJjOYrpNGKOspY7IMxexGu5/vrFgvDZa1hvT/Lrz1Nm6NX52VsjsHQw9XeGY8OCANG7zwrIYoCXysv+4rKb/fFoYaa5RZsAZixEwpjD0vIhOu/uL15IlScRX/MlRTXtphdokoY/aASWFIGYJPgxnV6wCXVaaCtkDSTl6m9A+DO+vzCKit/S3GLdIS+c742dRekYRFx/5Ix8+PhZXPrBiM35AwAQK2RrE/XxvNq+gADxKUafIZrL9YzHGWJPDA+gVyL
 1DiaFX/e
 ZpxH8IsfijUwGL+jLjshHeV/7G1hQ1uFf3kiqdMNb028n2efvWOVkCfPtHUerfVLOAUbCyjMhx0zK7h+RzeY4x4OsBSdRuMK0EH0xOPQje9YEGBvH4VOawfoq8n9B32dEidDCocmWfycgrGJ42i/iVpq0NEkh9MXVQcTtmRzpxlxNHySz5c4cIDrNzQsufSkuTVx0s5PVWVptmoZDqey8Fa+TiUnd2Y0FTgFoIwQzLddfh/ybbuYSy1dMjF5+Zj3pCtVYt5hPRtFZoDM24FXQw7pqOViudbCAzJ4g0ZoAVuNpqI9cScOXSKlrEXOIToLcveR+vzU04e/+LMg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Thu, Jul 25, 2024 at 11:05=E2=80=AFPM Huang, Ying <ying.huang@intel.com>=
 wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Thu, Jul 25, 2024 at 7:13=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >> >
> >> > The current proposed order also improves things step by step. The on=
ly
> >> > disagreement here is which patch order we introduce yet another list
> >> > in addition to the nonfull one. I just feel that it does not make
> >> > sense to invest into new code if that new code is going to be
> >> > completely rewrite anyway in the next two patches.
> >> >
> >> > Unless you mean is we should not do the patch 3 big rewrite and shou=
ld
> >> > continue the scan_swap_map_try_ssd_cluster() way of only doing half =
of
> >> > the allocation job and let scan_swap_map_slots() do the complex retr=
y
> >> > on top of try_ssd(). I feel the overall code is more complex and les=
s
> >> > maintainable.
> >>
> >> I haven't look at [3/3], will wait for your next version for that.  So=
,
> >> I cannot say which order is better.  Please consider reviewers' effort
> >> too.  Small step patch is easier to be understood and reviewed.
> >
> > That is exactly the reason I don't want to introduce too much new code
> > depending on the scan_swap_map_slots() behavior, which will be
> > abandoned in the big rewrite. Their constraints are very different. I
> > want to make the big rewrite patch 3 as small as possible. Using
> > incremental follow up patches to improve it.
> >
> >>
> >> >> > That is why I want to make this change patch after patch 3. There=
 is
> >> >> > also the long test cycle after the modification to make sure the =
swap
> >> >> > code path is stable. I am not resisting a change of patch orders,=
 it
> >> >> > is that patch can't directly be removed before patch 3 before the=
 big
> >> >> > rewrite.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> > Your original
> >> >> >> > suggestion appears like that you want to keep all cluster with=
 order-N
> >> >> >> > on the nonfull list for order-N always unless the number of fr=
ee swap
> >> >> >> > entry is less than 1<<N.
> >> >> >>
> >> >> >> Well I think that's certainly one of the conditions for removing=
 it. But agree
> >> >> >> that if a full scan of the cluster has been performed and no swa=
p entries have
> >> >> >> been freed since the scan started then it should also be removed=
 from the list.
> >> >> >
> >> >> > Yes, in the later patch of patch, beyond patch 3, we have the alm=
ost
> >> >> > full cluster that for the cluster has been scan and not able to
> >> >> > allocate order N entry.
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >>> And, I understand that in some situations it may
> >> >> >> >>> be better to share clusters among CPUs.  So my suggestion is=
,
> >> >> >> >>>
> >> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend=
 that we
> >> >> >> >>>   have free swap entries with that order even after we are s=
ure that we
> >> >> >> >>>   haven't.
> >> >> >> >>
> >> >> >> >> Is this patch pretending that today? I don't think so?
> >> >> >> >
> >> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even=
 if we are
> >> >> >> > sure that there are no order-N free swap entry in the cluster.
> >> >> >>
> >> >> >> Oh I see what you mean. I think you and Chris already discussed =
this? IIRC
> >> >> >> Chris's point was that if you move that cluster to N-1, eventual=
ly all clusters
> >> >> >> are for order-0 and you have no means of allocating high orders =
until a whole
> >> >> >> cluster becomes free. That logic certainly makes sense to me, so=
 think its
> >> >> >> better for swap_cluster_info->order to remain static while the c=
luster is
> >> >> >> allocated. (I only skimmed that conversation so appologies if I =
got the
> >> >> >> conclusion wrong!).
> >> >> >
> >> >> > Yes, that is the original intent, keep the cluster order as much =
as possible.
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >> But I agree that a
> >> >> >> >> cluster should only be on the per-order nonfull list if we kn=
ow there are at
> >> >> >> >> least enough free swap entries in that cluster to cover the o=
rder. Of course
> >> >> >> >> that doesn't tell us for sure because they may not be contigu=
ous.
> >> >> >> >
> >> >> >> > We can check that when free swap entry via checking adjacent s=
wap
> >> >> >> > entries.  IMHO, the performance should be acceptable.
> >> >> >>
> >> >> >> Would you then use the result of that scanning to "promote" a cl=
uster's order?
> >> >> >> e.g. swap_cluster_info->order =3D N+1? That would be neat. But t=
his all feels like
> >> >> >> a separate change on top of what Chris is doing here. For high o=
rders there
> >> >> >> could be quite a bit of scanning required in the worst case for =
every page that
> >> >> >> gets freed.
> >> >> >
> >> >> > Right, I feel that is a different set of patches. Even this serie=
s is
> >> >> > hard enough for review. Those order promotion and demotion is hea=
ding
> >> >> > towards a buddy system design. I want to point out that even the =
buddy
> >> >> > system is not able to handle the case that swapfile is almost ful=
l and
> >> >> > the recently freed swap entries are not contiguous.
> >> >> >
> >> >> > We can invest in the buddy system, which doesn't handle all the
> >> >> > fragmentation issues. Or I prefer to go directly to the discontig=
uous
> >> >> > swap entry. We pay a price for the indirect mapping of swap entri=
es.
> >> >> > But it will solve the fragmentation issue 100%.
> >> >>
> >> >> It's good if we can solve the fragmentation issue 100%.  Just need =
to
> >> >> pay attention to the cost.
> >> >
> >> > The cost you mean the development cost or the run time cost (memory =
and cpu)?
> >>
> >> I mean runtime cost.
> >
> > Thanks for the clarification. Agree that we need to pay attention to
> > the run time cost. That is given.
> >
> >> >> >> >>> My question is whether it's so important to share the per-cp=
u cluster
> >> >> >> >>> among CPUs?
> >> >> >> >>
> >> >> >> >> My rationale for sharing is that the preference previously ha=
s been to favour
> >> >> >> >> efficient use of swap space; we don't want to fail a request =
for allocation of a
> >> >> >> >> given order if there are actually slots available just becaus=
e they have been
> >> >> >> >> reserved by another CPU. And I'm still asserting that it shou=
ld be ~zero cost to
> >> >> >> >> do this. If I'm wrong about the zero cost, or in practice the=
 sharing doesn't
> >> >> >> >> actually help improve allocation success, then I'm happy to t=
ake the exclusive
> >> >> >> >> approach.
> >> >> >> >>
> >> >> >> >>> I suggest to start with simple design, that is, per-CPU
> >> >> >> >>> cluster will not be shared among CPUs in most cases.
> >> >> >> >>
> >> >> >> >> I'm all for starting simple; I think that's what I already pr=
oposed (exclusive
> >> >> >> >> in this patch, then shared in the "big rewrite"). I'm just ob=
jecting to the
> >> >> >> >> current half-and-half policy in this patch.
> >> >> >> >
> >> >> >> > Sounds good to me.  We can start with exclusive solution and e=
valuate
> >> >> >> > whether shared solution is good.
> >> >> >>
> >> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
> >> >> >
> >> >> > It is not able to avoid fragementation 100% of the time. I prefer=
 the
> >> >> > discontinued swap entry as the next step, which guarantees forwar=
d
> >> >> > progress, we will not be stuck in a situation where we are not ab=
le to
> >> >> > allocate swap entries due to fragmentation.
> >> >>
> >> >> If my understanding were correct, the implementation complexity of =
the
> >> >> order promotion/demotion isn't at the same level of that of discont=
inued
> >> >> swap entry.
> >> >
> >> > Discontinued swap entry has higher complexity but higher payout as
> >> > well. It can get us to the place where cluster promotion/demotion
> >> > can't.
> >> >
> >> > I also feel that if we implement something towards a buddy system
> >> > allocator for swap, we should do a proper buddy allocator
> >> > implementation of data structures.
> >>
> >> I don't think that it's easy to implement a real buddy allocator for
> >> swap entries.  So, I avoid to use buddy in my words.
> >
> > Then such a mix of cluster order promote/demote lose some benefit of
> > the buddy system. Because it lacks the proper data structure to
> > support buddy allocation. The buddy allocator provides more general
> > migration between orders. For the limited usage case of cluster
> > promotion/demotion is supported (by luck). We need to evaluate whether
> > it is worth the additional complexity.
>
> TBH, I believe that the complexity of order promote/demote is quite low,
> both for development and runtime.  A real buddy allocator may need to
> increase per-swap-entry memory footprint much.

I mostly concern its effectiveness. Anyway, the series is already
complex enough with the big rewrite and reclaim on swap cache.

Let me know if you think it needs to be done before the big rewrite.

Chris