From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B1151C27C5E
	for <linux-mm@archiver.kernel.org>; Tue, 11 Jun 2024 07:12:02 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E15266B0083; Tue, 11 Jun 2024 03:12:01 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id DC46E6B0085; Tue, 11 Jun 2024 03:12:01 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C8C5D6B0088; Tue, 11 Jun 2024 03:12:01 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id AA7AE6B0083
	for <linux-mm@kvack.org>; Tue, 11 Jun 2024 03:12:01 -0400 (EDT)
Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 8495A140F68
	for <linux-mm@kvack.org>; Tue, 11 Jun 2024 07:12:00 +0000 (UTC)
X-FDA: 82217738400.19.9E443FD
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
	by imf22.hostedemail.com (Postfix) with ESMTP id 6E1D7C001B
	for <linux-mm@kvack.org>; Tue, 11 Jun 2024 07:11:58 +0000 (UTC)
Authentication-Results: imf22.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=JnvNjcZi;
	dmarc=pass (policy=none) header.from=kernel.org;
	spf=pass (imf22.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718089918;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=mCzmHpXeiJJ4tmalddyufSZRa0r1kReMfUUfwz2Rz0o=;
	b=QMKzQYklEtRTMc6qvVOf+PJzBTq9oEA9lPvXMnZYE1UtoZs1oXHVFhCMUkMGYq5GUnLPV/
	dt/P7tocDpDVEEFOT4FgJNYserrXiIQQRkuHvIh6mL1m5Oleiq//U6bdkbiTIR6C83AWjv
	w6ZNzY5a0ia+PIjo2ScNau9h0Xj94lI=
ARC-Authentication-Results: i=1;
	imf22.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=JnvNjcZi;
	dmarc=pass (policy=none) header.from=kernel.org;
	spf=pass (imf22.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718089918; a=rsa-sha256;
	cv=none;
	b=PWHSZpWm7gSyaaVgusE4hQNWulS/5T+ILnfUr0qjDWlJE3Tdxm3QLZOtenfolaaB+q14rD
	ns6paUbwvYLPf5R4rEY7BB6MEcHCYk2J4454Zoa1U4m8FwMmtFBjrsQXJyJx18Hq4sG2Jv
	CKx7kk/lihdfv38fLQpIJtqt0jN7lIo=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by dfw.source.kernel.org (Postfix) with ESMTP id 5B80E60C24
	for <linux-mm@kvack.org>; Tue, 11 Jun 2024 07:11:57 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 102C1C4AF1C
	for <linux-mm@kvack.org>; Tue, 11 Jun 2024 07:11:57 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1718089917;
	bh=8U9f/wpNHk+Ymt9D7j6PaQUzgqA77GXh/cv3gXGFh70=;
	h=References:In-Reply-To:From:Date:Subject:To:Cc:From;
	b=JnvNjcZiGFiLU2xBjgYaqNRE9xUF3uwH5aNPZ8flXmtg1YjrULLBf6l/jUta6er+F
	 dEVbJtcHpmNvziQ9j6cmKtiCpjyWwvk763nGGT1SpZQSq9GaLv3lG5KMTnKhvWkOdp
	 u3gERBmh7Vfcya/S16j1tp8E+bRzAtu+RgwAS2wz+1W5ASqzADXVN9oY4aV1BTDwFe
	 uIKzPQvigd6y/NxP3X5zH9cgtbIReM1dLACKPmz3IUyb7yZU2ORUXuDe65T/hhcVum
	 Qq3mavkK+iQ7H3ltIooobz1E+4VwZ0CyGHKyT3+tQsQpsu5gaPd79zVcVOdlFV1ffc
	 gSv0+cGi7ELoA==
Received: by mail-lf1-f42.google.com with SMTP id 2adb3069b0e04-52c7f7fdd24so909713e87.1
        for <linux-mm@kvack.org>; Tue, 11 Jun 2024 00:11:56 -0700 (PDT)
X-Forwarded-Encrypted: i=1; AJvYcCVLQIEy+m3Dv+faFp07WCdZMe+K93zHvxhiiP4Gbklp2makeYu5oCYWd+sB2aIex4k0i52AWtcsuaNmpQJ0dF2qGqE=
X-Gm-Message-State: AOJu0YwQKJcwF9lsQdEYFPQ5BPDdD1d+lo1mM2FRnZe+4hAm6Pm6vji1
	zN0Td6kocojA+64AKxbIZZlOpMcI5Efw3oOo9vOy/n1H2oJUivPaitqd8rjokXRIIiWLBCkfOIB
	FGIqgAr/daMVsg70T6SYKGdnK+w==
X-Google-Smtp-Source: AGHT+IGs0LCFr6P0yHaSVN2vPSgxfoTSQXlYhz8BsxdEDy3GaE+89+V+MGAzlT142m16MpBNfg7E3FhFkHblzzD8Qc0=
X-Received: by 2002:a2e:9bd4:0:b0:2eb:dabb:f2b4 with SMTP id
 38308e7fff4ca-2ebdabbf60fmr38121021fa.32.1718089915508; Tue, 11 Jun 2024
 00:11:55 -0700 (PDT)
MIME-Version: 1.0
References: <20240524-swap-allocator-v1-0-47861b423b26@kernel.org>
 <CANeU7QkmQ+bJoFnr-ca-xp_dP1XgEKNSwb489MYVqynP_Q8Ddw@mail.gmail.com>
 <87cyp5575y.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAF8kJuN8HWLpv7=abVM2=M247KGZ92HLDxfgxWZD6JS47iZwZA@mail.gmail.com>
 <875xuw1062.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAF8kJuMc3sXKarq3hMPYGFfeqyo81Q63HrE0XtztK9uQkcZacA@mail.gmail.com>
 <87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAF8kJuPLhmJqMi-unDOm820c8_kRnQVA_dnSfgRzMXaHKnDHAQ@mail.gmail.com>
 <875xum96nn.fsf@yhuang6-desk2.ccr.corp.intel.com> <CANeU7Q=iYzyjDwgMRLtSZwKv414JqtZK8w=XWDd6bWZ7Ah-8jA@mail.gmail.com>
 <87wmmw6w9e.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87wmmw6w9e.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Chris Li <chrisl@kernel.org>
Date: Tue, 11 Jun 2024 00:11:42 -0700
X-Gmail-Original-Message-ID: <CANeU7Q=Epa438LXEX4WEccxLt6WOziLg2sp_=RA3C4PxtHD5uw@mail.gmail.com>
Message-ID: <CANeU7Q=Epa438LXEX4WEccxLt6WOziLg2sp_=RA3C4PxtHD5uw@mail.gmail.com>
Subject: Re: [PATCH 0/2] mm: swap: mTHP swap allocator base on swap cluster order
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Kairui Song <kasong@tencent.com>, 
	Ryan Roberts <ryan.roberts@arm.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, 
	Barry Song <baohua@kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 6E1D7C001B
X-Stat-Signature: m7b453t4waboet96ngcyijszcab97fxt
X-Rspam-User: 
X-Rspamd-Server: rspam11
X-HE-Tag: 1718089918-693625
X-HE-Meta: U2FsdGVkX18gnTBf1vBFBsZHOzJUAeF3mDZwMtxQ07wkKMftYdWIzOVigx+ptmQAw401+rS/FT3DCzRHH+J7DRnXu+7+EZsje9AXHn8x1ceit7J9hUZipLczXt7CNQP5D06CC4BYeEdxHl+JinsRxIYbLZTG2a5lKlAaMI8iePFpvnMfHcOw8KuDZJtFkz+c5l0RNzjUPyUThuZ2T2g9fy38UU7U0MS8kt/LanfbDRbHnlED8BA0Km2NPf3W/qk3Tu2iFgTYxFl/ADCaJEH4AvLXaIeLbSKZooWRqvJHdWz7QHXbJMTX2+XrjwBInC5PoHYmWRbV1/ziUPN4hzu0JSpvspvUtOu9yYF+4MrCR+9zi0oInSRJK91MKI1p+Qg7HOvkAEKCD/nfM1zlzLxaJ5upwIO4gvD6WN3m64suBv59MHd94fUmjI2ipYniHeyOigRqW4m2u4XJpBNvP42oRxQFQ212ojz9dthmfddgh1G1jbqCaf6MzT2ECt91NMnxDCQZ593ArD1rq7+vER2pwFAmrOB51tasl65LO0XlM8LOmudZbJRAo4OEbLPsJt7iYFjtAbYQOWkXywnhYPzMk9MhrKa3HgjGEp8ooe1yRJRcv9XNVlLNY8algWf6PpQIq1mmxELP8liJDpsDp2wNsW1VxXsjPRmYZkYre4w/td+eDztPKTpcnuPOtO/HW3PZYF67s3YAVnV5WedMD/2gRQaqKGPHnpoKRVNoQTN3+VaQbgOia8gvYL+syRZUVZVkLgY8ww5ayr8GpvIKPQZoWazgfjzN1hFmGmcBqmLxlPcaFbfHF8Rmwe7J0Un7AW/414rZ604u/ym9rZzfj0z1b4M77Jk+tiv97hk7hQEhbzobXHqrMm//IHPuBmIgJcpg48q6gtyLPYfM8pmpAmNP47PLeray1uTtjVvaLCcH8Wh225BQULqpVTk7p9lcoqM2Dg2IdwOZKjBGbz8tejL
 QJj0Iuvu
 VLhfPXjctf4WJbO4ISkhIjHB3RrE2/PNX5ld67S6GVji5ZW7nfxOkgWjKxMKM92kT/JgX281cs0THgV1260AY8mdgSU1Dc37LeUmaTv8WUeY/UmUSPgaoMDIeh5ImEW7a6/ChUxyizpj9WkKnTF/+H9QdXaAD7BL5OxGs1GeEo0psqCmdaQ+o8wAGpvyAXBh1Ed/mzvRkBAOr4xzZx1eKFQJ5TB4InQXY4+rT5WdG8NqN0BzZqZ9TAhDTubTkUBn8pxDSUEKXJ7QC2TVSmSjfj/Pgmx55R8eTUMexD8qgyGvhVANLbwg7hLAau46AUbTJ7Pqyy8jwVw8D5QcPcN22K6TCNWpQdqUyYYZV
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Jun 10, 2024 at 7:38=E2=80=AFPM Huang, Ying <ying.huang@intel.com> =
wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Wed, Jun 5, 2024 at 7:02=E2=80=AFPM Huang, Ying <ying.huang@intel.co=
m> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >
> >> > In the page allocation side, we have the hugetlbfs which reserve som=
e
> >> > memory for high order pages.
> >> > We should have similar things to allow reserve some high order swap
> >> > entries without getting polluted by low order one.
> >>
> >> TBH, I don't like the idea of high order swap entries reservation.
> > May I know more if you don't like the idea? I understand this can be
> > controversial, because previously we like to take the THP as the best
> > effort approach. If there is some reason we can't make THP, we use the
> > order 0 as fall back.
> >
> > For discussion purpose, I want break it down to smaller steps:
> >
> > First, can we agree that the following usage case is reasonable:
> > The usage case is that, as Barry has shown, zsmalloc compresses bigger
> > size than 4K and can have both better compress ratio and CPU
> > performance gain.
> > https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.c=
om/
> >
> > So the goal is to make THP/mTHP have some reasonable success rate
> > running in the mix size swap allocation, after either low order or
> > high order swap requests can overflow the swap file size. The allocate
> > can still recover from that, after some swap entries got free.
> >
> > Please let me know if you think the above usage case and goal are not
> > reasonable for the kernel.
>
> I think that it's reasonable to improve the success rate of high-order

Glad to hear that.

> swap entries allocation.  I just think that it's hard to use the
> reservation based method.  For example, how much should be reserved?

Understand, it is harder to use than a fully transparent method, but
still better than no solution at all. The alternative right now is we
can't do it.

Regarding how much we should reserve. Similarly, how much should you
choose your swap file size? If you choose N, why not N*120% or N*80%?
That did not stop us from having a swapfile, right?

> Why system OOM when there's still swap space available?  And so forth.

Keep in mind that the reservation is an option. If you prefer the old
behavior, you don't have to use the reservation. That shouldn't be a
reason to stop others who want to use it. We don't have an alternative
solution for the long run mix size allocation yet. If there is, I like
to hear it.

> So, I prefer the transparent methods.  Just like THP vs. hugetlbfs.

Me too. I prefer transparent over reservation if it can achieve the
same goal. Do we have a fully transparent method spec out? How to
achieve fully transparent and also avoid fragmentation caused by mix
order allocation/free?

Keep in mind that we are still in the early stage of the mTHP swap
development, I can have the reservation patch relatively easily. If
you come up with a better transparent method patch which can achieve
the same goal later, we can use it instead.

>
> >> that's really important for you, I think that it's better to design
> >> something like hugetlbfs vs core mm, that is, be separated from the
> >> normal swap subsystem as much as possible.
> >
> > I am giving hugetlbfs just to make the point using reservation, or
> > isolation of the resource to prevent mixing fragmentation existing in
> > core mm.
> > I am not suggesting copying the hugetlbfs implementation to the swap
> > system. Unlike hugetlbfs, the swap allocation is typically done from
> > the kernel, it is transparent from the application. I don't think
> > separate from the swap subsystem is a good way to go.
> >
> > This comes down to why you don't like the reservation. e.g. if we use
> > two swapfile, one swapfile is purely allocate for high order, would
> > that be better?
>
> Sorry, my words weren't accurate.  Personally, I just think that it's
> better to make reservation related code not too intrusive.

Yes. I will try to make it not too intrusive.

> And, before reservation, we need to consider something else firstly.
> Whether is it generally good to swap-in with swap-out order?  Should we

When we have the reservation patch (or other means to sustain mix size
swap allocation/free), we can test it out to get more data to reason
about it.
I consider the swap in size policy an orthogonal issue.

> consider memory wastage too?  One static policy doesn't fit all, we may
> need either a dynamic policy, or make policy configurable.
> In general, I think that we need to do this step by step.

The core swap layer needs to be able to sustain mix size swap
allocation free in the long run. Without that the swap in size policy
is meaningless.

Yes, that is the step by step approach. Allowing long run mix size
swap allocation as the first step.

> >> >> > Do you see another way to protect the high order cluster polluted=
 by
> >> >> > lower order one?
> >> >>
> >> >> If we use high-order page allocation as reference, we need somethin=
g
> >> >> like compaction to guarantee high-order allocation finally.  But we=
 are
> >> >> too far from that.
> >> >
> >> > We should consider reservation for high-order swap entry allocation
> >> > similar to hugetlbfs for memory.
> >> > Swap compaction will be very complicated because it needs to scan th=
e
> >> > PTE to migrate the swap entry. It might be easier to support folio
> >> > write out compound discontiguous swap entries. That is another way t=
o
> >> > address the fragmentation issue. We are also too far from that as
> >> > right now.
> >>
> >> That's not easy to write out compound discontiguous swap entries too.
> >> For example, how to put folios in swap cache?
> >
> > I propose the idea in the recent LSF/MM discussion, the last few
> > slides are for the discontiguous swap and it has the discontiguous
> > entries in swap cache.
> > https://drive.google.com/file/d/10wN4WgEekaiTDiAx2AND97CYLgfDJXAD/view
> >
> > Agree it is not an easy change. The cache cache would have to change
> > the assumption all offset are contiguous.
> > For swap, we kind of have some in memory data associated with per
> > offset already, so it might provide an opportunity to combine the
> > offset related data structure for swap together. Another alternative
> > might be using xarray without the multi entry property. , just treat
> > each offset like a single entry. I haven't dug deep into this
> > direction yet.
>
> Thanks!  I will study your idea.
>

I am happy to discuss if you have any questions.

> > We can have more discussion, maybe arrange an upstream alignment
> > meeting if there is interest.
>
> Sure.

Ideally, if we can resolve our differences over the mail list then we
don't need to have a separate meeting :-)

Chris