From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D67DBC54E69
	for <linux-mm@archiver.kernel.org>; Fri, 15 Mar 2024 10:02:02 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4E51C80112; Fri, 15 Mar 2024 06:02:01 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 47390800B4; Fri, 15 Mar 2024 06:02:01 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 24C3B80112; Fri, 15 Mar 2024 06:02:01 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 0B111800B4
	for <linux-mm@kvack.org>; Fri, 15 Mar 2024 06:02:01 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 7942FA10BF
	for <linux-mm@kvack.org>; Fri, 15 Mar 2024 10:02:00 +0000 (UTC)
X-FDA: 81898832400.28.18D094D
Received: from mail-vk1-f174.google.com (mail-vk1-f174.google.com [209.85.221.174])
	by imf13.hostedemail.com (Postfix) with ESMTP id A7D9D2001E
	for <linux-mm@kvack.org>; Fri, 15 Mar 2024 10:01:58 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Jgv7m0fF;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf13.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.174 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710496918; a=rsa-sha256;
	cv=none;
	b=in11iJvRGE5T12Bk2UlhnAsza7tJdrrqk6X+DRUYPBDcoKQ0mqmcoUpg2h/LzS2AT4ZBj+
	BCJPvFlsVWJZp24FTfQSMoRXNE2Nm+s/sw4GjvBoaYjq8+h2fAHETXWWzlya3wIwr0RUF4
	r/77zX/LLCOrReaalu/H7CRolpdRXGs=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=Jgv7m0fF;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf13.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.174 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710496918;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=oW5B6DxEkkaQLiJ4zUKDO9JEXVV6ZYR5z/sAge1VQ20=;
	b=W7HXilglMHS5HOLNcZf+dMEGFvpirxE6uwwMz57fxc3+fW48KyBJuv02ud+MHkeWgIZ86A
	bdmGAtfJR0e5gM9yzCqlJobNXEdj4pXmWFerEenj0pcy47YnjY1DNviblFIrKUoBjKd3R8
	TQl4ODyLYCnK+3ol2vkdAsAh2/x4XIA=
Received: by mail-vk1-f174.google.com with SMTP id 71dfb90a1353d-4d40b0d7223so798286e0c.0
        for <linux-mm@kvack.org>; Fri, 15 Mar 2024 03:01:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1710496918; x=1711101718; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=oW5B6DxEkkaQLiJ4zUKDO9JEXVV6ZYR5z/sAge1VQ20=;
        b=Jgv7m0fF41fcXJku89jZ+k6kuyaCKiTXJ2Ps/P64Xk3YRQ8K3bqbF1Qe0ZLWnepyYh
         vMnxmSkofwRa74J0DfY0VYd/DcPSRTCZHIlh9CyYqb1NfNGXRW8J+11kE1mZwtLGhNAR
         G5JI/2qxOP661w+GwL2cj+aP7jxMuccNpwa13wypuG/J0T946QwmySWcyJbmK9bggWL/
         FZDzsZ2io+u/PegJssXX/3uw/VeQ09xpFrWxzZ1/Mt4FEvdfPAN4XOz9UdfJ/A+haETV
         Fx5uKBlwFdE3lc5S8p/b9kNwYxIfb5BCXSc6hMGQOe69KSIhtGNzp5beMCxgV0VOanm8
         fP+Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710496918; x=1711101718;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=oW5B6DxEkkaQLiJ4zUKDO9JEXVV6ZYR5z/sAge1VQ20=;
        b=qUE5e7rE82mjaGRlSc8Me4IdelY9ElcAtgpKtGafHSayAicho/R0K+hYrH6CtW0DlC
         Wk4hoBtJqJS2ZueLfjSgWhW9dO6RZvzYWdIeGWtry9CEHyrxjjCjb+WmqAvwivR+L87c
         U5a9+R3flHFcREAfV7WZQASgV6CJ27uNdWkFhS05+WKygME9rY2aegYrKyzsEJKTuqaV
         5RD8scA12A57O9FNoYYLl+V/xq1AxXbpcAAcaCmPNWaz45CGB6gxRlDvlTPgJySpHGDq
         9B/emkPnFvsYSqsoS4RFweZPaHCrLp5F7/erXA24Gzsyog0JVsVfxTagDQDr4oLBr/1Q
         G+GQ==
X-Forwarded-Encrypted: i=1; AJvYcCX0SOvhLKKqMp8yDBbmuVZJ8HApdyJS3kuen+/vcdHIdkpu2LIJ+QJCP7LcCg/BSVL0zYjLfokyvJ9liw7fwQSM+hE=
X-Gm-Message-State: AOJu0YwmbdW43EzdzvOY8imM6slpUJ3D9OftSP6UVBz3WsKUe/tJTASP
	w3WQ6hU3EQQl2msdl1DQCkYZmwF2ebXpqtEVj4yBmjn3P2qssM4w4Kqc9AwdomiAv1vA1cok1Zu
	OC74KlpG9rtTPQyD5Rw55bqnEy5o=
X-Google-Smtp-Source: AGHT+IEnvN8kiTCclBTIT/7XiLqTI34YVzLDPfLghUXBNqtpvqHWbScB5k87v9w6zXKxGQAU7WWOajS0yZyvg3m8Xgw=
X-Received: by 2002:a1f:eec7:0:b0:4d4:1fe2:c398 with SMTP id
 m190-20020a1feec7000000b004d41fe2c398mr3511026vkh.2.1710496917602; Fri, 15
 Mar 2024 03:01:57 -0700 (PDT)
MIME-Version: 1.0
References: <20240304081348.197341-1-21cnbao@gmail.com> <20240304081348.197341-6-21cnbao@gmail.com>
 <87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAGsJ_4x+t_X4Tn15=QPbH58e1S1FwOoM3t37T+cUE8-iKoENLw@mail.gmail.com>
 <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Barry Song <21cnbao@gmail.com>
Date: Fri, 15 Mar 2024 23:01:46 +1300
Message-ID: <CAGsJ_4xna1xKz7J=MWDR3h543UvnS9v0-+ggVc5fFzpFOzfpyA@mail.gmail.com>
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>, akpm@linux-foundation.org, linux-mm@kvack.org, 
	ryan.roberts@arm.com, chengming.zhou@linux.dev, chrisl@kernel.org, 
	david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, 
	linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, 
	mhocko@suse.com, nphamcs@gmail.com, shy828301@gmail.com, steven.price@arm.com, 
	surenb@google.com, wangkefeng.wang@huawei.com, xiang@kernel.org, 
	yosryahmed@google.com, yuzhao@google.com, Chuanhua Han <hanchuanhua@oppo.com>, 
	Barry Song <v-songbaohua@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: A7D9D2001E
X-Stat-Signature: sp9pinqq8gaapbgpu84ek6ysibzzmq55
X-HE-Tag: 1710496918-127646
X-HE-Meta: U2FsdGVkX1/CGyjJ43cY3pR9kXUgca7OqcIb1bu4b93ta2Zzc0ksrsPLEYE38A5OHwlABnJyojB4BzLZZ9x3j4ARVOHQxPUZhTAezvjh1VbeUr1P5nSjCeRD1gm0vReoKRPSNBENyZ+NZvppN1y1/ykYnPsJPWiJVWdvqR7c/M0y+M4bG1+JbFxlsF9mn735JH9X/8QSibsP9eYNVoYiR8fArl7JwFFPUlm0/U6+z+BiMKQ6A4AxfPUZLUQdiE4wmnvYj+2SxR3KUkIHr18u+ep2PqCMGfk58gbFaGcEWod/PaK2kCFRkBDIGuP+ZW8oPKQBX1CnBZqLgIK1xx11LTNSPajQZoV8APhZHqn0R73d3uBEH/1hLFs9ITMjCi8F9QIUWHGuYS9THFv5GohKYpwFzCdszXK/WgdtCXASF72N/SeLU4gOxm//4y+Jvm2OGSzuRXbSPgkMwkU//PfNKtqHJPgETQ8DXovp7FnngEAJf8uIRwWpOQleWxx7V6+XwuiNOhP226F04h8Ni+8W+pEsFIMNgFFXGx4hA6LkQ2mg1q3NJl4YbdQRcz6Jm8kG17gviu8WhJgkvHdTkT/y2x96MOzCLmKfuqzdtLJ7leYev/MV+i8MPWQsOdoAyNnt081xfcPOfImHEROfBt1iD+afjo4Jp1xAvpf7n9iSXXM4Uc/kJypYMbzpsx8Zwsw4yfxG4FwD5nFgQFSpNL1uhWLDcMIRoj+Z+q+qDU635hg0vxoDVN2XGpwdWWv3xlaip/zKd39ImyNbjiotm3Bb1BEeaDHDdPuYASDrIirVUYUKQBg+wqSpFfoDuof/PcPfG26Afp6tJdnY25tEwI/J85+T4c9YUKnBSiupqZ8ALW7+k3IBU93G2bw7oHhfVrYVcN3nhOaClmCF2AVg3qFU0P9BdfHvXYVb9e0BUUmpYQi4GN4EN0zam/qey6vLHwzkqTgFH4d0M2Tq457mAXi
 VUaWBqfb
 /8/0pu6kT+fFwMkJJWRoQ7dj6077LPoc8gJH5Z/ELprJlcmxDI43Ll2rLUxPs54Rxpj6lhnh5MhvHygNVAHh9LTm+IeX9mraRu6VKtHNzxhHDGDMB8EB3IPmWK0p3xrypZ2I0bg8p/f4jvb5QVTBo6+0QGeDfuukxYYzD/TiW+1XqnkkWGSFLHVovT6pWz2Cdg6g1/DdLKRqw3f5iFUqdxpEx1zZeW4IWsKeFwJTgGY9nP+VtVoD02ClMvgGJvLT/ihf/Y+nTzfH7WuO5UrOrf+iDmF58K337bSExThZiDB5e0/7b64vneZJciQxDp16qfo6BFGrEtXfI08y/FdOYXnak0zs3HmmkQvtuvvdbbj9FCQmbFlIxegtVSp5NrWEbOg2POqdlO9CARiSxbZcOF1m+HaP0AH/2k6aJN2d+GDeSASu8qJ5yKA5jYRToIuu15B0OJHwOZ1mtVpioDAvKBK+7Ljj0F0srd8QFDKd6tc8+UESx9cFhj3LGuk6Ir2vmQ/Dm
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Mar 15, 2024 at 10:17=E2=80=AFPM Huang, Ying <ying.huang@intel.com>=
 wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Fri, Mar 15, 2024 at 9:43=E2=80=AFPM Huang, Ying <ying.huang@intel.c=
om> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >
> >> > On an embedded system like Android, more than half of anon memory is
> >> > actually in swap devices such as zRAM. For example, while an app is
> >> > switched to background, its most memory might be swapped-out.
> >> >
> >> > Now we have mTHP features, unfortunately, if we don't support large =
folios
> >> > swap-in, once those large folios are swapped-out, we immediately los=
e the
> >> > performance gain we can get through large folios and hardware optimi=
zation
> >> > such as CONT-PTE.
> >> >
> >> > This patch brings up mTHP swap-in support. Right now, we limit mTHP =
swap-in
> >> > to those contiguous swaps which were likely swapped out from mTHP as=
 a
> >> > whole.
> >> >
> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRONOU=
S
> >> > case. It doesn't support swapin_readahead as large folios yet since =
this
> >> > kind of shared memory is much less than memory mapped by single proc=
ess.
> >>
> >> In contrast, I still think that it's better to start with normal swap-=
in
> >> path, then expand to SWAP_SYCHRONOUS case.
> >
> > I'd rather try the reverse direction as non-sync anon memory is only ar=
ound
> > 3% in a phone as my observation.
>
> Phone is not the only platform that Linux is running on.

I suppose it's generally true that forked shared anonymous pages only
constitute a
small portion of all anonymous pages. The majority of anonymous pages are w=
ithin
a single process.

I agree phones are not the only platform. But Rome wasn't built in a
day. I can only get
started on a hardware which I can easily reach and have enough hardware/tes=
t
resources on it. So we may take the first step which can be applied on
a real product
and improve its performance, and step by step, we broaden it and make it
widely useful to various areas  in which I can't reach :-)

so probably we can have a sysfs "enable" entry with default "n" or
have a maximum
swap-in order as Ryan's suggestion [1] at the beginning,

"
So in the common case, swap-in will pull in the same size of folio as was
swapped-out. Is that definitely the right policy for all folio sizes? Certa=
inly
it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm not =
sure
it makes sense for 2M THP; As the size increases the chances of actually ne=
eding
all of the folio reduces so chances are we are wasting IO. There are simila=
r
arguments for CoW, where we currently copy 1 page per fault - it probably m=
akes
sense to copy the whole folio up to a certain size.
"

>
> >>
> >> In normal swap-in path, we can take advantage of swap readahead
> >> information to determine the swapped-in large folio order.  That is, i=
f
> >> the return value of swapin_nr_pages() > 1, then we can try to allocate
> >> and swapin a large folio.
> >
> > I am not quite sure we still need to depend on this. in do_anon_page,
> > we have broken the assumption and allocated a large folio directly.
>
> I don't think that we have a sophisticated policy to allocate large
> folio.  Large folio could waste memory for some workloads, so I think
> that it's a good idea to allocate large folio always.

i agree, but we still have the below check just like do_anon_page() has it,

        orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, false, true=
, true,
                                          BIT(PMD_ORDER) - 1);
        orders =3D thp_vma_suitable_orders(vma, vmf->address, orders);

in do_anon_page, we don't worry about the waste so much, the same
logic also applies to do_swap_page().

>
> Readahead gives us an opportunity to play with the policy.

I feel somehow the rules of the game have changed with an upper
limit for swap-in size. for example, if the upper limit is 4 order,
it limits folio size to 64KiB which is still a proper size for ARM64
whose base page can be 64KiB.

on the other hand, while swapping out large folios, we will always
compress them as a whole(zsmalloc/zram patch will come in a
couple of days), if we choose to decompress a subpage instead of
a large folio in do_swap_page(), we might need to decompress
nr_pages times. for example,

For large folios 16*4KiB, they are saved as a large object in zsmalloc(with
the coming patch), if we swap in a small folio, we decompress the large
object; next time, we will still need to decompress a large object. so
it is more sensible to swap in a large folio if we find those
swap entries are contiguous and were allocated by a large folio swap-out.

>
> > On the other hand, compressing/decompressing large folios as a
> > whole rather than doing it one by one can save a large percent of
> > CPUs and provide a much lower compression ratio.  With a hardware
> > accelerator, this is even faster.
>
> I am not against to support large folio for compressing/decompressing.
>
> I just suggest to do that later, after we play with normal swap-in.
> SWAP_SYCHRONOUS related swap-in code is an optimization based on normal
> swap.  So, it seems natural to support large folio swap-in for normal
> swap-in firstly.

I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
than the swapcache path since it is the majority. and on the other hand, a =
lot
of modification is required for the swapcache path. in OPPO's code[1], we d=
id
bring-up both paths, but the swapcache path is much much more complicated
than the SYNC path and hasn't really noticeable improvement.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/oneplu=
s/sm8650_u_14.0.0_oneplus12

>
> > So I'd rather more aggressively get large folios swap-in involved
> > than depending on readahead.
>
> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
> optimization too.  The sub-pages that is not accessed by page fault can
> be treated as readahead.  I think that is a better policy than
> allocating large folio always.

Considering the zsmalloc optimization, it would be a better choice to
always allocate
large folios if we find those swap entries are for a swapped-out large foli=
o. as
decompressing just once, we get all subpages.
Some hardware accelerators are even able to decompress a large folio with
multi-hardware threads, for example, 16 hardware threads can compress
each subpage of a large folio at the same time, it is just as fast as
decompressing
one subpage.

for platforms without the above optimizations, a proper upper limit
will help them
disable the large folios swap-in or decrease the impact. For example,
if the upper
limit is 0-order, we are just removing this patchset. if the upper
limit is 2 orders, we
are just like BASE_PAGE size is 16KiB.

>
> >>
> >> To do that, we need to track whether the sub-pages are accessed.  I
> >> guess we need that information for large file folio readahead too.
> >>
> >> Hi, Matthew,
> >>
> >> Can you help us on tracking whether the sub-pages of a readahead large
> >> folio has been accessed?
> >>
> >> > Right now, we are re-faulting large folios which are still in swapca=
che as a
> >> > whole, this can effectively decrease extra loops and early-exitings =
which we
> >> > have increased in arch_swap_restore() while supporting MTE restore f=
or folios
> >> > rather than page. On the other hand, it can also decrease do_swap_pa=
ge as
> >> > PTEs used to be set one by one even we hit a large folio in swapcach=
e.
> >> >
> >>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry