From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1B702C54E67
	for <linux-mm@archiver.kernel.org>; Mon, 18 Mar 2024 02:41:42 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8CA316B0085; Sun, 17 Mar 2024 22:41:41 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 87A406B0087; Sun, 17 Mar 2024 22:41:41 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 71B8A6B0088; Sun, 17 Mar 2024 22:41:41 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 5BB366B0085
	for <linux-mm@kvack.org>; Sun, 17 Mar 2024 22:41:41 -0400 (EDT)
Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 2486DA0BB2
	for <linux-mm@kvack.org>; Mon, 18 Mar 2024 02:41:41 +0000 (UTC)
X-FDA: 81908609202.07.85A1AEB
Received: from mail-vk1-f178.google.com (mail-vk1-f178.google.com [209.85.221.178])
	by imf24.hostedemail.com (Postfix) with ESMTP id 7A8B418000C
	for <linux-mm@kvack.org>; Mon, 18 Mar 2024 02:41:39 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=RIfQ4g6b;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.178 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710729699;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=4Q1ZsxcEK4oCrb2hk8jSYbnqg03nST/a1q8xFRqCISY=;
	b=ZnhK0+jqkWnZfbeUYWxug9JPVhpHbo14t2gEIHhulfEuyKUXKC0aYoU1pWK+77WCzm9VDR
	e4rsyzUGWa4/XkR9aOFfeHDMhqVotPvHsSHngFDO+PHZ1zLibzgbH026ZiEk1A6IoA7SPy
	+MaKHOJ5XiYi6iIQlXTCOTK+2Yh4C/0=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=RIfQ4g6b;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.178 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710729699; a=rsa-sha256;
	cv=none;
	b=iseZhiENkT5JCoYei9yOzPx5tndY7cHkInySi9Cuz6bYeZGfVsJutb4CTyaq9qQiSsPLna
	0cKcyDnH+RGLVQlp/ltvONzjj5VmtUnFXzyAKqxuH8XlfxZ/d0RMmf6dR7EOV7YRAX7HhY
	vayQbk4Hq5Wj8XFQ4VsB0gjco1mKH5w=
Received: by mail-vk1-f178.google.com with SMTP id 71dfb90a1353d-4affeacaff9so1015788e0c.3
        for <linux-mm@kvack.org>; Sun, 17 Mar 2024 19:41:39 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1710729698; x=1711334498; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=4Q1ZsxcEK4oCrb2hk8jSYbnqg03nST/a1q8xFRqCISY=;
        b=RIfQ4g6bDjV0bDWc7BFjrDLaZNn78vCsWUdiitAJoi3ex0I1SOCzkaGLPgvKacwMnp
         xK+OovbMlbSCdV1KJEO89Fiv139Jjj8wn29rQ415USkcNGFW8hP5onK7vlKJx9D7e7T9
         Rg5BDflS+W9faCZg9jY5xXblcOslygqBXBdDibN4Vi4G85OnjTcoShjMohaREZvSg0vb
         H1HUnwHCoVyBIlX9Zuz8UXgebcQS/XUPVYHwY3DF/uyWA//WrkNQ3z6FdT21zJoTHKmv
         MbxiKVrbg41JSaOEQOxPFonLG+04MM/rTyNcPjG3baURWJ+Gchwi+tIuWSqPczC5+aKP
         38Fw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710729698; x=1711334498;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=4Q1ZsxcEK4oCrb2hk8jSYbnqg03nST/a1q8xFRqCISY=;
        b=QT7lP4nw1N24j0hjf1jyPaLSiTf7CZeQMv2YZ2Zaiu5ZAqg+Ba/ojEpR4ZN3cmeikQ
         NG4gHktEkqG9nEMK+ED8d2D9ievyNr8PYwEQ13DbkUVEF45RMQQ6ilU4I+mK/ILkA7/f
         b5K9iijzNrwwgNxGjrVb6OtLuA6gxCjFj6SajTFWLWNk5fSweJRRNpFqMPvJEx8uE8SL
         FYBEwdqGnVbuiq9qGtR/6QmddGJbAmUpoADbU82XTmrXSb8oBPAfMFpTkIfxp2jHWVIG
         nYHVYUd/yeMXEbKuAkzeSSm1lLmCeKMYkbaXQ+QAL3GR1euTYA3HwE8Qhvuz1UXsLM6u
         Dgbw==
X-Forwarded-Encrypted: i=1; AJvYcCUn7gZcmd1B/B/itdJyneaBjuTq0M/JzF0OaQU/TV/sPPA54tFdo6Q1oWM/pVf9ll9ZYVOHBDdraUuLbsyXuMy5lGU=
X-Gm-Message-State: AOJu0YxUVlEMtU552eSDMTqMagUrSDXKpc4D43lEfkQvKMOjPxeXLOAM
	fZWytmuPtLokOBVAr5hRX8rG+NGutLUEdnKmhv2B1xt0mvqZgoxI+KpQ7nI9/CbI5j7k+RvUMQs
	nKmhi7o9EbOdtfO+xbqxURUvFCxk=
X-Google-Smtp-Source: AGHT+IF/cfTemmJP51QzZN8AKCBvdK1MVobNPsXMTsFOmcj/sugYkGaCE47QnxDSHUbP4iVTGuS94CSnThJ7pfGSQrA=
X-Received: by 2002:a05:6122:2314:b0:4c7:b048:bb9c with SMTP id
 bq20-20020a056122231400b004c7b048bb9cmr5377724vkb.15.1710729696898; Sun, 17
 Mar 2024 19:41:36 -0700 (PDT)
MIME-Version: 1.0
References: <20240304081348.197341-1-21cnbao@gmail.com> <20240304081348.197341-6-21cnbao@gmail.com>
 <87wmq3yji6.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAGsJ_4x+t_X4Tn15=QPbH58e1S1FwOoM3t37T+cUE8-iKoENLw@mail.gmail.com>
 <87sf0rx3d6.fsf@yhuang6-desk2.ccr.corp.intel.com> <CAGsJ_4xna1xKz7J=MWDR3h543UvnS9v0-+ggVc5fFzpFOzfpyA@mail.gmail.com>
 <87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com>
In-Reply-To: <87jzm0wblq.fsf@yhuang6-desk2.ccr.corp.intel.com>
From: Barry Song <21cnbao@gmail.com>
Date: Mon, 18 Mar 2024 15:41:25 +1300
Message-ID: <CAGsJ_4wTU3cmzXMCu+yQRMnEiCEUA8rO5=QQUopgG0RMnHYd5g@mail.gmail.com>
Subject: Re: [RFC PATCH v3 5/5] mm: support large folios swapin as a whole
To: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>, akpm@linux-foundation.org, linux-mm@kvack.org, 
	ryan.roberts@arm.com, chengming.zhou@linux.dev, chrisl@kernel.org, 
	david@redhat.com, hannes@cmpxchg.org, kasong@tencent.com, 
	linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, 
	mhocko@suse.com, nphamcs@gmail.com, shy828301@gmail.com, steven.price@arm.com, 
	surenb@google.com, wangkefeng.wang@huawei.com, xiang@kernel.org, 
	yosryahmed@google.com, yuzhao@google.com, Chuanhua Han <hanchuanhua@oppo.com>, 
	Barry Song <v-songbaohua@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 7A8B418000C
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: unk8tie3ufz4sbarq9e1s8gwrjpjkb3q
X-HE-Tag: 1710729699-641818
X-HE-Meta: U2FsdGVkX18YjbBYG64rjTTIKfXgVvb3P15BKZTwF4710KmAXdzwzeXZszhUJiCJPB6WR57b90t3rd8b7fl8hblRQJUjey0rDPQRVDppBh/pfcuq825vlJqS679rvsEva8Nn8+sM1NVVCWtdkdyogUt3zbhL0WvEdPIRmeik23zxIkcZumTMjr02pFDpa4VfPUzH63MXv8bAqEgIcq+Gnl9F8QcLq+LTfWGWN9e2hl6EdkLiKGMGmjnkjVjNVzksIcnB76OZ9xJkpSLrV9psPfVF7Z4v/E4C7kZ/zWD8x5KrG9Qw6nyg2GUWHltZchRfyYyc6T/x+0P8gDzj/1i7drQFmMl1BLCq+6ldtXeOE2tEkLlrbZJC97pJjSvRb3o7Ug7apBY4QqP55tQHltmYwBT9QWlJPGg0t4ivE9cexW+YI4CyRxhSxOcVSnvVJD9/CpLd/y/vnP+AXXfiocysiKNxA2xNAASK/f1iDs9V45dbkeY1jxMltLGNvp1LMgwBzqihuE4AIBWzPghlOt2B60Im8YplGSCpxbu2aalwV1mVZjjWIbLb5cW762rcRP9znA2mougG4FpgRK9mxOVXAuk2YgkeLoEHFQIIzjgfgZfhUhST1NBBu6OnvThG64qpgT+yifstQI7kwNuglxSdHd6Um87/E2sAiD9j12+qXybh/TxRSxYXjHiu8LkvZ05HR6m75xtM2hcDoyLjrWGuOvmn7Wgl5ZM2SWtFfLXl+FqlXrAfxIMUKLSM04FARodXZkiLR+DpQ4OnfUXFn+zDrOF/h3vwB95KJSJxsimD4OQP2FRbVSW9765aH+dtbcN7MURxMO8cQZu8LtWuEdKeh//3DMhhR2W+8Xr/ihBV/GpBwy6cQLaJc0rg+x1nSSWjNEd807iSUq9LhfqFTUCvUShRyxZOWuvFiJWcNOZdi1d9yrPNuDS5iCopKLFsts8TH/kh6jOz+HIms0d+vEH
 8Tbx2QRC
 zf9UEwtl51HJUnISsAb5+qPC76opB3OZnLn2t
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Mar 18, 2024 at 2:54=E2=80=AFPM Huang, Ying <ying.huang@intel.com> =
wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Fri, Mar 15, 2024 at 10:17=E2=80=AFPM Huang, Ying <ying.huang@intel.=
com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Fri, Mar 15, 2024 at 9:43=E2=80=AFPM Huang, Ying <ying.huang@inte=
l.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > From: Chuanhua Han <hanchuanhua@oppo.com>
> >> >> >
> >> >> > On an embedded system like Android, more than half of anon memory=
 is
> >> >> > actually in swap devices such as zRAM. For example, while an app =
is
> >> >> > switched to background, its most memory might be swapped-out.
> >> >> >
> >> >> > Now we have mTHP features, unfortunately, if we don't support lar=
ge folios
> >> >> > swap-in, once those large folios are swapped-out, we immediately =
lose the
> >> >> > performance gain we can get through large folios and hardware opt=
imization
> >> >> > such as CONT-PTE.
> >> >> >
> >> >> > This patch brings up mTHP swap-in support. Right now, we limit mT=
HP swap-in
> >> >> > to those contiguous swaps which were likely swapped out from mTHP=
 as a
> >> >> > whole.
> >> >> >
> >> >> > Meanwhile, the current implementation only covers the SWAP_SYCHRO=
NOUS
> >> >> > case. It doesn't support swapin_readahead as large folios yet sin=
ce this
> >> >> > kind of shared memory is much less than memory mapped by single p=
rocess.
> >> >>
> >> >> In contrast, I still think that it's better to start with normal sw=
ap-in
> >> >> path, then expand to SWAP_SYCHRONOUS case.
> >> >
> >> > I'd rather try the reverse direction as non-sync anon memory is only=
 around
> >> > 3% in a phone as my observation.
> >>
> >> Phone is not the only platform that Linux is running on.
> >
> > I suppose it's generally true that forked shared anonymous pages only
> > constitute a
> > small portion of all anonymous pages. The majority of anonymous pages a=
re within
> > a single process.
>
> Yes.  But IIUC, SWP_SYNCHRONOUS_IO is quite limited, they are set only
> for memory backed swap devices.

SWP_SYNCHRONOUS_IO is the most common case for embedded linux.
note almost all Android/embedded devices use zRAM rather than a disk
for swap.

And we can have an upper limit order or a new control like
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/swapin
and set them default to 0 first.

>
> > I agree phones are not the only platform. But Rome wasn't built in a
> > day. I can only get
> > started on a hardware which I can easily reach and have enough hardware=
/test
> > resources on it. So we may take the first step which can be applied on
> > a real product
> > and improve its performance, and step by step, we broaden it and make i=
t
> > widely useful to various areas  in which I can't reach :-)
>
> We must guarantee the normal swap path runs correctly and has no
> performance regression when developing SWP_SYNCHRONOUS_IO optimization.
> So we have to put some effort on the normal path test anyway.
>
> > so probably we can have a sysfs "enable" entry with default "n" or
> > have a maximum
> > swap-in order as Ryan's suggestion [1] at the beginning,
> >
> > "
> > So in the common case, swap-in will pull in the same size of folio as w=
as
> > swapped-out. Is that definitely the right policy for all folio sizes? C=
ertainly
> > it makes sense for "small" large folios (e.g. up to 64K IMHO). But I'm =
not sure
> > it makes sense for 2M THP; As the size increases the chances of actuall=
y needing
> > all of the folio reduces so chances are we are wasting IO. There are si=
milar
> > arguments for CoW, where we currently copy 1 page per fault - it probab=
ly makes
> > sense to copy the whole folio up to a certain size.
> > "
> >
> >>
> >> >>
> >> >> In normal swap-in path, we can take advantage of swap readahead
> >> >> information to determine the swapped-in large folio order.  That is=
, if
> >> >> the return value of swapin_nr_pages() > 1, then we can try to alloc=
ate
> >> >> and swapin a large folio.
> >> >
> >> > I am not quite sure we still need to depend on this. in do_anon_page=
,
> >> > we have broken the assumption and allocated a large folio directly.
> >>
> >> I don't think that we have a sophisticated policy to allocate large
> >> folio.  Large folio could waste memory for some workloads, so I think
> >> that it's a good idea to allocate large folio always.
> >
> > i agree, but we still have the below check just like do_anon_page() has=
 it,
> >
> >         orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, false, =
true, true,
> >                                           BIT(PMD_ORDER) - 1);
> >         orders =3D thp_vma_suitable_orders(vma, vmf->address, orders);
> >
> > in do_anon_page, we don't worry about the waste so much, the same
> > logic also applies to do_swap_page().
>
> As I said, "readahead" may help us from application/user specific
> configuration in most cases.  It can be a starting point of "using mTHP
> automatically when it helps and not cause many issues".

I'd rather start from the simpler code path and really improve  on
phones & embedded linux which our team can really reach :-)

>
> >>
> >> Readahead gives us an opportunity to play with the policy.
> >
> > I feel somehow the rules of the game have changed with an upper
> > limit for swap-in size. for example, if the upper limit is 4 order,
> > it limits folio size to 64KiB which is still a proper size for ARM64
> > whose base page can be 64KiB.
> >
> > on the other hand, while swapping out large folios, we will always
> > compress them as a whole(zsmalloc/zram patch will come in a
> > couple of days), if we choose to decompress a subpage instead of
> > a large folio in do_swap_page(), we might need to decompress
> > nr_pages times. for example,
> >
> > For large folios 16*4KiB, they are saved as a large object in zsmalloc(=
with
> > the coming patch), if we swap in a small folio, we decompress the large
> > object; next time, we will still need to decompress a large object. so
> > it is more sensible to swap in a large folio if we find those
> > swap entries are contiguous and were allocated by a large folio swap-ou=
t.
>
> I understand that there are some special requirements for ZRAM.  But I
> don't think it's a good idea to force the general code to fit the
> requirements of a specific swap device too much.  This is one of the
> reasons that I think that we should start with normal swap devices, then
> try to optimize for some specific devices.

I agree. but we are having a good start. zRAM is not a specific device,
it widely represents embedded Linux.

>
> >>
> >> > On the other hand, compressing/decompressing large folios as a
> >> > whole rather than doing it one by one can save a large percent of
> >> > CPUs and provide a much lower compression ratio.  With a hardware
> >> > accelerator, this is even faster.
> >>
> >> I am not against to support large folio for compressing/decompressing.
> >>
> >> I just suggest to do that later, after we play with normal swap-in.
> >> SWAP_SYCHRONOUS related swap-in code is an optimization based on norma=
l
> >> swap.  So, it seems natural to support large folio swap-in for normal
> >> swap-in firstly.
> >
> > I feel like SWAP_SYCHRONOUS is a simpler case and even more "normal"
> > than the swapcache path since it is the majority.
>
> I don't think so.  Most PC and server systems uses !SWAP_SYCHRONOUS
> swap devices.

The problem is that our team is all focusing on phones, we won't have
any resource
and bandwidth on PC and server. A more realistic goal is that we at
least let the
solutions benefit phones and similar embedded Linux and extend it to more a=
reas
such as PC and server.

I'd be quite happy if you or other people can join together on PC and serve=
r.

>
> > and on the other hand, a lot
> > of modification is required for the swapcache path. in OPPO's code[1], =
we did
> > bring-up both paths, but the swapcache path is much much more complicat=
ed
> > than the SYNC path and hasn't really noticeable improvement.
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8650/tree/on=
eplus/sm8650_u_14.0.0_oneplus12
>
> That's great.  Please cleanup the code and post it to mailing list.  Why
> doesn't it help?  IIUC, it can optimize TLB at least.

I agree this can improve but most of the anon pages are single-process
mapped. only
quite a few pages go to the readahead code path on phones. That's why
there is no
noticeable improvement finally.

I understand all the benefits you mentioned on changing readahead, but
simply because
those kinds of pages are really really rare, so improving that path
doesn't really help
Android devices.

>
> >>
> >> > So I'd rather more aggressively get large folios swap-in involved
> >> > than depending on readahead.
> >>
> >> We can take advantage of readahead algorithm in SWAP_SYCHRONOUS
> >> optimization too.  The sub-pages that is not accessed by page fault ca=
n
> >> be treated as readahead.  I think that is a better policy than
> >> allocating large folio always.

This is always true even in do_anonymous_page(). but we don't worry that
too much as we have per-size control. overload has the chance to set its
preferences.
        /*
         * Get a list of all the (large) orders below PMD_ORDER that are en=
abled
         * for this vma. Then filter out the orders that can't be allocated=
 over
         * the faulting address and still be fully contained in the vma.
         */
        orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, false, true=
, true,
                                          BIT(PMD_ORDER) - 1);
        orders =3D thp_vma_suitable_orders(vma, vmf->address, orders);

On the other hand, we are not always allocating large folios. we are alloca=
ting
large folios when the swapped-out folio was large. This is quite important =
to
an embedded linux, as swap is happening so often. more than half memory
can be in swap, if we swap-out them as a large folio, but swap them in a
small, we immediately lose all advantages such as less page faults, CONT-PT=
E
etc.

> >
> > Considering the zsmalloc optimization, it would be a better choice to
> > always allocate
> > large folios if we find those swap entries are for a swapped-out large =
folio. as
> > decompressing just once, we get all subpages.
> > Some hardware accelerators are even able to decompress a large folio wi=
th
> > multi-hardware threads, for example, 16 hardware threads can compress
> > each subpage of a large folio at the same time, it is just as fast as
> > decompressing
> > one subpage.
> >
> > for platforms without the above optimizations, a proper upper limit
> > will help them
> > disable the large folios swap-in or decrease the impact. For example,
> > if the upper
> > limit is 0-order, we are just removing this patchset. if the upper
> > limit is 2 orders, we
> > are just like BASE_PAGE size is 16KiB.
> >
> >>
> >> >>
> >> >> To do that, we need to track whether the sub-pages are accessed.  I
> >> >> guess we need that information for large file folio readahead too.
> >> >>
> >> >> Hi, Matthew,
> >> >>
> >> >> Can you help us on tracking whether the sub-pages of a readahead la=
rge
> >> >> folio has been accessed?
> >> >>
> >> >> > Right now, we are re-faulting large folios which are still in swa=
pcache as a
> >> >> > whole, this can effectively decrease extra loops and early-exitin=
gs which we
> >> >> > have increased in arch_swap_restore() while supporting MTE restor=
e for folios
> >> >> > rather than page. On the other hand, it can also decrease do_swap=
_page as
> >> >> > PTEs used to be set one by one even we hit a large folio in swapc=
ache.
> >> >> >
> >> >>
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry