From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8F365C3DA61
	for <linux-mm@archiver.kernel.org>; Mon, 29 Jul 2024 13:11:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0E4446B0092; Mon, 29 Jul 2024 09:11:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 041226B009A; Mon, 29 Jul 2024 09:11:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E0EFF6B009B; Mon, 29 Jul 2024 09:11:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14])
	by kanga.kvack.org (Postfix) with ESMTP id BFD926B0092
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 09:11:47 -0400 (EDT)
Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 419E7C036C
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 13:11:47 +0000 (UTC)
X-FDA: 82392827454.12.4B7FAB6
Received: from mail-oi1-f180.google.com (mail-oi1-f180.google.com [209.85.167.180])
	by imf09.hostedemail.com (Postfix) with ESMTP id 6C76814000D
	for <linux-mm@kvack.org>; Mon, 29 Jul 2024 13:11:45 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=jxHhBax8;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.167.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1722258645;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=3WSLu8EoQdDVl08mnyxRcNNHTsb4HNhH5XlRCO141Xs=;
	b=bgPXSO6Ozo/k99jaZGKnsDH+cFsSth31mOHpun5AfNxFue0XpmKRpj6Kf09OvoUfAksPe/
	DoYgygF530ZUhDLmyn7sx8bl9Cq6Ley5/JXoUTJ2nGPEOCFcTruMGNOD+PbNv2Ue+vruW3
	3w0b3QdA/APoUXaY3ZAYRLx5Y46ntS0=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722258645; a=rsa-sha256;
	cv=none;
	b=Q5qPti811B1vYtxbJ5Vq3xH6zUyPXi8IdPkCb/0fM/JeKrFNNwcirKuW2NsOmbEN5CmniU
	3qjRMmpPn2rq3BGRNvt2TIPs20pmQ/d05xXIpDP/g4jEDOaoi6YpZQOCuiDbs46tNee/5n
	+xsqDH4JPQGDDkLXpR2lJCKRR6GHBDY=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=jxHhBax8;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.167.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
Received: by mail-oi1-f180.google.com with SMTP id 5614622812f47-3db130a872fso2363319b6e.2
        for <linux-mm@kvack.org>; Mon, 29 Jul 2024 06:11:45 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1722258704; x=1722863504; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=3WSLu8EoQdDVl08mnyxRcNNHTsb4HNhH5XlRCO141Xs=;
        b=jxHhBax8nbLtFdWrRYLW9URKee5oiuhF/vpJIDuFaDCginLkjt1qsGF7u7Ob1KOAwx
         d2MDIt2i4BHw+PUY0XddC2XOwQu1rfgklHECysr5L+Snf6M/XROuMuVhOIGdbhMHhBms
         obYWqCEJnfpBKuR4hKHSQLU0/ySGK9WI/MvUYjUYxP0Pl5XN14f56n4Xj9SlhPkKwX7v
         dg4hNRe4TTnIw3ViLwzOEl97slJU1TqlLjYGqGr/xYSfhvlMEtOr2EZq2gaGH5Ty6lo5
         LhJUFljglLyIfU7dtwxqkBRlAWpMU5izC8V2CaLn92hBFQ0AvF+qdX0CPmT0lvej1V7L
         T0rg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1722258704; x=1722863504;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=3WSLu8EoQdDVl08mnyxRcNNHTsb4HNhH5XlRCO141Xs=;
        b=TzIxUB15FjiQ6YxJvWSa6y0vL0LtMVUEtNaXXbGn2OppVx4wYTWambQaCrMjJ4xOr2
         4rYLVVKt6QFNMu45wNgAoVWb+7mWYZ+BxVy0kKGNLKzNIXVEfVsabgmm8xwMUyn+eYa1
         fU/jmYtjUZc8AGaKleDmKsJ1jJPIhmgi/XybXiWRVNwQMkrdDLIxYm3/0+nhw9NFKVpG
         QSKUPHx4U5G27Js7HE2axyI7MAOrBIiYE4xzB+kwfWP82V/XCsGJOssVNNQfATDWTtkc
         qGr6XDkrlZs312fFJd1+58znTJpWDqZK+s7AqTC2VVjOo4E9bD/dh+aIpl4k1NAT2mWC
         NwsQ==
X-Forwarded-Encrypted: i=1; AJvYcCWTlCW9GTktBzzx3x39NMKQsR6UOyq2Ij2A5CGIAyueYnhUIecHKFyGnqvZf5Alswtr5H8FAMYsH6ydu9Ufy+SoaXQ=
X-Gm-Message-State: AOJu0YwlRIGXH9Ik8JaS2fhKvs+PV7MLyNXw9y1kH+fthC9Hjj/kQMpl
	ii7bNwEKaZ6shP+s1yCZ2cFRxEY+xnr0KGn2H7DMP+oQx3u6OTnzMq15l4ieqes6Rby0zmV0NRN
	/bQcHlawYaoSApugOMoedJGDsXPk=
X-Google-Smtp-Source: AGHT+IFU+QINFGAya2niNBVb0ufZXq6q+wAd7lWoSTPXG/4pgfbJdZ8vn2G6RdrRjk++lFQDWxsT65uJuebiS8WiWAs=
X-Received: by 2002:a05:6358:10c:b0:1ac:f4a9:45 with SMTP id
 e5c5f4694b2df-1addc1458e1mr1196562555d.3.1722258704054; Mon, 29 Jul 2024
 06:11:44 -0700 (PDT)
MIME-Version: 1.0
References: <20240726094618.401593-1-21cnbao@gmail.com> <20240726094618.401593-4-21cnbao@gmail.com>
 <ZqcRqxGaJsAwZD3C@casper.infradead.org> <CAGsJ_4xQkPtw1Cw=2mcRjpTdp-c9tSFCuW_U6JKzJ3zHGYQWrA@mail.gmail.com>
 <CAGsJ_4wxUZAysyg3cCVnHhOFt5SbyAMUfq3tJcX-Wb6D4BiBhA@mail.gmail.com> <ZqePwtX4b18wiubO@casper.infradead.org>
In-Reply-To: <ZqePwtX4b18wiubO@casper.infradead.org>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 30 Jul 2024 01:11:31 +1200
Message-ID: <CAGsJ_4zXfYT4KxBnLx1F8tpK-5s6PX3PQ7wf7tteuzEyKS+ZPQ@mail.gmail.com>
Subject: Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for
 zRAM-like swapfile
To: Matthew Wilcox <willy@infradead.org>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org, ying.huang@intel.com, 
	baolin.wang@linux.alibaba.com, chrisl@kernel.org, david@redhat.com, 
	hannes@cmpxchg.org, hughd@google.com, kaleshsingh@google.com, 
	kasong@tencent.com, linux-kernel@vger.kernel.org, mhocko@suse.com, 
	minchan@kernel.org, nphamcs@gmail.com, ryan.roberts@arm.com, 
	senozhatsky@chromium.org, shakeel.butt@linux.dev, shy828301@gmail.com, 
	surenb@google.com, v-songbaohua@oppo.com, xiang@kernel.org, 
	yosryahmed@google.com, Chuanhua Han <hanchuanhua@oppo.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam07
X-Rspamd-Queue-Id: 6C76814000D
X-Stat-Signature: gutun1fy1ixecay9qi3ygpdoeu9kuo8r
X-Rspam-User: 
X-HE-Tag: 1722258705-913398
X-HE-Meta: U2FsdGVkX1+ghGk8RAGgkhD4HXIYjkAlOWb7ClYpfPD5+rJ2vVs2a470+wFeL9NWHtWu9S8xotVs5x9udlYFiy60U00V+T+BITZNW7QuO79kvO2/gHYXMQYX8Ra/2g8XzpihQta3cLPzRX1MajAC3qgTnycMCMdwfx9mLOIEveFSgEr1ZmY9jeMuZFrSCYqG/oi6J2wNfnptJwJ2Bn6iaZlqFbCqR0gvGx5iYLI7vjyoJ/Zgil+azv/7trCxpo/ku8k85zCj3YEO7evpBLTMBMBnLvTjki9oFGIDMJQ7kwMhaSkLaTJT9Z+VbTnadmfNPmo4ya8ZYlp8mZOLZ2Z0Ozj4xhl3I8onZYqU02NpjEIrOKvO41bKn8YDypPIkza6M+kspx3AQPMlN9IzTijFTTF6Q99U2RqVSlwHINJjBa/izkF4LSXyFc7ac6PHkx+2VzzdaPWO72WHJ+r8plYv1ng7ewwx1WBWofejgTV8fapFP0Ou9wY058Rdo73T8BnRyBZDPzI1OyJS/9yNPIvsbflzbeq/20vcv+LICekscsU2DdoQaGwzUwoY1l4SeIZ3KPBZCf+8LRBGOcvqFLiVFOCYyIlcB3/YI2n4GK6GTuplD49mkG+K9zeW6qW8D4K/XQHT1cCvDZMCJQvLHfOj3Q1J8YHB7TE+jPXHKzNa5VSbHay9lKTOcKn4Hexo5G38TB2Q9GZ0BAbo2lQTyO+uw8SZ/VTMJtpF/VFy7f03p5uaNoAVxi2Ua0THLlJR4OIwk5F39sr2oYv4M3tJ/Ah19JrZ4H6mJF+B8vLcWYjMS2fptdxFIkC93DDfulql9dckxJYgovNqdJVB6XrkIn1gd+MrJkJlvCbqus6HWBbRcigLJ0tVlfCRXPuV236j1hfkqamsdvGhtw7xM5lno+5r0qMzux0fXh1ItT39r+7+nCxeXd6vKNIq8yvBUhohyiUxN+L7zYIOzVYXJLYDo/z
 LQ1EPnZ0
 fyWOSoJuO59f01UlKMh53wSsPXUHcqelG2ZS3GfQlCMaQdn3+NBZ27CVMlnzCS2307jz1AFUOgTY1RLoLylsJv3mzX5UjmObMtVNFfv/qG5ELXwqwWjFKRVjFEOfXVsf5qXeRW7l1ofdudxLVaYLihoaQwncdi1SQQwUjmCrbAGQ/I78SUT5iQFifYrpXQZBQiIG7R2PyJ+E3oteDre6yCDb7/zVbsmWE/lp3gMtQUBu0P9WL62ChhfCOLh1WLuv+h04OVR3+98SOHnODnws726eHsPu/ONhNERYTZtgL7cb3gJrX71kBlV5DNkxcNeyu+D7tKJw+W2fMQtx22wZoYWhfWU6BIYlAOb/cYMAQZ1QfA7JKrTzYZ2mr+0SYzWaNLPT6
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Jul 30, 2024 at 12:49=E2=80=AFAM Matthew Wilcox <willy@infradead.or=
g> wrote:
>
> On Mon, Jul 29, 2024 at 04:46:42PM +1200, Barry Song wrote:
> > On Mon, Jul 29, 2024 at 4:41=E2=80=AFPM Barry Song <21cnbao@gmail.com> =
wrote:
> > >
> > > On Mon, Jul 29, 2024 at 3:51=E2=80=AFPM Matthew Wilcox <willy@infrade=
ad.org> wrote:
> > > >
> > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote:
> > > > > -                     folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVA=
BLE, 0,
> > > > > -                                             vma, vmf->address, =
false);
> > > > > +                     folio =3D alloc_swap_folio(vmf);
> > > > >                       page =3D &folio->page;
> > > >
> > > > This is no longer correct.  You need to set 'page' to the precise p=
age
> > > > that is being faulted rather than the first page of the folio.  It =
was
> > > > fine before because it always allocated a single-page folio, but no=
w it
> > > > must use folio_page() or folio_file_page() (whichever has the corre=
ct
> > > > semantics for you).
> > > >
> > > > Also you need to fix your test suite to notice this bug.  I suggest
> > > > doing that first so that you know whether you've got the calculatio=
n
> > > > correct.
> > >
> > > I don't understand why the code is designed in the way the page
> > > is the first page of this folio. Otherwise, we need lots of changes
> > > later while mapping the folio in ptes and rmap.
>
> What?
>
>         folio =3D swap_cache_get_folio(entry, vma, vmf->address);
>         if (folio)
>                 page =3D folio_file_page(folio, swp_offset(entry));
>
> page is the precise page, not the first page of the folio.

this is the case we may get a large folio in swapcache but we result in
mapping only one subpage due to the condition to map the whole
folio is not met. if we meet the condition, we are going to set page
to the head instead and map the whole mTHP:

        if (folio_test_large(folio) && folio_test_swapcache(folio)) {
                int nr =3D folio_nr_pages(folio);
                unsigned long idx =3D folio_page_idx(folio, page);
                unsigned long folio_start =3D address - idx * PAGE_SIZE;
                unsigned long folio_end =3D folio_start + nr * PAGE_SIZE;
                pte_t *folio_ptep;
                pte_t folio_pte;

                if (unlikely(folio_start < max(address & PMD_MASK,
vma->vm_start)))
                        goto check_folio;
                if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)=
))
                        goto check_folio;

                folio_ptep =3D vmf->pte - idx;
                folio_pte =3D ptep_get(folio_ptep);
                if (!pte_same(folio_pte,
pte_move_swp_offset(vmf->orig_pte, -idx)) ||
                    swap_pte_batch(folio_ptep, nr, folio_pte) !=3D nr)
                        goto check_folio;

                page_idx =3D idx;
                address =3D folio_start;
                ptep =3D folio_ptep;
                nr_pages =3D nr;
                entry =3D folio->swap;
                page =3D &folio->page;
        }

>
> > For both accessing large folios in the swapcache and allocating
> > new large folios, the page points to the first page of the folio. we
> > are mapping the whole folio not the specific page.
>
> But what address are we mapping the whole folio at?
>
> > for swapcache cases, you can find the same thing here,
> >
> >         if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> >                 ...
> >                 entry =3D folio->swap;
> >                 page =3D &folio->page;
> >         }
>
> Yes, but you missed some important lines from your quote:
>
>                 page_idx =3D idx;
>                 address =3D folio_start;
>                 ptep =3D folio_ptep;
>                 nr_pages =3D nr;
>
> We deliberate adjust the address so that, yes, we're mapping the entire
> folio, but we're mapping it at an address that means that the page we
> actually faulted on ends up at the address that we faulted on.

for this zRAM case, it is a new allocated large folio, only
while all conditions are met, we will allocate and map
the whole folio. you can check can_swapin_thp() and
thp_swap_suitable_orders().

static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
{
        struct swap_info_struct *si;
        unsigned long addr;
        swp_entry_t entry;
        pgoff_t offset;
        char has_cache;
        int idx, i;
        pte_t pte;

        addr =3D ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
        idx =3D (vmf->address - addr) / PAGE_SIZE;
        pte =3D ptep_get(ptep);

        if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
                return false;
        entry =3D pte_to_swp_entry(pte);
        offset =3D swp_offset(entry);
        if (swap_pte_batch(ptep, nr_pages, pte) !=3D nr_pages)
                return false;

        si =3D swp_swap_info(entry);
        has_cache =3D si->swap_map[offset] & SWAP_HAS_CACHE;
        for (i =3D 1; i < nr_pages; i++) {
                /*
                 * while allocating a large folio and doing
swap_read_folio for the
                 * SWP_SYNCHRONOUS_IO path, which is the case the
being faulted pte
                 * doesn't have swapcache. We need to ensure all PTEs
have no cache
                 * as well, otherwise, we might go to swap devices
while the content
                 * is in swapcache
                 */
                if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) !=3D has_ca=
che)
                        return false;
        }

        return true;
}

and

static struct folio *alloc_swap_folio(struct vm_fault *vmf)
{
        ....
        entry =3D pte_to_swp_entry(vmf->orig_pte);
        /*
         * Get a list of all the (large) orders below PMD_ORDER that are en=
abled
         * and suitable for swapping THP.
         */
        orders =3D thp_vma_allowable_orders(vma, vma->vm_flags,
                        TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS,
                        BIT(PMD_ORDER) - 1);
        orders =3D thp_vma_suitable_orders(vma, vmf->address, orders);
        orders =3D thp_swap_suitable_orders(swp_offset(entry),
vmf->address, orders);
....
}

static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
                unsigned long addr, unsigned long orders)
{
        int order, nr;

        order =3D highest_order(orders);

        /*
         * To swap-in a THP with nr pages, we require its first swap_offset
         * is aligned with nr. This can filter out most invalid entries.
         */
        while (orders) {
                nr =3D 1 << order;
                if ((addr >> PAGE_SHIFT) % nr =3D=3D swp_offset % nr)
                        break;
                order =3D next_order(&orders, order);
        }

        return orders;
}


A mTHP is swapped out at aligned swap offset. and we only swap in
aligned mTHP. if somehow one mTHP is mremap() to unaligned address,
we won't swap them in as a large folio. For swapcache case, we are
still checking unaligned mTHP, but for new allocated mTHP, it
is a different story. There is totally no necessity to support
unaligned mTHP and there is no possibility to support unless
something is marked in swap devices to say there was a mTHP.

Thanks
Barry