From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8F365C3DA61 for ; Mon, 29 Jul 2024 13:11:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0E4446B0092; Mon, 29 Jul 2024 09:11:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 041226B009A; Mon, 29 Jul 2024 09:11:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E0EFF6B009B; Mon, 29 Jul 2024 09:11:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id BFD926B0092 for ; Mon, 29 Jul 2024 09:11:47 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 419E7C036C for ; Mon, 29 Jul 2024 13:11:47 +0000 (UTC) X-FDA: 82392827454.12.4B7FAB6 Received: from mail-oi1-f180.google.com (mail-oi1-f180.google.com [209.85.167.180]) by imf09.hostedemail.com (Postfix) with ESMTP id 6C76814000D for ; Mon, 29 Jul 2024 13:11:45 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jxHhBax8; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.167.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722258645; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3WSLu8EoQdDVl08mnyxRcNNHTsb4HNhH5XlRCO141Xs=; b=bgPXSO6Ozo/k99jaZGKnsDH+cFsSth31mOHpun5AfNxFue0XpmKRpj6Kf09OvoUfAksPe/ DoYgygF530ZUhDLmyn7sx8bl9Cq6Ley5/JXoUTJ2nGPEOCFcTruMGNOD+PbNv2Ue+vruW3 3w0b3QdA/APoUXaY3ZAYRLx5Y46ntS0= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722258645; a=rsa-sha256; cv=none; b=Q5qPti811B1vYtxbJ5Vq3xH6zUyPXi8IdPkCb/0fM/JeKrFNNwcirKuW2NsOmbEN5CmniU 3qjRMmpPn2rq3BGRNvt2TIPs20pmQ/d05xXIpDP/g4jEDOaoi6YpZQOCuiDbs46tNee/5n +xsqDH4JPQGDDkLXpR2lJCKRR6GHBDY= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jxHhBax8; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.167.180 as permitted sender) smtp.mailfrom=21cnbao@gmail.com Received: by mail-oi1-f180.google.com with SMTP id 5614622812f47-3db130a872fso2363319b6e.2 for ; Mon, 29 Jul 2024 06:11:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1722258704; x=1722863504; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3WSLu8EoQdDVl08mnyxRcNNHTsb4HNhH5XlRCO141Xs=; b=jxHhBax8nbLtFdWrRYLW9URKee5oiuhF/vpJIDuFaDCginLkjt1qsGF7u7Ob1KOAwx d2MDIt2i4BHw+PUY0XddC2XOwQu1rfgklHECysr5L+Snf6M/XROuMuVhOIGdbhMHhBms obYWqCEJnfpBKuR4hKHSQLU0/ySGK9WI/MvUYjUYxP0Pl5XN14f56n4Xj9SlhPkKwX7v dg4hNRe4TTnIw3ViLwzOEl97slJU1TqlLjYGqGr/xYSfhvlMEtOr2EZq2gaGH5Ty6lo5 LhJUFljglLyIfU7dtwxqkBRlAWpMU5izC8V2CaLn92hBFQ0AvF+qdX0CPmT0lvej1V7L T0rg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722258704; x=1722863504; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3WSLu8EoQdDVl08mnyxRcNNHTsb4HNhH5XlRCO141Xs=; b=TzIxUB15FjiQ6YxJvWSa6y0vL0LtMVUEtNaXXbGn2OppVx4wYTWambQaCrMjJ4xOr2 4rYLVVKt6QFNMu45wNgAoVWb+7mWYZ+BxVy0kKGNLKzNIXVEfVsabgmm8xwMUyn+eYa1 fU/jmYtjUZc8AGaKleDmKsJ1jJPIhmgi/XybXiWRVNwQMkrdDLIxYm3/0+nhw9NFKVpG QSKUPHx4U5G27Js7HE2axyI7MAOrBIiYE4xzB+kwfWP82V/XCsGJOssVNNQfATDWTtkc qGr6XDkrlZs312fFJd1+58znTJpWDqZK+s7AqTC2VVjOo4E9bD/dh+aIpl4k1NAT2mWC NwsQ== X-Forwarded-Encrypted: i=1; AJvYcCWTlCW9GTktBzzx3x39NMKQsR6UOyq2Ij2A5CGIAyueYnhUIecHKFyGnqvZf5Alswtr5H8FAMYsH6ydu9Ufy+SoaXQ= X-Gm-Message-State: AOJu0YwlRIGXH9Ik8JaS2fhKvs+PV7MLyNXw9y1kH+fthC9Hjj/kQMpl ii7bNwEKaZ6shP+s1yCZ2cFRxEY+xnr0KGn2H7DMP+oQx3u6OTnzMq15l4ieqes6Rby0zmV0NRN /bQcHlawYaoSApugOMoedJGDsXPk= X-Google-Smtp-Source: AGHT+IFU+QINFGAya2niNBVb0ufZXq6q+wAd7lWoSTPXG/4pgfbJdZ8vn2G6RdrRjk++lFQDWxsT65uJuebiS8WiWAs= X-Received: by 2002:a05:6358:10c:b0:1ac:f4a9:45 with SMTP id e5c5f4694b2df-1addc1458e1mr1196562555d.3.1722258704054; Mon, 29 Jul 2024 06:11:44 -0700 (PDT) MIME-Version: 1.0 References: <20240726094618.401593-1-21cnbao@gmail.com> <20240726094618.401593-4-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 30 Jul 2024 01:11:31 +1200 Message-ID: Subject: Re: [PATCH v5 3/4] mm: support large folios swapin as a whole for zRAM-like swapfile To: Matthew Wilcox Cc: akpm@linux-foundation.org, linux-mm@kvack.org, ying.huang@intel.com, baolin.wang@linux.alibaba.com, chrisl@kernel.org, david@redhat.com, hannes@cmpxchg.org, hughd@google.com, kaleshsingh@google.com, kasong@tencent.com, linux-kernel@vger.kernel.org, mhocko@suse.com, minchan@kernel.org, nphamcs@gmail.com, ryan.roberts@arm.com, senozhatsky@chromium.org, shakeel.butt@linux.dev, shy828301@gmail.com, surenb@google.com, v-songbaohua@oppo.com, xiang@kernel.org, yosryahmed@google.com, Chuanhua Han Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 6C76814000D X-Stat-Signature: gutun1fy1ixecay9qi3ygpdoeu9kuo8r X-Rspam-User: X-HE-Tag: 1722258705-913398 X-HE-Meta: U2FsdGVkX1+ghGk8RAGgkhD4HXIYjkAlOWb7ClYpfPD5+rJ2vVs2a470+wFeL9NWHtWu9S8xotVs5x9udlYFiy60U00V+T+BITZNW7QuO79kvO2/gHYXMQYX8Ra/2g8XzpihQta3cLPzRX1MajAC3qgTnycMCMdwfx9mLOIEveFSgEr1ZmY9jeMuZFrSCYqG/oi6J2wNfnptJwJ2Bn6iaZlqFbCqR0gvGx5iYLI7vjyoJ/Zgil+azv/7trCxpo/ku8k85zCj3YEO7evpBLTMBMBnLvTjki9oFGIDMJQ7kwMhaSkLaTJT9Z+VbTnadmfNPmo4ya8ZYlp8mZOLZ2Z0Ozj4xhl3I8onZYqU02NpjEIrOKvO41bKn8YDypPIkza6M+kspx3AQPMlN9IzTijFTTF6Q99U2RqVSlwHINJjBa/izkF4LSXyFc7ac6PHkx+2VzzdaPWO72WHJ+r8plYv1ng7ewwx1WBWofejgTV8fapFP0Ou9wY058Rdo73T8BnRyBZDPzI1OyJS/9yNPIvsbflzbeq/20vcv+LICekscsU2DdoQaGwzUwoY1l4SeIZ3KPBZCf+8LRBGOcvqFLiVFOCYyIlcB3/YI2n4GK6GTuplD49mkG+K9zeW6qW8D4K/XQHT1cCvDZMCJQvLHfOj3Q1J8YHB7TE+jPXHKzNa5VSbHay9lKTOcKn4Hexo5G38TB2Q9GZ0BAbo2lQTyO+uw8SZ/VTMJtpF/VFy7f03p5uaNoAVxi2Ua0THLlJR4OIwk5F39sr2oYv4M3tJ/Ah19JrZ4H6mJF+B8vLcWYjMS2fptdxFIkC93DDfulql9dckxJYgovNqdJVB6XrkIn1gd+MrJkJlvCbqus6HWBbRcigLJ0tVlfCRXPuV236j1hfkqamsdvGhtw7xM5lno+5r0qMzux0fXh1ItT39r+7+nCxeXd6vKNIq8yvBUhohyiUxN+L7zYIOzVYXJLYDo/z LQ1EPnZ0 fyWOSoJuO59f01UlKMh53wSsPXUHcqelG2ZS3GfQlCMaQdn3+NBZ27CVMlnzCS2307jz1AFUOgTY1RLoLylsJv3mzX5UjmObMtVNFfv/qG5ELXwqwWjFKRVjFEOfXVsf5qXeRW7l1ofdudxLVaYLihoaQwncdi1SQQwUjmCrbAGQ/I78SUT5iQFifYrpXQZBQiIG7R2PyJ+E3oteDre6yCDb7/zVbsmWE/lp3gMtQUBu0P9WL62ChhfCOLh1WLuv+h04OVR3+98SOHnODnws726eHsPu/ONhNERYTZtgL7cb3gJrX71kBlV5DNkxcNeyu+D7tKJw+W2fMQtx22wZoYWhfWU6BIYlAOb/cYMAQZ1QfA7JKrTzYZ2mr+0SYzWaNLPT6 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jul 30, 2024 at 12:49=E2=80=AFAM Matthew Wilcox wrote: > > On Mon, Jul 29, 2024 at 04:46:42PM +1200, Barry Song wrote: > > On Mon, Jul 29, 2024 at 4:41=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Mon, Jul 29, 2024 at 3:51=E2=80=AFPM Matthew Wilcox wrote: > > > > > > > > On Fri, Jul 26, 2024 at 09:46:17PM +1200, Barry Song wrote: > > > > > - folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVA= BLE, 0, > > > > > - vma, vmf->address, = false); > > > > > + folio =3D alloc_swap_folio(vmf); > > > > > page =3D &folio->page; > > > > > > > > This is no longer correct. You need to set 'page' to the precise p= age > > > > that is being faulted rather than the first page of the folio. It = was > > > > fine before because it always allocated a single-page folio, but no= w it > > > > must use folio_page() or folio_file_page() (whichever has the corre= ct > > > > semantics for you). > > > > > > > > Also you need to fix your test suite to notice this bug. I suggest > > > > doing that first so that you know whether you've got the calculatio= n > > > > correct. > > > > > > I don't understand why the code is designed in the way the page > > > is the first page of this folio. Otherwise, we need lots of changes > > > later while mapping the folio in ptes and rmap. > > What? > > folio =3D swap_cache_get_folio(entry, vma, vmf->address); > if (folio) > page =3D folio_file_page(folio, swp_offset(entry)); > > page is the precise page, not the first page of the folio. this is the case we may get a large folio in swapcache but we result in mapping only one subpage due to the condition to map the whole folio is not met. if we meet the condition, we are going to set page to the head instead and map the whole mTHP: if (folio_test_large(folio) && folio_test_swapcache(folio)) { int nr =3D folio_nr_pages(folio); unsigned long idx =3D folio_page_idx(folio, page); unsigned long folio_start =3D address - idx * PAGE_SIZE; unsigned long folio_end =3D folio_start + nr * PAGE_SIZE; pte_t *folio_ptep; pte_t folio_pte; if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start))) goto check_folio; if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)= )) goto check_folio; folio_ptep =3D vmf->pte - idx; folio_pte =3D ptep_get(folio_ptep); if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) || swap_pte_batch(folio_ptep, nr, folio_pte) !=3D nr) goto check_folio; page_idx =3D idx; address =3D folio_start; ptep =3D folio_ptep; nr_pages =3D nr; entry =3D folio->swap; page =3D &folio->page; } > > > For both accessing large folios in the swapcache and allocating > > new large folios, the page points to the first page of the folio. we > > are mapping the whole folio not the specific page. > > But what address are we mapping the whole folio at? > > > for swapcache cases, you can find the same thing here, > > > > if (folio_test_large(folio) && folio_test_swapcache(folio)) { > > ... > > entry =3D folio->swap; > > page =3D &folio->page; > > } > > Yes, but you missed some important lines from your quote: > > page_idx =3D idx; > address =3D folio_start; > ptep =3D folio_ptep; > nr_pages =3D nr; > > We deliberate adjust the address so that, yes, we're mapping the entire > folio, but we're mapping it at an address that means that the page we > actually faulted on ends up at the address that we faulted on. for this zRAM case, it is a new allocated large folio, only while all conditions are met, we will allocate and map the whole folio. you can check can_swapin_thp() and thp_swap_suitable_orders(). static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) { struct swap_info_struct *si; unsigned long addr; swp_entry_t entry; pgoff_t offset; char has_cache; int idx, i; pte_t pte; addr =3D ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE); idx =3D (vmf->address - addr) / PAGE_SIZE; pte =3D ptep_get(ptep); if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx))) return false; entry =3D pte_to_swp_entry(pte); offset =3D swp_offset(entry); if (swap_pte_batch(ptep, nr_pages, pte) !=3D nr_pages) return false; si =3D swp_swap_info(entry); has_cache =3D si->swap_map[offset] & SWAP_HAS_CACHE; for (i =3D 1; i < nr_pages; i++) { /* * while allocating a large folio and doing swap_read_folio for the * SWP_SYNCHRONOUS_IO path, which is the case the being faulted pte * doesn't have swapcache. We need to ensure all PTEs have no cache * as well, otherwise, we might go to swap devices while the content * is in swapcache */ if ((si->swap_map[offset + i] & SWAP_HAS_CACHE) !=3D has_ca= che) return false; } return true; } and static struct folio *alloc_swap_folio(struct vm_fault *vmf) { .... entry =3D pte_to_swp_entry(vmf->orig_pte); /* * Get a list of all the (large) orders below PMD_ORDER that are en= abled * and suitable for swapping THP. */ orders =3D thp_vma_allowable_orders(vma, vma->vm_flags, TVA_IN_PF | TVA_IN_SWAPIN | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1); orders =3D thp_vma_suitable_orders(vma, vmf->address, orders); orders =3D thp_swap_suitable_orders(swp_offset(entry), vmf->address, orders); .... } static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset, unsigned long addr, unsigned long orders) { int order, nr; order =3D highest_order(orders); /* * To swap-in a THP with nr pages, we require its first swap_offset * is aligned with nr. This can filter out most invalid entries. */ while (orders) { nr =3D 1 << order; if ((addr >> PAGE_SHIFT) % nr =3D=3D swp_offset % nr) break; order =3D next_order(&orders, order); } return orders; } A mTHP is swapped out at aligned swap offset. and we only swap in aligned mTHP. if somehow one mTHP is mremap() to unaligned address, we won't swap them in as a large folio. For swapcache case, we are still checking unaligned mTHP, but for new allocated mTHP, it is a different story. There is totally no necessity to support unaligned mTHP and there is no possibility to support unless something is marked in swap devices to say there was a mTHP. Thanks Barry