From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 852C4D17130 for ; Mon, 21 Oct 2024 20:28:53 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1F5D36B0092; Mon, 21 Oct 2024 16:28:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 17F2A6B0095; Mon, 21 Oct 2024 16:28:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F3BAE6B0096; Mon, 21 Oct 2024 16:28:52 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CD24B6B0092 for ; Mon, 21 Oct 2024 16:28:52 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id DB20DA0B4F for ; Mon, 21 Oct 2024 20:28:23 +0000 (UTC) X-FDA: 82698747684.24.E513F3E Received: from mail-vk1-f173.google.com (mail-vk1-f173.google.com [209.85.221.173]) by imf27.hostedemail.com (Postfix) with ESMTP id 53EE74000B for ; Mon, 21 Oct 2024 20:28:35 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jveLkkWb; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729542419; a=rsa-sha256; cv=none; b=k9YrMQl0Uhxd//3Uzf7+jXPa2b57ew/SDFweQMyzXCUuuU5VaaDPPNT7yIHqKyVucSfCh7 ThnmSHVZK4XZiJ4DykHgweOGMgdJoPRqTI1LaML3xe/XouRLb83WN1k/5V8NWA/mAGChoA 8uII9YdhlUAVVMfyxKhiCngjxTwcP84= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=jveLkkWb; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf27.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.173 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729542419; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A1MjJjA6OBVhIpO2BaWMfxqMF/W6s2PQO/G0uTKSwnQ=; b=tJdLuRUNCZZvixeSMBEA35VQV07P68XxoXwTd6Psk/cmP5/LBW6f6B5hl1DfnYy+eytryL XxBqMj00DwzOqlX55mPjOOgbFiKL2b+SruH/oQOjV7F4FBBUfknqQ1a8ZzyQUIdIl3Tm7I XTB7vtnWN8yzrEzFu3gGKwkDmLYlgi4= Received: by mail-vk1-f173.google.com with SMTP id 71dfb90a1353d-50d2c02875cso1258016e0c.2 for ; Mon, 21 Oct 2024 13:28:50 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729542529; x=1730147329; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=A1MjJjA6OBVhIpO2BaWMfxqMF/W6s2PQO/G0uTKSwnQ=; b=jveLkkWb0fw0gTqqMzARV+jMc32ndsqxQ7+H+p+ZeQznRr3ELD0mTB8estTzh1EPXA gZYGhuGpcdLpQpjc1zMH8vYMoJg8Y3MzUZBuk6OL99Jh+f22Bft46JO0FSpfrqRoI8TX 5GnaMQjgY7mO9cNLGSD/d6T2M4sWWjPWPEwHA5j80IE8TR7o0wej6VbPa3EkOD2YhuRe z+k6RirLkViZOaBfCMfDCaOxC0vtLnC3WmCSUSenjv2XvD9XlYlY5DPDggGuPWZDRH8b aQOtPfqvLEGoBYWeKXvV9EGfsFk+hG14TBu3Y/7txvKFsBNugcZyOceZ7yzYpdgUKlFo 2Glw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729542529; x=1730147329; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=A1MjJjA6OBVhIpO2BaWMfxqMF/W6s2PQO/G0uTKSwnQ=; b=YRyCLteomcC+rtALDKMVu7m3eDrNiHQstcTXmfWD4L382Wd7iNna+JS3RpHa0RQi/g nNL4QdB2P8wEcB6AgUpsEbX2+EgZN15o2B4zIMvte8PyBb6dUCnY8+TUoT933BI5xCOr q4PFP8Rp9uz+tkkPXnfVJgv5i32XpuB87KVaCcJGa90V5XRPSo0xAFErSia64naFjx/K cJG8QaanQe9/fPpuQ/Qx0ysTQWP7QHUYEWNvhMm7lmvOheivrwTiQf7dSbIDhCCqKpMP LXh+bEoehJqcn5cSdkj6dS/yI2efbO9G0YQWqxXJOeC7chG6EOzAOu9HHhfKX9LBNm+W RecQ== X-Forwarded-Encrypted: i=1; AJvYcCWeYqgdUbV4YO6iW6lmaFTxnzEce4xI1ltAMMQClrzXeNztUsEaNtt+wRPowPQ1FX1OZtlYivRdKA==@kvack.org X-Gm-Message-State: AOJu0YwVwe19gj5aNXtPIDVz0B46c+cQu6yLQqyQymti0d2s6Bn6urbJ HEkdAFP/GX/rW/xK/17neUWg1SGTYX7dYWzNE+uNO8Dx1lqzuL7MajUwRRnr7HqmqxLOnG1tpnw I1CWjCl3OV8O+1EmFJfwGLnIULKA= X-Google-Smtp-Source: AGHT+IHnAbkv9Z4ArBncAFrT/R18g+j7I3yE9ZwONgynN1gPNT9i0kkDPOLxkGvOVJZeCqPOu0TojT4L7bIrjPWhXJc= X-Received: by 2002:a05:6122:a20:b0:50d:bfb2:4f2f with SMTP id 71dfb90a1353d-50dda165e77mr8422244e0c.2.1729542529468; Mon, 21 Oct 2024 13:28:49 -0700 (PDT) MIME-Version: 1.0 References: <20241018105026.2521366-1-usamaarif642@gmail.com> <20241018105026.2521366-4-usamaarif642@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Tue, 22 Oct 2024 09:28:38 +1300 Message-ID: Subject: Re: [RFC 3/4] mm/zswap: add support for large folio zswapin To: Usama Arif Cc: akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, david@redhat.com, willy@infradead.org, kanchana.p.sridhar@intel.com, yosryahmed@google.com, nphamcs@gmail.com, chengming.zhou@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, riel@surriel.com, shakeel.butt@linux.dev, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 53EE74000B X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: jhxan46akhfocpjiadp793oqic64m6a7 X-HE-Tag: 1729542515-750219 X-HE-Meta: U2FsdGVkX18wX1UHbQzX9fy5OwW+4+wDLg5pP+IaPtcgiImhrpgbwUGMB0U8Etsr4hg2NaZvJ+HvLwfwU6mzHyQ/TyVC93hY2afCGcHFQw8omk0Jr0GO2QqQYElY/lGzdJk7Py7vhxGdZXbb/AnUB+YunGy0C+txivUkQY0Qu3O7zFw7oB0yKSZjQvgaU2lA203KVqw8GAQ75KW2SqJ13+mQmwaB7rg+javbF42nige+B2ymBKAd3kLN82rbNXVQ1Pe2hX9dPYpg1MeU6YK3joxczQVwkMsec8PAHBDr9MXvk5aWKKJFfS59bsPU1IXOcd7tbt9GyiejF8OgqzweUM6hUGNpaMQhMs/a3iPoc5xc+lHLTQIFoy1sPO2MzwXDwo08EUj3ZSxFNWQ6jHcOOyKL5X3bfU//vVMcqLslV9J3lYNeil8qVfiA24LUFhK7dnMjLpKYjfB5ncbCdcy5BjHTLaojqLUIIRr6L2WJgCdCBuOw1dSrZEG/FEWHrqcGmNRhMSsTJpxZ81tstgz5R2r2ZVqdJnVFb+/kqNm4dUXA/2QpmzJWLGykYJ3EmfSY1ll8f0EGGL7yQ+ATp7dwh2v04QfBdazES28/O9XdsNmkJmXmd0iHQ9BhsvgC9KqqWrhiVtOEMftY6Lfhw+23jCJFG3yvT6OL84uVC+cWhTiCwqNS5lPk+WanGYL+4hYt70h2QPab7Y94Ebv5zJB6gDJAHCdNP2PKSuDkNM3yOMl9AdWjDjQSSNiNN+Kl0zqz45k+ed1hI9eI4g6GBVEcKn+UdfnZfV4qkdiBfdZyQ+LUivugN+t7RTdDFejLDPNMpxMlUndIP+OOPKngqW/cgp6XxK9zk8w5JXYh5SlofDxTzWnQQSia+qIjsf32iCnc4aYGFGxmrgZz4jCDuMhaHlArfAfeUoJsQ4VN5zZyhVgz1/9brK8ykCb5rjGBx11CweWTXerXk1pjRNZwMeB gxEb8LMv 60hyyutnn4VPqQYL2lC0aqcDtKU1/hh++Q4coZH27ZWE2kFaJ6KuiDivh3+3h4/TxzXQvlTkGjPMPile7wm3y5BmCvufRy1nD6612laGU+ZY7ec/e+c4bcKPL4dhr7jut+0/HGC0W/MknuTA2qGeaHgp6j4lVpkx6vpWH2akm+n1UClzZY2uAkdUK1jTHHPyNc5Ganad/hrZXDWl/2r8ZG2Fbe6EQN4ZMIjvO8bYYlmXqF2wzROYd9MPIWlpYl5/qNQocOwDuIrxCIHadgqZ8FHlL3VyN7lCouBrfL2jWpQ185NFBpHuPBOr5tw/h+gcY0iLLHVYP3izEmy/EjWZu49TJnw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 22, 2024 at 1:21=E2=80=AFAM Usama Arif = wrote: > > > > On 21/10/2024 11:55, Barry Song wrote: > > On Mon, Oct 21, 2024 at 11:44=E2=80=AFPM Usama Arif wrote: > >> > >> > >> > >> On 21/10/2024 06:49, Barry Song wrote: > >>> On Fri, Oct 18, 2024 at 11:50=E2=80=AFPM Usama Arif wrote: > >>>> > >>>> At time of folio allocation, alloc_swap_folio checks if the entire > >>>> folio is in zswap to determine folio order. > >>>> During swap_read_folio, zswap_load will check if the entire folio > >>>> is in zswap, and if it is, it will iterate through the pages in > >>>> folio and decompress them. > >>>> This will mean the benefits of large folios (fewer page faults, batc= hed > >>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm= 64 > >>>> and amd) are not lost at swap out when using zswap. > >>>> This patch does not add support for hybrid backends (i.e. folios > >>>> partly present swap and zswap). > >>>> > >>>> Signed-off-by: Usama Arif > >>>> --- > >>>> mm/memory.c | 13 +++------- > >>>> mm/zswap.c | 68 ++++++++++++++++++++++++--------------------------= --- > >>>> 2 files changed, 34 insertions(+), 47 deletions(-) > >>>> > >>>> diff --git a/mm/memory.c b/mm/memory.c > >>>> index 49d243131169..75f7b9f5fb32 100644 > >>>> --- a/mm/memory.c > >>>> +++ b/mm/memory.c > >>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *= vmf, pte_t *ptep, int nr_pages) > >>>> > >>>> /* > >>>> * swap_read_folio() can't handle the case a large folio is = hybridly > >>>> - * from different backends. And they are likely corner cases= . Similar > >>>> - * things might be added once zswap support large folios. > >>>> + * from different backends. And they are likely corner cases= . > >>>> */ > >>>> if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) !=3D = nr_pages)) > >>>> return false; > >>>> if (unlikely(non_swapcache_batch(entry, nr_pages) !=3D nr_pa= ges)) > >>>> return false; > >>>> + if (unlikely(!zswap_present_test(entry, nr_pages))) > >>>> + return false; > >>>> > >>>> return true; > >>>> } > >>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct = vm_fault *vmf) > >>>> if (unlikely(userfaultfd_armed(vma))) > >>>> goto fallback; > >>>> > >>>> - /* > >>>> - * A large swapped out folio could be partially or fully in = zswap. We > >>>> - * lack handling for such cases, so fallback to swapping in = order-0 > >>>> - * folio. > >>>> - */ > >>>> - if (!zswap_never_enabled()) > >>>> - goto fallback; > >>>> - > >>>> entry =3D pte_to_swp_entry(vmf->orig_pte); > >>>> /* > >>>> * Get a list of all the (large) orders below PMD_ORDER that= are enabled > >>>> diff --git a/mm/zswap.c b/mm/zswap.c > >>>> index 9cc91ae31116..a5aa86c24060 100644 > >>>> --- a/mm/zswap.c > >>>> +++ b/mm/zswap.c > >>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int= nr_pages) > >>>> > >>>> bool zswap_load(struct folio *folio) > >>>> { > >>>> + int nr_pages =3D folio_nr_pages(folio); > >>>> swp_entry_t swp =3D folio->swap; > >>>> + unsigned int type =3D swp_type(swp); > >>>> pgoff_t offset =3D swp_offset(swp); > >>>> bool swapcache =3D folio_test_swapcache(folio); > >>>> - struct xarray *tree =3D swap_zswap_tree(swp); > >>>> + struct xarray *tree; > >>>> struct zswap_entry *entry; > >>>> + int i; > >>>> > >>>> VM_WARN_ON_ONCE(!folio_test_locked(folio)); > >>>> > >>>> if (zswap_never_enabled()) > >>>> return false; > >>>> > >>>> - /* > >>>> - * Large folios should not be swapped in while zswap is bein= g used, as > >>>> - * they are not properly handled. Zswap does not properly lo= ad large > >>>> - * folios, and a large folio may only be partially in zswap. > >>>> - * > >>>> - * Return true without marking the folio uptodate so that an= IO error is > >>>> - * emitted (e.g. do_swap_page() will sigbus). > >>>> - */ > >>>> - if (WARN_ON_ONCE(folio_test_large(folio))) > >>>> - return true; > >>>> - > >>>> - /* > >>>> - * When reading into the swapcache, invalidate our entry. Th= e > >>>> - * swapcache can be the authoritative owner of the page and > >>>> - * its mappings, and the pressure that results from having t= wo > >>>> - * in-memory copies outweighs any benefits of caching the > >>>> - * compression work. > >>>> - * > >>>> - * (Most swapins go through the swapcache. The notable > >>>> - * exception is the singleton fault on SWP_SYNCHRONOUS_IO > >>>> - * files, which reads into a private page and may free it if > >>>> - * the fault fails. We remain the primary owner of the entry= .) > >>>> - */ > >>>> - if (swapcache) > >>>> - entry =3D xa_erase(tree, offset); > >>>> - else > >>>> - entry =3D xa_load(tree, offset); > >>>> - > >>>> - if (!entry) > >>>> + if (!zswap_present_test(folio->swap, nr_pages)) > >>>> return false; > >>> > >>> Hi Usama, > >>> > >>> Is there any chance that zswap_present_test() returns true > >>> in do_swap_page() but false in zswap_load()? If that=E2=80=99s > >>> possible, could we be missing something? For example, > >>> could it be that zswap has been partially released (with > >>> part of it still present) during an mTHP swap-in? > >>> > >>> If this happens with an mTHP, my understanding is that > >>> we shouldn't proceed with reading corrupted data from the > >>> disk backend. > >>> > >> > >> If its not swapcache, the zswap entry is not deleted so I think > >> it should be ok? > >> > >> We can check over here if the entire folio is in zswap, > >> and if not, return true without marking the folio uptodate > >> to give an error. > > > > We have swapcache_prepare() called in do_swap_page(), which should > > have protected these entries from being partially freed by other proces= ses > > (for example, if someone falls back to small folios for the same addres= s). > > Therefore, I believe that zswap_present_test() cannot be false for mTHP= in > > the current case where only synchronous I/O is supported. > > > > the below might help detect the bug? > > > > if (!zswap_present_test(folio->swap, nr_pages)) { > > if (WARN_ON_ONCE(nr_pages > 1)) > > return true; > > return false; > > } > > > > I think this isn't correct. If nr_pages > 1 and the entire folio is not i= n zswap, > it should still return false. So would need to check the whole folio if w= e want to > warn. But I think if we are sure the code is ok, it is an unnecessary che= ck. my point is that zswap_present_test() can't differentiate 1. the *whole* folio is not in zswap 2. the folio is *partially* not in zswap in case 2, returning false is wrong. And when nr_pages > 1, we have already confirmed earlier in do_swap_page() that zswap_present_test() is true. At this point, it must always be true; if it's false, it indicates a bug. > > > the code seems quite ugly :-) do we have some way to unify the code > > for large and small folios? > > > > not quite sure about shmem though.... > > > > If its shmem, and the swap_count goes to 1, I think its still ok? because > then the folio will be gotten from swap_cache_get_folio if it has already > been in swapcache. > > >> > >> > >>>> > >>>> - zswap_decompress(entry, &folio->page); > >>>> + for (i =3D 0; i < nr_pages; ++i) { > >>>> + tree =3D swap_zswap_tree(swp_entry(type, offset + i)= ); > >>>> + /* > >>>> + * When reading into the swapcache, invalidate our e= ntry. The > >>>> + * swapcache can be the authoritative owner of the p= age and > >>>> + * its mappings, and the pressure that results from = having two > >>>> + * in-memory copies outweighs any benefits of cachin= g the > >>>> + * compression work. > >>>> + * > >>>> + * (Swapins with swap count > 1 go through the swapc= ache. > >>>> + * For swap count =3D=3D 1, the swapcache is skipped= and we > >>>> + * remain the primary owner of the entry.) > >>>> + */ > >>>> + if (swapcache) > >>>> + entry =3D xa_erase(tree, offset + i); > >>>> + else > >>>> + entry =3D xa_load(tree, offset + i); > >>>> > >>>> - count_vm_event(ZSWPIN); > >>>> - if (entry->objcg) > >>>> - count_objcg_events(entry->objcg, ZSWPIN, 1); > >>>> + zswap_decompress(entry, folio_page(folio, i)); > >>>> > >>>> - if (swapcache) { > >>>> - zswap_entry_free(entry); > >>>> - folio_mark_dirty(folio); > >>>> + if (entry->objcg) > >>>> + count_objcg_events(entry->objcg, ZSWPIN, 1); > >>>> + if (swapcache) > >>>> + zswap_entry_free(entry); > >>>> } > >>>> > >>>> + count_vm_events(ZSWPIN, nr_pages); > >>>> + if (swapcache) > >>>> + folio_mark_dirty(folio); > >>>> + > >>>> folio_mark_uptodate(folio); > >>>> return true; > >>>> } > >>>> -- > >>>> 2.43.5 > >>>> > >>> > > > > Thanks > > barry >