From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2E31D15D84 for ; Mon, 21 Oct 2024 12:21:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4F5946B0088; Mon, 21 Oct 2024 08:21:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4A2DF6B008A; Mon, 21 Oct 2024 08:21:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3450F6B0089; Mon, 21 Oct 2024 08:21:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 167DB6B0082 for ; Mon, 21 Oct 2024 08:21:10 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 69979C165E for ; Mon, 21 Oct 2024 12:20:53 +0000 (UTC) X-FDA: 82697518806.11.BE937A5 Received: from mail-wm1-f49.google.com (mail-wm1-f49.google.com [209.85.128.49]) by imf19.hostedemail.com (Postfix) with ESMTP id 64A8F1A0003 for ; Mon, 21 Oct 2024 12:20:49 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=EjvHcnDi; spf=pass (imf19.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729513117; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kKuurMfMvP0Fc/e7m/xqVIbQvz468r/9IsTzYONNr/Y=; b=c04kPCX3kfUS68WpJQOsxIFHU3U+E2EHQ4PfYwXpWO9sxs/tt+kUjVhA0kwizNiJo3qwum +7Bc5WxDmpkmFcYPOojyrEj6XpfHAx3vFH+PtbXYxEDbb/4+x8dO+FIhHdaDd2VU4jF8rG 90hStxKNPRrb93ZFiN5SLGi5Fw4R8E8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729513117; a=rsa-sha256; cv=none; b=XRaPEwPpWAEVUeWDrNC758y6KPXRfwRuF9uZ3huVVOMGcBxwxXBQst6c+V2NyB56vxkrKM UdrKTcPuqGcOClIT7exbfVOdJceXpXBNojf11EMaz690H2Dmgf3Te8E3vseeUBZkedqTx/ xy/5PtzLOdgiv+V+Kdpc/Ry/FPY+gm8= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=EjvHcnDi; spf=pass (imf19.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.49 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-wm1-f49.google.com with SMTP id 5b1f17b1804b1-4315eeb2601so50116085e9.2 for ; Mon, 21 Oct 2024 05:21:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729513266; x=1730118066; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=kKuurMfMvP0Fc/e7m/xqVIbQvz468r/9IsTzYONNr/Y=; b=EjvHcnDiaLHqxtQngmUYEGYb9VEG8F364alnQT68L8NFhPOmEYWEpe9pXmlOJfX0Mt R8sH/cZn84yy0lLIMKJtpMBXfkqm8N26oEi68bbvbQiCPujI6lqCPlNoLW2ExcE+i8WY YRGDp3Xwk1SzqWuPTwYQlciLdgdn4UFI2I1HZimkUD4mKfgxV1vc/p/vB3ScjLmygeNc ldM+Z8to0cBLEbaViqlAw6Mp0stkgqtxy0057Vbk1hlHjxwVDE05Ob8YZsDM/dSOxuNq B1OiJvsxdpouHJmdzQyIc+2PmCSlh22l4P19PJRViFXh6i4moWk8NA3NqhUf91/xr1S4 +eGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729513266; x=1730118066; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=kKuurMfMvP0Fc/e7m/xqVIbQvz468r/9IsTzYONNr/Y=; b=qHnYVmOeccoRjXrAxpGvwOLTvHv5PWFZknKPhXF0QR6+viLLbAvobbT/3xfMHVLcLo ljhW+wAEDzECXhT/RSyfGGl6Dbdl9XqvBJoMYq9R8oXh7t2YPeQOZOZhBhXcJpqnWNd/ YlTt5BC04IlnqgonD53x8s3Z3pnP2wUEv+YEx4aOs7rIB5tWezDOCOgacJBJMMHbrIwk qyIlxQIA0ME1CptNXCaHt2/N77K+e6mbL0+dxJ2Ma38RjB1N70y6dgCaRwVmfNl8U6Tl Q0Hk9Ia5wLEyBaTnHmNoY3RzmIYLt7EG8C7ClyvTsZTHxV54uHUuzeY8wg6Es2bqHKL2 6wxg== X-Forwarded-Encrypted: i=1; AJvYcCVZQR5PnEHkpNxTLKMxwyGGQRSY9RPgSuPK4h0h/+MdwtVAu4oyptDqL97o1AkrOhOK9zdve40Qsw==@kvack.org X-Gm-Message-State: AOJu0YyrGtM/K0ccI2Y2cuO+ORb4JDG5AJj8mvXBOy2MsxV12AMqswXy pKu7SGwDpkTiwrMjJOu0dYBguuunk6SwU0SuPabUvkCWlpgqO4vZ0Z0oPg== X-Google-Smtp-Source: AGHT+IFf55odAsh9GKsIYvTHwAdaSA7U0k5GJ4g8A3h1qEr6atBUpjedutOB49fBV8vXRrcs0aja3A== X-Received: by 2002:a05:600c:3151:b0:42b:ac3d:3abc with SMTP id 5b1f17b1804b1-43161685fedmr125939105e9.24.1729513265756; Mon, 21 Oct 2024 05:21:05 -0700 (PDT) Received: from ?IPV6:2a02:6b67:d751:7400:c2b:f323:d172:e42a? ([2a02:6b67:d751:7400:c2b:f323:d172:e42a]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-4316f5cc596sm55391155e9.44.2024.10.21.05.21.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 21 Oct 2024 05:21:05 -0700 (PDT) Message-ID: Date: Mon, 21 Oct 2024 13:21:04 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 3/4] mm/zswap: add support for large folio zswapin To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, david@redhat.com, willy@infradead.org, kanchana.p.sridhar@intel.com, yosryahmed@google.com, nphamcs@gmail.com, chengming.zhou@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, riel@surriel.com, shakeel.butt@linux.dev, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org References: <20241018105026.2521366-1-usamaarif642@gmail.com> <20241018105026.2521366-4-usamaarif642@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: xrnk9yoxrekcznribtd4sjbioitapwpg X-Rspamd-Queue-Id: 64A8F1A0003 X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1729513249-834921 X-HE-Meta: U2FsdGVkX18AdjArOH2rudxlmSdqhn4niKGV/oOeg0IpeHTV+fBOO1OyFD5uf7Sx40VUd23MtYsT+i+29g5+vcEFhu5HOsQz83h32okIQnon0+INfJr/+ELWXWaZHsavD3S2+oWk0nVupr1MhQuWNE5k2+COFfB9eEsirRTyzBpTcPTsZzbe99z6D+BVMJdAZn2U/wyI3Pnm74aZrOW8hvi5n3cF8g/q8tPaa3Y6T+rbVh9XeluJU7wi90Mz84W1jZu5UKBxWLf7lYM8P0LjPmbmxwz1GkWIc/MQ0Q4ohQDWi8g5q1xdsHPN2rlUH5fhpVsxpWSfbCvNQ8XYBWGNrMg/wRwD8WhxlU2MVzZf9Z5bt0Vuyskdq03qm+9p28yX074ZbrHSdyFfzSNn+mhZorAtPujusHyTYmBw2S6FKIaRWuk30184i1Btq1XKxYmQZe+RYrBYCS9DnLMTpjGxoyYwN2jzO57Ln+7I/t/1+mK03N1wYakRwDFvlFsYvpSwDI5gnHuUwPycuNpcXqO9ChczZAFmmhNy+LeQjlgI6hkVrCzx6uMCYyjmGcDRzsJ7NAmB6hfizELWHpW4+37KMLvQ1nFXVonwGIvgCjHwMEnlDts/CjW4ropaHliLdLH3L4klTYiXAcfhbCJEoeMNr5JEzC79Cr2A8OkoRp5ysF0dzPkA8nPIwfA576XbwPJLRIMjVsHbA3Efd+VzjQBuhfiwS76XrgMEZU+tETfr4/72xgc6WFxQDjeWOVVIHbuk0QioKYbtXGNr6EdWISxjoCsuWSIh353uUBIuclMvzC7ET/g+UmIqOEJCrBMuRyDYemMr6l5zA5EswXOOabWcngumw9z9YTDzCMFwgN4YHh3s7CPgdtPPan039YXWcClOidv9W3cfWM6s+HWG0Kb4nH1rpSis4hgZVd1KOW0qBnLrv9Tgc1y1SAO7EdfEYl7hyhR520mku0FSenjUVa+ b1N5Z3KB QJl/VuKGKZ/qwkn60/jENWBdMusTQYpIQIiVvdPxj4eKds3f96+QWwyABB77Vo7g3MP3eG7C33crNWm61/mw/afwmMAq3zb7UCCuX5rnK1elYbdSwzxbVSJRVDCA0cbIF5rwYHhVuZpyAUamYv8IJXiW7AOBYse4PNB6MPeZcSZOAnIVJ9uT8DF8nSQPsA89+ODhWseX/rp1+HdqNiA4i/vrs4eGzpR03WMau8SWKEgG/Qz1ZaJw410JeURioCIX7icMnMU9jXQV91xnwl9HIdvyntWYdzFX9cvhz/NNE4+KAiCN64yDndLeKHp90rIeNVOkPz1xH45AsZe9wPkMpiukYbURieNeNrskAe3MDNLSMm3dq/s9xC66ezwVvBulOsqPYMrDsojHCm3WTf51f0tIk0TPM36kd7nIYbU+KCyventmjcK0rvrWDvVM+8xpmH4BL9z4Ucg0Ywx+snlPlwtunrckVyigYDGeKTft2BEwyi1Q= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 21/10/2024 11:55, Barry Song wrote: > On Mon, Oct 21, 2024 at 11:44 PM Usama Arif wrote: >> >> >> >> On 21/10/2024 06:49, Barry Song wrote: >>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif wrote: >>>> >>>> At time of folio allocation, alloc_swap_folio checks if the entire >>>> folio is in zswap to determine folio order. >>>> During swap_read_folio, zswap_load will check if the entire folio >>>> is in zswap, and if it is, it will iterate through the pages in >>>> folio and decompress them. >>>> This will mean the benefits of large folios (fewer page faults, batched >>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64 >>>> and amd) are not lost at swap out when using zswap. >>>> This patch does not add support for hybrid backends (i.e. folios >>>> partly present swap and zswap). >>>> >>>> Signed-off-by: Usama Arif >>>> --- >>>> mm/memory.c | 13 +++------- >>>> mm/zswap.c | 68 ++++++++++++++++++++++++----------------------------- >>>> 2 files changed, 34 insertions(+), 47 deletions(-) >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 49d243131169..75f7b9f5fb32 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) >>>> >>>> /* >>>> * swap_read_folio() can't handle the case a large folio is hybridly >>>> - * from different backends. And they are likely corner cases. Similar >>>> - * things might be added once zswap support large folios. >>>> + * from different backends. And they are likely corner cases. >>>> */ >>>> if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) >>>> return false; >>>> if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) >>>> return false; >>>> + if (unlikely(!zswap_present_test(entry, nr_pages))) >>>> + return false; >>>> >>>> return true; >>>> } >>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) >>>> if (unlikely(userfaultfd_armed(vma))) >>>> goto fallback; >>>> >>>> - /* >>>> - * A large swapped out folio could be partially or fully in zswap. We >>>> - * lack handling for such cases, so fallback to swapping in order-0 >>>> - * folio. >>>> - */ >>>> - if (!zswap_never_enabled()) >>>> - goto fallback; >>>> - >>>> entry = pte_to_swp_entry(vmf->orig_pte); >>>> /* >>>> * Get a list of all the (large) orders below PMD_ORDER that are enabled >>>> diff --git a/mm/zswap.c b/mm/zswap.c >>>> index 9cc91ae31116..a5aa86c24060 100644 >>>> --- a/mm/zswap.c >>>> +++ b/mm/zswap.c >>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages) >>>> >>>> bool zswap_load(struct folio *folio) >>>> { >>>> + int nr_pages = folio_nr_pages(folio); >>>> swp_entry_t swp = folio->swap; >>>> + unsigned int type = swp_type(swp); >>>> pgoff_t offset = swp_offset(swp); >>>> bool swapcache = folio_test_swapcache(folio); >>>> - struct xarray *tree = swap_zswap_tree(swp); >>>> + struct xarray *tree; >>>> struct zswap_entry *entry; >>>> + int i; >>>> >>>> VM_WARN_ON_ONCE(!folio_test_locked(folio)); >>>> >>>> if (zswap_never_enabled()) >>>> return false; >>>> >>>> - /* >>>> - * Large folios should not be swapped in while zswap is being used, as >>>> - * they are not properly handled. Zswap does not properly load large >>>> - * folios, and a large folio may only be partially in zswap. >>>> - * >>>> - * Return true without marking the folio uptodate so that an IO error is >>>> - * emitted (e.g. do_swap_page() will sigbus). >>>> - */ >>>> - if (WARN_ON_ONCE(folio_test_large(folio))) >>>> - return true; >>>> - >>>> - /* >>>> - * When reading into the swapcache, invalidate our entry. The >>>> - * swapcache can be the authoritative owner of the page and >>>> - * its mappings, and the pressure that results from having two >>>> - * in-memory copies outweighs any benefits of caching the >>>> - * compression work. >>>> - * >>>> - * (Most swapins go through the swapcache. The notable >>>> - * exception is the singleton fault on SWP_SYNCHRONOUS_IO >>>> - * files, which reads into a private page and may free it if >>>> - * the fault fails. We remain the primary owner of the entry.) >>>> - */ >>>> - if (swapcache) >>>> - entry = xa_erase(tree, offset); >>>> - else >>>> - entry = xa_load(tree, offset); >>>> - >>>> - if (!entry) >>>> + if (!zswap_present_test(folio->swap, nr_pages)) >>>> return false; >>> >>> Hi Usama, >>> >>> Is there any chance that zswap_present_test() returns true >>> in do_swap_page() but false in zswap_load()? If that’s >>> possible, could we be missing something? For example, >>> could it be that zswap has been partially released (with >>> part of it still present) during an mTHP swap-in? >>> >>> If this happens with an mTHP, my understanding is that >>> we shouldn't proceed with reading corrupted data from the >>> disk backend. >>> >> >> If its not swapcache, the zswap entry is not deleted so I think >> it should be ok? >> >> We can check over here if the entire folio is in zswap, >> and if not, return true without marking the folio uptodate >> to give an error. > > We have swapcache_prepare() called in do_swap_page(), which should > have protected these entries from being partially freed by other processes > (for example, if someone falls back to small folios for the same address). > Therefore, I believe that zswap_present_test() cannot be false for mTHP in > the current case where only synchronous I/O is supported. > > the below might help detect the bug? > > if (!zswap_present_test(folio->swap, nr_pages)) { > if (WARN_ON_ONCE(nr_pages > 1)) > return true; > return false; > } > I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap, it should still return false. So would need to check the whole folio if we want to warn. But I think if we are sure the code is ok, it is an unnecessary check. > the code seems quite ugly :-) do we have some way to unify the code > for large and small folios? > > not quite sure about shmem though.... > If its shmem, and the swap_count goes to 1, I think its still ok? because then the folio will be gotten from swap_cache_get_folio if it has already been in swapcache. >> >> >>>> >>>> - zswap_decompress(entry, &folio->page); >>>> + for (i = 0; i < nr_pages; ++i) { >>>> + tree = swap_zswap_tree(swp_entry(type, offset + i)); >>>> + /* >>>> + * When reading into the swapcache, invalidate our entry. The >>>> + * swapcache can be the authoritative owner of the page and >>>> + * its mappings, and the pressure that results from having two >>>> + * in-memory copies outweighs any benefits of caching the >>>> + * compression work. >>>> + * >>>> + * (Swapins with swap count > 1 go through the swapcache. >>>> + * For swap count == 1, the swapcache is skipped and we >>>> + * remain the primary owner of the entry.) >>>> + */ >>>> + if (swapcache) >>>> + entry = xa_erase(tree, offset + i); >>>> + else >>>> + entry = xa_load(tree, offset + i); >>>> >>>> - count_vm_event(ZSWPIN); >>>> - if (entry->objcg) >>>> - count_objcg_events(entry->objcg, ZSWPIN, 1); >>>> + zswap_decompress(entry, folio_page(folio, i)); >>>> >>>> - if (swapcache) { >>>> - zswap_entry_free(entry); >>>> - folio_mark_dirty(folio); >>>> + if (entry->objcg) >>>> + count_objcg_events(entry->objcg, ZSWPIN, 1); >>>> + if (swapcache) >>>> + zswap_entry_free(entry); >>>> } >>>> >>>> + count_vm_events(ZSWPIN, nr_pages); >>>> + if (swapcache) >>>> + folio_mark_dirty(folio); >>>> + >>>> folio_mark_uptodate(folio); >>>> return true; >>>> } >>>> -- >>>> 2.43.5 >>>> >>> > > Thanks > barry