From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF34AD1713B for ; Mon, 21 Oct 2024 20:57:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 31E196B0096; Mon, 21 Oct 2024 16:57:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2CDDC6B0098; Mon, 21 Oct 2024 16:57:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1BC986B0099; Mon, 21 Oct 2024 16:57:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id EF37B6B0096 for ; Mon, 21 Oct 2024 16:57:22 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id DDA861C4FE6 for ; Mon, 21 Oct 2024 20:57:04 +0000 (UTC) X-FDA: 82698819210.01.C377350 Received: from mail-wm1-f54.google.com (mail-wm1-f54.google.com [209.85.128.54]) by imf04.hostedemail.com (Postfix) with ESMTP id 3826440004 for ; Mon, 21 Oct 2024 20:57:00 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AHA18Vy7; spf=pass (imf04.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729544091; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZlBZbYokjq2qUrngMt0qlyu5ykDp9WbiLWWOTLFp1l8=; b=2VVPi+6nMZbj9jyWLHuk0U+wYmlGbG1mG+4tOl9F07/xKifl9xVJsa1GxI/C6BYVZz10Ke 4Fj4oMo1YOX6Dy7lpNXN4YpOJUUn/MazPZ92v0zx0qzYUpT/eX0rmMaCcDxmLvW37WRaCZ ywkT/RU20PDYgXdNKIiykrIISPRa4BQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729544091; a=rsa-sha256; cv=none; b=0W1Dr1NAD0PtGmUrXzC3mWEQ+pYG1X1YhI2mgD3JaHi/ji9d7cHxcupEL4ojHsPNocxdY2 PSTJHi0znyqx7RPVoqY4kwAvAtGO4c31PZnm1nCublMVk4HlIzoWewPZWZc/LtHDNt/vwG cwAxLaGruofkUW8pLA3YkRIEuPdW+9w= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=AHA18Vy7; spf=pass (imf04.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.128.54 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-wm1-f54.google.com with SMTP id 5b1f17b1804b1-43155afca99so34750935e9.1 for ; Mon, 21 Oct 2024 13:57:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1729544239; x=1730149039; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=ZlBZbYokjq2qUrngMt0qlyu5ykDp9WbiLWWOTLFp1l8=; b=AHA18Vy7c9ZNxMa2qT0D2KQZOvukxGZ9kMvY7ygs5g455k9J/w/Fx5sKSRPwvzxmjd vS3Bw3f3bWhLndv7dBQY92QQ6/ybPCfgg3XcAsWdHr5OBCeS9u5flU1GK0X6Cq3c7L9p vaKRYQA9YQRCFw/5doLgNwN/D12IXbbe6mIcIakW5M+ZI7JW8IQT+/9FhEQSaIXWYj+j 1bNUkjvbVqRI7MA5v4p+OSBTyHeRyRu+6V9xXB2t4Z0M8UthtwrjmBNvk+1l5YBZT7eu /yUuL787/Ixj/EB9r7gkM9Wrgc8vT/CM01sJzst5SuRQ4dLCzlm8PWtLsQ16JWNMpCfO tWyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1729544239; x=1730149039; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ZlBZbYokjq2qUrngMt0qlyu5ykDp9WbiLWWOTLFp1l8=; b=kswt2nQAHXDlw/AmmYMOG83JTVWZOM39dXQx3aE8Hj06KHwMMROVk1kFnROyij/C39 e2BpPCzOuLznV1Rt0Y5Zd0UzebYOX33cKmxBfsYsLkvXvQC/hfK2U6mvmtb0kEm7xun2 PfcpAYhJNVJ4nn9Fiuwp4IfOTq8aPT8RCPEdc2D1QJr1jswXJ7lE0ljupODNE2VeWwHY PlS13FkFC0xiY6Irez9X+SnqKOJ5ZMXSbW6Vlwki1V5Z+F+aymNFK2tKe4F/hBWgeviQ Riut8oalQO55kiiLlInAKhlxQtbBGrAQ/KhXjYaFMsCnqCKlPaSOtjFYDdlcLDW3WQ9j khZQ== X-Forwarded-Encrypted: i=1; AJvYcCXo2W5uXQewOAYGRs6h/98roMuuls4+6VLx/RvsHI717U3BT51X7hOpnnSjbiFEm5JYUEI/C4TjKQ==@kvack.org X-Gm-Message-State: AOJu0YwrJy603tRgIWhOSZckFf04/JkUuYLXPD6e4TRgcdGwhl/bAvn6 vGWc9Bi88OSvb32hYEWINoX4wJcsmH44hhFkzs1+eZlBu6RZkA5Z X-Google-Smtp-Source: AGHT+IEK0KG8kGy8QqRtOmDwNHXwXz694CT1ipdEb6PtjYzYCrwAQR53iHDe8ULkVD5U57DarokSVw== X-Received: by 2002:a05:6000:c44:b0:374:adf1:9232 with SMTP id ffacd0b85a97d-37ef12b1be8mr552584f8f.19.1729544238817; Mon, 21 Oct 2024 13:57:18 -0700 (PDT) Received: from ?IPV6:2a02:6b67:d751:7400:c2b:f323:d172:e42a? ([2a02:6b67:d751:7400:c2b:f323:d172:e42a]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-37ee0ba7dffsm5103823f8f.116.2024.10.21.13.57.18 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 21 Oct 2024 13:57:18 -0700 (PDT) Message-ID: <6d036c4d-ec2e-4562-98a1-6668948086b5@gmail.com> Date: Mon, 21 Oct 2024 21:57:18 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 3/4] mm/zswap: add support for large folio zswapin To: Barry Song <21cnbao@gmail.com> Cc: akpm@linux-foundation.org, linux-mm@kvack.org, hannes@cmpxchg.org, david@redhat.com, willy@infradead.org, kanchana.p.sridhar@intel.com, yosryahmed@google.com, nphamcs@gmail.com, chengming.zhou@linux.dev, ryan.roberts@arm.com, ying.huang@intel.com, riel@surriel.com, shakeel.butt@linux.dev, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org References: <20241018105026.2521366-1-usamaarif642@gmail.com> <20241018105026.2521366-4-usamaarif642@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 3826440004 X-Stat-Signature: k74c5anb3mtaoetxzc51phewx63rk3r7 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1729544220-756284 X-HE-Meta: U2FsdGVkX18ZGxiO1JyiwStG6p5jksVftHLCINfs6TVuAzOqQHXTMSD3HCpcHub9iTr53KIa+avwjzgLTHdMfLGPcYfhAXyrwtE2HDMuOE+xWSLBDjEnuyQ30BFNUvNgdp8oYoUBItwPAIlBwrzY9jSy2iKhVTzcPXa/Vw8CZW6O6r48TUlMZ51SxROPYYdWl5EJZnrfxpYz8aLXBMvF1ZhmOZzSSIyMQ1XiROZiZ8o5qFIk3BIOvGOaEY/KjdwMSabaIyRguC6wJFyo9s4wUvOKqdVYujm+12XiLZzqWh84v/LzUJYJZ2hMi84Rc10zyFAyM4MLTpWrzRQ8k+cYHHGftQBr6xByUGaQ5OkdbUTNrabY6Ebe8SVDwNuXJbp93YDbDvcCUMy6zSZQ0vqqNn3HLg8Rv+twOtCdHnG2u4YPUWkNv8R4WEYtnCXIumUjoUsJKdcqH5iRBp4FvGubm1VJTp2AlcNNirInX4Ik1JA/ObtpIr1HZh7Y4A+SoJGH9RBOhqXNhiN4sOKiDA5eprQylW0+KJKpJgpz3P67SxQtDveD7gjPeUcQgQOFJouqVGGuBXyckz3rnvnoURQdcu50RnJG2tzaDkUf8QpmeEdU79UE4AFY4jIRktmNxsVENXTs3tsqRzHhgzSs18PhaObbqm+bR5+3t9CO7BkDvxskzA4Xj/j+HycdSIsJ45EUAPKkYim23NRhwR5AdGtcQFwjEkxATRqwgmLSp2Nlcdd7Cq1rmGSvTllpuc5WxZt0CXv0alv1wjKbwQ0INg1LBW+oLYArkd7Og+MvWALYHpWLPW/LfvQ/9Q+378MtgolNv4zOCZfkliV0rFcXtYpdhZC5+0Xzt5/et/4AgJOJNAlcQ2v2v49h0NwcUmmkZ2gnpET2S02WXBTVsiA0awFJfXFjAivsdjQAoA0ai5AoCVVg6yqPJOnN+tfMGRnZPcY21taLeS+wO3Ls1BJ9nPF IxvhaLD6 fpNJfDtsRIcJ8eJHB8r9SnjpWPDweKmnO4P4fOdHtrHlybg22IFJbwO/nW3RXGk/o9dvp04Zst2LjiUAaKBGCGZr8qVcok17T52GV4lbJbPR/yaPLPf1g6eS1nrbXHj7WwlpScMgauAZ/Z2EgAtQ8ie9BfRa0YT/uJcpiFadM54UkwkzntmWlWCrhJZswdaguMuzuY3VIAiuNC1GxG9zAUVAoiyTzEuAZrqAEPIsWvwAX4HRq6hWVZghsSDDYEtb5qupFHNnGKd5noovxUHBVZRt/4nePVFbTVZVxGrLG9/S4n0Qe9drI0071iXsAyOKzJn5mkn/FDogEDSjwCvMKxOuIa0Wzwy9LWBejPtgT9Y4zJBuiCDg2fsfgkl5gZMa/wG2zp1Lc9UiL4t+obEbLxtODbvRrKTylQsM+5CJTol/tniIh98ZUBuQ1jAI5f0RX+sstxezqBuUGvisdrVs789Nw2bAv7lr34Enj5pPCaeg1TT0= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 21/10/2024 21:28, Barry Song wrote: > On Tue, Oct 22, 2024 at 1:21 AM Usama Arif wrote: >> >> >> >> On 21/10/2024 11:55, Barry Song wrote: >>> On Mon, Oct 21, 2024 at 11:44 PM Usama Arif wrote: >>>> >>>> >>>> >>>> On 21/10/2024 06:49, Barry Song wrote: >>>>> On Fri, Oct 18, 2024 at 11:50 PM Usama Arif wrote: >>>>>> >>>>>> At time of folio allocation, alloc_swap_folio checks if the entire >>>>>> folio is in zswap to determine folio order. >>>>>> During swap_read_folio, zswap_load will check if the entire folio >>>>>> is in zswap, and if it is, it will iterate through the pages in >>>>>> folio and decompress them. >>>>>> This will mean the benefits of large folios (fewer page faults, batched >>>>>> PTE and rmap manipulation, reduced lru list, TLB coalescing (for arm64 >>>>>> and amd) are not lost at swap out when using zswap. >>>>>> This patch does not add support for hybrid backends (i.e. folios >>>>>> partly present swap and zswap). >>>>>> >>>>>> Signed-off-by: Usama Arif >>>>>> --- >>>>>> mm/memory.c | 13 +++------- >>>>>> mm/zswap.c | 68 ++++++++++++++++++++++++----------------------------- >>>>>> 2 files changed, 34 insertions(+), 47 deletions(-) >>>>>> >>>>>> diff --git a/mm/memory.c b/mm/memory.c >>>>>> index 49d243131169..75f7b9f5fb32 100644 >>>>>> --- a/mm/memory.c >>>>>> +++ b/mm/memory.c >>>>>> @@ -4077,13 +4077,14 @@ static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages) >>>>>> >>>>>> /* >>>>>> * swap_read_folio() can't handle the case a large folio is hybridly >>>>>> - * from different backends. And they are likely corner cases. Similar >>>>>> - * things might be added once zswap support large folios. >>>>>> + * from different backends. And they are likely corner cases. >>>>>> */ >>>>>> if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages)) >>>>>> return false; >>>>>> if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages)) >>>>>> return false; >>>>>> + if (unlikely(!zswap_present_test(entry, nr_pages))) >>>>>> + return false; >>>>>> >>>>>> return true; >>>>>> } >>>>>> @@ -4130,14 +4131,6 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) >>>>>> if (unlikely(userfaultfd_armed(vma))) >>>>>> goto fallback; >>>>>> >>>>>> - /* >>>>>> - * A large swapped out folio could be partially or fully in zswap. We >>>>>> - * lack handling for such cases, so fallback to swapping in order-0 >>>>>> - * folio. >>>>>> - */ >>>>>> - if (!zswap_never_enabled()) >>>>>> - goto fallback; >>>>>> - >>>>>> entry = pte_to_swp_entry(vmf->orig_pte); >>>>>> /* >>>>>> * Get a list of all the (large) orders below PMD_ORDER that are enabled >>>>>> diff --git a/mm/zswap.c b/mm/zswap.c >>>>>> index 9cc91ae31116..a5aa86c24060 100644 >>>>>> --- a/mm/zswap.c >>>>>> +++ b/mm/zswap.c >>>>>> @@ -1624,59 +1624,53 @@ bool zswap_present_test(swp_entry_t swp, int nr_pages) >>>>>> >>>>>> bool zswap_load(struct folio *folio) >>>>>> { >>>>>> + int nr_pages = folio_nr_pages(folio); >>>>>> swp_entry_t swp = folio->swap; >>>>>> + unsigned int type = swp_type(swp); >>>>>> pgoff_t offset = swp_offset(swp); >>>>>> bool swapcache = folio_test_swapcache(folio); >>>>>> - struct xarray *tree = swap_zswap_tree(swp); >>>>>> + struct xarray *tree; >>>>>> struct zswap_entry *entry; >>>>>> + int i; >>>>>> >>>>>> VM_WARN_ON_ONCE(!folio_test_locked(folio)); >>>>>> >>>>>> if (zswap_never_enabled()) >>>>>> return false; >>>>>> >>>>>> - /* >>>>>> - * Large folios should not be swapped in while zswap is being used, as >>>>>> - * they are not properly handled. Zswap does not properly load large >>>>>> - * folios, and a large folio may only be partially in zswap. >>>>>> - * >>>>>> - * Return true without marking the folio uptodate so that an IO error is >>>>>> - * emitted (e.g. do_swap_page() will sigbus). >>>>>> - */ >>>>>> - if (WARN_ON_ONCE(folio_test_large(folio))) >>>>>> - return true; >>>>>> - >>>>>> - /* >>>>>> - * When reading into the swapcache, invalidate our entry. The >>>>>> - * swapcache can be the authoritative owner of the page and >>>>>> - * its mappings, and the pressure that results from having two >>>>>> - * in-memory copies outweighs any benefits of caching the >>>>>> - * compression work. >>>>>> - * >>>>>> - * (Most swapins go through the swapcache. The notable >>>>>> - * exception is the singleton fault on SWP_SYNCHRONOUS_IO >>>>>> - * files, which reads into a private page and may free it if >>>>>> - * the fault fails. We remain the primary owner of the entry.) >>>>>> - */ >>>>>> - if (swapcache) >>>>>> - entry = xa_erase(tree, offset); >>>>>> - else >>>>>> - entry = xa_load(tree, offset); >>>>>> - >>>>>> - if (!entry) >>>>>> + if (!zswap_present_test(folio->swap, nr_pages)) >>>>>> return false; >>>>> >>>>> Hi Usama, >>>>> >>>>> Is there any chance that zswap_present_test() returns true >>>>> in do_swap_page() but false in zswap_load()? If that’s >>>>> possible, could we be missing something? For example, >>>>> could it be that zswap has been partially released (with >>>>> part of it still present) during an mTHP swap-in? >>>>> >>>>> If this happens with an mTHP, my understanding is that >>>>> we shouldn't proceed with reading corrupted data from the >>>>> disk backend. >>>>> >>>> >>>> If its not swapcache, the zswap entry is not deleted so I think >>>> it should be ok? >>>> >>>> We can check over here if the entire folio is in zswap, >>>> and if not, return true without marking the folio uptodate >>>> to give an error. >>> >>> We have swapcache_prepare() called in do_swap_page(), which should >>> have protected these entries from being partially freed by other processes >>> (for example, if someone falls back to small folios for the same address). >>> Therefore, I believe that zswap_present_test() cannot be false for mTHP in >>> the current case where only synchronous I/O is supported. >>> >>> the below might help detect the bug? >>> >>> if (!zswap_present_test(folio->swap, nr_pages)) { >>> if (WARN_ON_ONCE(nr_pages > 1)) >>> return true; >>> return false; >>> } >>> >> >> I think this isn't correct. If nr_pages > 1 and the entire folio is not in zswap, >> it should still return false. So would need to check the whole folio if we want to >> warn. But I think if we are sure the code is ok, it is an unnecessary check. > > my point is that zswap_present_test() can't differentiate > 1. the *whole* folio is not in zswap > 2. the folio is *partially* not in zswap > > in case 2, returning false is wrong. > Agreed! > And when nr_pages > 1, we have already confirmed earlier in > do_swap_page() that zswap_present_test() is true. At this point, > it must always be true; if it's false, it indicates a bug. > Yes agreed! I was thinking from just zswap_load perspective irrespective of who calls it. If someone adds large folio support to swapin_readahead, then I think the above warn might be an issue. But just with this patch series, doing what you suggested is correct. I will add it in next revision. We can deal with it once swap count > 1, starts supporting large folios. >> >>> the code seems quite ugly :-) do we have some way to unify the code >>> for large and small folios? >>> >>> not quite sure about shmem though.... >>> >> >> If its shmem, and the swap_count goes to 1, I think its still ok? because >> then the folio will be gotten from swap_cache_get_folio if it has already >> been in swapcache. >> >>>> >>>> >>>>>> >>>>>> - zswap_decompress(entry, &folio->page); >>>>>> + for (i = 0; i < nr_pages; ++i) { >>>>>> + tree = swap_zswap_tree(swp_entry(type, offset + i)); >>>>>> + /* >>>>>> + * When reading into the swapcache, invalidate our entry. The >>>>>> + * swapcache can be the authoritative owner of the page and >>>>>> + * its mappings, and the pressure that results from having two >>>>>> + * in-memory copies outweighs any benefits of caching the >>>>>> + * compression work. >>>>>> + * >>>>>> + * (Swapins with swap count > 1 go through the swapcache. >>>>>> + * For swap count == 1, the swapcache is skipped and we >>>>>> + * remain the primary owner of the entry.) >>>>>> + */ >>>>>> + if (swapcache) >>>>>> + entry = xa_erase(tree, offset + i); >>>>>> + else >>>>>> + entry = xa_load(tree, offset + i); >>>>>> >>>>>> - count_vm_event(ZSWPIN); >>>>>> - if (entry->objcg) >>>>>> - count_objcg_events(entry->objcg, ZSWPIN, 1); >>>>>> + zswap_decompress(entry, folio_page(folio, i)); >>>>>> >>>>>> - if (swapcache) { >>>>>> - zswap_entry_free(entry); >>>>>> - folio_mark_dirty(folio); >>>>>> + if (entry->objcg) >>>>>> + count_objcg_events(entry->objcg, ZSWPIN, 1); >>>>>> + if (swapcache) >>>>>> + zswap_entry_free(entry); >>>>>> } >>>>>> >>>>>> + count_vm_events(ZSWPIN, nr_pages); >>>>>> + if (swapcache) >>>>>> + folio_mark_dirty(folio); >>>>>> + >>>>>> folio_mark_uptodate(folio); >>>>>> return true; >>>>>> } >>>>>> -- >>>>>> 2.43.5 >>>>>> >>>>> >>> >>> Thanks >>> barry >>