From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 277D8C27C65 for ; Tue, 11 Jun 2024 16:52:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B38186B009B; Tue, 11 Jun 2024 12:52:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AE84D6B009C; Tue, 11 Jun 2024 12:52:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9B0406B009E; Tue, 11 Jun 2024 12:52:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 776286B009B for ; Tue, 11 Jun 2024 12:52:26 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 9FF65A157D for ; Tue, 11 Jun 2024 16:52:25 +0000 (UTC) X-FDA: 82219201050.08.F3D189E Received: from mail-lf1-f42.google.com (mail-lf1-f42.google.com [209.85.167.42]) by imf08.hostedemail.com (Postfix) with ESMTP id 9D185160010 for ; Tue, 11 Jun 2024 16:52:23 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VYFcLhJL; spf=pass (imf08.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.167.42 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718124743; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OM5fL+OasdGiX0Yjubme1KK8azWbu6kXziuSZJqJ0pI=; b=17CX1DDwZyFPi2jViquu4wM3gBqTuttr2R67YnipD6ZYCwmT4/Rw0JMPS2tL8P2Bqkex2+ AVEyK0ISl8ioFeXnfVOBtDQdbeL39Z6XpU/4Mf1S2pj13PhPzD3bN/8ID06MZo9XK+G4Jm bxdqvZ6qljEh8gLMlMwPSZ+y4bxnCtI= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VYFcLhJL; spf=pass (imf08.hostedemail.com: domain of usamaarif642@gmail.com designates 209.85.167.42 as permitted sender) smtp.mailfrom=usamaarif642@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718124743; a=rsa-sha256; cv=none; b=6PB+y3rp8SeRQEQjbvhi5qOXnk4B487HJ78NDP2zdQx6thnxUz7a8jFdLu6o7mlwO3c4cL /MQoTFMOdAp+K0U2ubanWWf02kHaDkqeeLDirY2LvRXER9vLt4DbLVaQ9NTtSIgAO+IpTN KKZz335oRpwlV6s+2oTI1+VJogbqTsA= Received: by mail-lf1-f42.google.com with SMTP id 2adb3069b0e04-52c9034860dso2354282e87.2 for ; Tue, 11 Jun 2024 09:52:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718124742; x=1718729542; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=OM5fL+OasdGiX0Yjubme1KK8azWbu6kXziuSZJqJ0pI=; b=VYFcLhJLUOKKW9oh3dY4P1qjDveiqNIphCsBiYNK4ExpHpjQL/CdAN3XgWadFQVEil hf1EUEXGei6D663Io0kmGmTKRAC5cQ85pCtTagE2ajj/l0yBnuGDY/OqtU7WD6P1SQPq 4BBP48Zm25b7LYY1JiWCEVWovCm67oWY6oXgHwbdPQ+kdxwVJbcA7hRT9RAowSKWLT6o iFI7V5Ou1yydJDlJVeGNyHq5bU24a95ooLrsIBUI6nCso+iWKbPL47fL6KJc2t499lkK oCnf94wvFf75FcPkBLhWJuwbA7qnio643EoYI/HCMnB1K6DNy6NnTAsQAqF8rlRmoVlr AVJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718124742; x=1718729542; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=OM5fL+OasdGiX0Yjubme1KK8azWbu6kXziuSZJqJ0pI=; b=XcyVRh26CO6TjZDWGB3OuyAHN2/71voMNldWOeRTraNXcpIlBbTjKQijtmct5Reel6 iJJmd5vGPz+Jc0Uqd0wvqEfeHae/HU02PFc+fJ8mNnDtTFuF2kzo/aNnhv3O5Y1BnT+t 4rLhdL5BaBoWDI/svJJQXPhTSyqfP2nmVsuKIq2nM9nyWrAu6Jhqqdi+ua7aGcmg4ldB 7k89H9d533cGJyaxrtJOZJloLyqr7P1Pz7amCJE7u7zs04tfiNbILRISXXbmVN8ux+8F 6KOoGOBdtA1KE/uX4Mw5A4C8aCXFDaJTZUhvlDaBeZFo3iDffdjUtnmgi1rPRyQeLRB+ 1aJw== X-Forwarded-Encrypted: i=1; AJvYcCWDvVsCEo6upfr+FUvgrz4uVZ7Mu4HQFR1hGRjL8B9yHWi35VLKYDRrXMP5R1aAmmXQJgBKHC88MvoJmgz889z8aB8= X-Gm-Message-State: AOJu0YyNniHqKZivES+Eikszd3ozHrlf+4H56gC4RXUn8mNEB9Mr6GDf pNqC4EYbgejEzqnoHWk33fbzG00o+cKkiZn2fc+W+QW/UV8b/wu1 X-Google-Smtp-Source: AGHT+IHZhKb/zNl+HHrFFjpaoUWIukSeMequbtGMtPNNY3asyqjFnBzEMVvw+aly1fDiFFxZTCMvzg== X-Received: by 2002:ac2:4a69:0:b0:52b:bfa7:66a1 with SMTP id 2adb3069b0e04-52bbfa766f7mr8601373e87.5.1718124741606; Tue, 11 Jun 2024 09:52:21 -0700 (PDT) Received: from ?IPV6:2a03:83e0:1126:4:eb:d0d0:c7fd:c82c? ([2620:10d:c092:500::7:57b4]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a6f0f5ee702sm428034866b.52.2024.06.11.09.52.20 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 11 Jun 2024 09:52:21 -0700 (PDT) Message-ID: <08ea43f2-13d2-4b27-ae62-42cebc185c7b@gmail.com> Date: Tue, 11 Jun 2024 17:52:20 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3 1/2] mm: store zero pages to be swapped out in a bitmap To: Yosry Ahmed Cc: 21cnbao@gmail.com, akpm@linux-foundation.org, hannes@cmpxchg.org, david@redhat.com, ying.huang@intel.com, hughd@google.com, willy@infradead.org, nphamcs@gmail.com, chengming.zhou@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel-team@meta.com, Shakeel Butt References: <20240610121820.328876-1-usamaarif642@gmail.com> <20240610121820.328876-2-usamaarif642@gmail.com> <9ddfe544-636d-4638-ae0e-053674e47322@gmail.com> Content-Language: en-US From: Usama Arif In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 9D185160010 X-Stat-Signature: eef9n1o3yj69i461acq4fupapnzjx9b8 X-HE-Tag: 1718124743-502467 X-HE-Meta: U2FsdGVkX1+cBxBHrUL5GcQg5Tyl8u6U329mYprLX1io4AAWR/NWaJGOKs43/TNNaCFKVtM+OrgKhVqloxEobqV4wU26GNv7dTm6/Z3jx+f6qNA2bXQ1Ywx/sYSKFsticligPyo0nRYCSlxZfhYT4iXSsfenJqQYbrMRpglZaAqgrX4NN1tAq5MwCkOcOL/VL9szIM/sqomaTD1EZLxhG9TPj1VyUcaJNsn7KSgKGb2vk8/Z8E8W8erYkKEBGoHf6zvgrr7YzZAci4Hnf7l0I9IWtVaeu8YSlk57wwWdGSXZXu2oOROHK37KgLpvucNYAFkFyEWgjDJbBEJLtrxIHnxCaBPrg3FsuAKUXHh6FN2sbtw+DanYhM7+Y4uTvUMT748P6kA9gSYm3xOTBOXohqwRcLwRdqkLfxpun6wqcD1KOr7Yzj8WrjYPoZOZiwyqdq9auqaIgXc30WhMchULV1xOH7vN2pk1voTDtVHAtFj8fzgvs5uU1PBYfFH0w7VlJZ+xaQKD6WFWcGUGm4Yz4K8ZnWJvqt9yIGX1S9voBhRmc5TN/o7p5V7GddKat6xFMJrIkBrIqsyZ/uwCLyPi1pmkOuK0+OWRtwYwYgDJzDc0SRGLPYvN9IOvufeg2PMBP3+rX6Cro8eYmwu9WTXu1tFhlSlH1ITcgoR3Z4nTre8TJO9YPpMgpp1jRJH7wroScP2mNdbZw2h1/yYk7EGcEkjxxnLeEwMXJdj5wWKqaWa9ozondnIIOJLRVvOlZiOH3cRqzygmJ/cr0BaV4Kpjt/dYaZpkL3UCstutJlT3pwCvhNPzT60T4ZblgZ19gixalgofYYaXIN+E8RGOc1Erc7Gy00Nc05gLHa9LQIRwi+JAnbhFWPuPhK2OUEtmHiss6veHVxJIklggqhEVqlwMM5r/3M7HbP6zR9Fgo2RrlQdhoYJSApbL/MlqHzF6WLnFc+TBk8mMfND2HxrOzZe e+SUzFCp NmWG8teybpVhwWZFsyfAGOGeXe6A1ACKXW6NB32s9dvmpG0LBY5OCH2Eqd0V0611AxJELxBMiP69jBolFa1q5VrlW81aDtcmAkm76a/g/5zAdRemZuplcMpLmrGCNbkUt/zW38SniO+BkuRlzj2gBt3neazLkROpJcL15LuQIh0K2eMNVv0Jyg7juxacrpQVbVCSOOAhT/aU8b+vRBieNfM/DzSGgqZ/1yVN0D5VHZTr51XQgc1O1GhgDELsQ+l2EdXie5OwqAtINdy/Vkld8ERl0RsxR9PwuBx9aUNvxeJ8jRq4QZIkAtw5oGqqeIYfLmWMgzU2VPMmpPaTwu73fTg61pNPly0ctUm0FWsNay7VApJ7qBN37J2voHvxQBVr0MvM2GDlskW5wEBGPZ0fahiGYKoaBRwtFxCg46JIdiX2xmrkPWxQNZB3TWTQXgfhrqlWMjWG2jOSjwgAPBfxWIYaenDl7UNAs++7Zj1IXuBe7Gz5+HyuiVJOYnBS6WtpkySAjMWFtf0PGwh4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11/06/2024 16:42, Yosry Ahmed wrote: > On Tue, Jun 11, 2024 at 4:49 AM Usama Arif wrote: >> @@ -515,8 +600,11 @@ void swap_read_folio(struct folio *folio, bool >> synchronous, >>>>>> psi_memstall_enter(&pflags); >>>>>> } >>>>>> delayacct_swapin_start(); >>>>>> - >>>>>> - if (zswap_load(folio)) { >>>>>> + if (swap_zeromap_folio_test(folio)) { >>>>>> + folio_zero_fill(folio); >>>>>> + folio_mark_uptodate(folio); >>>>>> + folio_unlock(folio); >>>>> We don't currently support swapping in large folios, but it is a work >>>>> in progress, and this will break once we have it. >>>>> swap_zeromap_folio_test() will return false even if parts of the folio >>>>> are in fact zero-filled. Then, we will go read those from disk swap, >>>>> essentially corrupting data. >>>> So yes, with this patch I tested swap out of large zero folio, but when >>>> swapping in it was page by page. My assumption was that swap_read_folio >>>> (when support is added) would only pass a large folio that was earlier >>>> swapped out as a large folio. So if a zero filled large folio was >>>> swapped out, the zeromap will be set for all the pages in that folio, >>>> and at large folio swap in (when it is supported), it will see that all >>>> the bits in the zeromap for that folio are set, and will just >>>> folio_zero_fill. >>>> >>>> If even a single page in large folio has non-zero data then zeromap will >>>> not store it and it will go to either zswap or disk, and at read time as >>>> all the bits in zeromap are not set, it will still goto either zswap or >>>> disk, so I think this works? >>>> >>>> Is my assumption wrong that only large folios can be swapped in only if >>>> they were swapped out as large? I think this code works in that case. >>> I think the assumption is incorrect. I think we would just check if >>> contiguous PTEs have contiguous swap entries and swapin the folio as a >>> large folio in this case. It is likely that the swap entries are >>> contiguous because it was swapped out as a large folio, but I don't >>> think it's guaranteed. >> Yes, makes sense. Thanks for explaining this. >> >>> For example, here is a patch that implements large swapin support for >>> the synchronous swapin case, and I think it just checks that the PTEs >>> have contiguous swap entries: >>> https://lore.kernel.org/linux-mm/20240304081348.197341-6-21cnbao@gmail.com/ >>> >>> This makes a lot of sense because otherwise you'd have to keep track >>> of how the folios were composed at the time of swapout, to be able to >>> swap the same folios back in. >> I think the solution to large folio swap-in for this optimization and >> zswap is not in swap_read_folio in this patch-series or any call further >> down the stack, as its too late by the time you reach here, but in >> Barrys' patch. This needs to happen much earlier when deciding the size >> of the folio at folio creation in alloc_swap_folio, after Barry checks >> >> if (is_zswap_enabled()) >> goto fallback; >> >> once the order is decided, we would need to check the indexes in the >> zeromap array starting from swap_offset(entry) and ending at >> swap_offset(entry) + 2^order are set. If no bits are set, or all bits >> are set, then we allocate large folio. Otherwise, we goto fallback. >> >> I think its better to handle this in Barrys patch. I feel this series is >> close to its final state, i.e. the only diff I have for the next >> revision is below to remove start/end_writeback for zer_filled case. I >> will comment on Barrys patch once the I send out the next revision of this. > Sorry I did not make myself clearer. I did not mean that you should > handle the large folio swapin here. This needs to be handled at a > higher level because as you mentioned, a large folio may be partially > in the zeromap, zswap, swapcache, disk, etc. > > What I meant is that we should probably have a debug check to make > sure this doesn't go unhandled. For zswap, I am trying to add a > warning and fail the swapin operation if a large folio slips through > to zswap. We can do something similar here if folks agree this is the > right way in the interim: > https://lore.kernel.org/lkml/20240611024516.1375191-3-yosryahmed@google.com/. > > Maybe I am too paranoid, but I think it's easy to mess up these things > when working on large folio swapin imo. So there is a difference between zswap and this optimization. In this optimization, if the zeromap is set for all the folio bits, then we should do large folio swapin. There still needs to be a change in Barrys patch in alloc_swap_folio, but apart from that does the below diff over v3 make it better? I will send a v4 with this if it sounds good. diff --git a/mm/page_io.c b/mm/page_io.c index 6400be6e4291..bf01364748a9 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -234,18 +234,24 @@ static void swap_zeromap_folio_clear(struct folio *folio)         }  } -static bool swap_zeromap_folio_test(struct folio *folio) +/* + * Return the index of the first subpage which is not zero-filled + * according to swap_info_struct->zeromap. + * If all pages are zero-filled according to zeromap, it will return + * folio_nr_pages(folio). + */ +static long swap_zeromap_folio_test(struct folio *folio)  {         struct swap_info_struct *sis = swp_swap_info(folio->swap);         swp_entry_t entry; -       unsigned int i; +       long i;         for (i = 0; i < folio_nr_pages(folio); i++) {                 entry = page_swap_entry(folio_page(folio, i));                 if (!test_bit(swp_offset(entry), sis->zeromap)) -                       return false; +                       return i;         } -       return true; +       return i;  }  /* @@ -581,6 +587,7 @@ void swap_read_folio(struct folio *folio, bool synchronous,  {         struct swap_info_struct *sis = swp_swap_info(folio->swap);         bool workingset = folio_test_workingset(folio); +       long first_non_zero_page_idx;         unsigned long pflags;         bool in_thrashing; @@ -598,10 +605,19 @@ void swap_read_folio(struct folio *folio, bool synchronous,                 psi_memstall_enter(&pflags);         }         delayacct_swapin_start(); -       if (swap_zeromap_folio_test(folio)) { +       first_non_zero_page_idx = swap_zeromap_folio_test(folio); +       if (first_non_zero_page_idx == folio_nr_pages(folio)) {                 folio_zero_fill(folio);                 folio_mark_uptodate(folio);                 folio_unlock(folio); +       } else if (first_non_zero_page_idx != 0) { +               /* +                * The case for when only *some* of subpages being swapped-in were recorded +                * in sis->zeromap, while the rest are in zswap/disk is currently not handled. +                * WARN in this case and return without marking the folio uptodate so that +                * an IO error is emitted (e.g. do_swap_page() will sigbus). +                */ +                WARN_ON_ONCE(1);         } else if (zswap_load(folio)) {                 folio_mark_uptodate(folio);                 folio_unlock(folio);