From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F3B1CD5BA2 for ; Thu, 5 Sep 2024 10:10:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B1BB76B0470; Thu, 5 Sep 2024 06:10:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ACA986B0473; Thu, 5 Sep 2024 06:10:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 94C6E6B0470; Thu, 5 Sep 2024 06:10:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 42C756B046D for ; Thu, 5 Sep 2024 06:10:45 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DC800C18BE for ; Thu, 5 Sep 2024 10:10:44 +0000 (UTC) X-FDA: 82530265608.07.24E3005 Received: from mail-vs1-f48.google.com (mail-vs1-f48.google.com [209.85.217.48]) by imf24.hostedemail.com (Postfix) with ESMTP id 09FD6180016 for ; Thu, 5 Sep 2024 10:10:42 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=El4mJv13; spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725530966; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=w9NkalXiXrjFXv9wiESLRtComGGu7EyyvwI+whS2j6I=; b=VQ7mv6KFaSQy+CbnohfkCaM+MVZOAC3iqNhU95YgRQq0eSh6jdMlhcA9D+EpcxCB/MkAfG S3rLhAMw+m2qs4+ZxZUUU6hIlqRytRaGtTaASWXQfibskUCFIjsyCQAQXvWmJcdn3wHtTK HqdeH9kLH+ZUBGmL+OVKuBIUOfKwcZI= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=El4mJv13; spf=pass (imf24.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725530966; a=rsa-sha256; cv=none; b=5/wHUmKx7bRzXWGRyS7NdXtnxV7fuLR6GloKK9NLbcsOCD/0PawJMke0qlX87zLKhPDgkL RC0Z36JhcAkozTflpEsqxp2QA1NrRfGermB3deMi5hUn4vbUGASqZyEqqj2HtAThM1R0Rg N831d5H8buyuBAM1Bb16BdjWK6NUS3g= Received: by mail-vs1-f48.google.com with SMTP id ada2fe7eead31-49bcf6b9b64so145870137.1 for ; Thu, 05 Sep 2024 03:10:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725531042; x=1726135842; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=w9NkalXiXrjFXv9wiESLRtComGGu7EyyvwI+whS2j6I=; b=El4mJv13bdJuaxPHX5fDXtqRDodBWnDpTRGb3kYwX2ic+mUsRqRMDEzq1AtR3Ct2kl dyt5kRPnYkyvOS/JFqXj9azu7B05S3CXLJtBctBw9ExFBwiY21C2Vilmbp5MnGitlGFx AM+UiQwn7ThgcmErd9yDtoV/7XtVP6x74+S5359cAIqqF8Cp8LELGYRoqZ1T/hbhfLWX f6EatVqX0TSHa1GVmWYTcpqzDA/+fSsQiPU8byTqIkLMxK1DaAFlfCTkda2ObUAPzbA1 GI9ZYv10+ayOjPXXugF0/tCRdyml1233cxV6yXiQzdJGVwX8mARCX6LOxsPtkkdNubPk 7QZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725531042; x=1726135842; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=w9NkalXiXrjFXv9wiESLRtComGGu7EyyvwI+whS2j6I=; b=aBvtZoyKH4Yxsd8ZhvCbF7fJOsIYWG81qaCx0KBv7fEJ1acOS+x6jVf2Y37FnCFJ44 be4Rv8iUH1Ci1jMTROnKyYuaQbzaCWPQLRVikW4dsCCdYTcVimnJx66pC3iNkuGI+tcP mePTfYiDAdkdiTDl3Ek4bvmZjAGR/3Wk0XpcxUJzrQ7zEhrQqMGJy3JEU4xxh2t4hXCW VUuYGXXkipaF5huU3/rY41xoGwUv4jQfxx+nXQ/H/fyfcgiY2jidGLSNujYL8vdXfAfW 1Zw4IQlfk88Ooo82mUfpLEhcSYf2azc0LuYC26oTbZxoyE9XQaKsOPpoGtlIXdOX22dk BMgA== X-Forwarded-Encrypted: i=1; AJvYcCU+q7e7GHnRos+CmWi5TEaBPkoNMHLgIjF3rUlwL4MtBedPXlqal6wyQwxdMGfMbi5iiFWtDam9yw==@kvack.org X-Gm-Message-State: AOJu0Yz4/ihaABXhUYNZiRIdESBAxXkRlhdUWQi/gLq5tLy47qGM6vAN fzBRxa+YDA7RZEn10Tu3nNk/xhyg8SjPno/3zTogTUPHsvMHEaCyZstZaiz82nPNWbd1LRcelHW +4KldELdDJAPa+DkT8R8KSo8bpRY= X-Google-Smtp-Source: AGHT+IEaMRP/2MxwEpJaq/nIw3gUV+RskEV5DiqUqiMYiHQuGlI/7DuEMwSu587QatDqCGs3ME8ZuP7FVqCP72LvhQo= X-Received: by 2002:a05:6102:41a2:b0:492:a93d:7cab with SMTP id ada2fe7eead31-49ba89b5349mr12983866137.1.1725531041854; Thu, 05 Sep 2024 03:10:41 -0700 (PDT) MIME-Version: 1.0 References: <20240612124750.2220726-2-usamaarif642@gmail.com> <20240904055522.2376-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 5 Sep 2024 22:10:30 +1200 Message-ID: Subject: Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap To: Yosry Ahmed Cc: usamaarif642@gmail.com, akpm@linux-foundation.org, chengming.zhou@linux.dev, david@redhat.com, hannes@cmpxchg.org, hughd@google.com, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, nphamcs@gmail.com, shakeel.butt@linux.dev, willy@infradead.org, ying.huang@intel.com, hanchuanhua@oppo.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: fucq9ors173q4y1neomnsk7rhckdr851 X-Rspam-User: X-Rspamd-Queue-Id: 09FD6180016 X-Rspamd-Server: rspam02 X-HE-Tag: 1725531042-203384 X-HE-Meta: U2FsdGVkX18sUMmofdfHFdH2PhMaickgkNe5GJySrXdQdBLQPUZ7Z1Q/W7eETYarWhBzYzEiunUTkStDPNWtz9Ht1DortZaQC3rzvn+dJ0dC97rAMrz7y7XkST0B7fR2dnc8X5cx3LhbtP+EHL54Q2eBlTidZeCtiq0vwUwWZTJGu+Oy3mRqL3xTJcX+uVYTyln4xk1utc1DjjAnSgi208/tQ+bEG/K4GdYDXqxqagTII3LzxP0ZFAlUxnk7keQXaWJyZ9xAPjB88qG7cnqnPcm6UButgbdE35NiqV494cuuUqh1Ogx5ZjqhDKg3vjtY5ZQ2NiXrml0dzimGafguyKbtgpcsau5Vonq38LYjtNEFJsx7XlKWFwSRrAm79e10BvIaI6c2ygzuwcwXkvk+xRXB7G5pZyjeEU4XkrFtEnMqr1u1sDYj/Y5ZFrqWGVAz9niu+iOFA5spl3nsIRfWWRV0nQ6aIz2RO/clfVjf9MdL77PwA3takYgeYvt+RgvAimj81pYeVs00ltZG6R1YeLwdG0WgPJ2EMNhVQjliX4uK5H9MVw0iQn2lgs5qQN3pBC27VKcOZwww2+p62Dp02J5s0Lp/bHo20wPr7Mjr8xSX0wh22fPwyZKDQrtEXMtCQbo3o9J/CYwqt7VoKOyLH8LIvvDgua9YZazkC68UWITR5/OhMFPUbuMugPVxD5Qgp6Apf8YYGV6zPJA7fnGVL5YnZzLHeL0P9juYq61Q8y8L56B8n4NckANHH3aWv/fe6ekF5AjDJ4cuHC2igHjQfyWdEx2Djq13Jhn7a8agimrez261fPQNcTxSvNvc1emTiILsJu4601O/kHAJgrFm/80hx8GEncGOtoTWB+UFkYOhPirw7JxWGtN+yHwYgxA+ckx+9g/Tn42RxIUIOnQ23QHSzV7qyn0qTR21Mi7h+kDOifDSZVn1aojSe3U45imAJ2++krNeAyXX/pgS49E LX+/poOL StoNbHqxqlhY34muUQD79mjn6Z0xPUh8e+0C9fXWwtAj0AchwLW6tAbjrhr9TfaGfOg/fgXtjNrH1+XlkfNvuA9LNMyzwXf2ILEBOGWI2JfIftYUItg3ot4Fv0u6ItOj/2gWKrlpRPjkgEAq2dH4FQuXQZDIyoH8FXNMdAev9avCl6dHfE+nAO85x58FtlILcbacJkBy0ehtTocbnG2w1G0z/z/vc3+G/c5yLIs+Zi3nTrLwpj/28AB17DiXnrdUpm7FrulTzoOw+mPdEU/A5+MxiCvOMRNckJOvV9I4TUHDqhdAITJis9viE6l9e9V6InEJ2S895DhJk8nOPuutqZ6T+XBk4oK05SQaSTALEWEopMnHFAOQ+T68+Nt2vB8v5Aah5OFXH1UfQ29k= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Sep 5, 2024 at 8:49=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrote= : > > On Thu, Sep 5, 2024 at 7:55=E2=80=AFPM Yosry Ahmed wrote: > > > > On Thu, Sep 5, 2024 at 12:03=E2=80=AFAM Barry Song <21cnbao@gmail.com> = wrote: > > > > > > On Thu, Sep 5, 2024 at 5:41=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > [..] > > > > > > I understand the point of doing this to unblock the synchronous= large > > > > > > folio swapin support work, but at some point we're gonna have t= o > > > > > > actually handle the cases where a large folio being swapped in = is > > > > > > partially in the swap cache, zswap, the zeromap, etc. > > > > > > > > > > > > All these cases will need similar-ish handling, and I suspect w= e won't > > > > > > just skip swapping in large folios in all these cases. > > > > > > > > > > I agree that this is definitely the goal. `swap_read_folio()` sho= uld be a > > > > > dependable API that always returns reliable data, regardless of w= hether > > > > > `zeromap` or `zswap` is involved. Despite these issues, mTHP swap= -in shouldn't > > > > > be held back. Significant efforts are underway to support large f= olios in > > > > > `zswap`, and progress is being made. Not to mention we've already= allowed > > > > > `zeromap` to proceed, even though it doesn't support large folios= . > > > > > > > > > > It's genuinely unfair to let the lack of mTHP support in `zeromap= ` and > > > > > `zswap` hold swap-in hostage. > > > > > > > > > > Hi Yosry, > > > > > > > Well, two points here: > > > > > > > > 1. I did not say that we should block the synchronous mTHP swapin w= ork > > > > for this :) I said the next item on the TODO list for mTHP swapin > > > > support should be handling these cases. > > > > > > Thanks for your clarification! > > > > > > > > > > > 2. I think two things are getting conflated here. Zswap needs to > > > > support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What i= s > > > > truly, and is outside the scope of zswap/zeromap, is being able to > > > > support hybrid mTHP swapin. > > > > > > > > When swapping in an mTHP, the swapped entries can be on disk, in th= e > > > > swapcache, in zswap, or in the zeromap. Even if all these things > > > > support mTHPs individually, we essentially need support to form an > > > > mTHP from swap entries in different backends. That's what I meant. > > > > Actually if we have that, we may not really need mTHP swapin suppor= t > > > > in zswap, because we can just form the large folio in the swap laye= r > > > > from multiple zswap entries. > > > > > > > > > > After further consideration, I've actually started to disagree with t= he idea > > > of supporting hybrid swapin (forming an mTHP from swap entries in dif= ferent > > > backends). My reasoning is as follows: > > > > I do not have any data about this, so you could very well be right > > here. Handling hybrid swapin could be simply falling back to the > > smallest order we can swapin from a single backend. We can at least > > start with this, and collect data about how many mTHP swapins fallback > > due to hybrid backends. This way we only take the complexity if > > needed. > > > > I did imagine though that it's possible for two virtually contiguous > > folios to be swapped out to contiguous swap entries and end up in > > different media (e.g. if only one of them is zero-filled). I am not > > sure how rare it would be in practice. > > > > > > > > 1. The scenario where an mTHP is partially zeromap, partially zswap, = etc., > > > would be an extremely rare case, as long as we're swapping out the mT= HP as > > > a whole and all the modules are handling it accordingly. It's highly > > > unlikely to form this mix of zeromap, zswap, and swapcache unless the > > > contiguous VMA virtual address happens to get some small folios with > > > aligned and contiguous swap slots. Even then, they would need to be > > > partially zeromap and partially non-zeromap, zswap, etc. > > > > As I mentioned, we can start simple and collect data for this. If it's > > rare and we don't need to handle it, that's good. > > > > > > > > As you mentioned, zeromap handles mTHP as a whole during swapping > > > out, marking all subpages of the entire mTHP as zeromap rather than j= ust > > > a subset of them. > > > > > > And swap-in can also entirely map a swapcache which is a large folio = based > > > on our previous patchset which has been in mainline: > > > "mm: swap: entirely map large folios found in swapcache" > > > https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@gmail.com= / > > > > > > It seems the only thing we're missing is zswap support for mTHP. > > > > It is still possible for two virtually contiguous folios to be swapped > > out to contiguous swap entries. It is also possible that a large folio > > is swapped out as a whole, then only a part of it is swapped in later > > due to memory pressure. If that part is later reclaimed again and gets > > added to the swapcache, we can run into the hybrid swapin situation. > > There may be other scenarios as well, I did not think this through. > > > > > > > > 2. Implementing hybrid swap-in would be extremely tricky and could di= srupt > > > several software layers. I can share some pseudo code below: > > > > Yeah it definitely would be complex, so we need proper justification fo= r it. > > > > > > > > swap_read_folio() > > > { > > > if (zeromap_full) > > > folio_read_from_zeromap() > > > else if (zswap_map_full) > > > folio_read_from_zswap() > > > else { > > > folio_read_from_swapfile() > > > if (zeromap_partial) > > > folio_read_from_zeromap_fixup() /* fill zero > > > for partially zeromap subpages */ > > > if (zwap_partial) > > > folio_read_from_zswap_fixup() /* zswap_load > > > for partially zswap-mapped subpages */ > > > > > > folio_mark_uptodate() > > > folio_unlock() > > > } > > > > > > We'd also need to modify folio_read_from_swapfile() to skip > > > folio_mark_uptodate() > > > and folio_unlock() after completing the BIO. This approach seems to > > > entirely disrupt > > > the software layers. > > > > > > This could also lead to unnecessary IO operations for subpages that > > > require fixup. > > > Since such cases are quite rare, I believe the added complexity isn't= worth it. > > > > > > My point is that we should simply check that all PTEs have consistent= zeromap, > > > zswap, and swapcache statuses before proceeding, otherwise fall back = to the next > > > lower order if needed. This approach improves performance and avoids = complex > > > corner cases. > > > > Agree that we should start with that, although we should probably > > fallback to the largest order we can swapin from a single backend, > > rather than the next lower order. > > > > > > > > So once zswap mTHP is there, I would also expect an API similar to > > > swap_zeromap_entries_check() > > > for example: > > > zswap_entries_check(entry, nr) which can return if we are having > > > full, non, and partial zswap to replace the existing > > > zswap_never_enabled(). > > > > I think a better API would be similar to what Usama had. Basically > > take in (entry, nr) and return how much of it is in zswap starting at > > entry, so that we can decide the swapin order. > > > > Maybe we can adjust your proposed swap_zeromap_entries_check() as well > > to do that? Basically return the number of swap entries in the zeromap > > starting at 'entry'. If 'entry' itself is not in the zeromap we return > > 0 naturally. That would be a small adjustment/fix over what Usama had, > > but implementing it with bitmap operations like you did would be > > better. > > I assume you means the below > > /* > * Return the number of contiguous zeromap entries started from entry > */ > static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, = int nr) > { > struct swap_info_struct *sis =3D swp_swap_info(entry); > unsigned long start =3D swp_offset(entry); > unsigned long end =3D start + nr; > unsigned long idx; > > idx =3D find_next_bit(sis->zeromap, end, start); > if (idx !=3D start) > return 0; > > return find_next_zero_bit(sis->zeromap, end, start) - idx; > } > > If yes, I really like this idea. > > It seems much better than using an enum, which would require adding a new > data structure :-) Additionally, returning the number allows callers > to fall back > to the largest possible order, rather than trying next lower orders > sequentially. No, returning 0 after only checking first entry would still reintroduce the current bug, where the start entry is zeromap but other entries might not be. We need another value to indicate whether the entries are consistent if we want to avoid the enum: /* * Return the number of contiguous zeromap entries started from entry; * If all entries have consistent zeromap, *consistent will be true; * otherwise, false; */ static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, int nr, bool *consistent) { struct swap_info_struct *sis =3D swp_swap_info(entry); unsigned long start =3D swp_offset(entry); unsigned long end =3D start + nr; unsigned long s_idx, c_idx; s_idx =3D find_next_bit(sis->zeromap, end, start); if (s_idx =3D=3D end) { *consistent =3D true; return 0; } c_idx =3D find_next_zero_bit(sis->zeromap, end, start); if (c_idx =3D=3D end) { *consistent =3D true; return nr; } *consistent =3D false; if (s_idx =3D=3D start) return 0; return c_idx - s_idx; } I can actually switch the places of the "consistent" and returned number if that looks better. > > Hi Usama, > what is your take on this? > > > > > > > > > Though I am not sure how cheap zswap can implement it, > > > swap_zeromap_entries_check() > > > could be two simple bit operations: > > > > > > +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t > > > entry, int nr) > > > +{ > > > + struct swap_info_struct *sis =3D swp_swap_info(entry); > > > + unsigned long start =3D swp_offset(entry); > > > + unsigned long end =3D start + nr; > > > + > > > + if (find_next_bit(sis->zeromap, end, start) =3D=3D end) > > > + return SWAP_ZEROMAP_NON; > > > + if (find_next_zero_bit(sis->zeromap, end, start) =3D=3D end) > > > + return SWAP_ZEROMAP_FULL; > > > + > > > + return SWAP_ZEROMAP_PARTIAL; > > > +} > > > > > > 3. swapcache is different from zeromap and zswap. Swapcache indicates > > > that the memory > > > is still available and should be re-mapped rather than allocating a > > > new folio. Our previous > > > patchset has implemented a full re-map of an mTHP in do_swap_page() a= s mentioned > > > in 1. > > > > > > For the same reason as point 1, partial swapcache is a rare edge case= . > > > Not re-mapping it > > > and instead allocating a new folio would add significant complexity. > > > > > > > > > > > > > Nonetheless, `zeromap` and `zswap` are distinct cases. With `zero= map`, we > > > > > permit almost all mTHP swap-ins, except for those rare situations= where > > > > > small folios that were swapped out happen to have contiguous and = aligned > > > > > swap slots. > > > > > > > > > > swapcache is another quite different story, since our user scenar= ios begin from > > > > > the simplest sync io on mobile phones, we don't quite care about = swapcache. > > > > > > > > Right. The reason I bring this up is as I mentioned above, there is= a > > > > common problem of forming large folios from different sources, whic= h > > > > includes the swap cache. The fact that synchronous swapin does not = use > > > > the swapcache was a happy coincidence for you, as you can add suppo= rt > > > > mTHP swapins without handling this case yet ;) > > > > > > As I mentioned above, I'd really rather filter out those corner cases > > > than support > > > them, not just for the current situation to unlock swap-in series :-) > > > > If they are indeed corner cases, then I definitely agree. > > Thanks > Barry