From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F30DCD4F5B for ; Thu, 5 Sep 2024 08:49:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9F0956B039F; Thu, 5 Sep 2024 04:49:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 979B96B03A0; Thu, 5 Sep 2024 04:49:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F3556B03A1; Thu, 5 Sep 2024 04:49:47 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 5EA7B6B039F for ; Thu, 5 Sep 2024 04:49:47 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id D599C16167C for ; Thu, 5 Sep 2024 08:49:46 +0000 (UTC) X-FDA: 82530061572.08.3852F83 Received: from mail-vk1-f181.google.com (mail-vk1-f181.google.com [209.85.221.181]) by imf09.hostedemail.com (Postfix) with ESMTP id 1469F140005 for ; Thu, 5 Sep 2024 08:49:44 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QCLbSHNZ; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725526136; a=rsa-sha256; cv=none; b=xXa8HxGPzUKPiFUvoVdx/jo2fnPHURlAEnIj6hXKqCSrVsn/rEJ09cViGU4D/aQaNO+kBW 99+uglycUkVsYOzfNsQOQKOsygAA4voB7nu3DbJa4EPRxkh4Z/f2vkPDlY+t6pOtdR4TYG cODlAQxwHLxULYA8hVUvj/zxP4DqwR4= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QCLbSHNZ; spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725526136; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ydZrq1jmYcJ3iT5KVQR2uEItSEJAIa0sqI78Zxb7F4o=; b=rm988VOi7iLjjUlexztBebB1E/IxYWOchfeu5EACP8eefJJtMlJHrnzg9L2ajoWDaKp5tS m+kzGldRA0vgb+aB31zgbHnUnEam5xUKZK9CVT06bpFf9o4R4Pp+vSdbkOzaXApR6ypsQa he78cuKKqsUMbobw3n90U1LJbOrzOH8= Received: by mail-vk1-f181.google.com with SMTP id 71dfb90a1353d-50108a42fa9so207616e0c.3 for ; Thu, 05 Sep 2024 01:49:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725526184; x=1726130984; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ydZrq1jmYcJ3iT5KVQR2uEItSEJAIa0sqI78Zxb7F4o=; b=QCLbSHNZsfoh/KSWtFU75WL7bBspP/ML1tjOHz2XhspyvKr3TikT475Pj/XrbOc1Pk +hbqcMDB5jD08V8AUWp50gxwlRIn7Ypvsznj6eUMjfEoJ4PhZTX/kdcrXketDbKqzfIB PyhTbSFDdm5YWVP0d4YQS2wxV7arYzXls2Zl7IevXHM5eJFRNPGp45XqTyVc63mxtBrD GLPIHtpSNY6Z/e/5QPwZ1jr2P79WuaYqRMf9jRLeB718vhNch2ma/s4waEbeE5Y52VkA I7ru8TEToSsnPvxKUsUillJhv1MDyVwBzehmejkNj9zgtVLne2a5No/i0c+vdtvAfe7e Jk5Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725526184; x=1726130984; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ydZrq1jmYcJ3iT5KVQR2uEItSEJAIa0sqI78Zxb7F4o=; b=YzV93BTjQAxSUTLrq31FWzdDhap3B/+RlZjQuZ0DDs+9BCJb4FibFy5O9/8tvYcYUz S0yr9QODrChLsiG0Wm3AM8VDNyAsKVDaX99zszjKUqJTnob4oORjh83L0uFDYqiPWA0S W90dSO6P3SWubt14rMsEiRF8ctpI/IR+Z1R+HUhtlAWLL/2Kev/JlPUky+pNdDrfamSx 5HtR//tTOpi4iUoQ6YhP5U92GYQAbdmeGrSi8J07zSfbAZG1KiC+cunOW1mUcsoIDjuk NaxP68Kjl+NIOArJH+aNsl+ufxLJYuykeAhUfzua99tjCxIbI6uRf2MU/McLmYMTdMbS V5VA== X-Forwarded-Encrypted: i=1; AJvYcCW5tZcZXJrSwVNR1+dPx5Es9Aq7rmnJFmU0ftpP0iLtHMcFftR0Tuz30IvQhXJEA8UMLNz0K/LB+A==@kvack.org X-Gm-Message-State: AOJu0Yxb6jhW7tk5MASFQkc29jvdIUtcLGc00+D5X9HX5yBu/Fbs3vS7 JNdJyk8IpxpE3bSLbAC0uyjXqbD4fitzir/4HG3ChyRwNrKzIAI0gCoByp4I7OB28whPPOaoIFJ uE13BdGr/2XWhWPrzQHzwMkTSc70= X-Google-Smtp-Source: AGHT+IESNubA1dBtNKwBaV0JG3mhnHocLUPe4ld26wzHID0WHJi/mIUJKopiztf/4SpPMVyPen7jmWn785b8VHIqMWY= X-Received: by 2002:a05:6122:1797:b0:4f5:261a:bdc4 with SMTP id 71dfb90a1353d-5009b00151amr19662728e0c.2.1725526183853; Thu, 05 Sep 2024 01:49:43 -0700 (PDT) MIME-Version: 1.0 References: <20240612124750.2220726-2-usamaarif642@gmail.com> <20240904055522.2376-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 5 Sep 2024 20:49:31 +1200 Message-ID: Subject: Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap To: Yosry Ahmed Cc: usamaarif642@gmail.com, akpm@linux-foundation.org, chengming.zhou@linux.dev, david@redhat.com, hannes@cmpxchg.org, hughd@google.com, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, nphamcs@gmail.com, shakeel.butt@linux.dev, willy@infradead.org, ying.huang@intel.com, hanchuanhua@oppo.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 5ykh3ego89bbrdn43uh1f47mbhc8116g X-Rspamd-Queue-Id: 1469F140005 X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1725526184-756548 X-HE-Meta: U2FsdGVkX18GIQK83uHgWzkN9RJuQrl4quEuHLpqrguFbIhdp24rabj9QDBRBX9R4vUSFvB07JY1qVzNpFV0FE0d7FKZqUkNO4d0YoLnxU4nAatsiWOJIIC9jvnmWtYIlbT5c+FUe9pgRmxSX52wwrk0y6gjGLv0Q9s46BvUn/cP4czOhedlVUHVCtBJr6LGRIv2wJGYSUei5FdFewVTpqb8tzHRrevh1/MLW5xWQaonPPwCusw7JdkCTOdfMnGt9M9Xoq+7h5drNgAYahI9jWF6Guapph8n4bfjwuUUYHeHeNMjCrIRVd6qSVHi+eCS8rqkGRCxTxVTAkd8vUQQWSRWbZg5v9czrpNCCg6y2nfURjrXiL7Y918imAZAqAQEUs3CBPFjfsm/AR5PwRm6Grlq5i/cqQaNFIHndpRmlUEs3hoBqOMSaPw7ICA55ohe902s56QFFPp/hp1FFZdXQ6l9znn8+iSe4f17YOWB9fXT7KckX4R0kThK5coww050uH3iRdOYyfbPlPfn5ge7RVxyCkB/EB6gbtCC/MzHfXL8VCBZPtQ/u8TMijYSJi0ykB7AsfDeOXHsFrX1o1h9w3u5I5M6dcsCWOI4u5HojQ2FQatjXZEX02Uzo+pqACX0R6AJAlBmaVx5RODWgkiKxXpSN9wK62KiZ0IlK7QW8cFvEoI7ON4yuz6EzkokYHz/GUsr4hz+cjfgvIWSLHOS+InGxDncmX1ay6lMj8G8NsncqJTLVy2JXod0O0enDNR/j+WnzdhhwKSq5TvXwFUBUQVxZDtGNYVRlTfsJvf5xHb4clIsbZx34KhZT/H4kMD9ptlgznPdSZ/rsOCCQn/ekoS8XMZdQhaoBvonHZS2KzdcSkW0rSqgq0bDo6me2ehYiCfx65et7OifAgf6+Q9TJz2r99pTrQ2hB26aF3qucwKRwrQTeXzjHrDM8bwnP9kNfgq7ucPNnOSe7aVdEtp qkT3lorN 16AoncQutm0alkRui+sPkxZCFEbm6V6hp5nyR11GROja8ypKViUNI4Wwv+4EKg55fRWeTnpYGLiOLZZcVI5/cPjxVpx26G2HtGKJ1ASvoMRgM9NE0KJrkB/dyTp3uV4lJOj+UXH7quc64ycQyo50+CAXMlwH6gh2VMGuclDA4iuEbcgBJzB/QtrWmVZJa/Z8ro7uzsXq6BH6NwyQSQS6Mjuil1tqQvLEtGBx7NXq74G0LpOV1wddP0ocUFPBKSL5BASRu2LWcq4C0BxIPMIShp+ZvNcp4DJktVTeNp70hBrdyOnUb/cImNvp+Rsd23PKiEnbfpyK0sRg3R44Txlh5lxJSm8KjNN6xRxR3MfK+QRJKRw5C1kUp437NIRDx6t8mxTHCQP09PK4jm9+5RTNQPIbEtQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Sep 5, 2024 at 7:55=E2=80=AFPM Yosry Ahmed = wrote: > > On Thu, Sep 5, 2024 at 12:03=E2=80=AFAM Barry Song <21cnbao@gmail.com> wr= ote: > > > > On Thu, Sep 5, 2024 at 5:41=E2=80=AFAM Yosry Ahmed wrote: > > > > > > [..] > > > > > I understand the point of doing this to unblock the synchronous l= arge > > > > > folio swapin support work, but at some point we're gonna have to > > > > > actually handle the cases where a large folio being swapped in is > > > > > partially in the swap cache, zswap, the zeromap, etc. > > > > > > > > > > All these cases will need similar-ish handling, and I suspect we = won't > > > > > just skip swapping in large folios in all these cases. > > > > > > > > I agree that this is definitely the goal. `swap_read_folio()` shoul= d be a > > > > dependable API that always returns reliable data, regardless of whe= ther > > > > `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-i= n shouldn't > > > > be held back. Significant efforts are underway to support large fol= ios in > > > > `zswap`, and progress is being made. Not to mention we've already a= llowed > > > > `zeromap` to proceed, even though it doesn't support large folios. > > > > > > > > It's genuinely unfair to let the lack of mTHP support in `zeromap` = and > > > > `zswap` hold swap-in hostage. > > > > > > > Hi Yosry, > > > > > Well, two points here: > > > > > > 1. I did not say that we should block the synchronous mTHP swapin wor= k > > > for this :) I said the next item on the TODO list for mTHP swapin > > > support should be handling these cases. > > > > Thanks for your clarification! > > > > > > > > 2. I think two things are getting conflated here. Zswap needs to > > > support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is > > > truly, and is outside the scope of zswap/zeromap, is being able to > > > support hybrid mTHP swapin. > > > > > > When swapping in an mTHP, the swapped entries can be on disk, in the > > > swapcache, in zswap, or in the zeromap. Even if all these things > > > support mTHPs individually, we essentially need support to form an > > > mTHP from swap entries in different backends. That's what I meant. > > > Actually if we have that, we may not really need mTHP swapin support > > > in zswap, because we can just form the large folio in the swap layer > > > from multiple zswap entries. > > > > > > > After further consideration, I've actually started to disagree with the= idea > > of supporting hybrid swapin (forming an mTHP from swap entries in diffe= rent > > backends). My reasoning is as follows: > > I do not have any data about this, so you could very well be right > here. Handling hybrid swapin could be simply falling back to the > smallest order we can swapin from a single backend. We can at least > start with this, and collect data about how many mTHP swapins fallback > due to hybrid backends. This way we only take the complexity if > needed. > > I did imagine though that it's possible for two virtually contiguous > folios to be swapped out to contiguous swap entries and end up in > different media (e.g. if only one of them is zero-filled). I am not > sure how rare it would be in practice. > > > > > 1. The scenario where an mTHP is partially zeromap, partially zswap, et= c., > > would be an extremely rare case, as long as we're swapping out the mTHP= as > > a whole and all the modules are handling it accordingly. It's highly > > unlikely to form this mix of zeromap, zswap, and swapcache unless the > > contiguous VMA virtual address happens to get some small folios with > > aligned and contiguous swap slots. Even then, they would need to be > > partially zeromap and partially non-zeromap, zswap, etc. > > As I mentioned, we can start simple and collect data for this. If it's > rare and we don't need to handle it, that's good. > > > > > As you mentioned, zeromap handles mTHP as a whole during swapping > > out, marking all subpages of the entire mTHP as zeromap rather than jus= t > > a subset of them. > > > > And swap-in can also entirely map a swapcache which is a large folio ba= sed > > on our previous patchset which has been in mainline: > > "mm: swap: entirely map large folios found in swapcache" > > https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@gmail.com/ > > > > It seems the only thing we're missing is zswap support for mTHP. > > It is still possible for two virtually contiguous folios to be swapped > out to contiguous swap entries. It is also possible that a large folio > is swapped out as a whole, then only a part of it is swapped in later > due to memory pressure. If that part is later reclaimed again and gets > added to the swapcache, we can run into the hybrid swapin situation. > There may be other scenarios as well, I did not think this through. > > > > > 2. Implementing hybrid swap-in would be extremely tricky and could disr= upt > > several software layers. I can share some pseudo code below: > > Yeah it definitely would be complex, so we need proper justification for = it. > > > > > swap_read_folio() > > { > > if (zeromap_full) > > folio_read_from_zeromap() > > else if (zswap_map_full) > > folio_read_from_zswap() > > else { > > folio_read_from_swapfile() > > if (zeromap_partial) > > folio_read_from_zeromap_fixup() /* fill zero > > for partially zeromap subpages */ > > if (zwap_partial) > > folio_read_from_zswap_fixup() /* zswap_load > > for partially zswap-mapped subpages */ > > > > folio_mark_uptodate() > > folio_unlock() > > } > > > > We'd also need to modify folio_read_from_swapfile() to skip > > folio_mark_uptodate() > > and folio_unlock() after completing the BIO. This approach seems to > > entirely disrupt > > the software layers. > > > > This could also lead to unnecessary IO operations for subpages that > > require fixup. > > Since such cases are quite rare, I believe the added complexity isn't w= orth it. > > > > My point is that we should simply check that all PTEs have consistent z= eromap, > > zswap, and swapcache statuses before proceeding, otherwise fall back to= the next > > lower order if needed. This approach improves performance and avoids co= mplex > > corner cases. > > Agree that we should start with that, although we should probably > fallback to the largest order we can swapin from a single backend, > rather than the next lower order. > > > > > So once zswap mTHP is there, I would also expect an API similar to > > swap_zeromap_entries_check() > > for example: > > zswap_entries_check(entry, nr) which can return if we are having > > full, non, and partial zswap to replace the existing > > zswap_never_enabled(). > > I think a better API would be similar to what Usama had. Basically > take in (entry, nr) and return how much of it is in zswap starting at > entry, so that we can decide the swapin order. > > Maybe we can adjust your proposed swap_zeromap_entries_check() as well > to do that? Basically return the number of swap entries in the zeromap > starting at 'entry'. If 'entry' itself is not in the zeromap we return > 0 naturally. That would be a small adjustment/fix over what Usama had, > but implementing it with bitmap operations like you did would be > better. I assume you means the below /* * Return the number of contiguous zeromap entries started from entry */ static inline unsigned int swap_zeromap_entries_count(swp_entry_t entry, in= t nr) { struct swap_info_struct *sis =3D swp_swap_info(entry); unsigned long start =3D swp_offset(entry); unsigned long end =3D start + nr; unsigned long idx; idx =3D find_next_bit(sis->zeromap, end, start); if (idx !=3D start) return 0; return find_next_zero_bit(sis->zeromap, end, start) - idx; } If yes, I really like this idea. It seems much better than using an enum, which would require adding a new data structure :-) Additionally, returning the number allows callers to fall back to the largest possible order, rather than trying next lower orders sequentially. Hi Usama, what is your take on this? > > > > > Though I am not sure how cheap zswap can implement it, > > swap_zeromap_entries_check() > > could be two simple bit operations: > > > > +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t > > entry, int nr) > > +{ > > + struct swap_info_struct *sis =3D swp_swap_info(entry); > > + unsigned long start =3D swp_offset(entry); > > + unsigned long end =3D start + nr; > > + > > + if (find_next_bit(sis->zeromap, end, start) =3D=3D end) > > + return SWAP_ZEROMAP_NON; > > + if (find_next_zero_bit(sis->zeromap, end, start) =3D=3D end) > > + return SWAP_ZEROMAP_FULL; > > + > > + return SWAP_ZEROMAP_PARTIAL; > > +} > > > > 3. swapcache is different from zeromap and zswap. Swapcache indicates > > that the memory > > is still available and should be re-mapped rather than allocating a > > new folio. Our previous > > patchset has implemented a full re-map of an mTHP in do_swap_page() as = mentioned > > in 1. > > > > For the same reason as point 1, partial swapcache is a rare edge case. > > Not re-mapping it > > and instead allocating a new folio would add significant complexity. > > > > > > > > > > Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeroma= p`, we > > > > permit almost all mTHP swap-ins, except for those rare situations w= here > > > > small folios that were swapped out happen to have contiguous and al= igned > > > > swap slots. > > > > > > > > swapcache is another quite different story, since our user scenario= s begin from > > > > the simplest sync io on mobile phones, we don't quite care about sw= apcache. > > > > > > Right. The reason I bring this up is as I mentioned above, there is a > > > common problem of forming large folios from different sources, which > > > includes the swap cache. The fact that synchronous swapin does not us= e > > > the swapcache was a happy coincidence for you, as you can add support > > > mTHP swapins without handling this case yet ;) > > > > As I mentioned above, I'd really rather filter out those corner cases > > than support > > them, not just for the current situation to unlock swap-in series :-) > > If they are indeed corner cases, then I definitely agree. Thanks Barry