From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 586EECD4F4D for ; Thu, 5 Sep 2024 07:03:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E0E5F6B0415; Thu, 5 Sep 2024 03:03:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DBE586B0416; Thu, 5 Sep 2024 03:03:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C5F806B0418; Thu, 5 Sep 2024 03:03:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A3DCF6B0415 for ; Thu, 5 Sep 2024 03:03:57 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 469BA1219DC for ; Thu, 5 Sep 2024 07:03:57 +0000 (UTC) X-FDA: 82529794914.30.EB7EB3A Received: from mail-vk1-f170.google.com (mail-vk1-f170.google.com [209.85.221.170]) by imf08.hostedemail.com (Postfix) with ESMTP id 77C8E160023 for ; Thu, 5 Sep 2024 07:03:55 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XeSysnFQ; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725519758; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ER2L0+4siAxGA0mjEF/2tgr84COLH8G/X1oBsGURNq4=; b=gXE8MRtiwVcdAEA5GXHiNlY3DN6zcJMltEFKLEUKJyG5c0wAtnuIcrs6e7batgH8/oVkJ1 JOvEf6jkdpzHJM+/WP8sGtC1VxQyxUK/4Sx1fmnM8bT7Yjy9mWJnvt8o1vjdlJnHtw/2nt lxDvMdO8tbXsOcPqmiMwxckrpU6Ut60= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XeSysnFQ; spf=pass (imf08.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.170 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725519758; a=rsa-sha256; cv=none; b=g4TcEWjH7QuCrwofZD8JVEvCCCopm73yZlGx+COq/eUNmZGlEaOg7nLB4cmnIoTbjz3FwS 6gw+sYIPTGijmXGr/ERqoaMsuRpEOzuaVt/MhCksE+orXUVzgi2r4Ta44gPs0hwub6Xsie U8wULV7IdZNfFqLPbFdhRm8fWQSYs2Y= Received: by mail-vk1-f170.google.com with SMTP id 71dfb90a1353d-4fd0d7fe6f6so153073e0c.2 for ; Thu, 05 Sep 2024 00:03:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725519834; x=1726124634; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=ER2L0+4siAxGA0mjEF/2tgr84COLH8G/X1oBsGURNq4=; b=XeSysnFQ2xH00AyUhwXHtrD9sn9tCxkBwodsjM1YUVNn4CJlKQCnk0ISReprHZ4s70 Yi7pnUFiHPKO3DC8GiTmswXEb4ro9pvPF1mWpS2dnG3rJJfY4NTUhjql+9y2aT3Vvo/K BDG/3CWOjH0Y1xgiplmS4RP8Vj99/R+ecCVQaSWyt6O1SPvvJvgF8SdG3ABZuemgkQkg 4cnEU+gpJQyyj2bjwfGreRxX/SlzkNkSpNQyd8IqAMgJ0OM0sDqdrBlSmMzTh8mwbaBv AlAe+dCGe8o8TUuY1SSi8Y66+3EFC99ZPkQjJTgKfI/EuOAEBpc/mK4oBWH14i4PnXP4 ELuw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725519834; x=1726124634; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ER2L0+4siAxGA0mjEF/2tgr84COLH8G/X1oBsGURNq4=; b=HOyCnjI1ZNY/tb2KBME6nXs+ouTTli+Tu9p0zH5Egf/AmGeAXA1YLnl4iWLo9LgzTX 7PYzPPP7VJ+s2rpCdNIn7o5Nz20JVEopYUOj/CquYsEPdganqNZJ9BRxjM300GAUH4W2 JztQUlKJH6Y7HXP+rRb3zuP7ujOiV/rwjBDZY6QSaVv5/ZpHB3L8E/PsMlFQX4TBfMH8 CMs/2wyLvwYg1Z3eQJ2fm4QHouOzwSROd6f5Y8pBX33UiEjhimYpG4BNwQPDkc683OyI c/b3lk1ZlIgWHHjvE16jjs0FMAeLUVO5Y6tQI6XLWbyqLlyZImIFu9AOKD6NaRbpqtfl N+dQ== X-Forwarded-Encrypted: i=1; AJvYcCUWKtP4SF0cRbTfUPLtyyzRY2/e1afsfnsMqDtXvXk4qUMp9fYtUN5QFksiYCqbtWw1g52vEo3gbw==@kvack.org X-Gm-Message-State: AOJu0YxdL5s2R/WjAE+/5FmLwGLs6A3z3BSh11x8ni6Oc7ZuFmO7AAb7 2ODRGZ720CDKo3QeFdoWhZygzElYUfMEOsu+3rXh8Nylg7Xh+eZsi683eYcTLtIuW13hFpVDIcp hWOOK85VpA4yMN35Bd+3TKU9ze8WnYPUYhy0= X-Google-Smtp-Source: AGHT+IF1ZSvi5OD6Vd1LoKxHOJRg5Vlz1rUm1VAolvT5eKOZLMFg5nbKgw6aWPOezt7qpOYbnciRnVzXId8dKhPTXCQ= X-Received: by 2002:a05:6122:3296:b0:4f6:b610:61bf with SMTP id 71dfb90a1353d-5009b0827f9mr20454612e0c.8.1725519834280; Thu, 05 Sep 2024 00:03:54 -0700 (PDT) MIME-Version: 1.0 References: <20240612124750.2220726-2-usamaarif642@gmail.com> <20240904055522.2376-1-21cnbao@gmail.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Thu, 5 Sep 2024 19:03:42 +1200 Message-ID: Subject: Re: [PATCH v4 1/2] mm: store zero pages to be swapped out in a bitmap To: Yosry Ahmed Cc: usamaarif642@gmail.com, akpm@linux-foundation.org, chengming.zhou@linux.dev, david@redhat.com, hannes@cmpxchg.org, hughd@google.com, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, nphamcs@gmail.com, shakeel.butt@linux.dev, willy@infradead.org, ying.huang@intel.com, hanchuanhua@oppo.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: h7c34m5icr6khxfofmeqmnqdpmykhru7 X-Rspam-User: X-Rspamd-Queue-Id: 77C8E160023 X-Rspamd-Server: rspam02 X-HE-Tag: 1725519835-313231 X-HE-Meta: U2FsdGVkX18eomAF28gIbHKK/1Oz6rY1TWqEIIfFZ+i/UxXnmrdW0H+EQBG6FBQoiYfuB+By1mLX6u2LZmtQx6UVFnFl2wioe1a8Bc39WuZ/uCWyn69VmoyI+pxwoaSHLbRVFIGzS4wMRsZKdQp+8t7hG49MdsDGGDDZvIaHZcp13TY2Ak8uVINgjqQKP5+SuqQta2kMnpMtnKm9Gz6CPIwH8LJZb6oJbRHlAXcur9MNPkQGNr+jQCewoNXo1or6EffsP5Vk7qneoMsNEPDtvwEIh0LZXo/Mdf5H2LdtHXd2OHKrTLWa/t5SDMcxUC/q+PAkDAEyp85M2pfwSwOhY2EXyuHsMerT8Dnk5qYDxpCj7UQ3XBdqWvqo3G+OCGogDz+XohsV1sko4WL+HYx3Cpi7sT5forXZhbJ+Df4G0JKBjeg+zSCkjbnAZ40t+Hz2OVTfg1CZ70FEbhoUQOMJ8+2zpZx6pATjzC0H6zWckbXZo8JN+hbjCxltGb/5RD/NaOZC0OaZdK47VSzu64rtJFV5h/BpmN6h7FYgwyrMm7wpVgdlpVrqSfK8facUr0/KtElQfbjjIbPNJi2Ex8Tq5mdOKhHpFi9nBbZLuEzW1OsJtrPau6slH0u+8q3/3AzS9lpmRVmJ1FoYGvvBOfLEundOajKLEaD8fhiQMOkQro4vE8PEVHZ7nnR0/CLa6tlZzKvbvTJReXMliulmEXYhaVibNkEs2QTsSdbzQmFBgkT6iv6OCcMlVusIURwMU+1BOHyUGfqTH0tQ/rGBZ/0+ZKdVG5agQio5voIoMT6MhRyXFqmk6oWjIDcMBbgln4kFKQV/hOF8wQcZnQ4OWAWDc5VQI7Xrmh5aDukYkxuDpob6BrLnoOrZsbBfy3GjTRrTVoCNmjXIEUUwaFPz/L2mnWkgySmpjbQyrFsm8++tmC9VbmVeA6t//F0gQIDjxZ86aa1UvG1nE0+rsgPwYju A806NZiX 0V8ljOTQpLIB68Clfnz1EcGOoXhw3AlCjU4Oyg7U3DyI2xPVE0axNd1+IHbRpxXrCcH11jW2ScuQpFhYqfJ4IJB6PqDbwiblSS0Lx7eVLspSJBeD3AW2RYxrFLJ7TSetSl0niGR6bV4vZX6695NTcVhpkUpFcXtPNWMrXdEQz+Q0TymJUpxDXCwBlmRGrl34bhy28Af1DSd18XoKdpWd6UCR1diuxIay5sD2vYdrvUCFnl63uIl3mmsc0jl3blHmvkTfk51pXYNKoz2O7JpgiRCZHznfAYCRfsP4cJFtYCgV4vKTASmrWG378aZkn0K9L14CoMxsAPt7wNMk/qPoFrbChyQdaHHVQz1yPeOYCRV76LXDptfqELYYJwjvVHCKnWeQiluo/07ExH/VLUYoPjHalyg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Sep 5, 2024 at 5:41=E2=80=AFAM Yosry Ahmed = wrote: > > [..] > > > I understand the point of doing this to unblock the synchronous large > > > folio swapin support work, but at some point we're gonna have to > > > actually handle the cases where a large folio being swapped in is > > > partially in the swap cache, zswap, the zeromap, etc. > > > > > > All these cases will need similar-ish handling, and I suspect we won'= t > > > just skip swapping in large folios in all these cases. > > > > I agree that this is definitely the goal. `swap_read_folio()` should be= a > > dependable API that always returns reliable data, regardless of whether > > `zeromap` or `zswap` is involved. Despite these issues, mTHP swap-in sh= ouldn't > > be held back. Significant efforts are underway to support large folios = in > > `zswap`, and progress is being made. Not to mention we've already allow= ed > > `zeromap` to proceed, even though it doesn't support large folios. > > > > It's genuinely unfair to let the lack of mTHP support in `zeromap` and > > `zswap` hold swap-in hostage. > Hi Yosry, > Well, two points here: > > 1. I did not say that we should block the synchronous mTHP swapin work > for this :) I said the next item on the TODO list for mTHP swapin > support should be handling these cases. Thanks for your clarification! > > 2. I think two things are getting conflated here. Zswap needs to > support mTHP swapin*. Zeromap already supports mTHPs AFAICT. What is > truly, and is outside the scope of zswap/zeromap, is being able to > support hybrid mTHP swapin. > > When swapping in an mTHP, the swapped entries can be on disk, in the > swapcache, in zswap, or in the zeromap. Even if all these things > support mTHPs individually, we essentially need support to form an > mTHP from swap entries in different backends. That's what I meant. > Actually if we have that, we may not really need mTHP swapin support > in zswap, because we can just form the large folio in the swap layer > from multiple zswap entries. > After further consideration, I've actually started to disagree with the ide= a of supporting hybrid swapin (forming an mTHP from swap entries in different backends). My reasoning is as follows: 1. The scenario where an mTHP is partially zeromap, partially zswap, etc., would be an extremely rare case, as long as we're swapping out the mTHP as a whole and all the modules are handling it accordingly. It's highly unlikely to form this mix of zeromap, zswap, and swapcache unless the contiguous VMA virtual address happens to get some small folios with aligned and contiguous swap slots. Even then, they would need to be partially zeromap and partially non-zeromap, zswap, etc. As you mentioned, zeromap handles mTHP as a whole during swapping out, marking all subpages of the entire mTHP as zeromap rather than just a subset of them. And swap-in can also entirely map a swapcache which is a large folio based on our previous patchset which has been in mainline: "mm: swap: entirely map large folios found in swapcache" https://lore.kernel.org/all/20240529082824.150954-1-21cnbao@gmail.com/ It seems the only thing we're missing is zswap support for mTHP. 2. Implementing hybrid swap-in would be extremely tricky and could disrupt several software layers. I can share some pseudo code below: swap_read_folio() { if (zeromap_full) folio_read_from_zeromap() else if (zswap_map_full) folio_read_from_zswap() else { folio_read_from_swapfile() if (zeromap_partial) folio_read_from_zeromap_fixup() /* fill zero for partially zeromap subpages */ if (zwap_partial) folio_read_from_zswap_fixup() /* zswap_load for partially zswap-mapped subpages */ folio_mark_uptodate() folio_unlock() } We'd also need to modify folio_read_from_swapfile() to skip folio_mark_uptodate() and folio_unlock() after completing the BIO. This approach seems to entirely disrupt the software layers. This could also lead to unnecessary IO operations for subpages that require fixup. Since such cases are quite rare, I believe the added complexity isn't worth= it. My point is that we should simply check that all PTEs have consistent zerom= ap, zswap, and swapcache statuses before proceeding, otherwise fall back to the= next lower order if needed. This approach improves performance and avoids comple= x corner cases. So once zswap mTHP is there, I would also expect an API similar to swap_zeromap_entries_check() for example: zswap_entries_check(entry, nr) which can return if we are having full, non, and partial zswap to replace the existing zswap_never_enabled(). Though I am not sure how cheap zswap can implement it, swap_zeromap_entries_check() could be two simple bit operations: +static inline zeromap_stat_t swap_zeromap_entries_check(swp_entry_t entry, int nr) +{ + struct swap_info_struct *sis =3D swp_swap_info(entry); + unsigned long start =3D swp_offset(entry); + unsigned long end =3D start + nr; + + if (find_next_bit(sis->zeromap, end, start) =3D=3D end) + return SWAP_ZEROMAP_NON; + if (find_next_zero_bit(sis->zeromap, end, start) =3D=3D end) + return SWAP_ZEROMAP_FULL; + + return SWAP_ZEROMAP_PARTIAL; +} 3. swapcache is different from zeromap and zswap. Swapcache indicates that the memory is still available and should be re-mapped rather than allocating a new folio. Our previous patchset has implemented a full re-map of an mTHP in do_swap_page() as ment= ioned in 1. For the same reason as point 1, partial swapcache is a rare edge case. Not re-mapping it and instead allocating a new folio would add significant complexity. > > > > Nonetheless, `zeromap` and `zswap` are distinct cases. With `zeromap`, = we > > permit almost all mTHP swap-ins, except for those rare situations where > > small folios that were swapped out happen to have contiguous and aligne= d > > swap slots. > > > > swapcache is another quite different story, since our user scenarios be= gin from > > the simplest sync io on mobile phones, we don't quite care about swapca= che. > > Right. The reason I bring this up is as I mentioned above, there is a > common problem of forming large folios from different sources, which > includes the swap cache. The fact that synchronous swapin does not use > the swapcache was a happy coincidence for you, as you can add support > mTHP swapins without handling this case yet ;) As I mentioned above, I'd really rather filter out those corner cases than support them, not just for the current situation to unlock swap-in series :-) Thanks Barry