From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D18DC54798 for ; Wed, 6 Mar 2024 01:33:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87C886B0075; Tue, 5 Mar 2024 20:33:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 82DA16B007D; Tue, 5 Mar 2024 20:33:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 658796B007E; Tue, 5 Mar 2024 20:33:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 51DE56B0075 for ; Tue, 5 Mar 2024 20:33:28 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id E62AC160949 for ; Wed, 6 Mar 2024 01:33:27 +0000 (UTC) X-FDA: 81864891654.20.021294A Received: from mail-vk1-f179.google.com (mail-vk1-f179.google.com [209.85.221.179]) by imf01.hostedemail.com (Postfix) with ESMTP id 2D46040017 for ; Wed, 6 Mar 2024 01:33:25 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dAnmMws5; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709688806; a=rsa-sha256; cv=none; b=jS5k7zBmkms/tdaGAKZQkRyEvIvaqIJSgvQUji1CXXuKhpv1tOZNYFrWg5M8YZ/vp/ELDc 1rfUqidILesILzHM/Evy0k3edKM75ZcXnhotvSDfz3cQ17KKCnZEch093qo8exIw6tA8hh 6WIXwI52Z4MYaAbbAhur0H3C/sGxbek= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dAnmMws5; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.179 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709688806; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=z4ffJddcVC8erD65dJtF9QdfZQ26JRFofY6MCYT8vDs=; b=6lsLebqik0NptQptK+kh9jtYzooLeRtRPjyxNt+MzHH6Sfe8QtJOjyAPf0nBLwInIEonv1 g+dbCN+F2mJq98zNII6jyxi8KMS//lEJPjYAHDV0mQHO5J/mkWv4o4wmJK4exRaRK4Ur7B 3RC0CnEy3Sb8+ofotOb+S9DqRQL9VHA= Received: by mail-vk1-f179.google.com with SMTP id 71dfb90a1353d-4d355374878so356969e0c.0 for ; Tue, 05 Mar 2024 17:33:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709688805; x=1710293605; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=z4ffJddcVC8erD65dJtF9QdfZQ26JRFofY6MCYT8vDs=; b=dAnmMws5pfkf7gcRzmAf6kRm9YO3d7levP67xTN40mklccr02+RXC56Y6o0LOoZ/Fu l/18G7Cc9iD0pmGEbM1c8LaUdwARIYmdpHUsoxrfem6yw1Eg9gMtkW1SWV5fSsKHqGdE 3tMOYwoGO6MliCdfbaZh6UapD4X9x/NzdoIUGS2vNjRT9jIUyo3nVMcE9yG4ZAAwimFM pH+SqcedOMPs2TrnrVRhKGqDHu3Gog/Vy7q6wBt3LIpB4ECE9cxy4P04Tn8kMhK+j1IY WWL1IigZarfttERVSwQQU1754Fl/PspvScL88Hme48GVGwnZiimwYA0eZt1JVsCfZPfl LyMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709688805; x=1710293605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=z4ffJddcVC8erD65dJtF9QdfZQ26JRFofY6MCYT8vDs=; b=jGSianDoH+65YCmVQxVS7TuKJ42VzzXCfoCVoqC0vIJTSBOzlUMk/STKaBVAgWw5z/ 1L2R7nnsTHZuEZJ3DqWiTnsr81IRd8Zpp5gnH2Mu6TYmjNZBydOug5beGu3PDvBIk14S tBeSZCn2gGLA7cN4icB4PHD305UtH3lkdf8bGCUisAqiUDfHAePmcCI/0ff/xMNLcNUV DWBQ04S3M1wyxB7frHgwEnpglcyy0RGAosUM5iXxuLaFtjkWK/Eh2SAdmrWwVzkpgAUT qJpIuBzTekKakxG6mFPvLB5jagtyjlXbIC+Y0496Gz7ViSjFyfmdeziZm9EBX9fyUOs7 6KEA== X-Forwarded-Encrypted: i=1; AJvYcCV/bnLfGRdJ5E/srm62/+Sm2wEYsNLW4Vgnodf35u3VCTlIWnFKFPNHRwib2QT+udNJFMnNQ6twC4FC3GJ/P0ySE7o= X-Gm-Message-State: AOJu0YxWQq9VJc+0Q96lwc5gr+fI761u9VyotyPmEKrS0oVI/d2uGSkG 3XK+UFIx6ARNNeXTQ4vXruEsvVlAXADfKYoXlDShoU7lJ1N8Bp+hPy0amEXnGTIeP9XzJbCB7e3 yLQgeU1xRNMq9m4pniqY4A8nZCtM= X-Google-Smtp-Source: AGHT+IFb/7d2ZAxnaqSnSu2P0u9CHSsx8noBww8CRD0VP0qDeHibpAIcbTu87h3ucS3vG6yoMUpocY7IIqd3ULLWUQo= X-Received: by 2002:a05:6122:d0a:b0:4d3:39c3:717c with SMTP id az10-20020a0561220d0a00b004d339c3717cmr4001517vkb.1.1709688805111; Tue, 05 Mar 2024 17:33:25 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Wed, 6 Mar 2024 14:33:14 +1300 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Nhat Pham Cc: Chris Li , lsf-pc@lists.linux-foundation.org, linux-mm , ryan.roberts@arm.com, David Hildenbrand , Chuanhua Han Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2D46040017 X-Stat-Signature: 51xk3wecea1jgp3jwpnjhah9zz8oipzx X-HE-Tag: 1709688805-928389 X-HE-Meta: U2FsdGVkX19ul8ZlOX1YD3CikrvpreGuCBcacPrsm3YVVrxmI+sDhP8aOwnxDK1EOA3B3LQsCX+Sfd/m7w24bpfSJuNkHRS2PdHgon9GoYuhtlAigf3vk9IQqX8XdN6tCi8Zb5m3oefUuNSSXFEvbABKUXjge7bXTzAIb0coWPuIduE2dEII9T1wTsoizaZ3bU8KrJcT6TrKIrKBbCfxiVDszxczud+3Q9oVh8FOLnDTsfTLzNOM9W9/8kwpVfMwqWiYchxaS5QUPvDHr7K6U7ZFrIMoRa+bDJDQtb5QsJZnGv2Utr1jJdf1dYNZNc3ttBMzstDhDcWWXF0ZdNdJnSNmPTG2hnX/f9e+UqUYVbCnq/Cd8WrOJUaQho/z8Rw5doT6UfLM3VLkrIjtaQJT9mB9lvC/JdVYoDjNTLFr+s48LMHV2/qnO/z7IBirZ0Vq4Bq+mhiYxUFNPQqJK2Hk12QEtj5wqh23y4hSHHweHmJhVMpKvytO4avXCS2h9AsYd43gy5FNoDR5jQ87wFHCGgfrm382ONHhWXatOt5yx03rYe+H1qHFMAGzWzyOadFNjzY/Xf0duAD5x1Mc1M3fFuABi4iDsHlVtKsidpRnYKT4FEy5B6oLRlKL0QFUd/WNSsCBy9cK1i56orePz7dEeSR5DySY5PIg+955mkW+l+f+En4/L2cTinVnie3IO1KrR1m5aQBlOaoTRbsQ6z0uFoHWX8Hxip/MSde+qIwbzC57eeA7xBdUEh1CUoM7lGJIrIUgxSyNGBJ9CNzmnLzhiKQqt0MwcnKY7TckEvmyju/vJes6zo3B6pXLmkFOqkHdWkKW5XaPwe4JrsQ6xwCPJ1/tD2kNjAm0FG6/GVNMyYj536OGCZtteFoXfPq4yvZnSgWIJtbv82TOCWKET+LSTgvVU/CdXLY5nhmCqZjaM7axKDbdtoT7+IwgGHRa+ctpGdHIUUH4ecK58V2rxAY cTCBtk6z 8QF3DeZrpAUEJeNP7S8Ll0DbzQKHKsKBiNhR7Y+vNn0OSVyBR4pOtLnjvrC+5OZiSqko5lfyYBK8VbGGIcQpPNNDjyg14Sdwkt8xYWiE91uBl0/h+/0EXVTWbBNYMoQJ8QM4PAlwMXDh02QX+jNXWNaXl99JmrOPEicVZE0Whd8DnUk/JOCjUERcdCCEE1PmyF7hcc4aFGhjX8CMfoq08xP2mWZBXMa5JIW2uYMdaI+Im55Yk8nWz36c3dvJC8CJpDwKZviXY6bvGdx3OHXgQ/1pNqOfTkeFb3tWfnLhT4sM3cxKBknRLzMSvysFzveIhVTDuRwXMJXwr+LlDs2RLgc/C3q2E7FVId2UvNzUrfSTKyAa2Ya/Pw8bjEz3X1K4+r8l7I2DzeiQWU3gdgmCOOrfj15reruyZNYMaBord8t+stuHtRnYaPVwG/Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.129799, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 1, 2024 at 10:53=E2=80=AFPM Nhat Pham wrote= : > > On Fri, Mar 1, 2024 at 4:24=E2=80=AFPM Chris Li wrote= : > > > > In last year's LSF/MM I talked about a VFS-like swap system. That is > > the pony that was chosen. > > However, I did not have much chance to go into details. > > I'd love to attend this talk/chat :) > > > > > This year, I would like to discuss what it takes to re-architect the > > whole swap back end from scratch? > > > > Let=E2=80=99s start from the requirements for the swap back end. > > > > 1) support the existing swap usage (not the implementation). > > > > Some other design goals:: > > > > 2) low per swap entry memory usage. > > > > 3) low io latency. > > > > What are the functions the swap system needs to support? > > > > At the device level. Swap systems need to support a list of swap files > > with a priority order. The same priority of swap device will do round > > robin writing on the swap device. The swap device type includes zswap, > > zram, SSD, spinning hard disk, swap file in a file system. > > > > At the swap entry level, here is the list of existing swap entry usage: > > > > * Swap entry allocation and free. Each swap entry needs to be > > associated with a location of the disk space in the swapfile. (offset > > of swap entry). > > * Each swap entry needs to track the map count of the entry. (swap_map) > > * Each swap entry needs to be able to find the associated memory > > cgroup. (swap_cgroup_ctrl->map) > > * Swap cache. Lookup folio/shadow from swap entry > > * Swap page writes through a swapfile in a file system other than a > > block device. (swap_extent) > > * Shadow entry. (store in swap cache) > > IMHO, one thing this new abstraction should support is seamless > transfer/migration of pages from one backend to another (perhaps from > high to low priority backends, i.e writeback). > > I think this will require some careful redesigns. The closest thing we > have right now is zswap -> backing swapfile. But it is currently > handled in a rather peculiar manner - the underlying swap slot has > already been reserved for the zswap entry. But there's a couple of > problems with this: > > a) This is wasteful. We're essentially having the same piece of data > occupying spaces in two levels in the hierarchies. > b) How do we generalize to a multi-tier hierarchy? > c) This is a bit too backend-specific. It'd be nice if we can make > this as backend-agnostic as possible (if possible). > > Motivation: I'm currently working/thinking about decoupling zswap and > swap, and this is one of the more challenging aspects (as I can't seem > to find a precedent in the swap world for inter-swap backends pages > migration), and especially with respect to concurrent loads (and > swapcache interactions). > > I don't have good answers/designs quite yet - just raising some > questions/concerns :) I actually have one more problem here. to swap in a large folio, in case we have 16 subpages, it could be that 5 subpages are in zswap and 11 are in the backend swap in some cases. we get no way to differententiate this unless we iterate subpage one by one within a large folio before calling zswap_load(). right now, swap_read_folio() can't handle this, void swap_read_folio(struct folio *folio, bool synchronous, struct swap_iocb **plug) { ... if (zswap_load(folio)) { folio_mark_uptodate(folio); folio_unlock(folio); } else if (data_race(sis->flags & SWP_FS_OPS)) { swap_read_folio_fs(folio, plug); } else if (synchronous || (sis->flags & SWP_SYNCHRONOUS_IO)) { swap_read_folio_bdev_sync(folio, sis); } else { swap_read_folio_bdev_async(folio, sis); } ... } Thanks Barry