From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D63CFC6FD20 for ; Fri, 24 Mar 2023 07:29:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 25BFB6B0072; Fri, 24 Mar 2023 03:29:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 20B6B6B0074; Fri, 24 Mar 2023 03:29:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0AB896B0075; Fri, 24 Mar 2023 03:29:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id EDD026B0072 for ; Fri, 24 Mar 2023 03:29:11 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id A351A1408CD for ; Fri, 24 Mar 2023 07:29:11 +0000 (UTC) X-FDA: 80602965702.07.C709242 Received: from mail-ed1-f51.google.com (mail-ed1-f51.google.com [209.85.208.51]) by imf16.hostedemail.com (Postfix) with ESMTP id BF78C180009 for ; Fri, 24 Mar 2023 07:29:09 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=i1RgwClM; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679642949; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iKAtUOCIw6GhbGCBuD1cAn0nsA8qM4vJqKL3ImYVscw=; b=jgpZsPZOqo4H4frmAWYnX6gM9ZODlNJ+matzRIi3EMr+PDGBanQXtS0hp8eNN5mMmFu9vW NPy/VT9AObA5MVgSUHgKLveBclBoMeqjudE4omQ5F8RkKGWkNfB2GIOesz4H6gWm7asYUx KpTVY9na6Z2m4HvlwyDhawmQaM/J40g= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=i1RgwClM; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf16.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.51 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679642949; a=rsa-sha256; cv=none; b=3/ctaj7/CqZHaqixIo2ZJWCxs49Guf0SZWS7kvR3H0QTeQ5OBOyG/h9+g7QynaSCHKRdF7 NGb8jBhe2Kd4kbgGp2U64SWNNQmWA4BoTiFsYCCtBJ+6aIPx/eS6/ekq1Qd2Fq1KlMsT2R CMQDz7u51VMD4R1U942/STNPWgiA00I= Received: by mail-ed1-f51.google.com with SMTP id y4so4348617edo.2 for ; Fri, 24 Mar 2023 00:29:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1679642948; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=iKAtUOCIw6GhbGCBuD1cAn0nsA8qM4vJqKL3ImYVscw=; b=i1RgwClMQDjigG+0EfNiKGkK3+O4sHJ7wrV9uCl+e4tAKkWqND0C7K45rgOcvtTZdP +UnreyB9na3kuxNFoDti2UZP9B2ZhU3/ddJrafugeyBX1o4uNbp6CjqyEuBLAufAdC0f ImgCRGhbqNY9XyIkZXOvIfe+pQ7C1X7HXNGHs5sSDlGzF7LOl/B7d1GjmRq5l99tp9AU os7m848K5zImotK79q/6QMoNNgwEcF7CAJCsuZ50V1glrO1KnOvtPma/dYilTItwpnDR hKPBIJ5VvbS55eDp5lxcApB51VE0XjOpHpNs65NiF2iHB2nICcdZycur4a85clEsmvOg vtSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679642948; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=iKAtUOCIw6GhbGCBuD1cAn0nsA8qM4vJqKL3ImYVscw=; b=kStyksL5mdO0GW8KFou6Av8tA5X4vYLsiAmUFq760GF/WI8zgoRXVjNsS13lB8YRII lIDmfzsXUw8UpVtAmU5qBE1Lm4q2bPD9W0h0fsspqsm+naM3DREII7QeHZylFRcrDo3K SM6/N0Ee1CWmuMR6DViiwOuVR0hOtTOLCWwq+pTJxUngC0DI2ZpZp6u1FVWsT/FxElUI Rro8Igw70aO4DT/TGBbdHxbeQUso5Sh3epOROSN7tr1NTSOIMYObPH32yf8yrnK+Q6SS FUZDi1na3I/9Lo8x47DO2TeF/SWdCGJvpB4dvz0278ysq3p3Zn7FohjfrOs7cV1rsRCv 1EQA== X-Gm-Message-State: AAQBX9eHqBwqpxFeV0XIt2m+Qj4cWU2Qw2eM5BhSTi8CdLGAluMIKGWm nbgQLhMdUujZGmOpLsjvvcUgXskPS7bk4OVWodd8qw== X-Google-Smtp-Source: AKy350Y+jA7617YlG0/m/B9r5K58/llDiEmVBi14q+cNERp9pkviZPBXlAvzskgGgylsXFYiDVVjLgGGtmedKoh9wks= X-Received: by 2002:a17:906:7846:b0:933:1967:a984 with SMTP id p6-20020a170906784600b009331967a984mr871302ejm.15.1679642947984; Fri, 24 Mar 2023 00:29:07 -0700 (PDT) MIME-Version: 1.0 References: <87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkkt5e4o.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1ns3zeg.fsf@yhuang6-desk2.ccr.corp.intel.com> <874jqcteyq.fsf@yhuang6-desk2.ccr.corp.intel.com> <87v8isrwck.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkkkrps8.fsf@yhuang6-desk2.ccr.corp.intel.com> <87sfduri1j.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87sfduri1j.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yosry Ahmed Date: Fri, 24 Mar 2023 00:28:31 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: "Huang, Ying" Cc: Chris Li , lsf-pc@lists.linux-foundation.org, Johannes Weiner , Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Minchan Kim , Andrew Morton , Aneesh Kumar K V , Michal Hocko , Wei Xu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: BF78C180009 X-Stat-Signature: b8mc7kbxxmeayfb5tdbgwq6zg9h4w9br X-HE-Tag: 1679642949-68929 X-HE-Meta: U2FsdGVkX18yQLqGuu4FY0Peuon3xuu33EZzqaQjHDlGEiNx62CFPWRjneKICqvH5NXwvfYSEiWu2wSUPO0/6QpF2kdLsQ4v2GJfm0E+f5Fk3OSiff6oInggRY7o1tx/F3Cw0TRS3taW7z91Rmda7AOZvQoHhny8pYbv/T/G6sAj1/Wh7Xk+KVHR9P4C7jL+xGmZhSVL0DAakZTL/Pz8rKmt4DafdSpyCRwYxB+C58mV44tTAXckbqzbai7jZOmDAKVOrEUCfL+1azoi4H8RivExNq4EIYxF0WrUcmqp/XTUVFp3p/dSDHCHer8OlwVMKvSyyBMa5osO3IYWr22ewaYnn3LTtE27xNCYzRtuDA4DtwVrkSw52CRWzmBQxquHx9GJFB0z6k+HxAPaUcqQWPz4nsEhWK9Qg9jw6uStRm4erswib6b6sAYfINbZlHDx/JrWXIKDjUub2wMQjP0p6KsftGZT3jwC334bvmWcL/OQ5459qKunTexiP9Mbl4LZ2phhFYJ9o4A1kz2aIM76bVsq1iQcdPRqKK+78BbFISb1xFbAdVhk851qi4s5Y4BwrCVYi+HBtjJvxP9bsgaU6EWzdqx/FP3TE3/GUe3li1GJmxbrxW0nYnLfhEBb7id2AR8JhTNEt6LyQ8JVpKc+jtcpYLk8GaUOWz8BJSkGHQ2YCuEALXwMsG8T+09LACymubjR7bWDJJOFB2aL7wiETBBEtmDEWUMaF5u4ZBCTMw7BmbeRW6CEBDlVlZbfOTwK/6rKZswBVYHLbKVSJcDVmkYndyuI0OdrNxIjw35hdOn5zIkaoB4gKpt9aiX+vvuJqnXOORNm5kX/5nx1JlWltu7+LQ1pb2e8B88mr37k92S7X+/iimW7wFztmOtCrFeZF5g7P2x6B03RNSYobGTmbYwuIukUW5eShz1YFIJvwgcANC1Sh2UH1Dx2wd6eoDCEa9yEH6+CmFmcUx8Sggy u20EPp2A PtaVusV8IRnwHWoscMzDUpWNSYFVVee21eEG4V9/QVqQMRDZfAq8KITaEBAoty3P/wvNA/hVBFWvUT+rAsAsBB8cwkkW8RVWTbs7HJ9bAnYdZMfoJ9kZLpSwks7G8AuFTXzxT2Jb6XUaMA1EICuNJABF7VSeFTBVdFIZorXZZ0D32/q6BrqBaJxjug+jwKwD7tVNo/WFbPM4w0Tv6pPMeRPG2KyPqpD2qaOh2g+UVbsZHtEp7QLRKrM9l68RiTP3eqSGpH/vsX9WYaaPeo8vnMISMJa0pyliauRXTIupQ1HuRTuo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Mar 23, 2023 at 7:38=E2=80=AFPM Huang, Ying = wrote: > > Yosry Ahmed writes: > > > On Wed, Mar 22, 2023 at 10:39=E2=80=AFPM Huang, Ying wrote: > >> > >> Yosry Ahmed writes: > >> > >> > On Wed, Mar 22, 2023 at 8:17=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Yosry Ahmed writes: > >> >> > >> >> > On Wed, Mar 22, 2023 at 6:50=E2=80=AFPM Huang, Ying wrote: > >> >> >> > >> >> >> Yosry Ahmed writes: > >> >> >> > >> >> >> > On Sun, Mar 19, 2023 at 7:56=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> > >> >> >> >> Yosry Ahmed writes: > >> >> >> >> > >> >> >> >> > On Thu, Mar 16, 2023 at 12:51=E2=80=AFAM Huang, Ying wrote: > >> >> >> >> >> > >> >> >> >> >> Yosry Ahmed writes: > >> >> >> >> >> > >> >> >> >> >> > On Sun, Mar 12, 2023 at 7:13=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> >> >> > >> >> >> >> >> >> Yosry Ahmed writes: > >> >> >> >> >> >> > >> > >> [snip] > >> > >> >> >> > >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > xarray (b) is indexed by swap id as well > >> >> >> >> >> > and contain swap entry or zswap entry. Reverse mapping m= ight be > >> >> >> >> >> > needed. > >> >> >> >> >> > >> >> >> >> >> Reverse mapping isn't needed. > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > It would be needed if xarray (a) is indexed by the swap id.= I am not > >> >> >> >> > sure I understand how it can be indexed by the swap entry i= f the > >> >> >> >> > indirection is enabled. > >> >> >> >> > > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> > In this case we have an extra overhead of 12-16 bytes + = 8 bytes for > >> >> >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + rev= erse mapping > >> >> >> >> >> > where needed. > >> >> >> >> >> > > >> >> >> >> >> > There is also the extra cpu overhead for an extra lookup= in certain paths. > >> >> >> >> >> > > >> >> >> >> >> > Is my analysis correct? If yes, I agree that the origina= l proposal is > >> >> >> >> >> > good if the reverse mapping can be avoided in enough sit= uations, and > >> >> >> >> >> > that we should consider such alternatives otherwise. As = I mentioned > >> >> >> >> >> > above, I think it comes down to whether we can completel= y restrict > >> >> >> >> >> > cluster readahead to rotating disks or not -- in which c= ase we need to > >> >> >> >> >> > decide what to do for shmem and for anon when vma readah= ead is > >> >> >> >> >> > disabled. > >> >> >> >> >> > >> >> >> >> >> We can even have a minimal indirection implementation. Wh= ere, swap > >> >> >> >> >> cache and swap_map[] are kept as they ware before, just on= e xarray is > >> >> >> >> >> added. The xarray is indexed by swap id (or swap_desc ind= ex) to store > >> >> >> >> >> the corresponding swap entry. > >> >> >> >> >> > >> >> >> >> >> When indirection is disabled, no extra overhead. > >> >> >> >> >> > >> >> >> >> >> When indirection is enabled, the extra overhead is just 8 = bytes per > >> >> >> >> >> swapped page. > >> >> >> >> >> > >> >> >> >> >> The basic migration support can be build on top of this. > >> >> >> >> >> > >> >> >> >> >> I think that this could be a baseline for indirection supp= ort. Then > >> >> >> >> >> further optimization can be built on top of it step by ste= p with > >> >> >> >> >> supporting data. > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > I am not sure how this works with zswap. Currently swap_map= [] > >> >> >> >> > implementation is specific for swapfiles, it does not work = for zswap > >> >> >> >> > unless we implement separate swap counting logic for zswap = & > >> >> >> >> > swapfiles. Same for the swapcache, it currently supports be= ing indexed > >> >> >> >> > by a swap entry, it would need to support being indexed by = a swap id, > >> >> >> >> > or have a separate swap cache for zswap. Having separate > >> >> >> >> > implementation would add complexity, and we would need to p= erform > >> >> >> >> > handoffs of the swap count/cache when a page is moved from = zswap to a > >> >> >> >> > swapfile. > >> >> >> >> > >> >> >> >> We can allocate a swap entry for each swapped page in zswap. > >> >> >> > > >> >> >> > > >> >> >> > This is exactly what the current implementation does and what = we want > >> >> >> > to move away from. The current implementation uses zswap as an > >> >> >> > in-memory compressed cache on top of an actual swap device, an= d each > >> >> >> > swapped page in zswap has a swap entry allocated. With this > >> >> >> > implementation, zswap cannot be used without a swap device. > >> >> >> > >> >> >> I totally agree that we should avoid to use an actual swap devic= e under > >> >> >> zswap. And, as an swap implementation, zswap can manage the swa= p entry > >> >> >> inside zswap without an underlying actual swap device. For exam= ple, > >> >> >> when we swap a page to zswap (actually compress), we can allocat= e a > >> >> >> (virtual) swap entry in the zswap. I understand that there's ov= erhead > >> >> >> to manage the swap entry in zswap. We can consider how to reduc= e the > >> >> >> overhead. > >> >> > > >> >> > I see. So we can (for example) use one of the swap types for zswa= p, > >> >> > and then have zswap code handle this entry according to its > >> >> > implementation. We can then have an xarray that maps swap ID -> s= wap > >> >> > entry, and this swap entry is used to index the swap cache and su= ch. > >> >> > When a swapped page is moved between backends we update the swap = ID -> > >> >> > swap entry xarray. > >> >> > > >> >> > This is one possible implementation that I thought of (very brief= ly > >> >> > tbh), but it does have its problems: > >> >> > For zswap: > >> >> > - Managing swap entries inside zswap unnecessarily. > >> >> > - We need to maintain a swap entry -> zswap entry mapping in zswa= p -- > >> >> > similar to the current rbtree, which is something that we can get= rid > >> >> > of with the initial proposal if we embed the zswap_entry pointer > >> >> > directly in the swap_desc (it can be encoded to avoid breaking th= e > >> >> > abstraction). > >> >> > > >> >> > For mm/swap in general: > >> >> > - When we allocate a swap entry today, we store it in folio->priv= ate > >> >> > (or page->private), which is used by the unmapping code to be pla= ced > >> >> > in the page tables or shmem page cache. With this implementation,= we > >> >> > need to store the swap ID in page->private instead, which means t= hat > >> >> > every time we need to access the swap cache during reclaim/swapou= t we > >> >> > need to lookup the swap entry first. > >> >> > - On the fault path, we need two lookups instead of one (swap ID = -> > >> >> > swap entry, swap entry -> swap cache), not sure how this affects = fault > >> >> > latency. > >> >> > - Each swap backend will have its own separate implementation of = swap > >> >> > counting, which is hard to maintain and very error-prone since th= e > >> >> > logic is backend-agnostic. > >> >> > - Handing over a page from one swap backend to another includes > >> >> > handing over swap cache entries and swap counts, which I imagine = will > >> >> > involve considerable synchronization. > >> >> > > >> >> > Do you have any thoughts on this? > >> >> > >> >> Yes. I understand there's additional overhead. I have no clear id= ea > >> >> about how to reduce this now. We need to think about that in depth= . > > > > I agree that we need to think deeper about the tradeoff here. It seems > > like the extra xarray lookup may not be a huge problem, but there are > > other concerns such as having separate implementations of swap > > counting that are basically doing the same thing in different ways for > > different backends. > > In fact, I just suggest to use the minimal design on top of the current > implementation as the first step. Then, you can improve it step by > step. > > The first step could be the minimal effort to implement indirection > layer and moving swapped pages between swap implementations. Based on > that, you can build other optimizations, such as pulling swap counting > to the swap core. For each step, we can evaluate the gain and cost with > data. Right, I understand that, but to implement the indirection layer on top of the current implementation, then we will need to support using zswap without a backing swap device. In order to do this without pulling swap counting to the swap core, we need to implement swap counting logic in zswap. We need to implement swap entry management in zswap as well. Right? > > Anyway, I don't think you can just implement all your final solution in > one step. And, I think the minimal design suggested could be a starting > point. I agree that's a great point, I am just afraid that we will avoid implementing that full final solution and instead do a lot of work inside zswap to make up for the difference (e.g. swap entry management, swap counting). Also, that work in zswap may end up being unacceptable due to the maintenance burden and/or complexity. > > >> >> > >> >> The bottom line is whether this is worse than the current zswap > >> >> implementation? > >> > > >> > It's not just zswap, as I note above, this design would introduce so= me > >> > overheads to the core swapping code as well as long as the indirecti= on > >> > layer is active. I am particularly worried about the extra lookups o= n > >> > the fault path. > >> > >> Maybe you can measure the time for the radix tree lookup? And compare > >> it with the total fault time? > > > > I ran a simple test with perf swapping in a 1G shmem file: > > > > |--1.91%--swap_cache_get_folio > > | | > > | --1.32%--__filemap_get_folio > > | | > > | --0.66%--xas_load > > > > Seems like the swap cache lookup is ~2%, and < 1% is coming from the > > xarray lookup. I am not sure if the lookup time varies a lot with > > fragmentation and different access patterns, but it seems like it's > > generally not a major contributor to the latency. > > Thanks for data! > > >> > >> > For zswap, we already have a lookup today, so maintaining swap entry > >> > -> zswap entry mapping would not be a regression, but I am not sure > >> > about the extra overhead to manage swap entries within zswap. Keep i= n > >> > mind that using swap entries for zswap probably implies having a > >> > fixed/max size for zswap (to be able to manage the swap entries > >> > efficiently similar to swap devices), which is a limitation that the > >> > initial proposal was hoping to overcome. > >> > >> We have limited bits in PTE, so the max number of zswap entries will b= e > >> limited anyway. And, we don't need to manage swap entries in the same > >> way as disks (which need to consider sequential writing etc.). > > > > Right, the number of bits allowed would impose a maximum on the swap > > ID, which would imply a maximum on the number of zswap entries. The > > concern is about managing swap entries within zswap. If zswap needs to > > keep track of the entries it allocated and the entries that are free, > > it needs a data structure to do so (e.g. a bitmap). The size of this > > data structure can potentially scale with the maximum number of > > entries, so we would want to impose a virtual limit on zswap entries > > to limit the size of the data structure. Alternatively, we can have a > > dynamic data structure, but this also comes with its complexities. > > Yes. We will need that. > > Best Regards, > Huang, Ying