From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 96E98C64EC7 for ; Tue, 28 Feb 2023 08:09:49 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C671D6B0071; Tue, 28 Feb 2023 03:09:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C174F6B0072; Tue, 28 Feb 2023 03:09:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AB8096B0073; Tue, 28 Feb 2023 03:09:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9BDC36B0071 for ; Tue, 28 Feb 2023 03:09:48 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 5D866A137F for ; Tue, 28 Feb 2023 08:09:48 +0000 (UTC) X-FDA: 80515976856.29.8F67E6B Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54]) by imf10.hostedemail.com (Postfix) with ESMTP id 7B311C000B for ; Tue, 28 Feb 2023 08:09:46 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=AOhjTYFH; spf=pass (imf10.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677571786; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=A/GfO2WoWd0pE2MAE7MUQH2dzOLDDKuHXJgSO1ICoNE=; b=4Nlx3ro97umLpp1yr9UWITjVRw1ZYlmMKf+3o55Il+xcyMzATC6IoAoOTY6Ino5O99GRqD o3s136SapaVQWyHIkJEtUW4yudHIkxtJhQSkKMzJgES/yaZvind+Ue0V1ohSI0Oa3kjSoW yGuTI7VIU195O3AXZIqWDG1EIqBDE6U= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=AOhjTYFH; spf=pass (imf10.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677571786; a=rsa-sha256; cv=none; b=SsFzlIZVSBzRndxnCUlkcJm2l6vfTW5YuzRVmBG5pPT5EhsRy6vk2y760Ep+uoNxmIgoaw yzfGP8THeFlZc1czQHFiD/FlkMjkHHTwPEr1rzIRyCo1FsGAn5367UrPky+cpMrrM6BOdC ePlRYX8ldamwKuayZsj0M4yzRzAUeHw= Received: by mail-ed1-f54.google.com with SMTP id cq23so36448885edb.1 for ; Tue, 28 Feb 2023 00:09:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1677571785; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=A/GfO2WoWd0pE2MAE7MUQH2dzOLDDKuHXJgSO1ICoNE=; b=AOhjTYFHuR98+fQem/au73rR0sC3FD0JIs8T5VeDb2FAK1erJriqrx4O0L/QaqecBq jZpWBR2vyjsYHWMipWy4rRAh8jfmBlWQV/6Li+0Hd0nIBPyVdZt7VtXI/egU+VNqp70N /jfpxcRvDdylayV2VixRIClldGaFxWxuPebXKUuxJiN4aIFgciyDUOgl7LW4LQtMjvma bynCLBDRqWAiKv0dInQ4JdHwoySnCXZfjcP6iDm9lGu2g9jTJ1+8DAXRbBxhAzRTTrhf h4GrQcQ5E4chwJT7aa+MU80QhcS6xDR55KGJhB9abbyljKfQ/sPKFTtufrgtqoBlReO3 iR4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677571785; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=A/GfO2WoWd0pE2MAE7MUQH2dzOLDDKuHXJgSO1ICoNE=; b=EPwr+o2K4XyZ4fbI12gDiPemOZ4Ad2VekyYFQrEopzIbOHmyMoXefJVDG/WefpGkKL w2bYVNImUho/Vo5+89jG3zqz+R16kro92CRnYseJjHgESS9om5omKdAtQ/UtUqjsnBwX uLIIE6E03gHWXOZP4F/8vyl+vtSxbiJQT4ukrTZ0Rnk4IGOqjKTtFdurrEsBU/gPzoep BxcujOTGQ0aSCGuM05w9vZX6aQZ8cnwyRRW+HU9VmDl1MPpAKnzfby40viDIagqpCuWx uyLx7CsT8FxTOjsyxYXUUlbc9Mn7/bDdI2mmgIixJ4kBvAQNhEZRt+9kxdAr2315If05 0w8Q== X-Gm-Message-State: AO0yUKVEBpuWxc0o4O5VVqCqtQFsCJyw60PGc/P/+ADizaLgD+Gxyx1Q qClo8/20xDCEGg2rUI3Ax4737RNDvktT8J75W4/3Ag== X-Google-Smtp-Source: AK7set/C91T4zySE89a7l9f+lPvg3yHWudYE0OJIL8G5LSNqBM/5QoyVVTYTan/nIxvukxWGjhl8bgjXztm6f1IuQ6M= X-Received: by 2002:a17:906:a056:b0:8b3:8147:6b6e with SMTP id bg22-20020a170906a05600b008b381476b6emr758616ejb.10.1677571784685; Tue, 28 Feb 2023 00:09:44 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Tue, 28 Feb 2023 00:09:07 -0800 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: Kalesh Singh Cc: Johannes Weiner , Yang Shi , lsf-pc@lists.linux-foundation.org, Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Peter Xu , Minchan Kim , Andrew Morton , Nhat Pham , Akilesh Kailash Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: wc18tqg8tjytzfa9rsssgcbzj494iwyj X-Rspam-User: X-Rspamd-Queue-Id: 7B311C000B X-Rspamd-Server: rspam06 X-HE-Tag: 1677571786-85045 X-HE-Meta: U2FsdGVkX1/ju98Y/FPy/ntNYkAS/lIO/vRaUzRgMIqxj2iCsH7kwTrShucKYM6Lk748um5gPBk9gtL4WotLmnjUj4MOGpQP5Qu8Jf9DSZFJURE45LXOhipKEeqlVFC7XBAD3L34nuxBgSGhJmaLvsaXeJhZS71TEV2rhlr0kKqQqlzYEnq4lxattfq66q7oqkSXIIxO+c0dBjGbXJ7/IPa9FTiVZi/dFoibdqodhcxDl6EDsrja6dvxgFNDsH2upfmPmUlGhzFtrR3wNEYi/ij3+IdNvgL9zkKgcBq7sOm32z0gSNILlyrq0iOn0smc69l9hQBoJ764NxdzRisyQUvJaddUcPE7fz9V+6tDUbmZw228jG1U7k1aEEDjed08w5zN7Z765TncUgAgKbSY85Z6XSJtGYjaTx4Y86mS9yzZV29BZ3DMGfvWnCQKOQsX8D5EnTmoYQFdj5PiBKOTfYXxRoLibpN4yH8MAXWGt6YPG59v45/qChxZa4HrHOfH6tL4ygiryXA9PHfLDsVRqNZbzBZQCIzSFXSjGGS9nHeUrjL9RSyEHgxoRqBBiIzmhwD4KfonKrH6dqJQk6OOljyLTa4OAYA+jWlLuvWcwehkytiAjQR2FdZ2v03EkREfNSfohYshaCmIiTAxoKeLq1DryApcd5HPV7oLFovzUwdLPSTvF+ZoA/Nx28GXSzs5qVuDHi5CiwGu26KkB82/JyvZupRHrmgYnY77ah811JhCm9fmhoeZX3yLjExh/qhX5Mnn6gE2gN3pR7X3BuZPMD2qyCceqcrrf8gO2AyEoxGmWJKCsE20Oeu57sfBH4Dv01bn4DIHBTHnb7vU7ufRSYYE+ev2YertrPsOSgchrCnuZb9Ge5OZTpt5SbLXwFcrLrgp+BAjxPFCv/ADNZTPSABanG2aInI72Ei5/Ae+TXPR7XTetkolJuzaj71v4V2wPoLcRhtgQmPDGg34TVH kik9ROrM 4z0MQFXFAova5CvmaZFaiah4fOM3zoiUHak4O9Cj7IBVHhkl6yYVrsqs9xRWOf8tCuIabSySDUP6WsVxWroCvJkFfElVc2YWRf7PXMNQ8b82aXoY0TMGGfz5ulY1qLqq4ndmO8F+PPAKRKyLjtFnZ7g7bauRs7GczpZ2t7KntVIQRbs6DQmHLLhuCdQbwC1HEIQadWXwu2iZRSGXnBbqNryPH+61A618kKXdZEaZuM0+yNfEn8ZPNzzhWt9u9Ba4DQcBbtmgp1/ngXXh+IocTdA1Wm/kiflYBzXSm8TR8wUDrfkjMtmaIr9mENUBNoymTCT9Cso8JQ/fb7pxeDX57of2i9aq/0A32Chzl X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Feb 27, 2023 at 8:29 PM Kalesh Singh wrote= : > > On Wed, Feb 22, 2023 at 2:47=E2=80=AFPM Yosry Ahmed wrote: > > > > On Wed, Feb 22, 2023 at 8:57 AM Johannes Weiner wr= ote: > > > > > > Hello, > > > > > > thanks for proposing this, Yosry. I'm very interested in this > > > work. Unfortunately, I won't be able to attend LSFMMBPF myself this > > > time around due to a scheduling conflict :( > > > > Ugh, would have been great to have you, I guess there might be a > > remote option, or we will end up discussing on the mailing list > > eventually anyway. > > > > > > > > On Tue, Feb 21, 2023 at 03:38:57PM -0800, Yosry Ahmed wrote: > > > > On Tue, Feb 21, 2023 at 3:34 PM Yang Shi wrot= e: > > > > > > > > > > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed wrote: > > > > > > > > > > > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi = wrote: > > > > > > > > > > > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed wrote: > > > > > > > > > > > > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi wrote: > > > > > > > > > > > > > > > > > > Hi Yosry, > > > > > > > > > > > > > > > > > > Thanks for proposing this topic. I was thinking about thi= s before but > > > > > > > > > I didn't make too much progress due to some other distrac= tions, and I > > > > > > > > > got a couple of follow up questions about your design. Pl= ease see the > > > > > > > > > inline comments below. > > > > > > > > > > > > > > > > Great to see interested folks, thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed wrote: > > > > > > > > > > > > > > > > > > > > Hello everyone, > > > > > > > > > > > > > > > > > > > > I would like to propose a topic for the upcoming LSF/MM= /BPF in May > > > > > > > > > > 2023 about swap & zswap (hope I am not too late). > > > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Intro =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > > Currently, using zswap is dependent on swapfiles in an = unnecessary > > > > > > > > > > way. To use zswap, you need a swapfile configured (even= if the space > > > > > > > > > > will not be used) and zswap is restricted by its size. = When pages > > > > > > > > > > reside in zswap, the corresponding swap entry in the sw= apfile cannot > > > > > > > > > > be used, and is essentially wasted. We also go through = unnecessary > > > > > > > > > > code paths when using zswap, such as finding and alloca= ting a swap > > > > > > > > > > entry on the swapout path, or readahead in the swapin p= ath. I am > > > > > > > > > > proposing a swapping abstraction layer that would allow= us to remove > > > > > > > > > > zswap's dependency on swapfiles. This can be done by in= troducing a > > > > > > > > > > data structure between the actual swapping implementati= on (swapfiles, > > > > > > > > > > zswap) and the rest of the MM code. > > > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Objective =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > > > > > > > > > > Enabling the use of zswap without a backing swapfile, w= hich makes > > > > > > > > > > zswap useful for a wider variety of use cases. Also, wh= en zswap is > > > > > > > > > > used with a swapfile, the pages in zswap do not use up = space in the > > > > > > > > > > swapfile, so the overall swapping capacity increases. > > > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Idea =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > > Introduce a data structure, which I currently call a sw= ap_desc, as an > > > > > > > > > > abstraction layer between swapping implementation and t= he rest of MM > > > > > > > > > > code. Page tables & page caches would store a swap id (= encoded as a > > > > > > > > > > swp_entry_t) instead of directly storing the swap entry= associated > > > > > > > > > > with the swapfile. This swap id maps to a struct swap_d= esc, which acts > > > > > > > > > > as our abstraction layer. All MM code not concerned wit= h swapping > > > > > > > > > > details would operate in terms of swap descs. The swap_= desc can point > > > > > > > > > > to either a normal swap entry (associated with a swapfi= le) or a zswap > > > > > > > > > > entry. It can also include all non-backend specific ope= rations, such > > > > > > > > > > as the swapcache (which would be a simple pointer in sw= ap_desc), swap > > > > > > > > > > counting, etc. It creates a clear, nice abstraction lay= er between MM > > > > > > > > > > code and the actual swapping implementation. > > > > > > > > > > > > > > > > > > How will the swap_desc be allocated? Dynamically or preal= located? Is > > > > > > > > > it 1:1 mapped to the swap slots on swap devices (whatever= it is > > > > > > > > > backed, for example, zswap, swap partition, swapfile, etc= )? > > > > > > > > > > > > > > > > I imagine swap_desc's would be dynamically allocated when w= e need to > > > > > > > > swap something out. When allocated, a swap_desc would eithe= r point to > > > > > > > > a zswap_entry (if available), or a swap slot otherwise. In = this case, > > > > > > > > it would be 1:1 mapped to swapped out pages, not the swap s= lots on > > > > > > > > devices. > > > > > > > > > > > > > > It makes sense to be 1:1 mapped to swapped out pages if the s= wapfile > > > > > > > is used as the back of zswap. > > > > > > > > > > > > > > > > > > > > > > > I know that it might not be ideal to make allocations on th= e reclaim > > > > > > > > path (although it would be a small-ish slab allocation so w= e might be > > > > > > > > able to get away with it), but otherwise we would have stat= ically > > > > > > > > allocated swap_desc's for all swap slots on a swap device, = even unused > > > > > > > > ones, which I imagine is too expensive. Also for things lik= e zswap, it > > > > > > > > doesn't really make sense to preallocate at all. > > > > > > > > > > > > > > Yeah, it is not perfect to allocate memory in the reclamation= path. We > > > > > > > do have such cases, but the fewer the better IMHO. > > > > > > > > > > > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top o= f the > > > > > > slab cache, idk if that makes sense, or if there is a way to te= ll slab > > > > > > to proactively refill a cache. > > > > > > > > > > > > I am open to suggestions here. I don't think we should/can prea= llocate > > > > > > the swap_desc's, and we cannot completely eliminate the allocat= ions in > > > > > > the reclaim path. We can only try to minimize them through cach= ing, > > > > > > etc. Right? > > > > > > > > > > Yeah, reallocation should not work. But I'm not sure whether cach= ing > > > > > works well for this case or not either. I'm supposed that you wer= e > > > > > thinking about something similar with pcp. When the available num= ber > > > > > of elements is lower than a threshold, refill the cache. It shoul= d > > > > > work well with moderate memory pressure. But I'm not sure how it = would > > > > > behave with severe memory pressure, particularly when anonymous > > > > > memory dominated the memory usage. Or maybe dynamic allocation wo= rks > > > > > well, we are just over-engineered. > > > > > > > > Yeah it would be interesting to look into whether the swap_desc > > > > allocation will be a bottleneck. Definitely something to look out f= or. > > > > I share your thoughts about wanting to do something about it but al= so > > > > not wanting to over-engineer it. > > > > > > I'm not too concerned by this. It's a PF_MEMALLOC allocation, meaning > > > it's not subject to watermarks. And the swapped page is freed right > > > afterwards. As long as the compression delta exceeds the size of > > > swap_desc, the process is a net reduction in allocated memory. For > > > regular swap, the only requirement is that swap_desc < page_size() :-= ) > > > > > > To put this into perspective, the zswap backends allocate backing > > > pages on-demand during reclaim. zsmalloc also kmallocs metadata in > > > that path. We haven't had any issues with this in production, even > > > under fairly severe memory pressure scenarios. > > > > Right. The only problem would be for pages that do not compress well > > in zswap, in which case we might not end up freeing memory. As you > > said, this is already happening today with zswap tho. > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Benefits =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > > > > > > > > > > This work enables using zswap without a backing swapfil= e and increases > > > > > > > > > > the swap capacity when zswap is used with a swapfile. I= t also creates > > > > > > > > > > a separation that allows us to skip code paths that don= 't make sense > > > > > > > > > > in the zswap path (e.g. readahead). We get to drop zswa= p's rbtree > > > > > > > > > > which might result in better performance (less lookups,= less lock > > > > > > > > > > contention). > > > > > > > > > > > > > > > > > > > > The abstraction layer also opens the door for multiple = cleanups (e.g. > > > > > > > > > > removing swapper address spaces, removing swap count co= ntinuation > > > > > > > > > > code, etc). Another nice cleanup that this work enables= would be > > > > > > > > > > separating the overloaded swp_entry_t into two distinct= types: one for > > > > > > > > > > things that are stored in page tables / caches, and for= actual swap > > > > > > > > > > entries. In the future, we can potentially further opti= mize how we use > > > > > > > > > > the bits in the page tables instead of sticking everyth= ing into the > > > > > > > > > > current type/offset format. > > > > > > > > > > > > > > > > > > > > Another potential win here can be swapoff, which can be= more practical > > > > > > > > > > by directly scanning all swap_desc's instead of going t= hrough page > > > > > > > > > > tables and shmem page caches. > > > > > > > > > > > > > > > > > > > > Overall zswap becomes more accessible and available to = a wider range > > > > > > > > > > of use cases. > > > > > > > > > > > > > > > > > > How will you handle zswap writeback? Zswap may writeback = to the backed > > > > > > > > > swap device IIUC. Assuming you have both zswap and swapfi= le, they are > > > > > > > > > separate devices with this design, right? If so, is the s= wapfile still > > > > > > > > > the writeback target of zswap? And if it is the writeback= target, what > > > > > > > > > if swapfile is full? > > > > > > > > > > > > > > > > When we try to writeback from zswap, we try to allocate a s= wap slot in > > > > > > > > the swapfile, and switch the swap_desc to point to that ins= tead. The > > > > > > > > process would be transparent to the rest of MM (page tables= , page > > > > > > > > cache, etc). If the swapfile is full, then there's really n= othing we > > > > > > > > can do, reclaim fails and we start OOMing. I imagine this i= s the same > > > > > > > > behavior as today when swap is full, the difference would b= e that we > > > > > > > > have to fill both zswap AND the swapfile to get to the OOMi= ng point, > > > > > > > > so an overall increased swapping capacity. > > > > > > > > > > > > > > When zswap is full, but swapfile is not yet, will the swap tr= y to > > > > > > > writeback zswap to swapfile to make more room for zswap or ju= st swap > > > > > > > out to swapfile directly? > > > > > > > > > > > > > > > > > > > The current behavior is that we swap to swapfile directly in th= is > > > > > > case, which is far from ideal as we break LRU ordering by skipp= ing > > > > > > zswap. I believe this should be addressed, but not as part of t= his > > > > > > effort. The work to make zswap respect the LRU ordering by writ= ing > > > > > > back from zswap to make room can be done orthogonal to this eff= ort. I > > > > > > believe Johannes was looking into this at some point. > > > > > > Actually, zswap already does LRU writeback when the pool is full. Nha= t > > > Pham (CCd) recently upstreamed the LRU implementation for zsmalloc, s= o > > > as of today all backends support this. > > > > > > There are still a few quirks in zswap that can cause rejections which > > > bypass the LRU that need fixing. But for the most part LRU writeback > > > to the backing file is the default behavior. > > > > Right, I was specifically talking about this case. When zswap is full > > it rejects incoming pages and they go directly to the swapfile, but we > > also kickoff writeback, so this only happens until we do some LRU > > writeback. I guess I should have been more clear here. Thanks for > > clarifying and correcting. > > > > > > > > > > Other than breaking LRU ordering, I'm also concerned about the > > > > > potential deteriorating performance when writing/reading from swa= pfile > > > > > when zswap is full. The zswap->swapfile order should be able to > > > > > maintain a consistent performance for userspace. > > > > > > > > Right. This happens today anyway AFAICT, when zswap is full we just > > > > fallback to writing to swapfile, so this would not be a behavior > > > > change. I agree it should be addressed anyway. > > > > > > > > > > > > > > But anyway I don't have the data from real life workload to back = the > > > > > above points. If you or Johannes could share some real data, that > > > > > would be very helpful to make the decisions. > > > > > > > > I actually don't, since we mostly run zswap without a backing > > > > swapfile. Perhaps Johannes might be able to have some data on this = (or > > > > anyone using zswap with a backing swapfile). > > > > > > Due to LRU writeback, the latency increase when zswap spills its > > > coldest entries into backing swap is fairly linear, as you may > > > expect. We have some limited production data on this from the > > > webservers. > > > > > > The biggest challenge in this space is properly sizing the zswap pool= , > > > such that it's big enough to hold the warm set that the workload is > > > most latency-sensitive too, yet small enough such that the cold pages > > > get spilled to backing swap. Nhat is working on improving this. > > > > > > That said, I think this discussion is orthogonal to the proposed > > > topic. zswap spills to backing swap in LRU order as of today. The > > > LRU/pool size tweaking is an optimization to get smarter zswap/swap > > > placement according to access frequency. The proposed swap descriptor > > > is an optimization to get better disk utilization, the ability to run > > > zswap without backing swap, and a dramatic speedup in swapoff time. > > > > Fully agree. > > > > > > > > > > > > > > Anyway I'm interested in attending the discussion for thi= s topic. > > > > > > > > > > > > > > > > Great! Looking forward to discuss this more! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Cost =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > > The obvious downside of this is added memory overhead, = specifically > > > > > > > > > > for users that use swapfiles without zswap. Instead of = paying one byte > > > > > > > > > > (swap_map) for every potential page in the swapfile (+ = swap count > > > > > > > > > > continuation), we pay the size of the swap_desc for eve= ry page that is > > > > > > > > > > actually in the swapfile, which I am estimating can be = roughly around > > > > > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. Th= e overhead only > > > > > > > > > > scales with pages actually swapped out. For zswap users= , it should be > > > > > > > > > > a win (or at least even) because we get to drop a lot o= f fields from > > > > > > > > > > struct zswap_entry (e.g. rbtree, index, etc). > > > > > > Shifting the cost from O(swapspace) to O(swapped) could be a win for > > > many regular swap users too. > > > > > > There are the legacy setups that provision 2*RAM worth of swap as an > > > emergency overflow that is then rarely used. > > > > > > We have a setups that swap to disk more proactively, but we also > > > overprovision those in terms of swap space due to the cliff behavior > > > when swap fills up and the VM runs out of options. > > > > > > To make a fair comparison, you really have to take average swap > > > utilization into account. And I doubt that's very high. > > > > Yeah I was looking for some data here, but it varies heavily based on > > the use case, so I opted to only state the overhead of the swap > > descriptor without directly comparing it to the current overhead. > > > > > > > > In terms of worst-case behavior, +0.8% per swapped page doesn't sound > > > like a show-stopper to me. Especially when compared to zswap's curren= t > > > O(swapped) waste of disk space. > > > > Yeah for zswap users this should be a win on most/all fronts, even > > memory overhead, as we will end up trimming struct zswap_entry which > > is also O(swapped) memory overhead. It should also make zswap > > available for more use cases. You don't need to provision and > > configure swap space, you just need to turn zswap on. > > > > > > > > > > > > > > > Another potential concern is readahead. With this desig= n, we have no > > > > > > > > > > way to get a swap_desc given a swap entry (type & offse= t). We would > > > > > > > > > > need to maintain a reverse mapping, adding a little bit= more overhead, > > > > > > > > > > or search all swapped out pages instead :). A reverse m= apping might > > > > > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% = of swapped out > > > > > > > > > > memory). > > > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D Bottom Line =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D > > > > > > > > > > It would be nice to discuss the potential here and the = tradeoffs. I > > > > > > > > > > know that other folks using zswap (or interested in usi= ng it) may find > > > > > > > > > > this very useful. I am sure I am missing some context o= n why things > > > > > > > > > > are the way they are, and perhaps some obvious holes in= my story. > > > > > > > > > > Looking forward to discussing this with anyone interest= ed :) > > > > > > > > > > > > > > > > > > > > I think Johannes may be interested in attending this di= scussion, since > > > > > > > > > > a lot of ideas here are inspired by discussions I had w= ith him :) > > Hi everyone, > > I came across this interesting proposal and I would like to > participate in the discussion. I think it will be useful/overlap with > some projects we are currently planning in Android. Great to see more interested folks! Looking forward to discussing that! > > Thanks, > Kalesh > > > > > > > Thanks! > >