From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 96E98C64EC7
	for <linux-mm@archiver.kernel.org>; Tue, 28 Feb 2023 08:09:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C671D6B0071; Tue, 28 Feb 2023 03:09:48 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C174F6B0072; Tue, 28 Feb 2023 03:09:48 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id AB8096B0073; Tue, 28 Feb 2023 03:09:48 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 9BDC36B0071
	for <linux-mm@kvack.org>; Tue, 28 Feb 2023 03:09:48 -0500 (EST)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 5D866A137F
	for <linux-mm@kvack.org>; Tue, 28 Feb 2023 08:09:48 +0000 (UTC)
X-FDA: 80515976856.29.8F67E6B
Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54])
	by imf10.hostedemail.com (Postfix) with ESMTP id 7B311C000B
	for <linux-mm@kvack.org>; Tue, 28 Feb 2023 08:09:46 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=AOhjTYFH;
	spf=pass (imf10.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1677571786;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=A/GfO2WoWd0pE2MAE7MUQH2dzOLDDKuHXJgSO1ICoNE=;
	b=4Nlx3ro97umLpp1yr9UWITjVRw1ZYlmMKf+3o55Il+xcyMzATC6IoAoOTY6Ino5O99GRqD
	o3s136SapaVQWyHIkJEtUW4yudHIkxtJhQSkKMzJgES/yaZvind+Ue0V1ohSI0Oa3kjSoW
	yGuTI7VIU195O3AXZIqWDG1EIqBDE6U=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=AOhjTYFH;
	spf=pass (imf10.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677571786; a=rsa-sha256;
	cv=none;
	b=SsFzlIZVSBzRndxnCUlkcJm2l6vfTW5YuzRVmBG5pPT5EhsRy6vk2y760Ep+uoNxmIgoaw
	yzfGP8THeFlZc1czQHFiD/FlkMjkHHTwPEr1rzIRyCo1FsGAn5367UrPky+cpMrrM6BOdC
	ePlRYX8ldamwKuayZsj0M4yzRzAUeHw=
Received: by mail-ed1-f54.google.com with SMTP id cq23so36448885edb.1
        for <linux-mm@kvack.org>; Tue, 28 Feb 2023 00:09:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112; t=1677571785;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=A/GfO2WoWd0pE2MAE7MUQH2dzOLDDKuHXJgSO1ICoNE=;
        b=AOhjTYFHuR98+fQem/au73rR0sC3FD0JIs8T5VeDb2FAK1erJriqrx4O0L/QaqecBq
         jZpWBR2vyjsYHWMipWy4rRAh8jfmBlWQV/6Li+0Hd0nIBPyVdZt7VtXI/egU+VNqp70N
         /jfpxcRvDdylayV2VixRIClldGaFxWxuPebXKUuxJiN4aIFgciyDUOgl7LW4LQtMjvma
         bynCLBDRqWAiKv0dInQ4JdHwoySnCXZfjcP6iDm9lGu2g9jTJ1+8DAXRbBxhAzRTTrhf
         h4GrQcQ5E4chwJT7aa+MU80QhcS6xDR55KGJhB9abbyljKfQ/sPKFTtufrgtqoBlReO3
         iR4A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112; t=1677571785;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=A/GfO2WoWd0pE2MAE7MUQH2dzOLDDKuHXJgSO1ICoNE=;
        b=EPwr+o2K4XyZ4fbI12gDiPemOZ4Ad2VekyYFQrEopzIbOHmyMoXefJVDG/WefpGkKL
         w2bYVNImUho/Vo5+89jG3zqz+R16kro92CRnYseJjHgESS9om5omKdAtQ/UtUqjsnBwX
         uLIIE6E03gHWXOZP4F/8vyl+vtSxbiJQT4ukrTZ0Rnk4IGOqjKTtFdurrEsBU/gPzoep
         BxcujOTGQ0aSCGuM05w9vZX6aQZ8cnwyRRW+HU9VmDl1MPpAKnzfby40viDIagqpCuWx
         uyLx7CsT8FxTOjsyxYXUUlbc9Mn7/bDdI2mmgIixJ4kBvAQNhEZRt+9kxdAr2315If05
         0w8Q==
X-Gm-Message-State: AO0yUKVEBpuWxc0o4O5VVqCqtQFsCJyw60PGc/P/+ADizaLgD+Gxyx1Q
	qClo8/20xDCEGg2rUI3Ax4737RNDvktT8J75W4/3Ag==
X-Google-Smtp-Source: AK7set/C91T4zySE89a7l9f+lPvg3yHWudYE0OJIL8G5LSNqBM/5QoyVVTYTan/nIxvukxWGjhl8bgjXztm6f1IuQ6M=
X-Received: by 2002:a17:906:a056:b0:8b3:8147:6b6e with SMTP id
 bg22-20020a170906a05600b008b381476b6emr758616ejb.10.1677571784685; Tue, 28
 Feb 2023 00:09:44 -0800 (PST)
MIME-Version: 1.0
References: <CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com>
 <CAHbLzkqWD2cV5McSEtBd88zPsG0Ak-O6sMba1L_OSj+fomSs8A@mail.gmail.com>
 <CAJD7tkYvhbMth4v-kg_nwZPE6s-0UwM27XpMe-NsAvRjCStrqQ@mail.gmail.com>
 <CAHbLzkrATJSrWQX2+iHE4TL6BX6uZyJXB6EqvkoPuqRsRtbmdg@mail.gmail.com>
 <CAJD7tkbqv5_oPNM7HqgbkjqwyAiG1ew7UWn5qTzhic=BasOvog@mail.gmail.com>
 <CAHbLzkpJF+7s+C0WVj_5b4PvLGe29rE2SyXTDcc0dHexgZt1vw@mail.gmail.com>
 <CAJD7tkb=RqyWr1P1jPzbKPLig1NjRHGt7Wa5ECz1NS9Vv4PiMw@mail.gmail.com>
 <Y/ZJcx+5tvhpv59P@cmpxchg.org> <CAJD7tka3MgUpyG4zfcKjtA-P=Wt0Qog=AdJ5zPx0pGwN2a8dbQ@mail.gmail.com>
 <CAC_TJve7e=sz4uPDuRvauj1hr=evOUWbSoz91wniSQYUbv0ajA@mail.gmail.com>
In-Reply-To: <CAC_TJve7e=sz4uPDuRvauj1hr=evOUWbSoz91wniSQYUbv0ajA@mail.gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 28 Feb 2023 00:09:07 -0800
Message-ID: <CAJD7tkavU+DxD-Sqs8zqGmrYb42pLVWTL8C3U14pbzpV-DABbA@mail.gmail.com>
Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap
To: Kalesh Singh <kaleshsingh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Yang Shi <shy828301@gmail.com>, 
	lsf-pc@lists.linux-foundation.org, Linux-MM <linux-mm@kvack.org>, 
	Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeelb@google.com>, 
	David Rientjes <rientjes@google.com>, Hugh Dickins <hughd@google.com>, 
	Seth Jennings <sjenning@redhat.com>, Dan Streetman <ddstreet@ieee.org>, 
	Vitaly Wool <vitaly.wool@konsulko.com>, Peter Xu <peterx@redhat.com>, 
	Minchan Kim <minchan@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, 
	Nhat Pham <nphamcs@gmail.com>, Akilesh Kailash <akailash@google.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Stat-Signature: wc18tqg8tjytzfa9rsssgcbzj494iwyj
X-Rspam-User: 
X-Rspamd-Queue-Id: 7B311C000B
X-Rspamd-Server: rspam06
X-HE-Tag: 1677571786-85045
X-HE-Meta: U2FsdGVkX1/ju98Y/FPy/ntNYkAS/lIO/vRaUzRgMIqxj2iCsH7kwTrShucKYM6Lk748um5gPBk9gtL4WotLmnjUj4MOGpQP5Qu8Jf9DSZFJURE45LXOhipKEeqlVFC7XBAD3L34nuxBgSGhJmaLvsaXeJhZS71TEV2rhlr0kKqQqlzYEnq4lxattfq66q7oqkSXIIxO+c0dBjGbXJ7/IPa9FTiVZi/dFoibdqodhcxDl6EDsrja6dvxgFNDsH2upfmPmUlGhzFtrR3wNEYi/ij3+IdNvgL9zkKgcBq7sOm32z0gSNILlyrq0iOn0smc69l9hQBoJ764NxdzRisyQUvJaddUcPE7fz9V+6tDUbmZw228jG1U7k1aEEDjed08w5zN7Z765TncUgAgKbSY85Z6XSJtGYjaTx4Y86mS9yzZV29BZ3DMGfvWnCQKOQsX8D5EnTmoYQFdj5PiBKOTfYXxRoLibpN4yH8MAXWGt6YPG59v45/qChxZa4HrHOfH6tL4ygiryXA9PHfLDsVRqNZbzBZQCIzSFXSjGGS9nHeUrjL9RSyEHgxoRqBBiIzmhwD4KfonKrH6dqJQk6OOljyLTa4OAYA+jWlLuvWcwehkytiAjQR2FdZ2v03EkREfNSfohYshaCmIiTAxoKeLq1DryApcd5HPV7oLFovzUwdLPSTvF+ZoA/Nx28GXSzs5qVuDHi5CiwGu26KkB82/JyvZupRHrmgYnY77ah811JhCm9fmhoeZX3yLjExh/qhX5Mnn6gE2gN3pR7X3BuZPMD2qyCceqcrrf8gO2AyEoxGmWJKCsE20Oeu57sfBH4Dv01bn4DIHBTHnb7vU7ufRSYYE+ev2YertrPsOSgchrCnuZb9Ge5OZTpt5SbLXwFcrLrgp+BAjxPFCv/ADNZTPSABanG2aInI72Ei5/Ae+TXPR7XTetkolJuzaj71v4V2wPoLcRhtgQmPDGg34TVH
 kik9ROrM
 4z0MQFXFAova5CvmaZFaiah4fOM3zoiUHak4O9Cj7IBVHhkl6yYVrsqs9xRWOf8tCuIabSySDUP6WsVxWroCvJkFfElVc2YWRf7PXMNQ8b82aXoY0TMGGfz5ulY1qLqq4ndmO8F+PPAKRKyLjtFnZ7g7bauRs7GczpZ2t7KntVIQRbs6DQmHLLhuCdQbwC1HEIQadWXwu2iZRSGXnBbqNryPH+61A618kKXdZEaZuM0+yNfEn8ZPNzzhWt9u9Ba4DQcBbtmgp1/ngXXh+IocTdA1Wm/kiflYBzXSm8TR8wUDrfkjMtmaIr9mENUBNoymTCT9Cso8JQ/fb7pxeDX57of2i9aq/0A32Chzl
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Feb 27, 2023 at 8:29 PM Kalesh Singh <kaleshsingh@google.com> wrote=
:
>
> On Wed, Feb 22, 2023 at 2:47=E2=80=AFPM Yosry Ahmed <yosryahmed@google.co=
m> wrote:
> >
> > On Wed, Feb 22, 2023 at 8:57 AM Johannes Weiner <hannes@cmpxchg.org> wr=
ote:
> > >
> > > Hello,
> > >
> > > thanks for proposing this, Yosry. I'm very interested in this
> > > work. Unfortunately, I won't be able to attend LSFMMBPF myself this
> > > time around due to a scheduling conflict :(
> >
> > Ugh, would have been great to have you, I guess there might be a
> > remote option, or we will end up discussing on the mailing list
> > eventually anyway.
> >
> > >
> > > On Tue, Feb 21, 2023 at 03:38:57PM -0800, Yosry Ahmed wrote:
> > > > On Tue, Feb 21, 2023 at 3:34 PM Yang Shi <shy828301@gmail.com> wrot=
e:
> > > > >
> > > > > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed <yosryahmed@google.c=
om> wrote:
> > > > > >
> > > > > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi <shy828301@gmail.com>=
 wrote:
> > > > > > >
> > > > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed <yosryahmed@goog=
le.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi <shy828301@gmail.=
com> wrote:
> > > > > > > > >
> > > > > > > > > Hi Yosry,
> > > > > > > > >
> > > > > > > > > Thanks for proposing this topic. I was thinking about thi=
s before but
> > > > > > > > > I didn't make too much progress due to some other distrac=
tions, and I
> > > > > > > > > got a couple of follow up questions about your design. Pl=
ease see the
> > > > > > > > > inline comments below.
> > > > > > > >
> > > > > > > > Great to see interested folks, thanks!
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed <yosryahmed@g=
oogle.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Hello everyone,
> > > > > > > > > >
> > > > > > > > > > I would like to propose a topic for the upcoming LSF/MM=
/BPF in May
> > > > > > > > > > 2023 about swap & zswap (hope I am not too late).
> > > > > > > > > >
> > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D Intro =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > > > > > Currently, using zswap is dependent on swapfiles in an =
unnecessary
> > > > > > > > > > way. To use zswap, you need a swapfile configured (even=
 if the space
> > > > > > > > > > will not be used) and zswap is restricted by its size. =
When pages
> > > > > > > > > > reside in zswap, the corresponding swap entry in the sw=
apfile cannot
> > > > > > > > > > be used, and is essentially wasted. We also go through =
unnecessary
> > > > > > > > > > code paths when using zswap, such as finding and alloca=
ting a swap
> > > > > > > > > > entry on the swapout path, or readahead in the swapin p=
ath. I am
> > > > > > > > > > proposing a swapping abstraction layer that would allow=
 us to remove
> > > > > > > > > > zswap's dependency on swapfiles. This can be done by in=
troducing a
> > > > > > > > > > data structure between the actual swapping implementati=
on (swapfiles,
> > > > > > > > > > zswap) and the rest of the MM code.
> > > > > > > > > >
> > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D Objective =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
> > > > > > > > > > Enabling the use of zswap without a backing swapfile, w=
hich makes
> > > > > > > > > > zswap useful for a wider variety of use cases. Also, wh=
en zswap is
> > > > > > > > > > used with a swapfile, the pages in zswap do not use up =
space in the
> > > > > > > > > > swapfile, so the overall swapping capacity increases.
> > > > > > > > > >
> > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D Idea =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > > > > > Introduce a data structure, which I currently call a sw=
ap_desc, as an
> > > > > > > > > > abstraction layer between swapping implementation and t=
he rest of MM
> > > > > > > > > > code. Page tables & page caches would store a swap id (=
encoded as a
> > > > > > > > > > swp_entry_t) instead of directly storing the swap entry=
 associated
> > > > > > > > > > with the swapfile. This swap id maps to a struct swap_d=
esc, which acts
> > > > > > > > > > as our abstraction layer. All MM code not concerned wit=
h swapping
> > > > > > > > > > details would operate in terms of swap descs. The swap_=
desc can point
> > > > > > > > > > to either a normal swap entry (associated with a swapfi=
le) or a zswap
> > > > > > > > > > entry. It can also include all non-backend specific ope=
rations, such
> > > > > > > > > > as the swapcache (which would be a simple pointer in sw=
ap_desc), swap
> > > > > > > > > > counting, etc. It creates a clear, nice abstraction lay=
er between MM
> > > > > > > > > > code and the actual swapping implementation.
> > > > > > > > >
> > > > > > > > > How will the swap_desc be allocated? Dynamically or preal=
located? Is
> > > > > > > > > it 1:1 mapped to the swap slots on swap devices (whatever=
 it is
> > > > > > > > > backed, for example, zswap, swap partition, swapfile, etc=
)?
> > > > > > > >
> > > > > > > > I imagine swap_desc's would be dynamically allocated when w=
e need to
> > > > > > > > swap something out. When allocated, a swap_desc would eithe=
r point to
> > > > > > > > a zswap_entry (if available), or a swap slot otherwise. In =
this case,
> > > > > > > > it would be 1:1 mapped to swapped out pages, not the swap s=
lots on
> > > > > > > > devices.
> > > > > > >
> > > > > > > It makes sense to be 1:1 mapped to swapped out pages if the s=
wapfile
> > > > > > > is used as the back of zswap.
> > > > > > >
> > > > > > > >
> > > > > > > > I know that it might not be ideal to make allocations on th=
e reclaim
> > > > > > > > path (although it would be a small-ish slab allocation so w=
e might be
> > > > > > > > able to get away with it), but otherwise we would have stat=
ically
> > > > > > > > allocated swap_desc's for all swap slots on a swap device, =
even unused
> > > > > > > > ones, which I imagine is too expensive. Also for things lik=
e zswap, it
> > > > > > > > doesn't really make sense to preallocate at all.
> > > > > > >
> > > > > > > Yeah, it is not perfect to allocate memory in the reclamation=
 path. We
> > > > > > > do have such cases, but the fewer the better IMHO.
> > > > > >
> > > > > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top o=
f the
> > > > > > slab cache, idk if that makes sense, or if there is a way to te=
ll slab
> > > > > > to proactively refill a cache.
> > > > > >
> > > > > > I am open to suggestions here. I don't think we should/can prea=
llocate
> > > > > > the swap_desc's, and we cannot completely eliminate the allocat=
ions in
> > > > > > the reclaim path. We can only try to minimize them through cach=
ing,
> > > > > > etc. Right?
> > > > >
> > > > > Yeah, reallocation should not work. But I'm not sure whether cach=
ing
> > > > > works well for this case or not either. I'm supposed that you wer=
e
> > > > > thinking about something similar with pcp. When the available num=
ber
> > > > > of elements is lower than a threshold, refill the cache. It shoul=
d
> > > > > work well with moderate memory pressure. But I'm not sure how it =
would
> > > > > behave with severe memory pressure, particularly when  anonymous
> > > > > memory dominated the memory usage. Or maybe dynamic allocation wo=
rks
> > > > > well, we are just over-engineered.
> > > >
> > > > Yeah it would be interesting to look into whether the swap_desc
> > > > allocation will be a bottleneck. Definitely something to look out f=
or.
> > > > I share your thoughts about wanting to do something about it but al=
so
> > > > not wanting to over-engineer it.
> > >
> > > I'm not too concerned by this. It's a PF_MEMALLOC allocation, meaning
> > > it's not subject to watermarks. And the swapped page is freed right
> > > afterwards. As long as the compression delta exceeds the size of
> > > swap_desc, the process is a net reduction in allocated memory. For
> > > regular swap, the only requirement is that swap_desc < page_size() :-=
)
> > >
> > > To put this into perspective, the zswap backends allocate backing
> > > pages on-demand during reclaim. zsmalloc also kmallocs metadata in
> > > that path. We haven't had any issues with this in production, even
> > > under fairly severe memory pressure scenarios.
> >
> > Right. The only problem would be for pages that do not compress well
> > in zswap, in which case we might not end up freeing memory. As you
> > said, this is already happening today with zswap tho.
> >
> > >
> > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D Benefits =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
> > > > > > > > > > This work enables using zswap without a backing swapfil=
e and increases
> > > > > > > > > > the swap capacity when zswap is used with a swapfile. I=
t also creates
> > > > > > > > > > a separation that allows us to skip code paths that don=
't make sense
> > > > > > > > > > in the zswap path (e.g. readahead). We get to drop zswa=
p's rbtree
> > > > > > > > > > which might result in better performance (less lookups,=
 less lock
> > > > > > > > > > contention).
> > > > > > > > > >
> > > > > > > > > > The abstraction layer also opens the door for multiple =
cleanups (e.g.
> > > > > > > > > > removing swapper address spaces, removing swap count co=
ntinuation
> > > > > > > > > > code, etc). Another nice cleanup that this work enables=
 would be
> > > > > > > > > > separating the overloaded swp_entry_t into two distinct=
 types: one for
> > > > > > > > > > things that are stored in page tables / caches, and for=
 actual swap
> > > > > > > > > > entries. In the future, we can potentially further opti=
mize how we use
> > > > > > > > > > the bits in the page tables instead of sticking everyth=
ing into the
> > > > > > > > > > current type/offset format.
> > > > > > > > > >
> > > > > > > > > > Another potential win here can be swapoff, which can be=
 more practical
> > > > > > > > > > by directly scanning all swap_desc's instead of going t=
hrough page
> > > > > > > > > > tables and shmem page caches.
> > > > > > > > > >
> > > > > > > > > > Overall zswap becomes more accessible and available to =
a wider range
> > > > > > > > > > of use cases.
> > > > > > > > >
> > > > > > > > > How will you handle zswap writeback? Zswap may writeback =
to the backed
> > > > > > > > > swap device IIUC. Assuming you have both zswap and swapfi=
le, they are
> > > > > > > > > separate devices with this design, right? If so, is the s=
wapfile still
> > > > > > > > > the writeback target of zswap? And if it is the writeback=
 target, what
> > > > > > > > > if swapfile is full?
> > > > > > > >
> > > > > > > > When we try to writeback from zswap, we try to allocate a s=
wap slot in
> > > > > > > > the swapfile, and switch the swap_desc to point to that ins=
tead. The
> > > > > > > > process would be transparent to the rest of MM (page tables=
, page
> > > > > > > > cache, etc). If the swapfile is full, then there's really n=
othing we
> > > > > > > > can do, reclaim fails and we start OOMing. I imagine this i=
s the same
> > > > > > > > behavior as today when swap is full, the difference would b=
e that we
> > > > > > > > have to fill both zswap AND the swapfile to get to the OOMi=
ng point,
> > > > > > > > so an overall increased swapping capacity.
> > > > > > >
> > > > > > > When zswap is full, but swapfile is not yet, will the swap tr=
y to
> > > > > > > writeback zswap to swapfile to make more room for zswap or ju=
st swap
> > > > > > > out to swapfile directly?
> > > > > > >
> > > > > >
> > > > > > The current behavior is that we swap to swapfile directly in th=
is
> > > > > > case, which is far from ideal as we break LRU ordering by skipp=
ing
> > > > > > zswap. I believe this should be addressed, but not as part of t=
his
> > > > > > effort. The work to make zswap respect the LRU ordering by writ=
ing
> > > > > > back from zswap to make room can be done orthogonal to this eff=
ort. I
> > > > > > believe Johannes was looking into this at some point.
> > >
> > > Actually, zswap already does LRU writeback when the pool is full. Nha=
t
> > > Pham (CCd) recently upstreamed the LRU implementation for zsmalloc, s=
o
> > > as of today all backends support this.
> > >
> > > There are still a few quirks in zswap that can cause rejections which
> > > bypass the LRU that need fixing. But for the most part LRU writeback
> > > to the backing file is the default behavior.
> >
> > Right, I was specifically talking about this case. When zswap is full
> > it rejects incoming pages and they go directly to the swapfile, but we
> > also kickoff writeback, so this only happens until we do some LRU
> > writeback. I guess I should have been more clear here. Thanks for
> > clarifying and correcting.
> >
> > >
> > > > > Other than breaking LRU ordering, I'm also concerned about the
> > > > > potential deteriorating performance when writing/reading from swa=
pfile
> > > > > when zswap is full. The zswap->swapfile order should be able to
> > > > > maintain a consistent performance for userspace.
> > > >
> > > > Right. This happens today anyway AFAICT, when zswap is full we just
> > > > fallback to writing to swapfile, so this would not be a behavior
> > > > change. I agree it should be addressed anyway.
> > > >
> > > > >
> > > > > But anyway I don't have the data from real life workload to back =
the
> > > > > above points. If you or Johannes could share some real data, that
> > > > > would be very helpful to make the decisions.
> > > >
> > > > I actually don't, since we mostly run zswap without a backing
> > > > swapfile. Perhaps Johannes might be able to have some data on this =
(or
> > > > anyone using zswap with a backing swapfile).
> > >
> > > Due to LRU writeback, the latency increase when zswap spills its
> > > coldest entries into backing swap is fairly linear, as you may
> > > expect. We have some limited production data on this from the
> > > webservers.
> > >
> > > The biggest challenge in this space is properly sizing the zswap pool=
,
> > > such that it's big enough to hold the warm set that the workload is
> > > most latency-sensitive too, yet small enough such that the cold pages
> > > get spilled to backing swap. Nhat is working on improving this.
> > >
> > > That said, I think this discussion is orthogonal to the proposed
> > > topic. zswap spills to backing swap in LRU order as of today. The
> > > LRU/pool size tweaking is an optimization to get smarter zswap/swap
> > > placement according to access frequency. The proposed swap descriptor
> > > is an optimization to get better disk utilization, the ability to run
> > > zswap without backing swap, and a dramatic speedup in swapoff time.
> >
> > Fully agree.
> >
> > >
> > > > > > > > > Anyway I'm interested in attending the discussion for thi=
s topic.
> > > > > > > >
> > > > > > > > Great! Looking forward to discuss this more!
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D Cost =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > > > > > > > > > The obvious downside of this is added memory overhead, =
specifically
> > > > > > > > > > for users that use swapfiles without zswap. Instead of =
paying one byte
> > > > > > > > > > (swap_map) for every potential page in the swapfile (+ =
swap count
> > > > > > > > > > continuation), we pay the size of the swap_desc for eve=
ry page that is
> > > > > > > > > > actually in the swapfile, which I am estimating can be =
roughly around
> > > > > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. Th=
e overhead only
> > > > > > > > > > scales with pages actually swapped out. For zswap users=
, it should be
> > > > > > > > > > a win (or at least even) because we get to drop a lot o=
f fields from
> > > > > > > > > > struct zswap_entry (e.g. rbtree, index, etc).
> > >
> > > Shifting the cost from O(swapspace) to O(swapped) could be a win for
> > > many regular swap users too.
> > >
> > > There are the legacy setups that provision 2*RAM worth of swap as an
> > > emergency overflow that is then rarely used.
> > >
> > > We have a setups that swap to disk more proactively, but we also
> > > overprovision those in terms of swap space due to the cliff behavior
> > > when swap fills up and the VM runs out of options.
> > >
> > > To make a fair comparison, you really have to take average swap
> > > utilization into account. And I doubt that's very high.
> >
> > Yeah I was looking for some data here, but it varies heavily based on
> > the use case, so I opted to only state the overhead of the swap
> > descriptor without directly comparing it to the current overhead.
> >
> > >
> > > In terms of worst-case behavior, +0.8% per swapped page doesn't sound
> > > like a show-stopper to me. Especially when compared to zswap's curren=
t
> > > O(swapped) waste of disk space.
> >
> > Yeah for zswap users this should be a win on most/all fronts, even
> > memory overhead, as we will end up trimming struct zswap_entry which
> > is also O(swapped) memory overhead. It should also make zswap
> > available for more use cases. You don't need to provision and
> > configure swap space, you just need to turn zswap on.
> >
> > >
> > > > > > > > > > Another potential concern is readahead. With this desig=
n, we have no
> > > > > > > > > > way to get a swap_desc given a swap entry (type & offse=
t). We would
> > > > > > > > > > need to maintain a reverse mapping, adding a little bit=
 more overhead,
> > > > > > > > > > or search all swapped out pages instead :). A reverse m=
apping might
> > > > > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% =
of swapped out
> > > > > > > > > > memory).
> > > > > > > > > >
> > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D Bottom Line =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D
> > > > > > > > > > It would be nice to discuss the potential here and the =
tradeoffs. I
> > > > > > > > > > know that other folks using zswap (or interested in usi=
ng it) may find
> > > > > > > > > > this very useful. I am sure I am missing some context o=
n why things
> > > > > > > > > > are the way they are, and perhaps some obvious holes in=
 my story.
> > > > > > > > > > Looking forward to discussing this with anyone interest=
ed :)
> > > > > > > > > >
> > > > > > > > > > I think Johannes may be interested in attending this di=
scussion, since
> > > > > > > > > > a lot of ideas here are inspired by discussions I had w=
ith him :)
>
> Hi everyone,
>
> I came across this interesting proposal and I would like to
> participate in the discussion. I think it will be useful/overlap with
> some projects we are currently planning in Android.

Great to see more interested folks! Looking forward to discussing that!

>
> Thanks,
> Kalesh
>
> > >
> > > Thanks!
> >