From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7A55C54E41 for ; Tue, 5 Mar 2024 10:55:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 468E76B0075; Tue, 5 Mar 2024 05:55:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 41668940007; Tue, 5 Mar 2024 05:55:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 291E06B00BC; Tue, 5 Mar 2024 05:55:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 116C8940007 for ; Tue, 5 Mar 2024 05:55:22 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id DF2B4A126A for ; Tue, 5 Mar 2024 10:55:21 +0000 (UTC) X-FDA: 81862678842.09.FB23C49 Received: from mail-ua1-f51.google.com (mail-ua1-f51.google.com [209.85.222.51]) by imf15.hostedemail.com (Postfix) with ESMTP id 43734A0005 for ; Tue, 5 Mar 2024 10:55:20 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dPyocqK9; spf=pass (imf15.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.222.51 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709636120; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oJf394RY13tjYkV39rNOT3XDSncPdAlGSuqq0cFUlq8=; b=ranZMdnS6TzCv35LjHapM7+tVyyz/V50sXOcKJtjsMvQ5XPuyOF04l+kmrB88c2JDH7BP/ GZOcjDQ8hgfn1iXvtQMrwC8yKCmpvVhygxfKxZVbveMwe4m1Lxt5uC7SB1i0eE8QWaPnY9 9mHOaypvI78b8Bdop7+qu28ZkwkUFPI= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=dPyocqK9; spf=pass (imf15.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.222.51 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709636120; a=rsa-sha256; cv=none; b=hEN15XMEZii+Is+rWGF4NLu2bjBFCQt7lDJjO1cyjTBDRcUT+qCZeVPYYt4OSz9MCIhb1M v2W0MvvH2ttk5UD+CcFvcpOU4OK7jdVoG8sq27QlgdMFwZ6y5PtnKr0rATMC0ascwyXoyV QyeU9EVOYXPBmx4qztSgIYm+PcZ7uAY= Received: by mail-ua1-f51.google.com with SMTP id a1e0cc1a2514c-7db44846727so676115241.0 for ; Tue, 05 Mar 2024 02:55:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709636119; x=1710240919; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=oJf394RY13tjYkV39rNOT3XDSncPdAlGSuqq0cFUlq8=; b=dPyocqK9vr6QJEFHkQiQmtn7auwKXvzoGVpkxNdPPoR3OsJ/p5jh7oVciQXpFGb7KC MYZHXyB+CzgJJ0HR4lmtn4hvXVy73JuIKpDJMQPOD3u7dVT0Sh0ArgLK0PT8k5RbJzTO gwoSufBt72MjveuR0ihQGLx2RAclg3+7vTtvYjNI7zMr7P2avLykqi89WFcGEpuhoB/L AGO5c1qIc6w3IKR62FYrH78fpH/9DV7DffY3JivO0HvG7deC8EwV+1o3wFI0jY2AjrEn VD4D5hyS9cakZ9y369PI52nVAtD5JElQ8+FYnCWYlbJoWBA69wHKzGKtGypRZwglzMG1 z+MA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709636119; x=1710240919; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=oJf394RY13tjYkV39rNOT3XDSncPdAlGSuqq0cFUlq8=; b=e0Jm8wZhqD9sOsdfZGNR1KfrwoNff0DZd9JfuHoZ0rurZ3fPJWd0p7p4hKMogg+hWu g3gPOAcLK7a6wU48UPOvRpQ/CqbH6CpoVvhcu94wjkbCyzFZGRibN/n8ObcrqSeGZwBS XylnQzcmiS54oIXEjEpxOZ7X3+UPGhjb/KoOq3sk1QDJr06SbQfE+quoyn2pmcIPkNlg eg53NipCPX+rKYex5IsbguOfcpyTw4ufjwoo8qdKdy26WNxgK6id3MyW2KSLVvP/TD1g PobLDEMgVqJW9icimY4SmUfqfV+T3wNmegoTeyr6yVXroSDvb2eZRSNpvtnKTyZM/0G6 5gjA== X-Forwarded-Encrypted: i=1; AJvYcCVEFJ/9pL+CF7YnWK4y3WFFuIonCmtPKMMAxPa80HtJfuu5lI2IrP2gCOrDsOoVtmOCgJ76Kx1Wx1qSToo6++IDtgg= X-Gm-Message-State: AOJu0Yx1PIt1xAyLW73aPm3/qrB19NcLEtR/sPvePBrrwitfE4BaU3md byCGcC807P5UU+2glgf8va/CQ0ElCf2iuOQu6CWmJFmkONp2egQn8HkHukyW92TBItzzXg1viLs +YtQCKy39HFcwsGpEHQ7PqsDRDIg= X-Google-Smtp-Source: AGHT+IFI5fHcxO26RvurEBGZ0QY8v2QYthKGi1mk6q9wX01b1dy46TycbRqgUNAGz8LD1zWbTaIDl5FNtvwdhg8sGEI= X-Received: by 2002:a05:6102:80e:b0:470:6f4e:7b25 with SMTP id g14-20020a056102080e00b004706f4e7b25mr1137705vsb.15.1709636119130; Tue, 05 Mar 2024 02:55:19 -0800 (PST) MIME-Version: 1.0 References: <97e95dc3-bdc0-4dfd-aca9-2d2880e1fdf5@linux.dev> In-Reply-To: <97e95dc3-bdc0-4dfd-aca9-2d2880e1fdf5@linux.dev> From: Nhat Pham Date: Tue, 5 Mar 2024 17:55:07 +0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Chengming Zhou Cc: Chris Li , Matthew Wilcox , lsf-pc@lists.linux-foundation.org, linux-mm , ryan.roberts@arm.com, David Hildenbrand , Barry Song <21cnbao@gmail.com>, Chuanhua Han Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 43734A0005 X-Rspam-User: X-Stat-Signature: ibk5fim3m1y189sy41xr8rcfy6in5g7d X-Rspamd-Server: rspam01 X-HE-Tag: 1709636120-269976 X-HE-Meta: U2FsdGVkX18xuyGLVLh/kWnbBqJFx3ubbSOMvlVi1FS8dXLqBILkqU8tY3iaf0f6BSgPsm8Ewr6cd37npaZ48SfPOTIGgrmmD7PLzOLAwQeXMYXn/SoiU+AeAAdvvb3daAoOPsQya8K79EVLEhF4YjRuVrVsafWZROvi1m6CRayJeV+X7CfeYVKMmLENlh5fwH/23vuTKCOdayKHaR62pXn/HOXDnwZVRY5JNMw052DsIrGhZr2zLTePsp8CVFrwIpKDA1kF9pHVdCRt9fug9MFh5SBO2Uyi8TuMZFIXM4s0pKx6c0LSJ/wp3Q9CpsX2Ib32L49mNJS3HHX7rkW3gpolExJvd46qSvTfArreqxhHkLLEMEqeyN4XJc1ulCSkM5A9dMLPlD0nNnAsYbuZj3pfJrmEgCRBbs8OXK+V37SMYzdd2xu/IzFVMj5fz5f5f+0i3EzQ2O/E/GwdAUJbD29KtbcMoJ5rSC6s8i9Tn6fI/3VDJUxKLswMf2gzR5p07AgOFd/x0/ICzZcXoXKX/LYPtQ8add8b4BUIG7VyJ+KcyGlXwYNmmkt2gY9e/dp7weDXuVdVVyM1wdUoDRTExEThQyrERTQFCviA6Ug31iat15idHHQ4+Xpu+fMnU6AIL5Hg9T4nbnZAr+nQi3ELwNpWcE6bQPA84IBWaH1LWBL2SlHzIBYk6NFJGiNg3mgo6nRe4mFFUU4FF7iTHTeifSB/c4dtb4JEfwF+t4BodG4GKidZv8vikrPQXPJgPD2SG1aIOkjGKVU4VlpKWKXSUragBUXMWe7gk+UK3xQxRokiH8EL8zREgUxYw4OeYkU2si73noBAeubX65u0z+KvlIx7hEa4PRX0msm0AgbeMUVT+z01sBIv/W+TINQifAZuougcQMnguThNR8Nhvbmddw9MsUzHAxnJj/a2RTCtIkSKjdq47X6bAnDE7jcGKE+GrqMidHPT/VbP/Iv7uP3 lVbiQAfG 2rkQE/reexSAumuOpRlVUWe69GJ3Hzjc7f6oz X-Bogosity: Ham, tests=bogofilter, spamicity=0.008582, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Mar 5, 2024 at 4:52=E2=80=AFPM Chengming Zhou wrote: > > On 2024/3/5 17:32, Nhat Pham wrote: > > On Tue, Mar 5, 2024 at 2:44=E2=80=AFPM Chris Li wro= te: > >> > >> On Mon, Mar 4, 2024 at 7:24=E2=80=AFPM Chengming Zhou wrote: > >>> > >>> On 2024/3/5 06:58, Matthew Wilcox wrote: > >>>> On Fri, Mar 01, 2024 at 04:53:43PM +0700, Nhat Pham wrote: > >>>>> IMHO, one thing this new abstraction should support is seamless > >>>>> transfer/migration of pages from one backend to another (perhaps fr= om > >>>>> high to low priority backends, i.e writeback). > >>>>> > >>>>> I think this will require some careful redesigns. The closest thing= we > >>>>> have right now is zswap -> backing swapfile. But it is currently > >>>>> handled in a rather peculiar manner - the underlying swap slot has > >>>>> already been reserved for the zswap entry. But there's a couple of > >>>>> problems with this: > >>>>> > >>>>> a) This is wasteful. We're essentially having the same piece of dat= a > >>>>> occupying spaces in two levels in the hierarchies. > >>>>> b) How do we generalize to a multi-tier hierarchy? > >>>>> c) This is a bit too backend-specific. It'd be nice if we can make > >>>>> this as backend-agnostic as possible (if possible). > >>>>> > >>>>> Motivation: I'm currently working/thinking about decoupling zswap a= nd > >>>>> swap, and this is one of the more challenging aspects (as I can't s= eem > >>>>> to find a precedent in the swap world for inter-swap backends pages > >>>>> migration), and especially with respect to concurrent loads (and > >>>>> swapcache interactions). > >>>> > >>>> Have you considered (and already rejected?) the opposite approach -- > >>>> coupling zswap and swap more tightly? That is, we always write out > >>>> the original pages today. Why don't we write out the compressed pag= es > >>>> instead? For the same amount of I/O, we'd free up more memory! Tha= t > >>>> sounds like a win to me. > > > > Compressed writeback (for a lack of better term) is in my > > to-think-about list, precisely for the benefits you point out (swap > > space saving + I/O efficiency). > > > > By decoupling, I was primarily aiming to reduce initial swap space > > wastage. Specifically, as of now, we prematurely reserve swap space > > even if the page is successfully stored in zswap. I'd like to avoid > > this (if possible). Compressed writeback could be an orthogonal > > improvement to this. > > Looks sensible. Now the zswap middle layer is transparent to frontend use= rs, > which just allocate swap entry and swap out, don't care about whether it'= s > swapped out to the zswap or swap file. > > By decoupling, the frontend users need to know it want to allocate zswap = entry > instead of a swap entry, right? Which becomes not transparent to users. Hmm for now, I was just thinking that it should always try zswap first, and only fall back to swap if it fails to store to zswap, to maintain the overall LRU ordering (best effort). The minimal viable implementation I'm thinking right now for this is basically the "ghost swapfile" approach - i.e represent zswap as a swapfile. Writeback becomes quite hairy though, because there might be two "swap" entries of the same object (the zswap swap entry and the newly reserved swap entry) lying around near the end of the writeback step, so gotta be careful with synchronization (read: juggling the swap cache) to make sure concurrent swap-ins get something that makes sense. > > On the other hand, MGLRU could use its more fine-grained information to > decide whether to allocate zswap entry or swap entry, instead of always > going through zswap -> swap layers. Just some random thoughts :) > > Thanks. > > > > >> > >> I have considered that as well, that is further than writing from one > >> swap device to another. The current swap device currently can't accept > >> write on non page aligned offset. If we allow byte aligned write out > >> size, the whole swap entry offset stuff needs some heavy changes. > >> > >> If we write out 4K pages, and the compression ratio is lower than 50%, > >> it means a combination of two compressed pages can't fit into one > >> page. Which means some of the page read back will need to overflow > >> into another page. We kind of need a small file system to keep track > >> of how the compressed data is stored, because it is not page aligned > >> size any more. > >> > >> We can write out zsmalloc blocks of data as it is, however there is no > >> guarantee the data in zsmalloc blocks have the same LRU order. > > > > zsmalloc used to do this - it writes the entire zspage (which is a > > multiple of pages). I don't think it's a good idea either. Objects in > > the same zspage are of the same size class, and this does NOT > > necessarily imply similar access recency/frequency IMHO. > > > > We need to figure out how to swap out non-page-sized entities before > > compressed writeback becomes a thing. > > > >> > >> It makes more sense when writing higher order > 0 swap pages. e.g > >> writing 64K pages in one buffer, then we can write out compressed data > >> as page boundary aligned and page sizes, accepting the waste on the > >> last compressed page, might not fill up the whole page. > >> > >>> > >>> Right, I also thought about this direction for some time. > >>> Apart from fewer IO, there are more advantages we can see: > >>> > >>> 1. Don't need to allocate a page when write out compressed data. > >>> This method actually has its own problem[1], by allocating a new p= age and > >>> put on LRU list, wait for writeback and reclaim. > >>> If we write out compressed data directly, so don't need to allocat= ed page, > >>> these problems can be avoided. > >> > >> Does it go through swap cache at all? If not, there will be some > >> interesting synchronization issues when other races swap in the page > >> and modify it. > > > > I agree. If we are to go with this approach, we will need to modify > > the swap cache to synchronize concurrent swapins. ... with concurrent compressed writebacks. One benefit of allocating the page is we can share that with the swap cache. Removing that from the equation can make everything becomes hairier, but I haven't thought too deeply about this. > > > > As usual, devils are in the details :) > > > >> > >>> > >>> 2. Don't need to decompress when write out compressed data. > >> > >> Yes. > >> > >>> > >>> [1] https://lore.kernel.org/all/20240209115950.3885183-1-chengming.zh= ou@linux.dev/ > >>> > >>>> > >>>> I'm sure it'd be a big redesign, but that seems to be what we're tal= king > >>>> about anyway. > >>>> > >>> > >>> Yes, we need to do modifications in some parts: > >>> > >>> 1. zsmalloc: compressed objects can be migrated anytime, we need to s= upport pinning. > >> > >> Or use a bounce buffer to read it out. > >> > >>> > >>> 2. swapout: need to support non-folio write out. > >> > >> Yes. Non page aligned write out will change swap back end design drama= tically. > >> > >>> > >>> 3. zswap: zswap need to handle synchronization between compressed wri= te out and swapin, > >>> since they share the same swap entry. > >> > >> Exactly. Same for ZRAM as well. > >> > > > > I agree with this list. 1 sounds the least invasive, but the other 2 > > will be quite massive :) > > > >> Chris > > > > Nhat