From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AB336C5475B for ; Wed, 6 Mar 2024 06:05:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2D9536B007E; Wed, 6 Mar 2024 01:05:21 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2620C6B0080; Wed, 6 Mar 2024 01:05:21 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 103126B0081; Wed, 6 Mar 2024 01:05:21 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id EDEF36B007E for ; Wed, 6 Mar 2024 01:05:20 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id BA65516032A for ; Wed, 6 Mar 2024 06:05:20 +0000 (UTC) X-FDA: 81865576800.21.464339B Received: from mail-ua1-f47.google.com (mail-ua1-f47.google.com [209.85.222.47]) by imf21.hostedemail.com (Postfix) with ESMTP id 2488B1C0006 for ; Wed, 6 Mar 2024 06:05:18 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TqpGx29D; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.47 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709705119; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3+hiOhs5irRiXpfR2pHDExZ5iSvupojK1uOinaAigSo=; b=wq9dzosnFELKZ+DJ1gilfzU/P4ABKMRrECzOtJcxtpxrfKIMKHXcowMbg32j+uw/TGKC5I xI7JB+Nl4X4aCiyo+poweCoIbPBuyd9rqKNc76iMOrcdSKKOrvfr/2DrcVQ+iDIvchbEt6 UsAzhXsQsYms/YV2H4qQZGRf/OcpxHI= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=TqpGx29D; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf21.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.47 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709705119; a=rsa-sha256; cv=none; b=LE54exooqOojQHQIYm6oyRNYWtD6AIstbS8sbLpq5WG83Ez+2i0tcM5NsQ5PKuPxoguTDU w3D9G7fPz979buemE+AMyGMw/i0DMPApAw6WnFW3sgZ4iNyw0yYnMLHOM9CdcBpsxwuuSl L8Nnzp8ROhXTY7mNcVkntJ9oDDfv2b8= Received: by mail-ua1-f47.google.com with SMTP id a1e0cc1a2514c-7d5fce59261so4458311241.3 for ; Tue, 05 Mar 2024 22:05:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709705118; x=1710309918; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3+hiOhs5irRiXpfR2pHDExZ5iSvupojK1uOinaAigSo=; b=TqpGx29DCGkcJEySdL8oK2QiCYWOBcf+MkFE652zu9tajNKhRG/yGGMxogq/0t4Gha 3NBLHr4kAvTorpKf6sNAOhvarXQoKNe7SgGhTfOf515Rge1exGUV6lCnIzjJzYIyAtu3 FSsGEbEDY4fVD6Fh3CUCH9hdUy3BRZVciqW2qY0MsoCguBpdZkvoclFlwQzidman8Y0q cey15YsnyYERSOT7NcctbI8MkuY5p9X4mSPlW1fCPjDhr9lFefg7+c7AtrumvT0DhxMj ddiYvPY4A6avz3AAIJpAIazrU1SunJZte1uNKJ7GbwT4ezFnW6KVFYho/iVYCLZTC43a GgAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709705118; x=1710309918; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3+hiOhs5irRiXpfR2pHDExZ5iSvupojK1uOinaAigSo=; b=g0Q1HRvNbuX8gMnhkxOx9gE/8hdKiS67IKKA9Uk/iJREtY0beqxST+L+jg9LLYjVxx JAWCI8/ip8SJyD6NV2ZzOaayXuhwY6niPsQsDb//lTbu2BaxjL3TAFQWkJiSThxis8a2 ibpRtDRtpjn4rN8TxIH8hDyd45O8bWLzZEuB8gE7YmoL5uMg2ODZWlcd0TMNy9kFYrDH dbkNPHdYqrF9/gic2rnZdUM3l2q+LaejLkjh8oufvZVTaSiKf2ZiOR2yuzlo18lQVeOd 07A4JpSsj+CF4GZHW3pjiy8CMaXykdSp8Q2Lgqj/SUakRBGlWwRp5MMdGo+r+86QsNa6 M5pg== X-Forwarded-Encrypted: i=1; AJvYcCWKoZfBLf4lqZtC0vvzq0yIREmQOfAriuLrUNHggeM4rjxXTiJwXEpTC3ISDwuwx0VHPgbsdB4uxGsNI8Fa+mziC/Q= X-Gm-Message-State: AOJu0YxX1Fm12rJjGhR3lecNXgmFdxBVaZnmIFtTxbYYNp5ABxKSW8Vf 5H4o9TMvpM8/iy6lBVvYNv6F2NPMl5w1ICfHljU/XUUsOI2clA9n8wVWtN+Ik98A48iv9Dqbg1s wOtEuJv4REbFQgMBWNkRZP2nmVNw= X-Google-Smtp-Source: AGHT+IG5i1MJW0OqORu8RPwJx7Lw6EomC1cSYvw3C0Lp4pEdL2lpcsysSq8w6Eh6s2PjQdAjKQQ+zXUhSbmHH+haFlY= X-Received: by 2002:a05:6122:4b18:b0:4d1:34a1:c896 with SMTP id fc24-20020a0561224b1800b004d134a1c896mr3941145vkb.8.1709705117965; Tue, 05 Mar 2024 22:05:17 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Wed, 6 Mar 2024 19:05:06 +1300 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony" To: Chris Li Cc: lsf-pc@lists.linux-foundation.org, linux-mm , ryan.roberts@arm.com, David Hildenbrand , Chuanhua Han Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 2488B1C0006 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: hmp8jnpzsbya8dwd56m4rcdsgbd5to1q X-HE-Tag: 1709705118-687833 X-HE-Meta: U2FsdGVkX1+xap4hjPirKm4Q64LWF+sQwJho/lafsKgRk4FAJAa/vwhAHdAcEBMKb5PW8rpsFhYZKttI5uCRmAGJtTLPzndyAr2eGEvJzPDKxZtuGhIbKxbdsFc9sHH09G4M/UfBvrVVjjIFeOZryqPeahgeImMMupvzKzTDL3HP/WlJHJ5ceH2yTlerx1jfXDY12e9TvNSoD4LlhigeNus3KYsGrgDAZ/0X55tffTdZ5fGPLMctSBjpoqEj558M8zLaX+Shuvkiz/UXoXh5ulBdsxpuMmO2bjLnv3778gHDQDfROQ0Ri+F5rYp4dAOL9+OOMBhhENBBKXEPWoc12PAYWIyT0b2xvE1U1ESZJe9fv6WA+Y87E4aTs++CkJypUdR8jvjj6AvdZVduUCcaiHc6yRWIta4qfDzVjEawK2XT5ca7fsBJx2SG2c3+GLC1zZjjOtiRgsIeDM+rXHu9n5S+LUIJX/fwk2l3qvp0OYkzWWFfpEPur9VtyuE5/KzRbcWw3bZSDgJ6dDppUcUEALMY+Hye6WonYEdk0IjCBqaR5YHEu5Vp6F80gY7pkISEO+jA5cKiKo/61MdefPo22qoSQ/dQ/JT49nwf685yTrt7xtmSbiT4yV6dazDPLOSkHLaWyaqN/Ow6QmU6sfox0osnxwsVe33AglcVUVhGi+gEoZVr29FgcH52PTMX5RmBuCSMWFR12NOgY7RmhJbqYlqU8Nk9AC1/WtgJ9kbVE1rndpK2H8QaAkwUyNsszbdKwMJ8KmtncXsGcGnpq8yd/bMaxHBLygRZ3jSNFX41GPO7TUL1VWE71rnLo4kyLbwWGLrMn0iru1nS5X5ADDe7t/Dvj5rjndFlGWskKBpNrjzCTLF+GTSNR9xrCk0+79T1p1nP91IBFIQF1YmFkkhM2HSaAJ2pQF943XXDCUjvIS0bl3SfCQ6pXLITA1/tMnIodrCqf4jJd2d/4s8bEaG xsqmcu/9 DfSfo6U5x5E9p8h+cphbUZDa9xd8bzvm9b/BA X-Bogosity: Ham, tests=bogofilter, spamicity=0.000477, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Mar 6, 2024 at 4:00=E2=80=AFPM Chris Li wrote: > > On Tue, Mar 5, 2024 at 5:15=E2=80=AFPM Barry Song <21cnbao@gmail.com> wro= te: > > > Another limitation I would like to address is that swap_writepage can > > > only write out IO in one contiguous chunk, not able to perform > > > non-continuous IO. When the swapfile is close to full, it is likely > > > the unused entry will spread across different locations. It would be > > > nice to be able to read and write large folio using discontiguous dis= k > > > IO locations. > > > > I don't find it will be too difficult for swap_writepage to only write > > out a large folio which has discontiguous swap offsets. taking > > zRAM as an example, as long as bio can be organized correctly, > > zram should be able to write a large folio one by one for its all > > subpages. > > Yes. > > > > > static void zram_bio_write(struct zram *zram, struct bio *bio) > > { > > unsigned long start_time =3D bio_start_io_acct(bio); > > struct bvec_iter iter =3D bio->bi_iter; > > > > do { > > u32 index =3D iter.bi_sector >> SECTORS_PER_PAGE_SHIFT; > > u32 offset =3D (iter.bi_sector & (SECTORS_PER_PAGE - 1)= ) << > > SECTOR_SHIFT; > > struct bio_vec bv =3D bio_iter_iovec(bio, iter); > > > > bv.bv_len =3D min_t(u32, bv.bv_len, PAGE_SIZE - offset)= ; > > > > if (zram_bvec_write(zram, &bv, index, offset, bio) < 0)= { > > atomic64_inc(&zram->stats.failed_writes); > > bio->bi_status =3D BLK_STS_IOERR; > > break; > > } > > > > zram_slot_lock(zram, index); > > zram_accessed(zram, index); > > zram_slot_unlock(zram, index); > > > > bio_advance_iter_single(bio, &iter, bv.bv_len); > > } while (iter.bi_size); > > > > bio_end_io_acct(bio, start_time); > > bio_endio(bio); > > } > > > > right now , add_to_swap() is lacking a way to record discontiguous > > offset for each subpage, alternatively, we have a folio->swap. > > > > I wonder if we can somehow make it page granularity, for each > > subpage, it can have its own offset somehow like page->swap, > > then in swap_writepage(), we can make a bio with multiple > > discontiguous I/O index. then we allow add_to_swap() to get > > nr_pages different swap offsets, and fill into each subpage. > > The key is where to store the subpage offset. It can't be stored on > the tail page's page->swap because some tail page's page struct are > just mapping of the head page's page struct. I am afraid this mapping > relationship has to be stored on the swap back end. That is the idea, > have swap backend keep track of an array of subpage's swap location. > This array is looked up by the head swap offset. I assume "some tail page's page struct are just mapping of the head page's page struct" is only true of hugeTLB larger than PMD-mapped hugeTLB (for example 2MB) for this moment? more widely mTHP less than PMD-mapped size will still have all tail page struct? "Having swap backend keep track of an array of subpage's swap location" means we will save this metadata on swapfile? will we have more I/O especially if a large folio's mapping area might be partially unmap, for example, by MADV_DONTNEED even after the large folio is swapped-out, then we have to update the metadata? right now, we only need to change PTE entries and swap_map[] for the same case. do we have some way to keep those data in memory instead? > > > But will this be a step back for folio? > > I think this should be separate from the folio. It is on the swap > backend. From folio's point of view, it is just writing out a folio. > The swap back end knows how to write out into subpage locations. From > folio's point of view. It is just one swap page write. > > > > Some possible ideas for the fragmentation issue. > > > > > > a) buddy allocator for swap entities. Similar to the buddy allocator > > > in memory. We can use a buddy allocator system for the swap entry to > > > avoid the low order swap entry fragment too much of the high order > > > swap entry. It should greatly reduce the fragmentation caused by > > > allocate and free of the swap entry of different sizes. However the > > > buddy allocator has its own limit as well. Unlike system memory, we > > > can move and compact the memory. There is no rmap for swap entry, it > > > is much harder to move a swap entry to another disk location. So the > > > buddy allocator for swap will help, but not solve all the > > > fragmentation issues. > > > > I agree buddy will help. Meanwhile, we might need some way similar > > with MOVABLE, UNMOVABLE migratetype. For example, try to gather > > swap applications for small folios together and don't let them spread > > throughout the whole swapfile. > > we might be able to dynamically classify swap clusters to be for small > > folios, for large folios, and avoid small folios to spread all clusters= . > > This really depends on the swap entries allocation and free cycle. In > this extreme case, all swap entries have been allocated full. > Then it free some of the 4K entry at discotinuges locations. Buddy > allocator or cluster allocator are not going to save you from ending > up with fragmented swap entries. That is why I think we still need > b). I agree. I believe that classifying clusters has the potential to alleviate fragmentation to some degree while it can not resolve it. We can to some extent prevent the spread of small swaps' applications. > > > > b) Large swap entries. Take file as an example, a file on the file > > > system can write to a discontinuous disk location. The file system > > > responsible for tracking how to map the file offset into disk > > > location. A large swap entry can have a similar indirection array map > > > out the disk location for different subpages within a folio. This > > > allows a large folio to write out dis-continguos swap entries on the > > > swap file. The array will need to store somewhere as part of the > > > overhead.When allocating swap entries for the folio, we can allocate = a > > > batch of smaller 4k swap entries into an array. Use this array to > > > read/write the large folio. There will be a lot of plumbing work to > > > get it to work. > > > > we already have page struct, i wonder if we can record the offset > > there if this is not a step back to folio. on the other hand, while > > No for the tail pages. Because some of the tail page "struct page" are > just remapping of the head page "struct page". > > > swap-in, we can also allow large folios be swapped in from non- > > discontiguous places and those offsets are actually also in PTE > > entries. > > This discontinues sub page location needs to store outside of folio. > Keep in mind that you can have more than one PTE in different > processes. Those PTE on different processes might not agree with each > other. BTW, shmem store the swap entry in page cache not PTE. I don't quite understand what you mean by "Those PTE on different processes might not agree with each other". Can we have a concrete example? I assume this is also true for small folios but it won't be a problem as the process which is doing swap-in only cares about its own PTE entries? > > > > I feel we have "page" to record offset before pageout() is done > > and we have PTE entries to record offset after pageout() is > > done. > > > > But still (a) is needed as we really hope large folios can be put > > in contiguous offsets, with this, we might have other benefit > > like saving the whole compressed large folio as one object rather than > > nr_pages objects in zsmalloc and decompressing them together > > while swapping in (a patchset is coming in a couple of days for this). > > when a large folio is put in nr_pages different places, hardly can we d= o > > this in zsmalloc. But at least, we can still swap-out large folios > > without splitting and swap-in large folios though we read it > > back from nr_pages different objects. > > Exactly. > > Chris Thanks Barry