Re: Swap Min Odrer

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Chris Li <chrisl@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: Daniel Gomez <da.gomez@samsung.com>,
	Ryan Roberts <ryan.roberts@arm.com>,
	 Barry Song <v-songbaohua@oppo.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,  Luis Chamberlain <mcgrof@kernel.org>,
	Pankaj Raghav <p.raghav@samsung.com>
Subject: Re: Swap Min Odrer
Date: Wed, 8 Jan 2025 13:09:12 -0800	[thread overview]
Message-ID: <CACePvbUJHcPgus-02fcuYtBoTeWub-ibKue1mcF2EdiK6UeFHg@mail.gmail.com> (raw)
In-Reply-To: <470be5fa-97d6-4045-a855-5332d3a46443@redhat.com>

On Tue, Jan 7, 2025 at 8:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.01.25 13:29, Daniel Gomez wrote:
> > On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote:
> >> On 07.01.25 10:43, Daniel Gomez wrote:
> >>> Hi,
> >>
> >> Hi,
> >>
> >>>
> >>> High-capacity SSDs require writes to be aligned with the drive's
> >>> indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
> >>> support swap on these devices, we need to ensure that writes do not
> >>> cross IU boundaries. So, I think this may require increasing the minimum
> >>> allocation size for swap users.
> >>
> >> How would we handle swapout/swapin when we have smaller pages (just imagine
> >> someone does a mmap(4KiB))?
> >
> > Swapout would require to be aligned to the IU. An mmap of 4 KiB would
> > have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any
> > potential RMW penalty. So, I think aligning the mmap allocation to the
> > IU would guarantee a write of the required granularity and alignment.
>
> We must be prepared to handle and VMA layout with single-page VMAs,
> single-page holes etc ... :/ IMHO we should try to handle this
> transparently to the application.
>
> > But let's also look at your suggestion below with swapcache.
> >
> > Swapin can still be performed at LBA format levels (e.g. 4 KiB) without
> > the same write penalty implications, and only affecting performance
> > if I/Os are not conformant to these boundaries. So, reading at IU
> > boundaries is preferred to get optimal performance, not a 'requirement'.
> >
> >>
> >> Could this be something that gets abstracted/handled by the swap
> >> implementation? (i.e., multiple small folios get added to the swapcache but
> >> get written out / read in as a single unit?).
> >
> > Do you mean merging like in the block layer? I'm not entirely sure if
> > this could guarantee deterministically the I/O boundaries the same way
> > it does min order large folio allocations in the page cache. But I guess
> > is worth exploring as optimization.
>
> Maybe the swapcache could somehow abstract that? We currently have the
> swap slot allocator, that assigns slots to pages.
>
> Assuming we have a 16 KiB BS but a 4 KiB page, we might have various
> options to explore.
>
> For example, we could size swap slots 16 KiB, and assign even 4 KiB
> pages a single slot. This would waste swap space with small folios, that
> would go away with large folios.

We can group multiple swap 4K swap entries into one 16K write unit.
There will be no waste of the SSD.

>
> If we stick to 4 KiB swap slots, maybe pageout() could be taught to
> effectively writeback "everything" residing in the relevant swap slots
> that span a BS?
>
> I recall there was a discussion about atomic writes involving multiple
> pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely
> no expert on that, unfortunately. Hoping Chris has some ideas.

Yes, see my other email about the "virtual swapfile" idea. More
detailed write up coming next week.

Chris

>
>
> >
> >>
> >> I recall that we have been talking about a better swap abstraction for years
> >> :)
> >
> > Adding Chris Li to the cc list in case he has more input.
> >
> >>
> >> Might be a good topic for LSF/MM (might or might not be a better place than
> >> the MM alignment session).
> >
> > Both options work for me. LSF/MM is in 12 weeks so, having a previous
> > session would be great.
>
> Both work for me.
>
> --
> Cheers,
>
> David / dhildenb
>

next prev parent reply	other threads:[~2025-01-08 21:09 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CGME20250107094349eucas1p1c973738624046458bbd8ca980cf6fe33@eucas1p1.samsung.com>
2025-01-07  9:43 ` Daniel Gomez
2025-01-07 10:31   ` David Hildenbrand
2025-01-07 12:29     ` Daniel Gomez
2025-01-07 16:41       ` David Hildenbrand
2025-01-08 14:14         ` Daniel Gomez
2025-01-08 20:36           ` David Hildenbrand
2025-01-08 21:19             ` Chris Li
2025-01-08 21:24               ` David Hildenbrand
2025-01-16  8:38                 ` Chris Li
2025-01-20 12:02                   ` David Hildenbrand
2025-01-09  3:38               ` David Rientjes
2025-01-08 21:09         ` Chris Li [this message]
2025-01-08 21:05       ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACePvbUJHcPgus-02fcuYtBoTeWub-ibKue1mcF2EdiK6UeFHg@mail.gmail.com \
    --to=chrisl@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=da.gomez@samsung.com \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=mcgrof@kernel.org \
    --cc=p.raghav@samsung.com \
    --cc=ryan.roberts@arm.com \
    --cc=v-songbaohua@oppo.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox