linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com>
To: Nhat Pham <nphamcs@gmail.com>, Usama Arif <usamaarif642@gmail.com>
Cc: David Hildenbrand <david@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	"yosryahmed@google.com" <yosryahmed@google.com>,
	"chengming.zhou@linux.dev" <chengming.zhou@linux.dev>,
	"ryan.roberts@arm.com" <ryan.roberts@arm.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	"21cnbao@gmail.com" <21cnbao@gmail.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"hughd@google.com" <hughd@google.com>,
	"willy@infradead.org" <willy@infradead.org>,
	"bfoster@redhat.com" <bfoster@redhat.com>,
	"dchinner@redhat.com" <dchinner@redhat.com>,
	"chrisl@kernel.org" <chrisl@kernel.org>,
	"Feghali, Wajdi K" <wajdi.k.feghali@intel.com>,
	"Gopal, Vinodh" <vinodh.gopal@intel.com>,
	"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com>
Subject: RE: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
Date: Fri, 18 Oct 2024 21:59:07 +0000	[thread overview]
Message-ID: <SJ0PR11MB5678A864244B09FDE4D914EEC9402@SJ0PR11MB5678.namprd11.prod.outlook.com> (raw)
In-Reply-To: <CAKEwX=NVUkjDgxvsr1g3o_2dWGjEF91_+q==MQE8VQc8o5vwtQ@mail.gmail.com>

Hi Usama, Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Friday, October 18, 2024 10:21 AM
> To: Usama Arif <usamaarif642@gmail.com>
> Cc: David Hildenbrand <david@redhat.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> hughd@google.com; willy@infradead.org; bfoster@redhat.com;
> dchinner@redhat.com; chrisl@kernel.org; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls
> swapin_readahead() zswap load batching interface.
> 
> On Fri, Oct 18, 2024 at 4:04 AM Usama Arif <usamaarif642@gmail.com>
> wrote:
> >
> >
> > On 18/10/2024 08:26, David Hildenbrand wrote:
> > > On 18.10.24 08:48, Kanchana P Sridhar wrote:
> > >> This patch invokes the swapin_readahead() based batching interface to
> > >> prefetch a batch of 4K folios for zswap load with batch decompressions
> > >> in parallel using IAA hardware. swapin_readahead() prefetches folios
> based
> > >> on vm.page-cluster and the usefulness of prior prefetches to the
> > >> workload. As folios are created in the swapcache and the readahead
> code
> > >> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch",
> the
> > >> respective folio_batches get populated with the folios to be read.
> > >>
> > >> Finally, the swapin_readahead() procedures will call the newly added
> > >> process_ra_batch_of_same_type() which:
> > >>
> > >>   1) Reads all the non_zswap_batch folios sequentially by calling
> > >>      swap_read_folio().
> > >>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which
> calls
> > >>      zswap_finish_load_batch() that finally decompresses each
> > >>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a
> prefetch
> > >>      batch of say, 32 folios) in parallel with IAA.
> > >>
> > >> Within do_swap_page(), we try to benefit from batch decompressions in
> both
> > >> these scenarios:
> > >>
> > >>   1) single-mapped, SWP_SYNCHRONOUS_IO:
> > >>        We call swapin_readahead() with "single_mapped_path = true". This
> is
> > >>        done only in the !zswap_never_enabled() case.
> > >>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
> > >>        We call swapin_readahead() with "single_mapped_path = false".
> > >>
> > >> This will place folios in the swapcache: a design choice that handles
> cases
> > >> where a folio that is "single-mapped" in process 1 could be prefetched in
> > >> process 2; and handles highly contended server scenarios with stability.
> > >> There are checks added at the end of do_swap_page(), after the folio has
> > >> been successfully loaded, to detect if the single-mapped swapcache folio
> is
> > >> still single-mapped, and if so, folio_free_swap() is called on the folio.
> > >>
> > >> Within the swapin_readahead() functions, if single_mapped_path is true,
> and
> > >> either the platform does not have IAA, or, if the platform has IAA and the
> > >> user selects a software compressor for zswap (details of sysfs knob
> > >> follow), readahead/batching are skipped and the folio is loaded using
> > >> zswap_load().
> > >>
> > >> A new swap parameter "singlemapped_ra_enabled" (false by default) is
> added
> > >> for platforms that have IAA, zswap_load_batching_enabled() is true, and
> we
> > >> want to give the user the option to run experiments with IAA and with
> > >> software compressors for zswap (swap device is
> SWP_SYNCHRONOUS_IO):
> > >>
> > >> For IAA:
> > >>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> > >>
> > >> For software compressors:
> > >>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> > >>
> > >> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will
> skip
> > >> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO"
> do_swap_page()
> > >> path.
> > >>
> > >> Thanks Ying Huang for the really helpful brainstorming discussions on the
> > >> swap_read_folio() plug design.
> > >>
> > >> Suggested-by: Ying Huang <ying.huang@intel.com>
> > >> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > >> ---
> > >>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++--
> ---------
> > >>   mm/shmem.c      |   2 +-
> > >>   mm/swap.h       |  12 ++--
> > >>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++--
> --
> > >>   mm/swapfile.c   |   2 +-
> > >>   5 files changed, 299 insertions(+), 61 deletions(-)
> > >>
> > >> diff --git a/mm/memory.c b/mm/memory.c
> > >> index b5745b9ffdf7..9655b85fc243 100644
> > >> --- a/mm/memory.c
> > >> +++ b/mm/memory.c
> > >> @@ -3924,6 +3924,42 @@ static vm_fault_t
> remove_device_exclusive_entry(struct vm_fault *vmf)
> > >>       return 0;
> > >>   }
> > >>   +/*
> > >> + * swapin readahead based batching interface for zswap batched loads
> using IAA:
> > >> + *
> > >> + * Should only be called for and if the faulting swap entry in
> do_swap_page
> > >> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> > >> + *
> > >> + * Detect if the folio is in the swapcache, is still mapped to only this
> > >> + * process, and further, there are no additional references to this folio
> > >> + * (for e.g. if another process simultaneously readahead this swap entry
> > >> + * while this process was handling the page-fault, and got a pointer to
> the
> > >> + * folio allocated by this process in the swapcache), besides the
> references
> > >> + * that were obtained within __read_swap_cache_async() by this
> process that is
> > >> + * faulting in this single-mapped swap entry.
> > >> + */
> > >
> > > How is this supposed to work for large folios?
> > >
> >
> > Hi,
> >
> > I was looking at zswapin large folio support and have posted a RFC in [1].
> > I got bogged down with some prod stuff, so wasn't able to send it earlier.
> >
> > It looks quite different, and I think simpler from this series, so might be
> > a good comparison.
> >
> > [1] https://lore.kernel.org/all/20241018105026.2521366-1-
> usamaarif642@gmail.com/
> >
> > Thanks,
> > Usama
> 
> I agree.
> 
> I think the lower hanging fruit here is to build upon Usama's patch.
> Kanchana, do you think we can just use the new batch decompressing
> infrastructure, and apply it to Usama's large folio zswap loading?
> 
> I'm not denying the readahead idea outright, but that seems much more
> complicated. There are questions regarding the benefits of
> readahead-ing when apply to zswap in the first place - IIUC, zram
> circumvents that logic in several cases, and zswap shares many
> characteristics with zram (fast, synchronous compression devices).
> 
> So let's reap the low hanging fruits first, get the wins as well as
> stress test the new infrastructure. Then we can discuss the readahead
> idea later?

Thanks Usama for publishing the zswap large folios swapin series, and
thanks Nhat for your suggestions.  Sure, I can look into integrating the
new batch decompressing infrastructure with Usama's large folio zswap
loading.

However, I think we need to get clarity on a bigger question: does it
make sense to swapin large folios? Some important considerations
would be:

1) What are the tradeoffs in memory footprint cost of swapping in a
    large folio?
2) If we decide to let the user determine this by say, an option that
     determines the swapin granularity (e.g. no more than 32k at a time),
     how does this constrain compression and zpool storage granularity?

Ultimately, I feel the bigger question is about memory utilization cost
of large folio swapin. The swapin_readahead() based approach tries to
use the prefetch-usefulness characteristics of the workload to improve
the efficiency of multiple 4k folios by using strategies like parallel
decompression, to strike some balance in memory utilization vs.
efficiency.

Usama, I downloaded your patch-series and tried to understand this
better, and wanted to share the data.

I ran the kernel compilation "allmodconfig" with zstd, page-cluster=0,
and 16k/32k/64k large folios enabled to "always":

16k/32k/64k folios: kernel compilation with zstd:
 =================================================

 ------------------------------------------------------------------------------
                        mm-unstable-10-16-2024    + zswap large folios swapin
                                                                       series
 ------------------------------------------------------------------------------
 zswap compressor                         zstd                           zstd
 vm.page-cluster                             0                              0
 ------------------------------------------------------------------------------
 real_sec                               772.53                         870.61
 user_sec                            15,780.29                      15,836.71
 sys_sec                              5,353.20                       6,185.02
 Max_Res_Set_Size_KB                 1,873,348                      1,873,004
                                                                             
 ------------------------------------------------------------------------------
 memcg_high                                  0                              0
 memcg_swap_fail                             0                              0
 zswpout                            93,811,916                    111,663,872
 zswpin                             27,150,029                     54,730,678
 pswpout                                    64                             59
 pswpin                                     78                             53
 thp_swpout                                  0                              0
 thp_swpout_fallback                         0                              0
 16kB-mthp_swpout_fallback                   0                              0
 32kB-mthp_swpout_fallback                   0                              0
 64kB-mthp_swpout_fallback               5,470                              0
 pgmajfault                         29,019,256                     16,615,820
 swap_ra                                     0                              0
 swap_ra_hit                             3,004                          3,614
 ZSWPOUT-16kB                        1,324,160                      2,252,747
 ZSWPOUT-32kB                          730,534                      1,356,640
 ZSWPOUT-64kB                        3,039,760                      3,955,034
 ZSWPIN-16kB                                                        1,496,916
 ZSWPIN-32kB                                                        1,131,176
 ZSWPIN-64kB                                                        1,866,884
 SWPOUT-16kB                                 0                              0
 SWPOUT-32kB                                 0                              0
 SWPOUT-64kB                                 4                              3
 ------------------------------------------------------------------------------

It does appear like there is considerably higher swapout and swapin
activity as a result of swapping in large folios, which does end up
impacting performance.

I would appreciate thoughts on understanding the usefulness of
swapping in large folios, with the considerations outlined earlier/other
factors.

Thanks,
Kanchana

  reply	other threads:[~2024-10-18 21:59 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
2024-10-18  6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch() Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
2024-10-18  7:26   ` David Hildenbrand
2024-10-18 11:04     ` Usama Arif
2024-10-18 17:21       ` Nhat Pham
2024-10-18 21:59         ` Sridhar, Kanchana P [this message]
2024-10-20 16:50           ` Usama Arif
2024-10-20 20:12             ` Sridhar, Kanchana P
2024-10-18 18:09     ` Sridhar, Kanchana P
2024-10-18  6:48 ` [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds Kanchana P Sridhar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SJ0PR11MB5678A864244B09FDE4D914EEC9402@SJ0PR11MB5678.namprd11.prod.outlook.com \
    --to=kanchana.p.sridhar@intel.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bfoster@redhat.com \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=dchinner@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=usamaarif642@gmail.com \
    --cc=vinodh.gopal@intel.com \
    --cc=wajdi.k.feghali@intel.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox