linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	hannes@cmpxchg.org, yosryahmed@google.com, nphamcs@gmail.com,
	chengming.zhou@linux.dev, usamaarif642@gmail.com,
	ryan.roberts@arm.com, ying.huang@intel.com, 21cnbao@gmail.com,
	akpm@linux-foundation.org, hughd@google.com, willy@infradead.org,
	bfoster@redhat.com, dchinner@redhat.com, chrisl@kernel.org
Cc: wajdi.k.feghali@intel.com, vinodh.gopal@intel.com
Subject: Re: [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface.
Date: Fri, 18 Oct 2024 09:26:30 +0200	[thread overview]
Message-ID: <71bcbd3f-a8bd-4014-aabe-081006cc62f8@redhat.com> (raw)
In-Reply-To: <20241018064805.336490-7-kanchana.p.sridhar@intel.com>

On 18.10.24 08:48, Kanchana P Sridhar wrote:
> This patch invokes the swapin_readahead() based batching interface to
> prefetch a batch of 4K folios for zswap load with batch decompressions
> in parallel using IAA hardware. swapin_readahead() prefetches folios based
> on vm.page-cluster and the usefulness of prior prefetches to the
> workload. As folios are created in the swapcache and the readahead code
> calls swap_read_folio() with a "zswap_batch" and a "non_zswap_batch", the
> respective folio_batches get populated with the folios to be read.
> 
> Finally, the swapin_readahead() procedures will call the newly added
> process_ra_batch_of_same_type() which:
> 
>   1) Reads all the non_zswap_batch folios sequentially by calling
>      swap_read_folio().
>   2) Calls swap_read_zswap_batch_unplug() with the zswap_batch which calls
>      zswap_finish_load_batch() that finally decompresses each
>      SWAP_CRYPTO_SUB_BATCH_SIZE sub-batch (i.e. upto 8 pages in a prefetch
>      batch of say, 32 folios) in parallel with IAA.
> 
> Within do_swap_page(), we try to benefit from batch decompressions in both
> these scenarios:
> 
>   1) single-mapped, SWP_SYNCHRONOUS_IO:
>        We call swapin_readahead() with "single_mapped_path = true". This is
>        done only in the !zswap_never_enabled() case.
>   2) Shared and/or non-SWP_SYNCHRONOUS_IO folios:
>        We call swapin_readahead() with "single_mapped_path = false".
> 
> This will place folios in the swapcache: a design choice that handles cases
> where a folio that is "single-mapped" in process 1 could be prefetched in
> process 2; and handles highly contended server scenarios with stability.
> There are checks added at the end of do_swap_page(), after the folio has
> been successfully loaded, to detect if the single-mapped swapcache folio is
> still single-mapped, and if so, folio_free_swap() is called on the folio.
> 
> Within the swapin_readahead() functions, if single_mapped_path is true, and
> either the platform does not have IAA, or, if the platform has IAA and the
> user selects a software compressor for zswap (details of sysfs knob
> follow), readahead/batching are skipped and the folio is loaded using
> zswap_load().
> 
> A new swap parameter "singlemapped_ra_enabled" (false by default) is added
> for platforms that have IAA, zswap_load_batching_enabled() is true, and we
> want to give the user the option to run experiments with IAA and with
> software compressors for zswap (swap device is SWP_SYNCHRONOUS_IO):
> 
> For IAA:
>   echo true > /sys/kernel/mm/swap/singlemapped_ra_enabled
> 
> For software compressors:
>   echo false > /sys/kernel/mm/swap/singlemapped_ra_enabled
> 
> If "singlemapped_ra_enabled" is set to false, swapin_readahead() will skip
> prefetching folios in the "single-mapped SWP_SYNCHRONOUS_IO" do_swap_page()
> path.
> 
> Thanks Ying Huang for the really helpful brainstorming discussions on the
> swap_read_folio() plug design.
> 
> Suggested-by: Ying Huang <ying.huang@intel.com>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>   mm/memory.c     | 187 +++++++++++++++++++++++++++++++++++++-----------
>   mm/shmem.c      |   2 +-
>   mm/swap.h       |  12 ++--
>   mm/swap_state.c | 157 ++++++++++++++++++++++++++++++++++++----
>   mm/swapfile.c   |   2 +-
>   5 files changed, 299 insertions(+), 61 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b5745b9ffdf7..9655b85fc243 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3924,6 +3924,42 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>   	return 0;
>   }
>   
> +/*
> + * swapin readahead based batching interface for zswap batched loads using IAA:
> + *
> + * Should only be called for and if the faulting swap entry in do_swap_page
> + * is single-mapped and SWP_SYNCHRONOUS_IO.
> + *
> + * Detect if the folio is in the swapcache, is still mapped to only this
> + * process, and further, there are no additional references to this folio
> + * (for e.g. if another process simultaneously readahead this swap entry
> + * while this process was handling the page-fault, and got a pointer to the
> + * folio allocated by this process in the swapcache), besides the references
> + * that were obtained within __read_swap_cache_async() by this process that is
> + * faulting in this single-mapped swap entry.
> + */

How is this supposed to work for large folios?

> +static inline bool should_free_singlemap_swapcache(swp_entry_t entry,
> +						   struct folio *folio)
> +{
> +	if (!folio_test_swapcache(folio))
> +		return false;
> +
> +	if (__swap_count(entry) != 0)
> +		return false;
> +
> +	/*
> +	 * The folio ref count for a single-mapped folio that was allocated
> +	 * in __read_swap_cache_async(), can be a maximum of 3. These are the
> +	 * incrementors of the folio ref count in __read_swap_cache_async():
> +	 * folio_alloc_mpol(), add_to_swap_cache(), folio_add_lru().
> +	 */
> +
> +	if (folio_ref_count(folio) <= 3)
> +		return true;
> +
> +	return false;
> +}
> +
>   static inline bool should_try_to_free_swap(struct folio *folio,
>   					   struct vm_area_struct *vma,
>   					   unsigned int fault_flags)
> @@ -4215,6 +4251,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	swp_entry_t entry;
>   	pte_t pte;
>   	vm_fault_t ret = 0;
> +	bool single_mapped_swapcache = false;
>   	void *shadow = NULL;
>   	int nr_pages;
>   	unsigned long page_idx;
> @@ -4283,51 +4320,90 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	if (!folio) {
>   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>   		    __swap_count(entry) == 1) {
> -			/* skip swapcache */
> -			folio = alloc_swap_folio(vmf);
> -			if (folio) {
> -				__folio_set_locked(folio);
> -				__folio_set_swapbacked(folio);
> -
> -				nr_pages = folio_nr_pages(folio);
> -				if (folio_test_large(folio))
> -					entry.val = ALIGN_DOWN(entry.val, nr_pages);
> -				/*
> -				 * Prevent parallel swapin from proceeding with
> -				 * the cache flag. Otherwise, another thread
> -				 * may finish swapin first, free the entry, and
> -				 * swapout reusing the same entry. It's
> -				 * undetectable as pte_same() returns true due
> -				 * to entry reuse.
> -				 */
> -				if (swapcache_prepare(entry, nr_pages)) {
> +			if (zswap_never_enabled()) {
> +				/* skip swapcache */
> +				folio = alloc_swap_folio(vmf);
> +				if (folio) {
> +					__folio_set_locked(folio);
> +					__folio_set_swapbacked(folio);
> +
> +					nr_pages = folio_nr_pages(folio);
> +					if (folio_test_large(folio))
> +						entry.val = ALIGN_DOWN(entry.val, nr_pages);
>   					/*
> -					 * Relax a bit to prevent rapid
> -					 * repeated page faults.
> +					 * Prevent parallel swapin from proceeding with
> +					 * the cache flag. Otherwise, another thread
> +					 * may finish swapin first, free the entry, and
> +					 * swapout reusing the same entry. It's
> +					 * undetectable as pte_same() returns true due
> +					 * to entry reuse.
>   					 */
> -					add_wait_queue(&swapcache_wq, &wait);
> -					schedule_timeout_uninterruptible(1);
> -					remove_wait_queue(&swapcache_wq, &wait);
> -					goto out_page;
> +					if (swapcache_prepare(entry, nr_pages)) {
> +						/*
> +						 * Relax a bit to prevent rapid
> +						 * repeated page faults.
> +						 */
> +						add_wait_queue(&swapcache_wq, &wait);
> +						schedule_timeout_uninterruptible(1);
> +						remove_wait_queue(&swapcache_wq, &wait);
> +						goto out_page;
> +					}
> +					need_clear_cache = true;
> +
> +					mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> +
> +					shadow = get_shadow_from_swap_cache(entry);
> +					if (shadow)
> +						workingset_refault(folio, shadow);
> +
> +					folio_add_lru(folio);
> +
> +					/* To provide entry to swap_read_folio() */
> +					folio->swap = entry;
> +					swap_read_folio(folio, NULL, NULL, NULL);
> +					folio->private = NULL;
> +				}
> +			} else {
> +				/*
> +				 * zswap is enabled or was enabled at some point.
> +				 * Don't skip swapcache.
> +				 *
> +				 * swapin readahead based batching interface
> +				 * for zswap batched loads using IAA:
> +				 *
> +				 * Readahead is invoked in this path only if
> +				 * the sys swap "singlemapped_ra_enabled" swap
> +				 * parameter is set to true. By default,
> +				 * "singlemapped_ra_enabled" is set to false,
> +				 * the recommended setting for software compressors.
> +				 * For IAA, if "singlemapped_ra_enabled" is set
> +				 * to true, readahead will be deployed in this path
> +				 * as well.
> +				 *
> +				 * For single-mapped pages, the batching interface
> +				 * calls __read_swap_cache_async() to allocate and
> +				 * place the faulting page in the swapcache. This is
> +				 * to handle a scenario where the faulting page in
> +				 * this process happens to simultaneously be a
> +				 * readahead page in another process. By placing the
> +				 * single-mapped faulting page in the swapcache,
> +				 * we avoid race conditions and duplicate page
> +				 * allocations under these scenarios.
> +				 */
> +				folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> +							 vmf, true);
> +				if (!folio) {
> +					ret = VM_FAULT_OOM;
> +					goto out;
>   				}
> -				need_clear_cache = true;
> -
> -				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
> -
> -				shadow = get_shadow_from_swap_cache(entry);
> -				if (shadow)
> -					workingset_refault(folio, shadow);
> -
> -				folio_add_lru(folio);
>   
> -				/* To provide entry to swap_read_folio() */
> -				folio->swap = entry;
> -				swap_read_folio(folio, NULL, NULL, NULL);
> -				folio->private = NULL;
> -			}
> +				single_mapped_swapcache = true;
> +				nr_pages = folio_nr_pages(folio);
> +				swapcache = folio;
> +			} /* swapin with zswap support. */
>   		} else {
>   			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
> -						vmf);
> +						 vmf, false);
>   			swapcache = folio;

I'm sorry, but making this function ever more complicated and ugly is 
not going to fly. The zswap special casing is quite ugly here as well.

Is there a way forward that we can make this code actually readable and 
avoid zswap special casing?

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2024-10-18  7:26 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-18  6:47 [RFC PATCH v1 0/7] zswap IAA decompress batching Kanchana P Sridhar
2024-10-18  6:47 ` [RFC PATCH v1 1/7] mm: zswap: Config variable to enable zswap loads with " Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 2/7] mm: swap: Add IAA batch decompression API swap_crypto_acomp_decompress_batch() Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 3/7] pagevec: struct folio_batch changes for decompress batching interface Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 4/7] mm: swap: swap_read_folio() can add a folio to a folio_batch if it is in zswap Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 5/7] mm: swap, zswap: zswap folio_batch processing with IAA decompression batching Kanchana P Sridhar
2024-10-18  6:48 ` [RFC PATCH v1 6/7] mm: do_swap_page() calls swapin_readahead() zswap load batching interface Kanchana P Sridhar
2024-10-18  7:26   ` David Hildenbrand [this message]
2024-10-18 11:04     ` Usama Arif
2024-10-18 17:21       ` Nhat Pham
2024-10-18 21:59         ` Sridhar, Kanchana P
2024-10-20 16:50           ` Usama Arif
2024-10-20 20:12             ` Sridhar, Kanchana P
2024-10-18 18:09     ` Sridhar, Kanchana P
2024-10-18  6:48 ` [RFC PATCH v1 7/7] mm: For IAA batching, reduce SWAP_BATCH and modify swap slot cache thresholds Kanchana P Sridhar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=71bcbd3f-a8bd-4014-aabe-081006cc62f8@redhat.com \
    --to=david@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=bfoster@redhat.com \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=dchinner@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kanchana.p.sridhar@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=ryan.roberts@arm.com \
    --cc=usamaarif642@gmail.com \
    --cc=vinodh.gopal@intel.com \
    --cc=wajdi.k.feghali@intel.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox