From: "Huang, Ying" <ying.huang@intel.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org, Kairui Song <kasong@tencent.com>,
Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>, Hugh Dickins <hughd@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Matthew Wilcox <willy@infradead.org>,
Michal Hocko <mhocko@suse.com>,
Yosry Ahmed <yosryahmed@google.com>,
David Hildenbrand <david@redhat.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally
Date: Tue, 09 Jan 2024 10:03:29 +0800 [thread overview]
Message-ID: <871qar9sb2.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <20240102175338.62012-10-ryncsn@gmail.com> (Kairui Song's message of "Wed, 3 Jan 2024 01:53:38 +0800")
Kairui Song <ryncsn@gmail.com> writes:
> From: Kairui Song <kasong@tencent.com>
>
> Currently, shmem uses cluster readahead for all swap backends. Cluster
> readahead is not a good solution for ramdisk based device (ZRAM) at all.
>
> After switching to the new helper, most benchmarks showed a good result:
>
> - Single file sequence read:
> perf stat --repeat 20 dd if=/tmpfs/test of=/dev/null bs=1M count=8192
> (/tmpfs/test is a zero filled file, using brd as swap, 4G memcg limit)
> Before: 22.248 +- 0.549
> After: 22.021 +- 0.684 (-1.1%)
>
> - Random read stress test:
> fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
> --size=256m --ioengine=mmap --rw=randread --random_distribution=random \
> --time_based --ramp_time=1m --runtime=5m --group_reporting
> (using brd as swap, 2G memcg limit)
>
> Before: 1818MiB/s
> After: 1888MiB/s (+3.85%)
>
> - Zipf biased random read stress test:
> fio -name=tmpfs --numjobs=16 --directory=/tmpfs \
> --size=256m --ioengine=mmap --rw=randread --random_distribution=zipf:1.2 \
> --time_based --ramp_time=1m --runtime=5m --group_reporting
> (using brd as swap, 2G memcg limit)
>
> Before: 31.1GiB/s
> After: 32.3GiB/s (+3.86%)
>
> So cluster readahead doesn't help much even for single sequence read,
> and for random stress test, the performance is better without it.
>
> Considering both memory and swap device will get more fragmented
> slowly, and commonly used ZRAM consumes much more CPU than plain
> ramdisk, false readahead could occur more frequently and waste
> more CPU. Direct SWAP is cheaper, so use the new helper and skip
> read ahead for SWP_SYNCHRONOUS_IO device.
It's good to take advantage of swap_direct (no readahead). I also hopes
we can take advantage of VMA based swapin if shmem is accessed via mmap.
That appears possible.
--
Best Regards,
Huang, Ying
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/shmem.c | 67 +++++++++++++++++++++++++------------------------
> mm/swap.h | 9 -------
> mm/swap_state.c | 11 ++++++--
> 3 files changed, 43 insertions(+), 44 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9da9f7a0e620..3c0729fe934d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1564,20 +1564,6 @@ static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
> static struct mempolicy *shmem_get_pgoff_policy(struct shmem_inode_info *info,
> pgoff_t index, unsigned int order, pgoff_t *ilx);
>
> -static struct folio *shmem_swapin_cluster(swp_entry_t swap, gfp_t gfp,
> - struct shmem_inode_info *info, pgoff_t index)
> -{
> - struct mempolicy *mpol;
> - pgoff_t ilx;
> - struct folio *folio;
> -
> - mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> - folio = swap_cluster_readahead(swap, gfp, mpol, ilx);
> - mpol_cond_put(mpol);
> -
> - return folio;
> -}
> -
> /*
> * Make sure huge_gfp is always more limited than limit_gfp.
> * Some of the flags set permissions, while others set limitations.
> @@ -1851,9 +1837,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> {
> struct address_space *mapping = inode->i_mapping;
> struct shmem_inode_info *info = SHMEM_I(inode);
> + enum swap_cache_result cache_result;
> struct swap_info_struct *si;
> struct folio *folio = NULL;
> + struct mempolicy *mpol;
> swp_entry_t swap;
> + pgoff_t ilx;
> int error;
>
> VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
> @@ -1871,36 +1860,40 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> return -EINVAL;
> }
>
> - /* Look it up and read it in.. */
> - folio = swap_cache_get_folio(swap, NULL, 0, NULL);
> + mpol = shmem_get_pgoff_policy(info, index, 0, &ilx);
> + folio = swapin_entry_mpol(swap, gfp, mpol, ilx, &cache_result);
> + mpol_cond_put(mpol);
> +
> if (!folio) {
> - /* Or update major stats only when swapin succeeds?? */
> + error = -ENOMEM;
> + goto failed;
> + }
> + if (cache_result != SWAP_CACHE_HIT) {
> if (fault_type) {
> *fault_type |= VM_FAULT_MAJOR;
> count_vm_event(PGMAJFAULT);
> count_memcg_event_mm(fault_mm, PGMAJFAULT);
> }
> - /* Here we actually start the io */
> - folio = shmem_swapin_cluster(swap, gfp, info, index);
> - if (!folio) {
> - error = -ENOMEM;
> - goto failed;
> - }
> }
>
> /* We have to do this with folio locked to prevent races */
> folio_lock(folio);
> - if (!folio_test_swapcache(folio) ||
> - folio->swap.val != swap.val ||
> - !shmem_confirm_swap(mapping, index, swap)) {
> + if (cache_result != SWAP_CACHE_BYPASS) {
> + /* With cache bypass, folio is new allocated, sync, and not in cache */
> + if (!folio_test_swapcache(folio) || folio->swap.val != swap.val) {
> + error = -EEXIST;
> + goto unlock;
> + }
> + if (!folio_test_uptodate(folio)) {
> + error = -EIO;
> + goto failed;
> + }
> + folio_wait_writeback(folio);
> + }
> + if (!shmem_confirm_swap(mapping, index, swap)) {
> error = -EEXIST;
> goto unlock;
> }
> - if (!folio_test_uptodate(folio)) {
> - error = -EIO;
> - goto failed;
> - }
> - folio_wait_writeback(folio);
>
> /*
> * Some architectures may have to restore extra metadata to the
> @@ -1908,12 +1901,19 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> */
> arch_swap_restore(swap, folio);
>
> - if (shmem_should_replace_folio(folio, gfp)) {
> + /* With cache bypass, folio is new allocated and always respect gfp flags */
> + if (cache_result != SWAP_CACHE_BYPASS && shmem_should_replace_folio(folio, gfp)) {
> error = shmem_replace_folio(&folio, gfp, info, index);
> if (error)
> goto failed;
> }
>
> + /*
> + * The expected value checking below should be enough to ensure
> + * only one up-to-date swapin success. swap_free() is called after
> + * this, so the entry can't be reused. As long as the mapping still
> + * has the old entry value, it's never swapped in or modified.
> + */
> error = shmem_add_to_page_cache(folio, mapping, index,
> swp_to_radix_entry(swap), gfp);
> if (error)
> @@ -1924,7 +1924,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> if (sgp == SGP_WRITE)
> folio_mark_accessed(folio);
>
> - delete_from_swap_cache(folio);
> + if (cache_result != SWAP_CACHE_BYPASS)
> + delete_from_swap_cache(folio);
> folio_mark_dirty(folio);
> swap_free(swap);
> put_swap_device(si);
> diff --git a/mm/swap.h b/mm/swap.h
> index 8f790a67b948..20f4048c971c 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -57,9 +57,6 @@ void __delete_from_swap_cache(struct folio *folio,
> void delete_from_swap_cache(struct folio *folio);
> void clear_shadow_from_swap_cache(int type, unsigned long begin,
> unsigned long end);
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> - struct vm_area_struct *vma, unsigned long addr,
> - void **shadowp);
> struct folio *filemap_get_incore_folio(struct address_space *mapping,
> pgoff_t index);
>
> @@ -123,12 +120,6 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
> return 0;
> }
>
> -static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> - struct vm_area_struct *vma, unsigned long addr)
> -{
> - return NULL;
> -}
> -
> static inline
> struct folio *filemap_get_incore_folio(struct address_space *mapping,
> pgoff_t index)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3edf4b63158d..10eec68475dd 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -318,7 +318,14 @@ void free_pages_and_swap_cache(struct encoded_page **pages, int nr)
>
> static inline bool swap_use_no_readahead(struct swap_info_struct *si, swp_entry_t entry)
> {
> - return data_race(si->flags & SWP_SYNCHRONOUS_IO) && __swap_count(entry) == 1;
> + int count;
> +
> + if (!data_race(si->flags & SWP_SYNCHRONOUS_IO))
> + return false;
> +
> + count = __swap_count(entry);
> +
> + return (count == 1 || count == SWAP_MAP_SHMEM);
> }
>
> static inline bool swap_use_vma_readahead(void)
> @@ -334,7 +341,7 @@ static inline bool swap_use_vma_readahead(void)
> *
> * Caller must lock the swap device or hold a reference to keep it valid.
> */
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> +static struct folio *swap_cache_get_folio(swp_entry_t entry,
> struct vm_area_struct *vma, unsigned long addr, void **shadowp)
> {
> struct folio *folio;
next prev parent reply other threads:[~2024-01-09 2:05 UTC|newest]
Thread overview: 38+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-02 17:53 [PATCH v2 0/9] swapin refactor for optimization and unified readahead Kairui Song
2024-01-02 17:53 ` [PATCH v2 1/9] mm/swapfile.c: add back some comment Kairui Song
2024-01-02 17:53 ` [PATCH v2 2/9] mm/swap: move no readahead swapin code to a stand-alone helper Kairui Song
2024-01-04 7:28 ` Huang, Ying
2024-01-05 7:43 ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 3/9] mm/swap: avoid doing extra unlock error checks for direct swapin Kairui Song
2024-01-04 8:10 ` Huang, Ying
2024-01-09 9:38 ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 4/9] mm/swap: always account swapped in page into current memcg Kairui Song
2024-01-05 7:14 ` Huang, Ying
2024-01-05 7:33 ` Kairui Song
2024-01-08 7:44 ` Huang, Ying
2024-01-09 9:42 ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 5/9] mm/swap: introduce swapin_entry for unified readahead policy Kairui Song
2024-01-05 7:28 ` Huang, Ying
2024-01-10 2:42 ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 6/9] mm/swap: handle swapcache lookup in swapin_entry Kairui Song
2024-01-08 8:26 ` Huang, Ying
2024-01-10 2:53 ` Kairui Song
2024-01-15 1:45 ` Huang, Ying
2024-01-15 17:11 ` Kairui Song
2024-01-02 17:53 ` [PATCH v2 7/9] mm/swap: avoid a duplicated swap cache lookup for SWP_SYNCHRONOUS_IO Kairui Song
2024-01-03 12:50 ` kernel test robot
2024-01-02 17:53 ` [PATCH v2 8/9] mm/swap: introduce a helper for swapin without vmfault Kairui Song
2024-01-09 1:08 ` Huang, Ying
2024-01-10 3:32 ` Kairui Song
2024-01-15 1:52 ` Huang, Ying
2024-01-21 18:40 ` Kairui Song
2024-01-22 6:38 ` Huang, Ying
2024-01-22 11:35 ` Kairui Song
2024-01-24 3:31 ` Huang, Ying
2024-01-02 17:53 ` [PATCH v2 9/9] mm/swap, shmem: use new swapin helper to skip readahead conditionally Kairui Song
2024-01-03 11:56 ` kernel test robot
2024-01-03 13:45 ` kernel test robot
2024-01-09 2:03 ` Huang, Ying [this message]
2024-01-10 3:35 ` Kairui Song
2024-01-30 0:39 ` Kairui Song
2024-01-30 2:01 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=871qar9sb2.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kasong@tencent.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=ryncsn@gmail.com \
--cc=willy@infradead.org \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox