linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: ryan.roberts@arm.com
Cc: akpm@linux-foundation.org, david@redhat.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	mhocko@suse.com, shy828301@gmail.com, wangkefeng.wang@huawei.com,
	willy@infradead.org, xiang@kernel.org, ying.huang@intel.com,
	yuzhao@google.com, chrisl@kernel.org, surenb@google.com,
	hanchuanhua@oppo.com
Subject: Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
Date: Mon,  5 Feb 2024 22:51:55 +1300	[thread overview]
Message-ID: <20240205095155.7151-1-v-songbaohua@oppo.com> (raw)
In-Reply-To: <20231025144546.577640-5-ryan.roberts@arm.com>

+Chris, Suren and Chuanhua

Hi Ryan,

> +	/*
> +	 * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> +	 * so indicate that we are scanning to synchronise with swapoff.
> +	 */
> +	si->flags += SWP_SCANNING;
> +	ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> +	si->flags -= SWP_SCANNING;

nobody is using this scan_base afterwards. it seems a bit weird to
pass a pointer.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  					if (!can_split_folio(folio, NULL))
>  						goto activate_locked;
>  					/*
> -					 * Split folios without a PMD map right
> -					 * away. Chances are some or all of the
> -					 * tail pages can be freed without IO.
> +					 * Split PMD-mappable folios without a
> +					 * PMD map right away. Chances are some
> +					 * or all of the tail pages can be freed
> +					 * without IO.
>  					 */
> -					if (!folio_entire_mapcount(folio) &&
> +					if (folio_test_pmd_mappable(folio) &&
> +					    !folio_entire_mapcount(folio) &&
>  					    split_folio_to_list(folio,
>  								folio_list))
>  						goto activate_locked;
> --

Chuanhua and I ran this patchset for a couple of days and found a race
between reclamation and split_folio. this might cause applications get
wrong data 0 while swapping-in.

in case one thread(T1) is reclaiming a large folio by some means, still
another thread is calling madvise MADV_PGOUT(T2). and at the same time,
we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
and T2 does split as below,

static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
                                unsigned long addr, unsigned long end, 
                                struct mm_walk *walk)
{

                /*   
                 * Creating a THP page is expensive so split it only if we
                 * are sure it's worth. Split it if we are only owner.
                 */
                if (folio_test_large(folio)) {
                        int err; 

                        if (folio_estimated_sharers(folio) != 1)
                                break;
                        if (pageout_anon_only_filter && !folio_test_anon(folio))
                                break;
                        if (!folio_trylock(folio))
                                break;
                        folio_get(folio);
                        arch_leave_lazy_mmu_mode();
                        pte_unmap_unlock(start_pte, ptl);
                        start_pte = NULL;
                        err = split_folio(folio);
                        folio_unlock(folio);
                        folio_put(folio);
                        if (err)
                                break;
                        start_pte = pte =
                                pte_offset_map_lock(mm, pmd, addr, &ptl);
                        if (!start_pte)
                                break;
                        arch_enter_lazy_mmu_mode();
                        pte--;
                        addr -= PAGE_SIZE;
                        continue;
                }    

        return 0;
}



if T3 and T4 swap-in same page, and they both do swap_read_folio(). the
first one of T3 and T4 who gets PTL will set pte, and the 2nd one will
check pte_same() and find pte has been changed by another thread, thus
goto out_nomap in do_swap_page.
vm_fault_t do_swap_page(struct vm_fault *vmf)
{
        if (!folio) {
                if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
                    __swap_count(entry) == 1) {
                        /* skip swapcache */
                        folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
                                                vma, vmf->address, false);
                        page = &folio->page;
                        if (folio) {
                                __folio_set_locked(folio);
                                __folio_set_swapbacked(folio);
                         
                                /* To provide entry to swap_read_folio() */
                                folio->swap = entry;
                                swap_read_folio(folio, true, NULL);
                                folio->private = NULL;
                        }
                } else {
                }
        
        
        /*
         * Back out if somebody else already faulted in this pte.
         */
        vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
                        &vmf->ptl);
        if (unlikely(!vmf->pte || !pte_same(ptep_get(vmf->pte), vmf->orig_pte)))
                goto out_nomap;

        swap_free(entry);
        pte = mk_pte(page, vma->vm_page_prot);

        set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
        return ret;
}


while T1 and T2 is working in parallel, T2 will split folio. this can
run into race with T1's reclamation without splitting. T2 will split
a large folio into a couple of normal pages and reclaim them.

If T3 finishes swap_read_folio and gets PTL earlier than T4, it calls
set_pte and swap_free. this will cause zRAM to free the slot. then
t4 will get zero data in swap_read_folio() as the below zRAM code
will fill zero for freed slots, 

static int zram_read_from_zspool(struct zram *zram, struct page *page,
                                 u32 index)
{
        ...

        handle = zram_get_handle(zram, index);
        if (!handle || zram_test_flag(zram, index, ZRAM_SAME)) {
                unsigned long value;
                void *mem;

                value = handle ? zram_get_element(zram, index) : 0; 
                mem = kmap_local_page(page);
                zram_fill_page(mem, PAGE_SIZE, value);
                kunmap_local(mem);
                return 0;
        }
}

usually, after t3 frees swap and does set_pte, t4's pte_same becomes
false, it won't set pte again. So filling zero data into freed slot
by zRAM driver is not a problem at all. but the race is that T1 and
T2 might do set swap to ptes twice as t1 doesn't split but t2 splits
(splitted normal folios are also added into reclaim_list), thus, the
corrupted zero data will get a chance to be set into PTE by t4 as t4
reads the new PTE which is set secondly and has the same swap entry
as its orig_pte after T3 has swapped-in and free the swap entry.

we have worked around this problem by preventing T4 from splitting
large folios and letting it goto skip the large folios entirely in
MADV PAGEOUT once we detect a concurrent reclamation for this large
folio.

so my understanding is changing vmscan isn't sufficient to support
large folio swap-out without splitting. we have to adjust madvise
as well. we will have a fix for this problem in
[PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT
https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/

But i feel this patch should be a part of your swap-out patchset rather
than the swap-in series of Chuanhua and me :-)

Thanks
Barry


  parent reply	other threads:[~2024-02-05  9:52 UTC|newest]

Thread overview: 116+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-25 14:45 [PATCH v3 0/4] " Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-02-22 10:19   ` David Hildenbrand
2024-02-22 10:20     ` David Hildenbrand
2024-02-26 17:41       ` Ryan Roberts
2024-02-27 17:10         ` Ryan Roberts
2024-02-27 19:17           ` David Hildenbrand
2024-02-28  9:37             ` Ryan Roberts
2024-02-28 12:12               ` David Hildenbrand
2024-02-28 14:57                 ` Ryan Roberts
2024-02-28 15:12                   ` David Hildenbrand
2024-02-28 15:18                     ` Ryan Roberts
2024-03-01 16:27                     ` Ryan Roberts
2024-03-01 16:31                       ` Matthew Wilcox
2024-03-01 16:44                         ` Ryan Roberts
2024-03-01 17:00                           ` David Hildenbrand
2024-03-01 17:14                             ` Ryan Roberts
2024-03-01 17:18                               ` David Hildenbrand
2024-03-01 17:06                           ` Ryan Roberts
2024-03-04  4:52                             ` Barry Song
2024-03-04  5:42                               ` Barry Song
2024-03-05  7:41                                 ` Ryan Roberts
2024-03-01 16:31                       ` Ryan Roberts
2024-03-01 16:32                       ` David Hildenbrand
2024-03-04 16:03                 ` Ryan Roberts
2024-03-04 17:30                   ` David Hildenbrand
2024-03-04 18:38                     ` Ryan Roberts
2024-03-04 20:50                       ` David Hildenbrand
2024-03-04 21:55                         ` Ryan Roberts
2024-03-04 22:02                           ` David Hildenbrand
2024-03-04 22:34                             ` Ryan Roberts
2024-03-05  6:11                               ` Huang, Ying
2024-03-05  8:35                                 ` David Hildenbrand
2024-03-05  8:46                                   ` Ryan Roberts
2024-02-28 13:33               ` Matthew Wilcox
2024-02-28 14:24                 ` Ryan Roberts
2024-02-28 14:59                   ` Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals entry Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-30  8:18   ` Huang, Ying
2023-10-30 13:59     ` Ryan Roberts
2023-10-31  8:12       ` Huang, Ying
2023-11-03 11:42         ` Ryan Roberts
2023-11-02  7:40   ` Barry Song
2023-11-02 10:21     ` Ryan Roberts
2023-11-02 22:36       ` Barry Song
2023-11-03 11:31         ` Ryan Roberts
2023-11-03 13:57           ` Steven Price
2023-11-04  9:34             ` Barry Song
2023-11-06 10:12               ` Steven Price
2023-11-06 21:39                 ` Barry Song
2023-11-08 11:51                   ` Steven Price
2023-11-07 12:46               ` Ryan Roberts
2023-11-07 18:05                 ` Barry Song
2023-11-08 11:23                   ` Barry Song
2023-11-08 20:20                     ` Ryan Roberts
2023-11-08 21:04                       ` Barry Song
2023-11-04  5:49           ` Barry Song
2024-02-05  9:51   ` Barry Song [this message]
2024-02-05 12:14     ` Ryan Roberts
2024-02-18 23:40       ` Barry Song
2024-02-20 20:03         ` Ryan Roberts
2024-03-05  9:00         ` Ryan Roberts
2024-03-05  9:54           ` Barry Song
2024-03-05 10:44             ` Ryan Roberts
2024-02-27 12:28     ` Ryan Roberts
2024-02-27 13:37     ` Ryan Roberts
2024-02-28  2:46       ` Barry Song
2024-02-22  7:05   ` Barry Song
2024-02-22 10:09     ` David Hildenbrand
2024-02-23  9:46       ` Barry Song
2024-02-27 12:05         ` Ryan Roberts
2024-02-28  1:23           ` Barry Song
2024-02-28  9:34             ` David Hildenbrand
2024-02-28 23:18               ` Barry Song
2024-02-28 15:57             ` Ryan Roberts
2023-11-29  7:47 ` [PATCH v3 0/4] " Barry Song
2023-11-29 12:06   ` Ryan Roberts
2023-11-29 20:38     ` Barry Song
2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
2024-01-18 11:10   ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
2024-01-26 23:14     ` Chris Li
2024-02-26  2:59       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free() Barry Song
2024-01-26 23:17     ` Chris Li
2024-02-26  4:47       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
2024-01-26 23:22     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
2024-01-27 19:53     ` Chris Li
2024-02-26  7:29       ` Barry Song
2024-01-27 20:06     ` Chris Li
2024-02-26  7:31       ` Barry Song
2024-01-18 11:10   ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
2024-01-18 11:54     ` David Hildenbrand
2024-01-23  6:49       ` Barry Song
2024-01-29  3:25         ` Chris Li
2024-01-29 10:06           ` David Hildenbrand
2024-01-29 16:31             ` Chris Li
2024-02-26  5:05               ` Barry Song
2024-04-06 23:27             ` Barry Song
2024-01-27 23:41     ` Chris Li
2024-01-18 11:10   ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
2024-01-29  2:15     ` Chris Li
2024-02-26  6:39       ` Barry Song
2024-02-27 12:22     ` Ryan Roberts
2024-02-27 22:39       ` Barry Song
2024-02-27 14:40     ` Ryan Roberts
2024-02-27 18:57       ` Barry Song
2024-02-28  3:49         ` Barry Song
2024-01-18 15:25   ` [PATCH RFC 0/6] mm: support large folios swap-in Ryan Roberts
2024-01-18 23:54     ` Barry Song
2024-01-19 13:25       ` Ryan Roberts
2024-01-27 14:27         ` Barry Song
2024-01-29  9:05   ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240205095155.7151-1-v-songbaohua@oppo.com \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hanchuanhua@oppo.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=surenb@google.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=xiang@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox