From: Barry Song <21cnbao@gmail.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: akpm@linux-foundation.org, david@redhat.com,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
mhocko@suse.com, shy828301@gmail.com,
wangkefeng.wang@huawei.com, willy@infradead.org,
xiang@kernel.org, ying.huang@intel.com, yuzhao@google.com,
chrisl@kernel.org, surenb@google.com, hanchuanhua@oppo.com
Subject: Re: [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting
Date: Wed, 28 Feb 2024 15:46:17 +1300 [thread overview]
Message-ID: <CAGsJ_4zdh5kOG7QP4UDaE-wmLFiTEJC2PX-_LxtOj=QrZSvkCA@mail.gmail.com> (raw)
In-Reply-To: <6a55e785-73dd-4951-bad8-2670810a13b6@arm.com>
On Wed, Feb 28, 2024 at 2:37 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/02/2024 09:51, Barry Song wrote:
> > +Chris, Suren and Chuanhua
> >
> > Hi Ryan,
> >
> >> + /*
> >> + * __scan_swap_map_try_ssd_cluster() may drop si->lock during discard,
> >> + * so indicate that we are scanning to synchronise with swapoff.
> >> + */
> >> + si->flags += SWP_SCANNING;
> >> + ret = __scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order);
> >> + si->flags -= SWP_SCANNING;
> >
> > nobody is using this scan_base afterwards. it seems a bit weird to
> > pass a pointer.
> >
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1212,11 +1212,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >> if (!can_split_folio(folio, NULL))
> >> goto activate_locked;
> >> /*
> >> - * Split folios without a PMD map right
> >> - * away. Chances are some or all of the
> >> - * tail pages can be freed without IO.
> >> + * Split PMD-mappable folios without a
> >> + * PMD map right away. Chances are some
> >> + * or all of the tail pages can be freed
> >> + * without IO.
> >> */
> >> - if (!folio_entire_mapcount(folio) &&
> >> + if (folio_test_pmd_mappable(folio) &&
> >> + !folio_entire_mapcount(folio) &&
> >> split_folio_to_list(folio,
> >> folio_list))
> >> goto activate_locked;
> >> --
> >
> > Chuanhua and I ran this patchset for a couple of days and found a race
> > between reclamation and split_folio. this might cause applications get
> > wrong data 0 while swapping-in.
> >
> > in case one thread(T1) is reclaiming a large folio by some means, still
> > another thread is calling madvise MADV_PGOUT(T2). and at the same time,
> > we have two threads T3 and T4 to swap-in in parallel. T1 doesn't split
> > and T2 does split as below,
>
Hi Ryan,
> Hi Barry,
>
> Do you have a test case you can share that provokes this problem? And is this a
> separate problem to the race you solved with TTU_SYNC or is this solving the
> same problem?
They are the same.
After sending you the report about the races, I spent some time and
finally figured
out what was happening, why corrupted data came while swapping in. it
is absolutely
not your fault, but TTU_SYNC does somehow resolve my problem though it is not
the root cause. this corrupted data only can reproduce after applying
patch 4[1] of
swap-in series,
[1] [PATCH RFC 4/6] mm: support large folios swapin as a whole
https://lore.kernel.org/linux-mm/20240118111036.72641-5-21cnbao@gmail.com/
In case we have a large folio with 16 PTEs as below, and after
add_to_swap(), they get
swapoffset 0x10000, their PTEs are all present as they are still mapped.
PTE pte_stat
PTE0 present
PTE1 present
PTE2 present
PTE3 present
...
PTE15 present
then we get to try_to_unmap_one, as try_to_unmap_one doesn't hold PTL
from PTE0, while it scans PTEs, we might have
PTE pte_stat
PTE0 none (someone is writing PTE0 for various reasons)
PTE1 present
PTE2 present
PTE3 present
...
PTE15 present
We hold PTL from PTE1.
after try_to_unmap_one, PTEs become
PTE pte_stat
PTE0 present (someone finished the write of PTE0)
PTE1 swap 0x10001
PTE2 swap 0x10002
PTE3 swap 0x10003
...
...
PTE15 swap 0x1000F
Thus, after try_to_unmap_one, the large folio is still mapped. so its swapcache
will still be there.
Now a parallel thread runs MADV_PAGEOUT, and it finds this large folio
is not completely mapped, so it splits the folio into 16 small folios but their
swap offsets are kept.
Now in swapcache, we have 16 folios with contiguous swap offsets.
MADV_PAGEOUT will reclaim these 16 folios, after new 16 try_to_unmap_one,
PTE pte_stat
PTE0 swap 0x10000 SWAP_HAS_CACHE
PTE1 swap 0x10001 SWAP_HAS_CACHE
PTE2 swap 0x10002 SWAP_HAS_CACHE
PTE3 swap 0x10003 SWAP_HAS_CACHE
...
PTE15 swap 0x1000F SWAP_HAS_CACHE
From this time, we can have various different cases for these 16 PTEs.
for example,
PTE pte_stat
PTE0 swap 0x10000 SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE1 swap 0x10001 SWAP_HAS_CACHE = 0 -> become false due to
finished pageout and remove_mapping
PTE2 swap 0x10002 SWAP_HAS_CACHE = 0 -> become false due to
concurrent swapin and swapout
PTE3 swap 0x10003 SWAP_HAS_CACHE = 1
...
PTE13 swap 0x1000D SWAP_HAS_CACHE = 1
PTE14 swap 0x1000E SWAP_HAS_CACHE = 1
PTE15 swap 0x1000F SWAP_HAS_CACHE = 1
but all of them have swp_count = 1 and different SWAP_HAS_CACHE. some of these
small folios might be in swapcache, some others might not be in.
then we do_swap_page at one PTE whose SWAP_HAS_CACHE=0 and
swap_count=1 (the folio is not in swapcache, thus has been written to swap),
we do this check:
static bool pte_range_swap(pte_t *pte, int nr_pages)
{
int i;
swp_entry_t entry;
unsigned type;
pgoff_t start_offset;
entry = pte_to_swp_entry(ptep_get_lockless(pte));
if (non_swap_entry(entry))
return false;
start_offset = swp_offset(entry);
if (start_offset % nr_pages)
return false;
type = swp_type(entry);
for (i = 1; i < nr_pages; i++) {
entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
if (non_swap_entry(entry))
return false;
if (swp_offset(entry) != start_offset + i)
return false;
if (swp_type(entry) != type)
return false;
}
return true;
}
as those swp entries are contiguous, we will call swap_read_folio().
For those folios which are still in swapcache and haven't been written,
we get zero-filled data from zRAM.
So the root cause is that pte_range_swap should check
all 16 swap_map have the same SWAP_HAS_CACHE as
false.
static bool is_pte_range_contig_swap(pte_t *pte, int nr_pages)
{
...
count = si->swap_map[start_offset];
for (i = 1; i < nr_pages; i++) {
entry = pte_to_swp_entry(ptep_get_lockless(pte + i));
if (non_swap_entry(entry))
return false;
if (swp_offset(entry) != start_offset + i)
return false;
if (swp_type(entry) != type)
return false;
/* fallback to small folios if SWAP_HAS_CACHE isn't same */
if (si->swap_map[start_offset + i] != count)
return false;
}
return true;
}
but somehow TTU_SYNC "resolves" it by giving no chance to
MADV_PAGEOUT to split this folio as the large folio are either
entirely written by swap entries, or entirely keep present PTEs.
Though the bug is within the swap-in series, I am still a big fan of
TTU_SYNC for large folio reclamation for at least three reasons,
1. We remove some possibility that large folios fail to be reclaimed, improving
reclamation efficiency.
2. We avoid many strange cases and potential folio_split during reclamation.
without TTU_SYNC, folios can be splitted later, or partially being set to swap
entries while partially being still present
3. we don't increase PTL contention. My test shows try_to_unmap_one
will always get PTL after it sometimes skips one or two PTEs because
intermediate break-before-makes are short. Of course, most time try_to_unmap_one
will get PTL from PTE0.
>
> Thanks,
> Ryan
>
Thanks
Barry
next prev parent reply other threads:[~2024-02-28 2:46 UTC|newest]
Thread overview: 116+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-25 14:45 [PATCH v3 0/4] " Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 1/4] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-02-22 10:19 ` David Hildenbrand
2024-02-22 10:20 ` David Hildenbrand
2024-02-26 17:41 ` Ryan Roberts
2024-02-27 17:10 ` Ryan Roberts
2024-02-27 19:17 ` David Hildenbrand
2024-02-28 9:37 ` Ryan Roberts
2024-02-28 12:12 ` David Hildenbrand
2024-02-28 14:57 ` Ryan Roberts
2024-02-28 15:12 ` David Hildenbrand
2024-02-28 15:18 ` Ryan Roberts
2024-03-01 16:27 ` Ryan Roberts
2024-03-01 16:31 ` Matthew Wilcox
2024-03-01 16:44 ` Ryan Roberts
2024-03-01 17:00 ` David Hildenbrand
2024-03-01 17:14 ` Ryan Roberts
2024-03-01 17:18 ` David Hildenbrand
2024-03-01 17:06 ` Ryan Roberts
2024-03-04 4:52 ` Barry Song
2024-03-04 5:42 ` Barry Song
2024-03-05 7:41 ` Ryan Roberts
2024-03-01 16:31 ` Ryan Roberts
2024-03-01 16:32 ` David Hildenbrand
2024-03-04 16:03 ` Ryan Roberts
2024-03-04 17:30 ` David Hildenbrand
2024-03-04 18:38 ` Ryan Roberts
2024-03-04 20:50 ` David Hildenbrand
2024-03-04 21:55 ` Ryan Roberts
2024-03-04 22:02 ` David Hildenbrand
2024-03-04 22:34 ` Ryan Roberts
2024-03-05 6:11 ` Huang, Ying
2024-03-05 8:35 ` David Hildenbrand
2024-03-05 8:46 ` Ryan Roberts
2024-02-28 13:33 ` Matthew Wilcox
2024-02-28 14:24 ` Ryan Roberts
2024-02-28 14:59 ` Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 2/4] mm: swap: Remove struct percpu_cluster Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 3/4] mm: swap: Simplify ssd behavior when scanner steals entry Ryan Roberts
2023-10-25 14:45 ` [PATCH v3 4/4] mm: swap: Swap-out small-sized THP without splitting Ryan Roberts
2023-10-30 8:18 ` Huang, Ying
2023-10-30 13:59 ` Ryan Roberts
2023-10-31 8:12 ` Huang, Ying
2023-11-03 11:42 ` Ryan Roberts
2023-11-02 7:40 ` Barry Song
2023-11-02 10:21 ` Ryan Roberts
2023-11-02 22:36 ` Barry Song
2023-11-03 11:31 ` Ryan Roberts
2023-11-03 13:57 ` Steven Price
2023-11-04 9:34 ` Barry Song
2023-11-06 10:12 ` Steven Price
2023-11-06 21:39 ` Barry Song
2023-11-08 11:51 ` Steven Price
2023-11-07 12:46 ` Ryan Roberts
2023-11-07 18:05 ` Barry Song
2023-11-08 11:23 ` Barry Song
2023-11-08 20:20 ` Ryan Roberts
2023-11-08 21:04 ` Barry Song
2023-11-04 5:49 ` Barry Song
2024-02-05 9:51 ` Barry Song
2024-02-05 12:14 ` Ryan Roberts
2024-02-18 23:40 ` Barry Song
2024-02-20 20:03 ` Ryan Roberts
2024-03-05 9:00 ` Ryan Roberts
2024-03-05 9:54 ` Barry Song
2024-03-05 10:44 ` Ryan Roberts
2024-02-27 12:28 ` Ryan Roberts
2024-02-27 13:37 ` Ryan Roberts
2024-02-28 2:46 ` Barry Song [this message]
2024-02-22 7:05 ` Barry Song
2024-02-22 10:09 ` David Hildenbrand
2024-02-23 9:46 ` Barry Song
2024-02-27 12:05 ` Ryan Roberts
2024-02-28 1:23 ` Barry Song
2024-02-28 9:34 ` David Hildenbrand
2024-02-28 23:18 ` Barry Song
2024-02-28 15:57 ` Ryan Roberts
2023-11-29 7:47 ` [PATCH v3 0/4] " Barry Song
2023-11-29 12:06 ` Ryan Roberts
2023-11-29 20:38 ` Barry Song
2024-01-18 11:10 ` [PATCH RFC 0/6] mm: support large folios swap-in Barry Song
2024-01-18 11:10 ` [PATCH RFC 1/6] arm64: mm: swap: support THP_SWAP on hardware with MTE Barry Song
2024-01-26 23:14 ` Chris Li
2024-02-26 2:59 ` Barry Song
2024-01-18 11:10 ` [PATCH RFC 2/6] mm: swap: introduce swap_nr_free() for batched swap_free() Barry Song
2024-01-26 23:17 ` Chris Li
2024-02-26 4:47 ` Barry Song
2024-01-18 11:10 ` [PATCH RFC 3/6] mm: swap: make should_try_to_free_swap() support large-folio Barry Song
2024-01-26 23:22 ` Chris Li
2024-01-18 11:10 ` [PATCH RFC 4/6] mm: support large folios swapin as a whole Barry Song
2024-01-27 19:53 ` Chris Li
2024-02-26 7:29 ` Barry Song
2024-01-27 20:06 ` Chris Li
2024-02-26 7:31 ` Barry Song
2024-01-18 11:10 ` [PATCH RFC 5/6] mm: rmap: weaken the WARN_ON in __folio_add_anon_rmap() Barry Song
2024-01-18 11:54 ` David Hildenbrand
2024-01-23 6:49 ` Barry Song
2024-01-29 3:25 ` Chris Li
2024-01-29 10:06 ` David Hildenbrand
2024-01-29 16:31 ` Chris Li
2024-02-26 5:05 ` Barry Song
2024-04-06 23:27 ` Barry Song
2024-01-27 23:41 ` Chris Li
2024-01-18 11:10 ` [PATCH RFC 6/6] mm: madvise: don't split mTHP for MADV_PAGEOUT Barry Song
2024-01-29 2:15 ` Chris Li
2024-02-26 6:39 ` Barry Song
2024-02-27 12:22 ` Ryan Roberts
2024-02-27 22:39 ` Barry Song
2024-02-27 14:40 ` Ryan Roberts
2024-02-27 18:57 ` Barry Song
2024-02-28 3:49 ` Barry Song
2024-01-18 15:25 ` [PATCH RFC 0/6] mm: support large folios swap-in Ryan Roberts
2024-01-18 23:54 ` Barry Song
2024-01-19 13:25 ` Ryan Roberts
2024-01-27 14:27 ` Barry Song
2024-01-29 9:05 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAGsJ_4zdh5kOG7QP4UDaE-wmLFiTEJC2PX-_LxtOj=QrZSvkCA@mail.gmail.com' \
--to=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=david@redhat.com \
--cc=hanchuanhua@oppo.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=ryan.roberts@arm.com \
--cc=shy828301@gmail.com \
--cc=surenb@google.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
--cc=xiang@kernel.org \
--cc=ying.huang@intel.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox