Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Barry Song <baohua@kernel.org>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	willy@infradead.org, wangkefeng.wang@huawei.com,
	chrisl@kernel.org, ying.huang@intel.com, 21cnbao@gmail.com,
	ryan.roberts@arm.com, shy828301@gmail.com, ziy@nvidia.com,
	ioworker0@gmail.com, da.gomez@samsung.com, p.raghav@samsung.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 0/9] support large folio swap-out and swap-in for shmem
Date: Thu, 20 Jun 2024 18:51:58 +0200	[thread overview]
Message-ID: <1cdf6a43-9599-4e34-8f70-29ed16295a60@redhat.com> (raw)
In-Reply-To: <f302f177-44e6-b74d-f140-b3bdf734b4b1@google.com>

On 20.06.24 18:27, Hugh Dickins wrote:
> On Thu, 20 Jun 2024, David Hildenbrand wrote:
> 
>>>
>>> (I do have doubts about Barry's: the "_new" in folio_add_new_anon_rmap()
>>> was all about optimizing a known-exclusive case, so it surprises me
>>> to see it being extended to non-exclusive; and I worry over how its
>>> atomic_set(&page->_mapcount, 0)s can be safe when non-exclusive (but
>>> I've never caught up with David's exclusive changes, I'm out of date).
>>
>> We discussed that a while ago: if we wouldn't be holding the folio lock in the
>> "folio == swapcache" at that point (which we do for both do_swap_page() and
>> unuse_pte()) things would already be pretty broken.
> 
> You're thinking of the non-atomic-ness of atomic_set(): I agree that
> the folio lock makes that safe (against other adds; but not against
> concurrent removes, which could never occur with the old "_new" usage).
> 
> But what I'm worried about is the 0 in atomic_set(&page->_mapcount, 0):
> once the folio lock has been dropped, another non-exclusive add could
> come in and set _mapcount again to 0 instead of to 1 (mapcount 1 when
> it should be 2)?

That would mean that we have someone checking folio_test_anon() before 
doing the folio_add_new*, and *not* performing that check under folio 
lock such that we can race with another folio_add_new*() call. (or not 
checking for folio_test_anon() at all, which would be similarly broken)

When thinking about this in the past I concluded that such code would be 
inherently racy and wrong already and the existing VM_WARN_ON_FOLIO() 
would have revealed that race.

Whoever calls folio_add_new*() either (a) just allocated that thing and 
can be sure that nobody else does something concurrently -- no need for 
the folio lock; or (b) calls it only when !folio_test_anon(), and 
synchronizes against any other possible races using the folio lock.

swapin code falls into category (b), everything else we have in the tree 
into category (a).

So far my thinking :)

> 
>>
>> That's I added a while ago:
>>
>> if (unlikely(!folio_test_anon(folio))) {
>> 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
>> 	/*
>> 	 * For a PTE-mapped large folio, we only know that the single
>> 	 * PTE is exclusive. Further, __folio_set_anon() might not get
>> 	 * folio->index right when not given the address of the head
>> 	 * page.
>> 	 */
>> 	...
>>
>> We should probably move that VM_WARN_ON_FOLIO() to folio_add_new_anon_rmap()
>> and document that it's required in the non-exclusive case.
>>
>>>
>>> But even if those are wrong, I'd expect them to tend towards a mapped
>>> page becoming unreclaimable, then "Bad page map" when munmapped,
>>> not to any of the double-free symptoms I've actually seen.)
>>
>> What's the first known-good commit?
> 
> I cannot answer that with any certainty: we're on the shifting sands
> of next and mm-unstable, and the bug itself is looking rather like
> something which gets amplified or masked by other changes - witness
> my confident arrival at Barry's series as introducing the badness,
> only for a longer run then to contradict that conclusion.
> 
> There was no sign of a problem in a 20-hour run of the same load on
> rc3-based next-2024-06-13 (plus my posted fixes); there has been no
> sign of this problem on 6.10-rc1, rc2, rc3 (but I've not tried rc4
> itself, mm.git needing the more urgent attention). mm-everything-
> 2024-06-15 (minus Chris's mTHP swap) did not show this problem, but
> did show filemap_get_folio() hang. mm-everything-2024-06-18 (minus
> Baolin's mTHP shmem swap) is where I'm hunting it.

Good, as long as we suspect only something  very recent is affected.

There was this report against rc3 upstream:

https://lore.kernel.org/all/CA+G9fYvKmr84WzTArmfaypKM9+=Aw0uXCtuUKHQKFCNMGJyOgQ@mail.gmail.com/

It's only been observed once, and who knows what's happening in that NFS 
setup. I suspected NUMA hinting code, but I am not sure if that really 
explains the issue ...

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2024-06-20 16:52 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-18  6:54 Baolin Wang
2024-06-18  6:54 ` [PATCH v2 1/9] mm: vmscan: add validation before spliting shmem large folio Baolin Wang
2024-06-20  6:12   ` Yu Zhao
2024-06-20  8:26     ` Baolin Wang
2024-06-18  6:54 ` [PATCH v2 2/9] mm: swap: extend swap_shmem_alloc() to support batch SWAP_MAP_SHMEM flag setting Baolin Wang
2024-06-18  6:54 ` [PATCH v2 3/9] mm: shmem: extend shmem_partial_swap_usage() to support large folio swap Baolin Wang
2024-06-18  6:54 ` [PATCH v2 4/9] mm: shmem: return number of pages beeing freed in shmem_free_swap Baolin Wang
2024-06-18  6:54 ` [PATCH v2 5/9] mm: filemap: use xa_get_order() to get the swap entry order Baolin Wang
2024-06-18  6:54 ` [PATCH v2 6/9] mm: shmem: use swap_free_nr() to free shmem swap entries Baolin Wang
2024-06-18  6:54 ` [PATCH v2 7/9] mm: shmem: support large folio allocation for shmem_replace_folio() Baolin Wang
2024-06-18  6:54 ` [PATCH v2 8/9] mm: shmem: drop folio reference count using 'nr_pages' in shmem_delete_from_page_cache() Baolin Wang
2024-06-18  6:54 ` [PATCH v2 9/9] mm: shmem: support large folio swap out Baolin Wang
2024-06-18 20:05 ` [PATCH v2 0/9] support large folio swap-out and swap-in for shmem Andrew Morton
2024-06-19  1:28   ` Baolin Wang
2024-06-19  8:16     ` Hugh Dickins
2024-06-19  8:58       ` Baolin Wang
2024-06-20  1:33       ` Andrew Morton
2024-06-20  3:59         ` Hugh Dickins
2024-06-20  6:41           ` David Hildenbrand
2024-06-20  4:42       ` Hugh Dickins
2024-06-20  6:52         ` David Hildenbrand
2024-06-20 16:27           ` Hugh Dickins
2024-06-20 16:51             ` David Hildenbrand [this message]
2024-06-25  4:54         ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1cdf6a43-9599-4e34-8f70-29ed16295a60@redhat.com \
    --to=david@redhat.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=chrisl@kernel.org \
    --cc=da.gomez@samsung.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=p.raghav@samsung.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=ying.huang@intel.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox