Re: [PATCH] mm: Migrate high-order folios in swap cache correctly

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Charan Teja Kalla <quic_charante@quicinc.com>
To: "Matthew Wilcox (Oracle)" <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>,
	David Hildenbrand <david@redhat.com>,
	Suren Baghdasaryan <surenb@google.com>
Subject: Re: [PATCH] mm: Migrate high-order folios in swap cache correctly
Date: Wed, 7 Feb 2024 20:08:28 +0530	[thread overview]
Message-ID: <d79f6acb-4663-0ce9-72cc-3f881c23a933@quicinc.com> (raw)
In-Reply-To: <69cb784f-578d-ded1-cd9f-c6db04696336@quicinc.com>

I am sorry that I didn't check the latest kernel before posting. We are
working on 6.1 LTS kernel where the mentioned issue is observed.  I
assume the analysis still holds true on 6.1 kernel.

I see that this is not a problem on the latest kernels because of [1],
by David.

[1]
https://lore.kernel.org/linux-mm/20230821160849.531668-2-david@redhat.com/#Z31mm:huge_memory.c

Thanks,
Charan
On 2/6/2024 8:24 PM, Charan Teja Kalla wrote:
> Hi Matthew,
> 
> It seems that issue is not completely fixed with the below change. This
> time it is because of not filling the ->private for subpages of THP.
> This can result into same type of issue this patch is tried for:
> 
> 1) For reclaim of anon THP, page is first added to swap cache. As part
> of this, contig swap space of THP size is allocated, fills the
> folio_nr_pages of swap cache indices with folio, set each subpage
> ->private with respective swap entry.
> add_to_swap_cache():
> 	for (i = 0; i < nr; i++) {
> 		set_page_private(folio_page(folio, i), entry.val + i);
> 		xas_store(&xas, folio);
> 		xas_next(&xas);
> 	}
> 
> 2) Migrating the folio that is sitting on the swap cache causes only
> head page of newfolio->private set to swap entry. The tail pages are missed.
> folio_migrate_mapping():
> 	newfolio->private = folio_get_private(folio);
> 	for (i = 0; i < entries; i++) {
> 		xas_store(&xas, newfolio);
> 		xas_next(&xas);
> 	}
> 
> 3) Now again, when this migrated page that is still sitting on the swap
> cache, is tried to migrate, this THP page can be split (see
> migrate_pages()->try_split_thp), which will endup in storing the swap
> cache entries with sub pages in the respective indices, but this sub
> pages->private doesn't contain the valid swap cache entry.
> __split_huge_page():
> 	for (i = nr - 1; i >= 1; i--) {
> 		if {
> 		...
> 		} else if(swap_cache) {
> 			__xa_store(&swap_cache->i_pages, offset + i,
> 					head + i, 0);
> 		}
> 	}
> 
> This leads to a state where all the sub pages are sitting on the swap
> cache with ->private not holding the valid value. As the subpage is
> tried with delete_from_swap_cache(), it tries to replace the __wrong
> swap cache index__ with NULL/shadow value and subsequently decrease the
> refcount of the sub page.
> 	for (i = 0; i < nr; i++) {
> 		if (subpage == page)
> 			continue;
> 			
> 		free_page_and_swap_cache(subpage):
> 			free_swap_cache(page);
> 			(Calls delete_from_swap_cache())
> 			put_page(page);
> 	}
> 
> Consider a folio just sitting on a swap cache, its subpage entries will
> now have refcount of 1(ref counts decremented from swap cache deletion +
> put_page) from 3 (isolate + swap cache + lru_add_page_tail()). Now
> migrate_pages() is tried again on these splitted THP pages. But the
> refcount of '1' makes the page freed directly (unmap_and_move).
> 
> So, this leads to a state of "freed page entry on the swap cache" which
> can causes various corruptions, loop under RCU lock(observed in
> mapping_get_entry) e.t.c.,
> 
> Seems below change is also required here.
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 9f5f52d..8049f4e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -427,10 +427,8 @@ int folio_migrate_mapping(struct address_space
> *mapping,
>         folio_ref_add(newfolio, nr); /* add cache reference */
>         if (folio_test_swapbacked(folio)) {
>                 __folio_set_swapbacked(newfolio);
> -               if (folio_test_swapcache(folio)) {
> +               if (folio_test_swapcache(folio))
>                         folio_set_swapcache(newfolio);
> -                       newfolio->private = folio_get_private(folio);
> -               }
>                 entries = nr;
>         } else {
>                 VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
> @@ -446,6 +444,8 @@ int folio_migrate_mapping(struct address_space *mapping,
> 
>         /* Swap cache still stores N entries instead of a high-order
> entry */
>         for (i = 0; i < entries; i++) {
> +               set_page_private(folio_page(newfolio, i),
> +                               folio_page(folio, i)->private);


I see that the expectation of page_tail->private should be zero for
__split_huge_page_tail. So, this patch also wrong :( .

>                 xas_store(&xas, newfolio);
>                 xas_next(&xas);
>         }
> 
> 
> 
> On 12/14/2023 10:28 AM, Matthew Wilcox (Oracle) wrote:
>> From: Charan Teja Kalla <quic_charante@quicinc.com>
>>
>> Large folios occupy N consecutive entries in the swap cache
>> instead of using multi-index entries like the page cache.
>> However, if a large folio is re-added to the LRU list, it can
>> be migrated.  The migration code was not aware of the difference
>> between the swap cache and the page cache and assumed that a single
>> xas_store() would be sufficient.
>>
>> This leaves potentially many stale pointers to the now-migrated folio
>> in the swap cache, which can lead to almost arbitrary data corruption
>> in the future.  This can also manifest as infinite loops with the
>> RCU read lock held.
>>
>> Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com>
>> [modifications to the changelog & tweaked the fix]
>> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
>> ---
>>  mm/migrate.c | 9 ++++++++-
>>  1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index d9d2b9432e81..2d67ca47d2e2 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -405,6 +405,7 @@ int folio_migrate_mapping(struct address_space *mapping,
>>  	int dirty;
>>  	int expected_count = folio_expected_refs(mapping, folio) + extra_count;
>>  	long nr = folio_nr_pages(folio);
>> +	long entries, i;
>>  
>>  	if (!mapping) {
>>  		/* Anonymous page without mapping */
>> @@ -442,8 +443,10 @@ int folio_migrate_mapping(struct address_space *mapping,
>>  			folio_set_swapcache(newfolio);
>>  			newfolio->private = folio_get_private(folio);
>>  		}
>> +		entries = nr;
>>  	} else {
>>  		VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
>> +		entries = 1;
>>  	}
>>  
>>  	/* Move dirty while page refs frozen and newpage not yet exposed */
>> @@ -453,7 +456,11 @@ int folio_migrate_mapping(struct address_space *mapping,
>>  		folio_set_dirty(newfolio);
>>  	}
>>  
>> -	xas_store(&xas, newfolio);
>> +	/* Swap cache still stores N entries instead of a high-order entry */
>> +	for (i = 0; i < entries; i++) {
>> +		xas_store(&xas, newfolio);
>> +		xas_next(&xas);
>> +	}
>>  
>>  	/*
>>  	 * Drop cache reference from old page by unfreezing

     prev parent reply	other threads:[~2024-02-07 14:38 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-14  4:58 Matthew Wilcox (Oracle)
2023-12-14 22:11 ` Andrew Morton
2024-02-06 14:54 ` Charan Teja Kalla
2024-02-07 14:38   ` Charan Teja Kalla [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d79f6acb-4663-0ce9-72cc-3f881c23a933@quicinc.com \
    --to=quic_charante@quicinc.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=linux-mm@kvack.org \
    --cc=surenb@google.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox