From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 08FF2C54E49
	for <linux-mm@archiver.kernel.org>; Thu,  7 Mar 2024 08:56:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8AC146B012A; Thu,  7 Mar 2024 03:56:33 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 85BC36B012B; Thu,  7 Mar 2024 03:56:33 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 74BB56B012D; Thu,  7 Mar 2024 03:56:33 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 5F5496B012A
	for <linux-mm@kvack.org>; Thu,  7 Mar 2024 03:56:33 -0500 (EST)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 1483A81055
	for <linux-mm@kvack.org>; Thu,  7 Mar 2024 08:56:33 +0000 (UTC)
X-FDA: 81869637066.25.337EA75
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf11.hostedemail.com (Postfix) with ESMTP id 261584000F
	for <linux-mm@kvack.org>; Thu,  7 Mar 2024 08:56:30 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf11.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1709801791;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=SXgy2dd1kLf8udX6LpUn4iHpjY7zRn679eLYUvDR0+Q=;
	b=iao6VLp+Lp/h1TiqHJyPjL2VU/M++7uzNH1Yk273VJU7/GtPCclM2pngEINaCAcOzWDjIl
	F3Q0iprXd2aL/k2wFRpmkDOi0n0ELCDMvcl/209gggAhWhqB68TAdkznO6HvD3OuF7HKcC
	52WvhYAte5vm064Yt+BGFHEp53Ns78Y=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf11.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709801791; a=rsa-sha256;
	cv=none;
	b=t9d2asQGr6KfiCyWUp45HWcNKKX3q8ywNwurhBsoeI6OTkP/y9Bmr5Ms85AU6blDPXKStl
	9S/lxkRkDixVT1fsjLawDJr0AobpYUE2pkK82fm3Kf35me6LEcQSbWuwiM3JAWim1ShOHn
	h1+UVvT4xIe652cZ1JVrLG9cVU0Y4uo=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 06E241FB;
	Thu,  7 Mar 2024 00:57:07 -0800 (PST)
Received: from [10.57.68.241] (unknown [10.57.68.241])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id EB9133F762;
	Thu,  7 Mar 2024 00:56:28 -0800 (PST)
Message-ID: <da729e0b-4eae-451c-baec-e58a3b5a2752@arm.com>
Date: Thu, 7 Mar 2024 08:56:27 +0000
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v3 10/18] mm: Allow non-hugetlb large folios to be batch
 processed
Content-Language: en-GB
To: Matthew Wilcox <willy@infradead.org>, Zi Yan <ziy@nvidia.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org,
 Yang Shi <shy828301@gmail.com>, Huang Ying <ying.huang@intel.com>
References: <20240227174254.710559-1-willy@infradead.org>
 <20240227174254.710559-11-willy@infradead.org>
 <367a14f7-340e-4b29-90ae-bc3fcefdd5f4@arm.com>
 <ZeiVIcq6VyWX13jQ@casper.infradead.org>
 <85cc26ed-6386-4d6b-b680-1e5fba07843f@arm.com>
 <36bdda72-2731-440e-ad15-39b845401f50@arm.com>
 <03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com>
 <ZejKRs0Ft9w2nm0s@casper.infradead.org>
 <ZejmTM1XbE0mPA2A@casper.infradead.org>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <ZejmTM1XbE0mPA2A@casper.infradead.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 261584000F
X-Stat-Signature: q8r7rzdzqxn4orqmgtmumhb5jejzqmrr
X-Rspam-User: 
X-HE-Tag: 1709801790-554876
X-HE-Meta: U2FsdGVkX1/3yj1T71j/Nthi+9bhcy35EqLRd96fD+kCBa9rkx3IgGCrWRL1jUNm+8fcyYPUwia684KHol74vaOccpRuSd2bsmuWaCWGc5wERh6DNojvvSIPyxEzgVxJy8QDx55FkiVIKlGN5AFgMRj6FuEHyewIm9WTzkJzCwMFIcuxXtICZptBZz2hU9BylUtjCl3jY4tBtWSG81AGYBQTJpV9xqnLl5yJIhcJsZtTtANGthXfr8z6Mg/Uxh2E3KnuAEDBms5e9p+58rOJJyzDPvL/e4hg1xd+ppgbT6E/T18bojNI2ga9PsLThLCe1TMYvanpANQYO1yaUqBmV5YRkHLhTuHGhQ/9Qanez9y0zKsUPdrV1Dw/8zqa6fpQhbBueVX2Gx6dBvWtqrmdEpd0pDl/HTHA+Qg2cEAmvVU240kP4seUjP/JnZG0JB8OhgHraXy6/cMcO/CxGw5VUwsDA6fZv3dBmvWtRTyRL7LKItSCmSXslc87y9p+ccPAI5OMfMv7x/PrJfLUDPIUNBWvQLdgKm/AqWZPShkRRAT3TT2gCBswqEhcFMWID7fL1YTL7EWAtFnaSiGbjlAbZtgaxtPykzeAxV6mCNNqaNSJS1qkJcSJ7T/xR9nqg/2bUkTpCrvB7DJirkNBSjERl8owIPuon7JfhgFjba6X0tEb8lkqBp3h/PP+J6l0QiE2y89oH5jdpydkDY8xiYHE41DpEM9DDB/QJJ9YBHT6xPalv7PVLSsdUrlcWIzkG23M1M7IjFVaLKUd+zfK4LfHcvAhwHR9jX9q2EZ5RIcfdzCvaZV8GoMEjwM53DJgBnozt64TreU7lWhGp66DBL7BOWtYx8Cpxoc6
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 06/03/2024 21:55, Matthew Wilcox wrote:
> On Wed, Mar 06, 2024 at 07:55:50PM +0000, Matthew Wilcox wrote:
>> Hang on, I think I see it.  It is a race between folio freeing and
>> deferred_split_scan(), but page migration is absolved.  Look:
>>
>> CPU 1: deferred_split_scan:
>> spin_lock_irqsave(split_queue_lock)
>> list_for_each_entry_safe()
>> folio_try_get()
>> list_move(&folio->_deferred_list, &list);
>> spin_unlock_irqrestore(split_queue_lock)
>> list_for_each_entry_safe() {
>> 	folio_trylock() <- fails
>> 	folio_put(folio);
>>
>> CPU 2: folio_put:
>> folio_undo_large_rmappable
>>         ds_queue = get_deferred_split_queue(folio);
>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>                 list_del_init(&folio->_deferred_list);
>> *** at this point CPU 1 is not holding the split_queue_lock; the
>> folio is on the local list.  Which we just corrupted ***

Wow, this would have taken me weeks...

I just want to make sure I've understood correctly: CPU1's folio_put() is not the last reference, and it keeps iterating through the local list. Then CPU2 does the final folio_put() which causes list_del_init() to modify the local list concurrently with CPU1's iteration, so CPU1 probably goes into the weeds?

>>
>> Now anything can happen.  It's a pretty tight race that involves at
>> least two CPUs (CPU 2 might have been the one to have the folio locked
>> at the time CPU 1 caalled folio_trylock()).  But I definitely widened
>> the window by moving the decrement of the refcount and the removal from
>> the deferred list further apart.
>>
>>
>> OK, so what's the solution here?  Personally I favour using a
>> folio_batch in deferred_split_scan() to hold the folios that we're
>> going to try to remove instead of a linked list.  Other ideas that are
>> perhaps less intrusive?
> 
> I looked at a few options, but I think we need to keep the refcount
> elevated until we've got the folios back on the deferred split list.
> And we can't call folio_put() while holding the split_queue_lock or
> we'll deadlock.  So we need to maintain a list of folios that isn't
> linked through deferred_list.  Anyway, this is basically untested,
> except that it compiles.

If we can't call folio_put() under spinlock, then I agree.

> 
> Opinions?  Better patches?

I assume the fact that one scan is limited to freeing a batch-worth of folios is not a problem? The shrinker will keep calling while there are folios on the deferred list?

> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index fd745bcc97ff..0120a47ea7a1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3312,7 +3312,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
> -	LIST_HEAD(list);
> +	struct folio_batch batch;
>  	struct folio *folio, *next;
>  	int split = 0;
>  
> @@ -3321,37 +3321,41 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		ds_queue = &sc->memcg->deferred_split_queue;
>  #endif
>  
> +	folio_batch_init(&batch);
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	/* Take pin on all head pages to avoid freeing them under us */
>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>  							_deferred_list) {
> -		if (folio_try_get(folio)) {
> -			list_move(&folio->_deferred_list, &list);
> -		} else {
> -			/* We lost race with folio_put() */
> -			list_del_init(&folio->_deferred_list);
> -			ds_queue->split_queue_len--;
> +		if (!folio_try_get(folio))
> +			continue;
> +		if (!folio_trylock(folio))
> +			continue;
> +		list_del_init(&folio->_deferred_list);
> +		if (folio_batch_add(&batch, folio) == 0) {
> +			--sc->nr_to_scan;
> +			break;
>  		}
>  		if (!--sc->nr_to_scan)
>  			break;
>  	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> -		if (!folio_trylock(folio))
> -			goto next;
> -		/* split_huge_page() removes page from list on success */
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
>  		if (!split_folio(folio))
>  			split++;
>  		folio_unlock(folio);
> -next:
> -		folio_put(folio);
>  	}
>  
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	list_splice_tail(&list, &ds_queue->split_queue);
> +	while ((folio = folio_batch_next(&batch)) != NULL) {
> +		if (!folio_test_large(folio))
> +			continue;
> +		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
> +	}
>  	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>  
> +	folios_put(&batch);
> +
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
>  	 * This can happen if pages were freed under us.

I've added this patch to my branch and tested (still without the patch that I fingered as the culprit originally, for now). Unfortuantely it is still blowing up at about the same rate, although it looks very different now. I've seen bad things twice. The first time was RCU stalls, but systemd had turned the log level down so no stack trace and I didn't manage to get any further information. The second time, this:

[  338.519401] Unable to handle kernel paging request at virtual address fffc001b13a8c870
[  338.519402] Unable to handle kernel paging request at virtual address fffc001b13a8c870
[  338.519407] Mem abort info:
[  338.519407]   ESR = 0x0000000096000004
[  338.519408]   EC = 0x25: DABT (current EL), IL = 32 bits
[  338.519588] Unable to handle kernel paging request at virtual address fffc001b13a8c870
[  338.519591] Mem abort info:
[  338.519592]   ESR = 0x0000000096000004
[  338.519593]   EC = 0x25: DABT (current EL), IL = 32 bits
[  338.519594]   SET = 0, FnV = 0
[  338.519595]   EA = 0, S1PTW = 0
[  338.519596]   FSC = 0x04: level 0 translation fault
[  338.519597] Data abort info:
[  338.519597]   ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[  338.519598]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  338.519599]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  338.519600] [fffc001b13a8c870] address between user and kernel address ranges
[  338.519602] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[  338.519605] Modules linked in:
[  338.519607] CPU: 43 PID: 3234 Comm: usemem Not tainted 6.8.0-rc5-00465-g279cb41b481e-dirty #3
[  338.519610] Hardware name: linux,dummy-virt (DT)
[  338.519611] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  338.519613] pc : down_read_trylock+0x2c/0xd0
[  338.519618] lr : folio_lock_anon_vma_read+0x74/0x2c8
[  338.519623] sp : ffff800087f935c0
[  338.519623] x29: ffff800087f935c0 x28: 0000000000000000 x27: ffff800087f937e0
[  338.519626] x26: 0000000000000001 x25: ffff800087f937a8 x24: fffffc0007258180
[  338.519628] x23: ffff800087f936c8 x22: fffc001b13a8c870 x21: ffff0000f7d51d69
[  338.519630] x20: ffff0000f7d51d68 x19: fffffc0007258180 x18: 0000000000000000
[  338.519632] x17: 0000000000000001 x16: ffff0000c90ab458 x15: 0000000000000040
[  338.519634] x14: ffff0000c8c7b558 x13: 0000000000000228 x12: 000040f22f534640
[  338.519637] x11: 0000000000000000 x10: 0000000000000000 x9 : ffff800080338b3c
[  338.519639] x8 : ffff800087f93618 x7 : 0000000000000000 x6 : ffff0000c9692f50
[  338.519641] x5 : ffff800087f936b0 x4 : 0000000000000001 x3 : ffff0000d70d9140
[  338.519643] x2 : 0000000000000001 x1 : fffc001b13a8c870 x0 : fffc001b13a8c870
[  338.519645] Call trace:
[  338.519646]  down_read_trylock+0x2c/0xd0
[  338.519648]  folio_lock_anon_vma_read+0x74/0x2c8
[  338.519650]  rmap_walk_anon+0x1d8/0x2c0
[  338.519652]  folio_referenced+0x1b4/0x1e0
[  338.519655]  shrink_folio_list+0x768/0x10c8
[  338.519658]  shrink_lruvec+0x5dc/0xb30
[  338.519660]  shrink_node+0x4d8/0x8b0
[  338.519662]  do_try_to_free_pages+0xe0/0x5a8
[  338.519665]  try_to_free_mem_cgroup_pages+0x128/0x2d0
[  338.519667]  try_charge_memcg+0x114/0x658
[  338.519671]  __mem_cgroup_charge+0x6c/0xd0
[  338.519672]  __handle_mm_fault+0x42c/0x1640
[  338.519675]  handle_mm_fault+0x70/0x290
[  338.519677]  do_page_fault+0xfc/0x4d8
[  338.519681]  do_translation_fault+0xa4/0xc0
[  338.519682]  do_mem_abort+0x4c/0xa8
[  338.519685]  el0_da+0x2c/0x78
[  338.519687]  el0t_64_sync_handler+0xb8/0x130
[  338.519689]  el0t_64_sync+0x190/0x198
[  338.519692] Code: aa0003e1 b9400862 11000442 b9000862 (f9400000) 
[  338.519693] ---[ end trace 0000000000000000 ]---

The fault is when trying to do an atomic_long_read(&sem->count) here:

struct anon_vma *folio_lock_anon_vma_read(struct folio *folio,
					  struct rmap_walk_control *rwc)
{
	struct anon_vma *anon_vma = NULL;
	struct anon_vma *root_anon_vma;
	unsigned long anon_mapping;

retry:
	rcu_read_lock();
	anon_mapping = (unsigned long)READ_ONCE(folio->mapping);
	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		goto out;
	if (!folio_mapped(folio))
		goto out;

	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
	root_anon_vma = READ_ONCE(anon_vma->root);
	if (down_read_trylock(&root_anon_vma->rwsem)) { <<<<<<<

I guess we are still corrupting folios?