From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4BFFCC3DA7F
	for <linux-mm@archiver.kernel.org>; Mon,  5 Aug 2024 09:51:56 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C73A06B00A0; Mon,  5 Aug 2024 05:51:55 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C23426B00A2; Mon,  5 Aug 2024 05:51:55 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B120E6B00A3; Mon,  5 Aug 2024 05:51:55 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 94D886B00A0
	for <linux-mm@kvack.org>; Mon,  5 Aug 2024 05:51:55 -0400 (EDT)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 391C61C08FA
	for <linux-mm@kvack.org>; Mon,  5 Aug 2024 09:51:55 +0000 (UTC)
X-FDA: 82417725390.05.8961372
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf11.hostedemail.com (Postfix) with ESMTP id 38E2D4000D
	for <linux-mm@kvack.org>; Mon,  5 Aug 2024 09:51:53 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf11.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1722851453;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=yPQXN7hIInIhvjL8bOimjLRSgCq5BHwze5Rk9bTEVWM=;
	b=PHOZdWRkQarCFgqphs3DBwwsqEbOTanbtA+0hV7Uum7yOcvVikVR1sJa2gi0/01CKm8+6t
	Ltl9NiVvUbIKe/mMLO4EMBq+pl5M7IfPSF5dlozgSesiDnnw8dhpOIPKDPkHk6ANBys3Bn
	NXsGEXDcC8nN4Fh2I8zCXWBSlbeJ7IU=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722851453; a=rsa-sha256;
	cv=none;
	b=vlM+ek6iGx2R38noaSkJGUUEx++8mk0UeZLrLAqmW8b5lC7XKY3wR5BKQLc507vttw2lH0
	42xYWcn5tn0qQkPyup1Xj/PSNMBoUTweFXHPrdZ2oF5F2oMmiYatOB0yFx6frXjXgLTioT
	bDjyk+zafN4RYjgIgR2ieuhYgxhrv1A=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=arm.com;
	spf=pass (imf11.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id E0EBE143D;
	Mon,  5 Aug 2024 02:52:17 -0700 (PDT)
Received: from [10.162.41.18] (e116581.arm.com [10.162.41.18])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A84093F5A1;
	Mon,  5 Aug 2024 02:51:43 -0700 (PDT)
Message-ID: <f8d21caa-7a82-4761-8a78-d928ae8d0f24@arm.com>
Date: Mon, 5 Aug 2024 15:21:36 +0530
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Race condition observed between page migration and page fault
 handling on arm64 machines
To: David Hildenbrand <david@redhat.com>, Will Deacon <will@kernel.org>
Cc: akpm@linux-foundation.org, willy@infradead.org, ryan.roberts@arm.com,
 anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org,
 vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, osalvador@suse.de,
 baolin.wang@linux.alibaba.com, dave.hansen@linux.intel.com,
 baohua@kernel.org, ioworker0@gmail.com, gshan@redhat.com,
 mark.rutland@arm.com, kirill.shutemov@linux.intel.com, hughd@google.com,
 aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com,
 broonie@kernel.org, linux-arm-kernel@lists.infradead.org,
 linux-kernel@vger.kernel.org, linux-mm@kvack.org
References: <20240801081657.1386743-1-dev.jain@arm.com>
 <3b82e195-5871-4880-9ce5-d01bb751f471@redhat.com>
 <bbe411f2-4c68-4f92-af8c-da184669dca8@arm.com>
 <a6a38ad5-c754-44ad-a64b-f9ea5b764291@redhat.com>
 <92df0ee1-d3c9-41e2-834c-284127ae2c4c@arm.com>
 <19902a48-c59b-4e3b-afc5-e792506c2fd6@redhat.com>
 <6486a2b1-45ef-44b6-bd84-d402fc121373@redhat.com>
 <20240801134358.GB4794@willie-the-truck>
 <9359caf7-81a8-45d9-9787-9009b3b2eed3@redhat.com>
Content-Language: en-US
From: Dev Jain <dev.jain@arm.com>
In-Reply-To: <9359caf7-81a8-45d9-9787-9009b3b2eed3@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: 38E2D4000D
X-Stat-Signature: 1d9hrqckozkf6ocm69uqgjenho6xc3hy
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-HE-Tag: 1722851513-318624
X-HE-Meta: U2FsdGVkX19GB/sdLtcJnheQPwRbAEtTII4FnkWJATE212u8RqWjf4VJqdWkV2cunTikw98UichpWwKNNLX4/7uCJgTvwz36ike9BFDQnJ0EUmJ40GbfC2/x/jrDQLsROUMJ82aFmGdXs/RKzTyVJx+0gJC4cY4q1aYDCUcceCyXr1liXnndwCI1jdDYU3tCwZaXO9islvhocXnYgxfAS7d4wfepCUVkqc9h+DU48j1ijrbI8EoYIZIIbbi80Zmxqq+4fy1nQbFZfA91UCvmL6OQryJtOBxmPM+TnsUBG2CqWZ41WmXYXOop1yvPCm2mAn8DFVyH643aaFA26xrG12eEphyKCDfRcui61OqA7yzyua9K/bM8+A9aZLR+rLHz9m1fwZLBYGpdbEEXPssgaB4NMmGVysMozPQ4ftvqsdG3YHFtkpLr9a1MmA6YFfiKBqubDKbew16rCfxN7DC/voU+PMuGEuOUn3agK0s7yapXYZ2g4DbUOpAdSmuQEXBeYk6JbP7OHXZmFJQjZj6HLszffy811gKaWOVPPmdhKcBAph59WdgFFN1GLPlX/l+d0wU7G+6tcDRjzsdC0JWLo5bXesfpTHaFimqzh5M8B9DSEa1mqni/ulFAKw9+s12QcoV/QTIWj4LXHD5RskR229JuTEmDP5t9HlXEcr0u/5qRBreGrwRPEfr6pLX7Y7DcubZBmxRL6Qq6cAs4LiMr0i+gEdel7a7GOgfZy+9GoyW0ox9HZKX32ZU0mfOxeRI447bs91RyKk3YoYh7egyGBSKB0Z4VQ607TxnaWD2X44EUfFRBt3BX9Cuz6yeb/xOQoB9p9GiXMyjUUSA8QaBojxj1Xw5FMrHQSNABBitw50Nelv+sXCpbRWdt2md+Mu3xmHGbRA1zKUp+VlF4dxqcidNhrBujtHEzvCVq256TxDrj/fCkeT0+wTYPAWZcMpCK7v3SXXDUCABUZsrhU3c
 bFhBxpv/
 +tLBekxyjGLe0fCRrQH50q7XNvnNd7ao7tdAGDc3kTnPtRkhynjHkBQP9gpVzNmdNWbnS+Qj/RPt1n2nfC2YrKeZs3kpEuUMNlWV8BvIHbNunRHYLHZ+hyxpATHSlch+MEeiJ0Q+5enoYhh10MBr+ftD7kyI+5j/ydh9DkFgLqvZFUOg=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 8/1/24 19:18, David Hildenbrand wrote:
> On 01.08.24 15:43, Will Deacon wrote:
>> On Thu, Aug 01, 2024 at 03:26:57PM +0200, David Hildenbrand wrote:
>>> On 01.08.24 15:13, David Hildenbrand wrote:
>>>>>>> To dampen the tradeoff, we could do this in shmem_fault() 
>>>>>>> instead? But
>>>>>>> then, this would mean that we do this in all
>>>>>>>
>>>>>>> kinds of vma->vm_ops->fault, only when we discover another 
>>>>>>> reference
>>>>>>> count race condition :) Doing this in do_fault()
>>>>>>>
>>>>>>> should solve this once and for all. In fact, do_pte_missing() 
>>>>>>> may call
>>>>>>> do_anonymous_page() or do_fault(), and I just
>>>>>>>
>>>>>>> noticed that the former already checks this using 
>>>>>>> vmf_pte_changed().
>>>>>>
>>>>>> What I am still missing is why this is (a) arm64 only; and (b) if 
>>>>>> this
>>>>>> is something we should really worry about. There are other reasons
>>>>>> (e.g., speculative references) why migration could temporarily fail,
>>>>>> does it happen that often that it is really something we have to 
>>>>>> worry
>>>>>> about?
>>>>>
>>>>>
>>>>> (a) See discussion at [1]; I guess it passes on x86, which is quite
>>>>> strange since the race is clearly arch-independent.
>>>>
>>>> Yes, I think this is what we have to understand. Is the race simply 
>>>> less
>>>> likely to trigger on x86?
>>>>
>>>> I would assume that it would trigger on any arch.
>>>>
>>>> I just ran it on a x86 VM with 2 NUMA nodes and it also seems to 
>>>> work here.
>>>>
>>>> Is this maybe related to deferred flushing? Such that the other CPU 
>>>> will
>>>> by accident just observe the !pte_none a little less likely?
>>>>
>>>> But arm64 also usually defers flushes, right? At least unless
>>>> ARM64_WORKAROUND_REPEAT_TLBI is around. With that we never do deferred
>>>> flushes.
>>>
>>> Bingo!
>>>
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index e51ed44f8b53..ce94b810586b 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -718,10 +718,7 @@ static void set_tlb_ubc_flush_pending(struct 
>>> mm_struct
>>> *mm, pte_t pteval,
>>>    */
>>>   static bool should_defer_flush(struct mm_struct *mm, enum 
>>> ttu_flags flags)
>>>   {
>>> -       if (!(flags & TTU_BATCH_FLUSH))
>>> -               return false;
>>> -
>>> -       return arch_tlbbatch_should_defer(mm);
>>> +       return false;
>>>   }
>>>
>>>
>>> On x86:
>>>
>>> # ./migration
>>> TAP version 13
>>> 1..1
>>> # Starting 1 tests from 1 test cases.
>>> #  RUN           migration.shared_anon ...
>>> Didn't migrate 1 pages
>>> # migration.c:170:shared_anon:Expected migrate(ptr, self->n1, 
>>> self->n2) (-2)
>>> == 0 (0)
>>> # shared_anon: Test terminated by assertion
>>> #          FAIL  migration.shared_anon
>>> not ok 1 migration.shared_anon
>>>
>>>
>>> It fails all of the time!
>>
>> Nice work! I suppose that makes sense as, with the eager TLB
>> invalidation, the window between the other CPU faulting and the
>> migration entry being written is fairly wide.
>>
>> Not sure about a fix though :/ It feels a bit overkill to add a new
>> invalid pte encoding just for this.
>
> Something like that might make the test happy in most cases:
>
> diff --git a/tools/testing/selftests/mm/migration.c 
> b/tools/testing/selftests/mm/migration.c
> index 6908569ef406..4c18bfc13b94 100644
> --- a/tools/testing/selftests/mm/migration.c
> +++ b/tools/testing/selftests/mm/migration.c
> @@ -65,6 +65,7 @@ int migrate(uint64_t *ptr, int n1, int n2)
>         int ret, tmp;
>         int status = 0;
>         struct timespec ts1, ts2;
> +       int errors = 0;
>
>         if (clock_gettime(CLOCK_MONOTONIC, &ts1))
>                 return -1;
> @@ -79,12 +80,17 @@ int migrate(uint64_t *ptr, int n1, int n2)
>                 ret = move_pages(0, 1, (void **) &ptr, &n2, &status,
>                                 MPOL_MF_MOVE_ALL);
>                 if (ret) {
> -                       if (ret > 0)
> +                       if (ret > 0) {
> +                               if (++errors < 100)
> +                                       continue;
>                                 printf("Didn't migrate %d pages\n", ret);
> -                       else
> +                       } else {
>                                 perror("Couldn't migrate pages");
> +                       }
>                         return -2;
>                 }
> +               /* Progress! */
> +               errors = 0;
>
>                 tmp = n2;
>                 n2 = n1;
>
>
> [root@localhost mm]# ./migration
> TAP version 13
> 1..1
> # Starting 1 tests from 1 test cases.
> #  RUN           migration.shared_anon ...
> #            OK  migration.shared_anon
> ok 1 migration.shared_anon
> # PASSED: 1 / 1 tests passed.
> # Totals: pass:1 fail:0 xfail:0 xpass:0 skip:0 error:0


This does make the test pass, to my surprise, since what you are doing 
from userspace

should have been done by the kernel, because it retries folio unmapping 
and moving

NR_MAX_MIGRATE_(A)SYNC_RETRY times; I had already tested pumping up these

macros and the original test was still failing. Now, I digged in more, 
and, if the

following assertion is correct:


Any thread having a reference on a folio will end up calling folio_lock()


then it seems to me that the retry for loop wrapped around 
migrate_folio_move(), inside

migrate_pages_batch(), is useless; if migrate_folio_move() fails on the 
first iteration, it is

going to fail for all iterations since, if I am reading the code path 
correctly, the only way it

fails is when the actual refcount is not equal to expected refcount (in 
folio_migrate_mapping()),

and there is no way that the extra refcount is going to get released 
since the migration path

has the folio lock.

And therefore, this begs the question: isn't it logical to assert the 
actual refcount against the

expected refcount, just after we have changed the PTEs, so that if this 
assertion fails, we can

go to the next iteration of the for loop for migrate_folio_unmap() 
inside migrate_pages_batch()

by calling migrate_folio_undo_src()/dst() to restore the old state? I am 
trying to implement

this but is not as straightforward as it seemed to me this morning.