From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D96A1E7718B for ; Fri, 20 Dec 2024 04:37:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3D5CC6B007B; Thu, 19 Dec 2024 23:37:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 385CE6B0083; Thu, 19 Dec 2024 23:37:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 24D5A6B0085; Thu, 19 Dec 2024 23:37:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 06B526B007B for ; Thu, 19 Dec 2024 23:37:58 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 2BF6B1C7121 for ; Fri, 20 Dec 2024 04:37:57 +0000 (UTC) X-FDA: 82914078786.24.4A5FB25 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf12.hostedemail.com (Postfix) with ESMTP id 5A70C4000A for ; Fri, 20 Dec 2024 04:37:40 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf12.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734669443; a=rsa-sha256; cv=none; b=z+wsQBaGT9sj6DrUnupVcHtqcb6ig8J75dGg1ghDDpETP3jPdnfHdkvlYrCmy726GY/Ori yjUu24FUJX1vYOdvFhbKPuAgQcbeZ2cl4pa2kAKiQdr5FilXHykmbSFKR15Hx9FGtO3nVX fHG8kDSpXjkjRx0/xJhZAhiHI/gkcwk= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf12.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734669443; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xw14w4bATSh2iva6cFFrT1YFIhzf/sjKmPVCzU8WpPI=; b=fcvZqAe5JvBXZ6LXn3TTICTU3eOYjufMjGauAQpXclPow0zmAJWFQzUg4W2RTxYqPSzTbl NYX+1JNlqruPujksJah3Soe9d8dFiNubylCxmKfPUR821RlBczQ5UQQikw6ooaU55UvQOh cmCO8kgxbVnkNTdtbHCuK1wTlLmfjik= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 39A151480; Thu, 19 Dec 2024 20:38:22 -0800 (PST) Received: from [10.162.42.20] (K4MQJ0H1H2.blr.arm.com [10.162.42.20]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BD5EA3F720; Thu, 19 Dec 2024 20:37:50 -0800 (PST) Message-ID: Date: Fri, 20 Dec 2024 10:07:47 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm: migration :shared anonymous migration test is failing To: Baolin Wang , Donet Tom , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Ritesh Harjani , "Aneesh Kumar K . V" , Zi Yan , David Hildenbrand , shuah Khan References: <20241219124717.4907-1-donettom@linux.ibm.com> <36f9ab13-e057-40a0-8d0b-9939df056fc6@linux.ibm.com> <4d76321e-7905-46e6-8105-f09afde516ff@linux.alibaba.com> Content-Language: en-US From: Dev Jain In-Reply-To: <4d76321e-7905-46e6-8105-f09afde516ff@linux.alibaba.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: ieuaibbbeiafcwoj5hpx1qdqyqtzp3p4 X-Rspam-User: X-Rspamd-Queue-Id: 5A70C4000A X-Rspamd-Server: rspam08 X-HE-Tag: 1734669460-656679 X-HE-Meta: U2FsdGVkX18Mq2YFnHJDq08RVocLr0L72W4MXwa6L+4wRUMC7vqcZA2KciHGERU0/JB+tXuZxsCRqh+M2CdDiFt+au0fyFA4pwvxqtQlP8w1L5UWeHZi8AgITlrD29kuBHsFVrI1toeUHYqYLq+W+EsLVClpgQHNF9kiuwJqml9COFR9VrY3zHUk8vqocW3TBb+eWZ2Vp6rmTD/pBOfDpIEFEmbL3e4Ru4/Wv9vm0wvXkD6WQf0r2Ddr9jkc6KgoZytk/TKa46ZS2UzArNRPP88UVjxEISaJFjlN5Ieerlg29AXUS362vqw5QOGhBijEJVCRPDh3nt/Hrbl1EeB/nKkw/uKLCuM2MTPRDQGQWIcZXny/rBTpSG9/VMU8uN2Iu6+2pVOpYFzjNjNipU31GzpSmwxGhZ+sxfsHVM0L4YCJiOTmtF9kbbglPv1DeKK41fUXVpd+u0SF27zMQq4cYLhOv1JV+57I6e5Saz7pRPVrjZ/u8FOOzLYESgq+LG15uRp9qVzFB+W/s3dQ4cAujvXlYIM1tbKVd/2pC57ElwwoCSscppUXV5gzQuLcV4zYd2mLgoTSAXmtNEjw0++YPeTNH5cZQKqjGlMoIm1iNcVA04pUQ9dlY2rK29Dn/R+Wk6F9su/RvLxeNxm6EuZvrfT9TaKpY22J7KUH5Se7AJEpCD4LVujrf8hC+vEDtzo/3i6M7gs0av3o/zWnEMxOAHT/ll1ErbHRwzvSg4nmYUZH3ICGkJkc/dpGhiDpcZACySC/AD4qURUVEszyTeXSXRpIE+uuO4GaKfomQrut6Y7h9Pn4cpabFvvS9vzXHD+RVdYPd4FO/5QTQuAV+3MC7A8pLFkFAgci2pNGngZaPcdyy04jBhCujvzOsz5OkXol1WEHyCrxJJKNGaB3lQDT9ogPHhnnCH3wim/lQ3+MWgaxLB307vHKR7N4jFU5iH6tV+6nrXC6k8kUwT2qCs7 9K3pWO1c vQBhh7fxlJjvVO3ajp8byD/NFDC37I0qwRYKggq+SD++/GnTYQjEei6xVFTxV+6K9+FvRdy2I3x+urd9qMwoAk5PoghlbVWxFVAmCupFe091FeaunPpsjsW5oCcZ5wKK+z1kG9BeJmyY87bJ1UQDNaIjd9Pp++F4u70aPdmiidE8JwtTamAx+vc516v3rw9iEmR0WfkkG2OggNyuLZt4eor1mMEGthh1nGUVHsP+UjUp3GCbUoyu7EiuQ0qD0gbpilk2DqbBBOa+Uvk3y54U17PiMVcSR6D6vDmiU6t1w3HeVcEAAmZLKdfoBVg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 20/12/24 9:02 am, Baolin Wang wrote: > > > On 2024/12/20 11:12, Donet Tom wrote: >> >> On 12/20/24 08:01, Baolin Wang wrote: >>> >>> >>> On 2024/12/19 20:47, Donet Tom wrote: >>>> The migration selftest is currently failing for shared anonymous >>>> mappings due to a race condition. >>>> >>>> During migration, the source folio's PTE is unmapped by nuking the >>>> PTE, flushing the TLB,and then marking the page for migration >>>> (by creating the swap entries). The issue arises when, immediately >>>> after the PTE is nuked and the TLB is flushed, but before the page >>>> is marked for migration, another thread accesses the page. This >>>> triggers a page fault, and the page fault handler invokes >>>> do_pte_missing() instead of do_swap_page(), as the page is not yet >>>> marked for migration. >>>> >>>> In the fault handling path, do_pte_missing() calls __do_fault() >>>> ->shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(). >>>> This eventually calls folio_try_get(), incrementing the reference >>>> count of the folio undergoing migration. The thread then blocks >>>> on folio_lock(), as the migration path holds the lock. This >>>> results in the migration failing in __migrate_folio(), which expects >>>> the folio's reference count to be 2. However, the reference count is >>>> incremented by the fault handler, leading to the failure. >>>> >>>> The issue arises because, after nuking the PTE and before marking the >>>> page for migration, the page is accessed. To address this, we have >>>> updated the logic to first nuke the PTE, then mark the page for >>>> migration, and only then flush the TLB. With this patch, If the >>>> page is >>>> accessed immediately after nuking the PTE, the TLB entry is still >>>> valid, so no fault occurs. After marking the page for migration, >>> >>> IMO, I don't think this assumption is correct. At this point, the >>> TLB entry might also be evicted, so a page fault could still occur. >>> It's just a matter of probability. >> In this patch, we mark the page for migration before flushing the TLB. >> This ensures that if someone accesses the page after the TLB flush, >> the page fault will occur and in the page fault handler will wait for >> the >> migration to complete. So migration will not fail >> >> Without this patch, if someone accesses the page after the TLB flush >> but before it is marked for migration, the migration will fail. > > Actually my concern is the same as David's (I did not see David's > reply before sending my comments), which is that your patch does not > "rules out all cases". I like this solution but really the proper solution for this one was to atomically set the migration entry IMHO. > >>> Additionally, IIUC, if another thread is accessing the shmem folio >>> causing the migration to fail, I think this is expected, and >>> migration failure is not a vital issue? >>> >> In my case, the shmem migration test is always failing, >> even after retries. Would it be correct to consider this >> as expected behavior? > > IMHO I think your test case is too aggressive and unlikely to occur in > real-world scenarios. Additionally, as I mentioned, migration failure > is not a vital issue in the system, and some temporary refcnt can also > lead to migration failure if you want to create such test cases. So > personally, I don't think it is worthy doing. Agreed, AFAIR the test case starts faulting exactly on those pages which we want to migrate, making this a very artificial scenario.