From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0EE4DC48BF6 for ; Mon, 4 Mar 2024 08:47:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 85AC66B00A0; Mon, 4 Mar 2024 03:47:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7E34B6B00A2; Mon, 4 Mar 2024 03:47:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 65C9F6B00A3; Mon, 4 Mar 2024 03:47:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 50A536B00A0 for ; Mon, 4 Mar 2024 03:47:35 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 28B3E1209A3 for ; Mon, 4 Mar 2024 08:47:35 +0000 (UTC) X-FDA: 81858728070.12.6290ACD Received: from szxga06-in.huawei.com (szxga06-in.huawei.com [45.249.212.32]) by imf25.hostedemail.com (Postfix) with ESMTP id 5D72FA000D for ; Mon, 4 Mar 2024 08:47:31 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf25.hostedemail.com: domain of mawupeng1@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=mawupeng1@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709542053; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QHLwyu4POJ7226HWiojYjNZ81b0N//XHC3UlMaJLclQ=; b=k7ZkFGNXEiljdcZr7A//o9O4RKBedV6qrK8nNgICnXqVVvsAG1zyA+EImacffX4w/usZsk GfV9G3PNhbp6eDQHPKN0yUwTIUAtxJkhwV0isjGaKUpyAHB8n+F8MYMhlq1Bsd2BnafqcE jcytYIW6tAwVDgODiYAIeO+8apDy3bA= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf25.hostedemail.com: domain of mawupeng1@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=mawupeng1@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709542053; a=rsa-sha256; cv=none; b=TNTcEiMBepdPXwoFpFtOu1lq+zhufALIbhblletVe5r6sepiGOk8qVCIyEm5H2e7px7abl YBlgUPGAguoklTbToczbpZw5+eeMyL/V8PESUXz1kGzDaqSaaNThPNn8pCz4iI33PoYZFG JhEmJ1dFTt9JPt6BV//0agm0tB2zfMw= Received: from mail.maildlp.com (unknown [172.19.163.17]) by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4TpC3q6fQzz1vvkM; Mon, 4 Mar 2024 16:46:47 +0800 (CST) Received: from dggpemd200001.china.huawei.com (unknown [7.185.36.224]) by mail.maildlp.com (Postfix) with ESMTPS id 3F3971A0172; Mon, 4 Mar 2024 16:47:27 +0800 (CST) Received: from [10.174.178.120] (10.174.178.120) by dggpemd200001.china.huawei.com (7.185.36.224) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1258.28; Mon, 4 Mar 2024 16:47:26 +0800 Message-ID: <234a5423-8d6d-444a-a27c-c772a71c9871@huawei.com> Date: Mon, 4 Mar 2024 16:47:26 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird CC: , , , , , , , Subject: Re: [Question] CoW on VM_PFNMAP vma during write fault Content-Language: en-US To: , , , , , , , , , , , References: <20240227122814.3781907-1-mawupeng1@huawei.com> From: mawupeng In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.178.120] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpemd200001.china.huawei.com (7.185.36.224) X-Rspamd-Queue-Id: 5D72FA000D X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: h53fxhakuz5ddu3uxhrcubgrzqzd6fij X-HE-Tag: 1709542051-240808 X-HE-Meta: U2FsdGVkX1+nu57i+MqYZzFkJfR5RKwEER37bGGPqHUd77RRijDoVrttwGMwUhI47+YVp/S0Z7w6RaFugXC+kbNFPgYZBvyTGR+Re88Bo24CNcyuPmrzYTtF4cOPYL7ys9zyyxolGw90yy2NU+Pwe/toyuQhgUYN1FUXu77ko5K65/Gdo0FbnZQFqwlpqAeF88W6gDTuLILbJ6SU7s0DWqYNdc2wIy3fRRIWpqHJTyRdtwFdo5VQAVkh2yEHrA6NNWFwPdMTAkVJZ/db2jo8S4ZIlneyhf3nCj8b1vphq7ZI+Y9w/fOtfIAOuMtbZ+arrcerEMc8U0b22ZXb+zgcGtfbx9mb0a6GPurjGBU7AoOsz9twH3B9PAShGm5ZQ9K2r7Yh/FUNOwb94C1JtfC8GQOjBp9XysWfuWHBmE/Wj8DUKe8HaPd7N9Y8JQTyYscVTkiAtp1rM6Ra8mfBrsoleCK3qQPmvFzuYcW+ZULq1XiOL6lfM/fRXVs0USaGRhegg6IBSwK0A7oDJcL1XDUtPZ5rKAO4PEfAzk+1aVuaZiCYKwneZRANNvN33vKO2MCjG3PAUYoonw3ZIfVE8PVqKSezISkrNbofA5Jr2vY3p9xuAVuIb2XvjIuIWC8rF3hXAgWY4I3s2X77nkMpW1QseoKDaRqN29+m2OPirA2wOJeqccVIbZkY4xWMdaIMxOg+wEHYkeQYvMpkSTiChkUeSPkQjdcVRa/91+CbQ/a9VLEQ4XXeoHblqryibTQQ0ykiecFJpIYgDL7B1uUvqGfgqx/qGh8pe9Pb9D2xrJYCqIfU0rfzObh6wCLKilFL6EhCml02j9jsBeFJJd+uT2CVNx/5tnAEt1hvcAyou1oM6q0OpCgw89liKqEzdLZgyHuXIw59cC+olklIraIq/qdQbEPhSObOPH6QnyIeeZxGgFQR1SL6HWN+kTi47NmzdRfdXCxE8hdhN0lqAYPnDDL EZcElazb Pifw8YzziehisDObbHqfaUXKQOtmhp/H7WO+8H9iIEwp/VBo4KvigUWby8o0IjaExS8Guz7t76UO0YBNOlrNWY3jYUskVO04Rwlnm1cXJd2JJGLuJngI+NafmFY+3NLJjEjTt6nHgcOpPjlcOzk2eNG03u6xFBy7YSXV4+ARU0wNNUA9BKKnMi5zM5MIaOWCmLU9UeLGxjE6qgqF31VUCQdfh38g9PFHuSxq9wim8Qic6iJvphSb2PXwysQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Maintainers, kindly ping... On 2024/2/28 9:55, mawupeng wrote: > > > On 2024/2/27 21:15, David Hildenbrand wrote: >> On 27.02.24 14:00, David Hildenbrand wrote: >>> On 27.02.24 13:28, Wupeng Ma wrote: >>>> We find that a warn will be produced during our test, the detail log is >>>> shown in the end. >>>> >>>> The core problem of this warn is that the first pfn of this pfnmap vma is >>>> cleared during memory-failure. Digging into the source we find that this >>>> problem can be triggered as following: >>>> >>>> // mmap with MAP_PRIVATE and specific fd which hook mmap >>>> mmap(MAP_PRIVATE, fd) >>>>     __mmap_region >>>>       remap_pfn_range >>>>       // set vma with pfnmap and the prot of pte is read only >>>>      >>> >>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume. >>> >>> What fd is that exactly? Often, we disallow private mappings in the >>> mmap() callback (for a good reason). We found this problem in 5.10, Commit 9f78bf330a66 ("xsk: support use vaddr as ring") Fix this problem during supporting vaddr by remap VM_PFNMAP by VM_MIXEDMAP. But other modules which use remap_pfn_range may still have this problem. It do seems wired for private mappings, What is the good reason? > > just a device fd with device-specify mmap which use remap_pfn_range to assign memory. > >>> >>>> // memset this memory with trigger fault >>>> handle_mm_fault >>>>     __handle_mm_fault >>>>       handle_pte_fault >>>>         // write fault and !pte_write(entry) >>>>         do_wp_page >>>>           wp_page_copy // this will alloc a new page with valid page struct >>>>                        // for this pfnmap vma >>> >>> Here we replace the mapped PFNMAP thingy by a proper anon folio. > > My problem is can wen replace a pfn with fully functioned page for pfnmap vma? This is not MIXEDMAP vma. > >>> >>>> >>>> // inject a hwpoison to the first page of this vma >>> >>> I assume this is an anon folio? > > Yes. > >>> >>>> madvise_inject_error >>>>     memory_failure >>>>       hwpoison_user_mappings >>>>         try_to_unmap_one >>>>           // mark this pte as invalid (hwpoison) >>>>           mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, >>>>                   address, range.end); > > If we can replace the mapped PFNMAP thingy by a proper anon folio, we need to make memory_failure to handle > pfnmap vma properly since pfnmap vma shoule not touch its struct page? > > Current this page have a valid mapping and can be unmap. > > Maybe there is something wrong with my understanding of CoW on a private pfnmap vma. > >>>> >>>> // during unmap vma, the first pfn of this pfnmap vma is invalid >>>> vm_mmap_pgoff >>>>     do_mmap >>>>       __do_mmap_mm >>>>         __mmap_region >>>>           __do_munmap >>>>             unmap_region >>>>               unmap_vmas >>>>                 unmap_single_vma >>>>                   untrack_pfn >>>>                     follow_phys // pte is already invalidate, WARN_ON here >>> >>> unmap_single_vma()->...->zap_pte_range() should do the right thing when >>> calling vm_normal_page(). >>> >>> untrack_pfn() is the problematic part. > > For pfnmap vma, it don't have a valid page for all pfns, so unmap is not expected. In this case, it just > check wheather the first address have a valid pte or not which seems reasonable to me. > >>> >>>> >>>> CoW with a valid page for pfnmap vma is weird to us. Can we use >>>> remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap >>>> vma during write fault, this page is normal(page flag is valid) for most mm >>>> subsystems, such as memory failure in thais case and extra should be done to >>>> handle this special page. >>>> >>>> During unmap, if this vma is pfnmap, unmap shouldn't be done since page >>>> should not be touched for pfnmap vma. >>>> >>>> But the root problem is that can we insert a valid page for pfnmap vma? >>>> >>>> Any thoughts to solve this warn? >>> >>> vm_normal_page() documentation explains how that magic is supposed to >>> work. vm_normal_page() should be able to correctly identify whether we >>> want to look at the struct page for an anon folio that was COWed. > > vm_normal_page() can find out a CoW mapping but > >>> >>> >>> untrack_pfn() indeed does not seem to be well prepared for handling >>> MAP_PRIVATE mappings where we end up having anon folios. >>> >>> I think it will already *completely mess up* simply when unmapping the >>> range without the memory failure involved. >>> >>> See, follow_phys() would get the PFN of the anon folio and then >>> untrack_pfn() would do some nonesense with that. Completely broken. >>> >>> The WARN is just a side-effect of the brokenness. >>> >>> In follow_phys(), we'd likely have to call vm_normal_page(). If we get a >>> page back, we'd likely have to fail follow_phys() instead of returning a >>> PFN of an anon folio. >>> >>> Now, how do we fix untrack_pfn() ? I really don't know. In theory, we >>> might no longer have *any* PFNMAP PFN in there after COW'ing everything. >>> >>> Sounds like MAP_PRIVATE VM_PFNMAP + __HAVE_PFNMAP_TRACKING is some >>> broken garbage (sorry). Can we disallow it? >> >> Staring at track_pfn_copy(), it's maybe similarly broken? >> >> I think we want to do: >> >> diff --git a/mm/memory.c b/mm/memory.c >> index 098356b8805ae..da5d1e37c5534 100644 >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -6050,6 +6050,10 @@ int follow_phys(struct vm_area_struct *vma, >>                 goto out; >>         pte = ptep_get(ptep); >>   >> +       /* Never return addresses of COW'ed anon folios. */ >> +       if (vm_normal_page(vma, address, pte)) >> +               goto unlock; >> + >>         if ((flags & FOLL_WRITE) && !pte_write(pte)) >>                 goto unlock; >>   >> >> And then, just disallow it with PAT involved: >> >> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c >> index 0904d7e8e1260..e4d2b2e8c0281 100644 >> --- a/arch/x86/mm/pat/memtype.c >> +++ b/arch/x86/mm/pat/memtype.c >> @@ -997,6 +997,15 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot, >>                                 && size == (vma->vm_end - vma->vm_start))) { >>                 int ret; >>   >> +               /* >> +                * untrack_pfn() and friends cannot handl regions that suddenly >> +                * contain anon folios after COW. In particular, follow_phys() >> +                * will fail when we have an anon folio at the beginning og the >> +                * VMA. >> +                */ >> +               if (vma && is_cow_mapping(vma->vm_flags)) >> +                       return -EINVAL; > > In this case, anyone use remap_pfn_range can not be cow_maaping which means if VM_MAYWRITE exists, VM_SHARED is > needed for this vma. > > This can solve this CoW on private vma problem. > >> + >>                 ret = reserve_pfn_range(paddr, size, prot, 0); >>                 if (ret == 0 && vma) >>                         vm_flags_set(vma, VM_PAT); >> >> >> I'm afraid that will break something. But well, it's already semi-broken. >> >> As long as VM_PAT is not involved, it should work as expected. >> >> In an ideal world, we'd get rid of follow_phys() completely and just >> derive that information from the VMA? >>