From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4281C61D97 for ; Fri, 24 Nov 2023 08:06:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E52C8D0065; Fri, 24 Nov 2023 03:06:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 395E58D0063; Fri, 24 Nov 2023 03:06:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 25D968D0065; Fri, 24 Nov 2023 03:06:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 152BA8D0063 for ; Fri, 24 Nov 2023 03:06:19 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id D930CB6299 for ; Fri, 24 Nov 2023 08:06:18 +0000 (UTC) X-FDA: 81492115236.17.F01738C Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.93]) by imf15.hostedemail.com (Postfix) with ESMTP id D3161A0020 for ; Fri, 24 Nov 2023 08:06:16 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=DTVtP8vW; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700813177; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ivF5jVM++NrQFf98WbNoIgVeELB676Sqtqb2ARkd1ck=; b=NbFInS14RZxtP8OcM//GVSBnbdteZbws03Q90UnTm6vFneXG1IkWeE1mcp+NGTuz3E6OzZ mAb8hfUWR5wNVEqcto+ZpF8nUMb+aoN5QlM2z/865/lb7P+0ibzRR33hxZMTKPTZkPCyO2 FsLQ4PCu7RhAJKN0Y8o9NVHrmqWgEmo= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=DTVtP8vW; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf15.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700813177; a=rsa-sha256; cv=none; b=ZQK9eh2h1iATcK43aq2EqCXew5Ti1yMOfTMqiz0m9obF7aPZg+loK+PrsH1jAtA7TdAGGk LfHo8SLh+x0YiVfQqvEvsLEDzjxSPs1MeUjheWON19ehr1JCpGN2OVYHVjCBhkpFy3G5Aq ikWl/TjUkc8sqAcEGrBEoJJRcNP4ilg= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700813176; x=1732349176; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=VeIJKF2kYSgmlspkatXkM8Vav2oHi3gI1x/yScu/G+A=; b=DTVtP8vW/w35hnSgcSLew5ZgzwQX1q08eJFH3dsBUZUJJr3ueNDabEk+ cb4KKP2u5lhVC05lcWh/8EOcJBIgXkb5ePeU/XzVNKJNJaLr+MK448dtj ZyY2NSaqNVJOK4HtXYN1TCzdsHaWSg0fI1y8Fyr6+YlED+A89UmWdZXz+ W+SMRGzM0LKI/gNP+qSz2FPD0if8CeqWjPvLOtZpjEpv1RtZB0ptgRy31 gkraiR1eh2FRoN1SNcnlqrft6pUndByDhgviGb6eDvu0JuoAF/cyk26t1 go/pBrhXV3syKfkHR2MeRf/xKEFgKMd7EmXw5qy1/AtNPrMhYi6/sFqrB Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10902"; a="389537773" X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="389537773" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Nov 2023 00:06:15 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="15574227" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Nov 2023 00:06:12 -0800 From: "Huang, Ying" To: "zhangpeng (AS)" Cc: Yin Fengwei , , , , , , , , , , Subject: Re: [RFC PATCH] mm: filemap: avoid unnecessary major faults in filemap_fault() In-Reply-To: (zhangpeng's message of "Fri, 24 Nov 2023 15:27:25 +0800") References: <20231122140052.4092083-1-zhangpeng362@huawei.com> <801bd0c9-7d0c-4231-93e5-7532e8231756@intel.com> <48235d73-3dc6-263d-7822-6d479b753d46@huawei.com> <87y1en7pq3.fsf@yhuang6-desk2.ccr.corp.intel.com> <87ttpb7p4z.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 24 Nov 2023 16:04:11 +0800 Message-ID: <87lean7f2c.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: D3161A0020 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: ocuqkgoj1n6s138qre3ss3fpxcu819iz X-HE-Tag: 1700813176-733691 X-HE-Meta: U2FsdGVkX1979aczo7E3noVIiL319g+0joMdZ9PQJnBmOKQA6TJe4aDlblqXVZm1VLmGQFcg8bNJCdnWkAYSKOGeik3uHgKrHc3bgqjxqKA1dhbO7mmEtmdKJn6gXXB/Q5I/QzGCFfV7BiaaaI5MqMec6mxds2oQIYhyVICNMKNUoCg3Xk7ZYJEQ9U7eRz90VCLaU4YifDvjVxF/q7VZ4CBvsiabYNzj25FHk+yoDfmFN6NOvNullSinzcGGph+kCWVdFXtnBQlnbrzsICUfsUBtYi04cV2nNpLtjTO6M4XZgYerLZup52wjIZsNzXyNwP+6BYPpCcLdpouJ0Avmdmy48hRdoDL/9BgjYqYEsvtsj22RLfbXFCsXmmLoZVfcvJ7yXpQTzLY9ZykMwxqdBQzjBGNLKkcZer7UYhg0OAYeXfkb7rUEBRxRfti56P3JMOZ4WDUnvys1cPQBh27XgKCnVt00LQdSZaPCUEZ165iU5vo8GJvu+AwTnLHHvTSql0rV1k8CjTAlGrl9GFXwiXXZvZQVVz6K5VK4fB3sVBhSC33xEIjsL3LbZqMhHVgzQuurjn+RFd0gU12rRzj2cloMfqvkJKodhXTS80s6puVlIa9fS0nolYORfTKRCqFjRYmyLLjuJC9MHAJyxrv/bQjeUnjzDhwbLZx7kmA0WeMYI10x7BlZnXFsujXz4hinTbZrMenrw3U/F7u/5zVPLnovU8H0Q9WjPADr8eEAOd60dywoluHa4zJN1ro2Jbr1dshBafm7r1MiTS4xkBfXeGpZer+uBVXFOUvlNbTUJ/uJRj+bZQHk/SJhfzDVEZ+Ih3KYFxOrFsApA7qsN47DzUa+cpBZivGk4XoyRGvtP7vaZlF3ICs8KYjdCuWsSULvWv5dTK5hbm68ISWIYGMFLgDOcibXeT3dtKSg3e5pZTLqv4AH7PQUOjJQ4oCyKXfx9WaC37Yf64FApFSSKr0 eeViubM9 WRuk85umRwzD+caqpoM0xHbMHTTZh5w+BkvuHycqWJ37XCw8d0nhYgS/OFOE2kH3CwQvG6Tg7LwPwu/kRh9gniCQ/cIEMiWreXM+V7/BgBnQTchKUJ9MXl3QQBhiEoGADvVb2AXejfKbPSI2T6fQdDUCJRcRnDg89qMZdwwTTTkg2UvvqrFW8pJ85RsXqLw0Dwbs9OdIRFMj+8oLjt8rPh8yatD+0w7isYgIf0TJLoBoS4iLRfXjJvCr99+K5+EwdZG5kuRJ6k5EbWoT73aXi3n9u5ujjK+t4aV+z3tB+amVz9AOvK7CWqsfT9s7dcADBFZ0V4++C/2dnowZ2wuO64/Qb1jrhzq04mKbI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "zhangpeng (AS)" writes: > On 2023/11/24 12:26, Huang, Ying wrote: > >> "Huang, Ying" writes: >> >>> "zhangpeng (AS)" writes: >>> >>>> On 2023/11/23 13:26, Yin Fengwei wrote: >>>> >>>>> On 11/23/23 12:12, zhangpeng (AS) wrote: >>>>>> On 2023/11/23 9:09, Yin Fengwei wrote: >>>>>> >>>>>>> Hi Peng, >>>>>>> >>>>>>> On 11/22/23 22:00, Peng Zhang wrote: >>>>>>>> From: ZhangPeng >>>>>>>> >>>>>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUT= URE) >>>>>>>> in application, which leading to an unexpected performance issue[1= ]. >>>>>>>> >>>>>>>> This caused by temporarily cleared pte during a read/modify/write = update >>>>>>>> of the pte, eg, do_numa_page()/change_pte_range(). >>>>>>>> >>>>>>>> For the data segment of the user-mode program, the global variable= area >>>>>>>> is a private mapping. After the pagecache is loaded, the private a= nonymous >>>>>>>> page is generated after the COW is triggered. Mlockall can lock CO= W pages >>>>>>>> (anonymous pages), but the original file pages cannot be locked an= d may >>>>>>>> be reclaimed. If the global variable (private anon page) is access= ed when >>>>>>>> vmf->pte is zeroed in numa fault, a file page fault will be trigge= red. >>>>>>>> >>>>>>>> At this time, the original private file page may have been reclaim= ed. >>>>>>>> If the page cache is not available at this time, a major fault wil= l be >>>>>>>> triggered and the file will be read, causing additional overhead. >>>>>>>> >>>>>>>> Fix this by rechecking the pte by holding ptl in filemap_fault() b= efore >>>>>>>> triggering a major fault. >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa= 17434ee@huawei.com/ >>>>>>>> >>>>>>>> Signed-off-by: ZhangPeng >>>>>>>> Signed-off-by: Kefeng Wang >>>>>>>> --- >>>>>>>> =C2=A0 mm/filemap.c | 14 ++++++++++++++ >>>>>>>> =C2=A0 1 file changed, 14 insertions(+) >>>>>>>> >>>>>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>>>>> index 71f00539ac00..bb5e6a2790dc 100644 >>>>>>>> --- a/mm/filemap.c >>>>>>>> +++ b/mm/filemap.c >>>>>>>> @@ -3226,6 +3226,20 @@ vm_fault_t filemap_fault(struct vm_fault *v= mf) >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 mapping_locked =3D true; >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } else { >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pte_t *ptep =3D pte_of= fset_map_lock(vmf->vma->vm_mm, vmf->pmd, >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 vmf->address, &vmf->ptl); >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ptep) { >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 /* >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 * Recheck pte with ptl locked as the pte can be cleared >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 * temporarily during a read/modify/write update. >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 */ >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 if (unlikely(!pte_none(ptep_get(ptep)))) >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 ret =3D VM_FAULT_NOPAGE; >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 pte_unmap_unlock(ptep, vmf->ptl); >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 if (unlikely(ret)) >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 return ret; >>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>> I am curious. Did you try not to take PTL here and just check wheth= er PTE is not NONE? >>>>>> Thank you for your reply. >>>>>> >>>>>> If we don't take PTL, the current use case won't trigger this issue = either. >>>>> Is this verified by testing or just in theory? >>>> If we add a delay between ptep_modify_prot_start() and ptep_modify_pro= t_commit(), >>>> this issue will also trigger. Without delay, we haven't reproduced thi= s problem >>>> so far. >>>> >>>>>> In most cases, if we don't take PTL, this issue won't be triggered. = However, >>>>>> there is still a possibility of triggering this issue. The corner ca= se is that >>>>>> task 2 triggers a page fault when task 1 is between ptep_modify_prot= _start() >>>>>> and ptep_modify_prot_commit() in do_numa_page(). Furthermore,task 2 = passes the >>>>>> check whether the PTE is not NONE before task 1 updates PTE in >>>>>> ptep_modify_prot_commit() without taking PTL. >>>>> There is very limited operations between ptep_modify_prot_start() and >>>>> ptep_modify_prot_commit(). While the code path from page fault to thi= s check is >>>>> long. My understanding is it's very likely the PTE is not NONE when d= o PTE check >>>>> here without hold PTL (This is my theory. :)). >>>> Yes, there is a high probability that this issue won't occur without t= aking PTL. >>>> >>>>> In the other side, acquiring/releasing PTL may bring performance impa= ction. It may >>>>> not be big deal because the IO operations in this code path. But it's= better to >>>>> collect some performance data IMHO. >>>> We tested the performance of file private mapping page fault (page_fau= lt2.c of >>>> will-it-scale [1]) and file shared mapping page fault (page_fault3.c o= f will-it-scale). >>>> The difference in performance (in operations per second) before and af= ter patch >>>> applied is about 0.7% on a x86 physical machine. >>> Whether is it improvement or reduction? >> And I think that you need to test ramdisk cases too to verify whether >> this will cause performance regression and how much. > > Yes, I will. > In addition, are there any ramdisk test cases recommended? =F0=9F=98=81 I think that you can start with the will-it-scale test case you used before. And you can try some workload with large number of major fault, like file read with mmap. -- Best Regards, Huang, Ying >> -- >> Best Regards, >> Huang, Ying >> >>> -- >>> Best Regards, >>> Huang, Ying >>> >>>> [1] https://github.com/antonblanchard/will-it-scale/tree/master >>>> >>>>> Regards >>>>> Yin, Fengwei >>>>> >>>>>>> Regards >>>>>>> Yin, Fengwei >>>>>>> >>>>>>>> + >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* No pag= e in the page cache at all */ >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_vm_= event(PGMAJFAULT); >>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_mem= cg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);