From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0FFEEC4828F for ; Fri, 2 Feb 2024 00:41:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 825FE6B0075; Thu, 1 Feb 2024 19:41:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7D1616B0078; Thu, 1 Feb 2024 19:41:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 673286B007B; Thu, 1 Feb 2024 19:41:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 57F8C6B0075 for ; Thu, 1 Feb 2024 19:41:42 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 3571216052C for ; Fri, 2 Feb 2024 00:41:42 +0000 (UTC) X-FDA: 81745010844.15.A5815B3 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.14]) by imf01.hostedemail.com (Postfix) with ESMTP id 21DD74000F for ; Fri, 2 Feb 2024 00:41:39 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=hUT4JIMu; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706834500; a=rsa-sha256; cv=none; b=KlP5FSp9Oo8bnFYf8vkoQ85T7OkgJR7gpXhYnzH5U6aC+gm1D4ro+uQVHJPKJGx4B7BBfY +Y0X/Oe0Ub4U7fppvTDIXw6uss6Iu0af+t0sf5IjLhfX6BOulI3OrZUbGMbHMK87zvCCLO M7P37YffZ1ste+IdWzhm0kWq5ZagcAI= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=hUT4JIMu; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.14 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706834500; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CI9mrzdnQFJYkZOppZ1+6RYK1zKhGA5GAXkOextxLuY=; b=47nAmoc6VoSSp3TsDuoWjfF7kUThiKHU+lLAJNIzfmbI0hnhaJdVZd1Y8aPu7ppvjmoP8l Vte9SeHxTh6hE+qiUlHvdx4iqDjvYdchigNfij4PaXRtx+da/E03oyYnma1O7Xv++88+uV yWNloniIi+t9xYCdbF+Z0FmoSOhJ4p8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1706834500; x=1738370500; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=HtYrXDLKOD6BHnTEFNusFxbc9wuUkDPz62Q/9vC2Bq0=; b=hUT4JIMuYdNDWQX8tK050qOrPrrKaBotJuRmxRGo8xHCGDfsMNkuhZCf y1IxnMlHZ5xSPoRnfpXPrHkP3jjO5yX6Zz02Iv0h98s6V6pcV+QZQLpdy xd7UeFMo+AKMorEp7iLs78U/mRKlBKxqmQ8L7WbxyBmJAQKdwlWEvrvaq ZO4c5kZk5T+mnBLRasuidhwpl7M7rpTCVXk2F30nLsmEjPgo3bND/ih63 X0G5SD1g+YlBdnXmpU5fdb02vRatTo6oMUWnhPubsFAdUZ7Ka9jVlxLEH MbcLh119tdhFynXqaP32d9HMaeyHWbOqE+gLU0bSJFDnN2MeIIOmA3HPr g==; X-IronPort-AV: E=McAfee;i="6600,9927,10971"; a="3896160" X-IronPort-AV: E=Sophos;i="6.05,236,1701158400"; d="scan'208";a="3896160" Received: from fmviesa003.fm.intel.com ([10.60.135.143]) by orvoesa106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Feb 2024 16:41:38 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.05,236,1701158400"; d="scan'208";a="4569001" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Feb 2024 16:41:35 -0800 From: "Huang, Ying" To: "zhangpeng (AS)" Cc: Yin Fengwei , , , , , , , , , , Subject: Re: [RFC PATCH] mm: filemap: avoid unnecessary major faults in filemap_fault() In-Reply-To: <4da573ec-a2f9-84f4-f729-540492192957@huawei.com> (zhangpeng's message of "Thu, 1 Feb 2024 20:10:24 +0800") References: <20231122140052.4092083-1-zhangpeng362@huawei.com> <801bd0c9-7d0c-4231-93e5-7532e8231756@intel.com> <48235d73-3dc6-263d-7822-6d479b753d46@huawei.com> <87y1en7pq3.fsf@yhuang6-desk2.ccr.corp.intel.com> <87ttpb7p4z.fsf@yhuang6-desk2.ccr.corp.intel.com> <87lean7f2c.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plzt464d.fsf@yhuang6-desk2.ccr.corp.intel.com> <4da573ec-a2f9-84f4-f729-540492192957@huawei.com> Date: Fri, 02 Feb 2024 08:39:39 +0800 Message-ID: <87wmrnbsxg.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 21DD74000F X-Stat-Signature: khgtpjias9daampdnsf65cekgs98wt5u X-HE-Tag: 1706834499-115157 X-HE-Meta: U2FsdGVkX1+CViAi8iIplxcu0Lt4UYZANYkABmOWh7VNcptbKihPtM1QXmL0GsNaBgFk9IFbGUHCLe0+VSt0S8rTzFxeJaYcpL0atn4BWLE4iHM2SROl/nFu3LF/BuZGzoypom3b5lZQbIUj0I6kxwI/8DDA08tJ4BwNq/8pTbD1L0NqGfM2mSgvDPHVWIGTCkIedXdNignLJvPz11NPzoBfkL1nromini9+2DVG6CTPdQ9TM75IdNsW8Ut+laepuVaWgEn0US3hcNjptgQsnVWN15tPY9DykNKYx7wUmjyXe5/isEmKx7jlx3Ticz5m59p9WdUZoNY0o93J29811yP9/mIsyIKAN3ZMLQL0ZbF6kzarEfuA6oxgxO8eqHazOU9WclD+z/TmMf4xTlrbKy+cdKpnV7gSkdkCrasmO3yPZDA7eZT69leztJT8jlSkpAgAIsUYTZ7bkyKVJjRjPRBCVOCArFB9K0wUCRzjxGbSwJa6xssxHP9vwg9lgWzcfG6WGnzNfx2ZdaaPkw0evIXLU548lkQee3X/JRNGY7RUIfWiouhIWmb+KCzc7iOmsLiYak/VvuzQUYN0yrNWApTz6N3Sn/HfhOo4/tEL7o4d5aGoQXCH7W8F+p39aMiSJBP6QYECAl30bG7dqAJK5C22e75Qv5UMyzoRRRfuCD3OXVh3Xu2VuVwJP0xmSK1JtOJpUIEv0pOj27bjrzcf0unSd4oIZxg8luDBy/xdwR3LTAGDWGZvKJLIxqo3HnWn/f0zSk398vBHmWFvn6t+fIa3aSx4bjnNqHhBnPtCpva9yrcDl6Zi0FagWjeZtIeFEUvi+pF5+5wpgZRZsmMxehQy6i5e5XULIES/1dpJFbJKibm2ZxqOsTkxhsyisD9ogjfPGwFO+snFEAxsaaqwyc7ER0Q5H4/P9k6zkdu8/TwMPxod17+K2vbCE0NOqkDQixHR8Bvo2aPwmRWKv0n LDCxDnDM E8Pz9Z7CPrw7HE7u1wz9+laPxUKtoC3xvX/E8KUTE7o1GeA8uw4dWu6Out5XoJl7wPS//PkmcW1LSc+AOIMpDQ8pwG6YMfnmbTiicd1WwI1Ztv1V5TeTFCvMzyKy1h0+bUUlP3TyCTmvdmvMGaSduHXSkZGv9FLSC+6QewBgVOFEUgAKg8QgaB4f4vFUcp7XnJWBWOiX0cb6JxE3c6zchY3SjY72cQB8zMfCacpW7pd5t16N31Ra204OgidtFxEd6sL8WwPAEELUYVkc700PMDj1DnYQWNdlQyL55DlH9BDs3RDUBvSktcWHSJHnMq5qQRvxNtALluENwy54= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "zhangpeng (AS)" writes: > On 2023/11/29 10:59, Huang, Ying wrote: > >> "zhangpeng (AS)" writes: >> >>> On 2023/11/24 16:04, Huang, Ying wrote: >>> >>>> "zhangpeng (AS)" writes: >>>> >>>>> On 2023/11/24 12:26, Huang, Ying wrote: >>>>> >>>>>> "Huang, Ying" writes: >>>>>> >>>>>>> "zhangpeng (AS)" writes: >>>>>>> >>>>>>>> On 2023/11/23 13:26, Yin Fengwei wrote: >>>>>>>> >>>>>>>>> On 11/23/23 12:12, zhangpeng (AS) wrote: >>>>>>>>>> On 2023/11/23 9:09, Yin Fengwei wrote: >>>>>>>>>> >>>>>>>>>>> Hi Peng, >>>>>>>>>>> >>>>>>>>>>> On 11/22/23 22:00, Peng Zhang wrote: >>>>>>>>>>>> From: ZhangPeng >>>>>>>>>>>> >>>>>>>>>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL= _FUTURE) >>>>>>>>>>>> in application, which leading to an unexpected performance iss= ue[1]. >>>>>>>>>>>> >>>>>>>>>>>> This caused by temporarily cleared pte during a read/modify/wr= ite update >>>>>>>>>>>> of the pte, eg, do_numa_page()/change_pte_range(). >>>>>>>>>>>> >>>>>>>>>>>> For the data segment of the user-mode program, the global vari= able area >>>>>>>>>>>> is a private mapping. After the pagecache is loaded, the priva= te anonymous >>>>>>>>>>>> page is generated after the COW is triggered. Mlockall can loc= k COW pages >>>>>>>>>>>> (anonymous pages), but the original file pages cannot be locke= d and may >>>>>>>>>>>> be reclaimed. If the global variable (private anon page) is ac= cessed when >>>>>>>>>>>> vmf->pte is zeroed in numa fault, a file page fault will be tr= iggered. >>>>>>>>>>>> >>>>>>>>>>>> At this time, the original private file page may have been rec= laimed. >>>>>>>>>>>> If the page cache is not available at this time, a major fault= will be >>>>>>>>>>>> triggered and the file will be read, causing additional overhe= ad. >>>>>>>>>>>> >>>>>>>>>>>> Fix this by rechecking the pte by holding ptl in filemap_fault= () before >>>>>>>>>>>> triggering a major fault. >>>>>>>>>>>> >>>>>>>>>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-4= 98fa17434ee@huawei.com/ >>>>>>>>>>>> >>>>>>>>>>>> Signed-off-by: ZhangPeng >>>>>>>>>>>> Signed-off-by: Kefeng Wang >>>>>>>>>>>> --- >>>>>>>>>>>> =C2=A0 mm/filemap.c | 14 ++++++++++++++ >>>>>>>>>>>> =C2=A0 1 file changed, 14 insertions(+) >>>>>>>>>>>> >>>>>>>>>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>>>>>>>>> index 71f00539ac00..bb5e6a2790dc 100644 >>>>>>>>>>>> --- a/mm/filemap.c >>>>>>>>>>>> +++ b/mm/filemap.c >>>>>>>>>>>> @@ -3226,6 +3226,20 @@ vm_fault_t filemap_fault(struct vm_faul= t *vmf) >>>>>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 mapping_locked =3D true; >>>>>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } else { >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pte_t *ptep =3D pt= e_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0 vmf->address, &vmf->ptl); >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ptep) { >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 /* >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 * Recheck pte with ptl locked as the pte can be cleared >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 * temporarily during a read/modify/write update. >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 */ >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 if (unlikely(!pte_none(ptep_get(ptep)))) >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ret =3D VM_FAULT_NOPAGE; >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 pte_unmap_unlock(ptep, vmf->ptl); >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 if (unlikely(ret)) >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return ret; >>>>>>>>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>>>>>>> I am curious. Did you try not to take PTL here and just check w= hether PTE is not NONE? >>>>>>>>>> Thank you for your reply. >>>>>>>>>> >>>>>>>>>> If we don't take PTL, the current use case won't trigger this is= sue either. >>>>>>>>> Is this verified by testing or just in theory? >>>>>>>> If we add a delay between ptep_modify_prot_start() and ptep_modify= _prot_commit(), >>>>>>>> this issue will also trigger. Without delay, we haven't reproduced= this problem >>>>>>>> so far. >>>>>>>> >>>>>>>>>> In most cases, if we don't take PTL, this issue won't be trigger= ed. However, >>>>>>>>>> there is still a possibility of triggering this issue. The corne= r case is that >>>>>>>>>> task 2 triggers a page fault when task 1 is between ptep_modify_= prot_start() >>>>>>>>>> and ptep_modify_prot_commit() in do_numa_page(). Furthermore,tas= k 2 passes the >>>>>>>>>> check whether the PTE is not NONE before task 1 updates PTE in >>>>>>>>>> ptep_modify_prot_commit() without taking PTL. >>>>>>>>> There is very limited operations between ptep_modify_prot_start()= and >>>>>>>>> ptep_modify_prot_commit(). While the code path from page fault to= this check is >>>>>>>>> long. My understanding is it's very likely the PTE is not NONE wh= en do PTE check >>>>>>>>> here without hold PTL (This is my theory. :)). >>>>>>>> Yes, there is a high probability that this issue won't occur witho= ut taking PTL. >>>>>>>> >>>>>>>>> In the other side, acquiring/releasing PTL may bring performance = impaction. It may >>>>>>>>> not be big deal because the IO operations in this code path. But = it's better to >>>>>>>>> collect some performance data IMHO. >>>>>>>> We tested the performance of file private mapping page fault (page= _fault2.c of >>>>>>>> will-it-scale [1]) and file shared mapping page fault (page_fault3= .c of will-it-scale). >>>>>>>> The difference in performance (in operations per second) before an= d after patch >>>>>>>> applied is about 0.7% on a x86 physical machine. >>>>>>> Whether is it improvement or reduction? >>>>>> And I think that you need to test ramdisk cases too to verify whether >>>>>> this will cause performance regression and how much. >>>>> Yes, I will. >>>>> In addition, are there any ramdisk test cases recommended? =F0=9F=98= =81 >>>> I think that you can start with the will-it-scale test case you used >>>> before. And you can try some workload with large number of major faul= t, >>>> like file read with mmap. >>> I used will-it-scale to test the page faults of ext4 files and >>> tmpfs files. The data is the average change compared with the >>> mainline after the patch is applied. The test results are within >>> the range of fluctuation, and there is no obvious difference. >>> The test results are as follows: >>> >>> processes processes_idle threads threads_idle >>> ext4 private file write: -0.51% 0.08% -0.03% -0.04% >>> ext4 shared file write: 0.135% -0.531% 2.883% -0.772% >>> tmpfs private file write: -0.344% -0.110% 0.200% 0.145% >>> tmpfs shared file write: 0.958% 0.101% 2.781% -0.337% >>> tmpfs private file read: -0.16% 0.00% -0.12% 0.41% >> Thank you very much for test results! >> >> We shouldn't use tmpfs, because there will be no major faults. Please >> check your major faults number to verify that. IIUC, ram disk + disk >> file system should be used. >> >> And, please make sure that there's no heavy lock contention in the base >> kernel. Because if some heavy lock contention kills performance, there >> will no performance difference between based and patched kernel. > > I'm so sorry I was so late to finish the test and reply. > > I used will-it-scale to test the page faults of ramdisk files. The > data is the average change compared with the mainline after the patch > is applied. The test results are as follows: > > processes processes_idle threads threads_idle > ramdisk private file write: -0.48% 0.23% -1.08% 0.27% > ramdisk private file read: 0.07% -6.90% -5.85% -0.70% ~~~~~~ It appears that the patch will cause some visible performance regression in this benchmark. We can try to verify that via `perf profile`. Or, we can just try Fengwei's idea, that is, check pte_none() without acquiring PTL. -- Best Regards, Huang, Ying > > Applied patch: > > diff --git a/mm/filemap.c b/mm/filemap.c > index 32eedf3afd45..2db9ccfbd5e3 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -3226,6 +3226,22 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) > mapping_locked =3D true; > } > } else { > + if (!pmd_none(*vmf->pmd)) { > + pte_t *ptep =3D pte_offset_map_lock(vmf->vma->vm_= mm, vmf->pmd, > + vmf->address, &= vmf->ptl); > + if (unlikely(!ptep)) > + return VM_FAULT_NOPAGE; > + /* > + * Recheck pte with ptl locked as the pte can be = cleared > + * temporarily during a read/modify/write update. > + */ > + if (unlikely(!pte_none(ptep_get(ptep)))) > + ret =3D VM_FAULT_NOPAGE; > + pte_unmap_unlock(ptep, vmf->ptl); > + if (unlikely(ret)) > + return ret; > + } > + > /* No page in the page cache at all */ > count_vm_event(PGMAJFAULT); > count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); > >> -- >> Best Regards, >> Huang, Ying